## #Logistic regression

- This the perceptron with a
**sigmoid**activation- It actually computes the
*probability*that the input belongs to class 1 - Decision boundaries may be obtained by comparing the probability to a threshold
- These boundaries will be lines (hyperplanes in higher dimensions)
- The sigmoid perceptron is a linear classifier

- It actually computes the

### #Estimating the model

- Given: Training data: $\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)$
- $X$ are vectors, $y$ are binary (0/1) class values
- Total probability of data

$$
\begin{array}{l}
P\left(\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)\right)= \prod_{i} P\left(X_{i}, y_{i}\right) \\

=\prod_{i} P\left(y_{i} \mid X_{i}\right) P\left(X_{i}\right)=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right)
\end{array}
$$

- Likelihood

$$ P(\text {Training data})=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) $$

- Log likelihood

$$ \begin{array}{l} \log P(\text {Training data})= \sum_{i} \log P\left(X_{i}\right)-\sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) \end{array} $$

- Maximum Likelihood Estimate

$$ w_{0}, w_{1}=\underset{w_{0}, w_{1}}{\operatorname{argmax}} \log P(\text {Training data}) $$

- Equals (note argmin rather than argmax)

$$ w_{0}, w_{1}=\underset{w_{0}, w}{\operatorname{argmin}} \sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) $$

- Identical to
**minimizing the KL divergence**between the desired output and actual output $\frac{1}{1+e^{-\left(w_{0}+w^{T} X_{i}\right)}}$

## #MLP

### #Separable case

- The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features
- We can now attach
**any**linear classifier above it for perfect classification *Need not be a perceptron*- Could even train an SVM on top of the features!

- We can now attach
- For
**insufficient**structures, the network may*attempt*to transform the inputs to linearly separable features- Will fail to separate exactly, but will try to minimize error

- The network until the
**second-to-last**layer is a non-linear function $f(X)$ that converts the input space $X$ of into the feature space where the classes are maximally linearly separable

### #Lower layers

**Manifold hypothesis**: For separable classes, the classes are linearly separable on a non-linear manifold- Layers sequentially βstraightenβ the data manifold
- The β
**feature extraction**β layer transforms the data such that the posterior probability may now be modelled by a logistic

### #Weight as a template

- In high dimensional space, all vectors are more or less the same length
- Which means all $x$ are in this surface of sphere

- The perceptron fires if the input is within a
**specified angle of the weight**- Represents a convex region on the surface of the sphere!
- The network is a Boolean function over these regions

- Neuron
**fires**if the input vector is close enough to the weight vector*If the input pattern matches the weight pattern closely enough*

- The perceptron is a
**correlation filter**!

## #Autoencoder

- The lowest layers of a network detect significant features in the signal
**The signal could be (partially) reconstructed using these features**- Will retain all the significant components of the signal

### #Simplest autoencoder

- This is just PCA!
- The autoencoder finds the
**direction of maximum energy** - Simply varying the hidden representation will result in an output that lies along the major axis

### #Terminology

- Encoder
- The βAnalysisβ net which computes the hidden representation

- Decoder
- The βSynthesisβ which recomposes the data from the hidden representation

### #Nonlinearity

- When the hidden layer has a linear activation the decoder represents the best
**linear**manifold to fit the data- Varying the hidden value will move along this linear manifold

- When the hidden layer has
**non-linear activation**, the net performs**nonlinear**PCA- The decoder represents the best non-linear manifold to fit the data
- Varying the hidden value will move along this non-linear manifold

- The model is specific to the training data
- Varying the hidden layer value only generates data along the learned manifold
- Any input will result in an output along the learned manifold
**But may not generalize beyond the manifold**- Input unseen data may behave beyond intuitive manner, no constrain!
**The decoder can only generate data on the manifold that the training data lie on**

- This also makes it an excellent β
**generator**β of the distribution of the training data

## #Dictionary-based techniques

- The decoder represents a
**source-specific generative dictionary**- Exciting it will produce typical data from the source!

### #Signal separation

- Separation: Identify the combination of entries from both dictionaries that compose the mixed signal

- Given mixed signal and source dictionaries, find excitation that best recreates mixed signal
- Simple backpropagation

- Intermediate results are separated signals