# Tony Robinson

### Site Tools

private:unsupervised

## Unsupervised Learning

### Why we need it

AGI operates by learning a much as possible about the world. Unsupervised is the first second layer in doing this - the correlation layer. The other layers are memory, correlation, causation and explanation.

### Why Latents should be hierarchical

If a single flat latent space then parts of are main structural and parts of it must be preserved and isolated for detail. If we accept this then it's heirarchical anyway. Why should some parts be preserved? Because the use of fine detail depends on the structure - e.g. the main structure may say “wavy hair” - the detail shows the waves. If the structure said “straight hair” then the use of the detail is very different. So some random choice at the top level dictates the possible random choices lower down - you've got to know what top level choices were made for the lower ones to make any sense. Also, most of the information is low down - putting it all at one level would make it very hard to decide what is top level information and what isn't.

### Current Ideas

Sigmoid units define a hyperplane, the output of the unit can be inverted to constrain the input space to the hyperplane. We expect uncertainlty, so we area dealing with a distribution on input and output. We hope to have Gaussian distributions and independent hyperplanes and end up with a Gaussian distribution in input space which can be sampled from and sent back another layer.

Training: Compute mu_i as the mean of the i'th output unit. For any input, x_nj compute all outputs, y_ni. If independent, then log like is \sum_ni y_ni ln(mu_i) + (1-y_ni) ln(1-mu_i). We can see this as the prior on the classes are mu_i and 1-mu_i which we encode in log2(mu_i) and log2(1-mu_i) bits respectivly. We need to tranmit a different distribution, so it costs y_ni ln(mu_i) for one class and (1-y_ni) ln(1-mu_i) for the other. Now, y_ni are not nececssarily independent. If we assume we know all but one, we can remove any linear independence as u_

Training: Orthogonalsise and run 2 class GMM - log likelihood gain says how much of the varaince these new parameters absorbed and therefore how to scale the result in comparison with the input space. Problem - if represent as a probability of one class then it's non-linear but not very Gaussian. If represent as the log ratio input to sigmoid then it's linear. Could have different mixes and different variances, but then it's not a hyperplane and would need to solve quadratics to reverse project.

Training: GMM but make sure that it doesn't end up one-hot - then scale, eigenvector and GMM again until we can sample.

Training: Lloyd Max each dimentions with the number of points proportional to variance (or stddev?). Find cumulative variance and use that to partition into eqaul sizes, so end eigen vecrtors may get summed and onle one or two points used. Then outout posteriors in each dimension, do eigenvector and repeat,

Training. Build a virtual 2^D sized codebook - each dimension has two points, tied variance. Can also rotate and scale the space, so asking to minimise MSE when binary codebook is projected through a matrix. Training would be something like: Project image through inverse of weight matrix, find either closest code or distribution over codes, adjust weight matrix to have MSE distortion. This would seem to be achievable.

Training: Train up a big GMM/VQ. Get Gaussian stats on nearest 2 neighbours (or all short distances). Model as a fraction of the probability mass between to centres and a distribution around the line. Pass on the distance between centres. Reconstruct by taking a weighted average of the points between centres plus some of the observed Gaussian noise.

Training: Each layer is very simple it has input of data points and estimated variance, it has a smaller latent space which is latent values and estimated variance, and it has output which is the same space as input and varaince.

• All dims are assumed independent.
• Training is KL divergence.
• Sampling is natural, just call randn().
• Trained using fast dropout techniques, not sampling.
• Share encoding and decoding weights if at all possible (like eigenvectors/GMM/VQ).
• Aim is to use a linear combination of the smallest number of dimensions to reconstruct - sort of like the best compression. If data was in ndim points or less then encoding would be one hot and reconstruction perfect.
• If data has no structure then work in linear region of sigmoid and work like PCA.
• When expanding back out, scale to unit circle, that way averages don't wash out, they pick a plausible data point.
• cos \theta is distance squared. Do we want this? Even if we do, what sets the scale? (high temp and linear, low and one hot)

Outstanding questions:

• How do we ensure that all planes form a basis? Is this even necessary or desirable? The solution to XOR has parallel planes. We don't need orthogonal, but it may help to keep things independent.
• How do we combine distriutions? Surely independent Gaussians convolve to be Gaussian, but this needs to be prooved.
• What is the distance metric that allows all of this to train? Maybe:
• two codebook GMM and maximise separation of means or minimise the shared variance?
• compute a confusion matrix and use grad descent to zero all but the diagonal elements

### Why is dot product so dominant?

Dot product is used in:

• Sigmoid/Softmax - dot product gives log likelihood which is then normalised
• Cosine distance - cos(\theta) = A . B / (|A| |B|) – can arrange for |A| and or |B| to be 1.0
• can arrange for |A| and or |B| to be 1.0 or use as a prior (e.g. in word embeddings)
• cos(x) ~= 1 - x^2/2 + x^4/24 … - if x small then sensibly quadratic
• cos(x) has a finite range - 1 to -1
• Euclidean distance (e.g. in GMMs) - |A - B|^2 = A.A - 2 A.B + B.B
• again can arange for A.A and/or B.B to be 1.0

Let's constrain |A| = |B| = 1 then:

• Distance x = r theta, but r=1, so cos(x) ~= 1 - x^2 / 2 = A.B so x^2 = 2 - 2 A.B (for small x)
• x^2 = |A - B|^2 = 2 - 2 A.B

This is cool - we have the property that distance is the dot product and we can minimise distance by avaraging (so learning is fast) - also data points and parameters are essestially the same - given only one example we'd dot product with that.

### Orthonormal weights

Orthonormal means weight transpose is inverse that is W W^T = I

There are only n (n-1)/2 free variables as both the rows and columns must be unit vectors.

They fit in very well with efficient estimation and dot products (above) - the data is the same as the weights. If we had to memorise one data point then use that as weights.

Question: How to ensure orthogality?

• exact methods - decompose a (n,n) matrix into (n-1, n-1) and ensure each step maintains orthonormal - has to be n steps to backprop through which can't be good.
• approximate methods - enforce rows or columns to be unit vector and then minimise Tr(W W^T - I) using (say) LMS
• near exact - use exp(W) and approximate the exp - http://proceedings.mlr.press/v97/lezcano-casado19a/lezcano-casado19a.pdf

### Embedding daat onto unit sphere

Can do it using distances:

### Dear Santa 2019

Tony Robinson 12/23/19 1:49 PM

Dear Santa,

I'd really like to do good unsupervised learning, so here is my Christmas wish list:

• I'd like it to have a proper generative model, whenever I sample from it I get something meaningful (not like the yucky faces I have already).
• I'd like it to be properly probabilistic everywhere and invertable so that I have proper two way communication between the observation space and the final latent space. This would mean that I don't need the encoder/decoder structure, I can constrain the observation space and get latents, or the latent space and get observations or part of the observation space and it'll fill in the rest - just like stacked RBMs.
• Thinking of stacked RBMs, I'd like to be able to tweak it a bit with supervised data and gradient descent, just like the first big DNNs.
• I like efficient algorithms, EM is my favourite. So please let me train this fast, even if it isn't closed form or EM please let it be simple so that gradient descent doesn't take ages, I only have old GPUs with limited memory.
• I like continous spaces, real world tasks don't often fit the Bernoulli distribution, so maybe not quite like stacked RMBs. Gaussians are nice - everything turns in to a Gaussian when you mix it up enough.
• Thinking of Gaussians, I like Eigenspaces, so please can the representation at every level be orthogonal and nearly Gaussian? That way I can sample at whatever level I want.
• I'd like it to be non-linear. Surely a good latent space has to be non-linear in the observations. I realise that this is difficult as I also want it to be invertible.
• I'd like to be able to stack these like I can convnets, say using a 5×5 input field and then subsample the spatial dimensions by a factor of two and double the channels. I realise this isn't easy to generate from as now we have several sources generating the same points.
• So if you can give me eigenspaces that are nearly Gaussian, can the next level just pick up on the remaining structure? That would make the next level up more compact. I'd really like to constuct this level in an optimal way, not using a greedy algorithm, and please remember I'd like to use the same thing one level above, so this shuold look like a (non-linear) eigenspace.
• Lastly, does this already exist? Is it Stacked Gaussian Processes or Stacked hyperparameters or something similar? A reference from arXiv.org and/or a github repo would be super!

Thanks Santa!

Edit: every layer has a larger space at the bottom, closer to the observations and a smaller space at the top, closer to the latent space. The job of the layer is to represent the information in the observation space as closely as possible, what can be preserved is taken as signal, what can't is taken as noise. The signal is passed on, the noise density is modelled. The very top layers have no signal left, they are just are hyperparameters (of hyperparameters) of noise. Everything is second degree orthogonal and Gaussian, so you can sample at any level and as you go down towards observations you gain more detail in the way of signal and also the parameters of the noise, which becomes fake detail for the layer below it (which is believable because we've put a lot of effort into modelling it as noise). 