AGI operates by learning a much as possible about the world. Unsupervised is the first second layer in doing this - the correlation layer. The other layers are memory, correlation, causation and explanation.
If a single flat latent space then parts of are main structural and parts of it must be preserved and isolated for detail. If we accept this then it's heirarchical anyway. Why should some parts be preserved? Because the use of fine detail depends on the structure - e.g. the main structure may say “wavy hair” - the detail shows the waves. If the structure said “straight hair” then the use of the detail is very different. So some random choice at the top level dictates the possible random choices lower down - you've got to know what top level choices were made for the lower ones to make any sense. Also, most of the information is low down - putting it all at one level would make it very hard to decide what is top level information and what isn't.
Sigmoid units define a hyperplane, the output of the unit can be inverted to constrain the input space to the hyperplane. We expect uncertainlty, so we area dealing with a distribution on input and output. We hope to have Gaussian distributions and independent hyperplanes and end up with a Gaussian distribution in input space which can be sampled from and sent back another layer.
Training: Compute mu_i as the mean of the i'th output unit. For any input, x_nj compute all outputs, y_ni. If independent, then log like is \sum_ni y_ni ln(mu_i) + (1-y_ni) ln(1-mu_i). We can see this as the prior on the classes are mu_i and 1-mu_i which we encode in log2(mu_i) and log2(1-mu_i) bits respectivly. We need to tranmit a different distribution, so it costs y_ni ln(mu_i) for one class and (1-y_ni) ln(1-mu_i) for the other. Now, y_ni are not nececssarily independent. If we assume we know all but one, we can remove any linear independence as u_
Training: Orthogonalsise and run 2 class GMM - log likelihood gain says how much of the varaince these new parameters absorbed and therefore how to scale the result in comparison with the input space. Problem - if represent as a probability of one class then it's non-linear but not very Gaussian. If represent as the log ratio input to sigmoid then it's linear. Could have different mixes and different variances, but then it's not a hyperplane and would need to solve quadratics to reverse project.
Training: GMM but make sure that it doesn't end up one-hot - then scale, eigenvector and GMM again until we can sample.
Training: Lloyd Max each dimentions with the number of points proportional to variance (or stddev?). Find cumulative variance and use that to partition into eqaul sizes, so end eigen vecrtors may get summed and onle one or two points used. Then outout posteriors in each dimension, do eigenvector and repeat,
Training. Build a virtual 2^D sized codebook - each dimension has two points, tied variance. Can also rotate and scale the space, so asking to minimise MSE when binary codebook is projected through a matrix. Training would be something like: Project image through inverse of weight matrix, find either closest code or distribution over codes, adjust weight matrix to have MSE distortion. This would seem to be achievable.
Training: Train up a big GMM/VQ. Get Gaussian stats on nearest 2 neighbours (or all short distances). Model as a fraction of the probability mass between to centres and a distribution around the line. Pass on the distance between centres. Reconstruct by taking a weighted average of the points between centres plus some of the observed Gaussian noise.
Training: Each layer is very simple it has input of data points and estimated variance, it has a smaller latent space which is latent values and estimated variance, and it has output which is the same space as input and varaince.
Dot product is used in:
Let's constrain |A| = |B| = 1 then:
This is cool - we have the property that distance is the dot product and we can minimise distance by avaraging (so learning is fast) - also data points and parameters are essestially the same - given only one example we'd dot product with that.
Orthonormal means weight transpose is inverse that is W W^T = I
There are only n (n-1)/2 free variables as both the rows and columns must be unit vectors.
They fit in very well with efficient estimation and dot products (above) - the data is the same as the weights. If we had to memorise one data point then use that as weights.
Question: How to ensure orthogality?
Can do it using distances:
Tony Robinson 12/23/19 1:49 PM
I'd really like to do good unsupervised learning, so here is my Christmas wish list:
Edit: every layer has a larger space at the bottom, closer to the observations and a smaller space at the top, closer to the latent space. The job of the layer is to represent the information in the observation space as closely as possible, what can be preserved is taken as signal, what can't is taken as noise. The signal is passed on, the noise density is modelled. The very top layers have no signal left, they are just are hyperparameters (of hyperparameters) of noise. Everything is second degree orthogonal and Gaussian, so you can sample at any level and as you go down towards observations you gain more detail in the way of signal and also the parameters of the noise, which becomes fake detail for the layer below it (which is believable because we've put a lot of effort into modelling it as noise).