Design and license cores for the efficient execution of neural networks. The efficiently is obtained by having vector and matrix “registers” and hardware to perform the vector * matrix = vector operations (plus associated, such as addition, ReLu, sigmiod and softmax on vector). The engine does complete stochastic gradient descent, not just inference.
The market is dominated by GPUs with FPGA upcoming. Totally analog (neurpmophic) has been on the horizon for a while but can't be trained. Neuracore will both be trainable with standard software (tensorflow, pyTorch) and achieve massive improvement in TFLOP/watt - it will be the most power efficient NN hardware available.
PoC - simulation (SPICE) Patent protection of key ideas Partner with ARM, get them to fund joint work Sell to ARM, nVidia, Intel for mass production
Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough. Google have enough services that need NNs to build their own ASIC, the TPU1.
Stage 0: Come up with a reasonable hardware design
Stage 1: DO PATENT REVIEW then Get team together
Stage 2+ as previous section
Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk.
UseAI index 2018 annual report for evidence of ai gold rush. Neuracore sells the shovels “During the gold rush its a good time to be in the pick and shovel business” Mark Twain
From: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx
If we assume that neural nets will be a major consumption of power in the future, and that power is limited by convenience (on a phone) or cost (servers) or CO2 emissions (climate change) then there is the case for a power efficient hardware implementation of neural networks.
DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient. This limits both the scale which can be trained and the scope for deployment. Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server. GPUs from nVidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data. FPGAs are also receiving more attention, they are good at convolutional neural networks (say why). Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck). The field is moving fast, so it must be easily adoptable in a short time period.
In order to make an impact any solution must be complete, that is almost invisible to the user. It needs to improve on the three major operations:
The final pass, model update, can be formulated as computationally lower cost (e.g. updating only whenever there is a significant change) and also is not standardised in approach (ref: S. Ruder). There are also other operations (e.g. softmax and batch normalisation) that are best suited to general a purpose processor.
Guiding Principle | Why |
---|---|
Minimise power | Aids deployability: (1) researchers get more power so can build bigger models so will buy (2) sells into areas not currently accessible (e.g. mobile). Cost and transistor count probably correlate with power, but they are secondary considerations |
Scalable | 200W for data centre, 20W for laptop and 2W for phone |
Sufficiently flexible | Blocker: if can't implement what's needed then it won't be used |
State of art results | Blocker: if better results elsewhere then people will go elsewhere |
Easily trainable | Blocker: if not TensorFlow/pyTorch then adoption will be too slow |
Analog: Originally it was thought that the main limitation on power and speed was the MAC operations, switched capacitors can do both multiply and accumulate. The main objection to analog is that it's not reproducible, so wouldn't gain the acceptance of the scientific community which drives adoption. Also, it's throughput not speed that counts and ternary weight and hierarchical carry save adders can accumulate without many gates compared with the memory subsystem.
Massively parallel accumulate: Memory bandwidth can be kept to a minimum by keeping the weights in processor memory. However, it's too much bandwidth to accumulate changes in main memory, so one solution is to have a local carry save adder for every weight. This was considered as a hardware step too far, it would mean several times more transistors and it pushes the problem to later.
Binarised Neural Networks (BNN) consider the activations and weights to be stochastic binary or ternary values. That is, the weights/activations are real valued and give the probability that the value that +1, (0), -1 are used. In raw form they still underperform (ref: BNN+) but there are promising results if continuous values are used (ref: Neural Networks with Few Multiplications).
So, create a processor with new very wide registers and instructions on those registers. Registers can hold an array of nbit values and multiply/accumulate them with binary or ternary weights (binary weights would be stored in a register with 1 bit per weight, ternary with two, same hardware does both operations). nbit is small, say 4 or 8. Implement nbit 8 first as it can be used to emulate nbit 4.
The main operation is a single tile of a large matrix multiply.
Assuming only one layer which fits into hardware the aim is to get training into 3 * nvec cycles.
# training (background) stochasticly quantise weights and store in one of nbank # fwd (background) load one of nbank into the weight vector for t in 0:nvec pull in quant(obs(t)) from RAM and pass through mat mul store out(t) back in RAM (background) quantise out(t) and write to one of nbank calculate deltas # bwd (background) load the weight vector transposed for t in 0:nvec pull in quant(delta(t)) from RAM and pass through mat mul store out(t) back in RAM # acc switch to quanitised outputs as weights for t in 0:nvec pull in quant(delta(t)) from RAM and pass through mat mul accumulate result in 16 bit output vector special accumulate results in nbig max and min index within nvec for 2*nbig asynchronously update weights using recipe of choice
Need to think though more layers and tiled matrix multiply
* Neural Network Processor really general, but just an application for which there must be prior art
I seem to have just reinvented the TPU, albeit with lower precision and lower latency.
For inference sample the top bit of the weights, and possibly lower bits as well. Run all of these and combine to get a reasonable precision weight vector (what is reasonable, 2, or 4 bits?).
For training, randomly sample the weights. Include support for transposing the quanitised weights so that the backward pass uses the same random sampling.
For the accumulate (acc) phase, write the activations and gradient to memory and randomly sample the activations to get the binary/ternary weights. Again need fancy addressing modes.
Need to flesh all of this out to find out the memory bottleneck.
Need to simulate it all to check that it works.
Can use FPGA if simulation checks out fine.
.
Need supporting functions, like ReLU or scaling so that everything stays in fixed point range.
All designs need local memory for weights. Use banked weights as operation needs to be able to deal with larger matrix multiplies than can fit in hardware. Backward pass is to scale the gradients and use the same hardware, it is not yet known if need to store the weights transposed or not. Accumulate pass is to use the gradients and outputs and accumulate changes in gradient. Some of the biggest are selected and their addresses used as outputs, there being a separate call to read and reset the accumulated changes. Based on (US10152676) Distributed training of models using stochastic gradient descent and need to steer clear of this.
Some fault tolerance is low cost, a broken MAC unit can be detected and wired around.
Problem: Say one in a million (1000 x 1000) weight are selected for update, then it takes too long to update them all. Also don't have the time time to read/reset the value along with any other time that weight is duplicated, multiply by a learning rate and change the weight. How should this be done? Lots of ARM cores running the same program?
How many bits for weights, activations, etc. Neural Networks with Few Multiplications has ternary weights which is appealing. Longer training times are fine if better results, indeed it could be a reason for a monopoly - Neurocore is the only way to get the best results. Look at https://github.com/hantek/BinaryConnect. Weights are only binary in training, at inference time they are full precision - so need to explore “We also explored the performance if we sample those weights during test time. With ternary connect at test time, the same model (the one reaches 1.15% error rate) yields 1.49% error rate, which is still fairly acceptable”, i.e. how many bits of weights would be negligable degridation? Could do 2 bits at once and run twice for 4 bit weights. Also see https://www.reddit.com/r/MachineLearning/comments/3p1nap/151003009_neural_networks_with_few/. Still relevant.
Slightly later work from the same authors: https://arxiv.org/abs/1602.02830. Claim binary weights and activations in this paper. TOP OF READING LIST.
BIG PROBLEM WITH WORK TO DATE - still need a MUL for computing changes - somehow had assumed that it wasn't needed.
BINARY WEIGHTS HACK: Instead of clipping, do gradient decent in a hidden variable that is passed through tanh, that is w = tanh(x). Do gradient descent in x, but only keep w. ￼d (tanh x) / dx = 1 - tanh^2 x or dw/dx = 1 - w^2, so given a gradient multiply it by (1-w^2) then make change. To see this works, when w=0 (1-w^2) = 1, so gradient descent happens as normal. However, as w approaches -1 or 1 then (1-w^2) goes to zero, so weights never get to -1 or =1.
Main article http://www.jmlr.org/papers/volume18/16-456/16-456.pdf
Easy overview: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx
Three big GEMMs:
Question: is acc a fixed batch size followed by an update or is it more hogwild like, accumulate until big enough to do something significant? The first is GEMM, the second isn't. If all are GEMMs then do they tile? That is, does the hardware support operating on all submatrices to compute an arbitrarily large solution.
Question: is it possible to reverse address the same memory or do we need to store a transposed copy?
Multiplication is local. it could be n-bit by n-bit multiply, or one value could be in the log domains so multiplication is just a bit shift, or both could be in the log domain and use an addition.
Hierarchical summation to output. Expect vector size of about 1024 so 10 steps to output. Use Carry Save Adders so that none of those 10 steps are full adds. Only do the carry propagate at the very end.
Gradient accumulators: Maintain a Carry Save Adder accumulator for each weight.
Inputs are digital, the summation is analog based on switched capacitors. The multiplication may be done in analog or by selection of appropriate capacitors. The summation is the connection of all capacitors, most of which will cancel each other out.
Disadvantages: Switched capacitors aren't well developed. It's analog so won't get the same results each time. Gradient accumulation has to be analog, so potential problems as it's long term analog (will decay as accumulate, will drop charge if context switch to another application).
Analog is low power, longer time to develop, more IP. As unknown, larger scales, 32nm?
Digital is repeatable, known, can use 7nm or whatever TSMC is at.
neuralcomputer.ai, neuralprocessor.ai neurocpu.ai neuroprocessor neurocode (or .io) neurochip.ai neuromorph.{ai,io}, neurocore.io
4 a b = (a + b)^2 - (a - b)^2
Diodes implement exponential which, as a Taylor series, is dominated by square for a useful region.
Saturated MOSFET has a square law https://inst.eecs.berkeley.edu/~ee105/fa03/handouts/lectures/Lecture12.pdf
Abandon as we don't need mul in a few units and the deviation from square is likely to be very high.
TO DO
Review all these: https://cacm.acm.org/magazines/2018/4/226374-chips-for-artificial-intelligence/fulltext
Need fast hamming distance
Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing Paper https://arxiv.org/abs/1612.00933. S. Simon Wong. Target this as it's close to implrmrntable and still accurate enough - unlike the floating memory.
Design Automation for Binarized Neural Networks: A Quantum Leap Opportunity?
https://www.sciencedirect.com/science/article/pii/S092523121000216X
http://hasler.ece.gatech.edu/ including her course and the reference list
https://en.wikipedia.org/wiki/Neuromorphic_engineering
An ultra-low energy internally analog, externally digital vector-matrix multiplier based on NOR flash memory technology
Fixed capacitor patent: http://www.freepatentsonline.com/9069995.html
https://www.eetimes.com/document.asp?doc_id=1332971&page_number=7