User Tools

Site Tools


private:neuracore

Neuracore

What is the company going to do?

Design and license cores for the efficient execution of neural networks. The efficiently is obtained by having vector and matrix “registers” and hardware to perform the vector * matrix = vector operations (plus associated, such as addition, ReLu, sigmiod and softmax on vector). The engine does complete stochastic gradient descent, not just inference.

Why is it unique?

The market is dominated by GPUs with FPGA upcoming. Totally analog (neurpmophic) has been on the horizon for a while but can't be trained. Neuracore will both be trainable with standard software (tensorflow, pyTorch) and achieve massive improvement in TFLOP/watt - it will be the most power efficient NN hardware available.

How is it going to be successful?

PoC - simulation (SPICE) Patent protection of key ideas Partner with ARM, get them to fund joint work Sell to ARM, nVidia, Intel for mass production

Draft a quick business plan so that you have a story to tell others

Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough. Google have enough services that need NNs to build their own ASIC, the TPU1.

Stage 0: Come up with a reasonable hardware design

Stage 1: DO PATENT REVIEW then Get team together

Stage 2+ as previous section

Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk.

UseAI index 2018 annual report for evidence of ai gold rush. Neuracore sells the shovels “During the gold rush its a good time to be in the pick and shovel business” Mark Twain

Competitors

From: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx

Idea killers

  • Consumer grade
    • Faster than GPU, FPGA or TPU
    • Cheaper than GPU and FPGA (e.g. has more RAM)
    • Easy enough to use (will be less precision than fp16)
    • Ideally more accurate - runs ternary weights or something like that
  • Can't get memory side-by-side with logic so don't get the bandwidth
  • Must be able to do training on chip as something will need this in 5 years time, e.g. AGI
  • Must be flexible enough to keep up with the NN developments in the next 5 years, including training
  • Memory banks must be close to MAC - does it use too much memory? GDDR6 can run at 320 GB/s, so a 1024×1024 matrix is ~1GB and can do 320/s - not fast!
  • Hardware people have fixated on CNNs - are they right? What does everyone want to use?
  • Must be able to use all common SGD optimisation techniques.

If we assume that neural nets will be a major consumption of power in the future, and that power is limited by convenience (on a phone) or cost (servers) or CO2 emissions (climate change) then there is the case for a power efficient hardware implementation of neural networks.

Technical Summary

Problem statement/Diagnosis

DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient. This limits both the scale which can be trained and the scope for deployment. Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server. GPUs from nVidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data. FPGAs are also receiving more attention, they are good at convolutional neural networks (say why). Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck). The field is moving fast, so it must be easily adoptable in a short time period.

In order to make an impact any solution must be complete, that is almost invisible to the user. It needs to improve on the three major operations:

  • fwd: The inference, or forward pass of a DNN model
  • bwd: The backward error propagation pass of stochastic gradient descent
  • acc: The combination of fwd and bwd results to get and error signal which is accumulated over a batch

The final pass, model update, can be formulated as computationally lower cost (e.g. updating only whenever there is a significant change) and also is not standardised in approach (ref: S. Ruder). There are also other operations (e.g. softmax and batch normalisation) that are best suited to general a purpose processor.

Guiding Principles

Guiding Principle Why
Minimise power Aids deployability: (1) researchers get more power so can build bigger models so will buy (2) sells into areas not currently accessible (e.g. mobile). Cost and transistor count probably correlate with power, but they are secondary considerations
Scalable 200W for data centre, 20W for laptop and 2W for phone
Sufficiently flexible Blocker: if can't implement what's needed then it won't be used
State of art results Blocker: if better results elsewhere then people will go elsewhere
Easily trainable Blocker: if not TensorFlow/pyTorch then adoption will be too slow

Rejected

Analog: Originally it was thought that the main limitation on power and speed was the MAC operations, switched capacitors can do both multiply and accumulate. The main objection to analog is that it's not reproducible, so wouldn't gain the acceptance of the scientific community which drives adoption. Also, it's throughput not speed that counts and ternary weight and hierarchical carry save adders can accumulate without many gates compared with the memory subsystem.

Massively parallel accumulate: Memory bandwidth can be kept to a minimum by keeping the weights in processor memory. However, it's too much bandwidth to accumulate changes in main memory, so one solution is to have a local carry save adder for every weight. This was considered as a hardware step too far, it would mean several times more transistors and it pushes the problem to later.

Best Foot Forward

Binarised Neural Networks (BNN) consider the activations and weights to be stochastic binary or ternary values. That is, the weights/activations are real valued and give the probability that the value that +1, (0), -1 are used. In raw form they still underperform (ref: BNN+) but there are promising results if continuous values are used (ref: Neural Networks with Few Multiplications).

So, create a processor with new very wide registers and instructions on those registers. Registers can hold an array of nbit values and multiply/accumulate them with binary or ternary weights (binary weights would be stored in a register with 1 bit per weight, ternary with two, same hardware does both operations). nbit is small, say 4 or 8. Implement nbit 8 first as it can be used to emulate nbit 4.

The main operation is a single tile of a large matrix multiply.

  • The vector size, nvec, is large, at least 256, maybe 1024 or 4096.
    • If could get 1000 * 1000 at 1GHz that would be 1 petaop/s
  • There are nvec * nvec cells and each cell:
  • Low precision multiply is expected to be binary or ternary
    • Ternary multiplication is xor with the sign bit and zero if the zero bit is set, plus setting a carry bit if needed
    • Ternary multiplication may be repeated to give arbitrary precision bit-sliced multiplication
  • There are very many vector sized registers:
    • output registers are 16 bit and hold the accup
  • Summation is achieved with hierarchical carry save adders. Carry bits are only resolved at the very end. The precision of the addition increases by one bit at every stage (so guaranteed no overflow - it would be possible to save a few transistors, but not many).
  • There is hardware support for randomly sampling from the full precision (say s16) weight to get the binary/ternary weight vector.

Operations

Assuming only one layer which fits into hardware the aim is to get training into 3 * nvec cycles.

# training

(background) stochasticly quantise weights and store in one of nbank

# fwd
(background) load one of nbank into the weight vector
for t in 0:nvec
  pull in quant(obs(t)) from RAM and pass through mat mul
  store out(t) back in RAM
  (background) quantise out(t) and write to one of nbank

calculate deltas

# bwd
(background) load the weight vector transposed
for t in 0:nvec
  pull in quant(delta(t)) from RAM and pass through mat mul
  store out(t) back in RAM
  
# acc
switch to quanitised outputs as weights
for t in 0:nvec
  pull in quant(delta(t)) from RAM and pass through mat mul
  accumulate result in 16 bit output vector
    special accumulate results in nbig max and min index within nvec
  for 2*nbig asynchronously update weights using recipe of choice

Need to think though more layers and tiled matrix multiply

FAQ

  1. Q: Why binary/ternary weights? A: the main guiding principle is power consumption. s4, s8 or fp16 multiply at each cell would dominate the active transistor count. Well maybe not s4, that's only two more conditional bit shifts and carry save adders (3bit x 3bit multiply then sign change to next stage. sign change is just xor plus set a low carry bit, but can do sign change on carry save??). Ummm….
  2. Q: How does this improver over the TPU English?
  3. Q: Will this run sparse Neutral Networks? A: Yes, block sparse ones, in the same way that other tiled matrix accelerators can miss out some blocks.

Patents

* Neural Network Processor really general, but just an application for which there must be prior art

THE END

I seem to have just reinvented the TPU, albeit with lower precision and lower latency.


Old below

For inference sample the top bit of the weights, and possibly lower bits as well. Run all of these and combine to get a reasonable precision weight vector (what is reasonable, 2, or 4 bits?).

For training, randomly sample the weights. Include support for transposing the quanitised weights so that the backward pass uses the same random sampling.

For the accumulate (acc) phase, write the activations and gradient to memory and randomly sample the activations to get the binary/ternary weights. Again need fancy addressing modes.

Need to flesh all of this out to find out the memory bottleneck.

Need to simulate it all to check that it works.

Can use FPGA if simulation checks out fine.

.

Need supporting functions, like ReLU or scaling so that everything stays in fixed point range.

All designs need local memory for weights. Use banked weights as operation needs to be able to deal with larger matrix multiplies than can fit in hardware. Backward pass is to scale the gradients and use the same hardware, it is not yet known if need to store the weights transposed or not. Accumulate pass is to use the gradients and outputs and accumulate changes in gradient. Some of the biggest are selected and their addresses used as outputs, there being a separate call to read and reset the accumulated changes. Based on (US10152676) Distributed training of models using stochastic gradient descent and need to steer clear of this.

Some fault tolerance is low cost, a broken MAC unit can be detected and wired around.

Problem: Say one in a million (1000 x 1000) weight are selected for update, then it takes too long to update them all. Also don't have the time time to read/reset the value along with any other time that weight is duplicated, multiply by a learning rate and change the weight. How should this be done? Lots of ARM cores running the same program?

Design decisions

How many bits for weights, activations, etc. Neural Networks with Few Multiplications has ternary weights which is appealing. Longer training times are fine if better results, indeed it could be a reason for a monopoly - Neurocore is the only way to get the best results. Look at https://github.com/hantek/BinaryConnect. Weights are only binary in training, at inference time they are full precision - so need to explore “We also explored the performance if we sample those weights during test time. With ternary connect at test time, the same model (the one reaches 1.15% error rate) yields 1.49% error rate, which is still fairly acceptable”, i.e. how many bits of weights would be negligable degridation? Could do 2 bits at once and run twice for 4 bit weights. Also see https://www.reddit.com/r/MachineLearning/comments/3p1nap/151003009_neural_networks_with_few/. Still relevant.

Slightly later work from the same authors: https://arxiv.org/abs/1602.02830. Claim binary weights and activations in this paper. TOP OF READING LIST.

  • can use real sigmoid, not hard sigmoid, just backprop through it then you don't need any clipping

BIG PROBLEM WITH WORK TO DATE - still need a MUL for computing changes - somehow had assumed that it wasn't needed.

BINARY WEIGHTS HACK: Instead of clipping, do gradient decent in a hidden variable that is passed through tanh, that is w = tanh(x). Do gradient descent in x, but only keep w. d (tanh x) / dx = 1 - tanh^2 x or dw/dx = 1 - w^2, so given a gradient multiply it by (1-w^2) then make change. To see this works, when w=0 (1-w^2) = 1, so gradient descent happens as normal. However, as w approaches -1 or 1 then (1-w^2) goes to zero, so weights never get to -1 or =1.

Main article http://www.jmlr.org/papers/volume18/16-456/16-456.pdf

Easy overview: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx

Three big GEMMs:

  • fwd: Forward pass - the only one needed for inference
  • bwd: Backward pass - may need to scale deltas to keep them in range
  • acc: Accumulate - accumulate gradients

Question: is acc a fixed batch size followed by an update or is it more hogwild like, accumulate until big enough to do something significant? The first is GEMM, the second isn't. If all are GEMMs then do they tile? That is, does the hardware support operating on all submatrices to compute an arbitrarily large solution.

Question: is it possible to reverse address the same memory or do we need to store a transposed copy?

All digital

Multiplication is local. it could be n-bit by n-bit multiply, or one value could be in the log domains so multiplication is just a bit shift, or both could be in the log domain and use an addition.

Hierarchical summation to output. Expect vector size of about 1024 so 10 steps to output. Use Carry Save Adders so that none of those 10 steps are full adds. Only do the carry propagate at the very end.

Gradient accumulators: Maintain a Carry Save Adder accumulator for each weight.

Analog summation

Inputs are digital, the summation is analog based on switched capacitors. The multiplication may be done in analog or by selection of appropriate capacitors. The summation is the connection of all capacitors, most of which will cancel each other out.

Disadvantages: Switched capacitors aren't well developed. It's analog so won't get the same results each time. Gradient accumulation has to be analog, so potential problems as it's long term analog (will decay as accumulate, will drop charge if context switch to another application).

Analog vs digital

Analog is low power, longer time to develop, more IP. As unknown, larger scales, 32nm?

Digital is repeatable, known, can use 7nm or whatever TSMC is at.

Old ideas

on names

neuralcomputer.ai, neuralprocessor.ai neurocpu.ai neuroprocessor neurocode (or .io) neurochip.ai neuromorph.{ai,io}, neurocore.io

Multiplication via square

4 a b = (a + b)^2 - (a - b)^2

Diodes implement exponential which, as a Taylor series, is dominated by square for a useful region.

Saturated MOSFET has a square law https://inst.eecs.berkeley.edu/~ee105/fa03/handouts/lectures/Lecture12.pdf

Abandon as we don't need mul in a few units and the deviation from square is likely to be very high.

Multiplication via decomposition into bits

private/neuracore.txt · Last modified: 2019/06/05 12:41 by admin