Differences

This shows you the differences between two versions of the page.

--- how_to_start_up_a_startup [2020/04/27 06:56]
admin
+++ how_to_start_up_a_startup [2020/05/17 07:32]
admin [Example: Neuracore.ai]
@@ Line 76: / Line 76: @@
   * Do the right thing, act responsibly and with integrity.   Your reputation is more important than your current startup.
-=== Example:  Neuracore.ai ===
+==== Example:  Neuracore.ai ====
-==== What is the company going to do? ====
+Neuracore was incorporated on 12 December 2018 and dissolved on 30 April 2019.
-Design and license cores for the efficient execution of neural networks.  The efficiently is obtained by having vector and matrix "registers" and hardware to perform the vector * matrix = vector operations (plus associated, such as addition, ReLu, sigmiod and softmax on vector).  The engine does complete stochastic gradient descent, not just inference.
+=== What is the company going to do? ===
-==== Why is it unique? ====
+Design and license cores for the efficient execution of neural networks.  This will enable "AI Everywhere".
-The market is dominated by GPUs with FPGA upcoming.  Totally analog (neurpmophic) has been on the horizon for a while but can't be trained.  Neuracore will both be trainable with standard software (tensorflow, pyTorch) and achieve massive improvement in TFLOP/watt - it will be the most power efficient NN hardware available.
+=== Why is it unique? ===
-==== How is it going to be successful? ====
+Extreme power efficiency obtained though low precision integer operation (with supporting software): single propagation delay addition and very low propagation delay multiplication.
-PoC - simulation (SPICE)
+=== How is it going to be successful? ===
-Patent protection of key ideas
-Partner with ARM, get them to fund joint work
-Sell to ARM, nVidia, Intel for mass production
-==== Draft a quick business plan so that you have a story to tell others ====
+Licence technology into a massive market, from servers though laptops, phones and smart watches.
-Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough.   Google have enough services that need NNs to build their own ASIC, the TPU1.
+=== Draft a quick business plan so that you have a story to tell others ===
-Stage 0:  Come up with a reasonable hardware design
+Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough.   Google have enough services that need NNs to build their own ASIC, the TPU.  Facebook is driven by AI, the trend towards increasing automation is massive and well known.
-Stage 1:  DO PATENT REVIEW then Get team together
+  * Stage 0:  Come up with a reasonable hardware design
+  * Stage 1:  Do patent review then get team together
-Stage 2+ as previous section
+  * Stage 2:  Partner with ARM (local and known), get them to fund joint work
+  * Stage 3:  Sell to ARM, boarden base.  Retain sufficient IP to be independent.
 Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk.
-UseAI index 2018 annual report for evidence of ai gold rush.   Neuracore sells the shovels "During the gold rush its a good time to be in the pick and shovel business" Mark Twain
+Use AI index 2018 annual report for evidence of AI gold rush.   Neuracore sells the shovels "During the gold rush its a good time to be in the pick and shovel business" Mark Twain
-==== Competitors ====
+=== Competitors ===
-From: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx
+From: [[https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx|The Great Debate of AI Architecture]]
-  * nVidia - DNN training is a major part of their strategy
+  * Nvidia - DNN training is a major part of their strategy
   * Intel ([[https://ai.intel.com/intel-nervana-neural-network-processors-nnp-redefine-ai-silicon|Nervana]] (estimated $408 million) and [[https://www.movidius.com|Movidius]]) - Need to maintain leading position
   * ARM [[https://developer.arm.com/products/processors/machine-learning/arm-ml-processor|ML Processor]] - FPGA to rewire a fixed point unit with local controller and memory.  Claim 4 TOps/s per Watt.
@@ Line 127: / Line 125: @@
   * [[https://globalnewstribune.com/2019/02/20/global-deep-learning-chipsets-market-2019-2025-markets-major-players-are-google-intel-xilinx-amd-nvidia-arm-qualcomm-ibm-graphcore-brainchip-mobileye-wave-computing-ceva-movidius-nerv/|Global Deep Learning Chipsets Market 2019-2025: Markets Major Players Are-, Google, Intel, Xilinx, AMD...]]
-==== Idea killers ====
+=== Idea killers ===
-  * Consumer grade
+  * Consumer/research grade has to be:
     * Faster than GPU, FPGA or TPU
     * Cheaper than GPU and FPGA (e.g. has more RAM)
     * Easy enough to use (will be less precision than fp16)
-    * Ideally more accurate - runs ternary weights or something like that
+  * Need to get memory side-by-side with logic so get the bandwidth
-  * Can't get memory side-by-side with logic so don't get the bandwidth
   * Must be able to do training on chip as something will need this in 5 years time, e.g. AGI
   * Must be flexible enough to keep up with the NN developments in the next 5 years, including training
-  * Memory banks must be close to MAC - does it use too much memory?  GDDR6 can run at 320 GB/s, so a 1024x1024 matrix is ~1GB and can do 320/s - not fast!
   * Hardware people have fixated on CNNs - are they right?  What does everyone want to use?
   * Must be able to use all common SGD optimisation techniques.
@@ Line 143: / Line 139: @@
 If we assume that neural nets will be a major consumption of power in the future, and that power is limited by convenience (on a phone) or cost (servers) or CO2 emissions (climate change) then there is the case for a power efficient hardware implementation of neural networks.
-===== Technical Summary =====
+=== Technical Summary ===
-==== Problem statement/Diagnosis ====
+== Problem statement/Diagnosis ==
-DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient.   This limits both the scale which can be trained and the scope for deployment.  Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server.  GPUs from nVidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck).  The field is moving fast, so it must be easily adoptable in a short time period.
+DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient.   This limits both the scale which can be trained and the scope for deployment.  Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server.  GPUs from Nvidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck).  The field is moving fast, so it must be easily adoptable in a short time period.
 In order to make an impact any solution must be complete, that is almost invisible to the user.  It needs to improve on the three major operations:
-  * fwd:  The inference, or forward pass of a DNN model
+  * forward:  The inference, or forward pass of a DNN model
-  * bwd:  The backward error propagation pass of stochastic gradient descent
+  * backward:  The backward error propagation pass of stochastic gradient descent which accumulates gradients over a batch
-  * acc:  The combination of fwd and bwd results to get and error signal which is accumulated over a batch
+  * update:  The work needed to scale the batch gradient into a weight update (may need complex CPU like operations)
-The final pass, model update, can be formulated as computationally lower cost (e.g. updating only whenever there is a significant change) and also is not standardised in approach (ref: [[http://ruder.io/optimizing-gradient-descent|S. Ruder]]).  There are also other operations (e.g. softmax and batch normalisation) that are best suited to general a purpose processor.
-==== Guiding Principles ====
+== Guiding Principles ==
 ^ Guiding Principle ^ Why ^
@@ Line 162: / Line 157: @@
 | Sufficiently flexible | Blocker:  if can't implement what's needed then it won't be used |
 | State of art results  | Blocker:  if better results elsewhere then people will go elsewhere |
-| Easily trainable      | Blocker:  if not TensorFlow/pyTorch then adoption will be too slow |
+| Easily trainable      | Blocker:  if not TensorFlow/PyTorch then adoption will be too slow |
-==== Rejected ====
-Analog:  Originally it was thought that the main limitation on power and speed was the MAC operations, switched capacitors can do both multiply and accumulate.   The main objection to analog is that it's not reproducible, so wouldn't gain the acceptance of the scientific community which drives adoption.   Also, it's throughput not speed that counts and ternary weight and hierarchical carry save adders can accumulate without many gates compared with the memory subsystem.
-Massively parallel accumulate:  Memory bandwidth can be kept to a minimum by keeping the weights in processor memory.  However, it's too much bandwidth to accumulate changes in main memory, so one solution is to have a local carry save adder for every weight.   This was considered as a hardware step too far, it would mean several times more transistors and it pushes the problem to later.
-==== Best Foot Forward ====
-Binarised Neural Networks (BNN) consider the activations and weights to be stochastic binary or ternary values.   That is, the weights/activations are real valued and give the probability that the value that +1, (0), -1 are used.  In raw form they still underperform (ref: [[https://openreview.net/forum?id=SJfHg2A5tQ|BNN+]]) but there are promising results if continuous values are used (ref: [[https://arxiv.org/abs/1510.03009|Neural Networks with Few Multiplications]]).
-So, create a processor with new very wide registers and instructions on those registers.  Registers can hold an array of nbit values and multiply/accumulate them with binary or ternary weights (binary weights would be stored in a register with 1 bit per weight, ternary with two, same hardware does both operations).  nbit is small, say 4 or 8.   Implement nbit 8 first as it can be used to emulate nbit 4.
-The main operation is a single tile of a large matrix multiply.
-  * The vector size, nvec, is large, at least 256, maybe 1024 or 4096.
-    * If could get 1000 * 1000 at 1GHz that would be 1 petaop/s
-  * There are nvec * nvec cells and each cell:
-    * has nbank of (write only) weight storage and one weight active
-    * has a low precision multiply
-      * ternary is lowest gate count and power
-      * nibble would be accepted better
-    * feeds into a hierarchical carry save adder
-      * Hierarchical is debatable, could be difficult to route
-      * Add 4 (or 8?) numbers at once to speed up and flatten hierarchy.  Tutorial at https://www.google.com/url?sa=t&source=web&rct=j&url=http://www.ecs.umass.edu/ece/koren/arith/slides/Part5c-add.ppt&ved=2ahUKEwisybm4puDfAhV8WhUIHUY7AEwQFjAIegQICBAB&usg=AOvVaw0W9OpTc4c6BZvkdPCIwUQz&cshid=1547023249478
-  * Low precision multiply is expected to be binary or ternary
-    * Ternary multiplication is xor with the sign bit and zero if the zero bit is set, plus setting a carry bit if needed
-    * Ternary multiplication may be repeated to give arbitrary precision bit-sliced multiplication
-  * There are very many vector sized registers:
-    * output registers are 16 bit and hold the accup
-  * Summation is achieved with hierarchical carry save adders.   Carry bits are only resolved at the very end.   The precision of the addition increases by one bit at every stage (so guaranteed no overflow - it would be possible to save a few transistors, but not many).
-  * There is hardware support for randomly sampling from the full precision (say s16) weight to get the binary/ternary weight vector.
-=== Operations ===
-Assuming only one layer which fits into hardware the aim is to get training into 3 * nvec cycles.
-<code>
-# training
-(background) stochasticly quantise weights and store in one of nbank
-# fwd
-(background) load one of nbank into the weight vector
-for t in 0:nvec
-  pull in quant(obs(t)) from RAM and pass through mat mul
-  store out(t) back in RAM
-  (background) quantise out(t) and write to one of nbank
-calculate deltas
-# bwd
-(background) load the weight vector transposed
-for t in 0:nvec
-  pull in quant(delta(t)) from RAM and pass through mat mul
-  store out(t) back in RAM
-# acc
-switch to quanitised outputs as weights
-for t in 0:nvec
-  pull in quant(delta(t)) from RAM and pass through mat mul
-  accumulate result in 16 bit output vector
-    special accumulate results in nbig max and min index within nvec
-  for 2*nbig asynchronously update weights using recipe of choice
-</code>
-Need to think though more layers and tiled matrix multiply
-=== FAQ ===
-  - Q: Why binary/ternary weights?   A: the main guiding principle is power consumption.  s4, s8 or fp16 multiply at each cell would dominate the active transistor count.  Well maybe not s4, that's only two more conditional bit shifts and carry save adders (3bit x 3bit multiply then sign change to next stage. sign change is just xor plus set a low carry bit, but can do sign change on carry save??).   Ummm....
-  - Q:  How does this improver over the [[http://njiot.blogspot.com/2017/04/google-tpu-tensor-process-unit.html|TPU]] [[https://translate.google.com/translate?hl=en&sl=zh-TW&u=http://njiot.blogspot.com/2017/04/google-tpu-tensor-process-unit.html&prev=search|English]]?
-    * TPU uses same big matrix multiply
-    * TPU uses a systolic array, no hierarchical add.   So O(n) latency, not O(log(n)) but same throughput
-    * LAtest found article https://medium.com/@antonpaquin/whats-inside-a-tpu-c013eb51973e
-    * [[https://github.com/UCSBarchlab/OpenTPU|OpenTPU]]
-    * Why bwetter than [[https://www.mdpi.com/2079-9292/8/1/78|Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights]]?
-  - Q: Will this run sparse Neutral Networks?  A:  Yes, block sparse ones, in the same way that other tiled matrix accelerators can miss out some blocks.
-==== Patents ====
- * [[https://patents.google.com/patent/US20160342891A1/en|Neural Network Processor]] really general, but just an application for which there must be prior art
-===== THE END =====
-I seem to have just reinvented the TPU, albeit with lower precision and lower latency.
-----
-==== Old below ====
-For inference sample the top bit of the weights, and possibly lower bits as well.  Run all of these and combine to get a reasonable precision weight vector (what is reasonable, 2, or 4 bits?).
-For training, randomly sample the weights.  Include support for transposing the quanitised weights so that the backward pass uses the same random sampling.
-For the accumulate (acc) phase, write the activations and gradient to memory and randomly sample the activations to get the binary/ternary weights.   Again need fancy addressing modes.
-Need to flesh all of this out to find out the memory bottleneck.
-Need to simulate it all to check that it works.
-Can use FPGA if simulation checks out fine.
-.
-Need supporting functions, like ReLU or scaling so that everything stays in fixed point range.
-All designs need local memory for weights.   Use banked weights as operation needs to be able to deal with larger matrix multiplies than can fit in hardware.   Backward pass is to scale the gradients and use the same hardware, it is not yet known if need to store the weights transposed or not.  Accumulate pass is to use the gradients and outputs and accumulate changes in gradient.  Some of the biggest are selected and their addresses used as outputs, there being a separate call to read and reset the accumulated changes.   Based on [[http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PTXT&S1=%2210152676+%22&OS=|(US10152676) Distributed training of models using stochastic gradient descent]] and need to steer clear of this.
-Some fault tolerance is low cost, a broken MAC unit can be detected and wired around.
-Problem:  Say one in a million (1000 x 1000) weight are selected for update, then it takes too long to update them all.   Also don't have the time time to read/reset the value along with any other time that weight is duplicated, multiply by a learning rate and change the weight.  How should this be done?  Lots of ARM cores running the same program?
-===== Design decisions =====
-How many bits for weights, activations, etc.  [[https://arxiv.org/pdf/1510.03009.pdf|Neural Networks with Few Multiplications]] has ternary weights which is appealing.   Longer training times are fine if better results, indeed it could be a reason for a monopoly - Neurocore is the only way to get the best results.  Look at https://github.com/hantek/BinaryConnect.  Weights are only binary in training, at inference time they are full precision - so need to explore "We also explored the performance if we sample those weights during test time. With ternary connect at test time, the same model (the one reaches 1.15% error rate) yields 1.49% error rate, which is
-still fairly acceptable", i.e. how many bits of weights would be negligable degridation?  Could do 2 bits at once and run twice for 4 bit weights.  Also see
-https://www.reddit.com/r/MachineLearning/comments/3p1nap/151003009_neural_networks_with_few/.  Still relevant.
-Slightly later work from the same authors: https://arxiv.org/abs/1602.02830.  Claim binary weights and activations in this paper.  TOP OF READING LIST.
-  * can use real sigmoid, not hard sigmoid, just backprop through it then you don't need any clipping
-BIG PROBLEM WITH WORK TO DATE - still need a MUL for computing changes - somehow had assumed that it wasn't needed.
-BINARY WEIGHTS HACK:  Instead of clipping, do gradient decent in a hidden variable that is passed through tanh, that is w = tanh(x).   Do gradient descent in x, but only keep w.  d (tanh x) / dx = 1 - tanh^2 x or dw/dx = 1 - w^2, so given a gradient multiply it by (1-w^2) then make change.  To see this works, when w=0 (1-w^2) = 1, so gradient descent happens as normal.  However, as w approaches -1 or 1 then (1-w^2) goes to zero, so weights never get to -1 or =1.
-Main article http://www.jmlr.org/papers/volume18/16-456/16-456.pdf
-Easy overview: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx
-Three big GEMMs:
-  * fwd:  Forward pass - the only one needed for inference
-  * bwd:  Backward pass - may need to scale deltas to keep them in range
-  * acc:  Accumulate - accumulate gradients
-Question:  is acc a fixed batch size followed by an update or is it more hogwild like, accumulate until big enough to do something significant?   The first is GEMM, the second isn't.  If all are GEMMs then do they tile?  That is, does the hardware support operating on all submatrices to compute an arbitrarily large solution.
-Question:  is it possible to reverse address the same memory or do we need to store a transposed copy?
-===== All digital ======
-Multiplication is local.   it could be n-bit by n-bit multiply, or one value could be in the log domains so multiplication is just a bit shift, or both could be in the log domain and use an addition.
-Hierarchical summation to output.  Expect vector size of about 1024 so 10 steps to output.   Use Carry Save Adders so that none of those 10 steps are full adds.   Only do the carry propagate at the very end.
-Gradient accumulators:  Maintain a Carry Save Adder accumulator for each weight.
+=== Rejected and company closed ===
+After considering many designs, including analog and ternary weights, I ended up with 4 bit weights and activations.  This achieves the goals albeit uncomfortably similar to the TPU.  The scale of work needed to make the trainsition from fp32/fp16 to 4bit is too great - the first prototype would be noticed by the giants and the company would be overtaken (defending IP is very expensive).  This could well lead to a forced sale which isn't great for anyone (expecially founders/Ordinary share holders).
+Start October 2018, end February 2019, minimal external costs.

Dr Tony Robinson

User Tools

Site Tools

Differences

Page Tools