Differences

This shows you the differences between two versions of the page.

--- how_to_start_up_a_startup [2020/04/27 07:12]
admin
+++ how_to_start_up_a_startup [2021/11/04 10:56]
admin [Example: Neuracore.ai]
@@ Line 86: / Line 86: @@
 === Why is it unique? ===
-Extreme power efficiency efficiently is obtained by a support software enabling low precision integer training and single propagation delay addition.
+Extreme power efficiency obtained though low precision integer operation (with supporting software): single propagation delay addition and very low propagation delay multiplication.
 === How is it going to be successful? ===
-Partner with ARM (local and known), get them to fund joint work, use them as first revenues.  Retain sufficient IP to be independent.
+Licence technology into a massive market, from servers though laptops, phones and smart watches.
 === Draft a quick business plan so that you have a story to tell others ===
-Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough.   Google have enough services that need NNs to build their own ASIC, the TPU.
+Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough.   Google have enough services that need NNs to build their own ASIC, the TPU.  Facebook is driven by AI, the trend towards increasing automation is massive and well known.
   * Stage 0:  Come up with a reasonable hardware design
-  * Stage 1:  DO PATENT REVIEW then Get team together
+  * Stage 1:  Do patent review then get team together
-  * Stage 2+ as previous section
+  * Stage 2:  Partner with ARM (local and known), get them to fund joint work
+  * Stage 3:  Sell to ARM, boarden base.  Retain sufficient IP to be independent.
 Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk.
@@ Line 106: / Line 107: @@
 === Competitors ===
-From: https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx
+From: [[https://www.engineering.com/Hardware/ArticleID/16753/The-Great-Debate-of-AI-Architecture.aspx|The Great Debate of AI Architecture]]
-  * nVidia - DNN training is a major part of their strategy
+  * Nvidia - DNN training is a major part of their strategy
   * Intel ([[https://ai.intel.com/intel-nervana-neural-network-processors-nnp-redefine-ai-silicon|Nervana]] (estimated $408 million) and [[https://www.movidius.com|Movidius]]) - Need to maintain leading position
   * ARM [[https://developer.arm.com/products/processors/machine-learning/arm-ml-processor|ML Processor]] - FPGA to rewire a fixed point unit with local controller and memory.  Claim 4 TOps/s per Watt.
@@ Line 126: / Line 127: @@
 === Idea killers ===
-  * Consumer grade
+  * Consumer/research grade has to be:
     * Faster than GPU, FPGA or TPU
     * Cheaper than GPU and FPGA (e.g. has more RAM)
     * Easy enough to use (will be less precision than fp16)
-  * Can't get memory side-by-side with logic so don't get the bandwidth
+  * Need to get memory side-by-side with logic so get the bandwidth
   * Must be able to do training on chip as something will need this in 5 years time, e.g. AGI
   * Must be flexible enough to keep up with the NN developments in the next 5 years, including training
-  * Memory banks must be close to MAC - does it use too much memory?  GDDR6 can run at 320 GB/s, so a 1024x1024 matrix is ~1GB and can do 320/s - not fast!
   * Hardware people have fixated on CNNs - are they right?  What does everyone want to use?
   * Must be able to use all common SGD optimisation techniques.
@@ Line 143: / Line 143: @@
 == Problem statement/Diagnosis ==
-DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient.   This limits both the scale which can be trained and the scope for deployment.  Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server.  GPUs from nVidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck).  The field is moving fast, so it must be easily adoptable in a short time period.
+DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient.   This limits both the scale which can be trained and the scope for deployment.  Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server.  GPUs from Nvidia are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck).  The field is moving fast, so it must be easily adoptable in a short time period.
 In order to make an impact any solution must be complete, that is almost invisible to the user.  It needs to improve on the three major operations:
-  * fwd:  The inference, or forward pass of a DNN model
+  * forward:  The inference, or forward pass of a DNN model
-  * bwd:  The backward error propagation pass of stochastic gradient descent
+  * backward:  The backward error propagation pass of stochastic gradient descent which accumulates gradients over a batch
-  * acc:  The combination of fwd and bwd results to get and error signal which is accumulated over a batch
+  * update:  The work needed to scale the batch gradient into a weight update (may need complex CPU like operations)
-The final pass, model update, can be formulated as computationally lower cost (e.g. updating only whenever there is a significant change) and also is not standardised in approach (ref: [[http://ruder.io/optimizing-gradient-descent|S. Ruder]]).  There are also other operations (e.g. softmax and batch normalisation) that are best suited to general a purpose processor.
 == Guiding Principles ==
@@ Line 158: / Line 157: @@
 | Sufficiently flexible | Blocker:  if can't implement what's needed then it won't be used |
 | State of art results  | Blocker:  if better results elsewhere then people will go elsewhere |
-| Easily trainable      | Blocker:  if not TensorFlow/pyTorch then adoption will be too slow |
+| Easily trainable      | Blocker:  if not TensorFlow/PyTorch then adoption will be too slow |
 === Rejected and company closed ===
-After considering many designs, including analog and ternary weights, I ended up with 4 bit weights and activations.  This achives the goals,
+After considering many designs, including analog and ternary weights, I ended up with 4 bit weights and activations.  This achieves the goals albeit uncomfortably similar to the TPU.  The scale of work needed to make the trainsition from fp32/fp16 to 4bit is too great - the first prototype would be noticed by the giants and the company would be overtaken (defending IP is very expensive).  This could well lead to a forced sale which isn't great for anyone (expecially founders/Ordinary share holders).
-I seem to have just reinvented the TPU, albeit with lower precision and lower latency.
+Start October 2018, end February 2019, minimal external costs.
+EDIT: Reopened 4 Nov 2021

Dr Tony Robinson

User Tools

Site Tools

Differences

Page Tools