User Tools

Site Tools


how_to_start_up_a_startup

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
how_to_start_up_a_startup [2020/04/27 07:12]
admin
how_to_start_up_a_startup [2021/11/04 10:56]
admin [Example: Neuracore.ai]
Line 86: Line 86:
 === Why is it unique? === === Why is it unique? ===
  
-Extreme power efficiency ​efficiently is obtained ​by a support software enabling ​low precision integer ​training and single propagation delay addition.+Extreme power efficiency obtained ​though ​low precision integer ​operation (with supporting software): ​single propagation delay addition ​and very low propagation delay multiplication.
  
 === How is it going to be successful? === === How is it going to be successful? ===
  
-Partner with ARM (local and known)get them to fund joint workuse them as first revenues. ​ Retain sufficient IP to be independent.+Licence technology into a massive marketfrom servers though laptopsphones and smart watches.
  
 === Draft a quick business plan so that you have a story to tell others === === Draft a quick business plan so that you have a story to tell others ===
  
-Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough. ​  ​Google have enough services that need NNs to build their own ASIC, the TPU.+Hardware acceleration for Neural Nets is already huge, the whole current wave of Deep Learning happened because GPUs became cheap enough. ​  ​Google have enough services that need NNs to build their own ASIC, the TPU.  Facebook is driven by AI, the trend towards increasing automation is massive and well known.
  
   * Stage 0:  Come up with a reasonable hardware design   * Stage 0:  Come up with a reasonable hardware design
-  * Stage 1:  ​DO PATENT REVIEW ​then Get team together +  * Stage 1:  ​Do patent review ​then get team together 
-  * Stage 2+ as previous section+  * Stage 2:  Partner with ARM (local and known), get them to fund joint work 
 +  * Stage 3:  Sell to ARM, boarden base.  Retain sufficient IP to be independent.
  
 Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk. Aim for getting it out there in 5 years - any sooner and FPGA will dominate, any later and too much risk.
Line 106: Line 107:
 === Competitors === === Competitors ===
  
-From: https://​www.engineering.com/​Hardware/​ArticleID/​16753/​The-Great-Debate-of-AI-Architecture.aspx+From: [[https://​www.engineering.com/​Hardware/​ArticleID/​16753/​The-Great-Debate-of-AI-Architecture.aspx|The Great Debate of AI Architecture]]
  
-  * nVidia ​- DNN training is a major part of their strategy+  * Nvidia ​- DNN training is a major part of their strategy
   * Intel ([[https://​ai.intel.com/​intel-nervana-neural-network-processors-nnp-redefine-ai-silicon|Nervana]] (estimated $408 million) and [[https://​www.movidius.com|Movidius]]) - Need to maintain leading position   * Intel ([[https://​ai.intel.com/​intel-nervana-neural-network-processors-nnp-redefine-ai-silicon|Nervana]] (estimated $408 million) and [[https://​www.movidius.com|Movidius]]) - Need to maintain leading position
   * ARM [[https://​developer.arm.com/​products/​processors/​machine-learning/​arm-ml-processor|ML Processor]] - FPGA to rewire a fixed point unit with local controller and memory. ​ Claim 4 TOps/s per Watt.   * ARM [[https://​developer.arm.com/​products/​processors/​machine-learning/​arm-ml-processor|ML Processor]] - FPGA to rewire a fixed point unit with local controller and memory. ​ Claim 4 TOps/s per Watt.
Line 126: Line 127:
 === Idea killers === === Idea killers ===
  
-  * Consumer grade+  * Consumer/​research ​grade has to be:
     * Faster than GPU, FPGA or TPU     * Faster than GPU, FPGA or TPU
     * Cheaper than GPU and FPGA (e.g. has more RAM)     * Cheaper than GPU and FPGA (e.g. has more RAM)
     * Easy enough to use (will be less precision than fp16)     * Easy enough to use (will be less precision than fp16)
-  * Can'​t ​get memory side-by-side with logic so don'​t ​get the bandwidth+  * Need to get memory side-by-side with logic so get the bandwidth
   * Must be able to do training on chip as something will need this in 5 years time, e.g. AGI   * Must be able to do training on chip as something will need this in 5 years time, e.g. AGI
   * Must be flexible enough to keep up with the NN developments in the next 5 years, including training   * Must be flexible enough to keep up with the NN developments in the next 5 years, including training
-  * Memory banks must be close to MAC - does it use too much memory? ​ GDDR6 can run at 320 GB/s, so a 1024x1024 matrix is ~1GB and can do 320/s - not fast! 
   * Hardware people have fixated on CNNs - are they right? ​ What does everyone want to use?   * Hardware people have fixated on CNNs - are they right? ​ What does everyone want to use?
   * Must be able to use all common SGD optimisation techniques.   * Must be able to use all common SGD optimisation techniques.
Line 143: Line 143:
 == Problem statement/​Diagnosis == == Problem statement/​Diagnosis ==
  
-DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient. ​  This limits both the scale which can be trained and the scope for deployment. ​ Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server. ​ GPUs from nVidia ​are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck). ​ The field is moving fast, so it must be easily adoptable in a short time period.+DNNs are everywhere and are growing in popularity, however the popular hardware is very general and not power efficient. ​  This limits both the scale which can be trained and the scope for deployment. ​ Typically a 2 slot PCIe card can consume 300W and a small number of them can fit in a server. ​ GPUs from Nvidia ​are the current favourite, these perform fp16 calculations (was fp32) using a dedicated architecture of local SIMD processors and local data.  FPGAs are also receiving more attention, they are good at convolutional neural networks (say why).  Any 10x improvement over current technology must both reduce the transistor count (so as to reduce power) and be very memory bandwidth efficient (so as not to have a memory bottleneck). ​ The field is moving fast, so it must be easily adoptable in a short time period.
  
 In order to make an impact any solution must be complete, that is almost invisible to the user.  It needs to improve on the three major operations: In order to make an impact any solution must be complete, that is almost invisible to the user.  It needs to improve on the three major operations:
-  * fwd:  The inference, or forward pass of a DNN model +  * forward:  The inference, or forward pass of a DNN model 
-  * bwd:  The backward error propagation pass of stochastic gradient descent +  * backward:  The backward error propagation pass of stochastic gradient descent ​which accumulates gradients over a batch 
-  * acc:  The combination of fwd and bwd results ​to get and error signal which is accumulated over batch +  * update:  The work needed ​to scale the batch gradient into weight ​update (may need complex CPU like operations)
-The final pass, model update, can be formulated as computationally lower cost (e.g. updating only whenever there is a significant change) and also is not standardised in approach (ref: [[http://​ruder.io/​optimizing-gradient-descent|S. Ruder]]). ​ There are also other operations ​(e.g. softmax and batch normalisationthat are best suited to general a purpose processor.+
  
 == Guiding Principles == == Guiding Principles ==
Line 158: Line 157:
 | Sufficiently flexible | Blocker: ​ if can't implement what's needed then it won't be used | | Sufficiently flexible | Blocker: ​ if can't implement what's needed then it won't be used |
 | State of art results ​ | Blocker: ​ if better results elsewhere then people will go elsewhere | | State of art results ​ | Blocker: ​ if better results elsewhere then people will go elsewhere |
-| Easily trainable ​     | Blocker: ​ if not TensorFlow/pyTorch ​then adoption will be too slow |+| Easily trainable ​     | Blocker: ​ if not TensorFlow/PyTorch ​then adoption will be too slow |
  
 === Rejected and company closed === === Rejected and company closed ===
  
-After considering many designs, including analog and ternary weights, I ended up with 4 bit weights and activations. ​ This achives ​the goals,  +After considering many designs, including analog and ternary weights, I ended up with 4 bit weights and activations. ​ This achieves ​the goals albeit uncomfortably similar ​to the TPU.  The scale of work needed to make the trainsition from fp32/fp16 to 4bit is too great - the first prototype would be noticed by the giants ​and the company would be overtaken (defending IP is very expensive) This could well lead to a forced sale which isn't great for anyone (expecially founders/​Ordinary share holders). 
-I seem to have just reinvented ​the TPU, albeit with lower precision ​and lower latency.+ 
 +Start October 2018, end February 2019, minimal external costs. 
 + 
 +EDIT: Reopened 4 Nov 2021