Channel: PDP-11๐
The latest paper by David Patterson & Google TPU team reveals details of the world most efficient and one of the most powerful supercomputers for DNN Acceleration - TPU v3. The one which was used to train BERT.
We recommend that you definitely read the full text, but here are insights and tldr highlights
Key Insight:
The co-design of an ML-specific programming system (TensorFlow), compiler (XLA), architecture (TPU), floating-point arithmetic (Brain float16), interconnect (ICI), and chip (TPUv2/v3) let production ML applications scale at 96%โ99% of perfect linear speedup and 10x gains in performance/ Watt over the most efficient general-purpose supercomputers.
More highlights:
๐ฃ๐ค๐ Three generations
There are 3 generations of TPU now released, TPU v1 used fixpoint arithmetic and was used for inference only. TPU v2 and v3 operate in floating-point and used for training. TPU v4 results were presented in MLPerf summer release, but there is no public information available. The TPU architecture differs from CPU with
โช๏ธ Two Dimensional array processing units (instead of 1D vector SIMDs in CPU)
โช๏ธNarrower data (8-16 bits)
โช๏ธ Drop complex CPU features - caches and branch prediction
๐ฎ๐ค๐ค Fewer cores per chip (two oxen vs 1024 chickens)
NVidia put thousands of CUDA cores inside their chip. TPU v3 has only 2 TensorCores per chip. It's way easier to generate a program for 2 beefier cores than to swarm of wimpier cores.
Each TensorCore includes the following units:-
โช๏ธ
โช๏ธ
โช๏ธ
โช๏ธ
โช๏ธ
๐ฑ๐ถโ From inference to training chip
Key challenges on the way from inference chip V1 to training hardware V2
โช๏ธ Harder parallelization
โช๏ธ More computation
โช๏ธ More memory
โช๏ธ More programmability
โช๏ธ Wider dynamic range of data
โ๏ธ๐งฎโ๏ธ Brain Float
The compromised
๐ฉ๐งฌโก๏ธ Torus topology and ICI
TPU v1 was an accelerator card for CPU 'based computer. TPUv2 and v3 are building blocks of the supercomputer. Chips connected with ICI interface, each running at ~500Gbits/s. ICU enables direct connection between chips, so no need of any extra interfaces. GPU/CPU based supercomputers have to apply NVLink and PCI-E inside computer chase and InfiniBand network and switches to connect them.
Chips in TPUv2 and v3 clusters are connected in 2D Torus topology (doughnut ) and achieve an unbelievable linear scale of performance growth with increasing of chips number.
๐ โ๏ธ๐ฅ XLA compiler (to orchestrate them all)
TF programs are graphs of operations, where tensor-arrays are first-class citizens. XLA compiler front-end transforms the TF graph into an intermediate representation, which is then efficiently mapped into selected TPU (or CPU/GPU) architectures. XLA maps TF graph parallelism across hundreds of chips, TensorCores per chip, multiple units per core. XLA provides precise reasoning about memory use at every point in the program.
Young XLA compiler has more opportunities to improve than a more mature CUDA stack.
๐ฒ๐ฐ๐ฆ Green Power (Forest animals approves)
TPU v3 supercomputer already climbed on the 4th row of TOP500 ranking, but what is remarkable - it demonstrates an overwhelming 146.3 GFLops/Watt performance. The nearest competitor has 10 times and lower number.
Original Paper
A Domain Specific Computer for training DNN
We recommend that you definitely read the full text, but here are insights and tldr highlights
Key Insight:
The co-design of an ML-specific programming system (TensorFlow), compiler (XLA), architecture (TPU), floating-point arithmetic (Brain float16), interconnect (ICI), and chip (TPUv2/v3) let production ML applications scale at 96%โ99% of perfect linear speedup and 10x gains in performance/ Watt over the most efficient general-purpose supercomputers.
More highlights:
๐ฃ๐ค๐ Three generations
There are 3 generations of TPU now released, TPU v1 used fixpoint arithmetic and was used for inference only. TPU v2 and v3 operate in floating-point and used for training. TPU v4 results were presented in MLPerf summer release, but there is no public information available. The TPU architecture differs from CPU with
โช๏ธ Two Dimensional array processing units (instead of 1D vector SIMDs in CPU)
โช๏ธNarrower data (8-16 bits)
โช๏ธ Drop complex CPU features - caches and branch prediction
๐ฎ๐ค๐ค Fewer cores per chip (two oxen vs 1024 chickens)
NVidia put thousands of CUDA cores inside their chip. TPU v3 has only 2 TensorCores per chip. It's way easier to generate a program for 2 beefier cores than to swarm of wimpier cores.
Each TensorCore includes the following units:-
โช๏ธ
ICI(Inter Core Interconnects)
- connect core across different chips- โช๏ธ
HBM
, stacked DRAM on the same interposes substrate- โช๏ธ
Core Sequencer
- manages instructions and performs scalar operations- โช๏ธ
Vector Processing Unit
, performs vectors operation for 1D and 2D vectors- โช๏ธ
Matrix Multiply Unit (MXU)
๐ฑ๐ถโ From inference to training chip
Key challenges on the way from inference chip V1 to training hardware V2
โช๏ธ Harder parallelization
โช๏ธ More computation
โช๏ธ More memory
โช๏ธ More programmability
โช๏ธ Wider dynamic range of data
โ๏ธ๐งฎโ๏ธ Brain Float
IEEE FP16
and FP32
use (1+8+23) and (1+5+7) bits for the sign, exponent, and mantissa values respectively. In practice, DNN doesn't need mantissa precision of FP32
, but the dynamic range of FP16
is not enough. Using of FP16 also requires loss scaling.The compromised
bf16
keeps the same 8 bits for exponent, as FP32, but reduced mantissa - only 7 bits instead of 23. BF16
delivers reducing space usage and power consumption with no loss scaling in software required. ๐ฉ๐งฌโก๏ธ Torus topology and ICI
TPU v1 was an accelerator card for CPU 'based computer. TPUv2 and v3 are building blocks of the supercomputer. Chips connected with ICI interface, each running at ~500Gbits/s. ICU enables direct connection between chips, so no need of any extra interfaces. GPU/CPU based supercomputers have to apply NVLink and PCI-E inside computer chase and InfiniBand network and switches to connect them.
Chips in TPUv2 and v3 clusters are connected in 2D Torus topology (doughnut ) and achieve an unbelievable linear scale of performance growth with increasing of chips number.
๐ โ๏ธ๐ฅ XLA compiler (to orchestrate them all)
TF programs are graphs of operations, where tensor-arrays are first-class citizens. XLA compiler front-end transforms the TF graph into an intermediate representation, which is then efficiently mapped into selected TPU (or CPU/GPU) architectures. XLA maps TF graph parallelism across hundreds of chips, TensorCores per chip, multiple units per core. XLA provides precise reasoning about memory use at every point in the program.
Young XLA compiler has more opportunities to improve than a more mature CUDA stack.
๐ฒ๐ฐ๐ฆ Green Power (Forest animals approves)
TPU v3 supercomputer already climbed on the 4th row of TOP500 ranking, but what is remarkable - it demonstrates an overwhelming 146.3 GFLops/Watt performance. The nearest competitor has 10 times and lower number.
Original Paper
A Domain Specific Computer for training DNN
cacm.acm.org
A Domain-Specific Supercomputer for Training Deep Neural Networks
Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.
๐ค๐ณ๐โโ๏ธ Decision trees accelerating
One would be surprised, but DNNs do not exhaust the list of ML algorithms. In fact, few businesses can find an application of CV or NLP, few have a significant amount of speech or photo data, where DNN shows SOTA results.
But most of them have a huge amount of irregular table data - financial market prices, customer data, base station activity logs, or windmills breakdown statistics.
And that's where decision trees get up on stage. There are 3 major frameworks on the market nowadays, who provides frameworks for training ensembles of decision trees with gradient boosting. It's XGboost, CatBoost and LightGBM. Read here to learn more about them here or here
Due to the specific nature, good match between algorithm and hardware organizations, decision trees can be significantly accelerated on FPGA.
We will cover 2 stories here
๐๐ฉ๐ช๐ฅจ Xelera Decision Tree Acceleration
Germany-based startup Xelera offers FPGA devices as a hardware backend for decision trees inference acceleration. The company claims 700x both speedup and latency improvement. FPGA results were estimated on the cloud AWS F1 FPGA instances and on the Xilinx Alveo U50.
The secret as to why FPGAs perform so well on this class of workloads is their unique memory architecture, which consists of thousands of independent blocks of on-chip memory. This memory is not only highly parallel a key difference to the GPU memory is that it can handle highly parallel, irregular memory accesses very well.
๐บ๐ธ๐ฌ๐ณ FPGAs for Particles classification
HLS4ML is an open-source framework from the Cornell University team, who is working in CERN, where FPGAs are used for trigger condition detection or particle classification.
HLS4ML generates an HLS description of the ML algorithm, which you may feed to the HLS synthesis tool (i.e. Vivado HLS) to generate an FPGA configuration file.
The recent paper describes how to use HLS4ML to generate FPGA firmware and host software for decision trees acceleration.
Taking as an example a multiclass classification problem from high energy physics, we show how a state-of-the-art algorithm could bedeployed on an FPGA with a typical inference time of 12 clock cycles (i.e., 60 ns at a clock frequency of 200 MHz)
One would be surprised, but DNNs do not exhaust the list of ML algorithms. In fact, few businesses can find an application of CV or NLP, few have a significant amount of speech or photo data, where DNN shows SOTA results.
But most of them have a huge amount of irregular table data - financial market prices, customer data, base station activity logs, or windmills breakdown statistics.
And that's where decision trees get up on stage. There are 3 major frameworks on the market nowadays, who provides frameworks for training ensembles of decision trees with gradient boosting. It's XGboost, CatBoost and LightGBM. Read here to learn more about them here or here
Due to the specific nature, good match between algorithm and hardware organizations, decision trees can be significantly accelerated on FPGA.
We will cover 2 stories here
๐๐ฉ๐ช๐ฅจ Xelera Decision Tree Acceleration
Germany-based startup Xelera offers FPGA devices as a hardware backend for decision trees inference acceleration. The company claims 700x both speedup and latency improvement. FPGA results were estimated on the cloud AWS F1 FPGA instances and on the Xilinx Alveo U50.
The secret as to why FPGAs perform so well on this class of workloads is their unique memory architecture, which consists of thousands of independent blocks of on-chip memory. This memory is not only highly parallel a key difference to the GPU memory is that it can handle highly parallel, irregular memory accesses very well.
๐บ๐ธ๐ฌ๐ณ FPGAs for Particles classification
HLS4ML is an open-source framework from the Cornell University team, who is working in CERN, where FPGAs are used for trigger condition detection or particle classification.
HLS4ML generates an HLS description of the ML algorithm, which you may feed to the HLS synthesis tool (i.e. Vivado HLS) to generate an FPGA configuration file.
The recent paper describes how to use HLS4ML to generate FPGA firmware and host software for decision trees acceleration.
Taking as an example a multiclass classification problem from high energy physics, we show how a state-of-the-art algorithm could bedeployed on an FPGA with a typical inference time of 12 clock cycles (i.e., 60 ns at a clock frequency of 200 MHz)
๐ค๐ค๐ค๐ค๐ค
How PCIe 5 and its Smart Friends Will Change Solution Acceleration
Nice article by Scott Schweitzer, Xilinx
Keynotes:
๐ฅฆPCI-E Gen5 offers not only throughput bandwidth doubling, but also Compute Express Link (CXL) and a Cache Coherent Interconnect for Accelerators (CCIX) promise to create efficient communication between CPUs and accelerators like SmartNIC or co-processors.
๐ธ CCIX configurations include direct attached, switched topologies, and hybrid daisy chain. it can take memory from different devices, each with varying performance characteristics, pool it together, and map it into a single non-uniform memory access (NUMA) architecture. Then it establishes a Virtual Address space, enabling all of the devices in this pool access to the full range of NUMA memory
๐ฅฅ SmartSSDs, also known as computational storage, place a computing device, often an FPGA accelerator, alongside the storage controller within a solid-state drive. This enables the computing device in the SmartSSD to operate on data as it enters and exits the drive, potentially redefining both how data is accessed and stored.
๐ฉโ๐ฌ SmartNICs, are a special class of accelerators that sit at the nexus between the PCIe bus and the external network. While SmartSSDs place computing close to data, SmartNICs place computing close to the network
๐ฉโ๐ฉโ๐งโ๐ง SmartNICs and DPUs (data processing units) that leverage PCIe 5 and CXL or CCIX will offer us richly interconnected accelerators that will enable the development of complex and highly performant solutions
How PCIe 5 and its Smart Friends Will Change Solution Acceleration
Nice article by Scott Schweitzer, Xilinx
Keynotes:
๐ฅฆPCI-E Gen5 offers not only throughput bandwidth doubling, but also Compute Express Link (CXL) and a Cache Coherent Interconnect for Accelerators (CCIX) promise to create efficient communication between CPUs and accelerators like SmartNIC or co-processors.
๐ธ CCIX configurations include direct attached, switched topologies, and hybrid daisy chain. it can take memory from different devices, each with varying performance characteristics, pool it together, and map it into a single non-uniform memory access (NUMA) architecture. Then it establishes a Virtual Address space, enabling all of the devices in this pool access to the full range of NUMA memory
๐ฅฅ SmartSSDs, also known as computational storage, place a computing device, often an FPGA accelerator, alongside the storage controller within a solid-state drive. This enables the computing device in the SmartSSD to operate on data as it enters and exits the drive, potentially redefining both how data is accessed and stored.
๐ฉโ๐ฌ SmartNICs, are a special class of accelerators that sit at the nexus between the PCIe bus and the external network. While SmartSSDs place computing close to data, SmartNICs place computing close to the network
๐ฉโ๐ฉโ๐งโ๐ง SmartNICs and DPUs (data processing units) that leverage PCIe 5 and CXL or CCIX will offer us richly interconnected accelerators that will enable the development of complex and highly performant solutions
๐จ๐ญ๐น๐ท Onur Mutlu, world leading researcher of computer architectures, (SAFARI group, ETH) published short thought provoking keynotes
Intelligent Architectures for Intelligent Machines
Submitted on 13 Aug 2020
Highlights:
๐กData access is still a major bottleneck.
๐คนโโCurrent the processor-centric design paradigm is a dichotomy between processing and memory/storage: data has to be brought from storage and memory units to compute units, which are far away from the memory/storage. This processor-memory dichotomy leads to large amounts of data movement across the entire system, degrading performance and expending large
amounts of energy.
Modern Architectures are poor at:
โก๏ธDealing with data: they are designed to mainly store and move data, as opposed to actually compute on the data
โก๏ธTaking advantage of vast amounts of data and metadata available to them during online operation and over time.
โก๏ธExploiting different properties of application data. They are designed to treat all data as the same
Intelligent architecture
Intelligent architecture should handle (i.e., store, access, and process) data well.
Key principles:
๐Data-centric: minimizing data movement and maximizing the efficiency with which data is handled,
๐ฆDatadriven: the architecture should make datadriven, self-optimizing decisions in its components
๐ Data-aware": the architecture should make datacharacteristics-aware decisions in its components and across the entire system
Read the full keynote here
๐
Intelligent Architectures for Intelligent Machines
Submitted on 13 Aug 2020
Highlights:
๐กData access is still a major bottleneck.
๐คนโโCurrent the processor-centric design paradigm is a dichotomy between processing and memory/storage: data has to be brought from storage and memory units to compute units, which are far away from the memory/storage. This processor-memory dichotomy leads to large amounts of data movement across the entire system, degrading performance and expending large
amounts of energy.
Modern Architectures are poor at:
โก๏ธDealing with data: they are designed to mainly store and move data, as opposed to actually compute on the data
โก๏ธTaking advantage of vast amounts of data and metadata available to them during online operation and over time.
โก๏ธExploiting different properties of application data. They are designed to treat all data as the same
Intelligent architecture
Intelligent architecture should handle (i.e., store, access, and process) data well.
Key principles:
๐Data-centric: minimizing data movement and maximizing the efficiency with which data is handled,
๐ฆDatadriven: the architecture should make datadriven, self-optimizing decisions in its components
๐ Data-aware": the architecture should make datacharacteristics-aware decisions in its components and across the entire system
Read the full keynote here
๐
How to Evaluate Deep Neural Network Processors
Vivienne Sze
Why FOPS/W metric is not enough
Common metric for hardware efficieny measurement is FOPS/W (floating-point operations per second per watt) or TOPS/WTerra FOPS/W. However, TOPS/W alone is not enough. It often goes along with, the peak performance in TOPS, which gives the maximum efficiency since it assumes maximum utilization and thus maximum amortization of overhead. However, this does not tell the complete story because processors typically do not operate at their peak TOPS and their efficiency degrades at lower utilization. Following 6 metrics and it's combination must be considered:
๐ฏ1. The accuracy determines if the system can perform the given task. To evaluate it, several benchmark were proposed, among them MLPerf.
โฑ2. The latency and throughput determine whether it can run fast enough and in real time. Throughput is the amount of inferences per second and latency is the time period between the input sample arrival and generation of the result. Batching technique improves throughput, but degrades latency. Thus, achieving low latency and high throughput simultaneously can sometimes be at odds depending on the approach, and both metrics should be reported
๐๐ผ3. The energy and power consumption primarily dictate the form factor of the device where the processing can operate. Memory read and write operations are still the main consumers of the power, not an arithmetic. 32b DRAM read takes 640 pH and 32b FP multiply 0.9pJ.
๐ค4. The cost, which is primarily dictated by the chip area and external memory BW requirements, determines how much one would pay for the solution. Custom DNN processors have a higher design cost (after amortization) than off-the-shelf CPUs and GPUs. We consider anything beyond this, e.g., the economics of the semiconductor business, including how to price platforms, to be outside the scope of this article. Considering the hardware cost of the design is important from both an industry and a research perspective as it dictates whether a system is financially viable
๐5. The flexibility determines the range of tasks it can support. To maintain efficiency, the hardware should not rely on certain properties of the DNN models to achieve efficiency, as the properties of DNN models are diverse and evolving rapidly. For instance, a DNN processor that can efficiently support the case where the entire DNN model (i.e., all of the weights) fits on chip may perform extremely poorly when the DNN model grows larger, which is likely, given that the size of DNN models continues to increase over time;
๐ก6. The scalability determines whether the same design effort can be amortized for deployment in multiple domains (e.g., in the cloud and at the edge) and if the system can efficiently be scaled with DNN model size.
๐จโ๐ฉโ๐งโ๐ฆ7. Interplay Among Different Metrics
All the metrics must be accounted for to fairly evaluate the design tradeoffs
๐ญ
๐
Vivienne Sze
Why FOPS/W metric is not enough
Common metric for hardware efficieny measurement is FOPS/W (floating-point operations per second per watt) or TOPS/WTerra FOPS/W. However, TOPS/W alone is not enough. It often goes along with, the peak performance in TOPS, which gives the maximum efficiency since it assumes maximum utilization and thus maximum amortization of overhead. However, this does not tell the complete story because processors typically do not operate at their peak TOPS and their efficiency degrades at lower utilization. Following 6 metrics and it's combination must be considered:
๐ฏ1. The accuracy determines if the system can perform the given task. To evaluate it, several benchmark were proposed, among them MLPerf.
โฑ2. The latency and throughput determine whether it can run fast enough and in real time. Throughput is the amount of inferences per second and latency is the time period between the input sample arrival and generation of the result. Batching technique improves throughput, but degrades latency. Thus, achieving low latency and high throughput simultaneously can sometimes be at odds depending on the approach, and both metrics should be reported
๐๐ผ3. The energy and power consumption primarily dictate the form factor of the device where the processing can operate. Memory read and write operations are still the main consumers of the power, not an arithmetic. 32b DRAM read takes 640 pH and 32b FP multiply 0.9pJ.
๐ค4. The cost, which is primarily dictated by the chip area and external memory BW requirements, determines how much one would pay for the solution. Custom DNN processors have a higher design cost (after amortization) than off-the-shelf CPUs and GPUs. We consider anything beyond this, e.g., the economics of the semiconductor business, including how to price platforms, to be outside the scope of this article. Considering the hardware cost of the design is important from both an industry and a research perspective as it dictates whether a system is financially viable
๐5. The flexibility determines the range of tasks it can support. To maintain efficiency, the hardware should not rely on certain properties of the DNN models to achieve efficiency, as the properties of DNN models are diverse and evolving rapidly. For instance, a DNN processor that can efficiently support the case where the entire DNN model (i.e., all of the weights) fits on chip may perform extremely poorly when the DNN model grows larger, which is likely, given that the size of DNN models continues to increase over time;
๐ก6. The scalability determines whether the same design effort can be amortized for deployment in multiple domains (e.g., in the cloud and at the edge) and if the system can efficiently be scaled with DNN model size.
๐จโ๐ฉโ๐งโ๐ฆ7. Interplay Among Different Metrics
All the metrics must be accounted for to fairly evaluate the design tradeoffs
๐ญ
Case 1
. Tiny binarized NN architecture, with very low power consumption, high throughput, but unacceptable accuracy.๐
Case 2
. Complete floating point DNN chip, high throughput and moderate chip power consumption. But it's pure arithmetic chip with MACs and all the data read/write/storage is performed outside chip. So total system power consumption will very high.๐ฎ๐ง
Five trends that will shape the future semiconductor technology landscape
Sri Samavedam, senior vice president CMOS technologies at imec
โคด๏ธTrend 1: Mooreโs Law will continue, CMOS transistor density scaling will roughly continue to follow Mooreโs Law for the next eight to ten years.
โคต๏ธTrend 2: ... but logic performance improvement at fixed power will slow down
Node-to-node performance improvements at fixed power โ referred to as Dennard scaling โ have slowed down due to the inability to scale supply voltage. Researchers worldwide are looking for ways to compensate for this slow-down and further improve the chipโs performance
๐Trend 3: More heterogeneous integration, enabled by 3D technologies
We see more and more examples of systems being built through heterogeneous integration leveraging 2.5D or 3D connectivity, like SoC FPGA, HBM, CPU and GPU on the same interposer.
๐Trend 4: NAND and DRAM being pushed to their limits
Emerging non-volatile memories on the rise. The emerging non-volatile memory market is expected to grow at >50% compound annual growth rate โ mainly driven by the demand for embedded magnetic random access memory (MRAM) and standalone phase change memory (PCM).
โ๏ธTrend 5: Spectacular rise of the edge AI chip industry
With an expected growth of above 100% in the next five years, edge AI is one of the biggest trends in the chip industry. As opposed to cloud-based AI, inference functions are embedded locally on the Internet of Things (IoT) endpoints that reside at the edge of the network, such as cell phones and smart speakers
Read the full text
Five trends that will shape the future semiconductor technology landscape
Sri Samavedam, senior vice president CMOS technologies at imec
โคด๏ธTrend 1: Mooreโs Law will continue, CMOS transistor density scaling will roughly continue to follow Mooreโs Law for the next eight to ten years.
โคต๏ธTrend 2: ... but logic performance improvement at fixed power will slow down
Node-to-node performance improvements at fixed power โ referred to as Dennard scaling โ have slowed down due to the inability to scale supply voltage. Researchers worldwide are looking for ways to compensate for this slow-down and further improve the chipโs performance
๐Trend 3: More heterogeneous integration, enabled by 3D technologies
We see more and more examples of systems being built through heterogeneous integration leveraging 2.5D or 3D connectivity, like SoC FPGA, HBM, CPU and GPU on the same interposer.
๐Trend 4: NAND and DRAM being pushed to their limits
Emerging non-volatile memories on the rise. The emerging non-volatile memory market is expected to grow at >50% compound annual growth rate โ mainly driven by the demand for embedded magnetic random access memory (MRAM) and standalone phase change memory (PCM).
โ๏ธTrend 5: Spectacular rise of the edge AI chip industry
With an expected growth of above 100% in the next five years, edge AI is one of the biggest trends in the chip industry. As opposed to cloud-based AI, inference functions are embedded locally on the Internet of Things (IoT) endpoints that reside at the edge of the network, such as cell phones and smart speakers
Read the full text
๐๐ฅณDataFest2020 will take place this weekend ๐ฅณ๐
The OpenDataScience community presents DataFest2020 , that will happen the following weekend, 19-20th of September.
This year's festival will be fully online.
Please, check the festival landing page and you will find a list full of fascinating sections.
Each section includes several tracks - videos and reading materials - and interactive chat rooms for discussions and Q&A.
@PDP11ML channel and our good friends will present Domain Specific Hardware section.
We will talk about market trends, technical details, share success stories and highlight SOTA solution.
Hardware section content and time schedule will be presented here
Participation is free of charge, all tracks in our sections are in English and you are very welcome to join us.
See you soon!
The OpenDataScience community presents DataFest2020 , that will happen the following weekend, 19-20th of September.
This year's festival will be fully online.
Please, check the festival landing page and you will find a list full of fascinating sections.
Each section includes several tracks - videos and reading materials - and interactive chat rooms for discussions and Q&A.
@PDP11ML channel and our good friends will present Domain Specific Hardware section.
We will talk about market trends, technical details, share success stories and highlight SOTA solution.
Hardware section content and time schedule will be presented here
Participation is free of charge, all tracks in our sections are in English and you are very welcome to join us.
See you soon!
fest.ai
Data Fest
Largest free and open Data Science conference
Why Nvidia wants ARM, by WSJ
๐ค Huangโs Law, Silicon chips that power artificial intelligence more than double in performance every two years.
๐AI Goes from the cloud to the edge (dishwashers, smartphones, watches, hoovers)
๐ ARM develops ultra-low-power CPUs and ML cores
๐กThis movement of AI processing from the cloud to the โedgeโโthat is, on the devices themselvesโexplains Nvidiaโs desire to buy Arm, says Nexar co-founder and CEO Eran Shir.
๐จ๐ณThe pace of improvement in AI-specific hardware will make possible a range of applications both utopian and dystopian,
๐ฒUses of mobile AI are multiplying, in phones and smart devices ranging from dishwashers to door locks to lightbulbs, as well as the millions of sensors making their way to cities, factories and industrial facilities. And chip designer Arm Holdingsโwhose patents Apple, among many tech companies large and small, licenses for its iPhone chipsโis at the center of this revolution.
๐Over the last three to five years, machine-learning networks have been increasing by orders of magnitude in efficiency, says Dennis Laudick, vice president of marketing in Armโs machine-learning group. โNow itโs more about making things work in a smaller and smaller environment,โ he adds.
Source: WSJ
๐ค Huangโs Law, Silicon chips that power artificial intelligence more than double in performance every two years.
๐AI Goes from the cloud to the edge (dishwashers, smartphones, watches, hoovers)
๐ ARM develops ultra-low-power CPUs and ML cores
๐กThis movement of AI processing from the cloud to the โedgeโโthat is, on the devices themselvesโexplains Nvidiaโs desire to buy Arm, says Nexar co-founder and CEO Eran Shir.
๐จ๐ณThe pace of improvement in AI-specific hardware will make possible a range of applications both utopian and dystopian,
๐ฒUses of mobile AI are multiplying, in phones and smart devices ranging from dishwashers to door locks to lightbulbs, as well as the millions of sensors making their way to cities, factories and industrial facilities. And chip designer Arm Holdingsโwhose patents Apple, among many tech companies large and small, licenses for its iPhone chipsโis at the center of this revolution.
๐Over the last three to five years, machine-learning networks have been increasing by orders of magnitude in efficiency, says Dennis Laudick, vice president of marketing in Armโs machine-learning group. โNow itโs more about making things work in a smaller and smaller environment,โ he adds.
Source: WSJ
WSJ
Huangโs Law Is the New Mooreโs Law, and Explains Why Nvidia Wants Arm
The rule that the same dollar buys twice the computing power every 18 months is no longer true, but a new lawโwhich we named for the CEO of Nvidia, the company now most emblematic of commercial AIโis in full effect.
โโIntel: IoT Enchanced processor increase Perfomace, AI, Security
๐ถMotivation: By 2023, up to 70% of all enterprises will process data at the edge.
๐AI-inferencing algorithms can run on up to 96 graphic execution units (INT8) or run on the CPU with vector neural network instructions (VNNI) built in. With Intelยฎ Time Coordinated Computing (Intelยฎ TCC Technology) and time-sensitive networking (TSN) technologies, 11th Gen processors enable real-time computing demands
๐๐ฝโโ๏ธSoftware Tools: Edge Software Hubโs Edge Insights for Industrial and the Intelยฎ Distribution of OpenVINOโข toolkit
๐ฆUse cases:
Industrial, Retail, Healthcare, Smart City, Transportation
๐ถMotivation: By 2023, up to 70% of all enterprises will process data at the edge.
๐AI-inferencing algorithms can run on up to 96 graphic execution units (INT8) or run on the CPU with vector neural network instructions (VNNI) built in. With Intelยฎ Time Coordinated Computing (Intelยฎ TCC Technology) and time-sensitive networking (TSN) technologies, 11th Gen processors enable real-time computing demands
๐๐ฝโโ๏ธSoftware Tools: Edge Software Hubโs Edge Insights for Industrial and the Intelยฎ Distribution of OpenVINOโข toolkit
๐ฆUse cases:
Industrial, Retail, Healthcare, Smart City, Transportation
NIVIDIA DPU
by Forbes
โก๏ธDomain specific processors (accelerators) are playing greater roles in off-loading CPUs and improving performance of computing systems
๐คThe NVDIA BlueField-2 DPU( Data Processor Units), a new domain specific computing technology, is enabled by the companyโs Data-Center-Infrastructure-on-a-Chip Software (DOCA SDK) Smart NIC. . Off-loading processing to a DPU can result in overall cost savings and improved performance for data centers.
๐ฌ NVIDIAโs current DPU lineup includes two PCIe cards BlueField-2 and BlueField-2X DPUs. BlueField-2 based on ConnectXยฎ-6 Dx SmartNIC combined with powerful Arm cores. BlueField-2X includes all the key features of a BlueField-2 DPU enhanced with an NVIDIA Ampere GPUโs AI capabilities that can be applied to data center security, networking and storage tasks
Read more about DPUs:
- Product page
- Mellanox product brief
- Servethehome
- Nextplatform
- Nextplatform
by Forbes
โก๏ธDomain specific processors (accelerators) are playing greater roles in off-loading CPUs and improving performance of computing systems
๐คThe NVDIA BlueField-2 DPU( Data Processor Units), a new domain specific computing technology, is enabled by the companyโs Data-Center-Infrastructure-on-a-Chip Software (DOCA SDK) Smart NIC. . Off-loading processing to a DPU can result in overall cost savings and improved performance for data centers.
๐ฌ NVIDIAโs current DPU lineup includes two PCIe cards BlueField-2 and BlueField-2X DPUs. BlueField-2 based on ConnectXยฎ-6 Dx SmartNIC combined with powerful Arm cores. BlueField-2X includes all the key features of a BlueField-2 DPU enhanced with an NVIDIA Ampere GPUโs AI capabilities that can be applied to data center security, networking and storage tasks
Read more about DPUs:
- Product page
- Mellanox product brief
- Servethehome
- Nextplatform
- Nextplatform
PDP-11๐
ESP SoC design platform https://telegra.ph/ESP-01-12
ESP
The ESP Project page, platform for FPGA and ASIC SoC accelerator design, was significantly updated since the last time we mentioned it
Now it includes many more materials, guides and videos:
๐ก How to design accelerator in Vivado HLS and Mentor Graphics Catapult HLS
๐ฌ Describes HLS4ML Flow. Original HLS4ML papers covered only the core design path, but this guide help you to integrate the core into the computer system
๐บ Integrating with Nvidia GPUs sthrough NVDLA
๐ List of the related papers
ESP is run by the System-Level Design (SLD) group at Columbia University, led by Professor Luca P. Carloni.
The ESP Project page, platform for FPGA and ASIC SoC accelerator design, was significantly updated since the last time we mentioned it
Now it includes many more materials, guides and videos:
๐ก How to design accelerator in Vivado HLS and Mentor Graphics Catapult HLS
๐ฌ Describes HLS4ML Flow. Original HLS4ML papers covered only the core design path, but this guide help you to integrate the core into the computer system
๐บ Integrating with Nvidia GPUs sthrough NVDLA
๐ List of the related papers
ESP is run by the System-Level Design (SLD) group at Columbia University, led by Professor Luca P. Carloni.
ESP - open SoC platform
Documentation
The ESP website.
๐คAMD Is in Advanced Talks to Buy Xilinx
by WSJ
๐according to people familiar with the matter, in a deal that could be valued at more than $30 billion and mark the latest big tie-up in the rapidly consolidating semiconductor industry.
๐AMDโs market value now tops $100 billion after its shares soared 89% this year as the coronavirus pandemic stokes demands for PCs, gaming consoles and other devices
๐Xilinx has a market value of about $26 billion, with its shares up about 9% so far this year, just ahead of the S&P 500โs 7% rise.
๐จโ๐จโ๐ฆShould AMD and Xilinx reach an agreement, three of the yearโs largest deals so far would be in the semiconductor industry
- Analog Devices paid 20B for Maxim Integrated
- Nvidia acquired ARM for 40B
by WSJ
๐according to people familiar with the matter, in a deal that could be valued at more than $30 billion and mark the latest big tie-up in the rapidly consolidating semiconductor industry.
๐AMDโs market value now tops $100 billion after its shares soared 89% this year as the coronavirus pandemic stokes demands for PCs, gaming consoles and other devices
๐Xilinx has a market value of about $26 billion, with its shares up about 9% so far this year, just ahead of the S&P 500โs 7% rise.
๐จโ๐จโ๐ฆShould AMD and Xilinx reach an agreement, three of the yearโs largest deals so far would be in the semiconductor industry
- Analog Devices paid 20B for Maxim Integrated
- Nvidia acquired ARM for 40B
WSJ
WSJ News Exclusive | AMD Is in Advanced Talks to Buy Xilinx
A deal between the rival chip makers could be worth more than $30 billion and mark the latest big tie-up in the rapidly consolidating industry.
๐ฅ๐ฏ๐คA boom in low cost edge AI chips using the RISC-V technology is coming says Facebookโs chief AI scientist Yann LeCun
๐The move to RISC-V for running neural networks for edge AI applications is accelerated by the proposed takeover of ARM by Nvidia, says Yann LeCun, chief AI scientist at Facebook speaking at the Innovation Day of French research lab CEA-Leti.
โThere is a change in the industry and ARM with Nvidia makes people uneasy but the emergence of RISC-V sees chips with a RISC-V core and an NPU (neural processing unit),โ he said.
๐โThese are incredibly cheap, less than $10, with many out of China, and these will become ubiquitous,โ he said. โIโm wondering if RISC-V will take over the world there.โ
๐โCertainly edge AI is a super important topic,โ he said. โIn the next two to three years, itโs not going to be exotic technologies, itโs about reducing the power consumption as much as possible, pruning the neural net, optimising the weights, shutting down parts of the system that arenโt used," said LeCun.
๐คฟ "The target is AR devices with chips in the next two to three years with devices in the five years, and thatโs coming,โ he said.
๐The move to RISC-V for running neural networks for edge AI applications is accelerated by the proposed takeover of ARM by Nvidia, says Yann LeCun, chief AI scientist at Facebook speaking at the Innovation Day of French research lab CEA-Leti.
โThere is a change in the industry and ARM with Nvidia makes people uneasy but the emergence of RISC-V sees chips with a RISC-V core and an NPU (neural processing unit),โ he said.
๐โThese are incredibly cheap, less than $10, with many out of China, and these will become ubiquitous,โ he said. โIโm wondering if RISC-V will take over the world there.โ
๐โCertainly edge AI is a super important topic,โ he said. โIn the next two to three years, itโs not going to be exotic technologies, itโs about reducing the power consumption as much as possible, pruning the neural net, optimising the weights, shutting down parts of the system that arenโt used," said LeCun.
๐คฟ "The target is AR devices with chips in the next two to three years with devices in the five years, and thatโs coming,โ he said.
eeNews Europe
RISC-V boom from edge AI says Facebook's chief AI scientist
A boom in low cost edge AI chips using the RISC-V technology is coming says Facebookโs chief AI scientist Yann LeCun
PDP-11๐
โ Can I run my Neural Network on the FPGA? โ Does Vivado HLS run my CPP code on the FPGA? โ What is difference between OneAPI and Intel OpenCL? โ Vitis - is it a sort of HLS for VIvado, isn't it? ๐ค There are two main FPGA vendors today - Xilinx and Intel.โฆ
2021_Book_DataParallelC.pdf
15.3 MB
Presentation by Philip Harris & Jeff Krupa (MIT)
Heterogeneous Computing at the LHC
๐ซ Proton collisions (events) occurs at 40MHz in the CMS detector, a new collision each 25ns and there 8Mb of data per collision and it gives 320Tb/s. There's no chance to catch them all for now.
๐พ There are 3 triggering levels, that select only "interesting event" for offline-computing at rate 8Gb/s. ML Models (Decision Trees and DNNs) are used for events classification. It creates huge challenges both for throughput, and latency requirements.
โ๏ธ Described system integrates FPGAs and GPUs accelerators in the cloud through the network, to make it available for researches.
๐งฉ This huge and largescale work includes may famous institutions, among them Fermilab, MIT, CERN, AWS and Microsoft Brainwave project and can be applied not only to HEP, but also Astrophysics and Gravitational Wave Detection
- YouTube video
- Slides Link (Dropbox)
Heterogeneous Computing at the LHC
TL;DR๐ FastML Collaboration is group founded by P.Harris and Nhan Tran to adapt DNN to LHC data flow, but already goes far beyond. HLS4ML tools is part of the project.
๐ซ Proton collisions (events) occurs at 40MHz in the CMS detector, a new collision each 25ns and there 8Mb of data per collision and it gives 320Tb/s. There's no chance to catch them all for now.
๐พ There are 3 triggering levels, that select only "interesting event" for offline-computing at rate 8Gb/s. ML Models (Decision Trees and DNNs) are used for events classification. It creates huge challenges both for throughput, and latency requirements.
โ๏ธ Described system integrates FPGAs and GPUs accelerators in the cloud through the network, to make it available for researches.
๐งฉ This huge and largescale work includes may famous institutions, among them Fermilab, MIT, CERN, AWS and Microsoft Brainwave project and can be applied not only to HEP, but also Astrophysics and Gravitational Wave Detection
- YouTube video
- Slides Link (Dropbox)
YouTube
Heterogeneous Computing at the Large Hadron Collider: Massachusetts Institute of Technology (MIT)
MIT Massachusetts Institute of Technology โ Investigating Heterogeneous Computing at the Large Hadron Collider
Presenter: Philip Harris
Only a small fraction of the 40 million collisions per second at the Large Hadron Collider are stored and analyzed dueโฆ
Presenter: Philip Harris
Only a small fraction of the 40 million collisions per second at the Large Hadron Collider are stored and analyzed dueโฆ
HTML Embed Code: