SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : NVIDIA Corporation (NVDA)
NVDA 176.99-1.8%12:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
From: Frank Sully8/20/2021 3:01:37 PM
   of 2646
 
Tesla Packs 50 Billion Transistors Onto D1 Dojo Chip Designed to Conquer Artificial Intelligence Training

By Aleksandar Kostovic

D1 deliver 362 TeraFLOPs of power



(Image credit: Dennis Hong / Twitter)

Artificial intelligence (AI) has seen a broad adoption over the past couple of years. At Tesla, who as many know is a company that works on electric and autonomous vehicles, AI has a massive value to every aspect of the company's work. To speed up the AI software workloads, Tesla has today presented its D1 Dojo custom application-specific integrated circuit (ASIC) for AI training.

There are currently many companies building ASICs for AI workloads. Everyone from countless number of start-ups, all the way to big players like Amazon, Baidu, Intel and NVIDIA. However, not everyone gets the formula right and not everyone can satisfy each workload perfectly. That is the reason why Tesla opted to develop its own ASIC for AI training purposes.

Called the D1, the chip resembles a part of the Dojo supercomputer used to train AI models inside Tesla HQ, which are later deployed in various applications. The D1 chip is a product of TSMC's manufacturing efforts, forged in a 7nm semiconductor node. Packing over 50 billion transistors, the chip boasts a huge die size of 645mm^2.

The chip has some impressive performance claims, where Tesla states that it can output as much as 362 TeraFLOPs at FP16/CFP8 precision or about 22.6 TeraFLOPs of single-precision FP32 tasks. It is clear that Tesla optimized for FP16 data types, where they have managed to beat even the current leader in compute power - Nvidia. Nvidia's A100 Ampere GPU is capable of producing "only" 312 TeraFLOPs of power at FP16 workloads — and with sparsity, it can do up to double that.

Going to the silicon level, we see that Tesla built a mesh of functional units (FUs) that are interconnected together to form one massive chip. Each FU contains a 64-bit CPU with custom ISA, designed for transposes, gathers, broadcasts, and link traversals. The CPU design itself is a superscalar implementation with a 4-wide scalar and 2-wide vector pipelines. Looking at the images below, you can see that the FU is built with a big block for single instruction multiple data (SIMD) floating-point and integer processing elements. Each FU has its own 1.25MB scratchpad SRAM memory.



(Image credit: Tesla)



(Image credit: Tesla)

The FU itself is capable of performing one TeraFLOP of BF16 or CFP8, 64 GigaFLOPs of FP32 computation, and has 512 GB/s bandwidth in any direction in the mesh. The mesh is designed to traverse the FUs in only a single clock cycle, meaning that latencies are reduced and performance is increased. For more details, you can watch the Tesla AI day replay here.

google.com
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext