I'd say Qualcomm's push is all in on energy efficiency and inferencing from a higher level.
Unpacking this, their neural compute strategy:
1) Doesn't involve training where nVidia has a large lead and a moat in its CUDA stack --and instead goes after the >15x higher volume of inference hardware. 2) Starts from power efficiency --the strategic push starts from the edge and encroaches into the data center.
Due to this, the high precision data types and the Von Neumann style architecture of GPUs while useful for training still shuttle data back and forth needlessly to and from memory during inference. This will consume vastly more energy during inference than a purpose built accelerator which aligns better with the data flow of pre-trained networks architectures.
nVidia was targeting specifically DLSS as a visual application with these units starting from their Turing series, and for this specific set of visual applications which require interactivity at low latency, it makes a lot of sense to do inferencing within the GPU. For more general purpose inferencing in non-visual applications such as an LLM, the more traditional data flow architecture and mix of precisions in a GPU make it a lot less efficient.
I'm guessing that due to the >3 year cycle times for major GPU design revisions, Eric completed the designs for the next major revision of Adreno with soon to be industry standard tensor units in the mix in preparation for the next-gen graphics APIs. He was probably hoping that the GPUs would lead the charge for Qualcomm's AI push in data centers and potentially training in addition to inference, but the Q pivoted to the more focused NPUs because of their desire for company-wide focus for 1) and 2) above. Still, sad to lose a legend in the GPU space, and I do hope Qualcomm has either a strong internal candidate to fill Eric's role or poaches a senior talent from either a start up or an established GPU house...
Intel, which sees data centers as its stronghold after its loss of the Macbook SoC socket are trying to mount a challenge to nVidia and pushing back out from data centers to reclaim the client space through GPUs which can be used for training and inference. Eric has had 14 years of experience shipping billions of energy efficient GPUs so he's a natural candidate for designing this part for their comeback. That said, the scale of energy required for inference will be beyond what our existing and future energy infrastructure can bear if GPUs alone are the primary thing in the mix, and similarly NPUs can simply do more per watt than GPUs at the edge.
Once the novelty of the current generation of science experiment style LLMs wears off, people will care about things like privacy, the proper scope of applications for correctness, and cost --meaning client side processing as well as smaller, more focused models that do things like prediction in addition to generation will matter more, and so the energy metrics and edge-first approach of Qualcomm's NPUs will matter. |