After Three Years, Modular’s CUDA Alternative Is Ready By Sally Ward-Foxton 04.22.2025
SAN JOSE, Calif. – Building a CUDA alternative was never going to be an easy task.
Chris Lattner’s team of 120 at Modular has been working on it for three years, aiming to replace not just CUDA, but the entire AI software stack from scratch.
“What does that take? Well, it’s actually pretty hard building a replacement for CUDA. It takes years,” Lattner told EE Times. “For the last three years, we’ve been working on the programming language, the graph compiler, the LLM optimizations, getting all these things sorted out, implemented at scale, tested and validated.”
Problems with the existing AI software stack stem from the fact that it emerged rapidly, and is still evolving very fast; layers get added quickly to keep up with new use cases and models. On top of CUDA today are libraries like OneMKL, vLLM for inference serving, Nvidia’s TensorRT-LLM, and now Nvidia’s NIM microservices—what Lattner calls “a gigantic stack of stuff.”
CUDA itself, Lattner pointed out, is 16 years old. In other words, it existed well before the generative AI use case and before GPU hardware features like tensor cores and FP4 were invented.
What Lattner calls “disposable frameworks,” parts of the stack that get adopted but have a short shelf-life before being superseded, also do not help.
“Everything changes, and it’s not designed for generality, and it falls away,” he said. “What we’re building for enterprises is a technology platform that can actually scale, so they can keep up with AI.”
CUDA Alternatives
There have been other projects aiming to replace CUDA, or to provide some level of code portability from CUDA, or both.
One of the most successful has been the open-source project ApacheTVM. TVM’s main aim is to enable AI to run efficiently on diverse hardware by automating kernel fusion, but generative AI proved to be a technical challenge, since algorithms are larger and more complex than for older computer vision applications. Generative AI algorithms are also more hardware-specific (like FlashAttention). TVM’s core contributors formed a company called OctoAI, which developed a generative AI inference stack for enterprise clusters, but it was acquired recently by Nvidia, casting some doubt on the project’s future.
Another widely known technology, OpenCL, is a standard designed to enable code portability between GPUs and other hardware types. It has been broadly adopted in mobile and embedded devices. However, critics of this standard (including Lattner) point to its lack of agility to keep up with fast-moving AI technology, in part because it is driven by a “co-opetition” of competing companies who generally decline to share anything about future hardware features.
Other commercial projects of this nature are still in the early stages, Lattner said.
“There’s a big gap between building a demo, solving one model and one use case, versus building something that’s generalized at scale, that can actually take on the pace of AI research, which is very significant,” he said.
Modular, as a software-only company, is better positioned to build a stack that works for all hardware, according to Lattner.
“We just want software developers to use their silicon,” he said. “We’re helping to break down those barriers, investing over many years across many generations of hardware that can enable [that].”
Performant portability
Modular’s AI inference engine, Max, launched in 2023 with CPU support for x86 and Arm CPUs, and support for Nvidia GPUs was added recently. This means Modular now has a full-stack replacement for CUDA, including the CUDA programming language and the LLM serving stack that builds on top of it.
Crucially, Lattner said Max can meet the performance of CUDA for Nvidia A100 and H100 GPUs.
“[Nvidia] had a bit of a head start on us—they had the help of the entire world that was tuning for their hardware, and A100 at that point was 4 years old, and that was very well understood and optimized [for], so it was a very high bar,” he said. “What [meeting CUDA performance for A100] told me is: we have a stack that can scale and we have a team that can execute.”
Meeting or beating CUDA’s performance for generative AI inference on H100s took two months from first introducing H100 support—an achievement Lattner is confident the team can reproduce for its next target hardware: Nvidia Blackwell-generation GPUs.
“We are engineering this in a way that is able to scale,” Lattner said. “We got to competitive performance on H100 in two months, not two years, because [our] technology investment has allowed us to scale up and actually lean in to these problems.”
The eventual aim is to enable performant portability between all types of AI hardware.
“No other stack can do that,” Lattner said. “Even Nvidia doesn’t have a performance portability story… CUDA can sort of run on A100 and H100, but practically speaking, you have to rewrite your code to get good performance, because [Nvidia] introduced new features like the TMA units in the H100.”
Tensor memory accelerator, or TMA, units were introduced for Hopper-generation GPUs to allow the asynchronous transfer of tensors between global and shared memories. Performant portability is enabled by Modular’s higher level abstractions for hardware features like this. The company aims to become a bridge between chip makers and software developers who simply want to use the hardware, Lattner said.
“As we unlock [the power of this technology], which we’re just coming into now, we can enable an entirely new category of people to be able to program all the new hardware that’s coming on to the market, and do so in a consistent way,” he said. “Developers don’t have to know about all the complexity on the hardware side or on the AI research side. They can focus on building their agentic workflow or their custom RAG solution and benefit from all the innovation that’s happening in the ecosystem; we can make it simple and adoptable.”
Modular support for non-Nvidia GPUs and other types of accelerators will begin towards the end of 2025.
eetimes.com |