Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Strategies & Market Trends : ajtj's Post-Lobotomy Market Charts and Thoughts -- Ignore unavailable to you. Want to Upgrade?

To: sandeep who wrote (94650)	9/1/2025 4:03:08 PM
From: Sun Tzu	2 Recommendations Recommended By ajtj99 nicewatch Respond to of 97572

Is this why nVidia is publishing research on what-if we could speed up AI models by 50x? Because they are not interested in more efficient AI modesl? arxiv.org Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters. Their latest research asks: What if we could make AI models 50x faster without making them worse? And their answer might change how we think about building efficient language models. Here's the problem they solved: Today's most capable AI models use something called "full attention" - imagine every word in a conversation needing to check in with every other word, every single time. It works well, but it's like having everyone in a stadium talk to everyone else individually. The computational cost explodes as conversations get longer. Researchers have tried replacing this with more efficient "linear attention" mechanisms (where the communication scales linearly rather than quadratically), but these models typically perform much worse at scale. It's been a painful trade-off: speed or performance, pick one. Instead of building new models from scratch, NVIDIA started with existing high-performing models and surgically modified them. They froze the parts that already worked well (the parts that store knowledge) and systematically explored which attention layers actually matter for different tasks. From previous research, we already know that not all attention layers are created equal. For general knowledge tasks, only 2 out of 28 layers were critical. For information retrieval, it was a different set of 2-3 layers. This insight let them keep full attention only where absolutely necessary and replace the rest with their new "JetBlock" - a linear attention mechanism enhanced with dynamic convolution that adapts to the input. They aren't the first to do this, this has been a theme for quite a while. It's about the efficiency they did it with: Jet-Nemotron-2B matches or beats models like Qwen3 and Gemma3 across benchmarks while being 47x faster at generation. At longer contexts (256K tokens), the speedup reaches 53.6x. They even outperform recent MoE models with 7x more parameters. How are these numbers even possible? By utilizing what they call "PostNAS": NAS stands Neural Architecture Search - automatically finding the best architecture for your neural network - a process that usually takes a lot of time and/or compute, often more than is feasible. But since they are starting with "already proven" pre-trained models (remember?), they reduce the exploration cost from hundreds of millions of dollars to just thousands in compute. Not only is this clever, it also makes architectural innovations much more accessible, allowing smaller teams to compete with tech giants in designing efficient AI systems.