The Age of Transformers: Cracks in the Foundation? Introducing Power Retention and the Future of AI Architecture

For years, the AI world has been dominated by a single, powerful idea: the Transformer architecture, introduced in the groundbreaking 2017 paper "Attention Is All You Need." This concept, centered around a mechanism called "attention," has been the engine behind virtually every major Large Language Model (LLM) you've heard of – from OpenAI's GPT series to Google's Gemini and Meta's Llama. Attention allows AI models to cleverly focus on the most important parts of a vast amount of information, much like how we humans selectively recall details when reading or listening. It's been a golden age for AI, driving incredible progress.

However, even the most brilliant ideas have limitations. The very mechanism that made Transformers so powerful – attention – is now showing its age. Imagine trying to remember every single word from a very long book perfectly. As the book gets longer, your brain needs more and more energy and space to keep track of it all. Similarly, for AI models, the computational cost and memory required to process longer and longer texts (or code, or videos) grows incredibly fast. This "quadratic scaling" with context length is becoming a major bottleneck, making it difficult and expensive to build AI that can truly understand and reason over vast amounts of data, like entire codebases or days of video footage.

A Radical Rethink: Enter Power Retention

This is where a lesser-known AI startup, Manifest AI, steps into the spotlight. They've introduced a variant of a leading open-source model called Qwen3, which they've dubbed Brumby-14B-Base. What makes Brumby revolutionary is that it ditches attention altogether. Instead, Manifest AI has developed a new mechanism called Power Retention. This approach is different; it's a "recurrent" system, meaning it processes information sequentially, like a conveyor belt. But crucially, it does so in a way that's highly efficient with computer hardware. Power Retention is designed to store and update information over arbitrarily long contexts without the memory explosion problem that plagues attention.

The results are striking. Manifest AI claims they trained this 14-billion-parameter Brumby model for a mere $4,000. Despite its novel architecture, it performs on par with established Transformer models on many reasoning and comprehension tasks. This is not just a small improvement; it's a potential paradigm shift, suggesting that groundbreaking AI capabilities might soon be accessible at a fraction of the current cost.

From Attention to Retention: How it Works

To understand why this is so significant, let's simplify how Transformers and Power Retention work. In a Transformer, for every piece of information (a "token"), the model compares it to every other piece of information in the input. This is like asking every student in a huge classroom to raise their hand if they understand a specific concept, and then having each student compare their understanding to every other student. It's thorough but incredibly time-consuming and resource-intensive, especially as the classroom size (context length) grows.

Power Retention takes a different path. It still considers the incoming information (query, key, value, or Q, K, V), but instead of doing a full comparison with everything else, it updates a continuous "memory state." Think of it like a dedicated scribe continuously taking notes. As new information arrives, the scribe updates their summary of everything that's happened so far, compressing past knowledge into a compact, fixed-size note. This means whether the AI is processing 1,000 pieces of information or 1,000,000, the effort for each new piece remains roughly the same. This "constant-time per-token computation" is a game-changer for handling long contexts.

But does this efficiency come at the cost of intelligence? Manifest AI argues no. The "power" in Power Retention comes from its ability to use mathematical operations (tensor powers) that can capture complex, higher-order relationships between past and present information. This means it can theoretically remember long-term patterns just as well as attention, but far more efficiently, much like an efficient RNN but with the expressive power often associated with Transformers.

Retraining: The Secret Sauce for Speed and Savings

One of the most astonishing aspects of Brumby's development is how it was trained. Manifest AI didn't build Brumby from scratch. They took an existing, powerful Transformer model (Qwen3-14B-Base) and essentially "retrained" it with their new Power Retention layers. This process, which took only 60 hours on 32 high-end GPUs and cost around $4,000, is significantly cheaper than training a comparable model from zero.

This retraining approach is key. It means they leveraged the immense knowledge already embedded in the Qwen3 weights. However, because the underlying architecture changed – the attention layers were swapped out – the model initially "forgot" some of its learned abilities. The retraining (about 3,000 steps) was like a quick crash course for a brilliant musician learning a new instrument. They already know music theory; they just need to learn how to play this specific guitar. By the end of this brief period, Brumby regained its performance, matching its Transformer predecessor while gaining the efficiency benefits of the new architecture.

This efficiency in retraining is crucial. As Jacob Buckman, founder of Manifest AI, explained, the ability to build upon existing models is a "critical accelerant" for adopting new paradigms. It means future advancements might not require rebuilding the entire AI from scratch, drastically lowering the barrier to entry for experimentation and innovation.

Beyond Brumby: A Broader Trend?

Brumby isn't an isolated incident; it's part of a growing wave of research exploring alternatives to the Transformer. The mention of Mamba in the context of Brumby is particularly telling. Mamba is another promising architecture that also aims to solve the long-context problem by using "selective state spaces." Like Power Retention, Mamba processes information linearly, avoiding the quadratic cost of attention. Both Mamba and Power Retention represent a move towards more computationally efficient sequence modeling, and comparing their performance and hardware utilization offers valuable insights into the future direction of AI architectures. Research into the resurgence of Recurrent Neural Networks (RNNs), such as the RWKV model, further underscores this trend, suggesting that classic sequential processing techniques, when modernized, can offer significant advantages for handling long data streams.

[Learn more about Mamba's approach here.] [Explore RWKV and the evolution of RNNs.]

The Economic Revolution: Democratizing AI Development

The $4,000 training cost is perhaps the most attention-grabbing claim from the Brumby release. Training a 14-billion-parameter model from scratch typically costs hundreds of thousands, if not millions, of dollars. If Manifest AI's claims hold true and can be replicated, this could democratize AI development on an unprecedented scale. Smaller research labs, startups, and even individual researchers could afford to experiment with and train powerful AI models. This accessibility is vital for fostering innovation and ensuring that the benefits of AI are not concentrated in the hands of a few tech giants.

This shift has profound implications for businesses. Imagine being able to fine-tune or even train custom LLMs for specific industry needs without astronomical upfront investment. This could lead to hyper-specialized AI solutions in fields like healthcare, law, finance, and creative arts, driving efficiency and creating new business models. The economic landscape of AI development is poised for a dramatic change, moving from an era of prohibitive costs to one of greater accessibility and faster iteration.

[Understand the previous cost barriers of LLM training.]

Hardware Efficiency and the Road Ahead

The efficiency of Power Retention isn't just about software; it's also about hardware. Manifest AI reports that their specialized kernels (pieces of code that run on GPUs) can offer hundreds-fold speedups for processing long contexts. They claim significantly higher hardware utilization compared to other efficient architectures like Mamba, meaning the GPUs are working harder and more effectively. This hardware-level optimization is crucial for real-world deployment, enabling faster inference (when the AI generates responses) and more efficient training.

The development of these optimized kernels and frameworks, such as their in-house Vidrial CUDA framework, highlights the symbiotic relationship between AI architecture innovation and hardware acceleration. As AI models become more complex and handle larger datasets, the demand for specialized hardware and efficient software to run them will only increase. Advances in areas like GPU and TPU acceleration are fundamental to unlocking the full potential of new architectures like Power Retention and Mamba. The ability to run these models efficiently on existing hardware, or with modest upgrades, will accelerate adoption across industries.

[See how hardware optimization is tackled for LLMs.]

Practical Implications and Actionable Insights

For businesses, this architectural shift signals a potential future where:

Long-Context AI becomes mainstream: Imagine customer service bots that can recall entire conversation histories, or legal AI that can analyze thousands of pages of documents in seconds.
Cost of AI Development Plummets: This opens doors for smaller businesses and research institutions to leverage cutting-edge AI without massive budgets.
Faster Iteration and Customization: The ability to retrain models quickly means businesses can adapt AI solutions more rapidly to evolving needs.
New Applications Emerge: AI that can process and reason over incredibly long sequences will unlock possibilities in scientific research, personalized education, advanced simulation, and more.

The integration process for adopting these new architectures is also becoming simpler. Manifest AI suggests it can be as easy as "pip install retention, change one line of your architecture code, and resume training." This ease of adoption is critical for widespread acceptance. While integration with popular inference engines is still in progress, the direction is clear: more efficient, more capable AI is on the horizon.

The End of the Transformer Era? Not Yet, But the March Has Begun.

The Brumby-14B-Base release is more than just a technical achievement; it's a potent signal that the Transformer's long reign might be nearing a challenge. By demonstrating performance parity with Transformers at a fraction of the cost and complexity, Manifest AI has opened a crack in the seemingly impenetrable wall of Transformer dominance. This suggests a future where AI architectures are more diverse, more efficient, and more accessible.

While the Transformer era is far from over – its established ecosystem and vast research base are undeniable strengths – the success of Power Retention, alongside other emerging architectures like Mamba, marks the beginning of a significant march towards more sustainable, scalable, and powerful AI. The coming years will likely see intense innovation in architectural design, driving down costs, expanding capabilities, and ultimately, reshaping how we build and use artificial intelligence across every facet of our lives.

TLDR:

A new AI architecture called "Power Retention," demonstrated by Manifest AI's Brumby-14B-Base model, offers a potentially cheaper and more efficient alternative to the dominant Transformer architecture. By ditching the costly "attention" mechanism, Power Retention allows AI to handle much longer contexts without immense computational cost, making advanced AI development more accessible and enabling new applications that require understanding vast amounts of data.