For years, the AI world has been dominated by a groundbreaking idea from a 2017 Google paper: "Attention Is All You Need." This paper introduced a mechanism called "attention" that became the backbone of nearly every major Artificial Intelligence (AI) model we use today, from the text generators we chat with to the tools that help us understand complex data. Think of attention as the AI's ability to look at all the information it's given and figure out which parts are most important to focus on. It's incredibly powerful and has led to the AI advancements we've seen explode recently.
However, even the most brilliant inventions have limits. The "attention" mechanism, while revolutionary, is also very demanding. As we ask AI models to process longer and longer pieces of information – like entire books, vast codebases, or hours of video – the computational cost and memory needed by attention grow exponentially. This means it becomes incredibly slow and expensive, acting like a bottleneck that's holding back further progress.
But what if there's a different way? Recently, a little-known AI startup called Manifest AI introduced a fascinating new approach with their model, Brumby-14B-Base. This model takes a leading open-source AI, Qwen3, and fundamentally changes how it "thinks" by getting rid of the attention mechanism altogether. Instead, Brumby uses a new technique called Power Retention. This is a big deal because it suggests we might be entering a "post-transformer" era, moving beyond the architecture that has defined AI for nearly a decade.
To understand why Brumby is so important, let's look closer at the problem with attention. When an AI uses attention, it's like every word in a sentence checking in with every other word to see how related they are. For a short sentence, this is quick. But for a long document, imagine every single word needing to compare itself to thousands or millions of other words. This is where the problem lies: the more information you give the AI (the longer the "context"), the exponentially more work the attention mechanism has to do. This is like trying to have a conversation where every person has to remember and compare everything everyone else has ever said – it quickly becomes unmanageable.
Manifest AI's Power Retention offers a different path. Instead of comparing every piece of information to every other piece, it uses a method more like a continuous flow. Imagine a flowing river: it carries information downstream, but it also has a way of summarizing or "retaining" what has passed. Power Retention does something similar. It keeps a compressed summary of past information in a fixed-size "memory" (called a latent state). As new information comes in, the model updates this summary. The key breakthrough is that the effort needed to process new information doesn't drastically increase with how much information it has already seen.
This means Brumby can handle arbitrarily long contexts – think of processing an entire historical archive or a complex scientific paper – with a consistent processing cost per piece of information. This is a monumental shift from attention, where cost balloons with length. Crucially, Power Retention doesn't sacrifice the AI's ability to understand complex relationships. By using clever mathematical techniques (involving "tensor powers," hence the name "power retention"), it can still grasp intricate, long-term dependencies, much like attention, but far more efficiently.
Perhaps the most striking aspect of the Brumby-14B-Base release is its training cost. Manifest AI reported training this 14-billion-parameter model for just $4,000. To put this in perspective, training state-of-the-art models of similar size typically costs millions of dollars. This dramatic cost reduction is achieved by retraining an existing Transformer model rather than building one from scratch. While Brumby isn't a fully "from-scratch" foundation model in the traditional sense, this retraining approach is a crucial accelerant. It demonstrates that new architectures can achieve impressive results by building upon the knowledge embedded in existing models, but with a fraction of the investment.
This economic efficiency has profound implications for the future of AI development. It suggests that groundbreaking research and development could become accessible to a much wider range of organizations, from smaller startups and academic labs to even individual researchers. This could democratize AI, fostering more diverse ideas and applications.
The article references discussions around this claim, with some noting that the $4,000 figure relies on reusing pre-trained weights. Manifest AI clarifies that this efficiency is precisely the point: leveraging existing knowledge makes adoption of new paradigms feasible at unprecedented low costs. Jacob Buckman, founder of Manifest AI, explained that the ability to "build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm." This means we can experiment and iterate much faster and cheaper.
The emergence of architectures like Power Retention, alongside others like Mamba (which also aims for linear scaling with a different state-space approach), signals a potential turning point. The era of Transformer dominance might be evolving. Here's what these developments imply:
For businesses, this isn't just an academic discussion; it has real-world implications:
For society, the democratization of AI is particularly exciting. It means that the benefits of advanced AI could be more broadly shared. We might see AI tools tailored to local needs, support for underserved languages, or breakthroughs in scientific research happening at a faster pace, driven by a wider pool of innovators.
The AI landscape is evolving rapidly, and staying ahead requires a proactive approach:
The Transformer architecture has been a monumental achievement, ushering in the current golden age of AI. However, like all technologies, it has reached its current limits. The work by Manifest AI and others exploring alternatives like Power Retention and Mamba suggests that we are on the cusp of a new wave of AI innovation. This wave promises not only more powerful and capable AI but also more accessible, efficient, and democratized AI development. The journey beyond attention has begun, and its implications will shape the future of technology and society for years to come.