On October 28, 2025, the AI landscape witnessed a significant shift with the introduction of Brumby-14B-Base by Manifest AI, a relatively unknown startup that has made waves in the artificial intelligence community. This new model is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models, but it comes with a groundbreaking twist: it completely abandons the attention mechanism that has been the cornerstone of transformer architectures since their inception.
The transformer architecture, introduced in the seminal 2017 paper “Attention Is All You Need” by Google researchers, revolutionized the field of natural language processing (NLP) and artificial intelligence (AI). It enabled models to process sequences of data by focusing on the most relevant parts of the input, allowing for unprecedented performance in tasks such as translation, summarization, and question-answering. However, as the demand for larger context lengths has grown—spanning documents, codebases, and even video streams—the limitations of attention mechanisms have become increasingly apparent. The computational and memory costs associated with attention scale quadratically with the length of the input, creating a bottleneck that poses challenges for both research and industry.
Brumby-14B-Base seeks to address these limitations through its innovative architecture known as Power Retention. This novel approach leverages a recurrent, hardware-efficient design that allows the model to store and update information over arbitrarily long contexts without the exponential memory growth typically associated with attention. By replacing traditional attention layers with Power Retention, Manifest AI claims to have developed a model that not only matches the performance of established transformer models but does so at a fraction of the cost.
The training process for Brumby-14B-Base was remarkably efficient. Manifest AI trained the model for just 60 hours using 32 Nvidia H100 GPUs, incurring a total cost of approximately $4,000. This figure represents less than 2% of what it would typically cost to train a conventional model of this scale from scratch. However, it is essential to note that this low training cost was made possible by leveraging the existing weights of the Qwen3 model. Jacob Buckman, the founder of Manifest AI, emphasized that while the ability to train for such a low cost is impressive, it relies on the foundation laid by previous transformer models. Brumby could not have been trained from scratch for that price.
The significance of this achievement lies in the potential for Power Retention systems to catch up to transformer performance with significantly lower investment. In the loss curves released by Manifest AI, Brumby’s training loss quickly converged to that of the Qwen3 baseline within just 3,000 training steps, demonstrating the model’s ability to adapt rapidly to its new architecture. Although Brumby-14B-Base began its life as Qwen3-14B-Base, it underwent a fundamental transformation as Manifest AI removed the attention layers that define how transformers process information and replaced them with the new Power Retention mechanism. This architectural change effectively restructured the model’s internal wiring, giving it a new operational framework while preserving much of its prior knowledge.
To illustrate this process, Buckman likened the transition to that of a world-class pianist learning to play the guitar. While the pianist possesses a deep understanding of music theory, rhythm, and melody, they must learn entirely new patterns to produce music on a different instrument. Similarly, Brumby had to relearn how to utilize its existing knowledge through a new computational instrument. The retraining phase, consisting of about 3,000 additional learning steps, served to recalibrate the model’s weights, aligning them with the Power Retention framework without starting from zero.
The results of this retraining process are promising. Across standard evaluation tasks, Brumby-14B-Base consistently performs at or near parity with transformer baselines of comparable scale. For instance, in the ARC benchmark, Brumby achieved a score of 0.89 compared to Qwen3’s 0.94. In the GSM8K task, Brumby scored 0.88, surpassing Qwen3’s score of 0.84. Notably, Brumby outperformed Qwen3 in mathematical reasoning tasks, scoring 0.62 on the MATH benchmark compared to Qwen3’s 0.54. However, it did lag slightly behind on knowledge-heavy evaluations like MMLU-Pro, where it scored 0.36 compared to Qwen3’s 0.55. This performance pattern reinforces the idea that recurrent or retention-based systems may hold a structural advantage for reasoning over extended temporal or logical dependencies.
One of the standout features of Brumby’s Power Retention design is its hardware efficiency. The state update mechanism involves only local matrix operations, allowing inference to be implemented with linear complexity in sequence length. Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention mechanisms when processing very long contexts. Buckman noted that the alpha-stage Power Retention kernels achieve typical hardware utilization rates of 80–85%, which surpasses the 70–75% utilization seen with FlashAttention2 and the 50–60% utilization of Mamba, another emerging post-transformer architecture.
The implications of Brumby-14B-Base extend beyond its immediate performance metrics. The model’s low training cost represents a two-order-of-magnitude reduction in the cost of foundation model development, potentially democratizing access to large-scale experimentation. Smaller research groups and organizations may now have the opportunity to retrain or repurpose existing transformer checkpoints without facing prohibitive compute costs. Buckman confirmed that the ease of retraining improves with scale, suggesting that as models grow larger, the number of steps required for successful retraining decreases.
Integration and deployment of Brumby-14B-Base are designed to be straightforward. Companies already engaged in retraining, post-training, or fine-tuning open-source models can easily convert an existing transformer into a Power Retention model. The process involves a simple command: “pip install retention,” followed by a minor adjustment to the architecture code. After only a small number of GPU-hours, the model typically recovers its original performance, gaining the efficiency benefits of the attention-free design.
On the infrastructure side, the main Brumby kernels are written in Triton, making them compatible with both NVIDIA and AMD accelerators. Specialized CUDA kernels are also available through Manifest’s Vidrial framework. While integration with vLLM and other inference engines is still a work in progress, Buckman expressed confidence that the recurrent-state architecture would not exacerbate any instability concerns. In fact, he noted that context-parallel training and GPU partitioning for multi-user inference become significantly cleaner technically when using the Power Retention approach.
Beyond the technical details, Buckman articulated Manifest AI’s broader mission: to train a neural network capable of modeling all human output. The team aims to move beyond merely modeling “artifacts of intelligence” toward capturing “the intelligent processes that generated them.” This ambitious vision requires a fundamental rethinking of how models are designed and trained, and the introduction of Power Retention represents just the beginning of this journey.
The launch of Brumby-14B-Base has sparked considerable discussion within the AI community. Some researchers have raised questions about the framing of Manifest AI’s announcement, particularly regarding the “$4,000 foundation model” claim. Critics argue that the training involved reusing pretrained transformer weights rather than training from scratch, suggesting that the tagline may be misleading. Buckman responded to these concerns by clarifying that the initial tweet was part of a longer thread explaining the retraining approach. He acknowledged that while the claim is technically accurate, it challenges expectations about the costs associated with experimenting at the frontier of AI development.
In conclusion, the release of Brumby-14B-Base marks more than just an engineering milestone; it serves as a proof of concept that the dominance of transformers may finally face credible competition. By replacing attention with Power Retention, Manifest AI has demonstrated that achieving performance parity with state-of-the-art transformers is possible at a fraction of the computational cost. The long-context bottleneck that has plagued attention-based architectures can be addressed without the need for exotic hardware.
The broader implications of this development are twofold. First, the economics of training and serving large models could shift dramatically, lowering barriers to entry for open research and smaller organizations. Second, the architectural diversity of AI models may expand once again, reigniting theoretical and empirical exploration after years of transformer monoculture. As Buckman aptly stated, “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.” The journey toward more efficient and capable AI architectures has only just begun, and Brumby-14B-Base may well be a pivotal chapter in that ongoing narrative.
