AUI Launches Apollo-1, Promising Unmatched Reliability in Enterprise AI Task Execution

For over a decade, the field of conversational AI has been characterized by ambitious promises of creating human-like assistants capable of performing complex tasks beyond mere chatting. Despite significant advancements in large language models (LLMs) such as ChatGPT, Claude, and Gemini, one critical aspect of this technology remains largely unresolved: the reliable execution of real-world tasks outside of chat interfaces. Current AI models often struggle with consistency, failing to meet the reliability standards demanded by enterprises and users alike.

Recent evaluations reveal that even the most advanced AI models score poorly on benchmarks designed to assess their task completion capabilities. For instance, top-performing models only achieve scores in the 30th percentile on Terminal-Bench Hard, a third-party benchmark that evaluates AI agents on a variety of browser-based tasks. Similarly, task-specific benchmarks like TAU-Bench Airline, which measures the reliability of AI agents in finding and booking flights, show that even the best agents fail nearly half the time, with a pass rate of just 56% for the leading models.

In response to these challenges, Augmented Intelligence (AUI) Inc., a New York City-based startup co-founded by Ohad Elhelo and Ori Cohen, has emerged with a promising solution aimed at enhancing AI agent reliability to a level where enterprises can trust these systems to perform as instructed. Their new foundation model, named Apollo-1, is currently in preview with early testers and is poised for a general release in November 2025. This innovative model is built on a principle known as “stateful neuro-symbolic reasoning,” a hybrid architecture that combines the strengths of symbolic reasoning with the fluency of neural networks.

Elhelo, in a recent interview, articulated the dual nature of conversational AI, stating, “Conversational AI is essentially two halves. The first half—open-ended dialogue—is handled beautifully by LLMs. They’re designed for creative or exploratory use cases. The other half is task-oriented dialogue, where there’s always a specific goal behind the conversation. That half has remained unsolved because it requires certainty.” AUI defines this certainty as the distinction between an agent that “probably” performs a task and one that almost “always” does.

Apollo-1’s performance on benchmarks is nothing short of remarkable. For example, it boasts a staggering 92.5% pass rate on the TAU-Bench Airline benchmark, significantly outperforming competitors like Claude 3.7 Sonnet, which only managed a 56% pass rate. In practical applications, Apollo-1 completed 83% of live booking tasks on Google Flights, while Gemini 2.5-Flash achieved a mere 22%. In retail scenarios on Amazon, Apollo-1 achieved a remarkable 91% task completion rate compared to Rufus, which only managed 17%.

The architecture of Apollo-1 is designed to ensure deterministic behavior, addressing a fundamental limitation of traditional transformer models. While LLMs generate plausible text based on statistical patterns, Apollo-1 is engineered to predict the next action in a conversation, operating within what AUI refers to as a “typed symbolic state.” This approach allows Apollo-1 to maintain a closed reasoning loop, where an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision engine determines the next action, a planner executes it, and a decoder converts the result back into language. This iterative process continues until the task is completed, resulting in a level of reliability that traditional models cannot match.

Cohen elaborated on the significance of this neuro-symbolic approach, stating, “Neuro-symbolic means we’re merging the two dominant paradigms. The symbolic layer gives you structure—it knows what an intent, an entity, and a parameter are—while the neural layer provides language fluency. The neuro-symbolic reasoner sits between them. It’s a different kind of brain for dialogue.” This unique architecture enables Apollo-1 to execute tasks with a level of certainty that is crucial for enterprise applications.

Unlike conventional chatbots or bespoke automation systems, Apollo-1 is designed to function as a foundation model for task-oriented dialogue. It is a domain-agnostic system that can be configured for various industries, including banking, travel, retail, and insurance, through what AUI calls a “System Prompt.” This System Prompt is not merely a configuration file; it serves as a behavioral contract that defines precisely how the agent must behave in specific situations. Organizations can use this prompt to encode symbolic slots—intents, parameters, and policies—as well as tool boundaries and state-dependent rules.

For instance, a food delivery app might enforce a rule stating, “if allergy mentioned, always inform the restaurant,” while a telecom provider could define a policy such as, “after three failed payment attempts, suspend service.” In both cases, Apollo-1 guarantees that the specified behaviors will execute deterministically, rather than relying on statistical probabilities.

The journey to develop Apollo-1 began in 2017 when AUI’s team started encoding millions of real task-oriented conversations handled by a workforce of 60,000 human agents. This extensive data collection led to the creation of a symbolic language capable of distinguishing between procedural knowledge—steps, constraints, and flows—and descriptive knowledge, which includes entities and attributes. Elhelo noted, “The insight was that task-oriented dialogue has universal procedural patterns. Food delivery, claims processing, and order management all share similar structures. Once you model that explicitly, you can compute over it deterministically.”

The results of AUI’s evaluations underscore the effectiveness of the neuro-symbolic architecture. Apollo-1 achieved over 90% task completion on the Ď„-Bench-Airline benchmark, significantly outperforming Claude-4, which managed only 60%. Additionally, Apollo-1 completed 83% of live booking chats on Google Flights compared to just 22% for Gemini 2.5-Flash, and it achieved 91% task completion in retail scenarios on Amazon versus Rufus’s 17%. Cohen emphasized that these differences are not merely incremental; they represent order-of-magnitude improvements in reliability.

Importantly, AUI positions Apollo-1 not as a replacement for large language models but as a necessary complement. Elhelo explained, “Transformers optimize for creative probability. Apollo-1 optimizes for behavioral certainty. Together, they form the complete spectrum of conversational AI.” This complementary relationship allows organizations to leverage the strengths of both types of models, enhancing their overall AI capabilities.

Currently, Apollo-1 is undergoing limited pilot programs with undisclosed Fortune 500 companies across various sectors, including finance, travel, and retail. AUI has also confirmed a strategic partnership with Google, which will facilitate the integration of Apollo-1 into existing workflows. The company plans to make Apollo-1 generally available in November 2025, at which point it will open APIs, release full documentation, and introduce voice and image capabilities.

As the launch date approaches, AUI is keeping details under wraps, hinting at forthcoming announcements that may further illuminate the potential of Apollo-1. Elhelo teased, “Let’s just say we’re preparing an announcement. Soon.”

The overarching mission of AUI is clear: to democratize access to AI that works reliably in real-world applications. The promise of Apollo-1 is not just about creating AI that can converse but about building systems that businesses can trust to act decisively and accurately. Whether Apollo-1 will become the new standard for task-oriented dialogue remains to be seen, but if AUI’s architecture performs as anticipated, it could bridge the long-standing divide between chatbots that sound human and agents that can reliably perform human tasks.

In conclusion, the introduction of Apollo-1 represents a significant advancement in the quest for reliable enterprise AI agents. By combining the strengths of symbolic reasoning with the fluency of neural networks, AUI aims to set a new benchmark for task execution in conversational AI. As organizations increasingly seek dependable AI solutions to enhance their operations, Apollo-1 may very well be the breakthrough that transforms the landscape of enterprise AI, making it possible for businesses to harness the full potential of artificial intelligence in their workflows. The future of conversational AI is poised for a paradigm shift, and Apollo-1 stands at the forefront of this evolution.