Thinking Machines Builds Real-Time Interaction Models for Continuous AI Collaboration

Thinking Machines, the AI company founded by former OpenAI CTO Mira Murati, has announced it’s working on a new approach to how people interact with AI—something the company calls “interaction models.” The announcement, shared Monday alongside a detailed explanation of the concept, is less about another incremental upgrade to text generation and more about changing the rhythm of the conversation itself: how an AI perceives you, how it decides what to do next, and how quickly it can respond while you’re still speaking, moving, or thinking out loud.

At the center of the idea is a critique of today’s dominant model experience. In most current systems, the user effectively drives the interaction in discrete chunks. You type a message, or you speak until you stop, and only then does the model “turn on” its full attention to produce a response. That structure isn’t just a UX quirk—it shapes what the model can perceive and when it can act. Thinking Machines argues that this creates a fundamentally different kind of collaboration than the one we’re used to with other people.

The company’s framing is straightforward: today’s models experience reality in a single thread. Until the user finishes typing or speaking, the system waits. During that waiting period, it has no perception of what the user is doing or how the user is doing it. The result is an interaction that feels sequential rather than continuous. Even when the AI is fast, the “pause-and-respond” pattern makes the exchange feel like turn-taking between two separate processes rather than a shared, real-time activity.

Interaction models are proposed as a way to break that pattern. Instead of treating the user’s input as something that arrives all at once, the system would continuously take in audio, video, and text—then think, respond, and act in real time. The goal is not merely faster responses, but a different relationship between perception and action: the AI should be able to track what’s happening as it happens, and adjust its behavior without waiting for a formal “end” signal from the user.

That distinction matters because real communication is rarely tidy. When you talk to a person, you don’t wait for them to finish speaking before you start reacting. You listen while they speak, you notice changes in tone, you interpret pauses, and you often respond midstream—sometimes with words, sometimes with gestures, sometimes with a shift in attention. Thinking Machines is essentially aiming to bring that same continuity into AI interaction, where the model doesn’t just generate text after the fact, but participates as an ongoing collaborator.

What makes this announcement particularly notable is the multi-modal emphasis. Interaction models, as described by Thinking Machines, are designed to continuously ingest audio, video, and text. That combination suggests a system that can interpret not only what you say, but how you say it and what you’re doing while you say it. Video input introduces a layer of context that many conversational AI experiences currently ignore: facial expressions, body language, gaze direction, and environmental cues. Audio adds prosody and timing—information that can signal uncertainty, urgency, excitement, or confusion even when the words themselves are ambiguous. Text provides the explicit content. Together, these channels could allow the AI to maintain a richer, more stable understanding of the interaction over time.

The company’s description also implies a shift in the internal mechanics of the model’s “attention” across the timeline of an interaction. If the system is continuously perceiving, it needs a way to decide what to do next while new information is still arriving. That’s a hard problem, because real-time perception creates a moving target: the user’s intent may evolve mid-sentence, and the AI’s best guess early in the interaction might be wrong once additional audio or visual cues arrive. An interaction model, in theory, would be built to revise its interpretation and update its actions as the stream continues.

This is where the concept becomes more than a marketing phrase. Continuous perception and real-time action require architectural choices that differ from the typical “single-thread” pipeline. In many existing systems, the model’s job is to produce an output given a completed prompt. Interaction models, by contrast, must operate like a system that is always “on,” constantly updating its internal state. That means the model must handle partial inputs, manage uncertainty, and coordinate timing so that responses feel natural rather than jittery or overly reactive.

If Thinking Machines succeeds at this, the user experience could change in ways that are difficult to fully capture in a static demo. Imagine speaking to an AI while walking through a room. As you point, gesture, or turn your head, the AI could adjust what it’s focusing on. Or consider a scenario where you’re explaining a problem while showing a screen: the AI could watch your actions and respond in sync with your explanation, rather than waiting for you to finish describing everything. In such cases, the AI isn’t just answering questions; it’s participating in the workflow.

There’s also a subtle but important implication for how people trust AI. When an AI responds only after you finish speaking, it can feel like it’s “catching up” rather than collaborating. Continuous interaction could make the AI feel more present—like it’s tracking you rather than processing you. That presence can reduce friction, but it also raises expectations. If the AI is always perceiving, users will naturally assume it understands more than it actually does. That creates a new responsibility for transparency and calibration: the system must communicate its confidence and limitations in a way that fits real-time interaction.

Thinking Machines’ announcement doesn’t spell out every technical detail, but the direction is clear enough to suggest where the biggest impact could land. Voice assistants are an obvious candidate, but the opportunity extends beyond consumer chat. Real-time interaction models could matter most in environments where timing and context are essential—settings where delays are costly, misunderstandings are expensive, or the user’s intent is dynamic.

Customer support is one example. A support agent doesn’t wait for a customer to finish typing before responding; they listen, ask clarifying questions, and react to what the customer is doing. An interaction model could potentially mirror that style, especially if it can interpret audio cues and visual context (for instance, a customer showing an error message on a device). Similarly, training and coaching applications—fitness, language learning, technical instruction—benefit from feedback that arrives while the learner is still performing the task.

Another area is creative and collaborative work. Writers, designers, and developers often iterate in bursts: they explain an idea, show a draft, react to feedback, and refine. If an AI can continuously perceive what’s happening—what’s being said, what’s being shown, and how the user is behaving—it could become a more fluid partner in the iteration cycle. Instead of “prompt, wait, response,” the interaction could resemble co-editing in real time.

There’s also a more strategic angle to this announcement. Many AI companies have spent the last year competing on benchmarks, model sizes, and text quality. Those improvements are real, but they don’t automatically solve the core mismatch between how humans communicate and how many AI systems currently operate. Interaction models represent a bet that the next leap won’t come solely from better language modeling, but from better integration of perception, timing, and action.

In other words, the company appears to be targeting the interface layer as much as the intelligence layer. That’s a meaningful shift. Intelligence that is impressive in a chat window can still feel awkward when it’s asked to participate in a live conversation with interruptions, overlapping speech, and evolving context. By focusing on continuous audio/video/text ingestion and real-time response, Thinking Machines is trying to align the AI’s operational behavior with human communication patterns.

The “single thread” critique is also worth unpacking. A single-thread experience doesn’t just mean the model waits; it means the model’s understanding is anchored to a moment after the user stops. That can lead to a kind of cognitive lag. Even if the model is accurate, the interaction can feel like it’s happening in hindsight. Interaction models aim to reduce that lag by letting the AI perceive the user’s ongoing state. That could improve not only responsiveness, but also relevance—because the AI can tailor its response to what the user is doing right now, not what they did right before they stopped.

However, continuous perception introduces its own challenges. Audio and video streams are noisy. People move, lighting changes, microphones pick up background sounds, and speech recognition can fluctuate. A robust interaction model must filter and interpret those signals reliably enough to be useful. It also must decide when to act and when to wait. Acting too early could lead to errors that feel intrusive; acting too late could recreate the old “wait for the end” problem. Striking the right balance is likely one of the hardest parts of building interaction models.

There’s also the question of how the system handles interruptions. In natural conversation, people interrupt each other, correct themselves, and change direction. A continuous AI must be able to absorb those changes without derailing. That requires mechanisms for updating intent midstream and for maintaining coherence across a conversation that never truly “ends” until the user stops. The model must avoid getting stuck on earlier interpretations while still preserving enough context to be coherent.

Thinking Machines’ emphasis on real-time “think, respond, and act” suggests the company is considering more than just conversational output. “Act” implies that the AI could trigger actions—perhaps controlling software, guiding tasks, or interacting with devices. If the AI can perceive continuously and act immediately, it becomes closer to an agent than a chatbot. That raises additional considerations around safety and control. Real-time action increases the stakes of mistakes, so the system likely needs guardrails that are sensitive to context and confidence.

Even without knowing the exact product roadmap, the announcement signals a direction that many users will find compelling: AI that feels less like a tool you consult and more like a collaborator you can work alongside. That shift is not purely emotional; it changes how workflows are structured. If the AI can respond while you’re still explaining, you can compress steps. You can iterate faster