Uber Questions Value of AI Spend as Link to Customer Features Proves Hard to Prove – Superintelligence Digest

Uber’s AI spending is no longer just a question of whether the company can build smarter systems—it’s becoming a question of whether those systems can be tied to outcomes that customers actually notice. In a candid interview with Rapid Response, Uber president and chief operating officer Andrew Macdonald suggested that the internal story connecting AI investment to deliverable product value is still incomplete, even as usage metrics continue to rise.

The tension at the center of his comments is familiar across the tech industry, but it lands differently at Uber because the company’s AI work isn’t happening in a lab. It’s embedded in a business where reliability, speed, and customer experience are measurable every day—where “token consumption” can climb without necessarily producing a clear, externally visible improvement. Macdonald’s point wasn’t that AI is failing. It was that the line between spending and impact is getting harder to draw, and that makes justification more difficult as budgets tighten.

What Uber is seeing: more AI usage, unclear customer payoff

According to Macdonald, Uber has observed increased token consumption for Claude Code, an AI coding tool used internally. Tokens are often treated as a proxy for activity: more tokens can mean more requests, more experimentation, or more reliance on AI systems during development. But Macdonald emphasized that higher usage hasn’t yet translated into a straightforward relationship with more useful features being shipped to consumers.

In other words, Uber is not simply asking “are we using AI?” The question is “are we using AI in a way that improves the product in a way we can prove?” And right now, he implied, the answer is murkier than executives would like.

He described the challenge of attribution—how to connect one set of internal metrics to another set of outcomes. Even if AI tools are helping teams move faster, reduce friction, or generate code more efficiently, the causal chain to customer value can be indirect. Features may take time to develop, roll out gradually, and be influenced by many factors beyond AI assistance. Meanwhile, internal metrics like token usage can spike for reasons that don’t necessarily correspond to meaningful progress: teams might be testing, refactoring, or exploring ideas that never reach production.

Macdonald’s framing suggests Uber is wrestling with a problem that many companies hit after the initial wave of AI adoption: the early phase is about capability and throughput; the next phase is about measurable impact. The first phase can be tracked with usage and experimentation. The second phase requires a measurement framework that can survive real-world complexity.

Why “drawing a line” is so hard in practice

Attribution is difficult for reasons that go beyond simple measurement. AI systems can influence development in multiple ways that don’t show up cleanly in a single metric.

For example, AI coding tools can:
1) accelerate routine tasks (documentation, boilerplate, test generation)
2) help engineers explore alternative implementations
3) reduce time spent debugging or searching for patterns
4) increase the number of iterations teams can afford
5) improve code quality indirectly through better test coverage or refactoring suggestions

Any of these could contribute to better customer outcomes, but none of them guarantee a direct, immediate link between token usage and a specific feature improvement. A team might use AI heavily to generate prototypes, but only a subset becomes production-ready. Another team might use AI less but ship a high-impact feature. If you only look at tokens, you might conclude AI is “working” or “not working” incorrectly.

There’s also the issue of time lag. Even if AI accelerates development, customer-facing improvements might appear weeks or months later. Token consumption can be a leading indicator, but leading indicators are only useful if you can map them to downstream results with confidence. Macdonald’s comments imply Uber hasn’t yet built that mapping well enough to satisfy internal scrutiny.

And then there’s the human factor. AI tools can change how teams work, but they don’t eliminate decision-making. Product priorities, engineering tradeoffs, and operational constraints still determine what gets shipped. If a company wants to justify AI spend, it needs to show that AI isn’t just increasing activity—it’s improving decisions and outcomes.

The budget pressure behind the conversation

Uber’s skepticism comes at a moment when AI spending is under heightened scrutiny across the industry. The Verge reported that Uber exhausted its annual AI budget just four months into 2026. Whether or not the exact figure reflects a strict “burn rate” or a broader accounting reality, the headline itself signals urgency: AI costs are rising quickly, and leadership wants to know what those costs buy.

This is where Macdonald’s comments become more than a technical observation. They’re a signal that Uber is moving from “AI as experimentation” to “AI as a portfolio with accountability.” When budgets are consumed early, the organization can’t treat AI spend as an open-ended R&D line. It has to decide what to scale, what to constrain, and what to stop.

But scaling AI isn’t just about cost control. It’s about proving that the investment is producing value that outweighs alternatives. If the company can’t clearly connect internal AI metrics to consumer-visible improvements, it becomes easier for stakeholders to question whether the spend is justified—or whether the company should shift toward different approaches, such as smaller models, more targeted workflows, or tighter integration into specific product areas.

A unique challenge for Uber: AI inside a marketplace

Uber’s environment adds another layer of complexity. Unlike a software company building a single product surface, Uber operates a dynamic marketplace with real-time constraints: matching supply and demand, managing pricing and incentives, handling fraud and safety issues, optimizing routing, and maintaining service quality across regions.

AI can help in many of these areas, but the outcomes are often probabilistic and system-level. A small improvement in prediction accuracy might reduce cancellations or improve ETA reliability, but the effect might be distributed across many components rather than concentrated in one feature. That makes it harder to say, “this token spend produced this measurable customer benefit.”

Even if AI coding tools help engineers build better systems, the customer impact might show up as:
– fewer incidents
– improved reliability
– better matching performance
– reduced wait times
– smoother rider/driver experiences
– improved support outcomes

Those are valuable, but they’re not always easy to attribute to a specific AI-driven development workflow. The measurement problem becomes organizational: you need instrumentation, baselines, and causal reasoning—not just dashboards.

So when Macdonald says it’s hard to draw a line between internal stats and consumer value, he’s describing a structural issue: Uber’s AI work touches many parts of a complex system, and the customer experience is the result of interactions across those parts.

What “more shipped” might mean—and why it still isn’t enough

Macdonald suggested that there may be more being shipped implicitly, but that it’s difficult to connect the stats to a clear claim like “we’re producing 25 percent more useful consumer features.” That distinction matters.

Many companies can measure shipping velocity: how many pull requests, how many deployments, how many features. But “useful” is a different standard. Useful implies user impact, not just output volume. It implies that the features shipped are better aligned with user needs, reduce friction, improve outcomes, or solve problems customers feel.

AI can increase output, but usefulness depends on product judgment, experimentation design, and feedback loops. If AI helps teams ship more, but the additional shipping doesn’t translate into better user outcomes, then the ROI story weakens.

This is why the internal debate is likely shifting toward outcome-based metrics. Instead of asking whether AI increases throughput, Uber likely needs to ask whether AI improves:
– conversion rates
– retention
– satisfaction scores
– incident rates
– time-to-resolution for support
– operational efficiency that affects customer experience
– reliability metrics that riders and drivers feel

The challenge is that these metrics are influenced by many variables. Seasonality, market conditions, policy changes, and infrastructure updates all affect them. To justify AI spend, Uber needs to isolate the contribution of AI-driven development and deployment.

A broader industry pattern: from “AI adoption” to “AI governance”

Uber’s comments fit a larger shift happening across enterprises. Early AI adoption often focuses on capability: can the model help? can the tool generate code? can it summarize documents? can it automate tasks?

But as costs rise and expectations grow, companies increasingly focus on governance:
– cost controls and budgeting
– model selection and routing strategies
– evaluation frameworks
– risk management and compliance
– measurement of business impact

Macdonald’s statement reads like governance in action. He’s not rejecting AI. He’s questioning whether the current measurement approach is sufficient to justify continued spend at the same level.

That’s a subtle but important difference. Many headlines about AI spending cut through nuance by implying that AI is “not working.” Uber’s framing suggests something more realistic: AI may be working, but the organization hasn’t yet built a reliable method to prove that it’s working in the way leadership cares about.

What Uber may do next: tighter evaluation, narrower bets, better attribution

While the interview doesn’t lay out a roadmap, the logic of Macdonald’s comments points toward likely next steps.

First, Uber will probably invest more in evaluation. Not just model evaluation in isolation, but workflow evaluation: which AI-assisted processes lead to better outcomes. For example, does AI help most when generating tests? When drafting code for specific modules? When assisting with debugging? When summarizing logs? Each of these could have different ROI profiles.

Second, Uber may narrow the scope of AI usage. If token consumption is rising but outcomes aren’t clearly improving, leadership may push teams to use AI more selectively—only where it has demonstrated impact. That could mean:
– restricting AI tool access to certain workflows
– requiring approvals for high-cost usage
– implementing cost-aware routing (using cheaper models when possible)
– setting targets for measurable improvements rather than raw usage

Third, Uber may strengthen causal measurement. This could involve:
– A/B testing AI-assisted development changes where feasible
– tracking feature-level outcomes back to development workflows
– building internal “AI

Latest AI News ️‍🔥

UK Court Warns Against Outsourcing Legal Reasoning to AI After Pinsent Masons Error

Human Archive Pays India Gig Workers to Collect Real-World Training Data for Robots

Universal Music Group Renews TikTok Agreement to Tackle Unauthorized AI-Generated Music

Stanford Study Finds AI Hiring Tests Drive Clear Racial Disparities and Systemic Rejection