Business Flywheels: Use AI to Autonomously Improve Performance
Most businesses using AI are generating the data to improve it — and not using it.
Andrej Karpathy left his computer running for two days and went to relax. He came back to 700 experiments, 20 real improvements, and an 11% performance gain on a benchmark he'd been manually tuning for two decades.
Karpathy helped build modern AI at Tesla and OpenAI. Even he was surprised. "Seeing the agent do this entire workflow end-to-end and all by itself is wild," he wrote.
He called the tool autoresearch. It's open source and built for AI researchers. But he ended his announcement talking to everyone else:
"Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too."
I think most problems do.
The short version
Autoresearch works on the model itself, not the outputs. An agent modifies the training code (architecture, hyperparameters, optimizer, loss function), retrains, checks if the metric improved, keeps or discards, and repeats. You wake up to a better model.
To use this pattern you need two things: a model you're fine-tuning for a specific task, and historical data with labeled outcomes to evaluate against. The agent runs experiments against that evaluation set the same way Karpathy's runs against validation loss.
If you're calling OpenAI or Anthropic's API, this doesn't apply directly. You don't control the training code. But that's worth examining on its own. You're paying per-token for a general-purpose model when you could be running a smaller, specialized model that you fine-tune for your specific task.
A single GPU on a cloud instance running an open-weight model like Llama or Mistral is enough to get started. That's the setup Karpathy used. It's more infrastructure than calling an API, but you own the training code, which means you can actually improve it.
Here are 8 places where that loop is waiting.
1. Support ticket routing
You have an AI classifying tickets by product area, urgency, or team. Every day, you get ground truth: was it routed correctly? How fast did it get resolved? Did it escalate?
Most companies treat each misrouted ticket as a one-off annoyance. Every correction is a labeled training example. An agent could use those corrections to retrain the classification model overnight.
2. Lead scoring
Your CRM has a model predicting which leads are worth pursuing. It was probably trained on data from 12-18 months ago and nobody has touched it since.
Every closed deal is new signal. Buying patterns from last year don't predict this year's customers. A model that retrains on recent outcomes stays accurate. One that doesn't drifts quietly while your sales team wonders why conversions are dropping.
3. Email personalization
Open rate, reply rate, revenue per send: about as clean as feedback signals get.
If you have an LLM generating outreach emails, you're sitting on thousands of labeled examples: sends paired with reply outcomes. That's a fine-tuning dataset. An agent can experiment with how the model is trained (adjusting the architecture, the loss function, how it weighs positive vs negative examples) and retrain overnight against your historical reply data.
You're building a model that's better at understanding what makes someone respond.
4. Contract and document review
Legal teams reviewing AI-extracted clauses generate feedback constantly. Every correction, edit, and approval tells you where the model got it wrong.
Most companies capture none of it. The model makes the same mistakes in month six that it made in month one because nobody connected the lawyer's red ink to the training pipeline. Wire that loop and the model learns your organization's contracts and edge cases instead of staying generic.
5. Internal knowledge search (RAG)
An internal chatbot that pulls from your company's documents has one honest feedback signal: did the employee get what they needed?
Did they click the result? Did they ask a follow-up immediately (that usually means the answer was wrong), or walk over to someone's desk instead?
Almost nobody uses this data after launch. The retrieval model, the embedding weights, the re-ranker, all frozen at whatever someone configured during the pilot. An agent could retrain the retrieval model against real usage data overnight. That's the most honest evaluation set you have.
6. Ad copy generation
If AI is generating your ad variations, you already have what you need: CTR and ROAS.
You've got a clear metric and a growing pile of labeled outcomes. An agent can retrain the model that generates your ad copy against that conversion data, experimenting with the training setup until it gets measurably better at producing copy that converts. The metric is binary: the ad worked or it didn't.
7. Code review and PR summarization
Dismissed AI suggestions, edited summaries, post-merge bugs from code that passed AI review — these are all labeled outcomes the model behind your code review tool never sees.
Most code review tools don't capture any of it. The model behind the suggestions stays static, calibrated to generic best practices rather than your codebase, your team's standards, or the patterns that have historically broken things.
The feedback loop here is slower (outcomes take days or weeks to materialize). But those dismissed suggestions and post-merge bugs are labeled data. An agent could retrain the model against your team's actual preferences overnight.
8. Churn prediction
A churn model trained last year doesn't know what this year's churn looks like. The customers at risk today behave differently than the ones who left 18 months ago because usage patterns and products have shifted underneath the model.
Every customer who churned (or didn't) after being flagged is a labeled outcome. Without regular retraining, your customer success team ends up chasing the wrong accounts while the actual at-risk customers leave quietly.
Getting this right means earlier warnings. Earlier warnings mean more options: discounts, check-ins, or feature introductions before someone has already decided to go.
The pattern
Every one of these has a model in production, a measurable outcome, and historical data that could be used to retrain it, and nobody has wired the loop together.
Karpathy ran 700 experiments in two days and improved a system he'd been manually tuning for his entire career. The data to do the same thing is sitting in your production logs right now. Businesses that close this loop will have models that keep getting better automatically, while everyone else keeps running whatever they shipped in 2024.

Need a Fractional Head of AI?
I help companies build an AI operating system — shared context across teams, AI handling the repetitive work, and your people focused on what actually matters.
15+
Years in Tech
12+
AI Products Shipped
3
Fortune 500 Brands