Managing System Evolution course lesson

Unlike traditional apps that stay the same until you update them, AI systems constantly change as they learn from millions of interactions. This creates a fascinating paradox: the very feature that makes AI powerful, its ability to adapt, can also make it unpredictable. A model update might improve accuracy for most users while completely breaking the experience for others. New data patterns might shift the system's behavior in unexpected ways. Even successful improvements can confuse users who've grown comfortable with how things work. Managing this evolution isn't just about technical excellence. It's about maintaining the delicate balance between innovation and stability, between getting better and staying reliable. You'll discover how successful teams monitor their AI's health, communicate changes without causing alarm, and ensure that as their systems grow smarter, they never lose sight of the humans they serve.

Exercise #1

Understanding AI system drift

AI systems can change over time even without explicit updates. The training data your model learned from represents a snapshot of the world at one moment. But the real world keeps evolving, creating mismatches between what your AI learned and what it encounters.

Data cascades occur when initial data decisions create downstream effects throughout the development pipeline. A plant identification app trained mostly on North American species might work well at launch but fail when South American users arrive. The mismatch between training data and real-world usage wasn't visible until users reported errors.

These cascades can be hard to diagnose. You might not see their impact until users experience problems. The effects of poor data choices compound over time, making early planning critical.

Your AI's performance can degrade as the world changes around it. Language patterns shift, user behaviors evolve, and new trends emerge that weren't in your original dataset. Regular monitoring helps detect when reality diverges from your training assumptions. Planning for high-quality data from the start prevents many evolution problems. This means considering how your data might age and building in processes for updates.^[1]

Exercise #2

Monitoring performance over time

Once your model is running, you need to interpret AI output to ensure it aligns with product goals and user needs. If it doesn't, troubleshooting often reveals data issues that weren't apparent during development. Testing should happen continuously. Early development requires qualitative feedback from diverse users to find "red flag" issues with your training dataset or model tuning. Build mechanisms for ongoing user feedback throughout the product lifecycle. Create custom dashboards that visualize key metrics:

Recommendation acceptance rates
Error frequencies across user segments
Confidence score distributions

Tools like the What-If Tool and Language Interpretability Tool help inspect your model and identify blind spots. Monitor behavioral signals alongside technical metrics. Track how often users accept recommendations, complete suggested actions, or override AI decisions. If users consistently ignore suggestions despite high confidence scores, something needs investigation. Establish regular review cycles with cross-functional teams. Engineers track technical performance, product managers notice experience changes, and customer service identifies complaint patterns. Monthly reviews examining trends catch issues that single metrics miss.

Exercise #3

Communicating system updates

When AI systems evolve, users need to understand what changed without feeling overwhelmed by technical details. Poor communication about updates can break trust faster than the actual changes themselves.

Timing matters as much as content. Announce major changes before they happen, giving users time to adjust expectations. For subtle improvements, consider communicating after implementation when users can immediately experience the benefits.

Match communication depth to user needs. Power users might want detailed changelogs, while casual users need simple notices about improved experiences. Avoid technical jargon like "retrained the model with updated embeddings." Instead, say "improved understanding of your preferences."

Show, don't just tell. When possible, let users experience improvements through guided interactions rather than reading about them. A writing assistant could highlight a better suggestion with a subtle "New!" badge rather than a pop-up announcement. Address the "why" behind changes. Users accept evolution better when they understand it serves their needs. "Based on community feedback, we've improved how we recognize plant diseases" builds more trust than "Model updated to version 2.3."

Exercise #4

Version control for AI models

Training data quality directly determines your system's output and user experience quality. When models evolve through updates, maintaining data quality becomes even more critical. Document your data collection plan to avoid quality issues. Include what data you're collecting, how often it's refreshed, and what preprocessing steps you apply. This documentation helps future teams understand why certain choices were made. When updating models, consider your data maintenance plan:

Preventive maintenance stops problems before they occur.
Adaptive maintenance preserves your dataset while the real world changes.
Corrective maintenance fixes errors that arise from data cascades.

Keep detailed logs of everything you change in datasets. Problems can occur from unforeseen issues and human error. Having comprehensive records helps diagnose issues when user complaints arise after updates. Split your data carefully between training and test sets for each model version. The split depends on factors like example count and data distribution. A typical split might be 60% training and 40% testing, but this varies by use case.

Exercise #5

Handling performance degradation

AI systems are probabilistic and will sometimes give incorrect or unexpected output. This makes it critical to plan for errors and failures from early in development.

What users consider an error depends on their expectations. A recommendation system that's useful 60% of the time might be seen as a success or a failure depending on users' context. These perceptions establish or correct mental models and calibrate trust. Consider different error types. A medical diagnosis AI might fail by missing a condition (false negative) or flagging healthy patients (false positive). A translation app might produce grammatically correct but culturally inappropriate phrases. A route planner might suggest technically shorter paths through unsafe areas.

Design your system knowing some people will intentionally abuse it. Make failure safe and boring. Avoid making dangerous failures interesting or over-explaining vulnerabilities, which can incentivize reproduction.

When AI fails, often the easiest path forward is letting users take over. Users need awareness of the situation, understanding of what to do next, and the ability to take action. Error messages should be human, not machine-like. Address mistakes with humanity and explain limits while inviting people forward.^[2]

Exercise #6

Planning maintenance windows

Your AI system won't always provide appropriate information at the right time. Issues with relevance often cause context errors, conflicts between the system working as intended and users' real needs. Low confidence situations occur when models can't fulfill tasks due to uncertainty, lack of data, or unstable information. A flight price predictor might fail for next year's prices because conditions keep changing.

High confidence but irrelevant output creates different problems. Booking a trip for a funeral and receiving "fun vacation activity" suggestions shows high confidence in the wrong context. These relevance errors frustrate users even when technically correct. Plan for times when your system can't provide good results. Explain why certain outputs couldn't be given and provide alternative paths. "Not enough data to predict prices for next year. Try checking again in a month," acknowledges limitations helpfully. Create feedback mechanisms for users to report relevance issues. When the system works technically but fails contextually, user feedback helps improve future performance.

Exercise #7

Re-onboarding for major changes

People form mental models for everything they interact with, including AI products. These models help set expectations for capabilities and value. When systems change significantly, mental models must evolve too.

Onboarding starts before purchase or download and continues indefinitely. Marketing messages, ads, and early interactions shape expectations. Work with marketing teams to develop consistent messaging that doesn't overpromise "AI magic."

Many products set users up for disappointment by hiding complexity. While shielding users from technical details has merit, hiding how products work creates confusion and breaks trust. Balance clarity with simplicity.

When introducing changed features, connect feedback with personalization. Let users know how their input helps improve their experience. Tie benefits to user value: "You can improve recommendations by rating what you see."

Mental models need reinforcement over time. Daily products build strong models through repetition. Occasional-use products like tax software or travel booking apps need reminders. Users might confidently navigate tax software in April but feel lost returning next year after interface updates.

Exercise #8

Building evolutionary roadmaps

Defining success for AI requires carefully designing your reward function. This is the formula that tells your system what to optimize for and shapes the user experience. Teams from UX, Product, and Engineering must work together on this.

Think about different types of mistakes your AI might make. A running app might suggest runs users don't want, or miss runs they would enjoy. Deciding which mistake is worse shapes how your system develops over time.

Consider the balance between being careful and being thorough. Being careful (precision) means you're confident in what you recommend, but might miss good options. Being thorough (recall) means you catch more possibilities but include more mistakes. Your roadmap should plan how this balance changes.

Look at the long-term effects of your choices. A simple goal applied broadly can create problems later. Making users share more content might seem good at first but could make the app annoying over time. Watch for unexpected consequences. Ask yourself: "What happens if our system perfectly achieves its goal?" The answer should help users. Keeping people's attention all day might boost engagement numbers but not actually help them.^[3]

Exercise #9

Measuring long-term impact

Building AI responsibly means thinking about fairness throughout development. While there's no single definition of fairness, you can take steps to reduce bias. This becomes more important as systems grow and affect more users.

Check if your training data represents the diversity of your actual users. Data that works for early users might fail as your user base grows. Track whether all user groups are represented fairly as your system evolves.

Monitor if your product fails more often for certain groups of people. A voice assistant shouldn't work poorly for people with accents. If it does, your system isn't serving everyone equally as it develops. Watch how AI decisions affect people's opportunities. Your system's choices can impact people's access to resources and quality of life. Long-term measurement must track whether these impacts get better or worse over time.

Create measurements beyond technical performance. Fairness and helping users matter as much as accuracy. True success means all users benefit more over time, not just average numbers improving. Document which groups benefit from each update. If improvements consistently help some users more than others, small advantages add up to create unfairness even when individual changes seem neutral.