<?xml version="1.0" encoding="utf-8"?>

Setting appropriate benchmarks

Creating good benchmarks for AI features requires careful thought about what to compare against. Human performance often provides a useful reference point, especially for systems designed to enhance what people can do. When comparing AI to humans, look beyond just speed and accuracy to include consistency, stamina, and handling of unusual cases. A meaningful benchmark system should include several measurement types:

  • Absolute benchmarks establish minimum thresholds that must be met before release, such as "speech recognition must achieve 95% accuracy across all accents."
  • Relative benchmarks track improvement over time, like "customer satisfaction with recommendations should increase 5% quarterly."
  • Competitive benchmarks compare your system against alternatives to understand your market position.

Start by defining what "good" looks like from users’ perspective. For a customer service AI, include metrics like time-to-resolution, satisfaction scores, and successful handoff rates to human agents. Establish baseline measurements before implementation to enable valid before-and-after comparisons. Include both technical metrics (accuracy, speed) and experience metrics (satisfaction, trust, continued usage). The most valuable benchmarks reflect real user situations rather than artificial test environments. Check performance across different user groups to ensure the system works well for everyone, not just the majority of users. This prevents creating systems that perform well in controlled tests but fail for important segments of your actual audience.

Improve your UX & Product skills with interactive courses that actually work