Unit testing AI apps
- Published at
- Updated at
- Reading time
- 1min
How do you evaluate your software's doing what it's supposed to do?
Do you test all your app's possible cases, branches and states? I don't, at least not manually. Nobody aint time to manually click through all the edge cases. QA'ing a simple login form takes time, let alone testing complex applications.
Having robots do that helps a ton, and I recommend writing automated tests to help you sleep well at night (and release fewer bugs)!
Ignoring the burden of writing and maintaining tests, testing a "normal" web application is straightforward because it's predictable. Throw something at your app and expect a result. It should always do the same. Most apps are CRUD apps anyway — easy peasy.
But what if there are unpredictable parts in your app's core?
If you're riding the AI buzzword wave, you probably implemented an "I know everything" smart-ass right in your app's core that's known for lying and spreading fake news. (Yes, I mean some sort of LLM.)
How would you test your app's quality if you're building software on top of software you probably don't understand?
Here's Hamel Husain's recommendation:
There are three levels of evaluation to consider:
- Level 1: Unit Tests
- Level 2: Model & Human Eval (this includes debugging)
- Level 3: A/B testing
I'm not planning to get into serious AI work or LLM programming anytime soon, but unit testing software sitting on top of LLMs is fascinating and worth more than a bookmark!
Join 5.1k readers and learn something new every week with Web Weekly.