Published at
Updated at
Reading time

How do you evaluate your software's doing what it's supposed to do?

Do you test all your app's possible cases, branches and states? I don't, at least not manually. Nobody aint time to manually click through all the edge cases. QA'ing a simple login form takes time, let alone testing complex applications.

Having robots do that helps a ton, and I recommend writing automated tests to help you sleep well at night (and release fewer bugs)!

Ignoring the burden of writing and maintaining tests, testing a "normal" web application is straightforward because it's predictable. Throw something at your app and expect a result. It should always do the same. Most apps are CRUD apps anyway — easy peasy.

But what if there are unpredictable parts in your app's core?

If you're riding the AI buzzword wave, you probably implemented an "I know everything" smart-ass right in your app's core that's known for lying and spreading fake news. (Yes, I mean some sort of LLM.)

How would you test your app's quality if you're building software on top of software you probably don't understand?

Here's Hamel Husain's recommendation:

There are three levels of evaluation to consider:

  • Level 1: Unit Tests
  • Level 2: Model & Human Eval (this includes debugging)
  • Level 3: A/B testing

I'm not planning to get into serious AI work or LLM programming anytime soon, but unit testing software sitting on top of LLMs is fascinating and worth more than a bookmark!

Was this post helpful?
Yes? Cool! You might want to check out Web Weekly for more WebDev shenanigans. The last edition went out 11 days ago.
Stefan standing in the park in front of a green background

About Stefan Judis

Frontend nerd with over ten years of experience, freelance dev, "Today I Learned" blogger, conference speaker, and Open Source maintainer.

Related Topics

Related Articles