The success of AI solution depends on the clear goals and robust testing.
Goals may relate to:
- Performance on the task, including edge cases
- Consistency of the performance
- Style of the response
- Utilization of the context
- Latency
- Cost
and can be:
- Quantitative - e.g. F1, accuracy, precision, recall
- Qualitative - e.g. user satisfaction on likert scale
Test cases should cover a broad range of expected tasks and include edge cases. The most important thing is the number of test cases, hence it is best to design for automated testing.
The test can be graded by (order by preference):
- Code
- LLM
- Human