LLM-as-a-Judge Evaluation
We can't manually check every answer an AI gives. So we use a stronger AI (the "Judge") to grade the answers of a smaller AI.
Golden Dataset v1.0
| Input Prompt | Model Output | Reference Truth | Judge's Verdict |
|---|---|---|---|
| What is the capital of France? | The capital of France is Paris. | Paris | Waiting... |
| Sum 2 + 2 | It is 5. | 4 | Waiting... |
| Who wrote Hamlet? | William Shakespeare wrote Hamlet. | Shakespeare | Waiting... |