About Race the AI by Parity
Read about how and why we made this hereWithout anything like SWEBench for Kubernetes tasks, evaluating our AI agent's effectiveness was challenging. We created these tasks to build an internal benchmark for our agent, and decided to release this subset as a fun way of demonstrating our agent's capabilities.Here’s how it works1. Simulating the Cluster StateWe use an LLM to simulate the state of the cluster. Whenever you input a command, an LLM with knowledge of the root cause produces simulated outputs that are aligned with the root cause and the output history.2. Evaluating Your AnswerWhen you submit an answer of the root cause, an LLM as a judge scores how close your guess was to the actual root cause. We use this score, as well as the LLM judge's feedback, to determine if an answer was correct, partially correct, or incorrect.