Run one input through two prompts or two models at the same time, watch both stream in parallel, then compare output, cost and latency on a scored verdict. Turns "which prompt is better" from a guess into a measurement. Bring your own key, no backend.

Arena