The evaluation of LLM judges is a significant aspect of benchmarking in AI, particularly in how model outputs are assessed and ranked.
Recent analyses raise questions about the robustness of these judges, especially regarding how post-decision interactions may influence evaluations.
It is essential to scrutinize the underlying assumptions of current benchmarking pipelines to ensure their effectiveness and reliability.