Evaluating the Performance of Large Language Models via Debates | Synapse