Neurometric

Sign-up for weekly updates. New models. New algorithms. Real results.

N Logo

Thinking Algorithm Leaderboard

Continuously measuring what actually works across model-algorithm combinations on CRMArena-Pro. Navigate the jagged frontier systematically rather than guessing.

Best Ensemble

72.7%
5 Systems

Cost-Efficient Ensemble

61.3% ($0.008)
2 systems

Latency-Efficient Ensemble

61.7% (24.9s)
2 systems

Best Single System

63.8%
gpt.oss.120b + Weighted (n=3) - Haiku 3.5 judge

Cost-Efficient Single

54.6% ($0.011)
gpt.oss.120b + CoT

Latency-Efficient Single

56.3% (6.4s)
gpt.oss.120b + Weighted (n=3) - Nova Pro judge

The Jagged Frontier

Matching the algorithm to the model and task has huge performance gains.

Loading strip plot data...
Loading data...

CRMArena-Pro

A novel benchmark for holistic and realistic assessment of LLM agents in diverse professional settings. We explore eight of the nineteen expert-validated tasks covering three scenarios (sales, customer service, and configure, price, and quote) and four skills (workflow routing, policy compliance, information retrieval & textual reasoning, and database querying & numerical computation).

Methodology

This leaderboard aggregates results from the TTC Benchmark consistency experiments across multiple runs. The aggregation strategy depends on the evaluation method:

  • Chain of Thought (CoT): Averaged across Run 1, Run 2, and Run 3
  • Best of N methods (Best of N and Best of N - Weighted): Averaged across Run 1, Run 2 and Run 3 (Run 1 used a different judge model: Claude Haiku 3.5, while Run 2 and Run 3 used Amazon Nova Pro)

Accuracy measures the percentage of benchmark instances that were successfully resolved by each model, providing a comprehensive view of practical performance.

Data includes results across various CRM Arena tasks including lead qualification, case routing, activity priority, and more.

Share Your Feedback

We'd love to hear your thoughts on the TTC Benchmark Leaderboard.

We'll only use this to follow up on your feedback.

Your suggestions help us make things better!