Sign-up for weekly updates. New models. New algorithms. Real results.

Thinking Algorithm Leaderboard

Continuously measuring what actually works across model-algorithm combinations on CRMArena-Pro. Navigate the jagged frontier systematically rather than guessing.

Best Ensemble

72.7%

5 Systems

Cost-Efficient Ensemble

61.3% ($0.008)

2 systems

Latency-Efficient Ensemble

61.7% (24.9s)

2 systems

Best Single System

63.8%

gpt.oss.120b + Weighted (n=3) - Haiku 3.5 judge

Cost-Efficient Single

54.6% ($0.011)

gpt.oss.120b + CoT

Latency-Efficient Single

56.3% (6.4s)

gpt.oss.120b + Weighted (n=3) - Nova Pro judge

The Jagged Frontier

Matching the algorithm to the model and task has huge performance gains.

METHOD

MODELS

Loading strip plot data...

Loading data...

CRMArena-Pro

A novel benchmark for holistic and realistic assessment of LLM agents in diverse professional settings. We explore eight of the nineteen expert-validated tasks covering three scenarios (sales, customer service, and configure, price, and quote) and four skills (workflow routing, policy compliance, information retrieval & textual reasoning, and database querying & numerical computation).

GitHub →Hugging Face →

Methodology

This leaderboard aggregates results from the TTC Benchmark consistency experiments across multiple runs. The aggregation strategy depends on the evaluation method:

Chain of Thought (CoT): Averaged across Run 1, Run 2, and Run 3
Best of N methods (Best of N and Best of N - Weighted): Averaged across Run 1, Run 2 and Run 3 (Run 1 used a different judge model: Claude Haiku 3.5, while Run 2 and Run 3 used Amazon Nova Pro)

Accuracy measures the percentage of benchmark instances that were successfully resolved by each model, providing a comprehensive view of practical performance.

Data includes results across various CRM Arena tasks including lead qualification, case routing, activity priority, and more.

Share Your Feedback

We'd love to hear your thoughts on the TTC Benchmark Leaderboard.

Thinking Algorithm Leaderboard

Best Ensemble

Cost-Efficient Ensemble

Latency-Efficient Ensemble

Best Single System

Cost-Efficient Single

Latency-Efficient Single

The Jagged Frontier

CRMArena-Pro

Lead Routing

Top Issue Identification

Named Entity Disambiguation

Monthly Trend Analysis

Case Routing

Quote Approval

Lead Qualification

Activity Priority

Methodology

Share Your Feedback