railtracks 1.3.5 127 6

Measure agent performance locally

LLM evaluations that make it effortless to run and compare your agents

Get started for free. No signup required.

Trusted by engineers at

Analyze multi-step agent behavior across runs

FAST ITERATION CYCLES

Run evaluations in seconds and iterate as fast as you build. No waiting on pipelines or external services

LOCAL-FIRST EVALUATION

Keep runs, data, and prompts on your machine. Nothing leaves your machine, no added risk.

FULLY CUSTOMIZABLE

Use custom evaluators or tailor metrics to your workflow whether it’s tool usage, task success, or multi-step reasoning.

MEASURED PROGRESS

Track performance across every run with clear metrics and chart. Regressions are visible the moment they happen.

HOW IT WORKS

AgentHub Evaluation Workflow

RUN

Evaluate observed runs or datasets

EVALUATE

Drill into evaluator details and metrics

COMPARE

Validate changes vs previous evaluations

Before You Run Your First Evaluation

Do I need Railtracks?

Evaluations is built on Railtracks. We’re exploring support for other frameworks and import formats.

100%. Local runs never leave your machine. 

Human evaluators are supported in Conductr cloud (coming soon!). Local stays automated.