Bench
Modaic internal SDK for benchmarking judges and training confidence probes.
Installation
cd cli
uv sync
CLI Commands
All commands are run from the cli directory via uv run mo <command>.
create
Create benchmark datasets for training confidence probes. This command runs a judge on examples, extracts embeddings via Modal, and pushes the resulting dataset to HuggingFace Hub.
Subcommands:
create ppe- Create dataset from PPE (human-preference + correctness) benchmarkscreate judge_bench- Create dataset from the JudgeBench benchmark
Usage:
# Interactive mode (recommended) - prompts for configuration
uv run mo create ppe
uv run mo create judge_bench
# With config file
uv run mo create ppe --config config.yaml
uv run mo create judge_bench --config config.yaml
Options:
| Option | Short | Description |
|---|---|---|
--config |
-c |
Path to config file (YAML) |
Config File Example:
judge: tyrin/ppe-judge-gepa
output: tytodd/my-probe-dataset
n_train: 500
n_test: 100
embedding_layer: -1 # -1 for middle layer
What it does:
- Loads examples from the benchmark dataset
- Runs the specified judge on each example to get predictions
- Extracts embeddings from the judge's LLM via Modal (GPU)
- Creates a HuggingFace dataset with columns:
question,response_a,response_b,label,predicted,messages,embeddings - Pushes to HuggingFace Hub
train
Train a confidence probe on an embeddings dataset created with create.
Usage:
# Interactive mode (recommended) - prompts for all configuration
uv run mo train
# With config file
uv run mo train --config config.yaml
# With CLI arguments
uv run mo train --dataset tytodd/my-embeddings --epochs 10 --lr 0.0001
Options:
| Option | Short | Description | Default |
|---|---|---|---|
--config |
-c |
Path to config file (YAML) | - |
--dataset |
-d |
Dataset path (HuggingFace Hub or local) (must be a dataset created with create) |
- |
--model-path |
-m |
Output path for trained model | {dataset}_probe |
--batch-size |
Batch size | 4 | |
--epochs |
Number of training epochs | 10 | |
--lr |
Learning rate | 0.0001 | |
--weight-decay |
Weight decay | 0.01 | |
--test-size |
Validation split ratio (if no test split) | 0.2 | |
--seed |
Random seed | 42 | |
--project |
W&B project name | model_path | |
--hub-path |
HuggingFace Hub path to push model | - |
Config File Example:
dataset_path: tytodd/my-probe-dataset
model_path: ./best_probe
hub_path: tytodd/my-probe # Optional: push to HF Hub
batch_size: 4
epochs: 10
learning_rate: 0.0001
weight_decay: 0.01
test_size: 0.2
seed: 42
What it does:
- Loads an embeddings dataset (from HuggingFace Hub or local)
- Creates binary labels: 1 if
predicted == label, 0 otherwise - Trains a linear probe using MSE loss (Brier score optimization)
- Logs metrics to Weights & Biases (Brier, ECE, MCE, Kuiper, AUROC)
- Saves the best model based on validation Brier score
- Optionally pushes to HuggingFace Hub
eval
Evaluate a trained confidence probe on a dataset. Computes calibration and discrimination metrics.
Usage:
# Interactive mode (recommended) - prompts for probe and dataset
uv run mo eval
# With CLI arguments
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings
# Evaluate on train split instead of test
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings --split train
Options:
| Option | Short | Description | Default |
|---|---|---|---|
--probe |
-p |
Probe path (HuggingFace Hub or local) | - |
--dataset |
-d |
Dataset path (HuggingFace Hub or local) | - |
--split |
-s |
Dataset split to evaluate on | test |
--batch-size |
-b |
Batch size for evaluation | 64 |
--normalize/--no-normalize |
-n |
Normalize embeddings with StandardScaler | probe config |
Metrics computed:
| Metric | Description |
|---|---|
| Brier Score | Mean squared error between predictions and labels |
| Accuracy | Classification accuracy at 0.5 threshold |
| F1 Score | Harmonic mean of precision and recall |
| ECE | Expected Calibration Error (10 bins) |
| MCE | Maximum Calibration Error |
| Kuiper | Kuiper statistic for calibration |
| AUROC | Area Under the ROC Curve (discrimination) |
What it does:
- Loads a pretrained probe from HuggingFace Hub or local path
- Loads a dataset created with
create - Creates binary labels: 1 if
predicted == label, 0 otherwise - Runs inference and computes calibration/discrimination metrics
- Displays results in a formatted table
compile
Compile (optimize) a judge using GEPA over a dataset. GEPA iteratively improves the judge's prompt based on training examples.
Subcommands:
compile(base) - Compile with custom dataset and parameter mappingcompile ppe- Compile specifically for PPE datasets (human-preference + correctness)
Usage:
# Interactive mode
uv run mo compile
uv run mo compile ppe
# With config file
uv run mo compile --config config.yaml
uv run mo compile ppe --config config.yaml
Options:
| Option | Short | Description |
|---|---|---|
--config |
-c |
Path to config file (YAML) |
Config File Example:
judge: tyrin/ppe-judge
dataset: tytodd/ppe-human-preference
inputs: # selects which input columns of the dataset to use (not necearry if using a compile subcommand like ppe or judge_bench)
- name: question
- name: response_a
column: response_A # Map param name to dataset column
- name: response_b
column: response_B
label_column: label
n_train: 100
n_val: 50
base_model: gpt-4o-mini
reflection_model: gpt-4o
output: tyrin/ppe-judge-gepa
seed: 42
What it does:
- Loads a judge from Modaic Hub
- Loads training/validation examples from a HuggingFace dataset
- Maps judge parameters to dataset columns
- Runs GEPA optimization to improve the judge's prompt
- Pushes the optimized judge to Modaic Hub
embed
Regenerate embeddings for an existing dataset using a different model or layer. Useful for experimenting with different embedding configurations without re-running the judge.
Usage:
# Interactive mode
uv run mo embed
# With CLI arguments
uv run mo embed --dataset tytodd/my-dataset --hf-model Qwen/Qwen3-VL-32B-Instruct --layer -1
Options:
| Option | Short | Description |
|---|---|---|
--dataset |
-d |
Dataset path (HuggingFace Hub or local) |
--hf-model |
-m |
HuggingFace model path for embeddings |
--layer |
-l |
Hidden layer index (-1 for middle layer) |
What it does:
- Loads an existing dataset (must have a
messagescolumn) - Regenerates embeddings using the specified model/layer via Modal
- Replaces the
embeddingscolumn in the dataset - Prompts to push the updated dataset to HuggingFace Hub
Example workflow:
# Original dataset was created with layer 32
# Now try middle layer instead
uv run mo embed \
--dataset tytodd/my-embeddings \
--hf-model Qwen/Qwen3-VL-32B-Instruct \
--layer -1
Recommended Embedding Layers
When extracting embeddings, use these recommended layer indices for best probe performance:
| Model | HuggingFace Path | Recommended Layer |
|---|---|---|
| GPT-OSS 20B | openai/gpt-oss-20b |
8 |
| Qwen3-VL 32B | Qwen/Qwen3-VL-32B-Instruct |
16 |
| Llama 3.3 70B | meta-llama/Llama-3.3-70B-Instruct |
32 |
Use -1 for the middle layer if experimenting with an unlisted model.
Typical Workflow
# 1. Create a probe dataset from a benchmark
uv run mo create ppe
# 2. Train a confidence probe
uv run mo train --dataset tytodd/ppe-qwen3-embeddings
# 3. Evaluate the probe on a test set
uv run mo eval --probe tytodd/my-probe --dataset tytodd/ppe-qwen3-embeddings
# 4. (Optional) Compile/optimize a judge with GEPA
uv run mo compile ppe
# 5. (Optional) Re-embed with different layer
uv run mo embed --dataset tytodd/my-dataset --layer 32
Environment Variables
Create a .env file with:
OPENAI_API_KEY=...
WANDB_API_KEY=...
HF_TOKEN=...
MODAIC_TOKEN=...
TOGETHER_API_KEY=...