# Bench

Modaic internal SDK for benchmarking judges and training confidence probes.

## Installation

```bash
cd cli
uv sync
```

## CLI Commands

All commands are run from the `cli` directory via `uv run mo <command>`.

### `create`

Create benchmark datasets for training confidence probes. This command runs a judge on examples, extracts embeddings via Modal, and pushes the resulting dataset to HuggingFace Hub.

**Subcommands:**

- `create ppe` - Create dataset from PPE (human-preference + correctness) benchmarks
- `create judge_bench` - Create dataset from the JudgeBench benchmark

**Usage:**

```bash
# Interactive mode (recommended) - prompts for configuration
uv run mo create ppe
uv run mo create judge_bench

# With config file
uv run mo create ppe --config config.yaml
uv run mo create judge_bench --config config.yaml
```

**Options:**

| Option     | Short | Description                |
| ---------- | ----- | -------------------------- |
| `--config` | `-c`  | Path to config file (YAML) |

**Config File Example:**

```yaml
judge: tyrin/ppe-judge-gepa
output: tytodd/my-probe-dataset
n_train: 500
n_test: 100
embedding_layer: -1  # -1 for middle layer
```

**What it does:**

1. Loads examples from the benchmark dataset
2. Runs the specified judge on each example to get predictions
3. Extracts embeddings from the judge's LLM via Modal (GPU)
4. Creates a HuggingFace dataset with columns: `question`, `response_a`, `response_b`, `label`, `predicted`, `messages`, `embeddings`
5. Pushes to HuggingFace Hub

---

### `train`

Train a confidence probe on an embeddings dataset created with `create`.

**Usage:**

```bash
# Interactive mode (recommended) - prompts for all configuration
uv run mo train

# With config file
uv run mo train --config config.yaml

# With CLI arguments
uv run mo train --dataset tytodd/my-embeddings --epochs 10 --lr 0.0001
```

**Options:**

| Option           | Short | Description                                                                       | Default           |
| ---------------- | ----- | --------------------------------------------------------------------------------- | ----------------- |
| `--config`       | `-c`  | Path to config file (YAML)                                                        | -                 |
| `--dataset`      | `-d`  | Dataset path (HuggingFace Hub or local) (must be a dataset created with `create`) | -                 |
| `--model-path`   | `-m`  | Output path for trained model                                                     | `{dataset}_probe` |
| `--batch-size`   |       | Batch size                                                                        | 4                 |
| `--epochs`       |       | Number of training epochs                                                         | 10                |
| `--lr`           |       | Learning rate                                                                     | 0.0001            |
| `--weight-decay` |       | Weight decay                                                                      | 0.01              |
| `--test-size`    |       | Validation split ratio (if no test split)                                         | 0.2               |
| `--seed`         |       | Random seed                                                                       | 42                |
| `--project`      |       | W&B project name                                                                  | model_path        |
| `--hub-path`     |       | HuggingFace Hub path to push model                                                | -                 |

**Config File Example:**

```yaml
dataset_path: tytodd/my-probe-dataset
model_path: ./best_probe
hub_path: tytodd/my-probe  # Optional: push to HF Hub
batch_size: 4
epochs: 10
learning_rate: 0.0001
weight_decay: 0.01
test_size: 0.2
seed: 42
```

**What it does:**

1. Loads an embeddings dataset (from HuggingFace Hub or local)
2. Creates binary labels: 1 if `predicted == label`, 0 otherwise
3. Trains a linear probe using MSE loss (Brier score optimization)
4. Logs metrics to Weights & Biases (Brier, ECE, MCE, Kuiper, AUROC)
5. Saves the best model based on validation Brier score
6. Optionally pushes to HuggingFace Hub

---

### `eval`

Evaluate a trained confidence probe on a dataset. Computes calibration and discrimination metrics.

**Usage:**

```bash
# Interactive mode (recommended) - prompts for probe and dataset
uv run mo eval

# With CLI arguments
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings

# Evaluate on train split instead of test
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings --split train
```

**Options:**

| Option                       | Short | Description                              | Default      |
| ---------------------------- | ----- | ---------------------------------------- | ------------ |
| `--probe`                    | `-p`  | Probe path (HuggingFace Hub or local)    | -            |
| `--dataset`                  | `-d`  | Dataset path (HuggingFace Hub or local)  | -            |
| `--split`                    | `-s`  | Dataset split to evaluate on             | test         |
| `--batch-size`               | `-b`  | Batch size for evaluation                | 64           |
| `--normalize/--no-normalize` | `-n`  | Normalize embeddings with StandardScaler | probe config |

**Metrics computed:**

| Metric      | Description                                       |
| ----------- | ------------------------------------------------- |
| Brier Score | Mean squared error between predictions and labels |
| Accuracy    | Classification accuracy at 0.5 threshold          |
| F1 Score    | Harmonic mean of precision and recall             |
| ECE         | Expected Calibration Error (10 bins)              |
| MCE         | Maximum Calibration Error                         |
| Kuiper      | Kuiper statistic for calibration                  |
| AUROC       | Area Under the ROC Curve (discrimination)         |

**What it does:**

1. Loads a pretrained probe from HuggingFace Hub or local path
2. Loads a dataset created with `create`
3. Creates binary labels: 1 if `predicted == label`, 0 otherwise
4. Runs inference and computes calibration/discrimination metrics
5. Displays results in a formatted table

---

### `compile`

Compile (optimize) a judge using GEPA over a dataset. GEPA iteratively improves the judge's prompt based on training examples.

**Subcommands:**

- `compile` (base) - Compile with custom dataset and parameter mapping
- `compile ppe` - Compile specifically for PPE datasets (human-preference + correctness)

**Usage:**

```bash
# Interactive mode
uv run mo compile
uv run mo compile ppe

# With config file
uv run mo compile --config config.yaml
uv run mo compile ppe --config config.yaml
```

**Options:**

| Option     | Short | Description                |
| ---------- | ----- | -------------------------- |
| `--config` | `-c`  | Path to config file (YAML) |

**Config File Example:**

```yaml
judge: tyrin/ppe-judge
dataset: tytodd/ppe-human-preference
inputs: # selects which input columns of the dataset to use (not necearry if using a compile subcommand like ppe or judge_bench) 
  - name: question
  - name: response_a
    column: response_A  # Map param name to dataset column
  - name: response_b
    column: response_B
label_column: label
n_train: 100
n_val: 50
base_model: gpt-4o-mini
reflection_model: gpt-4o
output: tyrin/ppe-judge-gepa
seed: 42
```

**What it does:**

1. Loads a judge from Modaic Hub
2. Loads training/validation examples from a HuggingFace dataset
3. Maps judge parameters to dataset columns
4. Runs GEPA optimization to improve the judge's prompt
5. Pushes the optimized judge to Modaic Hub

---

### `embed`

Regenerate embeddings for an existing dataset using a different model or layer. Useful for experimenting with different embedding configurations without re-running the judge.

**Usage:**

```bash
# Interactive mode
uv run mo embed

# With CLI arguments
uv run mo embed --dataset tytodd/my-dataset --hf-model Qwen/Qwen3-VL-32B-Instruct --layer -1
```

**Options:**

| Option       | Short | Description                              |
| ------------ | ----- | ---------------------------------------- |
| `--dataset`  | `-d`  | Dataset path (HuggingFace Hub or local)  |
| `--hf-model` | `-m`  | HuggingFace model path for embeddings    |
| `--layer`    | `-l`  | Hidden layer index (-1 for middle layer) |

**What it does:**

1. Loads an existing dataset (must have a `messages` column)
2. Regenerates embeddings using the specified model/layer via Modal
3. Replaces the `embeddings` column in the dataset
4. Prompts to push the updated dataset to HuggingFace Hub

**Example workflow:**

```bash
# Original dataset was created with layer 32
# Now try middle layer instead
uv run mo embed \
  --dataset tytodd/my-embeddings \
  --hf-model Qwen/Qwen3-VL-32B-Instruct \
  --layer -1
```

---

## Recommended Embedding Layers

When extracting embeddings, use these recommended layer indices for best probe performance:

| Model         | HuggingFace Path                    | Recommended Layer |
| ------------- | ----------------------------------- | ----------------- |
| GPT-OSS 20B   | `openai/gpt-oss-20b`                | 8                 |
| Qwen3-VL 32B  | `Qwen/Qwen3-VL-32B-Instruct`        | 16                |
| Llama 3.3 70B | `meta-llama/Llama-3.3-70B-Instruct` | 32                |

Use `-1` for the middle layer if experimenting with an unlisted model.

---

## Typical Workflow

```bash
# 1. Create a probe dataset from a benchmark
uv run mo create ppe

# 2. Train a confidence probe
uv run mo train --dataset tytodd/ppe-qwen3-embeddings

# 3. Evaluate the probe on a test set
uv run mo eval --probe tytodd/my-probe --dataset tytodd/ppe-qwen3-embeddings

# 4. (Optional) Compile/optimize a judge with GEPA
uv run mo compile ppe

# 5. (Optional) Re-embed with different layer
uv run mo embed --dataset tytodd/my-dataset --layer 32
```

## Environment Variables

Create a `.env` file with:

```bash
OPENAI_API_KEY=...
WANDB_API_KEY=...
HF_TOKEN=...
MODAIC_TOKEN=...
TOGETHER_API_KEY=...
```