(no commit message)

This commit is contained in:
2026-02-16 13:41:05 -08:00
parent d199de153c
commit bd953051b0
3 changed files with 410 additions and 1 deletions

312
README.md
View File

@@ -1,2 +1,312 @@
# sentiment
# Bench
Modaic internal SDK for benchmarking judges and training confidence probes.
## Installation
```bash
cd cli
uv sync
```
## CLI Commands
All commands are run from the `cli` directory via `uv run mo <command>`.
### `create`
Create benchmark datasets for training confidence probes. This command runs a judge on examples, extracts embeddings via Modal, and pushes the resulting dataset to HuggingFace Hub.
**Subcommands:**
- `create ppe` - Create dataset from PPE (human-preference + correctness) benchmarks
- `create judge_bench` - Create dataset from the JudgeBench benchmark
**Usage:**
```bash
# Interactive mode (recommended) - prompts for configuration
uv run mo create ppe
uv run mo create judge_bench
# With config file
uv run mo create ppe --config config.yaml
uv run mo create judge_bench --config config.yaml
```
**Options:**
| Option | Short | Description |
| ---------- | ----- | -------------------------- |
| `--config` | `-c` | Path to config file (YAML) |
**Config File Example:**
```yaml
judge: tyrin/ppe-judge-gepa
output: tytodd/my-probe-dataset
n_train: 500
n_test: 100
embedding_layer: -1 # -1 for middle layer
```
**What it does:**
1. Loads examples from the benchmark dataset
2. Runs the specified judge on each example to get predictions
3. Extracts embeddings from the judge's LLM via Modal (GPU)
4. Creates a HuggingFace dataset with columns: `question`, `response_a`, `response_b`, `label`, `predicted`, `messages`, `embeddings`
5. Pushes to HuggingFace Hub
---
### `train`
Train a confidence probe on an embeddings dataset created with `create`.
**Usage:**
```bash
# Interactive mode (recommended) - prompts for all configuration
uv run mo train
# With config file
uv run mo train --config config.yaml
# With CLI arguments
uv run mo train --dataset tytodd/my-embeddings --epochs 10 --lr 0.0001
```
**Options:**
| Option | Short | Description | Default |
| ---------------- | ----- | --------------------------------------------------------------------------------- | ----------------- |
| `--config` | `-c` | Path to config file (YAML) | - |
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) (must be a dataset created with `create`) | - |
| `--model-path` | `-m` | Output path for trained model | `{dataset}_probe` |
| `--batch-size` | | Batch size | 4 |
| `--epochs` | | Number of training epochs | 10 |
| `--lr` | | Learning rate | 0.0001 |
| `--weight-decay` | | Weight decay | 0.01 |
| `--test-size` | | Validation split ratio (if no test split) | 0.2 |
| `--seed` | | Random seed | 42 |
| `--project` | | W&B project name | model_path |
| `--hub-path` | | HuggingFace Hub path to push model | - |
**Config File Example:**
```yaml
dataset_path: tytodd/my-probe-dataset
model_path: ./best_probe
hub_path: tytodd/my-probe # Optional: push to HF Hub
batch_size: 4
epochs: 10
learning_rate: 0.0001
weight_decay: 0.01
test_size: 0.2
seed: 42
```
**What it does:**
1. Loads an embeddings dataset (from HuggingFace Hub or local)
2. Creates binary labels: 1 if `predicted == label`, 0 otherwise
3. Trains a linear probe using MSE loss (Brier score optimization)
4. Logs metrics to Weights & Biases (Brier, ECE, MCE, Kuiper, AUROC)
5. Saves the best model based on validation Brier score
6. Optionally pushes to HuggingFace Hub
---
### `eval`
Evaluate a trained confidence probe on a dataset. Computes calibration and discrimination metrics.
**Usage:**
```bash
# Interactive mode (recommended) - prompts for probe and dataset
uv run mo eval
# With CLI arguments
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings
# Evaluate on train split instead of test
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings --split train
```
**Options:**
| Option | Short | Description | Default |
| ---------------------------- | ----- | ---------------------------------------- | ------------ |
| `--probe` | `-p` | Probe path (HuggingFace Hub or local) | - |
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) | - |
| `--split` | `-s` | Dataset split to evaluate on | test |
| `--batch-size` | `-b` | Batch size for evaluation | 64 |
| `--normalize/--no-normalize` | `-n` | Normalize embeddings with StandardScaler | probe config |
**Metrics computed:**
| Metric | Description |
| ----------- | ------------------------------------------------- |
| Brier Score | Mean squared error between predictions and labels |
| Accuracy | Classification accuracy at 0.5 threshold |
| F1 Score | Harmonic mean of precision and recall |
| ECE | Expected Calibration Error (10 bins) |
| MCE | Maximum Calibration Error |
| Kuiper | Kuiper statistic for calibration |
| AUROC | Area Under the ROC Curve (discrimination) |
**What it does:**
1. Loads a pretrained probe from HuggingFace Hub or local path
2. Loads a dataset created with `create`
3. Creates binary labels: 1 if `predicted == label`, 0 otherwise
4. Runs inference and computes calibration/discrimination metrics
5. Displays results in a formatted table
---
### `compile`
Compile (optimize) a judge using GEPA over a dataset. GEPA iteratively improves the judge's prompt based on training examples.
**Subcommands:**
- `compile` (base) - Compile with custom dataset and parameter mapping
- `compile ppe` - Compile specifically for PPE datasets (human-preference + correctness)
**Usage:**
```bash
# Interactive mode
uv run mo compile
uv run mo compile ppe
# With config file
uv run mo compile --config config.yaml
uv run mo compile ppe --config config.yaml
```
**Options:**
| Option | Short | Description |
| ---------- | ----- | -------------------------- |
| `--config` | `-c` | Path to config file (YAML) |
**Config File Example:**
```yaml
judge: tyrin/ppe-judge
dataset: tytodd/ppe-human-preference
inputs: # selects which input columns of the dataset to use (not necearry if using a compile subcommand like ppe or judge_bench)
- name: question
- name: response_a
column: response_A # Map param name to dataset column
- name: response_b
column: response_B
label_column: label
n_train: 100
n_val: 50
base_model: gpt-4o-mini
reflection_model: gpt-4o
output: tyrin/ppe-judge-gepa
seed: 42
```
**What it does:**
1. Loads a judge from Modaic Hub
2. Loads training/validation examples from a HuggingFace dataset
3. Maps judge parameters to dataset columns
4. Runs GEPA optimization to improve the judge's prompt
5. Pushes the optimized judge to Modaic Hub
---
### `embed`
Regenerate embeddings for an existing dataset using a different model or layer. Useful for experimenting with different embedding configurations without re-running the judge.
**Usage:**
```bash
# Interactive mode
uv run mo embed
# With CLI arguments
uv run mo embed --dataset tytodd/my-dataset --hf-model Qwen/Qwen3-VL-32B-Instruct --layer -1
```
**Options:**
| Option | Short | Description |
| ------------ | ----- | ---------------------------------------- |
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) |
| `--hf-model` | `-m` | HuggingFace model path for embeddings |
| `--layer` | `-l` | Hidden layer index (-1 for middle layer) |
**What it does:**
1. Loads an existing dataset (must have a `messages` column)
2. Regenerates embeddings using the specified model/layer via Modal
3. Replaces the `embeddings` column in the dataset
4. Prompts to push the updated dataset to HuggingFace Hub
**Example workflow:**
```bash
# Original dataset was created with layer 32
# Now try middle layer instead
uv run mo embed \
--dataset tytodd/my-embeddings \
--hf-model Qwen/Qwen3-VL-32B-Instruct \
--layer -1
```
---
## Recommended Embedding Layers
When extracting embeddings, use these recommended layer indices for best probe performance:
| Model | HuggingFace Path | Recommended Layer |
| ------------- | ----------------------------------- | ----------------- |
| GPT-OSS 20B | `openai/gpt-oss-20b` | 8 |
| Qwen3-VL 32B | `Qwen/Qwen3-VL-32B-Instruct` | 16 |
| Llama 3.3 70B | `meta-llama/Llama-3.3-70B-Instruct` | 32 |
Use `-1` for the middle layer if experimenting with an unlisted model.
---
## Typical Workflow
```bash
# 1. Create a probe dataset from a benchmark
uv run mo create ppe
# 2. Train a confidence probe
uv run mo train --dataset tytodd/ppe-qwen3-embeddings
# 3. Evaluate the probe on a test set
uv run mo eval --probe tytodd/my-probe --dataset tytodd/ppe-qwen3-embeddings
# 4. (Optional) Compile/optimize a judge with GEPA
uv run mo compile ppe
# 5. (Optional) Re-embed with different layer
uv run mo embed --dataset tytodd/my-dataset --layer 32
```
## Environment Variables
Create a `.env` file with:
```bash
OPENAI_API_KEY=...
WANDB_API_KEY=...
HF_TOKEN=...
MODAIC_TOKEN=...
TOGETHER_API_KEY=...
```