(no commit message)
This commit is contained in:
312
README.md
312
README.md
@@ -1,2 +1,312 @@
|
||||
# sentiment
|
||||
# Bench
|
||||
|
||||
Modaic internal SDK for benchmarking judges and training confidence probes.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd cli
|
||||
uv sync
|
||||
```
|
||||
|
||||
## CLI Commands
|
||||
|
||||
All commands are run from the `cli` directory via `uv run mo <command>`.
|
||||
|
||||
### `create`
|
||||
|
||||
Create benchmark datasets for training confidence probes. This command runs a judge on examples, extracts embeddings via Modal, and pushes the resulting dataset to HuggingFace Hub.
|
||||
|
||||
**Subcommands:**
|
||||
|
||||
- `create ppe` - Create dataset from PPE (human-preference + correctness) benchmarks
|
||||
- `create judge_bench` - Create dataset from the JudgeBench benchmark
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
# Interactive mode (recommended) - prompts for configuration
|
||||
uv run mo create ppe
|
||||
uv run mo create judge_bench
|
||||
|
||||
# With config file
|
||||
uv run mo create ppe --config config.yaml
|
||||
uv run mo create judge_bench --config config.yaml
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Short | Description |
|
||||
| ---------- | ----- | -------------------------- |
|
||||
| `--config` | `-c` | Path to config file (YAML) |
|
||||
|
||||
**Config File Example:**
|
||||
|
||||
```yaml
|
||||
judge: tyrin/ppe-judge-gepa
|
||||
output: tytodd/my-probe-dataset
|
||||
n_train: 500
|
||||
n_test: 100
|
||||
embedding_layer: -1 # -1 for middle layer
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
|
||||
1. Loads examples from the benchmark dataset
|
||||
2. Runs the specified judge on each example to get predictions
|
||||
3. Extracts embeddings from the judge's LLM via Modal (GPU)
|
||||
4. Creates a HuggingFace dataset with columns: `question`, `response_a`, `response_b`, `label`, `predicted`, `messages`, `embeddings`
|
||||
5. Pushes to HuggingFace Hub
|
||||
|
||||
---
|
||||
|
||||
### `train`
|
||||
|
||||
Train a confidence probe on an embeddings dataset created with `create`.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
# Interactive mode (recommended) - prompts for all configuration
|
||||
uv run mo train
|
||||
|
||||
# With config file
|
||||
uv run mo train --config config.yaml
|
||||
|
||||
# With CLI arguments
|
||||
uv run mo train --dataset tytodd/my-embeddings --epochs 10 --lr 0.0001
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Short | Description | Default |
|
||||
| ---------------- | ----- | --------------------------------------------------------------------------------- | ----------------- |
|
||||
| `--config` | `-c` | Path to config file (YAML) | - |
|
||||
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) (must be a dataset created with `create`) | - |
|
||||
| `--model-path` | `-m` | Output path for trained model | `{dataset}_probe` |
|
||||
| `--batch-size` | | Batch size | 4 |
|
||||
| `--epochs` | | Number of training epochs | 10 |
|
||||
| `--lr` | | Learning rate | 0.0001 |
|
||||
| `--weight-decay` | | Weight decay | 0.01 |
|
||||
| `--test-size` | | Validation split ratio (if no test split) | 0.2 |
|
||||
| `--seed` | | Random seed | 42 |
|
||||
| `--project` | | W&B project name | model_path |
|
||||
| `--hub-path` | | HuggingFace Hub path to push model | - |
|
||||
|
||||
**Config File Example:**
|
||||
|
||||
```yaml
|
||||
dataset_path: tytodd/my-probe-dataset
|
||||
model_path: ./best_probe
|
||||
hub_path: tytodd/my-probe # Optional: push to HF Hub
|
||||
batch_size: 4
|
||||
epochs: 10
|
||||
learning_rate: 0.0001
|
||||
weight_decay: 0.01
|
||||
test_size: 0.2
|
||||
seed: 42
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
|
||||
1. Loads an embeddings dataset (from HuggingFace Hub or local)
|
||||
2. Creates binary labels: 1 if `predicted == label`, 0 otherwise
|
||||
3. Trains a linear probe using MSE loss (Brier score optimization)
|
||||
4. Logs metrics to Weights & Biases (Brier, ECE, MCE, Kuiper, AUROC)
|
||||
5. Saves the best model based on validation Brier score
|
||||
6. Optionally pushes to HuggingFace Hub
|
||||
|
||||
---
|
||||
|
||||
### `eval`
|
||||
|
||||
Evaluate a trained confidence probe on a dataset. Computes calibration and discrimination metrics.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
# Interactive mode (recommended) - prompts for probe and dataset
|
||||
uv run mo eval
|
||||
|
||||
# With CLI arguments
|
||||
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings
|
||||
|
||||
# Evaluate on train split instead of test
|
||||
uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings --split train
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Short | Description | Default |
|
||||
| ---------------------------- | ----- | ---------------------------------------- | ------------ |
|
||||
| `--probe` | `-p` | Probe path (HuggingFace Hub or local) | - |
|
||||
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) | - |
|
||||
| `--split` | `-s` | Dataset split to evaluate on | test |
|
||||
| `--batch-size` | `-b` | Batch size for evaluation | 64 |
|
||||
| `--normalize/--no-normalize` | `-n` | Normalize embeddings with StandardScaler | probe config |
|
||||
|
||||
**Metrics computed:**
|
||||
|
||||
| Metric | Description |
|
||||
| ----------- | ------------------------------------------------- |
|
||||
| Brier Score | Mean squared error between predictions and labels |
|
||||
| Accuracy | Classification accuracy at 0.5 threshold |
|
||||
| F1 Score | Harmonic mean of precision and recall |
|
||||
| ECE | Expected Calibration Error (10 bins) |
|
||||
| MCE | Maximum Calibration Error |
|
||||
| Kuiper | Kuiper statistic for calibration |
|
||||
| AUROC | Area Under the ROC Curve (discrimination) |
|
||||
|
||||
**What it does:**
|
||||
|
||||
1. Loads a pretrained probe from HuggingFace Hub or local path
|
||||
2. Loads a dataset created with `create`
|
||||
3. Creates binary labels: 1 if `predicted == label`, 0 otherwise
|
||||
4. Runs inference and computes calibration/discrimination metrics
|
||||
5. Displays results in a formatted table
|
||||
|
||||
---
|
||||
|
||||
### `compile`
|
||||
|
||||
Compile (optimize) a judge using GEPA over a dataset. GEPA iteratively improves the judge's prompt based on training examples.
|
||||
|
||||
**Subcommands:**
|
||||
|
||||
- `compile` (base) - Compile with custom dataset and parameter mapping
|
||||
- `compile ppe` - Compile specifically for PPE datasets (human-preference + correctness)
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
# Interactive mode
|
||||
uv run mo compile
|
||||
uv run mo compile ppe
|
||||
|
||||
# With config file
|
||||
uv run mo compile --config config.yaml
|
||||
uv run mo compile ppe --config config.yaml
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Short | Description |
|
||||
| ---------- | ----- | -------------------------- |
|
||||
| `--config` | `-c` | Path to config file (YAML) |
|
||||
|
||||
**Config File Example:**
|
||||
|
||||
```yaml
|
||||
judge: tyrin/ppe-judge
|
||||
dataset: tytodd/ppe-human-preference
|
||||
inputs: # selects which input columns of the dataset to use (not necearry if using a compile subcommand like ppe or judge_bench)
|
||||
- name: question
|
||||
- name: response_a
|
||||
column: response_A # Map param name to dataset column
|
||||
- name: response_b
|
||||
column: response_B
|
||||
label_column: label
|
||||
n_train: 100
|
||||
n_val: 50
|
||||
base_model: gpt-4o-mini
|
||||
reflection_model: gpt-4o
|
||||
output: tyrin/ppe-judge-gepa
|
||||
seed: 42
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
|
||||
1. Loads a judge from Modaic Hub
|
||||
2. Loads training/validation examples from a HuggingFace dataset
|
||||
3. Maps judge parameters to dataset columns
|
||||
4. Runs GEPA optimization to improve the judge's prompt
|
||||
5. Pushes the optimized judge to Modaic Hub
|
||||
|
||||
---
|
||||
|
||||
### `embed`
|
||||
|
||||
Regenerate embeddings for an existing dataset using a different model or layer. Useful for experimenting with different embedding configurations without re-running the judge.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
# Interactive mode
|
||||
uv run mo embed
|
||||
|
||||
# With CLI arguments
|
||||
uv run mo embed --dataset tytodd/my-dataset --hf-model Qwen/Qwen3-VL-32B-Instruct --layer -1
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Short | Description |
|
||||
| ------------ | ----- | ---------------------------------------- |
|
||||
| `--dataset` | `-d` | Dataset path (HuggingFace Hub or local) |
|
||||
| `--hf-model` | `-m` | HuggingFace model path for embeddings |
|
||||
| `--layer` | `-l` | Hidden layer index (-1 for middle layer) |
|
||||
|
||||
**What it does:**
|
||||
|
||||
1. Loads an existing dataset (must have a `messages` column)
|
||||
2. Regenerates embeddings using the specified model/layer via Modal
|
||||
3. Replaces the `embeddings` column in the dataset
|
||||
4. Prompts to push the updated dataset to HuggingFace Hub
|
||||
|
||||
**Example workflow:**
|
||||
|
||||
```bash
|
||||
# Original dataset was created with layer 32
|
||||
# Now try middle layer instead
|
||||
uv run mo embed \
|
||||
--dataset tytodd/my-embeddings \
|
||||
--hf-model Qwen/Qwen3-VL-32B-Instruct \
|
||||
--layer -1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Embedding Layers
|
||||
|
||||
When extracting embeddings, use these recommended layer indices for best probe performance:
|
||||
|
||||
| Model | HuggingFace Path | Recommended Layer |
|
||||
| ------------- | ----------------------------------- | ----------------- |
|
||||
| GPT-OSS 20B | `openai/gpt-oss-20b` | 8 |
|
||||
| Qwen3-VL 32B | `Qwen/Qwen3-VL-32B-Instruct` | 16 |
|
||||
| Llama 3.3 70B | `meta-llama/Llama-3.3-70B-Instruct` | 32 |
|
||||
|
||||
Use `-1` for the middle layer if experimenting with an unlisted model.
|
||||
|
||||
---
|
||||
|
||||
## Typical Workflow
|
||||
|
||||
```bash
|
||||
# 1. Create a probe dataset from a benchmark
|
||||
uv run mo create ppe
|
||||
|
||||
# 2. Train a confidence probe
|
||||
uv run mo train --dataset tytodd/ppe-qwen3-embeddings
|
||||
|
||||
# 3. Evaluate the probe on a test set
|
||||
uv run mo eval --probe tytodd/my-probe --dataset tytodd/ppe-qwen3-embeddings
|
||||
|
||||
# 4. (Optional) Compile/optimize a judge with GEPA
|
||||
uv run mo compile ppe
|
||||
|
||||
# 5. (Optional) Re-embed with different layer
|
||||
uv run mo embed --dataset tytodd/my-dataset --layer 32
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Create a `.env` file with:
|
||||
|
||||
```bash
|
||||
OPENAI_API_KEY=...
|
||||
WANDB_API_KEY=...
|
||||
HF_TOKEN=...
|
||||
MODAIC_TOKEN=...
|
||||
TOGETHER_API_KEY=...
|
||||
```
|
||||
|
||||
55
config.json
Normal file
55
config.json
Normal file
@@ -0,0 +1,55 @@
|
||||
{
|
||||
"model": null,
|
||||
"signature": {
|
||||
"description": "Given a review title and content, determine the overall sentiment.\n\nTask: Analyze the text to determine if the reviewer has a positive or negative\nopinion about the product/service.\n\n- positive: The review expresses satisfaction, appreciation, or a favorable view\n- negative: The review expresses dissatisfaction, criticism, or an unfavorable view\n\nConsider:\n1. Overall tone and emotional language\n2. Whether the reviewer recommends the product\n3. Specific praise or complaints mentioned\n\nFirst reason through your thought process in the `reasoning` field.\nBe sure to verbalize any uncertainty in your thought process.\nThen output your conclusion in the `label` field.",
|
||||
"properties": {
|
||||
"title": {
|
||||
"__dspy_field_type": "input",
|
||||
"anyOf": [
|
||||
{
|
||||
"type": "string"
|
||||
},
|
||||
{
|
||||
"type": "null"
|
||||
}
|
||||
],
|
||||
"default": null,
|
||||
"desc": "The title of the review",
|
||||
"prefix": "Title:",
|
||||
"title": "Title"
|
||||
},
|
||||
"content": {
|
||||
"__dspy_field_type": "input",
|
||||
"desc": "The full review text/content",
|
||||
"prefix": "Content:",
|
||||
"title": "Content",
|
||||
"type": "string"
|
||||
},
|
||||
"reasoning": {
|
||||
"__dspy_field_type": "output",
|
||||
"desc": "Your step by step reasoning for the sentiment classification. Verbalize uncertainty.",
|
||||
"prefix": "Reasoning:",
|
||||
"title": "Reasoning",
|
||||
"type": "string"
|
||||
},
|
||||
"label": {
|
||||
"__dspy_field_type": "output",
|
||||
"desc": "The sentiment: 'positive' or 'negative'",
|
||||
"enum": [
|
||||
"positive",
|
||||
"negative"
|
||||
],
|
||||
"prefix": "Label:",
|
||||
"title": "Label",
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"content",
|
||||
"reasoning",
|
||||
"label"
|
||||
],
|
||||
"title": "Sentiment",
|
||||
"type": "object"
|
||||
}
|
||||
}
|
||||
44
program.json
Normal file
44
program.json
Normal file
@@ -0,0 +1,44 @@
|
||||
{
|
||||
"traces": [],
|
||||
"train": [],
|
||||
"demos": [],
|
||||
"signature": {
|
||||
"instructions": "Given a review title and content, determine the overall sentiment.\n\nTask: Analyze the text to determine if the reviewer has a positive or negative\nopinion about the product/service.\n\n- positive: The review expresses satisfaction, appreciation, or a favorable view\n- negative: The review expresses dissatisfaction, criticism, or an unfavorable view\n\nConsider:\n1. Overall tone and emotional language\n2. Whether the reviewer recommends the product\n3. Specific praise or complaints mentioned\n\nFirst reason through your thought process in the `reasoning` field.\nBe sure to verbalize any uncertainty in your thought process.\nThen output your conclusion in the `label` field.",
|
||||
"fields": [
|
||||
{
|
||||
"prefix": "Title:",
|
||||
"description": "The title of the review"
|
||||
},
|
||||
{
|
||||
"prefix": "Content:",
|
||||
"description": "The full review text/content"
|
||||
},
|
||||
{
|
||||
"prefix": "Reasoning:",
|
||||
"description": "Your step by step reasoning for the sentiment classification. Verbalize uncertainty."
|
||||
},
|
||||
{
|
||||
"prefix": "Label:",
|
||||
"description": "The sentiment: 'positive' or 'negative'"
|
||||
}
|
||||
]
|
||||
},
|
||||
"lm": {
|
||||
"model": "together_ai/Qwen/Qwen3-VL-32B-Instruct",
|
||||
"model_type": "chat",
|
||||
"cache": true,
|
||||
"num_retries": 3,
|
||||
"finetuning_model": null,
|
||||
"launch_kwargs": {},
|
||||
"train_kwargs": {},
|
||||
"temperature": null,
|
||||
"max_tokens": null
|
||||
},
|
||||
"metadata": {
|
||||
"dependency_versions": {
|
||||
"python": "3.11",
|
||||
"dspy": "3.1.2",
|
||||
"cloudpickle": "3.1"
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user