Proof Assistance Evals Project

Research Review

Review the study design first by reading the Evaluation Design. It explains the methodology, task set, rubric, scoring protocol, and limitations.

Core Path

Step	Title	Purpose
0	Setup	Clone the repo, install dependencies, configure `.env`, and verify the local commands.
1	Project Overview	Understand the research question, data design, deliverables, and guardrails.
2	Evaluation Design	Review the task set, rubric, scoring protocol, and methodology behind the study.
3	Work Area 1: Conversation Runner	Implement `run_conversations.py` so the harness can build and save scripted transcripts.
4	Work Area 2: Judge Scorer	Implement `run_judge.py` so an LLM judge can score saved transcripts.
5	Work Area 3: Analysis	Implement `analyze.py` so human and judge scores become useful tables.
6	Research Workflow and Follow-Up	Run the study in the right order, write findings, and choose optional extensions.

This project turns a scaffolded repo into a working evaluation harness for AI proof tutoring. You will build the pieces needed to collect model-student transcripts, score those transcripts with both human and automated judgment, and analyze where models help or fail as proof-writing tutors.

The work has three implementation areas:

Conversation collection: build the runner that loads the proof tasks, simulates three-turn student conversations, supports dry-run and stub testing, calls real models only when ready, and saves transcripts in JSON and Markdown.
Automated judging: build the scorer that reads transcripts, combines them with the rubric and private reference notes, asks an LLM judge for structured scores, and writes judge results to CSV.
Analysis: build the report generator that compares human and judge scores, summarizes performance by model/task/rubric dimension, and produces Markdown tables for the final findings writeup.

Plan on roughly 12-20 hours for the implementation and first study pass, depending on debugging time and API setup. A reasonable breakdown is 1 hour for setup and reading, 4-7 hours for the conversation runner, 3-5 hours for the judge scorer, 2-4 hours for analysis, and 2-3 hours to run the study workflow and start the findings writeup. If you run and hand-score all six configured models, reserve additional time for human scoring because the full suite produces 60 transcripts.

The project should be completed in order. First make the dry-run and stub paths work without API calls. Then add real model calls, collect transcripts, score them manually, run the LLM judge, and use the analysis output to write findings.

Research Review

Core Path

What You Will Build