Research Review
Review the study design first by reading the Evaluation Design. It explains the methodology, task set, rubric, scoring protocol, and limitations.
Core Path
| Step | Title | Purpose |
|---|---|---|
| 0 | Setup | Clone the repo, install dependencies, configure .env, and verify the local commands. |
| 1 | Project Overview | Understand the research question, data design, deliverables, and guardrails. |
| 2 | Evaluation Design | Review the task set, rubric, scoring protocol, and methodology behind the study. |
| 3 | Work Area 1: Conversation Runner | Implement run_conversations.py so the harness can build and save scripted transcripts. |
| 4 | Work Area 2: Judge Scorer | Implement run_judge.py so an LLM judge can score saved transcripts. |
| 5 | Work Area 3: Analysis | Implement analyze.py so human and judge scores become useful tables. |
| 6 | Research Workflow and Follow-Up | Run the study in the right order, write findings, and choose optional extensions. |
What You Will Build
This project turns a scaffolded repo into a working evaluation harness for AI proof tutoring. You will build the pieces needed to collect model-student transcripts, score those transcripts with both human and automated judgment, and analyze where models help or fail as proof-writing tutors.
The work has three implementation areas:
- Conversation collection: build the runner that loads the proof tasks, simulates three-turn student conversations, supports dry-run and stub testing, calls real models only when ready, and saves transcripts in JSON and Markdown.
- Automated judging: build the scorer that reads transcripts, combines them with the rubric and private reference notes, asks an LLM judge for structured scores, and writes judge results to CSV.
- Analysis: build the report generator that compares human and judge scores, summarizes performance by model/task/rubric dimension, and produces Markdown tables for the final findings writeup.
Plan on roughly 12-20 hours for the implementation and first study pass, depending on debugging time and API setup. A reasonable breakdown is 1 hour for setup and reading, 4-7 hours for the conversation runner, 3-5 hours for the judge scorer, 2-4 hours for analysis, and 2-3 hours to run the study workflow and start the findings writeup. If you run and hand-score all six configured models, reserve additional time for human scoring because the full suite produces 60 transcripts.
The project should be completed in order. First make the dry-run and stub paths work without API calls. Then add real model calls, collect transcripts, score them manually, run the LLM judge, and use the analysis output to write findings.