Research Review

Review the study design first by reading the Evaluation Design. It explains the methodology, task set, rubric, scoring protocol, and limitations.

Core Path

StepTitlePurpose
0SetupClone the repo, install dependencies, configure .env, and verify the local commands.
1Project OverviewUnderstand the research question, data design, deliverables, and guardrails.
2Evaluation DesignReview the task set, rubric, scoring protocol, and methodology behind the study.
3Work Area 1: Conversation RunnerImplement run_conversations.py so the harness can build and save scripted transcripts.
4Work Area 2: Judge ScorerImplement run_judge.py so an LLM judge can score saved transcripts.
5Work Area 3: AnalysisImplement analyze.py so human and judge scores become useful tables.
6Research Workflow and Follow-UpRun the study in the right order, write findings, and choose optional extensions.

What You Will Build

This project turns a scaffolded repo into a working evaluation harness for AI proof tutoring. You will build the pieces needed to collect model-student transcripts, score those transcripts with both human and automated judgment, and analyze where models help or fail as proof-writing tutors.

The work has three implementation areas:

  • Conversation collection: build the runner that loads the proof tasks, simulates three-turn student conversations, supports dry-run and stub testing, calls real models only when ready, and saves transcripts in JSON and Markdown.
  • Automated judging: build the scorer that reads transcripts, combines them with the rubric and private reference notes, asks an LLM judge for structured scores, and writes judge results to CSV.
  • Analysis: build the report generator that compares human and judge scores, summarizes performance by model/task/rubric dimension, and produces Markdown tables for the final findings writeup.

Plan on roughly 12-20 hours for the implementation and first study pass, depending on debugging time and API setup. A reasonable breakdown is 1 hour for setup and reading, 4-7 hours for the conversation runner, 3-5 hours for the judge scorer, 2-4 hours for analysis, and 2-3 hours to run the study workflow and start the findings writeup. If you run and hand-score all six configured models, reserve additional time for human scoring because the full suite produces 60 transcripts.

The project should be completed in order. First make the dry-run and stub paths work without API calls. Then add real model calls, collect transcripts, score them manually, run the LLM judge, and use the analysis output to write findings.