> Source URL: /index.path
---
title: Proof Assistance Evals Project
path:
  columns: [step, title, purpose]
---

[@map]: ./index.map.md

## Research Review

Review the study design first by reading the [Evaluation Design](./evals.guide.md). It explains the methodology, task set, rubric, scoring protocol, and limitations.

---

## Core Path

| Step | Title                                                              | Purpose                                                                                  |
| ---- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| 0    | [Setup](./setup.guide.md)                                          | Clone the repo, install dependencies, configure `.env`, and verify the local commands.   |
| 1    | [Project Overview](./proof-assistance-evals.project.md)            | Understand the research question, data design, deliverables, and guardrails.             |
| 2    | [Evaluation Design](./evals.guide.md)                              | Review the task set, rubric, scoring protocol, and methodology behind the study.         |
| 3    | [Work Area 1: Conversation Runner](./conversation-runner.guide.md) | Implement `run_conversations.py` so the harness can build and save scripted transcripts. |
| 4    | [Work Area 2: Judge Scorer](./judge-scorer.guide.md)               | Implement `run_judge.py` so an LLM judge can score saved transcripts.                    |
| 5    | [Work Area 3: Analysis](./analysis.guide.md)                       | Implement `analyze.py` so human and judge scores become useful tables.                   |
| 6    | [Research Workflow and Follow-Up](./research-workflow.guide.md)    | Run the study in the right order, write findings, and choose optional extensions.        |

---

## What You Will Build

This project turns a scaffolded repo into a working evaluation harness for AI proof tutoring. You will build the pieces needed to collect model-student transcripts, score those transcripts with both human and automated judgment, and analyze where models help or fail as proof-writing tutors.

The work has three implementation areas:

- **Conversation collection:** build the runner that loads the proof tasks, simulates three-turn student conversations, supports dry-run and stub testing, calls real models only when ready, and saves transcripts in JSON and Markdown.
- **Automated judging:** build the scorer that reads transcripts, combines them with the rubric and private reference notes, asks an LLM judge for structured scores, and writes judge results to CSV.
- **Analysis:** build the report generator that compares human and judge scores, summarizes performance by model/task/rubric dimension, and produces Markdown tables for the final findings writeup.

Plan on roughly **12-20 hours** for the implementation and first study pass, depending on debugging time and API setup. A reasonable breakdown is 1 hour for setup and reading, 4-7 hours for the conversation runner, 3-5 hours for the judge scorer, 2-4 hours for analysis, and 2-3 hours to run the study workflow and start the findings writeup. If you run and hand-score all six configured models, reserve additional time for human scoring because the full suite produces 60 transcripts.

The project should be completed in order. First make the dry-run and stub paths work without API calls. Then add real model calls, collect transcripts, score them manually, run the LLM judge, and use the analysis output to write findings.


---

## Backlinks

The following sources link to this document:

- [Core Path](/research-workflow.guide.llm.md)