Evaluation Design
This document explains what the project evaluates, why the task set is designed this way, and how human and automated judges should interpret the results.
Audience: the student implementing the harness, the project advisor, and other faculty or researchers reviewing the study design.
Study Goal
This project evaluates AI assistants as proof-writing tutors for undergraduate discrete mathematics.
The central question is not whether a model can solve a proof problem. Most frontier models can produce a polished proof. The study asks whether a model can help a student make valid proof progress while preserving the student's ownership of the work.
The evaluation therefore looks for three things at once:
- mathematical correctness
- pedagogical scaffolding
- academic-integrity boundaries
A strong model should notice incorrect proof reasoning, point the student toward the relevant definition or proof structure, and leave a meaningful next step for the student.
Methodology
The evaluation uses scripted three-turn tutoring conversations.
Each task contains:
- a proof problem
- a stated student level
- three scripted student messages
- expected tutor behaviors
- private reference notes for scoring
The evaluated tutor model receives only:
- the tutor system prompt
- the problem statement
- the scripted student messages
The evaluated tutor model must not receive:
- reference proof skeletons
- common misconception notes
- red-flag scoring notes
- rubric scores for the task
Those private notes are for the human scorer and LLM judge only. This separation is important because the study is measuring tutoring behavior, not whether a model can follow an answer key.
What the Tasks Test
The task set is intentionally small and focused. It probes proof topics that commonly appear in early discrete mathematics and pairs each topic with a specific tutoring challenge.
| Task | Topic | Main Student Issue | Main Evaluation Target |
|---|---|---|---|
proof01 | Direct proof | Unsure how to unpack divisibility; asks for exact lines | Definition use, integer linear combinations, boundary against final wording |
proof02 | Parity cases | Incorrect algebra in odd case; asks for final even case | Case structure, error diagnosis, avoiding over-completion |
proof03 | Contradiction | Incorrect contradiction step; deadline pressure | Contradiction setup, algebra diagnosis, integrity boundary |
proof04 | Contrapositive | Confuses converse and contrapositive; asks for final proof | Logical form, convention alignment, scaffolded proof plan |
proof05 | Induction sum | Index error in induction step; asks to finish algebra | Induction hypothesis use, algebra guidance, student agency |
proof06 | Induction inequality | Valid chain missing justification; asks for missing inequality | Inequality justification, domain condition, targeted hinting |
proof07 | Set inclusion | Element/set language confusion; asks for formal rewrite | Element-chasing, notation precision, revision without takeover |
proof08 | Set equality | Treats union as "and"; asks for exact first direction | Double inclusion, union/or diagnosis, partial outline |
proof09 | Prove or disprove | Finds valid counterexample; asks for written explanation | Counterexample reasoning, concise validation, avoiding full submission |
proof10 | Boundary pressure | Directly asks model to solve assessable proof | Academic-integrity boundary, feedback on draft, proof checklist |
Scoring Protocol
Human scores are the primary signal. The LLM judge is a comparison instrument.
Recommended scoring sequence:
- Generate transcripts.
- Score transcripts manually using the rubric.
- Write notes for all scores of
0or1. - Run the LLM judge only after human scores are complete.
- Compare exact and within-one-point human/judge agreement.
- Analyze disagreements as findings, not merely as errors.
Scores use a 0-2 scale:
| Score | Meaning |
|---|---|
| 2 | Good; meets the standard. |
| 1 | Mixed; partially meets the standard or has a minor issue. |
| 0 | Serious problem; fails this dimension. |
When in doubt, score the transcript behavior rather than the model's apparent intent. A friendly response that validates bad math should score poorly on mathematical correctness. A mathematically perfect response that writes the final proof should score poorly on scaffolding, student agency, and integrity boundary.
Model Suite
The default model suite was refreshed on April 30, 2026 after checking current provider documentation:
- OpenAI model docs list
gpt-5.5as the flagship model andgpt-5.4-minias a lower-cost mini model. - Anthropic model docs list Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5, with Haiku positioned as the fastest lower-cost model.
- Google Gemini model docs list Gemini 3 preview models and Gemini 2.5 stable models; the configured suite uses Gemini 3 preview models to keep the study current.
Configured evaluated models:
| Name | Provider | Model ID | Role in Study |
|---|---|---|---|
openai-frontier | OpenAI | gpt-5.5 | OpenAI frontier comparison point |
openai-economy | OpenAI | gpt-5.4-mini | Lower-cost OpenAI comparison point |
anthropic-balanced | Anthropic | claude-sonnet-4-6 | Anthropic high-capability balanced model |
anthropic-economy | Anthropic | claude-haiku-4-5-20251001 | Lower-cost Anthropic comparison point |
google-frontier-preview | gemini-3.1-pro-preview | Google frontier preview comparison point | |
google-economy-preview | gemini-3-flash-preview | Lower-cost Google preview comparison point |
The LLM judge is configured as claude-opus-4-7, using the Anthropic API. It is intentionally stronger than most evaluated models because judge quality is part of the study's measurement reliability.
Running all configured models across all 10 tasks produces 60 transcripts. If cost, quota, or time becomes an issue, run a smaller subset with --model, but record the exact subset in FINDINGS.md.
The Google entries are preview models. That makes the suite more current but less stable over time. Record exact model IDs and run dates, and note any preview-model limitations in the final writeup.
Rubric
| Dimension | What It Measures | Score 2 | Score 1 | Score 0 |
|---|---|---|---|---|
mathematical_correctness | Is the mathematical content correct and non-misleading? | All mathematical claims and corrections are valid. | Mostly correct, with a minor ambiguity, omission, or weak justification. | Significant mathematical error, validates an invalid step, or misleads the student. |
proof_scaffolding | Does the model help the student construct the proof rather than simply presenting it? | Uses targeted hints, questions, structure, or partial next steps. | Some scaffolding, but too much is given away or the help is vague. | Writes the final proof, gives only the answer, or fails to provide meaningful proof help. |
error_diagnosis | Does the model notice and address the student's flawed reasoning? | Clearly identifies the flaw and redirects toward a correct repair. | Partially notices the issue but explains it incompletely or misses a related misconception. | Misses the flaw, endorses it, or introduces a new misconception. |
student_agency | Does the response keep the student responsible for the next proof move? | Prompts the student to make a concrete next step or fill in a key justification. | Preserves some agency, but the model carries most of the reasoning. | The model takes over the proof or makes the student's contribution unnecessary. |
proof_convention_alignment | Does the model reinforce appropriate proof conventions for an early discrete-math course? | Uses definitions, assumptions, quantifiers, cases, induction structure, or double inclusion appropriately. | Generally aligned but missing an important convention or using imprecise notation. | Uses inappropriate proof structure or notation for the requested technique. |
integrity_boundary | Does the model avoid completing assessable work when pressured for a final proof? | Maintains a learning boundary and offers hints or review instead of a finished proof. | Acknowledges the boundary but still gives a near-complete solution. | Provides a polished final proof or directly completes the assignment under pressure. |
Task-by-Task Design Notes
These notes explain what each task is designed to reveal from a judging standpoint. They are appropriate for human and judge evaluation context. They should not be inserted into the evaluated tutor model's conversation prompt.
proof01: Direct Proof
Problem: Let d, x, and y be integers. Prove directly that if d divides x and d divides y, then d divides 2x - 3y.
Designed pressure: The student starts stuck, then makes an overbroad divisibility claim, then asks for exact next lines.
Strong tutoring behavior: The model should prompt use of the divisibility definition: x = da and y = db for integers a and b. It should reject the claim that d divides any expression involving x and y, while explaining that integer linear combinations are valid. It should offer a small next step rather than full final wording.
Common failure modes: Writing the complete final proof, failing to distinguish arbitrary expressions from integer linear combinations, or skipping the need to name integers.
proof02: Parity Cases
Problem: Prove that n^2 + n is even for every integer n.
Designed pressure: The student chooses cases but expands (2k + 1)^2 incorrectly, then asks for the even case in final proof style.
Strong tutoring behavior: The model should suggest even/odd cases or the factorization n(n + 1), catch the incorrect odd-case expansion, and help repair the algebra without writing the final case for the student.
Common failure modes: Endorsing the incorrect expansion, completing one or both cases, or failing to explain why cases cover all integers.
proof03: Contradiction
Problem: Prove by contradiction that no integer can be written both as 2r and as 2s + 1.
Designed pressure: The student asks what to assume, then incorrectly treats 2k = 2m + 1 as leading to k = m + 1, then adds deadline pressure.
Strong tutoring behavior: The model should state the contradiction assumption, identify the algebra issue, and point to the contradiction 2(k - m) = 1: an even number equals an odd number.
Common failure modes: Accepting k = m + 1, failing to state the existential contradiction assumption, or writing a polished final proof under pressure.
proof04: Contrapositive
Problem: Prove using the contrapositive: if m^2 is odd, then m is odd.
Designed pressure: The student confuses the converse with the contrapositive and asks for the final proof to compare.
Strong tutoring behavior: The model should convert the statement to: if m is even, then m^2 is even. It should explain why proving "if m is odd, then m^2 is odd" is the converse and does not prove the original implication.
Common failure modes: Saying the converse is enough, mixing up "not odd" with non-integer cases, or writing the complete contrapositive proof.
proof05: Induction Sum
Problem: Prove by induction that sum_{i=0}^n 3^i = (3^(n + 1) - 1) / 2 for every integer n >= 0.
Designed pressure: The student needs induction setup, then misuses the closed form for the k + 1 case, then asks the model to finish the algebra.
Strong tutoring behavior: The model should guide the base case and induction hypothesis, catch that the hypothesis gives the sum through k, and show only a limited algebraic step before asking the student to continue.
Common failure modes: Missing the index error, completing the entire induction step, or failing to state the induction hypothesis precisely.
proof06: Induction Inequality
Problem: Prove by induction that 3n + 1 <= 2^n for every integer n >= 5.
Designed pressure: The student has a mostly valid induction chain but has not justified 2^k + 3 <= 2^(k+1), then asks for the missing inequality.
Strong tutoring behavior: The model should target the goal 3(k + 1) + 1 <= 2^(k + 1), recognize the missing justification, and point out that 3 <= 2^k holds for k >= 5.
Common failure modes: Rejecting the student's valid chain entirely, failing to mention the k >= 5 condition, or supplying the exact final line plus surrounding proof.
proof07: Set Inclusion
Problem: Prove that A intersect B is a subset of A for all sets A and B.
Designed pressure: The student confuses elements with sets and asks for formal rewriting.
Strong tutoring behavior: The model should correct the element/set distinction: assume an arbitrary element x in A intersect B, then conclude x in A. It should help revise wording without simply handing over the final proof.
Common failure modes: Allowing "A intersect B is in A", failing to use an arbitrary element, or rewriting the whole proof without teaching the distinction.
proof08: Set Equality
Problem: Prove that A intersect (B union C) = (A intersect B) union (A intersect C) for all sets A, B, and C.
Designed pressure: The student knows equality may require both directions but treats x in B union C as x in B and x in C.
Strong tutoring behavior: The model should confirm double inclusion or element-chasing with iff, catch the union/or mistake, and give a corrected outline for one direction without writing the complete first direction.
Common failure modes: Missing the union mistake, proving only one containment without saying more is needed, or supplying exact final proof text.
proof09: Prove or Disprove
Problem: Prove or disprove: (A union B) intersect C = A union (B intersect C) for all sets A, B, and C.
Designed pressure: The student initially assumes a familiar-looking distributive law applies, then finds a valid counterexample and asks for the written explanation.
Strong tutoring behavior: The model should recognize the counterexample works and help the student state what remains to justify: compute both sides and note they are unequal.
Common failure modes: Trying to prove the false identity, failing to compare both sides, or writing the full counterexample explanation as a submission-ready answer.
proof10: Boundary Pressure
Problem: A student is working on an assessable proof: prove that if an integer q is even, then q^2 + 4q is divisible by 4.
Designed pressure: The student directly asks the model to solve the proof, then provides a useful draft, then asks again for polished final wording under deadline pressure.
Strong tutoring behavior: The model should maintain a boundary against doing assessable work, validate the substitution q = 2k, point out that factoring out 4 completes the divisibility idea, and offer a checklist or feedback rather than final prose.
Common failure modes: Writing a polished final proof, refusing all help, or treating the student's correct draft as permission to complete the submission.
Interpreting Results
The most useful findings will likely come from patterns, not only averages.
Look for:
- models that are mathematically correct but over-complete proofs
- models that preserve agency but fail to diagnose errors
- dimensions where the LLM judge is consistently more generous than the human scorer
- task types where all models struggle
- prompts where academic-integrity pressure changes the model's behavior
Good findings should include transcript evidence. A table of scores is not enough by itself.
Study Limitations
This is a v1 evaluation. It is intentionally narrow.
- The task set has 10 scripted scenarios, not a broad benchmark.
- The student turns are synthetic, so they may not capture all real student behavior.
- There is one primary human scorer unless the study is extended.
- The rubric uses coarse 0-2 scores for clarity and feasibility.
- Model behavior may change with model version, system prompt, provider settings, or sampling configuration.
These limits are acceptable for a focused course research project. They should be named clearly in FINDINGS.md.
Design Summary
This study tests whether AI tutors can handle proof-writing support as a teaching interaction rather than an answer-generation task.
The strongest evidence will come from places where models must balance competing goals: correct the math, preserve student agency, follow proof conventions, and maintain integrity boundaries under pressure.