Case study Confidential

AI Auto-Grading System

AI-powered grading for a US-based K-12 EdTech startup. Natural language assessment of student responses with teacher-in-the-loop review.

Practice: BUILD · NDA
Year: 2024
Stack: PythonOpenAILangChainPostgreSQLReact

Grading time teachers wanted back.

A US-based K-12 EdTech company came to us with a problem common to their category: teachers spend hours grading written and structured responses. The teachers want to spend that time teaching, not grading. The technology to assist with grading existed — LLMs are good at evaluating text against rubrics — but the existing tools were either too generic (general-purpose AI not trained on K-12 content) or too narrow (specific to one subject).

They wanted to build a system that would automate the first-pass grading across multiple subjects, integrate cleanly into existing teacher workflows, and keep humans in the loop for review and override.

Teacher trust was the real engineering problem.

Technically, getting an LLM to grade a student response against a rubric is the easy part. The hard part is making teachers trust it enough to use it — and teachers, rightly, are skeptical of black-box AI making consequential decisions about students.

The constraint that shaped the entire system: every grade had to be explainable. The AI couldn't just return a score. It had to return a score, the specific rubric criteria that influenced the score, and the exact passages of the student response that triggered each criterion.

A three-layer system.

Layer one: rubric ingestion. Teachers upload their rubrics in plain language. The system parses them into structured criteria.

Layer two: response evaluation. Student responses go through a multi-pass evaluation that scores against each rubric criterion separately, with specific evidence highlighting from the response.

Layer three: teacher review. Every AI-generated grade lands in a teacher review queue. The teacher sees the score, the per-criterion breakdown, and the highlighted evidence. They can accept, modify, or override. Every override feeds back into the rubric refinement loop.

Outcomes (limited detail).

System in production with multiple school districts.
Significant reduction in teacher grading time (specific metrics withheld).
Teacher override rate within acceptable accuracy bounds.
Continued engagement on feature expansion.

What we can share

Specific client name, screenshots, and certain technical details are withheld under NDA. If you're evaluating Elyshub Dev for similar work, we can discuss this project under NDA after a Discovery Sprint kickoff.

Tech stack

PythonOpenAILangChainPostgreSQLReactAWS

Up next

Custom Construction ERP

A custom enterprise resource planning system for a multi-site construction operation. Project management, inventory, equipment tracking, financial reporting.

Read the case study