Smarter Together: Enhancing Human-AI Collaborative Grading With Teacher-Cognition Multi-Agent LLM Framework

A man with tattoos on his arm sits at a wooden desk in natural light, writing or reviewing papers while surrounded by stacked books, educational materials, an orange, and a bowl, with shelving visible in the background.
Image Credit: Photo by cottonbro studio on Pexels (SourceLicense)

AI Summary of Scholarly Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

Publication Signals show what we were able to verify about where this research was published.STANDARDAvailable publication signals for this source were verified. Publication Signals reflect the source’s verifiable credentials, not the quality of the research.

Fewer signals were independently confirmable for this source. That reflects the limits of what’s on record — not a judgment about the research.

  • ✔ Published in indexed journal
  • ✔ No retraction or integrity flags

Overview

This study addresses limitations in automated grading of open-ended short-answer responses, particularly regarding partial credit attribution, model calibration, and interpretability in resource-constrained educational settings. The research introduces the Teacher-Cognition Multi-Agent Grading framework (TC-MAG), which operationalizes teacher decision-making processes through multiple anchored language model agents. The framework systematically executes rubric creation, guideline validation, independent double marking, arbitration protocols, and confidence calibration, generating staged explanations at each stage to enable targeted teacher review.

Methods and approach

A motivational preliminary study informed the design of TC-MAG's operational structure. Validation employed a dataset comprising 2,000 primary school student responses to mathematics questions scaled 1-4 marks, adjudicated against teacher-established gold standard labels. The multi-agent architecture decomposed grading into discrete modules: rubric generation, compliance checking, dual independent markings, conflict resolution via arbitration, and cross-verification with confidence scoring. Quantitative evaluation measured inter-rater reliability using Cohen's kappa for single-mark items and quadratic-weighted kappa for multi-mark items. A mixed-methods teacher study (N=14, mean experience 12.1 years) assessed explanation format, confidence scoring effects, and teacher delegation decisions through structured qualitative analysis.

Key Findings

TC-MAG achieved κ=0.968 for single-mark items and quadratic-weighted κ=0.936 for multi-mark items, demonstrating deployment-level reliability. Performance exceeded human teacher baseline by κ=+0.063 (p<.001) and outperformed state-of-the-art LLM baselines with minimum improvement of κ=+0.012 (p<.001). Teacher study findings indicated that explanation format and confidence scores significantly influenced grading delegation decisions. Staged explanations demonstrated superior diagnosticity relative to summarized formats (likelihood ratio positive=11.5 versus 4.60), suggesting explanation structure modulates teacher trust and oversight behavior.

Implications

The TC-MAG framework demonstrates feasibility of operationalizing pedagogical expertise through multi-agent LLM architectures for automated assessment. The achievement of reliability metrics exceeding human performance while maintaining interpretability through staged explanations addresses a critical tension in deployment scenarios where automated systems must preserve teacher agency and oversight. These results support the viability of structured multi-agent approaches as intermediate solutions in resource-constrained educational contexts where human grading capacity is limited.

Disclosure

  • Research title: Smarter Together: Enhancing Human-AI Collaborative Grading With Teacher-Cognition Multi-Agent LLM Framework
  • Authors: Sanskriti Uma, Surjya Ghosh, Dio Dzaky Achmad Mustaqim
  • Institutions: Birla Institute of Technology and Science, Pilani, James Cook University Singapore, Universitas Negeri Surabaya
  • Publication date: 2026-03-03
  • DOI: https://doi.org/10.1145/3742413.3789130
  • OpenAlex record: View
  • Image credit: Photo by cottonbro studio on Pexels (SourceLicense)
  • Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

Get the weekly research newsletter

Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.

More posts