AI grading is often framed as all or nothing. Either a system marks everything on its own, or humans keep every answer on their desks. Two recent papers from our research team point to a more useful view. In practice, the key question is whether a model can separate confident cases from uncertain ones.
With short answers, a solid average score is not enough. If a system waves through wrong answers, points are awarded where they should not be. If it grades too harshly, correct answers get penalized. That is where things become difficult for teachers, exam teams, and learning systems.
The ICALT paper therefore does not stop at one summary number. It separates false passes from false fails and asks a practical deployment question: should a model decide every case on its own, or only the ones where it is clearly confident?
The ICALT study evaluates short answers from a bachelor-level course on version control with Git. In the analyzed subset, 1,056 answers had a human reference label across nine open questions. Eight language models were tested under two prompt variants.
The main observation is straightforward. Not every answer needs to be graded automatically. Once only high-confidence answers were accepted, agreement with the human reference increased. In the strict setup, agreement reached 94.0 percent. The trade-off was coverage: only 80.0 percent of answers were decided without human intervention.
The workflow became even more cautious when one model was no longer enough and all eight models had to agree. In that setting, agreement rose to 97.8 percent. Automatic decisions then covered only 61.0 percent of answers. That is not a flaw. It is a clean operating rule: obvious cases move forward, borderline cases go to a human reviewer.
The paper also shows the downside. Changes in prompt wording can shift the balance between false passes and false fails. Teams that deploy these systems need re-audits after prompt updates, not blind trust in a score from an earlier test run.
The second paper comes from the BEA 2026 Shared Task on rubric-based short answer scoring for German. In simple terms, this is an official benchmark: every team gets the same dataset and the same task. Systems had to classify answers as correct, partially correct, or incorrect based on a textual rubric.
The dataset contains 7,899 labeled answers across 78 STEM questions. The WSE Research submission combined three building blocks: a clearly structured rubric inside the prompt, automatically selected comparison examples from similar answers, and fine-tuned open Qwen models at several scales. For some tracks, the system also used weighted aggregation across multiple models.
The result matters because no single trick carried the whole system. On the trial set, the best fine-tuned open model outperformed the best prompt-based commercial model. In the official leaderboard, the approach finished second on three of four tracks and third on the remaining one. The gap to first place was only 0.006 to 0.017 points, depending on the track.
Model aggregation helped on familiar question types. For new and previously unseen questions, the largest fine-tuned model was stronger. That is the practical message of the paper: strong AI grading does not come from a famous model name alone. It comes from rubrics, examples, and careful task-specific adaptation.
Taken together, the two papers tell the same story in different settings. The ICALT paper shows what a cautious operating mode can look like. The BEA paper shows which ingredients still matter when the method is tested against other teams on the same benchmark.
We already covered the architectural side of this work in our post on ICWE 2026. If your team is thinking about AI in grading, training, or certification, we would be glad to continue the conversation in a ScormIQ demo.