AI and LGR22 Assessment: Fair, Aligned Tests

A teacher-in-the-loop workflow for E/C/A evidence over time

A teacher reviewing assessment criteria and drafting questions with an AI tool on a laptop

Why it feels different

LGR22 assessment often feels straightforward on paper and surprisingly complex in practice. The criteria invite a holistic judgement, but pre-test season can quietly narrow our thinking. When time is tight, it is tempting to build one large, high-stakes test and let it “decide” the grade. The risk is not just stress; it is validity. A single sitting can over-reward test technique, under-represent practical or oral evidence, and miss the “over time” pattern that the criteria imply.

This is also where AI can help and harm. Used well, it can speed up drafting and improve alignment. Used carelessly, it can produce polished but mis-targeted tasks, or nudge you into criterion drift. If you are building inspection-ready routines around LGR22, it helps to connect assessment design to the wider curriculum throughlines and documentation habits described in LGR22 Section 2 throughlines and micro-tools.

Value words and evidence

A quick refresher helps before you generate anything. LGR22 betygskriterier often hinge on value words that signal quality rather than quantity. In many subjects, E-level evidence tends to show that the pupil can do the core task with basic relevance and some correctness. C-level evidence often adds security, clearer reasoning, and more appropriate choices. A-level evidence typically includes precision, independence, and well-judged adaptations to context.

The key is that these qualities rarely show up consistently in one attempt. A pupil may produce an A-like response on a familiar topic and an E-like response when the context changes. So “evidence over time” is less about collecting lots of marks and more about sampling performance across moments and modes. If you want a broader map from LGR22 intentions to practical classroom workflows, LGR22 three years on: gap-to-tool map is a useful companion.

Fairness principles

Before you open any generator, set fairness design principles. They act like guardrails when the tool is producing large amounts of text quickly.

Start with coverage. Decide which content and abilities you are sampling, and which you are explicitly not sampling yet. A fair test is often narrower than we think, because it is honest about what it can evidence.

Then accessibility. Reduce construct-irrelevant barriers: overly dense language in a maths paper, culturally narrow examples in history, or a reading load that turns a content check into a decoding test. Provide reasonable scaffolds, but keep an eye on over-scaffolding, where the support starts doing the thinking.

Add bias checks. Ask whether names, contexts, or assumed background knowledge advantage some pupils. AI is good at generating “neutral” contexts, but it can also recycle stereotypes unless you prompt carefully.

Finally, transparency. Pupils should understand what quality looks like. That does not mean giving away answers; it means making the success criteria legible. If you are tightening whole-school routines around this, it is worth aligning with your AI policy refresh so staff are consistent about what is allowed and how outputs are checked, as outlined in Annual AI acceptable use policy refresh.

Answer key workflow

Here is the core workflow for pre-test season: paste the relevant History betygskriterier into an Answer Key tool, then generate (1) E/C/A-targeted questions, (2) three-tier model answers, and (3) a justification checklist mapped directly to the criteria.

In practice, you might paste the betygskriterier for the ability you are assessing (for example, using historical concepts, reasoning about causes and consequences, or evaluating sources). Then you instruct the tool to produce three model answers to the same question: one that clearly meets E, one that clearly meets C, and one that clearly meets A. The point is not to create “perfect” scripts for pupils to copy. The point is to make the differences visible to you and your team: what changes between a response that is “in broad terms functioning” and one that is “well functioning”?

The justification checklist is the real safeguard. For each model answer, the tool should list which phrases or moves provide evidence for E, C, or A, using the language of the betygskriterier. When you later mark pupils’ work, you can annotate with the same checklist. This reduces the chance that you unconsciously reward length, confidence, or your favourite facts.

If you are concerned about compliance and documentation expectations when using AI with pupil-facing materials, keep an eye on EU AI Act and Swedish schools for a practical framing of risk, transparency, and human oversight.

Example 1: Quiz generator

Imagine Year 5 History on Medieval Nordic societies. You want a mixed quiz that samples knowledge, concepts, and simple reasoning, without turning into a reading endurance task. Ask the Quiz Generator for 12 questions with a balance of multiple choice, short answer, and one extended response. Crucially, request an E/C/A target label for each item so you can see the intended demand.

A workable set might include a couple of retrieval items on key terms (E-targeted), short explanations of roles in society or trade routes (E/C), and one item asking pupils to compare two sources or accounts at a basic level (C/A depending on the prompt). Your extended response could ask pupils to explain one change over time and give reasons, with an optional stretch prompt for evaluating which reason mattered most. When you mark it, you use the justification checklist to decide whether the reasoning is present and relevant, not whether the pupil wrote the longest paragraph.

Example 2: Word problems

For Year 8 Maths on percentages, AI shines when you ask it to generate parallel problems with controlled variation. You can produce a short set where the surface context changes but the method stays stable, which supports fairness and reduces copying.

Ask for problems at three difficulty tiers, but keep the same core methods: finding a percentage of an amount, percentage change, and reverse percentages. Then require full worked solutions plus method descriptors written in LGR22-style language. For example, an E-aligned method descriptor might emphasise a correct set-up with occasional slips and a mostly functioning strategy. A C-aligned descriptor might describe a secure, efficient method with correct representations. An A-aligned descriptor might add well-chosen methods, clear justification, and the ability to check reasonableness.

When you discuss feedback with pupils, you can point to method quality rather than “got it right”. That keeps the assessment focused on mathematical proficiency, not just final answers.

Ready to Revolutionise Your Teaching Experience?

Discover the power of Automated Education by joining out community of educators who are reclaiming their time whilst enriching their classrooms. With our intuitive platform, you can automate administrative tasks, personalise student learning, and engage with your class like never before.

Don’t let administrative tasks overshadow your passion for teaching. Sign up today and transform your educational environment with Automated Education.

🎓 Register for FREE!

Example 3: Reading comprehension

For Year 9 Swedish, you might want a longer expository text (around 1,200 words) that supports critical reading without relying on niche background knowledge. Use a Reading Comprehension tool to generate a text with a clear structure, subheadings, and a balanced register. Then generate questions that deliberately separate literal understanding from interpretation and evaluation.

To support evidence over time, include prompts that ask pupils to cite and explain evidence. For example, one question can require locating the author’s main claim and two supporting reasons (E/C), while a higher-demand question asks pupils to evaluate the strength of one reason and discuss what would improve it (C/A). Add “evidence prompts” under each question, such as “Quote or paraphrase one sentence that supports your answer”, so you are collecting traceable evidence rather than impressions. If you are also running revision cycles, you can connect these tasks to retrieval and integrity routines similar to those in Mock exam revision ops.

Store and moderate evidence

To avoid “one test decides the grade”, store evidence lightly but deliberately. A simple portfolio plan can be one folder per pupil (digital or paper) with three labelled samples per ability area across the term: one early, one mid, one late. Each sample should include the task, the pupil response, and a brief teacher sign-off noting what the evidence shows against E/C/A. You do not need to keep everything; you need a defensible sample.

Moderation becomes easier when everyone uses the same justification checklist language. In a short department meeting, compare two anonymised scripts and ask, “Which checklist items are evidenced here?” rather than “What grade is this?” That small shift reduces drift and builds shared interpretations.

Failure modes and QA

The most common failure mode is criterion drift: the task starts assessing something adjacent because the AI produced a more interesting prompt than you asked for. The second is over-scaffolding: sentence starters and hints that turn an A-quality response into a fill-in-the-blanks exercise. The third is hallucination: incorrect facts in history texts, implausible data in maths, or invented quotations in reading passages.

A simple QA checklist helps: verify factual accuracy; check reading load and language level; confirm each question maps to a specific criterion; ensure model answers genuinely differ in quality, not just length; and run a bias and accessibility scan. If you want a deeper look at classroom evaluation and integrity when using modern models, Claude 4 deep dive offers practical considerations that translate well across tools.

Prompt patterns and run sheet

Keep prompts minimum-data. Do not paste pupil names or sensitive information. You only need the betygskriterier, the topic, year group, and constraints.

Useful patterns include: “Create 12 questions, label each E/C/A target, and keep language accessible”; “Generate three model answers (E/C/A) to the same prompt and annotate which phrases evidence which criterion”; and “Produce a justification checklist using the exact wording of the pasted criteria.”

A one-page run sheet for building an assessment can be: paste criteria; define the sampled content and abilities; generate questions; generate three-tier model answers; generate the justification checklist; run QA checks; pilot two questions with a small group; revise; then schedule a follow-up mini-task two weeks later to add evidence over time.

To steady your planning across Years 7–9, it can help to borrow the “late improvement” logic from writing instruction, where you deliberately plan a second attempt after feedback. The structure in GY25–LGR22 bridge guide is a helpful model for that kind of planned re-evidence, even beyond writing.

May your pre-test season bring clearer evidence and calmer judgement calls. The Automated Education Team

Latest

Black Friday 2025: AI Deals for UK Schools
Black Friday can feel like a rare chance to “save” on AI subscriptions, but …
ChatGPT Turns 3: Education Impact Assessment
Three years after ChatGPT’s release, schools have enough experience to …
December Countdown: End-of-Term AI System
December in schools brings a familiar spike: cover changes, heightened …
Microsoft Ignite: AI highlights for school ops
Microsoft Ignite can feel like a firehose of AI updates, but schools need a …
Report Writing 2025: AI Tools Compared
Report writing in 2025 is less about “which chatbot is best” and more about …
LGR22 Three Years On: AI Gap-to-Tool Map
Three years into LGR22, many schools report real gains in clarity and …
Anti-Bullying Week digital citizenship response kit
Anti-Bullying Week works best when it moves beyond awareness and into …
Remembrance: Teaching History Sensitively with AI
Remembrance teaching asks for careful language, accurate sources, and …
Mock Exam Season: AI Revision Support
Mock season often fails for predictable reasons: revision plans are …

Alternative Languages

Eesti: AI ja LGR22 hindamine: õiglane, õppekavaga kooskõlas testimine
LGR22 hindamine palub õpetajatel teha terviklikke otsuseid ajas kogutud tõendite põhjal, kuid …
Svenska: AI och LGR22-bedömning: rättvisa, kursplansanpassade prov
LGR22-bedömning ber lärare att göra helhetsbedömningar utifrån underlag som samlas in över tid, men …
Suomi: AI ja LGR22-arviointi: reilut, opetussuunnitelmaan linjatut kokeet
LGR22-arviointi pyytää opettajia tekemään kokonaisarvioita ajan mittaan kertyneestä näytöstä, mutta …

Previous: Teacher Workload Crisis: Can AI Help? Next: End-of-Term Grading: A Batch Marking Pipeline