AI Detection Accuracy: The Evidence

What the research really says about Turnitin, GPTZero and fair assessment

A teacher reviewing AI detection reports alongside student work

Why accuracy matters now

AI writing detectors have arrived in schools faster than most policies, training plans or ethical frameworks. Many platforms now include an “AI percentage” next to similarity scores, and tools like GPTZero are only a browser tab away. Under pressure to respond to generative AI, some institutions are quietly treating these scores as de facto proof of misconduct.

This is risky. Independent studies show that AI detectors can be wrong in ways that are not random. They are more likely to mislabel certain students’ work as AI‑generated, particularly non‑native writers and those using simpler vocabulary. That has real consequences: stress, damaged trust, formal investigations and, in some systems, serious disciplinary records.

Understanding what the evidence actually says is therefore not a technical luxury; it is a safeguarding issue. If schools are to uphold fairness and academic integrity, they need to know where detectors work, where they fail, and how to make assessment decisions that do not rely on false certainty.

How AI detectors work

Most AI writing detectors use statistical patterns in language rather than “recognising” ChatGPT directly. In simple terms, they ask: “How predictable is this text, given what we know about AI and human writing?”

Two concepts appear often in the research:

Perplexity describes how surprising each word is in context. Large language models, like those behind ChatGPT, tend to produce text with relatively low perplexity: fluent, smooth and statistically predictable. Human writing, especially from less experienced writers, can be more uneven and surprising.

Burstiness refers to variation in sentence length and structure. Human writers often mix short, sharp sentences with longer, more complex ones. AI‑generated text can be more uniform, although this is changing as models improve.

Detectors are trained on samples of “known AI” and “known human” writing. They then learn patterns that distinguish the two and output a probability or score. However, they are only as good as their training data and assumptions. If the tools are trained mostly on native‑speaker university essays and early generations of AI models, they may struggle with school‑age writing, multilingual students and newer AI systems.

Evidence on Turnitin’s AI detection

Turnitin’s AI detection is widely used because it is bundled with plagiarism checking. The company has published its own validation claims, but independent evaluations tell a more cautious story.

Studies that tested Turnitin on sets of purely AI‑generated and purely human texts often found reasonably high accuracy when conditions were simple. When the task was: “Can Turnitin spot a block of unedited ChatGPT text?”, it performed relatively well, especially with longer samples.

However, accuracy dropped in more realistic scenarios. Research examining texts where AI‑generated passages were lightly edited, mixed with human paragraphs, or produced by newer models reported higher error rates. Short answers, reflective pieces and creative writing were particularly problematic because the statistical patterns differ from the academic prose detectors expect.

Of most concern are reports that Turnitin has produced false positives for genuine student work. In some case studies, multilingual students’ essays were flagged at high AI percentages despite being written under supervised conditions. Turnitin itself warns that its AI scores should not be used as sole evidence of misconduct, yet institutional practices do not always follow that guidance.

Evidence on GPTZero and similar tools

GPTZero and similar standalone detectors (such as Originality.ai, Copyleaks’ AI features and others) use related underlying ideas, with their own training data and thresholds. Independent tests have repeatedly shown three broad patterns.

First, detectors can often distinguish between large blocks of unedited AI text and typical adult human writing, especially in English, when the samples are long. In these simple cases, accuracy can look impressive.

Second, they struggle with nuance. When human writers simplify their style, or when AI text is heavily edited, paraphrased or combined with human writing, accuracy drops sharply. Some studies have shown detectors assigning high “AI” probabilities to texts written by secondary or undergraduate students, simply because their language is more predictable or formulaic.

Third, detectors do not generalise well. A tool tuned on one AI model (for example, an older version of GPT) may perform poorly on text from a newer model or from different languages and genres. This means that any accuracy figure is a moving target. A detector that worked acceptably last term may be far less reliable once students start using updated AI tools or translation systems.

False positives, bias and impact

False positives are not just technical errors; they are equity issues. Several studies and media investigations have shown that AI detectors are more likely to misclassify texts by non‑native English writers as AI‑generated. When students use straightforward vocabulary, repetitive structures or translation tools, their writing can resemble the “low perplexity” patterns detectors associate with AI.

High‑performing students can also be at risk. A well‑structured, polished essay written by a diligent learner who has practised extensively may look “too good” compared with their earlier work. If staff rely heavily on detector scores, such students can find themselves defending their integrity precisely because they have improved.

The impact on learners can be significant: anxiety, reluctance to experiment with language, and a perception that the system is stacked against them. For multilingual students already navigating additional barriers, an unjust accusation can be particularly damaging. From a safeguarding and inclusion perspective, any tool with documented bias deserves extreme caution.

Why mixed texts confuse detectors

Detectors are built on the assumption that a text is either human or AI. Classroom reality is rarely so neat. Students might:

Draft with AI, then rewrite heavily in their own words
Use AI to generate only an outline, example or paragraph
Translate their own ideas with machine translation and then edit

In these mixed cases, detectors receive conflicting signals. Some sentences look “AI‑like”; others look more human. Different tools react differently: some label the whole piece “likely AI”, others highlight individual sentences. None can tell you who had the idea, who chose the structure, or how much cognitive effort the student invested.

As AI models become more diverse and students learn to edit and blend outputs, the statistical fingerprints detectors rely on become even less stable. This is why many researchers argue that, over time, detection will only get harder, not easier.

Interpreting AI scores carefully

A key message from the research is that AI scores are not evidence of cheating; they are, at best, weak indicators that deserve context. A 90% AI score does not prove that 90% of a piece is AI‑generated, and a 0% score certainly does not prove originality.

Educators should resist the temptation to treat these numbers as diagnostic. At most, they can be one small signal among many: writing style compared with previous work, process evidence such as drafts and notes, and the conditions under which the work was produced.

If you would not be comfortable defending an academic misconduct decision based solely on a spellchecker or grammar checker, you should not do so based on an AI detector. The standard for evidence should be consistent and transparent.

Ready to Revolutionise Your Teaching Experience?

Discover the power of Automated Education by joining out community of educators who are reclaiming their time whilst enriching their classrooms. With our intuitive platform, you can automate administrative tasks, personalise student learning, and engage with your class like never before.

Don’t let administrative tasks overshadow your passion for teaching. Sign up today and transform your educational environment with Automated Education.

🎓 Register for FREE!

When to avoid and when to use

Given the evidence, there are clear situations where detectors should be avoided entirely. High‑stakes decisions about progression, graduation or serious sanctions should never rest on AI detection scores. This is especially true in contexts with many multilingual learners or where students have limited access to appeals.

Detectors are also ill‑suited to formative work, creative writing and early‑stage language learning. In these spaces, the risk of chilling students’ confidence outweighs any potential benefit.

If your institution chooses to use detectors at all, they should be used cautiously and with safeguards. That might mean limiting them to internal, advisory use by staff, never sharing raw scores with students, and always combining them with other evidence, such as in‑class writing and oral checks of understanding. Clear protocols can help here, much like those used for interpreting similarity scores in plagiarism reports.

For more on designing tasks where detectors become less central, see our guide on AI‑resilient assessment design, which focuses on task structures and process evidence rather than policing tools.

Building fair workflows

Fairness depends less on the tools you adopt and more on the workflows around them. If a detector raises a concern, staff should have a standard, transparent process. That might include reviewing the student’s previous work, inviting them to discuss their process, and asking them to explain key sections orally or through a brief in‑class task.

Documentation is essential. Record not just the detector score, but the additional evidence you considered and the reasoning behind any decision. This protects both students and staff and helps ensure that similar cases are treated consistently.

Policies also need updating. Many existing academic integrity policies pre‑date generative AI and say nothing about detectors. Institutions should explicitly state how, if at all, AI detection tools will be used, their limitations, and students’ rights to respond. Our guide on creating an AI acceptable use policy offers templates and discussion points that can be adapted to different contexts.

Most importantly, talk openly with students. Explain what detectors can and cannot do, why you are cautious about them, and what counts as acceptable AI support. This builds trust and reduces the sense that AI is a secret trap.

Moving beyond policing

The research on AI detection accuracy points to a clear conclusion: relying on detectors as our main defence against misconduct is neither fair nor sustainable. Instead, schools need to shift towards assessment designs and classroom habits that make dishonest use of AI less attractive and easier to detect through normal pedagogical practice.

This might involve more in‑class writing, oral defences of projects, iterative drafts with feedback, and tasks that connect to personal experience or local contexts. These approaches not only reduce opportunities for unacknowledged AI use but also strengthen learning. Our articles on why AI use is not automatically cheating and on how students actually cheat with AI explore this balance between integrity and innovation in more depth.

Ultimately, the goal is not to catch students out but to help them learn to use AI responsibly as part of their toolkit. That means teaching citation of AI assistance, discussing ethical boundaries, and designing work where the thinking process is visible, not just the final product.

AI detectors may have a limited role as one signal among many, but the evidence is clear: they are too inaccurate and too biased to serve as judges. Educators, supported by thoughtful policy and assessment design, remain the best interpreters of student work.

Happy assessing!
The Automated Education Team

Latest

Black Friday 2025: AI Deals for UK Schools
Black Friday can feel like a rare chance to “save” on AI subscriptions, but …
ChatGPT Turns 3: Education Impact Assessment
Three years after ChatGPT’s release, schools have enough experience to …
December Countdown: End-of-Term AI System
December in schools brings a familiar spike: cover changes, heightened …
Microsoft Ignite: AI highlights for school ops
Microsoft Ignite can feel like a firehose of AI updates, but schools need a …
Report Writing 2025: AI Tools Compared
Report writing in 2025 is less about “which chatbot is best” and more about …
LGR22 Three Years On: AI Gap-to-Tool Map
Three years into LGR22, many schools report real gains in clarity and …
Anti-Bullying Week digital citizenship response kit
Anti-Bullying Week works best when it moves beyond awareness and into …
Remembrance: Teaching History Sensitively with AI
Remembrance teaching asks for careful language, accurate sources, and …
Mock Exam Season: AI Revision Support
Mock season often fails for predictable reasons: revision plans are …

Alternative Languages

Eesti: AI tuvastamise täpsus: tõendid
AI-kirjutise tuvastajad lubavad ära tunda ChatGPT-stiilis teksti, kuid sõltumatu uurimistöö maalib …
Svenska: Noggrannhet i AI-detektering: Bevisen
Verktyg för att upptäcka AI-skrivet innehåll lovar att avslöja text i ChatGPT-stil, men oberoende …
Suomi: AI-tunnistuksen tarkkuus: Todisteet
AI-kirjoitustunnistimet lupaavat paljastaa ChatGPT-tyylisen tekstin, mutta riippumaton tutkimus …

Previous: Teaching Source Evaluation in the AI Era Next: Explaining AI to Parents