What Research Says About AI Tutoring

A research briefing for school leaders beyond the marketing claims

A school leader reviewing research on AI tutoring tools

Why AI tutoring is back

AI tutoring is enjoying a powerful second act. The idea of computers providing one-to-one support is not new; intelligent tutoring systems have been studied since the 1980s. What has changed is the arrival of large language models (LLMs) that can hold fluid conversations, generate explanations on demand and adapt to student responses in real time.

Vendors now promise “a tutor for every child”, often citing classic research on the “2 sigma problem” – the finding that individual tutoring can dramatically outperform typical classroom teaching. Understandably, school leaders want to know whether today’s AI tutors actually deliver anything close to those gains, or simply add another layer of complexity and cost.

This article synthesises what robust studies tell us so far, focusing on learning outcomes, equity and the implementation conditions that matter. For a more general overview of AI’s trajectory in schools, you may also find our state-of-ai-in-uk-education-sept-2024 briefing useful.

What we mean by ‘AI tutoring’

Definitions matter because “AI tutor” is used to describe very different tools. In the research literature, three broad types appear:

First, there are classic intelligent tutoring systems (ITS). These are structured platforms, often subject-specific, that present problems, analyse responses and offer step-by-step feedback. Their “intelligence” is usually rule-based and tightly constrained to a curriculum.

Second, we have LLM-based conversational tutors such as Khan Academy’s Khanmigo. These tools use general-purpose language models, wrapped in educational safeguards, to simulate a human-like dialogue: asking probing questions, explaining concepts and supporting metacognition.

Third, there are AI-enhanced practice apps. Duolingo’s AI features fall largely into this category. The core platform remains a structured practice environment, but AI is used to adapt difficulty, generate feedback, create exercises or simulate communicative tasks.

When you read claims about “AI tutoring”, it is crucial to ask: is this a fully fledged tutor that guides a learning sequence, or an assistant layered on top of existing materials? Research findings are not interchangeable across these categories.

What the research says overall

The most robust evidence still comes from pre-LLM intelligent tutoring systems, especially in mathematics and science. Meta-analyses of these systems typically report moderate positive effects on learning, with average effect sizes around 0.3–0.4 standard deviations compared with business-as-usual teaching. That roughly equates to several months of additional progress over a school year, though estimates vary by study quality and context.

The pattern is fairly consistent:

Subjects: Mathematics and some STEM subjects show the strongest and most reliable gains. Reading comprehension and writing are more mixed, partly because they rely heavily on open-ended language use that earlier systems struggled with.
Age groups: Upper primary and lower secondary pupils often benefit most. Very young learners require more scaffolding and behaviour management than current tools can reliably provide, while older students may use AI more for homework support than structured learning.
Outcomes: AI tutors tend to be particularly effective for procedural fluency and near-transfer tasks (similar problems to those practised). Effects on deeper conceptual understanding and far transfer are positive but smaller and less consistently measured.

The evidence base for LLM-based tutors is newer and less extensive. Early trials suggest they can achieve similar or slightly higher gains than older ITS tools under controlled conditions, especially when teachers are actively involved. However, we do not yet have large-scale, multi-year studies showing sustained impact at system level.

For context on when AI helps versus hinders learning more broadly, you may find our when-ai-helps-vs-harms-learning analysis a helpful companion.

Case study 1: Khanmigo

Khan Academy’s Khanmigo is one of the most visible LLM-based tutors in schools. It uses a large language model constrained by prompts and guardrails to act as a Socratic tutor, encouraging pupils to explain their thinking rather than simply giving answers.

Early evaluation work, mostly from pilot districts, points to several patterns:

Khanmigo appears particularly promising in mathematics problem-solving and coding, where the structured nature of tasks makes it easier to guide pupils through step-by-step reasoning. Small randomised or quasi-experimental studies have reported modest but statistically significant improvements in maths performance over a term or semester, especially when pupils use Khanmigo regularly and teachers integrate it into lesson plans rather than as optional enrichment.

Teachers in these pilots often report improved engagement and more productive help-seeking behaviour. Pupils who might otherwise sit stuck on a problem can get immediate prompts or hints that nudge them forward. Importantly, teachers value the way Khanmigo can model metacognitive strategies, such as checking work or breaking problems into sub-steps.

However, there are important caveats. Impact tends to drop sharply when usage is sporadic, when pupils are left entirely to self-manage, or when teachers are not trained to interpret Khanmigo’s data and adjust instruction. Some studies also note that while the system is designed not to give direct answers, determined pupils can still “game” it, especially outside supervised settings.

Overall, the emerging picture is that Khanmigo can support meaningful gains when embedded into a well-planned mathematics or computing curriculum, with clear routines and adult oversight. It is not a plug-and-play replacement for teacher-led explanation or guided practice.

Case study 2: Duolingo’s AI features

Duolingo occupies a different space: language learning rather than core curriculum subjects, and primarily outside formal timetabled lessons. Its AI features include adaptive difficulty, personalised review schedules and, more recently, conversational practice powered by language models.

The broader research on Duolingo and similar apps shows that:

Learners who use the platform regularly achieve measurable gains in vocabulary and basic grammar, often comparable to traditional beginner courses, especially when starting from scratch. The spaced repetition algorithm is particularly effective for retention of word forms and simple phrases.

AI-generated feedback on pronunciation and grammar can accelerate error correction, though it is not flawless. Studies comparing automated feedback with human teacher feedback suggest that AI works well for high-frequency, rule-governed errors, but struggles with nuance, pragmatics and culturally appropriate language use.

More experimental features, such as AI-driven role-play conversations, are still being evaluated. Early findings indicate that they boost confidence and willingness to speak, but evidence that they translate into real-world communicative competence is still limited.

For school leaders, the key takeaway is that AI-enhanced language apps can play a valuable supplementary role, particularly for homework, independent practice and summer continuity. They are less suited as the sole vehicle for language instruction, where exposure to rich, authentic language and human interaction remains critical.

Beyond the big brands

Looking beyond headline products, several large-scale trials and meta-analyses of AI tutors offer useful guidance.

Randomised controlled trials of maths and science ITS in diverse school systems have repeatedly shown that well-designed AI tutors can deliver gains equivalent to roughly an extra term of learning over a full year, especially for pupils who start behind. Some systems have demonstrated positive effects across hundreds of schools, suggesting that scale is possible when implementation is well supported.

At the same time, meta-analyses highlight large variation. Tools that are tightly aligned to curriculum standards, provide immediate targeted feedback and include clear teacher dashboards tend to outperform more generic systems. Programmes that assume pupils will self-regulate, or that provide little guidance to teachers on integration, show much weaker or inconsistent effects.

One consistent finding is that AI tutors are most effective as part of a broader instructional model, not as stand-alone solutions. When teachers use AI data to identify misconceptions, adjust grouping and target small-group instruction, gains are larger than when AI is treated as a separate “lab” activity.

Equity, access and unintended effects

Equity is a double-edged aspect of AI tutoring. On the positive side, several studies suggest that lower-attaining pupils can benefit disproportionately from AI tutors, because they receive immediate, patient feedback that might be hard to provide in a busy classroom. In some trials, achievement gaps narrowed when access and support were carefully structured.

However, there are also risks. Access gaps are obvious: pupils without reliable devices or connectivity cannot benefit from AI-based homework or holiday programmes. Our briefing on summer-learning-loss-ai-tutors explores how this plays out during breaks in schooling.

Less visible are differential usage patterns. In some pilots, more advantaged pupils used AI tutors more consistently and for longer periods, even when access was nominally equal. Without deliberate strategies, AI can end up amplifying existing inequalities.

There are also concerns about over-reliance. Some pupils begin to consult AI at the slightest difficulty, reducing productive struggle and independent problem-solving. Others learn to exploit loopholes to obtain answers rather than support. These behaviours can quietly erode the very learning gains AI is supposed to deliver.

Ready to Revolutionise Your Teaching Experience?

Discover the power of Automated Education by joining out community of educators who are reclaiming their time whilst enriching their classrooms. With our intuitive platform, you can automate administrative tasks, personalise student learning, and engage with your class like never before.

Don’t let administrative tasks overshadow your passion for teaching. Sign up today and transform your educational environment with Automated Education.

🎓 Register for FREE!

Implementation conditions that matter

Across studies and case reports, several conditions recur when AI tutoring works well in real schools.

First, leadership clarity is critical. Schools that define specific goals for AI tutors – for example, closing gaps in algebra for Year 8, or supporting new arrivals with language practice – see better results than those that adopt tools in a general spirit of innovation.

Second, structured routines make a difference. Regular, timetabled use in class or targeted intervention slots yields stronger outcomes than ad hoc or purely voluntary use. Teachers need to know when and how pupils will use the tool, and what they will do with the resulting data.

Third, teacher mediation is non-negotiable. Effective implementations train teachers not only in the mechanics of the platform, but in interpreting dashboards, spotting unproductive usage, and weaving AI-supported practice into whole-class instruction. AI that sits “off to the side” rarely changes learning at scale.

Finally, safeguarding and data protection must be addressed up front. Clear policies on chat logging, content filters, data storage and parental communication are essential, especially when pupils interact conversationally with an AI system. Our september-ai-readiness-checklist offers a useful framework for these governance questions.

Decision guide for school leaders

Translating the research into action, several decision rules emerge for when, where and how to deploy AI tutoring.

Use AI tutors when there is a clearly defined learning need that aligns with the tool’s strengths: for example, mathematics practice, language vocabulary and grammar, or specific exam-preparation tasks. They are particularly useful for targeted catch-up, structured homework and summer continuity, provided access is equitable.

Avoid positioning AI tutors as replacements for teacher explanation, relationship-building or formative assessment. The evidence supports AI as a complement, not a substitute. Be wary of claims that suggest otherwise.

Prioritise tools that offer curriculum alignment, transparent data and strong teacher controls over those that are simply impressive demonstrations of conversational ability. Ask vendors to share independent evaluations, not just internal case studies, and scrutinise the populations and contexts in which those studies were conducted.

Plan for sustained implementation, not short-term pilots. Most positive studies involve regular use over at least a term, with ongoing support for teachers. If you cannot commit to that level of integration, temper expectations accordingly.

Practical next steps and vendor questions

For leaders considering AI tutoring, three practical steps can keep decisions grounded in evidence.

First, clarify your problem statement. Are you trying to reduce maths attainment gaps, support language learners, extend challenge for high attainers, or maintain learning over holidays? Different goals point to different tools and implementation models.

Second, run a small but rigorous trial. Identify a limited number of classes, define success metrics in advance, and compare outcomes with similar groups not using the tool. Pay attention not only to test scores, but to usage patterns, teacher workload and pupil wellbeing.

Third, prepare a set of questions for vendors, such as:

What independent research (not commissioned marketing studies) supports your claims, and in which subjects, age groups and contexts?
How does your tool align with our curriculum and assessment practices?
What teacher dashboards and controls are available, and how do they support instructional decision-making?
How do you address data protection, safeguarding and content moderation, especially in open-ended chat?
What training and ongoing support do you provide for staff, and what level of usage is needed to see typical gains?

By approaching AI tutoring as a targeted intervention, grounded in research and shaped by local context, school leaders can move beyond hype and towards thoughtful, equitable deployment.

Happy decision-making!
The Automated Education Team

Latest

Black Friday 2025: AI Deals for UK Schools
Black Friday can feel like a rare chance to “save” on AI subscriptions, but …
ChatGPT Turns 3: Education Impact Assessment
Three years after ChatGPT’s release, schools have enough experience to …
December Countdown: End-of-Term AI System
December in schools brings a familiar spike: cover changes, heightened …
Microsoft Ignite: AI highlights for school ops
Microsoft Ignite can feel like a firehose of AI updates, but schools need a …
Report Writing 2025: AI Tools Compared
Report writing in 2025 is less about “which chatbot is best” and more about …
LGR22 Three Years On: AI Gap-to-Tool Map
Three years into LGR22, many schools report real gains in clarity and …
Anti-Bullying Week digital citizenship response kit
Anti-Bullying Week works best when it moves beyond awareness and into …
Remembrance: Teaching History Sensitively with AI
Remembrance teaching asks for careful language, accurate sources, and …
Mock Exam Season: AI Revision Support
Mock season often fails for predictable reasons: revision plans are …

Alternative Languages

Eesti: Mida ütleb teadus AI-tugiõppe kohta
AI-tugiõppe tööriistu pakutakse koolidele väga aktiivselt, sageli julgete lubadustega …
Svenska: Vad forskningen säger om AI-handledning
AI-handledningsverktyg marknadsförs intensivt till skolor, ofta med djärva löften om …
Suomi: Mitä tutkimus kertoo AI-tutoroinnista
AI-tutorointityökaluja markkinoidaan kouluissa voimakkaasti, usein rohkeilla lupauksilla …

Previous: Claude Computer Use in Schools Next: AI Tools for Parents’ Evening