Back

Large Language Models for Thematic Analysis in Healthcare Research: A Blinded Mixed-Methods Comparison with Human Analysts

Hill, C.; Dahil, A.; Simpson, G.; Hardisty, D.; Keast, J.; Pinn, C. K.; Dambha-Miller, H.

2025-12-29 health informatics
10.64898/2025.12.25.25343031
Show abstract

Large language models (LLMs) are increasingly used for qualitative thematic analysis, yet evidence on their performance in analysing focus-group data, where polyvocality and context complicate coding, remains limited. Given the increasing role of such models in thematic analysis, there is a need for methodological frameworks that enable systematic, metric-based comparisons between human and model-based analyses. We conducted a blinded mixed-methods comparison of two general-purpose LLMs (ChatGPT-5 and Claude 4 Sonnet), an LLM-based qualitative coding application (QualiGPT), and blinded human analysts on an in-person focus-group transcript informing an AI-enabled digital health proposal. We evaluated deductive coding using a 10-code, 6-theme codebook against an expert consensus adjudication; inductive coding with a structured Likert-scale comparison to a reference-standard set of inductive themes generated by expert consensus; and manual quote verification of LLM segments to define LLM hallucination (evidence absent or non-supportive) and error rate (including partial matches and speaker-coded segments). During deductive coding against an expert consensus adjudication, large language models (LLMs) yielded a mean agreement of 93.5% (95% CI 92.5-94.5) with {kappa} = 0.34 (95% CI 0.26-0.40); blinded human coders achieved 92.7% (95% CI 91.6-93.9) agreement with {kappa} = 0.34 (95% CI 0.26-0.41). Mean Gwets AC1 was 0.92 (95% CI 0.90-0.93) for the blinded human analysis, and 0.93 (95% CI 0.92-0.94) for the LLM-assisted deductive analysis, reflecting high agreement despite the low overall code prevalence (7.8%, SD = 3.2%). Only one model achieved non-inferiority in inductive analysis of the transcript (p = 0.043). The strict hallucination rate in inductive analysis was 1.2% (SD = 2.1%). LLMs were non-inferior to human analysts for deductive coding of the focus-group data, with variable performance in inductive analysis. Low hallucination but significant comprehensive error rates indicate that LLMs can augment qualitative analysis but require human verification. Author summaryQualitative research plays an important role in digital health, assisting in the implementation of healthcare technologies and innovations. However, analysing qualitative data in the form of focus groups is time-consuming and requires human expertise. Large Language Models (LLMs) are being increasingly used in qualitative research analysis, although evidence on their performance in analysing focus group data is limited. We compared the performance of LLMs to blinded human analysts in analysing a focus group transcript on AI implementations in healthcare. We used both qualitative and quantitative metrics to evaluate the performance of LLMs in thematic analysis. We found that the LLMs performed similarly to humans when applying pre-defined codes (deductive analysis), with a low rate of hallucination. However, in open-ended theme generation (inductive analysis) their performance was more variable, particularly in areas requiring interpretation of tone, nuance, or conversational context. These findings suggest that LLMs can be used to support interpretation of qualitative data, rather than replace human analysts. We provide a reproducible framework in analysing the performance of LLMs in qualitative analysis.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
based on 53 papers
Top 0.7%
13.5%
2
PLOS Digital Health
based on 88 papers
Top 1%
8.4%
3
Journal of Medical Internet Research
based on 81 papers
Top 2%
7.7%
4
BMJ Open
based on 553 papers
Top 17%
6.5%
5
BMJ Health & Care Informatics
based on 13 papers
Top 0.1%
5.1%
6
JAMIA Open
based on 35 papers
Top 2%
4.6%
7
BMC Medical Informatics and Decision Making
based on 36 papers
Top 3%
4.6%
50% of probability mass above
8
PLOS ONE
based on 1737 papers
Top 77%
3.0%
9
npj Digital Medicine
based on 85 papers
Top 6%
2.9%
10
BMC Medical Research Methodology
based on 41 papers
Top 1%
2.9%
11
International Journal of Medical Informatics
based on 25 papers
Top 2%
2.9%
12
Frontiers in Digital Health
based on 18 papers
Top 0.7%
2.9%
13
DIGITAL HEALTH
based on 11 papers
Top 0.3%
2.9%
14
Journal of Clinical Epidemiology
based on 29 papers
Top 0.8%
2.5%
15
The Lancet Digital Health
based on 25 papers
Top 0.9%
2.5%
16
JMIR Medical Informatics
based on 16 papers
Top 3%
2.4%
17
Journal of Biomedical Informatics
based on 37 papers
Top 3%
2.4%
18
BMJ Open Quality
based on 15 papers
Top 1%
1.6%
19
Healthcare
based on 14 papers
Top 1.0%
1.6%
20
JMIR Formative Research
based on 31 papers
Top 3%
1.6%
21
Journal of General Internal Medicine
based on 19 papers
Top 2%
1.3%
22
BMJ
based on 49 papers
Top 4%
1.3%
23
JAMA Network Open
based on 125 papers
Top 18%
0.8%
24
BMJ Global Health
based on 95 papers
Top 12%
0.8%
25
IEEE Journal of Biomedical and Health Informatics
based on 14 papers
Top 3%
0.7%
26
Wellcome Open Research
based on 34 papers
Top 4%
0.7%
27
Research Synthesis Methods
based on 17 papers
Top 1%
0.7%