Back

Beyond Accuracy: Multidimensional Evaluation of Large Language Models in Hepatocellular Carcinoma Management Emphasizing Prompting

Luo, J.; Ma, J.; Wang, T.; Qiu, Y.; Yang, Y.; Qiu, H.; Chen, G.; Wang, W.

2025-07-15 oncology
10.1101/2025.07.15.25331552 medRxiv
Show abstract

Background & AimsHepatocellular carcinoma is the most common type of primary liver cancer and remains a major global health challenge. In resource-limited settings, patients often face barriers such as low screening rates, poor adherence, and limited access to medical information. Despite comprehensive clinical guidelines, issues like inadequate patient education and ineffective communication persist. While large language models show promise in clinical communication and decision support, their performance in hepatocellular carcinoma management has not been systematically evaluated across multiple dimensions. MethodsTen emerging language models, including general-purpose and medical-domain models, were assessed under prompted and unprompted conditions using a standardized question set covering five key stages: general knowledge, screening, diagnosis, treatment, and follow-up. Accuracy was rated by experts, while semantic consistency, local interpretability, information entropy, and readability were measured computationally. ResultsChatGPT-4o and Grok-3 achieved the highest accuracy (2.62 {+/-} 0.06, 93%; 2.60 {+/-} 0.06, 95%) and interpretability (0.43;0.43). Prompting significantly improved accuracy (p < 0.001) and interpretability (p < 0.001) across all models. Semantic consistency declined slightly in most models; information entropy generally increased; readability changes varied. ConclusionsThis study presents the first multidimensional evaluation of large language models in hepatocellular carcinoma-related clinical tasks. General-purpose models outperformed some medical models, revealing limitations in domain-specific fine-tuning. Prompt design strongly influenced model performance. Further research should integrate diverse prompt strategies and clinical scenarios to improve the usability of language models in real-world oncology settings. Lay summaryThis study evaluated how well-advanced language-based artificial intelligence models can answer clinical questions related to hepatocellular carcinoma. The results showed that some models, especially when guided with structured instructions, provided accurate and understandable responses. These findings suggest that such tools may help improve communication and access to information for both doctors and patients managing liver cancer.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Biology Methods and Protocols
53 papers in training set
Top 0.1%
19.7%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
10.2%
3
Scientific Reports
3102 papers in training set
Top 17%
6.4%
4
Frontiers in Oncology
95 papers in training set
Top 0.7%
4.9%
5
Cancer Medicine
24 papers in training set
Top 0.3%
4.0%
6
PLOS ONE
4510 papers in training set
Top 35%
4.0%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
50% of probability mass above
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
3.6%
9
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.9%
10
PeerJ
261 papers in training set
Top 3%
2.8%
11
PLOS Digital Health
91 papers in training set
Top 1%
2.4%
12
BMC Cancer
52 papers in training set
Top 1.0%
2.1%
13
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.1%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.9%
15
Journal of Translational Medicine
46 papers in training set
Top 0.9%
1.7%
16
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
17
JAMA Network Open
127 papers in training set
Top 2%
1.7%
18
JMIR Formative Research
32 papers in training set
Top 0.9%
1.5%
19
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.2%
20
BMJ Open
554 papers in training set
Top 11%
1.2%
21
Frontiers in Public Health
140 papers in training set
Top 6%
1.1%
22
JMIR Medical Informatics
17 papers in training set
Top 1%
1.1%
23
BMC Medical Education
20 papers in training set
Top 0.7%
1.0%
24
iScience
1063 papers in training set
Top 26%
0.9%
25
BMC Research Notes
29 papers in training set
Top 0.4%
0.9%
26
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
27
Cancers
200 papers in training set
Top 5%
0.8%
28
Frontiers in Bioinformatics
45 papers in training set
Top 0.9%
0.8%
29
British Journal of Cancer
42 papers in training set
Top 2%
0.7%
30
European Journal of Cancer
10 papers in training set
Top 0.7%
0.5%