Document Type : Original Article
Authors
1 Department of Emergency Medicine, Woodlands Health, National Healthcare Group (NHG) Health, Singapore
2 Department of Orthopaedic Surgery, Khoo Teck Puat Hospital, National Healthcare Group (NHG) Health, Singapore
Abstract
Background: Generative artificial intelligence (AI) holds promise for medical education, yet the realism and contextual relevance of AI-generated toxicology vignettes in Southeast Asia are not well established. This study evaluated the face and content validity of vignettes produced by ChatGPT-4.0, to assess their plausibility and relevance for use in Singaporean emergency medicine education, training, and clinical decision support.
Methods: Ten vignettes were generated using ChatGPT-4.0 in March 2025 and independently evaluated by five Singapore-based clinical toxicologists from four public hospitals. A six-domain rubric, adapted from established validity frameworks, scored presentation realism, typicality of exposure, toxidrome representation, clinical progression, appropriateness for toxicology consultation, and alignment with local practice. Inter-rater reliability was calculated using a two-way random-effects intraclass correlation coefficient [ICC (2, k)].
Results: The mean total score was 20.1/24 (SD = 1.8). Inter-rater agreement was excellent (ICC = 0.87; 95% CI: 0.80–0.94). Face validity averaged 4.4/5 (SD = 0.5) and content validity averaged 4.2/5 (SD = 0.6). Most vignettes reflected common regional poisoning patterns, with some depicting rare but plausible exposures relevant to local practice.
Conclusion: ChatGPT-4.0 can generate toxicology vignettes with high expert-rated realism and contextual relevance when tailored to Singaporean practice. These findings support its potential role in medical education, simulation, and decision-support tools. Further research should compare AI-generated and clinician-authored materials to determine educational impact and applicability in real-world clinical settings.
Keywords
Main Subjects
- Ponampalam R, Tan HH, Ng KC, Lee WY, Tan SC. Demographics of toxic exposures presenting to three public hospital emergency departments in Singapore 2001–2003. Int J Emerg Med. 2009;2(1):25-31. doi:10.1007/s12245-008-0080-9
- Ponampalam R, Anantharaman V. The need for drug and poison information—the Singapore physicians’ perspective. Singapore Med J. 2003;44(5):231-42.
- Thirunavukarasu AJ, Ting DSJ, Elangovan K. Large language models in medicine. Nat Med. 2023;29:1930–40. doi:10.1038/s41591-023-02448-8
- Messick S. Validity of psychological assessment. Am Psychol. 1995;50(9):741-9. doi:10.1037/0003-066X.50.9.741
- Rubio DM, Berg-Weger M, Tebb SS, Lee ES, Rauch S. Objectifying content validity: Conducting a content validity study in social work research. Soc Work Res. 2003;27(2):94–104. doi:10.1093/swr/27.2.94
- Nogué-Xarau S, Ríos-Guillermo M, Amigó-Tadín M. Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished? Emergencias. 2024;36(5):351–58. doi:10.55633/s3me/082.2024
- Chary M, Boyer EW, Burns MM. Diagnosis of acute poisoning using explainable artificial intelligence. Comput Biol Med. 2021;134:104469. doi:10.1016/j.compbiomed.2021.104469
- Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med. 2006;119(2):166.e7-16. doi:10.1016/j.amjmed.2005.10.036
- Palinkas LA, Horwitz SM, Green CA, Wisdom JP, Duan N, Hoagwood K. Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Adm Policy Ment Health. 2015;42(5):533-44. doi:10.1007/s10488-013-0528-y
- Bakkum MJ, Hartjes MG, Piët JD, Donker EM, Likić R, Sanz E, et al. Using artificial intelligence to create diverse and inclusive medical case vignettes for education. Br J Clin Pharmacol. 2024;90(3):640-8. doi:10.1111/bcp.15977
- Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-63. doi:10.1016/j.jcm.2016.02.012
- Wells C, Wollack J. An instructor’s guide to understanding test reliability [Internet]. Madison: Testing & Evaluation Services, University of Wisconsin-Madison; 2003 [cited 2025 Aug 9]. Available from: https://testing.wisc.edu/Reliability.pdf
- Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health. 2025;3(1):100184. doi:10.1016/j.mcpdig.2024.11.005
- Castano-Villegas N, Villa MC, Monsalve Barrientos K, Llano I, Zea J. Arkangel AI, OpenEvidence, ChatGPT, Medisearch: are they objectively up to medical standards? A real-life assessment of LLMs in healthcare [Preprint]. 2025.https://doi.org/10.1101/2025.09.23.25336206
- Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, et al. Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study (Preprint). J Med Internet Res. 2024.https://doi.org/10.2196/58329