Document Type : Original Article

Authors

1 Department of Emergency Medicine, Woodlands Health, National Healthcare Group (NHG) Health, Singapore

2 Department of Orthopaedic Surgery, Khoo Teck Puat Hospital, National Healthcare Group (NHG) Health, Singapore

Abstract

Background: Generative artificial intelligence (AI) holds promise for medical education, yet the realism and contextual relevance of AI-generated toxicology vignettes in Southeast Asia are not well established. This study evaluated the face and content validity of vignettes produced by ChatGPT-4.0, to assess their plausibility and relevance for use in Singaporean emergency medicine education, training, and clinical decision support.
Methods: Ten vignettes were generated using ChatGPT-4.0 in March 2025 and independently evaluated by five Singapore-based clinical toxicologists from four public hospitals. A six-domain rubric, adapted from established validity frameworks, scored presentation realism, typicality of exposure, toxidrome representation, clinical progression, appropriateness for toxicology consultation, and alignment with local practice. Inter-rater reliability was calculated using a two-way random-effects intraclass correlation coefficient [ICC (2, k)].
Results: The mean total score was 20.1/24 (SD = 1.8). Inter-rater agreement was excellent (ICC = 0.87; 95% CI: 0.80–0.94). Face validity averaged 4.4/5 (SD = 0.5) and content validity averaged 4.2/5 (SD = 0.6). Most vignettes reflected common regional poisoning patterns, with some depicting rare but plausible exposures relevant to local practice.
Conclusion: ChatGPT-4.0 can generate toxicology vignettes with high expert-rated realism and contextual relevance when tailored to Singaporean practice. These findings support its potential role in medical education, simulation, and decision-support tools. Further research should compare AI-generated and clinician-authored materials to determine educational impact and applicability in real-world clinical settings.

Keywords

Main Subjects

  1. Ponampalam R, Tan HH, Ng KC, Lee WY, Tan SC. Demographics of toxic exposures presenting to three public hospital emergency departments in Singapore 2001–2003. Int J Emerg Med. 2009;2(1):25-31. doi:10.1007/s12245-008-0080-9
  2. Ponampalam R, Anantharaman V. The need for drug and poison information—the Singapore physicians’ perspective. Singapore Med J. 2003;44(5):231-42.
  3. Thirunavukarasu AJ, Ting DSJ, Elangovan K. Large language models in medicine. Nat Med. 2023;29:1930–40. doi:10.1038/s41591-023-02448-8
  4. Messick S. Validity of psychological assessment. Am Psychol. 1995;50(9):741-9. doi:10.1037/0003-066X.50.9.741
  5. Rubio DM, Berg-Weger M, Tebb SS, Lee ES, Rauch S. Objectifying content validity: Conducting a content validity study in social work research. Soc Work Res. 2003;27(2):94–104. doi:10.1093/swr/27.2.94
  6. Nogué-Xarau S, Ríos-Guillermo M, Amigó-Tadín M. Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished? Emergencias. 2024;36(5):351–58. doi:10.55633/s3me/082.2024
  7. Chary M, Boyer EW, Burns MM. Diagnosis of acute poisoning using explainable artificial intelligence. Comput Biol Med. 2021;134:104469. doi:10.1016/j.compbiomed.2021.104469
  8. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med. 2006;119(2):166.e7-16. doi:10.1016/j.amjmed.2005.10.036
  9. Palinkas LA, Horwitz SM, Green CA, Wisdom JP, Duan N, Hoagwood K. Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Adm Policy Ment Health. 2015;42(5):533-44. doi:10.1007/s10488-013-0528-y
  10. Bakkum MJ, Hartjes MG, Piët JD, Donker EM, Likić R, Sanz E, et al. Using artificial intelligence to create diverse and inclusive medical case vignettes for education. Br J Clin Pharmacol. 2024;90(3):640-8. doi:10.1111/bcp.15977
  11. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-63. doi:10.1016/j.jcm.2016.02.012
  12. Wells C, Wollack J. An instructor’s guide to understanding test reliability [Internet]. Madison: Testing & Evaluation Services, University of Wisconsin-Madison; 2003 [cited 2025 Aug 9]. Available from: https://testing.wisc.edu/Reliability.pdf
  13. Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health. 2025;3(1):100184. doi:10.1016/j.mcpdig.2024.11.005
  1. Castano-Villegas N, Villa MC, Monsalve Barrientos K, Llano I, Zea J. Arkangel AI, OpenEvidence, ChatGPT, Medisearch: are they objectively up to medical standards? A real-life assessment of LLMs in healthcare [Preprint]. 2025.https://doi.org/10.1101/2025.09.23.25336206
  2. Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, et al. Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study (Preprint). J Med Internet Res. 2024.https://doi.org/10.2196/58329