Evaluating the Accuracy and Clinical Relevance of ChatGPT in Answering Patient Queries Related to Common Health Conditions

doi:10.61336/ejcm/25-04-107

Contents

Abstract
Keywords
Introduction
Materials And Methods
Results
Discussion
Conclusion
References

Download XML

5793 Views

20 Downloads

Share this article

Research Article | Volume 15 Issue 4 (April, 2025) | Pages 689 - 692

Evaluating the Accuracy and Clinical Relevance of ChatGPT in Answering Patient Queries Related to Common Health Conditions

Kamlesh Agrawal

Yihan Fu

Swati R Bhutada

Nirvi Sharma

⁴

Assistant Professor, Department of General Medicine, RKDF Medical College & Hospital, Bhopal, MP

General Dentist, Smilebuilderz, Lancaster, PA United States

Assistant Professor, Department of Pediatrics, SMBT Institute of Medical Sciences & Research Centre, Dhamangaon, Nashik

⁴

Professor, Department of Occupational Therapy, Jaipur Occupational Therapy College, Jaipur, Raj.

Under a Creative Commons license

Open Access

DOI : 10.61336/ejcm/25-04-107

Received

Feb. 26, 2025

Revised

March 18, 2025

Accepted

March 29, 2025

Published

April 21, 2025

Abstract

Background: The increasing reliance on artificial intelligence (AI) in healthcare has raised important questions regarding the accuracy and clinical applicability of AI-driven responses, particularly from language models like ChatGPT. Patients frequently turn to online resources for medical advice, but the quality of such information remains variable. This study aimed to evaluate the accuracy, reliability, and clinical relevance of ChatGPT’s responses to commonly asked health-related questions. Materials and Methods: A cross-sectional observational study was conducted between January and March 2025. A panel of 10 licensed health care professionals generated a dataset of 100 frequently asked patient queries covering general medicine, pediatrics, dermatology, mental & dental health. Responses were obtained from ChatGPT (GPT-4) and evaluated by three independent physicians for accuracy, clinical relevance, and completeness using a 5-point Likert scale. Inter-rater reliability was assessed using Cohen’s kappa. Results: Out of 100 responses, 86% were rated as clinically accurate (mean score 4.3±0.5), while 78% were deemed relevant to actual clinical practice. Completeness was rated high in 73% of cases. The inter-rater reliability among the evaluators was substantial (κ = 0.78). However, 9% of the responses were flagged as potentially misleading or lacking important clinical context. Conclusion: ChatGPT demonstrates a high level of accuracy and clinical relevance in addressing common patient queries, suggesting its potential as a supplementary tool in patient education. Nevertheless, caution must be exercised, especially in complex or context-specific conditions, where professional medical consultation remains essential.

Keywords

ChatGPT

Artificial Intelligence

Patient Queries

Clinical Accuracy

Health Communication

Medical & Dental Information

INTRODUCTION

In recent years, the integration of artificial intelligence (AI) in healthcare has gained significant momentum, particularly in the domain of patient communication and education. Among AI-driven tools, large language models such as ChatGPT have demonstrated the ability to generate human-like responses based on extensive natural language processing capabilities (1). As patients increasingly seek medical information online, platforms like ChatGPT are being used to answer a wide range of health-related queries, raising questions about their accuracy, reliability, and clinical relevance.

Health misinformation and incomplete or inaccurate online content can negatively affect patient outcomes and clinical decision-making (2). While previous studies have evaluated the performance of AI in diagnostics and clinical decision support systems (3), limited evidence exists regarding the suitability of ChatGPT as a source of patient-level health information. Early evaluations of ChatGPT’s performance have shown promising results in areas such as mental health support, symptom triage, and general health inquiries (4,5). However, concerns remain about the potential for oversimplified, contextually inappropriate, or outdated responses, which could misguide users without medical expertise (6).

Given the growing accessibility of AI tools and their influence on patient behavior, it is imperative to assess their performance in addressing common medical queries. This study aims to systematically evaluate the accuracy and clinical relevance of ChatGPT's responses to frequently asked patient questions across various domains of general healthcare.

MATERIALS AND METHODS

A cross-sectional observational study was conducted over a three-month period from January to March 2025. The objective was to assess the accuracy and clinical relevance of responses generated by ChatGPT (GPT-4) to common patient health queries.

Question Selection: A total of 100 frequently asked health-related questions were selected from publicly available patient forums, medical FAQs, and clinical experience. The questions represented four major health domains: general medicine, dermatology, pediatrics,, mental& dental health. Care was taken to ensure a balanced representation of both symptom-based and condition-based inquiries.

AI Response Generation: Each question was inputted into ChatGPT (GPT-4, accessed via OpenAI’s official platform), with responses recorded in their original form without modifications. The prompt used was standardized to: “Provide a patient-friendly answer to the following question.” ChatGPT was not provided with any additional context, and all answers were generated under default settings.

Evaluation Criteria: Three licensed physicians with a minimum of five years of clinical experience independently evaluated the responses. A standardized scoring system was used to assess three parameters:

Accuracy – correctness of medical content (score 1 to 5)
Clinical Relevance – applicability to real-life healthcare settings (score 1 to 5)
Completeness – adequacy of the information provided in addressing the query (score 1 to 5)

Discrepancies in ratings were resolved through discussion and consensus.

Statistical Analysis: Descriptive statistics were used to summarize the results. Mean scores and standard deviations were calculated for each evaluation criterion. The level of agreement between the evaluators was determined using Cohen’s kappa (κ) coefficient. Data analysis was performed using IBM SPSS Statistics version 26.0. A p-value of <0.05 was considered statistically significant.

RESULTS

A total of 100 health-related queries were assessed across four clinical domains: general medicine, dermatology, pediatrics, mental & dental health. ChatGPT responses were evaluated based on three parameters: accuracy, clinical relevance, and completeness.

Overall Evaluation Scores: The mean score for accuracy was 4.3 ± 0.5, while clinical relevance and completeness scored 4.1 ± 0.6 and 3.9 ± 0.7 respectively on a 5-point scale (Table 1). The highest accuracy was observed in general medicine (4.5 ± 0.3), and the lowest completeness scores were noted in dermatology (3.6 ± 0.8), often due to insufficient explanation of treatment options or differential diagnoses.

Domain-wise Performance: As shown in Table 2, ChatGPT responses were consistently more accurate in general medicine and mental health queries. Pediatric responses, though mostly accurate, occasionally lacked context-specific cautionary notes. Dermatological answers had the lowest relevance scores due to generalized explanations.

Evaluator Agreement: The inter-rater reliability among the three evaluators was substantial, with a Cohen’s kappa coefficient (κ) of 0.78, indicating good agreement in scoring (Table 3).

Potentially Misleading Responses: Nine responses (9%) were flagged as potentially misleading or clinically inadequate. These primarily involved misinterpretation of symptoms or overgeneralization of treatment recommendations.

Table 1. Mean Evaluation Scores of ChatGPT Responses Across All Questions (n=100)

Parameter	Mean Score ± SD
Accuracy	4.3 ± 0.5
Clinical Relevance	4.1 ± 0.6
Completeness	3.9 ± 0.7

Table 2. Domain-wise Mean Scores of ChatGPT Responses

Domain	Accuracy (±SD)	Relevance (±SD)	Completeness (±SD)
General Medicine	4.5 ± 0.3	4.3 ± 0.4	4.0 ± 0.6
Dermatology	4.1 ± 0.6	3.8 ± 0.7	3.6 ± 0.8
Pediatrics	4.2 ± 0.5	4.0 ± 0.5	3.9 ± 0.6
Mental Health	4.4 ± 0.4	4.2 ± 0.5	4.1 ± 0.5
Dental Health	4.3±0.7	4.1±0.3	3.8±0.5

Table 3. Inter-Rater Reliability Using Cohen’s Kappa Coefficient

Evaluation Aspect	Kappa (κ) Value	Interpretation
Accuracy	0.81	Almost perfect
Clinical Relevance	0.76	Substantial
Completeness	0.77	Substantial
Overall Agreement	0.78	Substantial

DISCUSSION

This study aimed to evaluate the accuracy, clinical relevance, and completeness of ChatGPT-generated responses to common patient health queries. The findings suggest that ChatGPT demonstrates a high level of accuracy and acceptable clinical relevance across a range of medical disciplines, aligning with previous research that supports the potential utility of large language models in healthcare communication (1–3).

The overall accuracy score of 4.3 indicates that most responses were factually sound and consistent with current clinical guidelines. This observation is consistent with studies evaluating the performance of ChatGPT in domains like mental health, where it showed comparable accuracy to that of trained clinicians in responding to patient inquiries (4,5). Additionally, the relatively high Cohen’s kappa value (κ = 0.78) reflects strong inter-rater agreement, confirming the reliability of expert assessments in this study.

Domain-wise, ChatGPT performed best in general medicine and mental health, likely due to the higher volume of publicly available training data in these areas (6,7). Mental & dental health responses were particularly detailed and empathetic, as supported by prior literature emphasizing ChatGPT’s capability in emotional and behavioral support roles (8). On the other hand, dermatology responses received the lowest completeness scores, which might be attributed to the model’s limitation in interpreting visually driven clinical scenarios, as noted in prior assessments of AI performance in dermatological diagnostics (9,10).

Despite its strengths, 9% of ChatGPT responses were flagged as misleading or insufficient. These instances often involved overgeneralizations or lack of individualized context—an issue echoed in other studies exploring AI in patient education (11,12). This highlights the need for cautious integration of AI-generated content in healthcare, particularly when it is consumed by laypersons without clinical oversight.

While the results support ChatGPT's potential as a supplementary tool for patient education and preliminary triage, it should not replace professional medical advice. AI models lack real-time clinical reasoning, access to patient histories, and physical examination findings—critical components of accurate diagnosis and management (13,14). Additionally, the dynamic nature of medical knowledge requires constant updating, which remains a limitation for even the most advanced models if not regularly retrained (15).

This study is limited by its cross-sectional design and the use of only one AI platform. Future research could expand to include comparative evaluations with other AI systems, incorporate patient feedback, and assess longitudinal changes in response quality as the model evolves.

CONCLUSION

ChatGPT demonstrates high accuracy and clinical relevance in responding to patient queries across various medical domains. While it holds promise as a supportive tool for patient education, its limitations in context-specific reasoning and clinical judgment highlight the importance of professional oversight. Further research and continuous model refinement are essential to ensure safe and effective integration into healthcare settings.

REFERENCES

Mondal H, Dash I, Mondal S, Behera JK. ChatGPT in answering queries related to lifestyle-related diseases and disorders. Cureus. 2023 Nov 5;15(11):e48296. doi: 10.7759/cureus.48296. PMID: 38058315.
Sciberras M, Farrugia Y, Gordon H, Furfaro F, Allocca M, Torres J, et al. Accuracy of information given by ChatGPT for patients with inflammatory bowel disease in relation to ECCO guidelines. J Crohns Colitis. 2024 Aug 14;18(8):1215-1221. doi: 10.1093/ecco-jcc/jjae040. PMID: 38520394.
Ozturk N, Yakak I, Ağ MB, Aksoy N. Is ChatGPT reliable and accurate in answering pharmacotherapy-related inquiries in both Turkish and English? Curr Pharm Teach Learn. 2024 Jul;16(7):102101. doi: 10.1016/j.cptl.2024.04.017. PMID: 38702261.
Pugliese N, Wai-Sun Wong V, Schattenberg JM, Romero-Gomez M, Sebastiani G, Aghemo A; NAFLD Expert Chatbot Working Group. Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease. Clin Gastroenterol Hepatol. 2024 Apr;22(4):886-889.e5. doi: 10.1016/j.cgh.2023.08.033. PMID: 37716618.
Al-Sharif EM, Penteado RC, Dib El Jalbout N, Topilow NJ, Shoji MK, Kikkawa DO, et al. Evaluating the accuracy of ChatGPT and Google BARD in fielding oculoplastic patient queries: A comparative study on artificial versus human intelligence. Ophthalmic Plast Reconstr Surg. 2024 May-Jun;40(3):303-311. doi: 10.1097/IOP.0000000000002567. PMID: 38215452.
Oliveira JA, Eskandar K, Kar E, de Oliveira FR, Filho ALDS. Understanding AI's role in endometriosis patient education and evaluating its information and accuracy: Systematic review. JMIR AI. 2024 Oct 30;3:e64593. doi: 10.2196/64593. PMID: 39476855.
Warren CJ, Payne NG, Edmonds VS, Voleti SS, Choudry MM, Punjani N, et al. Quality of chatbot information related to benign prostatic hyperplasia. Prostate. 2025 Feb;85(2):175-180. doi: 10.1002/pros.24814. PMID: 39513562.
Jo E, Song S, Kim JH, Lim S, Kim JH, Cha JJ, et al. Assessing GPT-4's performance in delivering medical advice: Comparative analysis with human experts. JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282. PMID: 38989848.
Naz R, Akacı O, Erdoğan H, Açıkgöz A. Can large language models provide accurate and quality information to parents regarding chronic kidney diseases? J Eval Clin Pract. 2024 Dec;30(8):1556-1564. doi: 10.1111/jep.14084. PMID: 38959373.
Sood A, Moyer A, Jahangiri P, Mar D, Nitichaikulvatana P, Ramreddy N, et al. Evaluation of the reliability of ChatGPT to provide guidance on recombinant zoster vaccination for patients with rheumatic and musculoskeletal diseases. J Clin Rheumatol. 2025 Jan 15. doi: 10.1097/RHU.0000000000002198. PMID: 39814338.
Rao A, Mu A, Enichen E, Gupta D, Hall N, Koranteng E, et al. A future of self-directed patient internet research: Large language model-based tools versus standard search engines. Ann Biomed Eng. 2025 Mar 3. doi: 10.1007/s10439-025-03701-6. PMID: 40025252.
Tao H, Liu L, Cui J, Wang K, Peng L, Nahata MC. Potential use of ChatGPT for the treatment of infectious diseases in vulnerable populations. Ann Biomed Eng. 2024 Dec;52(12):3141-3144. doi: 10.1007/s10439-024-03600-2. PMID: 39133388.
Raghunathan R, Jacobs AR, Sant VR, King LJ, Rothberger G, Prescott J, et al. Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease? Surgery. 2025 Jan;177:108859. doi: 10.1016/j.surg.2024.06.075. PMID: 39424485.
Bagde H, Dhopte A, Alam MK, Basri R. A systematic review and meta-analysis on ChatGPT and its utilization in medical and dental research. Heliyon. 2023 Nov 29;9(12):e23050. doi: 10.1016/j.heliyon.2023.e23050. PMID: 38144348.
Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985. PMID: 37814369.

European Journal of Cardiovascular Medicine

Download PDF