Background: The increasing reliance on artificial intelligence (AI) in healthcare has raised important questions regarding the accuracy and clinical applicability of AI-driven responses, particularly from language models like ChatGPT. Patients frequently turn to online resources for medical advice, but the quality of such information remains variable. This study aimed to evaluate the accuracy, reliability, and clinical relevance of ChatGPT’s responses to commonly asked health-related questions. Materials and Methods: A cross-sectional observational study was conducted between January and March 2025. A panel of 10 licensed health care professionals generated a dataset of 100 frequently asked patient queries covering general medicine, pediatrics, dermatology, mental & dental health. Responses were obtained from ChatGPT (GPT-4) and evaluated by three independent physicians for accuracy, clinical relevance, and completeness using a 5-point Likert scale. Inter-rater reliability was assessed using Cohen’s kappa. Results: Out of 100 responses, 86% were rated as clinically accurate (mean score 4.3±0.5), while 78% were deemed relevant to actual clinical practice. Completeness was rated high in 73% of cases. The inter-rater reliability among the evaluators was substantial (κ = 0.78). However, 9% of the responses were flagged as potentially misleading or lacking important clinical context. Conclusion: ChatGPT demonstrates a high level of accuracy and clinical relevance in addressing common patient queries, suggesting its potential as a supplementary tool in patient education. Nevertheless, caution must be exercised, especially in complex or context-specific conditions, where professional medical consultation remains essential.
In recent years, the integration of artificial intelligence (AI) in healthcare has gained significant momentum, particularly in the domain of patient communication and education. Among AI-driven tools, large language models such as ChatGPT have demonstrated the ability to generate human-like responses based on extensive natural language processing capabilities (1). As patients increasingly seek medical information online, platforms like ChatGPT are being used to answer a wide range of health-related queries, raising questions about their accuracy, reliability, and clinical relevance.
Health misinformation and incomplete or inaccurate online content can negatively affect patient outcomes and clinical decision-making (2). While previous studies have evaluated the performance of AI in diagnostics and clinical decision support systems (3), limited evidence exists regarding the suitability of ChatGPT as a source of patient-level health information. Early evaluations of ChatGPT’s performance have shown promising results in areas such as mental health support, symptom triage, and general health inquiries (4,5). However, concerns remain about the potential for oversimplified, contextually inappropriate, or outdated responses, which could misguide users without medical expertise (6).
Given the growing accessibility of AI tools and their influence on patient behavior, it is imperative to assess their performance in addressing common medical queries. This study aims to systematically evaluate the accuracy and clinical relevance of ChatGPT's responses to frequently asked patient questions across various domains of general healthcare.
A cross-sectional observational study was conducted over a three-month period from January to March 2025. The objective was to assess the accuracy and clinical relevance of responses generated by ChatGPT (GPT-4) to common patient health queries.
Question Selection: A total of 100 frequently asked health-related questions were selected from publicly available patient forums, medical FAQs, and clinical experience. The questions represented four major health domains: general medicine, dermatology, pediatrics,, mental& dental health. Care was taken to ensure a balanced representation of both symptom-based and condition-based inquiries.
AI Response Generation: Each question was inputted into ChatGPT (GPT-4, accessed via OpenAI’s official platform), with responses recorded in their original form without modifications. The prompt used was standardized to: “Provide a patient-friendly answer to the following question.” ChatGPT was not provided with any additional context, and all answers were generated under default settings.
Evaluation Criteria: Three licensed physicians with a minimum of five years of clinical experience independently evaluated the responses. A standardized scoring system was used to assess three parameters:
Discrepancies in ratings were resolved through discussion and consensus.
Statistical Analysis: Descriptive statistics were used to summarize the results. Mean scores and standard deviations were calculated for each evaluation criterion. The level of agreement between the evaluators was determined using Cohen’s kappa (κ) coefficient. Data analysis was performed using IBM SPSS Statistics version 26.0. A p-value of <0.05 was considered statistically significant.
A total of 100 health-related queries were assessed across four clinical domains: general medicine, dermatology, pediatrics, mental & dental health. ChatGPT responses were evaluated based on three parameters: accuracy, clinical relevance, and completeness.
Overall Evaluation Scores: The mean score for accuracy was 4.3 ± 0.5, while clinical relevance and completeness scored 4.1 ± 0.6 and 3.9 ± 0.7 respectively on a 5-point scale (Table 1). The highest accuracy was observed in general medicine (4.5 ± 0.3), and the lowest completeness scores were noted in dermatology (3.6 ± 0.8), often due to insufficient explanation of treatment options or differential diagnoses.
Domain-wise Performance: As shown in Table 2, ChatGPT responses were consistently more accurate in general medicine and mental health queries. Pediatric responses, though mostly accurate, occasionally lacked context-specific cautionary notes. Dermatological answers had the lowest relevance scores due to generalized explanations.
Evaluator Agreement: The inter-rater reliability among the three evaluators was substantial, with a Cohen’s kappa coefficient (κ) of 0.78, indicating good agreement in scoring (Table 3).
Potentially Misleading Responses: Nine responses (9%) were flagged as potentially misleading or clinically inadequate. These primarily involved misinterpretation of symptoms or overgeneralization of treatment recommendations.
Table 1. Mean Evaluation Scores of ChatGPT Responses Across All Questions (n=100)
Parameter |
Mean Score ± SD |
Accuracy |
4.3 ± 0.5 |
Clinical Relevance |
4.1 ± 0.6 |
Completeness |
3.9 ± 0.7 |
Table 2. Domain-wise Mean Scores of ChatGPT Responses
Domain |
Accuracy (±SD) |
Relevance (±SD) |
Completeness (±SD) |
General Medicine |
4.5 ± 0.3 |
4.3 ± 0.4 |
4.0 ± 0.6 |
Dermatology |
4.1 ± 0.6 |
3.8 ± 0.7 |
3.6 ± 0.8 |
Pediatrics |
4.2 ± 0.5 |
4.0 ± 0.5 |
3.9 ± 0.6 |
Mental Health |
4.4 ± 0.4 |
4.2 ± 0.5 |
4.1 ± 0.5 |
Dental Health |
4.3±0.7 |
4.1±0.3 |
3.8±0.5 |
Table 3. Inter-Rater Reliability Using Cohen’s Kappa Coefficient
Evaluation Aspect |
Kappa (κ) Value |
Interpretation |
Accuracy |
0.81 |
Almost perfect |
Clinical Relevance |
0.76 |
Substantial |
Completeness |
0.77 |
Substantial |
Overall Agreement |
0.78 |
Substantial |
This study aimed to evaluate the accuracy, clinical relevance, and completeness of ChatGPT-generated responses to common patient health queries. The findings suggest that ChatGPT demonstrates a high level of accuracy and acceptable clinical relevance across a range of medical disciplines, aligning with previous research that supports the potential utility of large language models in healthcare communication (1–3).
The overall accuracy score of 4.3 indicates that most responses were factually sound and consistent with current clinical guidelines. This observation is consistent with studies evaluating the performance of ChatGPT in domains like mental health, where it showed comparable accuracy to that of trained clinicians in responding to patient inquiries (4,5). Additionally, the relatively high Cohen’s kappa value (κ = 0.78) reflects strong inter-rater agreement, confirming the reliability of expert assessments in this study.
Domain-wise, ChatGPT performed best in general medicine and mental health, likely due to the higher volume of publicly available training data in these areas (6,7). Mental & dental health responses were particularly detailed and empathetic, as supported by prior literature emphasizing ChatGPT’s capability in emotional and behavioral support roles (8). On the other hand, dermatology responses received the lowest completeness scores, which might be attributed to the model’s limitation in interpreting visually driven clinical scenarios, as noted in prior assessments of AI performance in dermatological diagnostics (9,10).
Despite its strengths, 9% of ChatGPT responses were flagged as misleading or insufficient. These instances often involved overgeneralizations or lack of individualized context—an issue echoed in other studies exploring AI in patient education (11,12). This highlights the need for cautious integration of AI-generated content in healthcare, particularly when it is consumed by laypersons without clinical oversight.
While the results support ChatGPT's potential as a supplementary tool for patient education and preliminary triage, it should not replace professional medical advice. AI models lack real-time clinical reasoning, access to patient histories, and physical examination findings—critical components of accurate diagnosis and management (13,14). Additionally, the dynamic nature of medical knowledge requires constant updating, which remains a limitation for even the most advanced models if not regularly retrained (15).
This study is limited by its cross-sectional design and the use of only one AI platform. Future research could expand to include comparative evaluations with other AI systems, incorporate patient feedback, and assess longitudinal changes in response quality as the model evolves.
ChatGPT demonstrates high accuracy and clinical relevance in responding to patient queries across various medical domains. While it holds promise as a supportive tool for patient education, its limitations in context-specific reasoning and clinical judgment highlight the importance of professional oversight. Further research and continuous model refinement are essential to ensure safe and effective integration into healthcare settings.