neuroGPT-X: toward a clinic-ready large language model

Edward Guo Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;
Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Edward Guo in
Current site
Google Scholar
PubMed
Close
,
Mehul Gupta Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Mehul Gupta in
Current site
Google Scholar
PubMed
Close
 MD
,
Sarthak Sinha Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Sarthak Sinha in
Current site
Google Scholar
PubMed
Close
 BSc
,
Karl Rössler Department of Neurosurgery, Medical University of Vienna, Vienna, Austria;

Search for other papers by Karl Rössler in
Current site
Google Scholar
PubMed
Close
 MD
,
Marcos Tatagiba Department of Neurosurgery, Tubingen University, Tubingen, Germany;

Search for other papers by Marcos Tatagiba in
Current site
Google Scholar
PubMed
Close
 MD
,
Ryojo Akagami Department of Surgery, University of British Columbia, Vancouver, British Columbia, Canada;

Search for other papers by Ryojo Akagami in
Current site
Google Scholar
PubMed
Close
 MD
,
Ossama Al-Mefty Department of Neurosurgery, Harvard School of Medicine, Boston, Massachusetts;

Search for other papers by Ossama Al-Mefty in
Current site
Google Scholar
PubMed
Close
 MD
,
Taku Sugiyama Department of Neurosurgery, Hokkaido University Graduate School of Medicine, Sapporo, Japan;

Search for other papers by Taku Sugiyama in
Current site
Google Scholar
PubMed
Close
 MD, PhD
,
Philip E. Stieg Department of Neurosurgery, Weill Cornell Medicine/NewYork-Presbyterian Hospital, New York, New York; and

Search for other papers by Philip E. Stieg in
Current site
Google Scholar
PubMed
Close
 PhD, MD
,
Gwynedd E. Pickett Department of Surgery, Dalhousie University, Halifax, Nova Scotia, Canada

Search for other papers by Gwynedd E. Pickett in
Current site
Google Scholar
PubMed
Close
 MD
,
Madeleine de Lotbiniere-Bassett Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;
Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Madeleine de Lotbiniere-Bassett in
Current site
Google Scholar
PubMed
Close
 MD
,
Rahul Singh Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;
Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Rahul Singh in
Current site
Google Scholar
PubMed
Close
 BSc
,
Sanju Lama Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;
Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Sanju Lama in
Current site
Google Scholar
PubMed
Close
 MD, PhD
, and
Garnette R. Sutherland Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada;
Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada;

Search for other papers by Garnette R. Sutherland in
Current site
Google Scholar
PubMed
Close
 MD
Open access

OBJECTIVE

The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.

METHODS

The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.

RESULTS

Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.

CONCLUSIONS

The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.

ABBREVIATIONS

ACM = Association for Computing Machinery; API = application programming interface; GPT = Generative Pretrained Transformer; LLM = large language model.

OBJECTIVE

The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.

METHODS

The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.

RESULTS

Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.

CONCLUSIONS

The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.

In Brief

As large language models became openly available, a curiosity arose: Could they be tested and used as an expert opinion for complex neurosurgical disease? The publicly available, context-enriched ChatGPT model revealed that such a trained model could indeed generate comprehensive responses to curated questions on vestibular schwannoma, similar to a cadre of international neurosurgical experts. The study is only a beginning and heralds the inevitable utility of machine intelligence in medicine.

Over the past several decades, information storage and retrieval have shifted from textbooks and journals to digital resources.1 Web-based knowledge, including mobile applications, has provided the medical community rapid access to an ever-increasing body of knowledge.24 Although physicians may access information and guidelines using search engines, results are usually high volume, necessitating selective data extraction into clinically relevant information. In addition, physicians are continually required to remain up to date on their knowledge and skills, relying on experience and the latest published standards of care.4 With more than 1 million publications per year (2 papers/minute) added to online databases in the biomedical field alone, the current mode of learning, teaching, and practicing medicine is rapidly becoming volume intensive and memory prohibitive.2,5,6 Such an approach to education is time consuming, with information often segregated, variable, or incomplete, requiring further in-depth review and analysis to achieve optimal results.6,7 In this era of rapid medical advancement, it is imperative that physicians have access to the latest and most relevant information to ensure the best outcomes for their patients.8,9

Large language models (LLMs),1013 such as Generative Pretrained Transformer (GPT),14 are a type of neural network that can process and generate human-like text by learning patterns from vast amounts of data. LLMs show promise as clinical reasoning tools, as they can perform tasks such as text generation, question answering, and language translation.14 The release of ChatGPT in November 2022, which features a chat-like interface and human-like conversation skills, highlights the potential of LLMs in fields that require complex decision-making. This is exemplified by recent studies that demonstrate the performance of ChatGPT and GPT on standardized examinations, such as the United States Medical Licensing Examination15 and the National Conference of Bar Examiners Unified Bar Examination.16

The use of LLMs as a point-of-care medical resource for subspecialty physicians has yet to be investigated. Despite their potential, issues such as interpretability and accountability have limited the application of LLMs in high-risk environments. However, context enrichment and zero-shot learning (a machine learning approach that enables models to recognize and classify objects or concepts they have never seen before by leveraging transferable knowledge learned from related tasks or domains) have shown marked domain-specific performance improvements,17,18 making this an area of considerable promise. Combining zero-shot learning and publication references with LLM responses may address interpretability and accountability concerns, thereby improving the clinical utility of these models.

To establish whether LLMs can assist in educating residents and delivering treatment options at an experienced physician level, it would be important to challenge LLMs with a complex and controversial disease paradigm. One such example is acoustic/vestibular schwannoma, which bears the nuances of intricate anatomical landmarks, diverse diagnostic modalities, global consensus treatment approaches, and surgeon expertise, whereby treatments and outcomes may vary between centers of excellence.19,20 Staying abreast of the literature and guidelines while maintaining technical excellence may be challenging, requiring a balance between knowledge extraction and reasoning.

In this study, we make three key contributions toward assessing LLMs as a clinical tool. First, we developed a framework to enrich GPT with context relevant to vestibular schwannoma. Second, we compared the performance of ChatGPT, termed naive GPT, and that of a context-enriched GPT model against leading neurosurgical experts worldwide. Finally, we introduced a proof-of-concept clinical tool, neuroGPT-X, which incorporates working memory and sources with its responses in a web-based chat platform, aimed at addressing the challenges of LLMs in a clinical setting (e.g., interpretability, reliability, accountability, safety). Here, we hypothesized that a well-trained, context-enriched GPT would perform similar to or better than expert surgeons in answering questions commonly encountered in day-to-day practice.

Methods

Data Set Curation

Terminologies unique to vestibular schwannoma were used to obtain articles and abstracts via web scraping (automated extraction of data from websites) from Wikipedia and PubMed. The Wikipedia Python application programming interface (API) was used to scrape all webpages from the starting webpage of "Vestibular Schwannoma," and the content from each page was extracted by heading. The findpapers Python application21 was used to scrape abstracts from journals and conference proceedings related to vestibular schwannoma into a structured JavaScript Object Notation (JSON) format from PubMed regardless of publication language from January 1, 2000, until January 1, 2023. The query was ([vestibular schwannoma] OR [acoustic neurilemmoma] OR [perineural fibroblastoma] OR [neurinoma of the acoustic nerve] OR [neurofibroma of the acoustic nerve] OR [schwannoma of the acoustic nerve] OR [acoustic neuroma]). All journals flagged as potentially predatory in the metadata of the findpapers application output were removed on the basis of Beall’s List of potentially predatory journals and publishers.22

A structured data set was compiled with these column names: Title, Heading, Content, Authors, Tokens, and Embeddings. For Wikipedia articles, "Title" was the name of the article, "Heading" was the article title, and "Authors" was set as "Wikipedia." For PubMed abstracts, "Title" was "Vestibular Schwannoma," "Heading" was the publication title, and "Authors" was the contributing author(s). The "Content" column included the scraped Wikipedia section or PubMed abstract. The "Tokens" column had the number of tokens for each row in the "Content" column computed using the GPT2TokenizerFast function provided by OpenAI. Tokens were used to calculate the maximum input length to provide to the OpenAI GPT API. The "Embeddings" column included embeddings from the "Content" column computed using the OpenAI embedding model, text-embedding-ada-002. These embeddings were used for thematic analysis and search purposes to find the most similar documents to a user’s input, providing context enrichment for the OpenAI GPT-3 API.

Thematic Analysis

Thematic analysis was performed using the data set embeddings. K-means clustering classified clusters of thematic similarity; the clusters were chosen on the basis of the elbow method and silhouette score for clusters ranging from 2 to 25. The t-distributed stochastic neighbor embedding (t-SNE) algorithm was employed for data visualization with two components: a perplexity of the square root of the number of rows in the structured data set, principal component analysis (PCA) as an initializer, and a learning rate of 200. The InstructGPT Davinci model,11 text-davinci-003, was used to perform thematic analysis on a randomly sampled subset of articles from each cluster. Each cluster’s theme was summarized into keywords using the same text-davinci-003 model with the hyperparameters of temperature = 0, max_tokens = 64, top_p = 1, frequency_penalty = 0, and presence_penalty = 0. The prompt for the thematic analysis was "What do the following abstracts have in common?" and the prompt for the summarization was "Summarize this phrase into keywords."

Context Enrichment With Prompt Construction

User input was enriched with a priming prompt and content for all requests to the GPT-3 API. Standard prompts were constructed using the following structure: Prompt = priming prompt + content + user input.

The priming prompt was "Pretend you are a physician writing an exam. Please answer every question to the highest degree of medical accuracy. Provide detailed reasoning for your answer. Use the context provided to supplement your knowledge base." The content comprised Wikipedia article sections and PubMed abstracts sorted by cosine similarity with a maximum token length of 2000. Context-enriched GPT output had a maximum token length of 2000.

Cosine similarity between user input and structured data set content was used to obtain relevant context enrichment for GPT via OpenAI’s embedding model, text-embedding-ada-002. More precisely, given user input as an embedding vector U and embedding within a structured data set E, the cosine similarity between U and E was defined as the dot product between U and E divided by the product of the magnitudes of U and E.

Web Application, Memory, and Inclusion of Sources

A web application was developed to create a user chat interface for the context-enriched GPT model using the Python Flask microweb framework23 paired with HTML, CSS, and JavaScript. We call this platform neuroGPT-X. User messages are permitted up to 1500 characters (approximately 210–380 words). The key features of neuroGPT-X include conversation memory of the current session, in-text citations and references for its responses, restriction of conversations to relevant medical topics, timestamps for all messages, animated loading icons while neuroGPT-X computes its response, and prevention of user input spam.

Memory was implemented using Conversation Summary Buffer Memory from LangChain, a framework that provides a standard memory interface with LLMs. This type of memory summarizes both new and old interactions in memory and keeps a buffer of recent interactions. In our implementation, the buffer memory had a maximum token limit of 750. Logistically, we save a serialized Conversation Summary Buffer Memory via the Python pickle module after neuroGPT-X completes its response, and we load the memory upon new user inputs. This memory file is deleted after the user completes or terminates each session.

On the backend of neuroGPT-X, the prompt was constructed with the following primer:

Pretend you are a specialist physician. Answer every question to the highest degree of medical accuracy. Provide detailed reasoning for your answer. Use the context provided to supplement your knowledge base. Cite the relevant context in IEEE style using the enumerated context before punctuation, e.g., [1], [3, 5]. Ignore the context if it is not relevant to the question. Do not create fake citations or more citations than were provided. Write "This is outside the scope of my functionality." if the question does not relate to vestibular schwannoma or neurosurgery.

Note that the Institute of Electrical and Electronics Engineers (IEEE) style is a citation style. The context was provided using an enumerated list of relevant articles sorted by cosine similarity to the user input up to a token length of 1500. The neuroGPT-X model was given API completion parameters of temperature = 0, model = text-davinci-003, max_tokens = 1500, and top_p = 1.

Study Methodology

A neurosurgeon with > 30 years of experience with active clinical practice in vestibular schwannoma curated 15 general questions regarding anatomy, surgical management, imaging, clinical contexts, and genetic predispositions (Supplementary Table 1). These questions were answered by 1) the January 30, 2023, ChatGPT model, termed naive GPT; 2) context-enriched GPT; and 3) 4 experienced neurosurgeons active in the treatment of vestibular schwannoma, also called experts here on. The context-enriched GPT model had API completion parameters of temperature = 0, model = text-davinci-003, max_tokens = 2000, and top_p = 1.

The 4 expert surgeons were asked to answer the 15 curated questions ad hoc as if speaking to residents or patients while timing themselves. The same questions were independently posed to naive GPT and context-enriched GPT, and response times were recorded. The expert, naive GPT, and context-enriched GPT responses were reviewed and compiled by an independent investigator. The evaluation of these answers was then divided into blinded and unblinded phases.

In the blinded phase, 3 independent neurosurgeons blindly evaluated the responses from naive GPT, context-enriched GPT, and the 4 experts (who were separate from the 3 evaluators). The evaluation metrics included accuracy, coherence, relevance, thoroughness, and overall rating on a 0–4 Likert scale, with 4 indicating a better answer (Supplementary Table 2). The evaluators were also asked whether they thought the response was provided by an expert or naive GPT/context-enriched GPT.

After the initial evaluations, the experts and evaluators were unblinded to the naive and context-enriched GPT responses. In this phase, all 7 expert surgeons were asked questions regarding the GPT-generated responses, including overall satisfaction, likelihood of use in the clinic, value, and likelihood they would recommend the tool to colleagues via a Likert scale from 0 to 4, with 4 indicating a better answer (Supplementary Table 2). The evaluators were also asked long-form questions about their thoughts on GPT in clinical practice.

Prospectively, GPT-4 was used to power the context-enriched model, and it was evaluated on a set of 103 consensus statements from Carlson et al.,24 which provide information regarding hearing preservation with radiosurgery (n = 16), microsurgery (n = 20), and observation (n = 13), along with tumor control and imaging surveillance (n = 20), preferred treatment (n = 24), operative considerations (n = 4), and complications (n = 6).

Statistical Analysis

Statistical analyses were performed using Python version 3.9 and visualized in GraphPad Prism. Statistical significance was set at an alpha of 0.05. Krippendorff’s alpha was computed to measure interrater agreement between evaluators in the blinded phase. In the blinded phase, aggregate scores and timing measures were compiled across all questions for naive GPT, context-enriched GPT, and the neurosurgeon experts. The normality of the data was evaluated using the Shapiro-Wilk test, and the Kruskal-Wallis test followed by adjustments for multiple comparisons if significant was used to compute statistical differences between responses. Data were visualized to correspond to the dimensions of the evaluation metrics, i.e., accuracy, coherence, relevance, thoroughness, and overall rating. In the affective phase, pie charts were used to visualize the expert perceptions of the clinical utility of the GPT-generated responses.

Results

Data Set Curation, Thematic Analysis, and neuroGPT-X

Web scraping with the Wikipedia API returned 157 articles, which were further divided into 1659 sections separated by headings. The findpapers Python application returned 4004 publications from PubMed, 15 publications from the Association for Computing Machinery (ACM) Digital Library, 19 publications from arXiv, 4 publications from bioRxiv, and 3 papers from medRxiv. Four publications flagged as published in potentially predatory journals were removed from the PubMed search. Thus, the total number of Wikipedia articles and publications was 4198. Figure 1 visualizes these results.

FIG. 1.
FIG. 1.

Overview of the study pipeline and data set characteristics. A: The data-processing pipeline and the 3 comparisons in this study. B: Data inclusion and exclusion criteria. C: Thematic analysis of the embedding vectors of the vestibular schwannoma data set. Clusters were computed using K-means clustering. The InstructGPT Davinci model was used to classify each cluster. D: Peer-reviewed abstract publication numbers by year and type. tSNE = t-distributed stochastic neighbor embedding. Figure is available in color online only.

In thematic analyses, the elbow method returned an optimal number of 11 clusters with a silhouette coefficient of 859.35. The thematic analysis produced a thematic analysis of each cluster, and the keywords associated with clusters 1–11 are shown in Supplementary Table 3. Figure 2 shows the neuroGPT-X user interface and an example of an interaction with the platform. An entire conversation, including answers to the 15 curated questions in this study, is included in the supplementary data (Supplementary Table 4). The website can be accessed here: https://neurogpt-x.azurewebsites.net/.

FIG. 2.
FIG. 2.

The neuroGPT-X chat interface. The features include conversation memory of the current session, references for the information it provides, restriction of conversations to relevant medical topics, timestamp for each message, animated loading icons while responses are computed, and prevention of user input spam. The website can be accessed here: https://neurogpt-x.azurewebsites.net/. Figure is available in color online only.

Evaluation of Naive GPT, Context-Enriched GPT, and Expert Neurosurgeons

The mean ± SD response times aggregated across all questions were 63.33 ± 66.08 seconds for neurosurgeon gamma, 30.00 ± 30.88 seconds for neurosurgeon delta, 42.00 ± 22.1 seconds for neurosurgeon epsilon, 134.67 ± 187.15 seconds for neurosurgeon zeta, 49.03 ± 10.67 seconds for naive GPT, and 16.67 ± 9.29 seconds for enriched GPT. The timing normalized by character count was 513.9 ± 140 msec/character for neurosurgeon gamma, 380.7 ± 820 msec/character for neurosurgeon delta, 694.9 ± 710 msec/character for neurosurgeon epsilon, 365.0 ± 240 msec/character for neurosurgeon zeta, 36.5 ± 10 msec/character for naive GPT, and 19.8 ± 10 msec/character for neuroGPT-X. Pairwise comparisons of normalized response speed demonstrated that both models were significantly faster than the expert neurosurgeon responses (p < 0.01), with both models having very similar response times (p > 0.999). These results are visualized in Fig. 3A and tabulated in Supplementary Table 5.

FIG. 3.
FIG. 3.

Mean blinded evaluation scores from 3 expert neurosurgeons. Responses to 15 curated questions were measured on the basis of normalized response speed by response character length (A), coherence score (B), thoroughness score (C), relevance score (D), accuracy score (E), and overall performance score (F). Gamma, Delta, Epsilon, and Zeta represent the individual neurosurgeons. Error bars represent standard deviation.

The evaluations by 3 additional neurosurgeons demonstrated that the naive and enriched GPT models were statistically noninferior and often superior to the responses obtained from expert neurosurgeons. Kruskal-Wallis 1-way ANOVA showed significant differences in performance across accuracy, coherence, relevance, thoroughness, and overall performance metrics (p < 0.0001). The Kruskal-Wallis test corrected for multiple comparisons was used to assess pairwise performance differences across responses and demonstrated the consistently noninferior performance of the naive and enriched GPT models compared with that of the expert neurosurgeons across all metrics. The models outperformed most expert neurosurgeon responses for coherence and thoroughness. Finally, the models had weaker scores in relevance, though both remained noninferior to expert neurosurgeon responses. A tabular summary of the results can be seen in Table 1 and is visualized in Fig. 3B–F. The mean ± SD Krippendorff’s alpha was 0.41 ± 0.05 for accuracy, 0.31 ± 0.03 for coherence, 0.35 ± 0.02 for relevance, 0.35 ± 0.05 for thoroughness, and 0.52 ± 0.01 for overall performance (Table 1). A visualization of the unblinded affective survey results is shown in Fig. 4.

TABLE 1.

Blinded evaluation scores from 3 expert neurosurgeons

CriterionAuthor*Mean (SD)Agreement (alpha)Comparison
GPT NaiveGPT EnrichedDeltaEpsilon
AccuracyGPT naive3.36 (0.84)0.35
GPT enriched3.09 (1.1)0.43>0.9999
Delta2.82 (1.09)0.400.2486>0.9999
Epsilon2.49 (1.01)0.440.00090.0474>0.9999
Gamma2.5 (1.19)0.490.00350.1312>0.9999>0.9999
Zeta2.82 (1.09)0.350.2534>0.9999>0.9999>0.9999
CoherenceGPT naive3.39 (0.69)0.22
GPT enriched3.11 (1.19)0.28>0.9999
Delta3.09 (0.85)0.32>0.9999>0.9999
Epsilon2.64 (0.93)0.390.00280.05180.4424
Gamma2.61 (0.84)0.380.00070.01650.1755>0.9999
Zeta2.56 (1.1)0.280.00170.03520.3258>0.9999
RelevanceGPT naive3.09 (1.01)0.27
GPT enriched2.87 (1.29)0.31>0.9999
Delta2.76 (1.11)0.35>0.9999>0.9999
Epsilon2.51 (1.01)0.360.10580.6264>0.9999
Gamma2.23 (1.31)0.430.01140.0990.8648>0.9999
Zeta3.04 (1.07)0.35>0.9999>0.9999>0.99990.1692
ThoroughnessGPT naive3.25 (0.92)0.26
GPT enriched2.98 (1.19)0.32>0.9999
Delta2 (1.22)0.34<0.00010.0019
Epsilon1.78 (0.95)0.39<0.0001<0.0001>0.9999
Gamma1.64 (1.1)0.44<0.0001<0.0001>0.9999>0.9999
Zeta2.56 (1.37)0.290.1503>0.99990.32190.014
Overall performanceGPT naive2.54 (1.4)0.57
GPT enriched2.29 (1.53)0.57>0.9999
Delta2 (0.99)0.440.7223>0.9999
Epsilon1.43 (0.91)0.510.00050.00970.4067
Gamma1.37 (0.94)0.590.00020.00480.2421>0.9999
Zeta2.1 (1.19)0.42>0.9999>0.9999>0.99990.104

Boldface type indicates statistical significance (p < 0.05).

Gamma, Delta, Epsilon, and Zeta represent individual neurosurgeons.

FIG. 4.
FIG. 4.

Unblinded survey metrics of all 7 neurosurgeon experts and evaluators regarding a context-enriched GPT model measuring overall satisfaction (A), likelihood of recommendation to colleagues (B), likelihood of use in the clinic (C), and perceived value of the tool relative to current resources (D).

The GPT-4–powered context-enriched model agreed with 98 of the 103 consensus statements (95%) from Carlson et al.24 The model had an agreement of 15/16 (94%) for hearing preservation with radiosurgery, 19/20 (95%) for hearing preservation with microsurgery, 11/13 (85%) for hearing preservation with observation, 19/20 (95%) for tumor control and imaging surveillance, 24/24 (100%) for preferred treatment modality, 4/4 (100%) for operative considerations, and 6/6 (100%) for operative complications. Table 2 highlights correct and incorrect examples alongside reasoning from the context-enriched GPT-4 model.

TABLE 2.

Correct and incorrect classification examples of the GPT-4–powered context-enriched model on a set of 103 consensus statements regarding the management of vestibular schwannoma

StatementResponse*
When performed by an experienced surgeon, the overall probability of maintaining serviceable hearing (≤50 dB PTA, ≥50% WRS) following gross-total microsurgical resection of an intracanalicular vestibular schwannoma using the middle fossa approach is 50–70% immediately after surgery, 40–60% at 2 yrs, 40–60% at 5 yrs, & 20–40% at 10 yrs following surgery.I agree with the statement. The studies provided consistently show that a significant proportion of patients maintain serviceable hearing following microsurgical resection of vestibular schwannoma using the middle fossa approach. The percentages mentioned in the statement are in line with the results from the studies, which show that 55-75% of patients maintain serviceable hearing immediately after surgery, and 71.8-86.2% at 5 years. Although there is no specific data provided for 2 years and 10 years, the statement’s percentages seem reasonable based on the available data. It is important to note that the success of the surgery and hearing preservation is highly dependent on the surgeon’s experience and skill.
The max posterior fossa diameter size of a vestibular schwannoma that can be safely managed w/ single-fraction radiosurgery is 3 cm in greatest posterior fossa diameter (vol 14 cm3). Factors including 4th ventricle distortion, T2 signal change in the brainstem, symptomatic brainstem or cerebellar compression, or evidence of hydrocephalus indicate an elevated risk of complications w/ radiosurgery.I agree with the statement, with some reservations. The studies mentioned in the context provide evidence that single-session radiosurgery can be a successful treatment for vestibular schwannomas up to 3 cm in greatest posterior fossa diameter (volume of 14 cm3). However, it is important to note that the tumor control rates and cranial nerve morbidity are lower for larger vestibular schwannomas managed with radiosurgery compared to smaller ones.

The factors mentioned in the statement, such as fourth ventricle distortion, T2 signal change in the brainstem, symptomatic brainstem or cerebellar compression, or evidence of hydrocephalus, do indicate an elevated risk of complications with radiosurgery. These factors should be carefully considered when deciding on the appropriate treatment for a patient with a vestibular schwannoma.

It is also important to consider the individual patient’s age, medical comorbidities, and preferences when deciding on the best treatment approach. In some cases, surgery may be a more appropriate option, especially for younger patients with minimal symptoms of tumor mass effect.

In conclusion, while single fraction radiosurgery can be a safe and effective treatment for vestibular schwannomas up to 3 cm in greatest posterior fossa diameter, it is crucial to consider the individual patient’s circumstances and the potential risks associated with radiosurgery. A multidisciplinary approach involving neurosurgeons, radiation oncologists, and other specialists is essential for determining the best treatment plan for each patient.
The estimated lifetime risk of recurrence after gross-total resection of a vestibular schwannoma in a hearing preservation operation is ≤5%.I disagree with the statement.

Reasoning: The statement suggests that the estimated lifetime risk of recurrence following gross total resection of a vestibular schwannoma in a hearing preservation operation is ≤5%. However, the provided context indicates that the recurrence rate may be higher than this estimate.

In one study, the long-term retrospective analysis after translabyrinthine approach indicated a 9.2% recurrence rate. Another study reported a tumor recurrence rate of 2.3% during a mean follow-up period of 6.4 years, with an average time to recurrence diagnosis of 8.8 years. In a different study, 83% of patients with residual tumors following incomplete resection did not experience tumor growth during a mean follow-up of 69 months.

These findings suggest that the risk of recurrence following vestibular schwannoma resection may be higher than 5%, especially in cases with residual tumors or incomplete resection. Additionally, the risk of recurrence may depend on factors such as the surgical approach, tumor size, and patient’s condition. Therefore, it is important to consider these factors when estimating the lifetime risk of recurrence for individual patients.

PTA = pure tone average; WRS = word recognition score.

The raw output from GPT is shown.

Discussion

In this study, we have demonstrated the statistical noninferiority of LLM-based model responses to 15 common subspecialist-level questions on vestibular schwannoma compared with those of experienced neurosurgeons. Moreover, we enriched GPT-3 with relevant, publicly available information via filtered abstracts from scientific publications to supplement the model’s knowledge base and provide reference material. To elicit expert-level responses from the model, a distinct priming prompt was employed, which varied from the prompt provided to the neurosurgeons. To bolster accessibility, functionality, and safety, we developed neuroGPT-X, incorporating a zero-shot learning framework with the GPT-3 API into a web application enriched with features. These features included conversation memory, in-text citations, full references, restriction of conversations to relevant medical topics, conversation timestamps, and spam prevention. To our knowledge, this is the first study to show the subspecialist-level performance of a context-enriched LLM.

The observation that response coherence and thoroughness were rated significantly higher in both GPT models compared with neurosurgeon experts suggests the importance of integrating domain-specific knowledge in the GPT models to improve response utility. Similarly, accuracy, relevance, and overall performance metrics were roughly equivalent between the GPT models and neurosurgeons. Interestingly, the GPT models were rated as having "minimal inaccuracies," while the neurosurgeons were rated as having "some inaccuracies" to "minimal inaccuracies." Furthermore, the GPT models and neurosurgeons were rated as having "average relevance" to being "relatively relevant." In the unblinded survey, there was a mix of responses across affective metrics, though the majority expressed positive opinions on GPT (Fig. 4). The relative superiority of GPT may relate to the impromptu nature of the expert surgeon responses, i.e., as if they were speaking to a trainee and not at a board examination. The GPT-4 context-enriched model also showed promising results, agreeing with 98 of the 103 consensus statements from Carlson et al.24 These results further suggest that LLMs bear promise as a point-of-care clinical aid.

LLMs in the Clinic

Clinical decision-making is a complex process that involves the application of practitioner knowledge and the integration of disease-, patient-, and system-level factors learned over time.25,26 Traditional methods of clinical decision-making can be hindered by limited access to relevant information, the need to integrate large amounts of complex information, and the possibility of human error, particularly in a fast-paced clinical environment.25,26 This may lead to variations in clinical practice among institutions, practitioners, subspecialties (e.g., neurosurgery, otolaryngology, radiology, radiation oncology), and patient perceptions.2729

The introduction of neuroGPT-X offers a unique tool for clinicians by providing rapid access to a wealth of information that includes patient-related findings and evidence-based guidelines, all integrated, summarized, and presented in a tailored and comprehensive manner.30 Such a tool can support decision-making by providing standardized processes and assisting practitioners with point-of-care informational aids.28,31,32 Interestingly and perhaps as expected, neuroGPT-X was rated as having more comprehensive responses than the neurosurgeons, whose answers reflected problems or nuances associated with the care of an individual patient rather than factual information that can be extracted from the literature. Nonetheless, clinicians retain and reference relevant textbook or literature information in their daily practice and, in particular, have knowledge reinforced by their clinical experiences.33 It is well known that human memory and recall are limited and can easily be overwhelmed while also being susceptible to biases and heuristics, which can influence information recall.5,29,3437 Through rapid processing of current publications, neuroGPT-X provides clinicians with information that would otherwise be impossible to read or assimilate.

Furthermore, the neuroGPT-X framework is transferable between LLMs, allowing the potential for rapid advancements in the performance of artificial intelligence–based medical decision-making systems. For example, integrating the framework with future iterations of BioGPT12 may benefit from improved clinical communications, as BioGPT was trained on millions of biomedical research articles to perform tasks such as answering questions, data extraction, and text generation. Indeed, as a framework with continual selective data input, neuroGPT-X enables the model and, by extension, its users to adapt to incoming discoveries, guidelines, and evidence-based best practices.

LLMs With a Word of Caution

Upon unblinding the evaluation results and responses, the expert surgeons shared that the context-enriched GPT was "quite useful in clinical practice in the present age when physicians have to deal with huge amounts of data and knowledge" and that the answers "appear to be well backed up by evidence." However, several salient concerns and questions were raised regarding the model’s limitations or interpretations (Table 3). The concerns, in part, may reflect a lack of level I evidence in managing vestibular schwannoma, with surgeons and sites opting for center-specific preferences, e.g., surgery versus radiosurgery versus observation.20 The GPT model relied primarily on the available literature with possible biases. Indeed, if neuroGPT-X were to be advanced toward a reliable and clinically usable tool, relevant thematic breadth and depth of information would need to be incorporated. Moreover, the model must continually mature through an automated feed of filtered, selected, and complete (full paper) literature sources. Similarly, security and restrictions for online abuse, malware, or spam questioning to the model must be duly incorporated.

TABLE 3.

Expert concerns from the unblinded study phase with consensus recommendations

Expert ConcernsDiscussion, Risk Mitigation Strategies, & Future Directions
"Garbage in, garbage out" pitfall



Contextual data are inaccurate, incomplete, biased, or unreliable. As a result, predictions generated by LLMs may be unreliable, leading to poor decision-making.
Data preprocessing & curation: Engaging domain-specific experts in curating training data to develop accurate & reliable LLMs.

Addressing biases in the literature: Engaging domain-specific experts in assessing whether training data (& the available literature broadly) include a representative sample of patients w/ different ages, sex, disease severities, comorbidities, etc.

Differential context enrichment weightage: Applying higher weights during context enrichment to papers w/ more citations or those published in higher–impact factor journals on the basis of the assumption that these papers are more influential or of higher quality.
Context enrichment & token length limitations



LLMs may not generate the full range of opinions to capture the complexity of the underlying literature. Moreover, current LLMs have input limits that restrict their capacity to understand longer texts.
Memory networks & attention mechanism: Incorporating memory modules that store information from previous inputs paired w/ an attention mechanism that selectively focuses on relevant parts of the memory when processing new inputs.

Previous prompt summary: Summarizing previous inputs to reduce the required amount of contextual input.

Longer context lengths: Increasing the token length for LLMs, enabling them to process full articles rather than abstracts or shorter texts.
Risk of medical mistakes, malpractice, or misuse



LLMs may provide unsafe recommendations. Furthermore, medical professionals & insurance providers may overly rely on LLMs w/o applying sufficient clinical judgment.
LLM recommendations as a tool: Ensuring LLM-generated recommendations are decision-making aids & not a replacement for clinical judgment, which takes individual patient contexts into account.

Transparent & interpretable LLMs: Providing citations used in the LLM responses.

Benchmarking & clinical validation: Validating LLM performance w/ real-world patient data across diverse clinical scenarios & comparing their performance to existing medical knowledge databases & human subspecialists.

Appropriate use policies: Incorporating actionable & evolving medicolegal dialogue & policies regarding LLM use in healthcare; restriction of unrelated, nonmedical questions; filtering toxic inputs & outputs.

Patient-facing education: Educating patients about the changing nature of medical technology & knowledge, e.g., drawing comparisons btwn new medical knowledge & medical texts. Emphasizing the importance of physician consultation.
Standardized recommendations vs tailored treatment



Indiscriminately following LLM recommendations overlooks a patient’s context, including social & psychological factors such as race, sex, & socioeconomic status, all of which impact outcomes.
Awareness of LLM training limitations: Training clinicians on appropriate LLM use should emphasize that these models, trained on large data sets, do not fully capture the complexity of individual patients or their specific medical conditions.

Integration of the multiple modalities of investigations: Integrating nontextual data into LLMs (e.g., radiographic evidence) & potential solutions include conversion into textual data or training multimodal models.

Proliferation in precision healthcare literature: Incorporating the expanding evidence base for personalized treatment in LLM training can enable tailored responses to individual patients & their unique medical needs.

Tailoring responses to diverse audiences: Training or priming LLMs to communicate to various audiences, e.g., patients who may benefit from a similar tool w/ accessible language.
Uncertainty associated w/ LLM responses



Clinicians may find it challenging to determine the significance of LLM recommendations & how to incorporate their own judgment w/ LLMs.
Probability scoring of responses: Integrating factors such as the relevance & reliability of input data, training history, & complexity of the prompt to indicate the level of certainty of the recommendation; LLMs should provide alternative recommendations w/ varying levels of confidence, enabling clinicians to evaluate a range of potential options.

Incorporating clinician’s feedback: Encouraging clinicians to report inaccuracies in LLM recommendations to aid in the continuous improvement of LLM accuracy.

Although the ability of LLMs to exceed human performance in clinical acumen is promising, it is critical to ensure that their responses are safe and reliable. LLMs are known to "hallucinate," which refers to the generation of factually incorrect statements.38,39 More specifically, LLMs operate on the principle of "next best guess," i.e., maximum likelihood estimation based on the probability distributions they learned via training. This raises questions about the nature of decision-making, both human and artificial. For example, human surgeons do not need a precise understanding of the trajectory of their joints when reaching out to grab a scalpel; they make a series of educated guesses based on their training and experience. Similarly, LLMs make estimates based on the probability distributions learned from the training data. However, LLMs cannot yet understand the implications of their decisions in the same way as humans. Therefore, it is essential to ensure that the responses provided by LLMs are accurate, safe, and reliable, a factor we attempted to address in neuroGPT-X through the inclusion of references and in-text citations.

Human Superiority Over LLMs

Human memory is not solely storage and recall. It is also closely tied to experiences, emotions, intuition, and beliefs, all of which shape worldviews. This multifaceted nature of human memory may be challenging for LLMs to replicate. Moreover, human memory has advantages over LLMs in information processing and ideating. For example, humans routinely engage in creative thinking and problem-solving, enabling them to generate new ideas and innovative solutions. In contrast, LLMs are limited to answer generation based on their training data set and the context supplied by their creator; their reliance on probability distributions also makes them more susceptible to inaccurate "hallucinations" or guessing.38,39 In this sense, LLMs fundamentally differ from human beings regarding their decision-making processes and abilities. These differences in memory, information processing, and recall necessitate the coupling of LLM technologies such as neuroGPT-X with experienced practitioners to ensure safe and reliable practices in the clinical setting.

Future Directions

Many exciting possibilities exist for the ongoing improvement, evolution, and integration of LLMs in the clinical environment. A promising area is the development of multimodal LLMs, which can interpret and integrate numerous inputs such as images, text, and other sensory data.40,41 Many medical specialties, including neurosurgery, rely heavily on multiple information modalities (e.g., imaging, patient demography, genetics, comorbidities) to make well-reasoned decisions. Furthermore, the integration of clinical practice guidelines will not only ensure that LLMs are consistent with the latest standards of care but also help to address the limitations of incorporating patient preferences and values. Therefore, incorporating this information into decision-making can lead to a holistic care plan more attuned to a patient-specific clinical scenario. Continual growth and maturity of the model would also ensure up-to-date answers consistent with the latest standards of care. In particular, incorporating patient-related factors, such as preferences and values, into the model would further refine the reliability and inclusivity of LLMs.4244

With healthcare informatics, it is paramount to consider data security and confidentiality in applying LLMs to healthcare.45 The neuroGPT-X model hosted on a secure cloud platform already ensures the security and compliance offered by Microsoft. From a systems-level perspective, a targeted effort to define the scope and policies regarding LLMs in the healthcare setting must be considered as the platform becomes globally accessible.

Conclusions

Our study presents a responsive and clinically relevant artificial intelligence tool, neuroGPT-X, utilizing context enrichment of the OpenAI GPT platform. The framework is readily transferable to other LLMs, opening possibilities for expanding its utility across neurosurgical diseases and other medical subspecialties. The study further highlights the importance of responsible artificial intelligence development and the potential for machine intelligence to enhance clinical judgment while recognizing its limitations to ensure safe implementation into clinical practice.

Disclosures

The domain name neuroGPT-X.com was purchased and is owned by Guo, Sinha, and Drs. Gupta, Lama, and Sutherland. This research was not funded.

Author Contributions

Conception and design: Sutherland, Guo, Gupta, Sinha, de Lotbiniere-Bassett, Lama. Acquisition of data: Sutherland, Guo, Gupta, Rössler, Tatagiba, Akagami, Al-Mefty, Sugiyama, Stieg, Pickett, Singh. Analysis and interpretation of data: Sutherland, Guo, Gupta, Sinha, Tatagiba. Drafting the article: Guo, Gupta, Sinha, de Lotbiniere-Bassett, Lama. Critically revising the article: Sutherland, Guo, Gupta, Sinha, Rössler, Tatagiba, Akagami, Al-Mefty, Sugiyama, Pickett, de Lotbiniere-Bassett, Lama. Reviewed submitted version of manuscript: Sutherland, Guo, Gupta, Sinha, Rössler, Tatagiba, Akagami, Sugiyama, Pickett, de Lotbiniere-Bassett, Lama. Approved the final version of the manuscript on behalf of all authors: Sutherland. Statistical analysis: Guo, Gupta, Sinha. Administrative/technical/material support: Sutherland, Lama. Study supervision: Sutherland, Guo, Gupta, Lama. Study participant/contributed data: Akagami.

Supplemental Information

Online-Only Content

Supplemental material is available with the online version of the article.

Preprint Server

An earlier version of this article can be found on a preprint server.

Preprint server name: medRxiv.

Preprint DOI: 10.1101/2023.02.25.23286117.

References

  • 1

    Dillon M. Introduction to modern information retrieval. Inf Process Manage. 1983;19(6):402403.

  • 2

    Landhuis E. Scientific literature: Information overload. Nature. 2016;535(7612):457458.

  • 3

    Charbonneau DH, James LN. DynaMed Plus®: an evidence-based clinical reference resource. Med Ref Serv Q. 2018;37(2):168176.

  • 4

    Albarqouni L, Hoffmann T, Straus S, et al. Core competencies in evidence-based practice for health professionals: consensus statement based on a systematic review and Delphi survey. JAMA Netw Open. 2018;1(2):e180281.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 5

    Cowan N. The magical mystery four: how is working memory capacity limited, and why? Curr Dir Psychol Sci. 2010;19(1):5157.

  • 6

    Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med. 2018;93(8):11071109.

  • 7

    Hunt RE, Newman RG. Medical knowledge overload: a disturbing trend for physicians. Health Care Manage Rev. 1997;22(1):7075.

  • 8

    Dawes M, Sampson U. Knowledge management in clinical practice: a systematic review of information seeking behavior in physicians. Int J Med Inform. 2003;71(1):915.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 9

    Daei A, Soleymani MR, Ashrafi-Rizi H, Zargham-Boroujeni A, Kelishadi R. Clinical information seeking behavior of physicians: a systematic review. Int J Med Inform. 2020;139:104144.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 10

    Chowdhery A, Narang S, Devlin J, et al. PaLM: scaling language modeling with pathways. arXiv. Preprint published online April 5, 2022. doi: 10.48550/arXiv.2204.02311

  • 11

    Smith S, Patwary M, Norick B, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv. Preprint published online January 28, 2022. doi:10.48550/arXiv.2201.11990

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 12

    Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6):bbac409.

  • 13

    Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019. ACL; 2019. Accessed September 7, 2023. https://aclanthology.org/N19-1423.pdf

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 14

    Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems. NeurIPS; 2020. Accessed September 7, 2023. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 15

    Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 16

    Bommarito MJ, Katz DM. GPT takes the bar exam. SSRN. Published online December 31, 2022. doi: 10.2139/ssrn.4314839

  • 17

    Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. arXiv. Preprint published online May 24, 2022. doi: 10.48550/ARXIV.2205.11916

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 18

    Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172180.

  • 19

    Carlson ML, Link MJ. Vestibular schwannomas. N Engl J Med. 2021;384(14):13351348.

  • 20

    Goldbrunner R, Weller M, Regis J, et al. EANO guideline on the diagnosis and treatment of vestibular schwannoma. Neuro Oncol. 2020;22(1):3145.

  • 21

    Grosman J. Findpapers: A tool for helping researchers who are looking for related works. Github. Accessed September 7, 2023. https://github.com/jonatasgrosman/findpapers

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 22

    Marina T, Sterligov I. Prevalence of potentially predatory publishing in Scopus on the country level. Scientometrics. 2021;126(6):50195077.

  • 23

    Grinberg M. Flask. Web Development: Developing Web Applications with Python. 2nd ed. O’Reilly Media; 2018.

  • 24

    Carlson ML, Link MJ, Driscoll CLW, et al. Working toward consensus on sporadic vestibular schwannoma care: a modified Delphi study. Otol Neurotol. 2020;41(10):e1360e1371.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 25

    Bate L, Hutchinson A, Underhill J, Maskrey N. How clinical decisions are made. Br J Clin Pharmacol. 2012;74(4):614620.

  • 26

    Hajjaj FM, Salek MS, Basra MK, Finlay AY. Non-clinical influences on clinical decision-making: a major challenge to evidence-based practice. J R Soc Med. 2010;103(5):178187.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 27

    Bensing J. Bridging the gap. The separate worlds of evidence-based medicine and patient-centered medicine. Patient Educ Couns. 2000;39(1):1725.

  • 28

    Harrison MB, Légaré F, Graham ID, Fervers B. Adapting clinical practice guidelines to local context and assessing barriers to their use. CMAJ. 2010;182(2):E78E84.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 29

    Cabana MD, Rand CS, Powe NR, et al. Why don’t physicians follow clinical practice guidelines? A framework for improvement. JAMA. 1999;282(15):14581465.

  • 30

    Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3(1):17.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 31

    Morris AH. Developing and implementing computerized protocols for standardization of clinical decisions. Ann Intern Med. 2000;132(5):373383.

  • 32

    Jia P, Zhang L, Chen J, Zhao P, Zhang M. The effects of clinical decision support systems on medication safety: an overview. PLoS One. 2016;11(12):e0167683.

  • 33

    Tulving E. Episodic and semantic memory. In: Tulving E, Donaldson W, eds. Organization of Memory. Academic Press; 1972:381-403.

  • 34

    Atkinson RC, Shiffrin RM. Human memory: a proposed system and its control processes. Psychol Learn Motiv. 1968;2:89195.

  • 35

    Ben-Artzi I, Luria R, Shahar N. Working memory capacity estimates moderate value learning for outcome-irrelevant features. Sci Rep. 2022;12(1):19677.

  • 36

    Brady TF, Störmer VS, Alvarez GA. Working memory is not fixed-capacity: more active storage capacity for real-world objects than for simple stimuli. Proc Natl Acad Sci U S A. 2016;113(27):74597464.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 37

    Correa VC, Lugo-Agudelo LH, Aguirre-Acevedo DC, et al. Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: a systematic metareview. Health Res Policy Syst. 2020;18(1):74.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 38

    Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):248.

  • 39

    Rashkin H, Reitter D, Tomar GS, Das D. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Vol 1. ACL; 2021. Accessed September 7, 2023. https://aclanthology.org/2021.acl-long.58/

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 40

    Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28(9):17731784.

  • 41

    Huang S, Dong L, Wang W, et al. Language is not all you need: aligning perception with language models. arXiv. Preprint posted online February 27, 2023. doi: 10.48550/arXiv.2302.14045

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 42

    Epstein RM, Peters E. Beyond information: exploring patients’ preferences. JAMA. 2009;302(2):195197.

  • 43

    Fletcher RR, Nakeshimana A, Olubeko O. Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. Front Artif Intell. 2021;3:561802.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 44

    Leavy S. Gender bias in artificial intelligence: The need for diversity and gender theory in machine learning. In: 2018 IEEE/ACM 1st International Workshop on Gender Equality in Software Engineering (GE). IEEE; 2018. Accessed September 7, 2023. https://ieeexplore.ieee.org/document/8452744

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 45

    Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22(1):122.

Supplementary Materials

  • Collapse
  • Expand

Paramedian and oblique transparietal approaches to meningiomas of the lateral ventricle provide operative corridors that avoid optic radiations and other white matter tracts. Artist: Kenneth X. Probst. Copyright John P. Andrews. Published with permission. See the article by Andrews et al. (pp 1001–1007).

  • FIG. 1.

    Overview of the study pipeline and data set characteristics. A: The data-processing pipeline and the 3 comparisons in this study. B: Data inclusion and exclusion criteria. C: Thematic analysis of the embedding vectors of the vestibular schwannoma data set. Clusters were computed using K-means clustering. The InstructGPT Davinci model was used to classify each cluster. D: Peer-reviewed abstract publication numbers by year and type. tSNE = t-distributed stochastic neighbor embedding. Figure is available in color online only.

  • FIG. 2.

    The neuroGPT-X chat interface. The features include conversation memory of the current session, references for the information it provides, restriction of conversations to relevant medical topics, timestamp for each message, animated loading icons while responses are computed, and prevention of user input spam. The website can be accessed here: https://neurogpt-x.azurewebsites.net/. Figure is available in color online only.

  • FIG. 3.

    Mean blinded evaluation scores from 3 expert neurosurgeons. Responses to 15 curated questions were measured on the basis of normalized response speed by response character length (A), coherence score (B), thoroughness score (C), relevance score (D), accuracy score (E), and overall performance score (F). Gamma, Delta, Epsilon, and Zeta represent the individual neurosurgeons. Error bars represent standard deviation.

  • FIG. 4.

    Unblinded survey metrics of all 7 neurosurgeon experts and evaluators regarding a context-enriched GPT model measuring overall satisfaction (A), likelihood of recommendation to colleagues (B), likelihood of use in the clinic (C), and perceived value of the tool relative to current resources (D).

  • 1

    Dillon M. Introduction to modern information retrieval. Inf Process Manage. 1983;19(6):402403.

  • 2

    Landhuis E. Scientific literature: Information overload. Nature. 2016;535(7612):457458.

  • 3

    Charbonneau DH, James LN. DynaMed Plus®: an evidence-based clinical reference resource. Med Ref Serv Q. 2018;37(2):168176.

  • 4

    Albarqouni L, Hoffmann T, Straus S, et al. Core competencies in evidence-based practice for health professionals: consensus statement based on a systematic review and Delphi survey. JAMA Netw Open. 2018;1(2):e180281.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 5

    Cowan N. The magical mystery four: how is working memory capacity limited, and why? Curr Dir Psychol Sci. 2010;19(1):5157.

  • 6

    Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med. 2018;93(8):11071109.

  • 7

    Hunt RE, Newman RG. Medical knowledge overload: a disturbing trend for physicians. Health Care Manage Rev. 1997;22(1):7075.

  • 8

    Dawes M, Sampson U. Knowledge management in clinical practice: a systematic review of information seeking behavior in physicians. Int J Med Inform. 2003;71(1):915.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 9

    Daei A, Soleymani MR, Ashrafi-Rizi H, Zargham-Boroujeni A, Kelishadi R. Clinical information seeking behavior of physicians: a systematic review. Int J Med Inform. 2020;139:104144.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 10

    Chowdhery A, Narang S, Devlin J, et al. PaLM: scaling language modeling with pathways. arXiv. Preprint published online April 5, 2022. doi: 10.48550/arXiv.2204.02311

  • 11

    Smith S, Patwary M, Norick B, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv. Preprint published online January 28, 2022. doi:10.48550/arXiv.2201.11990

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 12

    Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6):bbac409.

  • 13

    Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019. ACL; 2019. Accessed September 7, 2023. https://aclanthology.org/N19-1423.pdf

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 14

    Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems. NeurIPS; 2020. Accessed September 7, 2023. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 15

    Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 16

    Bommarito MJ, Katz DM. GPT takes the bar exam. SSRN. Published online December 31, 2022. doi: 10.2139/ssrn.4314839

  • 17

    Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. arXiv. Preprint published online May 24, 2022. doi: 10.48550/ARXIV.2205.11916

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 18

    Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172180.

  • 19

    Carlson ML, Link MJ. Vestibular schwannomas. N Engl J Med. 2021;384(14):13351348.

  • 20

    Goldbrunner R, Weller M, Regis J, et al. EANO guideline on the diagnosis and treatment of vestibular schwannoma. Neuro Oncol. 2020;22(1):3145.

  • 21

    Grosman J. Findpapers: A tool for helping researchers who are looking for related works. Github. Accessed September 7, 2023. https://github.com/jonatasgrosman/findpapers

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 22

    Marina T, Sterligov I. Prevalence of potentially predatory publishing in Scopus on the country level. Scientometrics. 2021;126(6):50195077.

  • 23

    Grinberg M. Flask. Web Development: Developing Web Applications with Python. 2nd ed. O’Reilly Media; 2018.

  • 24

    Carlson ML, Link MJ, Driscoll CLW, et al. Working toward consensus on sporadic vestibular schwannoma care: a modified Delphi study. Otol Neurotol. 2020;41(10):e1360e1371.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 25

    Bate L, Hutchinson A, Underhill J, Maskrey N. How clinical decisions are made. Br J Clin Pharmacol. 2012;74(4):614620.

  • 26

    Hajjaj FM, Salek MS, Basra MK, Finlay AY. Non-clinical influences on clinical decision-making: a major challenge to evidence-based practice. J R Soc Med. 2010;103(5):178187.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 27

    Bensing J. Bridging the gap. The separate worlds of evidence-based medicine and patient-centered medicine. Patient Educ Couns. 2000;39(1):1725.

  • 28

    Harrison MB, Légaré F, Graham ID, Fervers B. Adapting clinical practice guidelines to local context and assessing barriers to their use. CMAJ. 2010;182(2):E78E84.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 29

    Cabana MD, Rand CS, Powe NR, et al. Why don’t physicians follow clinical practice guidelines? A framework for improvement. JAMA. 1999;282(15):14581465.

  • 30

    Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3(1):17.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 31

    Morris AH. Developing and implementing computerized protocols for standardization of clinical decisions. Ann Intern Med. 2000;132(5):373383.

  • 32

    Jia P, Zhang L, Chen J, Zhao P, Zhang M. The effects of clinical decision support systems on medication safety: an overview. PLoS One. 2016;11(12):e0167683.

  • 33

    Tulving E. Episodic and semantic memory. In: Tulving E, Donaldson W, eds. Organization of Memory. Academic Press; 1972:381-403.

  • 34

    Atkinson RC, Shiffrin RM. Human memory: a proposed system and its control processes. Psychol Learn Motiv. 1968;2:89195.

  • 35

    Ben-Artzi I, Luria R, Shahar N. Working memory capacity estimates moderate value learning for outcome-irrelevant features. Sci Rep. 2022;12(1):19677.

  • 36

    Brady TF, Störmer VS, Alvarez GA. Working memory is not fixed-capacity: more active storage capacity for real-world objects than for simple stimuli. Proc Natl Acad Sci U S A. 2016;113(27):74597464.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 37

    Correa VC, Lugo-Agudelo LH, Aguirre-Acevedo DC, et al. Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: a systematic metareview. Health Res Policy Syst. 2020;18(1):74.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 38

    Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):248.

  • 39

    Rashkin H, Reitter D, Tomar GS, Das D. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Vol 1. ACL; 2021. Accessed September 7, 2023. https://aclanthology.org/2021.acl-long.58/

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 40

    Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28(9):17731784.

  • 41

    Huang S, Dong L, Wang W, et al. Language is not all you need: aligning perception with language models. arXiv. Preprint posted online February 27, 2023. doi: 10.48550/arXiv.2302.14045

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 42

    Epstein RM, Peters E. Beyond information: exploring patients’ preferences. JAMA. 2009;302(2):195197.

  • 43

    Fletcher RR, Nakeshimana A, Olubeko O. Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. Front Artif Intell. 2021;3:561802.

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 44

    Leavy S. Gender bias in artificial intelligence: The need for diversity and gender theory in machine learning. In: 2018 IEEE/ACM 1st International Workshop on Gender Equality in Software Engineering (GE). IEEE; 2018. Accessed September 7, 2023. https://ieeexplore.ieee.org/document/8452744

    • PubMed
    • Search Google Scholar
    • Export Citation
  • 45

    Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22(1):122.

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 4128 4128 338
PDF Downloads 3154 3154 205
EPUB Downloads 0 0 0