Determination of geographic variance in stroke prevalence using Internet search engine analytics

Full access

Object

Previous methods to determine stroke prevalence, such as nationwide surveys, are labor-intensive endeavors. Recent advances in search engine query analytics have led to a new metric for disease surveillance to evaluate symptomatic phenomenon, such as influenza. The authors hypothesized that the use of search engine query data can determine the prevalence of stroke.

Methods

The Google Insights for Search database was accessed to analyze anonymized search engine query data. The authors' search strategy utilized common search queries used when attempting either to identify the signs and symptoms of a stroke or to perform stroke education. The search logic was as follows: (stroke signs + stroke symptoms + mini stroke − heat) from January 1, 2005, to December 31, 2010.

The relative number of searches performed (the interest level) for this search logic was established for all 50 states and the District of Columbia. A Pearson product-moment correlation coefficient was calculated from the statespecific stroke prevalence data previously reported.

Results

Web search engine interest level was available for all 50 states and the District of Columbia over the time period for January 1, 2005–December 31, 2010. The interest level was highest in Alabama and Tennessee (100 and 96, respectively) and lowest in California and Virginia (58 and 53, respectively). The Pearson correlation coefficient (r) was calculated to be 0.47 (p = 0.0005, 2-tailed).

Conclusions

Search engine query data analysis allows for the determination of relative stroke prevalence. Further investigation will reveal the reliability of this metric to determine temporal pattern analysis and prevalence in this and other symptomatic diseases.

Abbreviation used in this paper: BRFSS = Behavioral Risk Factor Surveillance System.

Abstract

Object

Previous methods to determine stroke prevalence, such as nationwide surveys, are labor-intensive endeavors. Recent advances in search engine query analytics have led to a new metric for disease surveillance to evaluate symptomatic phenomenon, such as influenza. The authors hypothesized that the use of search engine query data can determine the prevalence of stroke.

Methods

The Google Insights for Search database was accessed to analyze anonymized search engine query data. The authors' search strategy utilized common search queries used when attempting either to identify the signs and symptoms of a stroke or to perform stroke education. The search logic was as follows: (stroke signs + stroke symptoms + mini stroke − heat) from January 1, 2005, to December 31, 2010.

The relative number of searches performed (the interest level) for this search logic was established for all 50 states and the District of Columbia. A Pearson product-moment correlation coefficient was calculated from the statespecific stroke prevalence data previously reported.

Results

Web search engine interest level was available for all 50 states and the District of Columbia over the time period for January 1, 2005–December 31, 2010. The interest level was highest in Alabama and Tennessee (100 and 96, respectively) and lowest in California and Virginia (58 and 53, respectively). The Pearson correlation coefficient (r) was calculated to be 0.47 (p = 0.0005, 2-tailed).

Conclusions

Search engine query data analysis allows for the determination of relative stroke prevalence. Further investigation will reveal the reliability of this metric to determine temporal pattern analysis and prevalence in this and other symptomatic diseases.

Stroke represents one of the leading causes of morbidity and death in the US.6 An estimated 795,000 new strokes occur each year in the US, making it the third overall leading cause of death in America.5 Surveillance data has been notoriously difficult to obtain due to sampling ability and reporting bias. In 2007 the Centers for Disease Control and Prevention analyzed the 2005 Behavioral Risk Factor Surveillance System (BRFSS) survey and reported the first state-specific variability in stroke prevalence.7 The prevalence estimates from the BRFSS are produced from a state-based, random-digit-dial telephone survey, a method that is probably costly and labor intensive.

Recent advances in search engine queries have led to the development of a new metric for disease surveillance to evaluate symptomatic phenomena in the infectious disease literature. Because about 90 million American adults are believed to search online for information about specific diseases or medical problems each year,2 Internet search engine data represent a potentially valuable source of information about health trends. Recently, an analysis of the frequency of queries related to influenza infection was highly correlated with the percentage of physician visits in which a patient presented with influenza-like symptoms.3 Essentially, near real-time prevalence data are available using search engine query data. With this in mind, we hypothesized that the use of search engine query data could determine the prevalence of a noninfectious but symptomatic disease: stroke.

Methods

We accessed the Google Insights for Search database (Google, Inc.) to analyze anonymized search engine query data (http://www.google.com/insights/search). This database contains 50 million of the most common search queries on all possible topics, without prefiltering. While billions of search queries occur every day, only those that occur frequently are included in the database. The Internet protocol (IP) address associated with each search query is recorded, allowing the location of each search to be determined to the nearest major city within the US.

Our search strategy utilized common search queries that either a patient or a patient's representative might use when attempting to identify the signs and symptoms of a stroke or to perform stroke education. Our search logic (stroke signs + stroke symptoms + mini stroke − heat) included all dates from January 1, 2005, to December 31, 2010. The term “− heat” was used to exclude all queries relating to heat stroke. We filtered for Web searches only and included all search categories.

The relative number of searches performed for this search logic was established relative to the total number of searches performed on Google over the same amount of time for all 50 states and the District of Columbia. All data were normalized and then scaled (interest level). A Pearson product-moment correlation coefficient was then calculated from the state-specific stroke prevalence data previously reported.7

Results

The Web search engine interest level was available for all 50 states and the District of Columbia for the time period from January 1, 2005, to December 31, 2010. Interest level was highest in Alabama and Tennessee (100 and 96, respectively) and lowest in California and Virginia (58 and 53, respectively) (Table 1 and Fig. 1). The Pearson correlation coefficient calculated with previously reported state prevalence rates was determined to be 0.47 (p = 0.0005, 2-tailed; Fig. 2).7

TABLE 1:

Interest level and stroke prevalence for 50 states and the District of Columbia for January 1, 2005–December 31, 2010

StateInterest LevelStroke Prevalence*
Alabama1003.2
Alaska642.5
Arizona822.1
Arkansas903.0
California582.6
Colorado771.7
Connecticut671.5
Delaware682.6
District of Columbia703.4
Florida822.8
Georgia842.9
Hawaii742.8
Idaho762.4
Illinois773.0
Indiana822.5
Iowa732.6
Kansas782.3
Kentucky933.1
Louisiana893.3
Maine792.4
Maryland792.1
Massachusetts712.1
Michigan833.0
Minnesota771.7
Mississippi924.3
Missouri903.1
Montana742.1
Nebraska822.2
Nevada743.2
New Hampshire722.6
New Jersey742.1
New Mexico812.2
New York682.4
North Carolina902.8
North Dakota821.8
Ohio832.3
Oklahoma933.4
Oregon782.5
Pennsylvania822.2
Rhode Island682.1
South Carolina872.9
South Dakota892.6
Tennessee963.1
Texas843.0
Utah872.6
Vermont712.1
Virginia532.7
Washington692.4
West Virginia903.0
Wisconsin761.9
Wyoming671.9

* Percentage of respondents ages ≥ 18 years who reported a history of stroke.

Fig. 1.
Fig. 1.

Map demonstrating the search volume index (interest level) in the US from January 1, 2005, to December 31, 2010.

Fig. 2.
Fig. 2.

Scatterplot of interest level in the US from January 1, 2005, to December 31, 2010, and stroke prevalence defined by state/area using the BRFSS, US, 2005 (r = 0.47, p = 0.0005, 2-tailed).

Discussion

We analyzed a large database of search engine queries to identify searches related to stroke. Previous methods for prevalence analysis, such as survey methods, are time and labor intensive. Based on the previously demonstrated effectiveness of search engine analytics to identify epidemiological data for symptomatic disease, we hypothesized that this tool could be used as a surrogate marker for actual disease prevalence. The Pearson correlation coefficient of 0.47 is moderate but should be interpreted with caution, as outliers can skew data interpretation.1 The state-specific dataset used for comparison is itself estimated and admittedly limited.7

Many states with the highest Internet interest estimates are concentrated in the southeast US, such as Alabama, Mississippi, and Arkansas. These correspond to the high rates of stroke prevalence previously observed in this region, which has been traditionally called the “stroke belt.”4 However, certain states outside of this region, such as Oklahoma, Missouri, and West Virginia, also had Internet interest levels among the highest in the country. This finding correlates with the 2005 Centers for Disease Control and Prevention BRFSS survey and can be explained by geographic differences in risk factors associated with stroke, such as race, obesity, diet, diabetes mellitus, and socioeconomic status. When using Internet interest estimates as surrogate markers for disease prevalence, major geographic disparities continue to exist.

In the same manner that patients with symptoms of the flu use a search engine to investigate their symptoms, stroke symptoms are plausibly entered into search engines at the onset of stroke-like symptoms, transient ischemic attack, or stroke. This can be accomplished by either the patients themselves or, depending on the clinical condition, their representative. Additionally, these search queries may be entered at any time after a stroke and do not necessarily depend on patient survival.

Search engine query data represent self-reported disease history in its purest form; it is generated from the patient or their representative of their own accord and in a private setting. A major limitation of this analysis is the user-generated misinterpretation of symptoms as those of stroke. Many different conditions could masquerade as stroke, falsely increasing the estimated prevalence. Another limitation of this methodology is that Internet access is not uniform and, in particular, may not be readily accessible to persons living in nursing homes, prisons, military bases, or other institutions. Internet use also varies widely by sex, socioeconomic background, and English literacy.6,9 The geographic distribution by state of reported Internet usage for individuals is varied;8 however, our model normalizes for overall search engine traffic.

The heterogeneity of stroke symptoms precludes the analysis of individual query terms. The disease-specific search, related to stroke, was made the core of our model. The identification of these geographic differences provided a metric to evaluate health disparities, allowing for the potential to direct public health programs related to stroke risk-factor prevention and educational measures in disproportionately affected states.

Conclusions

Search engine query data analysis allows for the determination of relative stroke prevalence in the US. Further investigation will reveal the reliability of this metric to determine temporal pattern analysis and prevalence in other symptomatic diseases.

Disclosure

The authors report no conflict of interest concerning the materials or methods used in this study or the findings specified in this paper.

Author contributions to the study and manuscript preparation include the following. Conception and design: Walcott. Acquisition of data: Walcott. Analysis and interpretation of data: Walcott, Redjal. Drafting the article: all authors. Critically revising the article: all authors. Reviewed final version of the manuscript and approved it for submission: all authors. Statistical analysis: Walcott, Coumans. Study supervision: Coumans.

References

If the inline PDF is not rendering correctly, you can download the PDF file here.

Article Information

Address correspondence to: Brian P. Walcott, M.D., Massachusetts General Hospital, 55 Fruit Street, White Building, Room 502, Boston, Massachusetts 02114. email: walcott.brian@mgh.harvard.edu.

© AANS, except where prohibited by US copyright law.

Headings

Figures

  • View in gallery

    Map demonstrating the search volume index (interest level) in the US from January 1, 2005, to December 31, 2010.

  • View in gallery

    Scatterplot of interest level in the US from January 1, 2005, to December 31, 2010, and stroke prevalence defined by state/area using the BRFSS, US, 2005 (r = 0.47, p = 0.0005, 2-tailed).

References

1

Cohen J: Statistical Power Analysis for the Behavioral Sciences ed 2Hillsdale, NJLawrence Erlbaum Associates1988

2

Fox S: Online Health Search 2006 Washington, DCPew Internet & American Life Project(http://www.pewinternet.org/~/media//Files/Reports/2006/PIP_Online_Health_2006.pdf.pdf

3

Ginsberg JMohebbi MHPatel RSBrammer LSmolinski MSBrilliant L: Detecting influenza epidemics using search engine query data. Nature 457:101210142009

4

Lanska DJKuller LH: The geography of stroke mortality in the United States and the concept of a stroke belt. Stroke 26:114511491995

5

Lloyd-Jones DAdams RJBrown TMCarnethon MDai SDe Simone G: Heart disease and stroke statistics—2010 update: a report from the American Heart Association. Circulation 121:e46e2152010. (Erratum in Circulation 121:

6

Neckerman KM: Social Inequality New YorkRussell Sage Foundation2004

7

Neyer JGreenlund KDenny CKennan NCasper MLabarthe D: Prevalence of stroke—United States, 2005. MMWR Morb Mortal Wkly Rep 56:4694742007

8

US Census Bureau: Internet Use in the United States October2009. (http://www.census.gov/population/www/socdemo/computer/2009.html

9

Wasserman IMRichmond-Abbott M: Gender and the Internet: causes of variation in access, level, and scope of use. Soc Sci Q 86:2522702005

TrendMD

Metrics

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 65 65 35
PDF Downloads 35 35 11
EPUB Downloads 0 0 0

PubMed

Google Scholar