Observer reliability of arteriovenous malformations grading scales using current imaging modalities

Clinical article

Full access

Object

The aim of this study was to examine observer reliability of frequently used arteriovenous malformation (AVM) grading scales, including the 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale, and the Pollock-Flickinger radiosurgery-based scale, using current imaging modalities in a setting closely resembling routine clinical practice.

Methods

Five experienced raters, including 1 vascular neurosurgeon, 2 neuroradiologists, and 2 senior neurosurgical residents independently reviewed 15 MRI studies, 15 CT angiograms, and 15 digital subtraction angiograms obtained at the time of initial diagnosis. Assessments of 5 scans of each imaging modality were repeated for measurement of intrarater reliability. Three months after the initial assessment, raters reassessed those scans where there was disagreement. In this second assessment, raters were asked to justify their rating with comments and illustrations. Generalized kappa (κ) analysis for multiple raters, Kendall's coefficient of concordance (W), and interclass correlation coefficient (ICC) were applied to determine interrater reliability. For intrarater reliability analysis, Cohen's kappa (κ), Kendall's correlation coefficient (tau-b), and ICC were used to assess repeat measurement agreement for each rater.

Results

Interrater reliability for the overall 5-tier Spetzler-Martin scale was fair to good (ICC = 0.69) to extremely strong (Kendall's W = 0.73) on initial assessment and improved on reassessment. Assessment of CT angiograms resulted in the highest agreement, followed by MRI and digital subtraction angiography. Agreement for the overall 3-tier Spetzler-Ponce grade was fair to good (ICC = 0.68) to strong (Kendall's W = 0.70) on initial assessment, improved on reassessment, and was comparable to agreement for the 5-tier Spetzler-Martin scale. Agreement for the overall Pollock-Flickinger radiosurgery-based grade was excellent (ICC = 0.89) to extremely strong (Kendall's W = 0.81). Intrarater reliability for the overall 5-tier Spetzler-Martin grade was excellent (ICC > 0.75) in 3 of the 5 raters and fair to good (ICC > 0.40) in the other 2 raters.

Conclusion

The 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale, and the Pollock-Flickinger radiosurgery-based scale all showed a high level of agreement. The improved reliability on reassessment was explained by a training effect from the initial assessment and the requirement to defend the rating, which outlines a potential downside for grades determined as part of routine clinical practice to be used for scientific purposes.

Abbreviations used in this paper:AVM = arteriovenous malformation; CTA = CT angiography; DS = digital subtraction; DSA = DS angiography; ICC = interclass correlation coefficient.

Object

The aim of this study was to examine observer reliability of frequently used arteriovenous malformation (AVM) grading scales, including the 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale, and the Pollock-Flickinger radiosurgery-based scale, using current imaging modalities in a setting closely resembling routine clinical practice.

Methods

Five experienced raters, including 1 vascular neurosurgeon, 2 neuroradiologists, and 2 senior neurosurgical residents independently reviewed 15 MRI studies, 15 CT angiograms, and 15 digital subtraction angiograms obtained at the time of initial diagnosis. Assessments of 5 scans of each imaging modality were repeated for measurement of intrarater reliability. Three months after the initial assessment, raters reassessed those scans where there was disagreement. In this second assessment, raters were asked to justify their rating with comments and illustrations. Generalized kappa (κ) analysis for multiple raters, Kendall's coefficient of concordance (W), and interclass correlation coefficient (ICC) were applied to determine interrater reliability. For intrarater reliability analysis, Cohen's kappa (κ), Kendall's correlation coefficient (tau-b), and ICC were used to assess repeat measurement agreement for each rater.

Results

Interrater reliability for the overall 5-tier Spetzler-Martin scale was fair to good (ICC = 0.69) to extremely strong (Kendall's W = 0.73) on initial assessment and improved on reassessment. Assessment of CT angiograms resulted in the highest agreement, followed by MRI and digital subtraction angiography. Agreement for the overall 3-tier Spetzler-Ponce grade was fair to good (ICC = 0.68) to strong (Kendall's W = 0.70) on initial assessment, improved on reassessment, and was comparable to agreement for the 5-tier Spetzler-Martin scale. Agreement for the overall Pollock-Flickinger radiosurgery-based grade was excellent (ICC = 0.89) to extremely strong (Kendall's W = 0.81). Intrarater reliability for the overall 5-tier Spetzler-Martin grade was excellent (ICC > 0.75) in 3 of the 5 raters and fair to good (ICC > 0.40) in the other 2 raters.

Conclusion

The 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale, and the Pollock-Flickinger radiosurgery-based scale all showed a high level of agreement. The improved reliability on reassessment was explained by a training effect from the initial assessment and the requirement to defend the rating, which outlines a potential downside for grades determined as part of routine clinical practice to be used for scientific purposes.

Grading systems for the treatment of arteriovenous malformations (AVMs) were developed to guide clinicians in medical decision making. Additionally, these objective measures aid in establishing a prognosis, facilitating communication, standardizing research, and assessing outcome following treatment. The success of a clinical grading scale depends on its being valid and reliable. Arteriovenous malformations are commonly classified according to the Spetzler-Martin scale22 and the Pollock-Flickinger radiosurgery-based scale.19,20 The 5-tier Spetzler-Martin scale was first described in 1986 and is designed to predict the risk of morbidity and mortality associated with the operative treatment of the malformation.22 The 3 variables considered are size, pattern of venous drainage, and eloquence of adjacent brain. Increasing Spetzler-Martin grades correlate with higher incidence of minor and major neurological deficits following surgery.22 Spetzler and Ponce recently modified the original 5-tier Spetzler-Martin system by combining Grade I with Grade II and Grade IV with Grade V.23 This simplification of the original grading system was also found to be predictive of outcome. In respect to radiosurgery, the Spetzler-Martin scale has limitations and has not been found to consistently correlate with successful AVM obliteration. Given this shortcoming, Pollock and Flickinger developed20 and subsequently modified19 a separate grading system specific for the radiosurgical treatment of AVMs. The score is calculated with the following formula: score = (0.1) (volume in ml) + (0.02) (age in years) + (0.5) (location; hemispheric/corpus callosum/cerebellar = 0, basal ganglia/thalamus/brainstem = 1).

Any proposed grading scale is susceptible to variability in implementation and has to be analyzed for reliability. Several studies have examined the observer reliability for the 5-tier Spetzler-Martin scale.1,7,12,13,15,22 No study has yet examined the observer reliability for the 3-tier Spetzler-Ponce scale or the radiosurgery-based Pollock-Flickinger scale. The purpose of this study was to comprehensively determine inter- and intrarater reliability of these AVM scales using imaging modalities routinely obtained for diagnosis in a setting closely resembling routine clinical practice.

Methods

Five experienced raters, including one vascular neurosurgeon, two neuroradiologists, and two senior neurosurgical residents, independently reviewed 15 MR images, 15 CT angiograms, and 15 digital subtraction (DS) angiograms obtained at one institution at the time of initial AVM diagnosis. The images were reviewed in random order. Sample size was calculated using a method developed by Walter et al.,25 which showed that increasing the number of raters per subject (up to n = 5) will decrease the total number of observations required to achieve adequate sample size. In the present study, assessments of the same 5 raters were used to determine interrater reliability of the 3 imaging modalities. Using 0.40 as the minimum acceptable level of interrater reliability and 0.70 as the desired level of interrater reliability, based on α = 0.05 and 80% power, the sample size requirement was 14 observations per imaging modality. Therefore, to ensure detection of an acceptable level of interrater agreement among the 5 raters, 15 images per imaging modality were used.

The images allocated to the study were randomly selected from the institution's AVM database and included a wide variety with respect to AVM location, size, and vascular complexity. The MR images, CT angiograms, and DS angiograms did not come from the same patients. The MR images assessed were obtained according to the institution's protocol for brain MRIs and included T1-weighted images with and without gadolinium enhancement, as well as T2-weighted, FLAIR, and T2-weighted gradient echo images. The CT angiograms included axial, coronal, and sagittal images at slice thicknesses of 6, 3, and 3 mm, respectively. The DS angiograms included anteroposterior and lateral views of global internal carotid and vertebral artery injections. Five scans of each imaging modality were repeated for intrarater reliability assessment, for a total of 60 sets of images. Three months after the initial assessment, raters reassessed those scans where there was disagreement, and the raters were asked to justify their rating with comments and illustrations.

Generalized kappa (κ) analysis for multiple raters, Kendall's coefficient of concordance (W), and intraclass correlation coefficient (ICC) were applied to determine interrater reliability. For intrarater reliability analysis, Cohen's kappa (κ),5 Kendall's correlation coefficient (tau-b), and ICC were used to assess repeat measurement agreement for each rater.

Generalized kappa (κ) as proposed and revised by Fleiss for multiple raters9,10 was used for nominally scaled variables (for example, the components “eloquence” and “drainage” of the Spetzler-Martin scale and “location” of the Pollock-Flickinger scale). Agreement measured by multiple rater–generalized kappa (κ) was interpreted as almost perfect with κ values between 0.81 and 1.00, substantial with κ values between 0.61 and 0.80, moderate with κ values between 0.41 and 0.60, fair with κ values between 0.21 and 0.40, and poor with κ values between 0 and 0.2.17

Scores for the overall Pollock-Flickinger scale were assessed both as continuous variables and grouped in categories as follows: less than 1.0, 1.0–1.5, 1.5–2.0, and greater than 2.0. Categorical ordinal data (for example, the overall Spetzler-Martin and Spetzler-Ponce scores, the component “size” of the Spetzler-Martin scale, and the overall Pollock-Flickinger score, expressed as categorical data) were assessed using Kendall's W and ICC. Continuous data (for example, overall Pollock-Flickinger score, expressed as continuous data, and the “volume” component of the Pollock-Flickinger scale) were assessed using ICC.

Agreement measured by Kendall's W was interpreted as extremely strong agreement with a coefficient between 0.71 and 9.0, strong agreement between 0.51 and 0.70, moderate agreement between 0.31 and 0.50, and weak agreement between 0.11 and 0.30.16

ICC, a flexible reliability coefficient designed to compare the variability of multiple raters rating the same image to the total variation across all ratings and all images, was also computed for ordinal, interval, and ratio variables.21 The first source of variability measured using an ICC is the proportion of variability related to the differences among the patient images themselves. Since this study design involved 5 raters who each rated all patient images, the variability among the raters was treated as a second potential source of systematic variance and represents the second factor in a 2-way random effects model. The raters in this study were assumed to be a random subset from a larger population of experienced neuroradiology and neurosurgery professionals. Therefore, the Shrout and Fleiss ICC Model 2 (ICC2, 1) was chosen and analyzed for absolute agreement to permit generalization to other potential raters.18 Using the ICC values proposed by Fleiss, an ICC greater than 0.75 was considered to have excellent inter- and intrarater reliability, with an ICC between 0.40 and 0.75 classified as fair to good, and an ICC less than 0.40 considered indicative of poor reliability.8

Cohen's kappa (κ), an unweighted kappa statistic appropriate for evaluating one rater providing two ratings using a dichotomous scale for 2-level categorical measurements, was used to calculate intrarater reliability for individual raters reproducing nominal ratings.5 Kendall's tau-b was used to compute intrarater statistics for test-retest reliability within raters for categorical ordinal ratings.16 The ICCs computed as intrarater reliability indices to quantify test-retest reliability were also computed using ICC Model 2.

All statistical analysis was performed using SAS version 9.3 (SAS Institute, Inc.) with the exception of the ICCs, which were computed using SPSS version 21.0 (SPSS, Inc.). Multiple-rater generalized kappa (κ) and Kendall's W were computed using the SAS MAGREE macro version 1.0.4

Results

Interrater Reliability

The 5-Tier Spetzler-Martin Scale

On initial assessment, the agreement on the overall grade across all imaging modalities was 0.73 using Kendall's W and 0.69 using ICC (Table 1). Among the 3 variables, agreement was highest on size followed by drainage and then eloquence. Examples for disagreement on drainage (Fig. 1) and eloquence (Fig. 2) are provided. The agreement on overall grade by modality was highest for CTA followed by MRI and DSA (Table 2). With respect to size also, agreement was highest for CTA followed by MRI and DSA. With respect to eloquence, MRI was superior to DSA and CTA. With respect to drainage, CTA was superior to DSA and MRI.

TABLE 1:

Five-tier Spetzler-Martin grading scale*

Parameter & Assessmentκ (95% CI)WICC (95% CI)
overall grade
 initial assessment0.730.69 (0.58–0.77)
 reassessment0.860.82 (0.75–0.89)
size
 initial assessment0.780.78 (0.69–0.86)
 reassessment0.870.86 (0.79–0.91)
eloquence
 initial assessment0.39 (0.29–0.49)
 reassessment0.71 (0.60–0.80)
drainage
 initial assessment0.53 (0.43–0.63)
 reassessment0.74 (0.64–0.83)

ICC = interclass correlation coefficient; κ = Fleiss's generalized kappa for multiple raters; W = Kendall's W.

Fig. 1.
Fig. 1.

AVM 7. Disagreement on drainage (Spetzler-Martin/Spetzler-Ponce scale). On initial and reassessment all but 1 rater (a neuroradiologist) assigned a “0” for drainage. On reassessment the neuroradiologist highlighted a draining vein with arrows (B and D) justifying his rating. A: Lateral view of left carotid artery injection (arterial phase). B: Lateral view of left carotid artery injection (venous phase). C: Anteroposterior view of left carotid artery injection (arterial phase). D: Anteroposterior view of left carotid artery injection (venous phase).

TABLE 2.

Five-tier Spetzler-Martin grading scale by modality

ParameterMRICTADSA
κ (95% CI)WICC (95% CI)κ (95% CI)WICC (95% CI)κ (95% CI)WICC (95% CI)
overall grade
 initial assessment0.590.64 (0.42–0.83)0.830.76 (0.59–0.90)0.500.35 (0.13–0.63)
 reassessment0.800.83 (0.69–0.93)0.890.86 (0.75–0.94)0.720.62 (0.40–0.82)
size
 initial assessment0.850.90 (0.81–0.96)0.900.90 (0.81–0.96)0.560.42 (0.20–0.69)
 reassessment0.870.92 (0.84–0.97)0.910.92 (0.84–0.97)0.790.71 (0.51–0.87)
eloquence
 initial assessment0.42 (0.26–0.58)0.25 (0.09–0.41)0.40 (0.24–0.56)
 reassessment0.81 (0.65–0.97)0.57 (0.49–0.80)0.68 (0.52–0.84)
drainage
 initial assessment0.33 (0.17–0.49)0.62 (0.46–0.78)0.55 (0.39–0.71)
 reassessment0.54 (0.38–0.70)0.82 (0.66–0.98)0.80 (0.64–0.96)
Fig. 2.
Fig. 2.

AVM 20. Disagreement on eloquence (Spetzler-Martin/Spetzler-Ponce scale). On initial assessment 2 of the raters (1 neuroradiologist and 1 senior neurosurgery resident) assigned a “0” for eloquence. On reassessment, all but one of the raters (the neuroradiologist) assigned a “1” for eloquence. The neuroradiologist did not change his “0” rating with the justification that the AVM was anterior to the motor strip. A: Axial CT angiogram at the level of the nidus. B: Coronal CT angiogram. C: Sagittal CT angiogram.

On reassessment, the agreement on the overall grade across all modalities was 0.86 using Kendall's W and 0.82 using ICC (Table 1). Among the 3 variables, agreement was highest on size followed by drainage and eloquence. The agreement on reassessment of overall grade by modality was highest for CTA followed by MRI and DSA (Table 2). With respect to size, agreement was highest for CTA followed by MRI and DSA. For assessment of eloquence, MRI was superior to DSA and CTA. With respect to drainage, CTA was superior to DSA and MRI.

The 3-Tier Spetzler-Ponce Scale

On initial assessment, the agreement on the overall grade was 0.70 using Kendall's W and 0.68 using ICC (Table 3). Assessment according to imaging modality resulted in comparable agreement for MRI and CTA followed by DSA (Table 4).

TABLE 3:

Three-tier Spetzler-Ponce grading scale

ParameterWICC (95% CI)
overall grade
 initial assessment0.700.68 (0.56–0.78)
 reassessment0.840.82 (0.74–0.88)
TABLE 4:

Three-tier Spetzler-Ponce grading scale by modality

ParameterMRICTADSA
WICC (95% CI)WICC (95% CI)WICC (95% CI)
overall grade
 initial assessment0.690.75 (0.57–0.89)0.780.76 (0.59–0.90)0.520.34 (0.13–0.63)
 reassessment0.870.89 (0.79–0.96)0.870.86 (0.74–0.94)0.700.62 (0.40–0.82)

On reassessment, the agreement on the overall grade was 0.84 using Kendall's W and 0.82 using ICC (Table 3). Reassessment according to imaging modality resulted in the highest agreement for MRI and CTA followed by DSA (Table 4).

The Pollock-Flickinger Radiosurgery-Based Scale

On initial assessment, agreement on the overall grade treated as a categorical ordinal variable was 0.81 using Kendall's W and 0.89 using ICC (Table 5). Agreement on overall grade treated as a continuous variable was 0.98 using ICC. Agreement for volume treated as a continuous variable using ICC was also 0.98. Assessment on overall score according to imaging modality resulted in the highest agreement for CTA followed by DSA and MRI (Table 6). With respect to location, agreement was comparable for MRI and CTA followed by DSA. For volume, CTA was superior to MRI and DSA. An example for disagreement on location is provided (Fig. 3).

TABLE 5:

Pollock-Flickinger radiosurgery-based grading scale

Parameterκ (95% CI)WICC (95% CI)
overall (categ)0.810.89 (0.85–0.94)
overall (cont)0.98 (0.97–0.99)
location0.63 (0.47–0.79)
volume (cont)0.98 (0.85–0.94)

categ = categorical; cont = continuous.

TABLE 6:

Pollock-Flickinger radiosurgery-based grading scale by modality*

ParameterMRICTADSA
κ (95% CI)WICC (95% CI)κ (95% CI)WICC (95% CI)κ (95% CI)WICC (95% CI)
overall (categ)0.720.64 (0.43–0.79)0.890.86 (0.73–0.94)0.780.72 (0.52–0.87)
overall (cont)0.86 (0.74–0.94)0.98 (0.97–0.99)0.63 (0.41–0.83)
location0.74 (0.58–0.90)0.71 (0.55–0.87)0.46 (0.30–0.62)
volume (cont)0.87 (0.75–0.95)0.98 (0.97–0.99)0.55 (0.33–0.78)

categ = categorical; cont = continuous.

Fig. 3.
Fig. 3.

AVM 58. Disagreement on location (Pollock-Flickinger radiosurgery-based scale). On initial assessment and reassessment all but 1 rater (a neuroradiologist) assigned a “1” for location. On reassessment, the neuroradiologist justified his rating and commented that the nidus was extraaxial. A: Axial CT angiogram at the level of the nidus. B: Coronal CT angiogram. C: Sagittal CT angiogram.

Intrarater Reliability

The initial assessment included intrarater observer reliability, and agreement for the individual raters is shown in Tables 7 and 8. Intrarater reliability for the overall 5-tier Spetzler-Martin grade was excellent (n = 3, ICC > 0.75) and fair to good (n = 2, ICC > 0.40) among the 5 raters. Intrarater reliability for the overall 3-tier Spetzler-Ponce grade was comparable to the results for the 5-tier Spetzler-Martin scale. The Pollock-Flickinger radiosurgery-based scale had excellent interrater reliability (ICC > 0.75) for all raters. Rater 1 had noticeably lower intrarater reliability than all other raters, with the overall effect of lowering the interrater reliability of the group.

TABLE 7:

Intrarater reliability*

Scale & ParameterRater
12345
tau-b (95% CI)ICC (95% CI)tau-b (95% CI)ICC (95% CI)tau-b (95% CI)ICC (95% CI)tau-b (95% CI)ICC (95% CI)tau-b (95% CI)ICC (95% CI)
5-tier SM0.58 (0.20–0.96)0.57 (0.12–0.83)0.76 (0.48–1.00)0.88 (0.69–0.96)0.73 (0.57–0.89)0.78 (0.47–0.92)0.73 (0.46–1.00)0.85 (0.61–0.95)0.65 (0.32–0.98)0.74 (0.39–0.91)
 size0.86 (0.67–1.00)0.86 (0.63–0.95)0.89 (0.70–1.00)0.92 (0.78–0.97)0.79 (0.64–0.94)0.76 (0.38–0.91)1.00 (1.00–1.00)1.00 (1.00–1.00)0.89 (0.70–1.00)0.92 (0.78–0.97)
 eloquence(see Table 8)
 drainage(see Table 8)
3-tier SP0.40 (0.00–0.87)0.43 (0.00–0.76)0.76 (0.48–1.00)0.84 (0.59–0.94)0.66 (0.40–0.92)0.74 (0.38–0.91)0.66 (0.37–0.95)0.76 (0.41–0.91)0.66 (0.25–1.00)0.67 (0.27–0.88)
Pollock-Flickinger
 total (categ)0.70 (0.50–0.90)0.82 (0.54–0.94)0.70 (0.50–0.90)0.82 (0.54–0.94)0.76 (0.61–0.91)0.86 (0.65–0.95)0.84 (0.72–0.96)0.87 (0.63–0.96)0.70 (0.35–1.00)0.75 (0.39–0.91)
 total (cont)0.98 (0.96–1.00)0.99 (0.96–1.00)0.98 (0.95–0.99)0.97 (0.92–0.99)0.97 (0.89–0.99)
 location(see Table 8)
 volume (cont)0.98 (0.95–0.99)0.99 (0.99–1.00)0.98 (0.95–0.99)0.97 (0.92–0.99)0.96 (0.89–0.99)

SM = Spetzler-Martin scale; SP = Spetzler-Ponce scale; tau-b = Kendall's tau-b.

TABLE 8:

Intrarater reliability for the parameters “eloquence,” “drainage,” and “location”*

Scale & ParameterRater
12345
κ (95% CI)κ (95% CI)κ (95% CI)κ (95% CI)κ (95% CI)
5-tier SM
 eloquence0.74 (0.41–1.00)0.87 (0.61–1.00)0.87 (0.62–1.00)0.60 (0.19–1.00)0.71 (0.34–1.00)
 drainage0.87 (0.61–1.00)0.67 (0.26–1.00)0.57 (0.15–1.00)0.70 (0.32–1.00)0.53 (0.06–0.99)
Pollock-Flickinger
 location0.47 (0.02–0.93)1.00 (1.00–1.00)0.84 (0.55–1.00)0.82 (0.47–1.00)1.00 (1.00–1.00)

Missing in Table 7.

Discussion

In this study we sought to evaluate observer agreement for common AVM grading scales using imaging modalities routinely obtained for diagnosis. Grading scales for AVMs are critical in the decision-making process for determining the preferred treatment modality for an individual patient, as well as for facilitating communication between researchers regarding care. They are expected to be robust and to yield similar results among multispecialty observers involved in the management of the disease. Observer reliability has been evaluated for the 5-tier Spetzler-Martin scale (Table 9),1,7,12,13,15,22 but it had not been evaluated for the 3-tier Spetzler-Ponce scale or the Pollock-Flickinger radiosurgery-based scale prior to this study. Another objective of the study was to assess the relative value of CTA and MRI in the determination of AVM grades and in comparison with DSA, as this has not been previously assessed.

TABLE 9:

Interrater and intrarater reliability of the 5-tier Spetzler-Martin grading scale*

Authors & YearRatersModalitiesInterrater ReliabilityIntrarater Reliability
OverallSizeEloquenceDrainageOverallSizeEloquenceDrainage
Spetzler & Martin, 19863 neurosurgeonsDSAagreement 92% of all casesagreement 100% of all cases8% disagreementagreement 100% of casesNRNRNRNR
Al-Shahi et al., 20025 neuroradiologistsDSAκ = 0.47NRNRNRκ = 0.63NRNRNR
Du et al., 20051 neurosurgeon, 1 neuroradiologistDSA, MRIκ = 0.61κ = 0.67κ = 0.71κ = 0.9NRNRNRNR
Iancu-Gontard et al., 20072 neuroradiologistsDSA (CT, MRI)κ = 0.7κ = 0.75κ = 0.71κ = 0.8κ = 0.75κ = 0.81κ = 0.86κ = 0.76
present study1 vascular neurosurgeon, 2 neuroradiologists, 2 neurosurgical residentsMRI, CTA, DSAW = 0.73, 0.86

ICC = 0.69/0.82
W = 0.78, 0.87

ICC = 0.78/0.86
κ = 0.39/0.71κ = 0.53, 0.74tau-b = 0.58–0.76

ICC = 0.57–0.88
tau-b = 0.79–1.00

ICC = 0.76–1.00
κ = 0.60–0.87κ = 0.53–0.87

NR = not reported.

The first value is for initial assessment, the second for reassessment.

Range for all 5 raters.

The 5-Tier Spetzler-Martin Scale

The 5-tier Spetzler-Martin scale is the most established AVM classification system, and its reliability has been assessed in several studies. As part of the original publication, a reliability assessment using DSA found complete agreement among all observers in 92% of cases. The observers included one of the authors and 2 other neurosurgeons. In 2 cases, the grades differed by a single point due to disagreement on whether the AVM involved eloquent cortex.1 A study by Du et al. compared assessments of preoperative angiograms by a neurosurgeon and an interventional neuroradiologist and reported substantial (κ = 0.61) agreement for overall 5-tier Spetzler-Martin grade. Of the individual components, agreement was highest for venous drainage (κ = 0.9), followed by eloquence (κ = 0.71) and size (κ = 0.67).7 There was a trend for the neurosurgeon to give higher overall grades and higher scores for size, explained by the influence of the neurosurgeon's surgical experience, as dissection planes are usually wider than the arterial outlines.7 In the present study, neither the raters' specialty (neurosurgery or neuroradiology) nor the level of training had a significant impact on the level of agreement. It is noted that Du et al. assessed interrater agreement for overall Spetzler-Martin grade and nidus size using Cohen's κ, a reliability measure more appropriately reserved for binary and nominal ratings. There are interrater reliability methods more suitable for evaluating ordinal grading elements (for example, Kendall's W and ICC). The agreement between the neurosurgeon and interventional radiologist was greater than the agreement reported in Al-Shahi and colleagues' study of 5 interventional radiologists, in which interrater reliability was only moderate (κ = 0.47). However, intrarater agreement in that same study was substantial (κ = 0.63), and the study included variables not considered in the Spetzler-Martin scale, such as venous stenosis, venous ectasia, angiogenesis, angiopathy, diffuse versus compact nidus border, and even presence or absence of aneurysms, that generally yielded fair to slight interrater reliability (κ ≤ 0.40).1 Iancu-Gontard et al. assessed agreement of 2 interventional neuroradiologists in a study that included axial CT and/or MRI in addition to angiography and found substantial to almost perfect inter- and intrarater reliability for the overall 5-tier Spetzler-Martin grade and individual components.15 The authors did not provide details on axial imaging assessed.

Caution is advised against directly comparing κ between different studies and populations, since κ is affected by well-documented issues relating to the number of rating categories along with paradoxical effects of rater bias and prevalence of the outcome in the underlying population.2,3 There are also numerous κ-like statistics that are used to report interrater reliability, and many studies fail to report which κ variant was used to calculate the statistics.14 Because of numerous criticisms of the kappa index by researchers, ICC and Kendall's W are increasingly preferred for assessing inter- and intrarater reliability of non-nominal data.6,11,24

In the present study, the agreement on the overall grade was significant and increased with reassessment, and it would have been even higher if Rater 1 had been excluded. When individual components were analyzed, agreement was greatest with respect to size followed by drainage and location, contrary to previous studies where size was the least reliable parameter.7 When comparing different modalities, agreement on the overall grade was actually higher for MRI and CTA than DSA. The finding that MRI had the highest agreement on location is expected given its spatial capabilities. Likewise, CTA and DSA had superior agreement on venous drainage when compared with MRI, which lacks detailed imaging of the vascular architecture. Thus, CTA appears to be adequate for determination of AVM grade using various grading scales. The purpose of this study was to determine the individual observer reliabilities of the different imaging modalities routinely used for the assessment of AVMs. Nevertheless, in clinical practice different imaging modalities have to be viewed as complementary and assessed in combination, as each is essential for a comprehensive evaluation of an AVM.

The 3-Tier Spetzler-Ponce Scale

The 3-tier Spetzler-Ponce scale combines Spetzler-Martin Grades I with II and IV with V. The individual components evaluated are identical to those of the 5-tier Spetzler-Martin scale; thus observer reliability can be expected to correlate between 3- and 5-tier scales. The agreement on overall 3-tier Spetzler-Ponce grade was comparable to that on the 5-tier Spetzler-Martin grade in our study. The 3-tier Spetzler-Ponce scale was found to be accurate in predicting outcome23 and, as demonstrated in this study, is comparable to the 5-tier scale from a reliability standpoint.

The Pollock-Flickinger Radiosurgery-Based Scale

Observer reliability for the Pollock-Flickinger radiosurgery-based scale has not been evaluated previously, even in the original manuscript. We found the overall agreement to range from excellent to extremely strong. Of the individual modalities evaluated, no single type of imaging study proved to be superior to another.

Intrarater Reliability

Intrarater reliability for all raters was significant across all the scales assessed—the 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale, and the Pollock-Flickinger radiosurgery-based scale. Only Rater 1 achieved generally lower intrarater agreement compared with the other raters. The effect of this rater's assessments upon the overall agreement may have led to a lower assessment of overall agreement than is actually the case. Including this rater, intrarater reliability was comparable between the 5-tier Spetzler-Martin and 3-tier Spetzler-Ponce scales and generally highest for the Pollock-Flickinger radiosurgery-based scale. This is not surprising given the significance of size for the overall Pollock-Flickinger radiosurgery-based score and the superior agreement on size as observed in the present study.

Conclusions

The 5-tier Spetzler-Martin scale, the 3-tier Spetzler-Ponce scale and the Pollock-Flickinger radiosurgery-based scale all achieved a high level of agreement. When observers who routinely interpret AVM imaging as part of their clinical duties were provided only with original descriptions of the grading scales, overall reliability was lower than on reassessment.7–10 On reassessment, observers were asked to justify their rating with comments and illustrations. We hypothesized that results from the initial assessment would more closely represent observer reliability expected when the grading scales were applied as part of routine clinical practice. The requirement for the observer to defend the rating with comments and illustrations during reassessment in conjunction with the experience gained during the initial assessment would result in improved reliability. The findings in this study highlight the importance of training personnel any time a grading scale is applied clinically or for scientific purposes, even if the raters are specialists familiar with the topic and the imaging modalities.

Acknowledgment

We acknowledge Kevin L. Junck, Ph.D., Department of Radiology, University of Alabama at Birmingham, for his assistance with this project.

Disclosure

The authors report no conflict of interest concerning the materials or methods used in this study or the findings specified in this paper.

Author contributions to the study and manuscript preparation include the following. Conception and design: Griessenauer, Walters. Acquisition of data: Griessenauer, Miller, Fisher, Curé, Chapman, Witcher, Fisher. Analysis and interpretation of data: Griessenauer, Miller, Walters. Drafting the article: Griessenauer. Critically revising the article: Griessenauer, Miller, Agee, Walters. Reviewed submitted version of manuscript: Griessenauer, Miller, Agee, Fisher, Foreman, Walters. Approved the final version of the manuscript on behalf of all authors: Griessenauer. Statistical analysis: Griessenauer, Agee. Administrative/technical/material support: Griessenauer. Study supervision: Griessenauer.

References

  • 1

    Al-Shahi RPal NLewis SCBhattacharya JJSellar RJWarlow CP: Observer agreement in the angiographic assessment of arteriovenous malformations of the brain. Stroke 33:150115082002

    • Search Google Scholar
    • Export Citation
  • 2

    Banerjee M: Beyond kappa: a review of interrater agreement measures. Can J Stat 27:3231999

  • 3

    Brenner HKliebsch U: Dependence of weighted kappa coefficients on the number of categories. Epidemiology 7:1992021996

  • 4

    Chen BZaebst DSeel L: A macro to calculate kappa statistics for categorizations by multiple raters. Proceedings of the Thirtieth Annual SAS® Users Group International Conference (http://www2.sas.com/proceedings/sugi30/155-30.pdf) [Accessed February 9 2014]

    • Search Google Scholar
    • Export Citation
  • 5

    Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37461960

  • 6

    Czuczman ADThomas LEBoulanger ABPeak DASenecal ELBrown DF: Interpreting red blood cells in lumbar puncture: distinguishing true subarachnoid hemorrhage from traumatic tap. Acad Emerg Med 20:2472562013

    • Search Google Scholar
    • Export Citation
  • 7

    Du RDowd CFJohnston SCYoung WLLawton MT: Interobserver variability in grading of brain arteriovenous malformations using the Spetzler-Martin system. Neurosurgery 57:6686752005

    • Search Google Scholar
    • Export Citation
  • 8

    Fleiss J: The Design and Analysis of Clinical Experiments New YorkJohn Wiley & Sons1986

  • 9

    Fleiss J: Measuring nominal scale agreement among many raters. Psychol Bull 76:3783821971

  • 10

    Fleiss J: Statistical Methods for Rates and Proportions ed 2New YorkJohn Wiley & Sons1981

  • 11

    Gwet K: Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters ed 3Gaithersburg, MDAdvanced Analytics Press2012

    • Search Google Scholar
    • Export Citation
  • 12

    Hadizadeh DRKukuk GMSteck DTGieseke JUrbach HTschampa HJ: Noninvasive evaluation of cerebral arteriovenous malformations by 4D-MRA for preoperative planning and postoperative follow-up in 56 patients: comparison with DSA and intraoperative findings. AJNR Am J Neuroradiol 33:109511012012

    • Search Google Scholar
    • Export Citation
  • 13

    Hadizadeh DRvon Falkenhausen MGieseke JMeyer BUrbach HHoogeveen R: Cerebral arteriovenous malformation: Spetzler-Martin classification at subsecond-temporal-resolution four-dimensional MR angiography compared with that at DSA. Radiology 246:2052132008

    • Search Google Scholar
    • Export Citation
  • 14

    Hallgren KA: Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol 8:23342012

    • Search Google Scholar
    • Export Citation
  • 15

    Iancu-Gontard DWeill AGuilbert FNguyen TRaymond JRoy D: Inter- and intraobserver variability in the assessment of brain arteriovenous malformation angioarchitecture and endovascular treatment results. AJNR Am J Neuroradiol 28:5245272007

    • Search Google Scholar
    • Export Citation
  • 16

    Kendall MGibbons JD: Rank Correlation Methods ed 5LondonEdward Arnold1990

  • 17

    Landis JRKoch GG: The measurement of observer agreement for categorical data. Biometrics 33:1591741977

  • 18

    McGraw KOWong SP: Forming inferences about some intraclass correlation coefficients. Psychol Methods 1:30461996

  • 19

    Pollock BEFlickinger JC: Modification of the radiosurgery-based arteriovenous malformation grading system. Neurosurgery 63:2392432008

    • Search Google Scholar
    • Export Citation
  • 20

    Pollock BEFlickinger JC: A proposed radiosurgery-based grading system for arteriovenous malformations. J Neurosurg 96:79852002

  • 21

    Shrout PEFleiss JL: Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86:4204281979

  • 22

    Spetzler RFMartin NA: A proposed grading system for arteriovenous malformations. J Neurosurg 65:4764831986

  • 23

    Spetzler RFPonce FA: A 3-tier classification of cerebral arteriovenous malformations. Clinical article. J Neurosurg 114:8428492011

    • Search Google Scholar
    • Export Citation
  • 24

    Thaler MLechner RGstöttner MLuegmair MLiebensteiner MNogler M: Interrater and intrarater reliability of the Kuntz et al new deformity classification system. Neurosurgery 71:47572012

    • Search Google Scholar
    • Export Citation
  • 25

    Walter SDEliasziw MDonner A: Sample size and optimal designs for reliability studies. Stat Med 17:1011101998

If the inline PDF is not rendering correctly, you can download the PDF file here.

Article Information

Address correspondence to: Christoph J. Griessenauer, M.D., 1530 3rd Ave. S., Birmingham, AL 35294. email: cgriessenauer@uabmc.edu.

Please include this information when citing this paper: published online March 14, 2014; DOI: 10.3171/2014.2.JNS131262.

© AANS, except where prohibited by US copyright law.

Headings

Figures

  • View in gallery

    AVM 7. Disagreement on drainage (Spetzler-Martin/Spetzler-Ponce scale). On initial and reassessment all but 1 rater (a neuroradiologist) assigned a “0” for drainage. On reassessment the neuroradiologist highlighted a draining vein with arrows (B and D) justifying his rating. A: Lateral view of left carotid artery injection (arterial phase). B: Lateral view of left carotid artery injection (venous phase). C: Anteroposterior view of left carotid artery injection (arterial phase). D: Anteroposterior view of left carotid artery injection (venous phase).

  • View in gallery

    AVM 20. Disagreement on eloquence (Spetzler-Martin/Spetzler-Ponce scale). On initial assessment 2 of the raters (1 neuroradiologist and 1 senior neurosurgery resident) assigned a “0” for eloquence. On reassessment, all but one of the raters (the neuroradiologist) assigned a “1” for eloquence. The neuroradiologist did not change his “0” rating with the justification that the AVM was anterior to the motor strip. A: Axial CT angiogram at the level of the nidus. B: Coronal CT angiogram. C: Sagittal CT angiogram.

  • View in gallery

    AVM 58. Disagreement on location (Pollock-Flickinger radiosurgery-based scale). On initial assessment and reassessment all but 1 rater (a neuroradiologist) assigned a “1” for location. On reassessment, the neuroradiologist justified his rating and commented that the nidus was extraaxial. A: Axial CT angiogram at the level of the nidus. B: Coronal CT angiogram. C: Sagittal CT angiogram.

References

  • 1

    Al-Shahi RPal NLewis SCBhattacharya JJSellar RJWarlow CP: Observer agreement in the angiographic assessment of arteriovenous malformations of the brain. Stroke 33:150115082002

    • Search Google Scholar
    • Export Citation
  • 2

    Banerjee M: Beyond kappa: a review of interrater agreement measures. Can J Stat 27:3231999

  • 3

    Brenner HKliebsch U: Dependence of weighted kappa coefficients on the number of categories. Epidemiology 7:1992021996

  • 4

    Chen BZaebst DSeel L: A macro to calculate kappa statistics for categorizations by multiple raters. Proceedings of the Thirtieth Annual SAS® Users Group International Conference (http://www2.sas.com/proceedings/sugi30/155-30.pdf) [Accessed February 9 2014]

    • Search Google Scholar
    • Export Citation
  • 5

    Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37461960

  • 6

    Czuczman ADThomas LEBoulanger ABPeak DASenecal ELBrown DF: Interpreting red blood cells in lumbar puncture: distinguishing true subarachnoid hemorrhage from traumatic tap. Acad Emerg Med 20:2472562013

    • Search Google Scholar
    • Export Citation
  • 7

    Du RDowd CFJohnston SCYoung WLLawton MT: Interobserver variability in grading of brain arteriovenous malformations using the Spetzler-Martin system. Neurosurgery 57:6686752005

    • Search Google Scholar
    • Export Citation
  • 8

    Fleiss J: The Design and Analysis of Clinical Experiments New YorkJohn Wiley & Sons1986

  • 9

    Fleiss J: Measuring nominal scale agreement among many raters. Psychol Bull 76:3783821971

  • 10

    Fleiss J: Statistical Methods for Rates and Proportions ed 2New YorkJohn Wiley & Sons1981

  • 11

    Gwet K: Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters ed 3Gaithersburg, MDAdvanced Analytics Press2012

    • Search Google Scholar
    • Export Citation
  • 12

    Hadizadeh DRKukuk GMSteck DTGieseke JUrbach HTschampa HJ: Noninvasive evaluation of cerebral arteriovenous malformations by 4D-MRA for preoperative planning and postoperative follow-up in 56 patients: comparison with DSA and intraoperative findings. AJNR Am J Neuroradiol 33:109511012012

    • Search Google Scholar
    • Export Citation
  • 13

    Hadizadeh DRvon Falkenhausen MGieseke JMeyer BUrbach HHoogeveen R: Cerebral arteriovenous malformation: Spetzler-Martin classification at subsecond-temporal-resolution four-dimensional MR angiography compared with that at DSA. Radiology 246:2052132008

    • Search Google Scholar
    • Export Citation
  • 14

    Hallgren KA: Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol 8:23342012

    • Search Google Scholar
    • Export Citation
  • 15

    Iancu-Gontard DWeill AGuilbert FNguyen TRaymond JRoy D: Inter- and intraobserver variability in the assessment of brain arteriovenous malformation angioarchitecture and endovascular treatment results. AJNR Am J Neuroradiol 28:5245272007

    • Search Google Scholar
    • Export Citation
  • 16

    Kendall MGibbons JD: Rank Correlation Methods ed 5LondonEdward Arnold1990

  • 17

    Landis JRKoch GG: The measurement of observer agreement for categorical data. Biometrics 33:1591741977

  • 18

    McGraw KOWong SP: Forming inferences about some intraclass correlation coefficients. Psychol Methods 1:30461996

  • 19

    Pollock BEFlickinger JC: Modification of the radiosurgery-based arteriovenous malformation grading system. Neurosurgery 63:2392432008

    • Search Google Scholar
    • Export Citation
  • 20

    Pollock BEFlickinger JC: A proposed radiosurgery-based grading system for arteriovenous malformations. J Neurosurg 96:79852002

  • 21

    Shrout PEFleiss JL: Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86:4204281979

  • 22

    Spetzler RFMartin NA: A proposed grading system for arteriovenous malformations. J Neurosurg 65:4764831986

  • 23

    Spetzler RFPonce FA: A 3-tier classification of cerebral arteriovenous malformations. Clinical article. J Neurosurg 114:8428492011

    • Search Google Scholar
    • Export Citation
  • 24

    Thaler MLechner RGstöttner MLuegmair MLiebensteiner MNogler M: Interrater and intrarater reliability of the Kuntz et al new deformity classification system. Neurosurgery 71:47572012

    • Search Google Scholar
    • Export Citation
  • 25

    Walter SDEliasziw MDonner A: Sample size and optimal designs for reliability studies. Stat Med 17:1011101998

TrendMD

Metrics

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 256 219 19
PDF Downloads 183 152 2
EPUB Downloads 0 0 0

PubMed

Google Scholar