| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Abstract |
|---|
|
|
|---|
| Series Outline |
|---|
|
|
|---|
Correspondence: Address all correspondence and requests for reprints to: Dr. Gordon Guyatt, McMaster University, Department of Clinical Epidemiology and Biostatistics, 1200 Main Street West, Room 2C12, Hamilton, Ontario, L8N 3Z5, Canada. E-mail: guyatt{at}McMaster.ca
| A. Introduction |
|---|
|
|
|---|
| B. The importance of osteoporosis |
|---|
|
|
|---|
There are limitations associated with the WHO definition. The predictive value of bone density measurement for fractures varies depending on the site selected, comparison database, and the technology used. Thus, t-scores do not provide a good basis for establishing comparable diagnostic thresholds between sites and techniques (5). This between-site and between-technique variability introduces a potential for misclassification and the unnecessary treatment of some individuals. Furthermore, how to apply the WHO criterion to men, and to different ethnic groups, is not clear. These limitations reflect the state of development of the science of diagnosis and treatment of osteoporosis.
The most important complications of osteoporosis are fractures of the hip, forearm, and vertebrae. Excess mortality associated with hip fractures in older women in the first year may be as high as 20%, and the increased mortality risk may persist for several years (6). The excess mortality, however, may not be directly attributable to the hip fracture and may be secondary to poor general health status and underlying medical conditions (7, 8). Most studies have not evaluated whether the mortality after a fracture is due to the fracture or as a result of underlying medical conditions. Browner et al. (8) noted in a prospective cohort of 9198 elderly women that there were 361 fractures of the hip or pelvis, and 69% of the deaths did not have a clear relationship to the fracture (7).
Cauley et al. (6) have recently demonstrated an excess mortality in women who experienced a clinical vertebral fracture.
The cumulative lifetime fracture risk for a 50-yr-old women with osteoporosis is as high as 60% (9). Effective fracture prevention would have a major impact on womens morbidity and a smaller but still important impact on mortality.
Patients suffering from osteoporosis are concerned about pain and impaired quality of life associated with vertebral and nonvertebral fractures. Fortunately for patients, but unfortunately for the efficient conduct of clinical trials, clinical fractures are relatively rare events. Therefore, investigators have often focused on treatment effects on bone density, hoping it will provide insight into the impact on fractures. Indeed, population studies have shown that bone density is an independent predictor of fractures, and there is a consistent relationship between bone density and fracture. Because of clinicians interest in the impact of treatment on bone density, we have included both bone density and fractures among the outcomes we have evaluated in this series. In the final section, we will examine the relation between changes in bone density and fracture reduction.
| C. Problems of a world without systematic reviews |
|---|
|
|
|---|
Typically, authors of traditional reviews make little or no attempt to be systematic in their formulation of the questions they are addressing, their search for relevant evidence, or their summary of the evidence they consider. Reviewers may selectively cite studies that are concordant with their prior opinions. Finally, if their summary of the findings is not systematic, reviewers run the risk of presenting a distorted picture of the overall impact of treatment. The author(s) expertise in the area of the review articles content may not guard against these problems. Indeed, in one study, self-rated expertise was inversely related to the methodological rigor of the review (10). A recent critical appraisal of 158 review articles in general medicine journals revealed that only a minority of the review articles specified rigorous, systematic methods of identifying, evaluating, and synthesizing the evidence (11).
Because of these limitations of unsystematic reviews, experts often make recommendations that disagree with one another, and their advice may lag behind or be inconsistent with the best available evidence. For example, in the cardiovascular area, data from randomized trials demonstrated mortality reduction with thrombolytic therapy and cholesterol-lowering agents a decade before experts were consistently recommending treatment with these agents (12). In addition, experts recommended routine use of lidocaine and calcium channel antagonists despite data from randomized trials showing trends toward increased mortality (13).
Systematic reviews will not protect clinicians against biased assessments of treatment effect unless the primary data come from methodologically strong studies. Studies that are less rigorous methodologically tend to overestimate the effectiveness of interventions, whereas consistent results from studies that are methodologically strong are more compelling (14, 15). For instance, experts have made strong recommendations for use of postmenopausal HRT on the basis of systematic reviews of observational studies suggesting a 50% reduction in cardiovascular risk in users vs. nonusers (15). The first large, randomized control trial (RCT) of HRT in postmenopausal women at high risk for cardiovascular events has demonstrated no benefit in cardiovascular events and an increase in thromboembolism (16). If they are to yield unbiased estimates of treatment impact, systematic reviews must focus on the results of RCTs.
RCTs have certain limitations. Summaries of published RCTs may produce biased results if publication bias has resulted in trials with less sanguine results not having appeared in the literature. Therefore, we pay close attention to issues of potential publication bias in our reviews. RCTs often fail to carefully measure all potential side-effects and are underpowered for rare effects. Thus, meta-analyses of RCTs generally cannot provide definitive information about drug toxicity, and this is true of the meta-analyses of osteoporosis therapies in this series. Finally, RCTs enroll select populations, and results may not be fully generalizable to patient populations that clinicians see in frontline practice. Nevertheless, the risk of bias in observational studies, and the balance of both known and unknown prognostic factors that RCTs provide, justifies a focus on RCTs in systematic reviews that address treatment benefits.
The limitations of unsystematic reviews and methodologically weak studies suggest that clinicians treating osteoporosis patients, and experts providing recommendations for those clinicians, could benefit from systematic summaries of RCTs regarding the effectiveness of osteoporosis treatments. Ultimately, we hope our meta-analyses can be used to help formulate practice guidelines and establish health policy (17). The next few paragraphs of this article summarize the process of conducting a systematic review.
| D. What is a systematic review? |
|---|
|
|
|---|
|
Such approaches have the merit of a parsimonious summary of information that may otherwise be difficult for readers to integrate. Their limitations include the arbitrariness that is inevitable in placing relative weights on different aspects of methodological quality. A second limitation is that established summary measures do not take into account recent developments in methodology. For example, existing summary measures often refer to double blinding. Investigators have recently demonstrated that both clinicians and expert methodologists do not agree on who is blind in double-blind studies (21), suggesting that specification of exactly who was blind to allocation is important.
As a result of these limitations, reviewers may choose to evaluate quality by examining the individual components that bear on the likelihood of an RCT yielding and unbiased estimate of treatment effect. In general, we have used this latter approach in our reviews.
Finally, the reviewers will summarize the data including, if appropriate, a quantitative synthesis or meta-analysis (22). The analysis includes an examination of the features of the eligible studies; an attempt to explain discrepancies in results on the basis of differences in patients, interventions, measurement of outcome, or methodological differences (exploring heterogeneity); a summary of the overall results; and an assessment of the precision and validity of those results.
| E. Eligibility criteria for these reviews: defining the questions |
|---|
|
|
|---|
This rule helps, but, because the expectation of a similar magnitude of effect is open to judgment, it does not solve the problem of selecting the right range of patients, interventions, and outcomes. There are a number of advantages to choosing to pool broadly. First, broad pooling reduces the likelihood of making spurious inferences about subgroup differences that are actually chance phenomena (23, 24). Second, if the magnitude of effect is indeed similar across the range of patients, interventions, and outcomes, broad eligibility criteria will result in a larger number of eligible studies and therefore narrower confidence intervals. Finally, the clinician will be able to generalize the results across a broader group of patients and a broader range of ways of administering the intervention.
On the other hand, if the magnitude of effect varies across the range of patients or interventions, pooling will provide misleading results. If, for instance, treatment does not reduce fractures in patients with low bone density but no prior vertebral fractures, but results in a large reduction in fracture rate in women who have a previous vertebral fracture, pooling across these patient groups will produce an apparent magnitude of effect that is inaccurate for either patient group.
Fortunately, meta-analysts have a solution to this dilemma. When they have completed their initial analysis, investigators can examine the results to see the extent of variability in results across studies. This allows them to check the initial assumption on which pooling is based: that the magnitude of intervention effects is similar across the range of patients, interventions, outcomes, and methodologies included in the analysis.
In general, we chose to pool broadly. We then evaluated our results, checking carefully to see if we could find explanations for variability or heterogeneity of the results across studies. To minimize the likelihood of making spurious inferences about subgroup effects, we specified a priori hypotheses concerning possible explanations of heterogeneity. In describing the methodology of our reviews, we outline our eligibility criteria and subgroup hypotheses. We describe the statistical methodology by which we tested to see if our a priori hypotheses could explain variability in study results in the section of this article devoted to statistical analysis.
Before presenting the summary of the methodological approaches we chose in conducting our reviews, we describe the background of our group.
| F. Background of the group |
|---|
|
|
|---|
Initially, Merck provided the initiative for a group of experts to lead an evidence-based review of pharmacological interventions for osteoporosis, and Merck provided most of the funds for this project. Procter & Gamble contributed a small grant. All funds were provided as unrestricted educational grants, and the Ottawa group received no Merck funds.
After consultation on accepting funding from companies directly involved in the products being reviewed, ORAG requested that the companies not be present for methodological discussions involved in preparing the review. Thus, Merck personnel provided input into initial, but not later, ORAG discussions. Merck did not attend the latter ORAG meetings at which ORAG made their final recommendations regarding content and presentation of data. We consulted with each company about the reviews of their products, but no company had any part in writing the manuscripts.
The source of funding and our interactions could suggest a possible threat to the objectivity of our reviews. Aware of this threat, we have tried in all ways to remain at arms length from the companies and have endeavored to be scrupulous both in our methods and our conclusions. Reviewers and readers must ultimately judge our success in providing an unbiased summary of the data.
Having summarized the process of conducting a systematic review, and having provided the background of our group, we will review the specific methodology of the reviews we undertook. To minimize the likelihood of bias, we developed our methods in advance and applied them consistently across all the reviews.
| G. Methodology |
|---|
|
|
|---|
1. Specify eligibility criteria.
a. Population.
We included only studies of postmenopausal women (defined as greater than 6 months postmenopausal) presented in the international literature. In the first instance, we pooled across all patients, irrespective of their degree of osteoporosis. However, we hypothesized that the magnitude of the treatment effect may vary in early postmenopausal women with bone density in the normal or near-normal range (prevention) vs. women with established osteoporosis (treatment). Because studies defined their populations differently, we developed a hierarchy of criteria for separating studies into primary and secondary prevention trials. If the t-score was available, we divided studies that restricted their population to women whose bone density was at least 2 SD values below peak bone mass and those that included women in which it was within 2 SD of the mean. Because of controversy about this cut point, for the alendronate and risedronate analyses we also analyzed the data using a cut point of -2.5 SD.
When bone density measurements were not available, we classified studies as treatment if the prevalence of vertebral fracture at baseline was greater than 20%. If these data were not available, we considered studies in which the average age was above 62 as treatment studies.
Because of the wide heterogeneity of the severity of osteoporosis in women with decreased bone density, when there were sufficient data available, we also looked at two subgroups within those trials with an entry criterion of lumbar spine bone mineral density at least 2 SD values below the mean for young adult women. We grouped trials with a prevalent fracture rate of greater than 10% at study inclusion in comparison to those without fractures or prevalence of 10% or below.
b. Interventions.
We included only studies that followed patients for at least 1 yr. Initially, we chose to pool across all doses and all durations of treatment. A priori hypotheses included that both dose and duration of therapy might be responsible for differences in the magnitude of treatment effect across studies.
For treatments other than calcium and vitamin D, we initially pooled across all studies irrespective of calcium and vitamin D intake. We hypothesized, however, that treatment effect might differ across varying levels of baseline calcium intake. We used a cut point of 1250 mg (at least 1250 mg of total calcium intake on average for participants vs. <1250 mg). We also hypothesized that vitamin D intake might lead to differences in treatment effect (and divided studies into those in which there was any vitamin D supplementation vs. those in which there was not).
c. Outcomes.
We included only studies that reported on the effect of treatment on either fracture incidence (vertebral or nonvertebral) or bone density. We conducted separate meta-analyses for vertebral and nonvertebral fractures. We constructed an a priori hypothesis that treatment effects would be larger in nonvertebral fracture sites in which a prior study had shown an association between low calcaneal bone density and the particular type of fracture (relative risk of fracture 1.5 or greater) (26). These sites include forearm, hip, rib, leg, patella, pelvis, and hands (osteoporotic fractures) vs. all fractures in which the risk was less than 1.5 (nonosteoporotic fractures).
We conducted separate meta-analyses for bone density at the forearm, hip, and lumbar spine sites. For hip bone density, we pooled across the different measurement sites, and used the same strategy for forearm sites. If a trial reported more than one hip site, the order of preference selected was: total hip, femoral neck, and trochanter. For the forearm, the order of preference was one third distal radius and ulna and then one third distal radius. We refer to the pooling across sites as the combined hip and combined forearm.
We conducted separate meta-analyses examining the proportion of patients who were unable to continue the study medication because of side effects.
2. Literature search and selection.
The Cochrane search strategy outlined by Haynes et al. (27) and Dickersin et al. (28) provided the basis for our search for relevant literature. For published data, this included a search of electronic databases including MEDLINE, EMBASE, Current Contents, and the Cochrane Controlled Trials Registry using a time frame from 1966 to 1999. There were no language restrictions applied to the search strategy. We hand-searched conference abstract books from international meetings and the results of Food and Drug Administration proceedings. We reviewed citations of relevant articles and enlisted the collaboration of the company that developed and manufactured the drug under study. For investigator-initiated trials, we sought information from authors, generally with good success.
The inclusion of unpublished studies remains controversial, but omission of unpublished studies increases the chance that studies with positive results will be over-represented (29, 30). Many studies have documented that trials with positive treatment effects are more likely to be published, and publication bias therefore threatens the validity of systematic reviews (31, 32). Abstract books from conferences, Food and Drug Administration proceedings, and company contacts all provided sources of unpublished studies. In addition, ORAG members, many of whom had worked for many years in osteoporosis research, were aware of unpublished studies.
Another threat to the validity of meta-analyses is the issue of duplicate publications. Trials with positive results tend to be published more than once (33). When we suspected a duplicate publication, we contacted the original author to identify the more complete data set.
3. Application of eligibility criteria.
Two reviewers evaluated titles and abstracts from the search and identified potentially eligible studies. We obtained the full articles of these potentially eligible titles and abstracts, and two reviewers used these full articles to make final judgements about eligibility (34, 35). We calculated chance-corrected agreement on study selection using the
statistic for these judgments (36). Reviewers resolved disagreements by consensus.
4. Quality assessment.
We included only RCTs (37). We evaluated four aspects of methodological quality: concealment of randomization, blinding, completeness of follow-up, and intention to treat analysis. Concealment refers to whether those responsible for determining eligibility are aware of the arm to which the patient will be allocated if the patient enters the trial. Studies are less likely to yield biased results if patients, caregivers, and those assessing outcomes are blind to allocation. We chose three cut points for loss to follow-up: 1%, 5%, and 20%, and used one or more cut points in our analyses depending on the number of studies and the distribution of loss to follow-up. Investigators conduct intention-to-treat analysis when they analyze patients in the arm to which they are randomized, irrespective of whether or not they received the intended treatment. We specified a priori that these four aspects of methodological quality might explain heterogeneity of study results.
5. Data abstraction.
At least two reviewers independently abstracted data including study characteristics, results, and methodological quality and resolved disagreements by consensus.
6. Analysis
a. Method of pooling for bone density.
For each bone density site (lumbar spine, total body, combined hip, and combined forearm), we conducted separate analyses using the difference between the change in bone density for the treatment and the change in the placebo arm. The first challenge of our analyses was to address whether we could legitimately pool across doses and across years. We addressed this issue by constructing regression models in which the independent variables were year and dose and the dependent variable the effect size. The analysis accounted for the covariance between dosage groups within a trial by subtracting the same placebo group.
Using repeated data from the same patients (as we do when we examine the effect of measurement time1 yr, 2 yr, etc.) raises the problem of covariation between measurements. The findings of models assessing the impact of year could depend on the extent to which patients repeated observations are closely related to one another. Thus, the analysis requires an assumption regarding the extent of covariation. We dealt with this issue by conducting two analyses. One used the smallest, and the other the largest, correlations (0.45 and 0.78) between changes in bone density from year to year in one large study for which we had these data available (38). Fortunately, in almost all cases, use of the two correlations yielded essentially identical results.
The principle by which we proceeded was to compare alternative models, and eventually choose the most parsimoniousthat is, the model with the fewest parameters, allowing greatest pooling but still explaining as much of the variability in treatment effect as possible. We began by comparing a full model with a parameter for each dose of the treatment to a model that ignored dose. Table 1
illustrates this approach using total body bone density for alendronate. In this case, the model with a parameter for each dose explained a statistically significant greater proportion of the variance than the model that did not consider dose (Table 1
, row 1). This indicates that dose explains some of the variability in results from study to study, and demands that we reject pooling across all doses. Next, we compared the full model with a parameter for each dose to a model with a single parameter for the highest two dosesin this case, 20 and 40 mg of alendronate. If there was no statistically significant variance explained by the full model, it indicated that doses of 20 and 40 mg had more or less the same effect, and we could pool these two doses in subsequent analyses. This indeed proved to be the case (Table 1
, row 2).
|
This regression approach allowed us to find the most parsimonious model for both dose and year of follow-up for the bone density data. If the analysis suggested that we could pool across years, we used the latest year from each trial in the analysis. For instance, if the regression suggested that results from yr 2 and 3 were similar, in studies that presented both yr 2 and 3 results, we used the latter.
In addition to using the regression analyses to inform our pooling decisions, we used another criteria, which, when it conflicted with the regression approach, took precedence. We pooled across two doses (or years) if the random-effects confidence interval for one of those doses or years was completely contained within the random-effects confidence interval for the other dose or year.
b. Primary results.
For each separate meta-analysis, we calculated the weighted mean difference in bone density between treatment and control groups using the percentage change from baseline in the treatment and placebo groups and the associated SD values. If the SD values were not reported, we obtained the results from the authors. When authors could not provide us with the SD values, we estimated them using a regression model with duration of follow-up and the control group SD as predictors of missing treatment group SD values. To assess whether the magnitude of heterogeneity (differences in apparent treatment effect across studies) was greater than one might expect by chance, we conducted a test based on the
2 distribution with N-1 degrees of freedom, where N is the number of studies (39). Our criterion for statistical significance of this test was a P value of 0.10.
Although some studies have counted patients with more than one fracture as having more than one event, we retained patient as the unit for each of our analyses. Thus, a patient with multiple fractures was counted as only a single event. For each fracture analysis we calculated a risk ratio (a relative risk) using methods described by Fleiss (39). We derived risk ratios by constructing two by two tables for vertebral and nonvertebral fractures.
c. Exploring heterogeneity.
Irrespective of whether or not we found statistically significant heterogeneity, for both bone density and fracture analyses we tested whether our a priori hypotheses could explain variability in the magnitude of treatment effects across studies using a procedure described by Hedges and Olkin (40). Table 2
presents the results of one such exploration of heterogeneity using data from bone mineral density in our meta-analysis of the effects of alendronate. The first column of the table states the bone density site, the second the drug dose, the third the duration of therapy, and the fourth the P value associated with the test of heterogeneity. The low P values indicate that, for each site, chance is an unlikely explanation for the variability in results between studies.
|
Although we consistently used the same approach to explore explanations for differences in study results irrespective of the results of the formal statistical test of heterogeneity, we typically present the detailed results only when these tests met our criterion of a P value of 0.10.
d. Random vs. fixed-effects models.
Two fundamental alternative statistical models are available for meta-analysis. The random-effects model is based on the theory that we are interested in all trials that may be conducted, and the trials we have are a random sample of those studies. The fixed-effects model assumes that we are interested only in deriving a best estimate of the true underlying treatment effect from the trials for which data are available. The fixed-effects model assumes identical underlying treatment effects in the studies, and the variance around each mean depends primarily on the size of the study. The random-effects model includes between-study variability in the assessment of error variance and therefore provides wider, more conservative confidence intervals (41). In general, we prefer a random-effects model both for theoretical reasons (we wish to generalize beyond the sample of patients included in the studies) and practical reasons (in general, we prefer the wider, more conservative confidence intervals that the random-effects model provides). Thus, with one exception, we have used random-effect models as the basis for analysis in all reviews.
With a random-effects model, the size of the study becomes less important and smaller trials have relatively greater weight than in the fixed-effects model. This heavier weighting of smaller studies may be problematic in certain situations. In the raloxifene review, we found two trials that reported fractures. One trial enrolled 50 times as many patients as the other (7705 vs. 143), which led to disparate effects of treatment effect. The larger trial (42) showed a significant reduction in vertebral fractures with a narrow confidence interval, whereas the smaller trial (43) showed a trend in favor of the control group with a wide confidence interval. Because the fixed-effects method does not incorporate large variation in results between the two studies, the results of the larger study dominate the pooled estimate (point estimate of relative risk, 0.65). Further, the confidence interval around the pooled estimate is narrow because the between-study variability is ignored (confidence interval around the relative risk, 0.560.75). The random-effects model, however, includes between-study variability in generating the confidence interval around the pooled estimate (point estimate, 0.80). Due to the large variability in the results and nonoverlapping confidence intervals, the confidence intervals around the summary estimate were very wide (0.421.52). Given the narrow confidence interval [point estimate of relative risk, 0.59 (95% CI 0.500.70)] generated by the larger trial, we found the confidence interval generated by the random-effects model implausible. Therefore, in this instance, we decided against pooling of the two trials.
e. Exploring data for publication bias.
Despite our exhaustive attempts to identify and include both published and unpublished studies, we may have failed to identify unpublished studies. To the extent that these studies remained unpublished in part because of small or absent treatment effects, our results will represent an overestimate of the true underlying effect of treatment.
To address this issue, we constructed plots of the relationship between sample size and the magnitude of the treatment effect. Assuming that there is no publication bias, one would anticipate less random error, and thus more consistent results, among larger trials. Smaller trials are subject to greater random error and, thus, would demonstrate greater variability in results. These errors should be symmetrical about the true value, and thus the data should resemble a funnel (Fig. 2A
). When the funnel plot suggests asymmetry, investigators must suspect the presence of publication bias (Fig. 2B
).
|
|
| H. Strengths and limitations of our meta-analyses |
|---|
|
|
|---|
Our study has the limitations of all meta-analyses, and particularly meta-analyses that do not use individual patient data. These limitations include, despite an exhaustive search for unpublished material and an examination of data for patterns that suggest publication bias, a residual susceptibility to unpublished studies with different results. As is often the case, we were frequently left with heterogeneity of study results that we could not explain with our a priori hypotheses. This may, in part, be due to our having to consider studies as a unit in our analyses. If we had access to individual patient data, some of these analyses may have been more revealing.
As a result of our not having individual patient data, readers may find discrepancies between the point estimates of treatment effect that we report and those presented in the initial publication. This is likely to be particularly true for estimates of relative risk of fracture with experimental treatment. There are three primary reasons for these discrepancies: First, the primary investigators may have calculated odds ratios rather than relative risks (and may even have calculated odds ratios but reported them as relative risks). Second, the primary investigators may have reported analyses in which they considered differences in baseline characteristics between groups so-called adjusted analyses. None of our analyses adjust for baseline differences between groups. Third, the primary investigators may have conducted survival analyses and reported hazard ratios. In simple terms, the hazard ratio represents the average relative risk over time. In general, a survival analysis provides a more sensitive analytic approach than simply considering proportions of patients with the events of interest.
| I. Conclusions |
|---|
|
|
|---|
| Footnotes |
|---|
| K. Bibliography |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D.J. Hosking, P. Geusens, and R. Rizzoli Osteoporosis therapy: an example of putting evidence-based medicine into clinical practice QJM, June 1, 2005; 98(6): 403 - 413. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. H. Bjarnason, A. S. Chan, S. M. Ott, S. L. Ruggiero, B. Mehrotra, H. G. Bone, and A. C. Santora Ten Years of Alendronate Treatment for Osteoporosis in Postmenopausal Women N. Engl. J. Med., July 8, 2004; 351(2): 190 - 192. [Full Text] [PDF] |
||||
| ||||||||||