Gateway to Think Tanks
来源类型 | Journal Publication |
规范类型 | 其他 |
The Advantages of Experimental Designs for Evaluating Sex Education Programs | |
Charles E. Metcalf | |
发表日期 | 1997 |
出版年 | 1997 |
语种 | 英语 |
摘要 | This paper examines issues related to using experimental designs in evaluations of sex education and abstinence programs for teenagers. In addition to reviewing traditional arguments for using random selection rather than comparison group designs to evaluate social programs, the paper discuses how school-based interventions have adapted different types of random assignment designs. The paper then discusses a variety of evaluation design issues related to risk-reduction programs. For example, it looks critically at the difficulties that can arise in implementing randomized demonstrations. It also examines ethical issues and problems related to program operator behavior and resistance as well as internal and external validity concerns. Evaluators are cautioned about creating an artificial environment that will distort the usefulness and policy relevance of their findings. Potential problems with using random assignment, particularly within schools, to evaluate school-based risk-reduction initiatives are identified. To enhance success, evaluators are urged to resolve complex and sensitive questions during the design process. These questions relate to evaluation objectives, criteria for success, and measurement of these criteria. In interpreting evaluation results, evaluators must also take care to examine how the initiative affects nonparticipants. Finally, the paper looks at the promise–as well as the risks–of using across-school designs, which involve random selection of treatment and control schools rather than students within schools. Although use of these designs has been rare, they have the potential to be extremely useful in evaluating programs to reduce teenagers’ high-risk behavior. Since the first income-maintenance experiments in the late 1960s, experimental methods that involve the random assignment of a target population to treatment and control groups have proved both feasible and valuable for evaluating social programs and policy interventions. In a properly designed experiment, the control group represents “what would have happened” to the treatment group had it not received the intervention being tested. This approach has been established as the most defensible method for determining the extent to which specific policy interventions affect behavior or outcomes of interest. Randomized experiments have been used to test interventions in welfare reform, including programs affecting teen parents; employment and training; food stamp benefit cash-out; health care delivery; long-term care; offender rehabilitation; domestic violence; family preservation services; and school initiatives for targeted groups of students. A number of randomized experiments of school-based initiatives have been conducted, including the three examples below, which are distinguished by the manner in which random assignment was conducted: An evaluation of the Upward Bound program (Myers et al, 1993); Myers and Schirm, 1996) is one of three recent studies that broke new ground in measuring the impacts of an existing broad-based program by diverting a nationally representative sample of program applicants into a randomized control group. In a stratified random sample of 70 Upward Bound projects, eligible ninth- through eleventh-grade applicants were randomly selected for Upward Bound and a waiting list. Those on the waiting list not later (randomly) selected to fill vacant slots were retained as the control group. (This example illustrates random assignment of targeted students within schools.) A number of health curriculum studies have chosen the classroom as the unit of randomization, with a random sample of health classes for a given student cohort chosen for a standardized, experimental curriculum. (This example illustrates random assignment of targeted students within schools, by classroom groupings.) The Child and Adolescent Trial for Cardiovascular Health (CATCH) was designed to study health behavior interventions, in an elementary school environment, for the primary prevention of cardiovascular disease (Leupker et al., 1996). CATCH used a randomized field trial in four states with 56 intervention and 60 control elementary schools. (This example illustrates random assignment of schools to treatment or control status, with no within-school randomization of students.) None of these examples relates directly to initiatives designed to promote sexual abstinence among adolescent females or to promote the rejection of other high-risk behavior among teenagers. The examples are highly relevant from a methodological perspective, however, because they illustrate how the strengths of an experimental approach can be adapted in a variety of contexts to help evaluate policy initiatives. The following discussion reviews the conventional arguments for using randomly selected control groups rather than comparison groups to test program or intervention impacts whenever possible. We then examine some issues related to school-based interventions designed to reduce high-risk behavior among teenagers, in light of how those issues relate to evaluation design. Superiority of Randomly Selected Control Groups to Comparison Groups The classical statistical methodology underlying randomized experiments requires that we compare two independent random samples—one that receives the intervention of interest—drawn from the same population. When this condition is met, simple statistical tests reveal the likelihood that any observed differences could be due to chance rather than to systematic differences created by the intervention. Random assignment fulfills this condition proactively, if neither the sample selection and randomization process nor the method of introducing the intervention creates contaminating effects that could be confused with the intervention s impact. Comparison group methods, in contrast, use assumptions, measurement of other sources of differences, and statistical models to eliminate differences that could be caused by reasons other than the intervention. If these efforts are successful, a residual difference can be identified as resulting from the intervention, perhaps with some measure of statistical confidence. Continuing debate about whether nonexperimental comparison groups can be used to provide convincing measures of program impacts has been fueled by a number of studies comparing impact estimates based on control and comparison groups. The debate has also been advanced by an increasingly rich econometric literature about methods to deal with the problem of “selection bias,” which results from sources of unmeasured or unmeasurable differences between treatment and comparison groups. Successful use of nonrandomized comparison groups requires that we be able to measure and control for all systematic differences (other than the intervention) between the samples. Even if all differences can be measured and controlled for, we must keep in mind that this correction process “uses up” statistical power that is no longer available for testing the intervention s primary impact. Time and time again, statistical tests appropriate for randomized experiments are misapplied to nonrandomized comparison groups, resulting in a vast overstatement of the strength of the results. Consider the following example. A demonstration is conducted in a single school, with 500 students participating in an intervention designed to promote sexual abstinence. The comparison group is made up of 500 similarly identified students in another similar school without the intervention. If we apply a difference of means test, as if we have two independent samples of 500 students, we in effect assert the following: There are nodifferences between the two schools or in the methods used to identify students in these schools. There are nodifferences between the schools that introduce additional sources of variance into the calculation. We assume that we know with certainty that the two schools are a perfect match; therefore, all differences are due to the intervention. We could control for school-specific effects, except that the effective sample size for such corrections is the number of schools, not the number of students. Similar problems exist with the statistical methods available to test for the presence of selection bias and correct for it. Tests for selection bias produce three possible outcomes: (1) bias is present, but we lack an acceptable method to correct for it or perhaps even to detect it; (2) bias is present, and available methods permit us to correct for it; and (3) no systematic bias appears to exist. Each of these outcomes poses problems: In the first case, the researcher cannot obtain internally valid impact estimates and must seek alternative data sets. This is a useful result for researchers evaluating alternative secondary data sets but scarce comfort for those who have just completed a demonstration with a primary data collection effort. In the second case, increasingly sophisticated statistical methods have been developed to correct for the source of bias. They typically require, however, the availability of measures for both the treatment and the comparison groups that correlate with program participation but not with program impacts. Furthermore, these methods tend to produce unstable, nondefinitive results. Even when successful, these methods absorb statistical power in the correction process and often produce standard errors of impact estimates that are approximatelythose produced with demonstrations using controls groups. When this happens, sample sizes in a comparison group design have to be as much aslarger than those in a properly designed randomized experiment to measure program impacts with the same statistical precision. Only in the last case can the researcher proceed with no statistical correction for bias. Again, however, using the full sample as if random assignment had occurred implies not only that “we have failed to detect evidence of selection bias” but also that “we know with certainty that it is absent.” In any event, we would not know which case applies until a demonstration has been completed and the data have been collected. The methods available to measure program impacts with nonexperimental data are extremely valuable when time, resources, or other circumstances preclude designing and conducting a randomized experiment to test a new policy intervention or an existing program. These methods are also important for helping to counteract the inevitable imperfections in formal experiments implemented in actual demonstration or program environments. Yet nonrandom comparison groups—whether “made to order” or drawn from currently available or future longitudinal data sets—are unlikely to be the methodology of choice for major impact evaluations that put a priority on producing convincing results. Potential Flaws in Random Assignment Designs Alternatives to random assignment clearly fall far short of the analytic power offered by demonstrations that use random assignment. The question remains: how far do randomized demonstrations as conventionally implemented also fall short of these idealized standards? Unfortunately, advocates of random assignment are often far too sanguine about what it can accomplish. Problems raised by those skeptical about the feasibility of using random assignment are germane in many instances. There is ample room for disagreement, however, on whether a careful design can overcome or minimize these problems. Two concepts central to sound evaluation design are internal validity and external validity” Internal validity deals with whether what we observe—for example, a measured reduction in sexual activity—is in fact caused by the intervention. External validity has to do with whether observed demonstration impacts would be replicated if the intervention were implemented in broader settings or on a larger scale. Both concepts are crucial for policy makers. Failure to demonstrate either type of validity weakens the usefulness of evaluation results as a guide for policy. In the realm of internal validity, well-designed randomized experiments are clearly superior to comparison group and other methodologies. Only with random assignment do we have a basis for attributing what we observe to the impact of an intervention, with a known degree of statistical precision. Experiments as typically implemented do less well with external validity, leaving the analyst to engage in nonexperimental, often judgmental, methods to establish policy relevance. But only a carefully crafted randomized design can produce a robust and internally valid measure of intervention impacts as a starting point for policy interpretation. By yielding weak or even misleading results, weaker designs jeopardize the opportunity to learn from a demonstration. We must caution, however, that just as economists are sometimes accused of tilting toward the measurable at the expense of relevance, experimentalists often focus on creating an analytically precise environment amenable to a structured evaluation, perhaps distancing their “test tube” from direct policy relevance. Frequently, we face the tension between “asking the right question with a weak methodology” and “asking the wrong question with a sound methodology.” Next, we turn to some ethical issues and problems related to the behavior of program operators and resistance in random assignment. We then address additional threats to the internal validity of an experiment, as well as to the external validity or policy applicability of experimental results. During random assignment, ethical issues come into play in at least two respects. First, implementing certain types of experiments—as well as demonstrations not using random assignment—may violate ethical norms held by many. Second, the steps taken to protect ethical standards may severely weaken the methodological power or the demonstration s relevance. The ethical issues we have in mind include notions of informed consent and whether participation can be mandated, whether subjects can be placed at risk of being made worse off by an experimental intervention, and whether control group members can be denied services to which they are otherwise entitled. The early income-maintenance experiments of the 1960s and 1970s adhered to the principles of informed consent. They also specified voluntary participation, involved new programs or initiatives, and limited treatments to those generally regarded as more generous than the prevailing entitlements available to controls. In most cases, institutional review boards were consulted to ensure the rights of participants. Beginning with the Carter administration and then continuing more extensively in the Reagan years, experimentation involved interventions that modified existing programs and made participation mandatory. For example, the age of the youngest child was lowered for mandatory Work Incentive Program (WIN) participation. The ethical logic was as follows: if we can mandate a change in policy through legislation without knowing whether the policy will be effective, why can t we mandate it for a subgroup to gain knowledge before subjecting the full target population to the policy change? More recently, the idea of “leaving behind” a control group for a transitional evaluation period, while the remainder of the target population is subject to revised legislation, came into vogue. Sometimes this strategy was implemented when the benefit to participants was ambiguous or when the intervention was only partially funded rather than an entitlement. This movement reached a crescendo some five years ago. A now infamous Texas case, canceled before it was implemented, provided an extreme: the entire caseload of the Aid to Families with Dependent Children program would have received extended Medicaid eligibility after leaving AFDC, except for a control sample of 800 households that would be subject to the old, more limited, Medicaid eligibility rules. This case was properly labeled as a denial of services, pure and simple. It also illustrated a classic interplay between ethics and methodology: either we had to examine the consequences of extending Medicaid eligibility before enacting legislation, or we had to forgo that knowledge when we proceeded to a broad-based entitlement. Another problem with randomized experiments is that program operators are usually suspicious of them. Demonstration procedures, particularly the process of random assignment to alternative treatments, can seem harsh and disruptive to program operators. These perceptions reflect a combination of biases to be overcome and a realistic appreciation of the disruptive effects a research effort can have on a program. Program operators views often boil down to “I know what works; why are we experimenting?” and “I know which clients would benefit from the service, so why not rely on my judgment rather than random assignment?” Beyond these views, which must be set aside if we believe an evaluation is needed to inform the policy process, lies a basic fact that evaluators cannot overlook. The process by which clients are identified, assessed, and referred—whether by needs assessment or, for some programs, the principle of universal participation—is an integral part of an intervention. By creating an artificial, evaluation-specific selection process to accommodate random assignment, we may do far more than disrupt program operations: we may actually distort either the nature of the intervention being tested or the existing treatment of the control group, thus distorting the relevance of the demonstration results. Therefore, experiments must be designed so that they disrupt normal program enrollment procedures as little as possible, for both the treatment and the control groups. Furthermore, steps must be taken to prevent agencies and staff—well intentioned or not—from distorting the internal or external validity of an experiment. Common forms of distortion or “contamination” include corrupting the random assignment mechanism, providing the treatment intervention to controls despite proper randomization, permitting other forms of “compensatory” treatment of controls or unintended changes in the control group environment, and distorting referral flows. These threats are real—they have occurred and do occur in randomized demonstrations. Techniques have been developed for minimizing them, though, for example, through evaluator control rather than school control of the randomization process. We have already touched on a number of internal validity concerns as they relate to program staff s actions. Three related issues are noteworthy. First, we may not be able to control the status quo or “control” environment completely. If there are unmeasured or unanticipated changes in the control environment, we may lose track of what the intervention is being compared with. When an experimental intervention is a precursor to broader policy change, the control environment is at risk of becoming more like the intervention. When this drift of the control environment cannot be resisted for ethical or other reasons, it is critical—at a minimum—to be able to observe what is happening to the control group by measurement standards that are comparable to those used for the control group. Second—and we will return to this issue—the intervention may have communitywide consequences rather than impacts solely on selected participants. If so, the concept of a within-school control group or a comparison group is moot. Third, the data collection process may be biased by the demonstration; conversely, data collection may itself be an intervening influence. For example, information may be differentially available for treatment and control students. The process of collecting measures of attitudes or behavior from teenagers may in fact affect their behavior. There is a trade-off between using data sources, for example, program records, that are indirect but nonobtrusive, versus those that require direct collection from teenagers. In the area of external validity, we have already mentioned that the process of randomization, student selection, data collection, or other features of a demonstration may change the character of the intervention or how the control group is treated. This, in turn, affects the relevance of the demonstration results for external interpretation. Beyond those circumstances in which demonstration implementation may overtly contaminate external validity, most randomized demonstrations provide internal validity at best. They also require the use of nonexperimental or judgmental methods to extrapolate the results to a broader environment. The early income-maintenance experiments were criticized for (among other things) not testing the negative income tax on a nationally representative sample of the population. The designers of these experiments responded that testing the intervention in a fully and statistically representative environment was not feasible. They also said that it was preferable to conduct an internally valid experiment in a few communities—which they referred to as “test bores”—and then to interpret the results judgmentally for a “representative” environment. That debate reinforces the point that demonstration results can rarely be applied directly to a broader policy environment in the “what you saw is what you ll get” sense without further interpretation. And even if the results are directly applicable, the classical measures of statistical precision available for making internally valid impact statements do not apply. Even if we conduct a broad-based demonstration in multiple, “representative” sites, full-scale implementation may have short- or long-term effects unmeasurable in a limited experiment. Furthermore, demonstrations may involve “best practices” and caseworker commitment, which may not be fully replicated with broader implementation, for example, by those who preferred the old policy. Problems in Evaluating Risk-Reduction Initiatives with Randomized Designs Initiatives to reduce risk-taking behavior may be difficult to implement in an evaluation context that can generate credible evidence of their effectiveness, with or without random assignment. To establish an appropriate evaluation context, a number of complex and sensitive issues must be addressed: What is our ultimate objective? What is our criterion for success? For example, is it sexual abstinence? reduction in risks associated with sexual behavior? reductions in adverse consequences, such as out-of-wedlock births or sexually transmitted diseases? How are our success criteria to be measured, given the age of the target population and the intrinsic difficulties with interview data and self-reports of high-risk behavior? Are externally available consequence measures, such as birth records, sufficient? Does the intervention affectits direct participants, or are there displacement, spillover, or interaction effects on others? This last question strikes at the heart of both program design and evaluation methodology issues. Let us consider two types of interventions: (1) a hypothetical curriculum-based approach tested on a small fraction of a peer cohort and not focused on peer interactions; and (2) an approach that, in addition to curriculum content, focuses on peer pressure and interactions and applies to the entire peer cohort, for example, all sixth- through ninth-grade females in a school. Turning to the first approach, that applied to a limited student group, let us assume that the intervention is successful in promoting sexual abstinence among the group. Two additional consequences, both on nonparticipants, might follow: Male teenagers could shift their focus to femalesparticipating in the intervention, in search of more willing participants in joint high-risk behavior. This couldabstinence in the nonparticipating group, causing measures based on program participants to overstate the community impact of the intervention. Participants, through their interaction with nonparticipating peers, could cause impacts to “spill over” to their peers. Conversely, the influence of nonparticipating peers could undermine the success of the intervention. In addition to influencing how we might interpret the effects of the intervention, these factors could also call into question the use of nonparticipating students as a control or comparison group—because they would have been “contaminated” by program influences. This would be true regardless of whether random assignment was used to select participants. This brings us to the second approach,—of which Best Friends would be an example. This approach explicitly considers peer interactions, and the intervention strategy calls for involving the entire peer group. By its very concept, this approach undermines the traditional design methodology that relies on within-site (or within-school) treatment and control or comparison groups, for the following reasons: If implemented on its intended scale, the intervention precludes the availability of a within-school control or comparison group. If the scale of the intervention is artificially reduced to create a nonparticipant group, the methodology is double-damned: (1) the intervention itself is undermined, since the peer-interactive structure is altered; and (2) the created control or comparison group is likely to be contaminated by the intervention. This discussion has little, if anything, to do with the merits of a randomized evaluation design. Rather, it is an indictment of basing comparisons on a within-school group of nonparticipants, regardless of the method of selection. As we look outside the participating schools to define a standard of evaluation comparison, the issue of whether to adopt a randomized design can be properly addressed. Dealing with the issues discussed here does not preclude using demonstrations effectively, but it does mandate a rigorous and thorough design effort to preserve both the integrity of the program environment and the validity of the evaluation structure. The Promise and Risks of Randomized Cross-School Demonstrations Most randomized policy experiments have involved within-site assignment of cases to treatment and control status, and much of the preceding discussion presumed the use of within-school control groups. The same principles of randomization, however, can be applied to a less frequently observed design—but very promising in many contexts—involving the random selection of treatment and control schools. Under this design, all targeted students in the respective groups of schools make up the treatment and control samples of students. As noted in the preceding discussion, within-school random assignment requires that (1) the scale of the program intervention in each site is smaller than the potentially eligible population; (2) the nature of the intervention is such that none of its benefits spills over to the control group (such as when instructional methods for the control group are affected by what teachers learn from the demonstration, or when innovations or reforms are schoolwide in their potential impact); and (3) school resistance to denying the intervention on a random basis to some eligible students can be overcome. Across-school random assignment avoids these difficulties. Treatment group schools do not have to deal with the mechanics of random assignment or with denial of service, while control schools can be more easily insulated than within-school control groups from spillover or peer group effects. Across-school designs pose their own risks, and at least two appear to be significant. First, innovations tested with this approach must be applied either to the entire eligible student population or to subgroups identified by student characteristics that can be readily measured in data collected for students in the control schools. Interventions—such as Upward Bound—that are targeted at a small number of volunteer applicants from a larger, nominally eligible group would not be well suited for this approach, because attempts to identify the comparable group of students in the control schools would suffer from the same selection bias problems that plague nonrandomized comparison groups. Second, in situations for which either design would be methodologically appropriate, across-school designs typically require larger sample sizes than within-school designs for equal statistical precision. This is because both individual and school characteristics would vary randomly between the treatment and the control groups. Consider, for example, a cross-school design where 2,000 intervention students are spread across 10 intervention schools and another 2,000 control students are spread across 10 control schools. Analysts might be tempted to use a simple statistical test based on comparing two independent samples of 2,000 each. This approach falsely assumes, however, that site-specific variability is nonexistent. Schools will differ in socioeconomic conditions, statutory guidelines, and the manner in which sex education programs are implemented; thus, schools will not be perfectly matched. As a result, the “effective” sample size depends on both the parameters “10” and “2,000.” With plausible assumptions, such a treatment sample has the statistical power of a simple random sample in the range of 200 to 400. More critically, increases in the size of the sample per school do little to increase statistical power, since the total number of schools becomes the sharply constraining influence. Within-school random assignment, in contrast, eliminates variations between the treatment and the control group in school characteristics. Available methods of selecting schools for participation in a randomized demonstration depend on whether we want the results to be externally as well as internally valid, as previously defined. Planners of demonstrations typically solicit applications from schools or sites willing to participate. Strategies to achieve external validity require approaching a random sample of schools or school districts—in a domain of interest, such as schools in large urban areas—and inviting them to participate. For experiments of this sort to be effective, certain conditions have to be met: This approach is feasible only if the offer of participation is attractive enough so that a large proportion of the selected schools agree to participate, because the treatment group is properly defined as all who areparticipation (not just participants). Furthermore, nonparticipants dilute the power of the experiment. As noted, the potential impacts of the intervention have to be schoolwide or have to serve a high fraction of eligible students. The impacts also have to be measurable in tangible terms that can be measured consistently across schools. If all relevant output measures can be obtained from external sources (for example, birth data, surveys conducted through household frames), there is no need to obtain any special consent from the control schools. Given the difficulties of screening for eligible respondents and obtaining parental consent for interviewing minors with a household-based survey, however, it may not be possible to proceed without the active cooperation of control schools. While control schools do not have to be involved in the demonstration in any material waythan in facilitating data collection, the expected lower participation rate of control schools is a serious impediment to implementing a demonstration intended to provide externally valid results. If we are willing to exclude small school districts from the evaluation frame, however, a promising alternative involves approaching districts with an offer of participation for a (preferably matched) pair of schools. (This strategy requires that agreement be obtained from districts before identification of the treatment schools.) If the initiative is widely publicized, inexpensive, and easy to implement, there is a risk that control schools will implement a similar program on their own. If this happens too quickly, the outside world will catch up to the innovation before its effects can be measured. The demonstration is more likely to be successful in measuring those effects if the innovation requires significant resources or technical assistance to implement, and if premature publicity surrounding the demonstration is kept to a minimum. If we are willing to forgo external v |
主题 | Poverty Studies |
URL | https://www.aei.org/research-products/journal-publication/the-advantages-of-experimental-designs-for-evaluating-sex-education-programs/ |
来源智库 | American Enterprise Institute (United States) |
资源类型 | 智库出版物 |
条目标识符 | http://119.78.100.153/handle/2XGU8XDN/209050 |
推荐引用方式 GB/T 7714 | Charles E. Metcalf. The Advantages of Experimental Designs for Evaluating Sex Education Programs. 1997. |
条目包含的文件 | 条目无相关文件。 |
个性服务 |
推荐该条目 |
保存到收藏夹 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[Charles E. Metcalf]的文章 |
百度学术 |
百度学术中相似的文章 |
[Charles E. Metcalf]的文章 |
必应学术 |
必应学术中相似的文章 |
[Charles E. Metcalf]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。