# REVIEW OF BASIC CONCEPTS

1

REVIEW OF BASIC CONCEPTS

1.1 Introduction

This textbook assumes that students have taken basic courses in statistics and research methods. A typical first course in statistics includes methods to describe the distribution of scores on a single variable (such as frequency distribution tables, mediansmeansvariances, and standard deviations) and a few widely used bivariate statistics . Bivariate statistics (such as the Pearson correlation, the independent samples t test, and one-way analysis of variance or ANOVA) assess how pairs of variables are related. This textbook is intended for use in a second course in statistics; the presentation in this textbook assumes that students recognize the names of statistics such as t test and Pearson correlation but may not yet fully understand how to apply and interpret these analyses.

The first goal in this course is for students to develop a better understanding of these basic bivariate statistics and the problems that arise when these analytic methods are applied to real-life research problems. Chapters 1 through 3 deal with basic concepts that are often a source of confusion because actual researcher behaviors often differ from the recommended methods described in basic textbooks; this includes issues such as sampling and statistical significance testing. Chapter 4 discusses methods for preliminary data screening; before doing any statistical analysis, it is important to remove errors from data, assess whether the data violate assumptions for the statistical procedures, and decide how to handle any violations of assumptions that are detected. Chapters 5 through 9 review familiar bivariate statistics that can be used to assess how scores on one X predictor variable are related to scores on one Y outcome variable (such as the Pearson correlation and the independent samples t test). Chapters 10 through 13 discuss the questions that arise when a third variable is added to the analysis. Later chapters discuss analyses that include multiple predictor and/or multiple outcome variables.

When students begin to read journal articles or conduct their own research, it is a challenge to understand how textbook knowledge is applied in real-life research situations. This textbook provides guidance on dealing with the problems that arise when researchers apply statistical methods to actual data.

1.2 A Simple Example of a Research Problem

Suppose that a student wants to do a simple experiment to assess the effect of caffeine on anxiety. (This study would not yield new information because there has already been substantial research on the effects of caffeine; however, this is a simple research question that does not require a complicated background story about the nature of the variables.) In the United States, before researchers can collect data, they must have the proposed methods for the study reviewed and approved by an institutional review board (IRB) . If the research poses unacceptable risks to participants, the IRB may require modification of procedures prior to approval.

To run a simple experiment to assess the effects of caffeine on anxiety, the researcher would obtain IRB approval for the procedure, recruit a sample of participants, divide the participants into two groups, give one group a beverage that contains some fixed dosage level of caffeine, give the other group a beverage that does not contain caffeine, wait for the caffeine to take effect, and measure each person’s anxiety, perhaps by using a self-report measure. Next, the researcher would decide what statistical analysis to apply to the data to evaluate whether anxiety differs between the group that received caffeine and the group that did not receive caffeine. After conducting an appropriate data analysis, the researcher would write up an interpretation of the results that takes the design and the limitations of the study into account. The researcher might find that participants who consumed caffeine have higher self-reported anxiety than participants who did not consume caffeine. Researchers generally hope that they can generalize the results obtained from a sample to make inferences about outcomes that might conceivably occur in some larger population. If caffeine increases anxiety for the participants in the study, the researcher may want to argue that caffeine would have similar effects on other people who were not actuallyincluded in the study.

This simple experiment will be used to illustrate several basic problems that arise in actual research:

1. Selection of a sample from a population

2. Evaluating whether a sample is representative of a population

3. Descriptive versus inferential applications of statistics

5. Selection of a statistical analysis that is appropriate for the type of data

The following discussion focuses on the problems that arise when these concepts are applied in actual research situations and comments on the connections between research methods and statistical analyses.

1.3 Discrepancies Between Real and Ideal Research Situations

Terms that appear simple (such as sample vs. population) can be a source of confusion because the actual behaviors of researchers often differ from the idealized research process described in introductory textbooks. Researchers need to understand how compromises that are often made in actual research (such as the use of convenience samples) affect the interpretability of research results. Each of the following sections describes common practices in actual research in contrast to idealized textbook approaches. Unfortunately, because of limitations in time and money, researchers often cannot afford to conduct studies in the most ideal manner.

1.4 Samples and Populations

A sample is a subset of members of a population.1 Usually, it is too costly and time-consuming to collect data for all members of an actual population of interest (such as all registered voters in the United States), and therefore researchers usually collect data for a relatively small sample and use the results from that sample to make inferences about behavior or attitudes in larger populations. In the ideal research situation described in research methods and statistics textbooks, there is an actual population of interest. All members of that population of interest should be identifiable; for example, the researcher should have a list of names for all members of the population of interest. Next, the researcher selects a sample from that population using either simple random sampling or other sampling methods (Cozby, 2004).

In a simple random sample, sample members are selected from the population using methods that should give every member of the population an equal chance of being included in the sample. Random sampling can be done in a variety of ways. For a small population, the researcher can put each participant’s name on a slip of paper, mix up the slips of paper in a jar, and draw names from the jar. For a larger population, if the names of participants are listed in a spreadsheet such as Excel, the researcher can generate a column of random numbers next to the names and make decisions about which individuals to include in the sample based on those random numbers. For instance, if the researcher wants to select of the members of the population at random, the researcher may decide to include each participant whose name is next to a random number that ends in one arbitrarily chosen value (such as 3).

In theory, if a sample is chosen randomly from a population, that simple random sample should be representative of the population from which it is drawn. A sample is representative if it has characteristics similar to those of the population. Suppose that the population of interest to a researcher is all the 500 students at Corinth College in the United States. Suppose the researcher randomly chooses a sample of 50 students from this population by using one of the methods just described. The researcher can evaluate whether this random sample is representative of the entire population of all Corinth College students by comparing the characteristics of the sample with the characteristics of the entire population. For example, if the entire population of Corinth College students has a mean age of 19.5 years and is 60% female and 40% male, the sample would be representative of the population with respect to age and gender composition if the sample had a mean age close to 19.5 years and a gender composition of about 60% female and 40% male. Representativeness of a sample can be assessed for many other characteristics, of course. Some characteristics may be particularly relevant to a research question; for example, if a researcher were primarily interested in the political attitudes of the population of students at Corinth, it would be important to evaluate whether the composition of the sample was similar to that of the overall Corinth College population in terms of political party preference.

Random selection may be combined with systematic sampling methods such as stratification. A stratified random sample is obtained when the researcher divides the population into “strata,” or groups (such as Buddhist/Christian/Hindu/Islamic/Jewish/other religion or male/female), and then draws a random sample from each stratum or group. Stratified sampling can be used to ensure equal representation of groups (such as 50% women and 50% men in the sample) or that the proportional representation of groups in the sample is the same as in the population (if the entire population of students at Corinth College consists of 60% women and 40% men, the researcher might want the sample to contain the same proportion of women and men).2 Basic sampling methods are reviewed in Cozby (2004); more complex survey sampling methods are discussed by Kalton (1983).

In some research domains (such as the public opinion polls done by the Gallup and Harris organizations), sophisticated sampling methods are used, and great care is taken to ensure that the sample is representative of the population of interest. In contrast, many behavioral and social science studies do not use such rigorous sampling procedures. Researchers in education, psychology, medicine, and many other disciplines often use accidental or convenience samples (instead of random samples). An accidental or convenience sample is not drawn randomly from a well-defined population of interest. Instead, a convenience sample consists of participants who are readily available to the researcher. For example, a teacher might use his class of students or a physician might use her current group of patients.

A systematic difference between the characteristics of a sample and a population can be termed bias . For example, if 25% of Corinth College students are in each of the 4 years of the program, but 80% of the members of the convenience sample obtained through the subject pool are first-year students, this convenience sample is biased (it includes more first-year students, and fewer second-, third-, and fourth-year students, than the population).

The widespread use of convenience samples in disciplines such as psychology leads to underrepresentation of many types of people. Convenience samples that consist primarily of first-year North American college students typically underrepresent many kinds of people, such as persons younger than 17 and older than 30 years, persons with serious physical health problems, people who are not interested in or eligible for a college education, persons living in poverty, and persons from cultural backgrounds that are not numerically well represented in North America. For many kinds of research, it would be highly desirable for researchers to obtain samples from more diverse populations, particularly when the outcome variables of interest are likely to differ across age and cultural background. The main reason for the use of convenience samples is the low cost. The extensive use of college students as research participants limits the potential generalizability of results (Sears, 1986); this limitation should be explicitly acknowledged when researchers report and interpret research results.

1.5 Descriptive Versus Inferential Uses of Statistics

Statistics that are used only to summarize information about a sample are called descriptive statistics. One common situation where statistics are used only as descriptive information occurs when teachers compute summary statistics, such as a mean for exam scores for students in a class. A teacher at Corinth College would typically use a mean exam score only to describe the performance of that specific classroom of students and not to make inferences about some broader population (such as the population of all students at Corinth College or all college students in North America).

Researchers in the behavioral and social sciences almost always want to make inferences beyond their samples; they hope that the attitudes or behaviors that they find in the small groups of college students who actually participate in their studies will provide evidence about attitudes or behaviors in broader populations in the world outside the laboratory. Thus, almost all the statistics reported in journal articles are inferential statistics. Researchers may want to estimate a population mean from a sample mean or a population correlation from a sample correlation. When means or correlations based on samples of scores are used to make inferences about (i.e., estimates of) the means or correlations for broader populations, they are called inferential statistics. If a researcher finds a strong correlation between self-esteem and popularity in a convenience sample of Corinth College students, the researcher typically hopes that these variables are similarly related in broader populations, such as all North American college students.

In some applications of statistics, such as political opinion polling, researchers often obtain representative samples from actual, well-defined populations by using well-thought-out sampling procedures (such as a combination of stratified and random sampling). When good sampling methods are used to obtain representative samples, it increases researcher confidence that the results from a sample (such as the stated intention to vote for one specific candidate in an election) will provide a good basis for making inferences about outcomes in the broader population of interest.

However, in many types of research (such as experiments and small-scale surveys in psychology, education, and medicine), it is not practical to obtain random samples from the entire population of a country. Instead, researchers in these disciplines often use convenience samples when they conduct small-scale studies.

Consider the example introduced earlier: A researcher wants to run an experiment to assess whether caffeine increases anxiety. It would not be reasonable to try to obtain a sample of participants from the entire adult population of the United States (consider the logistics involved in travel, for example). In practice, studies similar to this are usually conducted using convenience samples. At most colleges or universities in the United States, convenience samples primarily include persons between 18 and 22 years of age.

When researchers obtain information about behavior from convenience samples, they cannot confidently use their results to make inferences about the responses of an actual, well-defined population. For example, if the researcher shows that a convenience sample of Corinth College students scores higher on anxiety after consuming a dose of caffeine, it would not be safe to assume that this result is generalizable to all adults or to all college students in the United States. Why not? For example, the effects of caffeine might be quite different for adults older than 70 than for 20-year-olds. The effects of caffeine might differ for people who regularly consume large amounts of caffeine than for people who never use caffeine. The effects of caffeine might depend on physical health.

Although this is rarely explicitly discussed, most researchers implicitly rely on a principle that Campbell (cited in Trochim, 2001) has called “proximal similarity” when they evaluate the potential generalizability of research results based on convenience samples. It is possible to imagine a hypothetical population —that is, a larger group of people that is similar in many ways to the participants who were included in the convenience sample—and to make cautious inferences about this hypothetical population based on the responses of the sample. Campbell suggested that researchers evaluate the degree of similarity between a sample and hypothetical populations of interest and limit generalizations to hypothetical populations that are similar to the sample of participants actually included in the study. If the convenience sample consists of 50 Corinth College students who are between the ages of 18 and 22 and mostly of Northern European family background, it might be reasonable to argue (cautiously, of course) that the results of this study potentially apply to a hypothetical broader population of 18- to 22-year-old U.S. college students who come from similar ethnic or cultural backgrounds. This hypothetical population—all U.S. college students between 18 and 22 years from a Northern European family background—has a composition fairly similar to the composition of the convenience sample. It would be questionable to generalize about response to caffeine for populations that have drastically different characteristics from the members of the sample (such as persons who are older than age 50 or who have health problems that members of the convenience sample do not have).

Generalization of results beyond a sample to make inferences about a broader population is always risky, so researchers should be cautious in making generalizations. An example involving research on drugs highlights the potential problems that can arise when researchers are too quick to assume that results from convenience samples provide accurate information about the effects of a treatment on a broader population. For example, suppose that a researcher conducts a series of studies to evaluate the effects of a new antidepressant drug on depression. Suppose that the participants are a convenience sample of depressed young adults between the ages of 18 and 22. If the researcher uses appropriate experimental designs and finds that the new drug significantly reduces depression in these studies, the researcher might tentatively say that this drug may be effective for other depressed young adults in this age range. It could be misleading, however, to generalize the results of the study to children or to older adults. A drug that appears to be safe and effective for a convenience sample of young adults might not be safe or effective in patients who are younger or older.

To summarize, when a study uses data from a convenience sample, the researcher should clearly state that the nature of the sample limits the potential generalizability of the results. Of course, inferences about hypothetical or real populations based on data from a single study are never conclusive, even when random selection procedures are used to obtain the sample. An individual study may yield incorrect or misleading results for many reasons. Replication across many samples and studies is required before researchers can begin to feel confident about their conclusions.

1.6 Levels of Measurement and Types of Variables

A controversial issue introduced early in statistics courses involves types of measurement for variables. Many introductory textbooks list the classic levels of measurement defined by S. Stevens (1946): nominal ordinal interval , and ratio (see Table 1.1 for a summary and Note 3 for a more detailed review of these levels of measurement).3 Strict adherents to the Stevens theory of measurement argue that the level of measurement of a variable limits the set of logical and arithmetic operations that can appropriately be applied to scores. That, in turn, limits the choice of statistics. For example, if scores are nominal or categorical level of measurement, then according to Stevens, the only things we can legitimately do with the scores are count how many persons belong to each group (and compute proportions or percentages of persons in each group); we can also note whether two persons have equal or unequal scores. It would be nonsense to add up scores for a nominal variable such as eye color (coded 1 = blue, 2 = green, 3 = brown, 4 = hazel, 5 = other) and calculate a “mean eye color” based on a sum of these scores.4

Table 1.1 Levels of Measurement, Arithmetic Operations, and Types of Statistics

a.   Jaccard and Becker (2002).

b.   Many variables that are widely used in the social and behavioral sciences, such as 5-point ratings for attitude and personality measurement, probably fall short of satisfying the requirement that equal differences between scores represent exactly equal changes in the amount of the underlying characteristics being measured. However, most authors (such as Harris, 2001) argue that application of parametric statistics to scores that fall somewhat short of the requirements for interval level of measurement does not necessarily lead to problems.

In recent years, many statisticians have argued for less strict application of level of measurement requirements. In practice, many common types of variables (such as 5-point ratings of degree of agreement with an attitude statement) probably fall short of meeting the strict requirements for equal interval level of measurement. A strict enforcement of the level of measurement requirements outlined in many introductory textbooks creates a problem: Can researchers legitimately compute statistics (such as mean, t test, and correlation) for scores such as 5-point ratings when the differences between these scores may not represent exactly equal amounts of change in the underlying variable that the researcher wants to measure (in this case, strength of agreement)? Many researchers implicitly assume that the answer to this question is yes.

The variables that are presented as examples when ordinal, interval, and ratio levels of measurement are defined in introductory textbooks are generally classic examples that are easy to classify. In actual practice, however, it is often difficult to decide whether scores on a variable meet the requirements for interval and ratio levels of measurement. The scores on many types of variables (such as 5-point ratings) probably fall into a fuzzy region somewhere between the ordinal and interval levels of measurement. How crucial is it that scores meet the strict requirements for interval level of measurement?

Many statisticians have commented on this problem, noting that there are strong differences of opinion among researchers. Vogt (1999) noted that there is considerable controversy about the need for a true interval level of measurement as a condition for the use of statistics such as mean, variance, and Pearson’s r , stating that “as with constitutional law, there are in statistics strict and loose constructionists in the interpretation of adherence to assumptions” (p. 158). Although some statisticians adhere closely to Stevens’s recommendations, many authors argue that it is not necessary to have data that satisfy the strict requirements for interval level of measurement to obtain interpretable and useful results for statistics such as mean and Pearson’s r.

Howell (1992) reviewed the arguments and concluded that the underlying level of measurement is not crucial in the choice of a statistic:

The validity of statements about the objects or events that we think we are measuring hinges primarily on our knowledge of those objects or events, not on the measurement scale. We do our best to ensure that our measures relate as closely as possible to what we want to measure, but our results are ultimately only the numbers we obtain and our faith in the relationship between those numbers and the underlying objects or events … the underlying measurement scale is not crucial in our choice of statistical techniques … a certain amount of common sense is required in interpreting the results of these statistical manipulations. (pp. 8–9)

Harris (2001) says,

I do not accept Stevens’s position on the relationship between strength [level] of measurement and “permissible” statistical procedures … the most fundamental reason for [my] willingness to apply multivariate statistical techniques to such data, despite the warnings of Stevens and his associates, is the fact that the validity of statistical conclusions depends only on whether the numbers to which they are applied meet the distributional assumptions … used to derive them, and not on the scaling procedures used to obtain the numbers. (pp. 444–445)

Gaito (1980) reviewed these issues and concluded that “scale properties do not enter into any of the mathematical requirements” for various statistical procedures, such as ANOVA. Tabachnick and Fidell (2007) addressed this issue in their multivariate textbook: “The property of variables that is crucial to application of multivariate procedures is not type of measurement so much as the shape of the distribution” (p. 6). Zumbo and Zimmerman (1993) used computer simulations to demonstrate that varying the level of measurement for an underlying empirical structure (between ordinal and interval) did not lead to problems when several widely used statistics were applied.

Based on these arguments, it seems reasonable to apply statistics (such as the sample mean, Pearson’s r, and ANOVA) to scores that do not satisfy the strict requirements for interval level of measurement. (Some teachers and journal reviewers continue to prefer the more conservative statistical practices advocated by Stevens; they may advise you to avoid the computation of means, variances, and Pearson correlations for data that aren’t clearly interval/ratio level of measurement.)

When making decisions about the type of statistical analysis to apply, it is useful to make a simpler distinction between two types of variables: categorical versus quantitative (Jaccard & Becker, 2002). For a categorical or nominal variable, each number is merely a label for group membership. A categorical variable may represent naturally occurring groups or categories (the categorical variable gender can be coded 1 = male, 2 = female). Alternatively, a categorical variable can identify groups that receive different treatments in an experiment. In the hypothetical study described in this chapter, the categorical variable treatment can be coded 1 for participants who did not receive caffeine and 2 for participants who received 150 mg of caffeine. It is possible that the outcome variable for this imaginary study, anxiety, could also be a categorical or nominal variable; that is, a researcher could classify each participant as either 1 = anxious or 0 = not anxious, based on observations of behaviors such as speech rate or fidgeting.

Quantitative variables have scores that provide information about the magnitude of differences between participants in terms of the amount of some characteristic (such as anxiety in this example). The outcome variable, anxiety, can be measured in several different ways. An observer who does not know whether each person had caffeine could observe behaviors such as speech rate and fidgeting and make a judgment about each individual’s anxiety level. An observer could rank order the participants in order of anxiety: 1 = most anxious, 2 = second most anxious, and so forth. (Note that ranking can be quite time-consuming if the total number of persons in the study is large.)

A more typical measurement method for this type of research situation would be self-report of anxiety, perhaps using a 5-point rating similar to the one below. Each participant would be asked to choose a number from 1 to 5 in response to a statement such as “I am very anxious.”

Conventionally, 5-point ratings (where the five response alternatives correspond to “degrees of agreement” with a statement about an attitude, a belief, or a behavior) are called Likert scales . However, questions can have any number of response alternatives, and the response alternatives may have different labels—for example, reports of the frequency of a behavior. (See Chapter 21 for further discussion of self-report questions and response alternatives.)

What level of measurement does a 5-point rating similar to the one above provide? The answer is that we really don’t know. Scores on 5-point ratings probably do not have true equal interval measurement properties; we cannot demonstrate that the increase in the underlying amount of anxiety represented by a difference between 4 points and 3 points corresponds exactly to the increase in the amount of anxiety represented by the difference between 5 points and 4 points. None of the responses may represent a true 0 point. Five-point ratings similar to the example above probably fall into a fuzzy category somewhere between ordinal and interval levels of measurement. In practice, many researchers apply statistics such as means and standard deviations to this kind of data despite the fact that these ratings may fall short of the strict requirements for equal interval level of measurement; the arguments made by Harris and others above suggest that this common practice is not necessarily problematic. Usually researchers sum or average scores across a number of Likert items to obtain total scores for scales (as discussed in Chapter 21). These total scale scores are often nearly normally distributed; Carifio and Perla (2008) review evidence that application of parametric statistics to these scale scores produces meaningful results.

Tabachnick and Fidell (2007) and the other authors cited above have argued that it is more important to consider the distribution shapes for scores on quantitative variables (rather than their levels of measurement). Many of the statistical tests covered in introductory statistics books were developed based on assumptions that scores on quantitative variables are normally distributed. To evaluate whether a batch of scores in a sample has a nearly normal distribution shape, we need to know what an ideal normal distribution looks like. The next section reviews the characteristics of the standard normal distribution.

1.7 The Normal Distribution

Introductory statistics books typically present both empirical and theoretical distributions. An empirical distribution is based on frequencies of scores from a sample, while a theoretical distribution is defined by a mathematical function or equation.

A description of an empirical distribution can be presented as a table of frequencies or in a graph such as a histogram . Sometimes it is helpful to group scores to obtain a more compact view of the distribution. SPSS® makes reasonable default decisions about grouping scores and the number of intervals and interval widths to use; these decisions can be modified by the user. Details about the decisions involved in grouping scores are provided in most introductory statistics textbooks and will not be discussed here. Thus, each bar in an SPSS histogram may correspond to an interval that contains a group of scores (rather than a single score). An example of an empirical distribution appears in Figure 1.1. This shows a distribution of measurements of women’s heights (in inches). The height of each bar is proportional to the number of cases; for example, the tallest bar in the histogram corresponds to the number of women whose height is 64 in. For this empirical sample distribution, the mean female height M = 64 in., and the standard deviation for female height (denoted by s or SD) is 2.56 in. Note that if these heights were transformed into centimeters (M = 162.56 cm, s = 6.50 cm), the shape of the distribution would be identical; the labels of values on the X axis are the only feature of the graph that would change.

Figure 1.1 A Histogram Showing an Empirical Distribution of Scores That Is Nearly Normal in Shape

The smooth curve superimposed on the histogram is a plot of the mathematical (i.e., theoretical) function for an ideal normal distribution with a population mean μ = 64 and a population standard deviation σ = 2.56.

Empirical distributions can have many different shapes. For example, the distribution of number of births across days of the week in the bar chart in Figure 1.2 is approximately uniform; that is, approximately one seventh of the births take place on each of the 7 days of the week (see Figure 1.2).

Some empirical distributions have shapes that can be closely approximated by mathematical functions, and it is often convenient to use that mathematical function and a few parameters (such as mean and standard deviation) as a compact and convenient way to summarize information about the distribution of scores on a variable. The proportion of area that falls within a slice of an empirical distribution (in a bar chart or histogram) can be interpreted as a probability. Thus, based on the bar chart in Figure 1.2, we can say descriptively that “about one seventh of the births occurred on Monday”; we can also say that if we draw an individual case from the distribution at random, there is approximately a one-seventh probability that a birth occurred on a Monday.

When a relatively uncommon behavior (such as crying) is assessed through self-report, the distribution of frequencies is often a J-shaped, or roughly exponential, curve, as in Figure 1.3, which shows responses to the question, “How many times did you cry last week?” (data from Brackett, Mayer, & Warner, 2004). Most people reported crying 0 times per week, a few reported crying 1 to 2 times a week, and very few reported crying more than 11 times per week. (When variables are frequency counts of behaviors, distributions are often skewed; in some cases, 0 is the most common behavior frequency. Statistics reviewed in this book assume normal distribution shapes; researchers whose data are extremely skewed and/or include a large number of 0s may need to consider alternative methods based on Poisson or negative binomial distributions, as discussed by Atkins & Gallop, 2007.)

Figure 1.2 A Bar Chart Showing a Fairly Uniform Distribution for Number of Births (Y Axis) by Day of the Week (X Axis)

A theoretical distribution shape that is of particular interest in statistics is the normal (or Gaussian) distribution illustrated in Figure 1.4. Students should be familiar with the shape of this distribution from introductory statistics. The curve is symmetrical, with a peak in the middle and tails that fall off gradually on both sides. The normal curve is often described as a bell-shaped curve. A precise mathematical definition of the theoretical normal distribution is given by the following equations:

Figure 1.3 Bar Chart Showing a J-Shaped or Exponential Distribution

Figure 1.4 A Standard Normal Distribution, Showing the Correspondence Between Distance From the Mean (Given as Number of σ Units or z Scores) and Proportion of Area Under the Curve

where

π is a mathematical constant, approximate value 3.1416 …

e is a mathematical constant, approximate value 2.7183 …

μ is the mean—that is, the center of the distribution.

σ is the standard deviation; that is, it corresponds to the dispersion of the distribution.

In Figure 1.4, the X value is mapped on the horizontal axis, and the Y height of the curve is mapped on the vertical axis.

For the normal distribution curve defined by this mathematical function, there is a fixed relationship between the distance from the center of the distribution and the area under the curve, as shown in Figure 1.4. The Y value (the height) of the normal curve asymptotically approaches 0 as the X distance from the mean increases; thus, the curve theoretically has a range of X from −∞ to +∞. Despite this infinite range of X values, the area under the normal curve is finite. The total area under the normal curve is set equal to 1.0 so that the proportions of this area can be interpreted as probabilities. The standard normal distribution is defined by Equations 1.1 and 1.2 with μ set equal to 0 and σ set equal to 1. In Figure 1.4, distances from the mean are marked in numbers of standard deviations—for example, +1σ, +2σ, and so forth.

From Figure 1.4, one can see that the proportion of area under the curve that lies between 0 and +1σ is about .3413; the proportion of area that lies above +3σ is .0013. In other words, about 1 of 1,000 cases lies more than 3σ above the mean, or, to state this another way, the probability that a randomly sampled individual from this population will have a score that is more than 3σ above the mean is about .0013.

A “family” of normal distributions (with different means and standard deviations) can be created by substituting in any specific values for the population parameters μ and σ. The term parameter is (unfortunately) used to mean many different things in various contexts. Within statistics, the term generally refers to a characteristic of a population distribution that can be estimated by using a corresponding sample statistic. The parameters that are most often discussed in introductory statistics are the population mean μ (estimated by a sample mean M) and the population standard deviation σ (estimated by the sample standard deviation , usually denoted by either s or SD). For example, assuming that the empirical distribution of women’s heights has a nearly normal shape with a mean of 64 in. and a standard deviation of 2.56 in. (as in Figure 1.1), a theoretical normal distribution with μ = 64 and σ = 2.56 will approximately match the location and shape of the empirical distribution of heights. It is possible to generate a family of normal distributions by using different values for μ and σ. For example, intelligence quotient (IQ) scores are normally distributed with μ = 100 and σ = 15; heart rate for a population of healthy young adults might have a mean of μ = 70 beats per minute (bpm) and a standard deviation σ of 11 bpm.

When a population has a known shape (such as “normal”) and known parameters (such as μ = 70 and σ = 11), this is sufficient information to draw a curve that represents the shape of the distribution. If a population has an unknown distribution shape, we can still compute a mean and standard deviation, but that information is not sufficient to draw a sketch of the distribution.

The standard normal distribution is the distribution generated by Equations 1.1 and 1.2 for the specific values μ = 0 and σ = 1.0 (i.e., population mean of 0 and population standard deviation of 1). When normally distributed X scores are rescaled so that they have a mean of 0 and a standard deviation of 1, they are called standard scores or z scoresFigure 1.4 shows the standard normal distribution for z. There is a fixed relationship between distance from the mean and area, as shown in Figure 1.4. For example, the proportion of area under the curve that lies between z = 0 and z = +1 under the standard normal curve is always .3413 (34.13% of the area).

Recall that a proportion of area under the uniform distribution can be interpreted as a probability; similarly, a proportion of area under a section of the normal curve can also be interpreted as a probability. Because the normal distribution is widely used, it is useful for students to remember some of the areas that correspond to z scores. For example, the bottom 2.5% and top 2.5% of the area of a standard normal distribution lie below z = −1.96 and above z = +1.96, respectively. That is, 5% of the scores in a normally distributed population lie more than 1.96 standard deviations above or below the mean. We will want to know when scores or test statistics have extreme or unusual values relative to some distribution of possible values. In most situations, the outcomes that correspond to the most extreme 5% of a distribution are the outcomes that are considered extreme or unlikely.

The statistics covered in introductory textbooks, such as the t test, ANOVA, and Pearson’s r, were developed based on the assumption that scores on quantitative variables are normally distributed in the population. Thus, distribution shape is one factor that is taken into account when deciding what type of statistical analysis to use.

Students should be aware that although normal distribution shapes are relatively common, many variables are not normally distributed. For example, income tends to have a distribution that is asymmetric; it has a lower limit of 0, but there is typically a long tail on the upper end of the distribution. Relatively uncommon behaviors, such as crying, often have a J-shaped distribution (as shown in Figure 1.3). Thus, it is important to examine distribution shapes for quantitative variables before applying statistical analyses such as ANOVA or Pearson’s r. Methods for assessing whether variables are normally distributed are described in Chapter 4.

1.8 Research Design

Up to this point, the discussion has touched on two important issues that should be taken into account when deciding what statistical analysis to use: the types of variables involved (the level of measurement and whether the variables are categorical or quantitative) and the distribution shapes of scores on quantitative variables. We now turn from a discussion of individual variables (e.g., categorical vs. quantitative types of variables and the shapes of distributions of scores on variables) to a brief consideration of research design.

It is extremely important for students to recognize that a researcher’s ability to draw causal inferences is based on the nature of the research design (i.e., whether the study is an experiment) rather than the type of analysis (such as correlation vs. ANOVA). This section briefly reviews basic research design terminology. Readers who have not taken a course in research methods may want to consult a basic research methods textbook (such as Cozby, 2004) for a more thorough discussion of these issues.

1.8.1 Experimental Design

In behavioral and social sciences, an experiment typically includes the following elements:

1. Random assignment of participants to groups or treatment levels (or other methods of assignment of participants to treatments, such as matched samples or repeated measures , that ensure equivalence of participant characteristics across treatments).

2. Two or more researcher-administered treatments, dosage levels, or interventions.

3. Experimental control of other “nuisance” or “error” variables that might influence the outcome: The goal is to avoid confounding other variables with the different treatments and to minimize random variations in response due to other variables that might influence participant behavior; the researcher wants to make certain that no other variable is confounded with treatment dosage level. If the caffeine group is tested before a midterm exam and the no-caffeine group is tested on a Friday afternoon before a holiday weekend, there would be a confound between the effects of the exam and those of the caffeine; to avoid a confound, both groups should be tested at the same time under similar circumstances. The researcher also wants to make sure that random variation of scores within each treatment condition due to nuisance variables is not too great; for example, testing persons in the caffeine group at many different times of the day and days of the week could lead to substantial variability in anxiety scores within this group. One simple way to ensure that a nuisance variance is neither confounded with treatment nor a source of variability of scores within groups is to “hold the variable constant”; for example, to avoid any potential confound of the effects of cigarette smoking with the effects of caffeine and to minimize the variability of anxiety within groups that might be associated with smoking, the researcher could “hold the variable smoking constant” by including only those participants who are not smokers.

4. Assessment of an outcome variable after the treatment has been administered.

5. Comparison of scores on outcome variables across people who have received different treatments, interventions, or dosage levels: Statistical analyses are used to compare group means and assess how strongly scores on the outcome variable are associated with scores on the treatment variable.

In contrast, nonexperimental studies usually lack a researcher-administered intervention, experimenter control of nuisance or error variables, and random assignment of participants to groups; they typically involve measuring or observing several variables in naturally occurring situations.

The goal of experimental design is to create a situation in which it is possible to make a causal inference about the effect of a manipulated treatment variable (X) on a measured outcome variable (Y). A study that satisfies the conditions for causal inferences is said to have internal validity . The conditions required for causal inference (a claim of the form “X causes Y”) (from Cozby, 2004) include the following:

1. The X and Y variables that represent the “cause” and the “effect” must be systematically associated in the study. That is, it only makes sense to theorize that X might cause Y if X and Y covary (i.e., if X and Y are statistically related). Covariation between variables can be assessed using statistical methods such as ANOVA and Pearson’s r. Covariation of X and Y is a necessary, but not sufficient, condition for causal inference. In practice, we do not require perfect covariation between X and Y before we are willing to consider causal theories. However, we look for evidence that X and Y covary “significantly” (i.e., we use statistical significance tests to try to rule out chance as an explanation for the obtained pattern of results).

2. The cause, X, must precede the effect, Y, in time. In an experiment, this requirement of temporal precedence is met by manipulating the treatment variable X prior to measuring or observing the outcome variable Y.

3. There must not be any other variable confounded with (or systematically associated with) the X treatment variable. If there is a confound between X and some other variable, the confounded variable is a rival explanation for any observed differences between groups. Random assignment of participants to treatment groups is supposed to ensure equivalence in the kinds of participants in the treatment groups and to prevent a confound between individual difference variables and treatment condition. Holding other situational factors constant for groups that receive different treatments should avoid confounding treatment with situational variables, such as day of the week or setting.

4. There should be some reasonable theory that would predict or explain a cause-and-effect relationship between the variables.

Of course, the results of a single study are never sufficient to prove causality. However, if a relationship between variables is replicated many times across well-designed experiments, belief that there could be a potential causal connection tends to increase as the amount of supporting evidence increases. Compared with quasi-experimental or nonexperimental designs, experimental designs provide relatively stronger evidence for causality, but causality cannot be proved conclusively by a single study even if it has a well-controlled experimental design.

There is no necessary connection between the type of design (experimental vs. non-experimental) and the type of statistic applied to the data (such as t test vs. Pearson’s r) (see Table 1.2). Because experiments often (but not always) involve comparison of group means, ANOVA and t tests are often applied to experimental data. However, the choice of a statistic depends on the type of variables in the dataset rather than on experimental design. An experiment does not have to compare a small number of groups. For example, participants in a drug study might be given 20 different dosage levels of a drug (the independent variable X could be the amount of drug administered), and a response to the drug (such as self-reported pain, Y) could be measured. Pairs of scores (for X = drug dosage and Y = reported pain) could be analyzed using methods such as Pearson correlation, although this type of analysis is uncommon in experiments.

Internal validity is the degree to which the results of a study can be used to make causal inferences; internal validity increases with greater experimental control of extraneous variables. On the other hand, external validity is the degree to which the results of a study can be generalized to groups of people, settings, and events that occur in the real world. A well-controlled experimental situation should have good internal validity.

Table 1.2 Statistical Analysis and Research Designs

External validity is the degree to which the results of a study can be generalized (beyond the specific participants, setting, and materials involved in the study) to apply to real-world situations. Some well-controlled experiments involve such artificial situations that it is unclear whether results are generalizable. For example, an experiment that involves systematically presenting different schedules of reinforcement or reward to a rat in a Skinner box, holding all other variables constant, typically has high internal validity (if the rat’s behavior changes, the researcher can be reasonably certain that the changes in behavior are caused by the changes in reward schedule, because all other variables are held constant). However, this type of research may have lower external validity; it is not clear whether the results of a study of rats isolated in Skinner boxes can be generalized to populations of children in school classrooms, because the situation of children in a classroom is quite different from the situation of rats in a Skinner box. External validity is better when the research situation is closely analogous to, or resembles, the real-world situations that the researcher wants to learn about. Some experimental situations achieve strong internal validity at the cost of external validity. However, it is possible to conduct experiments in field settings or to create extremely lifelike and involving situations in laboratory settings, and this can improve the external validity or generalizability of research results.

The strength of internal and external validity depends on the nature of the research situation; it is not determined by the type of statistical analysis that happens to be applied to the data.

1.8.2 Quasi-Experimental Design

Quasi-experimental designs typically include some, but not all, of the features of a true experiment. Often they involve comparison of groups that have received different treatments and/or comparison of groups before versus after an intervention program. Often they are conducted in field rather than in laboratory settings. Usually, the groups in quasi-experimental designs are not formed by random assignment, and thus, the assumption of equivalence of participant characteristics across treatment conditions is not satisfied. Often the intervention in a quasi experiment is not completely controlled by the researcher, or the researcher is unable to hold other variables constant. To the extent that a quasi experiment lacks the controls that define a well-designed experiment, a quasi-experimental design provides much weaker evidence about possible causality (and, thus, weaker internal validity). Because quasi experiments often focus on interventions that take place in real-world settings (such as schools), they may have stronger external validity than laboratory-based studies. Campbell and Stanley (1966) provide an extensive outline of threats to internal and external validity in quasi-experimental designs that is still valuable. Shadish, Cook, and Campbell (2001) provide further information about issues in the design and analysis of quasi-experimental studies.

1.8.3 Nonexperimental Research Design

Many studies do not involve any manipulated treatment variable. Instead, the researcher measures a number of variables that are believed to be meaningfully related. Variables may be measured at one point in time or, sometimes, at multiple points in time. Then, statistical analyses are done to see whether the variables are related in ways that are consistent with the researcher’s expectations. The problem with nonexperimental research design is that any potential independent variable is usually correlated or confounded with other possible independent variables; therefore, it is not possible to determine which, if any, of the variables have a causal impact on the dependent variable . In some nonexperimental studies, researchers make distinctions between independent and dependent variables (based on implicit theories about possible causal connections). However, in some nonexperimental studies, there may be little or no basis to make such a distinction. Nonexperimental research is sometimes called “correlational” research. This use of terminology is unfortunate because it can confuse beginning students. It is helpful to refer to studies that do not involve interventions as “nonexperimental” (rather than correlational) to avoid possible confusion between the Pearson’s r correlation statistic and nonexperimental design.

As shown in Table 1.2, a Pearson correlation can be performed on data that come from experimental designs, although it is much more often encountered in reports of non-experimental data. A t test or ANOVA is often used to analyze data from experiments, but these tests are also used to compare means between naturally occurring groups in non-experimental studies. (In other words, to judge whether a study is experimental or non-experimental, it is not useful to ask whether the reported statistics were ANOVAs or correlations. We have to look at the way the study was conducted, that is, whether it has the features typically found in experiments that were listed earlier.)

The degree to which research results can be interpreted as evidence of possible causality depends on the nature of the design (experimental vs. nonexperimental), not on the type of statistic that happens to be applied to the data (Pearson’s r vs. t test or ANOVA). While experiments often involve comparison of groups, group comparisons are not necessarily experimental.

A nonexperimental study usually has weak internal validity; that is, merely observing that two variables are correlated is not a sufficient basis for causal inferences. If a researcher finds a strong correlation between an X and Y variable in a nonexperimental study, the researcher typically cannot rule out rival explanations (e.g., changes in Y might be caused by some other variable that is confounded with X rather than by X). On the other hand, some nonexperimental studies (particularly those that take place in field settings) may have good external validity; that is, they may examine naturally occurring events and behaviors.

1.8.4 Between-Subjects Versus Within-Subjects or Repeated Measures

When an experiment involves comparisons of groups, there are many different ways in which participants or cases can be placed in these groups. Despite the fact that most writers now prefer to use the term participant rather than subject to refer to a person who contributes data in a study, the letter S is still widely used to stand for “subjects” when certain types of research designs are described.

When a study involves a categorical or group membership variable, we need to pay attention to the composition of the groups when we decide how to analyze the data. One common type of group composition is called between-subjects (between-S) or independent groups. In a between-S or independent groups study, each participant is a member of one and only one group. A second common type of group composition is called within-subjects (within-S) or repeated measures. In a repeated measures study, each participant is a member of every group; if the study includes several different treatments, each participant is tested under every treatment condition.

For example, consider the caffeine/anxiety study. This study could be done using either a between-S or a within-S design . In a between-S version of this study, a sample of 30 participants could be divided randomly into two groups of 15 each. Each group would be given only one treatment (Group 1 would receive a beverage that contains no caffeine; Group 2 would receive a beverage that contains 150 mg caffeine). In a within-S or repeated measures version of this study, each of the 30 participants would be observed twice: once after drinking a beverage that does not contain caffeine and once after drinking a beverage that contains 150 mg caffeine. Another possible variation of design would be to use both within-S and between-S comparisons. For example, the researcher could randomly assign 15 people to each of the two groups, caffeine versus no caffeine, and then assess each person’s anxiety level at two points in time, before and after consuming a beverage. This design has both a between-S comparison (caffeine vs. no caffeine) and a within-S comparison (anxiety before vs. after drinking a beverage that may or may not contain caffeine).

Within-S or repeated measures designs raise special problems, such as the need to control for order and carryover effects (discussed in basic research methods textbooks such as Shaughnessy, Zechmeister, & Zechmeister, 2003). In addition, different statistical tests are used to compare group means for within-S versus between-S designs. For a between-S design, one-way ANOVA for independent samples is used; for a within-S design, repeated measures ANOVA is used. A thorough discussion of repeated measures analyses is provided by Keppel (1991), and a brief introduction to repeated measures ANOVA is provided in Chapter 22 of this textbook. Thus, a researcher has to know whether the composition of groups in a study is between-S or within-S in order to choose an appropriate statistical analysis.

In nonexperimental studies, the groups are almost always between-S because they are usually based on previously existing participant characteristics (e.g., whether each participant is male or female, a smoker or a nonsmoker). Generally, when we talk about groups based on naturally occurring participant characteristics, group memberships are mutually exclusive; for example, a person cannot be classified as both a smoker and a nonsmoker. (We could, of course, create a larger number of groups such as nonsmoker, occasional smoker, and heavy smoker if the simple distinction between smokers and nonsmokers does not provide a good description of smoking behavior.)

In experiments, researchers can choose to use either within-S or between-S designs. An experimenter typically assigns participants to treatment groups in ways that are intended to make the groups equivalent prior to treatment. For example, in a study that examines three different types of stress, each participant may be randomly assigned to one and only one treatment group or type of stress.

In this textbook, all group comparisons are assumed to be between-S unless otherwise specified. Chapter 22 deals specifically with repeated measures ANOVA.

1.9 Combinations of These Design Elements

For clarity and simplicity, each design element (e.g., comparison of groups formed by an experimenter vs. naturally occurring group; between-and within-S design) has been discussed separately. This book covers most of the “building blocks” that can be included in more complex designs. These design elements can be combined in many ways. Within an experiment, some treatments may be administered using a between-S and others using a within-S or repeated measures design. A factorial study may include a factor that corresponds to an experimental manipulation and also a factor that corresponds to naturally occurring group memberships (as discussed in Chapter 13 on factorial ANOVA). Later chapters in this book include some examples of more complex designs. For example, a study may include predictor variables that represent group memberships and also quantitative predictor variables called covariates (as in analysis of covariance [ANCOVA], Chapter 17). For complex experiments (e.g., experiments that include both between-S and within-S factors), researchers should consult books that deal with these in greater detail (e.g., Keppel & Zedeck, 1989; Myers & Well, 1995).

1.10 Parametric Versus Nonparametric Statistics

Another issue that should be considered when choosing a statistical method is whether the data satisfy the assumptions for parametric statistical methods. Definitions of the term parametric statistics vary across textbooks. When a variable has a known distribution shape (such as normal), we can draw a sketch of the entire distribution of scores based on just two pieces of information: the shape of the distribution (such as normal) and a small number of population parameters for that distribution (for normally distributed scores, we need to know only two parameters, the population mean μ and the population standard deviation σ, in order to draw a picture of the entire distribution). Parametric statistics involve obtaining sample estimates of these population parameters (e.g., the sample mean M is used to estimate the population mean μ; the sample standard deviation s is used to estimate the population standard deviation σ).

Most authors include the following points in their discussion of parametric statistics:

1. Parametric statistics include the analysis of means, variances, and sums of squares . For example, t test, ANOVA, Pearson’s r, and regression are examples of parametric statistics.

2. Parametric statistics require quantitative dependent variables that are at least approximately interval/ratio level of measurement. In practice, as noted in Section 1.6, this requirement is often not strictly observed. For example, parametric statistics (such as mean and correlation) are often applied to scores from 5-point ratings, and these scores may fall short of satisfying the strict requirements for interval level of measurement.

3. The parametric statistics included in this book and in most introductory texts assume that scores on quantitative variables are normally distributed. (This assumption is violated when scores have a uniform or J-shaped distribution, as shown in Figures 1.2 and 1.3, or when there are extreme outliers .)

4. For analyses that involve comparisons of group means, the variances of dependent variable scores are assumed to be equal across the populations that correspond to the groups in the study.

5. Parametric analyses often have additional assumptions about the distributions of scores on variables (e.g., we need to assume that X and Y are linearly related to use Pearson’s r).

It is unfortunate that some students receive little or no education on nonparametric statistics . Most introductory statistics textbooks include one or two chapters on non-parametric methods; however, these are often at the end of the book, and instructors rarely have enough time to cover this material in a one-semester course. Sometimes nonparametric statistics do not necessarily involve estimation of population parameters; they often rely on quite different approaches to sample data—for example, comparing the sum of ranks across groups in the Wilcoxon rank sum test . There is no universally agreed-on definition for nonparametric statistics, but most discussions of nonparametric statistics include the following:

1. Nonparametric statistics include the median, the chi-square (χ2) test of association between categorical variables, the Wilcoxon rank sum test, the sign test, and the Friedman one-way ANOVA by ranks. Many nonparametric methods involve counting frequencies or finding medians.

2. The dependent variables for nonparametric tests may be either nominal or ordinal level of measurement. (Scores may be obtained as ranks initially, or raw scores may be converted into ranks as one of the steps involved in performing a nonparametric analysis.)

3. Nonparametric statistics do not require scores on the outcome variable to be normally distributed.

4. Nonparametric statistics do not typically require an assumption that variances are equal across groups.

5. Outliers are not usually a problem in nonparametric analyses; these are unlikely to arise in ordinal (rank) or nominal (categorical) data.

Researchers should consider the use of nonparametric statistics when their data fail to meet some or all of the requirements for parametric statistics.5 The issues outlined above are summarized in Table 1.3.

Jaccard and Becker (2002) pointed out that there is disagreement among behavioral scientists about when to use parametric versus nonparametric analyses. Some conservative statisticians argue that parametric analyses should be used only when all the assumptions listed in the discussion of parametric statistics above are met (i.e., only when scores on the dependent variable are quantitative, interval/ratio level of measurement, and normally distributed and meet all other assumptions for the use of a specific statistical test). On the other hand, Bohrnstedt and Carter (1971) have advocated a very liberal position; they argued that many parametric techniques are fairly robust6 to violations of assumptions and concluded that even for variables measured at an ordinal level, “parametric analyses not only can be, but should be, applied.”

The recommendation made here is a compromise between the conservative and liberal positions. It is useful to review all the factors summarized in Table 1.3 when making the choice between parametric and nonparametric tests. A researcher can safely use parametric statistics when all the requirements listed for parametric tests are met. That is, if scores on the dependent variable are quantitative and normally distributed, scores are interval/ratio level of measurement and have equal variances across groups, and there is a minimum N per group of at least 20 or 30, parametric statistics may be used.

When only one or two of the requirements for a parametric statistic are violated, or if the violations are not severe (e.g., the distribution shape for scores on the outcome variable is only slightly different from normal), then it may still be reasonable to use a parametric statistic. When in doubt about whether to choose parametric or nonparametric statistics, many researchers lean toward choosing parametric statistics. There are several reasons for this preference. First, the parametric tests are more familiar to most students, researchers, and journal editors. Second, it is widely thought that parametric tests have better statistical power ; that is, they give the researcher a better chance of obtaining a statistically significant outcome (however, this is not necessarily always the case). An additional issue becomes relevant when researchers begin to work with more than one predictor and/or more than one outcome variable. For some combinations of predictor and outcome variables, a parametric analysis exists, but there is no analogous nonparametric test. Thus, researchers who use only nonparametric analyses may be limited to working with fewer variables. (This is not necessarily a bad thing.)

When violations of the assumptions for the use of parametric statistics are severe, it is more appropriate to use nonparametric analyses. Violations of assumptions (such as the assumption that scores are distributed normally in the population) become much more problematic when they are accompanied by small (and particularly small and unequal) group sizes.

Table 1.3 Parametric and Nonparametric Statistics

 Parametric Tests (Such as M, t, F, Pearson’s r) Are More Appropriate When Nonparametric Tests (Such as Median, Wilcoxon Rank Sum Test, Friedman One-Way ANOVA by Ranks, Spearman r) Are More Appropriate When The outcome variable Y is interval/ratio level of measurementa The outcome variable Y is nominal or rank ordinal level of measurement (data may be collected as ranks or converted to ranks) Scores on Y are approximately normally distributedb Scores on Y are not necessarily normally distributed There are no extreme outlier values of Y c There can be extreme outlier Y scores Variances of Y scores are approximately equal across populations that correspond to groups in the studyd Variances of scores are not necessarily equal across groups The N of cases in each group is “large”e The N of cases in each group can be “small”

a.   Many variables that are widely used in psychology (such as 5-point or 7-point attitude ratings, personality test scores, and so forth) have scores that probably do not have true equal interval-level measurement properties. For example, consider 5-point degree of agreement ratings: The difference between a score of 4 and 5 and the difference between a score of 1 and 2 probably do not correspond to the same increase of change in agreement. Thus, 5-point ratings probably do not have true equal-interval measurement properties. On the basis of the arguments reported in Section 1.5, many researchers go ahead and apply parametric statistics (such as Pearson’s r and t test) to data from 5-point ratings and personality tests and other measures that probably fall short of satisfying the requirements for a true interval level of measurement as defined by S. Stevens (1946).

b.   Chapter 4 discusses how to assess this by looking at histograms of Y scores to see if the shape resembles the bell curve shown in Figure 1.4.

c.   Chapter 4 also discusses identification and treatment of outliers or extreme scores.

d.   Parametric statistics such as the t test and ANOVA were developed based on assumption that the Y scores have equal variances in the populations that correspond to the samples in the study. Data that violate the assumption of equal variances can, in theory, lead to misleading results (an increased risk of Type I error, discussed in Chapter 3). In practice, however, the t test and ANOVA can yield fairly accurate results even when the equal variance assumption is violated, unless the Ns of cases within groups are small and/or unequal across groups. Also, there is a modified version of the t test (usually called  separate variances t test  or “equal variances not assumed t test”) that takes violations of the equal variance assumption into account and corrects for this problem.

e.   There is no agreed-on standard about an absolute minimum sample size required in the use of parametric statistics. The suggested guideline given here is as follows: Consider nonparametric tests when N is less than 20, and definitely use nonparametric tests when N is less than 10 per group; but this is arbitrary. Smaller Ns are most problematic when there are other problems with the data, such as outliers.     In practice, it is useful to consider this entire set of criteria. If the data fail to meet just one criterion for the use of parametric tests (for example, if scores on Y do not quite satisfy the requirements for interval/ratio level of measurement), researchers often go ahead and use parametric tests, as long as there are no other serious problems. However, the larger the number of problems with the data, the stronger the case becomes for the use of nonparametric tests. If the data are clearly ordinal or if Y scores have a drastically nonnormal shape and if, in addition, the Ns within groups are small, a nonparametric test would be strongly preferred. Group Ns that are unequal can make other problems (such as unequal variances across groups) more serious.     Almost all the statistics reported in this textbook are parametric. (The only nonparametric statistics reported are the χ2 test of association in Chapter 8 and the binary logistic regression in Chapter 23.) If a student or researcher anticipates that his or her data will usually require nonparametric analysis, that student should take a course or at least buy a good reference book on nonparametric methods.

Special statistical methods have been developed to handle ordinal data—that is, scores that are obtained as ranks or that are converted into ranks during preliminary data handling. Strict adherence to the Stevens theory would lead us to use medians instead of means for ordinal data (because finding a median involves only rank ordering and counting scores, not summing them).

The choice between parametric and nonparametric statistics is often difficult because there are no generally agreed-on decision standards. In research methods and statistics, generally, it is more useful to ask, “What are the advantages and disadvantages of each approach to the problem?” than to ask, “Which is the right and which is the wrong answer?” Parametric and nonparametric statistics each have strengths and limitations. Experimental and nonexperimental designs each have advantages and problems. Self-report and behavioral observations each have advantages and disadvantages. In the discussion section of a research report, the author should point out the advantages and also acknowledge the limitations of the choices that he or she has made in research design and data analysis.

The limited coverage of nonparametric techniques in this book should not be interpreted as a negative judgment about their value. There are situations (particularly designs with small Ns and severely nonnormal distributions of scores on the outcome variables) where nonparametric analyses are preferable. In particular, when data come in the form of ranks, nonparametric procedures developed specifically for the analysis of ranks may be preferable. For a thorough treatment of nonparametric statistics, see Siegel and Castellan (1988).

Some additional assumptions are so basic that they generally are not even mentioned, but these are important considerations whether a researcher uses parametric or nonparametric statistics:

1. Scores on the outcome variable are assumed to be independent of each other (except in repeated measures data, where correlations among scores are expected and the pattern of dependence among scores has a relatively simple pattern). It is easier to explain the circumstances that lead to “nonindependent” scores than to define independence of observations formally. Suppose a teacher gives an examination in a crowded classroom, and students in the class talk about the questions and exchange information. The scores of students who communicate with each other will not be independent; that is, if Bob and Jan jointly decide on the same answer to several questions on the exam, they are likely to have similar (statistically related or dependent) scores. Nonindependence among scores can arise due to many kinds of interactions among participants, apart from sharing information or cheating; nonindependence can arise from persuasion, competition, or other kinds of social influence. This problem is not limited to human research subjects. For example, if a researcher measures the heights of trees in a grove, the heights of neighboring trees are not independent; a tree that is surrounded by other tall trees has to grow taller to get exposure to sunlight.     For both parametric and nonparametric statistics, different types of analysis are used when the data involve repeated measures than when they involve independent outcome scores.

2. The number of cases in each group included in the analysis should be reasonably large. Parametric tests typically require larger numbers per group than nonparametric tests to yield reasonable results. However, even for nonparametric analyses, extremely small Ns are undesirable. (A researcher does not want to be in a situation where a 1- or 2-point change in score for one participant would completely change the nature of the outcome, and very small Ns sometimes lead to this kind of instability.) There is no agreed-on absolute minimum N for each group. For each analysis presented in this textbook, a discussion about sample size requirements is included.

3. The analysis will yield meaningful and interpretable information about the relations between variables only if we have a “ correctly specified model ”; that is, we have included all the variables that should be included in the analysis, and we have not included any irrelevant or inappropriate variables. In other words, we need a theory that correctly identifies which variables should be included and which variables should be excluded. Unfortunately, we can never be certain that we have a correctly specified model. It is always possible that adding or dropping a variable in the statistical analysis might change the outcome of the analysis substantially. Our inability to be certain about whether we have a correctly specified model is one of the many reasons why we can never take the results of a single study as proof that a theory is correct (or incorrect).

1.12 Selection of an Appropriate Bivariate Analysis

Bivariate statistics assess the relation between a pair of variables. Often, one variable is designated as the independent variable and the other as dependent. When one of the variables is manipulated by the researcher, that variable is designated as the independent variable. In nonexperimental research situations, the decision regarding which variable to treat as an independent variable may be arbitrary. In some research situations, it may be preferable not to make a distinction between independent and dependent variables; instead, the researcher may merely report that two variables are correlated without identifying one as the predictor of the other. When a researcher has a theory that X might cause or influence Y, the researcher generally uses scores on X as predictors of Y even when the study is nonexperimental. However, the results of a nonexperimental study cannot be used to make a causal inference, and researchers need to be careful to avoid causal language when they interpret results from nonexperimental studies.

During introductory statistics courses, the choice of an appropriate statistic for various types of data is not always explicitly addressed. Aron and Aron (2002) and Jaccard and Becker (2002) provide good guidelines for the choice of bivariate analyses. The last part of this chapter summarizes the issues that have been discussed up to this point and shows how consideration of these issues influences the choice of an appropriate bivariate statistical analysis. Similar issues continue to be important when the analyses include more than one predictor and/or more than one outcome variable.

The choice of an appropriate bivariate analysis to assess the relation between two variables is often based, in practice, on the types of variables involved: categorical versus quantitative. The following guidelines for the choice of statistic are based on a discussion in Jaccard and Becker (2002). Suppose that a researcher has a pair of variables X and Y. There are three possible combinations of types of variables (see Table 1.4):

Case I: Both X and Y are categorical.

Case II: X is categorical, and Y is quantitative (or Y is categorical, and X is quantitative).

Case III: Both X and Y are quantitative.

Consider Case I: The X and Y variables are both categorical; the data are usually summarized in a contingency table that summarizes the numbers of scores in each XY group. The chi-square test of association (or one of many other contingency table statistics) can be used to assess whether X and Y are significantly related. This will be discussed in Chapter 8. There are many other types of statistics for contingency tables (Everitt, 1977).

Consider Case III: The X and Y variables are both quantitative variables. If X and Y are linearly related (and if other assumptions required for the use of Pearson’s r are reasonably well satisfied), a researcher is likely to choose Pearson’s r to assess the relation between the X and Y variables. Other types of correlation (such as Spearman r ) may be preferred when the assumptions for Pearson’s r are violated; Spearman r is an appropriate analysis when the X and Y scores consist of ranks (or are converted to ranks to get rid of problems such as extreme outliers).

Now consider Case II: One variable (usually the X or independent variable) is categorical, and the other variable (usually the Y or dependent variable) is quantitative. In this situation, the analysis involves comparing means, medians, or sums of ranks on the Y variable across the groups that correspond to scores on the X variable. The choice of an appropriate statistic in this situation depends on several factors; the following list is adapted from Jaccard and Becker (2002):

1. Whether scores on Y satisfy the assumptions for parametric analyses or violate these assumptions badly enough so that nonparametric analyses should be used

2. The number of groups that are compared (i.e., the number of levels of the X variable)

3. Whether the groups correspond to a between-S design (i.e., there are different participants in each group) or to a within-S or repeated measures design

One cell in Table 1.4 includes a decision tree for Case II from Jaccard and Becker (2002); this decision tree maps out choices among several common bivariate statistical methods based on the answers to these questions. For example, if the scores meet the assumptions for a parametric analysis, two groups are compared, and the design is between-S, the independent samples t test is a likely choice. Note that although this decision tree leads to just one analysis for each situation, sometimes other analyses could be used.

This textbook covers only parametric statistics (i.e., statistics in the parametric branch of the decision tree for Case II in Table 1.4). In some situations, however, nonparametric statistics may be preferable (see Siegell & Castellan, 1988, for a thorough presentation of nonparametric methods).

Table 1.4 Selecting an Appropriate Bivariate Statistic Based on Type of Independent Variable (IV) and Dependent Variable (DV)

SOURCE: Decision tree adapted from Jaccard and Becker (2002).

1.13 Summary

Reconsider the hypothetical experiment to assess the effects of caffeine on anxiety. Designing a study and choosing an appropriate analysis raises a large number of questions even for this very simple research question.

A nonexperimental study could be done; that is, instead of administering caffeine, a researcher could ask participants to self-report the amount of caffeine consumed within the past 3 hours and then self-report anxiety.

A researcher could do an experimental study (i.e., administer caffeine to one group and no caffeine to a comparison group under controlled conditions and subsequently measure anxiety). If the study is conducted as an experiment, it could be done using a between-S design (each participant is tested under only one condition, either with caffeine or without caffeine), or it could be done as a within-S or repeated measures study (each participant is tested under both conditions, with and without caffeine).

Let’s assume that the study is conducted as a simple experiment with a between-S design. The outcome measure of anxiety could be a categorical variable (i.e., each participant is identified by an observer as a member of the “anxious” or “nonanxious” group). In this case, a table could be set up to report how many of the persons who consumed caffeine were classified as anxious versus nonanxious, as well as how many of those who did not receive caffeine were classified as anxious versus nonanxious, and a chi-square test could be performed to assess whether people who received caffeine were more likely to be classified as anxious than people who did not receive caffeine.

If the outcome variable, anxiety, is assessed by having people self-report their level of anxiety using a 5-point ratings, an independent samples t test could be used to compare mean anxiety scores between the caffeine and no-caffeine groups. If examination of the data indicated serious violations of the assumptions for this parametric test (such as nonnormally distributed scores or unequal group variances, along with very small numbers of participants in the groups), the researcher might choose to use the Wilcoxon rank sum test to analyze the data from this study.

This chapter reviewed issues that generally are covered in early chapters of introductory statistics and research methods textbooks. On the basis of this material, the reader should be equipped to think about the following issues, both when reading published research articles and when planning a study:

1. Evaluate whether the sample is a convenience sample or a random sample from a well-defined population, and recognize how the composition of the sample in the study may limit the ability to generalize results to broader populations.

2. Understand that the ability to make inferences about a population from a sample requires that the sample be reasonably representative of or similar to the population.

3. For each variable in a study, understand whether it is categorical or quantitative.

4. Recognize the differences between experimental, nonexperimental, and quasi-experimental designs.

5. Understand the difference between between-S (independent groups) and within-S (repeated measures) designs.

6. Recognize that research designs differ in internal validity (the degree to which they satisfy the conditions necessary to make a causal inference) and external validity (the degree to which results are generalizable to participants, settings, and materials different from those used in the study).

7. Understand why experiments typically have stronger internal validity and why experiments may have weaker external validity compared with nonexperimental studies.

8. Understand the issues involved in making a choice between parametric and nonparametric statistical methods.

9. Be able to identify an appropriate statistical analysis to describe whether scores on two variables are related, taking into account whether the data meet the assumptions for parametric tests; the type(s) of variables, categorical versus quantitative; whether the design is between-S or within-S; and the number of groups that are compared. The decision tree in Table 1.4 identifies the most widely used statistical procedure for each of these situations.

10. Most important, readers should remember that whatever choices researchers make, each choice typically has both advantages and disadvantages. The discussion section of a research report can point out the advantages and strengths of the approach used in the study, but it should also acknowledge potential weaknesses and limitations. If the study was not an experiment, the researcher must avoid using language that implies that the results of the study are proof of causal connections. Even if the study is a well-designed experiment, the researcher should keep in mind that no single study provides definitive proof for any claim. If the sample is not representative of any well-defined, real population of interest, limitations in the generalizability of the results should be acknowledged (e.g., if a study that assesses the safety and effectiveness of a drug is performed on a sample of persons 18 to 22 years old, the results may not be generalizable to younger and older persons). If the data violate many of the assumptions for the statistical tests that were performed, this may invalidate the results.

Table 1.5 provides an outline of the process involved in doing research. Some issues that are included (such as the IRB review) apply only to research that involves human participants as subjects, but most of the issues are applicable to research projects in many different disciplines. It is helpful to think about the entire process and anticipate later steps when making early decisions. For example, it is useful to consider what types of variables you will have and what statistical analyses you will apply to those variables at an early stage in planning. It is essential to keep in mind how the planned statistical analyses are related to the primary research questions. This can help researchers avoid collecting data that are difficult or impossible to analyze.

Table 1.5 Preview of a Typical Research Process for an Honors Thesis, Master’s Thesis, or Dissertation

### Notes

1. Examples in this textbook assume that researchers are dealing with human populations; however, similar issues arise when samples are obtained from populations of animals, plants, geographic locations, or other entities. In fact, many of these statistics were originally developed for use in industrial quality control and agriculture, where the units of analysis were manufactured products and plots of ground that received different treatments.

2. Systematic differences in the composition of the sample, compared with the population, can be corrected for by using case-weighting procedures. If the population includes 500 men and 500 women, but the sample includes 25 men and 50 women, case weights could be used so that, in effect, each of the 25 scores from men would be counted twice in computing summary statistics.

3. The four levels of measurement are called nominal, ordinal, interval, and ratio. In nominal level of measurement, each number code serves only as a label for group membership. For example, the nominal variable gender might be coded 1 = male, 2 = female, and the nominal variable religion might be coded 1 = Buddhist, 2 = Christian, 3 = Hindu, 4 = Islamic, 5 = Jewish, 6 = other religion. The sizes of the numbers associated with groups do not imply any rank ordering among groups. Because these numbers serve only as labels, Stevens argued that the only logical operations that could appropriately be applied to the scores are = and ≠. That is, persons with scores of 2 and 3 on religion could be labeled as “the same” or “not the same” on religion. In ordinal measurement, numbers represent ranks, but the differences between scores do not necessarily correspond to equal intervals with respect to any underlying characteristic. The runners in a race can be ranked in terms of speed (runners are tagged 1, 2, and 3 as they cross the finish line, with 1 representing the fastest time). These scores supply information about rank (1 is faster than 2), but the numbers do not necessarily represent equal intervals. The difference in speed between Runners 1 and 2 (i.e., 2 − 1) might be much larger or smaller than the difference in speed between Runners 2 and 3 (i.e., 3 − 2), despite the difference in scores in both cases being one unit. For ordinal scores, the operations > and < would be meaningful (in addition to = and ≠). However, according to Stevens, addition or subtraction would not produce meaningful results with ordinal measures (because a one-unit difference does not correspond to the same “amount of speed” for all pairs of scores). Scores that have interval level of measurement qualities supply ordinal information and, in addition, represent equally spaced intervals. That is, no matter which pair of scores is considered (such as 3 − 2 or 7 − 6), a one-unit difference in scores should correspond to the same amount of the thing that is being measured. Interval level of measurement does not necessarily have a true 0 point. The centigrade temperature scale is a good example of interval level of measurement: The 10-point difference between 40°C and 50°C is equivalent to the 10-point difference between 50°C and 60°C (in each case, 10 represents the same number of degrees of change in temperature). However, because 0°C does not correspond to a complete absence of any heat, it does not make sense to look at a ratio of two temperatures. For example, it would be incorrect to say that 40°C is “twice as hot” as 20°C. Based on this reasoning, it makes sense to apply the plus and minus operations to interval scores (as well as the equality and inequality operators). However, by this reasoning, it would be inappropriate to apply multiplication and division to numbers that do not have a true 0 point. Ratio-level measurements are interval-level scores that also have a true 0 point. A clear example of a ratio-level measurement is height. It is meaningful to say that a person who is 6 ft tall is twice as tall as a person 3 ft tall because there is a true 0 point for height measurements. The narrowest interpretation of this reasoning would suggest that ratio level is the only type of measurement for which multiplication and division would yield meaningful results.

Thus, strict adherence to Stevens’s measurement theory would imply that statistics that involve addition, subtraction, multiplication, and division (such as mean, Pearson’s rt test, analysis of variance, and all other multivariate techniques covered later in this textbook) can legitimately be applied only to data that are at least interval (and preferably ratio) level of measurement.

4. There is one exception. When a nominal variable has only two categories and the codes assigned to these categories are 0 and 1 (e.g., the nominal variable gender could be coded 0 = male, 1 = female), the mean of these scores represents the proportion of persons who are female.

5. Violations of the assumptions for parametric statistics create more serious problems when they are accompanied by small Ns in the groups (and/or unequal Ns in the groups). Sometimes, just having very small Ns is taken as sufficient reason to prefer nonparametric statistics. When Ns are very small, it becomes quite difficult to evaluate whether the assumptions for parametric statistics are satisfied (such as normally distributed scores on quantitative variables).

6. A nontechnical definition of robust is provided at this point. A statistic is robust if it provides “accurate” results even when one or more of its assumptions are violated. A more precise definition of this term will be provided in Chapter 2.