Observational research can be used to measure an infant’s attachment to a caregiver.
3.4 Observational Research Moving further along the continuum of control, we come to the descriptive design with the greatest amount of researcher control. Observational research involves studies that directly observe behavior and record these observations in an objective and systematic way. Your previous psychology courses may have explored the concept of attachment theory, which argues that an infant’s bond with his or her primary caregiver has implications for later social and emotional development. Mary Ainsworth, a Canadian developmental psychologist, and John Bowlby, a British psychologist and psychiatrist, articulated this theory in the 1960s. They argued that children can form either “secure” or a variety of “insecure” attachments with their caregivers (Ainsworth & Bell, 1970; Bowlby, 1963).
To assess these classi�ications, Ainsworth and Bell developed an observational technique called the “strange situation.” Mothers would arrive at their laboratory with their children for a series of structured interactions, including having the mother play with the infant, leave him alone with a stranger, and then return to the room after a brief absence. The researchers were most interested in coding the ways in which the infant responded to these various episodes (eight in total). One group of infants, for example, was curious when the mother left but then returned to playing with toys, trusting that she would return. Another group showed immediate distress when the mother left and clung to her nervously upon her return. Based on these and other behavioral observations, Ainsworth and colleagues classi�ied these groups of infants as “securely” and “insecurely” attached to their mothers, respectively.
Research: Making an Impact
In the 1950s, U.S. psychologist Harry Harlow conducted a landmark series of studies on the mother–infant bond using rhesus monkeys. Although contemporary standards would consider his research unethical, the results of his work revealed the importance of affection, attachment, and love on healthy childhood development.
Prior to Harlow’s �indings, it was believed that infants attached to their mothers as a part of a drive to ful�ill exclusively biological needs, in this case obtaining food and water and avoiding pain (Herman, 2007; van der Horst & van der Veer, 2008). In an effort to clarify the reasons that infants so clearly need maternal care, Harlow removed rhesus monkeys from their natural mothers several hours after birth, giving the young monkeys a choice between two surrogate “mothers.” Both mothers were made of wire, but one was bare and one was covered in terry cloth. Although the wire mother provided food via an attached bottle, the monkeys preferred the softer, terry-cloth mother, even though the latter provided no food (Harlow & Zimmerman, 1958; Herman, 2007).
Further research with the terry-cloth mothers contributed to the understanding of healthy attachment and childhood development (van der Horst & van der Veer, 2008). When the young monkeys were given the option to explore a room with their terry-cloth mothers and had the cloth mothers in the room with them, they used the mothers as a safe base. Similarly, when exposed to novel stimuli such as a loud noise, the monkeys would seek comfort from the cloth-covered surrogate (Harlow & Zimmerman, 1958). However, when the monkeys were left in the room without their cloth mothers, they reacted poorly—freezing up, crouching, crying, and screaming.
A control group of monkeys who were never exposed to either their real mothers or one of the surrogates revealed stunted forms of attachment and affection. They were left incapable of forming lasting emotional attachments with other monkeys (Herman, 2007). Based on this research, Harlow discovered the importance of proper emotional attachment, stressing the importance of physical and emotional bonding between infants and mothers (Harlow & Zimmerman, 1958; Herman, 2007).
Harlow’s in�luential research led to improved understanding of maternal bonding and child development (Herman, 2007). His research paved the way for improvements in infant and child care and in helping children cope with separation from their mothers (Bretherton, 1992; Du Plessis, 2009). In addition, Harlow’s work contributed to the improved treatment of children in orphanages, hospitals, day care centers, and schools (Herman, 2007; van der Horst & van der Veer, 2008).
Pros and Cons of Observational Research
Observational designs are well suited to a wide range of research questions, provided the questions can be addressed through directly observable behaviors and events. For example, researchers can observe parent–child interactions, or nonverbal cues to emotion, or even crowd behavior. However, if they are interested in studying thought processes—such as how close mothers feel to their children—then observation will not suf�ice. This point harkens back to the discussion of behavioral measures in Chapter 2 (2.2): In exchange for giving up access to internal processes, researchers gain access to un�iltered behavioral responses.
To capture these un�iltered behaviors, it is vital for the researcher to be as unobtrusive as possible. As we have already discussed, people have a tendency to change their behavior when they are being observed. In the bullying study by Craig and Pepler (1997) discussed at the beginning of this chapter, the researchers used video cameras to record children’s behavior unobtrusively. Imagine how (arti�icially) low the occurrence of bullying might be if the playground had been surrounded by researchers with clipboards!
If researchers conduct an observational study in a laboratory setting, they have no way to hide the fact that people are being observed, but the use of one-way mirrors and video recordings can help people to become comfortable with the setting. Researchers who conduct an observational study out in the real world have even more possibilities for blending into the background, including using observers who are literally hidden. For example, someone hypothesizes that people are more likely to pick up garbage when the weather is nicer. Rather than station an observer with a clipboard by the trash can, the researcher could place someone out of sight behind a tree, or perhaps sitting on a park bench pretending to read a magazine. In both cases, people would be less conscious of being observed and therefore more likely to behave naturally.
One extremely clever strategy for blending in comes from a study by the social psychologist Muzafer Sherif et al. (1954), involving observations of cooperative and competitive behaviors among boys at a summer camp. For Sherif, it was particularly important to make observations in this context without the boys realizing they were part of a research study. Sherif took on the role of camp janitor, which allowed him to be a presence in nearly all of the camp activities. The boys never paid enough attention to the “janitor” to realize his omnipresence—or his discreet note- taking. The brilliance of this idea is that it takes advantage of the fact that people tend to blend into the background once we become used to their presence.
Types of Observational Research
Several variations of observational research exist, according to the amount of control that a researcher has over the data collection process. Structured observation involves creating a standard situation in a controlled setting and then observing participants’ responses to a predetermined set of events. The “strange situation” studies of parent– child attachment (discussed above) are a good example of structured observation—mothers and infants are subjected to a series of eight structured episodes, and researchers systematically observe and record the infants’
reactions. Even though these types of studies are conducted in a laboratory, they differ from experimental studies in an important way: Rather than systematically manipulate a variable to make comparisons, researchers present the same set of conditions to all participants.
Another example of structured observation comes from the research of John Gottman, a psychologist at the University of Washington. For nearly three decades, Gottman and his colleagues have conducted research on the interaction styles of married couples. Couples who take part in this research are invited for a three-hour session in a laboratory that closely resembles a living room. Gottman’s goal is to make couples feel reasonably comfortable and natural in the setting to get them talking as they might do at home. After allowing them to settle in, Gottman adds the structured element by asking the couple to discuss an “ongoing issue or problem” in their marriage. The researchers then sit back to watch the sparks �ly, recording everything from verbal and nonverbal communication to measures of heart rate and blood pressure. Gottman has observed and tracked so many couples over the decades that he is able to predict, with remarkable accuracy, which couples will divorce in the 18 months following the lab visit (Gottman & Levenson, 1992).
Naturalistic observation, meanwhile, involves observing and systematically recording behavior in the real world. This can be conducted in two broad ways—with or without intervention on the part of the researcher. Intervention in this context means that the researcher manipulates some aspect of the environment and then observes people’s responses. For example, a researcher might leave a shopping cart just a few feet away from the cart-return area and track whether people move the cart. (Given the number of carts that are abandoned just inches away from their proper destination, someone must be doing this research all the time.) Recall an example from Chapter 1 (the discussion of ethical dilemmas in section 1.5) in which Harari et al. (1995) used naturalistic observation to study whether people would help in emergency situations. In brief, these researchers staged what appeared to be an attempted rape in a public park and then observed whether groups or individual males were more likely to rush to the victim’s aid.
The ABC network has developed a hit reality show that mimics this type of research. The show, What Would You Do?, sets up provocative situations in public settings and videotapes people’s reactions. An unwitting participant in one of these episodes might witness a customer stealing tips from a restaurant table, or a son berating his father for being gay, or a man proposing to his girlfriend who minutes earlier had been kissing another man at the bar. Of course, these observation “studies” are more interested in shock value than data collection (or Institutional Review Board [IRB] approval; see Section 1.5), but the overall approach can be a useful strategy to assess people’s reactions to various situations. In fact, some of the scenarios on the show are based on classic studies in social psychology, such as the well-documented phenomenon that people are reluctant to take responsibility for helping in emergencies.
Alternatively, naturalistic studies can involve simply recording ongoing behavior without any attempt by the researchers to intervene or in�luence the situation. In these cases, the goal is to observe and record behavior in a completely natural setting. For example, researchers might station themselves at a liquor store and observe the numbers of men and women who buy beer versus wine. Or, they might observe the numbers of people who give money to the Salvation Army bell-ringers during the holiday season. A researcher can use this approach to compare different conditions, provided the differences occur naturally. That is, researchers could observe whether people donate more money to the Salvation Army on sunny or snowy days, or compare donation rates when the bell ringers are different genders or races. Do people give more money when the bell-ringer is an attractive female? Or do they give more to someone who looks needier? These are all research questions that could be addressed using a well- designed naturalistic observation study.
Finally, participant observation involves having the researcher(s) conduct observations while engaging in the same activities as the participants. The goal is to interact with these participants to gain better access and insight into their behaviors. In one famous example, the psychologist David Rosenhan (1973) was interested in the experience of people hospitalized for mental illness. To study these experiences, he had eight perfectly sane people gain admission to different mental hospitals. These fake patients were instructed to give accurate life histories to a doctor but lie about one diagnostic symptom. They all claimed to hear an occasional voice saying the words “empty,”
Psychologists David Rosenhan’s study of staff and patients in a mental hospital found that patients tended to be treated based on their diagnosis, not on their actual behavior.
“hollow,” and “thud.” Such auditory hallucinations are a symptom of schizophrenia, and Rosenhan chose these words to vaguely suggest an existential crisis.
Once admitted, these “patients” behaved in a normal and cooperative manner, with instructions to convince hospital staff that they were healthy enough to be released. In the meantime, they observed life in the hospital and took notes on their experiences—a behavior that many doctors interpreted as “paranoid note-taking.” The main �inding of this study was that hospital staff tended to view all patient behaviors through the lens of their initial diagnoses. Despite immediately acting “normally,” these fake patients were hospitalized an average of 19 days (with a range from 7 to 52) before being released. All but one was diagnosed with “schizophrenia in remission” upon release. Rosenhan’s other striking �inding was that treatment was generally depersonalized, with staff spending little time with individual patients.
In another example of participant observation, Festinger, Riecken, and Schachter (1956) decided to join a doomsday cult to test their new theory of cognitive dissonance. Brie�ly, this theory argues that people are motivated to maintain a sense of consistency among their various thoughts and behaviors. So, for example, a person who smokes a cigarette despite being aware of the health risks might rationalize smoking by convincing herself that lung-cancer risk is really just genetic. In this case, Festinger and colleagues stumbled upon the case of a woman named Mrs. Keach, who was predicting the end of the world, via alien invasion, at 11 p.m. on a speci�ic date six months in the future. What would happen, they wondered, when this prophecy failed to come true? (One can only imagine how shocked they would have been had the prophecy turned out to be correct.)
To answer this question, the researchers pretended to be new converts and joined the cult, living among the members and
observing them as they made their preparations for doomsday. Sure enough, the day came, and 11 p.m. came and went without the world ending. Mrs. Keach �irst declared that she had forgotten to account for a time-zone difference, but as sunrise started to approach, the group members became restless. Finally, after a short absence to communicate with the aliens, Mrs. Keach returned with some good news: The aliens were so impressed with the devotion of the group that they decided to postpone their invasion. The group members rejoiced, rallying around this brilliant piece of rationalizing, and quickly began a new campaign to recruit new members.
As these examples illustrate, participant observation can provide access to amazing and one-of-a-kind data, including insights into group members’ thoughts and feelings. This approach also provides access to groups that might be reluctant to allow outside observers. However, the participant approach has two clear disadvantages over other types of observation. The �irst problem is ethical; data are collected from individuals who do not have the opportunity to give informed consent. Indeed, the whole point of the technique is to observe people without their knowledge. Before an IRB can approve this kind of study, researchers must show an extremely compelling reason to ignore informed consent, as well as extremely rigorous measures to protect identities. The second problem is methodological; the approach provides ample opportunity for the objectivity of observations to be compromised by the close contact between researcher and participant. Because the researchers are a part of the group, they can change the dynamics in subtle ways, possibly leading the group to con�irm their hypothesis. In addition, the group can shape the researchers’ interpretations in subtle ways, leading them to miss important details.
Another spin on participant observation is called ethnography, or the scienti�ic study of the customs of people and cultures. This is very much a qualitative method that focuses on observing people in the real world and learning about a culture from the perspective of the person being studied—that is, learning from the ground up rather than testing hypotheses. Ethnography is used primarily in other social-science �ields, such as anthropology. In one famous example, the cultural anthropologist Margaret Mead (1928) used this approach to shed light on differences in social
norms around adolescence between American and Samoan societies. Mead’s conclusions were based on interviews she conducted over a six-month period, observing and living alongside a group of 68 young women. Mead concluded from these interviews that Samoan children and adolescents are largely ignored until they reach the age of 16 and become full members of society. Among her more provocative claims was the idea that Samoan adolescents were much more liberal in their sexual attitudes and behaviors than American adolescents.
Mead’s work has been the subject of criticism by a handful other anthropologists, one of whom has even suggested that Mead was taken in by an elaborate joke played by the group of young girls. Still others have come to Mead’s rescue and challenged the critics’ interpretations. The nature of this debate between Mead’s critics and her supporters highlights a distinctive characteristic of qualitative methods: “Winning” the argument is based on challenging interpretations of the original interviews and observations. In contrast, disagreements around quantitative methods are generally based on examining statistical results from hypothesis testing. While quantitative methods may lose much of the richness of people’s experiences, they do offer an arguably more objective way of settling theoretical disputes.
Steps in Observational Research
One of the major strengths of observational research is its high degree of ecological validity; that is, the research can be conducted in situations that closely resemble the real world. Think of the chapter examples so far—married couples observed in a living-room-like laboratory; doomsday cults observed from within; bullying behaviors on the school playground. In every case, people’s behaviors are observed in the natural environment or something very close to it. However, this ecological validity comes at a price; the real world is a jumble of information, some relevant, some not so much. The challenge for researchers, then, is to decide on a system that provides the best test of their hypothesis, one that can sort out the signal from the noise. This section discusses a three-step process for conducting observational research. The key point to note right away is that most of this process involves making decisions ahead of time so that the process of data collection is smooth, simple, and systematic.
Step 1—Develop a Hypothesis For research to be systematic, it is important to impose structure by having a clear research question, and, in the case of quantitative research, a clear hypothesis as well. Other chapters have covered hypotheses in detail, but the main points bear repeating: A hypothesis must be testable and falsi�iable, meaning that it must be framed in such a way that it can be addressed through empirical data and might be discon�irmed by these data. In the example involving Salvation Army donations, we predicted that people might donate more money to an attractive bell-ringer. This hypothesis could easily be tested empirically and could just as easily be discon�irmed by the right set of data— say, if attractive bell-ringers brought in the fewest donations.
This particular example also highlights an additional important feature of observational hypotheses; namely, they must be based on observable behaviors. That is, we can safely make predictions about the amount of money people will donate because we can directly observe it. We are, nonetheless, unable to make predictions in this context about the reasons for donations. We would have no way to observe, say, that people donate more to attractive bell-ringers because they are trying to impress them. In sum, one limitation of observing behavior in the real world is that it prevents researchers from delving into the cognitive and motivational reasons behind the behaviors.
Step 2—Decide What and How to Sample Once a researcher has developed a hypothesis that is testable, falsi�iable, and observable, the next step is to decide what kind of information to gather from the environment to test this hypothesis. The simple fact is that the world is too complex to sample everything. Imagine that someone wanted to observe the dinner rush at a restaurant. A nearly in�inite list of possibilities for observation presents itself: What time does the restaurant get crowded? How often do people send their food back to
The dinner scene at a busy restaurant offers a wide variety of behaviors to observe. In order to simplify the observation process, researchers should narrow the focus by taking a sample.
the kitchen? What are the most popular dishes? How often do people get in arguments with the wait staff? To simplify the process of observing behavior, the researcher will need to take a sample, or a smaller portion of the population, that is relevant to the hypothesis. That is, rather than observing “dinner at the restaurant,” the researcher’s goal is to narrow his or her focus to something as speci�ic as “the number of people waiting in line for a table at 6 p.m. versus 9 p.m.”
The choice of what and how to sample will ultimately depend on the best �it for the hypothesis. The context of observational research offers three strategies for sampling behaviors and events. The �irst strategy, time sampling, involves comparing behaviors during different time intervals. For example, to test the hypothesis that football teams make more mistakes when they start to get tired, researchers could count the number of penalties in the �irst �ive minutes and the last �ive minutes of the game. This data would allow researchers to compare mistakes at one time interval with mistakes at another time interval. In the case of Festinger’s (1956) study of a doomsday cult, time sampling was used to compare how the group members behaved before and after their prophecy failed to come true.
The second strategy, individual sampling, involves collecting data by observing one person at a time to test hypotheses about individual behaviors. Many of the examples already discussed involve individual sampling: Ainsworth and colleagues (1970) tested their hypotheses about attachment behaviors by observing individual infants, while Gottman (1992) tests his hypotheses about romantic relationships by observing one married couple at a time. These types of data allow researchers to examine behavior at the individual level and test hypotheses about the kinds of things people do—from the way they argue with their spouses to whether they wear team colors to a football game.
The third strategy, event sampling, involves observing and recording behaviors that occur throughout an event. For example, we could track the number of �ights that break out during an event such as a football game, or the number of times people leave the restaurant without paying the check. This strategy allows for testing hypotheses about the types of behaviors that occur in a particular environment or setting. For instance, a researcher might compare the number of �ights that break out in a professional football versus a professional hockey game. Or, the next time we host a party, we could count the number of wine bottles versus beer bottles that end up in the recycling bin. The distinguishing feature of this strategy is its focus on occurrence of behaviors more than on the individuals performing these behaviors.
Step 3—Record and Code Behavior Having formulated a hypothesis and decided on the best sampling strategy, researchers must perform one �inal and critical step before beginning data collection. Namely, they have to develop good operational de�initions of the variables by translating the underlying concepts into measurable variables. Gottman’s research turns the concept of marital interactions into a range of measurable variables, such as the number of dismissive comments and passive- aggressive sighing—all things that can be observed and counted objectively. Rosenhan’s 1973 study involving fake schizophrenic patients turned the concept of patient experience into measureable variables such as the amount of time staff members spent with each patient—again, something very straightforward to observe.
It is vital that researchers decide up front what kinds and categories of behavior they will be observing and recording. In the last section, we narrowed down our observation of dinner at the restaurant to the number of people in line at 6 p.m. versus the number of people in line at 9 p.m. But how can we be sure of an accurate count? What if two people are waiting by the door while the other two members of the group are sitting at the bar? Are
those at the bar waiting for a table or simply having drinks? One possibility might be to count the number of individuals who walk through the door in different time periods, although our count could be in�lated by those who give up on waiting or who only enter to sneak in and out of the restroom.
In short, observing behavior in the real world can be messy. The best way to deal with this mess is to develop a clear and consistent categorization scheme and stick with it. That is, in testing a hypothesis about the most crowded time at a restaurant, researchers would choose one method of counting people and use it for the duration of the study. In part, this choice of a method is a judgment call, but researchers’ judgment should be informed by three criteria. First, they should consider practical issues, such as whether their categories can be directly observed. A researcher can observe the number of people who leave the restaurant but cannot observe whether they got impatient. Second, they should consider theoretical issues, such as how well the categories represent the underlying theory. Why did researchers decide to study the most crowded time at the restaurant? Perhaps this particular restaurant is in a new, up-and-coming neighborhood, and they expect the restaurant to become crowded over the course of the evening. The time would also lead researchers to include people sitting both at tables and at the bar—because this crowd may come to the restaurant with the sole intention of staying at the bar. Finally, researchers should consider previous research in choosing their categories. Have other researchers studied dining patterns in restaurants? What kinds of behaviors did they observe? If these categories make sense for the project, researchers may feel free to re- use them—no need to reinvent the wheel.
Last but not least, a researcher should take a step back and evaluate both the validity and the reliability of the coding system. (See Section 2.2 for a review of these terms.) Validity in this case means making sure the categories capture the underlying variables in the hypothesis (i.e., construct validity; see Section 2.2). For example, in Gottman’s studies of marital interactions, some of the most important variables are the emotions expressed by both partners. One way to observe emotions would be to count the number of times a person smiles. However, we would have to think carefully about the validity of this measure, because smiling could indicate either genuine happiness or condescension. As a general rule, the better and more speci�ic researchers’ operational de�initions, the more valid their measures will be (Chapter 2).
Reliability in this context means making sure data are collected in a consistent way. If research involves more than one observer using the same system, their data should look roughly the same (i.e., interrater reliability). This reliability is accomplished in part by making the observation task simple and straightforward—for example, having trained assistants use a checklist to record behaviors rather than depending on open-ended notes. The other key to improving reliability is careful training of the observers, giving them detailed instructions and ample opportunities to practice the rating system.
To explain how all of this comes together, we will explore a pair of examples, from research question to data collection.
Example 1—Theater Restroom Usage First, imagine, for the sake of this example, that someone is interested in whether people are more likely to use the restroom before or after watching a movie. Such a research question could provide valuable information for theater owners in planning employee schedules (i.e., when are bathrooms most likely to need cleaning). Thus, studying patterns of human behavior results in valuable applied knowledge.
The �irst step is to develop a speci�ic, testable, and observable hypothesis. In this case, we might predict that people are more likely to use the restroom after the movie, as a result of consuming those 64-ounce sodas during the movie. Just for fun, we will also compare the restroom usage of men and women. Perhaps men are more likely to wait until after the movie, whereas women are just as likely to go before as after? This pattern of data might look something like the percentages in Table 3.1. That is, men make 80% of their restroom visits after the movie and 20% before the movie, while women make about 50% of their restroom visits at each time.
Table 3.1: Hypothesized restroom visits
Gender Men Women
Before movie 20% 50%
After movie 80% 50%
Total 100% 100%
The next step is to decide on the best sampling strategy to test this hypothesis. Of the three sampling strategies discussed—individual, event, and time—which one seems most relevant here? The best option would probably be time sampling because the hypothesis involves comparing the number of restroom visitors in two time periods (before versus after the movie). So, in this case, we would need to de�ine a time interval for collecting data. We could limit our observations to the 10 minutes before the previews begin and the 10 minutes after the credits end. The potential problem here, of course, is that some people might use either the previews or the end credits as a chance to use the restroom. Another complication arises in trying to determine which movie people are watching; in a giant multiplex theater, movies start just as others are �inishing. One possible solution, then, would be to narrow the sample to movie theaters that show only one movie at a time and to de�ine the sampling times based on the actual movie start- and end-times.
Having determined a sampling strategy, the next step is to identify the types of behaviors we want to record. This particular hypothesis poses a challenge because it deals with a rather private behavior. To faithfully record people “using the restroom,” we would need to station researchers in both men’s and women’s restrooms to verify that people actually, well, “use” the restroom while they are in it. However, this strategy poses the potential downside that the researcher’s presence (standing in the corner of the restroom) will affect people’s behavior. Another, less intrusive option would be to stand outside the restroom and simply count “the number of people who enter.” The downside to that, of course, is that we technically do not know why people are going into the restroom. But sometimes research involves making these sorts of compromises—in this case, we chose to sacri�ice a bit of precision in favor of a less-intrusive measurement. This compromise would also serve to reduce ethical issues with observing people in the restroom.
So, in sum, we started with the hypothesis that men are more likely to use the restroom after a movie, while women use the restroom equally before and after. We then decided that the best sampling strategy would be to identify a movie theater showing only one movie and to sample from the 10-minute periods before and after the actual movie’s running time. Finally, we decided that the best strategy for recording behavior would be to station observers outside the restrooms and count the number of people who enter. Now, say we conduct these observations every evening for one week and collect the data in Table 3.2.
Table 3.2: Findings from observing restroom visits
Gender Men Women
Before movie 75 (25%) 300 (60%)
After movie 225 (75%) 200 (40%)
Total 300 (100%) 500 (100%)
Notice that more women (N = 500) than men (N = 300) attended the movie theater during our week of sampling. The real test of our hypothesis, however, comes from examining the percentages within gender groups. That is, of the 300 men who went into the restroom, what percentage of them did so before the movie and what percentage of them did so after the movie? In this dataset, women used the restroom with relatively equal frequency before (60%) and after (40%) the movie. Men, in contrast, were three times as likely to use the restroom after (75%) than before (25%) the movie. In other words, our hypothesis appears to be con�irmed by examining these percentages.
Example 2—Cell Phone Usage While Driving
Imagine that we are interested in patterns of cell phone usage among drivers. Several recent studies have reported that drivers using cell phones are as impaired as drunk drivers, making this an important public safety issue. Thus, if we could understand the contexts in which people are most likely to use cell phones, it would provide valuable information for developing guidelines for safe and legal use of these devices. So, this study might count the number of drivers using cell phones in two settings: while navigating rush-hour traf�ic and while moving on the freeway.
The �irst step is to develop a speci�ic, testable, and observable hypothesis. In this case, we might predict that people are more likely to use cell phones when they are bored in the car. So, we hypothesize that we will see more drivers using cell phones while stuck in rush-hour traf�ic than while moving on the freeway.
The next step is to decide on the best sampling strategy to test this hypothesis. Of the three sampling strategies discussed—individual, event, and time—which one seems most relevant here? The best option would probably be individual sampling because we are interested in the cell phone usage of individual drivers. That is, for each individual car we see during the observation period, we want to know whether the driver is using a cell phone. One strategy for collecting these observations would be to station observers along a fast-moving stretch of freeway, as well as along a stretch of road that is clogged during rush hour. These observers would keep a record of each passing car and note whether the driver is on the phone.
After selecting a sampling strategy, we next must decide the types of behaviors to record. One challenge this study presents is how broadly to de�ine cell phone usage. Should we include both talking and text messaging? Given our interest in distraction and public safety, we probably want to include text messaging. Several states have recently banned this practice while driving, often in response to tragic accidents. Because we will be observing moving vehicles, the most reliable approach might be to simply note whether drivers have a cell phone in their hand. As with the restroom study, we sacri�ice a little bit of precision (i.e., knowing what the driver is using the cell phone for) to capture behaviors that are easier to record.
To sum up, we started with the hypothesis that drivers would be more likely to use cell phones when stuck in traf�ic. We then decided that the best sampling strategy would be to station observers along two stretches of road who would note whether drivers were using cell phones. Finally, we decided that the cell phone usage would be de�ined as each driver holding a cell phone. Now, suppose we conducted these observations over a 24-hour period and collected the data in Table 3.3.
Table 3.3: Findings from observing cell phone usage
Rush Hour Highway
Cell Phone 30 (30%) 200 (67%)
No Cell Phone 70 (70%) 100 (33%)
Total 100 (100%) 300 (100%)
The results show that more cars passed by on the highway (N = 300) than on the street during the rush-hour stretch (N = 100). The real test of our hypothesis, though, comes from examining the percentages within each stretch. That is, of the 100 people observed during rush hour and the 300 observed on the highway, what percentage was using cell phones? In this data set, 30% of those in rush hour were using cell phones, compared with 67% of those on the highway. In other words, the data did not con�irm our hypothesis. Drivers in rush hour were less than half as likely to be using cell phones. The next step in this research program would be to speculate on the reasons the data contradicted the hypothesis.
Qualitative versus Quantitative Approaches
The general method of observation lends itself equally well to qualitative and quantitative approaches, although some types of observation �it one approach better than the other. For example, structured observation tends to focus on hypothesis testing and quanti�ication of responses. In Mary Ainsworth’s (1970) “strange situation” research
(described previously), the primary goal was to expose children to a predetermined script of events and to test hypotheses about how children with secure and insecure attachments would respond to these events. In contrast, naturalistic observation—and, to a greater extent, participant observation—tends to focus on learning from events as they occur naturally. In Leon Festinger’s “doomsday cult” study, the researchers joined the group to observe the ways members reacted when their prophecy failed to come true. Margaret Mead (1928) spent several months living with Samoan adolescents to understand social norms around coming of age.
Research: Thinking Critically
“Irritable Heart” Syndrome in Civil War Veterans
Follow the link below to an article by science writer and editor K. Kris Hirst. In this article, Hirst reviews compelling research from health psychologist Roxanne Cohen Silver and her colleagues at the University of California, Irvine. Cohen Silver and her colleagues reviewed the service records of 15,027 Civil War veterans, �inding an astounding rate of mental illness—long before post-traumatic stress disorder was recognized. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
Think about it:
1. What hypotheses are the researchers testing in this study? 2. How did the researchers quantify trauma experienced by Civil War soldiers? Do you think this is a
valid way to operationalize trauma? Explain why or why not. 3. Would this research be best described as case studies, archival research, or natural observation?
Does the study involve elements of more than one type? Explain.
By the end of this chapter, you should be able to:
Describe the distinguishing features of survey research. Outline best practices for designing questionnaires to ensure quality responses. Explain the reasons for sampling for the population. Distinguish the different types of sampling strategies. Explain the logic behind common approaches to analyzing survey data.
In a highly in�luential book published in the 1960s, the sociologist Erving Goffman (1963) de�ined stigma as an unusual characteristic that triggers a negative evaluation. In his words, “the stigmatized person is one who is reduced in our minds from a whole and usual person to a tainted, discounted one” (p. 3). People’s beliefs about stigmatized characteristics exist largely in the eye of the beholder, but have substantial in�luence on social interactions with the stigmatized (see Snyder, Tanke, & Berscheid, 1977). A large research tradition in psychology has been devoted to understanding both the origins of stigma and the consequences of being stigmatized. According to Goffman and others, the characteristics associated with the greatest degree of stigma have three features in common: they are highly visible, they are perceived as controllable, and they are misunderstood by the public.
4 Survey Designs—Predicting Behavior
Recently, researchers have taken considerable interest in people’s attitudes toward members of the gay and lesbian community. Although these attitudes have become more positive over time, this group still encounters harassment and other forms of discrimination on a regular basis (see Almeida, Johnson, Corliss, Molnar, & Azrael, 2009; Berrill, 1990). One of the top recognized experts on this subject is Gregory Herek, professor of psychology at the University of California at Davis (http://psychology.ucdavis.edu/herek/ (http://psychology.ucdavis.edu/herek/) ). In a 1988 article, Herek conducted a survey of heterosexuals’ attitudes toward both lesbians and gay men, with the goal of understanding the predictors of negative attitudes. Herek approached this research question by constructing a questionnaire to measure people’s attitudes toward these groups. In three studies, participants were asked to complete this attitude measure, along with other existing scales assessing attitudes about gender roles, religion, and traditional ideologies.
Herek’s (1988) research revealed that, as hypothesized, heterosexual males tended to hold more negative attitudes about gay men and lesbians than did heterosexual females. However, the same psychological mechanisms seemed to explain the prejudice in both genders. That is, negative attitudes toward gays and lesbians were associated with increased religiosity, more traditional beliefs about family and gender, and fewer experiences actually interacting with gay men and lesbians. These associations meant that Herek could predict people’s attitudes toward gay men and lesbians based on knowing their views about family, gender, and religion, as well as their past interactions with the stigmatized group. In this paper, Herek’s primary contribution to the literature was the insight that reducing stigma toward gay men and lesbians “may require confronting deeply held, socially reinforced values” (1988, p. 473). This insight was only possible because people were asked to report these values directly.
This chapter continues along the continuum of control, moving on to survey research, in which the primary goal is either describing or predicting attitudes and behavior. For our purposes, survey research refers to any method that relies on people’s direct reports of their own attitudes, feelings, and behaviors. So, for example, in Herek’s (1988) study, the participants reported their attitudes toward lesbians and gay men, rather than these attitudes being somehow directly observed by the researchers. Compared to the descriptive designs we discussed in Chapter 3, survey designs tend to have more control over both data collection and question content. Thus, survey research falls somewhere between purely descriptive research (Chapter 3) and the explanatory power of experimental designs (Chapter 5). This chapter provides an overview of survey research from conceptualization through analysis. It will discuss the types of research questions that are best suited to survey research and provide an overview of the decisions to consider in designing and conducting a survey study. We will then cover the process of data collection, with a focus on selecting the people who will complete surveys. Finally, the chapter will describe the three most common approaches for analyzing survey data.
Research: Making an Impact
Alfred Kinsey’s research on human sexuality is an example of social research that changed the way society thought about a complex issue—in this case, ideas about “normal” sexual behavior. Kinsey’s research, particularly two books on male and female sexuality known together as the Kinsey Reports, illuminated the discrepancies between the assumptions made by a “moral public” and the actual behavior of individuals. His shift in the approach to studying sex—applying scienti�ic methods and reasoning rather than basing conclusions on medical speculation and dogmatic opinions—changed the nature of sex research and the general public’s view of sex for decades to come.
Kinsey’s major contribution was in challenging the prevailing assumptions about sexual activity in the United States and obtaining descriptive data from both men and women that described their own sexual practices (Bullough, 1998). By collecting actual data instead of relying on speculation, Kinsey made the study of sexuality more scienti�ically based. The results of his surveys revealed a variety of sexual behaviors that shocked many members of society and rede�ined the sexual morality of modern America.
Until Kinsey’s research, the general, Victorian viewpoint was that women should not show any interest in sex and should submit to their husband without any sign of pleasure (Davis, 1929). Kinsey’s data challenged the prevailing assumption that women were asexual. His studies revealed that 25% of the women studied had experienced an orgasm by the age of 15 and more than half by the age of 20 (Kinsey, Pomeroy, Martin, & Gebhard, 1953). Eventually, these results were bundled into the various elements that fueled the women’s movement of the 1960s and encouraged further examination of female sexuality (Bullough, 1998).
Kinsey’s data also contributed to the budding gay and lesbian liberation movement. Until the Kinsey Reports, studies of human sexuality were based on the assumption that homosexuals were mentally ill (Bullough, 1998). When Kinsey’s data revealed that many males and females practiced homosexuality to some degree, he suggested that sexuality was more of a continuum than a series of categories into which people �it. In addition, the Kinsey Reports revealed that the number of extramarital relationships people were having was higher than most expected. Forty percent of married American males reported having an extramarital relationship (Kinsey, et al., 1953).
These ideas, though controversial, prompted society to take a realistic look at the actual sexual practices of its members. The topic of sexuality became less dogmatic as society became more open about sexual activities and preferences.
Kinsey’s data not only encouraged social change but also revolutionized the way in which scientists study sexuality. By examining data and studying sex from an unbiased standpoint, Kinsey successfully transformed the study of human sexuality into a science. His research not only changed our way of studying sexual behavior but also allowed society to become less restrictive in its expectations of “normal” sexual behavior.
Think About It
1. What type of data formed the basis of Kinsey’s reports? What are the pros and cons of this type? 2. How did applying the scienti�ic method change the national conversation about sexuality?
Surveys are used to describe or predict attitudes and behavior.
4.1 Introduction to Survey Research Whether you aware of it or not, most people encounter survey research throughout most of their lives. Every time we decide to answer that call from an unknown number, and the person on the other end of the line insists on knowing the call recipient’s household income and favorite brand of laundry detergent, we are helping to conduct survey research. When news programs try to predict the winner of an election two weeks early, these reports are based on survey research of eligible voters. In both cases, the researcher is trying to make predictions about the products people buy or the candidates they will elect based on what people say about their own attitudes, feelings, and behaviors.
Surveys can be used in a variety of contexts and are most appropriate for questions that involve people describing their attitudes, their behaviors, or a combination of the two. For example, if we want to examine the predictors of attitudes toward the death penalty, we could ask people their opinions on this topic and also ask them about their political party af�iliation. Based on these responses, we could test whether political af�iliation predicted attitudes toward the death penalty. Or, imagine we want to know whether students who spend more time studying are more likely to do well on their exams. This question could be answered using a survey that asked students about their study habits and then tracked their exam grades. We will return to this example near the end of the chapter, as we discuss the process of analyzing survey data to test our hypotheses about predictions.
The common thread of these two examples is that they require people to report either their thoughts (e.g., opinions about the death penalty) or their behaviors (e.g., the hours they spend studying). Contrast these with an example that might be a poor �it for survey research: If a researcher wanted to test whether a new drug led to increased risk of developing blood clots, it would be much safer to test for these clots using medical technology, rather than asking people for their beliefs (“on a scale from 1 to 5, how many clots have you developed this week?”). Thus, when deciding whether a survey is the best �it for a research question, a researcher must consider whether people will be both able and willing to report the opinions or behaviors accurately. The next section expands on both of these issues.
Distinguishing Features of Surveys
Survey research designs have three distinguishing features that set them apart from other designs. First, all survey research relies on either written or verbal self-reports of people’s attitudes, feelings, and behaviors. This self- reporting means that researchers will ask participants a series of questions and record their responses. The approach has several advantages, including being relatively straightforward and allowing a degree of access to psychological processes (e.g., “Why do you support candidate X?”). However, researchers should also be also cautious in their interpretation of self-report data because participants’ responses often re�lect a combination of their true attitude and concern over how this attitude will be perceived. Scientists refer to this concern as social desirability, which means that people may be reluctant to report unpopular attitudes. For example, if we were to ask people their attitudes about different racial groups, their answers might re�lect both their true attitude and their desire not to appear racist. We return to the issue of social desirability later in this chapter and discuss some tactics for designing questions that can help to sidestep these concerns and capture respondents’ true attitudes.
The second distinguishing feature of survey research is its ability to access internal states that cannot be measured through direct observation. The discussion of observational designs in Chapter 3 explained that one limitation of these designs was a lack of insight into why people behave the way they do. Survey research can address this limitation directly: By asking people what they think, how they feel, and why they behave in certain ways,
researchers come closer to capturing the underlying psychological processes. However, people’s reports of their internal states should be taken with a grain of salt, for two reasons. First, as mentioned, these reports may be biased by social-desirability concerns, particularly when unpopular attitudes are involved. Second, a large body of literature in social psychology suggests that people may not understand the true reasons for their behavior. In an in�luential review paper, psychologists Richard Nisbett and Tim Wilson (1977) argued that we make poor guesses after the fact about why we do things, based more on our assumptions than on any real introspection. Thus, survey questions can provide access to internal states, but researchers should always interpret responses with caution.
Third, on a more practical note, survey research allows us to collect large amounts of data with relatively little effort and few resources. Many of the descriptive designs Chapter 3 discussed require observing one person at a time, and the same will hold true when Chapter 5 explores experimental designs. Survey-research designs stand out as the most ef�icient, because surveys can be distributed to large groups of people simultaneously. Still, their actual ef�iciency depends on the decisions researchers make during the design process. In reality, ef�iciency is often in a delicate balance with the accuracy and completeness of the data.
Broadly speaking, survey research can be conducted using either verbal or written self-reports (or a combination of the two). Before diving into the details of writing and formatting a survey, we need to understand the pros and cons of administering a survey as an interview (i.e., a verbal survey) or a questionnaire (i.e., a written survey).
An interview involves a verbal question-and-answer exchange between the researcher and the participant. This verbal exchange can take place either face-to-face or over the phone. So, our earlier telemarketer example represents an interview because the questions are asked verbally via phone. Likewise, if we are approached in a shopping mall and asked to answer questions about our favorite products, we experience a survey in interview form because the questions are administered verbally face-to-face. And, if a person has ever participated in a focus group, during which a group of people gives their reactions to a new product, the researchers are essentially conducting an interview with the group.
Interview Schedules Regardless of how the interview is administered, the interviewer (i.e., the researcher) has a predetermined plan, or script, for how the interview should go. This plan, or script, for the progress of the interview is known as an interview schedule. When conducting an interview—including those telemarketing calls—the researcher/interviewer has a detailed plan for the order of questions to be asked, along with follow-up questions that depend on the participant’s responses.
Broadly speaking, researchers employ two types of interview schedules. A linear (also called “structured”) schedule will ask the same questions, in the same order, for all participants. In contrast, a branching schedule unfolds more like a �lowchart, with the next question dependent on participants’ answers. Interviewers typically use a branching schedule in cases with follow-up questions that only make sense for some of the participants. For example, a researcher might �irst ask people whether they have children; if they answer “yes,” the interviewer might then follow up by asking how many.
One danger in using a branching schedule is that it is based partly on the researcher’s assumptions about the relationships between variables. Granted, to ask only people with children to indicate how many they have is fairly uncontroversial. Imagine the following scenario, however. Say we �irst ask participants for their household income, and then ask about their political donations:
“How much money do you make? $18,000? OK, how likely are you to donate to the Democratic Party?” “How much money do you make? $250,000? OK, how likely are you to donate money to the Republican Party?”
The way these questions branch implicitly assumes that wealthier people are more likely to be Republicans, and less wealthy people are more likely to be Democrats. The data might support this assumption or they might not. By planning the follow-up questions in this way, though, we are unable to capture cases that do not �it our stereotypes (i.e., the wealthy Democrats and the poor Republicans). Researchers must therefore be careful about letting their biases shape the data-collection process.
Advantages and Disadvantages of Interviews Interviews offer a number of advantages over written surveys. For one, people are often more motivated to talk than they are to write. Consider the example of an actual undergraduate research assistant who was dispatched to a local shopping mall to interview people about their experiences in romantic relationships. He had no trouble at all recruiting participants, many of whom would go on and on (and on, and on) about recent relationships—one woman even con�ided to him that she had just left an abusive spouse earlier that week. For better or for worse, these experiences would have been more dif�icult to capture in writing.
Related to this bene�it, people’s oral responses are typically richer and more detailed than their written responses. Think of the difference between asking someone to “describe your views on gun control” and asking someone to “indicate on a scale of 1 to 7 the degree to which you support gun control.” The former is more likely to capture the richness and subtlety involved in people’s attitudes about guns. On a practical note, an interview format also allows the researcher to ensure that respondents understand the questions. Poorly worded written-questionnaire items force survey participants to guess at the researcher’s meaning, and these guesses introduce a large source of error variance. On the other hand, if an interview question is poorly asked, people can easily ask the interviewer to clarify. Finally, using an interview format allows researchers to reach a broader cross-section of people and to include those who are unable to read and write—or, perhaps, unable to read and write the language of the survey.
Interviews also have two clear disadvantages compared to written surveys. First, interviews cost more in terms of both time and money. It took more time for the graduate assistant to go to a shopping mall than it would have taken to mail out packets of surveys (but no more money—research-assistant positions tend to be unpaid). Second, the interview format allows many opportunities for interviewers to pass on their personal biases. These biases are unlikely to be deliberate, but participants can often pick up on body language and subtle facial expressions when the interviewer disagrees with their answers. Such cues may in�luence them to shape their responses to make the interviewer happier. The best way to understand the pros and cons of interviewing is to recognize that both are a consequence of personal interaction. The interaction between interviewer and interviewee allows for richer responses but also the potential for these responses to be biased. Researchers must weigh these pros and cons and decide which method is the best �it for their survey. The next section turns to the process of administering surveys in writing.
One additional problem with interviews is the increasing dif�iculty of obtaining representative samples for interviews over the telephone due to low or declining use of landline phones, coupled with the use of unlisted numbers and call-screening devices. In the United States, the Pew Research Center (2012) reports that overall response rate—a ratio of completed interviews to the number of phone numbers dialed—was just 9% in 2012, one- fourth of the 36% level from 1997. Thus, signi�icant differences may exist between people who elect to respond to phone surveys and those who do not.
A questionnaire is a survey that involves a written question-and-answer exchange between the researcher and the participant. The exchange is a bit different from interview formats—in this case, the questions are designed ahead of time, then distributed to participants, who write their responses and return the questionnaire to the researcher. The next section discusses details for designing these questions. First, however, we will take a quick look at the process of administering written surveys.
Approximately 20–30% of online surveys are completed on a mobile device.
Questionnaires can be distributed in three primary ways, each with its own pattern of advantages and disadvantages:
Distributing by mail: Until recently, researchers commonly distributed surveys by sending paper copies through the mail to a group of participants (see the section on “Sampling” for more discussion on how this group is selected). Mailing surveys is relatively cheap and relatively easy to do, but it is unfortunately one of the worst methods in terms of response rates. People tend to ignore questionnaires that they receive in the mail, dismissing them as one more piece of junk. Researchers have a few methods available for increasing response rates, including providing incentives, making the survey interesting, and making it as easy as possible to return the results (e.g., with a postage-paid envelope). However, even using all of these tactics, researchers consider themselves extremely lucky to obtain a 30% response rate from a mail survey. That means a researcher who mails 1,000 surveys will be doing well to receive 300 back. More typical response rates for mail surveys can be in the single digits. Because of this low return on investment, researchers have begun relying on other methods for their written surveys.
Distributing in person: Another option for researchers is to distribute a written survey in person, simply handing out copies and asking participants to �ill them out on the spot. This method is certainly more time-consuming; a researcher has to be stationed for long periods of time to collect data. In addition, people are less likely to answer the questions honestly because the presence of a researcher makes them worry about social desirability. Last, the sample for this method is limited to people who are in the physical area at the time that questionnaires are being distributed. As the chapter discusses later, this limitation might lead to problems in the composition of the sample. On the plus side, however, this method tends to result in higher compliance rates because people �ind it harder to say no to someone face-to-face than to ignore a piece of mail.
Distributing online: During the last two decades, online surveys have become the dominant method of data collection, for both market research and academic research. Online distribution involves posting a questionnaire on a web page, and then directing participants to this web page to complete the questionnaire. Online surveys offer many bene�its over other forms of data collection, including the ability to: present audio and visual stimuli, randomize the order of questions, and implement complex branching logic (e.g., asking people to evaluate local grocery stores depending on where they live).
Most recently, researchers have begun exploring the best ways to design surveys for mobile devices. According to a report from the International Telecommunications Union, in 2013, 6.8 billion mobile phones were in use, compared to a world population of 7.2 billion. In 2012, 44% of Americans slept next to their phones (Pew Research Center, 2012). Not surprisingly, consensus in the market research industry is that approximately 20–30% of online surveys are actually completed on a mobile device (Poynter, Williams, & York, 2014). Why does this matter? People take surveys on their smartphones because it is convenient (or, in some cases, because it is their only Internet device). However, despite recent exponential advancement, mobile phones still have smaller screens, less functional keyboards, and less predictability in displaying images and videos. (Imagine someone being asked to view a set of two-minute-long advertisements on an iPhone while trying to complete a survey before a doctor’s appointment.) Researchers do have ways to make this experience more pleasant for respondents and consequently to increase the quality of data obtained. For example, mobile surveys work best when they are shorter overall, when the question text is short and straightforward, and when response scales (discussed below) are kept at �ive points (see Poynter et al., 2014, for a review). The latter point is a direct result of small screen size: Longer response scales require respondents to scroll back and forth on their screens to see the entire scale. Unfortunately, but understandably, some applied research suggests that people tend to ignore the scale points that they cannot see—perhaps using only four points out of a ten-point scale.
Because these methods are relatively new, the jury is still out on whether online and mobile distribution results in biased samples or biased responses. However, worth keeping in mind is that approximately 13% of the U.S. population does not have Internet access (Internet Users by Country, 2014). This group is disproportionately older (65+) and represents the lowest income and least educated segments of the population. Thus, if research questions involve reaching these groups, it is necessary to supplement online surveys with other distribution methods. For readers interested in more information on designing and conducting Internet research, Sam Gosling and John Johnson’s (2010) recent book provides an excellent resource. In addition, several groups of psychological researchers have been attempting to understand the psychology of Internet users (read about recent studies on this website: http://www.spring.org.uk/2010/10/internet-psychology.php (http://www.spring.org.uk/2010/10/internet-psychology.php) ).
Advantages and Disadvantages of Questionnaires Just as interview methods do, written questionnaires claim their own set of advantages and disadvantages. Written surveys allow researchers to collect large amounts of data with little cost or effort, and they can offer a greater degree of anonymity than interviews. Anonymity can be a particular advantage in dealing with sensitive or potentially embarrassing topics. That is, people may be more willing to answer a questionnaire about their alcohol use or their sexual history than they would be to discuss these things face-to-face with an interviewer. On the downside, written surveys miss out on one advantage of interviews because no one is available to clarify confusing questions. Fortunately, researchers have one relatively easy way of minimizing this problem: make survey questions as clear as possible. The next section explains the process of questionnaire design.
4.2 Questionnaire Design One of the most important steps in conducting survey research is deciding how to construct and assemble the questionnaire items. In some cases, a researcher will be able to answer research questions using questionnaires that other researchers have already developed. For example, quite a bit of psychology research uses standard scales that measure self-esteem, prejudice, depression, or stress levels. The advantage of these ready-made measures is that other people have already gone to the trouble of making sure they are valid and reliable. So, someone interested in the relationship between stress and depression could distribute the Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983) and the Beck Depression Inventory (Beck, Steer, Ball, & Ranieri, 1996) to a group of participants and more quickly move along on to the fun part of data analyses.
However, in many cases, no perfect measure exists for a research question—either because no one has studied the topic before or because the current measures are all �lawed in some way. When this happens, researchers need to go through the process of designing their own questions. This section discusses strategies for writing questions and choosing the most appropriate response format.
Five Rules for Better Questions
Each of the rules listed below is designed to make research questions as clear and easy to understand as possible so as to minimize the potential for error variance. We discuss each rule below and illustrate it with contrasting pairs of items: “bad” items that do not follow the rule and “better” items that do.
1. Use simple language. One of the simplest and most important rules to keep in mind is that people have to be able to understand the survey questions. This means avoiding jargon and specialized language whenever possible.
BAD: “Have you ever had an STD?”
BETTER: “Have you ever had a sexually transmitted disease?”
BAD: “What is your opinion of the S-CHIP program?”
BETTER: “What is your opinion of the State Children’s Health Insurance Program?”
It is also a good idea to simplify the language as much as possible, so that people spend time answering the question rather than trying to decode its meaning. For example, words like assist and consider can be replaced with simpler words like help and think. This may seem odd—or perhaps even condescending to participants—but it is always better to err on the side of simplicity. Remember, when people are forced to guess at the meaning of questions, these guesses add error variance to their answers.
2. Be precise. Another way to ensure that people understand the question is to be as precise as possible with wording. Ambiguously (or vaguely) worded questions will introduce an extra source of error variance into the data because people may interpret these questions in varying ways.
BAD: “What drugs do you take?” (Legal drugs? Illegal drugs? Now? In college?)
BETTER: “What prescription drugs are you currently taking?”
BAD: “Do you like sports?” (Playing? Watching? Which sports??)
BETTER: “How much do you enjoy watching basketball on television?”
3. Use neutral language. Questions should be designed to measure participants’ attitudes, feelings, or behaviors rather than to manipulate these things. That is, avoid leading questions that are written in such a way that they suggest an answer.
BAD: “Do you beat your children?” (Who would say yes?)
BETTER: “Is it acceptable to use physical forms of discipline?”
BAD: “Do you agree that the president is an idiot?”
BETTER: “How would you rate the president’s job performance?”
This guideline can be used to sidestep social desirability concerns. If the researcher suspects that people may be reluctant to report holding an attitude—for example, using corporal punishment with their children —it helps to phrase the question in a nonthreatening way: “using physical forms of discipline” versus “beating your children.” Many current measures of prejudice adopt this technique. For example, McConahay’s (1986) “modern racism” scale contains items such as “Discrimination against Blacks is no longer a problem in the United States.” People who hold prejudicial attitudes are more likely to confess agreement with statements like this one than with blunter ones, like “I hate people from Group X.”
4. Ask one question at a time. One remarkably common error that people make in designing questions is to include a double-barreled question (one which asks more than one question at a time). A new-patient questionnaire at a doctor’s of�ice often asks whether the patient suffer from “headaches and nausea.” What if an individual only suffers from one of these or has a lot of nausea and an occasional headache? The better approach is to ask about each of these symptoms separately.
BAD: “Do you suffer from pain and numbness?”
BETTER: “How often do you suffer from pain?” “How often do you suffer from numbness?”
BAD: “Do you like watching football and boxing?”
BETTER: “How much do you enjoy watching football?” “How much do you enjoy watching boxing?”
5. Avoid negations. One �inal and simple way to clarify questions is to avoid questions with negative statements because these can often be dif�icult to understand. The �irst example below may be a little silly, but the second comes from a real survey of voter opinion.
BAD: “Do you never not cheat on your exams?” (Wait, what? Do I cheat? Do I not cheat? What is this asking?)
BETTER: “Have you ever cheated on an exam?”
BAD: “Are you against rejecting the ban on pesticides?” (Wait, so, am I for the ban? Against the ban? What is this asking?)
BETTER: “Do you support the current ban on pesticides?”
This section discusses the issue of deciding how participants should respond to survey questions. The decisions researchers make at this stage will affect the type of data they ultimately collect, so it is important to choose carefully. This section reviews the primary decisions a researcher will need to make about response options, as well as the pros and cons of each one.
One of the �irst choices to make is whether to collect open- ended or �ixed-format responses. As the names imply, �ixed- format responses require participants to choose from a list of options (e.g., “Choose your favorite color”), while open-
Thirty percent of participants selected the invention of computers as the most signi�icant event of the past 50 years when presented with �ixed-format responses, but when a different group was asked the same question in an open- ended format, only 20% listed the invention of computers.
ended responses ask participants to provide unstructured responses to a question or statement (e.g., “How do you feel about legalizing marijuana?”). Open-ended responses tend to be richer and more �lexible but harder to translate into quanti�iable data—analogous to the tradeoff we discussed in comparing written versus oral survey methods. To put it another way, some concepts are dif�icult to reduce to a seven- point �ixed-format scale, but number ratings on these scales are easier to analyze than a paragraph of free-�lowing text.
Another reason to think carefully about this decision is that �ixed-format responses will, by de�inition, restrict people’s options in answering the question. In some cases, these restrictions can even act as leading questions. In a study of people’s perceptions of history, Dario Páez Rovira and his colleagues (Rovira, Deschamps, & Pennebaker, 2006) asked respondents to indicate the “most signi�icant event over the last 50 years.” When this was asked in an open-ended way (i.e., “list the most signi�icant event”), 2% of participants listed the invention of computers. Another version of the survey asked the question using a �ixed-format way (i.e., “choose the most signi�icant event”). When asked to select from a list of four options (World War II, invention of computers, Tiananmen Square, or man on the moon), 30% chose the invention of computers. In exchange for having easily coded data, the researchers accidentally forced participants into a smaller number of options. The result, in this case, was a distorted sense of the importance of computers in people’s perceptions of history.
Fixed-Format Options Although �ixed-format responses can sometimes constrain or skew participants’ answers, researchers tend to use them more often than not. This decision is largely practical; �ixed-format responses allow for more ef�icient data collection from a much larger sample. (Imagine the chore of having to hand-code 2,000 essays.) But once researchers have decided on this option for the questionnaire, the decision process is far from over. In this section, we discuss three possibilities for constructing a �ixed-format response scale.
True/false. One �ixed-format option asks questions using a true/false format, which asks participants to indicate whether they endorse a statement. For example:
“I attended church last Sunday.” True False
“I am a U.S. citizen.” True False
“I am in favor of abortion.” True False
This last example may strike you as odd, and in fact it illustrates an important limitation in the use of true/false formats: They are best used for statements of facts rather than attitudes. It is relatively straightforward to answer whether we attended church or are a U.S. citizen. However, people’s attitudes toward abortion are often complicated —one might be “pro-choice” but still support some restrictions, or “pro-life” but support exceptions (e.g., in cases of rape). For most people, a true/false question cannot even come close to capturing the complexity of these beliefs. However, for survey items that involve simple statements of fact, the true/false format can be a good option.
Multiple choice. A second option uses a multiple-choice format, which asks participants to select from a set of predetermined responses.
“Which of the following is your favorite fast-food restaurant?”
a) McDonald’s b) Burger King c) Wendy’s d) Taco Bell
“Whom did you vote for in the 2012 presidential election?”
a) Mitt Romney b) Barack Obama
“How do you travel to work most days? (Select all that apply.)”
a) drive alone b) carpool c) public transportation
As these examples show, multiple-choice questions offer quite a bit of freedom in both the content and the response- scaling of questions. A researcher can ask participants either to select one answer or, as in the last example, to select all applicable answers. A survey can cover everything from preferences (e.g., favorite fast-food restaurant) to behaviors (e.g., how people travel to work).
Multiple-choice formats do have a downside. Whenever the survey provides a set of responses, it restricts participants’ responses to that set. This is the problem that Rovira and colleagues (2006) encountered in asking people about the most signi�icant events of the last century. In each of the examples above, the categories fail to capture all possible responses. What if someone’s favorite restaurant is In-and-Out Burger? What if a respondent voted for Ralph Nader? What if a person telecommutes or bicycles to work? Researchers have two relatively easy ways to avoid (or at least minimize) this problem. First, when choosing the response options, plan carefully. During the design process, it helps to brainstorm with other people to ensure the survey is capturing the most likely range of responses. However, it is often impossible to provide every option that people might conceive. The second solution is to provide an “other” response to a multiple-choice question, which allows people to write in an option that the survey neglected to include. For example, our last question about traveling to work could be rewritten as:
“How do you travel to work on most days? (Select all that apply.)”
a) drive alone b) carpool c) public transportation d) other (please specify): __________________
This way, people who telecommute, or bicycle, or even ride their trained pony to work will have a way to respond rather than skipping the question. And, if researchers start to notice a pattern in these write-in responses (e.g., 20% of people added “bicycle”), then they have valuable knowledge to improve the next incarnation of the survey.
Rating scales. Last, but certainly not least, another option uses a rating-scale format, which asks participants to respond on a scale representing a continuum.
“Sometimes it is necessary to sacri�ice liberty in the name of security.”
1 2 3 4 5
not at all necessary very necessary
“I would vote for a candidate who supported the death penalty.”
1 2 3 4 5
not at all likely very likely
“The political party in power right now has really messed things up.”
1 2 3 4 5
strongly disagree strongly agree
This format is well suited to capturing attitudes and opinions, and, indeed, is one of the most common approaches to attitude research. Rating scales are easy to score, and they give participants some �lexibility in indicating their agreement with or endorsement of the questions. Researchers have two critical decisions to make about the construction of rating-scale items; both have implications for how they analyze and interpret results.
First, a researcher needs to decide the anchors, or labels, for the response scale. Rating scales offer a good deal of �lexibility in these anchors, as the examples above demonstrate. A survey can frame questions in terms of “agreement” with a statement or “likelihood” of a behavior, or researchers can customize the anchors to match their questions (e.g., “not at all necessary”). Scales that use anchors of “strongly agree” and “strongly disagree” are also referred to as Likert scales. At a fairly simple level, the choice of labels affects the interpretation of the results. For example, if we asked the “political party in power” question above, we have to be aware that the anchors are phrased in terms of agreement with the statement. In discussing these results, we would be able to discuss how much people agreed with the statement, on average, and whether agreement correlated with other things. If this seems like an obvious point, readers would be amazed how often researchers (or the media) will take an item like this and spin the results to talk about the “likelihood of voting” for the party in power—confusing an attitude with a behavior. So, in short, researchers must make sure they are being honest when presenting and interpreting research data.
At a more conceptual level, a researcher needs to decide whether the anchors for the rating scale make use of a bipolar scale, which has polar opposites at its endpoints, or a unipolar scale, which assesses a single construct. The difference between these options is best illustrated by an example:
Bipolar: How would you rate your current mood?
Unipolar: How would you rate your current mood?
1 2 3 4 5 6 7
not at all sad very sad
1 2 3 4 5 6 7
not at all happy very happy
The bipolar option requires participants to place themselves on a continuous scale somewhere between “sad” and “happy,” which are polar opposites. The bipolar scale assumes that the endpoints represent the only two options; participants can be sad, happy, or somewhere in between. In contrast, the unipolar option asks participants to rate themselves on two scales, indicating their level of both “sadness” and “happiness.” A pair of unipolar scales assumes that it is possible to experience varying degrees of each item—participants can be moderately happy, but also a little bit sad, for example. The decision to use a bipolar or a unipolar scale comes down to the context. What is the most logical way to think about these constructs? What have previous researchers done?
In the 1970s, Sandra Lipsitz Bem revolutionized the way researchers thought about gender roles by arguing against a bipolar approach. Previously, gender role identi�ication had been measured on a bipolar
Sandra Lipsitz Bem insisted that people have varying degrees of masculine and feminine traits.
scale from “masculine” to “feminine”; the scale assumed that a person could be one or the other. Bem (1974) argued instead that people could easily have varying degrees of masculine and feminine traits. Her scale, the Bem Sex Role Inventory, asks respondents to rate themselves on a set of 60 unipolar traits. Someone with mostly feminine and hardly any masculine traits would be described as “feminine.” Someone with high ratings on both masculine and feminine traits would be described as “androgynous.” And, someone with low ratings on both masculine and feminine traits would be described as “undifferentiated.” View and complete Bem’s scale online at: http://garote.bdmonkeys.net/bsri.html (http://garote.bdmonkeys.net/bsri.html) .
After settling on the best way to anchor the scale, the researcher’s second critical decision is to decide on the number of points in the response scale. Notice that all of the examples in this section have an odd number of points (i.e., �ive or seven). Odd numbers are usually preferable for rating-scale items because the middle of the scale (i.e., “3” or “4”) allows respondents to give a neutral, middle-of-the-road answer. That is, on a scale from “strongly disagree” to “strongly agree,” the midpoint can be used to indicate “neither” or “I’m not sure.” However, in some cases, a researcher may not want to allow a neutral option in a scale. Using an even number of points (e.g., four or six) essentially compels people either to agree or disagree with the statement; this type of scaling is referred to as forced choice.
So, how many points should the scale have? As a general rule, more points will translate into more variability in responses—the more choice people have (up to a point), the more likely they are to distribute their responses among those choices. From a researcher’s perspective, the big question is whether this variability is meaningful. For example, if we assess college students’ attitudes about a student-fee increase, student opinions will likely vary depending on the size of the fee and the ways in which it will be used. Thus, we might prefer a �ive- or seven-point scale to a two-point (yes or no) scale. However, past a certain point, increases in the scale range cease to connect to meaningful variation in attitudes. In other words, the difference between a 5 and a 6 on a seven-point scale is fairly intuitive for participants to grasp. What is the real difference, though, between an 80 and an 81 on a 100-point scale? When scales become too large, researchers risk introducing another source of error variance as participants impose their own interpretations on the scaling. In sum, more points do not always translate to a better scale.
Back to the question: How many points should the scale have? The ideal compromise supported by most statisticians is to use a seven-point scale whenever possible because of the differences between scales of measurement. As the discussion in Chapter 2 explained, the way variables are measured has implications for data analyses. For the most popular statistical tests to be legitimate, variables need to be on an interval scale (i.e., with equal intervals between points) or a ratio scale (i.e., with a true zero point). Based on mathematical modeling research, statisticians have concluded that the variability generated by a seven-point scale is most likely to mimic an interval scale (e.g., Nunnally, 1978). So, from a statistical perspective, a seven-point scale is ideal because it allows us the most �lexibility in data analyses.
Finalizing the Questionnaire
After constructing the questionnaire items, researchers face one last important step before beginning data collection. This section discusses a few guidelines for assembling the items into a coherent questionnaire. One main goal at this stage is to think carefully about the order of the individual items.
First, keep in mind that the �irst few questions will set the tone for the rest of the questionnaire. It is best to start with questions that are both interesting and nonthreatening to help ensure that respondents complete the questionnaire with open minds. For example:
BAD OPENING: “Do you agree that your child’s teacher is an idiot?” (threatening, and also a leading question)
BETTER OPENING: “How would you rate the performance of your child’s teacher?”
BAD OPENING: “Would you support a 1% sales tax increase?” (boring)
BETTER OPENING: “How do you feel about raising taxes to help fund education?”
Second, strive whenever possible to have continuity in the different sections of the questionnaire. Imagine constructing a survey to give to college freshmen. It might include questions on family background, stress levels, future plans, campus engagement, and so on. The survey will be most effective if it groups questions by topic. So, for instance, students respond to a set of questions about future plans on one page and then answer a set of questions about campus engagement on another page. This approach makes it easier for participants to progress through the questions without having to switch mentally between topics.
Third, remember that individual questions are always read in context. This means that if the college-student survey begins with a question about plans for the future and then asks about stress, respondents will likely have their future plans in mind when they think about their stress level. Consider again the example of the graduate assistant. His department used to administer a gigantic survey packet (on paper) to the 2,000 students enrolled in Introductory Psychology each semester. One year, a faculty member included a measure of identity, asking participants to complete the statements “I am______” and “I am not______.” As researchers started to analyze data from this survey, they discovered an astonishing 60% of students had �illed in the blank with “I am not a homosexual!” This response seemed downright strange, until the surveyors realized that the questionnaire immediately preceding the identity one measured prejudice toward gay and lesbian individuals. So, as these students completed the identity measure, they had homosexuality on their minds and felt compelled to point out that they were not homosexual. In other words, responses are all about context.
Finally, after assembling a draft version of the questionnaire, perform a test run. This test run, called pilot testing, involves giving the questionnaire to a small sample of people, getting their feedback, and making any necessary changes. One of the best ways to pilot test is to �ind a patient group of friends to complete the questionnaire who will provide extensive feedback. In soliciting their feedback, ask questions like the following:
Was anything confusing or unclear?
Was anything offensive or threatening?
How long did the questionnaire take you to complete?
Did it seem repetitive or boring? Did it seem too long?
Were there particular questions that you liked or disliked? Why?
The answers to these questions will supply valuable information to revise and clarify the questionnaire before devoting resources to a full round of data collection. The next section turns to the question of how to �ind and select participants for this stage of the research.
Research: Thinking Critically
Beauty and First Impressions
Follow the link below to a press release from the University of British Columbia, describing a recent publication by researchers in the psychology department. This study suggests that physical beauty may play a role in how easily we form �irst impressions of other people. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
http://news.ubc.ca/2010/12/21/beautiful-people-convey-personality-traits-better-during-�irst- impressions/ (http://news.ubc.ca/2010/12/21/beautiful-people-convey-personality-traits-better-during-�irst- impressions/)
Think About It:
1. Suppose the following questions were part of the questionnaire given after the three-minute one-on- one conversations in this study. Based on the goals of the study and the rules discussed in this chapter, identify the problem with each of the following questions and suggest a better item.
a. Jane is very neat. 1 2 3 4 5
b. Jane is generous and organized. 1 2 3 4 5
c. Jane is extremely attractive. TRUE FALSE
2. What are the strengths and weaknesses of using a �ixed-format questionnaire in this study versus open-ended responses?
3. The researchers state that they took steps to control for the “positive bias that can occur in self- reporting.” How might social desirability in�luence the outcome of this particular study? What might the researchers have done to reduce the effect of social desirability?
4.3 Sampling From the Population At this point, the chapter should have conveyed an understanding of how to construct survey items. The next step is to �ind a group of people to �ill out the survey. But where does a researcher �ind this group? And how many people are needed? On the one hand, researchers want as many people as possible to capture the full range of attitudes and experiences. On the other hand, they have to conserve time and other resources, which often means choosing a smaller sample of people. This section examines the strategies researchers can use to select samples for their studies.
Researchers refer to the entire collection of people who could possibly be relevant to a study as the population. For example, if we were interested in the effects of prison overcrowding, our population would consist of prisoners in the United States. If we wanted to study voting behavior in the next presidential election, the population would be U.S. residents eligible to vote. And if we wanted to know how well college students cope with the transition from high school, our population would include every college student enrolled in every college in the country.
These populations suggest an obvious practical complication. How can we get every college student—much less every prisoner—in the country to �ill out our questionnaire? We cannot; instead, researchers will collect data from a sample, a subset of the population. Instead of trying to reach all prisoners, we might sample inmates from a handful of state prisons. Rather than attempt to survey all college students in the country, researchers often restrict their studies to a collection of students at one university.
The goal in choosing a sample is to make it as representative as possible of the larger population. That is, if researchers choose students at one university, they need to be reasonably similar to college students elsewhere in the country. If the phrase “reasonably similar” sounds vague, this is because the basis for evaluating a sample varies depending on the hypothesis and the key variables. For example, if we wanted to study the relationship between family income and stress levels, we would need to make sure that our sample mirrored the population in the distribution of income levels. Thus, a sample of students from a state university might be a better choice than students from, say, Harvard (which costs about $60,000 per year including room and board). On the other hand, if the research question deals with the pressures faced by students in selective private schools, then Harvard students could be a representative sample for the study.
Figure 4.1 shows a conceptual illustration of both a representative and nonrepresentative sample, drawn from a larger population. The population in this case consists of 144 individuals, split evenly between Xs and Os. Thus, we would want our sample to come as close as possible to capturing this 50/50 split. The sample of 20 individuals on the left is representative of the sample because it is split evenly between Xs and Os. But the sample of 20 individuals on the right is nonrepresentative because it contains 75% Xs. Because the population has far fewer Os than we might expect, this sample does not accurately represent the population. This failure of the sample to represent the population is also referred to as sampling bias.
Figure 4.1: Representative and nonrepresentative samples of a population
From where do these samples come? Broadly speaking, researchers have two broad categories of sampling strategies at their disposal: probability sampling and nonprobability sampling.
Researchers use probability sampling when each person in the population has a known chance of being in the sample. This is possible only in cases where researchers know the exact size of the population. For instance, the current population of the United States is 322.1 million people (www.census.gov/popclock/ (http://www.census.gov/popclock/) ). If we were to select a U.S. resident at random, each resident would have a one in 322.1 million chance of being selected. Whenever researchers have this information, probability-sampling strategies are the most powerful approach because they greatly increase the odds of getting a representative sample. Within this broad category of probability sampling are three speci�ic strategies: simple random sampling, strati�ied random sampling, and cluster sampling.
Simple random sampling, the most straightforward approach, involves randomly picking study participants from a list of everyone in the population. The term for this list is a sampling frame (e.g., imagine a list of every resident of the United States). To have a truly representative random sample, researchers must have a sampling frame; they must choose from it randomly; and they must have a 100% response rate from those selected. (As Chapter 2 discussed, if people drop out of a study, it can threaten the validity of the hypothesis test.)
Researchers use strati�ied random sampling, a variation of simple random sampling, when subgroups of the population might be left out of a purely random sampling process. Imagine a city with a population that is 80% Caucasian, 10% Hispanic, 5% African American, and 5% Asian. If we were to
In a neighborhood with a majority of Caucasian residents, strati�ied random sampling is needed to capture the perspective of all ethnic groups in the community.
pick 100 residents at random, the chances are very good that our entire sample would consist of Caucasian residents and ignore the perspective of all ethnic minority residents. To prevent this problem, researchers use strati�ied random sampling—breaking the sampling frame into subgroups and then sampling a random number from each subgroup. In this example, we could divide the list of residents into four ethnic groups and then pick a random 25 from each of these groups. The end result would be a sample of 100 people that captured opinions from each ethnic group in the population. Notice that this approach results in a sample that does not exactly represent the underlying population—that is, Hispanics constitute 25% of the sample, rather than 10%. One way to correct for this issue is to use a statistical technique known as “weighting” the data. Although the full details are beyond the scope of this book, weighting involves trying to correct for problems in representation by assigning each participant a weighting coef�icient for analyses. In essence, people from groups that are underrepresented would have a weight greater than 1, while those from groups that are overrepresented would have a weight less than 1. For more information on weighting and its uses, see http://www.applied-survey-methods.com/weight.html (http://www.applied-survey-methods.com/weight.html) .
Finally, researchers employ cluster sampling, another variation of random sampling, when they do not have access to a full sampling frame (i.e., a full list of everyone in the population). Imagine that we want to study how cancer patients in the United States cope with their illness. Because no list exists of every cancer patient in the country, we have to get a little creative with our sampling. The best way to think about cluster sampling is as “samples within samples.” Just as with strati�ied sampling, we divide the overall population into groups, but cluster sampling differs in that we are dividing into groups based on more than one level of analysis. In our cancer example, we could start by dividing the country into regions, then randomly selecting cities from within each region, and then randomly selecting hospitals from within each city, and �inally randomly selecting cancer patients from each hospital. The end result would be a random sample of cancer patients from, say, Phoenix, Miami, Dallas, Cleveland, Albany, and Seattle; taken together, these patients would provide a fairly representative sample of cancer patients around the country.
The other broad category of sampling strategies is known as nonprobability sampling. These strategies are used in the (remarkably common) case in which researchers do not know the odds of any given individual’s being in the sample. This uncertainty represents an obvious shortcoming—if we do not know the exact size of the population and do not have a list of everyone in it, we have no way to know that our sample is representative. Despite this limitation, researchers use nonprobability sampling on a regular basis. We will discuss two of the most common nonprobability strategies here.
In many cases, it is not possible to obtain a sampling frame. When researchers study rare or hard-to-reach populations or study potentially stigmatizing conditions, they often recruit by word-of-mouth. The term for this is snowball sampling—imagine a snowball rolling down a hill, picking up more snow (or participants) as it goes. If we wanted to study how often homeless people took advantage of social services, we would be hard pressed to �ind a sampling frame that listed the homeless population. Instead, we could recruit a small group of homeless people and ask each of them to pass the word along to others, and so on. If we wanted to study changes in people’s identities following sex-reassignment surgery, we would �ind it dif�icult to track down this population via public records. Instead, we could recruit one or two patients and ask for referrals to others. The resulting sample in both cases is unlikely to be representative, but researchers often have to compromise for the sake of obtaining access to a population. Snowball sampling is most often used in qualitative research, where the advantages of gaining a rich narrative from these individuals outweigh the loss of representativeness.
One of the most popular nonprobability strategies is known as convenience sampling, or simply including people who show up for the study. Any time a 24-hour news station announces the results of a viewer poll, they are likely based on a convenience sample. CNN and Fox News do not randomly select from a list of their viewers; they post a question onscreen or online, and people who are motivated (or bored) enough to respond will do so. As a matter of fact, the vast majority of psychology research studies are based on convenience samples of undergraduate college students. Research in psychology departments often works like this: Experimenters advertise their studies on a website, and students enroll in these studies, either to earn extra cash or to ful�ill a research requirement for a course. Students often pick a particular study based on whether it �its their busy schedules or whether the advertisement sounds interesting. These decisions are hardly random and, consequently, neither is the sample. The goal here is not to disparage all psychology research—that would be self-defeating—but to emphasize that all of the decisions researchers make have both pros and cons.
Choosing a Sampling Strategy
Although researchers always strive for a representative sample, no such thing as a perfectly representative one exists. Some degree of sampling error, de�ined as the degree to which the characteristics of the sample differ from the characteristics of the population, is always present. Instead of aiming for perfection, then, researchers aim for an estimate of how far from perfection their samples are. These estimates are known as the margin of error, or the degree to which the results from a particular sample are expected to deviate from the population as a whole.
One of the main advantages of a probability sample is that we are able to calculate these errors, as long as we know our sample size and desired level of con�idence. In fact, most of us encounter margins of error every time we see the results of an opinion poll. For example, CNN may report that “Candidate A is leading the race with 60% of the vote, ± 3%.” This means Candidate A’s approval percentage in the sample is 60%, but based on statistical calculations, her real percentage is between 57% and 63%. The smaller the error (3% in this example), the more closely the results from the sample match the population. Naturally, researchers conducting these opinion polls want the error of estimation to be as small as possible. How persuaded would anyone be to learn that “Candidate A has a 10-point lead, plus or minus 20 points?” This margin of error ought to trigger our skepticism, because the real difference is between 30 points and –10 points—i.e., a 10-point lead for the other candidate.
Researchers’ most direct means of controlling the margin of error is by changing the sample size. Most survey research aims for a margin of error of less than �ive percentage points. Based on standard calculations, this requires a sample size of 400 people per group. That is, if we want to draw conclusions about the entire sample (e.g., “30% of registered voters said X”), then we would need at least 400 respondents to say this with some con�idence. If we want to draw conclusions about subgroups (e.g., “30% of women compared to 50% of men”), then we would actually need at least 400 respondents of each gender to draw conclusions with con�idence.
The magic number of 400 represents a compromise—a researcher is willing to accept 5% error for the sake of keeping time and costs down. It is worth noting, however, that some types of research have more stringent standards: For political polls to be reported by the media, they must have at least 1,000 respondents, which brings the margin of error down to three percentage points. In contrast, some areas of applied research may have more relaxed standards. In marketing research, for example, budget considerations sometimes lead to smaller samples, which means drawing conclusions at lower levels of con�idence. For example, with a sample size of 100 people per group, researchers have to contend with 8–10% margin of error—almost double the error, but at a fraction of the costs.
If probability sampling is so powerful, why are nonprobability strategies so popular? One reason is that convenience samples are more practical; they are cheaper, easier, and almost always possible to conduct with relatively few resources because researchers can avoid the costs of large-scale sampling. A second reason is that convenience is often a good-enough starting point for a new line of research. For example, if we wanted to study the predictors of relationship satisfaction, we could start by testing our hypotheses in a controlled setting using college student participants and then extend the research to the study of adult married couples. Finally, and relatedly, in many cases it is acceptable to have a nonrepresentative sample because researchers do not need to generalize results. If we
want to study the prevalence of alcohol use in college students, it may be perfectly acceptable to use a convenience sample of college students. Although, even in this case, researchers would have to keep in mind that they are studying drinking behaviors among students who volunteered to complete a study on drinking behaviors.
In some cases, however, it is critical to use probability sampling, despite the extra effort required. Speci�ically, researchers use probability samples any time it is important to generalize and any time it is important to predict behavior of a population. The best way of understanding these criteria is to think of political polls. In the lead-up to an election, each campaign is invested in knowing exactly what the voting public thinks of its candidate. In contrast to a CNN poll, which is based on a convenience sample of viewers, polls conducted by a campaign will be based on randomly selected households from a list of registered voters. The resulting sample is much more likely to be representative, much more likely to tell the campaign how the entire population views its candidate, and therefore much more likely to be useful.
4.4 Analyzing Survey Data Now comes the fun part. Once researchers have designed a survey, chosen an appropriate sample, and collected some data, it is time for analyses. As with the descriptive designs Chapter 3 explained, the goal of these analyses is to subject hypotheses to a statistical test. Surveys can be used both to describe and predict thoughts, feelings, and behaviors. Since Chapter 3 already covered the basics of descriptive analysis, this section will focus on predictive analyses, which are designed to assess the associations between and among variables. Researchers typically use three approaches to test predictive hypotheses: correlational analyses, chi-square analyses, and regression analyses. Each has its advantages and disadvantages, and each is most appropriate for a different kind of data. This section will walk through the basics of each analysis. Because the statistics course discusses these approaches in more detail, the goal here is to acquire a more conceptual overview of each technique and its usefulness in answering research questions.
The beginning of this chapter described an example of a survey research question: What is the relationship between the number of hours that students spend studying and their grades in the class? In this case, the hypothesis claims that we can predict something about students’ grades by knowing how many hours they spend studying.
Imagine we collected a small amount of data (shown in Table 4.1) to test this hypothesis. (Of course, a true test of this hypothesis would require more than 10 people in the sample, but these data will do as an illustration.)
Table 4.1: Data for quiz grade/hours studied example
Participant Hours Studied Quiz Grade
1 1 2
2 1 3
3 2 4
4 3 5
5 3 6
6 3 6
7 4 7
8 4 8
9 4 9
10 5 9
The Logic of Correlation The important question here is whether and to what extent we can predict grades based on study time. One common statistic for testing these kinds of hypotheses is a correlation, which gives an assessment of the linear relationship between two variables. A stronger correlation between two variables indicates a stronger association between them. In the case of the current example, the stronger the correlation between study time and quiz grade, the more accurately we can predict grades based on knowing how long the student spends studying.
Before we calculate the correlation between these variables, it is always a good idea to visualize the data on a graph. Chapter 3 discussed a type of graph, called a scatterplot, that displays points of data on two variables at a time. The scatterplot in Figure 4.2 shows our sample data from the studying/quiz grade study.
Figure 4.3: Curvilinear relationship between arousal and performance
Figure 4.2: Scatterplot for quiz grade/hours studied example
Each point on the graph represents one participant. For example, the point in the top right corner represents a student who studied for �ive hours and earned a 9 on the quiz. The two points in the bottom right represent students who studied for only one hour and earned a 2 and a 3 on the quiz.
Researchers have three reasons to graph data before conducting statistical tests. First, a graph allows us to get a general sense of the pattern—in this case, students who study less appear to do worse on the quiz. As a result, we will be better informed going into our statistical calculations. Second, the graph lets us examine the raw data for any outliers, or points that stand out as clear exceptions to the overall pattern. These outlier points may indicate that a respondent misunderstood the question and should be dropped from analyses. On the other hand, a cluster of outlier points could indicate the presence of subgroups within our data. Perhaps most students do worse if they study less, but a group of students is able to ace the quizzes without any preparation. Examining this cluster of people in more detail might suggest either a re�inement of our hypothesis or an interesting direction for future research.
Third, the graph assures researchers that there is a linear relationship between the variables. This is a very important point about correlations: The math of the standard correlation formula is based on how well the data points �it a straight line, which means nonlinear relationships might be overlooked. Figure 4.3 demonstrates a robust nonlinear �inding in psychology regarding the relationship between task performance and physiological arousal. As this graph shows, people tend to perform their best on just about any task when they have a moderate level of arousal.
When arousal is too high, people �ind it dif�icult to calm down and concentrate; when arousal is too low, people �ind it dif�icult to care about the task at all. If we simply ran a standard correlation with data on performance and arousal, the correlation would be zero because the points do not �it a straight line. Thus, it is critical to visualize the data before jumping ahead to the statistics. Otherwise, researchers risk overlooking an important �inding in the data. (It is important to note that non-linear relationships like this one can still be analyzed, but the calculations quickly become complex. In fact, these analyses even require specialized knowledge to use statistical software.)
Interpreting Coef�icients Once we are satis�ied that our data look linear, it is time to calculate our statistics. Researchers typically calculate using a computer software program, such as SPSS, SAS, or Microsoft Excel. The number used to quantify the correlation is called the correlation coef�icient. This number ranges from –1 to +1 and contains two important pieces of information:
The direction of the relationship is based on the sign of the correlation coef�icient. A +0.8 would indicate a positive correlation, meaning that as one variable increases, so does the other variable. A –0.8 would indicate a negative correlation, meaning that as one variable increases, the other variable decreases. (Refer back to Section 2.1 for a review of these two terms.) The size of the relationship is based on the absolute value of the correlation coef�icient. The farther the coef�icient is from zero in either direction, the stronger the relationship between variables. For example, both a +0.8 and a –0.8 indicate strong relationships.
So, for example, a +0.2 represents a weak positive relationship and a –0.7 represents a strong negative relationship.
Calculating the correlation for our quiz-grade study produces a coef�icient of 0.962, indicating a strong positive relationship between studying and quiz grade. What does this mean in plain English? Students who spend more hours studying tend to score higher on the quiz.
How do we know whether to get excited about a correlation of 0.962? As with all of our statistical analyses, we look this value up in a critical value table, or, more commonly, let the computer software do this for us. The critical value table provides a p value representing the odds that our correlation is due to random chance. In this case, the p value is less than 0.001. This means that the chance of our correlation being a random �luke is less than 1 in 1,000; we can feel pretty con�ident in our results.
When interpreting correlation results, realize that statistical signi�icance is closely tied to the sample size. In a small sample, it is possible to see moderate to strong relationships that do not meet the threshold for statistical signi�icance. One good option in these cases is to collect additional data. If the correlation maintains its size and also attains statistical signi�icance, researchers can have some con�idence in the results. It is also possible to have the opposite problem: Large sample sizes can make even the smallest relationships show high levels of statistical signi�icance. In a 2008 journal article, Newman, Groom, Handelman and Pennebaker analyzed differences in language use between men and women. Because the authors had a sample of over 14,000 text samples, even the tiniest differences in language were statistically signi�icant. For example, men used words related to anger about 4% more than women; with such a large sample, this trivial difference was signi�icant at p < 0.05. To deal with this issue, the authors chose to use a more conservative threshold of p < 0.001, considering all other results to be too trivial.
Returning to our quiz-grade study, we now have all the information we need to report this correlation in a research paper. The standard way of reporting a correlation coef�icient includes information about the sample size (N) and p value, as well as the coef�icient itself. Our quiz-grade study would be reported as Figure 4.4 depicts.
Figure 4.4: Correlation coef�icient diagram
Where, then, does this leave our hypothesis? We started by predicting that students who spent more time studying would perform better on their quizzes than those who spent less time studying. We then designed a study to test this hypothesis by collecting data on study habits and quiz grades. Finally, we analyzed these data and found a signi�icant, strong, positive correlation between hours studied and quiz grade. Based on this study, our hypothesis has been con�irmed—students who study more have higher quiz grades. Of course, because this is a correlational study, we are unable to make causal statements. It could be that studying more for an exam helps students to learn more. Or, it could be the case that previous low quiz grades make students give up and study less. A third variable of motivation could cause students both to study more and perform better on the quizzes. To tease these explanations apart and determine causality calls for a different type of research design, which Chapter 5 will discuss.
Multiple Regression Analysis
Correlations are the best tool to test the linear relationship between pairs of quantitative variables. However, in many cases, researchers are interested in comparing the in�luence of several variables at once. Imagine we want to expand the study about hours studying and quiz grade by looking at other variables that might predict students’ quiz grades. We have already learned that the hours students spend studying correlate positively with their grades. But what about SAT scores? Will students with higher standardized-test scores do better in all of their college classes? What about the number of classes that students have previously taken in the subject area? Will increased familiarity with the subject be associated with higher scores? To compare the in�luence of all three variables, we can use a slightly different analytic approach. Multiple regression is a variation on correlational analysis in which more than one predictor variable is used to predict a single outcome variable. In this example, we would attempt to predict the outcome variable of quiz scores based on three predictor variables: SAT scores, number of previous classes, and hours studied.
Multiple regression requires an extensive set of calculations; consequently, it is always performed by computer software. A detailed look at these calculations is beyond the scope of this book, but a conceptual overview will help convey the unique advantages of this analysis. Essentially, the calculations for multiple regression are based on the correlation coef�icients between each of our predictor variables, as well as between each of these variables and the outcome variable. Table 4.2 shows these correlations for our revised quiz-grade study. If we scan the top row, we see the correlations between quiz grade and the three predictor variables: SAT (r = 0.14), previous classes (r = 0.24), and hours studied (r = 0.25). The remainder of the table shows correlations between the various predictor variables; for example, hours studied and previous classes correlate at r = 0.24. When researchers conduct multiple regression analysis using computer software, the software will use all of these correlations in performing its calculations.
Table 4.2: Correlations for a multiple regression analysis
Quiz Grade SAT Score Previous Classes Hours Studied
Quiz Grade — 0.14 0.24* 0.25*
Quiz Grade SAT Score Previous Classes Hours Studied
SAT Score — .02 –.02
Previous Classes — 0.24*
Hours Studied —
Note. Correlations marked with an asterisk (*) are statistically signi�icant at the 95% con�idence level. This notation in results tables is common and allows researchers to quickly spot the most interesting �indings.
The advantage of multiple regression is that it considers both the individual and the combined in�luence of the predictor variables. Figure 4.5 shows a visual diagram of the individual predictors of quiz grades. The numbers along each line are known as regression coef�icients, or beta weights. These values are very similar to correlation coef�icients but differ in an important way: They represent the effects of each predictor variable while controlling for the effects of all the other predictors. That is, the value of b = 0.21 linking hours studied with quiz grades is the independent contribution of hours studied, controlling for SAT scores and previous classes. If we compare the size of these regression coef�icients, we see that, in fact, hours spent studying is still the largest predictor of quiz grades (b = 0.21), compared to both SAT scores (b = 0.14) and previous classes (b = 0.19).
Even if individual variables only have a small in�luence, they can add up to a larger combined in�luence. So, if we were to analyze the predictors of quiz grades in this study, we would �ind a combined multiple correlation coef�icient of r = 0.34. The multiple correlation coef�icient represents the combined association between the outcome variable and the full set of predictor variables. Note that in this case, the combined r of 0.34 is larger than any of the individual correlations in Table 4.2, which ranged from 0.14 to 0.25. These numbers mean that we are better able to predict quiz grades from examining all three variables than we are from examining any single variable. Or, as the saying goes, the whole is greater than the sum of its parts.
Figure 4.5: Predictors of quiz grades
Multiple regression is an incredibly useful and powerful analytic approach, but it can also be a dif�icult concept to grasp. Before moving on, we will revisit the concept in the form of an analogy. Imagine someone has just eaten the most delicious hamburger of his life and is determined to understand what makes it so good. Many things contribute to the taste of the hamburger: the quality of the meat, the type and amount of cheese, the freshness of the bun, perhaps the smoked chili peppers layered on top. If the diner were to approach this investigation using multiple regression, he would be able to distinguish the in�luence of each variable (how important is the cheese compared to
the smoked peppers?) as well as take into account the full set of ingredients (does the freshness of the bun really matter when the other elements taste so good?). Ultimately, the individual would be armed with the knowledge of which elements are most important in crafting the perfect hamburger and would understand more about the perfect hamburger than if he had examined each ingredient in isolation.
Both correlations and regressions are well suited to testing hypotheses about prediction, as long as we can demonstrate a linear relationship between two variables. Linear relationships, however, require that variables be measured on one of the quantitative scales, that is, ordinal, interval, or ratio scales (see Section 2.3 for a review). What if we want to test an association between nominal, or categorical, variables? In these cases, we need an alternative statistic called the chi-square statistic, which determines whether two nominal variables are independent from or related to one another. Chi-square is often abbreviated with the symbol χ2, which shows the Greek letter chi with the superscript 2 for squared. (This statistic is also referred to as the chi-square test for independence—a slightly longer but more descriptive synonym.)
The idea behind this test is similar to that of the correlation coef�icient. If two variables are independent, then knowing the value of one variable does not tell us anything about the value of the other variable. As we will see in the example below, a larger chi-square re�lects a larger deviation from what we would expect by chance and is thus an index of statistical signi�icance.
Imagine that we want to know whether people in rural or urban areas are more likely to support a sales-tax increase. We can easily speculate why either group might be more likely to do so—perhaps people living in cities are more politically liberal or perhaps people living in small towns are better able to see bene�its of higher local taxes. So, we might survey a sample of 100 people, asking them to indicate both their location (rural or urban) and their support for a sales-tax proposal. The survey produces the following results (in Table 4.3), presented in a contingency table, which displays the number of individuals in each combination of our nominal variables. Notice that we have more urban than rural residents, re�lecting the higher population density in cities.
Table 4.3: Chi-square example: support for a sales tax increase
Rural Urban Total
Support 10 45 55
Don’t Support 30 15 45
Total 40 60 100
But, as it turns out, the raw numbers are less important than the ratios within each group. The chi-square calculation works by �irst considering what each cell in the table would look like if there were no relationship at all (i.e., under the null hypothesis), and then determining how much the data differ from that reference point.
In this example, our �inal chi-square value is 34.55; this represents the total difference across the table between actual and expected data. The larger this number is, the more our observed data differ from the expected frequencies, and the more our variables relate to one another. In the current example, this means we can predict a person’s support for a sales-tax increase based on where he or she lives, which is consistent with our initial hypothesis.
Still, how do we know if our value of 34.55 is meaningful? As with the other statistical tests we have discussed, determining the signi�icance requires looking up the result in a critical-value table to assess whether the calculated value is above threshold. In this case, the critical value for a chi-square with a 2 × 2 table = 3.84, so we can feel con�ident in our value of 34.55—almost 10 times higher than the threshold value.
However, unlike correlation and regression coef�icients, our chi-square results cannot tell us anything about the direction or magnitude of the relationship. A larger chi-square re�lects a larger deviation from what we would expect by chance and is thus an index of statistical signi�icance. To interpret the patterns of our data, we need to visually inspect the numbers in our data table. Better yet, we can create a bar graph like we did in Chapter 3 to display these frequencies visually.
As Figure 4.6 shows, the cell frequencies suggest a fairly clear interpretation: People who live in urban settings are much more likely than people who live in rural settings to support a sales-tax increase. In fact, urban residents support the increase by a 3-to-1 margin, while rural residents oppose the increase by a 3-to-1 margin.
Figure 4.6: Graph of chi-square results
Research: Thinking Critically
Self-Esteem in Youth and Early Adulthood
Follow the link below to read a press release from the American Psychological Association, describing recent research on self-esteem during adolescence. This study, by a group of Swiss researchers, challenges some of our popular assumptions about gender differences in self-esteem. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
Think About It:
1. Why is self-esteem a good topic to study using survey research methods? Does using a survey to study self-esteem present any weaknesses?
2. What type of sampling was used in this study? Was this an appropriate strategy?
3. What type of data analysis discussed in this chapter is appropriate to understanding the in�luence of multiple variables (mastery, health, income) on self-esteem?
Summary and Resources
Chapter Summary This chapter has covered the process of survey research from conceptualization through analysis. We �irst discussed the types of research questions that are best suited to survey research—essentially, those that can be answered based on people’s observations of their own behavior. Survey research can involve either verbal reports (i.e., interviews) or written reports (i.e., questionnaires). In both cases, surveys are distinguished by their reliance on people’s self-reports of their attitudes, feelings, and behaviors.
This chapter covered several key points for writing survey items. The key takeaway of the �ive rules for better questions is that questions should be written as clearly and unambiguously as possible. This helps to minimize the error variance that might result from participants imposing their own guesses and interpretations on the material. When designing survey items, researchers also have a broad choice between open-ended and �ixed-format responses. The former provide richer and more extensive data but are harder to score and code; the latter are easier to code but can constrain people’s responses to a researcher’s choice of categories. If and when researchers settle on a �ixed-format response, they have another set of decisions to make regarding the response scaling, labels, and general format.
Once researchers have constructed the scale, it is time to begin data collection. This chapter discussed the concept of sampling, or choosing a portion of the population to use for a study. Broadly speaking, sampling can be either “probability” or “nonprobability,” depending on whether researchers have a known population size from which they sample randomly. Probability sampling is more likely to result in a representative sample, but this approach is not possible in all studies. In fact, a signi�icant proportion of psychology research studies use a form of nonprobability sampling called convenience sampling, meaning that the sample consists of those who show up for the study.
Finally, this chapter covered three approaches to analyzing survey data and testing hypotheses about prediction. The �irst, correlational analysis, is a very popular way to analyze survey data. The correlation is a statistical test that assesses the linear relationship between two variables. The stronger the correlation between variables, the more we can accurately predict one based on knowing the other. Second, regression analyses allow us to expand our investigations into multiple predictors. Multiple regression offers the advantage of considering both the individual and the combined in�luence of the predictor variables. However, both correlation and regression require the variables to be quantitative—that is, measured on an ordinal, interval, or ratio scale. In cases where our survey produces nominal or categorical data, we use an alternative called the chi-square statistic, which determines whether two nominal variables are independent or related. The chi-square works by examining the extent to which our observed data deviate from the pattern we would expect if the variables were unrelated.
The common thread in all these analyses is that while they measure the association between variables, they do not tell us anything about the causal relationship between them. To make causal statements, we have to conduct experiments, which the next chapter will discuss.