|
|||||||||||
| » Home » The Institute » Electronic Full Texts | |||||||||||
| Ecological Intelligence: An Adaptation for Frequencies | ||||||||||||||||||||
|
|
|
|||||||||||||||||||
|
|
Ecological Intelligence: An Adaptation for FrequenciesWhen I left a restaurant in a charming town in Tuscany one night, I looked for my yellow-green rented Renault 4 in the parking lot. There was none. Instead there was a blue Renault 4 sitting in the lot, the same model, but the wrong color. I still feel my fingers hesitating to put my key into the lock of this car, but the lock opened. I drove the car home. When I looked out the window the next morning, there was a yellow-green Renault 4 standing in bright sunlight outside. What had happened? My color-constancy system did not work with the artificial light at the parking lot. Color constancy, an impressive adaptation of the human perceptual system, allows us to see the same color under changing illuminations: under the bluish light of day as well as the reddish light of the setting sun. Color constancy, however, fails under certain artificial lights, such as sodium or mercury vapor lamps, which were not present in the environment when mammals evolved (Shepard, 1992). Human color vision is adapted to the spectral properties of natural sunlight. More generally, our perceptual system has been shaped by the environment in which our ancestors evolved, which is often referred to as the "environment of evolutionary adaptiveness" or EEA (Tooby & Cosmides, 1992). Similarly, human morphology, physiology, and the nervous and immune systems show exquisite adaptations: The tubular form of the bones maximizes strength and flexibility while minimizing weight, and bones are, pound for pound, stronger than solid steel bars; the best man-made heart valves cannot yet match the way natural valves open and close (Nesse & Williams, 1996). Like color constancy, however, these systems can be fooled and may break down when stable, long-term properties of the environment to which they were adapted change. In this chapter I apply this argument to human reasoning.
I propose that human reasoning algorithms are, like those of color constancy,
designed for information that comes in a format that was present in
the EEA. I will focus on a class of inductive reasoning processes technically
known as Bayesian inference, and on a simple version thereof
where an organism infers from one or a few indicators which of two events
is true. Bayesian InferenceDavid Eddy (1982) asked physicians to estimate the probability that a woman has breast cancer given that she has a positive mammography on the basis of the following information: The probability that a patient has breast cancer is 1% (the physician's prior probability). If the patient has breast cancer, the probability that the radiologist will correctly diagnose it is 79% (sensitivity or hit rate). If the patient has a benign lesion (no breast cancer), the probability that the radiologist will incorrectly diagnose it as cancer is 9.6% (false positive rate). Question: What is the probability that a patient with a positive mammography actually has breast cancer? Eddy reported that 95 out of 100 physicians estimated the probability of breast cancer after a positive mammography to be about 75%. The inference from an observation (positive test) to a disease, or more generally, from data D to a hypothesis H, is often referred to as "Bayesian inference," because it can be modeled by Bayes's rule:
Equation 1 shows how the probability p(H|D)that
the woman has breast cancer (H) after a positive mammography (D) is
computed from the prior probability p(H) that the patient has
breast cancer, the sensitivity p(D|H), and
the false positive rate p This result, together with an avalanche of studies reporting
that lay people's reasoning does not follow Bayes's rule either, have
(mis-)led many to believe that Homo Sapiens would be inapt to reason
the Bayesian way. Listen to some influential voices: "In his evaluation
of evidence, man is apparently not a conservative Bayesian: he is not
a Bayesian at all" (Kahneman & Tversky, 1972, p. 450). "Tversky
and Kahneman argue, correctly, I think, that our minds are not built
(for whatever reason) to work by the rules of probability" (Gould,
1992, p. 469).[2] The literature
of the last 25 years has reiterated again and again the message that
people are bad reasoners, neglect base rates most of the time, neglect
false positive rates, and are unable to integrate base rate, hit rate,
and false positive rate the Bayesian way (for a recent review see Koehler,
1996). Probability problems such as the mammography problem have become
the stock-in-trade of textbooks, lectures, and party entertainment.
It is guaranteed fun to point out how dumb others are. And aren't they?
There seem to be many customers eager to buy the message of "inevitable
illusions" wired into our brain (Piatelli-Palmarini, 1994). Ecological Bayesian Inference: An Adaptation for FrequenciesBack to color constancy. If a human visual system enters an environment illuminated by sodium-vapor lamps, its color-constancy algorithms will fail. But this does not mean that human minds are not built to work by color-constancy algorithms. Similarly, if a human reasoning system enters an environment where statistical information is formatted differently from that encountered in the environment in which humans evolved, the reason algorithms may fail. But this does not imply that human minds are not built to reason the Bayesian way. The issue is not whether Nature has equipped our minds with good or with bad statistical software, as the "optimists" versus "pessimists" discussion about human rationality suggests (Jungermann, 1986). The issue I address here is the adaptation of mental algorithms to their environment. By "mental algorithms" I mean induction mechanisms that perform classification, estimation, or other forms of uncertain inferences, such as deciding what color an object is, or inferring whether a person has a disease. For what information formats have mental algorithms been designed? What matters for an algorithm that makes inductive inferences is the format of numerical information. Eddy presented information (about the prevalence of breast cancer, the sensitivity, and the false positive rate of the test) in terms of probabilities and percentages, just as most experimenters did who found humans making irrational judgments. What was the format of the numerical information humans encountered during their evolution? We know too little about these environments, for instance, about the historically normal conditions of childbirth, or how strong a factor religious doctrines were, and most likely, these varied considerably between societies. But concerning the format of numerical information, I believe we can be as certain as we ever can be: Probabilities and percentages were not the way organisms encountered information. Probabilities and percentages are quite recent forms of representations of uncertainty. Mathematical probability emerged as late as in the mid-17th century (Hacking, 1975), and the concept of probability itself did not gain prominence over the original primitive notion of "expectation" before the mid-18th century (Daston, 1988). Percentages became common notations only during the 19th century after the metric system was introduced during the French Revolution (but mainly for interest and taxes rather than for representing uncertainty). Only in the second half of the 20th century did probabilities and percentages become entrenched in the everyday language of Western countries as representations of uncertainty (Gigerenzer et al., 1989). To summarize, probabilities and percentages took millennia of literacy and numeracy to evolve as a format to represent degrees of uncertainty. In what format did humans acquire numerical information before that time? I propose the original format was event frequencies, acquired by natural sampling. Let me explain what this means by a parallel to the mammography problem, using the same numbers. Think about a physician in an illiterate society. Her people have been afflicted by a new, severe disease. She has no books, no statistical surveys, and must rely solely on her experience. Fortunately, she discovered a symptom that signals the disease, although not with certainty. In her lifetime, she has seen 1,000 people, 10 of whom had the disease. Of those 10, eight showed the symptom; of the 990 not afflicted, 95 did. Thus there were 8 + 95 = 103 who showed the symptom, and only 8 of these had the disease. Now a new patient appears. He has the symptom. What is the probability that he actually has the disease? The physician in the illiterate society does not need a pocket calculator to estimate the Bayesian posterior probability. All she needs to do is to keep track of the number of symptom and disease cases (here: 8) and the number of symptom and no disease cases (here: 95). The probability that the new patient actually has the disease can be "seen" easily from these frequencies:
Equation 2 is Bayes's rule for event frequencies, where a is the number of cases with symptom and disease, and b is the number of cases having the symptom but lacking the disease. The chance that the new patient has the disease is less than 8 out of 100, or 8%. Our physician who learns from experience cannot be fooled as easily into believing that the chances are about 75%, as many of her contemporary colleagues did. The comparison between Equations 1 and 2 reveals an important theoretical result: Bayesian reasoning is computationally simpler (in terms of the number of operations performed, such as additions and multiplications) when the information is in a frequency format (Equation 2) than in a probability format (Equation 1) (see Kleiter, 1994). Incidentally, as Equation 2 shows, the base rates of event frequencies (such as 10 in 1,000) need not be kept in memory; they are implicit in the two frequencies a and b. Now let me be clear how the terms "natural sampling" and "frequency format" relate (Gigerenzer & Hoffrage, 1995). Natural sampling is the sequential process of updating event frequencies from experience. A foraging organism who, day after day, samples potential resources for food and learns the frequencies with which a cue (e.g., the presence of other species) predicts food, performs natural sampling by updating the frequencies a and b from observation to observation. Natural sampling is different from systematic experimentation, where the sample sizes (the base rates) of each treatment group are fixed in advance. For instance, in a clinical experiment, one might select 100 patients with cancer and 100 without cancer, and then perform tests on these groups. By fixing the base rates, the frequencies obtained in such experimental designs no longer carry information about the base rates. This is not to say that controlled sampling in systematic experiments is useless, it just serves a different purpose. Brunswik's (1955) method of "representative sampling" in a natural environment is an example of applying the idea of natural sampling to experimental design. A "frequency format" reports the final tally of a natural sampling process. There is more than one way to present the final tally. In the case of the physician in the illiterate tribe, I specified the total number of observations (1,000), the frequency of the disease, and the frequencies a and b of hits and false positives, respectively: "In her lifetime, she has seen 1,000 people, 10 of whom had the disease. Of those 10, eight showed the symptom; of the 990 not afflicted, 95 did." This is a straightforward translation of the base rates, hit rates, and false positive rates into a frequency format. Alternatively, one can just communicate the frequencies a and b. "In her lifetime, she has seen 8 people with symptom and disease, and 95 people with symptom and no disease." The former frequency format uses a standard menu ("standard" because slicing up the information in terms of base rate, hit rate, and false positive rate is deeply entrenched today), the latter a short menu (Gigerenzer & Hoffrage, 1995). Both lead to the same result. A "frequency format" must not to be confused with a representation in terms of relative frequencies (e.g., a base rate of .01, a hit rate of .79, and a false positive rate of .096). Relative frequencies are, like probabilities and percentages, normalized numbers that no longer carry information about the natural base rates (Gigerenzer & Hoffrage, 1995). Relative frequencies, probabilities, and percentages are to human reasoning algorithms (that do Bayesian-type inference) like sodium-vapor lamps to human color-constancy algorithms. This analogy has, like every analogy, its limits. For instance, humans can be taught, although with some mental agony, to reason by probabilities, but not, I believe, to maintain color constancy under sodium-vapor illumination. Note that the total number of observations in a frequency format (which is communicated only in the standard menu) need not be the actual total number of observations. It can be any convenient number such as 100 or 1,000. The computational simplicity of the frequency format holds independently of whether the actual or a convenient number is used. For example, if the actual sample size was 5,167 patients, one can nevertheless represent the information in the same frequency format as above. "For every 1,000 patients we expect 10 who have breast cancer, and 8 out of these 10 will test positive...."[3] The hypothesis that mental algorithms were designed for frequency formats is consistent with (a) a body of studies that report that humans can monitor frequencies fairly accurately (e.g., Barsalou & Ross, 1986; Hintzman & Block, 1972; Jonides & Jones, 1992), (b) the thesis that humans process frequencies (almost) automatically, that is, without or with little effort, awareness, and interference with other processes (Hasher & Zacks, 1984), (c) the thesis that probability learning and transfer derive from frequency learning (Estes, 1976), and (d) developmental studies on counting in children and animals (e.g., Gallistel & Gelman, 1992). This is not to say that humans and animals count all possible events equally well, nor could they, since a conceptual mechanisms must first decide what the units of observation are, so that a frequency encoding mechanism can count them. This preceding conceptual process is not dealt with by the hypothesis that mental algorithms are designed for frequency formats (but see the connection proposed by Brase, Cosmides, & Tooby, in press). Thus, my argument has two parts: evolutionary (and developmental)
primacy of frequency formats, and ease of computation. First, mental
algorithms, from color constancy to inductive reasoning, have evolved
in an environment with fairly stable characteristics. If there are mental
algorithms that perform Bayesian-type inferences from data to hypotheses,
these are designed for event frequencies acquired by natural sampling,
that is, for frequency formats, and not for probabilities or percentages.
Second, when numerical information is represented in a frequency format,
Bayesian computations reduce themselves to a minimum. Both parts of
the argument are necessary. For instance, the computational part could
be countered by hypothesizing that there might be a single neuron in
the human mind that computes Equation 1 on the basis of probability
information, and thus in no time. The evolutionary part of the argument
makes it unlikely that such a neuron has evolved that computes using
an information format that was not present in the environment in which
our ancestors evolved. PredictionsThis argument has testable consequences. First, lay people -- that is, persons with no professional expertise in diagnostic inference -- are more likely to reason the Bayesian way when the information is presented in a frequency format than in a probability format. This effect should occur without any instruction in Bayesian inference. Second, experts such as physicians who make diagnostic inferences on a daily basis should, despite their experience, show the same effect. Third, the "inevitable illusions" (Piattelli-Palmarini, 1994), such as base rate neglect should become evitable by using frequency formats. Finally, frequency formats should provide a superior vehicle for teaching Bayesian inference. In what follows I report tests of these predictions and several examples drawn from a broad variety of everyday situations. This is not to say that probabilities are useless or
perverse. In mathematics they play their role independent of whether
or not they suit human reasoning, just like Riemannian and other non-Euclidean
geometries play their roles independent of the fact that human spatial
reasoning is Euclidean. Breast CancerEddy (1982) provides only a scant, one-paragraph description of his study of physician's intuitions, and refers to a study by Casscells, Schoenberger, & Grayboys (1978) with similar results. Both studies used a probability format. Would a frequency format make any difference to experts such as physicians? Ulrich Hoffrage and I tested 48 physicians in Munich, Germany, on the mammography problem. These physicians had an average professional experience of 14 years. Twenty-four physicians read the information in a probability format as in Eddy's study, the other 24 read the same information in a frequency format. In the probability format, physicians were always asked for a single-event probability (as in Eddy's study); in the frequency format, physicians were always asked for a frequency judgment. The probability and the frequency formats of the mammography problem are shown in Table 1. Each physician got four diagnostic problems (including the mammography problem), two in a probability and two in a frequency format (the details are in Gigerenzer, 1996b; Gigerenzer & Hoffrage, 1996; Hoffrage & Gigerenzer, in press). In the probability format, only 2 out of 24 physicians (8%) came up with the Bayesian answer. The median estimate of the probability of breast cancer after a positive mammography was 70%, consistent with Eddy's findings. In the frequency format, however, 11 out of 24 physicians (46%) responded with the Bayesian answer. Across all four diagnostic problems, similar results were obtained: 10% Bayesian responses in the probability format, and 46% in the frequency format. Frequency formats also changed the physicians' non-Bayesian inferences. When information was in the form of probabilities, the two dominant non-Bayesian strategies were subtracting the false positive rate from the sensitivity, or simply taking the sensitivity. Both strategies ignore base rates. With frequency formats, however, these two strategies largely disappeared, and physicians' dominant non-Bayesian strategies focussed exclusively on base rates: on the base rate of the disease or on the base rate of a positive test. Frequency formats not only changed physicians reasoning but made them also feel less nervous, more relaxed, and in need of less time to complete the task. We obtained essentially identical results when we tested lay people (students at the University of Salzburg, Austria, from various disciplines) with probability and frequency formats of the mammography problem (Gigerenzer & Hoffrage, 1995). Lay people and experienced physicians alike were equally helpless with probabilities, and did not by themselves spontaneously translate probability information into frequencies. Some even retranslated frequencies into percentages, because they believed that this is the only right way to represent uncertainty. A remarkable result was that when students worked on 30 problems where frequency and probability formats alternated randomly from problem to problem, students continued to fail on the probability formats and to solve the frequency formats at about the same rate, with little spontaneous transfer (Gigerenzer & Hoffrage, 1995). Even those who are experienced with statistics can have problems "seeing" through probabilities as easily as through frequencies. Colleagues who work with Bayes theorem on a daily basis often falter when confronted with a specific problem to be solved on the spot. I grant that few people are skilled at mental arithmetic under any circumstances. But it is nevertheless noteworthy that experts as well as laymen seem to do better when calculations involve absolute frequencies rather than probabilities. The moral of these results is not to blame physicians'
or students' minds when they stumble over probabilities. The lesson
is to represent information in textbooks, in curricula, and in physician-patient
interactions in frequency formats that correspond, according to my thesis,
to the way information was encountered in the environment in which human
minds evolved. Colon CancerThe fecal occult blood test is a widely used and well-known test for colon cancer. Windeler and Köbberling (1986) reported that physicians overestimated the (posterior) probability that a patient has colon cancer if the fecal occult blood test is positive, and that physicians overestimated the base rate of colon cancer, the sensitivity (hit rate), and the false positive rate of the test. Windeler and Köbberling asked these physicians about probabilities and percentages. Would a frequency format improve physicians' estimates of what a positive test tells about the presence of colon cancer? The 48 physicians in the study reported above were given the best available estimates for the base rate, sensitivity, and false positive rate, as published in Windeler and Köbberling (1986). Here is a shortened version of the full text given to the physicians, which was structured like the mammography problem in Table 1. In the probability format, the information was: The probability that a person has colon cancer is 0.3%. If a person has colon cancer, the probability that the test is positive is 50%. If a person does not have colon cancer, the probability that the test is positive is 3%. What is the probability that a person who tests positive actually has colon cancer? If one inserts these values into Bayes's rule (Equation 1), the resulting probability is 4.8%. In the frequency format, the information was 30 out of every 10,000 people have colon cancer. Of these 30 people with colon cancer, 15 will test positive. Of the remaining 9,970 people without colon cancer, 300 will still test positive. Imagine a group of people who test positive. How many of these will actually have colon cancer? When the information was in the probability format, only
1 out of 24 physicians (4%) could find the Bayesian answer, or anything
close to it. The median estimate was one order of magnitude higher,
namely 47%. When the information was presented in the frequency format,
16 out of 24 physicians (67%) came up with the Bayesian answer (details
are in Gigerenzer, 1996b; Hoffrage & Gigerenzer, in press). Wife BatteringAlan Dershowitz, a Harvard professor of law who advised the defense in the O. J. Simpson trial, stated on U.S. television in March 1995 that only about a tenth of 1% of wife batterers actually murder their wives. I. J. Good, a leading Bayesian statistician, published a response in Nature to correct for the possible misunderstandings of what that statement implies for the probability that O. J. Simpson actually murdered his wife in 1994 (Good, 1995). Good's argument is that the relevant probability is not the probability that a husband murders his wife if he batters her, but the probability that a husband has murdered his wife if he battered her and if she was actually murdered by someone. More precisely, the relevant probability is not p(G|Bat) but p(G|Bat and M), where G stands for "the husband is guilty" (that is, did the murder in 1994), Bat means that "the husband battered his wife," and M means that "the wife was actually murdered by somebody in 1994." My point concerns the way Good presented his argument, not the argument itself. Good presented the information in terms of single-event probabilities and odds (rather than in a frequency format). Good based his calculations of p(G|Bat and M) on the odds version of Bayes's rule: posterior odds = prior odds ¥ likelihood ratio, which in the present case is
where The following six equations, marked Good-1 to Good-6, were Good's method of explaining to the reader how to estimate p(G|Bat and M). Good started with Dershowitz' figure of a tenth of 1%, and argued that if the husband commits the murder, the probability is at least 1/10 that he will do it in 1994:[4]
Therefore the prior odds (O) are O(G|Bat) > 1/9,999 ª 1/10,000 (Good-2) Furthermore, the probability of a woman being murdered given that her husband has murdered her (whether he is a batterer or not) is unity: p(M|G and Bat) = p(M|G) = 1. (Good-3) Because there are about 25,000 murders per year in the U.S. population of about 250,000,000, Good estimates the probability of a woman being murdered, but not by her husband, as
From Equations (Good-3) and (Good-4) it follows that the likelihood ratio is about 10,000/1; therefore the posterior odds can be calculated: O(G|M and Bat) > 10,000/10,000 = 1 (Good-5) That is, the probability that a murdered, battered wife was killed by her husband is p(G|Bat and M) > 1/2 (Good-6) Good's point is that "most members of a jury or of the public, not being familiar with elementary probability theory, would readily confuse this with P(G|Bat), and would thus be badly mislead by Dershowitz' statement" (Good, 1995, p. 541). He adds that he sent a copy of this note to both Dershowitz and the Los Angeles Police Department and reminds us that Bayesian reasoning should be taught at the precollege level. Good's persuasive argument, I believe, could be understood more easily by his readers and the Los Angeles Police Department if the information is presented in a frequency format rather than by the single-event probabilities and odds in the six equations. As with breast cancer and colon cancer, one way to represent information in a frequency format is to start with a concrete sample of persons and break it down into subclasses, in the same way as it would be experienced by natural sampling. Here is a frequency version of Good's argument. Good's Argument in a Frequency FormatThink of 10,000 battered women. Within one year, at least one will be murdered by her husband. Of the remaining who are not killed by their husbands, one will be murdered by someone else. Thus, we expect at least two battered women to be murdered, at least one by her husband and one by someone else. Therefore, the probability p(G|Bat and M) that a murdered, battered woman was killed by her husband is at least 1/2. In a frequency format, Good's argument is short and transparent.
My conjecture is that more ordinary people, including employees of the
Los Angeles Police Department, could understand and communicate the
argument if the information is represented in a frequency format rather
than in probabilities or odds. AIDS CounselingUnder the headline "A false HIV test caused 18 months of hell," the Chicago Tribune (3/5/1993) published the following letter and response: Dear Ann Landers: In March 1991, I went to an anonymous testing center for a routine HIV test. In two weeks, the results came back positive. I was devastated. I was 20 years old and doomed. I became severely depressed and contemplated a variety of ways to commit suicide. After encouragement from family and friends, I decided to fight back. My doctors in Dallas told me that California had the best care for HIV patients, so I packed everything and headed west. It took three months to find a doctor I trusted. Before this physician would treat me, he insisted on running more tests. Imagine my shock when the new results came back negative. The doctor tested me again, and the results were clearly negative. I'm grateful to be healthy, but the 18 months I thought I had the virus changed my life forever. I'm begging doctors to be more careful. I also want to tell your readers to be sure and get a second opinion. I will continue to be tested for HIV every six months, but I am no longer terrified. David in Dallas Dear Dallas: Yours is truly a nightmare with a happy ending, but don't blame the doctor. It's the lab that needs to shape up. The moral of your story is this: Get a second opinion. And a third. Never trust a single test. Ever. Ann Landers
David does not mention what his Dallas doctors told him about the chances that he actually had the virus after the positive test. But he seems to have inferred that a positive test means that he has the virus, period. In fact, a study of AIDS counseling in Germany found that many doctors and social workers (erroneously) tell their patients that a positive test implies the virus is present (Gigerenzer, Hoffrage & Ebert, 1995). These counselors know that a single ELISA (enzyme-linked immunosorbent assay) test can produce a false positive, but they erroneously assume that the whole series of ELISA and Westernblot tests would wipe out every false positive. How could a doctor have explained the actual risk to David and saved him the nightmare? I do not have the statistics for Dallas, so I use the German figures for illustration. (The specific numbers are not the point here.) In Germany, the prevalence of HIV infections in heterosexual men of age 20-30 who belong to no known risk group is estimated as about 0.01%. The corresponding base rate for homosexual men is estimated as about 1.5%. The hit rate (sensitivity) of the typical test series (repeated ELISA and Westernblot tests) is estimated as about 99.99%. The estimates of the false positive rate vary somewhat; a reasonable estimate seems to be 0.01% (Gigerenzer, Hoffrage & Ebert, 1995). Given these values, and assuming that David was at the time of the routine HIV test a heterosexual man with an average sex life, what is the probability that he actually had the virus after testing positive? If his physician had actually given David these probabilities, David nevertheless might not have understood what to conclude. But the physician could have communicated the information in a frequency format. He might have said "Your situation is the following. Think of 10,000 heterosexual men like you. We expect one to be infected with the virus, and he will, with practical certainty, test positive. From the 9,999 men who are not infected, one additional will test positive. Thus we get two who test positive, but only one of them actually has the virus. This is your situation. The chances that you actually have the virus after the positive test are around 50%." If the physician had explained the risk in this way, David might have understood that there was, as yet, no reason to contemplate suicide, or to move to California. If David was in a risk group, say a homosexual with a 1.5% base rate of HIV infection, the estimate is different. Here, the physician might have explained, "Think of 10,000 homosexual men. We expect 150 to be infected with the virus, and every one of these men will test positive with practical certainty. From the 9,850 men who are not infected, we expect that one other will test positive. Thus we have 151 men who test positive, and 150 of these do have the virus. Your chances of not having the virus are 1 out of 151, that is, less than one percent." Still, David might be the lucky one, and this luck is certainly more likely than a lottery win. We do not know what risk group David was in. However,
whatever the statistics are, most people of average intelligence can
understand the risk of HIV after a positive test when the numbers are
represented by a counselor in a frequency format. Not one of the 20
AIDS counselors studied by Gigerenzer et al. (1996), however, explained
the client his risk in frequencies. Ann Landers' answer -- don't blame
the doctor, blame the lab -- however, overlooks that whatever reasons
there are for false positives (such as blood samples being confused
in the lab), a doctor should inform the patient that false positives
occur, and about how many. Expert WitnessesEvidentiary problems such as the evaluation of eyewitness testimony constituted one of the first domains of probability theory (Gigerenzer et al., 1989, chap. 1). Statisticians have taken the stand as expert witnesses for almost a century now: In the Dreyfus case in the late 19th century in France, or more recently, in People vs. Collins in California (Gigerenzer et al., 1989, chap. 7; Koehler, 1992). The convictions in both cases were ultimately reversed and the statistical arguments discredited. Part of the problem seems to have been that the statistical arguments were couched in probabilities rather than frequencies, which confused both the prosecution who were making the arguments and the jury and the judges who tried to understand the arguments. I will explain this point with the case of a chimney sweep who was accused of having committed a murder in Wuppertal, Germany. The Rheinische Merkur (No. 39, 1974) reported On the evening of July 20, 1972, the 40-year-old Wuppertal painter Wilhelm Fink and his 37-year-old wife Ingeborg took a walk in the woods and were attacked by a stranger. The husband was hit by three bullets in the throat and the chest, and fell down. Then the stranger attempted to rape his wife. When she defended herself and unexpectedly, the shot-down husband got back on his feet to help her, the stranger shot two bullets into the wife's head and fled. Three days later, a forest ranger discovered 20 kilometers from the scene of the crime the car of Werner Wiegand, a 25-year-old chimney sweep who used to spend his weekends there. The husband, who had survived, at first thought he recognized the chimney sweep in a photo, but became undecided in a confrontation, and later tended to think that another suspect was the murderer. But when the other suspect was proven to be innocent, the prosecution came back to the chimney sweep and put him on trial. The chimney sweep had no previous convictions and denied being the murderer. The Rheinische Merkur described the trial: After the experts had testified and explained their "probability theories," the case seemed to be clear: Wiegand, despite his denial, must have been the murderer. Dr. Christian Rittner, a lecturer at the University of Bonn, evaluated the traces of blood as follows: 17.29% of German citizens share Wiegand's blood group, traces of which have been found underneath the fingernails of the murdered woman; 15.69% of Germans share [her] blood group that was also found on Wiegand's boots; based on a so-called "cross-combination" the expert subsequently calculated an overall probability of 97.3% that Wiegand "can be considered the murderer." And concerning the textile fiber traces which were found both on Wiegand's clothes and on those of the victim [...] Dr. Ernst Röhm from the Munich branch of the State Crime Department explained: "The probability that textile microfibers of this kind are transmitted from a human to another human who was not in contact with the victim is at most 0.06%. From this results a 99.94% certainty for Wiegand being the murderer...." Both expert witnesses agreed that with a high probability, the chimney sweep was the murderer. These expert calculations, however, collapsed when the court discovered that the defendant was at the time of the crime in his hometown, 100 kilometers away from the scene of the crime. So what was wrong with the expert calculations? One can dissolve the confusion at court by representing the uncertainties in a frequency format. Let us assume that the blood underneath the fingernails of the victim was indeed the blood of the murderer, that the murderer carried traces of the victim's blood (as the expert witnesses assumed), and that there were 10,000,000 men in Germany who could have committed the crime (Schrage, n.d.). Assume further that on 1 of every 100 of these men a close examination would find microscopic traces of foreign blood, that is, on 100,000 men. Of these, some 15,690 men (15.69%) will have traces from blood that is of the victim's blood type. Of these 15,690 men, some 2,710 (17.29%) will also have the blood type that was found underneath the victim's fingernails (here I assume independence between the two evidences). Thus, there are some 2,710 men (including the murderer) who might appear guilty from the two pieces of blood evidence, and the chimney sweep is one ot those. Therefore, the probability that the chimney sweep was the murderer given the two blood evidences is about 1 in 2,710, and not 97.3%, as the first expert witness testified. The same method can be applied to the textile traces. Assume the second expert witness was correct when he said that the probability that the chimney sweep carries the textile trace if he is not the murderer is at most 0.06%, and that the murderer actually carries that trace. Then some 6,000 of the 10,000,000 would carry this textile trace, and only one of them is the murderer. Thus, the probability that the chimney sweep is the murderer given the textile fibres is about 1 in 6,000, and not 99.94%, as the second expert witness testified. What if one combines both the blood and the textile evidences
together, which seems not to have happened at the trial? One of the
2,710 men who satisfy both blood type evidences is the murderer, and
he will show the textile traces. Of the remaining innocent men, we expect
one or two (0.06%) will also show the textile traces (assuming mutual
independence of the three evidences). Thus, there will be two or three
who satisfy all three types of evidence, and one of them is the murderer.
Therefore, the probability that the chimney sweep is the murderer given
the two blood sample evidences and the textile evidence is between 1/3
and 1/2. This probability is not beyond reasonable doubt. Cognitive IllusionsFrequency formats not only make everyday inferences easier,
they also tend to make "cognitive illusions" of the laboratory
type largely disappear. I have summarized this evidence elsewhere (Gigerenzer
1991, 1994). One example is a cognitive illusion called "overconfidence
bias." Students were given questions such as "Which city has
more inhabitants: Islamabad or Hyderabad?" and asked to estimate
the probability (confidence) that their answer was correct. The typical
result was when students said they were 100% confident, they were correct
in only about 85% cases, when they said they were 90% confident, they
were correct in only 75%, and so on (Lichtenstein, Fischhoff & Phillips,
1982). This discrepancy between probability and frequency was labeled
the "overconficence bias," and human disasters of many kinds,
from deadly accidents in industry to errors in the legal process, have
been attributed to that "cognitive illusion." However, when
we replaced the probability judgments by frequency judgments, the apparently
stable cognitive illusion disappeared: Students were asked after every
50 questions to estimate how many they got correct, and these frequency
judgments did no longer overestimate the actual frequencies of correct
answers, but turned out to be fairly accurate, even with a tendency
towards underestimation (Gigerenzer, Hoffrage, & Kleinbölting,
1991). A second example is a medical disease problem with which Casscells,
Schoenberger and Grayboys (1978) had demonstrated base rate neglect
by staff and students at Harvard Medical School. Cosmides and Tooby
(1996) replaced the probabilities with frequencies, and Bayesian responses
increased from 12% (in the original probability format) to 76% in the
frequency format. An instruction to visualize frequencies in a grid
boosted the performance up to 92%. More generally, frequency formats
reduce the "base rate fallacy" in the cab problem and similar
"toy" problems (Gigerenzer, 1994; Gigerenzer & Hoffrage,
1995). A third and final example is the "Linda" problem. People
read a description of Linda that suggests she is a feminist, and thereafter
are asked which is more probable, (a) Linda is a bank teller, or (b)
Linda is a bank teller and active in the feminist movement. Some 80%
to 90% of subjects usually chose (b), a response which Tversky and Kahneman
(1983) labeled the "conjunction fallacy," because the probability
of a conjunction of two events (teller and feminist) cannot be larger
than the probability of one of these events (teller). This "conjunction
fallacy" in the Linda problem and related tasks, however, largely
disappeared when people were asked for judgments of frequencies: Think
of 200 women like Linda. How many of them are (a) bank tellers? (b)
bank tellers and active in the feminist movement? Replacing probabilities
by frequencies made conjunction violations drop from 80 - 90% to 10
- 20% (Hertwig & Gigerenzer, 1996; similar results were reported
by Fiedler, 1988; Tversky & Kahneman, 1983). The effect of frequency
representations and judgments on "cognitive illusions" is
the strongest and most consistent "debiasing method" known
today. Teaching Statistical ReasoningThe teaching of statistical reasoning is, like that of reading and writing, part of forming an educated citizenship. Our technological world with its abundance of statistical information makes the art of dealing with uncertain information particularly relevant. Reading and writing is taught to every child in modern Western democracies, but statistical thinking is not (Shaugnessy, 1992). The result has been termed "innumeracy" (Paulos, 1988). But can statistical reasoning be taught? Previous studies that attempted to teach Bayesian inference, mostly by corrective feedback, had little or no training effect (e.g., Peterson, DuCharme, & Edwards, 1968; Schaefer, 1976), a result that seems to be consistent with the view that the mind does not naturally reason the Bayesian way. However, the argument developed in this chapter suggests a "natural" method of teaching: instruct people how to represent probability information in a frequency format. Recall that students and physicians alike did not do this spontaneously, with very few exceptions (e.g., Gigerenzer & Hoffrage, 1995). Peter Sedlmeier and I designed a tutorial program that teaches Bayesian reasoning, based on the assumption that cognitive algorithms have evolved for dealing with frequency formats (Sedlmeier, in press); Sedlmeier & Gigerenzer, 1996). The goal of this tutorial is to teach participants how to reason the Bayesian way when the information is represented in probabilities, as is usually the case in newspapers, medical textbooks, and other information sources. The computerized tutorial instructs participants in how to represent the probability information in terms of a frequency format, rather than teaching them how to insert probabilities into Bayes's rule (in the form of Equation 1). It consists of two parts. In the first part, participants are shown how to translate probability information into frequencies, visually aided by a frequency tree (or a frequency grid); the method is illustrated by two medical problems, one of them the mammography problem. In the second part, participants solve eight other problems, with step-by-step guidance on what to do as well as step-by-step feedback. If participants have difficulties, the system provides immediate help that ensures that every participant solves all problems correctly. We conducted an evaluation study with four groups: two groups were taught how to represent probabilities in a frequency format (visually demonstrated by a frequency tree and a frequency grid, respectively), one control group was taught how to insert probabilities into Bayes's rule using a similar computer tutorial ("rule training"), and a second control group received no training. Sixty-two University of Chicago students participated in the study. In the rule-training group the median number of Bayesian solutions increased from 0% (pretraining baseline) to 35% after training, and the values for the no-training control slightly increased from 0% to 5%. In the two groups that learned how to construct frequency representations by constructing frequency trees or frequency grids, the median number of Bayesian solutions increased from 0% and 5% to 80% and 70%, respectively. Thus, the immediate success of the frequency tutorials was about twice as high as that of the rule training. But did the performance last over time, or was it subject to the usual steep forgetting curve following a successful test? In a five-week follow up, the median performance of the rule-training group was down to a mere 15%, but not so the effect of teaching frequency representations. The median performance of each of the two frequency representation groups five weeks after training was a strong 90%. Thus, there is evidence that (what I take to be) the
natural format of information in the environment in which humans evolved
can be used to teach people how to deal with probability information.
This may be good news for instructors who plan to design pre-college-level
curricula that teach young people how to infer risks in a technological
world, and for those unfortunate souls among us charged with teaching
undergraduate statistics. Summing UpInformation needs representation. If a representation
is recurrent and stable during human evolution, one can expect that
mental algorithms are designed to operate on this representation. In
this chapter, I applied this argument to the understanding of human
inferences under uncertainty. The thesis is that mental algorithms were
designed for frequency formats, which were the recurrent format of information
until very recently. I have dealt with a specific class of inferences
that correspond to a simple form of Bayesian inferences, where one of
several possible states is inferred from one or a few cues. Here mental
computations are simpler when information is encountered in the same
form as in the environment in which our ancestors evolved, rather than
in the modern form of probabilities or percentages. The evidence from
a broad variety of everyday situations and laboratory experiments shows
that frequencies can make human minds smarter.
Author's NoteI would like to thank Ralph Hertwig, Ulrich Hoffrage, Jim Magnussen, Laura Martignon, and Anita Todd for their helpful comments.
|
|||||||||||||||||||
Table 1:Frequency format and probability format of the mammography problem. To facilitate early detection of breast cancer, women are encouraged from a particular age on to participate at regular intervals in routine screening, even if they have no obvious symptoms. Imagine you conduct in a certain region such a breast cancer screening using mammography. For symptom-free women aged 40 to 50 who participate in screening using mammography, the following information is available for this region: Probability format The probability that one of these women has breast cancer is 1%. If a woman has breast cancer, the probability is 80% that she will have a positive mammography test. If a woman does not have breast cancer, the probability is 10% that she will still have a positive mammography test. Imagine a woman (aged 40 to 50, no symptoms) who has a positive mammography test in your breast cancer screening. What is the probability that she actually has breast cancer? _____% Frequency format Ten out of every 1,000 women have breast cancer. Of these 10 women with breast cancer, 8 will have a positive mammography test. Of the remaining 990 women without breast cancer, 99 will still have a positive mammography test. Imagine a sample of women (aged 40 to 50, no symptoms) who have positive mammography tests in your breast cancer screening. How many of these women do actually have breast cancer? ___ out of ____
|
||||||||||||||||||||
|
|
Footnotes
References Barsalou, L. W., & Ross, B. H. (1986). The roles of automatic and strategic processing in sensitivity to superordinate and property frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 116-134. Brase, G. , Cosmides, L., & Tooby, J. (in press). Individuation, counting, and statistical inference: The role of frequency and whole object representation in judgment under uncertainty. Journal of Experimental Psychology: Gerneral. Casscells, W., Schoenberger, A., & Grayboys, T. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299, 999-1000. Cohen, L. J. (1981). Can human irrationality be experimentally demonstrated? The Behavioral and Brain Sciences, 4, 317-370. Cosmides, L., & Tooby, J. (1966). Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty. Cognition,58, 1-73. Daston, L. J. (1988). Classical probability in the Enlightenment. Princeton, NJ: Princeton University Press. Ebert, A. (1995). Base Rate Neglect im Kontext der Risikokommunikation bei der Beratung zum HIV-Test: Ein Feldexperiment in deutschen Gesundheitsämtern. [Base rate neglect in the context of risk communication concerning the HIV test: A field experiment in German Health Departments.] Diploma thesis, University of Salzburg. Estes, W. K. (1976). The cognitive side of probability learning. Psychological Review, 83, 37-64. Fiedler, K. (1988). The dependence of the conjunction fallacy on subtle linguistic factors. Psychological Research, 50, 123-129. Gallistel, C. R., & Gelman, R. (1992). Preverbal and verbal counting and computation. Cognition, 44, 43-74. Gigerenzer, G. (1991). How to make cognitive illusions disappear: Beyond "heuristics and biases". In W. Stroebe & M. Hewstone (Eds.), European Review of Social Psychology, volume 2 (pp. 83-115). Chichester, England: Wiley. Gigerenzer, G. (1994). Why the distinction between single-event probabilities and frequencies is relevant for psychology (and vice versa). In G. Wright & P. Ayton (Eds.), Subjective probability (pp. 129-162). New York: Wiley. Gigerenzer, G. (1996a). On narrow norms and vague heuristics: A reply to Kahneman and Tversky (1996). Psychological Review, 103, 592-596. Gigerenzer, G. (1996b). The psychology of good judgment: Frequency formats and simple algorithms. Journal of Medical Decision Making, 16, 000-000. Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: Frequency formats. Psychological Review, 102, 684-704. Gigerenzer, G. & Hoffrage, U. (1996). How to improve diagnostic inferences in physicians. Manuscript submitted for publication. Gigerenzer, G., Hoffrage, U. & Ebert, A. (1996). Counseling for AIDS: What is the client told about the meaning of a positive HIV test? Manuscript submitted for publication. Gigerenzer, G., Hoffrage, U., & Kleinbölting, H. (1991). Probabilistic mental models: A Brunswikian theory of confidence. Psychological Review, 98, 506-528. Gigerenzer, G., & Murray, D. J. (1987). Cognition as intuitive statistics. Hillsdale, NJ: Erlbaum. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Krüger, L. (1989). The empire of chance: How probability changedscience and everyday life. Cambridge: Cambridge University Press. Good, I. J. (1995). When batterer turns murderer. Nature, 375, 541. Gould, S. J. (1992). Bully for brontosaurus: Further reflections in natural history. New York: Penguin Books. Hasher, L. & Zacks, R. T. (1984). Automatic processing of fundamental information. American Psychologist, 39, 1372-1388. Hertwig, R., & Gigerenzer, G. (1996). The "conjunction fallacy" revisited: Polysemy, conversational maxims, and frequency judgments. Manuscript. Max Planck Institute for Psychological Research, Munich. Hintzman, D. L., & Block, R. A. (1972). Repetition and memory: Evidence for a multiple trace hypothesis. Journal of Experimental Psychology, 88, 297-306. Hoffrage, U., & Gigerenzer, G. (in press). The impact of information representation on Bayesian reasoning. Proceedings of the Cognitive Science Society. Jonides, J. & Jones, C. M. (1992). Direct coding for frequency of occurrence. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 368-378. Jungermann, H. (1983). The two camps on rationality. In R. W. Scholz (Ed.), Decision making under uncertainty (pp. 63-86). Amsterdam: Elsevier.. Kahneman, D., & Tversky, A. (1996). On the reality of cognitive illusions. A reply to Gigerenzer's critique. Psychological Review, 103, 582-591. Kleiter, G. D. (1994). Natural sampling: Rationality without base rates. In G. H. Fischer & D. Laming (Eds.), Contributions to mathematical psychology, psychometrics, and methodology (pp. 375-388). New York: Springer. Koehler, J. J. (1992). Probabilities in the courtroom: An evaluation of the objections and polices. In D. K. Kagehiro & W. S. Laufer, (Eds.), Handbook of psychology and law (pp. xxx-xxx). New York : Springer. Koehler, J. J. (1996). The base rate fallacy reconsidered: Descriptive, normative and methodological challenges. Behavioral and Brain Sciences,19, 1-53. Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of probabilities: The state of the art to 1980. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 306-334). Cambridge: Cambridge University Press. Lopes, L. L. (1991). The rhetoric of irrationality. Theory and Psychology, 1, 65-82. Nesse, R. M., & Williams, G. C. (1995). Why we get sick: The new science of Darwinian medicine. New York:Vintage Books. Paulos, J. A. (1988). Innumeracy: Mathematical illiteracy and its consequences. New York: Vintage Books. Peterson, C. R., DuCharme, W. M., & Edwards, W. (1968). Sampling distributions and probability revision. Journal of Experimental Psychology, 76, 236-243. Piattelli-Palmarini, M. (1994). Inevitable illusions: How mistakes of reason rule our minds (M. Piattelli-Palmarini & K. Botsford, Trans.). New York: Wiley. Schaefer, R. E. (1976). The evaluation of individual and aggregated subjective probability distributions. Organizational Behavior and Human Performance, 17, 199-210. Schrage, G. (n.d.). Schwierigkeiten mit der stochastischen Modellbildung -- zwei Beispiele aus der Praxis. [Problems with stochastic models -- two examples from real life.]Unpublished manuscript. Sedlmeier, P. (in press). BasicBayes: A tutor system for simple Bayesian inference. Behavior Research Methods, Instruments, & Computers. Sedlmeier, P., & Gigerenzer, G. (1996). Teaching Bayesian reasoning in less than two hours. Manuscript submitted for publication. Shaughnessy, J. M. (1992). Research on probability and statistics: Reflections and directions. In D. A. Grouws (Ed.), Handbook of research on mathematical teaching and learning (pp. 465-499). New York: Macmillan. Shepard, R. N. (1992). The perceptual organization of colors: An adaptation to regularities of the terrestrial world? In J. H. Barkow, L. Cosmides, & J. Tooby (Eds.), The adapted mind: Evolutionary psychology and the generation of culture (495-532). New York: Oxford University Press. Stigler, S. M. (1983). Who discovered Bayes's Theorem? American Statistician, 37 (4), 290-296. Tooby, J. & Cosmides, L. (1992). The psychological foundations of culture. In J. Barkow, L. Cosmides, & J. Tooby (Eds.), The adapted mind: Evolutionary psychology and the generation of culture (pp.19-136). New York: Oxford University Press. Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90, 293-315. Windeler, J., & Köbberling, J. (1986). Empirische Untersuchung zur Einschätzung diagnostischer Verfahren am Beispiel des Haemoccult-Tests. [An empirical study of the judgments about diagnostic procedures using the example of the Hemoccult test.] Klinische Wochenschrift, 64, 1106-1112. |
|||||||||||||||||||
| Contact Author |
Gerd Gigerenzer |
This is an electronic archival version
of a published print book chapter.
Please cite according to the published version. |
||||||||||||||||||
| » Home » The Institute » Electronic Full Texts | |||||||||||
| Update 6/2001 | » webmaster-library(at)mpib-berlin.mpg.de » ©Copyright |
||||||||||