Don’t trust me, I’m an expert

In 2002 when I visited Santa Barbara, I went to a grocery store called Trader Joe’s. It had its own line in milk. Trader Joe’s Vitamin D Milk (Grade A, pasteurized, homogenized) had some ‘nutrition facts’ on the label, one of which was:

“This milk does not contain the growth hormone rBST”.

The first time I saw the label, I wondered what on earth rBST could be. I examined the bottle more closely. The cap said something different. “Our Farmers Guarantee MILK from cows not treated with rbst No significant difference has been shown between milk derived from cows treated with artificial hormones and those not treated with artificial hormones”.

rBST (or rBGH, as it is sometimes labelled) is a genetically engineered growth hormone that can increase milk production from cows.

My search on the web at the time revealed that the US FDA concluded in 1985 that milk from rBST-supplemented cows is safe. In November 1993, the US Food and Drug Administration (FDA) approved rBST for milk production in dairy cows.

Trade in products related to rBST was a contentious issue between the United States and Europe. Dairies such as those supplying Trader Joe’s in the United States were permitted to use rBST-free labels as long as they added a disclaimer that no harmful human health effects have been linked to the hormone.

My first reading of Trader Joe’s bottle cap suggested to me that someone in authority (it doesn’t say who) thinks rBST milk is okay. There are a few other things the bottle cap doesn’t say. It doesn’t imply there are no effects on the cows, just no effects on their milk. It doesn’t say there are no differences in the milk, just that there are no “significant” differences. And it doesn’t even say that there are no “significant” differences, but that no significant differences “have been shown”. All the authority is willing to admit is that they haven’t found a “significant” difference.

I have absolutely no idea whether rBST milk is harmful to people or beasts. I have nothing against GMOs, in principle. But a shred of common sense would lead anyone (other than a scientist) to ask, “Well, how hard have you looked for a difference? And what do you mean by ‘significant’”? Effectively, science doesn’t have answers to these questions. Technically (theoretically) it does, but in fact, very few working biologists have the faintest idea of how to come up with the answers.

This failing is best characterized as a disease. It is present to varying extents in the scientific community, is spread by textbooks and editorial conventions, has readily identifiable symptoms, and results in costly and debilitating outcomes.

Scientists are human

In the 1960s, Daniel Kahneman and Amos Tversky began to look at how people make decisions under uncertainty. They found that behind the illusion of rational thought lay a psychological pathology of unexpected proportions. People, it turns out, are barely rational, even in life and death circumstances. Kahneman and Tverskys’ observations led a cohort of cognitive psychologists to explore the vagaries of decision-making over the next few decades.

Their work has produced some wonderful generalizations. Some are funny. Others are plain depressing. The mistakes that people make can be summarized under headings that make a kind of pathology that is identifiable and predictable (and perhaps even treatable). Not everyone reacts in seemingly unreasonable ways, and not in all circumstances. But most people do, most of the time.

Some of the primary symptoms especially relevant to scientists include:

• Insensitivity to sample size: scientists ascribe inferences to their samples that are only justifiable from much larger samples.

• Judgement bias and overconfidence: people tend to be optimistic about their ability to predict, and often they make “predictions” that are, in fact, the product of hindsight.

• Anchoring: people tend to stick close to the number they first thought of, or that someone else said, for fear of seeming unconvincing or capricious.

Kahneman, Tversky and their colleagues also found that much of the apparent arbitrariness of decisions can be explained by the context in which decisions are set.

In 1999, Kammen and Hassenzahl described a beautiful contradiction that illustrates the importance of context. Two artificial substances were in our food in the 1980s: Saccharin (used to sweeten the taste of things such as diet soda) and Alar (a pesticide used on apple and pear crops).

During the 1970s, the US FDA banned Saccharin because it was a potential human carcinogen. Congress passed legislation to make Saccharin legal after public outcry.

In contrast, in the 1980s, the EPA concluded the amount of Alar reaching consumers was too small to warrant banning it. A public interest group released a report that children are particularly susceptible because they weigh less, consume a lot of apples and apple juice, and are more susceptible to toxins than adults. The public outcry that followed the release of the report convinced the maker of Alar to withdraw it from the market.

Take the checklist of evidence for Saccharin first:

• In high doses, it causes cancer in rats.

• The US FDA concluded that studies indicated cancer-forming potential in humans, giving a ‘very small’ (10-5 to 10-7) additional lifetime risk of dying from cancer at ‘normal’ (average) consumption levels.

• There is an additional 4.6x10-4 risk for someone with high exposure,such as someone who drinks a can of diet soda each day.

Next, look at the checklist of evidence for Alar:

• In high doses, it causes cancer in rats.

• The US EPA concluded that studies indicated cancer-forming potential in humans, giving a ‘very small’ (10-5 to 10-7) additional lifetime risk of dying from cancer at ‘normal’ (average) consumption levels.

• There is an additional 3 x 10-4 risk for someone with high exposure, such as someone who consumes a lot of apple products (such as children). The two chemicals have nearly the same potential to form cancers and appear in the general population with the same kinds of exposures. Yet the extrapolations from high doses to low doses in Saccharin were ridiculed, whereas the same extrapolations for Alar were accepted.

One chemical was banned and people demanded its return. The other was deemed safe and people demanded that it be withdrawn. Why?

Cognitive psychologists and risk analysts such as Paul Slovic and his colleagues delight in explaining these apparent contradictions. People were used to Saccharin and could accept the risk knowingly (you don’t have to buy diet soda).

In contrast, people could avoid Alar only by avoiding apple products altogether. In addition, Saccharin has benefits, such as reducing problems for diabetics and reducing the risk of heart disease in overweight people.

Alar didn’t improve apples, except perhaps by making them cheaper. Some growers’ claims of Alar-free apples turned out to be false, eroding trust. In addition, the most susceptible group was children.

Scientists, like other people, are very poor judges of risky circumstances. Yet scientists feel that they are immune to the failings that plague ordinary humans. We’ve been trained to believe we are objective, usually without much training in how to achieve objectivity. This delusion makes us susceptible to the scientific disease.

Symptoms of the disease

The pathology surfaces in scientists in some peculiar ways.

Technical myopia

Take the case of Saccharin and Alar. The decision by the public to accept one and not the other is contradictory only if you are myopic enough to look just at the technical risks. Very little else was the same: the context, the framing, the degree of control, the prospects of benefits, and the impact groups were completely different.

Most scientists suffer from technical myopia. They are surprised and bemused when people don’t do what they tell them to do. Our solution to problems like these is to argue that if people just understood the technical details, they would be rational.

This train of thought turns ugly when we scientists take it upon ourselves to provide people with both the information, and the decisions. We rationalize that people should be rational (like us), and that, to save time and trouble, we’ll do the thinking and decide what’s safe and what isn’t. We create panels of experts (comprised largely of people like ourselves), and the priesthood of scientists then decides the questions, collects the data, interprets them, and makes the decisions. Others don’t have to worry about a thing.

In my view, when this happens, we have imposed on the rest of society the values of a bunch of (mostly) middle-aged, middle-class men and women who often are compromised by conflicts of interest. In such circumstances, we are not to be trusted.

Mary O’Brien wrote a compelling book in 2000 on the frailties of risk analysis, in which she described numerous examples where scientists made important social decisions such as where to store nuclear waste, in almost complete isolation from the people who carried the burden of the outcomes if they made the wrong decision.

Optimism

We scientists are typically, heroically optimistic about our ability to predict. Many experts are wildly and unjustifiably confident about their ability to guess parameters, even within their field of technical expertise. Unfortunately, it is very difficult to distinguish a reliable expert from a crank. In a 1993 study on expert opinion in assessments of earthquake risks, Krinitzsky said experts may be “…fee-hungry knaves, time servers, dodderers in their dotage…Yet, these and all sorts of other characters can pass inspections, especially when their most serious deficiencies are submerged in tepid douches of banality”.

Often, we depend on ecological monitoring and surveillance systems to reassure ourselves that our decisions about putative environmental impacts are sound and that ecological systems are in control. The purpose of environmental monitoring is to protect the environment, society and the economy. The systems are supposed to (i) tell us there is a serious problem when one exists (thus avoiding unjustified over-confidence, called “false negatives”) and (ii) tell us there is not a serious problem when there isn’t one (thus avoiding false alarms, called “false positives”).

The first is crucial for detecting serious damage to environmental and social values, the second for ensuring that the economy is not damaged by unnecessary environmental regulations. Unfortunately, standard procedures implicitly assume that if no problem is observed, none exists. This logic appears in the disclaimer on Trader Joe’s milk bottle.

But detecting important environmental damage against a background of natural variation, measurement error, and poorly understood biological processes often is difficult. Furthermore, standard monitoring procedures do not attempt to determine whether the intensity of monitoring is too little, potentially overlooking impacts, or excessive, laying an unnecessary burden upon a proponent.

The remedy is to design monitoring and auditing protocols that report the probability that they will detect ecologically important changes.

Thus a monitoring system designer should demonstrate that the system would be reasonably certain of detecting unacceptable impacts (for a defined set of indicators, at an agreed level of reliability). Unfortunately, these issues are simply not directly addressed in most current monitoring systems. They are intimately related to the deployment of significance tests in science, and especially in ecology.

The curious phenomenon of one-sided logic

Karl Popper created a revolution in thinking that led to current acceptance of the notion of hypothesis testing in science. Today, students are exhorted to find a hypothesis to test. A thesis that fails to present and test a stark hypothesis is likely to fail. Publishable papers hinge on the results of such tests.

A nasty linguistic ambiguity slipped into the picture between about 1930 and 1960, when RA Fisher invented much of the mathematical machinery that underlies modern statistical methods. Ecology has adopted the machinery of Fisherian statistics over the last five decades. In doing so, it has equated Fisher’s null hypothesis test with Popper’s hypothesis testing.

Classical statistical training teaches us to deal with measurement errors, observational bias, and natural variation. Unfortunately, while p-values relate exclusively to rejecting the null hypothesis when it is true (Type I errors), most scientists believe that they say something about accepting the null hypothesis when it is false (Type II errors) as well. Often, scientists conclude that there is no difference between two samples when the only sensible question is; how big an effect is there? Yet scientific conventions and formal training are sufficiently powerful that the most gifted and insightful scientists trip over their own feet when confronted with this problem.

Fiona Fidler noted that many and perhaps most papers in applied ecology that use statistical tests misinterpret a lack of a statistically significant effect to be evidence that there is no “real” effect. This one-eyed view of evidence is only moderately damaging to the progress of science, but it becomes especially important in environmental science where the costs of Type II errors are counted in damage to the environment or human health.

Marine scientists and a handful of other subdisciplines in ecology can feel a little smug about this issue because Type II error rates became part of conventional thinking in the 1990s. Other branches of ecology lag far behind.

The advent of null hypothesis tests and the disproportionate focus on Type I error rates have created a system in which human activities are considered to be benign, until we find otherwise. There are many examples of this propensity. Species are considered to be extant until we are reasonably sure they are extinct. Ecotoxicologists condition decisions on “No Observed Adverse Effect Levels”. The bias from one-sided inference emerges particularly strongly in monitoring programs.

Linguistic uncertainty

Scientists communicate with words, and language is inexact. Scientists have little formal training in how to deal with language-based uncertainty, and they do not acknowledge that it exists. Helen Regan and her colleagues outlined a taxonomy of uncertainty in 2002 that distinguishes epistemic uncertainty, in which there is some determinate fact but we are uncertain about its status, from linguistic uncertainty, in which there is no specific fact. Linguistic uncertainty may be decomposed into ambiguity (words have two meanings, and it’s not clear which is meant), vagueness (terms allow borderline cases), underspecificity (terms are undefined), and context dependence (the meaning of words depends on context, but the context is not indicated). Language-based scientific methods typically assume linguistic uncertainties are trivial or non-existent. Methods developed for dealing with these kinds of uncertainty have been applied successfully in companion disciplines such as engineering and psychology for decades are used rarely in applied ecology. For instance, Resit Akcakaya and his colleagues developed a tool for encompassing linguistic uncertainty in Red List conservation assessments in 2001, but little use has been made of this tool since.

The antidotes: common sense, caution, pictures and tests People are bad at interpreting and deciding what’s best, when faced with uncertain information. A blanket of ambiguous and vague language overlies this disability. In science, a curious, one-sided approach to inference has been adopted and added to the mix. Taken together, this cocktail leads to irrational interpretations of evidence. The disease can be treated. But the first step in treatment is to admit the problem.

We can learn from other disciplines that are further down the road than ecology, such as psychology and medicine. Strategies in editorial policy and student training may be adopted to alleviate the problem.

The precautionary principle is an appeal to common sense, emerging from the broader population, and meant to be taken to heart by scientists. It is an informal suggestion to be aware of Type II errors and to weigh them against the costs and benefits of Type 1 errors.

One of the best protections against the irrational interpretation of evidence is to use pictures, rather than numbers, to represent data. Scatterplots, histograms and confidence interval diagrams are much less likely to be misinterpreted than are tables of numbers.

In applied ecology, we will rely on expert judgements for the foreseeable future. Experts have specialist knowledge, they can use it efficiently and should be deferred to in its interpretation. However, the frailties and idiosyncracies of experts are hidden by the fact that the feedback between an expert’s predictions in ecology and the outcomes typically is slow, ambiguous and impersonal.

A Dutch engineer, Roger Cooke, has shown us another way. Expert judgements are routine in nuclear and transport risk assessments. Cooke engages experts to make estimates of facts, and sprinkles the questions he puts to them with questions to which he already knows the answers. Experts who are routinely close to the correct answer and who are confident provide more information than experts who deviate from the truth or are very underconfident. Opinions from more informative experts are weighed more heavily. The opinions of some experts may be discarded altogether, and the feedback from the “test” questions can be used to help experts to improve their performance.

In the last decade, there have been some significant advances in gathering and interpreting expert judgement. The opportunity exists for the discipline to embrace novel ideas about how to engage with and learn the most from experts, thereby improving the information gleaned from them and the decisions to which it contributes, avoiding unnecessary environmental damage without unnecessarily constraining economic development.

This article first appeared in the March 2011 edition of the Bulletin of the British Ecological Society.

References

Aspinall, W. 2010. A route to more tractable expert advice. Nature 243, 294-295.

Cooke, R.M. & Goossens L. H. J. 2000. Procedures guide for structured expert judgement in accident consequence modelling. Radiation Protection Dosimetry 90, 303-9.

Fidler, F., Burgman, M. A., Cumming, G., Buttrose, R. & Thomason, N. 2006. Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology. Conservation Biology 20, 1539-1544.

Kahneman, D. and Tversky, A. 1984. Choices, values, and frames. American Psychologist 39, 342-347.

Kammen, D. M. and Hassenzahl, D. M. 1999. Should we risk it? Exploring environmental, health, and technological problem solving. Princeton University Press, Princeton.

Krinitzsky, E. L. 1993. Earthquake probability in engineering – Part 1: The use and misuse of expert opinion. Engineering Geology 33, 257-288.

O’Brien, M. 2000. Making better environmental decisions: an alternative to risk assessment. MIT Press, Cambridge, Massachusetts.

Regan, H. M., Colyvan, M. & Burgman, M. A. 2002. A taxonomy and treatment of uncertainty for ecology and conservation biology. Ecological Applications 12, 618-628.

Slovic, P. 1999. Trust, emotion, sex, politics, and science: surveying the risk-assessment battlefield. Risk Analysis 19, 689-701.

Don’t trust me, I’m an expert

Author

Disclosure statement

Partners

Scientists are human

Symptoms of the disease

Want to write?