The Relevance of 95% Significance Level
by Elizabeth Horn, Ph.D.
Like many others this time of year, I’m contemplating New Year’s resolutions.In particular I’m intrigued with the practice of hara hachi bu, which has been linked to increased longevity. This Japanese saying translates to eat until you are 8 out of 10 parts full. The modern interpretation is “eat until you are 80% full.” The idea is that being mindful of the “fullness” sensation results in fewer calories consumed and, ultimately, weight loss.
An obvious question about hara hachi bu is how one knows when 80% fullness is reached. A more important question, though, is why 80%? It seems rather arbitrary. Hara hachi bu originated in feudal Japan in the 1300s, and percentages were not used in everyday speech. Therefore, stating a whole number out of 10 made more sense when considering food portions versus, say, a fraction, such as 7½ out of 10 (or 75%). Still, why not 7 out of 10 or even 9 out 10? If followers of this teaching were asked “why 80”, likely they would explain that this practice has been ingrained into the culture. It is well known and accepted—a kind of universal truth.
This led me to ponder numbers that I consider “truth.” Most, if not all, scientific research involves some type of significance testing, with 95% as a universal benchmark for statistical significance. Significance testing is philosophically and scientifically based, and it identifies which of many differences between treatments (conditions, interventions, etc.) are worth pursuing. Declaring that a result is significantly different from another at the 95% significance level means that there is 95% certainty that the experiment correctly determines that the treatments are, in fact, different from one another.
In school, the use of 95% was drilled repeatedly into my head in various quantitative-oriented classes. My fellow students and I accepted that 95% was king, and that lower significance levels were to be viewed with suspicion. This was reinforced when I discovered that peer-reviewed academic journals, regardless of discipline, would not publish a research study without a demonstrated statistical difference at 95% significance. If investigators were unable to show significance at the proper level, they might as well pack up and go home. After graduate school, I took a position in the market research industry. In the midst of this strange, new world, my old friend, 95%, was there to greet me.
Careers rise and fall based on the 95% rule. Potentially beneficial pharmaceuticals and other therapies can die in early-stage clinical trials when failing to meet the required level of statistical significance. Viable new product ideas can be scrapped. Entire advertising campaigns can be deemed failures. If our fates are judged by 95% perhaps we should better understand its origins.
British statistician and geneticist R. A. Fisher developed the idea of significance testing in the early 1920s.
The value for which P = .05, or 1 in 20, is 1.96 or nearly 2;
it is convenient to take this point as a limit in judging whether a deviation is
considered to be significant or not. 
One in 20 is interpreted as the probability that a researcher will make a mistake in declaring a study result to be significant. This represents a 5% risk of being incorrect or, on the more positive side, a 95% chance of being correct. Fisher’s ideas on significance testing and the criterion to assess significance were embraced and furthered by the scientific community. Soon most researchers were using 95%, as the standard for determining significance.
Notice, though, Fisher’s choice of words, “...it is convenient to take this point….”’ He didn’t offer definitive proof or point to a body of research. He simply presented 95% as a reasonable level. Fisher’s suggestion was never intended to be taken at face-value. In a 1992 retrospective paper on his late mentor’s career, C. R. Rao, one of Fisher’s most influential and prolific students, attempted to set the record straight.
<Fisher> does not recommend any fixed level of significance, but suggests that the
observed level of significance has to be used with other evidence that the
experimenter may have in making a decision. 
Fisher never intended the scientific community to adopt 95% as the only standard. Rather, he advocated a more holistic approach of carefully weighing other research outcomes and data before declaring a result to be significant.
I wonder how many beneficial scientific insights have been discounted or flat out ignored because of adherence to 95%? And in the market research industry, what great nuggets of truth have we inadvertently missed? To its credit, market research is less rigid in its significance testing. Some companies even maintain a standard of 90%, recognizing that 95% may cause them to overlook something important.
Like Fisher, I believe it is vital not to rely solely on the outcome of a statistical test to judge a study result “significant.” Other factors, often more meaningful, are required to triangulate on the final decision. In this spirit, I encourage consideration of the following:
Assess the cost of making the wrong decision. The U.S. FDA demands high statistical significance in clinical trials because approving a therapy that later is shown to have a deleterious impact on health is a disastrous consequence. Decisions in the market research field are not life or death (thank goodness). Still, the cost of making a bad decision can be significant (poor resource investment and allocation). If the penalty of making a mistake is high, the significance level should be as well and the more external evidence is required.
Evaluate significance at multiple levels. Significance testing is easy and cheap. Most software programs will allow for testing at least at the 90%, 95%, and 99% levels. It’s not unheard of to test at 85% in certain applications. One note of caution, large base sizes (2,000+) tend to produce a massive number of significant results. In these cases, you might want to use 99%.
Measure the magnitude of the difference. Suppose a company is making a “go/no go” decision for a costly advertising campaign and is conducting some primary research to assess the current and the proposed campaigns on a number of metrics. A key study outcome will be the size of the gap between the two campaigns. Is it large or small? Large gaps are preferred, of course. However, in some situations, a small gap could lead to a large increase in company revenues.
Review other evidence. Place the result in context. Do the results align with other findings within the current research? Do the current results align with previous research? The company from the previous example may decide to move forward with the new campaign if it outperforms the current campaign on all key metrics and if separate tests of the component ads in the campaign indicate success.
So along with the other resolutions made this new year, let’s resolve to think more critically about the application of significance testing in general and, more specifically, the 95% significance level. Our clients and the market research industry would benefit greatly from our efforts, I believe. On a more personal front, I resolve to do my best to incorporate the practice of hara hachi bu into my life. Though, I’m still working on determining when I am 80% full!
 Fisher, R. A. Statistical Methods for Research Workers Second Ed. Edinburgh, Scotland: Oliver and Boyd. 1928.
 Rao, C. Radhakrishna. “R. A. Fisher: The Founder of Modern Statistics.” Statistical Science, Vol. 7, issue 1, 1992, pages 34-48.
About the Author
Elizabeth Horn, Ph.D. (firstname.lastname@example.org) is Senior Vice President, Advanced Analytics at Decision Analyst. She may be reached at 1-800-262-5974 or 1-817-640-6166.
Copyright © 2018 by Decision Analyst, Inc.
This posting may not be copied, published, or used in any way without written permission of Decision Analyst.