/The Journal of Heredity 77:218-220, 1986

**A Solution to the Too-good-to-be-true
Paradox and Gregor Mendel**

by Ira Pilgrim

In 1936, R.A. Fisher(1) concluded, on the basis of a Chi-square analysis, that Gregor Mendel had falsified his data. Because of my experience with experimental data, I believed, empirically, that Fisher was wrong. The first step in refuting him was to logically demonstrate that his method and conclusions might be suspect. In 1984(3), I was able to show that Fisher's reasoning contained a number of paradoxical elements, and that Mendel's honesty or dishonesty could not be conclusively demonstrated using chi square. Recently, Monaghan and Corcos(2) reviewed Fisher's(1), Weiling's(5) and other's work. They concluded that "although the statistical procedure is undoubtedly correct, the conclusion seems to us to be illogical. We have a series of independent experiments, none of which show evidence of bias and whose Chi-square values show no systematic trend. Yet the sum of these individually unbiased experiments is judged as showing bias. There seems to be no satisfactory solution to this problem at present, at least not in the statistics."

The second step in refuting Fisher, is to show where he went wrong. I have now solved that problem and can show why chi square is an inappropriate tool for the detection of falsified data.

Chi-square is a method, devised by Pearson, which assists the investigator in accepting or rejecting a hypothesis by calculating the probability that the data are compatible with the hypothesis. If the data cannot possibly be compatible with the hypothesis, the probability approaches zero and if the data conform exactly to the hypothesis, the probability approaches one.

It can not be emphasized too strongly that any test of goodness-of-fit can only assist an investigator in making up his mind. It neither proves nor disproves a hypothesis and, despite the best of experimental methods and statistics, it is still possible to accept a false hypothesis or reject a true one.

The question that is asked with Chi-square is: What is the probability that the experimental data are compatible with the hypothesis? Chi-square answers this question remarkably well.

When Fisher used the Chi-square test on Mendel's data, he was asking (or should have been asking) a very different question: We know that the hypothesis is correct. We also know that the data fit the hypothesis. What is the probability that the data represent a truly random sampling? Chi-square does not provide an answer to that question, as I will demonstrate.

Since I am trying to evaluate the suitability of Chi-square as a means of testing whether data are "too good", it would be preferable to use a different method for testing Mendel's data. Accordingly, I returned to first principles and calculate the exact probability of Mendel's data falling where they did.

To find out how probable or improbable Mendel's results were, I calculated the actual probability of his obtaining the results that he did. This was done with the aid of a computer, using the standard formula for the probability of occurrence of any distribution in a sample of N individuals, where one group consists of X individuals with a probability of occurrence P; and the other group consisting of (N-X) individuals, with a probability of occurrence of Q.

Probability = N! over (N-X)! X! times P to the X power times Q to the N-X power.

The ratio appropriate to the experiment was used. Stirling's Approximation was used for the factorials of large numbers.

First, let us examine the distribution of Mendel's data (fig. 1) where his
sample size was 100 against the appropriate curve (n=100, and a 2:1 ratio).
Approximately 2/3 of all values (the standard deviation) would be expected to
fall within +/- 5. Mendel's deviations, in the 5 experiments where he used a
sample size of 100, were +4, 0, -3, -4, and 7. Inspection of figure 1 indicates
that this is what one might reasonably expect with a random sampling. For one
to consider them as "too good", they would all have to be +/- one or
two. There is nothing about their distribution that would lead one to suspect
that they had been derived in anything but a random manner.

To evaluate the rest of Mendel's data, what we need is not the probability of the data being what they are, but the probability of each datum falling as close to the expected value as it did, which I will call Pd. The probability of the datum falling as far from the expected, or farther, would be 1-Pd. To find Pd, we add up all of the probabilities from the expected value to the derived value, plus a similar group of probabilities on the other side of the curve. For example: suppose that we toss a coin 100 times and get 52 heads and 48 tails. The value Pd is the sum of the probabilities of getting 52 heads and 48 tails, 51 heads and 49 tails, 50 heads and 50 tails, 49 heads and 51 tails, and 48 heads and 52 tails. This total is the probability of the results being as good as 48:52 or better.

**Table I.** The probability of
occurrence of Mendel's data.

Exp't * |
N |
Ratio |
Obs. |
Exp. |
Obs. |
Exp. |
S.D.** |
O-E*** |
Pd**** |

1 |
7324 |
3:1 |
5474 |
5493 |
1850 |
1831 |
37 |
-19 |
0.378 |

2 |
8023 |
3:1 |
6022 |
6017 |
2001 |
2006 |
39 |
+5 |
0.113 |

3 |
929 |
3:1 |
705 |
697 |
224 |
232 |
13 |
+8 |
0.481 |

4 |
1181 |
3:1 |
882 |
886 |
299 |
295 |
15 |
-4 |
0.238 |

5 |
580 |
3:1 |
428 |
435 |
152 |
145 |
10 |
-7 |
0.524 |

6 |
858 |
3:1 |
651 |
644 |
207 |
214 |
13 |
+7 |
0.444 |

7 |
1064 |
3:1 |
787 |
789 |
277 |
266 |
15 |
-11 |
0.712 |

1A |
565 |
2:1 |
372 |
379 |
193 |
186 |
12 |
-7 |
0.516 |

2A |
519 |
2:1 |
353 |
346 |
166 |
173 |
11 |
+7 |
0.492 |

3A |
100 |
2:1 |
64 |
67 |
36 |
33 |
5 |
-3 |
0.545 |

4A |
100 |
2:1 |
71 |
67 |
29 |
33 |
5 |
+4 |
0.644 |

5A |
100 |
2:1 |
60 |
67 |
40 |
33 |
5 |
-7 |
0.910 |

6A |
100 |
2:1 |
67 |
67 |
33 |
33 |
5 |
0 |
0.084 |

7A |
100 |
2:1 |
72 |
67 |
28 |
33 |
5 |
-4 |
0.734 |

* "Expt" is Mendel's experiment number; the letter A has been added by the author to distinguish between Mendel's first and second series of experiments.

** "S.D.", the standard deviation = the square root of npq. The mean plus or minus one standard deviation encompases two thirds of the area under the curve; within this range two thirds of all experimental values will fall.

*** "O-E" is the difference between one set of observed values and the expected value; the + or - refers to the first set of data. The second set will bear the opposite sign.

**** "Pd" is the sum of all of the possible probability values between the observed value and the mean, plus an equal amount on the other side of the mean; it is the probability of the value being as close to the mean as it is, or closer.

Let us examine the first three of Mendel's experiments (Table 1):

Experiment 1 has a Pd of 0.378. This means that 38% of the time the results would have been this close or closer to the expected. This falls between the probability of tossing a coin and getting three heads or three tails in a row, and the probability of getting two heads or two tails.

Experiment 2 has a Pd of 0.113, which is about the probability of getting four heads or four tails in a row, or the probability of having a family of four children, all of the same sex.

Experiment 3 has a Pd of 0.481 which is about the probability of getting either two heads or two tails, when tossing a coin twice, or the probability of having two children of the same sex.

These could hardly be considered extraordinary results. The rest of the values are also where one would reasonably expect them to be with a random sampling. Even the one value where the observation agreed precisely with the expectation would not be unexpected; provided it didn't happen too often. The mean Pd is 0.487, which suggests that 49% of the time, the results would have been this close to expectations or closer, and 51% of the time they would have been more deviant.(When I showed these figures to my wife, she said "Ira, that's too good to be true.") The reader can judge whether Mendel's results are "too good to be true", as Fisher contends.

Can I say, on the basis of the foregoing analysis that Mendel's data were not falsified? No, I can only say that I have no reason to suspect that the data were not honestly derived: i.e. if Mendel's data were falsified, it was done with such skill as to defy detection. I suspect that if a number of investigators repeated Mendel's experiments using similar numbers, that they would obtain similar results. Weiling(5) has pointed out that others did, indeed, obtain similar results.

In conformity with the Law of Large Numbers, "The larger the sample, the less will be the variability in the sample proportions. Tosses of pennies illustrate the same thing. If a fair coin is tossed 50 times, the proportion of heads may well be as little as 0.4 or as much as 0.6. But if a fair coin is tossed 5,000 times, the proportion of heads is unlikely to fall outside the range 0.48 to 0.52."(p.149)(4). One can, therefore, increase the precision of genetic ratios, with a relative decrease in expected random error, by increasing sample size. Since Mendel was a teacher of mathematics and experimental physics, it is reasonable to assume that he was trained in the theories of probability. He was probably aware of what he might expect and chose his sample size so that it would provide ratios that were very close to what he expected to find. When his sample size is relatively small, he seems almost apologetic that its size was not large enough to give almost perfect ratios.

It is obvious that Mendel's data are not "too good to be true". This being the case, there must be something about the use of Chi-square that is responsible for the discrepancy between Fisher's conclusions and mine. If one analyzes genetic data using Chi-square, high probability values are not unusual if the data agree with the hypothesis. It seems to me that at some point, it is possible that results might become suspiciously good. Fisher does not define where that point is. He assumes that the Chi-square derived probability does that and that the higher the "p", the more suspicious the data become.

To illustrate what Chi-square will do with data that are honestly derived, and data that are fabricated, let us take four sets of invented data and apply the test to them. In the first case, the data represent a random sampling. In the second case, a random sampling with excessive deviation. In the third case,the data are heavily skewed to one side of the normal curve, and in the forth case, the data are too good (i.e. clustered much closer to the mean than one would expect of a random sampling).

We will use 5 samples of 100 each. The data will be tested against the hypothesis that we can expect half of one kind and half of another.

1. This group will have data distributed as follows:

1 sample at |
50:50 |
50:50 (e.g. 50 heads, 50 tails) |

1 sample at |
+0.5 s.d. |
48-52 |

1 sample at |
-0.5 s.d. |
52:48 |

1 sample at |
+1.0 s.d. |
45:55 |

1 sample at |
-1.0 s.d. |
55:45 |

These data are not far from what one might obtain with a series of fair coin tosses. I have been quite conservative so that they could not be considered excessively good. We find a Chi-square of 2.32, which, with 5 degrees of freedom, gives a probability of 0.8.

2. Here I assume that our fictitious experimenter was very unlucky and his data were spread beyond what one might usually expect. It would consist of:

1 sample at |
50:50 |
50:50 |

1 sample at |
+1 s.d. |
45:55 |

1 sample at |
-1 s.d. |
55:45 |

1 sample at |
+2 s.d. |
40:60 |

1 sample at |
-2 s.d. |
60:40 |

In this extreme case, we get a Chi-square of 10, which yields a probability between 0.5 and 0.1.

3. This group consists of:

1 sample at |
50:50 |
50:50 |

2 samples at |
-0.5 s.d. |
52:48 |

2 samples at |
-1.0 s.d. |
55:45 |

These data do not fit the hypothesis, since they are all skewed to one side of the expected mean. Since chi square does not discriminate between data on either side of the mean, we obtain the same high value as in the first group: a chi square of 2.32 and a probability of 0.8. I included this group to point out the desirability of graphically visualizing data.

4. This group consists of data that are really "too good":

3 samples at |
50:50 |
50:50 |

1 sample at |
+0.5 s.d. |
48:52 |

1 sample at |
-0.5 s.d. |
52:48 |

We obtain a Chi-square of 0.32, which gives a probability of greater than 0.99.

When Fisher analyzed Mendel's data, he combined the Chi-square values from all of Mendel's experiments. To find out what happens when the data from several experiments are combined, let us consider several hypothetical scenarios:

1. Combining several sets of data from group 1 (reasonable data). If we do this we find that with 2 sets of such data we get a Chi-square of 4.62 which, with 10 degrees of freedom, give a p of 0.9. If we take 3 sets of such data, we get a Chi-square of 6.96 which, with 15 degrees of freedom, yields a p of about 0.96. Thus, combining sets of reasonable data we increase the p value. This is essentially what Fisher did with Mendel's data.

2. If we do the same thing with the data from our 2nd group, the unusually bad data, we find that two sets of data yield a p of 0.03, and three sets of data yield a p of 0.01. Combining sets of poor data yield a smaller probability.

That combining good data yields a much higher Chi-square, and combining poor data yields a much lower chi square, is as it should be and is what Chi-square was designed to do. If we repeat an experiment which yields data that are a mediocre fit to the hypothesis; and the repeated experiment also yields data which only marginally fit the hypothesis, it would indicate that the data probably do not fit the hypothesis; while "good" data, when repeated, make it more likely that the data confirm the hypothesis. This is when Chi-square is used as it was meant to be used. When, however, it is used as Fisher used it, to detect falsified data, virtually any sets of reasonable data, if they are a fairly good fit to the hypothesis, would seem to be "too good" if combined.

Fisher's use of Chi-square to detect falsified data represented a radically new application of a standard method. It is incumbent upon anyone doing this, to document it's validity; either theoretically or empirically. Fisher did not do this.

Throughout Fisher's(1) paper, he continually reiterates his contention that Mendel falsified his data. He states in a number of places that Mendel's data fall within the standard error. The implication throughout is that one would expect the data (even with a correct hypothesis) to do otherwise. Since 2/3 of all sets of data would fall within the standard error, I can not understand why Fisher would find this remarkable.

There are other errors in Fisher's paper which would cast doubt upon his conclusions, but since the use of Chi-square is inappropriate, any further critique of Fisher's paper seems superfluous.

I can conclude from the above that:

1. Chi-square is not an appropriate test for detecting falsified data.

2. There is no reason whatever to question Mendel's honesty.

The author is indebted to Dr. Leon Rosenblatt and Dr. Marshall Natapoff for their criticism and counsel.

References

1. Fisher, R.A. Has Mendel's work been rediscovered? Annals Sci.1:115-137. 1936 (Reprinted in Stern, C. and Sherwood, E.R. The Origin of Genetics: A Mendel Source Book. W.H.Freeman and Co. San Francisco and London).

2. Monaghan, F. and Corcos, A. Chi-square and Mendel's experiments: where's the bias? J. Hered. 76:307-309, 1985

2. Pilgrim, I. The too-good-to-be-true paradox and Gregor Mendel. J. Hered.75:501-502,1984.

3. Wallis, W.A. and Roberts, H.V. The Nature of Statistics. Collier Books, New York, 1962.

4. Weiling, F. Hat J.G.Mendel bei seinen Versuchen "zu genau" gearbeitet? Der Test und seine Bedeutung fur die Beurteilung genetischer Spaltungsverhaltnisse. Der Zuchter36.359-365, 1966.