In the evaluation of ABX Double Blind Test results, statistics are used to get
as much information as possible from each test. The statistical methods
reviewed here are by no means the only ones available for use with an ABX
Double Blind Comparator. They are however the methods that we at the ABX
Company have found useful in ABX Double Blind Comparisons. Experienced
researchers may utilize methods beyond the scope of this paper. |
One traditional approach to audibility thresholds defined the threshold point
as midway between all correct (100%) and guessing (50%). Thus less than 75%
correct indicated the effect was not detected by the listeners. With
statistical evaluation, ABX Double Blind Comparisons can show audible
differences with as little as 51% correct, provided the experiment is
carefully designed.
|
The statistics to bring this high level of resolution are elementary. The
discussion to follow can serve both as an explanation for those new to
statistics or as a reminder to statisticians who don't use these simple
methods regularly.
|
WHAT IS BEING TESTED ?
|
Although novel experiments may be designed to utilize the Double Blind
features of the ABX Comparator System, it was designed initially for
comparisons of audio components. In this application, the ABX Company has
adopted the following definitions:
|
It is important not to confuse SUBJECT and LISTENER. Listeners thinking their
ears are on trial may be intimidated and thereby not do their best. This
caution is worth explaining at the beginning of each test.
|
Since the randomization of the ABX Comparator is internal, the EXPERIMENTER
and the LISTENER may be the same person. With the ABX Comparator Double Blind
tests may be done by a single person working alone.
|
MISSING A FEW
|
Listeners don't have to be correct on each and every response to show that the
effect being tested is detectable. When some responses are incorrect, the
science of statistics is called upon to show the responses do relate to the
effect being tested. The statistics are basic, so basic that they are rarely
explained in practical terms in statistics texts.
|
WHAT IS p ?
|
Statistics is a science of probabilities. There are no absolute tests of
absolute proof. The result of a statistical analysis is also a probability.
Whether a result is random or not random is stated statistically as the
likelihood of the result being random. A result's randomness is determined by
comparing the experimental result with results that have been theoretically
studied. These reference points are called "distributions". To test if an
ABX Double Blind score is random or a significant event, the score would be
compared with a known distribution.
|
Tossing a single coin repeatedly is statistically identical to the listeners
score in an ABX Double Blind Comparison. The coin is random; its distribution
is well known. In a very large number of tosses the head comes up half the
time. If the listeners just guessed in an ABX Comparison, the score would
likewise be 50%. If the score is above 50%, it may be random or not. Which
depends on the number of trials run. Generally the evidence that the score
was not like random guessing is stronger the more trials are run, although
advanced researchers may be able to calculate an optimum number of trials in
certain circumstances. We at the ABX Company have found sixteen trials to be
fairly sensitive without an undue burden on the listeners.
|
If a coin is tossed sixteen times, there won't always be exactly 8 heads.
Five to eleven heads may occur quite frequently in each group of sixteen
tosses. If twelve heads occur in sixteen tosses, the coin may be out of
balance.
|
The dividing line is between eleven and twelve because the coin toss
distribution has been compared to other known distributions. The closest ones
are the Chi-Squared Distribution and the Binomial Distribution. The Chi-
Squared Test is easy to use by a simple calculation and a table of Chi-Squared
valued and the corresponding probability values, but is approximate for small
scores. The Binomial Distribution Test is exact but tables are not readily
available for large numbers. The ABX Comparator User's Manual includes an
exact Binomial Distribution Table for up to 40 trials. The ABX Company has
prepared a Binomial Table for up to 600 trials. It is a 2 inch thick
printout. [This was written in 1982.]
|
USING THE BINOMIAL TABLE
|
The table is easy to use. If, for example, the experiment's result was 20 X's
out of 30 correctly identified, 20 is found across the top of the page and 30
at the side. The number at the row column intersection is the probability
this score could have occurred in a coin toss or guessing situation. For
20/30, the probability listed in the table is 0.049. This means that in 100
random fair coin toss experiments consisting of 30 unknowns, a score of 20/30
or better would be expected to happen almost five times, but no more. Thus
20/30 is a fairly rare event. In a scientific report this would be stated:
67% correct (p=0.049).
|
In rigorous scientific experiments the value of p is chosen in advance as a
goal. The most common p value thus set is 0.05. Lower values are sometimes
sought and achieved, but higher values are almost never tolerated. It would
be unwise to report an audibility result if p is not below 0.05.
|
USING CHI-SQUARED
|
In the same 20 X's correct out of 30 X's example, a value of Chi-Squared is
calculated by the formula below. The probability of the Chi-Squared value is
determined from the Chi-Squared Table. In the case of an ABX Double Blind
Comparison, the Chi-Squared value is:
|
In the ABX Double Blind Comparison the goal is to statistically disprove the null hypothesis to confirm the hypothesis.
|
When an individual ABX Comparision is completed, the responses are checked
against the key in the ABX Comparator in ANSWER mode. The number of correct
responses is compared to the number of response attempts to give a score such
as the example of 20 correct out of 30 attempts, which we have written briefly
as a fraction or a percentage: 20 / 30 = 67%. The score is then compared
with the probability table from which the probability that it is a random
score is determined. Thus the result is stated: 20 / 30 = 67% (p = 0.049).
This literally means that the score (20/30) is probably not random except for
a 0.049 chance that it is random. Thus the result of the experiment is that
the null hypothesis is not true except for a 0.049 chance it is true, or in
audiophile terms, A sound different from B except for a 4.9% chance that they
are the same.
|
Note that no matter what score is achieved, A = B cannot be proven. That is
the ABX Double Blind Comparison can never be used to prove two audio
components sound the same. The notion that ABX can prove components sound the same is a common misconception about ABX.
|
A second common misconception about ABX is the claim that an ABX test result
is not a preference: it doesn't tell which audio component sounds better.
While literally true, if an ABX test confirms a difference is heard, selecting
one's preference is easy and completely justified.
|
If the score had been random, 19 / 30 = 63% (p = 0.1), all that could be
reported is the experimentor failed to disprove the null hypothesis. No
further conclusion could be reached about the similarity or difference of
component A and component B from this Double Blind experiment. Of course a
near miss score like 19 / 30 may tempt the experimenter to attempt more trials
in the hope of the new score and the combined score being significant.
|