We on the FTDNA Taylor Surname Project Admin Team get
asked many times “Is the match I’ve found meaningful and what does it mean?” This
article tries to help answer that. A portion has been published on the FTDNA Taylor
Project blog.
Hold on, the ride could get bumpy. Do not read further
if you are not comfortable with Y-DNA basics & terminology. We'll assume you know what a marker is,
what STR means, & what a match is. First, visit one of the many sites on the subject
of Y-DNA in genealogy, including
this one.
Do not read further if you're hoping to substitute DNA for
traditional genealogy, based on documents and records. DNA can not tell you your
ancestors' names or where they lived and can only estimate a probable range of
time in which they might have lived. DNA helps focus documentary research;
it can't replace it. You should know genealogical time frames and standards.
Probability can prove or
disprove many questions to a high degree of reliability. It’s defined as a way of quantitatively
expressing knowledge or belief that an event will occur or has occurred. In
this article, we’ll draw from the concepts of probability theory,
a branch of mathematics.
The beginning question in our
exploration becomes “What are the odds that my found match could not be
due to chance?” Eventually, we’ll proceed to “When, probably, did our common
ancestor live?”
Perhaps, you’ve seen a graph
something like the simplified one below from FTDNA’s site:
“How”, you may wonder, “do they do that?”
“What does it mean?”
This article is to partly answer the first question. For the second, this
graph shows how more markers produce a "tighter" probability distribution,
leading to greater precision. We'll
use the same techniques in a different way.
Probability is not limited to DNA. It’s used to predict
weather, estimate chances of airplanes crashing, gauge public opinion, guide gambling
(its earliest use), control manufacturing quality, manage complex systems, and
many other applications.
In an egregious example, probability was widely used to
estimate the risk of default on collateralized mortgage securities & credit
default swaps. It failed
— mostly because of problems we'll discuss later.
Probabilities always are between zero (0) & one (1),
sometimes expressed in percentages – between 0% & 100%. Mathematically, it's expressed
as "0 < p < 1", where "p" represents the probability of a thing. Zero means a thing is
impossible; one (100%) means it is certain. Unlike athletes, probability can’t
“give 110%”.
Next, probability involves math, often complicated math. “Talking Barbie” said “Math is hard.”;
actually, the hardest part these days is coming up with the right questions to
ask and applying the right formulas relevant to the questions before letting
the computer do the math.
Third, probability models are approximations of selected aspects of the real world; they
do not represent it exactly and completely. They may come close enough for their
intended use, but do not over-interpret.
The last & most important thing to remember is that there is
no absolute certainty in probability;
absolute certainty is beyond its power. We might be 99.99% sure of a
probabilistic statement,
but can never be 100%. (The surety of a probability-based statement is called a
“confidence level”.)
Start with the definition,
basic to all probabilistic methods, 0 < p < 1. The probability is
greater than zero and less than one; it lies between them. Since that’s built
into the definition, no formulation, based on it can yield <= zero or >= one.
We needn’t let an element of doubt dissuade us from using probabilities to reach highly valid
conclusions. 90% confidence is much better than knowing nothing.
Some probability terms
"Mean" = Average of all values in the distribution,
μ =¯x = ∑(x)/N.
"Mode" = Most frequent value, where curve reaches its peak. Some
distributions have more than one peak (bi- or multi-modal), some have no
peak.
"Median" = a midpoint, halfway between highest & lowest values.
"Variance" = a measure of central tendency, the clustering about
the mean, Var = ∑(x-¯x)/N; for samples, Var =
∑(x-¯x)/(n-1) . A large variance means less clustering.
"Standard Deviation" = a measure of central tendency derived by
taking the square root of the variance, σ =
√[∑(x-¯x)/N)]; for samples, σ = √[∑(x-¯x)/n-1)]. The standard deviation
figures prominently in probability calculations.
"Slope" = the rate of increase or decrease on
the vertical axis with increase in the horizontal axis.
Mutations in Y-DNA, as reflected in STR marker/allele counts, aren’t common. A change in a marker
happens — on average — only every 250 to 400 generations. For simplicity’s sake, we’ll classify them
as
“Fast” markers, about 1 mutation in every 250 generations (1/250) — a frequency per generation
of 0.0040 = 0.40%, and
“Slow” markers, about 1 in every 400 generations (1/400) — a frequency of 0.0025 = 0.25%.
These are very small probabilities, but greater than zero. They allow us to
use probabilities for estimating & predicting.
To quote from Dr. Bruce Walsh, at the U. of Arizona
(emphasis added):
The basic idea is simple: Individuals that match at a higher fraction of markers are more
closely related. The formal logic is as follows: One can image {imagine?} the chromosome as a clock that slowly
ticks (i.e., one "tick" of the clock equals one mutation). Thus, a chromosome is a
molecular clock that ticks randomly within a specified rate. This paradoxically sounding phrase means that a clock
running longer has a higher probability of having more ticks than a clock that has been running
shorter. The more time, the more ticks and the older the time back to the MRCA.
Estimates of TMRCA are thus based on the observed number of mutations by which the two Y chromosomes
differ. Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7
generations), but rather a probability distribution, a function that gives the probability that the TMRCA is a certain number of generations or less (i.e., a 47% probability that the TMRCA is 16
generations or less). This website shows the plot of these functions for the various marker matches
for 12 and 21 marker tests. As one uses more and more markers, the distribution becomes tighter and
tighter about its mean value, and estimates have higher precision.
MRCA = Most Recent Common Ancestor. (Earlier ancestors would also be
shared.)
TMRCA = Time (in generations) to Most Recent Common Ancestor
Statisticians use sampling
techniques to estimate characteristics of the real world when it’s impractical to
measure the entire population. However, the present sample sizes are small in relation to
what's being sampled; human
Y-DNA exists in billions of varieties, only a tiny fraction of which have been
tested. Existing DNA databases are very small in
relation to the population and this presents problems in finding matches.
We think, for
example, that there may be as many as 10,000 different haplotypes (individual
Y-DNA patterns) for the tens of millions of people bearing the Taylor surname. In the
FTDNA Taylor project, we have identified 172 haplotypes (29 groups plus 143
ungrouped). We’ve barely scratched the surface, yet 89 members (38%) have found
matches of sufficient quality to be formed into groups.
The samples are sufficiently large overall (in the tens of thousands) to
prove the concepts & theory on which genetic genealogy is based. The sampling problem here is that many specific Y-DNA haplotypes are not
included in the existing samples. your As we get more people
testing, the chances of matching will improve.
Reading Probability Graphs
Graphs show probability data in "picture" form, take less space than tables &
are easier to understand than formulas. A two-dimensional graph has two axis, the vertical & the horizontal,
For the graphs we're concerned with:
The vertical or "Y" axis is always probability or confidence level. It's
a number between 0 & 1, or 0% & 100%.
The horizontal or "X" axis is the number of
past generations to
the MRCA. Depending on the graph, this can be either a specific number of
generations or a maximum number ("upper bound").
We are most interested in cumulative probabilities. The individual
probabilities for specific generations are small, so we want a range of
generations starting at 1 (our fathers) or some other number.
The Normal Curve, as illustration
By way of illustration, let's look at graphs of the familiar normal curve &
its cumulative probabilities. (The example is a standardized normal curve.)
We are NOT saying that DNA mutations follow normal distribution.
However, it demonstrates how probabilities can be accumulated over a range.
We'll abbreviate cumulative probability "cum p".
Highest probability is at mean (average), about 40%.
Probability decreases away from the mean.
Slope is relatively flat at tails, becomes steeper, then flattens & reverses at
mean.
Highest cum p (→100%) is at right of distribution.
Slope is almost flat at distance from mean, but always positive.
Slope is steepest near the mean, where individual probabilities are highest.
Binomial Distributions
The FTDNA graph (above) is of cumulative probabilities, based on one that looks more like this
graph of binomial distributions:
This is a traditional Y-DNA probability distribution, .showing the probability of the MRCA being at each generation past.
(We've artificially cut off the Y axis at 5%; a 37 of 37 match reaches 11.5%
probability at 1 generation; scaling the graph to show this would make the other
curves harder to see.) Notice that the point of maximum probability
keeps shifting right & getting smaller as the match degrades from 37 to 33 of 37
markers.
These features are characteristic of binomial distributions. Also
characteristic is the nature of the "n=k" (e.g., 37 of 37) distribution. The curve for perfect matches — all tested markers are in agreement
— looks different from curves for good matches. The "perfect-match" curve has a very high
probability
at zero (0) generations. In math terms, it is asymptotic to the vertical
axis.
Each individual generation has a relatively low probability, less than 3% for
a 35 of 37 match; that's not very useful. But, we can add up the probabilities
of each generation as we go back in time and get a graph like this:
Illustrative only. Not to be used for research.
Here's something we can work with.
We can choose our degree of match, select a minimum probability level and
read down to the maximum number of generations.
We can also instantly see that a 33/37 match has only a 50% probability for
a MRCA at 40 generations or less (roughly 800 AD to present) and we have to
go out to 75 generations for 90%; therefore, it
might not be worth pursuing.
Or, we can do it for perfect matches, involving different numbers of markers:
Because these are "perfect" matches, the curves are asymptotic to the Y
axis.
With more markers, the cum p curve is steeper & 90% CL is reached
sooner.
The “confidence level”
(abbreviated CL) of a statistical statement describes how much confidence we
may have in it. FTDNA’s graph uses four points on the confidence spectrum (the
vertical axis) — 0, 50%, 90%, & 99%. Here’s what you need to know
about them:
<50% — less than 50%
confidence (or probability) means that a
statement is more likely to be wrong than right.
These levels are more useful in "null hypotheses" which scientists want to
disprove.
For example, a statement with a10% CL is 90% likely to be wrong.
50% is even-steven; a statement with a 50% CL is
just as
likely to be wrong as right. It’s the same confidence that your next coin-toss
will come up heads. Most genealogists dismiss any conclusion with only a 50%
chance of truth as too “iffy”.
90% is the minimum CL you should accept. You can be 90%
sure that your MRCA falls that number of generations, or fewer, back in your
family tree.
95% is better confidence. Now, you’re 95% sure and the
chance of being wrong is only 1 out of 20.
99% is pretty darn good. The chance of being wrong falls
to 1 in 100.
A "confidence interval" is a more general term, which includes any part of
the probability distribution between two defined points. A confidence level
implies all of a distribution from the defined point to the left or right (for us,
the left).
To sum up, do not fuss about a 50% probability. The number associated with it
is as likely to be low as high. Use higher probabilities & confidence levels for
guidance.
Generations:
The horizontal axis of the
graph is maximum number of generations (upper bound) to reach the stated
confidence level. DNA doesn’t know when your ancestors were born or how old
their parents were and can only estimate the number of "transmission events" or
opportunities for mutation. Each generation is another opportunity for mutation.
The question of "How many years per generation?" has no simple or
universally-accepted answer. Estimates vary depending on time, area, & culture.
We think that 18th & 19th century American generations — on average — were
in the range of 25 to 30 years. If you prefer a different estimate, please feel free to substitute
your own.
Matches of the minimum qualities
indicated below indicate a common ancestor. The question then becomes "When
might he have lived?" We’ll try to estimate the time of that MRCA, so that traditional documentary research can help identify
him specifically. .However, be aware that ancestors who lived before about 1350
probably did not have surnames.
Perhaps, the other person only tested 12 markers; that’s all you can compare.
CL
12 of 12
11 of 12
10 of 12
Gen
Years
Gen
Years
Gen
Years
50%
7
180-210
17
430-510
16.5
410-500
90%
23
580-690
39
980-1170
56
1400-1680
95%
29
730-870
47
1180-1410
72
1800-2160
The best possible match (12/12) carries 99% confidence of sharing a common ancestor,
But, the upper bound for an
acceptable confidence level (90%) is 23 generations or 580-690 years.
This
places your MRCA — at earliest — about
the mid-14th century, the time surnames came into general use;
Genealogically
speaking, that’s a tough find.
Poorer quality matches (i.e.,
<12 of 12) may not be worth pursuing, due to the huge time window.
"The 12 Marker Y DNA test is an excellent tool to determine those whom are
not related within a group of people that share the same or similar surname."
Read the source for this quote.
"A 12 marker STR test is usually not discriminating enough to provide conclusive results for a common surname.
".
This source.
"..most laboratories and surname projects recommend testing at least 25
{markers}. The more markers
that are tested, the more discriminating and powerful the results will be. "
Source.
With a match of 23 to 25 out of 25, the probability is >= 99% that you share a common
ancestor.
CL
25 of 25
24 of 25
23 of 25
Gen
Years
Gen
Years
Gen
Years
50%
3
80-90 *
7
180-210
11
280-330
90%
10
250-300
16
400-480
23
580-690
95%
13
330-390
20
500-600
27
680-810
25/25 match, you can be 90%
confident that your MRCA probably lived within the past 300 years (roughly,
1700 or later.) & 95% sure of within the past 400 years; this is a solvable
genealogical problem. Research back to about 1850 is relatively easy through US
censuses; for the Colonial era, you'll need to becomes familiar with other
resources.
24/25 match, at 90% CL, has an
upper bound ~500 years ago, before European migration to America. You may have to find the MRCA in
the “Old Country”.
23/25 or poorer match, may not be worth pursuing, as the time window
is large at an acceptable CL.
Not enough people have yet opted for the 67-marker panel to make it likely that you will find a
match across 67 markers. But, if you do, a match of 60 or more of 67 markers carries a >= 99% probability of sharing a
common ancestor.
CL
67 of 67
66 of 67
65 of 67
Gen
Years
Gen
Years
Gen
Years
50%
2
50-60
4
100-120
6
150-180
90%
4
100-120
8
200-240
12
300-360
95%
6
150-180
9
230-270
14
350-420
67/67 match carries 90% confidence that the MRCA is as late as your
second-great-
grandfather & 95% confidence he’s no earlier than your 4GGF. (You may need
the AGI technique below.)
66/67 gives you a 240 year time frame at 90% CL & 270 at 95% CL
65/67 gives you a 360 year window at 90% CL & 420 at 95% CL
Some people have excellent documentation proving the MRCA can not be in the
latest few generations; let's call it additional genealogical information, "AGI". It
leads to the question "What happens to the probabilities assigned to those
generations?"
If I have a 50% probability of being related to another person within 7
generations. And we know for certain that our MRCA can not possibly be within
the first four generations. So those generations are eliminated from the scope.
Then is the probability of being related in the remaining 3 generations
increased? If so, by how much?
Dr. Walsh wrote on 11 Mar 2009:
"When you have two individuals with an EXACT match, the probabilities for each generation fall off by
exactly the same amount. The net result is that if (say) the probabilities for generations 1, 2, 3
are 0.4, 0.2, 0.1, then if you condition on the first (say) 5 generations being excluded, the
resulting CONDITIONAL probabilities for 6, 7, 8 become 0.4, 0.2, 0.1 (the same as for 1 , 2, 3)
Hence, the whole distribution is shifted without a shape change to start that the value you set.
This only happens with an exact match and can cause some confusion."
The sum of probabilities for those impossible
generations gets re-allocated to the remaining, earlier generations. The entire probability (& cumulative) curve simply gets shifted to the right by the
number of generations in which the MRCA can not exist. We can call this
an "AGI adjustment" and the resulting curve is a "conditional probability
distribution". The remaining probabilities are conditioned on the MRCA not
existing in the generations for which the paper trail shows it cannot be, as
demonstrated in the graph below.
For the probability points listed in the graphs & tables above, add the number of generations in which the MRCA is found not to exist.
See also Dr. John Chandler's reply to Roy.
Wrapping up this part:
Hopefully, we’ve explained
how probabilities help us make sense of Y-DNA matches and how you can use them
to focus your research on a common ancestor of you & your match.
Part 2 adds more detailed information
and explanations.
This part reviews some of the
prior discussion on probabilities with DNA in more detail. It’s for those who
are comfortable with statistics or want to learn more.
In Part 1, we said “The ..
most important thing to remember with probabilities is that there is
no
absolute certainty; absolute certainty is beyond its power. We might be 99.99% sure of a
statement, but can never be 100%.” That statement is controversial and some
(perhaps, followers of Descartes) don’t buy it. So, we’ll take more space to
explore the issues.
The proposition that some things can not be absolutely
known with complete confidence is hard to accept . (It shocked the world of
nuclear physics when Werner Heisenberg published his "Uncertainty Principle" in
the1920s.) It is, however, crucial to understanding
why DNA can not give you an absolutely certain answer.
Probability theory applies to things that we wish to predict in advance or which can not practically be observed
directly and, therefore, must be inferred . When a thing can readily be
observed directly, we don’t need inference or prediction; observe it and remove
the doubt. There is always an element of doubt,
however tiny, to probability-derived statements. In fact, probabilistic
functions start with the definition “0 < p < 1”, so that any
probability for which they’re to be used is greater than zero and less than one.
Some may say that – if they
have flipped five coins and all come up heads – the probability of having
flipped five heads is 100%. They’re misreading the concept; probability applies
before you flip the coins or draw to that inside straight, not after. When you’ve directly
observed a thing or event, you do not need or want a probabilistic answer about
it; you simply count the five heads.
When we engage in
probability, we step into a “parallel universe” which operates by different
rules than 0/1, black/white, certain/impossible. In this universe, all things
are possible, but some are highly unlikely. A butterfly may, indeed, cause a
hurricane in Florida by flapping its wings in Brazil; it just isn’t very
probable. We may talk about theoretical
certainties or impossibilities, but they don’t show up in the world of probabilities;
confidence levels never reach 100%; there is always some small (perhaps minute)
element of doubt.
A failure to appreciate the lack of certainty in probabilistic conclusions
can have serious consequences. The 2008 collapse of global
credit markets is mostly due to this misunderstanding. Sophisticated financial
institutions did not realize the risks they were taking with complex,
probability-based investments.
It is true that the theoretical sum of an entire probability distribution
equals unity (100%). However, the sum includes a great many very small
probabilities, out at the "tails" of the distribution.
You may also want to look at a talk on aircraft software reliability. In short, the speaker
concludes “If claims for dependability can never be made with certainty,
we need a formalism that handles the uncertainty.”
The sources of probabilistic doubt are many; they include:
Measurement
error: Even the best scientific instruments are subject to some error.
Sampling error: There are many ways in which a sample may not accurately represent
an entire population; proper sampling methods must be followed to attain any
confidence in its representativeness. Further, even a properly drawn sample is
subject to limitations, usually expressed as a sampling error, largely
dependent on the sample’s size.
Approximation: The mathematical functions (formulas) which yield
probabilities are approximations of the real world.
Assumptions: “Housing prices never decline.” "Lehman Brothers is too big too
fail." Need we say more?
For a mathematical
demonstration, take the binomial distribution:
Most people ignore it, but we
start with the definition, 0 < p <1. (The probability, p, lies between
zero and one and never equals either.) We'll show why this matters.
The formula for the standard deviation (basic to calculating probabilities) is σ = √pq,
(the square root of p times q) where p is the probability & q = 1-p.
As p→0, q→1 and as p→1, q→0.
(The symbol → means "approaches" or "gets very close to".)
In either case, pq→0 and σ→0.
At p=0 or p=1, the distribution has no deviation; all the data points are concentrated at a
single place on the horizontal axis. A probability can not be calculated.
Skew: (“Skew” relates to the shape
of the function's graph curve, in which most of the distribution is shifted
either left or right and has an abnormally long left or right “tail”.)
The formula is Skew = √((q-p)/(pq)) — the square root of the difference
between q & p divided by their product.
As p→1, q→0 and their product, pq→0. (Similarly, as p→0, q→1 and their product, pq→0.)
As pq→0, 1/(pq) →∞ [“infinity”] and becomes what mathematicians
call “undefined”; likewise √((q-p)/(pq)) →∞.
Attempt to use your computer spreadsheet program to divide by zero and it will
signal an error.
The frequency (or frequencies) of the random Y-DNA mutations establishes the
marginal probabilities we need to work with. Unfortunately, these frequencies
aren't wholly known.
Mutations in Y-DNA, as
reflected in STR marker/allele counts, aren’t common. In a 2000 study, Manfred
Kayser & Antii Sanjila looked at Y-DNA transmission in 15 loci (markers)
for each of 4,999 father/son pairs; they found only 14 total mutations and only
2 instances of two mutations per pair. (They
also found a relationship between the frequency of mutation and a locus’
molecular structure.) For 4,999 pairs
times 15 loci, the frequency of mutation reached a high of 8.6*10-3 (0.0086)
and averaged 2.8*10-3 . (0.0028)
These changes in the loci are
(we think) random & independent of each other. Charles Kerchner, http://www.kerchner.com/dna-info.htm,
has studied the subject and estimated rates of mutation in individual markers.
For simplicity’s sake, we’ll classify them as
“Fast” markers, at about 1
mutation in every 250 generations (1/250), and
“Slow” markers, at about 1 in
every 400 generations (1/400).
Even the “fast” markers are
not especially quick. The frequency rate (Another term for these rates is
“marginal probability” per generation.) calculates to:
“Fast”: 1/250 = 0.0040 = 0.40%
“Slow” 1/400 = 0.0025 = 0.25%
The FTDNA Y-DNA STR panels have, respectively, by count:
“Fast” markers
3
8
13
13
“Slow” markers
9
17
24
54
Panel size, total markers
12
25
37
67
The probability of a
mutation, from one generation to the next, in any specific marker is very
small. However, it’s not zero and that fact enables our use of statistical
tools.
Dr. Bruce Walsh of the University of Arizona likens the mutation probability
to a ticking clock. The clock ticks slowly and randomly, allowing us to estimate
a range of time back to the MRCA, based on the number of observed mutations.
Dice provide an example: A die (1/2 a pair of dice) can come up only with
integer values from 1 to 6; it can not come up 2.43746 or any other
fractional value. STR allele counts are reported as integer
values, making them discrete random variables.
Imagine a roulette wheel with about 500 slots, ~499 of them labeled
"Win" and 1 labeled "Lose". What are the chances the ball will land on
a Win slot 12 times out of 12? 24 of 25? 34 of 37?
Out near the tail of a distribution, the probability curve's slope gets so very
nearly flat that assuming no slope at all is a reasonable approximation.
This simplifies probability calculations for rare events.
With each new generation, a "transmission event", there's a small opportunity for Y-DNA
to change. The DNA of the present generation is dependent on the results of
all the transmission events that have gone before.
If you’re interested in the
deleted material beyond the links or what we have below, please ask. It’s fascinating stuff for the
math-junky.
Short version: The FTDNA
probability tables & graph are based on comparisons of probabilities to
derive a confidence level that the match you see is not due to random
chance (represents a common ancestor) and to estimate the time back to the
most recent common ancestor (MRCA).
Statisticians use the term
“significance level” when talking about a single probability, for questions
like “Is my 6’10” nephew tall compared to other boys?” (His height would be
compared to the mean and standard deviation of the normal distribution.)
A “confidence level” applies to
questions like “Are boys taller than girls?” (The distributions of boys’
heights and girls’ heights would be compared.)
In our earlier discussion, we didn't distinguish between significance &
confidence levels.
Binomial Distributions
Dr. Walsh wrote on 11 Mar 2009:
"The actual distribution is a multivariate geometric --- essentially a series of binomials.
The formula for the probability of k successes in n trials, p the probability
of success in each trial, is P = n! / [k!(n-k)!] * pk * (1-p)(n-k).
Note that as k → n, n!/[k!(n-k)!] → 1 {(n-n)! = 0! = 1},
& (1-p)^(n-k) → (1-p). At k=n, the formula reduces to
P = p^k (1-p).
Poisson Distribution
One advantage of the Poisson distribution is not needing to know very much
about the underlying distribution; all you really need is an average per unit of
time, space, etc. {The mean & variance are the same.}
Another advantage is its applicability to integer variables.
Its very first application was in 1898, related to Prussian
Army deaths by horse-kicks. It's heavily relied on in "Queuing theory", for
study of traffic congestion or how much capacity is needed.
Imagine a bakery bakes four loaves of a specialty bread every day and, on
average, sells three (3) of them. What are the chances it will have customers asking
for five (5) loaves?
Formula:
Pr(k;λ) = λk / [eλ k!] λ = 3, k=5,
e ~ 2.78
substituting Pr = 3^5 / [e^3 5!] = 243 / [~20.09*120] = 243 /
[~2410] ~ 0.101 ~ 10.1%
If the probability of an event is very small {0>p<<1}, the Poisson
distribution resembles the binomial distribution. It may become a fair
approximation to the binomial for approximate mutation probabilities, where
0.002>=p>=0.004.
For this flat-tail condition: Pr(Y=y) = e(-µ) * µy /y! =
µy/ [eµ y!], where
e = base of natural logarithms (~2.78), µ = mean v, y = occurrences
for which probability is desired,
y! = factorial of y
{y*(y-1)*(y-2)*(y-3)...1}.
Multiple Independent Events
The probability of all of the events occurring is the product of their
individual probabilities: p(A & B) = p(A)*p(B) ; p(A, B & C) = p(A)*p(B)*p(C), etc.
A jar has 15 marbles: 5 red, 5 blue and 5 yellow. You draw a
marble at random, replace it & then repeat this twice more. What are
the chances of drawing three (3) red marbles?
The individual chance of drawing red is p(red) = 5/15 = 1/3; the chance of drawing
three reds is
p(3 reds) = 1/3 * 1/3 *1/3 = 1/9 ~ 0.111 = 11.1%.
p(A & B) = p(A) * p(B|A), where P(B|A) is the conditional probability of B
given A.
Or, p(B|A) = p(A&B)/p(A)
In the marble example, assume you drew a red first and do not
replace drawn marbles.
What are the chances of drawing blue in either the next try or two? By not
replacing the red marble, you affect the chances for future draws.
First try:
p(b|r) = 5/14 ~ 0.357 = 35.7%, Both tries p(b|r,<>b) = 5/14 + 5/13 ~ 0.742
= 74.2%.
A confidence level (CL,
or confidence
interval) tells you the balance of certainty versus uncertainty in a
conclusion, such as “The MRCA for me and my match was no more than 8 generations
ago.” With a CL of 90%, you can be 9/10 certain and 1/10 doubtful of the
statement.
FTDNA’s graph uses four
points on the confidence spectrum (the vertical axis). Here’s what you
need to know about confidence levels:
50% is as likely to be wrong as right;
do not fall into the trap of thinking it means something. It’s the same confidence
that your next coin-toss will come up heads. A 50% CL is mostly good for
providing an anchor point on the graph lines; it isn’t much good for prediction
or estimating because your most recent CMA could, with equal probability, lie on
either side of the line.
90% is the minimum acceptable in most scientific studies. You
can be 90% sure that your MRCA falls that number of generations or fewer back
in your family tree.
95% is better confidence. Now, you’re
95% sure and the chance of being wrong is only 1 out of 20.
99% is pretty darn good. The chance of being wrong falls to 1/100.
99.5% & 99.9% are occasionally
found in some scientific studies – if the statistics hold up to this close a
scrutiny.
Bounds: A confidence interval has both an upper bound and a
lower bound. We’re mostly concerned with the upper bounds because the lower
bounds in DNA interpretation tend to fall within those generations (0-4) where
documentary research tells us what we need to know.
The horizontal axis is time, the
maximum number of generations back to the MRCA. It’s measured in generations
because DNA doesn’t know the parents' age when the children were
conceived.
The data used in Part 1, was
taken from this table (& others as noted) & reorganized for a (hopefully) simpler explanation
for the reader. The calculation of year ranges is our own.
Notice that the data & graph
talk about maximum number of generations (“upper bound”), for which we’ve used
the less-than-or-equal-to sign (<=). For a given match & CL, the MRCA
will be no more than that number in the past.
Notice, too, that – as we ask for
more confidence – the generations necessary to reach that confidence level
increase.
And, as the quality of the match
decreases, the generations’ upper bound increases – sometimes at an exponential
rate.
Binomial probabilities best fit the phenomenon of DNA matches. The equation
for the number of k successes in n trials is
where
n = the number of trials (markers);
k = the number of successes, k <= n; (matching markers)
p = the probability of changes (mutations) in the markers
! = the factorial sign, e.g., k*(k-1)*(k-2)*(k-3)....*1. (The
factorial terms are needed to account for the various combinations &
permutations of k successes in n trials.)
a number or letter written above & to the right of the one preceding is
an exponent, meaning the preceding is "raised to the power of", or
multiplied by itself that many times.
The formula
explains why a perfect match, n=k, is asymptotic to the vertical axes.
The factorial term evaluates to 1; (n-k)! = 0, Zero factorial, 0!=1 & n! divided by itself = 1.
The last term evaluates to 1; (1-p)0 = 1.
We're left with pk = pn;
The formula for the cumulative probability is
We suspect, in the absence of knowing, that the quoted generational
probabilities represent the ratios of two (or more?) binomial distributions.
Converting generations to
years is tricky. We haven’t found a solid reference. Part of our problem is
that we're searching for an average figure for many families over many
generations. Estimates vary from as
little as 15 to as many as 35 years per generation. Undoubtedly, the time, place
and culture of the population matter.
One not-so-hypothetical family illustrates the complexity of the
problem:
Jim marries Betsy and their first child is born when both are
18; they continue having a child every other year or so until Betsy dies
at 35. Jim then marries Sarah, age 24, and they have children until Jim
is 45. When the last child is born, the first is an adult and also
having children, a generational overlap. The years-per-generation for Jim's children ranges from 18 to 45.
One British study used an
average of 35 years per generation. While this may apply to a mature British population,
the American population in the 18th & 19th centuries
was heavily weighted toward the young. Fewer than 32% of Americans in 1810 were
over 25 years of age, while almost 35% were under 10.
Americans married and bore
their first children young, often as young as 18. While later siblings may have
been born at a later age, by age 40 most families were done having children. I
believe something closer to 27 applies for my American Taylors. So, let’s take
a range of 25 to 30.
The graph below shows the age
distribution for free white persons enumerated in the 1810 U.S. federal census:
Be sure you've read the Interpretations section in Part 1. To reprise a part, in plain language:
A 35/37 match indicates you can be 90% confident your MRCA was no more than 12
generations in the past and 95% confident he was no more than 14 generations in
the past.
A 36/37 match means you can be 90%
confident your MRCA was no more than 8 generations in the past & 95%
confident your MRCA was no more than 10 generations in the past.
A 37/37 match means you can be 90%
confident your MRCA was no more than 3 generations in the past and 95%
confident he was no more than 7 generations past.
Interpreting non-matching markers
Ken Nordtvedt has published an article in the
Fall 2008
Journal of Genetic Genealogy discussing a method for going beyond simple
interpretation of matching vs. non-matching markers and into the question of
meaning of the differences in non-matching markers. In short, he establishes
modal values to infer the CMA haplotype and computes distances from that mode. He
goes on to use Bayesian analysis to estimate a probability distribution. His
estimate for the probability distribution's peak is:
G = n/(2M), where
G = number of generations,
n = number of mutations (not markers as previously), and
M = the sum of marker mutation rates.
Cumulative probability is given by:
Prob(G) ~
∫(G)GD
e(-2MG), where
GD = genetic distance (marker differences)
e = base of natural logarithms ~ 2.78,
G & M are generations & mutations rates, as above.
This author confesses to less than full understanding of Nordtvedt's article.
Read it by clicking on the
link above.
We've run some numbers using these formulas and they don't seem to produce usable results.
In Part 1, we talked about "conditional probabilities" in relation to documentation proving the MRCMA can not be in the latest few generations. It's good to
revisit this in a more formal way.
The formula for the probability of B, given A, is:
Example:
A jar contains black and white marbles. Two marbles are chosen without
replacement. The probability of selecting a black marble and then a
white marble is 0.34, and the probability of selecting a black marble on
the first draw is 0.47. What is the probability of selecting a white
marble on the second draw, given that the first marble drawn was black?
Solution:
P(White|Black)
=
P(Black and White)
=
0.34
=
0.72
=
72%
P(Black)
0.47
In the same way, given that the MRCMA can not be in the latest 4 generations,
the probability of being in earlier generations is increased.
Conclusion:
The statistical methods used by FTDNA are, in some
sense, a “black box”. We do not know the formulas used. Nor, do we have access
to the data. We’ve used the summary data they’ve made available and supplemented
it with estimates from other sources. Should we get
more information, we’ll do a Part 3.