Probabilities in DNA [an error occurred while processing this directive]

On this page:

12-marker match 25-marker match 37-marker match 67-marker match
Relatedness Probability Basics Prob. of Mutations Sampling Cumulative Probabilities Confidence Interpretation TMRCA Calculators Missing Probability
Part 2 xxx

Respond

Thank you for visiting.

Probabilities in DNA

by Ralph Taylor

Part 1: Introduction and Overview

We on the FTDNA Taylor Surname Project Admin Team get asked many times “Is the match I’ve found meaningful and what does it mean?” This article tries to help answer that. A portion has been published on the FTDNA Taylor Project blog.

Hold on, the ride could get bumpy. Do not read further if you are not comfortable with Y-DNA basics & terminology. We'll assume you know what a marker is, what STR means, & what a match is. First, visit one of the many sites on the subject of Y-DNA in genealogy, including this one.

Do not read further if you're hoping to substitute DNA for traditional genealogy, based on documents and records. DNA can not tell you your ancestors' names or where they lived and can only estimate a probable range of time in which they might have lived. DNA helps focus documentary research; it can't replace it. You should know genealogical time frames and standards.

Always, answers to the meaning of matches involve probabilities – statistics. Changes in Y-DNA from father to son are infrequent, but they happen, we think, randomly. Statistical probability analysis provides the best tools for dealing with randomness. To quote Dr. Bruce Walsh of the University of Arizona, “Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7 generations), but rather a probability distribution.”

Probability can prove or disprove many questions to a high degree of reliability. It’s defined as a way of quantitatively expressing knowledge or belief that an event will occur or has occurred. In this article, we’ll draw from the concepts of probability theory, a branch of mathematics.

The beginning question in our exploration becomes “What are the odds that my found match could not be due to chance?” Eventually, we’ll proceed to “When, probably, did our common ancestor live?”

Perhaps, you’ve seen a graph something like the simplified one below from FTDNA’s site:

“How”, you may wonder, “do they do that?”
“What does it mean?”

Source:
http://www.familytreedna.com/faq-markers.aspx

This article is to partly answer the first question. For the second, this graph shows how more markers produce a "tighter" probability distribution, leading to greater precision. We'll use the same techniques in a different way.

Relatedness:

In Y-DNA, “related” has a special meaning, "sharing a common direct male ancestor with another person". It does not include

your in-laws or
even your paternal grandmother or maternal grandfather.

It does include:

your brother or

any direct male ancestor such as your father,

his father,
his father’s father,
his father’s father’s father’s father, etc. or
his brother's sons.
But, this must be an unbroken gender line; any females in the descendancy will break the transmission chain of Y-DNA.

Probability Applications:

Probability is not limited to DNA. It’s used to predict weather, estimate chances of airplanes crashing, gauge public opinion, guide gambling (its earliest use), control manufacturing quality, manage complex systems, and many other applications.

In an egregious example, probability was widely used to estimate the risk of default on collateralized mortgage securities & credit default swaps. It failed — mostly because of problems we'll discuss later.

Probability Basics:

Probabilities always are between zero (0) & one (1), sometimes expressed in percentages – between 0% & 100%. Mathematically, it's expressed as "0 < p < 1", where "p" represents the probability of a thing. Zero means a thing is impossible; one (100%) means it is certain. Unlike athletes, probability can’t “give 110%”.

Next, probability involves math, often complicated math. “Talking Barbie” said “Math is hard.”; actually, the hardest part these days is coming up with the right questions to ask and applying the right formulas relevant to the questions before letting the computer do the math.

Third, probability models are approximations of selected aspects of the real world; they do not represent it exactly and completely. They may come close enough for their intended use, but do not over-interpret.

The last & most important thing to remember is that there is no absolute certainty in probability; absolute certainty is beyond its power. We might be 99.99% sure of a probabilistic statement, but can never be 100%. (The surety of a probability-based statement is called a “confidence level”.)

Start with the definition, basic to all probabilistic methods, 0 < p < 1. The probability is greater than zero and less than one; it lies between them. Since that’s built into the definition, no formulation, based on it can yield <= zero or >= one.
See Part 2 for more discussion of this controversial statement.
We needn’t let an element of doubt dissuade us from using probabilities to reach highly valid conclusions. 90% confidence is much better than knowing nothing.

Some probability terms

"Mean" = Average of all values in the distribution, μ =¯x = ∑(x)/N.
"Mode" = Most frequent value, where curve reaches its peak. Some distributions have more than one peak (bi- or multi-modal), some have no peak.
"Median" = a midpoint, halfway between highest & lowest values.
"Variance" = a measure of central tendency, the clustering about the mean, Var = ∑(x-¯x)/N; for samples, Var = ∑(x-¯x)/(n-1) . A large variance means less clustering.
"Standard Deviation" = a measure of central tendency derived by taking the square root of the variance, σ = √[∑(x-¯x)/N)]; for samples, σ = √[∑(x-¯x)/n-1)]. The standard deviation figures prominently in probability calculations.
"Slope" = the rate of increase or decrease on the vertical axis with increase in the horizontal axis.

Positive Slope

Negative Slope

Slope positive, then negative

Probability of Mutations:

Mutations in Y-DNA, as reflected in STR marker/allele counts, aren’t common. A change in a marker happens — on average — only every 250 to 400 generations. For simplicity’s sake, we’ll classify them as

“Fast” markers, about 1 mutation in every 250 generations (1/250) — a frequency per generation of 0.0040 = 0.40%, and
“Slow” markers, about 1 in every 400 generations (1/400) — a frequency of 0.0025 = 0.25%.

These are very small probabilities, but greater than zero. They allow us to use probabilities for estimating & predicting.

To quote from Dr. Bruce Walsh, at the U. of Arizona (emphasis added):

The basic idea is simple:
Individuals that match at a higher fraction of markers are more closely related. The formal logic is as follows: One can image {imagine?} the chromosome as a clock that slowly ticks (i.e., one "tick" of the clock equals one mutation). Thus, a chromosome is a molecular clock that ticks randomly within a specified rate. This paradoxically sounding phrase means that a clock running longer has a higher probability of having more ticks than a clock that has been running shorter. The more time, the more ticks and the older the time back to the MRCA.

Estimates of TMRCA are thus based on the observed number of mutations by which the two Y chromosomes differ. Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7 generations), but rather a probability distribution, a function that gives the probability that the TMRCA is a certain number of generations or less (i.e., a 47% probability that the TMRCA is 16 generations or less). This website shows the plot of these functions for the various marker matches for 12 and 21 marker tests. As one uses more and more markers, the distribution becomes tighter and tighter about its mean value, and estimates have higher precision.

MRCA = Most Recent Common Ancestor. (Earlier ancestors would also be shared.)
TMRCA = Time (in generations) to Most Recent Common Ancestor

The Sampling Problem:

Statisticians use sampling techniques to estimate characteristics of the real world when it’s impractical to measure the entire population. However, the present sample sizes are small in relation to what's being sampled; human Y-DNA exists in billions of varieties, only a tiny fraction of which have been tested. Existing DNA databases are very small in relation to the population and this presents problems in finding matches.

We think, for example, that there may be as many as 10,000 different haplotypes (individual Y-DNA patterns) for the tens of millions of people bearing the Taylor surname. In the FTDNA Taylor project, we have identified 172 haplotypes (29 groups plus 143 ungrouped). We’ve barely scratched the surface, yet 89 members (38%) have found matches of sufficient quality to be formed into groups.

The samples are sufficiently large overall (in the tens of thousands) to prove the concepts & theory on which genetic genealogy is based. The sampling problem here is that many specific Y-DNA haplotypes are not included in the existing samples. your As we get more people testing, the chances of matching will improve.

Reading Probability Graphs

Graphs show probability data in "picture" form, take less space than tables & are easier to understand than formulas. A two-dimensional graph has two axis, the vertical & the horizontal, For the graphs we're concerned with:

The vertical or "Y" axis is always probability or confidence level. It's a number between 0 & 1, or 0% & 100%.
The horizontal or "X" axis is the number of past generations to the MRCA. Depending on the graph, this can be either a specific number of generations or a maximum number ("upper bound").

Cumulative Probabilities

We are most interested in cumulative probabilities. The individual probabilities for specific generations are small, so we want a range of generations starting at 1 (our fathers) or some other number.

The Normal Curve, as illustration

By way of illustration, let's look at graphs of the familiar normal curve & its cumulative probabilities. (The example is a standardized normal curve.) We are NOT saying that DNA mutations follow normal distribution. However, it demonstrates how probabilities can be accumulated over a range. We'll abbreviate cumulative probability "cum p".

Highest probability is at mean (average), about 40%.
Probability decreases away from the mean.
Slope is relatively flat at tails, becomes steeper, then flattens & reverses at mean.

cumulative probability of normal distribution

Highest cum p (→100%) is at right of distribution.
Slope is almost flat at distance from mean, but always positive.
Slope is steepest near the mean, where individual probabilities are highest.

Binomial Distributions

The FTDNA graph (above) is of cumulative probabilities, based on one that looks more like this graph of binomial distributions:

probability of TMRCA

This is a traditional Y-DNA probability distribution, .showing the probability of the MRCA being at each generation past. (We've artificially cut off the Y axis at 5%; a 37 of 37 match reaches 11.5% probability at 1 generation; scaling the graph to show this would make the other curves harder to see.) Notice that the point of maximum probability keeps shifting right & getting smaller as the match degrades from 37 to 33 of 37 markers.

These features are characteristic of binomial distributions. Also characteristic is the nature of the "n=k" (e.g., 37 of 37) distribution. The curve for perfect matches — all tested markers are in agreement — looks different from curves for good matches. The "perfect-match" curve has a very high probability at zero (0) generations. In math terms, it is asymptotic to the vertical axis.

Each individual generation has a relatively low probability, less than 3% for a 35 of 37 match; that's not very useful. But, we can add up the probabilities of each generation as we go back in time and get a graph like this:

Illustrative only. Not to be used for research.

Here's something we can work with.

We can choose our degree of match, select a minimum probability level and read down to the maximum number of generations.

We can also instantly see that a 33/37 match has only a 50% probability for a MRCA at 40 generations or less (roughly 800 AD to present) and we have to go out to 75 generations for 90%; therefore, it might not be worth pursuing.

Or, we can do it for perfect matches, involving different numbers of markers:

Because these are "perfect" matches, the curves are asymptotic to the Y axis.

With more markers, the cum p curve is steeper & 90% CL is reached sooner.

Confidence:

The “confidence level” (abbreviated CL) of a statistical statement describes how much confidence we may have in it. FTDNA’s graph uses four points on the confidence spectrum (the vertical axis) — 0, 50%, 90%, & 99%. Here’s what you need to know about them:

<50% — less than 50% confidence (or probability) means that a statement is more likely to be wrong than right. These levels are more useful in "null hypotheses" which scientists want to disprove.
- For example, a statement with a10% CL is 90% likely to be wrong.
50% is even-steven; a statement with a 50% CL is just as likely to be wrong as right. It’s the same confidence that your next coin-toss will come up heads. Most genealogists dismiss any conclusion with only a 50% chance of truth as too “iffy”.
90% is the minimum CL you should accept. You can be 90% sure that your MRCA falls that number of generations, or fewer, back in your family tree.
95% is better confidence. Now, you’re 95% sure and the chance of being wrong is only 1 out of 20.
99% is pretty darn good. The chance of being wrong falls to 1 in 100.

A "confidence interval" is a more general term, which includes any part of the probability distribution between two defined points. A confidence level implies all of a distribution from the defined point to the left or right (for us, the left).

To sum up, do not fuss about a 50% probability. The number associated with it is as likely to be low as high. Use higher probabilities & confidence levels for guidance.

Generations:

The horizontal axis of the graph is maximum number of generations (upper bound) to reach the stated confidence level. DNA doesn’t know when your ancestors were born or how old their parents were and can only estimate the number of "transmission events" or opportunities for mutation. Each generation is another opportunity for mutation.

The question of "How many years per generation?" has no simple or universally-accepted answer. Estimates vary depending on time, area, & culture.

We think that 18th & 19th century American generations — on average — were in the range of 25 to 30 years. If you prefer a different estimate, please feel free to substitute your own.

Interpretation:

Matches of the minimum qualities indicated below indicate a common ancestor. The question then becomes "When might he have lived?" We’ll try to estimate the time of that MRCA, so that traditional documentary research can help identify him specifically. .However, be aware that ancestors who lived before about 1350 probably did not have surnames.

TMRCA Calculators

A TMRCA calculator has been published by Doug McDonald here; it appears to be on the conservative side.
And, another table-generator calculation, based on Dr. Bruce Walsh's work here.
Also, Ann Turner has posted a visual basic calculator here; plug in your numbers and see the result.

These calculators seem to assume a mutation rate of about 0.002 per generation.

12-marker match:

Perhaps, the other person only tested 12 markers; that’s all you can compare.

CL	12 of 12		11 of 12		10 of 12
CL	Gen	Years	Gen	Years	Gen	Years
50%	7	180-210	17	430-510	16.5	410-500
90%	23	580-690	39	980-1170	56	1400-1680
95%	29	730-870	47	1180-1410	72	1800-2160

The best possible match (12/12) carries 99% confidence of sharing a common ancestor,

But, the upper bound for an acceptable confidence level (90%) is 23 generations or 580-690 years.
This places your MRCA — at earliest — about the mid-14th century, the time surnames came into general use;
Genealogically speaking, that’s a tough find.
Poorer quality matches (i.e., <12 of 12) may not be worth pursuing, due to the huge time window.

"The 12 Marker Y DNA test is an excellent tool to determine those whom are not related within a group of people that share the same or similar surname." Read the source for this quote.

"A 12 marker STR test is usually not discriminating enough to provide conclusive results for a common surname. ". This source.

25-marker match:

"..most laboratories and surname projects recommend testing at least 25 {markers}. The more markers that are tested, the more discriminating and powerful the results will be. " Source.

With a match of 23 to 25 out of 25, the probability is >= 99% that you share a common ancestor.

CL	25 of 25		24 of 25		23 of 25
CL	Gen	Years	Gen	Years	Gen	Years
50%	3	80-90 *	7	180-210	11	280-330
90%	10	250-300	16	400-480	23	580-690
95%	13	330-390	20	500-600	27	680-810

25/25 match, you can be 90% confident that your MRCA probably lived within the past 300 years (roughly, 1700 or later.) & 95% sure of within the past 400 years; this is a solvable genealogical problem. Research back to about 1850 is relatively easy through US censuses; for the Colonial era, you'll need to becomes familiar with other resources.
24/25 match, at 90% CL, has an upper bound ~500 years ago, before European migration to America. You may have to find the MRCA in the “Old Country”.
23/25 or poorer match, may not be worth pursuing, as the time window is large at an acceptable CL.

37-marker match:

A 33 to 37 of 37 match carries a >= 99% probability of sharing a common ancestor.

CL	37 of 37		36 of 37		35 of 37		Graph from data returned by McDonald calculator
CL	Gen	Years	Gen	Years	Gen	Years
50%	2-3	50-90 *	4	100-120	6	150-180
90%	5	130-150	8	200-240	12	300-360
95%	7	180-210	10	250-300	14	350-420

* Now, you have to add your own age to the years given, because the window is small enough to make your year of birth a significant factor.

37/37 match means, with 90% confidence, that your MRCA probably lived 150 years or less before you were born. (You may need the AGI technique below.)
A 36/37 means a <= 250 year time window at 90% CL
35/37 means a <= 360 year window at 90% CL
34/37 means a <=1680 year window at 90% CL

67-marker matches:

Not enough people have yet opted for the 67-marker panel to make it likely that you will find a match across 67 markers. But, if you do, a match of 60 or more of 67 markers carries a >= 99% probability of sharing a common ancestor.

CL	67 of 67		66 of 67		65 of 67
CL	Gen	Years	Gen	Years	Gen	Years
50%	2	50-60	4	100-120	6	150-180
90%	4	100-120	8	200-240	12	300-360
95%	6	150-180	9	230-270	14	350-420

67/67 match carries 90% confidence that the MRCA is as late as your second-great- grandfather & 95% confidence he’s no earlier than your 4GGF. (You may need the AGI technique below.)
66/67 gives you a 240 year time frame at 90% CL & 270 at 95% CL
65/67 gives you a 360 year window at 90% CL & 420 at 95% CL

What if I know he's not there? The "missing probability" paradox:

Some people have excellent documentation proving the MRCA can not be in the latest few generations; let's call it additional genealogical information, "AGI". It leads to the question "What happens to the probabilities assigned to those generations?"

On 07 Mar 2003, Roy wrote to [email protected]:

If I have a 50% probability of being related to another person within 7 generations. And we know for certain that our MRCA can not possibly be within the first four generations. So those generations are eliminated from the scope. Then is the probability of being related in the remaining 3 generations increased? If so, by how much?

Dr. Walsh wrote on 11 Mar 2009:

"When you have two individuals with an EXACT match, the probabilities for each generation fall off by exactly the same amount. The net result is that if (say) the probabilities for generations 1, 2, 3 are 0.4, 0.2, 0.1, then if you condition on the first (say) 5 generations being excluded, the resulting CONDITIONAL probabilities for 6, 7, 8 become 0.4, 0.2, 0.1 (the same as for 1 , 2, 3) Hence, the whole distribution is shifted without a shape change to start that the value you set. This only happens with an exact match and can cause some confusion."

The sum of probabilities for those impossible generations gets re-allocated to the remaining, earlier generations. The entire probability (& cumulative) curve simply gets shifted to the right by the number of generations in which the MRCA can not exist. We can call this an "AGI adjustment" and the resulting curve is a "conditional probability distribution". The remaining probabilities are conditioned on the MRCA not existing in the generations for which the paper trail shows it cannot be, as demonstrated in the graph below.

graph: AGI adjustment

For the probability points listed in the graphs & tables above, add the number of generations in which the MRCA is found not to exist. See also Dr. John Chandler's reply to Roy.

Wrapping up this part:

Hopefully, we’ve explained how probabilities help us make sense of Y-DNA matches and how you can use them to focus your research on a common ancestor of you & your match.

Part 2 adds more detailed information and explanations.

Part 2: For those who want more.

This part reviews some of the prior discussion on probabilities with DNA in more detail. It’s for those who are comfortable with statistics or want to learn more.

Confidence & Doubt — No Absolute Certainty:

In Part 1, we said “The .. most important thing to remember with probabilities is that there is no absolute certainty; absolute certainty is beyond its power. We might be 99.99% sure of a statement, but can never be 100%.” That statement is controversial and some (perhaps, followers of Descartes) don’t buy it. So, we’ll take more space to explore the issues.

The proposition that some things can not be absolutely known with complete confidence is hard to accept . (It shocked the world of nuclear physics when Werner Heisenberg published his "Uncertainty Principle" in the1920s.) It is, however, crucial to understanding why DNA can not give you an absolutely certain answer.

Probability theory applies to things that we wish to predict in advance or which can not practically be observed directly and, therefore, must be inferred . When a thing can readily be observed directly, we don’t need inference or prediction; observe it and remove the doubt. There is always an element of doubt, however tiny, to probability-derived statements. In fact, probabilistic functions start with the definition “0 < p < 1”, so that any probability for which they’re to be used is greater than zero and less than one.

Some may say that – if they have flipped five coins and all come up heads – the probability of having flipped five heads is 100%. They’re misreading the concept; probability applies before you flip the coins or draw to that inside straight, not after. When you’ve directly observed a thing or event, you do not need or want a probabilistic answer about it; you simply count the five heads.

When we engage in probability, we step into a “parallel universe” which operates by different rules than 0/1, black/white, certain/impossible. In this universe, all things are possible, but some are highly unlikely. A butterfly may, indeed, cause a hurricane in Florida by flapping its wings in Brazil; it just isn’t very probable. We may talk about theoretical certainties or impossibilities, but they don’t show up in the world of probabilities; confidence levels never reach 100%; there is always some small (perhaps minute) element of doubt.

A failure to appreciate the lack of certainty in probabilistic conclusions can have serious consequences. The 2008 collapse of global credit markets is mostly due to this misunderstanding. Sophisticated financial institutions did not realize the risks they were taking with complex, probability-based investments.

It is true that the theoretical sum of an entire probability distribution equals unity (100%). However, the sum includes a great many very small probabilities, out at the "tails" of the distribution.

You may also want to look at a talk on aircraft software reliability. In short, the speaker concludes “If claims for dependability can never be made with certainty, we need a formalism that handles the uncertainty.”

The sources of probabilistic doubt are many; they include:

Measurement error: Even the best scientific instruments are subject to some error.
Sampling error: There are many ways in which a sample may not accurately represent an entire population; proper sampling methods must be followed to attain any confidence in its representativeness. Further, even a properly drawn sample is subject to limitations, usually expressed as a sampling error, largely dependent on the sample’s size.
Approximation: The mathematical functions (formulas) which yield probabilities are approximations of the real world.
Assumptions: “Housing prices never decline.” "Lehman Brothers is too big too fail." Need we say more?
Limitations on Inference: “Mathematical theorems are true; statistical methods are sometimes effective when used with judgment.” (page 307 of “For All Practical Purposes: Mathematical Literacy in Today’s World” , by COMAP, the Consortium for Mathematics and Its Applications, Macmillan, 2002; section on “The Limitations of Inference”)

Those who wish to delve into the philosophical underpinnings, may want to read this paper, “Probability without Certainty, or “Adler on Certainty and Probability”. A counter-argument (that probability can exceed 100%) is at “Probability, certainty and natural distribution”.

For a mathematical demonstration, take the binomial distribution:

Most people ignore it, but we start with the definition, 0 < p <1. (The probability, p, lies between zero and one and never equals either.) We'll show why this matters.
The formula for the standard deviation (basic to calculating probabilities) is σ = √pq, (the square root of p times q) where p is the probability & q = 1-p.
- As p→0, q→1 and as p→1, q→0. (The symbol → means "approaches" or "gets very close to".)
- In either case, pq→0 and σ→0.
- At p=0 or p=1, the distribution has no deviation; all the data points are concentrated at a single place on the horizontal axis. A probability can not be calculated.
Skew: (“Skew” relates to the shape of the function's graph curve, in which most of the distribution is shifted either left or right and has an abnormally long left or right “tail”.)
- The formula is Skew = √((q-p)/(pq)) — the square root of the difference between q & p divided by their product.
- As p→1, q→0 and their product, pq→0. (Similarly, as p→0, q→1 and their product, pq→0.) As pq→0,
  1/(pq) →∞ [“infinity”] and becomes what mathematicians call “undefined”; likewise √((q-p)/(pq)) →∞.
- Attempt to use your computer spreadsheet program to divide by zero and it will signal an error.

Probability of Mutations, revisited:

The frequency (or frequencies) of the random Y-DNA mutations establishes the marginal probabilities we need to work with. Unfortunately, these frequencies aren't wholly known.

Mutations in Y-DNA, as reflected in STR marker/allele counts, aren’t common. In a 2000 study, Manfred Kayser & Antii Sanjila looked at Y-DNA transmission in 15 loci (markers) for each of 4,999 father/son pairs; they found only 14 total mutations and only 2 instances of two mutations per pair. (They also found a relationship between the frequency of mutation and a locus’ molecular structure.) For 4,999 pairs times 15 loci, the frequency of mutation reached a high of 8.6*10^-3 (0.0086) and averaged 2.8*10^-3. (0.0028)

These changes in the loci are (we think) random & independent of each other. Charles Kerchner, http://www.kerchner.com/dna-info.htm, has studied the subject and estimated rates of mutation in individual markers. For simplicity’s sake, we’ll classify them as

“Fast” markers, at about 1 mutation in every 250 generations (1/250), and
“Slow” markers, at about 1 in every 400 generations (1/400).

Even the “fast” markers are not especially quick. The frequency rate (Another term for these rates is “marginal probability” per generation.) calculates to:

“Fast”: 1/250 = 0.0040 = 0.40%
“Slow” 1/400 = 0.0025 = 0.25%

The FTDNA Y-DNA STR panels have, respectively, by count:

“Fast” markers	3	8	13	13
“Slow” markers	9	17	24	54
Panel size, total markers	12	25	37	67

The probability of a mutation, from one generation to the next, in any specific marker is very small. However, it’s not zero and that fact enables our use of statistical tools.

Dr. Bruce Walsh of the University of Arizona likens the mutation probability to a ticking clock. The clock ticks slowly and randomly, allowing us to estimate a range of time back to the MRCA, based on the number of observed mutations.

Skipping Ahead of the Math:

You’ve been reading the shortened & simplified form of this exploration. We’ve saved for later, discussions of :

Discrete Random Variables,
- Dice provide an example: A die (1/2 a pair of dice) can come up only with integer values from 1 to 6; it can not come up 2.43746 or any other fractional value. STR allele counts are reported as integer values, making them discrete random variables.
the Bernoulli Theorem and Binomial Distribution (success/fail probabilities),
- Imagine a roulette wheel with about 500 slots, ~499 of them labeled "Win" and 1 labeled "Lose". What are the chances the ball will land on a Win slot 12 times out of 12? 24 of 25? 34 of 37?
the Poisson Distribution (very small probabilities),
- Out near the tail of a distribution, the probability curve's slope gets so very nearly flat that assuming no slope at all is a reasonable approximation. This simplifies probability calculations for rare events.
Probability of Multiple Independent Events (mutations of markers in one generation)
- Mutation (change in allele count) in any STR marker is independent of what's happening with any other marker.
Probability of Multiple Dependent Events (multiple generations).
- With each new generation, a "transmission event", there's a small opportunity for Y-DNA to change. The DNA of the present generation is dependent on the results of all the transmission events that have gone before.

If you’re interested in the deleted material beyond the links or what we have below, please ask. It’s fascinating stuff for the math-junky.

Short version: The FTDNA probability tables & graph are based on comparisons of probabilities to derive a confidence level that the match you see is not due to random chance (represents a common ancestor) and to estimate the time back to the most recent common ancestor (MRCA).

Statisticians use the term “significance level” when talking about a single probability, for questions like “Is my 6’10” nephew tall compared to other boys?” (His height would be compared to the mean and standard deviation of the normal distribution.)
A “confidence level” applies to questions like “Are boys taller than girls?” (The distributions of boys’ heights and girls’ heights would be compared.)
In our earlier discussion, we didn't distinguish between significance & confidence levels.

Binomial Distributions

Dr. Walsh wrote on 11 Mar 2009:

"The actual distribution is a multivariate geometric --- essentially a series of binomials.

The formula for the probability of k successes in n trials, p the probability of success in each trial, is
P = n! / [k!(n-k)!] * p^k * (1-p)^(n-k).

Note that as k → n, n!/[k!(n-k)!] → 1 {(n-n)! = 0! = 1}, & (1-p)^(n-k) → (1-p). At k=n, the formula reduces to
P = p^k (1-p).

Poisson Distribution

One advantage of the Poisson distribution is not needing to know very much about the underlying distribution; all you really need is an average per unit of time, space, etc. {The mean & variance are the same.} Another advantage is its applicability to integer variables.

Its very first application was in 1898, related to Prussian Army deaths by horse-kicks. It's heavily relied on in "Queuing theory", for study of traffic congestion or how much capacity is needed.

Imagine a bakery bakes four loaves of a specialty bread every day and, on average, sells three (3) of them. What are the chances it will have customers asking for five (5) loaves?
Formula: Pr(k;λ) = λ^k / [e^λ k!] λ = 3, k=5, e ~ 2.78
substituting Pr = 3^5 / [e^3 5!] = 243 / [~20.09*120] = 243 / [~2410] ~ 0.101 ~ 10.1%

If the probability of an event is very small {0>p<<1}, the Poisson distribution resembles the binomial distribution. It may become a fair approximation to the binomial for approximate mutation probabilities, where 0.002>=p>=0.004.

For this flat-tail condition: Pr(Y=y) = e^(-µ)* µ^y /y! = µ^y/ [e^µ y!], where
e = base of natural logarithms (~2.78), µ = mean v, y = occurrences for which probability is desired,
y! = factorial of y {y*(y-1)*(y-2)*(y-3)...1}.

Multiple Independent Events

The probability of all of the events occurring is the product of their individual probabilities:
p(A & B) = p(A)*p(B) ; p(A, B & C) = p(A)*p(B)*p(C), etc.

A jar has 15 marbles: 5 red, 5 blue and 5 yellow. You draw a marble at random, replace it & then repeat this twice more. What are the chances of drawing three (3) red marbles?
The individual chance of drawing red is p(red) = 5/15 = 1/3; the chance of drawing three reds is
p(3 reds) = 1/3 * 1/3 *1/3 = 1/9 ~ 0.111 = 11.1%.

Multiple Dependent Events, Conditional Probabilities

p(A & B) = p(A) * p(B|A), where P(B|A) is the conditional probability of B given A.
Or, p(B|A) = p(A&B)/p(A)

In the marble example, assume you drew a red first and do not replace drawn marbles. What are the chances of drawing blue in either the next try or two? By not replacing the red marble, you affect the chances for future draws.
First try: p(b|r) = 5/14 ~ 0.357 = 35.7%, Both tries p(b|r,<>b) = 5/14 + 5/13 ~ 0.742 = 74.2%.

Confidence levels:

A confidence level (CL, or confidence interval) tells you the balance of certainty versus uncertainty in a conclusion, such as “The MRCA for me and my match was no more than 8 generations ago.” With a CL of 90%, you can be 9/10 certain and 1/10 doubtful of the statement.

FTDNA’s graph uses four points on the confidence spectrum (the vertical axis). Here’s what you need to know about confidence levels:

50% is as likely to be wrong as right; do not fall into the trap of thinking it means something. It’s the same confidence that your next coin-toss will come up heads. A 50% CL is mostly good for providing an anchor point on the graph lines; it isn’t much good for prediction or estimating because your most recent CMA could, with equal probability, lie on either side of the line.
90% is the minimum acceptable in most scientific studies. You can be 90% sure that your MRCA falls that number of generations or fewer back in your family tree.
95% is better confidence. Now, you’re 95% sure and the chance of being wrong is only 1 out of 20.
99% is pretty darn good. The chance of being wrong falls to 1/100.
99.5% & 99.9% are occasionally found in some scientific studies – if the statistics hold up to this close a scrutiny.
Bounds: A confidence interval has both an upper bound and a lower bound. We’re mostly concerned with the upper bounds because the lower bounds in DNA interpretation tend to fall within those generations (0-4) where documentary research tells us what we need to know.

The horizontal axis is time, the maximum number of generations back to the MRCA. It’s measured in generations because DNA doesn’t know the parents' age when the children were conceived.

Here’s a (edited) copy of the table on which the graph is based. You can find the original at http://www.familytreedna.com/faq-markers.aspx.

Matching markers	50% p MRCA was <= gen.	90% p MRCA was <= gen.	95% p MRCA was <= gen.
12 markers
11 of 12	17	39	47
12 of 12	7	23	29
25 markers
23 of 25	11	23	27
24 of 25	7	16	20
25 of 25	3	10	13
37 markers
35 of 37	6	12	14
36 of 37	4	8	10
37 of 37	2 to 3	5	7
67 markers
65 of 67	6	12	14
66 of 67	4	8	9
67 of 67	2	4	6

The data used in Part 1, was taken from this table (& others as noted) & reorganized for a (hopefully) simpler explanation for the reader. The calculation of year ranges is our own.

FTDNA confidence levels for DNA matches

Notice that the data & graph talk about maximum number of generations (“upper bound”), for which we’ve used the less-than-or-equal-to sign (<=). For a given match & CL, the MRCA will be no more than that number in the past.
Notice, too, that – as we ask for more confidence – the generations necessary to reach that confidence level increase.
And, as the quality of the match decreases, the generations’ upper bound increases – sometimes at an exponential rate.

Binomials Again!

Binomial probabilities best fit the phenomenon of DNA matches. The equation for the number of k successes in n trials is

binomial probability formula

where

n = the number of trials (markers);
k = the number of successes, k <= n; (matching markers)
p = the probability of changes (mutations) in the markers
! = the factorial sign, e.g., k*(k-1)*(k-2)*(k-3)....*1. (The factorial terms are needed to account for the various combinations & permutations of k successes in n trials.)
a number or letter written above & to the right of the one preceding is an exponent, meaning the preceding is "raised to the power of", or multiplied by itself that many times.

The formula explains why a perfect match, n=k, is asymptotic to the vertical axes.

The factorial term evaluates to 1; (n-k)! = 0, Zero factorial, 0!=1 & n! divided by itself = 1.
The last term evaluates to 1; (1-p)⁰ = 1.
We're left with p^k = pⁿ;

The formula for the cumulative probability is

We suspect, in the absence of knowing, that the quoted generational probabilities represent the ratios of two (or more?) binomial distributions.

What does “generations” mean in dates?

Converting generations to years is tricky. We haven’t found a solid reference. Part of our problem is that we're searching for an average figure for many families over many generations. Estimates vary from as little as 15 to as many as 35 years per generation. Undoubtedly, the time, place and culture of the population matter.

One not-so-hypothetical family illustrates the complexity of the problem:

Jim marries Betsy and their first child is born when both are 18; they continue having a child every other year or so until Betsy dies at 35. Jim then marries Sarah, age 24, and they have children until Jim is 45. When the last child is born, the first is an adult and also having children, a generational overlap. The years-per-generation for Jim's children ranges from 18 to 45.

One British study used an average of 35 years per generation. While this may apply to a mature British population, the American population in the 18^th & 19^th centuries was heavily weighted toward the young. Fewer than 32% of Americans in 1810 were over 25 years of age, while almost 35% were under 10.

Americans married and bore their first children young, often as young as 18. While later siblings may have been born at a later age, by age 40 most families were done having children. I believe something closer to 27 applies for my American Taylors. So, let’s take a range of 25 to 30.

The graph below shows the age distribution for free white persons enumerated in the 1810 U.S. federal census:

graph: age distribution in 1810 US census by male & female

Source: (2004). Historical Census Browser. Retrieved 23 Feb 2009, from the University of Virginia, Geospatial and Statistical Data Center:
http://fisher.lib.virginia.edu/collections/stats/histcensus/index.html

(Note that almost 50% of the population is <=15 and almost 70% <=25.)

Interpretation:

Be sure you've read the Interpretations section in Part 1. To reprise a part, in plain language:

A 35/37 match indicates you can be 90% confident your MRCA was no more than 12 generations in the past and 95% confident he was no more than 14 generations in the past.
A 36/37 match means you can be 90% confident your MRCA was no more than 8 generations in the past & 95% confident your MRCA was no more than 10 generations in the past.
A 37/37 match means you can be 90% confident your MRCA was no more than 3 generations in the past and 95% confident he was no more than 7 generations past.

Interpreting non-matching markers

Ken Nordtvedt has published an article in the Fall 2008 Journal of Genetic Genealogy discussing a method for going beyond simple interpretation of matching vs. non-matching markers and into the question of meaning of the differences in non-matching markers. In short, he establishes modal values to infer the CMA haplotype and computes distances from that mode. He goes on to use Bayesian analysis to estimate a probability distribution. His estimate for the probability distribution's peak is:

G = n/(2M), where

G = number of generations,
n = number of mutations (not markers as previously), and
M = the sum of marker mutation rates.

Cumulative probability is given by:

Prob(G) ~ ∫(G)^GD e^(-2MG), where

GD = genetic distance (marker differences)
e = base of natural logarithms ~ 2.78,
G & M are generations & mutations rates, as above.

This author confesses to less than full understanding of Nordtvedt's article. Read it by clicking on the link above.

We've run some numbers using these formulas and they don't seem to produce usable results.

Conditional Probabilities Revisited

In Part 1, we talked about "conditional probabilities" in relation to documentation proving the MRCMA can not be in the latest few generations. It's good to revisit this in a more formal way.

The formula for the probability of B, given A, is:

Example:

A jar contains black and white marbles. Two marbles are chosen without replacement. The probability of selecting a black marble and then a white marble is 0.34, and the probability of selecting a black marble on the first draw is 0.47. What is the probability of selecting a white marble on the second draw, given that the first marble drawn was black?

Solution:

P(White\|Black)	=	P(Black and White)	=	0.34	=	0.72	=	72%
		P(Black)		0.47

In the same way, given that the MRCMA can not be in the latest 4 generations, the probability of being in earlier generations is increased.

Conclusion:

The statistical methods used by FTDNA are, in some sense, a “black box”. We do not know the formulas used. Nor, do we have access to the data. We’ve used the summary data they’ve made available and supplemented it with estimates from other sources. Should we get more information, we’ll do a Part 3.