Probabilities in DNA [an error occurred while processing this directive]
On this page:
12-marker match 25-marker match 37-marker match 67-marker match
Relatedness Probability Basics Prob. of Mutations Sampling Cumulative Probabilities Confidence Interpretation TMRCA Calculators Missing Probability
Part 2 xxx

Respond

Thank you for visiting.

Probabilities in DNA

by Ralph Taylor

Part 1: Introduction and Overview

We on the FTDNA Taylor Surname Project Admin Team get asked many times “Is the match I’ve found meaningful and what does it mean?” This article tries to help answer that. A portion has been published on the FTDNA Taylor Project blog.

Hold on, the ride could get bumpy. Do not read further if you are not comfortable with Y-DNA basics & terminology. We'll assume you know what a marker is, what STR means, & what a match is. First, visit one of the many sites on the subject of Y-DNA in genealogy, including this one.

Do not read further if you're hoping to substitute DNA for traditional genealogy, based on documents and records. DNA can not tell you your ancestors' names or where they lived and can only estimate a probable range of time in which they might have lived.  DNA helps focus documentary research; it can't replace it. You should know genealogical time frames and standards.

Always, answers to the meaning of matches involve probabilities – statistics. Changes in Y-DNA from father to son are infrequent, but they happen, we think, randomly. Statistical probability analysis provides the best tools for dealing with randomness. To quote Dr. Bruce Walsh of the University of Arizona, Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7 generations), but rather a probability distribution.”

Probability can prove or disprove many questions to a high degree of reliability. It’s defined as a way of quantitatively expressing knowledge or belief that an event will occur or has occurred. In this article, we’ll draw from the concepts of probability theory, a branch of mathematics.

The beginning question in our exploration becomes “What are the odds that my found match could not be due to chance?” Eventually, we’ll proceed to “When, probably,  did our common ancestor live?”

Perhaps, you’ve seen a graph something like the simplified one below from FTDNA’s site:

How”, you may wonder, “do they do that?
“What does it mean?”

Source:
http://www.familytreedna.com/faq-markers.aspx

This article is to partly answer the first question. For the second, this graph shows how more markers produce a "tighter" probability distribution, leading to greater precision. We'll use the same techniques in a different way.

Relatedness:

In Y-DNA, “related” has a special meaning,  "sharing a common direct male ancestor with another person". It does not include

It does include:

Probability Applications:

Probability is not limited to DNA. It’s used to predict weather, estimate chances of airplanes crashing, gauge public opinion, guide gambling (its earliest use), control manufacturing quality, manage complex systems, and many other applications.

In an egregious example, probability was widely used to estimate the risk of default on collateralized mortgage securities & credit default swaps. It failed — mostly because of problems we'll discuss later.

Probability Basics:

Probabilities always are between zero (0) & one (1), sometimes expressed in percentages – between 0% & 100%. Mathematically, it's expressed as "0 < p < 1", where "p" represents the probability of a thing. Zero means a thing is impossible; one (100%) means it is certain. Unlike athletes, probability can’t “give 110%”.

Next, probability involves math, often complicated math. “Talking Barbie” said “Math is hard.”; actually, the hardest part these days is coming up with the right questions to ask and applying the right formulas relevant to the questions before letting the computer do the math.

Third, probability models are approximations of selected aspects of the real world; they do not represent it exactly and completely. They may come close enough for their intended use, but do not over-interpret.

The last & most important thing to remember is that there is no absolute certainty in probability; absolute certainty is beyond its power. We might be 99.99% sure of a probabilistic statement, but can never be 100%. (The surety of a probability-based statement is called a “confidence level”.) 

Some probability terms

Posotive slope graph
Positive Slope
negative slope graph
Negative Slope
slope positive, then negative
Slope positive, then negative

Probability of Mutations:

Mutations in Y-DNA, as reflected in STR marker/allele counts, aren’t common. A change in a marker happens — on average — only every 250 to 400 generations. For simplicity’s sake, we’ll classify them as

These are very small probabilities, but greater than zero. They allow us to use probabilities for estimating & predicting.

To quote from Dr. Bruce Walsh, at the U. of Arizona (emphasis added):

The basic idea is simple:
Individuals that match at a higher fraction of markers are more closely related. The formal logic is as follows: One can image  {imagine?} the chromosome as a clock that slowly ticks (i.e., one "tick" of the clock equals one mutation). Thus, a chromosome is a molecular clock that ticks randomly within a specified rate. This paradoxically sounding phrase means that a clock running longer has a higher probability of having more ticks than a clock that has been running shorter. The more time, the more ticks and the older the time back to the MRCA.

Estimates of TMRCA are thus based on the observed number of mutations by which the two Y chromosomes differ. Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7 generations), but rather a probability distribution, a function that gives the probability that the TMRCA is a certain number of generations or less (i.e., a 47% probability that the TMRCA is 16 generations or less). This website shows the plot of these functions for the various marker matches for 12 and 21 marker tests. As one uses more and more markers, the distribution becomes tighter and tighter about its mean value, and estimates have higher precision.

The Sampling Problem:

Statisticians use sampling techniques to estimate characteristics of the real world when it’s impractical to measure the entire population. However, the present sample sizes are small in relation to what's being sampled; human Y-DNA exists in billions of varieties, only a tiny fraction of which have been tested.  Existing DNA databases are very small in relation to the population and this presents problems in finding matches.

We think, for example, that there may be as many as 10,000 different haplotypes (individual Y-DNA patterns) for the tens of millions of people bearing the Taylor surname. In the FTDNA Taylor project, we have identified 172 haplotypes (29 groups plus 143 ungrouped). We’ve barely scratched the surface, yet 89 members (38%) have found matches of sufficient quality to be formed into groups.

The samples are sufficiently large overall (in the tens of thousands) to prove the concepts & theory on which genetic genealogy is based. The sampling problem here is that many specific Y-DNA haplotypes are not included in the existing samples. your As we get more people testing, the chances of matching will improve. 

Reading Probability Graphs

Graphs show probability data in "picture" form, take less space than tables & are easier to understand than formulas. A two-dimensional graph has two axis, the vertical & the horizontal, For the graphs we're concerned with:

  1. The vertical or "Y" axis is always probability or confidence level. It's a number between 0 & 1, or 0% & 100%. 
  2. The horizontal or "X" axis is the number of past generations to the MRCA. Depending on the graph, this can be either a specific number of generations or a maximum number ("upper bound").

Cumulative Probabilities

We are most interested in cumulative probabilities. The individual probabilities for specific generations are small, so we want a range of generations starting at 1 (our fathers) or some other number.

The Normal Curve, as illustration

By way of illustration, let's look at graphs of the familiar normal curve & its cumulative probabilities. (The example is a standardized normal curve.) We are NOT saying that DNA mutations follow normal distribution. However, it demonstrates how probabilities can be accumulated over a range. We'll abbreviate cumulative probability "cum p".

normal distribution curve
Highest probability is at mean (average), about 40%.
Probability decreases away from the mean.
Slope is relatively flat at tails, becomes steeper, then flattens & reverses at mean.
cumulative probability of normal distribution
Highest cum p (→100%) is at right of distribution.
Slope is almost flat at distance from mean, but always positive.
Slope is steepest near the mean, where individual probabilities are highest.

Binomial Distributions

The FTDNA graph (above) is of cumulative probabilities, based on one that looks more like this graph of binomial distributions:

probability of TMRCA

This is a traditional Y-DNA probability distribution, .showing the probability of the MRCA being at each generation past. (We've artificially cut off the Y axis at 5%; a 37 of 37 match reaches 11.5% probability at 1 generation; scaling the graph to show this would make the other curves harder to see.) Notice that the point of maximum probability keeps shifting right & getting smaller as the match degrades from 37 to 33 of 37 markers.

These features are characteristic of binomial distributions. Also characteristic is the nature of the "n=k" (e.g., 37 of 37) distribution. The curve for perfect matches — all tested markers are in agreement — looks different from curves for good matches. The "perfect-match" curve has a very high probability at zero (0) generations. In math terms, it is asymptotic to the vertical axis.

Each individual generation has a relatively low probability, less than 3% for a 35 of 37 match; that's not very useful. But, we can add up the probabilities of each generation as we go back in time and get a graph like this:

cum. p of TMRCA
Illustrative only. Not to be used for research.
Here's something we can work with.

We can choose our degree of match, select a minimum probability level and read down to the maximum number of generations.

We can also instantly see that a 33/37 match has only a 50% probability for a MRCA at 40 generations or less (roughly 800 AD to present) and we have to go out to 75 generations for 90%; therefore, it might not be worth pursuing.

Or, we can do it for perfect matches, involving different numbers of markers:

p TMRCA
Because these are "perfect" matches, the curves are asymptotic to the Y axis.
cum p TMRCA
With more markers, the cum p curve is steeper & 90% CL is reached sooner.

Confidence:

The “confidence level” (abbreviated CL) of a statistical statement describes how much confidence we may have in it. FTDNA’s graph uses four points on the confidence spectrum (the vertical axis) — 0, 50%, 90%, & 99%. Here’s what you need to know about them:

A "confidence interval" is a more general term, which includes any part of the probability distribution between two defined points. A confidence level implies all of a distribution from the defined point to the left or right (for us, the left).

To sum up, do not fuss about a 50% probability. The number associated with it is as likely to be low as high. Use higher probabilities & confidence levels for guidance.

Generations:

The horizontal axis of the graph is maximum number of generations (upper bound) to reach the stated confidence level. DNA doesn’t know when your ancestors were born or how old their parents were and can only estimate the number of "transmission events" or opportunities for mutation. Each generation is another opportunity for mutation.

The question of "How many years per generation?" has no simple or universally-accepted answer. Estimates vary depending on time, area, & culture.

We think that 18th & 19th century American generations — on average — were in the range of 25 to 30 years. If you prefer a different estimate, please feel free to substitute your own.

Interpretation:

Matches of the minimum qualities indicated below indicate a common ancestor. The question then becomes "When might he have lived?" We’ll try to estimate the time of that MRCA, so that traditional documentary research can help identify him specifically. .However, be aware that ancestors who lived before about 1350 probably did not have surnames.

TMRCA Calculators

These calculators seem to assume a mutation rate of about 0.002 per generation.

12-marker match:

Perhaps, the other person only tested 12 markers; that’s all you can compare.

CL 12 of 12 11 of 12 10 of 12 graph: cum p 12 markers
Gen Years Gen Years Gen Years
50%   7  180-210 17   430-510 16.5   410-500
90% 23  580-690 39  980-1170 56  1400-1680
95% 29  730-870 47  1180-1410 72  1800-2160

The best possible match (12/12) carries 99% confidence of sharing a common ancestor,

25-marker match:

"..most laboratories and surname projects recommend testing at least 25 {markers}. The more markers that are tested, the more discriminating and powerful the results will be. " Source.

With a match of 23 to 25 out of 25, the probability is >= 99% that you share a common ancestor.

CL 25 of 25 24 of 25 23 of 25 graph: cum p 25 markers
Gen  Years Gen  Years Gen  Years
50%   3    80-90 *   7  180-210  11  280-330
90% 10  250-300 16  400-480  23  580-690
95% 13  330-390 20  500-600  27  680-810

37-marker match:

A 33 to 37 of 37 match carries a >= 99% probability of sharing a common ancestor.

CL 37 of 37 36 of 37 35 of 37 graph: 37-marker matches
Graph from data returned by  McDonald calculator
Gen Years Gen Years Gen Years
50% 2-3 50-90 *   4 100-120    6 150-180
90%   5 130-150   8  200-240 12 300-360
95%   7 180-210 10 250-300 14 350-420

* Now, you have to add your own age to the years given, because the window is small enough to make your year of birth a significant factor. 

67-marker matches:

Not enough people have yet opted for the 67-marker panel to make it likely that you will find a match across 67 markers.  But, if you do, a match of 60 or more of 67 markers carries a >= 99% probability of sharing a common ancestor.

CL 67 of 67 66 of 67 65 of 67 graph for 67-marker match
Gen Years Gen Years Gen Years
50% 2   50-60   4 100-120 6 150-180
90%   4 100-120 8 200-240 12 300-360
95% 6 150-180 9 230-270  14 350-420

What if I know he's not there? The "missing probability" paradox:

Some people have excellent documentation proving the MRCA can not be in the latest few generations; let's call it additional genealogical information, "AGI". It leads to the question "What happens to the probabilities assigned to those generations?"

On 07 Mar 2003, Roy wrote to [email protected]:

If I have a 50% probability of being related to another person within 7 generations. And we know for certain that our MRCA can not possibly be within the first four generations. So those generations are eliminated from the scope. Then is the probability of being related in the remaining 3 generations increased? If so, by how much?

Dr. Walsh wrote on 11 Mar 2009:

"When you have two individuals with an EXACT match, the probabilities for each generation fall off by exactly the same amount. The net result is that if (say) the probabilities for generations 1, 2, 3 are 0.4, 0.2, 0.1, then if you condition on the first (say) 5 generations being excluded, the resulting CONDITIONAL probabilities for 6, 7, 8 become 0.4, 0.2, 0.1 (the same as for 1 , 2, 3) Hence, the whole distribution is shifted without a shape change to start that the value you set. This only happens with an exact match and can cause some confusion."

The sum of probabilities for those impossible generations gets re-allocated to the remaining, earlier generations. The entire probability (& cumulative) curve simply gets shifted to the right by the number of generations in which the MRCA can not exist. We can call this an "AGI adjustment" and the resulting curve is a "conditional probability distribution". The remaining probabilities are conditioned on the MRCA not existing in the generations for which the paper trail shows it cannot be, as demonstrated in the graph below.

graph: AGI adjustment

For the probability points listed in the graphs & tables above, add the number of generations in which the MRCA is found not to exist.  See also Dr. John Chandler's reply to Roy.

Wrapping up this part:

Hopefully, we’ve explained how probabilities help us make sense of Y-DNA matches and how you can use them to focus your research on a common ancestor of you & your match.

 Part 2 adds more detailed information and explanations.


 

Part 2: For those who want more.

This part reviews some of the prior discussion on probabilities with DNA  in more detail. It’s for those who are comfortable with statistics or want to learn more.

Confidence & Doubt — No Absolute Certainty:

In Part 1, we said “The .. most important thing to remember with probabilities is that there is no absolute certainty; absolute certainty is beyond its power. We might be 99.99% sure of a statement, but can never be 100%.” That statement is controversial and some (perhaps, followers of Descartes) don’t buy it. So, we’ll take more space to explore the issues.

The proposition that some things can not be absolutely known with complete confidence is hard to accept . (It shocked the world of nuclear physics when Werner Heisenberg published his "Uncertainty Principle" in the1920s.) It is, however, crucial to understanding why DNA can not give you an absolutely certain answer.

Probability theory applies to things that we wish to predict in advance or which can not practically be observed directly and, therefore, must be inferred .  When a thing can readily be observed directly, we don’t need inference or prediction; observe it and remove the doubt. There is always an element of doubt, however tiny,  to probability-derived statements. In fact, probabilistic functions start with the definition “0 < p < 1”, so that any probability for which they’re to be used is greater than zero and less than one.

Some may say that – if they have flipped five coins and all come up heads – the probability of having flipped five heads is 100%. They’re misreading the concept; probability applies before you flip the coins or draw to that inside straight, not after. When you’ve directly observed a thing or event, you do not need or want a probabilistic answer about it; you simply count the five heads.

 When we engage in probability, we step into a “parallel universe” which operates by different rules than 0/1, black/white, certain/impossible. In this universe, all things are possible, but some are highly unlikely. A butterfly may, indeed, cause a hurricane in Florida by flapping its wings in Brazil; it just isn’t very probable. We may talk about theoretical certainties or impossibilities, but they don’t show up in the world of probabilities; confidence levels never reach 100%; there is always some small (perhaps minute) element of doubt.

A failure to appreciate the lack of certainty in probabilistic conclusions can have serious consequences. The 2008 collapse of global credit markets is mostly due to this misunderstanding. Sophisticated financial institutions did not realize the risks they were taking with complex, probability-based investments.

It is true that the theoretical sum of an entire probability distribution equals unity (100%). However, the sum includes a great many very small probabilities, out at the "tails" of the distribution.

You may also want to look at a talk on aircraft software reliability. In short, the speaker concludes “If claims for dependability can never be made with certainty, we  need a formalism that handles the uncertainty.”

The sources of probabilistic doubt are many; they include:

 Those who wish to delve into the philosophical underpinnings, may want to read this paper, “Probability without Certainty, or “Adler on Certainty and Probability”. A counter-argument (that probability can exceed 100%) is at “Probability, certainty and natural distribution”.  

 For a mathematical demonstration, take the binomial distribution:

Probability of Mutations, revisited:

The frequency (or frequencies) of the random Y-DNA mutations establishes the marginal probabilities we need to work with. Unfortunately, these frequencies aren't wholly known.

Mutations in Y-DNA, as reflected in STR marker/allele counts, aren’t common. In a 2000 study, Manfred Kayser & Antii Sanjila looked at Y-DNA transmission in 15 loci (markers) for each of 4,999 father/son pairs; they found only 14 total mutations and only 2 instances of two mutations per pair. (They also found a relationship between the frequency of mutation and a locus’ molecular structure.) For 4,999 pairs times 15 loci, the frequency of mutation reached a high of 8.6*10-3 (0.0086) and averaged 2.8*10-3 . (0.0028)

 These changes in the loci are (we think) random & independent of each other. Charles Kerchner, http://www.kerchner.com/dna-info.htm,  has studied the subject and estimated rates of mutation in individual markers. For simplicity’s sake, we’ll classify them as

Even the “fast” markers are not especially quick. The frequency rate (Another term for these rates is “marginal probability” per generation.) calculates to:

The FTDNA Y-DNA STR panels have, respectively, by count:

“Fast” markers 3 8 13 13
“Slow” markers 9 17 24 54
Panel size, total markers 12 25 37 67

The probability of a mutation, from one generation to the next, in any specific marker is very small.  However, it’s not zero and that fact enables our use of statistical tools.

Dr. Bruce Walsh of the University of Arizona likens the mutation probability to a ticking clock. The clock ticks slowly and randomly, allowing us to estimate a range of time back to the MRCA, based on the number of observed mutations.

Skipping Ahead of the Math:

You’ve been reading the shortened & simplified form of this exploration. We’ve saved for later, discussions of :

 If you’re interested in the deleted material beyond the links or what we have below, please ask. It’s fascinating stuff for the math-junky.

 Short version:  The FTDNA probability tables & graph are based on comparisons of probabilities to derive a confidence level that the match you see is not due to random chance  (represents a common ancestor) and to estimate the time back to the most recent common ancestor (MRCA).

Binomial Distributions

Dr. Walsh wrote on 11 Mar 2009:

"The actual distribution is a multivariate geometric --- essentially a series of binomials.

The formula for the probability of k successes in n trials, p the probability of success in each trial,  is
P = n! / [k!(n-k)!] * pk * (1-p)(n-k).

Note that as k → n, n!/[k!(n-k)!] → 1 {(n-n)! = 0! = 1}, & (1-p)^(n-k) → (1-p). At k=n, the formula reduces to
P = p^k (1-p). 

Poisson Distribution

One advantage of the Poisson distribution is not needing to know very much about the underlying distribution; all you really need is an average per unit of time, space, etc. {The mean & variance are the same.} Another advantage is its applicability to integer variables.

Its very first application was in 1898, related to Prussian Army deaths by horse-kicks. It's heavily relied on in "Queuing theory", for study of traffic congestion or how much capacity is needed.

Imagine a bakery bakes four loaves of a specialty bread every day and, on average, sells three (3) of them. What are the chances it will have customers asking for five (5) loaves?  
Formula: Pr(k;λ) = λk / [eλ k!]  λ = 3, k=5, e ~ 2.78
substituting Pr = 3^5 / [e^3 5!] = 243 / [~20.09*120] = 243 / [~2410] ~ 0.101 ~ 10.1%

If the probability of an event is very small {0>p<<1}, the Poisson distribution resembles the binomial distribution. It may become a fair approximation to the binomial for approximate mutation probabilities, where 0.002>=p>=0.004.

For this flat-tail condition: Pr(Y=y) = e(-µ) * µy /y! = µy/ [eµ y!], where
e = base of natural logarithms (~2.78), µ = mean  v,  y = occurrences for which probability is desired,
y! = factorial of y {y*(y-1)*(y-2)*(y-3)...1}

Multiple Independent Events

The probability of all of the events occurring is the product of their individual probabilities:
p(A & B) = p(A)*p(B)p(A, B & C) = p(A)*p(B)*p(C), etc.

A jar has 15 marbles: 5 red, 5 blue and 5 yellow. You draw a marble at random, replace it & then repeat this twice more. What are the chances of drawing three (3) red marbles?
The individual chance of drawing red is p(red) = 5/15 = 1/3; the chance of drawing three reds is
p(3 reds) = 1/3 * 1/3 *1/3 = 1/9 ~ 0.111 = 11.1%.
 

Multiple Dependent Events, Conditional Probabilities

p(A & B) = p(A) * p(B|A),  where P(B|A) is the conditional probability of B given A.
Or, p(B|A) = p(A&B)/p(A)

In the marble example, assume you drew a red first and do not replace drawn marbles. What are the chances of drawing blue in either the next try or two? By not replacing the red marble, you affect the chances for future draws.
First try: p(b|r) =  5/14 ~ 0.357 = 35.7%, Both tries p(b|r,<>b) = 5/14 + 5/13 ~ 0.742 = 74.2%.

Confidence levels:

A confidence level (CL, or confidence interval) tells you the balance of certainty versus uncertainty in a conclusion, such as “The MRCA for me and my match was no more than 8 generations ago.” With a CL of 90%, you can be 9/10 certain and 1/10 doubtful of the statement. 

 FTDNA’s graph uses four points on the confidence spectrum (the vertical axis). Here’s what you need to know about confidence levels:

 The horizontal axis is time, the maximum number of generations back to the MRCA. It’s measured in generations because DNA doesn’t know  the parents' age when the children were conceived.

 Here’s a (edited) copy of the table on which the graph is based. You can find the original at http://www.familytreedna.com/faq-markers.aspx.

Matching markers

50% p MRCA was <= gen.

90% p MRCA was <= gen.

95% p  MRCA was <= gen.

12 markers

11 of 12

17

39

47

12 of 12

7

23

29

25 markers

23 of 25

11

23

27

24 of 25

7

16

20

25 of 25

3

10

13

37 markers

35 of 37

6

12

14

36 of 37

4

8

10

37 of 37

2 to 3

5

7

67 markers

65 of 67

6

12

14

66 of 67

4

8

9

67 of 67

2

4

6

The data used in Part 1, was taken from this table (& others as noted) & reorganized for a (hopefully) simpler explanation for the reader. The calculation of year ranges is our own.

 FTDNA confidence levels for DNA matches

Binomials Again!

Binomial probabilities best fit the phenomenon of DNA matches. The equation for the number of k successes in n trials is

binomial probability formula

where

The formula explains why a perfect match, n=k, is asymptotic to the vertical axes.

The formula for the cumulative probability is

We suspect, in the absence of knowing, that the quoted generational probabilities represent the ratios of two (or more?) binomial distributions.

What does “generations” mean in dates?

Converting generations to years is tricky.  We haven’t found a solid reference. Part of our problem is that we're searching for an average figure for many families over many generations. Estimates vary from as little as 15 to as many as 35 years per generation. Undoubtedly, the time, place and culture of the population matter.

One not-so-hypothetical family illustrates the complexity of the problem:

Jim marries Betsy and their first child is born when both are 18; they continue having a child every other year or so until Betsy dies at 35. Jim then marries Sarah, age 24, and they have children until Jim is 45. When the last child is born, the first is an adult and also having children, a generational overlap. The years-per-generation for Jim's children ranges from 18 to 45.

One British study used an average of 35 years per generation. While this may apply to a mature British population, the American population in the 18th & 19th centuries was heavily weighted toward the young. Fewer than 32% of Americans in 1810 were over 25 years of age, while almost 35% were under 10.

 Americans married and bore their first children young, often as young as 18. While later siblings may have been born at a later age, by age 40 most families were done having children. I believe something closer to 27 applies for my American Taylors. So, let’s take a range of 25 to 30.

 The graph below shows the age distribution for free white persons enumerated in the 1810 U.S. federal census:

graph: age distribution in 1810 US census by male & female Source: (2004). Historical Census Browser. Retrieved 23 Feb 2009, from the University of Virginia, Geospatial and Statistical Data Center:
http://fisher.lib.virginia.edu/collections/stats/histcensus/index.html 

(Note that almost 50% of the population is <=15 and almost 70% <=25.)

Interpretation:

Be sure you've read the Interpretations section in Part 1. To reprise a part, in plain language:

Interpreting non-matching markers

Ken Nordtvedt has published an article in the Fall 2008 Journal of Genetic Genealogy discussing a method for going beyond simple interpretation of matching vs. non-matching markers and into the question of meaning of the differences in non-matching markers. In short, he establishes modal values to infer the CMA haplotype and computes distances from that mode. He goes on to use Bayesian analysis to estimate a probability distribution. His estimate for the probability distribution's peak is:

G = n/(2M), where

Cumulative probability is given by:

Prob(G) ~ (G)GD e(-2MG), where

This author confesses to less than full understanding of Nordtvedt's article. Read it by clicking on the link above.

We've run some numbers using these formulas and they don't seem to produce usable results.

Conditional Probabilities Revisited

In Part 1, we talked about "conditional probabilities" in relation to documentation proving the MRCMA can not be in the latest few generations. It's good to revisit this in a more formal way.

The formula for the probability of B, given A,  is:   conditional probability formula
Example:

A jar contains black and white marbles. Two marbles are chosen without replacement. The probability of selecting a black marble and then a white marble is 0.34, and the probability of selecting a black marble on the first draw is 0.47. What is the probability of selecting a white marble on the second draw, given that the first marble drawn was black?
 

Solution:
P(White|Black) = P(Black and White)  =   0.34  = 0.72  = 72%
P(Black) 0.47

In the same way, given that the MRCMA can not be in the latest 4 generations, the probability of being in earlier generations is increased.

Conclusion:

The statistical methods used by FTDNA are, in some sense, a “black box”. We do not know the formulas used. Nor, do we have access to the data. We’ve used the summary data they’ve made available and supplemented it with estimates from other sources. Should we get more information, we’ll do a Part 3.