GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2007-02 > 1171528304


From: Jonathan Day <>
Subject: Re: [DNA] TMRCA
Date: Thu, 15 Feb 2007 00:31:44 -0800 (PST)
In-Reply-To: <BAY105-F11B8B5D355D5DA2C6AC20CCC970@phx.gbl>


Some calculators I have seen ask for the number of
markers and offer a range of choices on the difference
(genetic difference of 1, genetic difference of 2
split over two markers, genetic difference of 2 in a
single marker, and so on).

Clearly, such calculators are crude. They make no
pretense to be otherwise, offering a probability curve
from 0% to 99.9% chance that the common ancestor is
before some given timeframe. It's a simple one-tailed
test using very basic methods.

I guess the first thing to do is to identify what a
good calculator MUST be able to do to qualify as good,
then decide how inaccurate the calculator must become
if it lacks a specific capability.

It would seem to me that a good calculator must make
some attempt at figuring out what the common
ancestor's markers most likely were. The absolute
values probably aren't important, but how you divvy up
the mutations and whether the mutations were
cumulative or subtractive would most certainly matter.

My guess would be that halving the time to the most
recent common ancestor would more than double the
probability of getting an accurate window. If this is
the case, then having multiple, well-spaced
individuals would be good, as you could then derive a
set of "most likely marker values" for relatively
close common ancestors and restart the calculation
from them.

The carbon dating reference is a good one, here. It
turns out that C12/C14 dating has a lot of variables
which are not immediately obvious. A known C12/C14
reference value from a tree in Canada would be next to
useless for dating an object in North Africa, for
example. Your reference values have to be comparable
in date and place.

Likewise, mutation rates quite likely vary across the
globe. depending on how much their environment
increases (or decreases) the chances of an error in
copying. This will not necessarily be a fixed value
for a place, but may vary between times.

All that seems straightforward enough. So, from this,
I think we can say the following. A calculator, to be
accurate, must:

1) Take the actual marker values of the individuals,
not just a genetic difference

2) Take additional reference values from as many
relations as possible for both individuals, provided
that there is a reasonably even and reasonably broad
spread in relationships.

3) Take the exact nature of the relationships between
the reference individuals and the individual from that
group you are trying to match, where known. (Family
trees aren't always perfect, and surname projects may
be able to cluster individuals with an extremely high
probability without being able to say in what way the
individuals are related.)

4) Take the geographic location of the two individuals
to be matched, plus as many of the reference
individuals as possible, plus the oldest known
ancestor in each group.

5) Take the earliest possible timeframe for this most
recent common ancestor.

6) Take one or more individuals who are definitely
related PRIOR to this most recent common ancestor.

There are two ways you can go from there. You could
build up candidate family trees, starting from the
group and working back to a most recent common
ancestor in each tree, rejecting a tree whenever the
probability of some node within it being correct falls
below some threshold. With a large enough set of
individuals in the group, you should reduce to a
handful of near-identical candidate trees. You pick
three of them - the one over the shortest timespan,
the one over the longest timespan, and the most likely
- and calculate the MRCA using the predicted marker
values for each of the MRCAs at the top of each tree.

The herustic used to draw up the trees and prune the
unlikely ones is well-known and is used to search
extremely large structures of data organized by time.
The main drawback to this approach is that your
information and your pruning methods need to be
excellent. It's a hack-and-slash approach, which is
fine if you know exactly what needs to be cut.

The second method is to use a self-learning network.
This is a mechanism that is trained on what is known
and from there derives the relationships necessary for
all constraints to hold true.

Self-learning networks only work if the problem space
is linearly separable. In other words, if the machine
were to draw up an imaginary family tree, with one
"person" in that tree only at the points a mutation
occurs, the machine needs to be able to draw some
number of straight lines through that tree such that
each "person" is separated from every other "person".

You also need one person in each group for each line
that would be drawn. That's how it figures out where
the lines are.

The three big drawbacks with self-learning networks
are that (a) they only work on linearly separable
problems, they produce nonsense otherwise, (b) they
need a lot of computing power, and (c) if the method
of training them is incorrect, then the network will
produce valid results but not for the problem you want
to solve.

For only a handful of generations, herustics is the
way to go. The odds of an error will be small, the
number of people you would need would be small, the
odds that the calculations are being done right are
going to be very large.

If you're looking back many generations (I'm going to
guess that this would be anything much above 10
generations), then self-learning networks would be a
much better approach, in terms of what you'd need to
put in to get at least as good results.

(Using herustics at 10 generations, you'd want the
total number of people involved to be at least 45.
Self-learning networks would need about 30.)



--- Steven Bird <> wrote:

> The problem as I see it is two-fold, at least. On
> the one hand, MRCA
> calculators are trying to offer the consumer/lay
> person a predictor about
> where to look. Sometimes, this works quite well.
> I plugged in my allele
> values and that of my sixth cousin into Wimpy's
> TMRCA calculator and it
> predicted a birth date of 1747 (with 1950 as the
> average birth year for the
> present generation) for our MRCA. Strangely enough,
> the person involved,
> Samuel Bird III, was born in 1747! We know this
> through a conventional
> paper trail. That would appear to be pretty good.
>
> OTOH, the same calculator appeared to underestimate
> the TMRCA badly for my
> ninth cousin and myself, both descendants of Thomas
> Bird, who was born in
> about 1600. The result was in the early 1720's,
> obviously much too short.
>
> Still, I would not be looking for a common ancestor
> in the late 1800's from
> either calculation. That is useful sometimes.
>
> The other problem has to do with estimating
> population ages, which is much
> more difficult. One can only hope for a very rough
> estimate in most cases,
> I think. I keep hoping that a better, more accurate
> approach will evolve.
>
>
> 'I just have no confidence in these
> >MRCA calculations. You are working with mother
> nature at her random best
> >with mutations.'
>
> This is similar to complaining about the margin of
> error in carbon dating,
> IMO. What's the alternative presently? Carbon
> dating became more accurate
> once someone figured out that ancient tree rings
> were a useful source of
> information for the purpose calibrating carbon 14
> levels, year by year. We
> just need some clever person to figure out a way to
> calibrate the estimates
> better.




____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com


This thread: