GENEALOGY-DNA-L Archives
Archiver > GENEALOGY-DNA > 2008-03 > 1204723366
From: James Heald <>
Subject: Re: [DNA] Central Limit Theorem in Action
Date: Wed, 05 Mar 2008 13:22:46 +0000
References: <013801c87d66$d40ce450$6400a8c0@Ken1><REME20080303195544@alum.mit.edu> <47CDA27D.4080506@ucl.ac.uk><00fc01c87e31$8a7d64b0$6400a8c0@Ken1>
In-Reply-To: <00fc01c87e31$8a7d64b0$6400a8c0@Ken1>
Ken Nordtvedt wrote:
> But I don't believe the variance of the sum of marker variances grows as MG,
> M being total mutation rate and G being generations. It certainly is
> proportional to M from what I see, and at one time I entertained the
> proportionality to G, but everything I have learned since indicates that
> variance of the variance goes as M times a sum over generations "g" of 1 /
> n(g) with n(g) being the population of "contributors" in the population in
> generation "g"; n(1) starting with 2. (Contributors are those males left
> in the pruned tree who actually are ancestors of one or more of the people
> in the present-day population under examination.)
Ken, that sounds interesting. I'd be interested to read more in more
detail the theory behind it. Do you have an exposition on your web-page?
As you yourself wrote, a few posts earlier, it is important to
distinguish the calculation of the TMRCA between two samples, and the
overall TMRCA for a group of samples. The group calculation is a lot
harder.
When there are only two samples, I think your formula agrees with me.
n(g)=2 for all g; so Sum_g (1/n(g)) = G/2.
I'm not so clear how to apply your formula to the more complicated case,
but I suppose one would consider the joint forward probability of the
ASD and the tree-shape, and then sum over the possible tree shapes):
Prob(ASD | G) = Sum_{tree shape} Prob(ASD, tree_shape | G)
= Sum_(tree_shape) [ Prob(ASD | tree_shape, G)
x Prob (tree_shape | G) ]
and one can also see that
Var(ASD | G) = Sum_(tree_shape) [ Var(ASD | tree_shape, G)
x Prob (tree_shape | G) ]
Is this the sort of thing you've been calculating?
But it looks to me that this would be likely still to have quite a
strong functional dependence on G ?
I'm also not entirely convinced that the ASD is a "sufficient statistic"
for the problem, given a whole group of data samples. I'd need some
argument, that it really did capture all the relevant information in the
data.
Have you compared your group TMRCA estimates with BATWING ? BATWING
goes about things in a strictly Bayesian way (though I still need to
think some more about the Coalescent prior it assumes). Do you know
whether your estimates match up ?
> When you have things in the form of Prob(ASD | G) after applying Bayes
> theorem, one needs to take a derivative of that expression with respect to G
> to find the most likely G. If it is a Gaussian-like function with the mean
> ASD equal to 2MG, but the variance of the ASDs about 2MG being independent
> of G, there is no further derivative with respect to a normalizing function
> involving G; so one is done and the most likely G will be ASD/2M (for many
> markers which have pushed the ASD distribution to Gaussian-like)
Per the above: I am dubious that the predicted variance Var(ASD | G)
should be independent of G. I can't see any reason why it should be.
>
> Months ago when I thought that variance might depend on G rather than just
> the early generation history, I had the modified estimator due to additional
> terms in the derivative with respect to G derived and mentioned on the list.
> If somehow one believed that the early time history varied with respect to
> consideration of different total G for the history, then those modifications
> would still be there. But there is no definite link; in other words,
> comparing populations with different G to find a most likely history to fit
> the ASD is really a subset of different conceivable population histories
> that could be under comparison. In a sound byte: "there's more to a
> population history than its total age".
Sure, G is only part of the story. The distribution for the whole
history would be something more like
P(G, tree_shape | data)
-- which is what BATWING aims to generate samples from.
>
> Certainly none of the above has anything to do with the extreme boundary
> case of comparing a pair of haplotypes having zero ASD; that's a much
> simpler problem with a different applicable answer of zero generations being
> most likely but a non-zero average number of generations to MRCA which only
> collapses toward zero as 1/M.
True.
But I hope I've also explained why the distribution for the 2-individual
TMCRA remains so skewed for so long when the ASD is not zero.
The many-individual case is more difficult. Can you spell out in more
detail (or give a link, if you've done so before), your argument for why
the mean of the ASD given G should depend so strongly on G, but the
variance not at all ?
-- James.
This thread:
| Re: [DNA] Central Limit Theorem in Action by James Heald <> |