FRDB Archives

Freethought & Rationalism Archive

The archives are read only.


Go Back   FRDB Archives > Archives > IIDB ARCHIVE: 200X-2003, PD 2007 > IIDB Philosophical Forums (PRIOR TO JUN-2003)
Welcome, Peter Kirby.
You last visited: Today at 05:55 AM

 
 
Thread Tools Search this Thread
Old 03-18-2002, 08:20 PM   #21
Regular Member
 
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
Post

Quote:
Originally posted by Scientiae:
<strong>Here's a good candidate question.

Suppose we have a polymer of N nucleotides.

What is the a priori probability that it contains the following sequence, AATTCGAATT, at least once?

We make the following simplifying assumptions:

1) A,C,T,G are i.i.d.
2) We make no assumptions about the biology of the molecule (i.e. possible modification of nucleotides, like methylation). I speak of ACTG in the sense of a code.
3) N &gt; 10

I would like:
1) An *exact* formula for P(sequence at least once | N)
2) An approximation *derived from the exact formula* as N becomes &gt;&gt; 10, and a threshold value of N for which the approximation is valid to 2 decimal places.

Should be a fairly trivial question for the properly trained, right?

EDIT: I will entertain pleas that this problem is too difficult.

SC

[ March 18, 2002: Message edited by: Scientiae ]</strong>

Am I missing something or is this just

P = [(N-L+1) * K ^ (N-L)]/[K^N]

= (N-L+1)* K^(-L)

Where N is the length of the string, L the length of the message (in your case 10) and K the number of possible values in any given spot (in your case 4).

The reasoning is thus: We are looking for at least one occurence of the message in the string. There are obviously N-L+1 possible positions for that message. For each position there are K^(N-L) possible "garbage" strings and there are K^N total strings of length N. Hence my formula.

In the limit N -&gt; infinity we get

P -&gt; [N * K^N * K^(-L)]/K^N = N * K^(-L)

using the assumption that N-L+1 -&gt; N as N -&gt; inf and L stay constant.

The two digit accuracy threshold is somewhere around N = 800 using simple trial and error comparing the limit with the actual value.
Ragnarok is offline  
Old 03-18-2002, 08:49 PM   #22
Veteran Member
 
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
Post

Quote:
Originally posted by MoCk:
The reasoning is thus: We are looking for at least one occurence of the message in the string. There are obviously N-L+1 possible positions for that message. For each position there are K^(N-L) possible "garbage" strings and there are K^N total strings of length N. Hence my formula.
Mock,

2 observations:

1) In the K^(N-L) garbage strings, there is a non-zero probability that there is a recurrence of the sequence L (e.g. if N &gt; 2L, then you can easily have the sequence L-L). Consequently, you are possibly counting any sequence more than once.

2) Notice the sequence is AATTCGAATT. You are counting garbage strings outside of this sequence (i.e. N-L). However, consider the following sequence: AATTCGAATTCGAATT. N = 16, but the sequence has occurred twice.

Scientiae

EDIT: your analysis is pretty standard of how molecular biologists go about estimating the distribution of, say, a restriction site in a piece of DNA. However, I think most of them have not considered why their estimates are valid.

[ March 18, 2002: Message edited by: Scientiae ]</p>
Principia is offline  
Old 03-18-2002, 09:04 PM   #23
Regular Member
 
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
Post

Quote:
Originally posted by Scientiae:
<strong>

Mock,

2 observations:

1) In the K^(N-L) garbage strings, there is a non-zero probability that there is a recurrence of the sequence L (e.g. if N &gt; 2L, then you can easily have the sequence L-L). Consequently, you are possibly counting any sequence more than once.

[ March 18, 2002: Message edited by: Scientiae ]</strong>
Hmm... yes, I see your point. I suspected this was more complex than my quick analysis suggested. Obviously the numerator is somewhat smaller than I proposed due to repeating groups. Will need to reconsider repeating messages and recompute.

[ March 18, 2002: Message edited by: MoCk ]</p>
Ragnarok is offline  
Old 03-19-2002, 03:03 AM   #24
Junior Member
 
Join Date: Mar 2002
Location: Glenbrook
Posts: 6
Post

Scientiae,

Since I know nothing of molecular biology, I am treating your problem from a purely mathematical viewpoint.

On a very brief consideration I think I can give you upper and lower limits for the probability.

Consider 1: there are 4^10 possible combinations assuming order is vital [ie AAAAAAAAAG is considered a different nucleotide from GAAAAAAAAA]

2: the chance that any single randomly generated nucleotide is not ours is 1-[1/[4^10]]

3: the chance that in N selections our nucleotide does not occur at least once =
1 - [1-[1/[4^10]]^N

The trick is thus to correctly calculate out of a string Q long how many strings can possibly hold our nucleotide.

The two extreme points are Q/10 and Q - 9

Q/10 assumes that each string is an indivisible entity. This is obviously an underestimate since it ignores the fact that our string might commence on the 2nd character of the Q main string

Q - 9 assumes that any sub string could contain our string. This is obviously an over estimate since if the first string is GGGGGGGGGG then none of the first 10 strings can be our string

Hence p &gt; 1-[1-[1/[4^10]]]^[Q/10]

and p &lt; 1-[1-[1/[4^10]]]^[Q-9]

As you would expect when Q =10 then the two expression are identical and = 1/[4^10]

Gerald
gerald_the_okapi is offline  
Old 03-19-2002, 10:45 AM   #25
Veteran Member
 
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
Post

Quote:
Originally posted by gerald_the_okapi:
Hence p &gt; 1-[1-[1/[4^10]]]^[Q/10]

and p &lt; 1-[1-[1/[4^10]]]^[Q-9]

As you would expect when Q =10 then the two expression are identical and = 1/[4^10]
Gerald,

I agree with these boundaries. It would be more interesting to extend your conclusion and see if the limits converge to the one predicted by Mock.

In fact, (1 - x)^n for n &gt;&gt; 1 is approximately 1 - nx, and thus p -&gt; Q/4^10, which is intuitively correct. However, the limits do not help us to determine at what point the probabilities are accurate to with a tolerance of, say 1%. This is an important problem, because I have witnessed molecular biologists estimate the probability of, say, a 10bp restriction site occurring in a 30 bp sequence.

As a matter of strategy, I would stick with Mock's approach to calculate the exact formula, since it would be easier to count how many polymers contain the sequence than not.

Problem still unresolved,
Scientiae
Principia is offline  
Old 03-19-2002, 02:58 PM   #26
Regular Member
 
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
Post

Quote:
Originally posted by Scientiae:
<strong>

Gerald,

I agree with these boundaries. It would be more interesting to extend your conclusion and see if the limits converge to the one predicted by Mock.


Problem still unresolved,
Scientiae</strong>
Actually I think Gerald had the right approach. This is the development I came up with today.

Definitions

K: The number of possible values of a character in the string

L: The length of the string being sought

S: The sought string itself

A: A random string I will use also of length L

N: The total length of the larger string being searched.

Begin by picking a "seed" string A of length L made up of random values.
The only constraint is that it may not be S. The likelihood of it not
being S is 1 - 1/(K^L), call this p_seed. We will now pick, at random,
another character and append it to the end of A. We wish to avoid the
possibility that this results in the construction of the string S. The
only way this construction could happen, though, is if the last L-1
characters of A were the same as the first L-1 characters of S and if we
chose the very last character in S as our random character. This gives us
two independent probabilities to evaluate which together determine the
probability that we will construct S.

There are obviously K possible values of A that have their last L-1
characters the same as the first L-1 characters of S. There are exactly
K^L - 1 allowed values of A (remember we disallow S itself). So the
probability of choosing one of these values of A is K/(K^L - 1). The
probability of randomly choosing the last character of S to add to our
string is simply 1/K so the combined probability of constructing S on this
iteration is 1/(K^L - 1) - (multiplying and cancelling the K's). This
means the probability of not constructing S on this iteration is 1 -
1/(K^L - 1), call this p_1.

We can continue to proceed in this fashion right up to the end of the
larger string N. We will iterate this process exactly N-L times
(remembering that we start adding characters to very end of the seed
string A). The combined probability of not ever constructing S in this way in a
string of length N is therefore

p = p_seed * p_1 * p_2 * ... p_(N-L)
= [1 - 1/K^L] * [1 - 1/(K^L - 1)]^(N-L)

The probability of ever encountering S in N (once, twice or as many times
as you like) is simply 1 - p.

The binomial expansion should provide your approximation to as close as you want to come to the real value.
Ragnarok is offline  
Old 03-19-2002, 03:33 PM   #27
Veteran Member
 
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
Post

Nice work, Mock. <img src="graemlins/notworthy.gif" border="0" alt="[Not Worthy]" />

SC
Principia is offline  
Old 03-20-2002, 12:38 AM   #28
Junior Member
 
Join Date: Mar 2002
Location: Glenbrook
Posts: 6
Post

Mock

Excellent.

I love it when an inductive approach cracks a directly irreducible problem.

Gerald
gerald_the_okapi is offline  
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -8. The time now is 05:06 PM.

Top

This custom BB emulates vBulletin® Version 3.8.2
Copyright ©2000 - 2015, Jelsoft Enterprises Ltd.