Freethought & Rationalism Archive

Ragnarok · 03-18-2002, 08:20 PM

Quote:

Originally posted by Scientiae:
Here's a good candidate question.

Suppose we have a polymer of N nucleotides.

What is the a priori probability that it contains the following sequence, AATTCGAATT, at least once?

We make the following simplifying assumptions:

1) A,C,T,G are i.i.d.
2) We make no assumptions about the biology of the molecule (i.e. possible modification of nucleotides, like methylation). I speak of ACTG in the sense of a code.
3) N > 10

I would like:
1) An *exact* formula for P(sequence at least once | N)
2) An approximation *derived from the exact formula* as N becomes >> 10, and a threshold value of N for which the approximation is valid to 2 decimal places.

Should be a fairly trivial question for the properly trained, right?

EDIT: I will entertain pleas that this problem is too difficult.

SC

[ March 18, 2002: Message edited by: Scientiae ]

Am I missing something or is this just

P = [(N-L+1) * K ^ (N-L)]/[K^N]

= (N-L+1)* K^(-L)

Where N is the length of the string, L the length of the message (in your case 10) and K the number of possible values in any given spot (in your case 4).

The reasoning is thus: We are looking for at least one occurence of the message in the string. There are obviously N-L+1 possible positions for that message. For each position there are K^(N-L) possible "garbage" strings and there are K^N total strings of length N. Hence my formula.

In the limit N -> infinity we get

P -> [N * K^N * K^(-L)]/K^N = N * K^(-L)

using the assumption that N-L+1 -> N as N -> inf and L stay constant.

The two digit accuracy threshold is somewhere around N = 800 using simple trial and error comparing the limit with the actual value.

Principia · 03-18-2002, 08:49 PM

Quote:

Originally posted by MoCk:
The reasoning is thus: We are looking for at least one occurence of the message in the string. There are obviously N-L+1 possible positions for that message. For each position there are K^(N-L) possible "garbage" strings and there are K^N total strings of length N. Hence my formula.

Mock,

2 observations:

1) In the K^(N-L) garbage strings, there is a non-zero probability that there is a recurrence of the sequence L (e.g. if N > 2L, then you can easily have the sequence L-L). Consequently, you are possibly counting any sequence more than once.

2) Notice the sequence is AATTCGAATT. You are counting garbage strings outside of this sequence (i.e. N-L). However, consider the following sequence: AATTCGAATTCGAATT. N = 16, but the sequence has occurred twice.

Scientiae

EDIT: your analysis is pretty standard of how molecular biologists go about estimating the distribution of, say, a restriction site in a piece of DNA. However, I think most of them have not considered why their estimates are valid.

[ March 18, 2002: Message edited by: Scientiae ]

Ragnarok · 03-18-2002, 09:04 PM

Quote:

Originally posted by Scientiae:


Mock,

2 observations:

1) In the K^(N-L) garbage strings, there is a non-zero probability that there is a recurrence of the sequence L (e.g. if N > 2L, then you can easily have the sequence L-L). Consequently, you are possibly counting any sequence more than once.

[ March 18, 2002: Message edited by: Scientiae ]

Hmm... yes, I see your point. I suspected this was more complex than my quick analysis suggested. Obviously the numerator is somewhat smaller than I proposed due to repeating groups. Will need to reconsider repeating messages and recompute.

[ March 18, 2002: Message edited by: MoCk ]

gerald_the_okapi · 03-19-2002, 03:03 AM

Scientiae,

Since I know nothing of molecular biology, I am treating your problem from a purely mathematical viewpoint.

On a very brief consideration I think I can give you upper and lower limits for the probability.

Consider 1: there are 4^10 possible combinations assuming order is vital [ie AAAAAAAAAG is considered a different nucleotide from GAAAAAAAAA]

2: the chance that any single randomly generated nucleotide is not ours is 1-[1/[4^10]]

3: the chance that in N selections our nucleotide does not occur at least once =
1 - [1-[1/[4^10]]^N

The trick is thus to correctly calculate out of a string Q long how many strings can possibly hold our nucleotide.

The two extreme points are Q/10 and Q - 9

Q/10 assumes that each string is an indivisible entity. This is obviously an underestimate since it ignores the fact that our string might commence on the 2nd character of the Q main string

Q - 9 assumes that any sub string could contain our string. This is obviously an over estimate since if the first string is GGGGGGGGGG then none of the first 10 strings can be our string

Hence p > 1-[1-[1/[4^10]]]^[Q/10]

and p < 1-[1-[1/[4^10]]]^[Q-9]

As you would expect when Q =10 then the two expression are identical and = 1/[4^10]

Gerald

Principia · 03-19-2002, 10:45 AM

Quote:

Originally posted by gerald_the_okapi:
Hence p > 1-[1-[1/[4^10]]]^[Q/10]

and p < 1-[1-[1/[4^10]]]^[Q-9]

As you would expect when Q =10 then the two expression are identical and = 1/[4^10]

Gerald,

I agree with these boundaries. It would be more interesting to extend your conclusion and see if the limits converge to the one predicted by Mock.

In fact, (1 - x)^n for n >> 1 is approximately 1 - nx, and thus p -> Q/4^10, which is intuitively correct. However, the limits do not help us to determine at what point the probabilities are accurate to with a tolerance of, say 1%. This is an important problem, because I have witnessed molecular biologists estimate the probability of, say, a 10bp restriction site occurring in a 30 bp sequence.

As a matter of strategy, I would stick with Mock's approach to calculate the exact formula, since it would be easier to count how many polymers contain the sequence than not.

Problem still unresolved,
Scientiae

Ragnarok · 03-19-2002, 02:58 PM

Quote:

Originally posted by Scientiae:


Gerald,

I agree with these boundaries. It would be more interesting to extend your conclusion and see if the limits converge to the one predicted by Mock.

Problem still unresolved,
Scientiae

Actually I think Gerald had the right approach. This is the development I came up with today.

Definitions

K: The number of possible values of a character in the string

L: The length of the string being sought

S: The sought string itself

A: A random string I will use also of length L

N: The total length of the larger string being searched.

Begin by picking a "seed" string A of length L made up of random values.
The only constraint is that it may not be S. The likelihood of it not
being S is 1 - 1/(K^L), call this p_seed. We will now pick, at random,
another character and append it to the end of A. We wish to avoid the
possibility that this results in the construction of the string S. The
only way this construction could happen, though, is if the last L-1
characters of A were the same as the first L-1 characters of S and if we
chose the very last character in S as our random character. This gives us
two independent probabilities to evaluate which together determine the
probability that we will construct S.

There are obviously K possible values of A that have their last L-1
characters the same as the first L-1 characters of S. There are exactly
K^L - 1 allowed values of A (remember we disallow S itself). So the
probability of choosing one of these values of A is K/(K^L - 1). The
probability of randomly choosing the last character of S to add to our
string is simply 1/K so the combined probability of constructing S on this
iteration is 1/(K^L - 1) - (multiplying and cancelling the K's). This
means the probability of not constructing S on this iteration is 1 -
1/(K^L - 1), call this p_1.

We can continue to proceed in this fashion right up to the end of the
larger string N. We will iterate this process exactly N-L times
(remembering that we start adding characters to very end of the seed
string A). The combined probability of not ever constructing S in this way in a
string of length N is therefore

p = p_seed * p_1 * p_2 * ... p_(N-L)
= [1 - 1/K^L] * [1 - 1/(K^L - 1)]^(N-L)

The probability of ever encountering S in N (once, twice or as many times
as you like) is simply 1 - p.

The binomial expansion should provide your approximation to as close as you want to come to the real value.

Principia · 03-19-2002, 03:33 PM

Nice work, Mock. <img src="graemlins/notworthy.gif" border="0" alt="[Not Worthy]" />

SC

gerald_the_okapi · 03-20-2002, 12:38 AM

Mock

Excellent.

I love it when an inductive approach cracks a directly irreducible problem.

Gerald

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search

03-19-2002, 03:03 AM	#24
gerald_the_okapi Junior Member Join Date: Mar 2002 Location: Glenbrook Posts: 6	Scientiae, Since I know nothing of molecular biology, I am treating your problem from a purely mathematical viewpoint. On a very brief consideration I think I can give you upper and lower limits for the probability. Consider 1: there are 4^10 possible combinations assuming order is vital [ie AAAAAAAAAG is considered a different nucleotide from GAAAAAAAAA] 2: the chance that any single randomly generated nucleotide is not ours is 1-[1/[4^10]] 3: the chance that in N selections our nucleotide does not occur at least once = 1 - [1-[1/[4^10]]^N The trick is thus to correctly calculate out of a string Q long how many strings can possibly hold our nucleotide. The two extreme points are Q/10 and Q - 9 Q/10 assumes that each string is an indivisible entity. This is obviously an underestimate since it ignores the fact that our string might commence on the 2nd character of the Q main string Q - 9 assumes that any sub string could contain our string. This is obviously an over estimate since if the first string is GGGGGGGGGG then none of the first 10 strings can be our string Hence p > 1-[1-[1/[4^10]]]^[Q/10] and p < 1-[1-[1/[4^10]]]^[Q-9] As you would expect when Q =10 then the two expression are identical and = 1/[4^10] Gerald

03-19-2002, 03:33 PM	#27
Principia Veteran Member Join Date: Mar 2002 Location: anywhere Posts: 1,976	Nice work, Mock. <img src="graemlins/notworthy.gif" border="0" alt="[Not Worthy]" /> SC

03-20-2002, 12:38 AM	#28
gerald_the_okapi Junior Member Join Date: Mar 2002 Location: Glenbrook Posts: 6	Mock Excellent. I love it when an inductive approach cracks a directly irreducible problem. Gerald

Freethought & Rationalism Archive

The archives are read only.