Freethought & Rationalism ArchiveThe archives are read only. |
03-18-2002, 08:20 PM | #21 | |
Regular Member
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
|
Quote:
Am I missing something or is this just P = [(N-L+1) * K ^ (N-L)]/[K^N] = (N-L+1)* K^(-L) Where N is the length of the string, L the length of the message (in your case 10) and K the number of possible values in any given spot (in your case 4). The reasoning is thus: We are looking for at least one occurence of the message in the string. There are obviously N-L+1 possible positions for that message. For each position there are K^(N-L) possible "garbage" strings and there are K^N total strings of length N. Hence my formula. In the limit N -> infinity we get P -> [N * K^N * K^(-L)]/K^N = N * K^(-L) using the assumption that N-L+1 -> N as N -> inf and L stay constant. The two digit accuracy threshold is somewhere around N = 800 using simple trial and error comparing the limit with the actual value. |
|
03-18-2002, 08:49 PM | #22 | |
Veteran Member
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
|
Quote:
2 observations: 1) In the K^(N-L) garbage strings, there is a non-zero probability that there is a recurrence of the sequence L (e.g. if N > 2L, then you can easily have the sequence L-L). Consequently, you are possibly counting any sequence more than once. 2) Notice the sequence is AATTCGAATT. You are counting garbage strings outside of this sequence (i.e. N-L). However, consider the following sequence: AATTCGAATTCGAATT. N = 16, but the sequence has occurred twice. Scientiae EDIT: your analysis is pretty standard of how molecular biologists go about estimating the distribution of, say, a restriction site in a piece of DNA. However, I think most of them have not considered why their estimates are valid. [ March 18, 2002: Message edited by: Scientiae ]</p> |
|
03-18-2002, 09:04 PM | #23 | |
Regular Member
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
|
Quote:
[ March 18, 2002: Message edited by: MoCk ]</p> |
|
03-19-2002, 03:03 AM | #24 |
Junior Member
Join Date: Mar 2002
Location: Glenbrook
Posts: 6
|
Scientiae,
Since I know nothing of molecular biology, I am treating your problem from a purely mathematical viewpoint. On a very brief consideration I think I can give you upper and lower limits for the probability. Consider 1: there are 4^10 possible combinations assuming order is vital [ie AAAAAAAAAG is considered a different nucleotide from GAAAAAAAAA] 2: the chance that any single randomly generated nucleotide is not ours is 1-[1/[4^10]] 3: the chance that in N selections our nucleotide does not occur at least once = 1 - [1-[1/[4^10]]^N The trick is thus to correctly calculate out of a string Q long how many strings can possibly hold our nucleotide. The two extreme points are Q/10 and Q - 9 Q/10 assumes that each string is an indivisible entity. This is obviously an underestimate since it ignores the fact that our string might commence on the 2nd character of the Q main string Q - 9 assumes that any sub string could contain our string. This is obviously an over estimate since if the first string is GGGGGGGGGG then none of the first 10 strings can be our string Hence p > 1-[1-[1/[4^10]]]^[Q/10] and p < 1-[1-[1/[4^10]]]^[Q-9] As you would expect when Q =10 then the two expression are identical and = 1/[4^10] Gerald |
03-19-2002, 10:45 AM | #25 | |
Veteran Member
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
|
Quote:
I agree with these boundaries. It would be more interesting to extend your conclusion and see if the limits converge to the one predicted by Mock. In fact, (1 - x)^n for n >> 1 is approximately 1 - nx, and thus p -> Q/4^10, which is intuitively correct. However, the limits do not help us to determine at what point the probabilities are accurate to with a tolerance of, say 1%. This is an important problem, because I have witnessed molecular biologists estimate the probability of, say, a 10bp restriction site occurring in a 30 bp sequence. As a matter of strategy, I would stick with Mock's approach to calculate the exact formula, since it would be easier to count how many polymers contain the sequence than not. Problem still unresolved, Scientiae |
|
03-19-2002, 02:58 PM | #26 | |
Regular Member
Join Date: Feb 2002
Location: Ginnungagap
Posts: 162
|
Quote:
Definitions K: The number of possible values of a character in the string L: The length of the string being sought S: The sought string itself A: A random string I will use also of length L N: The total length of the larger string being searched. Begin by picking a "seed" string A of length L made up of random values. The only constraint is that it may not be S. The likelihood of it not being S is 1 - 1/(K^L), call this p_seed. We will now pick, at random, another character and append it to the end of A. We wish to avoid the possibility that this results in the construction of the string S. The only way this construction could happen, though, is if the last L-1 characters of A were the same as the first L-1 characters of S and if we chose the very last character in S as our random character. This gives us two independent probabilities to evaluate which together determine the probability that we will construct S. There are obviously K possible values of A that have their last L-1 characters the same as the first L-1 characters of S. There are exactly K^L - 1 allowed values of A (remember we disallow S itself). So the probability of choosing one of these values of A is K/(K^L - 1). The probability of randomly choosing the last character of S to add to our string is simply 1/K so the combined probability of constructing S on this iteration is 1/(K^L - 1) - (multiplying and cancelling the K's). This means the probability of not constructing S on this iteration is 1 - 1/(K^L - 1), call this p_1. We can continue to proceed in this fashion right up to the end of the larger string N. We will iterate this process exactly N-L times (remembering that we start adding characters to very end of the seed string A). The combined probability of not ever constructing S in this way in a string of length N is therefore p = p_seed * p_1 * p_2 * ... p_(N-L) = [1 - 1/K^L] * [1 - 1/(K^L - 1)]^(N-L) The probability of ever encountering S in N (once, twice or as many times as you like) is simply 1 - p. The binomial expansion should provide your approximation to as close as you want to come to the real value. |
|
03-19-2002, 03:33 PM | #27 |
Veteran Member
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
|
Nice work, Mock. <img src="graemlins/notworthy.gif" border="0" alt="[Not Worthy]" />
SC |
03-20-2002, 12:38 AM | #28 |
Junior Member
Join Date: Mar 2002
Location: Glenbrook
Posts: 6
|
Mock
Excellent. I love it when an inductive approach cracks a directly irreducible problem. Gerald |
Thread Tools | Search this Thread |
|