Freethought & Rationalism ArchiveThe archives are read only. |
03-26-2003, 04:05 PM | #1 | |||
Veteran Member
Join Date: Jan 2001
Location: USA
Posts: 1,072
|
SIPF dipeptides too close of a fit to nature to be coincidental?
1999 Rode paper vastly overestimates link between SIPF and primitive proteins
DNAunion: For years now I just could not get over the statistical significance of the correlation between the dipeptides formed by the SIPF [salt-induced peptide formation] reaction and those found in ‘primitive’ proteins as reported by Bernd Michael Rode in his 1999 article “Peptides and the Origin of Life”. If he was correct, then the probability that the SIPF would preferentially produce the same dipeptides found in ‘primitive’ proteins, by chance alone, was 1 in 10^18. Who would argue against those odds by claiming that the SIPF reaction was not involved? I have invested the time needed to look at his claim and find it erroneous. Before explaining in detail why his calculations are wrong, I will present his case from the article. Quote:
1) The SIPF [salt-induced peptide formation] reaction tends to favor production of certain amino acid pairings (dipeptides) over others. By restricting the number of dipeptides that form frequently, the SIPF also limits the number of longer sequences that could form from the joining of those shorter sequences. In a prebiotic context, this could be seen as a blessing in that a search through all long sequences would be, for all practical purposes, impossible. Since a vast many sequences would tend not to form, saturation of a restricted search space might occur, finding all functional proteins that exist within it. The problem is, if the proteins needed to kickstart life are not within that restricted sequence space, then the SIPF would actually hinder the origin of life by leading chemistry away from where it needs to go. That brings us to point 2. 2) The amino-acid pairings (dipeptides) that are preferentially formed by the SIPF reaction match up extremely closely to those in some of the earliest proteins. In fact, by chance alone, the likelihood of the match being as tight as it is is only about one chance in a million trillion. A key point here is that the smaller the probability of the correspondence between SIPF and primitive proteins is, the more likely is the possibility that the SIPF was involved in the creation of those primitive proteins. For example, if the match between the SIPF dipeptides and those in primitive proteins was only, say, 1 in 5, then there would be little to no statistical significance to the match: chance alone would be an adequate explanation . But with a probability of only 1 in 10^18, chance alone can hardly be relied upon as being the best explanation: there must be some connection between the two. Again, crucial here is how small or large the probability is. Unfortunately for the argument, it is nowhere near 1 in 10^18. Before proceeding, perhaps we should answer the obvious question, “How did Rode arrive at his figure of 1 in 10^18?” Clues can be found in the following repeated material. Quote:
Quote:
So to begin with, Rhode’s comparison is incomplete. What if other amino acids occur more frequently in the ‘primitive’ dipeptides than the nine he looked at? Apparently they are ignored, and the nine of interest are bumped up, possibly moving from outside of the top four to being within it. If so, they would be counted as hits even though they were actually too far down in line originally to be counted as such. For the rest of the discussion, this potential flaw will be overlooked. Now, since the comparison is made to “the four most frequently occurring” amino acids joined to a given one, then a match will exist for four out of the nine possible ‘primitive’ amino acids. Thus, the probability for a single coincidence between a ‘primitive’ amino acid and one of the four SIPF ones is 4/9. This logic can be extended to see how Rode arrived at his other probabilities. For two coincidences, there is a 4/9 chance of a match with the first ‘primitive’ amino acid, and then a 3/8 chance for the second (we have to assume the first one matches before we can calculate the probability for the second match; with one ‘primitive’ amino acid already matched up, that leaves eight remaining, of which three will match to SIPF ones). Thus we have P(2 matches) = 4/9 * 3/8 = 1/6. For 3 coincidences, we assume the first ‘primitive’ amino acid matches one of the four target SIPF amino acids, and that the second does too. That leaves us with seven remaining ‘primitive’ amino acids of which two will match SIPF ones. Therefore, all together, P(3 matches) = 4/9 * 3/8 * 2/7 = 1/21. Extending this just one more time gives us P(4 matches) = 4/9 * 3/8 * 2/7 * 1/6 = 1/126. To find the overall probability of correspondence between SIPF dipeptides and ‘primitive’ ones, Rode apparently takes these individual probabilities and looks at how many times each occurs, using that value as an exponent. For example, he lists two single matches in table 8a, so the combined probability is 4/9 * 4/9, or simply, (4/9)^2. When all single, double, triple, and quadruple matches are taken into account, the overall probability of correspondence comes to 6.438 in 10^18, which he rounds down to 1 in 10^18. RODE’S BASIC PROBABILITIES ARE WRONG So far, I have managed only to confirm Rode’s probability. But there is a problem in his fundamental calculations – that is, his probability for a single match is wrong, as is his probability for a double match, as is his probability for a triple match, etc. Rode takes into account only enough trials to cover the number of coincidences. For example, for a single coincidence, Rode considers only a single trial. Sure, if you are only going to get one shot at an event with a probability of 4/9, then of course your chance of success is 4/9. But that is not the case here. There are four chances to get a single match. For example, one of his single matches is for the amino acid Ala, in which the archaebacteria have joined to it either Ala, Glu, Val, Leu and the SIPF has joined to it Ala, Pro, Gly, and His. So there were four attempts – Ala, Glu, Val, and Leu – at matching any of the four SIPF amino acids. That changes the probability of a single match dramatically: let’s take a look. What we will do first is calculate the probability that none of the four ‘primitive’ amino acids match the SIPF ones, then from it calculate the opposite probability (that at least one would match). ‘Primitive’ Amino Acid 1: To start with, there are four SIPF targets and nine possible ‘primitive’ amino acids that could be compared to them. So the probability of ‘primitive’ aa-1 matching one of the four target SIPF amino acids is 4/9. Therefore, the probability of its not matching is 1 – 4/9 = 5/9. ‘Primitive’ Amino Acid 2: Since the probability of this aa is dependent upon the previous one, we have to assume that aa-1 did not match. That leaves eight possible ‘primitive’ amino acids and still four target SIPF ones. So the probability of getting a match here is 4/8, which means the probability of not getting a match is 1 – 4/8 = 4/8 = 1/2. ‘Primitive’ Amino Acid 3: We have to assume that the previous attempt failed to match, leaving seven ‘primitive’ amino acids and still four target SIPF ones. So the probability of matching here is 4/7, meaning that the probability of a non-match is 1 – 4/7 = 3/7. ‘Primitive’ Amino Acid 4: Since the last one failed to match also, we are left with six ‘primitive’ amino acids and still have four target SIPF ones. So the probability of a match on this final step is 4/6, which means the probability of a non-match is 1 – 46 = 2/6 = 1/3. To figure out the overall probability – that is, what is the probability of not getting any matches in four attempts -- we just multiply each of the four individual probabilities for non-matches. P(no matches) = 5/9 * 1/2 * 3/7 * 2/3 = 5/126 And, looking at the opposite case… P(at least one match) = 1 – P(no matches) = 1 – 5/126 = 121/126 = 96.03% So whereas Rode tells us the probability of a single match is a mere 44.44%, we see that it is actually more than twice that: 96.03%. The same calculations show his other probabilities to be vastly underestimated: two matches are much more likely to occur than Rode leads us to believe, as are triple and quadruple matches. In addition, Rode does not take into account that in either of the eighteen instances, five attempts to match are used since there are five ‘primitive’ amino acids listed. This raises the probability that those rows will have a match even higher. COMPUTER MODEL MORE ACCURATE THAN RODE’S After spotting the error in Rode’s methodology, I modeled the comparison in silico. Unlike Rode’s simple (and flawed) calculations, my model took into account a couple additional factors (those in addition to how many coincidences each row had) for each entry in the table: how many ‘primitive’ amino acids are listed (how many attempts at making a match are used), and how many target SIPF amino acids are listed. The program (see code at end) performed one million iterations for each table entry to calculate an empirical probability (the law of large numbers indicates the empirical probability should be close to the theoretical probability). Then, the individual values (which were now vastly more accurate than those Rode used) were multiplied together as in the Rode method to arrive at a final overall probability. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Was the final value close to Rode’s 1 in 10^18? Nope, not at all. In fact, the probability of correspondence between the ‘primitive’ and SIPF dipeptides was many, many orders of magnitudes greater than what Rode stated; that is, billions of times more likely to be due to chance. The computer model produced a probability of 2.916 in 10^7. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Computer code (Language = Visual FoxPro 6.0) *calculate_empirical_probability.prg ************************************************** **** * This program models choosing lettered tiles from an urn in order * to calculate an empirical probability. ************************************************** **** CLEAR lcDiscardOrReplaceTilesOnceChosen = "DISCARD" lnNumberOfLetteredTiles = 9 lnNumberOfTargetTiles = 4 lnMatchesNeededForSuccess = 1 lnTrialsPerIteration = 4 lnIterations = 1000000 lnSuccessfulIterations = 0 * Initially seed pseudo-random number generator using the system clock =RAND(-1) * Column 1 = LETTER (a unique symbol on a tile that gets placed into the urn) * * Column 2 = CHOSEN (has the letter/tile already been chosen from the urn? * if so, it may be discarded and not available any more, or * may be used again, depending upon the value of the variable * lcDiscardOrReplaceTilesOnceChosen) * * Column 3 = TARGET (is this tile one of the target letters?) LOCAL ARRAY aUrn[lnNumberOfLetteredTiles , 3] FOR lnIndex = 1 TO lnNumberOfLetteredTiles lcLetter = CHR(64 + lnIndex) aUrn[lnIndex, 1] = lcLetter aUrn[lnIndex, 2] = "F" aUrn[lnIndex, 3] = "F" ENDFOR * Choose x number of different targets from the urn FOR lnLooper = 1 TO lnNumberOfTargetTiles DO WHILE .T. lnIndex = GetRandomNumber(1, lnNumberOfLetteredTiles) IF (aUrn[lnIndex, 3] = "F") aUrn[lnIndex, 3] = "T" EXIT ENDIF ENDDO ENDFOR * Should have at least 100 iterations in order to figure out perCENT of * success. Much larger numbers give more accurate results. FOR lnIteration = 1 TO lnIterations IF (lnIteration % 9999 = 0) WAIT WINDOW NOWAIT "Iteration " + ; STR(lnIteration) + " of " + STR(lnIterations) ENDIF * New iteration - clear all CHOSEN flags FOR lnIndex = 1 TO lnNumberOfLetteredTiles aUrn[lnIndex, 2] = "F" ENDFOR lnMatches = 0 * Select multiple tiles to try to match a target FOR lnTrial = 1 TO lnTrialsPerIteration * First, randomly choose a single lettered tile from the urn DO WHILE .T. lnIndex = GetRandomNumber(1, lnNumberOfLetteredTiles) DO CASE CASE lcDiscardOrReplaceTilesOnceChosen == "REPLACE" * Whether tile #x has been chosen before or * not does not matter it is in the urn now * available to be chosen EXIT CASE lcDiscardOrReplaceTilesOnceChosen == "DISCARD" * A tile that has been chosen is discarded after use, and so * cannot be chosen a second time. Need to check the value * of this tile's CHOSEN column. DO CASE CASE aUrn[lnIndex, 2] = "T" * This tile has already been chosen - * it can't be used again. Allow the * program to loop to try choosing * a different tile CASE aUrn[lnIndex, 2] = "F" * This tile has not been chosen before - * okay to choose it EXIT OTHERWISE WAIT WINDOW "Invalid value of " + aUrn[lnIndex, 2] + “ for aUrn[" + ALLTRIM(STR(lnIndex)) + ", 2]" ENDCASE OTHERWISE WAIT WINDOW "Invalid value of " + lcDiscardOrReplaceTilesOnceChosen + ; " for lcDiscardOrReplaceTilesOnceChosen" ENDCASE ENDDO * This tile has now been chosen - flag it as such aUrn[lnIndex, 2] = "T" * Does the chosen tile match one of the targets? DO CASE CASE aUrn[lnIndex, 3] == "F" * Does not match a target CASE aUrn[lnIndex, 3] == "T" * Does match one of the targets lnMatches = lnMatches + 1 * No need to continue pulling tiles if we have * enough matches already IF (lnMatches >= lnMatchesNeededForSuccess) EXIT ENDIF OTHERWISE WAIT WINDOW "Invalid value of " + aUrn[lnIndex, 3] + " for aUrn[" + ALLTRIM(STR(lnIndex)) + ", 3]" ENDCASE ENDFOR * Did we get enough matches? IF (lnMatches >= lnMatchesNeededForSuccess) lnSuccessfulIterations = lnSuccessfulIterations + 1 ENDIF ENDFOR WAIT CLEAR ? "Chosen tiles discarded or replaced: " + lcDiscardOrReplaceTilesOnceChosen ? "Number of lettered tiles: " + ALLTRIM(STR(lnNumberOfLetteredTiles)) ? "Number of target tiles: " + ALLTRIM(STR(lnNumberOfTargetTiles)) ? "Number of matches needed: " + ALLTRIM(STR(lnMatchesNeededForSuccess)) ? "Trials per iteration: " + ALLTRIM(STR(lnTrialsPerIteration)) ? "Total iterations: " + ALLTRIM(STR(lnIterations)) ? "Successful iterations: " + ALLTRIM(STR(lnSuccessfulIterations)) ? "Empirical probability: " + ALLTRIM(STR((lnSuccessfulIterations / lnIterations) * 100, 10, 4)) ************************* * ********************* * * * FUNCTIONS * * * ********************* * ************************* FUNCTION GetRandomNumber(lnMin, lnMax) LOCAL lnRandomNumber * The pseudo-random number generator was already seeded with the system * clock - all calls after that initialization should not pass any value DO WHILE .T. lnRandomNumber = (FLOOR(RAND() * 10000) % lnMax) + 1 IF (lnRandomNumber >= lnMin AND lnRandomNumber <= lnMax) EXIT ENDIF ENDDO RETURN lnRandomNumber ENDFUNC |
|||
03-26-2003, 09:33 PM | #2 |
Veteran Member
Join Date: Nov 2001
Location: NCSU
Posts: 5,853
|
DNAUnion,
On first note, "in silico" is not correct Latin. "In silice" is. (I point this out to every one I see using it!) I'm going to have a look at Rhode's paper and get back to you on the other points. |
03-27-2003, 11:27 AM | #4 |
Veteran Member
Join Date: Jun 2000
Posts: 1,302
|
Why don'y you submit your critique as a letter to the journal?
|
03-27-2003, 01:08 PM | #5 | |||
Veteran Member
Join Date: Mar 2002
Location: anywhere
Posts: 1,976
|
Quote:
1) Whether it's 1e-7 or 1e-18, the actual magnitude does not matter so much as the statistical significance of the number. The test here is presumably a rejection of the null hypothesis of a uniform probability. But wait. Where is the statistical test? What's the p-value? Rode had a good reason not to provide one, since he thought his calculation of 1e-18 was small enuf to beat any statistical challenge. DNAunion however claims that 1e-7 is not significant enough. Quote:
Quote:
2) DNAunion, in his zeal, to "correct" Rode's probability analysis forgets to look at the overall picture. Did the evidence show a bias in dipeptide formation? Yes (or more exactly, this was not challenged). Did the evidence show a bias in dipeptide content of early organisms? Maybe (but once again, this was not challenged). Did the evidence show that SIPF produced a bias in dipeptide formation? Yes (but, this was not challenged). Was there a plausible mechanism for the SIPF bias? Yes (the CuII coordination hypothesis was especially intriguing). In light of the accomplishments published in the paper, is there sufficient reason to doubt the results on the basis of one errant probability analysis? No. Logically speaking, even if the bias seen in SIPF and in nature is not statistically significant, this alone does not rule out SIPF as a possible mechanism of generating prebiotic peptides. 3) As a matter of fact, there exists better probability studies than the one proposed by Rode. That is to say, there exists several flaws that are more significant than the ones that DNAunion picked up on, which of course still remained in DNAunion's analysis. First, certain linkages are counted twice in the analysis -- a fact which completely escaped DNAunion. For instance Ala-Ala is counted in both the A-B and B-A linkages. Why didn't DNAunion notice something this obvious, especially when he used the Ala example? Second of all, and more importantly, the organization of the data is weak. The top four amino acids that preferentially links to a particular residue may in fact be relatively weak compared to other linkages. Yet, both DNAunion's and Rode's analyses assume that only ranking matters. What is needed is a complete ordering of prevalence from all 81 possible dipeptides in archaebacteria, compared with a complete ordering of yields for SIPF. Then, perform a statistical test for the significance of each ordering against a suitable null hypothesis. In this way, one avoids the faulty conclusion that A-B linkage preferences for one 'A' amino acid is independent of any linkage preferences for any other 'A' amino acid. This is what is tacitly assumed when one simply multiplies probabilities together (as did DNAunion in his "fixed" analysis). 4) A Monte Carlo analysis written in Visual Foxpro (!! ) for a combinatorial analysis of 9 sequence elements? I think there's a more scientific way of skinning this cat. There are other issues that I will bring up when I have time. |
|||
03-27-2003, 07:21 PM | #6 |
Veteran Member
Join Date: Jan 2001
Location: USA
Posts: 1,072
|
DNAunion: Principia, since you seem to be so good in math, I have a question for you.
I was tutoring someone tonight in college algebra and one of the problems we had was as follows: |x^2 - 4| = x - 2 Here's what I did. 1) "Split" it into two equations to eliminate the absolute value sign: a. x^2 - 4 = x - 2 b. x^2 - 4 = -(x - 2) 2) Recognizing x^2 - 4 as being the difference of two perfect squares, with one of the two factors being the same as the right side of the equation, I factored the left side. a. (x + 2)(x - 2) = x - 2 b. (x + 2)(x - 2) = -(x - 2) 3) I then divided both sides of the equation by the common (x - 2). a. [(x + 2)(x - 2)] / (x - 2) = (x - 2) / (x - 2) b. [(x + 2)(x - 2)] / (x - 2) = -(x - 2) / (x - 2) 4) Reducing leads to: a. x + 2 = 1 b. x + 2 = -1 5) Subtracting 2 from bot sides of both equations gives to isolate the variable gives: a. x = -1 b. x = -3 6) Finally, I checked my answers by plugging them back into the original equation (the one that has the absolute value in it). None of the "solutions" worked. This appears to indicate that there is no solution to the problem. However, x = 2 is a solution. My question is, how can I follow perfectly valid rules of algebra at every step and yet fail to come up with the solution? PS: I am not asking how to obtain the solution: I know that. I just don't get how I can do nothing illegal yet fail to get the solution. How does one get x = 2 following the method I used? |
03-27-2003, 10:29 PM | #7 |
Veteran Member
Join Date: Nov 2001
Location: NCSU
Posts: 5,853
|
Because when you divided you forgot to check if x-2 = 0. In other words your algebra is valid for every value of x except for when x=2.
|
03-29-2003, 09:33 AM | #8 | |
Veteran Member
Join Date: Jan 2001
Location: USA
Posts: 1,072
|
Quote:
|
|
03-29-2003, 09:56 AM | #9 |
Regular Member
Join Date: Jun 2000
Location: St. Louis, MO
Posts: 417
|
Don't feel bad
You actually stumbled across one of my favorite math tricks...
Proof that 2 = 1 Let 1) a=b then multiply both sides by a, giving 2) a^2 = a*b then subtract b^2 from both sides, giving 3) a^2 - b^2 = a*b - b^2 then factor both sides, giving 4) (a+b)(a-b) = b*(a-b) then we can cancel (a-b) from both sides, giving 5) (a+b) = b then substitute a for b (based on #1) giving 6) a+a = a which simplifies to 7) 2*a = a dividing both sides by a gives: 8) 2=1 Being very brisque with the phrase "we can cancel (a-b)", I've even left fellow Math majors scratching their heads over this . |
03-29-2003, 12:50 PM | #10 | |
Veteran Member
Join Date: Jan 2001
Location: USA
Posts: 1,072
|
Quote:
Perhaps it's just that you can't "speak" VFP: maybe you can only decipher a "real" programming language like C++. Well here, I took the time to recode it in that language. Now you can examine the code and point out my errors. // calcprob.cpp // This program models choosing lettered tiles from an urn in // order to calculate an empirical probability #include <iostream> #include <stdlib.h> // needed for the rand function #include <time.h> // needed to get current time to seed rand function using namespace std; long GetRandomNumber(int nMin, int nMax); int main() { const int nDiscardTilesOnceChosen = 1; const int nLetteredTiles = 9; const int nTargetTiles = 4; const int nMatchesNeededForSuccess = 1; const int nTrialsPerIteration = 4; const long lIterations = 1000000; int nTrial = 0; int nMatches = 0; int nIndex = 0; int nLooper = 0; int nFoundOne = 0; long lIteration = 0; long lSuccessfulIterations = 0; char cLetter = ' '; char cUrn[nLetteredTiles][3]; // Before doing anything else, intialize the random number generator // using the system clock srand((unsigned)time(NULL)); // Initialize array (fill the urn with lettered tiles) // The columns of the multidimensional array breakdown as follows: // [1] = LETTER: a unique symbol on a tile that gets placed into the urn. // As the program currently stands, this value is not used. // [2] = CHOSEN: has this letter/tile already been chosen from the urn? // If so, it may have been disarded and so not available // any more, or it may have been replaced and available to // be selected again. Which occurs for an already selected // tile depends upon the value of the const variable // nDiscardTilesOnceChosen. // [3] = TARGET: is this letter/tile one of the targets? for (nIndex = 0; nIndex < nLetteredTiles; nIndex++) { cLetter = 64 + nIndex; cUrn[nIndex][1] = cLetter; cUrn[nIndex][2] = 'F'; cUrn[nIndex][3] = 'F'; } // Choose x number of tiles from the Urn to serve as targets for (nLooper = 1; nLooper <= nTargetTiles; nLooper++) { nFoundOne = 0; while (nFoundOne == 0) { nIndex = (int) GetRandomNumber(0, nLetteredTiles - 1); if (cUrn[nIndex][3] == 'F') { cUrn[nIndex][3] = 'T'; nFoundOne = 1; } } } // Begin selecting tiles from the Urn. for (lIteration = 1; lIteration <= lIterations; lIteration++) { // New iteration: need to clear all CHOSEN flags for (nIndex = 0; nIndex < nLetteredTiles; nIndex++) { cUrn[nIndex][2] = 'F'; } nMatches = 0; // Give the user some output throughout the process cout << "Iteration " << lIteration << " of " << lIterations << endl; for (nTrial = 1; nTrial <= nTrialsPerIteration; nTrial++) { // Pull a single tile out of the urn nFoundOne = 0; while (nFoundOne == 0) { nIndex = (int) GetRandomNumber(0, nLetteredTiles - 1); if (nDiscardTilesOnceChosen == 0) { // Doesn't matter if the tile has been // chosen previously because selected // tiles are placed back into the urn. nFoundOne = 1; } else if (cUrn[nIndex][2] == 'T') { // This tile has already been chosen and // discarded: it can't be selected again. nFoundOne = 0; } else if (cUrn[nIndex][2] == 'F') { // This tile has not been chosen previously. nFoundOne = 1; } } // A tile has been chosen: flag it as such cUrn[nIndex][2] = 'T'; // Does the chosen tile match one of the targets? if (cUrn[nIndex][3] == 'T') { nMatches += 1; } // No need to continue pulling tiles for this iteration // if we've obtained enough matches for success if (nMatches >= nMatchesNeededForSuccess) { nTrial = nTrialsPerIteration + 1; } } // Did we get enough matches for this iteration? if (nMatches >= nMatchesNeededForSuccess) { lSuccessfulIterations += 1; } } cout << "Tiles discarded after being chosen? "; cout << (nDiscardTilesOnceChosen == 1?"Yes":"No") << endl; cout << "Number of lettered tiles in Urn: "; cout << nLetteredTiles << endl; cout << "Number of target tiles: "; cout << nTargetTiles << endl; cout << "Number of matches needed: "; cout << nMatchesNeededForSuccess << endl; cout << "Trials per iteration: "; cout << nTrialsPerIteration << endl; cout << "Total iterations: "; cout << lIterations << endl; cout << "Successful iterations: "; cout << lSuccessfulIterations << endl; cout << "Empirical probability: "; cout << ((float)lSuccessfulIterations / lIterations) * 100 << "%" << endl; return (0); } long GetRandomNumber(int nMin, int nMax) { long lRandomNumber; lRandomNumber = rand(); while (lRandomNumber < nMin || lRandomNumber > nMax) { lRandomNumber = rand(); } return lRandomNumber; } |
|
Thread Tools | Search this Thread |
|