Freethought & Rationalism ArchiveThe archives are read only. |
03-25-2005, 05:22 PM | #1 |
Veteran Member
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
|
A Lexical Look at the Paulines
Hello,
I have been toying with the morphologically tagged New Testament prepared by Dr. Tauber, based on the NA-26 Greek text. The feature that I needed was the resolution of each word in the New Testament into its lexical form. The reason for doing so was to look at an old problem, the lexical frequencies of the Pauline epistles, from a slightly new perspective, my own "max hapax" formula. It can be described in simple enough terms. First, one divides the material into maximally large chunks that will tell you something interesting about your data set, but that won't result in too many chunks (too long to process, not large enough sample size). For a quick analysis, I chose to use just 8 chunks: 00 Romans + Galatians (9341 words) 01 First and Second Corinthians (11307 words) 02 Philippians and First Thessalonians (3110 words) 03 Colossians (1582 words) 04 Ephesians (2422 words) 05 Second Thessalonians (823 words) 06 First Timothy, Second Timothy, Titus (3488 words) 07 Hebrews (4953 words) I left out Philemon, for now, because it may be too short to analyze. Then one chooses a number of authors between which the chunks can be parceled out. I chose 2, 3, 4, 5, and 6. (This also increases processing time. The program has to cycle through all possible permutations of author distribution.) The "max hapax" formula is this. For each word, the number of occurences per 500 words is calculated for each author. Then the highest rate of occurence is found. Then, one goes through the rest of the authors, and if that author does not have the word or has the word less than 1 time per 500 words, then, for each such author without the word, the value of the highest rate of occurence is added to the "hapax" score for that particular distribution of authors. The "max" part comes in displaying the top two distributions of authors in terms of the "hapax" score. The reasoning behind this is, basically, that the more distinctive the lexical style of each author in the distribution, the more likely that distribution is. And, of course, I wanted to see what would happen if one went forward with this kind of analysis. Here were the results. For two authors: Highest: Pastorals by themselves, the rest grouped together Second Highest: Hebrews by itself, the rest grouped together For three authors: Highest: Hebrews; Pastorals; the rest grouped together Second Highest: Hebrews; Pastorals and 2 Thessalonians; the rest For four authors: Highest: Hebrews; Pastorals; 2 Thessalonians;rest 2nd Highest: Hebrews; Pastorals; Ephesians; rest For five authors: Highest: Hebrews; Pastorals; 2 Thess; Ephesians; rest 2nd Highest: Hebrews; Pastorals; 2 Thess; Colossians; rest For six authors: Highest: Hebrews; Pastorals; 2 Thess; Eph; Philippians+1Thess; rest 2nd Highest: Hebrews; Pastorals; 2 Thess; Eph; Colossians; rest The results interpreted. Romans, Galatians are always grouped together with 1 Corinthians, and 2 Corinthians. The author of these four epistles may be called "Paul" and provide the basis for determining what is Pauline style. Hebrews and the Pastorals are certainly outliers in terms of lexical style. Hebrews is usually taken as non-Pauline; the Pastorals should be also, and, in fact, it often is on separate grounds. 2 Thessalonians is also on the periphery of Pauline lexical style. So is Ephesians. Their non-Pauline status is probable, though not as certain as Hebrews and the Pastorals. Colossians is up in the air for me. For Philippians and First Thessalonians, I would take them as Pauline more probably than not. Thoughts, suggestions, criticisms, requests for source code? I may do different types of studies with Tauber's NT in the future. Let me know if you have ideas. -- Peter Kirby (Undergrad in History at CSU Fullerton) Web Site: http://www.peterkirby.com/ |
03-26-2005, 04:13 PM | #2 | ||
Contributor
Join Date: Jan 2001
Location: Barrayar
Posts: 11,866
|
Peter, this is a fascinating piece of research.
Quote:
Quote:
Vorkosigan |
||
03-26-2005, 07:31 PM | #3 | |||
Veteran Member
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
|
Quote:
I was wondering if you had thoughts on "Mark 16 and Beyond"? I know you must be busy. Quote:
Quote:
However, I am fine-tuning a new formula and looking for more corpora on which to perform analysis. Things are looking good so far. best, Peter Kirby |
|||
03-27-2005, 12:40 AM | #4 | |
Veteran Member
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
|
I discovered a fatal flaw in the first formula that I tried, described in my previous post. (This flaw was discovered when working on the first data set described below.) So I have come up with a new formula, and it is one that has met with surprising success in the first three trial runs that I have made. The data sets chosen as "control groups"--cases other than the problem of the Pauline Epistles--are not ideal, but it's the best that I could do with the resources available. The only tagged Greek corpora of which I know are the Bible and some of the apostolic fathers.
First, I will share the results obtained with the new formula, because they should build confidence that there is *something* validly identified by the method--whatever that may be. Then I will briefly describe the procedure of the program in English. The first data set I created consisted of eighteen chunks of text. They are: 00 Acts 1:1-4:10 2016 words 01 Acts 4:10-7:24 2016 words 02 Acts 7:24-9:37 2016 words 03 Acts 9:37-13:1 2016 words 04 Acts 13:1-15:37 2016 words 05 Acts 15:37-19:3 2016 words 06 Acts 19:3-21:32 2016 words 07 Acts 21:32-25:9 2016 words 08 Acts 25:9-28:31 2322 words 09 Romans 1:1-6:2 2268 words 10 Romans 6:2-10:18 2268 words 11 Romans 10:18-15:33 2151 words 12 1 Cor 1:1-7:14 2268 words 13 1 Cor 7:14-12:12 2268 words 14 1 Cor 12:12-16:24 2294 words 15 Galatians 1:1-end 2230 words 16 2 Cor 1:1-7:12 2248 words 17 2 Cor 7:13-end 2229 words The Results for Two Groups: highest score: 0-8 grouped together, 9-17 grouped together A very clean cut between the blocks of Acts text and the blocks of Paul text The second highest score separated the first part of Acts (thus grouping it with Paul). This was, to me, a satisfying result in favor of the idea that the present formula is onto something. I decided to do something a little more complicated, to test the program in making groupings among more than two groups. I wanted to mix genres within one putative author, so I chose to isolate the letters to the seven churches in Revelation from the rest, which is split into two parts. The other three documents--Acts, Romans, and 1 Corinthians--are split into three parts. Here are the documents in the second test: 00 Acts 6048 words 01 Acts 6048 words 02 Acts 6354 words 03 Rev: letters 1555 words 04 Revelation 3935 words 05 Revelation 4300 words 06 Romans 2268 words 07 Romans 2268 words 08 Romans 2151 words 09 1 Cor 2268 words 10 1 Cor 2268 words 11 1 Cor 2294 words Here are the results: For Two Groupings: Highest (10947): Acts...the rest Second (10765): Revelation...the rest For Three Groupings: Highest (16424): Acts...Revelation...Paul Second (16058): Acts...4+5 (Rev Body)...Rev Letters (3)+Paul For Four Groupings: Highest (17626): Acts...3...4+5...Paul Second (17469): Acts...Rev...10...rest of Paul I stopped there before going on to five groupings. But how does one know how many groupings should be considered optimal? I have hit upon a rough guide that seems to work well enough so far, which is: Choose the highest score divided by the square root of the number of authors. So, for example, 10947 becomes an adjusted score of 7741, 16424 becomes an adjusted score of 9482, and 17626 becomes an adjusted score of 8813. Here, then, by this procedure, the optimal choice is that there are three authors. Like much else here, I am not sure about the theoretical basis as to *why* this works--if it does, in fact, work. The third data set that I worked on are the Apostolic Fathers. Someone pointed out to me by e-mail where to find some of them morphologically tagged: http://www.skrbc.org/bw_files/ Here there are ten documents: 00 First Clement 01 Second Clement 02 Didache 03 Ignatius to the Ephesians 04 Ignatius to the Magnesians 05 Ignatius to the Philadelphians 06 Ignatius to Polycarp 07 Ignatius to the Romans 08 Ignatius to the Trallians 09 Polycarp to the Philippians Here are the results for the Apostolic Fathers. With Two Groupings: Highest Score: 0-2 gruped...3-9 grouped (i.e., Didache+1Clem+2Clem grouped against Ignatius+Polycarp) Second Highest Score: 1, 3-8 grouped...0, 2, 9 grouped (i.e., 2Clem+Ignatius grouped against 1Clem, Didache, and Polycarp) With Three Groupings: Highest: 0, 2...1...3-9 Second: 0, 2, 9...1...3-8 With Four Groupings: Highest: 0, 2...1...3-8...9 Second: 0...1...2-8...9 With Five Groupings: Highest: 0...1...2...3-8...9 SCORE 19412 Second: 0, 7...1...2...3-6, 8...9 SCORE 17855 With Six Groupings: Highest: 0...1...2...3-8...9 SCORE 19412 Second: 0...1...2...6...9...3,4,5,7,8 SCORE 19232 What I noticed about this: It is quite significant that a grouping of five won out even when competing with groupings of six. Although this won't always happen, it's a sure sign that the 'optimal' grouping based on lexical distinctiveness has been achieved, as determined by the formula. 0 and 2 grouped with two, three, or four groupings. Here could be one of those cases when documents are grouped on a basis other than authorship. Perhaps both came from a very early era in church history and that is reflected in their diction. Also, 2 Clement is by itself for any grouping of three or higher. I believe it to be the latest of these documents, and I conjecture that this is also reflected in its diction. With that review of the results in these cases, allow me to give a brief description of the (new) formula. What my program does is, first, to calculate the frequency of each word in each of these texts. Let me take some excerpts from the frequency table that the program makes (for the Apostolic Fathers in this case): )Efe/sios 0 0 0 2 1 1 0 1 1 0 )Israh/l 7 0 0 0 0 0 0 0 0 0 A)/bel 4 0 0 0 0 0 0 0 0 0 Ai)/guptos 6 0 0 0 0 0 0 0 0 0 Eu)xaristou^me/n 0 0 3 0 0 0 0 0 0 0 Ka/i)^n 6 0 0 0 0 0 0 0 0 0 Mwu+sh^s 9 0 0 0 0 0 0 0 0 0 Qeofo/ros 0 0 0 1 1 1 1 1 1 0 Suri/a 0 0 0 2 2 2 3 4 1 1 Xristo/s 45 12 1 33 22 21 3 20 19 9 a(/gios 24 2 9 2 1 4 0 0 1 0 a(marti/a 24 4 3 0 0 0 0 0 0 4 a)/ggelos 6 0 0 0 0 0 0 0 0 0 a)/rshn 1 6 0 0 0 0 0 0 0 0 a)delfo/s 20 15 1 2 0 3 1 1 0 1 a)gaphto/s 18 0 0 0 1 1 2 0 0 0 a)mh/n 10 1 1 0 0 0 0 0 0 0 Okay, I got bored after copying this many over. I mostly picked out "interesting" ones, the ones that will actually factor into the total "lexical distinctiveness score" (more about this term below). I included an "uninteresting" one for example: "Christ" is found across the board and will not tell us much (unless we decide to group everything against Didache perhaps). Qeofo/ros is, of course, the second name of Ignatius. It is found only in those six epistles, and at a rate of 6 to 0 elsewhere when added up, it will do its part to group these texts together in the highest scoring permutations (possible selections of the ten texts to go into a certain number of "boxes" where their frequencies are combined together, to be compared with the frequencies of words in the texts in the other "boxes"). An example of a negative, instead of a positive, characteristic of the Ignatian letters is that none of them contain a(marti/a. This also will do a little bit to group them together. 1 Clement is the only document to contain )Israh/l, A)/bel, Ai)/guptos, Ka/i)^n, Mwu+sh^s, and a)/ggelos. All of these things will set it apart. Here is how the "lexical distinctiveness score" is calculated in the current algorithm: Quote:
For each word, such as in the table of numbers for each word above, the following is done: Nothing is added to the "lexical distinctiveness score" if there is only one 'author' assigned to the selected texts, or if the word occurs only one time in the 'author' that has it the most times. ('Author' in quotes because, as Carlson has pointed out to me, other reasons besides authorship can account for lexical distinctiveness. I've also been using the phrase 'author box', but perhaps one could say more accurately 'selection of texts' or 'grouping of texts'.) Otherwise, we add to that score in accordance to the formula on the last line. First, we start with the frequency of the word of concern per 1000 words in the author/grouping that has that word with the highest frequency. Then, if no other author/grouping has that word, we multiply by 100 (divide by .01). Otherwise, we divide by the frequency per 1000 words in the combined total of the frequency in the other authors (plus .01). Then we multiply by a "calibrating" factor based on the amount of text outside of the highest-frequency-grouping. It is the percentage of words outside of the highest-frequency-grouping. That's it. And, no, I'm not completely sure myself why this would work. I thank everyone who has commented and who will comment on the matter. best wishes, Peter Kirby |
|
03-27-2005, 12:49 AM | #5 | |||
Contributor
Join Date: Jan 2001
Location: Barrayar
Posts: 11,866
|
Quote:
Quote:
Quote:
|
|||
03-27-2005, 02:55 AM | #6 |
Contributor
Join Date: Mar 2003
Location: London UK
Posts: 16,024
|
Ellegard did a similar exercise but looking at individual words and how and where they occurred.
I couldn't find a good reference, he does look at just about all the texts you have used. In 1962, he wrote "A Statistical Method for Determining Authorship." The classic terms he uses are synagogue, apostle and saints. Interestingly, he asserts Jesus of Nazareth is a mistranslation of Jesus the Essene. |
03-27-2005, 10:59 AM | #7 |
Veteran Member
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
|
Ellegard's method is really quite different, IMHO. I have his book, and I've been meaning for some years to replicate his type of study with a more comprehensive set of data, specifically on theological words such as names for Jesus, names for God, names for believers, etc.
best, Peter Kirby |
03-28-2005, 06:55 AM | #8 |
Regular Member
Join Date: Nov 2004
Location: KY
Posts: 415
|
Peter,
Add my compliments to Vorks. This is fascinating stuff. Have you done any experimentation with algorithms that include the proximity of various words to one another? Seems this could be pretty powerful in terms of detecting similarities in phrases and construction, though the coding would likely be a little more hairy (and you might have to develop your criteria and weights from scratch). Regards, V. |
03-28-2005, 11:32 AM | #9 |
Veteran Member
Join Date: Feb 2004
Location: Washington, DC (formerly Denmark)
Posts: 3,789
|
Peter, do you have a link to Tauber's files?
Thanks, Julian |
03-28-2005, 01:23 PM | #10 | |
Veteran Member
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
|
Quote:
http://www.jtauber.com/morphgnt best, Peter Kirby |
|
Thread Tools | Search this Thread |
|