FRDB Archives

Freethought & Rationalism Archive

The archives are read only.


Go Back   FRDB Archives > Archives > Religion (Closed) > Biblical Criticism & History
Welcome, Peter Kirby.
You last visited: Today at 03:12 PM

 
 
Thread Tools Search this Thread
Old 03-25-2005, 05:22 PM   #1
Veteran Member
 
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
Default A Lexical Look at the Paulines

Hello,

I have been toying with the morphologically tagged New Testament prepared by Dr. Tauber, based on the NA-26 Greek text. The feature that I needed was the resolution of each word in the New Testament into its lexical form.

The reason for doing so was to look at an old problem, the lexical frequencies of the Pauline epistles, from a slightly new perspective, my own "max hapax" formula.

It can be described in simple enough terms. First, one divides the material into maximally large chunks that will tell you something interesting about your data set, but that won't result in too many chunks (too long to process, not large enough sample size). For a quick analysis, I chose to use just 8 chunks:

00 Romans + Galatians (9341 words)
01 First and Second Corinthians (11307 words)
02 Philippians and First Thessalonians (3110 words)
03 Colossians (1582 words)
04 Ephesians (2422 words)
05 Second Thessalonians (823 words)
06 First Timothy, Second Timothy, Titus (3488 words)
07 Hebrews (4953 words)

I left out Philemon, for now, because it may be too short to analyze.

Then one chooses a number of authors between which the chunks can be parceled out. I chose 2, 3, 4, 5, and 6. (This also increases processing time. The program has to cycle through all possible permutations of author distribution.)

The "max hapax" formula is this. For each word, the number of occurences per 500 words is calculated for each author. Then the highest rate of occurence is found. Then, one goes through the rest of the authors, and if that author does not have the word or has the word less than 1 time per 500 words, then, for each such author without the word, the value of the highest rate of occurence is added to the "hapax" score for that particular distribution of authors.

The "max" part comes in displaying the top two distributions of authors in terms of the "hapax" score.

The reasoning behind this is, basically, that the more distinctive the lexical style of each author in the distribution, the more likely that distribution is. And, of course, I wanted to see what would happen if one went forward with this kind of analysis.

Here were the results.

For two authors:
Highest: Pastorals by themselves, the rest grouped together
Second Highest: Hebrews by itself, the rest grouped together

For three authors:
Highest: Hebrews; Pastorals; the rest grouped together
Second Highest: Hebrews; Pastorals and 2 Thessalonians; the rest

For four authors:
Highest: Hebrews; Pastorals; 2 Thessalonians;rest
2nd Highest: Hebrews; Pastorals; Ephesians; rest

For five authors:
Highest: Hebrews; Pastorals; 2 Thess; Ephesians; rest
2nd Highest: Hebrews; Pastorals; 2 Thess; Colossians; rest

For six authors:
Highest: Hebrews; Pastorals; 2 Thess; Eph; Philippians+1Thess; rest
2nd Highest: Hebrews; Pastorals; 2 Thess; Eph; Colossians; rest

The results interpreted.

Romans, Galatians are always grouped together with 1 Corinthians, and 2 Corinthians. The author of these four epistles may be called "Paul" and provide the basis for determining what is Pauline style.

Hebrews and the Pastorals are certainly outliers in terms of lexical style. Hebrews is usually taken as non-Pauline; the Pastorals should be also, and, in fact, it often is on separate grounds.

2 Thessalonians is also on the periphery of Pauline lexical style. So is Ephesians. Their non-Pauline status is probable, though not as certain as Hebrews and the Pastorals.

Colossians is up in the air for me. For Philippians and First Thessalonians, I would take them as Pauline more probably than not.

Thoughts, suggestions, criticisms, requests for source code?

I may do different types of studies with Tauber's NT in the future. Let me know if you have ideas.

--
Peter Kirby (Undergrad in History at CSU Fullerton)
Web Site: http://www.peterkirby.com/
Peter Kirby is online now   Edit/Delete Message
Old 03-26-2005, 04:13 PM   #2
Contributor
 
Join Date: Jan 2001
Location: Barrayar
Posts: 11,866
Default

Peter, this is a fascinating piece of research.

Quote:
Romans, Galatians are always grouped together with 1 Corinthians, and 2 Corinthians. The author of these four epistles may be called "Paul" and provide the basis for determining what is Pauline style.
That is the same set of hits I have in Mark, minus 2 Cor, and plus Philippians. basically "Paul" is Rom, Gal, and 1 Cor.

Quote:
The "max hapax" formula is this. For each word, the number of occurences per 500 words is calculated for each author. Then the highest rate of occurence is found. Then, one goes through the rest of the authors, and if that author does not have the word or has the word less than 1 time per 500 words, then, for each such author without the word, the value of the highest rate of occurence is added to the "hapax" score for that particular distribution of authors.
I'm having trouble visualizing this without a concrete example. Is the data accessible and pastable in here?

Vorkosigan
Vorkosigan is offline  
Old 03-26-2005, 07:31 PM   #3
Veteran Member
 
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
Default

Quote:
Originally Posted by Vorkosigan
Peter, this is a fascinating piece of research.
Thank you, and thanks for replying.

I was wondering if you had thoughts on "Mark 16 and Beyond"? I know you must be busy.

Quote:
Originally Posted by Vorkosigan
That is the same set of hits I have in Mark, minus 2 Cor, and plus Philippians. basically "Paul" is Rom, Gal, and 1 Cor.
Do you think that 2 Corinthians is by another author?

Quote:
Originally Posted by Vorkosigan
I'm having trouble visualizing this without a concrete example. Is the data accessible and pastable in here?
Actually, it turns out that my original formula had a fatal flaw. This was revealed when I took nine samples of 2000 words each from Acts and nine samples of 2000 words each from "Paul's Hauptbrief". Instead of something sensible, or even something surprising, the results were erratic because they were skewed towards having one large selection of chunks and then the rest of the selections being only one chunk each. This invalidates most of the above post.

However, I am fine-tuning a new formula and looking for more corpora on which to perform analysis. Things are looking good so far.

best,
Peter Kirby
Peter Kirby is online now   Edit/Delete Message
Old 03-27-2005, 12:40 AM   #4
Veteran Member
 
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
Default

I discovered a fatal flaw in the first formula that I tried, described in my previous post. (This flaw was discovered when working on the first data set described below.) So I have come up with a new formula, and it is one that has met with surprising success in the first three trial runs that I have made. The data sets chosen as "control groups"--cases other than the problem of the Pauline Epistles--are not ideal, but it's the best that I could do with the resources available. The only tagged Greek corpora of which I know are the Bible and some of the apostolic fathers.

First, I will share the results obtained with the new formula, because they should build confidence that there is *something* validly identified by the method--whatever that may be. Then I will briefly describe the procedure of the program in English.

The first data set I created consisted of eighteen chunks of text. They are:

00 Acts 1:1-4:10 2016 words
01 Acts 4:10-7:24 2016 words
02 Acts 7:24-9:37 2016 words
03 Acts 9:37-13:1 2016 words
04 Acts 13:1-15:37 2016 words
05 Acts 15:37-19:3 2016 words
06 Acts 19:3-21:32 2016 words
07 Acts 21:32-25:9 2016 words
08 Acts 25:9-28:31 2322 words
09 Romans 1:1-6:2 2268 words
10 Romans 6:2-10:18 2268 words
11 Romans 10:18-15:33 2151 words
12 1 Cor 1:1-7:14 2268 words
13 1 Cor 7:14-12:12 2268 words
14 1 Cor 12:12-16:24 2294 words
15 Galatians 1:1-end 2230 words
16 2 Cor 1:1-7:12 2248 words
17 2 Cor 7:13-end 2229 words

The Results for Two Groups:

highest score: 0-8 grouped together, 9-17 grouped together
A very clean cut between the blocks of Acts text and the blocks of Paul text
The second highest score separated the first part of Acts (thus grouping it with Paul).

This was, to me, a satisfying result in favor of the idea that the present formula is onto something. I decided to do something a little more complicated, to test the program in making groupings among more than two groups. I wanted to mix genres within one putative author, so I chose to isolate the letters to the seven churches in Revelation from the rest, which is split into two parts. The other three documents--Acts, Romans, and 1 Corinthians--are split into three parts.

Here are the documents in the second test:

00 Acts 6048 words
01 Acts 6048 words
02 Acts 6354 words
03 Rev: letters 1555 words
04 Revelation 3935 words
05 Revelation 4300 words
06 Romans 2268 words
07 Romans 2268 words
08 Romans 2151 words
09 1 Cor 2268 words
10 1 Cor 2268 words
11 1 Cor 2294 words

Here are the results:

For Two Groupings:
Highest (10947): Acts...the rest
Second (10765): Revelation...the rest

For Three Groupings:
Highest (16424): Acts...Revelation...Paul
Second (16058): Acts...4+5 (Rev Body)...Rev Letters (3)+Paul

For Four Groupings:
Highest (17626): Acts...3...4+5...Paul
Second (17469): Acts...Rev...10...rest of Paul

I stopped there before going on to five groupings. But how does one know how many groupings should be considered optimal?

I have hit upon a rough guide that seems to work well enough so far, which is:

Choose the highest score divided by the square root of the number of authors.

So, for example, 10947 becomes an adjusted score of 7741, 16424 becomes an adjusted score of 9482, and 17626 becomes an adjusted score of 8813. Here, then, by this procedure, the optimal choice is that there are three authors.

Like much else here, I am not sure about the theoretical basis as to *why* this works--if it does, in fact, work.

The third data set that I worked on are the Apostolic Fathers. Someone pointed out to me by e-mail where to find some of them morphologically tagged:

http://www.skrbc.org/bw_files/

Here there are ten documents:

00 First Clement
01 Second Clement
02 Didache
03 Ignatius to the Ephesians
04 Ignatius to the Magnesians
05 Ignatius to the Philadelphians
06 Ignatius to Polycarp
07 Ignatius to the Romans
08 Ignatius to the Trallians
09 Polycarp to the Philippians

Here are the results for the Apostolic Fathers.

With Two Groupings:
Highest Score: 0-2 gruped...3-9 grouped (i.e., Didache+1Clem+2Clem grouped against Ignatius+Polycarp)
Second Highest Score: 1, 3-8 grouped...0, 2, 9 grouped (i.e., 2Clem+Ignatius grouped against 1Clem, Didache, and Polycarp)

With Three Groupings:
Highest: 0, 2...1...3-9
Second: 0, 2, 9...1...3-8

With Four Groupings:
Highest: 0, 2...1...3-8...9
Second: 0...1...2-8...9

With Five Groupings:
Highest: 0...1...2...3-8...9 SCORE 19412
Second: 0, 7...1...2...3-6, 8...9 SCORE 17855

With Six Groupings:
Highest: 0...1...2...3-8...9 SCORE 19412
Second: 0...1...2...6...9...3,4,5,7,8 SCORE 19232

What I noticed about this:

It is quite significant that a grouping of five won out even when competing with groupings of six. Although this won't always happen, it's a sure sign that the 'optimal' grouping based on lexical distinctiveness has been achieved, as determined by the formula.

0 and 2 grouped with two, three, or four groupings. Here could be one of those cases when documents are grouped on a basis other than authorship. Perhaps both came from a very early era in church history and that is reflected in their diction. Also, 2 Clement is by itself for any grouping of three or higher. I believe it to be the latest of these documents, and I conjecture that this is also reflected in its diction.

With that review of the results in these cases, allow me to give a brief description of the (new) formula.

What my program does is, first, to calculate the frequency of each word in each of these texts. Let me take some excerpts from the frequency table that the program makes (for the Apostolic Fathers in this case):

)Efe/sios 0 0 0 2 1 1 0 1 1 0
)Israh/l 7 0 0 0 0 0 0 0 0 0
A)/bel 4 0 0 0 0 0 0 0 0 0
Ai)/guptos 6 0 0 0 0 0 0 0 0 0
Eu)xaristou^me/n 0 0 3 0 0 0 0 0 0 0
Ka/i)^n 6 0 0 0 0 0 0 0 0 0
Mwu+sh^s 9 0 0 0 0 0 0 0 0 0
Qeofo/ros 0 0 0 1 1 1 1 1 1 0
Suri/a 0 0 0 2 2 2 3 4 1 1
Xristo/s 45 12 1 33 22 21 3 20 19 9
a(/gios 24 2 9 2 1 4 0 0 1 0
a(marti/a 24 4 3 0 0 0 0 0 0 4
a)/ggelos 6 0 0 0 0 0 0 0 0 0
a)/rshn 1 6 0 0 0 0 0 0 0 0
a)delfo/s 20 15 1 2 0 3 1 1 0 1
a)gaphto/s 18 0 0 0 1 1 2 0 0 0
a)mh/n 10 1 1 0 0 0 0 0 0 0

Okay, I got bored after copying this many over. I mostly picked out "interesting" ones, the ones that will actually factor into the total "lexical distinctiveness score" (more about this term below). I included an "uninteresting" one for example: "Christ" is found across the board and will not tell us much (unless we decide to group everything against Didache perhaps).

Qeofo/ros is, of course, the second name of Ignatius. It is found only in those six epistles, and at a rate of 6 to 0 elsewhere when added up, it will do its part to group these texts together in the highest scoring permutations (possible selections of the ten texts to go into a certain number of "boxes" where their frequencies are combined together, to be compared with the frequencies of words in the texts in the other "boxes"). An example of a negative, instead of a positive, characteristic of the Ignatian letters is that none of them contain a(marti/a. This also will do a little bit to group them together.

1 Clement is the only document to contain )Israh/l, A)/bel, Ai)/guptos, Ka/i)^n, Mwu+sh^s, and a)/ggelos. All of these things will set it apart.

Here is how the "lexical distinctiveness score" is calculated in the current algorithm:

Quote:
otheroccurences = 0.01; // so, no division by 0 below
bool iscounted = false; // this will be false if (1) there's only one author

for ( authorcounter = 0; authorcounter < NUMAUTHORS; authorcounter++ )
{
if ( ( authorcounter != highestauthorid ) && ( numbooksperauthor[ authorcounter ] != 0 ) )
{
iscounted = true;
otheroccurences += numperauthor[ authorcounter ];
}
}

if ( intcountperauthor[ highestauthorid ] <= 1 )
{
iscounted = false;
}

if ( iscounted )
{
totalscore += ( numperauthor[ highestauthorid ] / otheroccurences ) * ( WORDCOUNT - numwordsperauthor[ highestauthorid ] ) / double( WORDCOUNT );
}
In English:

For each word, such as in the table of numbers for each word above, the following is done:

Nothing is added to the "lexical distinctiveness score" if there is only one 'author' assigned to the selected texts, or if the word occurs only one time in the 'author' that has it the most times. ('Author' in quotes because, as Carlson has pointed out to me, other reasons besides authorship can account for lexical distinctiveness. I've also been using the phrase 'author box', but perhaps one could say more accurately 'selection of texts' or 'grouping of texts'.)

Otherwise, we add to that score in accordance to the formula on the last line. First, we start with the frequency of the word of concern per 1000 words in the author/grouping that has that word with the highest frequency.

Then, if no other author/grouping has that word, we multiply by 100 (divide by .01).

Otherwise, we divide by the frequency per 1000 words in the combined total of the frequency in the other authors (plus .01).

Then we multiply by a "calibrating" factor based on the amount of text outside of the highest-frequency-grouping. It is the percentage of words outside of the highest-frequency-grouping.

That's it. And, no, I'm not completely sure myself why this would work.

I thank everyone who has commented and who will comment on the matter.

best wishes,
Peter Kirby
Peter Kirby is online now   Edit/Delete Message
Old 03-27-2005, 12:49 AM   #5
Contributor
 
Join Date: Jan 2001
Location: Barrayar
Posts: 11,866
Default

Quote:
Originally Posted by Peter Kirby
Thank you, and thanks for replying.

I was wondering if you had thoughts on "Mark 16 and Beyond"? I know you must be busy.
I do. I shall be there soon!

Quote:
Do you think that 2 Corinthians is by another author?
No, but I am sure puzzled by why Mark used it so little.

Quote:
However, I am fine-tuning a new formula and looking for more corpora on which to perform analysis. Things are looking good so far.
best,
Peter Kirby
Good luck! Looks great so far.
Vorkosigan is offline  
Old 03-27-2005, 02:55 AM   #6
Contributor
 
Join Date: Mar 2003
Location: London UK
Posts: 16,024
Default

Ellegard did a similar exercise but looking at individual words and how and where they occurred.

I couldn't find a good reference, he does look at just about all the texts you have used. In 1962, he wrote "A Statistical Method for Determining Authorship." The classic terms he uses are synagogue, apostle and saints.

Interestingly, he asserts Jesus of Nazareth is a mistranslation of Jesus the Essene.
Clivedurdle is offline  
Old 03-27-2005, 10:59 AM   #7
Veteran Member
 
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
Default

Ellegard's method is really quite different, IMHO. I have his book, and I've been meaning for some years to replicate his type of study with a more comprehensive set of data, specifically on theological words such as names for Jesus, names for God, names for believers, etc.

best,
Peter Kirby
Peter Kirby is online now   Edit/Delete Message
Old 03-28-2005, 06:55 AM   #8
Regular Member
 
Join Date: Nov 2004
Location: KY
Posts: 415
Default

Peter,

Add my compliments to Vorks. This is fascinating stuff.

Have you done any experimentation with algorithms that include the proximity of various words to one another? Seems this could be pretty powerful in terms of detecting similarities in phrases and construction, though the coding would likely be a little more hairy (and you might have to develop your criteria and weights from scratch).

Regards,

V.
Vivisector is offline  
Old 03-28-2005, 11:32 AM   #9
Veteran Member
 
Join Date: Feb 2004
Location: Washington, DC (formerly Denmark)
Posts: 3,789
Default

Peter, do you have a link to Tauber's files?

Thanks,
Julian
Julian is offline  
Old 03-28-2005, 01:23 PM   #10
Veteran Member
 
Join Date: Jul 2001
Location: the reliquary of Ockham's razor
Posts: 4,035
Default

Quote:
Originally Posted by Julian
Peter, do you have a link to Tauber's files?
Here it is:

http://www.jtauber.com/morphgnt

best,
Peter Kirby
Peter Kirby is online now   Edit/Delete Message
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -8. The time now is 07:34 PM.

Top

This custom BB emulates vBulletin® Version 3.8.2
Copyright ©2000 - 2015, Jelsoft Enterprises Ltd.