Freethought & Rationalism Archive

Nazaroo · 03-14-2007, 11:01 AM

Step 1: Zeroing the Data - Subtract the 'mean'

For PCA to work properly, you have to subtract the mean (average value) from each of the data dimensions.

Subtracting the 'Mean'

The actual 'mean' that is subtracted from each entry is the average across each dimension. What is meant is actually simple: each 'dimension' is a column of data from the data-table.

We get the Mean Average by simply adding up all the entries, and dividing by the number of entries. This gives us the simplest and crudest kind of average value: the Mean. The formula is simple. If the set of entries for a column is:

X = [X1, X2, X3, ...Xn ]

...then the formula for the Mean average is:

So, all the x values have x ~ (the Mean of x) subtracted, and all the y values have y~ subtracted from them, etc. . This produces a data set whose mean (average value) is zero for every column (dimension).

An Explanation:

What are we doing when we do this? This is something like finding the 'center of mass' for a floating cloud of particles. It places the Origin of our coordinate axes somewhere in the middle of the cloud.

By 'somewhere', we stress that this may not be the actual geometrical 'center'. We might think we want to do that when plotting a cloud, just as a photographer might center his 'shot' so that the subject is in the middle. This saves graph paper. But Zero Mean conversion is different.

The Origin is located at a 'center' that is really a weighted center, weighted by the number of points in each range of values, for each axis. So if there were for instance a lot of points in a certain range, this would 'pull' the Origin toward that group or cluster.
In order to plot all the points, (and save paper), we would then place the origin and the 'view-window' independantly:

This is what the 'cross-hairs' and zero-lines in a PCA plot are all about. They show the weighted 'center' of the data, and its spread from this core in various directions. The reason that these lines are not usually centered themselves is that the viewport has to accomodate all the plotted points, and is chosen independantly.

jgibson000 · 03-14-2007, 11:02 AM

Quote:

Originally Posted by Nazaroo

I was going to simply critique Willker for his lame application of PCA to the PA.

Have also you sent this to Wieland?

JG

Nazaroo · 03-14-2007, 11:10 AM

Step 2: The 1st Principal Component Axis

Calculating the Principal Component Axis is the first step in a multivariate projection. We draw this new co-ordinate axis along the line of maximum variation of the data. That is, in a way that spans the maximum length or spread of the cloud of points. It can also be called a "line of best fit".

This axis is known as the First Principal Component Axis (PC1 in the picture). All the observations are projected down onto this new axis and the score values are read off. These become the new entries for a new Hidden Variable, the First Principal Component. This variable becomes a column in a new 'transformed' data-table with all new columns and entries.

This First Principal Component is said to "explain" as best as possible, the spread of the "Observations" (MSS) in our original 3-dimensional space. This spread is expressed as a percentage of the total variation in position, and is said to "explain" that % of it.
From the picture below, you can see that this involves somehow minimizing the distances of all the points to this line. That could be a pretty messy job, involving a lot of measuring and adjustment: luckily there is a handy technique that does this for us automatically.

MATRIX ALGEBRA (Linear Algebra)

In Matrix Algebra we can calculate the Eigenvectors, which happen to be the very axis lines we are looking for. We will explain how to do that later. The important point is that these Eigenvectors are also a set of orthogonal (at right angles) lines in space. But this time they are aligned to our data, and can serve as an alternate set of axes.

(1) Euclidean Vector Space Needed

In this application of Matrix Algebra, the coordinate space is treated as a Vector Space, and the points in space are treated as Vectors. This Vector Space and the coordinate points in it must obey a strict set of Tranformation Rules, and the Vector Space must be Euclidean.

All of the distances between points in this Vector Space, for instance, the length of Eigenvectors, are calculated using the Generalized Pythagorean Formula:

For instance, if E n is an Eigenvector, the Length of that vector is:

L E = [ X1^2 + X2^2... + Xn^2 ] ^1/2

Once again we see that the data must already be in a form that conforms to coordinates in a Euclidean Space, in order for proper distance calculations to be made.

(2) Square Data Table Needed

Now we run into one more snag! For large numbers of dimensions or observations, we have to resort to Eigenvector techniques. But Eigenvectors can only be found in square matrices. That is, our data table must have the same number of columns as it has rows.

If we have more columns than rows, or vice versa, we have to disgard some data, often a signficant amount! Or we have to break it down into smaller squares, and apply PCA techniques separately to the pieces, an obviously arbitrary and dubious method.

Once again, we are either artificially 'designing' the experiment to fit the method, biasing the result, or we are going to needlessly complicate the process of applying PCA, again with questionable results.
Even with a relatively large 'sample' of our data-table, the actual Eigenvector values may be way off from the actual 'maximum spread' axes.

Nazaroo · 03-14-2007, 11:15 AM

Step 3: The Second Principal Component Axis

We now add a second principal component – PC2 in the picture. After PC1, this defines the next best direction for approximating the original data and is orthogonal (at right angles) to PC1.

To do this graphically, we would keep the PC2 line at a right angle to the PC1 line, and rotate it around until it was spanning the widest part of the cloud. In actuality, we simply select the 2nd largest Eigenvector, using the Eigenvalues as a guide.

These two axes give the "best possible" two-dimensional window into our original 3-dimensional data, accounting for the largest % of the original variation.

Best View?

This opinion of 'best view' is based upon the idea that the projection will maximize the distance between the points in the projection. This is supposed to give maximum clarity and allow accurate identification of patterns or groups in the data.

This approach may be generally 'sound' as the best compromise for a projection, when nothing is known about the data other than the range of its values. But a little reflection should reveal that it is hardly an effective pattern-detecting method.

In fact, other than giving a very basic snap-shot of the data, the PCA technique is not very useful, except in cases where the data is already in a form or has features that are conveniently 'lucky' for this crude projection technique.

We'll see shortly how even a careful and correct application of PCA can catastrophically fail.

Because the success of the method is entirely dependant upon the data itself, it is a wholly unreliable and unrealistic technique for general data analysis.

Nazaroo · 03-14-2007, 11:16 AM

Once the PC2 axis is fixed, the observations (MSS) will be projected down onto the plane made from these two axes to create the score plot. Imagine the shadow cast by a three-dimensional (or multi-dimensional) swarm of points onto a wall.

With the light source positioned optimally, the underlying structure of the data is hopefully revealed even though the dimensionality of the data has been dramatically reduced (and spacial information has been lost).

There is no doubt that the PCA method does do what it says it does: It efficiently displays the spread of the data-points according to the Covariance Matrix. What is that? Its a collected set of measurements that summarize the basic spread of all the variables.

That is, PCA, done properly will give us the projection that has the widest spread of the data points possible in two dimensions. It does this without adding any further distance distortion other than that caused by dropping the dimension perpendicular to the projection.

But is this 'data analysis'? No. There may be all kinds of critically important geometrical relationships in the data-points that will be left undiscovered. The PCA method has about as much chance of tripping over these as a farmer searching for a needle in a haystack without a magnet.

Nazaroo · 03-14-2007, 11:20 AM

What the PCA Projection Missed...

To show the whopping errors possible with PCA, we need only imagine rotating the Principle Component Axis (the one "accounting" for the greatest % of the spread!) end over end as we continue the projection:

What happened? On a set of axes not related to either our original data Variables, or the PCA axes, there was a correlated sine/cosine function, a Hidden Variable.

By freely rotating our cloud under a light, we were able to reveal the hidden pattern, which in turn suggests a hidden and simple rule organizing and possibly 'causing' the readings.

A 'by the book' PCA analysis failed to expose the key pattern in the data, while a simple Rotation Transform, a rotation of the data through a series of angles, nailed it. At the same time, this exercise shows that PCA projections don't really expose hidden variables in data, and can't.

How did the PCA method fail?

Suppose the data plots were position recordings. The object is a tiny magnet suspended in an electromagnetic field. The motion of the object under investigation was a simple circular orbit. Unfortunately, the observer's apparatus is tilted at 45 degrees and mounted in a moving truck travelling along a bumpy road. This adds a random vertical component to each position of the object at 45 degrees to the orbit:

Now the spread of the position measurements is larger on a non-relevant axis than on the important pair, the plane of the orbit.

One might think this kind of situation is rare. But the opposite is the case. Raw field data is quite often compounded by multiple hidden factors influencing the recorded entries. It only takes a couple of layers of influencing factors to totally defeat a PCA projection.

The PCA technique actually failed because there was too much 'noise' in the data, that is, unimportant or unwanted measurements. But one of the claims of PCA proponents is that it is able to 'separate the noise from the signal'.

The sobering fact is that the noise must be an order of magnitude smaller than signal for PCA to work! But if that was the case above, PCA techniques would be redundant. The essential quality and pattern in the data (a circular orbit) would be obvious from almost any orientation or projection. We wouldn't need Principal Components at all.

This is true generally. If the data-table properly records the positions in a Euclidean space, clusters, patterns of all kinds will retain their essential shapes and groupings from many angles, making PCA redundant. If the 'noise' level is too high, PCA cannot reliably help.

What then is PCA?

And what is it really good for? PCA is a good method for displaying spacial relationships or affinities between observations, when the data is already in the right form, relatively free of error, and well-behaved.

It is an excellent final polish, when an experiment has already been designed properly, and unwanted influences have been eliminated from consideration, and the key variables are already identified.

It is best when the data-points in the modeling Space DON'T have any special ordering, or pattern, other than an uneven spread (varying mean and median deviations) in the space. In this case, PCA does a good job as a general compromise for a 'best view' of the data, giving maximum separation.

But is it an effective method to identify hidden patterns in data? No more than any other arbitrary projection method.

Is it a reliable method to establish groupings under investigation? No, unless the data is already free of noise and grouped independantly of any chosen axes.

PCA is ideal, when there are no patterns or hidden variables in the data at all, other than an inequality of spread between dimensions.

Nazaroo · 03-14-2007, 05:41 PM

Now that most of us have a good grasp of what goes on under the hood in a PCA projection, we are in a good position to assess Willker's alleged PCA plot.

From the chart it is clear that Willker has posted a Score Plot.

We can see that Willker has put the 1st Principal Component along the X -axis, and the 2nd along the Y-axis.

Assuming he really did use an Excel spreadsheet and an add-on package to calculate this, he must have started with an N x N data-table of numbers.

Counting the number of MSS in the plot, he seems to have had at least 15 independant texts. (Some have been doubled up, like D+1071, S+Omega, and f13+Lambda+1424, a whole family of MSS treated as a unit).

Its unlikely that f13 for instance was entered as separate MSS. Instead Willker probably used a printed critical text of this family. For one thing, the only access to the actual detailed readings of the MSS in this family are available as footnotes to text (f13 can be found on my website).

This shows that Julian's claim that Willker didn't use Groups of MSS (like f13) as input is plainly false, going by the labels Willker himself has used.

As a MS base, Willker has only used 15 MSS and one family (f13) containing about a half-dozen useful MSS that can witness the text.

That's right: Willker has used about 20-25 MSS for his PCA group analysis. The remaining question is, does Willker's 20 MS sample adequately represent the seven already identified texts distinguished by von Soden?

According to his own list, Willker believes he has represented all the groups, but some of his own sample groups are doing 'double-duty', representing two or more of von Soden's groups.

Remarkably, Willker uses only THREE MSS to represent two whole Groups: M3 and M2 are represented by only S, Omega, and MS 28.

Willker uses only SIX MSS to represent two more whole Groups, representing some 700 MSS.

It hardly seems surprising that Willker is unable to distinguish the Groups in either clustering of MSS.

One important question that comes up, since Willker has crapped all over von Soden as 'inaccurate' and 'unreliable', is what about more recent collations and investigations into the groupings of these MSS?

Willker himself notes that Maurice Robinson, probably the world's leading living expert on this passage at this time, "thinks that there are [actually] about 10 different text-types of the PA. The version in Codex D is clearly not the parent of any of these, but it "must represent a near-final descendant of a complex line of transmission."

Robinson is the only man alive to have personally collated ALL the 1,350 extant continuous-text MSS that contain the passage, as well as a 1,000 Lectionaries.

So the tendency of recent scholarship is to find MORE text-types (Groups) within the available MS tradition, not less.

Yet Willker's plot, based on only a handful of MSS, shows only 4 groups.

--------------------------------------------------------

This also means that in order to do a PCA plot, Willker must have had at least 15 columns, but no more than 25 columns, of Variation Units. But we know that all the critical texts of the PA show between 30 and 40 Variation Units.

If Willker really did a PCA plot, then he must have ignored over half the variation units, as well as the vast bulk of MSS.

Such a procedure can only be described as poking out one eye, in order to see the lay of the land better.

Can this really be a better way to tease out the delicate groupings of 1,350 MSS falling into apparently at least 10 different text-types?

--------------------------------------------------------

The second important question is posed when we naturally ask:

What Variation Units were used to characterize the MSS?

We would at least know this, if Willker had published his accompanying LOADING PLOT, which is necessary in order to interpret the SCORE PLOT.

This would assume that the Loading Plot was properly labelled, and told us by a name or I.D. number which Variant was being collated in each column.

The names would in turn refer to a list of Variation Units, like the ones in his own appendix.

Note that in Willker's Appendix, he only lists 13 Variation Units of note.

But to have done a PCA plot he would have required a minimum of 15 Units. The Data Table has to be Square, having the same columns as rows, in order to use PCA software that calculates Eigenvectors and Eigenvalues.

Why the discrepancy? Willker must have used a different number of Units, or divided them up differently in his PCA Data-Table.

One way to hide these problems, and avoid other investigators challenging his findings, is to fail to produce the LOADING PLOT. The Loading Plot would inevitably show an equal number of Variables, and be expected to be supported by a mapping and an explanation of the group clusterings found in the Score Plot.

---------------------------------------------------------

Far from Willker's conclusion that there are (only) four groups of MSS, we rather conclude that an inadequate PCA technique was performed.

It looks like Willker cut corners with MSS, readings, supporting documentation, and proper analysis.

The reason is, Willker seems to have simply used the variants and MSS found in the UBS text produced by Metzger and Cardinal Fang for 'translators'. But no adequate PCA analysis can be performed on such a half-assed sampling of variants and MSS as presented in that 'student's text' of the NT.

Willker seems to have used PCA projection as simply a quick and dirty means of displaying the 'groups', without doing any of the necessary work of collating the MSS, or encompassing the full set of variants.

Naturally no rationale for the groupings is offered. No analysis was done.

The result is more predictable than the Score Plot.

We have been given a fuzzy amateur 'snapshot' of SOME of the MSS and SOME of the readings, from a bad angle, in the dark, while the camera was shaking, and the exposure and focus was maladjusted.

jgibson000 · 03-14-2007, 05:44 PM

Quote:

Originally Posted by Nazaroo

Now that most of us have a good grasp of what goes on under the hood in a PCA projection, we are in a good position to assess Willker's alleged PCA plot.

Have you sent -- or do you intend to send -- this to directly Wieland?

JG

Julian · 03-14-2007, 08:25 PM

There is no reason to even respond to the silliness posted by Nazaroo and no point in sending it to Wieland unless it would be for entertainment value. Of course, I must make a few comments in the interest of showing just how far you can push sarcasm. In reality, I am actually rather mad that charlatan posts such as Nazaroo's are allowed to stand without comment and I therefore consider it my duty to show just how ridiculous his words are. While there is no doubt that Nazaroo suddenly figured out that he needed to put some actual numbers out there to illustrate his point and proceeded accordingly one must be amazed at the final bits. It starts out fairly reasonable, it shows in a limited way what PCA is, cleverly phrased in such a way that he skews the representation to fit his later nazaroo-math-is-from-an-alternate-universe conclusion, but even so, I started out pleasantly surprised.

It then slowly starts to degenerate into eyebrow-raising holy-ghost-statistics, for lack of a better term, where we see entirely unwarranted statements, such as pointing out that if the signal to noise differential is very small, it doesn't work as well. Really? It's not magical? Who knew? And how about, if your data is really messy, unrelated and/or disorganized it also would not produce great results. The PCA is a never-ending disaster! He talks about a rotation transformation. Hmmm, a rotation transformation, if I remember correctly what I have been doing for a living for the last two-and-a-half decades, is the transformation of a vector (n-dimensional point, if you will) through a transformation matrix. Gee, I guess exactly like a PCA... Except, of course, that the PCA does its best to emphasize results of significance in the amplitude domain whereas a rotational view based on a personal 'heuristics' approach (spin it until it fits) would be entirely subjective, in other words, very useful to your evangelist statistician. Of course, in Nazaroo's inept example the researcher would know the nature of the distorion in his data and could adjust for it. The example is contrived to show that the PCA is not perfect which nobody has ever claimed. But you all know how it works: if a hole can be shown then an elephant the size of 'faith' can easily pass through.

He then goes on to some genuine, well, I am not sure how I can classify them since this board prohibits anything I might say to describe their nature... Let's take a few example of his -representations:

Quote:

Originally Posted by Nazaroo, the master physician

But to have done a PCA plot he would have required a minimum of 15 Units. The Data Table has to be Square, having the same columns as rows, in order to use PCA software that calculates Eigenvectors and Eigenvalues.

Why the discrepancy? Willker must have used a different number of Units, or divided them up differently in his PCA Data-Table.

One way to hide these problems, and avoid other investigators challenging his findings, is to fail to produce the LOADING PLOT. The Loading Plot would inevitably show an equal number of Variables, and be expected to be supported by a mapping and an explanation of the group clusterings found in the Score Plot.

The Data Table has to be square. Hmmm. Gee, isn't it interesting that a COVARIANCE MATRIX IS ALWAYS SQUARE!!! If I start out with two manuscripts, for example, and generate a covariance matrix then the resulting covariance matrix is 2x2, no matter how many measurements I start out with. I could measure 1000 things for, say, three manuscript and the result shown in a covariance matrix WOULD BE 3 x 3 MATRIX!!! Now we must wonder if Nazaroo truly doesn't understand the basics of covariance matrices, or he... A covariance matrix measures the relation of one axis to another.

Moving right along, we notice that Nazaroo has added von Soden's groups to the PCA plot graphic. He doesn't say that he added them but he did. Of course, von Soden's classifications are not used very much anymore, being somewhat antiquated. He then complains that Willker treats a group like a single manuscript. In this case he goes on about f13 but it might as well have been f1, two manuscript groups that are routinely, traditionally and always treated collectively and not a single text crit scholar in the world would argue with that. Nazaroo would argue with that because it suits his purpose otherwise, rest assured, he would have dismissed the complaint. You will notice that this never comes up in Willker's PCA study which was my stated focus all along. Besides, those two families are from the 12th and 13th centuries, fairly useless in many ways (but they do have a number of interesting features), this, of course, makes them excellent for byzantine purposes. Except when they don't. Willker never uses groups in his PCA study. He states that he uses Swanson which in GJohn makes use of 50 or so manuscripts (I didn't bother counting them, being tired and all, I leave this as an exercise for the reader. I wouldn't trust Nazaroo's counting abilities, though).

Statement after statement crashes on the rocks and sinks into ignominy, " he must have started with an N x N data-table of numbers," WRONG.

Scholarship is looking for more text types, WRONG. They are looking for a better definition or delineation, no telling where it might lead.

Julian's claim about Willker not using groups as input is false, WRONG. I have posted the link to the PCA study and referred to that. Someone show me where, in that study, Willker uses groups as input. Good luck, hope you live that long. I am not talking about his commentary, never were.

Most of Nazaroo's points rely on Willker's necessity of needing a square matrix which is entirely and absolutely incorrect and shows his the-count-from-sesame-street level of understanding of linear algebra.

In conclusion, I will say that I respect the rules of this board and as a moderator I am bound to uphold them. This prevents me from from stating my actual opinion and true emotions regarding Nazaroo as a person and that's where I will leave that issue. As for his posts, I can certainly say a lot but frankly, his infantile and amateurish crap (and I shall allow myself such language on this occasion) posted on this forum as well as on his website just goes to illustrate that there is no substitute for thinking for yourself. May Thor (who is cool) have mercy upon those that use Nazaroo's madness-proximity-posts to rot their brains. Amen.

Julian,
An ignorant-in-over-his-head-ditchdigger

P.S. I will not reply to, or comment on, any more posts from anyone who posts material that is below the level of intellect required to comprehend 'My Pet Goat.' Naturally, the includes Nazaroo's posts.

Chris Weimer · 03-14-2007, 10:05 PM

Quote:

Originally Posted by jgibson000

Have you sent -- or do you intend to send -- this to directly Wieland?

JG

:rolling:

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search

03-14-2007, 11:01 AM	#211
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Step 1: Zeroing the Data - Subtract the 'mean' For PCA to work properly, you have to subtract the mean (average value) from each of the data dimensions. Subtracting the 'Mean' The actual 'mean' that is subtracted from each entry is the average across each dimension. What is meant is actually simple: each 'dimension' is a column of data from the data-table. We get the Mean Average by simply adding up all the entries, and dividing by the number of entries. This gives us the simplest and crudest kind of average value: the Mean. The formula is simple. If the set of entries for a column is: X = [X1, X2, X3, ...Xn ] ...then the formula for the Mean average is: So, all the x values have x ~ (the Mean of x) subtracted, and all the y values have y~ subtracted from them, etc. . This produces a data set whose mean (average value) is zero for every column (dimension). An Explanation: What are we doing when we do this? This is something like finding the 'center of mass' for a floating cloud of particles. It places the Origin of our coordinate axes somewhere in the middle of the cloud. By 'somewhere', we stress that this may not be the actual geometrical 'center'. We might think we want to do that when plotting a cloud, just as a photographer might center his 'shot' so that the subject is in the middle. This saves graph paper. But Zero Mean conversion is different. The Origin is located at a 'center' that is really a weighted center, weighted by the number of points in each range of values, for each axis. So if there were for instance a lot of points in a certain range, this would 'pull' the Origin toward that group or cluster. In order to plot all the points, (and save paper), we would then place the origin and the 'view-window' independantly: This is what the 'cross-hairs' and zero-lines in a PCA plot are all about. They show the weighted 'center' of the data, and its spread from this core in various directions. The reason that these lines are not usually centered themselves is that the viewport has to accomodate all the plotted points, and is chosen independantly.

03-14-2007, 11:10 AM	#213
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Step 2: The 1st Principal Component Axis Calculating the Principal Component Axis is the first step in a multivariate projection. We draw this new co-ordinate axis along the line of maximum variation of the data. That is, in a way that spans the maximum length or spread of the cloud of points. It can also be called a "line of best fit". This axis is known as the First Principal Component Axis (PC1 in the picture). All the observations are projected down onto this new axis and the score values are read off. These become the new entries for a new Hidden Variable, the First Principal Component. This variable becomes a column in a new 'transformed' data-table with all new columns and entries. This First Principal Component is said to "explain" as best as possible, the spread of the "Observations" (MSS) in our original 3-dimensional space. This spread is expressed as a percentage of the total variation in position, and is said to "explain" that % of it. From the picture below, you can see that this involves somehow minimizing the distances of all the points to this line. That could be a pretty messy job, involving a lot of measuring and adjustment: luckily there is a handy technique that does this for us automatically. MATRIX ALGEBRA (Linear Algebra) In Matrix Algebra we can calculate the Eigenvectors, which happen to be the very axis lines we are looking for. We will explain how to do that later. The important point is that these Eigenvectors are also a set of orthogonal (at right angles) lines in space. But this time they are aligned to our data, and can serve as an alternate set of axes. (1) Euclidean Vector Space Needed In this application of Matrix Algebra, the coordinate space is treated as a Vector Space, and the points in space are treated as Vectors. This Vector Space and the coordinate points in it must obey a strict set of Tranformation Rules, and the Vector Space must be Euclidean. All of the distances between points in this Vector Space, for instance, the length of Eigenvectors, are calculated using the Generalized Pythagorean Formula: For instance, if E n is an Eigenvector, the Length of that vector is: L E = [ X1^2 + X2^2... + Xn^2 ] ^1/2 Once again we see that the data must already be in a form that conforms to coordinates in a Euclidean Space, in order for proper distance calculations to be made. (2) Square Data Table Needed Now we run into one more snag! For large numbers of dimensions or observations, we have to resort to Eigenvector techniques. But Eigenvectors can only be found in square matrices. That is, our data table must have the same number of columns as it has rows. If we have more columns than rows, or vice versa, we have to disgard some data, often a signficant amount! Or we have to break it down into smaller squares, and apply PCA techniques separately to the pieces, an obviously arbitrary and dubious method. Once again, we are either artificially 'designing' the experiment to fit the method, biasing the result, or we are going to needlessly complicate the process of applying PCA, again with questionable results. Even with a relatively large 'sample' of our data-table, the actual Eigenvector values may be way off from the actual 'maximum spread' axes.

03-14-2007, 11:15 AM	#214
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Step 3: The Second Principal Component Axis We now add a second principal component – PC2 in the picture. After PC1, this defines the next best direction for approximating the original data and is orthogonal (at right angles) to PC1. To do this graphically, we would keep the PC2 line at a right angle to the PC1 line, and rotate it around until it was spanning the widest part of the cloud. In actuality, we simply select the 2nd largest Eigenvector, using the Eigenvalues as a guide. These two axes give the "best possible" two-dimensional window into our original 3-dimensional data, accounting for the largest % of the original variation. Best View? This opinion of 'best view' is based upon the idea that the projection will maximize the distance between the points in the projection. This is supposed to give maximum clarity and allow accurate identification of patterns or groups in the data. This approach may be generally 'sound' as the best compromise for a projection, when nothing is known about the data other than the range of its values. But a little reflection should reveal that it is hardly an effective pattern-detecting method. In fact, other than giving a very basic snap-shot of the data, the PCA technique is not very useful, except in cases where the data is already in a form or has features that are conveniently 'lucky' for this crude projection technique. We'll see shortly how even a careful and correct application of PCA can catastrophically fail. Because the success of the method is entirely dependant upon the data itself, it is a wholly unreliable and unrealistic technique for general data analysis.

03-14-2007, 11:16 AM	#215
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Once the PC2 axis is fixed, the observations (MSS) will be projected down onto the plane made from these two axes to create the score plot. Imagine the shadow cast by a three-dimensional (or multi-dimensional) swarm of points onto a wall. With the light source positioned optimally, the underlying structure of the data is hopefully revealed even though the dimensionality of the data has been dramatically reduced (and spacial information has been lost). There is no doubt that the PCA method does do what it says it does: It efficiently displays the spread of the data-points according to the Covariance Matrix. What is that? Its a collected set of measurements that summarize the basic spread of all the variables. That is, PCA, done properly will give us the projection that has the widest spread of the data points possible in two dimensions. It does this without adding any further distance distortion other than that caused by dropping the dimension perpendicular to the projection. But is this 'data analysis'? No. There may be all kinds of critically important geometrical relationships in the data-points that will be left undiscovered. The PCA method has about as much chance of tripping over these as a farmer searching for a needle in a haystack without a magnet.

03-14-2007, 11:20 AM	#216
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	What the PCA Projection Missed... To show the whopping errors possible with PCA, we need only imagine rotating the Principle Component Axis (the one "accounting" for the greatest % of the spread!) end over end as we continue the projection: What happened? On a set of axes not related to either our original data Variables, or the PCA axes, there was a correlated sine/cosine function, a Hidden Variable. By freely rotating our cloud under a light, we were able to reveal the hidden pattern, which in turn suggests a hidden and simple rule organizing and possibly 'causing' the readings. A 'by the book' PCA analysis failed to expose the key pattern in the data, while a simple Rotation Transform, a rotation of the data through a series of angles, nailed it. At the same time, this exercise shows that PCA projections don't really expose hidden variables in data, and can't. How did the PCA method fail? Suppose the data plots were position recordings. The object is a tiny magnet suspended in an electromagnetic field. The motion of the object under investigation was a simple circular orbit. Unfortunately, the observer's apparatus is tilted at 45 degrees and mounted in a moving truck travelling along a bumpy road. This adds a random vertical component to each position of the object at 45 degrees to the orbit: Now the spread of the position measurements is larger on a non-relevant axis than on the important pair, the plane of the orbit. One might think this kind of situation is rare. But the opposite is the case. Raw field data is quite often compounded by multiple hidden factors influencing the recorded entries. It only takes a couple of layers of influencing factors to totally defeat a PCA projection. The PCA technique actually failed because there was too much 'noise' in the data, that is, unimportant or unwanted measurements. But one of the claims of PCA proponents is that it is able to 'separate the noise from the signal'. The sobering fact is that the noise must be an order of magnitude smaller than signal for PCA to work! But if that was the case above, PCA techniques would be redundant. The essential quality and pattern in the data (a circular orbit) would be obvious from almost any orientation or projection. We wouldn't need Principal Components at all. This is true generally. If the data-table properly records the positions in a Euclidean space, clusters, patterns of all kinds will retain their essential shapes and groupings from many angles, making PCA redundant. If the 'noise' level is too high, PCA cannot reliably help. What then is PCA? And what is it really good for? PCA is a good method for displaying spacial relationships or affinities between observations, when the data is already in the right form, relatively free of error, and well-behaved. It is an excellent final polish, when an experiment has already been designed properly, and unwanted influences have been eliminated from consideration, and the key variables are already identified. It is best when the data-points in the modeling Space DON'T have any special ordering, or pattern, other than an uneven spread (varying mean and median deviations) in the space. In this case, PCA does a good job as a general compromise for a 'best view' of the data, giving maximum separation. But is it an effective method to identify hidden patterns in data? No more than any other arbitrary projection method. Is it a reliable method to establish groupings under investigation? No, unless the data is already free of noise and grouped independantly of any chosen axes. PCA is ideal, when there are no patterns or hidden variables in the data at all, other than an inequality of spread between dimensions.

03-14-2007, 05:41 PM	#217
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Now that most of us have a good grasp of what goes on under the hood in a PCA projection, we are in a good position to assess Willker's alleged PCA plot. From the chart it is clear that Willker has posted a Score Plot. We can see that Willker has put the 1st Principal Component along the X -axis, and the 2nd along the Y-axis. Assuming he really did use an Excel spreadsheet and an add-on package to calculate this, he must have started with an N x N data-table of numbers. Counting the number of MSS in the plot, he seems to have had at least 15 independant texts. (Some have been doubled up, like D+1071, S+Omega, and f13+Lambda+1424, a whole family of MSS treated as a unit). Its unlikely that f13 for instance was entered as separate MSS. Instead Willker probably used a printed critical text of this family. For one thing, the only access to the actual detailed readings of the MSS in this family are available as footnotes to text (f13 can be found on my website). *This shows that Julian's claim that Willker didn't* use Groups of MSS (like f13) as input is plainly false, going by the labels Willker himself has used. As a MS base, Willker has only used 15 MSS and one family (f13) containing about a half-dozen useful MSS that can witness the text. That's right: Willker has used about 20-25 MSS for his PCA group analysis. The remaining question is, does Willker's 20 MS sample adequately represent the seven already identified texts distinguished by von Soden? According to his own list, Willker believes he has represented all the groups, but some of his own sample groups are doing 'double-duty', representing two or more of von Soden's groups. Remarkably, Willker uses only THREE MSS to represent two whole Groups: M3 and M2 are represented by only S, Omega, and MS 28. Willker uses only SIX MSS to represent two more whole Groups, representing some 700 MSS. It hardly seems surprising that Willker is unable to distinguish the Groups in either clustering of MSS. One important question that comes up, since Willker has crapped all over von Soden as 'inaccurate' and 'unreliable', is what about more recent collations and investigations into the groupings of these MSS? Willker himself notes that Maurice Robinson, probably the world's leading living expert on this passage at this time, "thinks that there are [actually] about 10 different text-types of the PA. The version in Codex D is clearly not the parent of any of these, but it "must represent a near-final descendant of a complex line of transmission." Robinson is the only man alive to have personally collated ALL the 1,350 extant continuous-text MSS that contain the passage, as well as a 1,000 Lectionaries. So the tendency of recent scholarship is to find MORE text-types (Groups) within the available MS tradition, not less. Yet Willker's plot, based on only a handful of MSS, shows only 4 groups. -------------------------------------------------------- This also means that in order to do a PCA plot, Willker must have had at least 15 columns, but no more than 25 columns, of Variation Units. But we know that all the critical texts of the PA show between 30 and 40 Variation Units. If Willker really did a PCA plot, then he must have ignored over half the variation units, as well as the vast bulk of MSS. Such a procedure can only be described as poking out one eye, in order to see the lay of the land better. Can this really be a better way to tease out the delicate groupings of 1,350 MSS falling into apparently at least 10 different text-types? -------------------------------------------------------- The second important question is posed when we naturally ask: What Variation Units were used to characterize the MSS? We would at least know this, if Willker had published his accompanying LOADING PLOT, which is necessary in order to interpret the SCORE PLOT. This would assume that the Loading Plot was properly labelled, and told us by a name or I.D. number which Variant was being collated in each column. The names would in turn refer to a list of Variation Units, like the ones in his own appendix. Note that in Willker's Appendix, he only lists 13 Variation Units of note. But to have done a PCA plot he would have required a minimum of 15 Units. The Data Table has to be Square, having the same columns as rows, in order to use PCA software that calculates Eigenvectors and Eigenvalues. Why the discrepancy? Willker must have used a different number of Units, or divided them up differently in his PCA Data-Table. One way to hide these problems, and avoid other investigators challenging his findings, is to fail to produce the LOADING PLOT. The Loading Plot would inevitably show an equal number of Variables, and be expected to be supported by a mapping and an explanation of the group clusterings found in the Score Plot. --------------------------------------------------------- Far from Willker's conclusion that there are (only) four groups of MSS, we rather conclude that an inadequate PCA technique was performed. It looks like Willker cut corners with MSS, readings, supporting documentation, and proper analysis. The reason is, Willker seems to have simply used the variants and MSS found in the UBS text produced by Metzger and Cardinal Fang for 'translators'. But no adequate PCA analysis can be performed on such a half-assed sampling of variants and MSS as presented in that 'student's text' of the NT. Willker seems to have used PCA projection as simply a quick and dirty means of displaying the 'groups', without doing any of the necessary work of collating the MSS, or encompassing the full set of variants. Naturally no rationale for the groupings is offered. No analysis was done. The result is more predictable than the Score Plot. We have been given a fuzzy amateur 'snapshot' of SOME of the MSS and SOME of the readings, from a bad angle, in the dark, while the camera was shaking, and the exposure and focus was maladjusted.**

Freethought & Rationalism Archive

The archives are read only.