Freethought & Rationalism Archive

Nazaroo · 03-13-2007, 08:09 PM

While police like to lay as many charges as they can, once they start, so that *some* guilt looks more credible, judges tend to frown on wasting the court's time, and prosecutors usually drop the lesser charges in order to pursue the serious ones.

You strain everyone's patience and your own credibility when you make a mountain out of a molehill.

I am willing to concede that the victim should have some kind of say in whether or not there has been a legitimate slight or injury, but this must be guided by the perspective that common sense and experience of the world offers.

Thus, I would pay close attention to a man who claimed I stepped on his foot, because I know that can be painful and the injury is the same whether by accident or purpose, and I can accommodate his anger, suspicion and over-reaction. I know I would probably react similarly.

But it is difficult to believe my inattention or tiredness and a few harmless errors as noted above, are any kind of similar injury for a mature adult.

As to calling you, or *anyone* "doctor", who doesn't chop open bodies to perform life-saving operations, or who hasn't spent 5 years interning on an emergency ward, I can assure you that will never happen in a million years.

Perhaps your 'doctorate' is of great personal value to you, seeing as you probably spent 5 years and possibly $20,000 dollars to get it. But it will have to be your personal trophy. I would no more honour that than I would honour a hooligan for being on the team that won a world-cup soccer final.

When I am a guest in someone's house, and they proudly point to their diplomas, hockey trophies, or souveniers from African safaris, I nod politely and inch toward the door, excusing myself for having to get back to work.

You can earn all the honours among men you like, but a Christian will always evaluate them as toiletpaper in honour of a far greater teacher.

Quote:

How can you believe, who receive honour from one another,
and seek not the honour that comes from God the Father only?

(John 5:41-44)

jgibson000 · 03-13-2007, 08:19 PM

Quote:

Originally Posted by Nazaroo

While police like to lay as many charges as they can, once they start, so that *some* guilt looks more credible, judges tend to frown on wasting the court's time, and prosecutors usually drop the lesser charges in order to pursue the serious ones.

You strain everyone's patience and your own credibility when you make a mountain out of a molehill.

I am willing to concede that the victim should have some kind of say in whether or not there has been a legitimate slight or injury, but this must be guided by the perspective that common sense and experience of the world offers.

Thus, I would pay close attention to a man who claimed I stepped on his foot, because I know that can be painful and the injury is the same whether by accident or purpose, and I can accommodate his anger, suspicion and over-reaction. I know I would probably react similarly.

But it is difficult to believe my inattention or tiredness and a few harmless errors as noted above, are any kind of similar injury for a mature adult.

As to calling you, or *anyone* "doctor", who doesn't chop open bodies to perform life-saving operations, or who hasn't spent 5 years interning on an emergency ward, I can assure you that will never happen in a million years.

Perhaps your 'doctorate' is of great personal value to you, seeing as you probably spent 5 years and possibly $20,000 dollars to get it. But it will have to be your personal trophy. I would no more honour that than I would honour a hooligan for being on the team that won a world-cup soccer final.

When I am a guest in someone's house, and they proudly point to their diplomas, hockey trophies, or souveniers from African safaris, I nod politely and inch toward the door, excusing myself for having to get back to work.

You can earn all the honours among men you like, but a Christian will always evaluate them as toiletpaper in honour of a far greater teacher.

And yet you, who brag about your IQ and how superior a scientist and textual critic you are, still can't get your numbers and your names right.

JG

Apikorus · 03-13-2007, 08:49 PM

Whoa! We need a "time out" in this thread.

Quote:

Originally Posted by Nazaroo

As to calling you, or *anyone* "doctor", who doesn't chop open bodies to perform life-saving operations, or who hasn't spent 5 years interning on an emergency ward, I can assure you that will never happen in a million years.

This would exclude dermatologists from such an honor. Hmmm...maybe not such a bad idea.

Amaleq13 · 03-13-2007, 08:56 PM

Please return to a discussion of the thread subject.

Thanks in advance,

Doug aka Amaleq13, BC&H moderator

Nazaroo · 03-14-2007, 10:20 AM

I was going to simply critique Willker for his lame application of PCA to the PA.
Others diverted us to the question of PA in Textual Criticism generally.

I've decided to just finish this, by showing a few things about PCA that make it clear that many claims made for it have been greatly exaggerated.

So I'm stating my thesis, and then I'm going to do a 'walk through' of PCA, pointing out some of its features and the implications.

Opening Thesis:

(1) Principle Component Analysis (PCA) is not new.

The basic ideas and techniques have been around almost as long as Cartesian coordinates, and as long as the connection of algebra to geometry.

(2) PCA is not an 'automatic' process.

In any important analysis of data, a lot of theoretical groundwork, experiment, trial-and-error, and re-thinking of the problem in its details and its larger context is necessary. Theoretical frameworks and hypotheses must be used to make any sense or practical use of the data.

(3) PCA doesn't give the 'right' answer.

Data often contains many spurious patterns, and can have complex layers of independant cause and effect. There is no guarantee that PCA or any other projection technique will expose the 'correct' pattern in the data, among many. At best, a successful application of PCA will expose the 'loudest' patterns, or the crudest. But only an intelligent theoretical framework can assign importance, relative weights, or even meaning to 'patterns' in data.

(4) PCA doesn't separate the 'noise' from the 'signal'.

PCA is not any reliable indicator or sorting method for separating signal/noise. If the 'noise' is also patterned and significant in magnitude, it can even be mistaken for the signal.

(5) PCA puts severe restrictions on the type, form and content of data.

This effectively requires the experiment designer to 'prefix' the outcome of a PCA analysis, by forcing the kind of data collected to allow a legitimate application of PCA. Secondly, the data must be pre-filtered of non-relevant measurements, (Components!), and finally the data must by good fortune already have a certain type of content and conform to a 'best behaviour for the method' standard, for the PCA projection to be ideal and useful.

All of this together makes PCA a great way to project some kinds of data onto a 2-d chart, but a lousy way to establish relationships between objects under investigation, even for such simple questions as 'grouping'.

PCA is a 'method' without a guiding scientific basis or philosophy, and its requirements and presumptions about the data make it of limited use.

Nazaroo · 03-14-2007, 10:29 AM

Multivariate Data Analysis:
The Big Picture

Multivariate Data Analysis (MDA) is about separating the signal from the noise in sets of data involving many variables, and presenting the results as easy to interpret plots or graphs. Any large complex table of data can in theory be easily transformed into intuitive pictures, summarizing the essential information.

In the field of MDA, several popular methods that are based on mathematical projection have been developed to meet different needs: PCA, PLS, and PLS-DA.

Getting an Overview with PCA

Principal Components Analysis provides a concise overview of a dataset and is often the first step in any analysis. It can be useful for recognising patterns in data: outliers, trends, groups etc.

PCA is a very useful method to analyze numerical data already organized in a two-dimensional table (of M 'observations' by N 'variables'). It allows us to continue analyzing the data ands:

- Quickly identify correlations between the N variables,
- Display the groupings of the M observations (initially described by the N variables) on a low dimensional map, which in turn help identify a variability criterion,
- Build a set of P uncorrelated factors (P <= N ) that can be reused as input for other statistical methods (such as regression).

The limits of PCA stem from the fact that it is a projection method, and sometimes the visualization can lead to false interpretations. There are however some tricks to avoid these pitfalls.

Establishing Relationships with PLS

With Projections to Latent Structures, the aim is to establish relationships between input and output variables, creating predictive models that can be tested by further data collection.

Classification of New Objects with PLS-DA

PLS-Discriminant Analysis is another powerful method for classification. Again, the aim is to create a predictive model, but one which is general and robust enough to accurately classify future unknown samples or measurements.

Summary

Each of these methods or techniques have a variety of related goals associated with them, and their appropriateness and success is partly determined by nature of the problem and the data to which they are applied. They are simply some of the many mathematical tools available to scientific researchers.

Secondly, each method requires that a meaningful framework for the experiment be prepared beforehand, and that the data be properly prepared. Finally, interpretation of the results must be guided by a rational theoretical model with plausible explanatory power, relating the numerical and graphical objects to the real world in an unambiguous and scientific manner.

Nazaroo · 03-14-2007, 10:34 AM

The Mechanics of PCA
A 'Walk Through'

Suppose we are investigating textual patterns in different manuscripts (abbrev. = 'MSS', singular = 'MS'). The numbers in our data-table could refer to the percentage agreement between manuscripts and known Versions (standard translations) or Text-types (what a layman would call a 'version' of a text, i.e., a certain group of associated 'readings'.).

Its hard to see and understand the relationships among and between the MSS and the versions by just staring at numbers. We can't fully interpret most tables by inspection. We need a pictorial representation of the data, a graph or chart.

Converting Numbers into Pictures

By using Multivariate Projection Methods, we can transform the numbers into pictures. Note the names of the columns and rows in our sample data-table. These are the conceptual 'objects' we want to understand (manuscripts and versions).
The first plot is like a map, showing how the MSS relate to each other based on their "agreement profiles" (nearness = agreement). MSS with similar profiles lie close to each other in the map while MSS with different profiles are farther apart. 'Distance' is related to similarity, or affinity. We call such a map a Score Plot. We will see how a plot is created shortly.

For example, the Alexandrian MSS might cluster together in the top right quadrant while the Byzantine MSS form a cluster to the left.

The co-ordinates of the map (or number scales) are called the first two Principal Components, and the (x,y) values for each object are calculated from all of the data taken together (we'll see how shortly).

Having examined the score plot, it is natural to ask why, say, the Egyptian and Caesarian MSS cluster together. This information can be revealed by the second plot called a Loading Plot.

In the top right quadrant (the same area as the Egyptian cluster) we might find Coptic and Alexandrian Versions all of which have close affinities to an Egyptian/Alexandrian Text-type. By contrast, the Byzantine MSS contain many Western/Latin readings.

What PCA does then, is give us a pair of 'maps', a map for each set of conceptual objects, and these maps are related to each another through the numerical data itself. One map turns the rows from our data-table into dots on a map (in an X-Y 'object-space'), and the other map converts the columns into dots and places them on a similar map.

Each map can tell us about its own objects, but to really understand either map we need both, because we need to know the basis upon which each map was made. Having both maps is the only way we can truly know if either map has any meaning or significance.

Nazaroo · 03-14-2007, 10:39 AM

How Its Done

Getting Started
To transform a table of numbers into PCA 'maps' we first create a model of the data, in a Multi-dimensional Object Space. Then we use a Normalization technique to size and shape it, a Transform technique to orient it, and a Projection method to flatten it onto an X-Y plane.

A PCA 'shadow projection' can tell us a lot about a situation, even though the information has been greatly simplified or reduced. But we should always keep in mind that severe distortion and information loss are inevitably involved in PCA methods. This is a 'lossy' process, meaning that some information is completely lost in going from data-table to picture. This can mean both missing cues in the picture, and artificial mirage-like 'artifacts' or distortions. The result can be severely misleading, and any 'discovery' appearing in a projection must be independantly verified using other techniques.

We start with some "Observations" (objects, individuals, in our case MSS…) and "Variables" (measurements, reading sets, in our case 'Versions', …) in a data-table. Choosing and organizing the data-table itself is part of the setup and experimental design, and will be guided by initial presumptions (axioms) and hypotheses (ideas to test).

Our data-table has two dimensions (or at least we usually take two at a time). Each group of objects (dimensions of the table) will get its own 'map'. The Row-objects ("Observations") will be the 'Score Plot' and the Column-objects ("Variables") will be the 'Loading Plot'. In our example, the Row-objects are MSS, and the Column-objects are Versions.

If we looked at just one Variable ( i.e., one column, = one 'Version' ), we could plot the all the Observations (MSS) on a single line, but we don't really need to. We can just rearrange the rows to put the measurements in order, instead.

With two Variables (Versions), we have an X-Y plane, an abstract 'Data Space' to plot the MSS in terms of the two chosen Variables (the two columns from the data-table).

(To assist those without much math or geometry experience, this Phase Space, or 'Data Space' has no relation to physical space. Its an abstract space, a world we are creating where MSS simply 'float'. In this imaginary world, distance and position represent some kind of similarity between the MSS. )

Only Numbers Can be Used

Something we haven't mentioned yet is that the data-table must be made of numbers. The actual measurements have to be numerical, grouped by type, and in the same units. We can't mix apples and oranges, distance and time, or miles and meters. More importantly, we can't even use non-numerical measurements at all, unless we can place them on a numerical scale.

We can't use colors or shapes, or even variant readings in a manuscript! Not unless we can somehow represent them as numbers on a linear scale.

And not just any numbers, but quantities having some clear and rational meaning. For instance, colors could be recorded by frequency or wavelength, each color having a unique place on a properly graduated scale of only one dimension.

At the very least, the numbers in each column have to be of the same kind and in the same units. As members of a set of possible values on a single-dimensional scale, they are undirected quantities, that is, simple scalars, not vectors. (vector measurements having associated directions would be broken down into scalar components).

We will talk about the problem of representing MSS readings as numbers again later. Generating a table of "percentages of agreement" like ours, is no trivial task. We skipped that part of the problem by having a table ready. Here we are just assuming its done, so we can illustrate PCA techniques.

Nazaroo · 03-14-2007, 10:47 AM

Here I'm going to link to a few pictures from the Umetrics site, for review purposes:

Moving to More Dimensions

With 3 Variables (or Versions), we can make a 3-dimensional space. More than 3 dimensions may be hard to picture, but its easy to handle mathematically. We can actually have an any number of dimensions we want. Each dimension will represent an independant 'Variable', in our example, a 'Version' or Text-type.

The essential idea is that one dimension of our data-table (the MSS) will become the Objects or points to be plotted. The other dimension of the table (the Versions) will become the new dimensions or coordinate axes of our Phase Space. To make this work, the numbers in any given column must be of the same kind and units, and it must be a simple scalar.

A Euclidean Metric

Something that is rarely talked about in simple introductions to PCA is the problem of the Metric of the Phase Space. The Metric of a 'Space' can be called a statement of the relations between points or coordinate axes, but more plainly, its about 'true distance' in space, regardless of direction.
If we are going to represent MSS (or anything else) as a set of points some fixed set of relative distances apart, then we have to be able to measure distances. We also need to be able to compare distances, independantly of direction or position.

And we need to be able to say things like "MS A is the same distance from MS B as MS F is from G", or "It is a greater distance than that between MSS P and Q." Only being able to reliably express these relations will allow us to group MSS together, and understand their true relation to other groups.

So distances in a space are expected to be constant (invariable) regardless of a change in orientation of coordinates. In fact, this property is mandatory if 'distance' in space is to have any objective meaning, and if a multi-dimensional model is to properly represent the 'distances' between objects.

In Euclidean Geometry, distances are expressed by the generalized Pythagorean Theorem. The theorem In algebraic terms, a^2 + b^2 = c^2 where c is the hypotenuse while a and b are the legs of the triangle.

But the formula's most important use is in calculating the distance between any two arbitrary points, given their coordinates. The unknown distance is made the hypoteneuse, while the shorter sides correspond to the difference between the two points in X terms ( Δ x) and Y terms ( Δ y).

The equation is then solved for c:

( Δ x2 + Δ y2 )1/2 = c .

The distance formula is generalized for any number of dimensions:

distance = [ (Δ x1) 2 + (Δ x2) 2 + (Δ x3) 2...]1/2

For the formula to have any use, and for distance to be meaningful, it must be constant. But for this to be true, scales of the coordinate axes must be linear (the numbers must be equally spaced on the axis), and be of the same size units. Furthermore, the units must be the same as those used to measure distance in any direction in the space.

Said another way, we require our space to be a Euclidean Space, in which distance follows the generalized Pythagorean formula.

Why?

Because one of the first things we do in PCA is to take distance measurements of the objects in this space, to determine which axes efficiently cover the maximum variation in distance between the objects. Even though we do this indirectly (i.e., algebracially not graphically), its done using actual coordinates, and we rely upon calculations that use the generalized Pythagorean Theorem throughout the operation.

The PCA technique requires then, and assumes, that the data columns to be operated on and converted into coordinate axes are linear in magnitude, are equivalent in nature and type, are expressed in the same units, and are capable forming an N-Dimensional Euclidean Space.

Nazaroo · 03-14-2007, 10:52 AM

A Generalized Example

A PCA Transform and Projection Begins

Obviously with a space having a large number of dimensions, we cannot visualize or recognise the patterns, or even make a useful model. The first part of the technique involves reducing the number of dimensions to something managable.

But if we just plot 3 columns of our data-table as a 3-Dimensional Space we may just get an arbitrary cloud of points in space with no apparent order or pattern, at least from the point of view our observed variables (the 3 selected columns).

Purpose and Goal of PCA

A 3-d model of the Observed Variables lets us down. No plain correlation or pattern seems to appear. These observed variables are called Manifest Variables to distinguish them from any Hidden Variables, that will expose patterns and possibly reveal underlying causes or processes.

The key with PCA is to find new hidden variables (in our current example, two) for the plot, that will organize the data and reveal underlying causes or effects. These hidden variables will become the new coordinate axes for a projection that reveals the hidden structure.

At the same time, the left-over hidden variables (dimensions) that retain unused information or complications that veil the pattern are ignored. This reduces the model back down to two dimensions, for a clean projection.

The unused information is treated like 'noise', and the isolated pattern is viewed as the 'signal'. In this way, PCA enthusiasts can talk of PCA as a "method of separating the signal from the noise", although strictly speaking this implies unsubstantiated assumptions about the data.

03-14-2007, 10:20 AM	#205
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	SUBJECT: PCA - why its bullshit I was going to simply critique Willker for his lame application of PCA to the PA. Others diverted us to the question of PA in Textual Criticism generally. I've decided to just finish this, by showing a few things about PCA that make it clear that many claims made for it have been greatly exaggerated. So I'm stating my thesis, and then I'm going to do a 'walk through' of PCA, pointing out some of its features and the implications. Opening Thesis: (1) Principle Component Analysis (PCA) is not new. The basic ideas and techniques have been around almost as long as Cartesian coordinates, and as long as the connection of algebra to geometry. (2) PCA is not an 'automatic' process. In any important analysis of data, a lot of theoretical groundwork, experiment, trial-and-error, and re-thinking of the problem in its details and its larger context is necessary. Theoretical frameworks and hypotheses must be used to make any sense or practical use of the data. (3) PCA doesn't give the 'right' answer. Data often contains many spurious patterns, and can have complex layers of independant cause and effect. There is no guarantee that PCA or any other projection technique will expose the 'correct' pattern in the data, among many. At best, a successful application of PCA will expose the 'loudest' patterns, or the crudest. But only an intelligent theoretical framework can assign importance, relative weights, or even meaning to 'patterns' in data. (4) PCA doesn't separate the 'noise' from the 'signal'. PCA is not any reliable indicator or sorting method for separating signal/noise. If the 'noise' is also patterned and significant in magnitude, it can even be mistaken for the signal. (5) PCA puts severe restrictions on the type, form and content of data. This effectively requires the experiment designer to 'prefix' the outcome of a PCA analysis, by forcing the kind of data collected to allow a legitimate application of PCA. Secondly, the data must be pre-filtered of non-relevant measurements, (Components!), and finally the data must by good fortune already have a certain type of content and conform to a 'best behaviour for the method' standard, for the PCA projection to be ideal and useful. All of this together makes PCA a great way to project some kinds of data onto a 2-d chart, but a lousy way to establish relationships between objects under investigation, even for such simple questions as 'grouping'. PCA is a 'method' without a guiding scientific basis or philosophy, and its requirements and presumptions about the data make it of limited use.

03-14-2007, 10:29 AM	#206
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Multivariate Data Analysis: The Big Picture Multivariate Data Analysis (MDA) is about separating the signal from the noise in sets of data involving many variables, and presenting the results as easy to interpret plots or graphs. Any large complex table of data can in theory be easily transformed into intuitive pictures, summarizing the essential information. In the field of MDA, several popular methods that are based on mathematical projection have been developed to meet different needs: PCA, PLS, and PLS-DA. Getting an Overview with PCA Principal Components Analysis provides a concise overview of a dataset and is often the first step in any analysis. It can be useful for recognising patterns in data: outliers, trends, groups etc. PCA is a very useful method to analyze numerical data already organized in a two-dimensional table (of M 'observations' by N 'variables'). It allows us to continue analyzing the data ands: - Quickly identify correlations between the N variables, - Display the groupings of the M observations (initially described by the N variables) on a low dimensional map, which in turn help identify a variability criterion, - Build a set of P uncorrelated factors (P <= N ) that can be reused as input for other statistical methods (such as regression). The limits of PCA stem from the fact that it is a projection method, and sometimes the visualization can lead to false interpretations. There are however some tricks to avoid these pitfalls. Establishing Relationships with PLS With Projections to Latent Structures, the aim is to establish relationships between input and output variables, creating predictive models that can be tested by further data collection. Classification of New Objects with PLS-DA PLS-Discriminant Analysis is another powerful method for classification. Again, the aim is to create a predictive model, but one which is general and robust enough to accurately classify future unknown samples or measurements. Summary Each of these methods or techniques have a variety of related goals associated with them, and their appropriateness and success is partly determined by nature of the problem and the data to which they are applied. They are simply some of the many mathematical tools available to scientific researchers. Secondly, each method requires that a meaningful framework for the experiment be prepared beforehand, and that the data be properly prepared. Finally, interpretation of the results must be guided by a rational theoretical model with plausible explanatory power, relating the numerical and graphical objects to the real world in an unambiguous and scientific manner.

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search

03-13-2007, 08:56 PM	#204
Amaleq13 Veteran Member Join Date: Nov 2003 Location: Eagle River, Alaska Posts: 7,816	Please return to a discussion of the thread subject. Thanks in advance, Doug aka Amaleq13, BC&H moderator

03-14-2007, 10:34 AM	#207
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	The Mechanics of PCA A 'Walk Through' Suppose we are investigating textual patterns in different manuscripts (abbrev. = 'MSS', singular = 'MS'). The numbers in our data-table could refer to the percentage agreement between manuscripts and known Versions (standard translations) or Text-types (what a layman would call a 'version' of a text, i.e., a certain group of associated 'readings'.). Its hard to see and understand the relationships among and between the MSS and the versions by just staring at numbers. We can't fully interpret most tables by inspection. We need a pictorial representation of the data, a graph or chart. Converting Numbers into Pictures By using Multivariate Projection Methods, we can transform the numbers into pictures. Note the names of the columns and rows in our sample data-table. These are the conceptual 'objects' we want to understand (manuscripts and versions). The first plot is like a map, showing how the MSS relate to each other based on their "agreement profiles" (nearness = agreement). MSS with similar profiles lie close to each other in the map while MSS with different profiles are farther apart. 'Distance' is related to similarity, or affinity. We call such a map a Score Plot. We will see how a plot is created shortly. For example, the Alexandrian MSS might cluster together in the top right quadrant while the Byzantine MSS form a cluster to the left. The co-ordinates of the map (or number scales) are called the first two Principal Components, and the (x,y) values for each object are calculated from all of the data taken together (we'll see how shortly). Having examined the score plot, it is natural to ask why, say, the Egyptian and Caesarian MSS cluster together. This information can be revealed by the second plot called a Loading Plot. In the top right quadrant (the same area as the Egyptian cluster) we might find Coptic and Alexandrian Versions all of which have close affinities to an Egyptian/Alexandrian Text-type. By contrast, the Byzantine MSS contain many Western/Latin readings. What PCA does then, is give us a pair of 'maps', a map for each set of conceptual objects, and these maps are related to each another through the numerical data itself. One map turns the rows from our data-table into dots on a map (in an X-Y 'object-space'), and the other map converts the columns into dots and places them on a similar map. Each map can tell us about its own objects, but to really understand either map we need both, because we need to know the basis upon which each map was made. Having both maps is the only way we can truly know if either map has any meaning or significance.

03-14-2007, 10:39 AM	#208
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	How Its Done Getting Started To transform a table of numbers into PCA 'maps' we first create a model of the data, in a Multi-dimensional Object Space. Then we use a Normalization technique to size and shape it, a Transform technique to orient it, and a Projection method to flatten it onto an X-Y plane. A PCA 'shadow projection' can tell us a lot about a situation, even though the information has been greatly simplified or reduced. But we should always keep in mind that severe distortion and information loss are inevitably involved in PCA methods. This is a 'lossy' process, meaning that some information is completely lost in going from data-table to picture. This can mean both missing cues in the picture, and artificial mirage-like 'artifacts' or distortions. The result can be severely misleading, and any 'discovery' appearing in a projection must be independantly verified using other techniques. We start with some "Observations" (objects, individuals, in our case MSS…) and "Variables" (measurements, reading sets, in our case 'Versions', …) in a data-table. Choosing and organizing the data-table itself is part of the setup and experimental design, and will be guided by initial presumptions (axioms) and hypotheses (ideas to test). Our data-table has two dimensions (or at least we usually take two at a time). Each group of objects (dimensions of the table) will get its own 'map'. The Row-objects ("Observations") will be the 'Score Plot' and the Column-objects ("Variables") will be the 'Loading Plot'. In our example, the Row-objects are MSS, and the Column-objects are Versions. If we looked at just one Variable ( i.e., one column, = one 'Version' ), we could plot the all the Observations (MSS) on a single line, but we don't really need to. We can just rearrange the rows to put the measurements in order, instead. With two Variables (Versions), we have an X-Y plane, an abstract 'Data Space' to plot the MSS in terms of the two chosen Variables (the two columns from the data-table). (To assist those without much math or geometry experience, this Phase Space, or 'Data Space' has no relation to physical space. Its an abstract space, a world we are creating where MSS simply 'float'. In this imaginary world, distance and position represent some kind of similarity between the MSS. ) Only Numbers Can be Used Something we haven't mentioned yet is that the data-table must be made of numbers. The actual measurements have to be numerical, grouped by type, and in the same units. We can't mix apples and oranges, distance and time, or miles and meters. More importantly, we can't even use non-numerical measurements at all, unless we can place them on a numerical scale. We can't use colors or shapes, or even variant readings in a manuscript! Not unless we can somehow represent them as numbers on a linear scale. And not just any numbers, but quantities having some clear and rational meaning. For instance, colors could be recorded by frequency or wavelength, each color having a unique place on a properly graduated scale of only one dimension. At the very least, the numbers in each column have to be of the same kind and in the same units. As members of a set of possible values on a single-dimensional scale, they are undirected quantities, that is, simple scalars, not vectors. (vector measurements having associated directions would be broken down into scalar components). We will talk about the problem of representing MSS readings as numbers again later. Generating a table of "percentages of agreement" like ours, is no trivial task. We skipped that part of the problem by having a table ready. Here we are just assuming its done, so we can illustrate PCA techniques.

03-14-2007, 10:47 AM	#209
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	Here I'm going to link to a few pictures from the Umetrics site, for review purposes: Moving to More Dimensions With 3 Variables (or Versions), we can make a 3-dimensional space. More than 3 dimensions may be hard to picture, but its easy to handle mathematically. We can actually have an any number of dimensions we want. Each dimension will represent an independant 'Variable', in our example, a 'Version' or Text-type. The essential idea is that one dimension of our data-table (the MSS) will become the Objects or points to be plotted. The other dimension of the table (the Versions) will become the new dimensions or coordinate axes of our Phase Space. To make this work, the numbers in any given column must be of the same kind and units, and it must be a simple scalar. A Euclidean Metric Something that is rarely talked about in simple introductions to PCA is the problem of the Metric of the Phase Space. The Metric of a 'Space' can be called a statement of the relations between points or coordinate axes, but more plainly, its about 'true distance' in space, regardless of direction. If we are going to represent MSS (or anything else) as a set of points some fixed set of relative distances apart, then we have to be able to measure distances. We also need to be able to compare distances, independantly of direction or position. And we need to be able to say things like "MS A is the same distance from MS B as MS F is from G", or "It is a greater distance than that between MSS P and Q." Only being able to reliably express these relations will allow us to group MSS together, and understand their true relation to other groups. So distances in a space are expected to be constant (invariable) regardless of a change in orientation of coordinates. In fact, this property is mandatory if 'distance' in space is to have any objective meaning, and if a multi-dimensional model is to properly represent the 'distances' between objects. In Euclidean Geometry, distances are expressed by the generalized Pythagorean Theorem. The theorem In algebraic terms, a^2 + b^2 = c^2 where c is the hypotenuse while a and b are the legs of the triangle. But the formula's most important use is in calculating the distance between any two arbitrary points, given their coordinates. The unknown distance is made the hypoteneuse, while the shorter sides correspond to the difference between the two points in X terms ( Δ x) and Y terms ( Δ y). The equation is then solved for c: ( Δ x2 + Δ y2 )1/2 = c . The distance formula is generalized for any number of dimensions: distance = [ (Δ x1) 2 + (Δ x2) 2 + (Δ x3) 2...]1/2 For the formula to have any use, and for distance to be meaningful, it must be constant. But for this to be true, scales of the coordinate axes must be linear (the numbers must be equally spaced on the axis), and be of the same size units. Furthermore, the units must be the same as those used to measure distance in any direction in the space. Said another way, we require our space to be a Euclidean Space, in which distance follows the generalized Pythagorean formula. Why? Because one of the first things we do in PCA is to take distance measurements of the objects in this space, to determine which axes efficiently cover the maximum variation in distance between the objects. Even though we do this indirectly (i.e., algebracially not graphically), its done using actual coordinates, and we rely upon calculations that use the generalized Pythagorean Theorem throughout the operation. The PCA technique requires then, and assumes, that the data columns to be operated on and converted into coordinate axes are linear in magnitude, are equivalent in nature and type, are expressed in the same units, and are capable forming an N-Dimensional Euclidean Space.

03-14-2007, 10:52 AM	#210
Nazaroo Banned Join Date: Jan 2007 Location: Canada Posts: 528	A Generalized Example A PCA Transform and Projection Begins Obviously with a space having a large number of dimensions, we cannot visualize or recognise the patterns, or even make a useful model. The first part of the technique involves reducing the number of dimensions to something managable. But if we just plot 3 columns of our data-table as a 3-Dimensional Space we may just get an arbitrary cloud of points in space with no apparent order or pattern, at least from the point of view our observed variables (the 3 selected columns). Purpose and Goal of PCA A 3-d model of the Observed Variables lets us down. No plain correlation or pattern seems to appear. These observed variables are called Manifest Variables to distinguish them from any Hidden Variables, that will expose patterns and possibly reveal underlying causes or processes. The key with PCA is to find new hidden variables (in our current example, two) for the plot, that will organize the data and reveal underlying causes or effects. These hidden variables will become the new coordinate axes for a projection that reveals the hidden structure. At the same time, the left-over hidden variables (dimensions) that retain unused information or complications that veil the pattern are ignored. This reduces the model back down to two dimensions, for a clean projection. The unused information is treated like 'noise', and the isolated pattern is viewed as the 'signal'. In this way, PCA enthusiasts can talk of PCA as a "method of separating the signal from the noise", although strictly speaking this implies unsubstantiated assumptions about the data.

Freethought & Rationalism Archive

The archives are read only.