orngPCA: Principal component analysis module

This module contains tool for performing principal components analysis on data stored as Example table.

PCA class

PCA(dataset = None, attributes = None, rows = None, standardize = 0, imputer = defaultImputer, continuizer = defaultContinuizer, maxNumberOfComponents = 10, varianceCovered = 0.95, useGeneralizedVectors = 0)

dataset
ExampleTable instance on which PCA will be performed. If None, only parameters are set and PCA instance is returned. Projection can then be performed like this:
import orange, orngPCA dataset = orange.ExampleTable('iris.tab') pca = orngPCA.PCA(standardize = True) pca = pca(data)
attributes
List of attributes that will be used in projection. Names must match those of ExampleTable instance and there should be at least two. If None, whole domain is used.

part of PCA1.py

import orange, orngPCA data = orange.ExampleTable("iris.tab") attributes = ['sepal length', 'sepal width', 'petal length', 'petal width'] pca = orngPCA.PCA(data, standardize = True, attributes = attributes) print "PCA on attributes sepal.length, sepal.width, petal.length, petal.width:" print pca
rows
True/False array or list with the same length as number of examples in ExampleTable instance. Only examples that corresponds to True will be used for projection. If None, all data is used.

part of PCA1.py

import orange, orngPCA data = orange.ExampleTable("iris.tab") rows = [1, 0] * (len(data) / 2) pca = PCA(data, standardize = True, rows = rows) print "PCA on every second row:" print pca
standardize
If True, standardization of data is performed before projection.
imputer
orange.Imputer instance. Defines how data is imputed if values are missing. Must NOT be trained. Default is average imputation
continuizer

orange.Continuizer instance. Defines how data is continuized. Default values:

- Multinomial -> as normalized ordinal

- Class -> ignore

- Continuous -> leave

Example on how to use your own imputer and continuizer (PCA2.py)

import orange, orngPCA data = orange.ExampleTable("bridges.tab") imputer = orange.ImputerConstructor_maximal continuizer = orange.DomainContinuizer() continuizer.multinomialTreatment = continuizer.AsNormalizedOrdinal continuizer.classTreatment = continuizer.Ignore continuizer.continuousTreatment = continuizer.Leave pca = PCA(data, standardize = True, imputer = imputer, continuizer = continuizer) print pca
maxNumberOfComponents
Defines how many components will be retained. Default is 10, if -1 all components will be retained.
varianceCovered
Defines how much variance of original data should be explained. Default is 0.95
useGeneralizedVectors
If True, generalized vectors are used.

part of PCA3.py

import orange, orngPCA data = orange.ExampleTable("iris.tab") attributes = ['sepal length', 'sepal width', 'petal length', 'petal width'] pca = PCA(data, standardize = True, attributes = attributes, maxNumberOfComponents = -1, varianceCovered = 1.0) print pca

Output:

PCA SUMMARY Center: sepal length sepal width petal length petal width 5.8433 3.0540 3.7587 1.1987 Deviation: sepal length sepal width petal length petal width 0.8253 0.4321 1.7585 0.7606 Importance of components: eigenvalues proportion cumulative 2.9108 0.7277 0.7277 0.9212 0.2303 0.9580 0.1474 0.0368 0.9948 0.0206 0.0052 1.0000 Loadings: PC1 PC2 PC3 PC4 0.5224 -0.3723 -0.7210 0.2620 sepal length -0.2634 -0.9256 0.2420 -0.1241 sepal width 0.5813 -0.0211 0.1409 -0.8012 petal length 0.5656 -0.0654 0.6338 0.5235 petal width

PCAClassifier class

Object of this class is returned when PCA is performed successfully. It will contain domain of data on which PCA was performed, imputer and continuizer for use in projection, center, deviation, evalues and loadings for PCA. It will also store data of the last projection performed for use with biplot.

Summary of projection can be obtained by printing PCAClassifier instance after PCA projection was successfully completed.

Performing projection:

Projection can be performed by calling PCA classifier instance with ExampleTable instance. Projection will fail if ExampleTable instance domain is not the same as in training set (however, it does not have to be in the same order). New ExampleTable instance that is returned will have data projected and domain PC+N where N is goes from 1 to number of components.

part of PCA4.py import orange, orngPCA data = orange.ExampleTable("iris.tab") attributes = ['sepal length', 'sepal width', 'petal length', 'petal width'] pca = PCA(data, attributes = attributes, standardize = True) projected = pca(data)

Plotting functions:

Matplotlib is needed.

plot(title = 'Scree plot', filename = 'Scree_plot.png')

Creates scree plot for current PCA.

title: title of the scree plot

filename: path and filename to where the figure should be saved. If None figure is displayed directly.

biplot(choices = [1,2], scale = 1., title = 'Biplot', filename = 'biplot.png')

Creates a biplot for current projection.

Before calling biplot at least one projection must be performed (it will plot last performed projection).

- choices: two components number that will be plotted, first on x-axis, second on y-axis. Components are numbered from 1 to N where N is number of components returned by PCA. Biplot does not work if there is only one component available. Only the default is a biplot in the strict sense

- scale: transformed data is scaled by lambda ^ scale and loadings are scaled by 1/(lambda ^ scale) where lambda are the singular values as computed by princomp multiplied by square root of data length. Normally scale is inside [0, 1], and a warning will be printed if the specified scale is outside this range.

- title and filename: same as for plot

part of PCA5.py

import orange, orngPCA data = orange.ExampleTable("iris.tab") attributes = ['sepal length', 'sepal width', 'petal length', 'petal width'] pca = PCA(data, standardize = True, attributes = attributes) pca(data) pca.biplot()

Output stored in file biplot.png:

Utility functions

defaultImputer(dataset)
Returns orange.ImputerConstructor_average(dataset).
defaultContinuizer(dataset)

Creates default continuizer with:

- multinomial -> as normalized ordinal

- class -> ignore

- continuous -> leave

centerData(dataMatrix)
Perfomrs centering od data along rows, returns center and centered data. dataMatrix is instance of numpy.array
standardizeData(dataMatrix)
Performs standardization of data along rows, returns scale and scaled data. Throws error if constant variable is present.