orngCA: Orange Correspondence Analysis

Correspondence anaysis is an explorative technique applyed to analysis of contingency tables. The module provides implements correspondence analysis for two-way frequency crosstabulation tables.

Module contains one class CA which wraps all the mathematical functions and a function input for loading contingency table from a file. The class can be constructed by providing a contingency table as a parameter to the constructor. Contingency table is encoded as a Python's nested lists, "list-of-lists" or using numpy types matrix and array. The class also includes a method input(filename) that reads the contingency table from a file, where each row of contingency table is represented with a line of comma-separated numbers. Different means of passing the contingency table to a correspondence analysis method are illustrated in the following snippet:

>>> import orngCA >>> data = [[72, 39, 26, 23, 4], ... [95, 58, 66, 84, 41], ... [80, 73, 83, 4, 96], ... [79, 93, 35, 73, 63]] >>> c = orngCA.CA(data) >>> >>> data = orngCA.input('contigencyTable') >>> c = orngCA.CA(data)

Class orngCA

Attributes

The attributes provide access to the contingency table and various matrices created in the analysis process.

dataMatrix
A contingency table as provided by the user.
A
Principal axes of the column clouds.
B
Principal axes of the row clouds.
D
Matrix whose diagonal elements are singular values of the decomposition.
F
Coordinates of the row profiles with respect to principal axes in the matrix B.
G
Coordinates of the column profiles with respect to principal axes in the matrix A.

Methods

getA(), getB(), ..., getG
Returns the matrices A to G, respectively.
getPrincipalRowProfilesCoordinates(dim = (0,1))
Returns co-ordinates of the row profiles with respect to principal axes A. Only co-ordinates defined in tuple dim are returned. dim is optional and if omitted, first two dimensions are returned.
getPrincipalColProfilesCoordinates(dim = (0,1))
Returns co-ordinates of the column profiles with respect to principal axes B. Only co-ordinates defined in tuple dim are returned. If dim is omitted, first two dimensions are returned.
DecompositionOfInertia(axis = 0)
Returns decomposition of the inertia across the axes. Columns of this matrix represents contribution of the rows or columns to the inertia of axis. If axis equals to 0, inertia is decomposed across rows. If axis equals to 1, inertia is decomposed across columns. This parameter is optional, and defaults to 0.
InertiaOfAxis(percentage = 0)
Returns numpy array whose elements are inertias of axes. If percentage = 1 percentages of inertias of each axis are returned.
ContributionOfPointsToAxis(rowColumn = 0, axis = 0, percentage = 0)
Returns numpy array whose elements are contributions of points to the inertia of axis. Argument rowColumn defines wheter the calculation will be performed for row (default action) or column points. The values can be represented in percentages if percentage = 1.
PointsWithMostInertia(rowColumn = 0, axis = (0, 1))
Returns indices of row or column points sorted in decresing value of their contribution to axes defined in a tuple axis.
PlotScreeDiagram()
Creates a canvas and plots a scree diagram in it.
Biplot(dim = (0, 1))
Plots row points and column points in 2D canvas. If arguments are omitted, the first two dimensions are displayed, otherwise tuple dim defines principal axes.

Examples of use

Data table given below represents smoking habits of different employees in a company.


Smoking category


Staff Group

(1) None

(2) Light

(3) Medium

(4) Heavy

Row Totals

(1) Senior managers

4

2

3

2

11

(2) Junior Managers

4

3

7

4

18

(3) Senior Employees

25

10

12

2

51

(4) Junior Employees

18

24

33

13

88

(5) Secretaries

10

6

7

2

25

Column Totals

61

45

62

25

193

The 4 column values in each row of the table can be viewed as coordinates in a 4-dimensional space, and the (Euclidean) distances could be computed between the 5 row points in the 4-dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Correspondence analysis module can be used to find a lower-dimensional space, in which the row points are positioned in a manner that retains all, or almost all, of the information about the differences between the rows. All information about the similarities between the rows (types of employees in this case) can be presented in a simple 2-dimensional graph. While this may not appear to be particularly useful for small tables like the one shown above, the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a 2-dimensional space). This analysis can be similarly performed on columns of the table.

Following lines load modules and data needed for the analysis. Analysis is started in the last line.

1 import orange 2 from orngCA import CA 3 4 data = [[4, 2, 3, 2], 5 [4, 3, 7, 4], 6 [25, 10, 12, 4], 7 [18, 24, 33, 13], 8 [10, 6, 7, 2]] 9 10 c = CA(data)

After analysis finishes, results can be inspected:

11 print "Column profiles:" 12 print c._CA__colProfiles 13 print 14 print "Row profiles:" 15 print c._CA__rowProfiles 16 print Column profiles: [[ 0.06557377 0.06557377 0.40983607 0.29508197 0.16393443] [ 0.04444444 0.06666667 0.22222222 0.53333333 0.13333333] [ 0.0483871 0.11290323 0.19354839 0.53225806 0.11290323] [ 0.08 0.16 0.16 0.52 0.08 ]] Row profiles: [[ 0.36363636 0.18181818 0.27272727 0.18181818] [ 0.22222222 0.16666667 0.38888889 0.22222222] [ 0.49019608 0.19607843 0.23529412 0.07843137] [ 0.20454545 0.27272727 0.375 0.14772727] [ 0.4 0.24 0.28 0.08 ]]

The points in the two-dimensional display that are close to each other are similar with regard to the pattern of relative frequencies across the columns, i.e. they have similar row profiles. After producing the plot it can be noticed that along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together. This can be also seen by examining row profile, these two groups of employees show very similar patterns of relative frequencies across the categories of smoking intensity.

Lines 17- 19 print out singular values , eigenvalues, percentages of inertia explained. These are important values to decide how many axes are needed to represent the data. The dimensions are "extracted" to maximize the distances between the row or column points, and successive dimensions will "explain" less and less of the overall inertia.

17 print "Singular values: " + str(diag(c.D)) 18 print "Eigen values: " + str(square(diag(c.D))) 19 print "Percentage of Inertia:" + str(c.PercentageOfInertia()) 20 print Singular values: [ 2.73421115e-01 1.00085866e-01 2.03365208e-02 1.20036007e-16] Eigen values: [ 7.47591059e-02 1.00171805e-02 4.13574080e-04 1.44086430e-32] Percentage of Inertia: [ 8.78492893e+01 1.16387938e+01 5.11916964e-01 1.78671526e-29]

Lines 21-22 print out principal row co-ordinates with respect to first two axes. And lines 24-25 show decomposition of inertia.

21 print "Principal row coordinates:" 22 print c.getPrincipalRowProfilesCoordinates() 23 print 24 print "Decomposition Of Inertia:" 25 print c.DecompositionOfInertia()

Following two last statements plot a scree diagram and a biplot. Scree diagram is a plot of the amount of inertia accounted for by successive dimensions, i.e. it is a plot of the percentage of inertia against the components, plotted in order of magnitude from largest to smallest. This plot is usually used to identify components with the highest contribution of inertia, which are selected, and then look for a change in slope in the diagram, where the remaining factors seem simply to be debris at the bottom of the slope and they are discarded. Biplot is a plot or row and column point in two-dimensional space.

27 c.PlotScreeDiagram()

28 c.Biplot()