orngLR: Orange Logistic Regression Module

Module orngLR is a set of wrappers around classes LogisticLearner and LogisticClassifier, that are implemented in core Orange. This module expanses use of logistic regression to discrete attributes, it helps handling various anomalies in attributes, such as constant variables and singularities, that makes fitting logistic regression almost impossible. It also implements a function for constructing a stepwise logistic regression, which is a good technique to prevent overfitting, and is a good feature subset selection technique as well.


Functions

LogRegLearner([examples=None, weightID=0, removeSingular=0, fitter = None, stepwiseLR = 0, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
Returns a LogisticClassifier if examples are given. If examples are not specified, an instance of object LogisticLearner with its parameters appropriately initialized is returned.  Parameter weightID defines the ID of the weight meta attribute. Set parameter removeSingular to 1,if you want automatic removal of disturbing attributes, such as constants and singularities. Examples can contain discrete and continuous attributes. Parameter fitter is used to alternate fitting algorithm. Currently a Newton-Raphson fitting algorithm is used, however you can change it to something else. You can find bayesianFitter in orngLR to test it out. The last three parameters addCrit, deleteCrit, numAttr are used to set parameters for stepwise attribute selection (see next method). If you wish to use stepwise within LogRegLearner, stpewiseLR must be set as 1.
 
stepWiseFSS([examples=None, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
If examples are specified, stepwise logistic regression implemented in stepWiseFSS_class is performed and a list of chosen attributes is returned. If examples are not specified an instance of stepWiseFSS_class with all parameters set is returned. Parameters addCrit, deleteCrit and numAttr are explained in the description of stepWiseFSS_class.
bestNAtt([examples, N, addCrit=0.2, deleteCrit=0.3])
Returns "best" N attributes selected with stepwise logistic regression. Parameter examples is required.
printOUT([classifier])
Formatted print to console of all major attributes in logistic regression classifier. Parameter classifier is a logistic regression classifier.

Classes

stepWiseFSS_class([examples=None, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
Performs stepwise logistic regression and returns a list of "most" informative attributes. Each step of this algorithm is composed of two parts. First is backward elimination, where each already chosen attribute is tested for significant contribution to overall model. If worst among all tested attributes has higher significance that is specified in deleteCrit, this attribute is removed from the model. The second step is forward selection, which is similar to backward elimination. It loops through all attributes that are not in model and tests whether they contribute to common model with significance lower that addCrit. Algorithm stops when no attribute in model is to be removed and no attribute out of the model is to be added. By setting numAttr larger than -1, algorithm will stop its execution when number of attributes in model will exceed that number.
Significances are assesed via the likelihood ration chi-square test. Normal F test is not appropriate, because errors are assumed to follow a binomial distribution.

Examples

First example shows a very simple induction of a logistic regression classifier. import orange, orngLR data = orange.ExampleTable("titanic") lr = orngLR.LogRegLearner(data) # compute classification accuracy correct = 0.0 for ex in data: if lr(ex) == ex.getclass(): correct += 1 print "Classification accuracy:", correct/len(data) orngLR.printOUT(lr)

Result:

Classification accuracy: 0.778282598819 class attribute = survived class values = <yes, no> Attribute beta st. error wald Z P Intercept 0.38 0.14 2.73 0.01 status=second 1.02 0.20 5.13 0.00 status=third 1.78 0.17 10.24 0.00 status=crew 0.86 0.16 5.39 0.00 age=child -1.06 0.25 -4.30 0.00 sex=female -2.42 0.14 -17.04 0.00 Next examples shows how to handle singularities in data sets. import orange, orngLR data = orange.ExampleTable("adult_sample.tab") lr = orngLR.LogRegLearner(data, removeSingular = 1) for ex in data[:5]: print ex.getclass(), lr(ex) orngLR.printOUT(lr)

Result:

removing education=Preschool removing education-num removing workclass=Never-worked removing native-country=Honduras removing native-country=Thailand removing native-country=Ecuador removing native-country=Portugal removing native-country=France removing native-country=Yugoslavia removing native-country=Trinadad&Tobago removing native-country=Hong removing native-country=Hungary removing native-country=Holand-Netherlands <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K <=50K >50K class attribute = y class values = <>50K, <=50K> Attribute beta st. error wald Z P Intercept 1.39 0.82 1.71 0.09 age -0.04 0.01 -3.60 0.00 workclass=Self-emp-not-inc -0.10 0.37 -0.27 0.79 workclass=Self-emp-inc 0.49 0.50 0.97 0.33 workclass=Federal-gov -1.38 0.52 -2.69 0.01 workclass=Local-gov -0.12 0.41 -0.29 0.77 workclass=State-gov 0.30 0.47 0.63 0.53 workclass=Without-pay 2.50 2.55 0.98 0.33 fnlwgt -0.00 0.00 -0.66 0.51 education=Some-college 1.12 0.32 3.44 0.00 education=11th 1.51 0.75 2.02 0.04 education=HS-grad 1.16 0.31 3.76 0.00 education=Prof-school 0.03 1.18 0.03 0.98 education=Assoc-acdm 0.42 0.48 0.88 0.38 education=Assoc-voc 2.05 0.58 3.55 0.00 education=9th 2.67 0.99 2.71 0.01 education=7th-8th 2.04 0.77 2.64 0.01 education=12th 3.82 1.11 3.44 0.00 education=Masters -0.14 0.41 -0.34 0.73 education=1st-4th 1.05 1.52 0.69 0.49 education=10th 3.00 0.91 3.29 0.00 education=Doctorate 0.03 0.80 0.04 0.97 education=5th-6th 0.33 1.28 0.26 0.80 marital-status=Divorced 3.71 1.09 3.39 0.00 marital-status=Never-married 3.36 1.03 3.28 0.00 marital-status=Separated 3.21 1.22 2.64 0.01 marital-status=Widowed 3.66 1.20 3.04 0.00 marital-status=Married-spouse-absent 5.28 1.72 3.07 0.00 marital-status=Married-AF-spouse 4.93 5.06 0.97 0.33 occupation=Craft-repair 0.67 0.51 1.30 0.19 occupation=Other-service 1.97 0.64 3.06 0.00 occupation=Sales 0.43 0.53 0.81 0.42 occupation=Exec-managerial 0.45 0.51 0.89 0.37 occupation=Prof-specialty 0.54 0.52 1.03 0.30 occupation=Handlers-cleaners 1.71 0.74 2.33 0.02 occupation=Machine-op-inspct 1.15 0.62 1.84 0.07 occupation=Adm-clerical 0.67 0.53 1.27 0.20 occupation=Farming-fishing 1.40 0.78 1.80 0.07 occupation=Transport-moving 0.92 0.57 1.62 0.10 occupation=Priv-house-serv 2.38 1.81 1.32 0.19 occupation=Protective-serv 0.47 0.76 0.61 0.54 occupation=Armed-Forces 1.89 6.36 0.30 0.77 relationship=Own-child -0.18 1.05 -0.17 0.87 relationship=Husband 1.21 0.51 2.37 0.02 relationship=Not-in-family -0.31 1.12 -0.28 0.78 relationship=Other-relative -1.00 1.23 -0.81 0.42 relationship=Unmarried -0.47 1.17 -0.40 0.69 race=Asian-Pac-Islander -0.66 0.90 -0.74 0.46 race=Amer-Indian-Eskimo 1.65 1.91 0.86 0.39 race=Other 2.67 1.53 1.75 0.08 race=Black 0.48 0.38 1.26 0.21 sex=Male -0.18 0.37 -0.49 0.62 capital-gain -0.00 0.00 -6.74 0.00 capital-loss -0.00 0.00 -2.96 0.00 hours-per-week -0.04 0.01 -4.37 0.00 native-country=Cuba -1.04 5.24 -0.20 0.84 native-country=Jamaica -4.48 2.25 -1.99 0.05 native-country=India 1.03 1.42 0.73 0.47 native-country=Mexico 0.77 0.95 0.81 0.42 native-country=South 1.36 5.84 0.23 0.82 native-country=Puerto-Rico 0.52 5.00 0.10 0.92 native-country=England 1.50 2.40 0.63 0.53 native-country=Canada -0.68 1.41 -0.48 0.63 native-country=Germany -0.61 0.91 -0.67 0.50 native-country=Iran 3.31 3.46 0.96 0.34 native-country=Philippines 1.56 1.98 0.79 0.43 native-country=Italy -2.10 1.40 -1.50 0.13 native-country=Poland 0.84 2.60 0.32 0.75 native-country=Columbia -1.93 1.78 -1.08 0.28 native-country=Cambodia 1.19 5.52 0.21 0.83 native-country=Laos 3.40 3.34 1.02 0.31 native-country=Taiwan -0.14 1.81 -0.08 0.94 native-country=Haiti -3.22 2.07 -1.55 0.12 native-country=Dominican-Republic -3.90 7.85 -0.50 0.62 native-country=El-Salvador 1.23 2.87 0.43 0.67 native-country=Guatemala -1.17 5.41 -0.22 0.83 native-country=China 0.95 1.69 0.56 0.58 native-country=Japan -2.31 3.31 -0.70 0.48 native-country=Peru -0.40 4.39 -0.09 0.93 native-country=Outlying-US(Guam-USVI-etc) 0.95 4.78 0.20 0.84 native-country=Scotland 1.88 3.15 0.60 0.55 native-country=Greece -0.12 5.00 -0.02 0.98 native-country=Nicaragua 1.26 2.80 0.45 0.65 native-country=Vietnam 2.33 2.53 0.92 0.36 native-country=Ireland -1.00 1.45 -0.69 0.49

In case we set removeSingular to 0, inducing logistic regression classifier would return an error:

Traceback (most recent call last): File "C:\Python23\lib\site-packages\Pythonwin\pywin\framework\scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\Python23\Lib\site-packages\logreg1.py", line 4, in ? lr = Ndomain.LogisticLearner(data, removeSingular = 1) File "C:\Python23\Lib\site-packages\Ndomain.py", line 49, in LogisticLearner return lr(examples) File "C:\Python23\Lib\site-packages\Ndomain.py", line 68, in __call__ lr = orange.LogisticLearner(nexamples, showSingularity = self.showSingularity) KernelException: 'orange.LogisticFitterMinimization': singularity in education=Preschool

We can see that attribute education=Preschool is causeing singularity. The issue of this is that we should remove Preschool manually or leave it to function LogRegLearner to remove it automatically. In the last example it is shown, how the use of stepwise logistic regression can help us in achieving better classification.

import orange, orngFSS, orngTest, orngStat, orngLR def StepWiseFSS_Filter(examples = None, **kwds): """ check function StepWiseFSS() """ filter = apply(StepWiseFSS_Filter_class, (), kwds) if examples: return filter(examples) else: return filter class StepWiseFSS_Filter_class: def __init__(self, addCrit=0.2, deleteCrit=0.3, numAttr = -1): self.addCrit = addCrit self.deleteCrit = deleteCrit self.numAttr = numAttr def __call__(self, examples): attr = orngLR.StepWiseFSS(examples, addCrit=self.addCrit, deleteCrit = self.deleteCrit, numAttr = self.numAttr) return examples.select(orange.Domain(attr, examples.domain.classVar)) data = orange.ExampleTable("d:\\data\\ionosphere.tab") lr = orngLR.LogRegLearner(removeSingular=1) learners = (orngLR.LogRegLearner(name='logistic', removeSingular=1), orngFSS.FilteredLearner(lr, filter=StepWiseFSS_Filter(addCrit=0.05, deleteCrit=0.9), name='filtered')) results = orngTest.crossValidation(learners, data, storeClassifiers=1) # output the results print "Learner CA" for i in range(len(learners)): print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i]) # find out which attributes were retained by filtering print "\nNumber of times attributes were used in cross-validation:" attsUsed = {} for i in range(10): for a in results.classifiers[i][1].atts(): if a.name in attsUsed.keys(): attsUsed[a.name] += 1 else: attsUsed[a.name] = 1 for k in attsUsed.keys(): print "%2d x %s" % (attsUsed[k], k)

Result:

Learner CA logistic 0.835 filtered 0.846 Number of times attributes were used in cross-validation: 1 x a20 1 x a21 10 x a22 7 x a23 5 x a24 2 x a25 10 x a26 10 x a27 3 x a29 3 x a17 1 x a16 4 x a12 2 x a32 7 x a15 10 x a14 10 x a31 8 x a30 10 x a11 1 x a10 1 x a13 10 x a34 1 x a18 10 x a3 10 x a5 5 x a4 3 x a7 7 x a6 7 x a9 10 x a8

References

David W. Hosmer, Stanley Lemeshow. Applied Logistic Regression - 2nd ed. Wiley, New York, 2000