Package ete2 :: Package clustering :: Module stats
[hide private]
[frames] | no frames]

Source Code for Module ete2.clustering.stats

   1  # #START_LICENSE########################################################### 
   2  # 
   3  # Copyright (C) 2009 by Jaime Huerta Cepas. All rights reserved.   
   4  # email: jhcepas@gmail.com 
   5  # 
   6  # This file is part of the Environment for Tree Exploration program (ETE).  
   7  # http://ete.cgenomics.org 
   8  #   
   9  # ETE is free software: you can redistribute it and/or modify it 
  10  # under the terms of the GNU General Public License as published by 
  11  # the Free Software Foundation, either version 3 of the License, or 
  12  # (at your option) any later version. 
  13  #   
  14  # ETE is distributed in the hope that it will be useful, 
  15  # but WITHOUT ANY WARRANTY; without even the implied warranty of 
  16  # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
  17  # GNU General Public License for more details. 
  18  #   
  19  # You should have received a copy of the GNU General Public License 
  20  # along with ETE.  If not, see <http://www.gnu.org/licenses/>. 
  21  # 
  22  # #END_LICENSE############################################################# 
  23  __VERSION__="ete2-2.0rev104"  
  24  # Copyright (c) 1999-2007 Gary Strangman; All Rights Reserved.
 
  25  #
 
  26  # Permission is hereby granted, free of charge, to any person obtaining a copy
 
  27  # of this software and associated documentation files (the "Software"), to deal
 
  28  # in the Software without restriction, including without limitation the rights
 
  29  # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 
  30  # copies of the Software, and to permit persons to whom the Software is
 
  31  # furnished to do so, subject to the following conditions:
 
  32  # 
 
  33  # The above copyright notice and this permission notice shall be included in
 
  34  # all copies or substantial portions of the Software.
 
  35  # 
 
  36  # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 
  37  # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 
  38  # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 
  39  # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 
  40  # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 
  41  # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 
  42  # THE SOFTWARE.
 
  43  #
 
  44  # Comments and/or additions are welcome (send e-mail to:
 
  45  # strang@nmr.mgh.harvard.edu).
 
  46  # 
 
  47  """
 
  48  stats.py module
 
  49  
 
  50  (Requires pstat.py module.)
 
  51  
 
  52  #################################################
 
  53  #######  Written by:  Gary Strangman  ###########
 
  54  #######  Last modified:  Dec 18, 2007 ###########
 
  55  #################################################
 
  56  
 
  57  A collection of basic statistical functions for python.  The function
 
  58  names appear below.
 
  59  
 
  60  IMPORTANT:  There are really *3* sets of functions.  The first set has an 'l'
 
  61  prefix, which can be used with list or tuple arguments.  The second set has
 
  62  an 'a' prefix, which can accept NumPy array arguments.  These latter
 
  63  functions are defined only when NumPy is available on the system.  The third
 
  64  type has NO prefix (i.e., has the name that appears below).  Functions of
 
  65  this set are members of a "Dispatch" class, c/o David Ascher.  This class
 
  66  allows different functions to be called depending on the type of the passed
 
  67  arguments.  Thus, stats.mean is a member of the Dispatch class and
 
  68  stats.mean(range(20)) will call stats.lmean(range(20)) while
 
  69  stats.mean(Numeric.arange(20)) will call stats.amean(Numeric.arange(20)).
 
  70  This is a handy way to keep consistent function names when different
 
  71  argument types require different functions to be called.  Having
 
  72  implementated the Dispatch class, however, means that to get info on
 
  73  a given function, you must use the REAL function name ... that is
 
  74  "print stats.lmean.__doc__" or "print stats.amean.__doc__" work fine,
 
  75  while "print stats.mean.__doc__" will print the doc for the Dispatch
 
  76  class.  NUMPY FUNCTIONS ('a' prefix) generally have more argument options
 
  77  but should otherwise be consistent with the corresponding list functions.
 
  78  
 
  79  Disclaimers:  The function list is obviously incomplete and, worse, the
 
  80  functions are not optimized.  All functions have been tested (some more
 
  81  so than others), but they are far from bulletproof.  Thus, as with any
 
  82  free software, no warranty or guarantee is expressed or implied. :-)  A
 
  83  few extra functions that don't appear in the list below can be found by
 
  84  interested treasure-hunters.  These functions don't necessarily have
 
  85  both list and array versions but were deemed useful
 
  86  
 
  87  CENTRAL TENDENCY:  geometricmean
 
  88                     harmonicmean
 
  89                     mean
 
  90                     median
 
  91                     medianscore
 
  92                     mode
 
  93  
 
  94  MOMENTS:  moment
 
  95            variation
 
  96            skew
 
  97            kurtosis
 
  98            skewtest   (for Numpy arrays only)
 
  99            kurtosistest (for Numpy arrays only)
 
 100            normaltest (for Numpy arrays only)
 
 101  
 
 102  ALTERED VERSIONS:  tmean  (for Numpy arrays only)
 
 103                     tvar   (for Numpy arrays only)
 
 104                     tmin   (for Numpy arrays only)
 
 105                     tmax   (for Numpy arrays only)
 
 106                     tstdev (for Numpy arrays only)
 
 107                     tsem   (for Numpy arrays only)
 
 108                     describe
 
 109  
 
 110  FREQUENCY STATS:  itemfreq
 
 111                    scoreatpercentile
 
 112                    percentileofscore
 
 113                    histogram
 
 114                    cumfreq
 
 115                    relfreq
 
 116  
 
 117  VARIABILITY:  obrientransform
 
 118                samplevar
 
 119                samplestdev
 
 120                signaltonoise (for Numpy arrays only)
 
 121                var
 
 122                stdev
 
 123                sterr
 
 124                sem
 
 125                z
 
 126                zs
 
 127                zmap (for Numpy arrays only)
 
 128  
 
 129  TRIMMING FCNS:  threshold (for Numpy arrays only)
 
 130                  trimboth
 
 131                  trim1
 
 132                  round (round all vals to 'n' decimals; Numpy only)
 
 133  
 
 134  CORRELATION FCNS:  covariance  (for Numpy arrays only)
 
 135                     correlation (for Numpy arrays only)
 
 136                     paired
 
 137                     pearsonr
 
 138                     spearmanr
 
 139                     pointbiserialr
 
 140                     kendalltau
 
 141                     linregress
 
 142  
 
 143  INFERENTIAL STATS:  ttest_1samp
 
 144                      ttest_ind
 
 145                      ttest_rel
 
 146                      chisquare
 
 147                      ks_2samp
 
 148                      mannwhitneyu
 
 149                      ranksums
 
 150                      wilcoxont
 
 151                      kruskalwallish
 
 152                      friedmanchisquare
 
 153  
 
 154  PROBABILITY CALCS:  chisqprob
 
 155                      erfcc
 
 156                      zprob
 
 157                      ksprob
 
 158                      fprob
 
 159                      betacf
 
 160                      gammln 
 
 161                      betai
 
 162  
 
 163  ANOVA FUNCTIONS:  F_oneway
 
 164                    F_value
 
 165  
 
 166  SUPPORT FUNCTIONS:  writecc
 
 167                      incr
 
 168                      sign  (for Numpy arrays only)
 
 169                      sum
 
 170                      cumsum
 
 171                      ss
 
 172                      summult
 
 173                      sumdiffsquared
 
 174                      square_of_sums
 
 175                      shellsort
 
 176                      rankdata
 
 177                      outputpairedstats
 
 178                      findwithin
 
 179  """ 
 180  ## CHANGE LOG:
 
 181  ## ===========
 
 182  ## 07-11.26 ... conversion for numpy started
 
 183  ## 07-05-16 ... added Lin's Concordance Correlation Coefficient (alincc) and acov
 
 184  ## 05-08-21 ... added "Dice's coefficient"
 
 185  ## 04-10-26 ... added ap2t(), an ugly fcn for converting p-vals to T-vals
 
 186  ## 04-04-03 ... added amasslinregress() function to do regression on N-D arrays
 
 187  ## 03-01-03 ... CHANGED VERSION TO 0.6
 
 188  ##              fixed atsem() to properly handle limits=None case
 
 189  ##              improved histogram and median functions (estbinwidth) and
 
 190  ##                   fixed atvar() function (wrong answers for neg numbers?!?)
 
 191  ## 02-11-19 ... fixed attest_ind and attest_rel for div-by-zero Overflows
 
 192  ## 02-05-10 ... fixed lchisqprob indentation (failed when df=even)
 
 193  ## 00-12-28 ... removed aanova() to separate module, fixed licensing to
 
 194  ##                   match Python License, fixed doc string & imports
 
 195  ## 00-04-13 ... pulled all "global" statements, except from aanova()
 
 196  ##              added/fixed lots of documentation, removed io.py dependency
 
 197  ##              changed to version 0.5
 
 198  ## 99-11-13 ... added asign() function
 
 199  ## 99-11-01 ... changed version to 0.4 ... enough incremental changes now
 
 200  ## 99-10-25 ... added acovariance and acorrelation functions
 
 201  ## 99-10-10 ... fixed askew/akurtosis to avoid divide-by-zero errors
 
 202  ##              added aglm function (crude, but will be improved)
 
 203  ## 99-10-04 ... upgraded acumsum, ass, asummult, asamplevar, avar, etc. to
 
 204  ##                   all handle lists of 'dimension's and keepdims
 
 205  ##              REMOVED ar0, ar2, ar3, ar4 and replaced them with around
 
 206  ##              reinserted fixes for abetai to avoid math overflows
 
 207  ## 99-09-05 ... rewrote achisqprob/aerfcc/aksprob/afprob/abetacf/abetai to
 
 208  ##                   handle multi-dimensional arrays (whew!)
 
 209  ## 99-08-30 ... fixed l/amoment, l/askew, l/akurtosis per D'Agostino (1990)
 
 210  ##              added anormaltest per same reference
 
 211  ##              re-wrote azprob to calc arrays of probs all at once
 
 212  ## 99-08-22 ... edited attest_ind printing section so arrays could be rounded
 
 213  ## 99-08-19 ... fixed amean and aharmonicmean for non-error(!) overflow on
 
 214  ##                   short/byte arrays (mean of #s btw 100-300 = -150??)
 
 215  ## 99-08-09 ... fixed asum so that the None case works for Byte arrays
 
 216  ## 99-08-08 ... fixed 7/3 'improvement' to handle t-calcs on N-D arrays
 
 217  ## 99-07-03 ... improved attest_ind, attest_rel (zero-division errortrap)
 
 218  ## 99-06-24 ... fixed bug(?) in attest_ind (n1=a.shape[0])
 
 219  ## 04/11/99 ... added asignaltonoise, athreshold functions, changed all
 
 220  ##                   max/min in array section to N.maximum/N.minimum,
 
 221  ##                   fixed square_of_sums to prevent integer overflow
 
 222  ## 04/10/99 ... !!! Changed function name ... sumsquared ==> square_of_sums
 
 223  ## 03/18/99 ... Added ar0, ar2, ar3 and ar4 rounding functions
 
 224  ## 02/28/99 ... Fixed aobrientransform to return an array rather than a list
 
 225  ## 01/15/99 ... Essentially ceased updating list-versions of functions (!!!)
 
 226  ## 01/13/99 ... CHANGED TO VERSION 0.3
 
 227  ##              fixed bug in a/lmannwhitneyu p-value calculation
 
 228  ## 12/31/98 ... fixed variable-name bug in ldescribe
 
 229  ## 12/19/98 ... fixed bug in findwithin (fcns needed pstat. prefix)
 
 230  ## 12/16/98 ... changed amedianscore to return float (not array) for 1 score
 
 231  ## 12/14/98 ... added atmin and atmax functions
 
 232  ##              removed umath from import line (not needed)
 
 233  ##              l/ageometricmean modified to reduce chance of overflows (take
 
 234  ##                   nth root first, then multiply)
 
 235  ## 12/07/98 ... added __version__variable (now 0.2)
 
 236  ##              removed all 'stats.' from anova() fcn
 
 237  ## 12/06/98 ... changed those functions (except shellsort) that altered
 
 238  ##                   arguments in-place ... cumsum, ranksort, ...
 
 239  ##              updated (and fixed some) doc-strings
 
 240  ## 12/01/98 ... added anova() function (requires NumPy)
 
 241  ##              incorporated Dispatch class
 
 242  ## 11/12/98 ... added functionality to amean, aharmonicmean, ageometricmean
 
 243  ##              added 'asum' function (added functionality to N.add.reduce)
 
 244  ##              fixed both moment and amoment (two errors)
 
 245  ##              changed name of skewness and askewness to skew and askew
 
 246  ##              fixed (a)histogram (which sometimes counted points <lowerlimit)
 
 247  
 
 248  import pstat               # required 3rd party module 
 249  import math, string, copy  # required python modules 
 250  from types import * 
 251  
 
 252  __version__ = 0.6 
 253  
 
 254  ############# DISPATCH CODE ##############
 
 255  
 
 256  
 
257 -class Dispatch:
258 """ 259 The Dispatch class, care of David Ascher, allows different functions to 260 be called depending on the argument types. This way, there can be one 261 function name regardless of the argument type. To access function doc 262 in stats.py module, prefix the function with an 'l' or 'a' for list or 263 array arguments, respectively. That is, print stats.lmean.__doc__ or 264 print stats.amean.__doc__ or whatever. 265 """ 266
267 - def __init__(self, *tuples):
268 self._dispatch = {} 269 for func, types in tuples: 270 for t in types: 271 if t in self._dispatch.keys(): 272 raise ValueError, "can't have two dispatches on "+str(t) 273 self._dispatch[t] = func 274 self._types = self._dispatch.keys()
275
276 - def __call__(self, arg1, *args, **kw):
277 if type(arg1) not in self._types: 278 raise TypeError, "don't know how to dispatch %s arguments" % type(arg1) 279 return apply(self._dispatch[type(arg1)], (arg1,) + args, kw)
280 281 282 ########################################################################## 283 ######################## LIST-BASED FUNCTIONS ######################## 284 ########################################################################## 285 286 ### Define these regardless 287 288 #################################### 289 ####### CENTRAL TENDENCY ######### 290 #################################### 291
292 -def lgeometricmean (inlist):
293 """ 294 Calculates the geometric mean of the values in the passed list. 295 That is: n-th root of (x1 * x2 * ... * xn). Assumes a '1D' list. 296 297 Usage: lgeometricmean(inlist) 298 """ 299 mult = 1.0 300 one_over_n = 1.0/len(inlist) 301 for item in inlist: 302 mult = mult * pow(item,one_over_n) 303 return mult
304 305
306 -def lharmonicmean (inlist):
307 """ 308 Calculates the harmonic mean of the values in the passed list. 309 That is: n / (1/x1 + 1/x2 + ... + 1/xn). Assumes a '1D' list. 310 311 Usage: lharmonicmean(inlist) 312 """ 313 sum = 0 314 for item in inlist: 315 sum = sum + 1.0/item 316 return len(inlist) / sum
317 318
319 -def lmean (inlist):
320 """ 321 Returns the arithematic mean of the values in the passed list. 322 Assumes a '1D' list, but will function on the 1st dim of an array(!). 323 324 Usage: lmean(inlist) 325 """ 326 sum = 0 327 for item in inlist: 328 sum = sum + item 329 return sum/float(len(inlist))
330 331
332 -def lmedian (inlist,numbins=1000):
333 """ 334 Returns the computed median value of a list of numbers, given the 335 number of bins to use for the histogram (more bins brings the computed value 336 closer to the median score, default number of bins = 1000). See G.W. 337 Heiman's Basic Stats (1st Edition), or CRC Probability & Statistics. 338 339 Usage: lmedian (inlist, numbins=1000) 340 """ 341 (hist, smallest, binsize, extras) = histogram(inlist,numbins,[min(inlist),max(inlist)]) # make histog 342 cumhist = cumsum(hist) # make cumulative histogram 343 for i in range(len(cumhist)): # get 1st(!) index holding 50%ile score 344 if cumhist[i]>=len(inlist)/2.0: 345 cfbin = i 346 break 347 LRL = smallest + binsize*cfbin # get lower read limit of that bin 348 cfbelow = cumhist[cfbin-1] 349 freq = float(hist[cfbin]) # frequency IN the 50%ile bin 350 median = LRL + ((len(inlist)/2.0 - cfbelow)/float(freq))*binsize # median formula 351 return median
352 353
354 -def lmedianscore (inlist):
355 """ 356 Returns the 'middle' score of the passed list. If there is an even 357 number of scores, the mean of the 2 middle scores is returned. 358 359 Usage: lmedianscore(inlist) 360 """ 361 362 newlist = copy.deepcopy(inlist) 363 newlist.sort() 364 if len(newlist) % 2 == 0: # if even number of scores, average middle 2 365 index = len(newlist)/2 # integer division correct 366 median = float(newlist[index] + newlist[index-1]) /2 367 else: 368 index = len(newlist)/2 # int divsion gives mid value when count from 0 369 median = newlist[index] 370 return median
371 372
373 -def lmode(inlist):
374 """ 375 Returns a list of the modal (most common) score(s) in the passed 376 list. If there is more than one such score, all are returned. The 377 bin-count for the mode(s) is also returned. 378 379 Usage: lmode(inlist) 380 Returns: bin-count for mode(s), a list of modal value(s) 381 """ 382 383 scores = pstat.unique(inlist) 384 scores.sort() 385 freq = [] 386 for item in scores: 387 freq.append(inlist.count(item)) 388 maxfreq = max(freq) 389 mode = [] 390 stillmore = 1 391 while stillmore: 392 try: 393 indx = freq.index(maxfreq) 394 mode.append(scores[indx]) 395 del freq[indx] 396 del scores[indx] 397 except ValueError: 398 stillmore=0 399 return maxfreq, mode
400 401 402 #################################### 403 ############ MOMENTS ############# 404 #################################### 405
406 -def lmoment(inlist,moment=1):
407 """ 408 Calculates the nth moment about the mean for a sample (defaults to 409 the 1st moment). Used to calculate coefficients of skewness and kurtosis. 410 411 Usage: lmoment(inlist,moment=1) 412 Returns: appropriate moment (r) from ... 1/n * SUM((inlist(i)-mean)**r) 413 """ 414 if moment == 1: 415 return 0.0 416 else: 417 mn = mean(inlist) 418 n = len(inlist) 419 s = 0 420 for x in inlist: 421 s = s + (x-mn)**moment 422 return s/float(n)
423 424
425 -def lvariation(inlist):
426 """ 427 Returns the coefficient of variation, as defined in CRC Standard 428 Probability and Statistics, p.6. 429 430 Usage: lvariation(inlist) 431 """ 432 return 100.0*samplestdev(inlist)/float(mean(inlist))
433 434
435 -def lskew(inlist):
436 """ 437 Returns the skewness of a distribution, as defined in Numerical 438 Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 439 440 Usage: lskew(inlist) 441 """ 442 return moment(inlist,3)/pow(moment(inlist,2),1.5)
443 444
445 -def lkurtosis(inlist):
446 """ 447 Returns the kurtosis of a distribution, as defined in Numerical 448 Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 449 450 Usage: lkurtosis(inlist) 451 """ 452 return moment(inlist,4)/pow(moment(inlist,2),2.0)
453 454
455 -def ldescribe(inlist):
456 """ 457 Returns some descriptive statistics of the passed list (assumed to be 1D). 458 459 Usage: ldescribe(inlist) 460 Returns: n, mean, standard deviation, skew, kurtosis 461 """ 462 n = len(inlist) 463 mm = (min(inlist),max(inlist)) 464 m = mean(inlist) 465 sd = stdev(inlist) 466 sk = skew(inlist) 467 kurt = kurtosis(inlist) 468 return n, mm, m, sd, sk, kurt
469 470 471 #################################### 472 ####### FREQUENCY STATS ########## 473 #################################### 474
475 -def litemfreq(inlist):
476 """ 477 Returns a list of pairs. Each pair consists of one of the scores in inlist 478 and it's frequency count. Assumes a 1D list is passed. 479 480 Usage: litemfreq(inlist) 481 Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 482 """ 483 scores = pstat.unique(inlist) 484 scores.sort() 485 freq = [] 486 for item in scores: 487 freq.append(inlist.count(item)) 488 return pstat.abut(scores, freq)
489 490
491 -def lscoreatpercentile (inlist, percent):
492 """ 493 Returns the score at a given percentile relative to the distribution 494 given by inlist. 495 496 Usage: lscoreatpercentile(inlist,percent) 497 """ 498 if percent > 1: 499 print "\nDividing percent>1 by 100 in lscoreatpercentile().\n" 500 percent = percent / 100.0 501 targetcf = percent*len(inlist) 502 h, lrl, binsize, extras = histogram(inlist) 503 cumhist = cumsum(copy.deepcopy(h)) 504 for i in range(len(cumhist)): 505 if cumhist[i] >= targetcf: 506 break 507 score = binsize * ((targetcf - cumhist[i-1]) / float(h[i])) + (lrl+binsize*i) 508 return score
509 510
511 -def lpercentileofscore (inlist, score,histbins=10,defaultlimits=None):
512 """ 513 Returns the percentile value of a score relative to the distribution 514 given by inlist. Formula depends on the values used to histogram the data(!). 515 516 Usage: lpercentileofscore(inlist,score,histbins=10,defaultlimits=None) 517 """ 518 519 h, lrl, binsize, extras = histogram(inlist,histbins,defaultlimits) 520 cumhist = cumsum(copy.deepcopy(h)) 521 i = int((score - lrl)/float(binsize)) 522 pct = (cumhist[i-1]+((score-(lrl+binsize*i))/float(binsize))*h[i])/float(len(inlist)) * 100 523 return pct
524 525
526 -def lhistogram (inlist,numbins=10,defaultreallimits=None,printextras=0):
527 """ 528 Returns (i) a list of histogram bin counts, (ii) the smallest value 529 of the histogram binning, and (iii) the bin width (the last 2 are not 530 necessarily integers). Default number of bins is 10. If no sequence object 531 is given for defaultreallimits, the routine picks (usually non-pretty) bins 532 spanning all the numbers in the inlist. 533 534 Usage: lhistogram (inlist, numbins=10, defaultreallimits=None,suppressoutput=0) 535 Returns: list of bin values, lowerreallimit, binsize, extrapoints 536 """ 537 if (defaultreallimits <> None): 538 if type(defaultreallimits) not in [ListType,TupleType] or len(defaultreallimits)==1: # only one limit given, assumed to be lower one & upper is calc'd 539 lowerreallimit = defaultreallimits 540 upperreallimit = 1.000001 * max(inlist) 541 else: # assume both limits given 542 lowerreallimit = defaultreallimits[0] 543 upperreallimit = defaultreallimits[1] 544 binsize = (upperreallimit-lowerreallimit)/float(numbins) 545 else: # no limits given for histogram, both must be calc'd 546 estbinwidth=(max(inlist)-min(inlist))/float(numbins) +1e-6 #1=>cover all 547 binsize = ((max(inlist)-min(inlist)+estbinwidth))/float(numbins) 548 lowerreallimit = min(inlist) - binsize/2 #lower real limit,1st bin 549 bins = [0]*(numbins) 550 extrapoints = 0 551 for num in inlist: 552 try: 553 if (num-lowerreallimit) < 0: 554 extrapoints = extrapoints + 1 555 else: 556 bintoincrement = int((num-lowerreallimit)/float(binsize)) 557 bins[bintoincrement] = bins[bintoincrement] + 1 558 except: 559 extrapoints = extrapoints + 1 560 if (extrapoints > 0 and printextras == 1): 561 print '\nPoints outside given histogram range =',extrapoints 562 return (bins, lowerreallimit, binsize, extrapoints)
563 564
565 -def lcumfreq(inlist,numbins=10,defaultreallimits=None):
566 """ 567 Returns a cumulative frequency histogram, using the histogram function. 568 569 Usage: lcumfreq(inlist,numbins=10,defaultreallimits=None) 570 Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 571 """ 572 h,l,b,e = histogram(inlist,numbins,defaultreallimits) 573 cumhist = cumsum(copy.deepcopy(h)) 574 return cumhist,l,b,e
575 576
577 -def lrelfreq(inlist,numbins=10,defaultreallimits=None):
578 """ 579 Returns a relative frequency histogram, using the histogram function. 580 581 Usage: lrelfreq(inlist,numbins=10,defaultreallimits=None) 582 Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 583 """ 584 h,l,b,e = histogram(inlist,numbins,defaultreallimits) 585 for i in range(len(h)): 586 h[i] = h[i]/float(len(inlist)) 587 return h,l,b,e
588 589 590 #################################### 591 ##### VARIABILITY FUNCTIONS ###### 592 #################################### 593
594 -def lobrientransform(*args):
595 """ 596 Computes a transform on input data (any number of columns). Used to 597 test for homogeneity of variance prior to running one-way stats. From 598 Maxwell and Delaney, p.112. 599 600 Usage: lobrientransform(*args) 601 Returns: transformed data for use in an ANOVA 602 """ 603 TINY = 1e-10 604 k = len(args) 605 n = [0.0]*k 606 v = [0.0]*k 607 m = [0.0]*k 608 nargs = [] 609 for i in range(k): 610 nargs.append(copy.deepcopy(args[i])) 611 n[i] = float(len(nargs[i])) 612 v[i] = var(nargs[i]) 613 m[i] = mean(nargs[i]) 614 for j in range(k): 615 for i in range(n[j]): 616 t1 = (n[j]-1.5)*n[j]*(nargs[j][i]-m[j])**2 617 t2 = 0.5*v[j]*(n[j]-1.0) 618 t3 = (n[j]-1.0)*(n[j]-2.0) 619 nargs[j][i] = (t1-t2) / float(t3) 620 check = 1 621 for j in range(k): 622 if v[j] - mean(nargs[j]) > TINY: 623 check = 0 624 if check <> 1: 625 raise ValueError, 'Problem in obrientransform.' 626 else: 627 return nargs
628 629
630 -def lsamplevar (inlist):
631 """ 632 Returns the variance of the values in the passed list using 633 N for the denominator (i.e., DESCRIBES the sample variance only). 634 635 Usage: lsamplevar(inlist) 636 """ 637 n = len(inlist) 638 mn = mean(inlist) 639 deviations = [] 640 for item in inlist: 641 deviations.append(item-mn) 642 return ss(deviations)/float(n)
643 644
645 -def lsamplestdev (inlist):
646 """ 647 Returns the standard deviation of the values in the passed list using 648 N for the denominator (i.e., DESCRIBES the sample stdev only). 649 650 Usage: lsamplestdev(inlist) 651 """ 652 return math.sqrt(samplevar(inlist))
653 654
655 -def lcov (x,y, keepdims=0):
656 """ 657 Returns the estimated covariance of the values in the passed 658 array (i.e., N-1). Dimension can equal None (ravel array first), an 659 integer (the dimension over which to operate), or a sequence (operate 660 over multiple dimensions). Set keepdims=1 to return an array with the 661 same number of dimensions as inarray. 662 663 Usage: lcov(x,y,keepdims=0) 664 """ 665 666 n = len(x) 667 xmn = mean(x) 668 ymn = mean(y) 669 xdeviations = [0]*len(x) 670 ydeviations = [0]*len(y) 671 for i in range(len(x)): 672 xdeviations[i] = x[i] - xmn 673 ydeviations[i] = y[i] - ymn 674 ss = 0.0 675 for i in range(len(xdeviations)): 676 ss = ss + xdeviations[i]*ydeviations[i] 677 return ss/float(n-1)
678 679
680 -def lvar (inlist):
681 """ 682 Returns the variance of the values in the passed list using N-1 683 for the denominator (i.e., for estimating population variance). 684 685 Usage: lvar(inlist) 686 """ 687 n = len(inlist) 688 mn = mean(inlist) 689 deviations = [0]*len(inlist) 690 for i in range(len(inlist)): 691 deviations[i] = inlist[i] - mn 692 return ss(deviations)/float(n-1)
693 694
695 -def lstdev (inlist):
696 """ 697 Returns the standard deviation of the values in the passed list 698 using N-1 in the denominator (i.e., to estimate population stdev). 699 700 Usage: lstdev(inlist) 701 """ 702 return math.sqrt(var(inlist))
703 704
705 -def lsterr(inlist):
706 """ 707 Returns the standard error of the values in the passed list using N-1 708 in the denominator (i.e., to estimate population standard error). 709 710 Usage: lsterr(inlist) 711 """ 712 return stdev(inlist) / float(math.sqrt(len(inlist)))
713 714
715 -def lsem (inlist):
716 """ 717 Returns the estimated standard error of the mean (sx-bar) of the 718 values in the passed list. sem = stdev / sqrt(n) 719 720 Usage: lsem(inlist) 721 """ 722 sd = stdev(inlist) 723 n = len(inlist) 724 return sd/math.sqrt(n)
725 726
727 -def lz (inlist, score):
728 """ 729 Returns the z-score for a given input score, given that score and the 730 list from which that score came. Not appropriate for population calculations. 731 732 Usage: lz(inlist, score) 733 """ 734 z = (score-mean(inlist))/samplestdev(inlist) 735 return z
736 737
738 -def lzs (inlist):
739 """ 740 Returns a list of z-scores, one for each score in the passed list. 741 742 Usage: lzs(inlist) 743 """ 744 zscores = [] 745 for item in inlist: 746 zscores.append(z(inlist,item)) 747 return zscores
748 749 750 #################################### 751 ####### TRIMMING FUNCTIONS ####### 752 #################################### 753
754 -def ltrimboth (l,proportiontocut):
755 """ 756 Slices off the passed proportion of items from BOTH ends of the passed 757 list (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 'rightmost' 758 10% of scores. Assumes list is sorted by magnitude. Slices off LESS if 759 proportion results in a non-integer slice index (i.e., conservatively 760 slices off proportiontocut). 761 762 Usage: ltrimboth (l,proportiontocut) 763 Returns: trimmed version of list l 764 """ 765 lowercut = int(proportiontocut*len(l)) 766 uppercut = len(l) - lowercut 767 return l[lowercut:uppercut]
768 769
770 -def ltrim1 (l,proportiontocut,tail='right'):
771 """ 772 Slices off the passed proportion of items from ONE end of the passed 773 list (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 774 10% of scores). Slices off LESS if proportion results in a non-integer 775 slice index (i.e., conservatively slices off proportiontocut). 776 777 Usage: ltrim1 (l,proportiontocut,tail='right') or set tail='left' 778 Returns: trimmed version of list l 779 """ 780 if tail == 'right': 781 lowercut = 0 782 uppercut = len(l) - int(proportiontocut*len(l)) 783 elif tail == 'left': 784 lowercut = int(proportiontocut*len(l)) 785 uppercut = len(l) 786 return l[lowercut:uppercut]
787 788 789 #################################### 790 ##### CORRELATION FUNCTIONS ###### 791 #################################### 792
793 -def lpaired(x,y):
794 """ 795 Interactively determines the type of data and then runs the 796 appropriated statistic for paired group data. 797 798 Usage: lpaired(x,y) 799 Returns: appropriate statistic name, value, and probability 800 """ 801 samples = '' 802 while samples not in ['i','r','I','R','c','C']: 803 print '\nIndependent or related samples, or correlation (i,r,c): ', 804 samples = raw_input() 805 806 if samples in ['i','I','r','R']: 807 print '\nComparing variances ...', 808 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 809 r = obrientransform(x,y) 810 f,p = F_oneway(pstat.colex(r,0),pstat.colex(r,1)) 811 if p<0.05: 812 vartype='unequal, p='+str(round(p,4)) 813 else: 814 vartype='equal' 815 print vartype 816 if samples in ['i','I']: 817 if vartype[0]=='e': 818 t,p = ttest_ind(x,y,0) 819 print '\nIndependent samples t-test: ', round(t,4),round(p,4) 820 else: 821 if len(x)>20 or len(y)>20: 822 z,p = ranksums(x,y) 823 print '\nRank Sums test (NONparametric, n>20): ', round(z,4),round(p,4) 824 else: 825 u,p = mannwhitneyu(x,y) 826 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round(u,4),round(p,4) 827 828 else: # RELATED SAMPLES 829 if vartype[0]=='e': 830 t,p = ttest_rel(x,y,0) 831 print '\nRelated samples t-test: ', round(t,4),round(p,4) 832 else: 833 t,p = ranksums(x,y) 834 print '\nWilcoxon T-test (NONparametric): ', round(t,4),round(p,4) 835 else: # CORRELATION ANALYSIS 836 corrtype = '' 837 while corrtype not in ['c','C','r','R','d','D']: 838 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 839 corrtype = raw_input() 840 if corrtype in ['c','C']: 841 m,b,r,p,see = linregress(x,y) 842 print '\nLinear regression for continuous variables ...' 843 lol = [['Slope','Intercept','r','Prob','SEestimate'],[round(m,4),round(b,4),round(r,4),round(p,4),round(see,4)]] 844 pstat.printcc(lol) 845 elif corrtype in ['r','R']: 846 r,p = spearmanr(x,y) 847 print '\nCorrelation for ranked variables ...' 848 print "Spearman's r: ",round(r,4),round(p,4) 849 else: # DICHOTOMOUS 850 r,p = pointbiserialr(x,y) 851 print '\nAssuming x contains a dichotomous variable ...' 852 print 'Point Biserial r: ',round(r,4),round(p,4) 853 print '\n\n' 854 return None
855 856
857 -def lpearsonr(x,y):
858 """ 859 Calculates a Pearson correlation coefficient and the associated 860 probability value. Taken from Heiman's Basic Statistics for the Behav. 861 Sci (2nd), p.195. 862 863 Usage: lpearsonr(x,y) where x and y are equal-length lists 864 Returns: Pearson's r value, two-tailed p-value 865 """ 866 TINY = 1.0e-30 867 if len(x) <> len(y): 868 raise ValueError, 'Input values not paired in pearsonr. Aborting.' 869 n = len(x) 870 x = map(float,x) 871 y = map(float,y) 872 xmean = mean(x) 873 ymean = mean(y) 874 r_num = n*(summult(x,y)) - sum(x)*sum(y) 875 r_den = math.sqrt((n*ss(x) - square_of_sums(x))*(n*ss(y)-square_of_sums(y))) 876 r = (r_num / r_den) # denominator already a float 877 df = n-2 878 t = r*math.sqrt(df/((1.0-r+TINY)*(1.0+r+TINY))) 879 prob = betai(0.5*df,0.5,df/float(df+t*t)) 880 return r, prob
881 882
883 -def llincc(x,y):
884 """ 885 Calculates Lin's concordance correlation coefficient. 886 887 Usage: alincc(x,y) where x, y are equal-length arrays 888 Returns: Lin's CC 889 """ 890 covar = lcov(x,y)*(len(x)-1)/float(len(x)) # correct denom to n 891 xvar = lvar(x)*(len(x)-1)/float(len(x)) # correct denom to n 892 yvar = lvar(y)*(len(y)-1)/float(len(y)) # correct denom to n 893 lincc = (2 * covar) / ((xvar+yvar) +((amean(x)-amean(y))**2)) 894 return lincc
895 896
897 -def lspearmanr(x,y):
898 """ 899 Calculates a Spearman rank-order correlation coefficient. Taken 900 from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 901 902 Usage: lspearmanr(x,y) where x and y are equal-length lists 903 Returns: Spearman's r, two-tailed p-value 904 """ 905 TINY = 1e-30 906 if len(x) <> len(y): 907 raise ValueError, 'Input values not paired in spearmanr. Aborting.' 908 n = len(x) 909 rankx = rankdata(x) 910 ranky = rankdata(y) 911 dsq = sumdiffsquared(rankx,ranky) 912 rs = 1 - 6*dsq / float(n*(n**2-1)) 913 t = rs * math.sqrt((n-2) / ((rs+1.0)*(1.0-rs))) 914 df = n-2 915 probrs = betai(0.5*df,0.5,df/(df+t*t)) # t already a float 916 # probability values for rs are from part 2 of the spearman function in 917 # Numerical Recipies, p.510. They are close to tables, but not exact. (?) 918 return rs, probrs
919 920
921 -def lpointbiserialr(x,y):
922 """ 923 Calculates a point-biserial correlation coefficient and the associated 924 probability value. Taken from Heiman's Basic Statistics for the Behav. 925 Sci (1st), p.194. 926 927 Usage: lpointbiserialr(x,y) where x,y are equal-length lists 928 Returns: Point-biserial r, two-tailed p-value 929 """ 930 TINY = 1e-30 931 if len(x) <> len(y): 932 raise ValueError, 'INPUT VALUES NOT PAIRED IN pointbiserialr. ABORTING.' 933 data = pstat.abut(x,y) 934 categories = pstat.unique(x) 935 if len(categories) <> 2: 936 raise ValueError, "Exactly 2 categories required for pointbiserialr()." 937 else: # there are 2 categories, continue 938 codemap = pstat.abut(categories,range(2)) 939 recoded = pstat.recode(data,codemap,0) 940 x = pstat.linexand(data,0,categories[0]) 941 y = pstat.linexand(data,0,categories[1]) 942 xmean = mean(pstat.colex(x,1)) 943 ymean = mean(pstat.colex(y,1)) 944 n = len(data) 945 adjust = math.sqrt((len(x)/float(n))*(len(y)/float(n))) 946 rpb = (ymean - xmean)/samplestdev(pstat.colex(data,1))*adjust 947 df = n-2 948 t = rpb*math.sqrt(df/((1.0-rpb+TINY)*(1.0+rpb+TINY))) 949 prob = betai(0.5*df,0.5,df/(df+t*t)) # t already a float 950 return rpb, prob
951 952
953 -def lkendalltau(x,y):
954 """ 955 Calculates Kendall's tau ... correlation of ordinal data. Adapted 956 from function kendl1 in Numerical Recipies. Needs good test-routine.@@@ 957 958 Usage: lkendalltau(x,y) 959 Returns: Kendall's tau, two-tailed p-value 960 """ 961 n1 = 0 962 n2 = 0 963 iss = 0 964 for j in range(len(x)-1): 965 for k in range(j,len(y)): 966 a1 = x[j] - x[k] 967 a2 = y[j] - y[k] 968 aa = a1 * a2 969 if (aa): # neither list has a tie 970 n1 = n1 + 1 971 n2 = n2 + 1 972 if aa > 0: 973 iss = iss + 1 974 else: 975 iss = iss -1 976 else: 977 if (a1): 978 n1 = n1 + 1 979 else: 980 n2 = n2 + 1 981 tau = iss / math.sqrt(n1*n2) 982 svar = (4.0*len(x)+10.0) / (9.0*len(x)*(len(x)-1)) 983 z = tau / math.sqrt(svar) 984 prob = erfcc(abs(z)/1.4142136) 985 return tau, prob
986 987
988 -def llinregress(x,y):
989 """ 990 Calculates a regression line on x,y pairs. 991 992 Usage: llinregress(x,y) x,y are equal-length lists of x-y coordinates 993 Returns: slope, intercept, r, two-tailed prob, sterr-of-estimate 994 """ 995 TINY = 1.0e-20 996 if len(x) <> len(y): 997 raise ValueError, 'Input values not paired in linregress. Aborting.' 998 n = len(x) 999 x = map(float,x) 1000 y = map(float,y) 1001 xmean = mean(x) 1002 ymean = mean(y) 1003 r_num = float(n*(summult(x,y)) - sum(x)*sum(y)) 1004 r_den = math.sqrt((n*ss(x) - square_of_sums(x))*(n*ss(y)-square_of_sums(y))) 1005 r = r_num / r_den 1006 z = 0.5*math.log((1.0+r+TINY)/(1.0-r+TINY)) 1007 df = n-2 1008 t = r*math.sqrt(df/((1.0-r+TINY)*(1.0+r+TINY))) 1009 prob = betai(0.5*df,0.5,df/(df+t*t)) 1010 slope = r_num / float(n*ss(x) - square_of_sums(x)) 1011 intercept = ymean - slope*xmean 1012 sterrest = math.sqrt(1-r*r)*samplestdev(y) 1013 return slope, intercept, r, prob, sterrest
1014 1015 1016 #################################### 1017 ##### INFERENTIAL STATISTICS ##### 1018 #################################### 1019
1020 -def lttest_1samp(a,popmean,printit=0,name='Sample',writemode='a'):
1021 """ 1022 Calculates the t-obtained for the independent samples T-test on ONE group 1023 of scores a, given a population mean. If printit=1, results are printed 1024 to the screen. If printit='filename', the results are output to 'filename' 1025 using the given writemode (default=append). Returns t-value, and prob. 1026 1027 Usage: lttest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 1028 Returns: t-value, two-tailed prob 1029 """ 1030 x = mean(a) 1031 v = var(a) 1032 n = len(a) 1033 df = n-1 1034 svar = ((n-1)*v)/float(df) 1035 t = (x-popmean)/math.sqrt(svar*(1.0/n)) 1036 prob = betai(0.5*df,0.5,float(df)/(df+t*t)) 1037 1038 if printit <> 0: 1039 statname = 'Single-sample T-test.' 1040 outputpairedstats(printit,writemode, 1041 'Population','--',popmean,0,0,0, 1042 name,n,x,v,min(a),max(a), 1043 statname,t,prob) 1044 return t,prob
1045 1046
1047 -def lttest_ind (a, b, printit=0, name1='Samp1', name2='Samp2', writemode='a'):
1048 """ 1049 Calculates the t-obtained T-test on TWO INDEPENDENT samples of 1050 scores a, and b. From Numerical Recipies, p.483. If printit=1, results 1051 are printed to the screen. If printit='filename', the results are output 1052 to 'filename' using the given writemode (default=append). Returns t-value, 1053 and prob. 1054 1055 Usage: lttest_ind(a,b,printit=0,name1='Samp1',name2='Samp2',writemode='a') 1056 Returns: t-value, two-tailed prob 1057 """ 1058 x1 = mean(a) 1059 x2 = mean(b) 1060 v1 = stdev(a)**2 1061 v2 = stdev(b)**2 1062 n1 = len(a) 1063 n2 = len(b) 1064 df = n1+n2-2 1065 svar = ((n1-1)*v1+(n2-1)*v2)/float(df) 1066 t = (x1-x2)/math.sqrt(svar*(1.0/n1 + 1.0/n2)) 1067 prob = betai(0.5*df,0.5,df/(df+t*t)) 1068 1069 if printit <> 0: 1070 statname = 'Independent samples T-test.' 1071 outputpairedstats(printit,writemode, 1072 name1,n1,x1,v1,min(a),max(a), 1073 name2,n2,x2,v2,min(b),max(b), 1074 statname,t,prob) 1075 return t,prob
1076 1077
1078 -def lttest_rel (a,b,printit=0,name1='Sample1',name2='Sample2',writemode='a'):
1079 """ 1080 Calculates the t-obtained T-test on TWO RELATED samples of scores, 1081 a and b. From Numerical Recipies, p.483. If printit=1, results are 1082 printed to the screen. If printit='filename', the results are output to 1083 'filename' using the given writemode (default=append). Returns t-value, 1084 and prob. 1085 1086 Usage: lttest_rel(a,b,printit=0,name1='Sample1',name2='Sample2',writemode='a') 1087 Returns: t-value, two-tailed prob 1088 """ 1089 if len(a)<>len(b): 1090 raise ValueError, 'Unequal length lists in ttest_rel.' 1091 x1 = mean(a) 1092 x2 = mean(b) 1093 v1 = var(a) 1094 v2 = var(b) 1095 n = len(a) 1096 cov = 0 1097 for i in range(len(a)): 1098 cov = cov + (a[i]-x1) * (b[i]-x2) 1099 df = n-1 1100 cov = cov / float(df) 1101 sd = math.sqrt((v1+v2 - 2.0*cov)/float(n)) 1102 t = (x1-x2)/sd 1103 prob = betai(0.5*df,0.5,df/(df+t*t)) 1104 1105 if printit <> 0: 1106 statname = 'Related samples T-test.' 1107 outputpairedstats(printit,writemode, 1108 name1,n,x1,v1,min(a),max(a), 1109 name2,n,x2,v2,min(b),max(b), 1110 statname,t,prob) 1111 return t, prob
1112 1113
1114 -def lchisquare(f_obs,f_exp=None):
1115 """ 1116 Calculates a one-way chi square for list of observed frequencies and returns 1117 the result. If no expected frequencies are given, the total N is assumed to 1118 be equally distributed across all groups. 1119 1120 Usage: lchisquare(f_obs, f_exp=None) f_obs = list of observed cell freq. 1121 Returns: chisquare-statistic, associated p-value 1122 """ 1123 k = len(f_obs) # number of groups 1124 if f_exp == None: 1125 f_exp = [sum(f_obs)/float(k)] * len(f_obs) # create k bins with = freq. 1126 chisq = 0 1127 for i in range(len(f_obs)): 1128 chisq = chisq + (f_obs[i]-f_exp[i])**2 / float(f_exp[i]) 1129 return chisq, chisqprob(chisq, k-1)
1130 1131
1132 -def lks_2samp (data1,data2):
1133 """ 1134 Computes the Kolmogorov-Smirnof statistic on 2 samples. From 1135 Numerical Recipies in C, page 493. 1136 1137 Usage: lks_2samp(data1,data2) data1&2 are lists of values for 2 conditions 1138 Returns: KS D-value, associated p-value 1139 """ 1140 j1 = 0 1141 j2 = 0 1142 fn1 = 0.0 1143 fn2 = 0.0 1144 n1 = len(data1) 1145 n2 = len(data2) 1146 en1 = n1 1147 en2 = n2 1148 d = 0.0 1149 data1.sort() 1150 data2.sort() 1151 while j1 < n1 and j2 < n2: 1152 d1=data1[j1] 1153 d2=data2[j2] 1154 if d1 <= d2: 1155 fn1 = (j1)/float(en1) 1156 j1 = j1 + 1 1157 if d2 <= d1: 1158 fn2 = (j2)/float(en2) 1159 j2 = j2 + 1 1160 dt = (fn2-fn1) 1161 if math.fabs(dt) > math.fabs(d): 1162 d = dt 1163 try: 1164 en = math.sqrt(en1*en2/float(en1+en2)) 1165 prob = ksprob((en+0.12+0.11/en)*abs(d)) 1166 except: 1167 prob = 1.0 1168 return d, prob
1169 1170
1171 -def lmannwhitneyu(x,y):
1172 """ 1173 Calculates a Mann-Whitney U statistic on the provided scores and 1174 returns the result. Use only when the n in each condition is < 20 and 1175 you have 2 independent samples of ranks. NOTE: Mann-Whitney U is 1176 significant if the u-obtained is LESS THAN or equal to the critical 1177 value of U found in the tables. Equivalent to Kruskal-Wallis H with 1178 just 2 groups. 1179 1180 Usage: lmannwhitneyu(data) 1181 Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 1182 """ 1183 n1 = len(x) 1184 n2 = len(y) 1185 ranked = rankdata(x+y) 1186 rankx = ranked[0:n1] # get the x-ranks 1187 ranky = ranked[n1:] # the rest are y-ranks 1188 u1 = n1*n2 + (n1*(n1+1))/2.0 - sum(rankx) # calc U for x 1189 u2 = n1*n2 - u1 # remainder is U for y 1190 bigu = max(u1,u2) 1191 smallu = min(u1,u2) 1192 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 1193 if T == 0: 1194 raise ValueError, 'All numbers are identical in lmannwhitneyu' 1195 sd = math.sqrt(T*n1*n2*(n1+n2+1)/12.0) 1196 z = abs((bigu-n1*n2/2.0) / sd) # normal approximation for prob calc 1197 return smallu, 1.0 - zprob(z)
1198 1199
1200 -def ltiecorrect(rankvals):
1201 """ 1202 Corrects for ties in Mann Whitney U and Kruskal Wallis H tests. See 1203 Siegel, S. (1956) Nonparametric Statistics for the Behavioral Sciences. 1204 New York: McGraw-Hill. Code adapted from |Stat rankind.c code. 1205 1206 Usage: ltiecorrect(rankvals) 1207 Returns: T correction factor for U or H 1208 """ 1209 sorted,posn = shellsort(rankvals) 1210 n = len(sorted) 1211 T = 0.0 1212 i = 0 1213 while (i<n-1): 1214 if sorted[i] == sorted[i+1]: 1215 nties = 1 1216 while (i<n-1) and (sorted[i] == sorted[i+1]): 1217 nties = nties +1 1218 i = i +1 1219 T = T + nties**3 - nties 1220 i = i+1 1221 T = T / float(n**3-n) 1222 return 1.0 - T
1223 1224
1225 -def lranksums(x,y):
1226 """ 1227 Calculates the rank sums statistic on the provided scores and 1228 returns the result. Use only when the n in each condition is > 20 and you 1229 have 2 independent samples of ranks. 1230 1231 Usage: lranksums(x,y) 1232 Returns: a z-statistic, two-tailed p-value 1233 """ 1234 n1 = len(x) 1235 n2 = len(y) 1236 alldata = x+y 1237 ranked = rankdata(alldata) 1238 x = ranked[:n1] 1239 y = ranked[n1:] 1240 s = sum(x) 1241 expected = n1*(n1+n2+1) / 2.0 1242 z = (s - expected) / math.sqrt(n1*n2*(n1+n2+1)/12.0) 1243 prob = 2*(1.0 -zprob(abs(z))) 1244 return z, prob
1245 1246
1247 -def lwilcoxont(x,y):
1248 """ 1249 Calculates the Wilcoxon T-test for related samples and returns the 1250 result. A non-parametric T-test. 1251 1252 Usage: lwilcoxont(x,y) 1253 Returns: a t-statistic, two-tail probability estimate 1254 """ 1255 if len(x) <> len(y): 1256 raise ValueError, 'Unequal N in wilcoxont. Aborting.' 1257 d=[] 1258 for i in range(len(x)): 1259 diff = x[i] - y[i] 1260 if diff <> 0: 1261 d.append(diff) 1262 count = len(d) 1263 absd = map(abs,d) 1264 absranked = rankdata(absd) 1265 r_plus = 0.0 1266 r_minus = 0.0 1267 for i in range(len(absd)): 1268 if d[i] < 0: 1269 r_minus = r_minus + absranked[i] 1270 else: 1271 r_plus = r_plus + absranked[i] 1272 wt = min(r_plus, r_minus) 1273 mn = count * (count+1) * 0.25 1274 se = math.sqrt(count*(count+1)*(2.0*count+1.0)/24.0) 1275 z = math.fabs(wt-mn) / se 1276 prob = 2*(1.0 -zprob(abs(z))) 1277 return wt, prob
1278 1279
1280 -def lkruskalwallish(*args):
1281 """ 1282 The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 1283 groups, requiring at least 5 subjects in each group. This function 1284 calculates the Kruskal-Wallis H-test for 3 or more independent samples 1285 and returns the result. 1286 1287 Usage: lkruskalwallish(*args) 1288 Returns: H-statistic (corrected for ties), associated p-value 1289 """ 1290 args = list(args) 1291 n = [0]*len(args) 1292 all = [] 1293 n = map(len,args) 1294 for i in range(len(args)): 1295 all = all + args[i] 1296 ranked = rankdata(all) 1297 T = tiecorrect(ranked) 1298 for i in range(len(args)): 1299 args[i] = ranked[0:n[i]] 1300 del ranked[0:n[i]] 1301 rsums = [] 1302 for i in range(len(args)): 1303 rsums.append(sum(args[i])**2) 1304 rsums[i] = rsums[i] / float(n[i]) 1305 ssbn = sum(rsums) 1306 totaln = sum(n) 1307 h = 12.0 / (totaln*(totaln+1)) * ssbn - 3*(totaln+1) 1308 df = len(args) - 1 1309 if T == 0: 1310 raise ValueError, 'All numbers are identical in lkruskalwallish' 1311 h = h / float(T) 1312 return h, chisqprob(h,df)
1313 1314
1315 -def lfriedmanchisquare(*args):
1316 """ 1317 Friedman Chi-Square is a non-parametric, one-way within-subjects 1318 ANOVA. This function calculates the Friedman Chi-square test for repeated 1319 measures and returns the result, along with the associated probability 1320 value. It assumes 3 or more repeated measures. Only 3 levels requires a 1321 minimum of 10 subjects in the study. Four levels requires 5 subjects per 1322 level(??). 1323 1324 Usage: lfriedmanchisquare(*args) 1325 Returns: chi-square statistic, associated p-value 1326 """ 1327 k = len(args) 1328 if k < 3: 1329 raise ValueError, 'Less than 3 levels. Friedman test not appropriate.' 1330 n = len(args[0]) 1331 data = apply(pstat.abut,tuple(args)) 1332 for i in range(len(data)): 1333 data[i] = rankdata(data[i]) 1334 ssbn = 0 1335 for i in range(k): 1336 ssbn = ssbn + sum(args[i])**2 1337 chisq = 12.0 / (k*n*(k+1)) * ssbn - 3*n*(k+1) 1338 return chisq, chisqprob(chisq,k-1)
1339 1340 1341 #################################### 1342 #### PROBABILITY CALCULATIONS #### 1343 #################################### 1344
1345 -def lchisqprob(chisq,df):
1346 """ 1347 Returns the (1-tailed) probability value associated with the provided 1348 chi-square value and df. Adapted from chisq.c in Gary Perlman's |Stat. 1349 1350 Usage: lchisqprob(chisq,df) 1351 """ 1352 BIG = 20.0 1353 def ex(x): 1354 BIG = 20.0 1355 if x < -BIG: 1356 return 0.0 1357 else: 1358 return math.exp(x)
1359 1360 if chisq <=0 or df < 1: 1361 return 1.0 1362 a = 0.5 * chisq 1363 if df%2 == 0: 1364 even = 1 1365 else: 1366 even = 0 1367 if df > 1: 1368 y = ex(-a) 1369 if even: 1370 s = y 1371 else: 1372 s = 2.0 * zprob(-math.sqrt(chisq)) 1373 if (df > 2): 1374 chisq = 0.5 * (df - 1.0) 1375 if even: 1376 z = 1.0 1377 else: 1378 z = 0.5 1379 if a > BIG: 1380 if even: 1381 e = 0.0 1382 else: 1383 e = math.log(math.sqrt(math.pi)) 1384 c = math.log(a) 1385 while (z <= chisq): 1386 e = math.log(z) + e 1387 s = s + ex(c*z-a-e) 1388 z = z + 1.0 1389 return s 1390 else: 1391 if even: 1392 e = 1.0 1393 else: 1394 e = 1.0 / math.sqrt(math.pi) / math.sqrt(a) 1395 c = 0.0 1396 while (z <= chisq): 1397 e = e * (a/float(z)) 1398 c = c + e 1399 z = z + 1.0 1400 return (c*y+s) 1401 else: 1402 return s 1403 1404
1405 -def lerfcc(x):
1406 """ 1407 Returns the complementary error function erfc(x) with fractional 1408 error everywhere less than 1.2e-7. Adapted from Numerical Recipies. 1409 1410 Usage: lerfcc(x) 1411 """ 1412 z = abs(x) 1413 t = 1.0 / (1.0+0.5*z) 1414 ans = t * math.exp(-z*z-1.26551223 + t*(1.00002368+t*(0.37409196+t*(0.09678418+t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+t*(-0.82215223+t*0.17087277))))))))) 1415 if x >= 0: 1416 return ans 1417 else: 1418 return 2.0 - ans
1419 1420
1421 -def lzprob(z):
1422 """ 1423 Returns the area under the normal curve 'to the left of' the given z value. 1424 Thus, 1425 for z<0, zprob(z) = 1-tail probability 1426 for z>0, 1.0-zprob(z) = 1-tail probability 1427 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 1428 Adapted from z.c in Gary Perlman's |Stat. 1429 1430 Usage: lzprob(z) 1431 """ 1432 Z_MAX = 6.0 # maximum meaningful z-value 1433 if z == 0.0: 1434 x = 0.0 1435 else: 1436 y = 0.5 * math.fabs(z) 1437 if y >= (Z_MAX*0.5): 1438 x = 1.0 1439 elif (y < 1.0): 1440 w = y*y 1441 x = ((((((((0.000124818987 * w 1442 -0.001075204047) * w +0.005198775019) * w 1443 -0.019198292004) * w +0.059054035642) * w 1444 -0.151968751364) * w +0.319152932694) * w 1445 -0.531923007300) * w +0.797884560593) * y * 2.0 1446 else: 1447 y = y - 2.0 1448 x = (((((((((((((-0.000045255659 * y 1449 +0.000152529290) * y -0.000019538132) * y 1450 -0.000676904986) * y +0.001390604284) * y 1451 -0.000794620820) * y -0.002034254874) * y 1452 +0.006549791214) * y -0.010557625006) * y 1453 +0.011630447319) * y -0.009279453341) * y 1454 +0.005353579108) * y -0.002141268741) * y 1455 +0.000535310849) * y +0.999936657524 1456 if z > 0.0: 1457 prob = ((x+1.0)*0.5) 1458 else: 1459 prob = ((1.0-x)*0.5) 1460 return prob
1461 1462
1463 -def lksprob(alam):
1464 """ 1465 Computes a Kolmolgorov-Smirnov t-test significance level. Adapted from 1466 Numerical Recipies. 1467 1468 Usage: lksprob(alam) 1469 """ 1470 fac = 2.0 1471 sum = 0.0 1472 termbf = 0.0 1473 a2 = -2.0*alam*alam 1474 for j in range(1,201): 1475 term = fac*math.exp(a2*j*j) 1476 sum = sum + term 1477 if math.fabs(term) <= (0.001*termbf) or math.fabs(term) < (1.0e-8*sum): 1478 return sum 1479 fac = -fac 1480 termbf = math.fabs(term) 1481 return 1.0 # Get here only if fails to converge; was 0.0!!
1482 1483
1484 -def lfprob (dfnum, dfden, F):
1485 """ 1486 Returns the (1-tailed) significance level (p-value) of an F 1487 statistic given the degrees of freedom for the numerator (dfR-dfF) and 1488 the degrees of freedom for the denominator (dfF). 1489 1490 Usage: lfprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 1491 """ 1492 p = betai(0.5*dfden, 0.5*dfnum, dfden/float(dfden+dfnum*F)) 1493 return p
1494 1495
1496 -def lbetacf(a,b,x):
1497 """ 1498 This function evaluates the continued fraction form of the incomplete 1499 Beta function, betai. (Adapted from: Numerical Recipies in C.) 1500 1501 Usage: lbetacf(a,b,x) 1502 """ 1503 ITMAX = 200 1504 EPS = 3.0e-7 1505 1506 bm = az = am = 1.0 1507 qab = a+b 1508 qap = a+1.0 1509 qam = a-1.0 1510 bz = 1.0-qab*x/qap 1511 for i in range(ITMAX+1): 1512 em = float(i+1) 1513 tem = em + em 1514 d = em*(b-em)*x/((qam+tem)*(a+tem)) 1515 ap = az + d*am 1516 bp = bz+d*bm 1517 d = -(a+em)*(qab+em)*x/((qap+tem)*(a+tem)) 1518 app = ap+d*az 1519 bpp = bp+d*bz 1520 aold = az 1521 am = ap/bpp 1522 bm = bp/bpp 1523 az = app/bpp 1524 bz = 1.0 1525 if (abs(az-aold)<(EPS*abs(az))): 1526 return az 1527 print 'a or b too big, or ITMAX too small in Betacf.'
1528 1529
1530 -def lgammln(xx):
1531 """ 1532 Returns the gamma function of xx. 1533 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 1534 (Adapted from: Numerical Recipies in C.) 1535 1536 Usage: lgammln(xx) 1537 """ 1538 1539 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 1540 0.120858003e-2, -0.536382e-5] 1541 x = xx - 1.0 1542 tmp = x + 5.5 1543 tmp = tmp - (x+0.5)*math.log(tmp) 1544 ser = 1.0 1545 for j in range(len(coeff)): 1546 x = x + 1 1547 ser = ser + coeff[j]/x 1548 return -tmp + math.log(2.50662827465*ser)
1549 1550
1551 -def lbetai(a,b,x):
1552 """ 1553 Returns the incomplete beta function: 1554 1555 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 1556 1557 where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 1558 function of a. The continued fraction formulation is implemented here, 1559 using the betacf function. (Adapted from: Numerical Recipies in C.) 1560 1561 Usage: lbetai(a,b,x) 1562 """ 1563 if (x<0.0 or x>1.0): 1564 raise ValueError, 'Bad x in lbetai' 1565 if (x==0.0 or x==1.0): 1566 bt = 0.0 1567 else: 1568 bt = math.exp(gammln(a+b)-gammln(a)-gammln(b)+a*math.log(x)+b* 1569 math.log(1.0-x)) 1570 if (x<(a+1.0)/(a+b+2.0)): 1571 return bt*betacf(a,b,x)/float(a) 1572 else: 1573 return 1.0-bt*betacf(b,a,1.0-x)/float(b)
1574 1575 1576 #################################### 1577 ####### ANOVA CALCULATIONS ####### 1578 #################################### 1579
1580 -def lF_oneway(*lists):
1581 """ 1582 Performs a 1-way ANOVA, returning an F-value and probability given 1583 any number of groups. From Heiman, pp.394-7. 1584 1585 Usage: F_oneway(*lists) where *lists is any number of lists, one per 1586 treatment group 1587 Returns: F value, one-tailed p-value 1588 """ 1589 a = len(lists) # ANOVA on 'a' groups, each in it's own list 1590 means = [0]*a 1591 vars = [0]*a 1592 ns = [0]*a 1593 alldata = [] 1594 tmp = map(N.array,lists) 1595 means = map(amean,tmp) 1596 vars = map(avar,tmp) 1597 ns = map(len,lists) 1598 for i in range(len(lists)): 1599 alldata = alldata + lists[i] 1600 alldata = N.array(alldata) 1601 bign = len(alldata) 1602 sstot = ass(alldata)-(asquare_of_sums(alldata)/float(bign)) 1603 ssbn = 0 1604 for list in lists: 1605 ssbn = ssbn + asquare_of_sums(N.array(list))/float(len(list)) 1606 ssbn = ssbn - (asquare_of_sums(alldata)/float(bign)) 1607 sswn = sstot-ssbn 1608 dfbn = a-1 1609 dfwn = bign - a 1610 msb = ssbn/float(dfbn) 1611 msw = sswn/float(dfwn) 1612 f = msb/msw 1613 prob = fprob(dfbn,dfwn,f) 1614 return f, prob
1615 1616
1617 -def lF_value (ER,EF,dfnum,dfden):
1618 """ 1619 Returns an F-statistic given the following: 1620 ER = error associated with the null hypothesis (the Restricted model) 1621 EF = error associated with the alternate hypothesis (the Full model) 1622 dfR-dfF = degrees of freedom of the numerator 1623 dfF = degrees of freedom associated with the denominator/Full model 1624 1625 Usage: lF_value(ER,EF,dfnum,dfden) 1626 """ 1627 return ((ER-EF)/float(dfnum) / (EF/float(dfden)))
1628 1629 1630 #################################### 1631 ######## SUPPORT FUNCTIONS ####### 1632 #################################### 1633
1634 -def writecc (listoflists,file,writetype='w',extra=2):
1635 """ 1636 Writes a list of lists to a file in columns, customized by the max 1637 size of items within the columns (max size of items in col, +2 characters) 1638 to specified file. File-overwrite is the default. 1639 1640 Usage: writecc (listoflists,file,writetype='w',extra=2) 1641 Returns: None 1642 """ 1643 if type(listoflists[0]) not in [ListType,TupleType]: 1644 listoflists = [listoflists] 1645 outfile = open(file,writetype) 1646 rowstokill = [] 1647 list2print = copy.deepcopy(listoflists) 1648 for i in range(len(listoflists)): 1649 if listoflists[i] == ['\n'] or listoflists[i]=='\n' or listoflists[i]=='dashes': 1650 rowstokill = rowstokill + [i] 1651 rowstokill.reverse() 1652 for row in rowstokill: 1653 del list2print[row] 1654 maxsize = [0]*len(list2print[0]) 1655 for col in range(len(list2print[0])): 1656 items = pstat.colex(list2print,col) 1657 items = map(pstat.makestr,items) 1658 maxsize[col] = max(map(len,items)) + extra 1659 for row in listoflists: 1660 if row == ['\n'] or row == '\n': 1661 outfile.write('\n') 1662 elif row == ['dashes'] or row == 'dashes': 1663 dashes = [0]*len(maxsize) 1664 for j in range(len(maxsize)): 1665 dashes[j] = '-'*(maxsize[j]-2) 1666 outfile.write(pstat.lineincustcols(dashes,maxsize)) 1667 else: 1668 outfile.write(pstat.lineincustcols(row,maxsize)) 1669 outfile.write('\n') 1670 outfile.close() 1671 return None
1672 1673
1674 -def lincr(l,cap): # to increment a list up to a max-list of 'cap'
1675 """ 1676 Simulate a counting system from an n-dimensional list. 1677 1678 Usage: lincr(l,cap) l=list to increment, cap=max values for each list pos'n 1679 Returns: next set of values for list l, OR -1 (if overflow) 1680 """ 1681 l[0] = l[0] + 1 # e.g., [0,0,0] --> [2,4,3] (=cap) 1682 for i in range(len(l)): 1683 if l[i] > cap[i] and i < len(l)-1: # if carryover AND not done 1684 l[i] = 0 1685 l[i+1] = l[i+1] + 1 1686 elif l[i] > cap[i] and i == len(l)-1: # overflow past last column, must be finished 1687 l = -1 1688 return l 1689 1690
1691 -def lsum (inlist):
1692 """ 1693 Returns the sum of the items in the passed list. 1694 1695 Usage: lsum(inlist) 1696 """ 1697 s = 0 1698 for item in inlist: 1699 s = s + item 1700 return s
1701 1702
1703 -def lcumsum (inlist):
1704 """ 1705 Returns a list consisting of the cumulative sum of the items in the 1706 passed list. 1707 1708 Usage: lcumsum(inlist) 1709 """ 1710 newlist = copy.deepcopy(inlist) 1711 for i in range(1,len(newlist)): 1712 newlist[i] = newlist[i] + newlist[i-1] 1713 return newlist
1714 1715
1716 -def lss(inlist):
1717 """ 1718 Squares each value in the passed list, adds up these squares and 1719 returns the result. 1720 1721 Usage: lss(inlist) 1722 """ 1723 ss = 0 1724 for item in inlist: 1725 ss = ss + item*item 1726 return ss
1727 1728
1729 -def lsummult (list1,list2):
1730 """ 1731 Multiplies elements in list1 and list2, element by element, and 1732 returns the sum of all resulting multiplications. Must provide equal 1733 length lists. 1734 1735 Usage: lsummult(list1,list2) 1736 """ 1737 if len(list1) <> len(list2): 1738 raise ValueError, "Lists not equal length in summult." 1739 s = 0 1740 for item1,item2 in pstat.abut(list1,list2): 1741 s = s + item1*item2 1742 return s
1743 1744
1745 -def lsumdiffsquared(x,y):
1746 """ 1747 Takes pairwise differences of the values in lists x and y, squares 1748 these differences, and returns the sum of these squares. 1749 1750 Usage: lsumdiffsquared(x,y) 1751 Returns: sum[(x[i]-y[i])**2] 1752 """ 1753 sds = 0 1754 for i in range(len(x)): 1755 sds = sds + (x[i]-y[i])**2 1756 return sds
1757 1758
1759 -def lsquare_of_sums(inlist):
1760 """ 1761 Adds the values in the passed list, squares the sum, and returns 1762 the result. 1763 1764 Usage: lsquare_of_sums(inlist) 1765 Returns: sum(inlist[i])**2 1766 """ 1767 s = sum(inlist) 1768 return float(s)*s
1769 1770
1771 -def lshellsort(inlist):
1772 """ 1773 Shellsort algorithm. Sorts a 1D-list. 1774 1775 Usage: lshellsort(inlist) 1776 Returns: sorted-inlist, sorting-index-vector (for original list) 1777 """ 1778 n = len(inlist) 1779 svec = copy.deepcopy(inlist) 1780 ivec = range(n) 1781 gap = n/2 # integer division needed 1782 while gap >0: 1783 for i in range(gap,n): 1784 for j in range(i-gap,-1,-gap): 1785 while j>=0 and svec[j]>svec[j+gap]: 1786 temp = svec[j] 1787 svec[j] = svec[j+gap] 1788 svec[j+gap] = temp 1789 itemp = ivec[j] 1790 ivec[j] = ivec[j+gap] 1791 ivec[j+gap] = itemp 1792 gap = gap / 2 # integer division needed 1793 # svec is now sorted inlist, and ivec has the order svec[i] = vec[ivec[i]] 1794 return svec, ivec
1795 1796
1797 -def lrankdata(inlist):
1798 """ 1799 Ranks the data in inlist, dealing with ties appropritely. Assumes 1800 a 1D inlist. Adapted from Gary Perlman's |Stat ranksort. 1801 1802 Usage: lrankdata(inlist) 1803 Returns: a list of length equal to inlist, containing rank scores 1804 """ 1805 n = len(inlist) 1806 svec, ivec = shellsort(inlist) 1807 sumranks = 0 1808 dupcount = 0 1809 newlist = [0]*n 1810 for i in range(n): 1811 sumranks = sumranks + i 1812 dupcount = dupcount + 1 1813 if i==n-1 or svec[i] <> svec[i+1]: 1814 averank = sumranks / float(dupcount) + 1 1815 for j in range(i-dupcount+1,i+1): 1816 newlist[ivec[j]] = averank 1817 sumranks = 0 1818 dupcount = 0 1819 return newlist
1820 1821
1822 -def outputpairedstats(fname,writemode,name1,n1,m1,se1,min1,max1,name2,n2,m2,se2,min2,max2,statname,stat,prob):
1823 """ 1824 Prints or write to a file stats for two groups, using the name, n, 1825 mean, sterr, min and max for each group, as well as the statistic name, 1826 its value, and the associated p-value. 1827 1828 Usage: outputpairedstats(fname,writemode, 1829 name1,n1,mean1,stderr1,min1,max1, 1830 name2,n2,mean2,stderr2,min2,max2, 1831 statname,stat,prob) 1832 Returns: None 1833 """ 1834 suffix = '' # for *s after the p-value 1835 try: 1836 x = prob.shape 1837 prob = prob[0] 1838 except: 1839 pass 1840 if prob < 0.001: suffix = ' ***' 1841 elif prob < 0.01: suffix = ' **' 1842 elif prob < 0.05: suffix = ' *' 1843 title = [['Name','N','Mean','SD','Min','Max']] 1844 lofl = title+[[name1,n1,round(m1,3),round(math.sqrt(se1),3),min1,max1], 1845 [name2,n2,round(m2,3),round(math.sqrt(se2),3),min2,max2]] 1846 if type(fname)<>StringType or len(fname)==0: 1847 print 1848 print statname 1849 print 1850 pstat.printcc(lofl) 1851 print 1852 try: 1853 if stat.shape == (): 1854 stat = stat[0] 1855 if prob.shape == (): 1856 prob = prob[0] 1857 except: 1858 pass 1859 print 'Test statistic = ',round(stat,3),' p = ',round(prob,3),suffix 1860 print 1861 else: 1862 file = open(fname,writemode) 1863 file.write('\n'+statname+'\n\n') 1864 file.close() 1865 writecc(lofl,fname,'a') 1866 file = open(fname,'a') 1867 try: 1868 if stat.shape == (): 1869 stat = stat[0] 1870 if prob.shape == (): 1871 prob = prob[0] 1872 except: 1873 pass 1874 file.write(pstat.list2string(['\nTest statistic = ',round(stat,4),' p = ',round(prob,4),suffix,'\n\n'])) 1875 file.close() 1876 return None
1877 1878
1879 -def lfindwithin (data):
1880 """ 1881 Returns an integer representing a binary vector, where 1=within- 1882 subject factor, 0=between. Input equals the entire data 2D list (i.e., 1883 column 0=random factor, column -1=measured values (those two are skipped). 1884 Note: input data is in |Stat format ... a list of lists ("2D list") with 1885 one row per measured value, first column=subject identifier, last column= 1886 score, one in-between column per factor (these columns contain level 1887 designations on each factor). See also stats.anova.__doc__. 1888 1889 Usage: lfindwithin(data) data in |Stat format 1890 """ 1891 1892 numfact = len(data[0])-1 1893 withinvec = 0 1894 for col in range(1,numfact): 1895 examplelevel = pstat.unique(pstat.colex(data,col))[0] 1896 rows = pstat.linexand(data,col,examplelevel) # get 1 level of this factor 1897 factsubjs = pstat.unique(pstat.colex(rows,0)) 1898 allsubjs = pstat.unique(pstat.colex(data,0)) 1899 if len(factsubjs) == len(allsubjs): # fewer Ss than scores on this factor? 1900 withinvec = withinvec + (1 << col) 1901 return withinvec
1902 1903 1904 ######################################################### 1905 ######################################################### 1906 ####### DISPATCH LISTS AND TUPLES TO ABOVE FCNS ######### 1907 ######################################################### 1908 ######################################################### 1909 1910 ## CENTRAL TENDENCY: 1911 geometricmean = Dispatch ( (lgeometricmean, (ListType, TupleType)), ) 1912 harmonicmean = Dispatch ( (lharmonicmean, (ListType, TupleType)), ) 1913 mean = Dispatch ( (lmean, (ListType, TupleType)), ) 1914 median = Dispatch ( (lmedian, (ListType, TupleType)), ) 1915 medianscore = Dispatch ( (lmedianscore, (ListType, TupleType)), ) 1916 mode = Dispatch ( (lmode, (ListType, TupleType)), ) 1917 1918 ## MOMENTS: 1919 moment = Dispatch ( (lmoment, (ListType, TupleType)), ) 1920 variation = Dispatch ( (lvariation, (ListType, TupleType)), ) 1921 skew = Dispatch ( (lskew, (ListType, TupleType)), ) 1922 kurtosis = Dispatch ( (lkurtosis, (ListType, TupleType)), ) 1923 describe = Dispatch ( (ldescribe, (ListType, TupleType)), ) 1924 1925 ## FREQUENCY STATISTICS: 1926 itemfreq = Dispatch ( (litemfreq, (ListType, TupleType)), ) 1927 scoreatpercentile = Dispatch ( (lscoreatpercentile, (ListType, TupleType)), ) 1928 percentileofscore = Dispatch ( (lpercentileofscore, (ListType, TupleType)), ) 1929 histogram = Dispatch ( (lhistogram, (ListType, TupleType)), ) 1930 cumfreq = Dispatch ( (lcumfreq, (ListType, TupleType)), ) 1931 relfreq = Dispatch ( (lrelfreq, (ListType, TupleType)), ) 1932 1933 ## VARIABILITY: 1934 obrientransform = Dispatch ( (lobrientransform, (ListType, TupleType)), ) 1935 samplevar = Dispatch ( (lsamplevar, (ListType, TupleType)), ) 1936 samplestdev = Dispatch ( (lsamplestdev, (ListType, TupleType)), ) 1937 var = Dispatch ( (lvar, (ListType, TupleType)), ) 1938 stdev = Dispatch ( (lstdev, (ListType, TupleType)), ) 1939 sterr = Dispatch ( (lsterr, (ListType, TupleType)), ) 1940 sem = Dispatch ( (lsem, (ListType, TupleType)), ) 1941 z = Dispatch ( (lz, (ListType, TupleType)), ) 1942 zs = Dispatch ( (lzs, (ListType, TupleType)), ) 1943 1944 ## TRIMMING FCNS: 1945 trimboth = Dispatch ( (ltrimboth, (ListType, TupleType)), ) 1946 trim1 = Dispatch ( (ltrim1, (ListType, TupleType)), ) 1947 1948 ## CORRELATION FCNS: 1949 paired = Dispatch ( (lpaired, (ListType, TupleType)), ) 1950 pearsonr = Dispatch ( (lpearsonr, (ListType, TupleType)), ) 1951 spearmanr = Dispatch ( (lspearmanr, (ListType, TupleType)), ) 1952 pointbiserialr = Dispatch ( (lpointbiserialr, (ListType, TupleType)), ) 1953 kendalltau = Dispatch ( (lkendalltau, (ListType, TupleType)), ) 1954 linregress = Dispatch ( (llinregress, (ListType, TupleType)), ) 1955 1956 ## INFERENTIAL STATS: 1957 ttest_1samp = Dispatch ( (lttest_1samp, (ListType, TupleType)), ) 1958 ttest_ind = Dispatch ( (lttest_ind, (ListType, TupleType)), ) 1959 ttest_rel = Dispatch ( (lttest_rel, (ListType, TupleType)), ) 1960 chisquare = Dispatch ( (lchisquare, (ListType, TupleType)), ) 1961 ks_2samp = Dispatch ( (lks_2samp, (ListType, TupleType)), ) 1962 mannwhitneyu = Dispatch ( (lmannwhitneyu, (ListType, TupleType)), ) 1963 ranksums = Dispatch ( (lranksums, (ListType, TupleType)), ) 1964 tiecorrect = Dispatch ( (ltiecorrect, (ListType, TupleType)), ) 1965 wilcoxont = Dispatch ( (lwilcoxont, (ListType, TupleType)), ) 1966 kruskalwallish = Dispatch ( (lkruskalwallish, (ListType, TupleType)), ) 1967 friedmanchisquare = Dispatch ( (lfriedmanchisquare, (ListType, TupleType)), ) 1968 1969 ## PROBABILITY CALCS: 1970 chisqprob = Dispatch ( (lchisqprob, (IntType, FloatType)), ) 1971 zprob = Dispatch ( (lzprob, (IntType, FloatType)), ) 1972 ksprob = Dispatch ( (lksprob, (IntType, FloatType)), ) 1973 fprob = Dispatch ( (lfprob, (IntType, FloatType)), ) 1974 betacf = Dispatch ( (lbetacf, (IntType, FloatType)), ) 1975 betai = Dispatch ( (lbetai, (IntType, FloatType)), ) 1976 erfcc = Dispatch ( (lerfcc, (IntType, FloatType)), ) 1977 gammln = Dispatch ( (lgammln, (IntType, FloatType)), ) 1978 1979 ## ANOVA FUNCTIONS: 1980 F_oneway = Dispatch ( (lF_oneway, (ListType, TupleType)), ) 1981 F_value = Dispatch ( (lF_value, (ListType, TupleType)), ) 1982 1983 ## SUPPORT FUNCTIONS: 1984 incr = Dispatch ( (lincr, (ListType, TupleType)), ) 1985 sum = Dispatch ( (lsum, (ListType, TupleType)), ) 1986 cumsum = Dispatch ( (lcumsum, (ListType, TupleType)), ) 1987 ss = Dispatch ( (lss, (ListType, TupleType)), ) 1988 summult = Dispatch ( (lsummult, (ListType, TupleType)), ) 1989 square_of_sums = Dispatch ( (lsquare_of_sums, (ListType, TupleType)), ) 1990 sumdiffsquared = Dispatch ( (lsumdiffsquared, (ListType, TupleType)), ) 1991 shellsort = Dispatch ( (lshellsort, (ListType, TupleType)), ) 1992 rankdata = Dispatch ( (lrankdata, (ListType, TupleType)), ) 1993 findwithin = Dispatch ( (lfindwithin, (ListType, TupleType)), ) 1994 1995 1996 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 1997 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 1998 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 1999 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2000 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2001 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2002 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2003 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2004 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2005 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2006 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2007 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2008 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2009 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2010 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2011 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2012 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2013 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2014 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2015 2016 try: # DEFINE THESE *ONLY* IF NUMERIC IS AVAILABLE 2017 import numpy as N 2018 import numpy.linalg as LA 2019 2020 2021 ##################################### 2022 ######## ACENTRAL TENDENCY ######## 2023 ##################################### 2024
2025 - def ageometricmean (inarray,dimension=None,keepdims=0):
2026 """ 2027 Calculates the geometric mean of the values in the passed array. 2028 That is: n-th root of (x1 * x2 * ... * xn). Defaults to ALL values in 2029 the passed array. Use dimension=None to flatten array first. REMEMBER: if 2030 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2031 if dimension is a sequence, it collapses over all specified dimensions. If 2032 keepdims is set to 1, the resulting array will have as many dimensions as 2033 inarray, with only 1 'level' per dim that was collapsed over. 2034 2035 Usage: ageometricmean(inarray,dimension=None,keepdims=0) 2036 Returns: geometric mean computed over dim(s) listed in dimension 2037 """ 2038 inarray = N.array(inarray,N.float_) 2039 if dimension == None: 2040 inarray = N.ravel(inarray) 2041 size = len(inarray) 2042 mult = N.power(inarray,1.0/size) 2043 mult = N.multiply.reduce(mult) 2044 elif type(dimension) in [IntType,FloatType]: 2045 size = inarray.shape[dimension] 2046 mult = N.power(inarray,1.0/size) 2047 mult = N.multiply.reduce(mult,dimension) 2048 if keepdims == 1: 2049 shp = list(inarray.shape) 2050 shp[dimension] = 1 2051 sum = N.reshape(sum,shp) 2052 else: # must be a SEQUENCE of dims to average over 2053 dims = list(dimension) 2054 dims.sort() 2055 dims.reverse() 2056 size = N.array(N.multiply.reduce(N.take(inarray.shape,dims)),N.float_) 2057 mult = N.power(inarray,1.0/size) 2058 for dim in dims: 2059 mult = N.multiply.reduce(mult,dim) 2060 if keepdims == 1: 2061 shp = list(inarray.shape) 2062 for dim in dims: 2063 shp[dim] = 1 2064 mult = N.reshape(mult,shp) 2065 return mult
2066 2067
2068 - def aharmonicmean (inarray,dimension=None,keepdims=0):
2069 """ 2070 Calculates the harmonic mean of the values in the passed array. 2071 That is: n / (1/x1 + 1/x2 + ... + 1/xn). Defaults to ALL values in 2072 the passed array. Use dimension=None to flatten array first. REMEMBER: if 2073 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2074 if dimension is a sequence, it collapses over all specified dimensions. If 2075 keepdims is set to 1, the resulting array will have as many dimensions as 2076 inarray, with only 1 'level' per dim that was collapsed over. 2077 2078 Usage: aharmonicmean(inarray,dimension=None,keepdims=0) 2079 Returns: harmonic mean computed over dim(s) in dimension 2080 """ 2081 inarray = inarray.astype(N.float_) 2082 if dimension == None: 2083 inarray = N.ravel(inarray) 2084 size = len(inarray) 2085 s = N.add.reduce(1.0 / inarray) 2086 elif type(dimension) in [IntType,FloatType]: 2087 size = float(inarray.shape[dimension]) 2088 s = N.add.reduce(1.0/inarray, dimension) 2089 if keepdims == 1: 2090 shp = list(inarray.shape) 2091 shp[dimension] = 1 2092 s = N.reshape(s,shp) 2093 else: # must be a SEQUENCE of dims to average over 2094 dims = list(dimension) 2095 dims.sort() 2096 nondims = [] 2097 for i in range(len(inarray.shape)): 2098 if i not in dims: 2099 nondims.append(i) 2100 tinarray = N.transpose(inarray,nondims+dims) # put keep-dims first 2101 idx = [0] *len(nondims) 2102 if idx == []: 2103 size = len(N.ravel(inarray)) 2104 s = asum(1.0 / inarray) 2105 if keepdims == 1: 2106 s = N.reshape([s],N.ones(len(inarray.shape))) 2107 else: 2108 idx[0] = -1 2109 loopcap = N.array(tinarray.shape[0:len(nondims)]) -1 2110 s = N.zeros(loopcap+1,N.float_) 2111 while incr(idx,loopcap) <> -1: 2112 s[idx] = asum(1.0/tinarray[idx]) 2113 size = N.multiply.reduce(N.take(inarray.shape,dims)) 2114 if keepdims == 1: 2115 shp = list(inarray.shape) 2116 for dim in dims: 2117 shp[dim] = 1 2118 s = N.reshape(s,shp) 2119 return size / s
2120 2121
2122 - def amean (inarray,dimension=None,keepdims=0):
2123 """ 2124 Calculates the arithmatic mean of the values in the passed array. 2125 That is: 1/n * (x1 + x2 + ... + xn). Defaults to ALL values in the 2126 passed array. Use dimension=None to flatten array first. REMEMBER: if 2127 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2128 if dimension is a sequence, it collapses over all specified dimensions. If 2129 keepdims is set to 1, the resulting array will have as many dimensions as 2130 inarray, with only 1 'level' per dim that was collapsed over. 2131 2132 Usage: amean(inarray,dimension=None,keepdims=0) 2133 Returns: arithematic mean calculated over dim(s) in dimension 2134 """ 2135 if inarray.dtype in [N.int_, N.short,N.ubyte]: 2136 inarray = inarray.astype(N.float_) 2137 if dimension == None: 2138 inarray = N.ravel(inarray) 2139 sum = N.add.reduce(inarray) 2140 denom = float(len(inarray)) 2141 elif type(dimension) in [IntType,FloatType]: 2142 sum = asum(inarray,dimension) 2143 denom = float(inarray.shape[dimension]) 2144 if keepdims == 1: 2145 shp = list(inarray.shape) 2146 shp[dimension] = 1 2147 sum = N.reshape(sum,shp) 2148 else: # must be a TUPLE of dims to average over 2149 dims = list(dimension) 2150 dims.sort() 2151 dims.reverse() 2152 sum = inarray *1.0 2153 for dim in dims: 2154 sum = N.add.reduce(sum,dim) 2155 denom = N.array(N.multiply.reduce(N.take(inarray.shape,dims)),N.float_) 2156 if keepdims == 1: 2157 shp = list(inarray.shape) 2158 for dim in dims: 2159 shp[dim] = 1 2160 sum = N.reshape(sum,shp) 2161 return sum/denom
2162 2163
2164 - def amedian (inarray,numbins=1000):
2165 """ 2166 Calculates the COMPUTED median value of an array of numbers, given the 2167 number of bins to use for the histogram (more bins approaches finding the 2168 precise median value of the array; default number of bins = 1000). From 2169 G.W. Heiman's Basic Stats, or CRC Probability & Statistics. 2170 NOTE: THIS ROUTINE ALWAYS uses the entire passed array (flattens it first). 2171 2172 Usage: amedian(inarray,numbins=1000) 2173 Returns: median calculated over ALL values in inarray 2174 """ 2175 inarray = N.ravel(inarray) 2176 (hist, smallest, binsize, extras) = ahistogram(inarray,numbins,[min(inarray),max(inarray)]) 2177 cumhist = N.cumsum(hist) # make cumulative histogram 2178 otherbins = N.greater_equal(cumhist,len(inarray)/2.0) 2179 otherbins = list(otherbins) # list of 0/1s, 1s start at median bin 2180 cfbin = otherbins.index(1) # get 1st(!) index holding 50%ile score 2181 LRL = smallest + binsize*cfbin # get lower read limit of that bin 2182 cfbelow = N.add.reduce(hist[0:cfbin]) # cum. freq. below bin 2183 freq = hist[cfbin] # frequency IN the 50%ile bin 2184 median = LRL + ((len(inarray)/2.0-cfbelow)/float(freq))*binsize # MEDIAN 2185 return median
2186 2187
2188 - def amedianscore (inarray,dimension=None):
2189 """ 2190 Returns the 'middle' score of the passed array. If there is an even 2191 number of scores, the mean of the 2 middle scores is returned. Can function 2192 with 1D arrays, or on the FIRST dimension of 2D arrays (i.e., dimension can 2193 be None, to pre-flatten the array, or else dimension must equal 0). 2194 2195 Usage: amedianscore(inarray,dimension=None) 2196 Returns: 'middle' score of the array, or the mean of the 2 middle scores 2197 """ 2198 if dimension == None: 2199 inarray = N.ravel(inarray) 2200 dimension = 0 2201 inarray = N.sort(inarray,dimension) 2202 if inarray.shape[dimension] % 2 == 0: # if even number of elements 2203 indx = inarray.shape[dimension]/2 # integer division correct 2204 median = N.asarray(inarray[indx]+inarray[indx-1]) / 2.0 2205 else: 2206 indx = inarray.shape[dimension] / 2 # integer division correct 2207 median = N.take(inarray,[indx],dimension) 2208 if median.shape == (1,): 2209 median = median[0] 2210 return median
2211 2212
2213 - def amode(a, dimension=None):
2214 """ 2215 Returns an array of the modal (most common) score in the passed array. 2216 If there is more than one such score, ONLY THE FIRST is returned. 2217 The bin-count for the modal values is also returned. Operates on whole 2218 array (dimension=None), or on a given dimension. 2219 2220 Usage: amode(a, dimension=None) 2221 Returns: array of bin-counts for mode(s), array of corresponding modal values 2222 """ 2223 2224 if dimension == None: 2225 a = N.ravel(a) 2226 dimension = 0 2227 scores = pstat.aunique(N.ravel(a)) # get ALL unique values 2228 testshape = list(a.shape) 2229 testshape[dimension] = 1 2230 oldmostfreq = N.zeros(testshape) 2231 oldcounts = N.zeros(testshape) 2232 for score in scores: 2233 template = N.equal(a,score) 2234 counts = asum(template,dimension,1) 2235 mostfrequent = N.where(counts>oldcounts,score,oldmostfreq) 2236 oldcounts = N.where(counts>oldcounts,counts,oldcounts) 2237 oldmostfreq = mostfrequent 2238 return oldcounts, mostfrequent
2239 2240
2241 - def atmean(a,limits=None,inclusive=(1,1)):
2242 """ 2243 Returns the arithmetic mean of all values in an array, ignoring values 2244 strictly outside the sequence passed to 'limits'. Note: either limit 2245 in the sequence, or the value of limits itself, can be set to None. The 2246 inclusive list/tuple determines whether the lower and upper limiting bounds 2247 (respectively) are open/exclusive (0) or closed/inclusive (1). 2248 2249 Usage: atmean(a,limits=None,inclusive=(1,1)) 2250 """ 2251 if a.dtype in [N.int_, N.short,N.ubyte]: 2252 a = a.astype(N.float_) 2253 if limits == None: 2254 return mean(a) 2255 assert type(limits) in [ListType,TupleType,N.ndarray], "Wrong type for limits in atmean" 2256 if inclusive[0]: lowerfcn = N.greater_equal 2257 else: lowerfcn = N.greater 2258 if inclusive[1]: upperfcn = N.less_equal 2259 else: upperfcn = N.less 2260 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce(N.ravel(a)): 2261 raise ValueError, "No array values within given limits (atmean)." 2262 elif limits[0]==None and limits[1]<>None: 2263 mask = upperfcn(a,limits[1]) 2264 elif limits[0]<>None and limits[1]==None: 2265 mask = lowerfcn(a,limits[0]) 2266 elif limits[0]<>None and limits[1]<>None: 2267 mask = lowerfcn(a,limits[0])*upperfcn(a,limits[1]) 2268 s = float(N.add.reduce(N.ravel(a*mask))) 2269 n = float(N.add.reduce(N.ravel(mask))) 2270 return s/n
2271 2272
2273 - def atvar(a,limits=None,inclusive=(1,1)):
2274 """ 2275 Returns the sample variance of values in an array, (i.e., using N-1), 2276 ignoring values strictly outside the sequence passed to 'limits'. 2277 Note: either limit in the sequence, or the value of limits itself, 2278 can be set to None. The inclusive list/tuple determines whether the lower 2279 and upper limiting bounds (respectively) are open/exclusive (0) or 2280 closed/inclusive (1). ASSUMES A FLAT ARRAY (OR ELSE PREFLATTENS). 2281 2282 Usage: atvar(a,limits=None,inclusive=(1,1)) 2283 """ 2284 a = a.astype(N.float_) 2285 if limits == None or limits == [None,None]: 2286 return avar(a) 2287 assert type(limits) in [ListType,TupleType,N.ndarray], "Wrong type for limits in atvar" 2288 if inclusive[0]: lowerfcn = N.greater_equal 2289 else: lowerfcn = N.greater 2290 if inclusive[1]: upperfcn = N.less_equal 2291 else: upperfcn = N.less 2292 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce(N.ravel(a)): 2293 raise ValueError, "No array values within given limits (atvar)." 2294 elif limits[0]==None and limits[1]<>None: 2295 mask = upperfcn(a,limits[1]) 2296 elif limits[0]<>None and limits[1]==None: 2297 mask = lowerfcn(a,limits[0]) 2298 elif limits[0]<>None and limits[1]<>None: 2299 mask = lowerfcn(a,limits[0])*upperfcn(a,limits[1]) 2300 2301 a = N.compress(mask,a) # squish out excluded values 2302 return avar(a)
2303 2304
2305 - def atmin(a,lowerlimit=None,dimension=None,inclusive=1):
2306 """ 2307 Returns the minimum value of a, along dimension, including only values less 2308 than (or equal to, if inclusive=1) lowerlimit. If the limit is set to None, 2309 all values in the array are used. 2310 2311 Usage: atmin(a,lowerlimit=None,dimension=None,inclusive=1) 2312 """ 2313 if inclusive: lowerfcn = N.greater 2314 else: lowerfcn = N.greater_equal 2315 if dimension == None: 2316 a = N.ravel(a) 2317 dimension = 0 2318 if lowerlimit == None: 2319 lowerlimit = N.minimum.reduce(N.ravel(a))-11 2320 biggest = N.maximum.reduce(N.ravel(a)) 2321 ta = N.where(lowerfcn(a,lowerlimit),a,biggest) 2322 return N.minimum.reduce(ta,dimension)
2323 2324
2325 - def atmax(a,upperlimit,dimension=None,inclusive=1):
2326 """ 2327 Returns the maximum value of a, along dimension, including only values greater 2328 than (or equal to, if inclusive=1) upperlimit. If the limit is set to None, 2329 a limit larger than the max value in the array is used. 2330 2331 Usage: atmax(a,upperlimit,dimension=None,inclusive=1) 2332 """ 2333 if inclusive: upperfcn = N.less 2334 else: upperfcn = N.less_equal 2335 if dimension == None: 2336 a = N.ravel(a) 2337 dimension = 0 2338 if upperlimit == None: 2339 upperlimit = N.maximum.reduce(N.ravel(a))+1 2340 smallest = N.minimum.reduce(N.ravel(a)) 2341 ta = N.where(upperfcn(a,upperlimit),a,smallest) 2342 return N.maximum.reduce(ta,dimension)
2343 2344
2345 - def atstdev(a,limits=None,inclusive=(1,1)):
2346 """ 2347 Returns the standard deviation of all values in an array, ignoring values 2348 strictly outside the sequence passed to 'limits'. Note: either limit 2349 in the sequence, or the value of limits itself, can be set to None. The 2350 inclusive list/tuple determines whether the lower and upper limiting bounds 2351 (respectively) are open/exclusive (0) or closed/inclusive (1). 2352 2353 Usage: atstdev(a,limits=None,inclusive=(1,1)) 2354 """ 2355 return N.sqrt(tvar(a,limits,inclusive))
2356 2357
2358 - def atsem(a,limits=None,inclusive=(1,1)):
2359 """ 2360 Returns the standard error of the mean for the values in an array, 2361 (i.e., using N for the denominator), ignoring values strictly outside 2362 the sequence passed to 'limits'. Note: either limit in the sequence, 2363 or the value of limits itself, can be set to None. The inclusive list/tuple 2364 determines whether the lower and upper limiting bounds (respectively) are 2365 open/exclusive (0) or closed/inclusive (1). 2366 2367 Usage: atsem(a,limits=None,inclusive=(1,1)) 2368 """ 2369 sd = tstdev(a,limits,inclusive) 2370 if limits == None or limits == [None,None]: 2371 n = float(len(N.ravel(a))) 2372 limits = [min(a)-1, max(a)+1] 2373 assert type(limits) in [ListType,TupleType,N.ndarray], "Wrong type for limits in atsem" 2374 if inclusive[0]: lowerfcn = N.greater_equal 2375 else: lowerfcn = N.greater 2376 if inclusive[1]: upperfcn = N.less_equal 2377 else: upperfcn = N.less 2378 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce(N.ravel(a)): 2379 raise ValueError, "No array values within given limits (atsem)." 2380 elif limits[0]==None and limits[1]<>None: 2381 mask = upperfcn(a,limits[1]) 2382 elif limits[0]<>None and limits[1]==None: 2383 mask = lowerfcn(a,limits[0]) 2384 elif limits[0]<>None and limits[1]<>None: 2385 mask = lowerfcn(a,limits[0])*upperfcn(a,limits[1]) 2386 term1 = N.add.reduce(N.ravel(a*a*mask)) 2387 n = float(N.add.reduce(N.ravel(mask))) 2388 return sd/math.sqrt(n)
2389 2390 2391 ##################################### 2392 ############ AMOMENTS ############# 2393 ##################################### 2394
2395 - def amoment(a,moment=1,dimension=None):
2396 """ 2397 Calculates the nth moment about the mean for a sample (defaults to the 2398 1st moment). Generally used to calculate coefficients of skewness and 2399 kurtosis. Dimension can equal None (ravel array first), an integer 2400 (the dimension over which to operate), or a sequence (operate over 2401 multiple dimensions). 2402 2403 Usage: amoment(a,moment=1,dimension=None) 2404 Returns: appropriate moment along given dimension 2405 """ 2406 if dimension == None: 2407 a = N.ravel(a) 2408 dimension = 0 2409 if moment == 1: 2410 return 0.0 2411 else: 2412 mn = amean(a,dimension,1) # 1=keepdims 2413 s = N.power((a-mn),moment) 2414 return amean(s,dimension)
2415 2416
2417 - def avariation(a,dimension=None):
2418 """ 2419 Returns the coefficient of variation, as defined in CRC Standard 2420 Probability and Statistics, p.6. Dimension can equal None (ravel array 2421 first), an integer (the dimension over which to operate), or a 2422 sequence (operate over multiple dimensions). 2423 2424 Usage: avariation(a,dimension=None) 2425 """ 2426 return 100.0*asamplestdev(a,dimension)/amean(a,dimension)
2427 2428
2429 - def askew(a,dimension=None):
2430 """ 2431 Returns the skewness of a distribution (normal ==> 0.0; >0 means extra 2432 weight in left tail). Use askewtest() to see if it's close enough. 2433 Dimension can equal None (ravel array first), an integer (the 2434 dimension over which to operate), or a sequence (operate over multiple 2435 dimensions). 2436 2437 Usage: askew(a, dimension=None) 2438 Returns: skew of vals in a along dimension, returning ZERO where all vals equal 2439 """ 2440 denom = N.power(amoment(a,2,dimension),1.5) 2441 zero = N.equal(denom,0) 2442 if type(denom) == N.ndarray and asum(zero) <> 0: 2443 print "Number of zeros in askew: ",asum(zero) 2444 denom = denom + zero # prevent divide-by-zero 2445 return N.where(zero, 0, amoment(a,3,dimension)/denom) 2446 2447
2448 - def akurtosis(a,dimension=None):
2449 """ 2450 Returns the kurtosis of a distribution (normal ==> 3.0; >3 means 2451 heavier in the tails, and usually more peaked). Use akurtosistest() 2452 to see if it's close enough. Dimension can equal None (ravel array 2453 first), an integer (the dimension over which to operate), or a 2454 sequence (operate over multiple dimensions). 2455 2456 Usage: akurtosis(a,dimension=None) 2457 Returns: kurtosis of values in a along dimension, and ZERO where all vals equal 2458 """ 2459 denom = N.power(amoment(a,2,dimension),2) 2460 zero = N.equal(denom,0) 2461 if type(denom) == N.ndarray and asum(zero) <> 0: 2462 print "Number of zeros in akurtosis: ",asum(zero) 2463 denom = denom + zero # prevent divide-by-zero 2464 return N.where(zero,0,amoment(a,4,dimension)/denom)
2465 2466
2467 - def adescribe(inarray,dimension=None):
2468 """ 2469 Returns several descriptive statistics of the passed array. Dimension 2470 can equal None (ravel array first), an integer (the dimension over 2471 which to operate), or a sequence (operate over multiple dimensions). 2472 2473 Usage: adescribe(inarray,dimension=None) 2474 Returns: n, (min,max), mean, standard deviation, skew, kurtosis 2475 """ 2476 if dimension == None: 2477 inarray = N.ravel(inarray) 2478 dimension = 0 2479 n = inarray.shape[dimension] 2480 mm = (N.minimum.reduce(inarray),N.maximum.reduce(inarray)) 2481 m = amean(inarray,dimension) 2482 sd = astdev(inarray,dimension) 2483 skew = askew(inarray,dimension) 2484 kurt = akurtosis(inarray,dimension) 2485 return n, mm, m, sd, skew, kurt
2486 2487 2488 ##################################### 2489 ######## NORMALITY TESTS ########## 2490 ##################################### 2491
2492 - def askewtest(a,dimension=None):
2493 """ 2494 Tests whether the skew is significantly different from a normal 2495 distribution. Dimension can equal None (ravel array first), an 2496 integer (the dimension over which to operate), or a sequence (operate 2497 over multiple dimensions). 2498 2499 Usage: askewtest(a,dimension=None) 2500 Returns: z-score and 2-tail z-probability 2501 """ 2502 if dimension == None: 2503 a = N.ravel(a) 2504 dimension = 0 2505 b2 = askew(a,dimension) 2506 n = float(a.shape[dimension]) 2507 y = b2 * N.sqrt(((n+1)*(n+3)) / (6.0*(n-2)) ) 2508 beta2 = ( 3.0*(n*n+27*n-70)*(n+1)*(n+3) ) / ( (n-2.0)*(n+5)*(n+7)*(n+9) ) 2509 W2 = -1 + N.sqrt(2*(beta2-1)) 2510 delta = 1/N.sqrt(N.log(N.sqrt(W2))) 2511 alpha = N.sqrt(2/(W2-1)) 2512 y = N.where(y==0,1,y) 2513 Z = delta*N.log(y/alpha + N.sqrt((y/alpha)**2+1)) 2514 return Z, (1.0-zprob(Z))*2
2515 2516
2517 - def akurtosistest(a,dimension=None):
2518 """ 2519 Tests whether a dataset has normal kurtosis (i.e., 2520 kurtosis=3(n-1)/(n+1)) Valid only for n>20. Dimension can equal None 2521 (ravel array first), an integer (the dimension over which to operate), 2522 or a sequence (operate over multiple dimensions). 2523 2524 Usage: akurtosistest(a,dimension=None) 2525 Returns: z-score and 2-tail z-probability, returns 0 for bad pixels 2526 """ 2527 if dimension == None: 2528 a = N.ravel(a) 2529 dimension = 0 2530 n = float(a.shape[dimension]) 2531 if n<20: 2532 print "akurtosistest only valid for n>=20 ... continuing anyway, n=",n 2533 b2 = akurtosis(a,dimension) 2534 E = 3.0*(n-1) /(n+1) 2535 varb2 = 24.0*n*(n-2)*(n-3) / ((n+1)*(n+1)*(n+3)*(n+5)) 2536 x = (b2-E)/N.sqrt(varb2) 2537 sqrtbeta1 = 6.0*(n*n-5*n+2)/((n+7)*(n+9)) * N.sqrt((6.0*(n+3)*(n+5))/ 2538 (n*(n-2)*(n-3))) 2539 A = 6.0 + 8.0/sqrtbeta1 *(2.0/sqrtbeta1 + N.sqrt(1+4.0/(sqrtbeta1**2))) 2540 term1 = 1 -2/(9.0*A) 2541 denom = 1 +x*N.sqrt(2/(A-4.0)) 2542 denom = N.where(N.less(denom,0), 99, denom) 2543 term2 = N.where(N.equal(denom,0), term1, N.power((1-2.0/A)/denom,1/3.0)) 2544 Z = ( term1 - term2 ) / N.sqrt(2/(9.0*A)) 2545 Z = N.where(N.equal(denom,99), 0, Z) 2546 return Z, (1.0-zprob(Z))*2
2547 2548
2549 - def anormaltest(a,dimension=None):
2550 """ 2551 Tests whether skew and/OR kurtosis of dataset differs from normal 2552 curve. Can operate over multiple dimensions. Dimension can equal 2553 None (ravel array first), an integer (the dimension over which to 2554 operate), or a sequence (operate over multiple dimensions). 2555 2556 Usage: anormaltest(a,dimension=None) 2557 Returns: z-score and 2-tail probability 2558 """ 2559 if dimension == None: 2560 a = N.ravel(a) 2561 dimension = 0 2562 s,p = askewtest(a,dimension) 2563 k,p = akurtosistest(a,dimension) 2564 k2 = N.power(s,2) + N.power(k,2) 2565 return k2, achisqprob(k2,2)
2566 2567 2568 ##################################### 2569 ###### AFREQUENCY FUNCTIONS ####### 2570 ##################################### 2571
2572 - def aitemfreq(a):
2573 """ 2574 Returns a 2D array of item frequencies. Column 1 contains item values, 2575 column 2 contains their respective counts. Assumes a 1D array is passed. 2576 @@@sorting OK? 2577 2578 Usage: aitemfreq(a) 2579 Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 2580 """ 2581 scores = pstat.aunique(a) 2582 scores = N.sort(scores) 2583 freq = N.zeros(len(scores)) 2584 for i in range(len(scores)): 2585 freq[i] = N.add.reduce(N.equal(a,scores[i])) 2586 return N.array(pstat.aabut(scores, freq))
2587 2588
2589 - def ascoreatpercentile (inarray, percent):
2590 """ 2591 Usage: ascoreatpercentile(inarray,percent) 0<percent<100 2592 Returns: score at given percentile, relative to inarray distribution 2593 """ 2594 percent = percent / 100.0 2595 targetcf = percent*len(inarray) 2596 h, lrl, binsize, extras = histogram(inarray) 2597 cumhist = cumsum(h*1) 2598 for i in range(len(cumhist)): 2599 if cumhist[i] >= targetcf: 2600 break 2601 score = binsize * ((targetcf - cumhist[i-1]) / float(h[i])) + (lrl+binsize*i) 2602 return score
2603 2604
2605 - def apercentileofscore (inarray,score,histbins=10,defaultlimits=None):
2606 """ 2607 Note: result of this function depends on the values used to histogram 2608 the data(!). 2609 2610 Usage: apercentileofscore(inarray,score,histbins=10,defaultlimits=None) 2611 Returns: percentile-position of score (0-100) relative to inarray 2612 """ 2613 h, lrl, binsize, extras = histogram(inarray,histbins,defaultlimits) 2614 cumhist = cumsum(h*1) 2615 i = int((score - lrl)/float(binsize)) 2616 pct = (cumhist[i-1]+((score-(lrl+binsize*i))/float(binsize))*h[i])/float(len(inarray)) * 100 2617 return pct
2618 2619
2620 - def ahistogram (inarray,numbins=10,defaultlimits=None,printextras=1):
2621 """ 2622 Returns (i) an array of histogram bin counts, (ii) the smallest value 2623 of the histogram binning, and (iii) the bin width (the last 2 are not 2624 necessarily integers). Default number of bins is 10. Defaultlimits 2625 can be None (the routine picks bins spanning all the numbers in the 2626 inarray) or a 2-sequence (lowerlimit, upperlimit). Returns all of the 2627 following: array of bin values, lowerreallimit, binsize, extrapoints. 2628 2629 Usage: ahistogram(inarray,numbins=10,defaultlimits=None,printextras=1) 2630 Returns: (array of bin counts, bin-minimum, min-width, #-points-outside-range) 2631 """ 2632 inarray = N.ravel(inarray) # flatten any >1D arrays 2633 if (defaultlimits <> None): 2634 lowerreallimit = defaultlimits[0] 2635 upperreallimit = defaultlimits[1] 2636 binsize = (upperreallimit-lowerreallimit) / float(numbins) 2637 else: 2638 Min = N.minimum.reduce(inarray) 2639 Max = N.maximum.reduce(inarray) 2640 estbinwidth = float(Max - Min)/float(numbins) + 1e-6 2641 binsize = (Max-Min+estbinwidth)/float(numbins) 2642 lowerreallimit = Min - binsize/2.0 #lower real limit,1st bin 2643 bins = N.zeros(numbins) 2644 extrapoints = 0 2645 for num in inarray: 2646 try: 2647 if (num-lowerreallimit) < 0: 2648 extrapoints = extrapoints + 1 2649 else: 2650 bintoincrement = int((num-lowerreallimit) / float(binsize)) 2651 bins[bintoincrement] = bins[bintoincrement] + 1 2652 except: # point outside lower/upper limits 2653 extrapoints = extrapoints + 1 2654 if (extrapoints > 0 and printextras == 1): 2655 print '\nPoints outside given histogram range =',extrapoints 2656 return (bins, lowerreallimit, binsize, extrapoints)
2657 2658
2659 - def acumfreq(a,numbins=10,defaultreallimits=None):
2660 """ 2661 Returns a cumulative frequency histogram, using the histogram function. 2662 Defaultreallimits can be None (use all data), or a 2-sequence containing 2663 lower and upper limits on values to include. 2664 2665 Usage: acumfreq(a,numbins=10,defaultreallimits=None) 2666 Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2667 """ 2668 h,l,b,e = histogram(a,numbins,defaultreallimits) 2669 cumhist = cumsum(h*1) 2670 return cumhist,l,b,e
2671 2672
2673 - def arelfreq(a,numbins=10,defaultreallimits=None):
2674 """ 2675 Returns a relative frequency histogram, using the histogram function. 2676 Defaultreallimits can be None (use all data), or a 2-sequence containing 2677 lower and upper limits on values to include. 2678 2679 Usage: arelfreq(a,numbins=10,defaultreallimits=None) 2680 Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2681 """ 2682 h,l,b,e = histogram(a,numbins,defaultreallimits) 2683 h = N.array(h/float(a.shape[0])) 2684 return h,l,b,e
2685 2686 2687 ##################################### 2688 ###### AVARIABILITY FUNCTIONS ##### 2689 ##################################### 2690
2691 - def aobrientransform(*args):
2692 """ 2693 Computes a transform on input data (any number of columns). Used to 2694 test for homogeneity of variance prior to running one-way stats. Each 2695 array in *args is one level of a factor. If an F_oneway() run on the 2696 transformed data and found significant, variances are unequal. From 2697 Maxwell and Delaney, p.112. 2698 2699 Usage: aobrientransform(*args) *args = 1D arrays, one per level of factor 2700 Returns: transformed data for use in an ANOVA 2701 """ 2702 TINY = 1e-10 2703 k = len(args) 2704 n = N.zeros(k,N.float_) 2705 v = N.zeros(k,N.float_) 2706 m = N.zeros(k,N.float_) 2707 nargs = [] 2708 for i in range(k): 2709 nargs.append(args[i].astype(N.float_)) 2710 n[i] = float(len(nargs[i])) 2711 v[i] = var(nargs[i]) 2712 m[i] = mean(nargs[i]) 2713 for j in range(k): 2714 for i in range(n[j]): 2715 t1 = (n[j]-1.5)*n[j]*(nargs[j][i]-m[j])**2 2716 t2 = 0.5*v[j]*(n[j]-1.0) 2717 t3 = (n[j]-1.0)*(n[j]-2.0) 2718 nargs[j][i] = (t1-t2) / float(t3) 2719 check = 1 2720 for j in range(k): 2721 if v[j] - mean(nargs[j]) > TINY: 2722 check = 0 2723 if check <> 1: 2724 raise ValueError, 'Lack of convergence in obrientransform.' 2725 else: 2726 return N.array(nargs)
2727 2728
2729 - def asamplevar (inarray,dimension=None,keepdims=0):
2730 """ 2731 Returns the sample standard deviation of the values in the passed 2732 array (i.e., using N). Dimension can equal None (ravel array first), 2733 an integer (the dimension over which to operate), or a sequence 2734 (operate over multiple dimensions). Set keepdims=1 to return an array 2735 with the same number of dimensions as inarray. 2736 2737 Usage: asamplevar(inarray,dimension=None,keepdims=0) 2738 """ 2739 if dimension == None: 2740 inarray = N.ravel(inarray) 2741 dimension = 0 2742 if dimension == 1: 2743 mn = amean(inarray,dimension)[:,N.NewAxis] 2744 else: 2745 mn = amean(inarray,dimension,keepdims=1) 2746 deviations = inarray - mn 2747 if type(dimension) == ListType: 2748 n = 1 2749 for d in dimension: 2750 n = n*inarray.shape[d] 2751 else: 2752 n = inarray.shape[dimension] 2753 svar = ass(deviations,dimension,keepdims) / float(n) 2754 return svar
2755 2756
2757 - def asamplestdev (inarray, dimension=None, keepdims=0):
2758 """ 2759 Returns the sample standard deviation of the values in the passed 2760 array (i.e., using N). Dimension can equal None (ravel array first), 2761 an integer (the dimension over which to operate), or a sequence 2762 (operate over multiple dimensions). Set keepdims=1 to return an array 2763 with the same number of dimensions as inarray. 2764 2765 Usage: asamplestdev(inarray,dimension=None,keepdims=0) 2766 """ 2767 return N.sqrt(asamplevar(inarray,dimension,keepdims))
2768 2769
2770 - def asignaltonoise(instack,dimension=0):
2771 """ 2772 Calculates signal-to-noise. Dimension can equal None (ravel array 2773 first), an integer (the dimension over which to operate), or a 2774 sequence (operate over multiple dimensions). 2775 2776 Usage: asignaltonoise(instack,dimension=0): 2777 Returns: array containing the value of (mean/stdev) along dimension, 2778 or 0 when stdev=0 2779 """ 2780 m = mean(instack,dimension) 2781 sd = stdev(instack,dimension) 2782 return N.where(sd==0,0,m/sd)
2783 2784
2785 - def acov (x,y, dimension=None,keepdims=0):
2786 """ 2787 Returns the estimated covariance of the values in the passed 2788 array (i.e., N-1). Dimension can equal None (ravel array first), an 2789 integer (the dimension over which to operate), or a sequence (operate 2790 over multiple dimensions). Set keepdims=1 to return an array with the 2791 same number of dimensions as inarray. 2792 2793 Usage: acov(x,y,dimension=None,keepdims=0) 2794 """ 2795 if dimension == None: 2796 x = N.ravel(x) 2797 y = N.ravel(y) 2798 dimension = 0 2799 xmn = amean(x,dimension,1) # keepdims 2800 xdeviations = x - xmn 2801 ymn = amean(y,dimension,1) # keepdims 2802 ydeviations = y - ymn 2803 if type(dimension) == ListType: 2804 n = 1 2805 for d in dimension: 2806 n = n*x.shape[d] 2807 else: 2808 n = x.shape[dimension] 2809 covar = N.sum(xdeviations*ydeviations)/float(n-1) 2810 return covar
2811 2812
2813 - def avar (inarray, dimension=None,keepdims=0):
2814 """ 2815 Returns the estimated population variance of the values in the passed 2816 array (i.e., N-1). Dimension can equal None (ravel array first), an 2817 integer (the dimension over which to operate), or a sequence (operate 2818 over multiple dimensions). Set keepdims=1 to return an array with the 2819 same number of dimensions as inarray. 2820 2821 Usage: avar(inarray,dimension=None,keepdims=0) 2822 """ 2823 if dimension == None: 2824 inarray = N.ravel(inarray) 2825 dimension = 0 2826 mn = amean(inarray,dimension,1) 2827 deviations = inarray - mn 2828 if type(dimension) == ListType: 2829 n = 1 2830 for d in dimension: 2831 n = n*inarray.shape[d] 2832 else: 2833 n = inarray.shape[dimension] 2834 var = ass(deviations,dimension,keepdims)/float(n-1) 2835 return var
2836 2837
2838 - def astdev (inarray, dimension=None, keepdims=0):
2839 """ 2840 Returns the estimated population standard deviation of the values in 2841 the passed array (i.e., N-1). Dimension can equal None (ravel array 2842 first), an integer (the dimension over which to operate), or a 2843 sequence (operate over multiple dimensions). Set keepdims=1 to return 2844 an array with the same number of dimensions as inarray. 2845 2846 Usage: astdev(inarray,dimension=None,keepdims=0) 2847 """ 2848 return N.sqrt(avar(inarray,dimension,keepdims))
2849 2850
2851 - def asterr (inarray, dimension=None, keepdims=0):
2852 """ 2853 Returns the estimated population standard error of the values in the 2854 passed array (i.e., N-1). Dimension can equal None (ravel array 2855 first), an integer (the dimension over which to operate), or a 2856 sequence (operate over multiple dimensions). Set keepdims=1 to return 2857 an array with the same number of dimensions as inarray. 2858 2859 Usage: asterr(inarray,dimension=None,keepdims=0) 2860 """ 2861 if dimension == None: 2862 inarray = N.ravel(inarray) 2863 dimension = 0 2864 return astdev(inarray,dimension,keepdims) / float(N.sqrt(inarray.shape[dimension]))
2865 2866
2867 - def asem (inarray, dimension=None, keepdims=0):
2868 """ 2869 Returns the standard error of the mean (i.e., using N) of the values 2870 in the passed array. Dimension can equal None (ravel array first), an 2871 integer (the dimension over which to operate), or a sequence (operate 2872 over multiple dimensions). Set keepdims=1 to return an array with the 2873 same number of dimensions as inarray. 2874 2875 Usage: asem(inarray,dimension=None, keepdims=0) 2876 """ 2877 if dimension == None: 2878 inarray = N.ravel(inarray) 2879 dimension = 0 2880 if type(dimension) == ListType: 2881 n = 1 2882 for d in dimension: 2883 n = n*inarray.shape[d] 2884 else: 2885 n = inarray.shape[dimension] 2886 s = asamplestdev(inarray,dimension,keepdims) / N.sqrt(n-1) 2887 return s
2888 2889
2890 - def az (a, score):
2891 """ 2892 Returns the z-score of a given input score, given thearray from which 2893 that score came. Not appropriate for population calculations, nor for 2894 arrays > 1D. 2895 2896 Usage: az(a, score) 2897 """ 2898 z = (score-amean(a)) / asamplestdev(a) 2899 return z
2900 2901
2902 - def azs (a):
2903 """ 2904 Returns a 1D array of z-scores, one for each score in the passed array, 2905 computed relative to the passed array. 2906 2907 Usage: azs(a) 2908 """ 2909 zscores = [] 2910 for item in a: 2911 zscores.append(z(a,item)) 2912 return N.array(zscores)
2913 2914
2915 - def azmap (scores, compare, dimension=0):
2916 """ 2917 Returns an array of z-scores the shape of scores (e.g., [x,y]), compared to 2918 array passed to compare (e.g., [time,x,y]). Assumes collapsing over dim 0 2919 of the compare array. 2920 2921 Usage: azs(scores, compare, dimension=0) 2922 """ 2923 mns = amean(compare,dimension) 2924 sstd = asamplestdev(compare,0) 2925 return (scores - mns) / sstd
2926 2927 2928 ##################################### 2929 ####### ATRIMMING FUNCTIONS ####### 2930 ##################################### 2931 2932 ## deleted around() as it's in numpy now 2933
2934 - def athreshold(a,threshmin=None,threshmax=None,newval=0):
2935 """ 2936 Like Numeric.clip() except that values <threshmid or >threshmax are replaced 2937 by newval instead of by threshmin/threshmax (respectively). 2938 2939 Usage: athreshold(a,threshmin=None,threshmax=None,newval=0) 2940 Returns: a, with values <threshmin or >threshmax replaced with newval 2941 """ 2942 mask = N.zeros(a.shape) 2943 if threshmin <> None: 2944 mask = mask + N.where(a<threshmin,1,0) 2945 if threshmax <> None: 2946 mask = mask + N.where(a>threshmax,1,0) 2947 mask = N.clip(mask,0,1) 2948 return N.where(mask,newval,a)
2949 2950
2951 - def atrimboth (a,proportiontocut):
2952 """ 2953 Slices off the passed proportion of items from BOTH ends of the passed 2954 array (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 2955 'rightmost' 10% of scores. You must pre-sort the array if you want 2956 "proper" trimming. Slices off LESS if proportion results in a 2957 non-integer slice index (i.e., conservatively slices off 2958 proportiontocut). 2959 2960 Usage: atrimboth (a,proportiontocut) 2961 Returns: trimmed version of array a 2962 """ 2963 lowercut = int(proportiontocut*len(a)) 2964 uppercut = len(a) - lowercut 2965 return a[lowercut:uppercut]
2966 2967
2968 - def atrim1 (a,proportiontocut,tail='right'):
2969 """ 2970 Slices off the passed proportion of items from ONE end of the passed 2971 array (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 2972 10% of scores). Slices off LESS if proportion results in a non-integer 2973 slice index (i.e., conservatively slices off proportiontocut). 2974 2975 Usage: atrim1(a,proportiontocut,tail='right') or set tail='left' 2976 Returns: trimmed version of array a 2977 """ 2978 if string.lower(tail) == 'right': 2979 lowercut = 0 2980 uppercut = len(a) - int(proportiontocut*len(a)) 2981 elif string.lower(tail) == 'left': 2982 lowercut = int(proportiontocut*len(a)) 2983 uppercut = len(a) 2984 return a[lowercut:uppercut]
2985 2986 2987 ##################################### 2988 ##### ACORRELATION FUNCTIONS ###### 2989 ##################################### 2990
2991 - def acovariance(X):
2992 """ 2993 Computes the covariance matrix of a matrix X. Requires a 2D matrix input. 2994 2995 Usage: acovariance(X) 2996 Returns: covariance matrix of X 2997 """ 2998 if len(X.shape) <> 2: 2999 raise TypeError, "acovariance requires 2D matrices" 3000 n = X.shape[0] 3001 mX = amean(X,0) 3002 return N.dot(N.transpose(X),X) / float(n) - N.multiply.outer(mX,mX)
3003 3004
3005 - def acorrelation(X):
3006 """ 3007 Computes the correlation matrix of a matrix X. Requires a 2D matrix input. 3008 3009 Usage: acorrelation(X) 3010 Returns: correlation matrix of X 3011 """ 3012 C = acovariance(X) 3013 V = N.diagonal(C) 3014 return C / N.sqrt(N.multiply.outer(V,V))
3015 3016
3017 - def apaired(x,y):
3018 """ 3019 Interactively determines the type of data in x and y, and then runs the 3020 appropriated statistic for paired group data. 3021 3022 Usage: apaired(x,y) x,y = the two arrays of values to be compared 3023 Returns: appropriate statistic name, value, and probability 3024 """ 3025 samples = '' 3026 while samples not in ['i','r','I','R','c','C']: 3027 print '\nIndependent or related samples, or correlation (i,r,c): ', 3028 samples = raw_input() 3029 3030 if samples in ['i','I','r','R']: 3031 print '\nComparing variances ...', 3032 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 3033 r = obrientransform(x,y) 3034 f,p = F_oneway(pstat.colex(r,0),pstat.colex(r,1)) 3035 if p<0.05: 3036 vartype='unequal, p='+str(round(p,4)) 3037 else: 3038 vartype='equal' 3039 print vartype 3040 if samples in ['i','I']: 3041 if vartype[0]=='e': 3042 t,p = ttest_ind(x,y,None,0) 3043 print '\nIndependent samples t-test: ', round(t,4),round(p,4) 3044 else: 3045 if len(x)>20 or len(y)>20: 3046 z,p = ranksums(x,y) 3047 print '\nRank Sums test (NONparametric, n>20): ', round(z,4),round(p,4) 3048 else: 3049 u,p = mannwhitneyu(x,y) 3050 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round(u,4),round(p,4) 3051 3052 else: # RELATED SAMPLES 3053 if vartype[0]=='e': 3054 t,p = ttest_rel(x,y,0) 3055 print '\nRelated samples t-test: ', round(t,4),round(p,4) 3056 else: 3057 t,p = ranksums(x,y) 3058 print '\nWilcoxon T-test (NONparametric): ', round(t,4),round(p,4) 3059 else: # CORRELATION ANALYSIS 3060 corrtype = '' 3061 while corrtype not in ['c','C','r','R','d','D']: 3062 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 3063 corrtype = raw_input() 3064 if corrtype in ['c','C']: 3065 m,b,r,p,see = linregress(x,y) 3066 print '\nLinear regression for continuous variables ...' 3067 lol = [['Slope','Intercept','r','Prob','SEestimate'],[round(m,4),round(b,4),round(r,4),round(p,4),round(see,4)]] 3068 pstat.printcc(lol) 3069 elif corrtype in ['r','R']: 3070 r,p = spearmanr(x,y) 3071 print '\nCorrelation for ranked variables ...' 3072 print "Spearman's r: ",round(r,4),round(p,4) 3073 else: # DICHOTOMOUS 3074 r,p = pointbiserialr(x,y) 3075 print '\nAssuming x contains a dichotomous variable ...' 3076 print 'Point Biserial r: ',round(r,4),round(p,4) 3077 print '\n\n' 3078 return None
3079 3080
3081 - def dices(x,y):
3082 """ 3083 Calculates Dice's coefficient ... (2*number of common terms)/(number of terms in x + 3084 number of terms in y). Returns a value between 0 (orthogonal) and 1. 3085 3086 Usage: dices(x,y) 3087 """ 3088 import sets 3089 x = sets.Set(x) 3090 y = sets.Set(y) 3091 common = len(x.intersection(y)) 3092 total = float(len(x) + len(y)) 3093 return 2*common/total
3094 3095
3096 - def icc(x,y=None,verbose=0):
3097 """ 3098 Calculates intraclass correlation coefficients using simple, Type I sums of squares. 3099 If only one variable is passed, assumed it's an Nx2 matrix 3100 3101 Usage: icc(x,y=None,verbose=0) 3102 Returns: icc rho, prob ####PROB IS A GUESS BASED ON PEARSON 3103 """ 3104 TINY = 1.0e-20 3105 if y: 3106 all = N.concatenate([x,y],0) 3107 else: 3108 all = x+0 3109 x = all[:,0] 3110 y = all[:,1] 3111 totalss = ass(all-mean(all)) 3112 pairmeans = (x+y)/2. 3113 withinss = ass(x-pairmeans) + ass(y-pairmeans) 3114 withindf = float(len(x)) 3115 betwdf = float(len(x)-1) 3116 withinms = withinss / withindf 3117 betweenms = (totalss-withinss) / betwdf 3118 rho = (betweenms-withinms)/(withinms+betweenms) 3119 t = rho*math.sqrt(betwdf/((1.0-rho+TINY)*(1.0+rho+TINY))) 3120 prob = abetai(0.5*betwdf,0.5,betwdf/(betwdf+t*t),verbose) 3121 return rho, prob
3122 3123
3124 - def alincc(x,y):
3125 """ 3126 Calculates Lin's concordance correlation coefficient. 3127 3128 Usage: alincc(x,y) where x, y are equal-length arrays 3129 Returns: Lin's CC 3130 """ 3131 x = N.ravel(x) 3132 y = N.ravel(y) 3133 covar = acov(x,y)*(len(x)-1)/float(len(x)) # correct denom to n 3134 xvar = avar(x)*(len(x)-1)/float(len(x)) # correct denom to n 3135 yvar = avar(y)*(len(y)-1)/float(len(y)) # correct denom to n 3136 lincc = (2 * covar) / ((xvar+yvar) +((amean(x)-amean(y))**2)) 3137 return lincc
3138 3139
3140 - def apearsonr(x,y,verbose=1):
3141 """ 3142 Calculates a Pearson correlation coefficient and returns p. Taken 3143 from Heiman's Basic Statistics for the Behav. Sci (2nd), p.195. 3144 3145 Usage: apearsonr(x,y,verbose=1) where x,y are equal length arrays 3146 Returns: Pearson's r, two-tailed p-value 3147 """ 3148 TINY = 1.0e-20 3149 n = len(x) 3150 xmean = amean(x) 3151 ymean = amean(y) 3152 r_num = n*(N.add.reduce(x*y)) - N.add.reduce(x)*N.add.reduce(y) 3153 r_den = math.sqrt((n*ass(x) - asquare_of_sums(x))*(n*ass(y)-asquare_of_sums(y))) 3154 r = (r_num / r_den) 3155 df = n-2 3156 t = r*math.sqrt(df/((1.0-r+TINY)*(1.0+r+TINY))) 3157 prob = abetai(0.5*df,0.5,df/(df+t*t),verbose) 3158 return r,prob
3159 3160
3161 - def aspearmanr(x,y):
3162 """ 3163 Calculates a Spearman rank-order correlation coefficient. Taken 3164 from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 3165 3166 Usage: aspearmanr(x,y) where x,y are equal-length arrays 3167 Returns: Spearman's r, two-tailed p-value 3168 """ 3169 TINY = 1e-30 3170 n = len(x) 3171 rankx = rankdata(x) 3172 ranky = rankdata(y) 3173 dsq = N.add.reduce((rankx-ranky)**2) 3174 rs = 1 - 6*dsq / float(n*(n**2-1)) 3175 t = rs * math.sqrt((n-2) / ((rs+1.0)*(1.0-rs))) 3176 df = n-2 3177 probrs = abetai(0.5*df,0.5,df/(df+t*t)) 3178 # probability values for rs are from part 2 of the spearman function in 3179 # Numerical Recipies, p.510. They close to tables, but not exact.(?) 3180 return rs, probrs
3181 3182
3183 - def apointbiserialr(x,y):
3184 """ 3185 Calculates a point-biserial correlation coefficient and the associated 3186 probability value. Taken from Heiman's Basic Statistics for the Behav. 3187 Sci (1st), p.194. 3188 3189 Usage: apointbiserialr(x,y) where x,y are equal length arrays 3190 Returns: Point-biserial r, two-tailed p-value 3191 """ 3192 TINY = 1e-30 3193 categories = pstat.aunique(x) 3194 data = pstat.aabut(x,y) 3195 if len(categories) <> 2: 3196 raise ValueError, "Exactly 2 categories required (in x) for pointbiserialr()." 3197 else: # there are 2 categories, continue 3198 codemap = pstat.aabut(categories,N.arange(2)) 3199 recoded = pstat.arecode(data,codemap,0) 3200 x = pstat.alinexand(data,0,categories[0]) 3201 y = pstat.alinexand(data,0,categories[1]) 3202 xmean = amean(pstat.acolex(x,1)) 3203 ymean = amean(pstat.acolex(y,1)) 3204 n = len(data) 3205 adjust = math.sqrt((len(x)/float(n))*(len(y)/float(n))) 3206 rpb = (ymean - xmean)/asamplestdev(pstat.acolex(data,1))*adjust 3207 df = n-2 3208 t = rpb*math.sqrt(df/((1.0-rpb+TINY)*(1.0+rpb+TINY))) 3209 prob = abetai(0.5*df,0.5,df/(df+t*t)) 3210 return rpb, prob
3211 3212
3213 - def akendalltau(x,y):
3214 """ 3215 Calculates Kendall's tau ... correlation of ordinal data. Adapted 3216 from function kendl1 in Numerical Recipies. Needs good test-cases.@@@ 3217 3218 Usage: akendalltau(x,y) 3219 Returns: Kendall's tau, two-tailed p-value 3220 """ 3221 n1 = 0 3222 n2 = 0 3223 iss = 0 3224 for j in range(len(x)-1): 3225 for k in range(j,len(y)): 3226 a1 = x[j] - x[k] 3227 a2 = y[j] - y[k] 3228 aa = a1 * a2 3229 if (aa): # neither array has a tie 3230 n1 = n1 + 1 3231 n2 = n2 + 1 3232 if aa > 0: 3233 iss = iss + 1 3234 else: 3235 iss = iss -1 3236 else: 3237 if (a1): 3238 n1 = n1 + 1 3239 else: 3240 n2 = n2 + 1 3241 tau = iss / math.sqrt(n1*n2) 3242 svar = (4.0*len(x)+10.0) / (9.0*len(x)*(len(x)-1)) 3243 z = tau / math.sqrt(svar) 3244 prob = erfcc(abs(z)/1.4142136) 3245 return tau, prob
3246 3247
3248 - def alinregress(*args):
3249 """ 3250 Calculates a regression line on two arrays, x and y, corresponding to x,y 3251 pairs. If a single 2D array is passed, alinregress finds dim with 2 levels 3252 and splits data into x,y pairs along that dim. 3253 3254 Usage: alinregress(*args) args=2 equal-length arrays, or one 2D array 3255 Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3256 """ 3257 TINY = 1.0e-20 3258 if len(args) == 1: # more than 1D array? 3259 args = args[0] 3260 if len(args) == 2: 3261 x = args[0] 3262 y = args[1] 3263 else: 3264 x = args[:,0] 3265 y = args[:,1] 3266 else: 3267 x = args[0] 3268 y = args[1] 3269 n = len(x) 3270 xmean = amean(x) 3271 ymean = amean(y) 3272 r_num = n*(N.add.reduce(x*y)) - N.add.reduce(x)*N.add.reduce(y) 3273 r_den = math.sqrt((n*ass(x) - asquare_of_sums(x))*(n*ass(y)-asquare_of_sums(y))) 3274 r = r_num / r_den 3275 z = 0.5*math.log((1.0+r+TINY)/(1.0-r+TINY)) 3276 df = n-2 3277 t = r*math.sqrt(df/((1.0-r+TINY)*(1.0+r+TINY))) 3278 prob = abetai(0.5*df,0.5,df/(df+t*t)) 3279 slope = r_num / (float(n)*ass(x) - asquare_of_sums(x)) 3280 intercept = ymean - slope*xmean 3281 sterrest = math.sqrt(1-r*r)*asamplestdev(y) 3282 return slope, intercept, r, prob, sterrest, n
3283
3284 - def amasslinregress(*args):
3285 """ 3286 Calculates a regression line on one 1D array (x) and one N-D array (y). 3287 3288 Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3289 """ 3290 TINY = 1.0e-20 3291 if len(args) == 1: # more than 1D array? 3292 args = args[0] 3293 if len(args) == 2: 3294 x = N.ravel(args[0]) 3295 y = args[1] 3296 else: 3297 x = N.ravel(args[:,0]) 3298 y = args[:,1] 3299 else: 3300 x = args[0] 3301 y = args[1] 3302 x = x.astype(N.float_) 3303 y = y.astype(N.float_) 3304 n = len(x) 3305 xmean = amean(x) 3306 ymean = amean(y,0) 3307 shp = N.ones(len(y.shape)) 3308 shp[0] = len(x) 3309 x.shape = shp 3310 print x.shape, y.shape 3311 r_num = n*(N.add.reduce(x*y,0)) - N.add.reduce(x)*N.add.reduce(y,0) 3312 r_den = N.sqrt((n*ass(x) - asquare_of_sums(x))*(n*ass(y,0)-asquare_of_sums(y,0))) 3313 zerodivproblem = N.equal(r_den,0) 3314 r_den = N.where(zerodivproblem,1,r_den) # avoid zero-division in 1st place 3315 r = r_num / r_den # need to do this nicely for matrix division 3316 r = N.where(zerodivproblem,0.0,r) 3317 z = 0.5*N.log((1.0+r+TINY)/(1.0-r+TINY)) 3318 df = n-2 3319 t = r*N.sqrt(df/((1.0-r+TINY)*(1.0+r+TINY))) 3320 prob = abetai(0.5*df,0.5,df/(df+t*t)) 3321 3322 ss = float(n)*ass(x)-asquare_of_sums(x) 3323 s_den = N.where(ss==0,1,ss) # avoid zero-division in 1st place 3324 slope = r_num / s_den 3325 intercept = ymean - slope*xmean 3326 sterrest = N.sqrt(1-r*r)*asamplestdev(y,0) 3327 return slope, intercept, r, prob, sterrest, n
3328 3329 3330 ##################################### 3331 ##### AINFERENTIAL STATISTICS ##### 3332 ##################################### 3333
3334 - def attest_1samp(a,popmean,printit=0,name='Sample',writemode='a'):
3335 """ 3336 Calculates the t-obtained for the independent samples T-test on ONE group 3337 of scores a, given a population mean. If printit=1, results are printed 3338 to the screen. If printit='filename', the results are output to 'filename' 3339 using the given writemode (default=append). Returns t-value, and prob. 3340 3341 Usage: attest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 3342 Returns: t-value, two-tailed prob 3343 """ 3344 if type(a) != N.ndarray: 3345 a = N.array(a) 3346 x = amean(a) 3347 v = avar(a) 3348 n = len(a) 3349 df = n-1 3350 svar = ((n-1)*v) / float(df) 3351 t = (x-popmean)/math.sqrt(svar*(1.0/n)) 3352 prob = abetai(0.5*df,0.5,df/(df+t*t)) 3353 3354 if printit <> 0: 3355 statname = 'Single-sample T-test.' 3356 outputpairedstats(printit,writemode, 3357 'Population','--',popmean,0,0,0, 3358 name,n,x,v,N.minimum.reduce(N.ravel(a)), 3359 N.maximum.reduce(N.ravel(a)), 3360 statname,t,prob) 3361 return t,prob
3362 3363
3364 - def attest_ind (a, b, dimension=None, printit=0, name1='Samp1', name2='Samp2',writemode='a'):
3365 """ 3366 Calculates the t-obtained T-test on TWO INDEPENDENT samples of scores 3367 a, and b. From Numerical Recipies, p.483. If printit=1, results are 3368 printed to the screen. If printit='filename', the results are output 3369 to 'filename' using the given writemode (default=append). Dimension 3370 can equal None (ravel array first), or an integer (the dimension over 3371 which to operate on a and b). 3372 3373 Usage: attest_ind (a,b,dimension=None,printit=0, 3374 Name1='Samp1',Name2='Samp2',writemode='a') 3375 Returns: t-value, two-tailed p-value 3376 """ 3377 if dimension == None: 3378 a = N.ravel(a) 3379 b = N.ravel(b) 3380 dimension = 0 3381 x1 = amean(a,dimension) 3382 x2 = amean(b,dimension) 3383 v1 = avar(a,dimension) 3384 v2 = avar(b,dimension) 3385 n1 = a.shape[dimension] 3386 n2 = b.shape[dimension] 3387 df = n1+n2-2 3388 svar = ((n1-1)*v1+(n2-1)*v2) / float(df) 3389 zerodivproblem = N.equal(svar,0) 3390 svar = N.where(zerodivproblem,1,svar) # avoid zero-division in 1st place 3391 t = (x1-x2)/N.sqrt(svar*(1.0/n1 + 1.0/n2)) # N-D COMPUTATION HERE!!!!!! 3392 t = N.where(zerodivproblem,1.0,t) # replace NaN/wrong t-values with 1.0 3393 probs = abetai(0.5*df,0.5,float(df)/(df+t*t)) 3394 3395 if type(t) == N.ndarray: 3396 probs = N.reshape(probs,t.shape) 3397 if probs.shape == (1,): 3398 probs = probs[0] 3399 3400 if printit <> 0: 3401 if type(t) == N.ndarray: 3402 t = t[0] 3403 if type(probs) == N.ndarray: 3404 probs = probs[0] 3405 statname = 'Independent samples T-test.' 3406 outputpairedstats(printit,writemode, 3407 name1,n1,x1,v1,N.minimum.reduce(N.ravel(a)), 3408 N.maximum.reduce(N.ravel(a)), 3409 name2,n2,x2,v2,N.minimum.reduce(N.ravel(b)), 3410 N.maximum.reduce(N.ravel(b)), 3411 statname,t,probs) 3412 return 3413 return t, probs
3414
3415 - def ap2t(pval,df):
3416 """ 3417 Tries to compute a t-value from a p-value (or pval array) and associated df. 3418 SLOW for large numbers of elements(!) as it re-computes p-values 20 times 3419 (smaller step-sizes) at which point it decides it's done. Keeps the signs 3420 of the input array. Returns 1000 (or -1000) if t>100. 3421 3422 Usage: ap2t(pval,df) 3423 Returns: an array of t-values with the shape of pval 3424 """ 3425 pval = N.array(pval) 3426 signs = N.sign(pval) 3427 pval = abs(pval) 3428 t = N.ones(pval.shape,N.float_)*50 3429 step = N.ones(pval.shape,N.float_)*25 3430 print "Initial ap2t() prob calc" 3431 prob = abetai(0.5*df,0.5,float(df)/(df+t*t)) 3432 print 'ap2t() iter: ', 3433 for i in range(10): 3434 print i,' ', 3435 t = N.where(pval<prob,t+step,t-step) 3436 prob = abetai(0.5*df,0.5,float(df)/(df+t*t)) 3437 step = step/2 3438 print 3439 # since this is an ugly hack, we get ugly boundaries 3440 t = N.where(t>99.9,1000,t) # hit upper-boundary 3441 t = t+signs 3442 return t #, prob, pval
3443 3444
3445 - def attest_rel (a,b,dimension=None,printit=0,name1='Samp1',name2='Samp2',writemode='a'):
3446 """ 3447 Calculates the t-obtained T-test on TWO RELATED samples of scores, a 3448 and b. From Numerical Recipies, p.483. If printit=1, results are 3449 printed to the screen. If printit='filename', the results are output 3450 to 'filename' using the given writemode (default=append). Dimension 3451 can equal None (ravel array first), or an integer (the dimension over 3452 which to operate on a and b). 3453 3454 Usage: attest_rel(a,b,dimension=None,printit=0, 3455 name1='Samp1',name2='Samp2',writemode='a') 3456 Returns: t-value, two-tailed p-value 3457 """ 3458 if dimension == None: 3459 a = N.ravel(a) 3460 b = N.ravel(b) 3461 dimension = 0 3462 if len(a)<>len(b): 3463 raise ValueError, 'Unequal length arrays.' 3464 x1 = amean(a,dimension) 3465 x2 = amean(b,dimension) 3466 v1 = avar(a,dimension) 3467 v2 = avar(b,dimension) 3468 n = a.shape[dimension] 3469 df = float(n-1) 3470 d = (a-b).astype('d') 3471 3472 denom = N.sqrt((n*N.add.reduce(d*d,dimension) - N.add.reduce(d,dimension)**2) /df) 3473 zerodivproblem = N.equal(denom,0) 3474 denom = N.where(zerodivproblem,1,denom) # avoid zero-division in 1st place 3475 t = N.add.reduce(d,dimension) / denom # N-D COMPUTATION HERE!!!!!! 3476 t = N.where(zerodivproblem,1.0,t) # replace NaN/wrong t-values with 1.0 3477 probs = abetai(0.5*df,0.5,float(df)/(df+t*t)) 3478 if type(t) == N.ndarray: 3479 probs = N.reshape(probs,t.shape) 3480 if probs.shape == (1,): 3481 probs = probs[0] 3482 3483 if printit <> 0: 3484 statname = 'Related samples T-test.' 3485 outputpairedstats(printit,writemode, 3486 name1,n,x1,v1,N.minimum.reduce(N.ravel(a)), 3487 N.maximum.reduce(N.ravel(a)), 3488 name2,n,x2,v2,N.minimum.reduce(N.ravel(b)), 3489 N.maximum.reduce(N.ravel(b)), 3490 statname,t,probs) 3491 return 3492 return t, probs
3493 3494
3495 - def achisquare(f_obs,f_exp=None):
3496 """ 3497 Calculates a one-way chi square for array of observed frequencies and returns 3498 the result. If no expected frequencies are given, the total N is assumed to 3499 be equally distributed across all groups. 3500 @@@NOT RIGHT?? 3501 3502 Usage: achisquare(f_obs, f_exp=None) f_obs = array of observed cell freq. 3503 Returns: chisquare-statistic, associated p-value 3504 """ 3505 3506 k = len(f_obs) 3507 if f_exp == None: 3508 f_exp = N.array([sum(f_obs)/float(k)] * len(f_obs),N.float_) 3509 f_exp = f_exp.astype(N.float_) 3510 chisq = N.add.reduce((f_obs-f_exp)**2 / f_exp) 3511 return chisq, achisqprob(chisq, k-1)
3512 3513
3514 - def aks_2samp (data1,data2):
3515 """ 3516 Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from 3517 Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- 3518 like. 3519 3520 Usage: aks_2samp(data1,data2) where data1 and data2 are 1D arrays 3521 Returns: KS D-value, p-value 3522 """ 3523 j1 = 0 # N.zeros(data1.shape[1:]) TRIED TO MAKE THIS UFUNC-LIKE 3524 j2 = 0 # N.zeros(data2.shape[1:]) 3525 fn1 = 0.0 # N.zeros(data1.shape[1:],N.float_) 3526 fn2 = 0.0 # N.zeros(data2.shape[1:],N.float_) 3527 n1 = data1.shape[0] 3528 n2 = data2.shape[0] 3529 en1 = n1*1 3530 en2 = n2*1 3531 d = N.zeros(data1.shape[1:],N.float_) 3532 data1 = N.sort(data1,0) 3533 data2 = N.sort(data2,0) 3534 while j1 < n1 and j2 < n2: 3535 d1=data1[j1] 3536 d2=data2[j2] 3537 if d1 <= d2: 3538 fn1 = (j1)/float(en1) 3539 j1 = j1 + 1 3540 if d2 <= d1: 3541 fn2 = (j2)/float(en2) 3542 j2 = j2 + 1 3543 dt = (fn2-fn1) 3544 if abs(dt) > abs(d): 3545 d = dt 3546 # try: 3547 en = math.sqrt(en1*en2/float(en1+en2)) 3548 prob = aksprob((en+0.12+0.11/en)*N.fabs(d)) 3549 # except: 3550 # prob = 1.0 3551 return d, prob
3552 3553
3554 - def amannwhitneyu(x,y):
3555 """ 3556 Calculates a Mann-Whitney U statistic on the provided scores and 3557 returns the result. Use only when the n in each condition is < 20 and 3558 you have 2 independent samples of ranks. REMEMBER: Mann-Whitney U is 3559 significant if the u-obtained is LESS THAN or equal to the critical 3560 value of U. 3561 3562 Usage: amannwhitneyu(x,y) where x,y are arrays of values for 2 conditions 3563 Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 3564 """ 3565 n1 = len(x) 3566 n2 = len(y) 3567 ranked = rankdata(N.concatenate((x,y))) 3568 rankx = ranked[0:n1] # get the x-ranks 3569 ranky = ranked[n1:] # the rest are y-ranks 3570 u1 = n1*n2 + (n1*(n1+1))/2.0 - sum(rankx) # calc U for x 3571 u2 = n1*n2 - u1 # remainder is U for y 3572 bigu = max(u1,u2) 3573 smallu = min(u1,u2) 3574 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 3575 if T == 0: 3576 raise ValueError, 'All numbers are identical in amannwhitneyu' 3577 sd = math.sqrt(T*n1*n2*(n1+n2+1)/12.0) 3578 z = abs((bigu-n1*n2/2.0) / sd) # normal approximation for prob calc 3579 return smallu, 1.0 - azprob(z)
3580 3581
3582 - def atiecorrect(rankvals):
3583 """ 3584 Tie-corrector for ties in Mann Whitney U and Kruskal Wallis H tests. 3585 See Siegel, S. (1956) Nonparametric Statistics for the Behavioral 3586 Sciences. New York: McGraw-Hill. Code adapted from |Stat rankind.c 3587 code. 3588 3589 Usage: atiecorrect(rankvals) 3590 Returns: T correction factor for U or H 3591 """ 3592 sorted,posn = ashellsort(N.array(rankvals)) 3593 n = len(sorted) 3594 T = 0.0 3595 i = 0 3596 while (i<n-1): 3597 if sorted[i] == sorted[i+1]: 3598 nties = 1 3599 while (i<n-1) and (sorted[i] == sorted[i+1]): 3600 nties = nties +1 3601 i = i +1 3602 T = T + nties**3 - nties 3603 i = i+1 3604 T = T / float(n**3-n) 3605 return 1.0 - T
3606 3607
3608 - def aranksums(x,y):
3609 """ 3610 Calculates the rank sums statistic on the provided scores and returns 3611 the result. 3612 3613 Usage: aranksums(x,y) where x,y are arrays of values for 2 conditions 3614 Returns: z-statistic, two-tailed p-value 3615 """ 3616 n1 = len(x) 3617 n2 = len(y) 3618 alldata = N.concatenate((x,y)) 3619 ranked = arankdata(alldata) 3620 x = ranked[:n1] 3621 y = ranked[n1:] 3622 s = sum(x) 3623 expected = n1*(n1+n2+1) / 2.0 3624 z = (s - expected) / math.sqrt(n1*n2*(n1+n2+1)/12.0) 3625 prob = 2*(1.0 - azprob(abs(z))) 3626 return z, prob
3627 3628
3629 - def awilcoxont(x,y):
3630 """ 3631 Calculates the Wilcoxon T-test for related samples and returns the 3632 result. A non-parametric T-test. 3633 3634 Usage: awilcoxont(x,y) where x,y are equal-length arrays for 2 conditions 3635 Returns: t-statistic, two-tailed p-value 3636 """ 3637 if len(x) <> len(y): 3638 raise ValueError, 'Unequal N in awilcoxont. Aborting.' 3639 d = x-y 3640 d = N.compress(N.not_equal(d,0),d) # Keep all non-zero differences 3641 count = len(d) 3642 absd = abs(d) 3643 absranked = arankdata(absd) 3644 r_plus = 0.0 3645 r_minus = 0.0 3646 for i in range(len(absd)): 3647 if d[i] < 0: 3648 r_minus = r_minus + absranked[i] 3649 else: 3650 r_plus = r_plus + absranked[i] 3651 wt = min(r_plus, r_minus) 3652 mn = count * (count+1) * 0.25 3653 se = math.sqrt(count*(count+1)*(2.0*count+1.0)/24.0) 3654 z = math.fabs(wt-mn) / se 3655 z = math.fabs(wt-mn) / se 3656 prob = 2*(1.0 -zprob(abs(z))) 3657 return wt, prob
3658 3659
3660 - def akruskalwallish(*args):
3661 """ 3662 The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 3663 groups, requiring at least 5 subjects in each group. This function 3664 calculates the Kruskal-Wallis H and associated p-value for 3 or more 3665 independent samples. 3666 3667 Usage: akruskalwallish(*args) args are separate arrays for 3+ conditions 3668 Returns: H-statistic (corrected for ties), associated p-value 3669 """ 3670 assert len(args) == 3, "Need at least 3 groups in stats.akruskalwallish()" 3671 args = list(args) 3672 n = [0]*len(args) 3673 n = map(len,args) 3674 all = [] 3675 for i in range(len(args)): 3676 all = all + args[i].tolist() 3677 ranked = rankdata(all) 3678 T = tiecorrect(ranked) 3679 for i in range(len(args)): 3680 args[i] = ranked[0:n[i]] 3681 del ranked[0:n[i]] 3682 rsums = [] 3683 for i in range(len(args)): 3684 rsums.append(sum(args[i])**2) 3685 rsums[i] = rsums[i] / float(n[i]) 3686 ssbn = sum(rsums) 3687 totaln = sum(n) 3688 h = 12.0 / (totaln*(totaln+1)) * ssbn - 3*(totaln+1) 3689 df = len(args) - 1 3690 if T == 0: 3691 raise ValueError, 'All numbers are identical in akruskalwallish' 3692 h = h / float(T) 3693 return h, chisqprob(h,df)
3694 3695
3696 - def afriedmanchisquare(*args):
3697 """ 3698 Friedman Chi-Square is a non-parametric, one-way within-subjects 3699 ANOVA. This function calculates the Friedman Chi-square test for 3700 repeated measures and returns the result, along with the associated 3701 probability value. It assumes 3 or more repeated measures. Only 3 3702 levels requires a minimum of 10 subjects in the study. Four levels 3703 requires 5 subjects per level(??). 3704 3705 Usage: afriedmanchisquare(*args) args are separate arrays for 2+ conditions 3706 Returns: chi-square statistic, associated p-value 3707 """ 3708 k = len(args) 3709 if k < 3: 3710 raise ValueError, '\nLess than 3 levels. Friedman test not appropriate.\n' 3711 n = len(args[0]) 3712 data = apply(pstat.aabut,args) 3713 data = data.astype(N.float_) 3714 for i in range(len(data)): 3715 data[i] = arankdata(data[i]) 3716 ssbn = asum(asum(args,1)**2) 3717 chisq = 12.0 / (k*n*(k+1)) * ssbn - 3*n*(k+1) 3718 return chisq, achisqprob(chisq,k-1)
3719 3720 3721 ##################################### 3722 #### APROBABILITY CALCULATIONS #### 3723 ##################################### 3724
3725 - def achisqprob(chisq,df):
3726 """ 3727 Returns the (1-tail) probability value associated with the provided chi-square 3728 value and df. Heavily modified from chisq.c in Gary Perlman's |Stat. Can 3729 handle multiple dimensions. 3730 3731 Usage: achisqprob(chisq,df) chisq=chisquare stat., df=degrees of freedom 3732 """ 3733 BIG = 200.0 3734 def ex(x): 3735 BIG = 200.0 3736 exponents = N.where(N.less(x,-BIG),-BIG,x) 3737 return N.exp(exponents)
3738 3739 if type(chisq) == N.ndarray: 3740 arrayflag = 1 3741 else: 3742 arrayflag = 0 3743 chisq = N.array([chisq]) 3744 if df < 1: 3745 return N.ones(chisq.shape,N.float) 3746 probs = N.zeros(chisq.shape,N.float_) 3747 probs = N.where(N.less_equal(chisq,0),1.0,probs) # set prob=1 for chisq<0 3748 a = 0.5 * chisq 3749 if df > 1: 3750 y = ex(-a) 3751 if df%2 == 0: 3752 even = 1 3753 s = y*1 3754 s2 = s*1 3755 else: 3756 even = 0 3757 s = 2.0 * azprob(-N.sqrt(chisq)) 3758 s2 = s*1 3759 if (df > 2): 3760 chisq = 0.5 * (df - 1.0) 3761 if even: 3762 z = N.ones(probs.shape,N.float_) 3763 else: 3764 z = 0.5 *N.ones(probs.shape,N.float_) 3765 if even: 3766 e = N.zeros(probs.shape,N.float_) 3767 else: 3768 e = N.log(N.sqrt(N.pi)) *N.ones(probs.shape,N.float_) 3769 c = N.log(a) 3770 mask = N.zeros(probs.shape) 3771 a_big = N.greater(a,BIG) 3772 a_big_frozen = -1 *N.ones(probs.shape,N.float_) 3773 totalelements = N.multiply.reduce(N.array(probs.shape)) 3774 while asum(mask)<>totalelements: 3775 e = N.log(z) + e 3776 s = s + ex(c*z-a-e) 3777 z = z + 1.0 3778 # print z, e, s 3779 newmask = N.greater(z,chisq) 3780 a_big_frozen = N.where(newmask*N.equal(mask,0)*a_big, s, a_big_frozen) 3781 mask = N.clip(newmask+mask,0,1) 3782 if even: 3783 z = N.ones(probs.shape,N.float_) 3784 e = N.ones(probs.shape,N.float_) 3785 else: 3786 z = 0.5 *N.ones(probs.shape,N.float_) 3787 e = 1.0 / N.sqrt(N.pi) / N.sqrt(a) * N.ones(probs.shape,N.float_) 3788 c = 0.0 3789 mask = N.zeros(probs.shape) 3790 a_notbig_frozen = -1 *N.ones(probs.shape,N.float_) 3791 while asum(mask)<>totalelements: 3792 e = e * (a/z.astype(N.float_)) 3793 c = c + e 3794 z = z + 1.0 3795 # print '#2', z, e, c, s, c*y+s2 3796 newmask = N.greater(z,chisq) 3797 a_notbig_frozen = N.where(newmask*N.equal(mask,0)*(1-a_big), 3798 c*y+s2, a_notbig_frozen) 3799 mask = N.clip(newmask+mask,0,1) 3800 probs = N.where(N.equal(probs,1),1, 3801 N.where(N.greater(a,BIG),a_big_frozen,a_notbig_frozen)) 3802 return probs 3803 else: 3804 return s 3805 3806
3807 - def aerfcc(x):
3808 """ 3809 Returns the complementary error function erfc(x) with fractional error 3810 everywhere less than 1.2e-7. Adapted from Numerical Recipies. Can 3811 handle multiple dimensions. 3812 3813 Usage: aerfcc(x) 3814 """ 3815 z = abs(x) 3816 t = 1.0 / (1.0+0.5*z) 3817 ans = t * N.exp(-z*z-1.26551223 + t*(1.00002368+t*(0.37409196+t*(0.09678418+t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+t*(-0.82215223+t*0.17087277))))))))) 3818 return N.where(N.greater_equal(x,0), ans, 2.0-ans)
3819 3820
3821 - def azprob(z):
3822 """ 3823 Returns the area under the normal curve 'to the left of' the given z value. 3824 Thus, 3825 for z<0, zprob(z) = 1-tail probability 3826 for z>0, 1.0-zprob(z) = 1-tail probability 3827 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 3828 Adapted from z.c in Gary Perlman's |Stat. Can handle multiple dimensions. 3829 3830 Usage: azprob(z) where z is a z-value 3831 """ 3832 def yfunc(y): 3833 x = (((((((((((((-0.000045255659 * y 3834 +0.000152529290) * y -0.000019538132) * y 3835 -0.000676904986) * y +0.001390604284) * y 3836 -0.000794620820) * y -0.002034254874) * y 3837 +0.006549791214) * y -0.010557625006) * y 3838 +0.011630447319) * y -0.009279453341) * y 3839 +0.005353579108) * y -0.002141268741) * y 3840 +0.000535310849) * y +0.999936657524 3841 return x
3842 3843 def wfunc(w): 3844 x = ((((((((0.000124818987 * w 3845 -0.001075204047) * w +0.005198775019) * w 3846 -0.019198292004) * w +0.059054035642) * w 3847 -0.151968751364) * w +0.319152932694) * w 3848 -0.531923007300) * w +0.797884560593) * N.sqrt(w) * 2.0 3849 return x 3850 3851 Z_MAX = 6.0 # maximum meaningful z-value 3852 x = N.zeros(z.shape,N.float_) # initialize 3853 y = 0.5 * N.fabs(z) 3854 x = N.where(N.less(y,1.0),wfunc(y*y),yfunc(y-2.0)) # get x's 3855 x = N.where(N.greater(y,Z_MAX*0.5),1.0,x) # kill those with big Z 3856 prob = N.where(N.greater(z,0),(x+1)*0.5,(1-x)*0.5) 3857 return prob 3858 3859
3860 - def aksprob(alam):
3861 """ 3862 Returns the probability value for a K-S statistic computed via ks_2samp. 3863 Adapted from Numerical Recipies. Can handle multiple dimensions. 3864 3865 Usage: aksprob(alam) 3866 """ 3867 if type(alam) == N.ndarray: 3868 frozen = -1 *N.ones(alam.shape,N.float64) 3869 alam = alam.astype(N.float64) 3870 arrayflag = 1 3871 else: 3872 frozen = N.array(-1.) 3873 alam = N.array(alam,N.float64) 3874 arrayflag = 1 3875 mask = N.zeros(alam.shape) 3876 fac = 2.0 *N.ones(alam.shape,N.float_) 3877 sum = N.zeros(alam.shape,N.float_) 3878 termbf = N.zeros(alam.shape,N.float_) 3879 a2 = N.array(-2.0*alam*alam,N.float64) 3880 totalelements = N.multiply.reduce(N.array(mask.shape)) 3881 for j in range(1,201): 3882 if asum(mask) == totalelements: 3883 break 3884 exponents = (a2*j*j) 3885 overflowmask = N.less(exponents,-746) 3886 frozen = N.where(overflowmask,0,frozen) 3887 mask = mask+overflowmask 3888 term = fac*N.exp(exponents) 3889 sum = sum + term 3890 newmask = N.where(N.less_equal(abs(term),(0.001*termbf)) + 3891 N.less(abs(term),1.0e-8*sum), 1, 0) 3892 frozen = N.where(newmask*N.equal(mask,0), sum, frozen) 3893 mask = N.clip(mask+newmask,0,1) 3894 fac = -fac 3895 termbf = abs(term) 3896 if arrayflag: 3897 return N.where(N.equal(frozen,-1), 1.0, frozen) # 1.0 if doesn't converge 3898 else: 3899 return N.where(N.equal(frozen,-1), 1.0, frozen)[0] # 1.0 if doesn't converge
3900 3901
3902 - def afprob (dfnum, dfden, F):
3903 """ 3904 Returns the 1-tailed significance level (p-value) of an F statistic 3905 given the degrees of freedom for the numerator (dfR-dfF) and the degrees 3906 of freedom for the denominator (dfF). Can handle multiple dims for F. 3907 3908 Usage: afprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 3909 """ 3910 if type(F) == N.ndarray: 3911 return abetai(0.5*dfden, 0.5*dfnum, dfden/(1.0*dfden+dfnum*F)) 3912 else: 3913 return abetai(0.5*dfden, 0.5*dfnum, dfden/float(dfden+dfnum*F))
3914 3915
3916 - def abetacf(a,b,x,verbose=1):
3917 """ 3918 Evaluates the continued fraction form of the incomplete Beta function, 3919 betai. (Adapted from: Numerical Recipies in C.) Can handle multiple 3920 dimensions for x. 3921 3922 Usage: abetacf(a,b,x,verbose=1) 3923 """ 3924 ITMAX = 200 3925 EPS = 3.0e-7 3926 3927 arrayflag = 1 3928 if type(x) == N.ndarray: 3929 frozen = N.ones(x.shape,N.float_) *-1 #start out w/ -1s, should replace all 3930 else: 3931 arrayflag = 0 3932 frozen = N.array([-1]) 3933 x = N.array([x]) 3934 mask = N.zeros(x.shape) 3935 bm = az = am = 1.0 3936 qab = a+b 3937 qap = a+1.0 3938 qam = a-1.0 3939 bz = 1.0-qab*x/qap 3940 for i in range(ITMAX+1): 3941 if N.sum(N.ravel(N.equal(frozen,-1)))==0: 3942 break 3943 em = float(i+1) 3944 tem = em + em 3945 d = em*(b-em)*x/((qam+tem)*(a+tem)) 3946 ap = az + d*am 3947 bp = bz+d*bm 3948 d = -(a+em)*(qab+em)*x/((qap+tem)*(a+tem)) 3949 app = ap+d*az 3950 bpp = bp+d*bz 3951 aold = az*1 3952 am = ap/bpp 3953 bm = bp/bpp 3954 az = app/bpp 3955 bz = 1.0 3956 newmask = N.less(abs(az-aold),EPS*abs(az)) 3957 frozen = N.where(newmask*N.equal(mask,0), az, frozen) 3958 mask = N.clip(mask+newmask,0,1) 3959 noconverge = asum(N.equal(frozen,-1)) 3960 if noconverge <> 0 and verbose: 3961 print 'a or b too big, or ITMAX too small in Betacf for ',noconverge,' elements' 3962 if arrayflag: 3963 return frozen 3964 else: 3965 return frozen[0]
3966 3967
3968 - def agammln(xx):
3969 """ 3970 Returns the gamma function of xx. 3971 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 3972 Adapted from: Numerical Recipies in C. Can handle multiple dims ... but 3973 probably doesn't normally have to. 3974 3975 Usage: agammln(xx) 3976 """ 3977 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 3978 0.120858003e-2, -0.536382e-5] 3979 x = xx - 1.0 3980 tmp = x + 5.5 3981 tmp = tmp - (x+0.5)*N.log(tmp) 3982 ser = 1.0 3983 for j in range(len(coeff)): 3984 x = x + 1 3985 ser = ser + coeff[j]/x 3986 return -tmp + N.log(2.50662827465*ser)
3987 3988
3989 - def abetai(a,b,x,verbose=1):
3990 """ 3991 Returns the incomplete beta function: 3992 3993 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 3994 3995 where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 3996 function of a. The continued fraction formulation is implemented 3997 here, using the betacf function. (Adapted from: Numerical Recipies in 3998 C.) Can handle multiple dimensions. 3999 4000 Usage: abetai(a,b,x,verbose=1) 4001 """ 4002 TINY = 1e-15 4003 if type(a) == N.ndarray: 4004 if asum(N.less(x,0)+N.greater(x,1)) <> 0: 4005 raise ValueError, 'Bad x in abetai' 4006 x = N.where(N.equal(x,0),TINY,x) 4007 x = N.where(N.equal(x,1.0),1-TINY,x) 4008 4009 bt = N.where(N.equal(x,0)+N.equal(x,1), 0, -1) 4010 exponents = ( gammln(a+b)-gammln(a)-gammln(b)+a*N.log(x)+b* 4011 N.log(1.0-x) ) 4012 # 746 (below) is the MAX POSSIBLE BEFORE OVERFLOW 4013 exponents = N.where(N.less(exponents,-740),-740,exponents) 4014 bt = N.exp(exponents) 4015 if type(x) == N.ndarray: 4016 ans = N.where(N.less(x,(a+1)/(a+b+2.0)), 4017 bt*abetacf(a,b,x,verbose)/float(a), 4018 1.0-bt*abetacf(b,a,1.0-x,verbose)/float(b)) 4019 else: 4020 if x<(a+1)/(a+b+2.0): 4021 ans = bt*abetacf(a,b,x,verbose)/float(a) 4022 else: 4023 ans = 1.0-bt*abetacf(b,a,1.0-x,verbose)/float(b) 4024 return ans
4025 4026 4027 ##################################### 4028 ####### AANOVA CALCULATIONS ####### 4029 ##################################### 4030 4031 import LinearAlgebra, operator 4032 LA = LinearAlgebra 4033
4034 - def aglm(data,para):
4035 """ 4036 Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken 4037 from: 4038 Peterson et al. Statistical limitations in functional neuroimaging 4039 I. Non-inferential methods and statistical models. Phil Trans Royal Soc 4040 Lond B 354: 1239-1260. 4041 4042 Usage: aglm(data,para) 4043 Returns: statistic, p-value ??? 4044 """ 4045 if len(para) <> len(data): 4046 print "data and para must be same length in aglm" 4047 return 4048 n = len(para) 4049 p = pstat.aunique(para) 4050 x = N.zeros((n,len(p))) # design matrix 4051 for l in range(len(p)): 4052 x[:,l] = N.equal(para,p[l]) 4053 b = N.dot(N.dot(LA.inv(N.dot(N.transpose(x),x)), # i.e., b=inv(X'X)X'Y 4054 N.transpose(x)), 4055 data) 4056 diffs = (data - N.dot(x,b)) 4057 s_sq = 1./(n-len(p)) * N.dot(N.transpose(diffs), diffs) 4058 4059 if len(p) == 2: # ttest_ind 4060 c = N.array([1,-1]) 4061 df = n-2 4062 fact = asum(1.0/asum(x,0)) # i.e., 1/n1 + 1/n2 + 1/n3 ... 4063 t = N.dot(c,b) / N.sqrt(s_sq*fact) 4064 probs = abetai(0.5*df,0.5,float(df)/(df+t*t)) 4065 return t, probs
4066 4067
4068 - def aF_oneway(*args):
4069 """ 4070 Performs a 1-way ANOVA, returning an F-value and probability given 4071 any number of groups. From Heiman, pp.394-7. 4072 4073 Usage: aF_oneway (*args) where *args is 2 or more arrays, one per 4074 treatment group 4075 Returns: f-value, probability 4076 """ 4077 na = len(args) # ANOVA on 'na' groups, each in it's own array 4078 means = [0]*na 4079 vars = [0]*na 4080 ns = [0]*na 4081 alldata = [] 4082 tmp = map(N.array,args) 4083 means = map(amean,tmp) 4084 vars = map(avar,tmp) 4085 ns = map(len,args) 4086 alldata = N.concatenate(args) 4087 bign = len(alldata) 4088 sstot = ass(alldata)-(asquare_of_sums(alldata)/float(bign)) 4089 ssbn = 0 4090 for a in args: 4091 ssbn = ssbn + asquare_of_sums(N.array(a))/float(len(a)) 4092 ssbn = ssbn - (asquare_of_sums(alldata)/float(bign)) 4093 sswn = sstot-ssbn 4094 dfbn = na-1 4095 dfwn = bign - na 4096 msb = ssbn/float(dfbn) 4097 msw = sswn/float(dfwn) 4098 f = msb/msw 4099 prob = fprob(dfbn,dfwn,f) 4100 return f, prob
4101 4102
4103 - def aF_value (ER,EF,dfR,dfF):
4104 """ 4105 Returns an F-statistic given the following: 4106 ER = error associated with the null hypothesis (the Restricted model) 4107 EF = error associated with the alternate hypothesis (the Full model) 4108 dfR = degrees of freedom the Restricted model 4109 dfF = degrees of freedom associated with the Restricted model 4110 """ 4111 return ((ER-EF)/float(dfR-dfF) / (EF/float(dfF)))
4112 4113
4114 - def outputfstats(Enum, Eden, dfnum, dfden, f, prob):
4115 Enum = round(Enum,3) 4116 Eden = round(Eden,3) 4117 dfnum = round(Enum,3) 4118 dfden = round(dfden,3) 4119 f = round(f,3) 4120 prob = round(prob,3) 4121 suffix = '' # for *s after the p-value 4122 if prob < 0.001: suffix = ' ***' 4123 elif prob < 0.01: suffix = ' **' 4124 elif prob < 0.05: suffix = ' *' 4125 title = [['EF/ER','DF','Mean Square','F-value','prob','']] 4126 lofl = title+[[Enum, dfnum, round(Enum/float(dfnum),3), f, prob, suffix], 4127 [Eden, dfden, round(Eden/float(dfden),3),'','','']] 4128 pstat.printcc(lofl) 4129 return
4130 4131
4132 - def F_value_multivariate(ER, EF, dfnum, dfden):
4133 """ 4134 Returns an F-statistic given the following: 4135 ER = error associated with the null hypothesis (the Restricted model) 4136 EF = error associated with the alternate hypothesis (the Full model) 4137 dfR = degrees of freedom the Restricted model 4138 dfF = degrees of freedom associated with the Restricted model 4139 where ER and EF are matrices from a multivariate F calculation. 4140 """ 4141 if type(ER) in [IntType, FloatType]: 4142 ER = N.array([[ER]]) 4143 if type(EF) in [IntType, FloatType]: 4144 EF = N.array([[EF]]) 4145 n_um = (LA.det(ER) - LA.det(EF)) / float(dfnum) 4146 d_en = LA.det(EF) / float(dfden) 4147 return n_um / d_en
4148 4149 4150 ##################################### 4151 ####### ASUPPORT FUNCTIONS ######## 4152 ##################################### 4153
4154 - def asign(a):
4155 """ 4156 Usage: asign(a) 4157 Returns: array shape of a, with -1 where a<0 and +1 where a>=0 4158 """ 4159 a = N.asarray(a) 4160 if ((type(a) == type(1.4)) or (type(a) == type(1))): 4161 return a-a-N.less(a,0)+N.greater(a,0) 4162 else: 4163 return N.zeros(N.shape(a))-N.less(a,0)+N.greater(a,0)
4164 4165
4166 - def asum (a, dimension=None,keepdims=0):
4167 """ 4168 An alternative to the Numeric.add.reduce function, which allows one to 4169 (1) collapse over multiple dimensions at once, and/or (2) to retain 4170 all dimensions in the original array (squashing one down to size. 4171 Dimension can equal None (ravel array first), an integer (the 4172 dimension over which to operate), or a sequence (operate over multiple 4173 dimensions). If keepdims=1, the resulting array will have as many 4174 dimensions as the input array. 4175 4176 Usage: asum(a, dimension=None, keepdims=0) 4177 Returns: array summed along 'dimension'(s), same _number_ of dims if keepdims=1 4178 """ 4179 if type(a) == N.ndarray and a.dtype in [N.int_, N.short, N.ubyte]: 4180 a = a.astype(N.float_) 4181 if dimension == None: 4182 s = N.sum(N.ravel(a)) 4183 elif type(dimension) in [IntType,FloatType]: 4184 s = N.add.reduce(a, dimension) 4185 if keepdims == 1: 4186 shp = list(a.shape) 4187 shp[dimension] = 1 4188 s = N.reshape(s,shp) 4189 else: # must be a SEQUENCE of dims to sum over 4190 dims = list(dimension) 4191 dims.sort() 4192 dims.reverse() 4193 s = a *1.0 4194 for dim in dims: 4195 s = N.add.reduce(s,dim) 4196 if keepdims == 1: 4197 shp = list(a.shape) 4198 for dim in dims: 4199 shp[dim] = 1 4200 s = N.reshape(s,shp) 4201 return s
4202 4203
4204 - def acumsum (a,dimension=None):
4205 """ 4206 Returns an array consisting of the cumulative sum of the items in the 4207 passed array. Dimension can equal None (ravel array first), an 4208 integer (the dimension over which to operate), or a sequence (operate 4209 over multiple dimensions, but this last one just barely makes sense). 4210 4211 Usage: acumsum(a,dimension=None) 4212 """ 4213 if dimension == None: 4214 a = N.ravel(a) 4215 dimension = 0 4216 if type(dimension) in [ListType, TupleType, N.ndarray]: 4217 dimension = list(dimension) 4218 dimension.sort() 4219 dimension.reverse() 4220 for d in dimension: 4221 a = N.add.accumulate(a,d) 4222 return a 4223 else: 4224 return N.add.accumulate(a,dimension)
4225 4226
4227 - def ass(inarray, dimension=None, keepdims=0):
4228 """ 4229 Squares each value in the passed array, adds these squares & returns 4230 the result. Unfortunate function name. :-) Defaults to ALL values in 4231 the array. Dimension can equal None (ravel array first), an integer 4232 (the dimension over which to operate), or a sequence (operate over 4233 multiple dimensions). Set keepdims=1 to maintain the original number 4234 of dimensions. 4235 4236 Usage: ass(inarray, dimension=None, keepdims=0) 4237 Returns: sum-along-'dimension' for (inarray*inarray) 4238 """ 4239 if dimension == None: 4240 inarray = N.ravel(inarray) 4241 dimension = 0 4242 return asum(inarray*inarray,dimension,keepdims)
4243 4244
4245 - def asummult (array1,array2,dimension=None,keepdims=0):
4246 """ 4247 Multiplies elements in array1 and array2, element by element, and 4248 returns the sum (along 'dimension') of all resulting multiplications. 4249 Dimension can equal None (ravel array first), an integer (the 4250 dimension over which to operate), or a sequence (operate over multiple 4251 dimensions). A trivial function, but included for completeness. 4252 4253 Usage: asummult(array1,array2,dimension=None,keepdims=0) 4254 """ 4255 if dimension == None: 4256 array1 = N.ravel(array1) 4257 array2 = N.ravel(array2) 4258 dimension = 0 4259 return asum(array1*array2,dimension,keepdims)
4260 4261
4262 - def asquare_of_sums(inarray, dimension=None, keepdims=0):
4263 """ 4264 Adds the values in the passed array, squares that sum, and returns the 4265 result. Dimension can equal None (ravel array first), an integer (the 4266 dimension over which to operate), or a sequence (operate over multiple 4267 dimensions). If keepdims=1, the returned array will have the same 4268 NUMBER of dimensions as the original. 4269 4270 Usage: asquare_of_sums(inarray, dimension=None, keepdims=0) 4271 Returns: the square of the sum over dim(s) in dimension 4272 """ 4273 if dimension == None: 4274 inarray = N.ravel(inarray) 4275 dimension = 0 4276 s = asum(inarray,dimension,keepdims) 4277 if type(s) == N.ndarray: 4278 return s.astype(N.float_)*s 4279 else: 4280 return float(s)*s
4281 4282
4283 - def asumdiffsquared(a,b, dimension=None, keepdims=0):
4284 """ 4285 Takes pairwise differences of the values in arrays a and b, squares 4286 these differences, and returns the sum of these squares. Dimension 4287 can equal None (ravel array first), an integer (the dimension over 4288 which to operate), or a sequence (operate over multiple dimensions). 4289 keepdims=1 means the return shape = len(a.shape) = len(b.shape) 4290 4291 Usage: asumdiffsquared(a,b) 4292 Returns: sum[ravel(a-b)**2] 4293 """ 4294 if dimension == None: 4295 inarray = N.ravel(a) 4296 dimension = 0 4297 return asum((a-b)**2,dimension,keepdims)
4298 4299
4300 - def ashellsort(inarray):
4301 """ 4302 Shellsort algorithm. Sorts a 1D-array. 4303 4304 Usage: ashellsort(inarray) 4305 Returns: sorted-inarray, sorting-index-vector (for original array) 4306 """ 4307 n = len(inarray) 4308 svec = inarray *1.0 4309 ivec = range(n) 4310 gap = n/2 # integer division needed 4311 while gap >0: 4312 for i in range(gap,n): 4313 for j in range(i-gap,-1,-gap): 4314 while j>=0 and svec[j]>svec[j+gap]: 4315 temp = svec[j] 4316 svec[j] = svec[j+gap] 4317 svec[j+gap] = temp 4318 itemp = ivec[j] 4319 ivec[j] = ivec[j+gap] 4320 ivec[j+gap] = itemp 4321 gap = gap / 2 # integer division needed 4322 # svec is now sorted input vector, ivec has the order svec[i] = vec[ivec[i]] 4323 return svec, ivec
4324 4325
4326 - def arankdata(inarray):
4327 """ 4328 Ranks the data in inarray, dealing with ties appropritely. Assumes 4329 a 1D inarray. Adapted from Gary Perlman's |Stat ranksort. 4330 4331 Usage: arankdata(inarray) 4332 Returns: array of length equal to inarray, containing rank scores 4333 """ 4334 n = len(inarray) 4335 svec, ivec = ashellsort(inarray) 4336 sumranks = 0 4337 dupcount = 0 4338 newarray = N.zeros(n,N.float_) 4339 for i in range(n): 4340 sumranks = sumranks + i 4341 dupcount = dupcount + 1 4342 if i==n-1 or svec[i] <> svec[i+1]: 4343 averank = sumranks / float(dupcount) + 1 4344 for j in range(i-dupcount+1,i+1): 4345 newarray[ivec[j]] = averank 4346 sumranks = 0 4347 dupcount = 0 4348 return newarray
4349 4350
4351 - def afindwithin(data):
4352 """ 4353 Returns a binary vector, 1=within-subject factor, 0=between. Input 4354 equals the entire data array (i.e., column 1=random factor, last 4355 column = measured values. 4356 4357 Usage: afindwithin(data) data in |Stat format 4358 """ 4359 numfact = len(data[0])-2 4360 withinvec = [0]*numfact 4361 for col in range(1,numfact+1): 4362 rows = pstat.linexand(data,col,pstat.unique(pstat.colex(data,1))[0]) # get 1 level of this factor 4363 if len(pstat.unique(pstat.colex(rows,0))) < len(rows): # if fewer subjects than scores on this factor 4364 withinvec[col-1] = 1 4365 return withinvec
4366 4367 4368 ######################################################### 4369 ######################################################### 4370 ###### RE-DEFINE DISPATCHES TO INCLUDE ARRAYS ######### 4371 ######################################################### 4372 ######################################################### 4373 4374 ## CENTRAL TENDENCY: 4375 geometricmean = Dispatch ( (lgeometricmean, (ListType, TupleType)), 4376 (ageometricmean, (N.ndarray,)) ) 4377 harmonicmean = Dispatch ( (lharmonicmean, (ListType, TupleType)), 4378 (aharmonicmean, (N.ndarray,)) ) 4379 mean = Dispatch ( (lmean, (ListType, TupleType)), 4380 (amean, (N.ndarray,)) ) 4381 median = Dispatch ( (lmedian, (ListType, TupleType)), 4382 (amedian, (N.ndarray,)) ) 4383 medianscore = Dispatch ( (lmedianscore, (ListType, TupleType)), 4384 (amedianscore, (N.ndarray,)) ) 4385 mode = Dispatch ( (lmode, (ListType, TupleType)), 4386 (amode, (N.ndarray,)) ) 4387 tmean = Dispatch ( (atmean, (N.ndarray,)) ) 4388 tvar = Dispatch ( (atvar, (N.ndarray,)) ) 4389 tstdev = Dispatch ( (atstdev, (N.ndarray,)) ) 4390 tsem = Dispatch ( (atsem, (N.ndarray,)) ) 4391 4392 ## VARIATION: 4393 moment = Dispatch ( (lmoment, (ListType, TupleType)), 4394 (amoment, (N.ndarray,)) ) 4395 variation = Dispatch ( (lvariation, (ListType, TupleType)), 4396 (avariation, (N.ndarray,)) ) 4397 skew = Dispatch ( (lskew, (ListType, TupleType)), 4398 (askew, (N.ndarray,)) ) 4399 kurtosis = Dispatch ( (lkurtosis, (ListType, TupleType)), 4400 (akurtosis, (N.ndarray,)) ) 4401 describe = Dispatch ( (ldescribe, (ListType, TupleType)), 4402 (adescribe, (N.ndarray,)) ) 4403 4404 ## DISTRIBUTION TESTS 4405 4406 skewtest = Dispatch ( (askewtest, (ListType, TupleType)), 4407 (askewtest, (N.ndarray,)) ) 4408 kurtosistest = Dispatch ( (akurtosistest, (ListType, TupleType)), 4409 (akurtosistest, (N.ndarray,)) ) 4410 normaltest = Dispatch ( (anormaltest, (ListType, TupleType)), 4411 (anormaltest, (N.ndarray,)) ) 4412 4413 ## FREQUENCY STATS: 4414 itemfreq = Dispatch ( (litemfreq, (ListType, TupleType)), 4415 (aitemfreq, (N.ndarray,)) ) 4416 scoreatpercentile = Dispatch ( (lscoreatpercentile, (ListType, TupleType)), 4417 (ascoreatpercentile, (N.ndarray,)) ) 4418 percentileofscore = Dispatch ( (lpercentileofscore, (ListType, TupleType)), 4419 (apercentileofscore, (N.ndarray,)) ) 4420 histogram = Dispatch ( (lhistogram, (ListType, TupleType)), 4421 (ahistogram, (N.ndarray,)) ) 4422 cumfreq = Dispatch ( (lcumfreq, (ListType, TupleType)), 4423 (acumfreq, (N.ndarray,)) ) 4424 relfreq = Dispatch ( (lrelfreq, (ListType, TupleType)), 4425 (arelfreq, (N.ndarray,)) ) 4426 4427 ## VARIABILITY: 4428 obrientransform = Dispatch ( (lobrientransform, (ListType, TupleType)), 4429 (aobrientransform, (N.ndarray,)) ) 4430 samplevar = Dispatch ( (lsamplevar, (ListType, TupleType)), 4431 (asamplevar, (N.ndarray,)) ) 4432 samplestdev = Dispatch ( (lsamplestdev, (ListType, TupleType)), 4433 (asamplestdev, (N.ndarray,)) ) 4434 signaltonoise = Dispatch( (asignaltonoise, (N.ndarray,)),) 4435 var = Dispatch ( (lvar, (ListType, TupleType)), 4436 (avar, (N.ndarray,)) ) 4437 stdev = Dispatch ( (lstdev, (ListType, TupleType)), 4438 (astdev, (N.ndarray,)) ) 4439 sterr = Dispatch ( (lsterr, (ListType, TupleType)), 4440 (asterr, (N.ndarray,)) ) 4441 sem = Dispatch ( (lsem, (ListType, TupleType)), 4442 (asem, (N.ndarray,)) ) 4443 z = Dispatch ( (lz, (ListType, TupleType)), 4444 (az, (N.ndarray,)) ) 4445 zs = Dispatch ( (lzs, (ListType, TupleType)), 4446 (azs, (N.ndarray,)) ) 4447 4448 ## TRIMMING FCNS: 4449 threshold = Dispatch( (athreshold, (N.ndarray,)),) 4450 trimboth = Dispatch ( (ltrimboth, (ListType, TupleType)), 4451 (atrimboth, (N.ndarray,)) ) 4452 trim1 = Dispatch ( (ltrim1, (ListType, TupleType)), 4453 (atrim1, (N.ndarray,)) ) 4454 4455 ## CORRELATION FCNS: 4456 paired = Dispatch ( (lpaired, (ListType, TupleType)), 4457 (apaired, (N.ndarray,)) ) 4458 lincc = Dispatch ( (llincc, (ListType, TupleType)), 4459 (alincc, (N.ndarray,)) ) 4460 pearsonr = Dispatch ( (lpearsonr, (ListType, TupleType)), 4461 (apearsonr, (N.ndarray,)) ) 4462 spearmanr = Dispatch ( (lspearmanr, (ListType, TupleType)), 4463 (aspearmanr, (N.ndarray,)) ) 4464 pointbiserialr = Dispatch ( (lpointbiserialr, (ListType, TupleType)), 4465 (apointbiserialr, (N.ndarray,)) ) 4466 kendalltau = Dispatch ( (lkendalltau, (ListType, TupleType)), 4467 (akendalltau, (N.ndarray,)) ) 4468 linregress = Dispatch ( (llinregress, (ListType, TupleType)), 4469 (alinregress, (N.ndarray,)) ) 4470 4471 ## INFERENTIAL STATS: 4472 ttest_1samp = Dispatch ( (lttest_1samp, (ListType, TupleType)), 4473 (attest_1samp, (N.ndarray,)) ) 4474 ttest_ind = Dispatch ( (lttest_ind, (ListType, TupleType)), 4475 (attest_ind, (N.ndarray,)) ) 4476 ttest_rel = Dispatch ( (lttest_rel, (ListType, TupleType)), 4477 (attest_rel, (N.ndarray,)) ) 4478 chisquare = Dispatch ( (lchisquare, (ListType, TupleType)), 4479 (achisquare, (N.ndarray,)) ) 4480 ks_2samp = Dispatch ( (lks_2samp, (ListType, TupleType)), 4481 (aks_2samp, (N.ndarray,)) ) 4482 mannwhitneyu = Dispatch ( (lmannwhitneyu, (ListType, TupleType)), 4483 (amannwhitneyu, (N.ndarray,)) ) 4484 tiecorrect = Dispatch ( (ltiecorrect, (ListType, TupleType)), 4485 (atiecorrect, (N.ndarray,)) ) 4486 ranksums = Dispatch ( (lranksums, (ListType, TupleType)), 4487 (aranksums, (N.ndarray,)) ) 4488 wilcoxont = Dispatch ( (lwilcoxont, (ListType, TupleType)), 4489 (awilcoxont, (N.ndarray,)) ) 4490 kruskalwallish = Dispatch ( (lkruskalwallish, (ListType, TupleType)), 4491 (akruskalwallish, (N.ndarray,)) ) 4492 friedmanchisquare = Dispatch ( (lfriedmanchisquare, (ListType, TupleType)), 4493 (afriedmanchisquare, (N.ndarray,)) ) 4494 4495 ## PROBABILITY CALCS: 4496 chisqprob = Dispatch ( (lchisqprob, (IntType, FloatType)), 4497 (achisqprob, (N.ndarray,)) ) 4498 zprob = Dispatch ( (lzprob, (IntType, FloatType)), 4499 (azprob, (N.ndarray,)) ) 4500 ksprob = Dispatch ( (lksprob, (IntType, FloatType)), 4501 (aksprob, (N.ndarray,)) ) 4502 fprob = Dispatch ( (lfprob, (IntType, FloatType)), 4503 (afprob, (N.ndarray,)) ) 4504 betacf = Dispatch ( (lbetacf, (IntType, FloatType)), 4505 (abetacf, (N.ndarray,)) ) 4506 betai = Dispatch ( (lbetai, (IntType, FloatType)), 4507 (abetai, (N.ndarray,)) ) 4508 erfcc = Dispatch ( (lerfcc, (IntType, FloatType)), 4509 (aerfcc, (N.ndarray,)) ) 4510 gammln = Dispatch ( (lgammln, (IntType, FloatType)), 4511 (agammln, (N.ndarray,)) ) 4512 4513 ## ANOVA FUNCTIONS: 4514 F_oneway = Dispatch ( (lF_oneway, (ListType, TupleType)), 4515 (aF_oneway, (N.ndarray,)) ) 4516 F_value = Dispatch ( (lF_value, (ListType, TupleType)), 4517 (aF_value, (N.ndarray,)) ) 4518 4519 ## SUPPORT FUNCTIONS: 4520 incr = Dispatch ( (lincr, (ListType, TupleType, N.ndarray)), ) 4521 sum = Dispatch ( (lsum, (ListType, TupleType)), 4522 (asum, (N.ndarray,)) ) 4523 cumsum = Dispatch ( (lcumsum, (ListType, TupleType)), 4524 (acumsum, (N.ndarray,)) ) 4525 ss = Dispatch ( (lss, (ListType, TupleType)), 4526 (ass, (N.ndarray,)) ) 4527 summult = Dispatch ( (lsummult, (ListType, TupleType)), 4528 (asummult, (N.ndarray,)) ) 4529 square_of_sums = Dispatch ( (lsquare_of_sums, (ListType, TupleType)), 4530 (asquare_of_sums, (N.ndarray,)) ) 4531 sumdiffsquared = Dispatch ( (lsumdiffsquared, (ListType, TupleType)), 4532 (asumdiffsquared, (N.ndarray,)) ) 4533 shellsort = Dispatch ( (lshellsort, (ListType, TupleType)), 4534 (ashellsort, (N.ndarray,)) ) 4535 rankdata = Dispatch ( (lrankdata, (ListType, TupleType)), 4536 (arankdata, (N.ndarray,)) ) 4537 findwithin = Dispatch ( (lfindwithin, (ListType, TupleType)), 4538 (afindwithin, (N.ndarray,)) ) 4539 4540 ###################### END OF NUMERIC FUNCTION BLOCK ##################### 4541 4542 ###################### END OF STATISTICAL FUNCTIONS ###################### 4543 4544 except ImportError: 4545 pass 4546