This package contains tools for nonlinear least-squares curve fitting of data. In general a fit has four inputs:
- The dependent data y that is to be fit — typically y is a Python dictionary in an lsqfit analysis. Its values y[k] are either gvar.GVars or arrays (any shape or dimension) of gvar.GVars that specify the values of the dependent variables and their errors.
- A collection x of independent data — x can have any structure and contain any data (or no data).
- A fit function f(x, p) whose parameters p are adjusted by the fit until f(x, p) equals y to within ys errors — parameters p` are usually specified by a dictionary whose values p[k] are individual parameters or (numpy) arrays of parameters. The fit function is assumed independent of x (that is, f(p)) if x = False (or if x is omitted from the input data).
- Initial estimates or priors for each parameter in p — priors are usually specified using a dictionary prior whose values prior[k] are gvar.GVars or arrays of gvar.GVars that give initial estimates (values and errors) for parameters p[k].
A typical code sequence has the structure:
... collect x, y, prior ...
def f(x, p):
... compute fit to y[k], for all k in y, using x, p ...
... return dictionary containing the fit values for the y[k]s ...
fit = lsqfit.nonlinear_fit(data=(x, y), prior=prior, fcn=f)
print(fit) # variable fit is of type nonlinear_fit
The parameters p[k] are varied until the chi**2 for the fit is minimized.
The best-fit values for the parameters are recovered after fitting using, for example, p=fit.p. Then the p[k] are gvar.GVars or arrays of gvar.GVars that give best-fit estimates and fit uncertainties in those estimates. The print(fit) statement prints a summary of the fit results.
The dependent variable y above could be an array instead of a dictionary, which is less flexible in general but possibly more convenient in simpler fits. Then the approximate y returned by fit function f(x, p) must be an array with the same shape as the dependent variable. The prior prior could also be represented by an array instead of a dictionary.
The lsqfit tutorial contains extended explanations and examples. The first appendix in the paper at http://arxiv.org/abs/arXiv:1406.2279 provides conceptual background on the techniques used in this module for fits and, especially, error budgets.
Nonlinear least-squares fit.
lsqfit.nonlinear_fit fits a (nonlinear) function f(x, p) to data y by varying parameters p, and stores the results: for example,
fit = nonlinear_fit(data=(x, y), fcn=f, prior=prior) # do fit
print(fit) # print fit results
The best-fit values for the parameters are in fit.p, while the chi**2, the number of degrees of freedom, the logarithm of Gaussian Bayes Factor, the number of iterations, and the cpu time needed for the fit are in fit.chi2, fit.dof, fit.logGBF, fit.nit, and fit.time, respectively. Results for individual parameters in fit.p are of type gvar.GVar, and therefore carry information about errors and correlations with other parameters. The fit data and prior can be recovered using fit.x (equals False if there is no x), fit.y, and fit.prior; the data and prior are corrected for the svd cut, if there is one (that is, their covariance matrices have been modified in accordance with the svd cut).
Parameters: |
|
---|
The results from the fit are accessed through the following attributes (of fit where fit = nonlinear_fit(...)):
The minimum chi**2 for the fit. fit.chi2 / fit.dof is usually of order one in good fits; values much less than one suggest that the actual standard deviations in the input data and/or priors are smaller than the standard deviations used in the fit.
Covariance matrix of the best-fit parameters from the fit.
Number of degrees of freedom in the fit, which equals the number of pieces of data being fit when priors are specified for the fit parameters. Without priors, it is the number of pieces of data minus the number of fit parameters.
The logarithm of the probability (density) of obtaining the fit data by randomly sampling the parameter model (priors plus fit function) used in the fit. This quantity is useful for comparing fits of the same data to different models, with different priors and/or fit functions. The model with the largest value of fit.logGBF is the one prefered by the data. The exponential of the difference in fit.logGBF between two models is the ratio of probabilities (Bayes factor) for those models. Differences in fit.logGBF smaller than 1 are not very significant. Gaussian statistics are assumed when computing fit.logGBF.
Best-fit parameters from fit. Depending upon what was used for the prior (or p0), it is either: a dictionary (gvar.BufferDict) of gvar.GVars and/or arrays of gvar.GVars; or an array (numpy.ndarray) of gvar.GVars. fit.p represents a multi-dimensional Gaussian distribution which, in Bayesian terminology, is the posterior probability distribution of the fit parameters.
Means of the best-fit parameters from fit (dictionary or array).
Standard deviations of the best-fit parameters from fit (dictionary or array).
Same as fit.p except that the errors are computed directly from fit.cov. This is faster but means that no information about correlations with the input data is retained (unlike in fit.p); and, therefore, fit.palt cannot be used to generate error budgets. fit.p and fit.palt give the same means and normally give the same errors for each parameter. They differ only when the input data’s covariance matrix is too singular to invert accurately (because of roundoff error), in which case an SVD cut is advisable.
Same as fit.p but augmented to include the transforms of any log-normal or other parameter implemented using decorator lsqfit.transform_p. In the case of a log-normal variable fit.p['logXX'], for example, fit.transformed_p['XX'] is defined equal to exp(fit.p['logXX']).
The parameter values used to start the fit.
The probability that the chi**2 from the fit could have been larger, by chance, assuming the best-fit model is correct. Good fits have Q values larger than 0.1 or so. Also called the p-value of the fit.
The sum of all SVD corrections, if any, added to the fit data y or the prior prior.
The number of eignemodes modified (and/or deleted) by the SVD cut.
A dictionary where nblocks[s] equals the number of block-diagonal sub-matrices of the y–prior covariance matrix that are size s-by-s. This is sometimes useful for debugging.
CPU time (in secs) taken by fit.
The input parameters to the fit can be accessed as attributes. Note in particular attributes:
Prior used in the fit. This may differ from the input prior if an SVD cut is used. It is either a dictionary (gvar.BufferDict) or an array (numpy.ndarray), depending upon the input. Equals None if no prior was specified.
The first field in the input data. This is sometimes the independent variable (as in ‘y vs x’ plot), but may be anything. It is set equal to False if the x field is omitted from the input data. (This also means that the fit function has no x argument: so f(p) rather than f(x,p).)
Fit data used in the fit. This may differ from the input data if an SVD cut is used. It is either a dictionary (gvar.BufferDict) or an array (numpy.ndarray), depending upon the input.
Additional methods are provided for printing out detailed information about the fit, testing fits with simulated data, doing bootstrap analyses of the fit errors, dumping (for later use) and loading parameter values, and checking for roundoff errors in the final error estimates:
Formats fit output details into a string for printing.
The output tabulates the chi**2 per degree of freedom of the fit (chi2/dof), the number of degrees of freedom, the logarithm of the Gaussian Bayes Factor for the fit (logGBF), and the number of fit-algorithm iterations needed by the fit. Optionally, it will also list the best-fit values for the fit parameters together with the prior for each (in [] on each line). It can also list all of the data and the corresponding values from the fit. At the end it lists the SVD cut, the number of eigenmodes modified by the SVD cut, the relative and absolute tolerances used in the fit, and the time in seconds needed to do the fit.
Parameters: |
|
---|---|
Returns: | String containing detailed information about fit. |
Tabulate error budget for outputs[ko] due to inputs[ki].
For each output outputs[ko], fmt_errorbudget computes the contributions to outputs[ko]‘s standard deviation coming from the gvar.GVars collected in inputs[ki]. This is done for each key combination (ko,ki) and the results are tabulated with columns and rows labeled by ko and ki, respectively. If a gvar.GVar in inputs[ki] is correlated with other gvar.GVars, the contribution from the others is included in the ki contribution as well (since contributions from correlated gvar.GVars cannot be resolved). The table is returned as a string.
Parameters: |
|
---|---|
Returns: | A table (str) containing the error budget. Output variables are labeled by the keys in outputs (columns); sources of uncertainty are labeled by the keys in inputs (rows). |
Tabulate gvar.GVars in outputs.
Parameters: |
|
---|---|
Returns: | A table (str) containing values and standard deviations for variables in outputs, labeled by the keys in outputs. |
Iterator that returns simulation copies of a fit.
Fit reliability can be tested using simulated data which replaces the mean values in self.y with random numbers drawn from a distribution whose mean equals self.fcn(pexact) and whose covariance matrix is the same as self.y‘s. Simulated data is very similar to the original fit data, self.y, but corresponds to a world where the correct values for the parameters (i.e., averaged over many simulated data sets) are given by pexact. pexact is usually taken equal to fit.pmean.
Each iteration of the iterator creates new simulated data, with different random numbers, and fits it, returning the the lsqfit.nonlinear_fit that results. The simulated data has the same covariance matrix as fit.y. Typical usage is:
...
fit = nonlinear_fit(...)
...
for sfit in fit.simulated_fit_iter(n=3):
... verify that sfit.p agrees with pexact=fit.pmean within errors ...
Only a few iterations are needed to get a sense of the fit’s reliability since we know the correct answer in each case. The simulated fit’s output results should agree with pexact (=fit.pmean here) within the simulated fit’s errors.
Simulated fits can also be used to estimate biases in the fit’s output parameters or functions of them, should non-Gaussian behavior arise. This is possible, again, because we know the correct value for every parameter before we do the fit. Again only a few iterations may be needed for reliable estimates.
The (possibly non-Gaussian) probability distributions for parameters, or functions of them, can be explored in more detail by setting option bootstrap=True and collecting results from a large number of simulated fits. With bootstrap=True, the means of the priors are also varied from fit to fit, as in a bootstrap simulation; the new prior means are chosen at random from the prior distribution. Variations in the best-fit parameters (or functions of them) from fit to fit define the probability distributions for those quantities. For example, one would use the following code to analyze the distribution of function g(p) of the fit parameters:
fit = nonlinear_fit(...)
...
glist = []
for sfit in fit.simulated_fit_iter(n=100, bootstrap=True):
glist.append(g(sfit.pmean))
... analyze samples glist[i] from g(p) distribution ...
This code generates n=100 samples glist[i] from the probability distribution of g(p). If everything is Gaussian, the mean and standard deviation of glist[i] should agree with g(fit.p).mean and g(fit.p).sdev.
The only difference between simulated fits with bootstrap=True and bootstrap=False (the default) is that the prior means are varied. It is essential that they be varied in a bootstrap analysis since one wants to capture the impact of the priors on the final distributions, but it is not necessary and probably not desirable when simply testing a fit’s reliability.
Parameters: |
|
---|---|
Returns: | An iterator that returns lsqfit.nonlinear_fits for different simulated data. |
Note that additional keywords can be added to overwrite keyword arguments in lsqfit.nonlinear_fit.
Iterator that returns bootstrap copies of a fit.
A bootstrap analysis involves three steps: 1) make a large number of “bootstrap copies” of the original input data and prior that differ from each other by random amounts characteristic of the underlying randomness in the original data; 2) repeat the entire fit analysis for each bootstrap copy of the data, extracting fit results from each; and 3) use the variation of the fit results from bootstrap copy to bootstrap copy to determine an approximate probability distribution (possibly non-gaussian) for the fit parameters and/or functions of them: the results from each bootstrap fit are samples from that distribution.
Bootstrap copies of the data for step 2 are provided in datalist. If datalist is None, they are generated instead from the means and covariance matrix of the fit data (assuming gaussian statistics). The maximum number of bootstrap copies considered is specified by n (None implies no limit).
Variations in the best-fit parameters (or functions of them) from bootstrap fit to bootstrap fit define the probability distributions for those quantities. For example, one could use the following code to analyze the distribution of function g(p) of the fit parameters:
fit = nonlinear_fit(...)
...
glist = []
for sfit in fit.bootstrapped_fit_iter(n=100, datalist=datalist, bootstrap=True):
glist.append(g(sfit.pmean))
... analyze samples glist[i] from g(p) distribution ...
This code generates n=100 samples glist[i] from the probability distribution of g(p). If everything is Gaussian, the mean and standard deviation of glist[i] should agree with g(fit.p).mean and g(fit.p).sdev.
Parameters: |
|
---|---|
Returns: | Iterator that returns an lsqfit.nonlinear_fit object containing results from the fit to the next data set in datalist |
Dump parameter values (fit.p) into file filename.
fit.dump_p(filename) saves the best-fit parameter values (fit.p) from a nonlinear_fit called fit. These values are recovered using p = nonlinear_fit.load_parameters(filename) where p‘s layout is the same as that of fit.p.
Dump parameter means (fit.pmean) into file filename.
fit.dump_pmean(filename) saves the means of the best-fit parameter values (fit.pmean) from a nonlinear_fit called fit. These values are recovered using p0 = nonlinear_fit.load_parameters(filename) where p0‘s layout is the same as fit.pmean. The saved values can be used to initialize a later fit (nonlinear_fit parameter p0).
Load parameters stored in file filename.
p = nonlinear_fit.load_p(filename) is used to recover the values of fit parameters dumped using fit.dump_p(filename) (or fit.dump_pmean(filename)) where fit is of type lsqfit.nonlinear_fit. The layout of the returned parameters p is the same as that of fit.p (or fit.pmean).
Check for roundoff errors in fit.p.
Compares standard deviations from fit.p and fit.palt to see if they agree to within relative tolerance rtol and absolute tolerance atol. Generates a warning if they do not (in which case an svd cut might be advisable).
Call lsqfit.nonlinear_fit(**fitargs(z)) varying z, starting at z0, to maximize logGBF (empirical Bayes procedure).
The fit is redone for each value of z that is tried, in order to determine logGBF.
Parameters: |
|
---|---|
Returns: | A tuple containing the best fit (object of type lsqfit.nonlinear_fit) and the optimal value for parameter z. |
Weighted average of gvar.GVars or arrays/dicts of gvar.GVars.
The weighted average of several gvar.GVars is what one obtains from a least-squares fit of the collection of gvar.GVars to the one-parameter fit function
def f(p):
return N * [p[0]]
where N is the number of gvar.GVars. The average is the best-fit value for p[0]. gvar.GVars with smaller standard deviations carry more weight than those with larger standard deviations. The averages computed by wavg take account of correlations between the gvar.GVars.
If prior is not None, it is added to the list of data used in the average. Thus wavg([x2, x3], prior=x1) is the same as wavg([x1, x2, x3]).
Typical usage is
x1 = gvar.gvar(...)
x2 = gvar.gvar(...)
x3 = gvar.gvar(...)
xavg = wavg([x1, x2, x3]) # weighted average of x1, x2 and x3
where the result xavg is a gvar.GVar containing the weighted average.
The individual gvar.GVars in the last example can be replaced by multidimensional distributions, represented by arrays of gvar.GVars or dictionaries of gvar.GVars (or arrays of gvar.GVars). For example,
x1 = [gvar.gvar(...), gvar.gvar(...)]
x2 = [gvar.gvar(...), gvar.gvar(...)]
x3 = [gvar.gvar(...), gvar.gvar(...)]
xavg = wavg([x1, x2, x3])
# xavg[i] is wgtd avg of x1[i], x2[i], x3[i]
where each array x1, x2 ... must have the same shape. The result xavg in this case is an array of gvar.GVars, where the shape of the array is the same as that of x1, etc.
Another example is
x1 = dict(a=[gvar.gvar(...), gvar.gvar(...)], b=gvar.gvar(...))
x2 = dict(a=[gvar.gvar(...), gvar.gvar(...)], b=gvar.gvar(...))
x3 = dict(a=[gvar.gvar(...), gvar.gvar(...)])
xavg = wavg([x1, x2, x3])
# xavg['a'][i] is wgtd avg of x1['a'][i], x2['a'][i], x3['a'][i]
# xavg['b'] is gtd avg of x1['b'], x2['b']
where different dictionaries can have (some) different keys. Here the result xavg is a gvar.BufferDict` having the same keys as x1, etc.
Weighted averages can become costly when the number of random samples being averaged is large (100s or more). In such cases it might be useful to set parameter fast=True. This causes wavg to estimate the weighted average by incorporating the random samples one at a time into a running average:
result = prior
for dataseq_i in dataseq:
result = wavg([result, dataseq_i], ...)
This method is much faster when len(dataseq) is large, and gives the exact result when there are no correlations between different elements of list dataseq. The results are approximately correct when dataseq[i] and dataseq[j] are correlated for i!=j.
Parameters: |
|
---|
The following function attributes are also set:
chi**2 for weighted average.
Effective number of degrees of freedom.
The probability that the chi**2 could have been larger, by chance, assuming that the data are all Gaussain and consistent with each other. Values smaller than 0.1 or suggest that the data are not Gaussian or are inconsistent with each other. Also called the p-value.
Quality factor Q (or p-value) for fit.
Time required to do average.
The svd corrections made to the data when svdcut is not None.
Fit output from average.
These same attributes are also attached to the output gvar.GVar, array or dictionary from gvar.wavg().
Return the normalized incomplete gamma function Q(a,x) = 1-P(a,x).
Q(a, x) = 1/Gamma(a) * \int_x^\infty dt exp(-t) t ** (a-1) = 1 - P(a, x)
Note that gammaQ(ndof/2., chi2/2.) is the probabilty that one could get a chi**2 larger than chi2 with ndof degrees of freedom even if the model used to construct chi2 is correct.
Decorate fit function to allow log/sqrt-normal priors.
This decorator can be applied to fit functions whose parameters are stored in a dictionary-like object. It searches the parameter keys for string-valued keys of the form "log(XX)", "logXX", "sqrt(XX)", or "sqrtXX" where "XX" is an arbitrary string. For each such key it adds a new entry to the parameter dictionary with key "XX" where:
p["XX"] = exp(p[k]) for k = "log(XX)" or "logXX"
or
p["XX"] = p[k] ** 2 for k = "sqrt(XX)" or "sqrtXX"
This means that the fit function can be expressed entirely in terms of p["XX"] even if the actual fit parameter is the logarithm or square root of that quantity. Since fit parameters have gaussian/normal priors, p["XX"] has a log-normal or “sqrt-normal” distribution in the first or second cases above, respectively. In either case p["XX"] is guaranteed to be postiive.
This is a convenience function. It allows for the rapid replacement of a fit parameter by its logarithm or square root without having to rewrite the fit function — only the prior need be changed. The decorator needs to be told if the fit function has an x as its first argument, followed by the parameters p:
@lsqfit.transform_p(prior.keys(), has_x=True)
def fitfcn(x, p):
...
versus
@lsqfit.transform_p(prior.keys())
def fitfcn(p):
...
A list of the specific keys that need transforming can be used instead of the list of all keys (prior.keys()). The decorator assigns a copy of itself to the function as an attribute: fitfcn.transform_p.
Parameters: |
|
---|
Create transformed copy of dictionary p.
Create a copy of parameter-dictionary p that includes new entries for each "logXX", etc entry corresponding to "XX". The values in p can be any type that supports logarithms, exponentials, and arithmetic.
Undo self.transform(p).
Reconstruct p0 where p == self.transform(p0); that is remove entries for keys "XX" that were added by by transform_p.transform() (because "logXX" or "sqrtXX" or ... appeared in p0).
Return parameter key corresponding to prior-key k.
Strip off any "log" or "sqrt" prefix.
Return key in prior corresponding to k.
Add in "log" or "sqrt" as needed to find a key in prior.
Fitter for nonlinear least-squares multidimensional fits.
Parameters: |
|
---|
multifit is a function-class whose constructor does a least squares fit by minimizing sum_i f_i(x)**2 as a function of vector x. The following attributes are available:
Location of the most recently computed (best) fit point.
Covariance matrix at the minimum point.
The fit function f(x) at the minimum in the most recent fit.
Gradient J_ij = df_i/dx[j] for most recent fit.
Number of iterations used in last fit to find the minimum.
None if fit successful; an error message otherwise.
multifit is a wrapper for the multifit GSL routine.
Minimizer for multidimensional functions.
Parameters: |
|
---|
multiminex is a function-class whose constructor minimizes a multidimensional function f(x) by varying vector x. This routine does not use user-supplied information about the gradient of f(x). The following attributes are available:
Location of the most recently computed minimum (1-d array).
Value of function f(x) at the most recently computed minimum.
Number of iterations required to find most recent minimum.
None if fit successful; an error message otherwise.
multiminex is a wrapper for the multimin GSL routine.