It is common engineering practice to "fit a line" to a set of data in order to determine some useful parameter in a mathematical model or perhaps to generate a calibration curve. A straight line is a simple polynomial and goal of the fit is to determine the coefficients (the slope and intercept) of the polynomial that lead to the "best fit" of a line to the data.
The fitting process can be generalized to determine the coefficients of the Nth-order polynomial that best fits N+1 (or more, usually) data points. The determination of the coefficients is usually termed "polynomial regression" and, in MATLAB, is accomplished by the function polyfit.
There is more to life than fitting just polynomials to data. Check here for other options in Fitting Land.
Assuming you have two N-by-1 vectors of data values, x and y, the best coefficients for a straight-line fit (in the least-squares sense) are found through the command
coeff = polyfit(x,y,1)while those for the least-squares polynomial fit of degree n (< N-1) are found by
coeff = polyfit(x,y,n)
Make sure that x and y are the same shape (i.e., both art row vectors or both are column vectors) or you'll get an error.
Note that here is a very definite order of the coefficients in the vector coeff: the coefficient of the highest order monomial comes first and the rest follow in descending order (down to the constant term). Thus, for a straight-line fit,
slope = coeff(1) intercept = coeff(2)
Also, note that the coefficients from polyfit are
returned as a row vector.
1. Does the fit look good?
A picture is worth a thousand words and so a plot of the curve that represents the fitted polynomial overlaid on the data is a powerful way of assessing the quality of the fit. Using the coefficients determined by polyfit, you can create such a plot via the commands
coeff = polyfit(x,y,n) xfit = linspace(min(x),max(x),npts) yfit = polyval(coeff,xfit) plot(x,y,'o',xfit,yfit)
where n is the order of the polynomial you are fitting and npts is the number of points used to plot the polynomial you are fitting. npts = 2 is fine for a line but more points would be needed to make something like a cubic equation (with all its curves) look nice. Try something like 200 for npts to start things off.
It is important that the coefficients returned by polyfit are used without truncation (i.e., they should be used to full machine precision). This is accomplished by using the output of polyfit (the variable coeff) directly as the input to polyval.
2. Are the residuals small and unpatterned?
The residuals are defined as the mismatch of the predicted values of ypred (computed using the x-values in the data you are trying to fit) and the actual (data) values of y.
If the coefficients of the fit are given in the vector coeff, the residuals are calculated via the commands
ypred = polyval(coeff,x); % predicted values resid = y - ypred; % mismatch
Two different plots of the residuals can sometimes be quite helpful:
A. A plot of the residuals versus the predicted values
plot(ypred,resid,'o')
should not show any patterns or trends. The plot should simply show random noise, according to the theory that underlies all the computations in polyfit. The degree of pattern and trend is a very good measure of the quality of the fit (no trend = good fit). If there is a trend, your model is missing something!
B. A normal-probability plot of the residuals should be a straight line if the premises of linear regression are correct and if the fit is good. If the Statistics toolbox is avalilable, the function normplot will do this for you. An alternative is to use nsplot from BISKIT (available at Bucknell).
Numerical methods for assessing the fit are given
below.
For what follows, it is assumed that you have a set of x-y data pairs and that you have used polyfit to compute the coefficients for a polynomial of given order (and have stored in a vector called coeff).
ypred = polyval(coeff,x); % predictions dev = y - mean(y); % deviations - measure of spread SST = sum(dev.^2); % total variation to be accounted for resid = y - ypred; % residuals - measure of mismatch SSE = sum(resid.^2); % variation NOT accounted for Rsq = 1 - SSE/SST; % percent of error explained
The closer that Rsq is to 1, the more completely the
fitted model "explains" the data.
f = MSR/MSE
where MSR = SSR/dfr and MSE = SSE/dfe. The statistic is computed via the following commands (which assume the commands given above for the R2 computation have been executed)
SSR = SST - SSE; % the "ANOVA identity" dfr = <order of the polynomial> dfe = length(x) - 1 - dfr; % degrees of freedom for error MSE = SSE/dfe; % mean-square error of residuals MSR = SSR/dfr; % mean-square error for regression f = MSR/MSE; % f-statistic for regression
"Large" values of this f-statistic (typically > 6 but check an
F(dfr,dfe)-table to be sure)
indicate that the fit is significant.
The standard errors may be obtained from an alternate use of polyfit. Specifically, if polyfit is used to provide two outputs,
[coef,S] = polyfit(x,y,n)
the second output is a structure that contains three fields (as of version
5.2, at any rate):
R = S.R; % The "R" from A = QR d = (R'*R)\eye(n+1); % The covariance matrix d = diag(d)'; % ROW vector of the diagonal elements MSE = (S.normr^2)/S.df; % variance of the residuals se = sqrt(MSE*d); % the standard errors t = coef./se; % observed T-values
Note that a transpose is used when the diagonal elements are extracted from the covariance matrix. If this step is omitted, there will be a mismatch of dimension in the ./ step that follows because polyfit returns a row vector while diag returns a column vector.
A coefficient is usually significant if its t-value is 2 or
better. To be specific, check a t-table at a selected level of
significance and S.df degrees of freedom.
The half-width of an a-%, 2-sided confidence interval is computed according to
wj = ta,dfe sej
In words, this means find a value t such that the chance of finding ofther t-values greater than this value is (100 - a)/2 %. The computation reqires either a good T-table or a root find using an expression for the cumulative T-distribution. sej is the standard error for the jth coefficient and its computation is given above.
The confidence interval is then, [cj - wj, cj + wj].
Continuing the code from above, the half-widths can be computed simply from the commmands
tval = <a given or computed value based on dfe> width = tval*se; % half widths ci = [coeff-width ; coef+width]; % confidence intervals
The lower and upper limits of the confidence intervals are stored in a 2-row vector, ci with the lower limits in the first row and the upper limits in the second row.
The bottom line: polyfit makes polynomial fitting easy. It is vital, then, that you assess the quality of the results generated by polyfit. Hence,