It is common engineering practice to "fit a line" to a set of data in order to determine some useful parameter in a mathematical model or perhaps to generate a calibration curve. A straight line is a simple polynomial and goal of the fit is to determine the coefficients (the slope and intercept) of the polynomial that lead to the "best fit" of a line to the data.

The fitting process can be generalized to determine the coefficients of
the *N*th-order polynomial that best fits *N*+1 (or more,
usually) data points. The determination of the coefficients is usually
termed "polynomial regression" and, in MATLAB, is accomplished by the
function ** polyfit**.

- Computing the coefficients
- Evaluting the fit (i.e., using the coefficients)
- Computing statistical summaries
- The bottom line

There is more to life than fitting just polynomials to data. Check here for other options in Fitting Land.

[Back to main page] [Back to "How do I ..."]

Assuming you have two *N*-by-1 vectors of data values,
** x** and

while those for the least-squares polynomial fit of degreecoeff = polyfit(x,y,1)

coeff = polyfit(x,y,n)

Make sure that ` x` and

Note that here is a very definite order of the coefficients in the vector
** coeff**: the coefficient of the

slope = coeff(1) intercept = coeff(2)

Also, note that the coefficients from ` polyfit` are
returned as a

**1. Does the fit look good?**

A picture is worth a thousand words and so a plot of the curve that
represents the fitted polynomial overlaid on the data is a powerful way of
assessing the quality of the fit. Using the coefficients determined by
** polyfit**, you can create such a plot via the commands

coeff = polyfit(x,y,n) xfit = linspace(min(x),max(x),npts) yfit = polyval(coeff,xfit) plot(x,y,'o',xfit,yfit)

where ** n** is the order of the polynomial you are fitting
and

It is important that the coefficients returned by ** polyfit**
are used

**2. Are the residuals small and unpatterned?**

The residuals are defined as the mismatch of the predicted values of
*y _{pred}* (computed using the

If the coefficients of the fit are given in the vector
** coeff**, the residuals are calculated via the commands

ypred = polyval(coeff,x); % predicted values resid = y - ypred; % mismatch

Two different plots of the residuals can sometimes be quite helpful:

**A.** A plot of the residuals versus the predicted values

plot(ypred,resid,'o')

should not show any patterns or trends. The plot should simply show
random noise, according to the theory that underlies all the computations
in ** polyfit**. The degree of pattern and trend is a very
good measure of the quality of the fit (no trend = good fit). If there is
a trend, your model is missing something!

**B.** A normal-probability plot of the residuals should be a straight
line if the premises of linear regression are correct and if the fit is
good. If the Statistics toolbox is avalilable, the function
** normplot** will do this for you. An alternative is to use

Numerical methods for assessing the fit are given
below.

For what follows, it is assumed that you have a set of *x*-*y*
data pairs and that you have used ** polyfit** to compute the
coefficients for a polynomial of given order (and have stored in a vector
called

- The
**coefficient of determination**(also referred to as the R^{2}value) for the fit indicates the percent of the variation in the data that is explained by the model. This coefficient can be computed via the commands**ypred = polyval(coeff,x); % predictions dev = y - mean(y); % deviations - measure of spread SST = sum(dev.^2); % total variation to be accounted for resid = y - ypred; % residuals - measure of mismatch SSE = sum(resid.^2); % variation NOT accounted for Rsq = 1 - SSE/SST; % percent of error explained**The closer that

is to 1, the more completely the fitted model "explains" the data.`Rsq`

- The
**observed**for the fit compares the "size" of the fraction of the data variation explained by the model to the "size of the variation unexplained by the model. The basis for this comparison is the ratio of the variances for the model and the error (residuals).*f*-statistic*f*= MSR/MSEwhere MSR = SSR/df

_{r}and MSE = SSE/df_{e}. The statistic is computed via the following commands (which assume the commands given above for the R^{2}computation have been executed)**SSR = SST - SSE; % the "ANOVA identity" dfr =***<order of the polynomial>***dfe = length(x) - 1 - dfr; % degrees of freedom for error MSE = SSE/dfe; % mean-square error of residuals MSR = SSR/dfr; % mean-square error for regression f = MSR/MSE; % f-statistic for regression**"Large" values of this

*f*-statistic (typically > 6 but check an*F*(*df*_{r},*df*_{e})-table to be sure) indicate that the fit is significant.

- The
**observed**for the coefficients indicate the level of significance for any one of the coefficients. This*t*-statistics*t*-statistic is defined as the ratio of the value of the coefficient to its standard error. Hence, computation of these statistics requires computation of the standard errors.The standard errors may be obtained from an alternate use of

. Specifically, if`polyfit`is used to provide two outputs,`polyfit`**[coef,S] = polyfit(x,y,n)**the second output is a structure that contains three fields (as of version 5.2, at any rate):

: an`S.R`-by-**n**matrix from the**n***Q*-*R*decomposition of the matrix used in computing the fit: the degrees of freedom for the residuals`S.df`: the 2-norm of the vector of the residuals for the fit`S.normr`

To compute the vector of standard errors of the coefficients and the observed*t*-values, use the commands**R = S.R; % The "R" from A = QR d = (R'*R)\eye(n+1); % The covariance matrix d = diag(d)'; % ROW vector of the diagonal elements MSE = (S.normr^2)/S.df; % variance of the residuals se = sqrt(MSE*d); % the standard errors t = coef./se; % observed T-values**Note that a transpose is used when the diagonal elements are extracted from the covariance matrix. If this step is omitted, there will be a mismatch of dimension in the

step that follows because`./`returns a row vector while`polyfit`returns a column vector.`diag`A coefficient is usually significant if its

*t*-value is 2 or better. To be specific, check a*t*-table at a selected level of significance anddegrees of freedom.`S.df`

- The
**confidence intervals of the coefficients**are an alternative way of expressing the accuracy of results. The idea is to specify a level of confidence (*a*) and then use that to compute a*t*-statistic that will scale the standard error of the coefficient (see the discussion of the observed*t*-statistics, above) to define an interval about the predicted value of any of the coefficients predicted by.`polyfit`The half-width of an

*a*-%, 2-sided confidence interval is computed according to*w*_{j}=*t*_{a,dfe}*se*_{j}*t*_{a,dfe}is the t-value determined using*P*(*T*>*t*;*df*_{e}) = 1 - (*a*/100)/2In words, this means find a value

*t*such that the chance of finding ofther*t*-values greater than this value is (100 -*a*)/2 %. The computation reqires either a good T-table or a root find using an expression for the cumulative T-distribution.*se*_{j}is the standard error for the*j*^{th}coefficient and its computation is given above.The confidence interval is then, [

*c*_{j}-*w*_{j},*c*_{j}+*w*_{j}].Continuing the code from above, the half-widths can be computed simply from the commmands

**tval =**<*a given or computed value based on df*_{e}>**width = tval*se; % half widths ci = [coeff-width ; coef+width]; % confidence intervals**The lower and upper limits of the confidence intervals are stored in a 2-row vector,

with the lower limits in the first row and the upper limits in the second row.`ci`

**The bottom line**: ** polyfit** makes polynomial fitting
easy. It is vital, then, that you assess the

[back to main page] [back to "How do I ..."]

Comments? Contact Jim Maneval at