## Fitting Polynomial Models to Data

It is common engineering practice to "fit a line" to a set of data in order to determine some useful parameter in a mathematical model or perhaps to generate a calibration curve. A straight line is a simple polynomial and goal of the fit is to determine the coefficients (the slope and intercept) of the polynomial that lead to the "best fit" of a line to the data.

The fitting process can be generalized to determine the coefficients of the Nth-order polynomial that best fits N+1 (or more, usually) data points. The determination of the coefficients is usually termed "polynomial regression" and, in MATLAB, is accomplished by the function polyfit.

### Topics

There is more to life than fitting just polynomials to data. Check here for other options in Fitting Land.

[Back to main page] [Back to "How do I ..."]

### Computing the coefficients

Assuming you have two N-by-1 vectors of data values, x and y, the best coefficients for a straight-line fit (in the least-squares sense) are found through the command

`   coeff = polyfit(x,y,1) `
while those for the least-squares polynomial fit of degree n (< N-1) are found by
```   coeff = polyfit(x,y,n)
```

Make sure that x and y are the same shape (i.e., both art row vectors or both are column vectors) or you'll get an error.

Note that here is a very definite order of the coefficients in the vector coeff: the coefficient of the highest order monomial comes first and the rest follow in descending order (down to the constant term). Thus, for a straight-line fit,

```   slope = coeff(1)
intercept = coeff(2)
```

Also, note that the coefficients from polyfit are returned as a row vector.

### Evaluating the fit

polyfit makes the computation of the coefficients easy. The tough part of polynomial regression is knowing that the "fit" is a good one. Determining the quality of the fit requires experience, a sense of balance and some statistical summaries.

1. Does the fit look good?

A picture is worth a thousand words and so a plot of the curve that represents the fitted polynomial overlaid on the data is a powerful way of assessing the quality of the fit. Using the coefficients determined by polyfit, you can create such a plot via the commands

```   coeff = polyfit(x,y,n)
xfit = linspace(min(x),max(x),npts)
yfit = polyval(coeff,xfit)
plot(x,y,'o',xfit,yfit)
```

where n is the order of the polynomial you are fitting and npts is the number of points used to plot the polynomial you are fitting. npts = 2 is fine for a line but more points would be needed to make something like a cubic equation (with all its curves) look nice. Try something like 200 for npts to start things off.

It is important that the coefficients returned by polyfit are used without truncation (i.e., they should be used to full machine precision). This is accomplished by using the output of polyfit (the variable coeff) directly as the input to polyval.

2. Are the residuals small and unpatterned?

The residuals are defined as the mismatch of the predicted values of ypred (computed using the x-values in the data you are trying to fit) and the actual (data) values of y.

If the coefficients of the fit are given in the vector coeff, the residuals are calculated via the commands

```   ypred = polyval(coeff,x);      % predicted values
resid = y - ypred;             % mismatch
```

Two different plots of the residuals can sometimes be quite helpful:

A. A plot of the residuals versus the predicted values

```   plot(ypred,resid,'o')
```

should not show any patterns or trends. The plot should simply show random noise, according to the theory that underlies all the computations in polyfit. The degree of pattern and trend is a very good measure of the quality of the fit (no trend = good fit). If there is a trend, your model is missing something!

B. A normal-probability plot of the residuals should be a straight line if the premises of linear regression are correct and if the fit is good. If the Statistics toolbox is avalilable, the function normplot will do this for you. An alternative is to use nsplot from BISKIT (available at Bucknell).

Numerical methods for assessing the fit are given below.

### Computing statistical summaries of the fit

There are any number of statistical measures of the "quality" and "appropriateness" of a model fit to a set of data. This section shows how to compute some of the more common measures (coefficient of determination, observed f-value for the fit and the observed t-values/confidence intervals for the coefficients). Note that all of these measures are less informative than the by-eye views discussed above.

For what follows, it is assumed that you have a set of x-y data pairs and that you have used polyfit to compute the coefficients for a polynomial of given order (and have stored in a vector called coeff).

• The coefficient of determination (also referred to as the R2 value) for the fit indicates the percent of the variation in the data that is explained by the model. This coefficient can be computed via the commands
```   ypred = polyval(coeff,x);   % predictions
dev = y - mean(y);          % deviations - measure of spread
SST = sum(dev.^2);          % total variation to be accounted for
resid = y - ypred;          % residuals - measure of mismatch
SSE = sum(resid.^2);        % variation NOT accounted for
Rsq = 1 - SSE/SST;          % percent of error explained
```

The closer that Rsq is to 1, the more completely the fitted model "explains" the data.

• The observed f-statistic for the fit compares the "size" of the fraction of the data variation explained by the model to the "size of the variation unexplained by the model. The basis for this comparison is the ratio of the variances for the model and the error (residuals).

f = MSR/MSE

where MSR = SSR/dfr and MSE = SSE/dfe. The statistic is computed via the following commands (which assume the commands given above for the R2 computation have been executed)

```      SSR = SST - SSE;              % the "ANOVA identity"
dfr = <order of the polynomial>
dfe = length(x) - 1 - dfr;    % degrees of freedom for error
MSE = SSE/dfe;                % mean-square error of residuals
MSR = SSR/dfr;                % mean-square error for regression
f = MSR/MSE;                  % f-statistic for regression
```

"Large" values of this f-statistic (typically > 6 but check an F(dfr,dfe)-table to be sure) indicate that the fit is significant.

• The observed t-statistics for the coefficients indicate the level of significance for any one of the coefficients. This t-statistic is defined as the ratio of the value of the coefficient to its standard error. Hence, computation of these statistics requires computation of the standard errors.

The standard errors may be obtained from an alternate use of polyfit. Specifically, if polyfit is used to provide two outputs,

```      [coef,S] = polyfit(x,y,n)
```

the second output is a structure that contains three fields (as of version 5.2, at any rate):

1. S.R: an n-by-n matrix from the Q-R decomposition of the matrix used in computing the fit
2. S.df: the degrees of freedom for the residuals
3. S.normr: the 2-norm of the vector of the residuals for the fit

To compute the vector of standard errors of the coefficients and the observed t-values, use the commands
```      R = S.R;                % The "R" from A = QR
d = (R'*R)\eye(n+1);    % The covariance matrix
d = diag(d)';           % ROW vector of the diagonal elements
MSE = (S.normr^2)/S.df; % variance of the residuals
se = sqrt(MSE*d);       % the standard errors
t = coef./se;           % observed T-values
```

Note that a transpose is used when the diagonal elements are extracted from the covariance matrix. If this step is omitted, there will be a mismatch of dimension in the ./ step that follows because polyfit returns a row vector while diag returns a column vector.

A coefficient is usually significant if its t-value is 2 or better. To be specific, check a t-table at a selected level of significance and S.df degrees of freedom.

• The confidence intervals of the coefficients are an alternative way of expressing the accuracy of results. The idea is to specify a level of confidence (a) and then use that to compute a t-statistic that will scale the standard error of the coefficient (see the discussion of the observed t-statistics, above) to define an interval about the predicted value of any of the coefficients predicted by polyfit.

The half-width of an a-%, 2-sided confidence interval is computed according to

wj = ta,dfe sej

where ta,dfe is the t-value determined using
P(T > t; dfe) = 1 - (a/100)/2

In words, this means find a value t such that the chance of finding ofther t-values greater than this value is (100 - a)/2 %. The computation reqires either a good T-table or a root find using an expression for the cumulative T-distribution. sej is the standard error for the jth coefficient and its computation is given above.

The confidence interval is then, [cj - wj, cj + wj].

Continuing the code from above, the half-widths can be computed simply from the commmands

```      tval = <a given or computed value based on dfe>
width = tval*se;                   % half widths
ci = [coeff-width ; coef+width];   % confidence intervals
```

The lower and upper limits of the confidence intervals are stored in a 2-row vector, ci with the lower limits in the first row and the upper limits in the second row.

The bottom line: polyfit makes polynomial fitting easy. It is vital, then, that you assess the quality of the results generated by polyfit. Hence,