allanswers.org - comp.ai.neural-nets FAQ, Part 3 of 7: Generalization

 Home >  Scienceai-faqneural-nets >

comp.ai.neural-nets FAQ, Part 3 of 7: Generalization

Section 4 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4


linear. Hence the generalization error of "good" subsets will differ only in
the estimation variance. The estimation variance is (2p/t)s^2 where p
is the number of inputs in the subset, t is the training set size, and s^2
is the noise variance. The "best" subset is better than other "good" subsets
only because the "best" subset has (by definition) the smallest value of p.
But the t in the denominator means that differences in generalization error
among the "good" subsets will all go to zero as t goes to infinity.
Therefore it is difficult to guess which subset is "best" based on the
generalization error even when t is very large. It is well known that
unbiased estimates of the generalization error, such as those based on AIC,
FPE, and C_p, do not produce consistent estimates of the "best" subset
(e.g., see Stone, 1979). 

In leave-v-out cross-validation, t=n-v. The differences of the
cross-validation estimates of generalization error among the "good" subsets
contain a factor 1/t, not 1/n. Therefore by making t small enough (and
thereby making each regression based on t cases bad enough), we can make
the differences of the cross-validation estimates large enough to detect. It
turns out that to make t small enough to guess the "best" subset
consistently, we have to have t/n go to 0 as n goes to infinity. 

The crucial distinction between cross-validation and split-sample validation
is that with cross-validation, after guessing the "best" subset, we train
the linear regression model for that subset using all n cases, but with
split-sample validation, only t cases are ever used for training. If our
main purpose were really to choose the "best" subset, I suspect we would
still have to have t/n go to 0 even for split-sample validation. But
choosing the "best" subset is not the same thing as getting the best
generalization. If we are more interested in getting good generalization
than in choosing the "best" subset, we do not want to make our regression
estimate based on only t cases as bad as we do in cross-validation, because
in split-sample validation that bad regression estimate is what we're stuck
with. So there is no conflict between Shao and Kearns, but there is a
conflict between the two goals of choosing the "best" subset and getting the
best generalization in split-sample validation. 

Bootstrapping
+++++++++++++

Bootstrapping seems to work better than cross-validation in many cases
(Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly
analyzing subsets of the data, you repeatedly analyze subsamples of the
data. Each subsample is a random sample with replacement from the full
sample. Depending on what you want to do, anywhere from 50 to 2000
subsamples might be used. There are many more sophisticated bootstrap
methods that can be used not only for estimating generalization error but
also for estimating confidence bounds for network outputs (Efron and
Tibshirani 1993). For estimating generalization error in classification
problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap)
is one of the currently favored methods that has the advantage of performing
well even when there is severe overfitting. Use of bootstrapping for NNs is
described in Baxt and White (1995), Tibshirani (1996), and Masters (1995).
However, the results obtained so far are not very thorough, and it is known
that bootstrapping does not work well for some other methodologies such as
empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984;
Kohavi, 1995), for which it can be excessively optimistic. 

For further information
+++++++++++++++++++++++

Cross-validation and bootstrapping become considerably more complicated for
time series data; see Hjorth (1994) and Snijders (1988). 

More information on jackknife and bootstrap confidence intervals is
available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text
file). 

References: 

   Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for
   clinical input variable effects in a network trained to identify the
   presence of acute myocardial infarction", Neural Computation, 7, 624-638.

   Breiman, L. (1996), "Heuristics of instability and stabilization in model
   selection," Annals of Statistics, 24, 2350-2383. 

   Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), 
   Classification and Regression Trees, Belmont, CA: Wadsworth. 

   Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation
   in regression: The X-random case," International Statistical Review, 60,
   291-319. 

   Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical
   Implications, Proceedings of a workshop held in Groningen, The
   Netherlands, September 25-26, 1986, Berlin: Springer-Verlag. 

   Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling
   Plans, Philadelphia: SIAM. 

   Efron, B. (1983), "Estimating the error rate of a prediction rule:
   Improvement on cross-validation," J. of the American Statistical
   Association, 78, 316-331. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation:
   The .632+ bootstrap method," J. of the American Statistical Association,
   92, 548-560. 

   Goutte, C. (1997), "Note on free lunches and cross-validation," Neural
   Computation, 9, 1211-1215,
   ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model
   selection in small samples," Biometrika, 76, 297-307. 

   Kearns, M. (1997), "A bound on the error of cross validation using the
   approximation and estimation rates, with consequences for the
   training-test split," Neural Computation, 9, 1143-1161. 

   Kohavi, R. (1995), "A study of cross-validation and bootstrap for
   accuracy estimation and model selection," International Joint Conference
   on Artificial Intelligence (IJCAI), pp. ?, 
   http://robotics.stanford.edu/users/ronnyk/ 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation
   estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) 
   Advances in Neural Information Processing Systems 6, San Mateo, CA:
   Morgan Kaufman, pp. 391-398. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Shao, J. (1993), "Linear model selection by cross-validation," J. of the
   American Statistical Association, 88, 486-494. 

   Shao, J. (1995), "An asymptotic theory for linear model selection,"
   Statistica Sinica ?. 

   Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:
   Springer-Verlag. 

   Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in
   time series," in Dijkstra (1988), pp. 56-69. 

   Stone, M. (1977), "Asymptotics for and against cross-validation,"
   Biometrika, 64, 29-35. 

   Stone, M. (1979), "Comments on model selection criteria of Akaike and
   Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278. 

   Tibshirani, R. (1996), "A comparison of some error estimates for neural
   network models," Neural Computation, 8, 152-163. 

   Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

   Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation,"
   Neural Computation, 8, 1421-1426. 

------------------------------------------------------------------------

Subject: How to compute prediction and confidence
=================================================
intervals (error bars)?
=======================

(This answer is only about half finished. I will get around to the other
half eventually.) 

In addition to estimating over-all generalization error, it is often useful
to be able to estimate the accuracy of the network's predictions for
individual cases. 

Let: 

   Y      = the target variable
   y_i    = the value of Y for the ith case
   X      = the vector of input variables
   x_i    = the value of X for the ith case
   N      = the noise in the target variable
   n_i    = the value of N for the ith case
   m(X)   = E(Y|X) = the conditional mean of Y given X
   w      = a vector of weights for a neural network
   w^     = the weight obtained via training the network
   p(X,w) = the output of a neural network given input X and weights w
   p_i    = p(x_i,w)
   L      = the number of training (learning) cases, (y_i,x_i), i=1, ..., L
   Q(w)   = the objective function

Assume the data are generated by the model: 

   Y = m(X) + N
   E(N|X) = 0
   N and X are independent

The network is trained by attempting to minimize the objective function 
Q(w), which, for example, could be the sum of squared errors or the
negative log likelihood based on an assumed family of noise distributions. 

Given a test input x_0, a 100c% prediction interval for y_0 is an
interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <=
UPB_0) = c, where c is typically .95 or .99, and the probability is
computed over repeated random selection of the training set and repeated
observation of Y given the test input x_0. A 100c% confidence interval
for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <=
p_0 <= UCB_0) = c, where again the probability is computed over
repeated random selection of the training set. Note that p_0 is a
nonrandom quantity, since x_0 is given. A confidence interval is narrower
than the corresponding prediction interval, since the prediction interval
must include variation due to noise in y_0, while the confidence interval
does not. Both intervals include variation due to sampling of the training
set and possible variation in the training process due, for example, to
random initial weights and local minima of the objective function. 

Traditional statistical methods for nonlinear models depend on several
assumptions (Gallant, 1987): 

1. The inputs for the training cases are either fixed or obtained by simple
   random sampling or some similarly well-behaved process. 
2. Q(w) has continuous first and second partial derivatives with respect
   to w over some convex, bounded subset S_W of the weight space. 
3. Q(w) has a unique global minimum at w^, which is an interior point of
   S_W. 
4. The model is well-specified, which requires (a) that there exist weights 
   w$ in the interior of S_W such that m(x) = p(x,w$), and (b)
   that the assumptions about the noise distribution are correct. (Sorry
   about the w$ notation, but I'm running out of plain text symbols.) 

These traditional methods are based on a linear approximation to p(x,w)
in a neighborhood of w$, yielding a quadratic approximation to Q(w).
Hence the Hessian of Q(w) (the square matrix of second-order partial
derivatives with respect to w) frequently appears in these methods. 

Assumption (3) is not satisfied for neural nets, because networks with
hidden units always have multiple global minima, and the global minima are
often improper. Hence, confidence intervals for the weights cannot be
obtained using standard Hessian-based methods. However, Hwang and Ding
(1997) have shown that confidence intervals for predicted values can be
obtained because the predicted values are statistically identified even
though the weights are not. 

Cardell, Joerding, and Li (1994) describe a more serious violation of
assumption (3), namely that that for some m(x), no finite global minimum
exists. In such situations, it may be possible to use regularization methods
such as weight decay to obtain valid confidence intervals (De Veaux, Schumi,
Schweinsberg, and Ungar, 1998), but more research is required on this
subject, since the derivation in the cited paper assumes a finite global
minimum. 

For large samples, the sampling variability in w^ can be approximated in
various ways: 

 o Fisher's information matrix, which is the expected value of the Hessian
   of Q(w) divided by L, can be used when Q(w) is the negative log
   likelihood (Spall, 1998). 
 o The delta method, based on the Hessian of Q(w) or the Gauss-Newton
   approximation using the cross-product Jacobian of Q(w), can also be
   used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang
   and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998). 
 o The sandwich estimator, a more elaborate Hessian-based method, relaxes
   assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996). 
 o Bootstrapping can be used without knowing the form of the noise
   distribution and takes into account variability introduced by local
   minima in training, but requires training the network many times on
   different resamples of the training set (Tibshirani, 1996; Heskes 1997). 

References: 

   Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward
   networks cannot learn some polynomials," Neural Computation, 6, 761-766. 

   De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998),
   "Prediction intervals for neural networks via nonlinear regression,"
   Technometrics, 40, 273-282. 

   Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. 

   Heskes, T. (1997), "Practical confidence and prediction intervals," in
   Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
   Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp.
   176-182. 

   Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for
   artificial neural networks," J. of the American Statistical Association,
   92, 748-757. 

   Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for
   nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T.,
   (eds.) Advances in Neural Information Processing Systems 7, Cambridge,
   MA: The MIT Press, pp. 489-496. 

   Spall, J.C. (1998), "Resampling-based calculation of the information
   matrix in nonlinear statistical models," Proceedings of the 4th Joint
   Conference on Information Sciences, October 23-28, Research Triangle
   PArk, NC, USA, Vol 4, pp. 35-39. 

   Tibshirani, R. (1996), "A comparison of some error estimates for neural
   network models," Neural Computation, 8, 152-163. 

   White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

------------------------------------------------------------------------

Next part is part 4 (of 7). Previous part is part 2. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.

Section 4 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4

Back to category neural-nets - Use Smart Search
Home - Smart Search - About the project - Feedback

© allanswers.org | Terms of use

LiveInternet