![]() |
| Home > Science > ai-faq > neural-nets > |
comp.ai.neural-nets FAQ, Part 3 of 7: Generalization |
Section 4 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4
linear. Hence the generalization error of "good" subsets will differ only in the estimation variance. The estimation variance is (2p/t)s^2 where p is the number of inputs in the subset, t is the training set size, and s^2 is the noise variance. The "best" subset is better than other "good" subsets only because the "best" subset has (by definition) the smallest value of p. But the t in the denominator means that differences in generalization error among the "good" subsets will all go to zero as t goes to infinity. Therefore it is difficult to guess which subset is "best" based on the generalization error even when t is very large. It is well known that unbiased estimates of the generalization error, such as those based on AIC, FPE, and C_p, do not produce consistent estimates of the "best" subset (e.g., see Stone, 1979). In leave-v-out cross-validation, t=n-v. The differences of the cross-validation estimates of generalization error among the "good" subsets contain a factor 1/t, not 1/n. Therefore by making t small enough (and thereby making each regression based on t cases bad enough), we can make the differences of the cross-validation estimates large enough to detect. It turns out that to make t small enough to guess the "best" subset consistently, we have to have t/n go to 0 as n goes to infinity. The crucial distinction between cross-validation and split-sample validation is that with cross-validation, after guessing the "best" subset, we train the linear regression model for that subset using all n cases, but with split-sample validation, only t cases are ever used for training. If our main purpose were really to choose the "best" subset, I suspect we would still have to have t/n go to 0 even for split-sample validation. But choosing the "best" subset is not the same thing as getting the best generalization. If we are more interested in getting good generalization than in choosing the "best" subset, we do not want to make our regression estimate based on only t cases as bad as we do in cross-validation, because in split-sample validation that bad regression estimate is what we're stuck with. So there is no conflict between Shao and Kearns, but there is a conflict between the two goals of choosing the "best" subset and getting the best generalization in split-sample validation. Bootstrapping +++++++++++++ Bootstrapping seems to work better than cross-validation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the full sample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error but also for estimating confidence bounds for network outputs (Efron and Tibshirani 1993). For estimating generalization error in classification problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap) is one of the currently favored methods that has the advantage of performing well even when there is severe overfitting. Use of bootstrapping for NNs is described in Baxt and White (1995), Tibshirani (1996), and Masters (1995). However, the results obtained so far are not very thorough, and it is known that bootstrapping does not work well for some other methodologies such as empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984; Kohavi, 1995), for which it can be excessively optimistic. For further information +++++++++++++++++++++++ Cross-validation and bootstrapping become considerably more complicated for time series data; see Hjorth (1994) and Snijders (1988). More information on jackknife and bootstrap confidence intervals is available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text file). References: Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624-638. Breiman, L. (1996), "Heuristics of instability and stabilization in model selection," Annals of Statistics, 24, 2350-2383. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), Classification and Regression Trees, Belmont, CA: Wadsworth. Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation in regression: The X-random case," International Statistical Review, 60, 291-319. Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical Implications, Proceedings of a workshop held in Groningen, The Netherlands, September 25-26, 1986, Berlin: Springer-Verlag. Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans, Philadelphia: SIAM. Efron, B. (1983), "Estimating the error rate of a prediction rule: Improvement on cross-validation," J. of the American Statistical Association, 78, 316-331. Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall. Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation: The .632+ bootstrap method," J. of the American Statistical Association, 92, 548-560. Goutte, C. (1997), "Note on free lunches and cross-validation," Neural Computation, 9, 1211-1215, ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz. Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, London: Chapman & Hall. Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model selection in small samples," Biometrika, 76, 297-307. Kearns, M. (1997), "A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split," Neural Computation, 9, 1143-1161. Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI), pp. ?, http://robotics.stanford.edu/users/ronnyk/ Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 391-398. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Shao, J. (1993), "Linear model selection by cross-validation," J. of the American Statistical Association, 88, 486-494. Shao, J. (1995), "An asymptotic theory for linear model selection," Statistica Sinica ?. Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: Springer-Verlag. Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in time series," in Dijkstra (1988), pp. 56-69. Stone, M. (1977), "Asymptotics for and against cross-validation," Biometrika, 64, 29-35. Stone, M. (1979), "Comments on model selection criteria of Akaike and Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278. Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163. Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann. Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation," Neural Computation, 8, 1421-1426. ------------------------------------------------------------------------ Subject: How to compute prediction and confidence ================================================= intervals (error bars)? ======================= (This answer is only about half finished. I will get around to the other half eventually.) In addition to estimating over-all generalization error, it is often useful to be able to estimate the accuracy of the network's predictions for individual cases. Let: Y = the target variable y_i = the value of Y for the ith case X = the vector of input variables x_i = the value of X for the ith case N = the noise in the target variable n_i = the value of N for the ith case m(X) = E(Y|X) = the conditional mean of Y given X w = a vector of weights for a neural network w^ = the weight obtained via training the network p(X,w) = the output of a neural network given input X and weights w p_i = p(x_i,w) L = the number of training (learning) cases, (y_i,x_i), i=1, ..., L Q(w) = the objective function Assume the data are generated by the model: Y = m(X) + N E(N|X) = 0 N and X are independent The network is trained by attempting to minimize the objective function Q(w), which, for example, could be the sum of squared errors or the negative log likelihood based on an assumed family of noise distributions. Given a test input x_0, a 100c% prediction interval for y_0 is an interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <= UPB_0) = c, where c is typically .95 or .99, and the probability is computed over repeated random selection of the training set and repeated observation of Y given the test input x_0. A 100c% confidence interval for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <= p_0 <= UCB_0) = c, where again the probability is computed over repeated random selection of the training set. Note that p_0 is a nonrandom quantity, since x_0 is given. A confidence interval is narrower than the corresponding prediction interval, since the prediction interval must include variation due to noise in y_0, while the confidence interval does not. Both intervals include variation due to sampling of the training set and possible variation in the training process due, for example, to random initial weights and local minima of the objective function. Traditional statistical methods for nonlinear models depend on several assumptions (Gallant, 1987): 1. The inputs for the training cases are either fixed or obtained by simple random sampling or some similarly well-behaved process. 2. Q(w) has continuous first and second partial derivatives with respect to w over some convex, bounded subset S_W of the weight space. 3. Q(w) has a unique global minimum at w^, which is an interior point of S_W. 4. The model is well-specified, which requires (a) that there exist weights w$ in the interior of S_W such that m(x) = p(x,w$), and (b) that the assumptions about the noise distribution are correct. (Sorry about the w$ notation, but I'm running out of plain text symbols.) These traditional methods are based on a linear approximation to p(x,w) in a neighborhood of w$, yielding a quadratic approximation to Q(w). Hence the Hessian of Q(w) (the square matrix of second-order partial derivatives with respect to w) frequently appears in these methods. Assumption (3) is not satisfied for neural nets, because networks with hidden units always have multiple global minima, and the global minima are often improper. Hence, confidence intervals for the weights cannot be obtained using standard Hessian-based methods. However, Hwang and Ding (1997) have shown that confidence intervals for predicted values can be obtained because the predicted values are statistically identified even though the weights are not. Cardell, Joerding, and Li (1994) describe a more serious violation of assumption (3), namely that that for some m(x), no finite global minimum exists. In such situations, it may be possible to use regularization methods such as weight decay to obtain valid confidence intervals (De Veaux, Schumi, Schweinsberg, and Ungar, 1998), but more research is required on this subject, since the derivation in the cited paper assumes a finite global minimum. For large samples, the sampling variability in w^ can be approximated in various ways: o Fisher's information matrix, which is the expected value of the Hessian of Q(w) divided by L, can be used when Q(w) is the negative log likelihood (Spall, 1998). o The delta method, based on the Hessian of Q(w) or the Gauss-Newton approximation using the cross-product Jacobian of Q(w), can also be used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998). o The sandwich estimator, a more elaborate Hessian-based method, relaxes assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996). o Bootstrapping can be used without knowing the form of the noise distribution and takes into account variability introduced by local minima in training, but requires training the network many times on different resamples of the training set (Tibshirani, 1996; Heskes 1997). References: Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward networks cannot learn some polynomials," Neural Computation, 6, 761-766. De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998), "Prediction intervals for neural networks via nonlinear regression," Technometrics, 40, 273-282. Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. Heskes, T. (1997), "Practical confidence and prediction intervals," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 176-182. Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for artificial neural networks," J. of the American Statistical Association, 92, 748-757. Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) Advances in Neural Information Processing Systems 7, Cambridge, MA: The MIT Press, pp. 489-496. Spall, J.C. (1998), "Resampling-based calculation of the information matrix in nonlinear statistical models," Proceedings of the 4th Joint Conference on Information Sciences, October 23-28, Research Triangle PArk, NC, USA, Vol 4, pp. 35-39. Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163. White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013. ------------------------------------------------------------------------ Next part is part 4 (of 7). Previous part is part 2. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
Section 4 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4
| Back to category neural-nets - Use Smart Search |
| Home - Smart Search - About the project - Feedback |
© allanswers.org | Terms of use