![]() |
| Home > Science > ai-faq > neural-nets > |
comp.ai.neural-nets FAQ, Part 3 of 7: Generalization |
Section 1 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4
Archive-name: ai-faq/neural-nets/part3
Last-modified: 2001-05-21
URL: ftp://ftp.sas.com/pub/neural/FAQ3.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)
Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,
USA. Answers provided by other authors as cited below are copyrighted by
those authors, who by submitting the answers for the FAQ give permission for
the answer to be reproduced as part of the FAQ in any of the ways specified
in part 1 of the FAQ.
This is part 3 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.
========== Questions ==========
********************************
Part 1: Introduction
Part 2: Learning
Part 3: Generalization
How is generalization possible?
How does noise affect generalization?
What is overfitting and how can I avoid it?
What is jitter? (Training with noise)
What is early stopping?
What is weight decay?
What is Bayesian learning?
How to combine networks?
How many hidden layers should I use?
How many hidden units should I use?
How can generalization error be estimated?
What are cross-validation and bootstrapping?
How to compute prediction and confidence intervals (error bars)?
Part 4: Books, data, etc.
Part 5: Free software
Part 6: Commercial software
Part 7: Hardware and miscellaneous
------------------------------------------------------------------------
Subject: How is generalization possible?
=========================================
During learning, the outputs of a supervised neural net come to approximate
the target values given the inputs in the training set. This ability may be
useful in itself, but more often the purpose of using a neural net is to
generalize--i.e., to have the outputs of the net approximate target values
given inputs that are not in the training set. Generalizaton is not always
possible, despite the blithe assertions of some authors. For example,
Caudill and Butler, 1990, p. 8, claim that "A neural network is able to
generalize", but they provide no justification for this claim, and they
completely neglect the complex issues involved in getting good
generalization. Anyone who reads comp.ai.neural-nets is well aware from the
numerous posts pleading for help that artificial neural networks do not
automatically generalize.
Generalization requires prior knowledge, as pointed out by Hume (1739/1978),
Russell (1948), and Goodman (1954/1983) and rigorously proved by Wolpert
(1995a, 1996a, 1996b). For any practical application, you have to know what
the relevant inputs are (you can't simply include every imaginable input).
You have to know a restricted class of input-output functions that contains
an adequate approximation to the function you want to learn (you can't use a
learning method that is capable of fitting every imaginable function). And
you have to know that the cases you want to generalize to bear some
resemblance to the training cases. Thus, there are three conditions that are
typically necessary--although not sufficient--for good generalization:
1. The first necessary condition is that the inputs to the network contain
sufficient information pertaining to the target, so that there exists a
mathematical function relating correct outputs to inputs with the desired
degree of accuracy. You can't expect a network to learn a nonexistent
function--neural nets are not clairvoyant! For example, if you want to
forecast the price of a stock, a historical record of the stock's prices
is rarely sufficient input; you need detailed information on the
financial state of the company as well as general economic conditions,
and to avoid nasty surprises, you should also include inputs that can
accurately predict wars in the Middle East and earthquakes in Japan.
Finding good inputs for a net and collecting enough training data often
take far more time and effort than training the network.
2. The second necessary condition is that the function you are trying to
learn (that relates inputs to correct outputs) be, in some sense, smooth.
In other words, a small change in the inputs should, most of the time,
produce a small change in the outputs. For continuous inputs and targets,
smoothness of the function implies continuity and restrictions on the
first derivative over most of the input space. Some neural nets can learn
discontinuities as long as the function consists of a finite number of
continuous pieces. Very nonsmooth functions such as those produced by
pseudo-random number generators and encryption algorithms cannot be
generalized by neural nets. Often a nonlinear transformation of the input
space can increase the smoothness of the function and improve
generalization.
For classification, if you do not need to estimate posterior
probabilities, then smoothness is not theoretically necessary. In
particular, feedforward networks with one hidden layer trained by
minimizing the error rate (a very tedious training method) are
universally consistent classifiers if the number of hidden units grows at
a suitable rate relative to the number of training cases (Devroye,
Györfi, and Lugosi, 1996). However, you are likely to get better
generalization with realistic sample sizes if the classification
boundaries are smoother.
For Boolean functions, the concept of smoothness is more elusive. It
seems intuitively clear that a Boolean network with a small number of
hidden units and small weights will compute a "smoother" input-output
function than a network with many hidden units and large weights. If you
know a good reference characterizing Boolean functions for which good
generalization is possible, please inform the FAQ maintainer
(saswss@unx.sas.com).
3. The third necessary condition for good generalization is that the
training cases be a sufficiently large and representative subset
("sample" in statistical terminology) of the set of all cases that you
want to generalize to (the "population" in statistical terminology). The
importance of this condition is related to the fact that there are,
loosely speaking, two different types of generalization: interpolation
and extrapolation. Interpolation applies to cases that are more or less
surrounded by nearby training cases; everything else is extrapolation. In
particular, cases that are outside the range of the training data require
extrapolation. Cases inside large "holes" in the training data may also
effectively require extrapolation. Interpolation can often be done
reliably, but extrapolation is notoriously unreliable. Hence it is
important to have sufficient training data to avoid the need for
extrapolation. Methods for selecting good training sets are discussed in
numerous statistical textbooks on sample surveys and experimental design.
Thus, for an input-output function that is smooth, if you have a test case
that is close to some training cases, the correct output for the test case
will be close to the correct outputs for those training cases. If you have
an adequate sample for your training set, every case in the population will
be close to a sufficient number of training cases. Hence, under these
conditions and with proper training, a neural net will be able to generalize
reliably to the population.
If you have more information about the function, e.g. that the outputs
should be linearly related to the inputs, you can often take advantage of
this information by placing constraints on the network or by fitting a more
specific model, such as a linear model, to improve generalization.
Extrapolation is much more reliable in linear models than in flexible
nonlinear models, although still not nearly as safe as interpolation. You
can also use such information to choose the training cases more efficiently.
For example, with a linear model, you should choose training cases at the
outer limits of the input space instead of evenly distributing them
throughout the input space.
References:
Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT
Press: Cambridge, Massachusetts.
Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of
Pattern Recognition, NY: Springer.
Goodman, N. (1954/1983), Fact, Fiction, and Forecast, 1st/4th ed.,
Cambridge, MA: Harvard University Press.
Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R. (1986),
Induction: Processes of Inference, Learning, and Discovery, Cambridge, MA:
The MIT Press.
Howson, C. and Urbach, P. (1989), Scientific Reasoning: The Bayesian
Approach, La Salle, IL: Open Court.
Hume, D. (1739/1978), A Treatise of Human Nature, Selby-Bigge, L.A.,
and Nidditch, P.H. (eds.), Oxford: Oxford University Press.
Plotkin, H. (1993), Darwin Machines and the Nature of Knowledge,
Cambridge, MA: Harvard University Press.
Russell, B. (1948), Human Knowledge: Its Scope and Limits, London:
Routledge.
Stone, C.J. (1977), "Consistent nonparametric regression," Annals of
Statistics, 5, 595-645.
Stone, C.J. (1982), "Optimal global rates of convergence for
nonparametric regression," Annals of Statistics, 10, 1040-1053.
Vapnik, V.N. (1995), The Nature of Statistical Learning Theory, NY:
Springer.
Wolpert, D.H. (1995a), "The relationship between PAC, the statistical
physics framework, the Bayesian framework, and the VC framework," in
Wolpert (1995b), 117-214.
Wolpert, D.H. (ed.) (1995b), The Mathematics of Generalization: The
Proceedings of the SFI/CNLS Workshop on Formal Approaches to
Supervised Learning, Santa Fe Institute Studies in the Sciences of
Complexity, Volume XX, Reading, MA: Addison-Wesley.
Wolpert, D.H. (1996a), "The lack of a priori distinctions between
learning algorithms," Neural Computation, 8, 1341-1390.
Wolpert, D.H. (1996b), "The existence of a priori distinctions between
learning algorithms," Neural Computation, 8, 1391-1420.
------------------------------------------------------------------------
Subject: How does noise affect generalization?
===============================================
"Statistical noise" means variation in the target values that is
unpredictable from the inputs of a specific network, regardless of the
architecture or weights. "Physical noise" refers to variation in the target
values that is inherently unpredictable regardless of what inputs are used.
Noise in the inputs usually refers to measurement error, so that if the same
object or example is presented to the network more than once, the input
values differ.
Noise in the actual data is never a good thing, since it limits the accuracy
of generalization that can be achieved no matter how extensive the training
set is. On the other hand, injecting artificial noise (jitter) into the
inputs during training is one of several ways to improve generalization for
smooth functions when you have a small training set.
Certain assumptions about noise are necessary for theoretical results.
Usually, the noise distribution is assumed to have zero mean and finite
variance. The noise in different cases is usually assumed to be independent
or to follow some known stochastic model, such as an autoregressive process.
The more you know about the noise distribution, the more effectively you can
train the network (e.g., McCullagh and Nelder 1989).
If you have noise in the target values, what the network learns depends
mainly on the error function. For example, if the noise is independent with
finite variance for all training cases, a network that is well-trained using
least squares will produce outputs that approximate the conditional mean of
the target values (White, 1990; Bishop, 1995). Note that for a binary 0/1
variable, the mean is equal to the probability of getting a 1. Hence, the
results in White (1990) immediately imply that for a categorical target with
independent noise using 1-of-C coding (see "How should categories be
encoded?"), a network that is well-trained using least squares will produce
outputs that approximate the posterior probabilities of each class (see
Rojas, 1996, if you want a simple explanation of why least-squares estimates
probabilities). Posterior probabilities can also be learned using
cross-entropy and various other error functions (Finke and Müller, 1994;
Bishop, 1995). The conditional median can be learned by least-absolute-value
training (White, 1992a). Conditional modes can be approximated by yet other
error functions (e.g., Rohwer and van der Rest 1996). For noise
distributions that cannot be adequately approximated by a single location
estimate (such as the mean, median, or mode), a network can be trained to
approximate quantiles (White, 1992a) or mixture components (Bishop, 1995;
Husmeier, 1999).
If you have noise in the target values, the mean squared generalization
error can never be less than the variance of the noise, no matter how much
training data you have. But you can estimate the mean of the target values,
conditional on a given set of input values, to any desired degree of
accuracy by obtaining a sufficiently large and representative training set,
assuming that the function you are trying to learn is one that can indeed be
learned by the type of net you are using, and assuming that the complexity
of the network is regulated appropriately (White 1990).
Noise in the target values increases the danger of overfitting (Moody 1992).
Noise in the inputs limits the accuracy of generalization, but in a more
complicated way than does noise in the targets. In a region of the input
space where the function being learned is fairly flat, input noise will have
little effect. In regions where that function is steep, input noise can
degrade generalization severely.
Furthermore, if the target function is Y=f(X), but you observe noisy inputs
X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no
matter how large a training set you use. The net will not learn f(X), but
will instead learn a convolution of f(X) with the distribution of the noise
D (see "What is jitter?)"
For more details, see one of the statistically-oriented references on neural
nets such as:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press, especially section 6.4.
Finke, M., and Müller, K.-R. (1994), "Estimating a-posteriori
probabilities using stochastic network models," in Mozer, Smolensky,
Touretzky, Elman, & Weigend, eds., Proceedings of the 1993 Connectionist
Models Summer School, Hillsdale, NJ: Lawrence Erlbaum Associates, pp.
324-331.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Husmeier, D. (1999), Neural Networks for Conditional Probability
Estimation: Forecasting Beyond Point Predictions, Berlin: Springer
Verlag, ISBN 185233095.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
ed., London: Chapman & Hall.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
Generalization and Regularization in Nonlinear Learning Systems", in
Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
Information Processing Systems 4, 847-854.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
Cambridge University Press.
Rohwer, R., and van der Rest, J.C. (1996), "Minimum description length,
regularization, and multimodal data," Neural Computation, 8, 595-609.
Rojas, R. (1996), "A short proof of the posterior probability property of
classifier neural networks," Neural Computation, 8, 41-43.
White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,
535-550. Reprinted in White (1992).
White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings
of the 23rd Sympsium on the Interface: Computing Science and Statistics,
Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted
in White (1992b).
White, H. (1992b), Artificial Neural Networks: Approximation and
Learning Theory, Blackwell.
------------------------------------------------------------------------
Subject: What is overfitting and how can I avoid it?
=====================================================
The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. Overfitting can also produce wild
predictions in multilayer perceptrons even with noise-free data.
For an elementary discussion of overfitting, see Smith (1996). For a more
rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
on the bias/variance trade-off (it's not really a dilemma). We are talking
about statistical bias here: the difference between the average value of an
estimator and the correct value. Underfitting produces excessive bias in the
outputs, whereas overfitting produces excessive variance. There are
graphical examples of overfitting and underfitting in Sarle (1995, 1999).
The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from much overfitting, although you may
get some slight overfitting no matter how large the training set is. For
noise-free data, 5 times as many training cases as weights may be
sufficient. But you can't arbitrarily reduce the number of weights for fear
of underfitting.
Given a fixed amount of training data, there are at least six approaches to
avoiding underfitting and overfitting, and hence getting good
generalization:
o Model selection
o Jittering
o Early stopping
o Weight decay
o Bayesian learning
o Combining networks
The first five approaches are based on well-understood theory. Methods for
combining networks do not have such a sound theoretical basis but are the
subject of current research. These six approaches are discussed in more
detail under subsequent questions.
The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The more weights
there are, relative to the number of training cases, the more overfitting
amplifies noise in the targets (Moody 1992). The other approaches listed
above are concerned, directly or indirectly, with the size of the weights.
Reducing the size of the weights reduces the "effective" number of
weights--see Moody (1992) regarding weight decay and Weigend (1994)
regarding early stopping. Bartlett (1997) obtained learning-theory results
in which generalization error is related to the L_1 norm of the weights
instead of the VC dimension.
Overfitting is not confined to NNs with hidden units. Overfitting can occur
in generalized linear models (networks with no hidden units) if either or
both of the following conditions hold:
1. The number of input variables (and hence the number of weights) is large
with respect to the number of training cases. Typically you would want at
least 10 times as many training cases as input variables, but with
noise-free targets, twice as many training cases as input variables would
be more than adequate. These requirements are smaller than those stated
above for networks with hidden layers, because hidden layers are prone to
creating ill-conditioning and other pathologies.
2. The input variables are highly correlated with each other. This condition
is called "multicollinearity" in the statistical literature.
Multicollinearity can cause the weights to become extremely large because
of numerical ill-conditioning--see "How does ill-conditioning affect NN
training?"
Methods for dealing with these problems in the statistical literature
include ridge regression (similar to weight decay), partial least squares
(similar to Early stopping), and various methods with even stranger names,
such as the lasso and garotte (van Houwelingen and le Cessie, ????).
References:
Bartlett, P.L. (1997), "For valid generalization, the size of the weights
is more important than the size of the network," in Mozer, M.C., Jordan,
M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
Generalization and Regularization in Nonlinear Learning Systems", in
Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
Information Processing Systems 4, 847-854.
Sarle, W.S. (1995), "Stopped Training and Other Remedies for
Overfitting," Proceedings of the 27th Symposium on the Interface of
Computing Science and Statistics, 352-360,
ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
compressed postscript file, 747K, 10 pages)
Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"
ftp://ftp.sas.com/pub/neural/dojo/dojo.html
Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
International Thomson Computer Press, ISBN 1-850-32842-0.
van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized
likelihood as methods to improve predictive accuracy,"
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf
Weigend, A. (1994), "On overfitting and the effective number of hidden
units," Proceedings of the 1993 Connectionist Models Summer School,
335-342.
------------------------------------------------------------------------
Subject: What is jitter? (Training with noise)
===============================================
Jitter is artificial noise deliberately added to the inputs during training.
Training with jitter is a form of smoothing related to kernel regression
(see "What is GRNN?"). It is also closely related to regularization methods
such as weight decay and ridge regression.
Training with jitter works because the functions that we want NNs to learn
are mostly smooth. NNs can learn functions with discontinuities, but the
functions must be piecewise continuous in a finite number of regions if our
network is restricted to a finite number of hidden units.
In other words, if we have two cases with similar inputs, the desired
outputs will usually be similar. That means we can take any training case
and generate new training cases by adding small amounts of jitter to the
inputs. As long as the amount of jitter is sufficiently small, we can assume
that the desired output will not change enough to be of any consequence, so
we can just use the same target value. The more training cases, the merrier,
so this looks like a convenient way to improve training. But too much jitter
will obviously produce garbage, while too little jitter will have little
effect (Koistinen and Holmström 1992).
Consider any point in the input space, not necessarily one of the original
training cases. That point could possibly arise as a jittered input as a
result of jittering any of several of the original neighboring training
cases. The average target value at the given input point will be a weighted
average of the target values of the original training cases. For an infinite
number of jittered cases, the weights will be proportional to the
probability densities of the jitter distribution, located at the original
training cases and evaluated at the given input point. Thus the average
target values given an infinite number of jittered cases will, by
definition, be the Nadaraya-Watson kernel regression estimator using the
jitter density as the kernel. Hence, training with jitter is an
approximation to training with the kernel regression estimator as target.
Choosing the amount (variance) of jitter is equivalent to choosing the
bandwidth of the kernel regression estimator (Scott 1992).
When studying nonlinear models such as feedforward NNs, it is often helpful
first to consider what happens in linear models, and then to see what
difference the nonlinearity makes. So let's consider training with jitter in
a linear model. Notation:
x_ij is the value of the jth input (j=1, ..., p) for the
ith training case (i=1, ..., n).
X={x_ij} is an n by p matrix.
y_i is the target value for the ith training case.
Y={y_i} is a column vector.
Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
"inv" indicates a matrix inverse and "'" indicates transposition. Note that
if we replicate each training case c times, or equivalently stack c copies
of the X and Y matrices on top of each other, the least-squares weights are
inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before.
With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ...,
c, where z_ijk is produced by some random number generator, usually with
a normal distribution with mean 0 and standard deviation s, and the
z_ijk's are all independent. In place of the n by p matrix X, this
gives us a big matrix, say Q, with cn rows and p columns. To compute the
least-squares weights, we need Q'Q. Let's consider the jth diagonal
element of Q'Q, which is
2 2 2
sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
i,k i,k
which is approximately, for c large,
2 2
c(sum x_ij + ns )
i
which is c times the corresponding diagonal element of X'X plus ns^2.
Now consider the u,vth off-diagonal element of Q'Q, which is
sum (x_iu+z_iuk)(x_iv+z_ivk)
i,k
which is approximately, for c large,
c(sum x_iu x_iv)
i
which is just c times the corresponding element of X'X. Thus, Q'Q equals
c(X'X+ns^2I), where I is an identity matrix of appropriate size.
Similar computations show that the crossproduct of Q with the target values
is cX'Y. Hence the least-squares weights with jitter of variance s^2 are
given by
2 2 2
B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y
In the statistics literature, B(ns^2) is called a ridge regression
estimator with ridge value ns^2.
If we were to add jitter to the target values Y, the cross-product X'Y
would not be affected for large c for the same reason that the off-diagonal
elements of X'X are not afected by jitter. Hence, adding jitter to the
targets will not change the optimal weights; it will just slow down training
(An 1996).
The ordinary least squares training criterion is (Y-XB)'(Y-XB).
Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B,
where d is the decay rate. Weight decay can also be implemented by
inventing artificial training cases. Augment the training data with p new
training cases containing the matrix dI for the inputs and a zero vector
for the targets. To put this in a formula, let's use A;B to indicate the
matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.
Thus the augmented inputs are X;dI and the augmented targets are Y;0,
where 0 indicates the zero vector of the appropriate size. The squared error
for the augmented training data is:
(Y;0-(X;dI)B)'(Y;0-(X;dI)B)
= (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
= Y'Y - 2Y'XB + B'(X'X+d^2I)B
= Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
= (Y-XB)'(Y-XB)+d^2B'B
which is the weight-decay training criterion. Thus the weight-decay
estimator is:
inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y
which is the same as the jitter estimator B(d^2), i.e. jitter with
variance d^2/n. The equivalence between the weight-decay estimator and
the jitter estimator does not hold for nonlinear models unless the jitter
variance is small relative to the curvature of the nonlinear function (An
1996). However, the equivalence of the two estimators for linear models
suggests that they will often produce similar results even for nonlinear
models. Details for nonlinear models, including classification problems, are
given in An (1996).
B(0) is obviously the ordinary least-squares estimator. It can be shown
that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in
other words, adding jitter causes the weights to shrink. It can also be
shown that under the usual statistical assumptions, there always exists some
value of ns^2 > 0 such that B(ns^2) provides better expected
generalization than B(0). Unfortunately, there is no way to calculate a
value of ns^2 from the training data that is guaranteed to improve
generalization. There are other types of shrinkage estimators called Stein
estimators that do guarantee better generalization than B(0), but I'm not
aware of a nonlinear generalization of Stein estimators applicable to neural
networks.
The statistics literature describes numerous methods for choosing the ridge
value. The most obvious way is to estimate the generalization error by
cross-validation, generalized cross-validation, or bootstrapping, and to
choose the ridge value that yields the smallest such estimate. There are
also quicker methods based on empirical Bayes estimation, one of which
yields the following formula, useful as a first guess:
2 p(Y-XB(0))'(Y-XB(0))
s = --------------------
1 n(n-p)B(0)'B(0)
You can iterate this a few times:
2 p(Y-XB(0))'(Y-XB(0))
s = --------------------
l+1 2 2
n(n-p)B(s )'B(s )
l l
Note that the more training cases you have, the less noise you need.
References:
An, G. (1996), "The effects of adding noise during backpropagation
training on a generalization performance," Neural Computation, 8,
643-674.
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Section 1 of 4 - Prev - Next
All sections - 1 - 2 - 3 - 4
| Back to category neural-nets - Use Smart Search |
| Home - Smart Search - About the project - Feedback |
© allanswers.org | Terms of use