![]() |
| Home > Science > ai-faq > neural-nets > |
comp.ai.neural-nets FAQ, Part 2 of 7: Learning |
Section 6 of 6 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5 - 6
Hebbian learning is the other most common variety of unsupervised learning
(Hertz, Krogh, and Palmer 1991). Hebbian learning minimizes the same error
function as an auto-associative network with a linear hidden layer, trained
by least squares, and is therefore a form of dimensionality reduction. This
error function is equivalent to the sum of squared distances between each
training case and a linear subspace of the input space (with distances
measured perpendicularly), and is minimized by the leading principal
components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986; Jackson
1991; Diamantaras and Kung 1996). There are variations of Hebbian learning
that explicitly produce the principal components (Hertz, Krogh, and Palmer
1991; Karhunen 1994; Deco and Obradovic 1996; Diamantaras and Kung 1996).
Perhaps the most novel form of unsupervised learning in the NN literature is
Kohonen's self-organizing (feature) map (SOM, Kohonen 1995). SOMs combine
competitive learning with dimensionality reduction by smoothing the clusters
with respect to an a priori grid (see "How many kinds of Kohonen networks
exist?") for more explanation). But Kohonen's original SOM algorithm does
not optimize an "energy" function (Erwin et al., 1992; Kohonen 1995, pp.
126, 237). The SOM algorithm involves a trade-off between the accuracy of
the quantization and the smoothness of the topological mapping, but there is
no explicit combination of these two properties into an energy function.
Hence Kohonen's SOM is not simply an information-compression method like
most other unsupervised learning networks. Neither does Kohonen's SOM have a
clear interpretation as a density estimation method. Convergence of
Kohonen's SOM algorithm is allegedly demonstrated by Yin and Allinson
(1995), but their "proof" assumes the neighborhood size becomes zero, in
which case the algorithm reduces to VQ and no longer has topological
ordering properties (Kohonen 1995, p. 111). The best explanation of what a
Kohonen SOM learns seems to be provided by the connection between SOMs and
principal curves and surfaces explained by Mulier and Cherkassky (1995) and
Ritter, Martinetz, and Schulten (1992). For further explanation, see "How
many kinds of Kohonen networks exist?"
A variety of energy functions for SOMs have been proposed (e.g., Luttrell,
1994), some of which show a connection between SOMs and multidimensional
scaling (Goodhill and Sejnowski 1997). There are also other approaches to
SOMs that have clearer theoretical justification using mixture models with
Bayesian priors or constraints (Utsugi, 1996, 1997; Bishop, Svensén, and
Williams, 1997).
For additional references on cluster analysis, see
ftp://ftp.sas.com/pub/neural/clus_bib.txt.
References:
Anderberg, M.R. (1973), Cluster Analysis for Applications, New York:
Academic Press, Inc.
Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
study of the classification capabilities of neural networks using
unsupervised learning: A comparison with k-means clustering",
Psychometrika, 59, 509-525.
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press.
Bishop, C.M., Svensén, M., and Williams, C.K.I (1997), "GTM: A principled
alternative to the self-organizing map," in Mozer, M.C., Jordan, M.I.,
and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambrideg, MA: The MIT Press, pp. 354-360. Also see
http://www.ncrg.aston.ac.uk/GTM/
Deco, G. and Obradovic, D. (1996), An Information-Theoretic Approach to
Neural Computing, NY: Springer-Verlag.
Diamantaras, K.I., and Kung, S.Y. (1996) Principal Component Neural
Networks: Theory and Applications, NY: Wiley.
Erwin, E., Obermayer, K., and Schulten, K. (1992), "Self-organizing maps:
Ordering, convergence properties and energy functions," Biological
Cybernetics, 67, 47-55.
Flury, B. (1990), "Principal points," Biometrika, 77, 33-41.
Flury, B. (1993), "Estimation of principal points," Applied Statistics,
42, 139-151.
Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
Compression, Boston: Kluwer Academic Publishers.
Goodhill, G.J., and Sejnowski, T.J. (1997), "A unifying objective
function for topographic mappings," Neural Computation, 9, 1291-1303.
Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.
Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means
clustering algorithm," Applied Statistics, 28-100-108.
Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.
Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of
Neural Computation. Addison-Wesley: Redwood City, California.
Hotelling, H. (1933), "Analysis of a Complex of Statistical Variables
into Principal Components," Journal of Educational Psychology, 24,
417-441, 498-520.
Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering
utilizing hybrid search strategies," Pattern Recognition, 22, 75-89.
Jackson, J.E. (1991), A User's Guide to Principal Components, NY: Wiley.
Jolliffe, I.T. (1986), Principal Component Analysis, Springer-Verlag.
Karhunen, J. (1994), "Stability of Oja's PCA subspace rule," Neural
Computation, 6, 739-747.
Kohonen, T. (1995/1997), Self-Organizing Maps, Berlin: Springer-Verlag.
Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
N.J.: Prentice-Hall.
Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector
quantizer design," IEEE Transactions on Communications, 28, 84-95.
Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions
on Information Theory, 28, 129-137.
Luttrell, S.P. (1994), "A Bayesian analysis of self-organizing maps,"
Neural Computation, 6, 767-794.
McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, NY: Marcel
Dekker, Inc.
MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of
Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, 1, 281-297.
Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on
Information Theory, 6, 7-12.
Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an Iterative
Kernel Smoothing Process," Neural Computation, 7, 1165-1177.
Pearson, K. (1901) "On Lines and Planes of Closest Fit to Systems of
Points in Space," Phil. Mag., 2(6), 559-572.
Rao, C.R. (1964), "The Use and Interpretation of Principal Component
Analysis in Applied Research," Sankya A, 26, 329-358.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
Cambridge University Press.
Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation
and Self-Organizing Maps: An Introduction, Reading, MA: Addison-Wesley.
Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS
Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group
International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550,
ftp://ftp.sas.com/pub/neural/neural1.ps.
Tarpey, T., Luning, L>, and Flury, B. (1994), "Principal points and
self-consistent points of elliptical distributions," Annals of
Statistics, ?.
Utsugi, A. (1996), "Topology selection for self-organizing maps,"
Network: Computation in Neural Systems, 7, 727-740,
http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html
Utsugi, A. (1997), "Hyperparameter selection for self-organizing maps,"
Neural Computation, 9, 623-635, available on-line at
http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html
Yin, H. and Allinson, N.M. (1995), "On the Distribution and Convergence
of Feature Space in Self-Organizing Maps," Neural Computation, 7,
1178-1187.
Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector
quantizer design by stochastic relaxation," IEEE Transactions on Signal
Procesing, 40, 310-322.
------------------------------------------------------------------------
Subject: Help! My NN won't learn! What should I do?
====================================================
The following advice is intended for inexperienced users. Experts may try
more daring methods.
If you are using a multilayer perceptron (MLP):
o Check data for outliers. Transform variables or delete bad cases as
appropriate to the purpose of the analysis.
o Standardize quantitative inputs as described in "Should I standardize the
input variables?"
o Encode categorical inputs as described in "How should categories be
encoded?"
o Make sure you have more training cases than the total number of input
units. The number of training cases required depends on the amount of
noise in the targets and the complexity of the function you are trying to
learn, but as a starting point, it's a good idea to have at least 10
times as many training cases as input units. This may not be enough for
highly complex functions. For classification problems, the number of
cases in the smallest class should be at least several times the number
of input units.
o If the target is:
o quantitative, then it is usually a good idea to standardize the target
variable as described in "Should I standardize the target variables?"
Use an identity (usually called "linear") output activation function.
o binary, then use 0/1 coding and a logistic output activation function.
o categorical with 3 or more categories, then use 1-of-C encoding as
described in "How should categories be encoded?" and use a softmax
output activation function as described in "What is a softmax
activation function?"
o Use a tanh (hyperbolic tangent) activation function for the hidden units.
See "Why use activation functions?" for more information.
o Use a bias term (sometimes called a "threshold") in every hidden and
output unit. See "Why use a bias/threshold?" for an explanation of why
biases are important.
o When the network has hidden units, the results of training may depend
critically on the random initial weights. You can set each initial weight
(including biases) to a random number such as any of the following:
o A uniform random variable between -2 and 2.
o A uniform random variable between -0.2 and 0.2.
o A normal random variable with a mean of 0 and a standard deviation of
1.
o A normal random variable with a mean of 0 and a standard deviation of
0.1.
If any layer in the network has a large number of units, you will need to
adjust the initial weights (not including biases) of the connections from
the large layer to subsequent layers. Generate random initial weights as
described above, but then divide each of these random weights by the
square root of the number of units in the large layer. More sophisticated
methods are described by Bishop (1995).
Train the network using several (anywhere from 10 to 1000) different sets
of random initial weights. For the operational network, you can either
use the weights that produce the smallest training error, or combine
several trained networks as described in "How to combine networks?"
o If possible, use conventional numerical optimization techniques as
described in "What are conjugate gradients, Levenberg-Marquardt, etc.?"
If those techniques are unavailable in the software you are using, get
better software. If you can't get better software, use RPROP or Quickprop
as described in "What is backprop?" Only as a last resort should you use
standard backprop.
o Use batch training, because there are fewer mistakes that can be made
with batch training than with incremental (sometimes called "online")
training. If you insist on using incremental training, present the
training cases to the network in random order. For more details, see
"What are batch, incremental, on-line, off-line, deterministic,
stochastic, adaptive, instantaneous, pattern, epoch, constructive, and
sequential learning?"
o If you have to use standard backprop, you must set the learning rate by
trial and error. Experiment with different learning rates. If the weights
and errors change very slowly, try higher learning rates. If the weights
fluctuate wildly and the error increases during training, try lower
learning rates. If you follow all the instructions given above, you could
start with a learning rate of .1 for batch training or .01 for
incremental training.
Momentum is not as critical as learning rate, but to be safe, set the
momentum to zero. A larger momentum requires a smaller learning rate.
For more details, see What learning rate should be used for backprop?"
o Use a separate test set to estimate generalization error. If the test
error is much higher than the training error, the network is probably
overfitting. Read Part 3: Generalization of the FAQ and use one of the
methods described there to improve generalization, such as early
stopping, weight decay, or Bayesian learning.
o Start with one hidden layer.
For a classification problem with many categories, start with one unit in
the hidden layer; otherwise, start with zero hidden units. Train the
network, add one or few hidden units, retrain the network, and repeat.
When you get overfitting, stop adding hidden units. For more information
on the number of hidden layers and hidden units, see "How many hidden
layers should I use?" and "How many hidden units should I use?" in Part 3
of the FAQ.
If the generalization error is still not satisfactory, you can try:
o adding a second hidden layer
o using an RBF network
o transforming the input variables
o deleting inputs that are not useful
o adding new input variables
o getting more training cases
o etc.
If you are writing your own software, the opportunities for mistakes are
limitless. Perhaps the most critical thing for gradient-based algorithms
such as backprop is that you compute the gradient (partial derivatives)
correctly. The usual backpropagation algorithm will give you the partial
derivatives of the objective function with respect to each weight in the
network. You can check these partial derivatives by using finite-difference
approximations (Gill, Murray, and Wright, 1981) as follows:
1. Be sure to standardize the variables as described above.
2. Initialize the weights W as described above. For convenience of
notation, let's arrange all the weights in one long vector so we can use
a single subsbcript i to refer to different weights W_i. Call the
entire set of values of the initial weights w0. So W is a vector of
variables, and w0 is a vector of values of those variables.
3. Let's use the symbol F(W) to indicate the objective function you are
trying to optimize with respect to the weights. If you are using batch
training, F(W) is computed over the entire training set. If you are
using incremental training, choose any one training case and compute
F(W) for that single training case; use this same training case for all
the following steps.
4. Pick any one weight W_i. Initially, W_i = w0_i.
5. Choose a constant called h with a value anywhere from .0001 to
.00000001.
6. Change the value of W_i from w0_i to w0_i + h. Do not change any
of the other weights. Compute the value of the objective function f1 =
F(W) using this modified value of W_i.
7. Change the value of W_i to w0_i - h. Do not change any of the other
weights. Compute another new value of the objective function f2 =
F(W).
8. The central finite difference approximation to the partial derivative for
W_i is (f2-f1)/(2h). This value should usually be within about
10% of the partial derivative computed by backpropagation, except for
derivatives close to zero. If the finite difference approximation is very
different from the partial derivative computed by backpropagation, try a
different value of h. If no value of h provides close agreement between
the finite difference approximation and the partial derivative computed
by backpropagation, you probably have a bug.
9. Repeat the above computations for each weight W_i for i=1, 2, 3,
... up to the total number of weights.
References:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press.
Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
Academic Press: London.
------------------------------------------------------------------------
Next part is part 3 (of 7). Previous part is part 1.
--
Warren S. Sarle SAS Institute Inc. The opinions expressed here
saswss@unx.sas.com SAS Campus Drive are mine and not necessarily
(919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
Section 6 of 6 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5 - 6
| Back to category neural-nets - Use Smart Search |
| Home - Smart Search - About the project - Feedback |
© allanswers.org | Terms of use