allanswers.org - comp.ai.neural-nets FAQ, Part 2 of 7: Learning

 Home >  Scienceai-faqneural-nets >

comp.ai.neural-nets FAQ, Part 2 of 7: Learning

Section 1 of 6 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5 - 6


Archive-name: ai-faq/neural-nets/part2
Last-modified: 2002-10-11
URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,
USA. Answers provided by other authors as cited below are copyrighted by
those authors, who by submitting the answers for the FAQ give permission for
the answer to be reproduced as part of the FAQ in any of the ways specified
in part 1 of the FAQ. 

This is part 2 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction
Part 2: Learning

   What are combination, activation, error, and objective functions?
      Combination functions
      Activation functions
      Error functions
      Objective functions
   What are batch, incremental, on-line, off-line, deterministic,
   stochastic, adaptive, instantaneous, pattern, epoch, constructive, and
   sequential learning?
      Batch vs. Incremental Learning (also Instantaneous, Pattern, and
      Epoch)
      On-line vs. Off-line Learning
      Deterministic, Stochastic, and Adaptive Learning
      Constructive Learning (Growing networks)
      Sequential Learning, Catastrophic Interference, and the
      Stability-Plasticity Dilemma
   What is backprop?
   What learning rate should be used for backprop?
   What are conjugate gradients, Levenberg-Marquardt, etc.?
   How does ill-conditioning affect NN training?
   How should categories be encoded?
   Why not code binary inputs as 0 and 1?
   Why use a bias/threshold?
   Why use activation functions?
   How to avoid overflow in the logistic function?
   What is a softmax activation function?
   What is the curse of dimensionality?
   How do MLPs compare with RBFs?
      Hybrid training and the curse of dimensionality
      Additive inputs
      Redundant inputs
      Irrelevant inputs
   What are OLS and subset/stepwise regression?
   Should I normalize/standardize/rescale the data?
      Should I standardize the input variables?
      Should I standardize the target variables?
      Should I standardize the variables for unsupervised learning?
      Should I standardize the input cases?
   Should I nonlinearly transform the data?
   How to measure importance of inputs?
   What is ART?
   What is PNN?
   What is GRNN?
   What does unsupervised learning learn?
   Help! My NN won't learn! What should I do?

Part 3: Generalization
Part 4: Books, data, etc.
Part 5: Free software
Part 6: Commercial software
Part 7: Hardware and miscellaneous

------------------------------------------------------------------------

Subject: What are combination, activation, error, and
=====================================================
objective functions? 
=====================

Most neural networks involve combination, activation, error, and objective
functions. 

Combination functions
+++++++++++++++++++++

Each non-input unit in a neural network combines values that are fed into it
via synaptic connections from other units, producing a single value called
the "net input". There is no standard term in the NN literature for the
function that combines values. In this FAQ, it will be called the
"combination function". The combination function is a vector-to scalar
function. Most NNs use either a linear combination function (as in MLPs) or
a Euclidean distance combination function (as in RBF networks). There is a
detailed discussion of networks using these two kinds of combination
function under "How do MLPs compare with RBFs?" 

Activation functions
++++++++++++++++++++

Most units in neural networks transform their net input by using a
scalar-to-scalar function called an "activation function", yielding a value
called the unit's "activation". Except possibly for output units, the
activation value is fed via synaptic connections to one or more other units.
The activation function is sometimes called a "transfer", and activation
functions with a bounded range are often called "squashing" functions, such
as the commonly used tanh (hyperbolic tangent) and logistic (1/(1+exp(-x))))
functions. If a unit does not transform its net input, it is said to have an
"identity" or "linear" activation function. The reason for using
non-identity activation functions is explained under "Why use activation
functions?" 

Error functions
+++++++++++++++

Most methods for training supervised networks require a measure of the
discrepancy between the networks output value and the target (desired
output) value (even unsupervised networks may require such a measure of
discrepancy--see "What does unsupervised learning learn?"). 

Let: 

 o j be an index for cases 
 o X or X_j be an input vector 
 o W be a collection (vector, matrix, or some more complicated structure)
   of weights and possibly other parameter estimates 
 o y or y_j be a target scalar 
 o M(X,W) be the output function computed by the network (the letter M
   is used to suggest "mean", "median", or "mode") 
 o p or p_j = M(X_j,W) be an output (the letter p is used to suggest
   "predicted value" or "posterior probability") 
 o r or r_j = y_j - p_j be a residual 
 o Q(y,X,W) be the case-wise error function written to show the
   dependence on the weights explicitly 
 o L(y,p) be the case-wise error function in simpler form where the
   weights are implicit (the letter L is used to suggest "loss" function) 
 o D be a list of indices designating a data set, including inputs and
   target values 
    o DL designate the training (learning) set 
    o DV designate the validation set 
    o DT designate the test set 
 o #(D) be the number of elements (cases) in D 
    o NL be the number of cases in the training (learning) set 
    o NV be the number of cases in the validation set 
    o NT be the number of cases in the test set 
 o TQ(D,W) be the total error function 
 o AQ(D,W) be the average error function 

The difference between the target and output values for case j, r_j =
y_j - p_j, is called the "residual" or "error". This is NOT the
"error function"! Note that the residual can be either positive or negative,
and negative residuals with large absolute values are typically considered
just as bad as large positive residuals. Error functions, on the other hand,
are defined so that bigger is worse. 

Usually, an error function Q(y,X,W) is applied to each case and is
defined in terms of the target and output values Q(y,X,W) =
L(y,M(X,W)) = L(y,p). Error functions are also called "loss"
functions, especially when the two usages of the term "error" would sound
silly when used together. For example, instead of the awkward phrase
"squared-error error", you would typically use "squared-error loss" to mean
an error function equal to the squared residual, L(y,p) = (y -
p)^2. Another common error function is the classification loss for a
binary target y in {0, 1}: 

   L(y,p) = 0 if |y-p| < 0.5
            1 otherwise

The error function for an entire data set is usually defined as the sum of
the case-wise error functions for all the cases in a data set: 

   TQ(D,W) =  sum    Q(y_j,X_j,W)
             j in D

Thus, for squared-error loss, the total error is the sum of squared errors
(i.e., residuals), abbreviated SSE. For classification loss, the total error
is the number of misclassified cases. 

It is often more convenient to work with the average (i.e, arithmetic mean)
error: 

   AQ(D,W) = TQ(D,W)/#(D)

For squared-error loss, the average error is the mean or average of squared
errors (i.e., residuals), abbreviated MSE or ASE (statisticians have a
slightly different meaning for MSE in linear models, 
TQ(D,W)/[#(D)-#(W)] ). For classification loss, the average error
is the proportion of misclassified cases. The average error is also called
the "empirical risk." 

Using the average error instead of the total error is especially convenient
when using batch backprop-type training methods where the user must supply a
learning rate to multiply by the negative gradient to compute the change in
the weights. If you use the gradient of the average error, the choice of
learning rate will be relatively insensitive to the number of training
cases. But if you use the gradient of the total error, you must use smaller
learning rates for larger training sets. For example, consider any training
set DL_1 and a second training set DL_2 created by duplicating every
case in DL_1. For any set of weights, DL_1 and DL_2 have the same
average error, but the total error of DL_2 is twice that of DL_1. Hence
the gradient of the total error of DL_2 is twice the gradient for DL_1.
So if you use the gradient of the total error, the learning rate for DL_2
should be half the learning rate for DL_1. But if you use the gradient of
the average error, you can use the same learning rate for both training
sets, and you will get exactly the same results from batch training. 

The term "error function" is commonly used to mean any of the functions, 
Q(y,X,W), L(y,p), TQ(D,W), or AQ(D,W). You can usually tell
from the context which function is the intended meaning. The term "error
surface" refers to TQ(D,W) or AQ(D,W) as a function of W. 

Objective functions
+++++++++++++++++++

The objective function is what you directly try to minimize during training.

Neural network training is often performed by trying to minimize the total
error TQ(DL,W) or, equivalently, the average error AQ(DL,W) for the
training set, as a function of the weights W. However, as discussed in Part
3 of the FAQ, minimizing training error can lead to overfitting and poor
generalization if the number of training cases is small relative to the
complexity of the network. A common approach to improving generalization
error is regularization, i.e., trying to minimize an objective function that
is the sum of the total error function and a regularization function. The
regularization function is a function of the weights W or of the output
function M(X,W). For example, in weight decay, the regularization
function is the sum of squared weights. A crude form of Bayesian learning
can be done using a regularization function that is the log of the prior
density of the weights (weight decay is a special case of this). For more
information on regularization, see Part 3 of the FAQ. 

If no regularization function is used, the objective function is equal to
the total or average error function (or perhaps some other monotone function
thereof). 

------------------------------------------------------------------------

Subject: What are batch, incremental, on-line, off-line,
========================================================
deterministic, stochastic, adaptive, instantaneous,
===================================================
pattern, constructive, and sequential learning? 
================================================

There are many ways to categorize learning methods. The distinctions are
overlapping and can be confusing, and the terminology is used very
inconsistently. This answer attempts to impose some order on the chaos,
probably in vain. 

Batch vs. Incremental Learning (also Instantaneous, Pattern, and
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Epoch)
++++++

Batch learning proceeds as follows: 

   Initialize the weights.
   Repeat the following steps:
      Process all the training data.
      Update the weights.

Incremental learning proceeds as follows: 

   Initialize the weights.
   Repeat the following steps:
      Process one training case.
      Update the weights.

In the above sketches, the exact meaning of "Process" and "Update" depends
on the particular training algorithm and can be quite complicated for
methods such as Levenberg-Marquardt (see "What are conjugate gradients,
Levenberg-Marquardt, etc.?"). Standard backprop (see What is backprop?) is
quite simple, though. Batch standard backprop (without momentum) proceeds as
follows: 

   Initialize the weights W.
   Repeat the following steps:
      Process all the training data DL to compute the gradient
         of the average error function AQ(DL,W).
      Update the weights by subtracting the gradient times the
         learning rate.

Incremental standard backprop (without momentum) can be done as follows: 

   Initialize the weights W.
   Repeat the following steps for j = 1 to NL:
      Process one training case (y_j,X_j) to compute the gradient
         of the error (loss) function Q(y_j,X_j,W).
      Update the weights by subtracting the gradient times the
         learning rate.

Alternatively, the index j can be chosen randomly each type the loop is
executed, or j can be chosen from a random permutation. 

The question of when to stop training is very complicated. Some of the
possibilities are: 

 o Stop when the average error function for the training set becomes small. 
 o Stop when the gradient of the average error function for the training set
   becomes small. 
 o Stop when the average error function for the validation set starts to go
   up, and use the weights from the step that yielded the smallest
   validation error. For details, see "What is early stopping?" 
 o Stop when your boredom level is no longer tolerable. 

It is very important NOT to use the following method--which does not work
--but is often mistakenly used by beginners: 

   Initialize the weights W.
   Repeat the following steps for j = 1 to NL:
      Repeat the following steps until Q(y_j,X_j,W) is small: 
         Compute the gradient of Q(y_j,X_j,W).
         Update the weights by subtracting the gradient times the
            learning rate.

The reason this method does not work is that by the time you have finished
processing the second case, the network will have forgotten what it learned
about the first case, and when you are finished with the third case, the
network will have forgotten what it learned about the first two cases, and
so on. If you really need to use a method like this, see the section below
on "Sequential Learning, Catastrophic Interference, and the
Stability-Plasticity Dilemma". 

The term "batch learning" is used quite consistently in the NN literature,
but "incremental learning" is often used for on-line, constructive, or
sequential learning. The usage of "batch" and "incremental" in the NN FAQ
follows Bertsekas and Tsitsiklis (1996), one of the few references that
keeps the relevant concepts straight. 

"Epoch learning" is synonymous with "batch learning." 

"Instantaneous learning" and "pattern learning" are usually synonyms for
incremental learning. "Instantaneous learning" is a misleading term because
people often think it means learning instantaneously. "Pattern learning" is
easily confused with "pattern recognition". Hence these terms are not used
in the FAQ. 

There are also intermediate methods, sometimes called mini-batch: 

   Initialize the weights.
   Repeat the following steps:
      Process two or more, but not all training cases.
      Update the weights.

Conventional numerical optimization techniques (see "What are conjugate
gradients, Levenberg-Marquardt, etc.?") are batch algorithms. Conventional
stochastic approximation techniques (see below) are incremental algorithms.
For a theoretical discussion comparing batch and incremental learning, see
Bertsekas and Tsitsiklis (1996, Chapter 3). 

For more information on incremental learning, see Saad (1998), but note that
the authors in that volume do not reliably distinguish between on-line and
incremental learning. 

On-line vs. Off-line Learning
+++++++++++++++++++++++++++++

In off-line learning, all the data are stored and can be accessed
repeatedly. Batch learning is always off-line. 

In on-line learning, each case is discarded after it is processed and the
weights are updated. On-line training is always incremental. 

Incremental learning can be done either on-line or off-line. 

With off-line learning, you can compute the objective function for any fixed
set of weights, so you can see whether you are making progess in training.
You can compute a minimum of the objective function to any desired
precision. You can use a variety of algorithms for avoiding bad local
minima, such as multiple random initializations or global optimization
algorithms. You can compute the error function on a validation set and use 
early stopping or choose from different networks to improve generalization.
You can use cross-validation and bootstrapping to estimate generalization
error. You can compute prediction and confidence intervals (error bars).
With on-line learning you can do none of these things because you cannot
compute the objective function on the training set or the error function on
the validation set for a fixed set of weights, since these data sets are not
stored. Hence, on-line learning is generally more difficult and unreliable
than off-line learning. Off-line incremental learning does not have all
these problems of on-line learning, which is why it is important to
distinguish between the concepts of on-line and incremental learning. 

Some of the theoretical difficulties of on-line learning are alleviated when
the assumptions of stochastic learning hold (see below) and training is
assumed to proceed indefinitely. 

For more information on on-line learning, see Saad (1998), but note that the
authors in that volume do not reliably distinguish between on-line and
incremental learning. 

Deterministic, Stochastic, and Adaptive Learning
++++++++++++++++++++++++++++++++++++++++++++++++

Deterministic learning is based on optimization of an objective function
that can be recomputed many times and always produces the same value given
the same weights. Deterministic learning is always off-line. 

Stochastic methods are used when computation of the objective function is
corrupted by noise. In particular, basic stochastic approximation is a form
of on-line gradient descent learning in which the training cases are
obtained by a stationary random process: 

   Initialize the weights.
   Initialize the learning rate.
   Repeat the following steps:
      Randomly select one (or possibly more) case(s)
         from the population.
      Update the weights by subtracting the gradient
         times the learning rate.
      Reduce the learning rate according to an
         appropriate schedule.

In stochastic on-line learning, the noise-corrupted objective function is
the error function for any given case, assuming that the case-wise error
function has some stationary random distribution. The learning rate must
gradually decrease towards zero during training to guarantee convergence of
the weights to a local minimum of the noiseless objective function. This
gradual reduction of the learning rate is often called "annealing." 

If the function that the network is trying to learn changes over time, the
case-wise error function does not have a stationary random distribution. To
allow the network to track changes over time, the learning rate must be kept
strictly away from zero. Learning methods that track a changing environment
are often called "adaptive" (as in adaptive vector quantization, Gersho and
Gray, 1992) or "continuous" rather than "stochastic". There is a trade-off
between accuracy and speed of adaptation. Adaptive learning does not
converge in a stationary environment. Hence the long-run properties of
stochastic learning and adaptive learning are quite different, even though
the algorithms may differ only in the sequence of learning rates. 

The object of adaptive learning is to forget the past when it is no longer
relevant. If you want to remember the past in a changing learning
environment, then you would be more interested in sequential learning (see
below). 

In stochastic learning with a suitably annealed learning rate, overtraining
does not occur because the more you train, the more data you have, and the
network converges toward a local optimum of the objective function for the
entire population, not a local optimum for a finite training set. Of course,
this conclusion does not hold if you train by cycling through a finite
training set instead of collecting new data on every step. 

For a theoretical discussion of stochastic learning, see Bertsekas and
Tsitsiklis (1996, Chapter 4). For further references on stochastic
approximation, see "What is backprop?" For adaptive filtering, see Haykin
(1996). 

The term "adaptive learning" is sometimes used for gradient methods in which
the learning rate is changed during training. 

Constructive Learning (Growing networks)
++++++++++++++++++++++++++++++++++++++++

Constructive learning adds units or connections to the network during
training. Typically, constructive learning begins with a network with no
hidden units, which is trained for a while. Then without altering the
existing weights, one or more new hidden units are added to the network,
training resumes, and so on. Many variations are possible, involving
different patterns of connections and schemes for freezing and thawing
weights. The most popular constructive algorithm is cascade correlation
(Fahlman and Lebiere, 1990), of which many variations are possible (e.g.,
Littmann and Ritter, 1996; Prechelt, 1997). Various other constructive
algorithms are summarized in Smieja (1993), Kwok and Yeung (1997; also other
papers at http://info.cs.ust.hk/faculty/dyyeung/paper/cnn.html), and Reed
and Marks (1999). For theory, see Baum (1989), Jones (1992), and Meir and
Maiorov (1999). Lutz Prechelt has a bibliography on constructive algorithms
at http://wwwipd.ira.uka.de/~prechelt/NN/construct.bib 

Constructive algorithms can be highly effective for escaping bad local
minima of the objective function, and are often much faster than algorithms
for global optimization such as simulated annealing and genetic algorithms.
A well-known example is the two-spirals problem, which is very difficult to
learn with standard backprop but relatively easy to learn with cascade
correlation (Fahlman and Lebiere, 1990). Some of the Donoho-Johnstone
benchmarks (especially "bumps") are almost impossible to learn with standard
backprop but can be learned very accurately with constructive algorithms
(see ftp://ftp.sas.com/pub/neural/dojo/dojo.html.) 

Constructive learning is commonly used to train multilayer perceptrons in
which the activation functions are step functions. Such networks are
difficult to train nonconstructively because the objective function is
discontinuous and gradient-descent methods cannot be used. Several clever
constructive algorithms (such as Upstart, Tiling, Marchand, etc.) have been
devised whereby a multilayer perceptron is constructed by training a
sequence of perceptrons, each of which is trained by some standard method
such as the well-known perceptron or pocket algorithms. Most constructive
algorithms of this kind are designed so that the training error goes to zero
when the network gets sufficiently large. Such guarantees do not apply to
the generalization error; you should guard against overfitting when you are
using constructive algorithms just as with nonconstructive algorithms (see
part 3 of the FAQ, especially "What is overfitting and how can I avoid it?")

Logically, you would expect "destructive" learning to start with a large
network and delete units during training, but I have never seen this term
used. The process of deleting units or connections is usually called
"pruning" (Reed, 1993; Reed and Marks 1999). The term "selectionism" has
also been used as the opposite of "constructivism" in cognitive neuroscience
(Quartz and Sejnowski, 1997). 

Sequential Learning, Catastrophic Interference, and the
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Stability-Plasticity Dilemma
++++++++++++++++++++++++++++

"Sequential learning" sometimes means incremental learning but also refers
to a very important problem in neuroscience (e.g., McClelland, McNaughton,
and O'Reilly 1995). To reduce confusion, the latter usage should be
preferred. 

Sequential learning in its purest form operates on a sequence of training
sets as follows: 

   Initialize the weights.
   Repeat the following steps:
      Collect one or more training cases.
      Train the network on the current training set
         using any method.
      Discard the current training set and all other
         information related to training except the
         weights.

Pure sequential learning differs from on-line learning in that most on-line
algorithms require storage of information in addition to the weights, such
as the learning rate or aproximations to the objective function or Hessian
Matrix. Such additional storage is not allowed in pure sequential learning. 

Pure sequential learning is important as a model of how learning occurs in
real, biological brains. Humans and other animals are often observed to
learn new things without forgetting old things. But pure sequential learning
tends not to work well in artificial neural networks. With a fixed
architecture, distributed (rather than local) representations, and a
training algorithm based on minimizing an objective function, sequential
learning results in "catastrophic interference", because the minima of the
objective function for one training set may be totally diferent than the
minima for subsequent training sets. Hence each successive training set may
cause the network to forget completely all previous training sets. This
problem is also called the "stability-plasticity dilemma." 

Successful sequential learning usually requires one or more of the
following: 

 o Noise-free data. 
 o Constructive architectures. 
 o Local representations. 
 o Storage of extra information besides the weights (this is impure
   sequential learning and is not biologically plausible unless a a
   biological mechanism for storing the extra information is provided) 

The connectionist literature on catastrophic interference seems oblivious to
statistical and numerical theory, and much of the research is based on the
ludicrous idea of using an autoassociative backprop network to model
recognition memory. Some of the problems with this approach are explained by
Sharkey and Sharkey (1995). Current research of the PDP variety is reviewed
by French (1999). PDP remedies for catastrophic forgetting generally require
cheating, i.e., storing information outside the network. For example,
pseudorehearsal (Robins, 1995) requires storing the distribution of the
inputs, although this fact is often overlooked. It appears that the only way
to avoid catastrophic interference in a PDP-style network is to combine two
networks modularly: a fast-learning network to memorize data, and a
slow-learning network to generalize from data memorized by the fast-learning
network (McClelland, McNaughton, and O'Reilly 1995). The PDP literature
virtually ignores the ART literature (See "What is ART?"), which provides a
localist constructive solution to what the ART people call the
"stability-plasticity dilemma." None of this research deals with sequential
learning in a statistically sound manner, and many of the methods proposed
for sequential learning require noise-free data. The statistical theory of
sufficient statistics makes it obvious that efficient sequential learning
requires the storage of additional information besides the weights in a
standard feedforward network. I know of no references to this subject in the
NN literature, but Bishop (1991) provided a mathematically sound treatment
of a closely-related problem. 

References: 

   Baum, E.B. (1989), "A proposal for more powerful learning algorithms,"
   Neural Computation, 1, 201-207. 

   Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
   Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. 

   Bishop, C. (1991), "A fast procedure for retraining the multilayer
   perceptron," International Journal of Neural Systems, 2, 229-236. 

   Fahlman, S.E., and Lebiere, C. (1990), "The Cascade-Correlation Learning
   Architecture", in Touretzky, D. S. (ed.), Advances in Neural Information
   Processing Systems 2,, Los Altos, CA: Morgan Kaufmann Publishers, pp.
   524-532, 
   ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z, 
   http://www.rafael-ni.hpg.com.br/arquivos/fahlman-cascor.pdf 

   French, R.M. (1999), "Catastrophic forgetting in connectionist networks:
   Causes, consequences and solutions," Trends in Cognitive Sciences, 3,
   128-135, http://www.fapse.ulg.ac.be/Lab/Trav/rfrench.html#TICS_cat_forget

   Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
   Compression, Boston: Kluwer Academic Publishers. 

   Haykin, S. (1996), Adaptive Filter Theory, Englewood Cliffs, NJ:
   Prentice-Hall. 

   Jones, L. (1992), "A simple lemma on greedy approximation in Hilbert
   space and convergence rate for projection pursuit regression and neural
   network training," Annals of Statistics, 20, 608-613. 

   Kwok, T.Y. and Yeung, D.Y. (1997), "Constructive algorithms for structure
   learning in feedforward neural networks for regression problems," IEEE
   Transactions on Neural Networks, volume 8, 630-645. 

   Littmann, E., and Ritter, H. (1996), "Learning and generalization in
   cascade network architectures," Neural Computation, 8, 1521-1539. 

   McClelland, J., McNaughton, B. and O'Reilly, R. (1995), "Why there are
   complementary learning systems in the hippocampus and neocortex: Insights
   from the successes and failures of connectionist models of learning and
   memory," Psychological Review, 102, 419-457. 

   Meir, R., and Maiorov, V. (1999), "On the optimality of incremental
   neural network algorithms," in Kerans, M.S., Solla, S.A., amd Cohn, D.A.
   (eds.), Advances in Neural Information Processing Systems 11,
   Cambridge,MA: The MIT Press, pp. 295-301. 

   Prechelt, L. (1997), "Investigation of the CasCor Family of Learning
   Algorithms," Neural Networks, 10, 885-896, 
   http://wwwipd.ira.uka.de/~prechelt/Biblio/#CasCor 

   Quartz, S.R., and Sejnowski, T.J. (1997), "The neural basis of cognitive
   development: A constructivist manifesto," Behavioral and Brain Sciences,
   20, 537-596, ftp://ftp.princeton.edu/pub/harnad/BBS/bbs.quartz 

   Reed, R. (1993), "Pruning algorithms--A survey," IEEE Transactions on
   Neural Networks, 4, 740-747. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8.

   Robins, A. (1995), "Catastrophic Forgetting, Rehearsal, and
   Pseudorehearsal," Connection Science, 7, 123-146. 

   Saad, D., ed. (1998), On-Line Learning in Neural Networks, Cambridge:
   Cambridge University Press. 

   Sharkey, N.E., and Sharkey, A.J.C. (1995), "An analysis of catastrophic

Section 1 of 6 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5 - 6

Back to category neural-nets - Use Smart Search
Home - Smart Search - About the project - Feedback

© allanswers.org | Terms of use

LiveInternet