![]() |
| Home > Science > ai-faq > neural-nets > |
comp.ai.neural-nets FAQ, Part 4 of 7: Books, data, etc. |
Section 5 of 5 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5
total of 20 form faces represented in the database. Each image is stored
in bi-level black and white raster format. The images in this database
appear to be real forms prepared by individuals but the images have been
automatically derived and synthesized using a computer and contain no
"real" tax data. The entry field values on the forms have been
automatically generated by a computer in order to make the data available
without the danger of distributing privileged tax information. In
addition to the images the database includes 5,590 answer files, one for
each image. Each answer file contains an ASCII representation of the data
found in the entry fields on the corresponding image. Image format
documentation and example software are also provided. The uncompressed
database totals approximately 5.9 gigabytes of data.
NIST special database 3: Binary Images of Handwritten Segmented
---------------------------------------------------------------
Characters (HWSC)
-----------------
Contains 313,389 isolated character images segmented from the 2,100
full-page images distributed with "NIST Special Database 1". 223,125
digits, 44,951 upper-case, and 45,313 lower-case character images. Each
character image has been centered in a separate 128 by 128 pixel region,
error rate of the segmentation and assigned classification is less than
0.1%. The uncompressed database totals approximately 2.75 gigabytes of
image data and includes image format documentation and example software.
The system requirements for all databases are a 5.25" CD-ROM drive with
software to read ISO-9660 format. Contact: Darrin L. Dimmick;
dld@magi.ncsl.nist.gov; (301)975-4147
The prices of the databases are between US$ 250 and 1895 If you wish to
order a database, please contact: Standard Reference Data; National
Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899;
Phone: (301)975-2208; FAX: (301)926-0416
Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in
directory /pub/data A more complete description of the available
databases can be obtained from the same host as
/pub/databases/catalog.txt
8. CEDAR CD-ROM 1: Database of Handwritten Cities, States,
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ZIP Codes, Digits, and Alphabetic Characters
++++++++++++++++++++++++++++++++++++++++++++
The Center Of Excellence for Document Analysis and Recognition (CEDAR)
State University of New York at Buffalo announces the availability of
CEDAR CDROM 1: USPS Office of Advanced Technology The database contains
handwritten words and ZIP Codes in high resolution grayscale (300 ppi
8-bit) as well as binary handwritten digits and alphabetic characters
(300 ppi 1-bit). This database is intended to encourage research in
off-line handwriting recognition by providing access to handwriting
samples digitized from envelopes in a working post office.
Specifications of the database include:
+ 300 ppi 8-bit grayscale handwritten words (cities,
states, ZIP Codes)
o 5632 city words
o 4938 state words
o 9454 ZIP Codes
+ 300 ppi binary handwritten characters and digits:
o 27,837 mixed alphas and numerics segmented
from address blocks
o 21,179 digits segmented from ZIP Codes
+ every image supplied with a manually determined
truth value
+ extracted from live mail in a working U.S. Post
Office
+ word images in the test set supplied with dic-
tionaries of postal words that simulate partial
recognition of the corresponding ZIP Code.
+ digit images included in test set that simulate
automatic ZIP Code segmentation. Results on these
data can be projected to overall ZIP Code recogni-
tion performance.
+ image format documentation and software included
System requirements are a 5.25" CD-ROM drive with software to read
ISO-9660 format. For further information, see
http://www.cedar.buffalo.edu/Databases/CDROM1/ or send email to Ajay
Shekhawat at
There is also a CEDAR CDROM-2, a database of machine-printed Japanese
character images.
9. AI-CD-ROM (see question "Other sources of information")
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10. Time series
+++++++++++++++
Santa Fe Competition
--------------------
Various datasets of time series (to be used for prediction learning
problems) are available for anonymous ftp from ftp.santafe.edu in
/pub/Time-Series". Data sets include:
o Fluctuations in a far-infrared laser
o Physiological data of patients with sleep apnea;
o High frequency currency exchange rate data;
o Intensity of a white dwarf star;
o J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge"
Some of the datasets were used in a prediction contest and are described
in detail in the book "Time series prediction: Forecasting the future and
understanding the past", edited by Weigend/Gershenfield, Proceedings
Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity
series of Addison Wesley (1994).
M3 Competition
--------------
3003 time series from the M3 Competition can be found at
http://forecasting.cwru.edu/Data/index.html
The numbers of series of various types are given in the following table:
Interval Micro Industry Macro Finance Demog Other Total
Yearly 146 102 83 58 245 11 645
Quarterly 204 83 336 76 57 0 756
Monthly 474 334 312 145 111 52 1428
Other 4 0 0 29 0 141 174
Total 828 519 731 308 413 204 3003
Rob Hyndman's Time Series Data Library
--------------------------------------
A collection of over 500 time series on subjects including agriculture,
chemistry, crime, demography, ecology, economics & finance, health,
hydrology & meteorology, industry, physics, production, sales, simulated
series, sport, transport & tourism, and tree-rings can be found at
http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/
11. Financial data
++++++++++++++++++
http://chart.yahoo.com/d?s=
http://www.chdwk.com/data/index.html
12. USENIX Faces
++++++++++++++++
The USENIX faces archive is a public database, accessible by ftp, that
can be of use to people working in the fields of human face recognition,
classification and the like. It currently contains 5592 different faces
(taken at USENIX conferences) and is updated twice each year. The images
are mostly 96x128 greyscale frontal images and are stored in ascii files
in a way that makes it easy to convert them to any usual graphic format
(GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided.
Each image file takes approximately 25K.
For further information, see http://facesaver.usenix.org/
According to the archive administrator, Barbara L. Dijker
(barb.dijker@labyrinth.com), there is no restriction to use them.
However, the image files are stored in separate directories corresponding
to the Internet site to which the person represented in the image
belongs, with each directory containing a small number of images (two in
the average). This makes it difficult to retrieve by ftp even a small
part of the database, as you have to get each one individually.
A solution, as Barbara proposed me, would be to compress the whole set of
images (in separate files of, say, 100 images) and maintain them as a
specific archive for research on face processing, similar to the ones
that already exist for fingerprints and others. The whole compressed
database would take some 30 megabytes of disk space. I encourage anyone
willing to host this database in his/her site, available for anonymous
ftp, to contact her for details (unfortunately I don't have the resources
to set up such a site).
Please consider that UUNET has graciously provided the ftp server for the
FaceSaver archive and may discontinue that service if it becomes a
burden. This means that people should not download more than maybe 10
faces at a time from uunet.
A last remark: each file represents a different person (except for
isolated cases). This makes the database quite unsuitable for training
neural networks, since for proper generalisation several instances of the
same subject are required. However, it is still useful for use as testing
set on a trained network.
13. Linguistic Data Consortium
++++++++++++++++++++++++++++++
The Linguistic Data Consortium (URL:
http://www.ldc.upenn.edu/ldc/noframe.html) is an open consortium of
universities, companies and government research laboratories. It creates,
collects and distributes speech and text databases, lexicons, and other
resources for research and development purposes. The University of
Pennsylvania is the LDC's host institution. The LDC catalog includes
pronunciation lexicons, varied lexicons, broadcast speech, microphone
speech, mobile-radio speech, telephone speech, broadcast text,
conversation text, newswire text, parallel text, and varied text, at
widely varying fees.
Linguistic Data Consortium
University of Pennsylvania
3615 Market Street, Suite 200
Philadelphia, PA 19104-2608
Tel (215) 898-0464 Fax (215) 573-2175
Email: ldc@ldc.upenn.edu
14. Otago Speech Corpus
+++++++++++++++++++++++
The Otago Speech Corpus contains speech samples in RIFF WAVE format that
can be downloaded from
http://divcom.otago.ac.nz/infosci/kel/software/RICBIS/hyspeech_main.html
15. Astronomical Time Series
++++++++++++++++++++++++++++
Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson
(Pennsyvania State University):
o Detection of variability in photon counting observations 1
(QSO1525+337)
o Detection of variability in photon counting observations 2 (H0323+022)
o Detection of variability in photon counting observations 3 (SN1987A)
o Detecting orbital and pulsational periodicities in stars 1 (binaries)
o Detecting orbital and pulsational periodicities in stars 2 (variables)
o Cross-correlation of two time series 1 (Sun)
o Cross-correlation of two time series 2 (OJ287)
o Periodicity in a gamma ray burster (GRB790305)
o Solar cycles in sunspot numbers (Sun)
o Deconvolution of sources in a scanning operation (HEAO A-1)
o Fractal time variability in a seyfert galaxy (NGC5506)
o Quasi-periodic oscillations in X-ray binaries (GX5-1)
o Deterministic chaos in an X-ray pulsar? (Her X-1)
URL: http://xweb.nrl.navy.mil/www_hertz/timeseries/timeseries.html
16. Miscellaneous Images
++++++++++++++++++++++++
The USC-SIPI Image Database:
http://sipi.usc.edu/services/database/Database.html
CityU Image Processing Lab:
http://www.image.cityu.edu.hk/images/database.html
Center for Image Processing Research: http://cipr.rpi.edu/
Computer Vision Test Images:
http://www.cs.cmu.edu:80/afs/cs/project/cil/ftp/html/v-images.html
Lenna 97: A Complete Story of Lenna:
http://www.image.cityu.edu.hk/images/lenna/Lenna97.html
17. StatLib
+++++++++++
The StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon
University has a large collection of data sets, many of which can be used
with NNs.
------------------------------------------------------------------------
Next part is part 5 (of 7). Previous part is part 3.
--
Warren S. Sarle SAS Institute Inc. The opinions expressed here
saswss@unx.sas.com SAS Campus Drive are mine and not necessarily
(919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
Section 5 of 5 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5
| Back to category neural-nets - Use Smart Search |
| Home - Smart Search - About the project - Feedback |
© allanswers.org | Terms of use