allanswers.org - comp.ai.neural-nets FAQ, Part 4 of 7: Books, data, etc.

 Home >  Scienceai-faqneural-nets >

comp.ai.neural-nets FAQ, Part 4 of 7: Books, data, etc.

Section 5 of 5 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5


   total of 20 form faces represented in the database. Each image is stored
   in bi-level black and white raster format. The images in this database
   appear to be real forms prepared by individuals but the images have been
   automatically derived and synthesized using a computer and contain no
   "real" tax data. The entry field values on the forms have been
   automatically generated by a computer in order to make the data available
   without the danger of distributing privileged tax information. In
   addition to the images the database includes 5,590 answer files, one for
   each image. Each answer file contains an ASCII representation of the data
   found in the entry fields on the corresponding image. Image format
   documentation and example software are also provided. The uncompressed
   database totals approximately 5.9 gigabytes of data. 

   NIST special database 3: Binary Images of Handwritten Segmented
   ---------------------------------------------------------------
   Characters (HWSC)
   -----------------

   Contains 313,389 isolated character images segmented from the 2,100
   full-page images distributed with "NIST Special Database 1". 223,125
   digits, 44,951 upper-case, and 45,313 lower-case character images. Each
   character image has been centered in a separate 128 by 128 pixel region,
   error rate of the segmentation and assigned classification is less than
   0.1%. The uncompressed database totals approximately 2.75 gigabytes of
   image data and includes image format documentation and example software.

   The system requirements for all databases are a 5.25" CD-ROM drive with
   software to read ISO-9660 format. Contact: Darrin L. Dimmick;
   dld@magi.ncsl.nist.gov; (301)975-4147

   The prices of the databases are between US$ 250 and 1895 If you wish to
   order a database, please contact: Standard Reference Data; National
   Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899;
   Phone: (301)975-2208; FAX: (301)926-0416

   Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in
   directory /pub/data A more complete description of the available
   databases can be obtained from the same host as 
   /pub/databases/catalog.txt 

8. CEDAR CD-ROM 1: Database of Handwritten Cities, States,
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   ZIP Codes, Digits, and Alphabetic Characters
   ++++++++++++++++++++++++++++++++++++++++++++

   The Center Of Excellence for Document Analysis and Recognition (CEDAR)
   State University of New York at Buffalo announces the availability of
   CEDAR CDROM 1: USPS Office of Advanced Technology The database contains
   handwritten words and ZIP Codes in high resolution grayscale (300 ppi
   8-bit) as well as binary handwritten digits and alphabetic characters
   (300 ppi 1-bit). This database is intended to encourage research in
   off-line handwriting recognition by providing access to handwriting
   samples digitized from envelopes in a working post office. 

        Specifications of the database include:
        +    300 ppi 8-bit grayscale handwritten words (cities,
             states, ZIP Codes)
             o    5632 city words
             o    4938 state words
             o    9454 ZIP Codes
        +    300 ppi binary handwritten characters and digits:
             o    27,837 mixed alphas  and  numerics  segmented
                  from address blocks
             o    21,179 digits segmented from ZIP Codes
        +    every image supplied with  a  manually  determined
             truth value
        +    extracted from live mail in a  working  U.S.  Post
             Office
        +    word images in the test  set  supplied  with  dic-
             tionaries  of  postal  words that simulate partial
             recognition of the corresponding ZIP Code.
        +    digit images included in test  set  that  simulate
             automatic ZIP Code segmentation.  Results on these
             data can be projected to overall ZIP Code recogni-
             tion performance.
        +    image format documentation and software included

   System requirements are a 5.25" CD-ROM drive with software to read
   ISO-9660 format. For further information, see 
   http://www.cedar.buffalo.edu/Databases/CDROM1/ or send email to Ajay
   Shekhawat at  

   There is also a CEDAR CDROM-2, a database of machine-printed Japanese
   character images. 

9. AI-CD-ROM (see question "Other sources of information")
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

10. Time series
+++++++++++++++

   Santa Fe Competition
   --------------------

   Various datasets of time series (to be used for prediction learning
   problems) are available for anonymous ftp from ftp.santafe.edu in 
   /pub/Time-Series". Data sets include:
    o Fluctuations in a far-infrared laser 
    o Physiological data of patients with sleep apnea; 
    o High frequency currency exchange rate data; 
    o Intensity of a white dwarf star; 
    o J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge" 

   Some of the datasets were used in a prediction contest and are described
   in detail in the book "Time series prediction: Forecasting the future and
   understanding the past", edited by Weigend/Gershenfield, Proceedings
   Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity
   series of Addison Wesley (1994). 

   M3 Competition
   --------------

   3003 time series from the M3 Competition can be found at 
   http://forecasting.cwru.edu/Data/index.html 

   The numbers of series of various types are given in the following table: 

   Interval  Micro Industry    Macro  Finance    Demog    Other    Total
   Yearly      146      102       83       58      245       11      645
   Quarterly   204       83      336       76       57        0      756
   Monthly     474      334      312      145      111       52     1428
   Other         4        0        0       29        0      141      174
   Total       828      519      731      308      413      204     3003

   Rob Hyndman's Time Series Data Library
   --------------------------------------

   A collection of over 500 time series on subjects including agriculture,
   chemistry, crime, demography, ecology, economics & finance, health,
   hydrology & meteorology, industry, physics, production, sales, simulated
   series, sport, transport & tourism, and tree-rings can be found at 
   http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/ 

11. Financial data
++++++++++++++++++

   http://chart.yahoo.com/d?s=

   http://www.chdwk.com/data/index.html

12. USENIX Faces
++++++++++++++++

   The USENIX faces archive is a public database, accessible by ftp, that
   can be of use to people working in the fields of human face recognition,
   classification and the like. It currently contains 5592 different faces
   (taken at USENIX conferences) and is updated twice each year. The images
   are mostly 96x128 greyscale frontal images and are stored in ascii files
   in a way that makes it easy to convert them to any usual graphic format
   (GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided.
   Each image file takes approximately 25K. 

   For further information, see http://facesaver.usenix.org/

   According to the archive administrator, Barbara L. Dijker
   (barb.dijker@labyrinth.com), there is no restriction to use them.
   However, the image files are stored in separate directories corresponding
   to the Internet site to which the person represented in the image
   belongs, with each directory containing a small number of images (two in
   the average). This makes it difficult to retrieve by ftp even a small
   part of the database, as you have to get each one individually.
   A solution, as Barbara proposed me, would be to compress the whole set of
   images (in separate files of, say, 100 images) and maintain them as a
   specific archive for research on face processing, similar to the ones
   that already exist for fingerprints and others. The whole compressed
   database would take some 30 megabytes of disk space. I encourage anyone
   willing to host this database in his/her site, available for anonymous
   ftp, to contact her for details (unfortunately I don't have the resources
   to set up such a site). 

   Please consider that UUNET has graciously provided the ftp server for the
   FaceSaver archive and may discontinue that service if it becomes a
   burden. This means that people should not download more than maybe 10
   faces at a time from uunet. 

   A last remark: each file represents a different person (except for
   isolated cases). This makes the database quite unsuitable for training
   neural networks, since for proper generalisation several instances of the
   same subject are required. However, it is still useful for use as testing
   set on a trained network. 

13. Linguistic Data Consortium
++++++++++++++++++++++++++++++

   The Linguistic Data Consortium (URL: 
   http://www.ldc.upenn.edu/ldc/noframe.html) is an open consortium of
   universities, companies and government research laboratories. It creates,
   collects and distributes speech and text databases, lexicons, and other
   resources for research and development purposes. The University of
   Pennsylvania is the LDC's host institution. The LDC catalog includes
   pronunciation lexicons, varied lexicons, broadcast speech, microphone
   speech, mobile-radio speech, telephone speech, broadcast text,
   conversation text, newswire text, parallel text, and varied text, at
   widely varying fees. 

      Linguistic Data Consortium 
      University of Pennsylvania 
      3615 Market Street, Suite 200 
      Philadelphia, PA 19104-2608 
      Tel (215) 898-0464 Fax (215) 573-2175
      Email: ldc@ldc.upenn.edu 
      

14. Otago Speech Corpus
+++++++++++++++++++++++

   The Otago Speech Corpus contains speech samples in RIFF WAVE format that
   can be downloaded from 
   http://divcom.otago.ac.nz/infosci/kel/software/RICBIS/hyspeech_main.html 

15. Astronomical Time Series
++++++++++++++++++++++++++++

   Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson
   (Pennsyvania State University): 
    o Detection of variability in photon counting observations 1
      (QSO1525+337) 
    o Detection of variability in photon counting observations 2 (H0323+022)
    o Detection of variability in photon counting observations 3 (SN1987A) 
    o Detecting orbital and pulsational periodicities in stars 1 (binaries) 
    o Detecting orbital and pulsational periodicities in stars 2 (variables)
    o Cross-correlation of two time series 1 (Sun) 
    o Cross-correlation of two time series 2 (OJ287) 
    o Periodicity in a gamma ray burster (GRB790305) 
    o Solar cycles in sunspot numbers (Sun) 
    o Deconvolution of sources in a scanning operation (HEAO A-1) 
    o Fractal time variability in a seyfert galaxy (NGC5506) 
    o Quasi-periodic oscillations in X-ray binaries (GX5-1) 
    o Deterministic chaos in an X-ray pulsar? (Her X-1) 
   URL: http://xweb.nrl.navy.mil/www_hertz/timeseries/timeseries.html 

16. Miscellaneous Images
++++++++++++++++++++++++

   The USC-SIPI Image Database: 
   http://sipi.usc.edu/services/database/Database.html

   CityU Image Processing Lab: 
   http://www.image.cityu.edu.hk/images/database.html

   Center for Image Processing Research: http://cipr.rpi.edu/

   Computer Vision Test Images: 
   http://www.cs.cmu.edu:80/afs/cs/project/cil/ftp/html/v-images.html

   Lenna 97: A Complete Story of Lenna: 
   http://www.image.cityu.edu.hk/images/lenna/Lenna97.html

17. StatLib
+++++++++++

   The StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon
   University has a large collection of data sets, many of which can be used
   with NNs. 

------------------------------------------------------------------------

Next part is part 5 (of 7). Previous part is part 3. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.

Section 5 of 5 - Prev - Next
All sections - 1 - 2 - 3 - 4 - 5

Back to category neural-nets - Use Smart Search
Home - Smart Search - About the project - Feedback

© allanswers.org | Terms of use

LiveInternet