mauricio kugler
TM & © 2015, Mauricio Kugler Inc. All rights reserved.
Publications | Software | Hardware | Databases
UCI Databases
Formated and Organized
PRD File Format

The PRD (Pattern Recognition Data) is an ASCII file format created in order to make the databases easy to read in C/C++ programs. It is defined as follows (where each line begins with the line number just for reference):

1          PRD
2          <number of samples>
3          <number of features>
4          s[0][0],s[0][1],...,s[0][n-2],s[0][n-1],<label>
5          s[1][0],s[1][1],...,s[1][n-2],s[1][n-1],<label>
6          s[2][0],s[2][1],...,s[2][n-2],s[2][n-1],<label>
...        ...
m+2        s[m-2][0],s[m-2][1],...,s[m-2][n-2],s[m-2][n-1],<label>
m+3        s[m-1][0],s[m-1][1],...,s[m-1][n-2],s[m-1][n-1],<label>

where 'm' is the number of samples, 'n' is the number of features and s[a][b] means the feature 'b' of sample 'a'.

The class label must be an integer and the sequences of classes labels cannot be disrupted. For example, in a 10 classes database, the classes must be labeled from 0 to 9, but they can appear on the file at any order. The samples features are divided by commas with no spaces before or after it. The lines are numbered here just for reference, but they are not numbered in the databases files. Finnaly, there are no spaces after the three header lines.

Database Files

bupa
Training Data: bupa_data.prd
Test Data: none
Information: bupa_info.txt
Obs: -

dermatology
Training Data: dermatology_data.prd
Test Data: none
Information: dermatology_info.txt
Obs: see note 1

ecoli
Training Data: ecoli_data.prd
Test Data: none
Information: ecoli_info.txt
Obs: -

forest (UCI KDD)
Training Data: forest_training_data.zip
Test Data: forest_test_data.zip
Information: forest_info.txt
Obs: see note 2

glass
Training Data: glass_data.prd
Test Data: none
Information: glass_info.txt
Obs: see note 3

ionosphere
Training Data: ionosphere_data.prd
Test Data: none
Information: ionosphere_info.txt
Obs: see note 4

iris
Training Data: iris_data.prd
Test Data: none
Information: iris_info.txt
Obs: -

isolet
Training Data: isolet_training_data.zip
Test Data: isolet_test_data.zip
Information: isolet_info.txt
Obs: -

letter
Training Data: letter_training_data.prd
Test Data: letter_test_data.prd
Information: letter_info.txt
Obs: see note 5

lrs
Training Data: lrs_data.prd
Test Data: none
Information: lrs_info.txt
Obs: see note 6

lung
Training Data: lung_data.prd
Test Data: none
Information: lung_info.txt 
Obs: see note 7

optdigits
Training Data: optdigits_training_data.prd
Test Data: optdigits_test_data.prd
Information: optdigits_info.txt
Obs: -

pendigits
Training Data: pendigits_training_data.prd
Test Data: pendigits_test_data.prd
Information: pendigits_info.txt
Obs: -

pima
Training Data: pima_data.prd
Test Data: none
Information: pima_info.txt
Obs: -

satimage
Training Data: satimage_training_data.prd
Test Data: satimage_test_data.prd
Information: satimage_info.txt 
Obs: see note 8 

segment
Training Data: segment_training_data.prd
Test Data: segment_test_data.prd
Information: segment_info.txt
Obs: see note 9

shuttle
Training Data: shuttle_training_data.zip
Test Data: shuttle_test_data.zip
Information: shuttle_info.txt
Obs: -

sonar
Training Data: sonar_data.prd
Test Data: none
Information: sonar_info.txt 
Obs: -

thyroid
Training Data: thyroid_data.prd
Test Data: none
Information: thyroid_info.txt 
Obs: -

vehicle
Training Data: vehicle_data.prd
Test Data: none
Information: vehicle_info.txt 
Obs: -

vowel
Training Data: vowel_training_data.prd
Test Data: vowel_test_data.prd
Information: vowel_info.txt
Obs: -

wdbc
Training Data: wdbc_data.prd
Test Data: none
Information: wdbc_info.txt 
Obs: -

wine
Training Data: wine_data.prd
Test Data: none
Information: wine_info.txt 
Obs: -

Notes

1. For the dermatology database, the 8 patterns with missing values were removed from the database, changing the samples number from 366 to 358;
2. The forest database had been split in training and test data with 387344 samples for training and 193668 for test, keeping the proportion of the classes. See the problems statistics for details;
3. The glass database have one of the 7 classes with no patterns. This class was not considered in the PRD file class labeling. So, the labels are from 0 to 5;
4. In the ionosphere database, the second feature of  is constant equal to 0 and had been removed, changing the number of features from 34 to 33;
5. The letter database had been split in training and test data with the first 12200 samples for training and the remaining 7800 for test;
6. The classes in the original lrs database are numbered from 0 to 99, but this numbering are not continuous. The 48 present classes were relabeled from 0 to 47. Also, the 10 "header" features were eliminated, changing the feature number from 103 to 93;
7. For the lung database, the 26th sample and the 5th feature were removed because missing values, changing the samples number from 32 to 31 and the features number from 56 to 55;
8. The satimage database have one of the 7 classes with no patterns. This class was not considered in the PRD file class labeling. So, the labels are from 0 to 5;
9. For the segment database, the feature that contains the number of pixels ("region-pixel-count") is constant (always 9) and was removed, changing the original number of features from 19 to 18;

Other Repositories

http://ida.first.fhg.de/projects/bench/benchmarks.htm
http://www.faqs.org/faqs/ai-faq/neural-nets/part4/section-7.html
http://www.ncrg.aston.ac.uk/NN/databases.html
http://vision.ece.ucsb.edu/download.html

UCI Knowledge Discovery in Databases Archive
http://kdd.ics.uci.edu

Delve Datasets
http://www.cs.toronto.edu/~delve/data/datasets.html

Bilkent University - Function Approximation Repository
http://funapp.cs.bilkent.edu.tr/DataSets/

Time Series Data Library
http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/

StatLib - Dataset Archive
http://lib.stat.cmu.edu/modules.php?op=modload&name=PostWrap&file=index&page=datasets/ 
(contain other links for other repositories)



I used these databases for my doctoral thesis experiments. When I say "formatted and organized", it doesn't mean that UCI database is disorganized! I create a file format that I called PRD and formatted all the databases using it. Also, I organized the data in an easy-to-use way for my experiments. See the notes about some differences you can find when comparing to UCI original informations. All of this gave me a lot of work, so, I hope these files will be useful for someone else. If you find some mistake in this page, please tell me by e-mail.
Don't forget to refer UCI database in your work!

Summary
Database
Classes
Training Set
Test Set
Features
2
345
-
6
6
358
-
34
8
336
-
7
7
387344
193668
54
6
214
-
9
2
351
-
33
3
150
-
4
26
6238
1559
617
26
12200
7800
16
48
531
-
93
2
31
-
55
10
3823
1797
64
10
7494
3498
16
2
768
-
8
6
4435
2000
36
7
210
2100
18
7
43500
14500
9
2
208
-
60
3
215
-
5
4
846
-
18
11
528
462
11
2
569
-
30
3
178
-
13
*UCI Knowledge Discovery in Databases
 Main | Curriculum | Research | Utils  | Links | Photos | Contact