Downloading the machine learning example data
You need several data sets to run the machine learning examples. You can download these data sets from the Vertica GitHub repository.
Important
The GitHub examples are based on the latest Vertica version. If you note differences, please upgrade to the latest version.You can download the example data in either of two ways:
-
Download the ZIP file. Extract the contents of the file into a directory.
-
Clone the Vertica Machine Learning GitHub repository. Using a terminal window, run the following command:
$ git clone https://github.com/vertica/Machine-Learning-Examples
Loading the example data
You can load the example data by doing one of the following. Note that models are not automatically dropped. You must either rerun the load_ml_data.sql
script to drop models or manually drop them.
-
Copying and pasting the DDL and DML operations in
load_ml_data.sql
in a vsql prompt or another Vertica client. -
Running the following command from a terminal window within the data folder in the Machine-Learning-Examples directory:
$ /opt/vertica/bin/vsql -d <name of your database> -f load_ml_data.sql
You must also load the naive_bayes_data_prepration.sql
script in the Machine-Learning-Examples directory:
$ /opt/vertica/bin/vsql -d <name of your database> -f ./naive_bayes/naive_bayes_data_preparation.sql
Example data descriptions
The repository contains the following data sets.
Name | Description |
---|---|
agar_dish | Synthetic data set meant to represent clustering of bacteria on an agar dish. Contains the following columns: id, x-coordinate, and y-coordinate. |
agar_dish_2 | 125 rows sampled randomly from the original 500 rows of the agar_dish data set. |
agar_dish_1 | 375 rows sampled randomly from the original 500 rows of the agar_dish data set. |
baseball | Contains statistics from a fictional baseball league. The statistics included are: first name, last name, date of birth, team name, homeruns, hits, batting average, and salary. |
daily-min-temperatures | Contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990. |
dem_votes |
Contains data on the number of yes and no votes by Democrat members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the naive_bayes_data_prepration.sql script. Contains the following columns: vote, yes, no. |
faithful |
Wait times between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Reference Härdle, W. (1991) Smoothing Techniques with Implementation in S. New York: Springer. Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365. |
faithful_testing | Roughly 60% of the original 272 rows of the faithful data set. |
faithful_training | Roughly 40% of the original 272 rows of the faithful data set. |
house84 |
The house84 data set includes votes for each of the U.S. House of Representatives Congress members on 16 votes. Contains the following columns: id, party, vote1, vote2, vote3, vote4, vote5, vote6, vote7, vote8, vote9, vote10, vote11, vote12, vote13, vote14, vote15, vote16. Reference Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985. |
iris |
The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Reference Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. |
iris1 | 90 rows sampled randomly from the original 150 rows in the iris data set. |
iris2 | 60 rows sampled randomly from the original 150 rows in the iris data set. |
mtcars |
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Reference Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411. |
rep_votes |
Contains data on the number of yes and no votes by Republican members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the naive_bayes_data_prepration.sql script. Contains the following columns: vote, yes, no. |
salary_data | Contains fictional employee data. The data included are: employee id, first name, last name, years worked, and current salary. |
transaction_data | Contains fictional credit card transactions with a BOOLEAN column indicating whether there was fraud associated with the transaction. The data included are: first name, last name, store, cost, and fraud. |
titanic_testing | Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived. |
titanic_training | Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived. |
world | Contains country-specific information about human development using HDI, GDP, and CO2 emissions. |