<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>OpenText Analytics Database 26.2.x – Machine learning for predictive analytics</title>
    <link>/en/data-analysis/ml-predictive-analytics/</link>
    <description>Recent content in Machine learning for predictive analytics on OpenText Analytics Database 26.2.x</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/en/data-analysis/ml-predictive-analytics/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Data-Analysis: Download the machine learning example data</title>
      <link>/en/data-analysis/ml-predictive-analytics/download-ml-example-data/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/download-ml-example-data/</guid>
      <description>
        
        
        &lt;p&gt;You need several data sets to run the machine learning examples. You can download these data sets from the Vertica GitHub repository.&lt;/p&gt;

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
The GitHub examples are based on the latest OpenText™ Analytics Database version. If you note differences, please upgrade to the latest version.
&lt;/div&gt;
&lt;p&gt;You can download the example data in either of two ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Download the ZIP file. Extract the contents of the file into a directory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Clone the Machine Learning GitHub repository. Using a terminal window, run the following command:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ git clone https://github.com/vertica/Machine-Learning-Examples
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;loading-the-example-data&#34;&gt;Loading the example data&lt;/h2&gt;
&lt;p&gt;You can load the example data by doing one of the following. Note that models are not automatically dropped. You must either rerun the &lt;code&gt;load_ml_data.sql&lt;/code&gt; script to drop models or manually drop them.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Copying and pasting the DDL and DML operations in &lt;code&gt;load_ml_data.sql&lt;/code&gt; in a vsql prompt or another database client.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Running the following command from a terminal window within the data folder in the Machine-Learning-Examples directory:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ /opt/vertica/bin/vsql -d &amp;lt;name of your database&amp;gt; -f load_ml_data.sql
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You must also load the &lt;code&gt;naive_bayes_data_prepration.sql&lt;/code&gt; script in the Machine-Learning-Examples directory:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ /opt/vertica/bin/vsql -d &amp;lt;name of your database&amp;gt; -f ./naive_bayes/naive_bayes_data_preparation.sql
&lt;/code&gt;&lt;/pre&gt;&lt;h2 id=&#34;example-data-descriptions&#34;&gt;Example data descriptions&lt;/h2&gt;
&lt;p&gt;The repository contains the following data sets.

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Name&lt;/th&gt; 

&lt;th &gt;
Description&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
agar_dish&lt;/td&gt; 

&lt;td &gt;
Synthetic data set meant to represent clustering of bacteria on an agar dish. Contains the following columns: id, x-coordinate, and y-coordinate.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
agar_dish_2&lt;/td&gt; 

&lt;td &gt;
125 rows sampled randomly from the original 500 rows of the agar_dish data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
agar_dish_1&lt;/td&gt; 

&lt;td &gt;
375 rows sampled randomly from the original 500 rows of the agar_dish data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
baseball&lt;/td&gt; 

&lt;td &gt;
Contains statistics from a fictional baseball league. The statistics included are: first name, last name, date of birth, team name, homeruns, hits, batting average, and salary.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
daily-min-temperatures&lt;/td&gt; 

&lt;td &gt;
Contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
dem_votes&lt;/td&gt; 

&lt;td &gt;
Contains data on the number of yes and no votes by Democrat members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the &lt;code&gt;naive_bayes_data_prepration.sql&lt;/code&gt; script. Contains the following columns: vote, yes, no.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
faithful&lt;/td&gt; 

&lt;td &gt;








&lt;p&gt;Wait times between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Härdle, W. (1991) &lt;em&gt;Smoothing Techniques with Implementation in S&lt;/em&gt;. New York: Springer.&lt;/p&gt;
&lt;p&gt;Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. &lt;em&gt;Applied Statistics&lt;/em&gt; 39, 357–365.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
faithful_testing&lt;/td&gt; 

&lt;td &gt;
Roughly 60% of the original 272 rows of the faithful data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;


faithful_training&lt;/td&gt; 

&lt;td &gt;
Roughly 40% of the original 272 rows of the faithful data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
house84&lt;/td&gt; 

&lt;td &gt;






&lt;p&gt;The house84 data set includes votes for each of the U.S. House of Representatives Congress members on 16 votes. Contains the following columns: id, party, vote1, vote2, vote3, vote4, vote5, vote6, vote7, vote8, vote9, vote10, vote11, vote12, vote13, vote14, vote15, vote16.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
iris&lt;/td&gt; 

&lt;td &gt;






&lt;p&gt;The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) &lt;em&gt;The New S Language&lt;/em&gt;. Wadsworth &amp;amp; Brooks/Cole.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
iris1&lt;/td&gt; 

&lt;td &gt;
90 rows sampled randomly from the original 150 rows in the iris data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
iris2&lt;/td&gt; 

&lt;td &gt;
60 rows sampled randomly from the original 150 rows in the iris data set.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
mtcars&lt;/td&gt; 

&lt;td &gt;






&lt;p&gt;The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Henderson and Velleman (1981), Building multiple regression models interactively. &lt;em&gt;Biometrics&lt;/em&gt;, 37, 391–411.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
rep_votes&lt;/td&gt; 

&lt;td &gt;
Contains data on the number of yes and no votes by Republican members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the &lt;code&gt;naive_bayes_data_prepration.sql&lt;/code&gt; script. Contains the following columns: vote, yes, no.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
salary_data&lt;/td&gt; 

&lt;td &gt;
Contains fictional employee data. The data included are: employee id, first name, last name, years worked, and current salary.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
transaction_data&lt;/td&gt; 

&lt;td &gt;
Contains fictional credit card transactions with a BOOLEAN column indicating whether there was fraud associated with the transaction. The data included are: first name, last name, store, cost, and fraud.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
titanic_testing&lt;/td&gt; 

&lt;td &gt;
Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
titanic_training&lt;/td&gt; 

&lt;td &gt;
Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
world&lt;/td&gt; 

&lt;td &gt;
Contains country-specific information about human development using HDI, GDP, and CO2 emissions.&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Data preparation</title>
      <link>/en/data-analysis/ml-predictive-analytics/data-preparation/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/data-preparation/</guid>
      <description>
        
        
        &lt;p&gt;Before you can analyze your data, you must prepare it. You can do the following data preparation tasks in OpenText™ Analytics Database:&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Regression algorithms</title>
      <link>/en/data-analysis/ml-predictive-analytics/regression-algorithms/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/regression-algorithms/</guid>
      <description>
        
        
        &lt;p&gt;Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response. Regression is used to make predictions about profits, sales, temperature, stocks, and more. For example, you could use regression to predict the price of a house based on the location, the square footage, the size of the lot, and so on. In this example, the house&#39;s value is the response, and the other factors, such as location, are the features.&lt;/p&gt;
&lt;p&gt;The optimal set of coefficients found for the regression&#39;s equation is known as the model. The relationship between the outcome and the features is summarized in the model, which can then be applied to different data sets, where the outcome value is unknown.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Classification algorithms</title>
      <link>/en/data-analysis/ml-predictive-analytics/classification-algorithms/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/classification-algorithms/</guid>
      <description>
        
        
        &lt;p&gt;Classification is an important and popular machine learning tool that assigns items in a data set to different categories. Classification is used to predict risk over time, in fraud detection, text categorization, and more. Classification functions begin with a data set where the different categories are known. For example, suppose you want to classify students based on how likely they are to get into graduate school. In addition to factors like admission score exams and grades, you could also track work experience.&lt;/p&gt;
&lt;p&gt;Binary classification means the outcome, in this case, admission, only has two possible values: admit or do not admit. Multiclass outcomes have more than two values. For example, low, medium, or high chance of admission. During the training process, classification algorithms find the relationship between the outcome and the features. This relationship is summarized in the model, which can then be applied to different data sets, where the categories are unknown.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Clustering algorithms</title>
      <link>/en/data-analysis/ml-predictive-analytics/clustering-algorithms/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/clustering-algorithms/</guid>
      <description>
        
        
        &lt;p&gt;Clustering is an important and popular machine learning tool used to find clusters of items in a data set that are similar to one another. The goal of clustering is to create clusters with a high number of objects that are similar. Similar to classification, clustering segments the data. However, in clustering, the categorical groups are not defined.&lt;/p&gt;
&lt;p&gt;Clustering data into related groupings has many useful applications. If you already know how many clusters your data contains, the &lt;a href=&#34;../../../en/data-analysis/ml-predictive-analytics/clustering-algorithms/k-means/#&#34;&gt;K-means&lt;/a&gt; algorithm may be sufficient to train your model and use that model to predict cluster membership for new data points.&lt;/p&gt;
&lt;p&gt;However, in the more common case, you do not know before analyzing the data how many clusters it contains. In these cases, the &lt;a href=&#34;../../../en/data-analysis/ml-predictive-analytics/clustering-algorithms/bisecting-k-means/#&#34;&gt;Bisecting k-means&lt;/a&gt; algorithm is much more effective at finding the correct clusters in your data.&lt;/p&gt;
&lt;p&gt;Both k-means and bisecting k-means predict the clusters for a given data set. A model trained using either algorithm can then be used to predict the cluster to which new data points are assigned.&lt;/p&gt;
&lt;p&gt;Clustering can be used to find anomalies in data and find natural groups of data. For example, you can use clustering to analyze a geographical region and determine which areas of that region are most likely to be hit by an earthquake. For a complete example, see &lt;a href=&#34;https://www.researchgate.net/publication/312033426_Earthquake_Cluster_Analysis_K-Means_Approach&#34;&gt;Earthquake Cluster Analysis Using the KMeans Approach&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In OpenText™ Analytics Database, clustering is computed based on Euclidean distance. Through this computation, data points are assigned to the cluster with the nearest center.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Time series forecasting</title>
      <link>/en/data-analysis/ml-predictive-analytics/time-series-forecasting/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/time-series-forecasting/</guid>
      <description>
        
        
        &lt;p&gt;Time series models are trained on stationary time series (that is, time series where the mean doesn&#39;t change over time) of stochastic processes with consistent time steps. These algorithms forecast future values by taking into account the influence of values at some number of preceding timesteps (lags).&lt;/p&gt;
&lt;p&gt;Examples of applicable datasets include those for temperature, stock prices, earthquakes, product sales, etc.&lt;/p&gt;
&lt;p&gt;To normalize datasets with inconsistent timesteps, see &lt;a href=&#34;../../../en/data-analysis/time-series-analytics/gap-filling-and-interpolation-gfi/#&#34;&gt;Gap filling and interpolation (GFI)&lt;/a&gt;.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Model management</title>
      <link>/en/data-analysis/ml-predictive-analytics/model-management/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/model-management/</guid>
      <description>
        
        
        &lt;p&gt;OpenText™ Analytics Database provides a number of tools to manage existing models. You can view model summaries and attributes, alter model characteristics like name and privileges, drop models, and version models.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data-Analysis: Using external models</title>
      <link>/en/data-analysis/ml-predictive-analytics/using-external-models-with/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/data-analysis/ml-predictive-analytics/using-external-models-with/</guid>
      <description>
        
        
        &lt;p&gt;To give you the utmost in machine learning flexibility and scalability, OpenText™ Analytics Database supports importing, exporting, and predicting with PMML and TensorFlow models.&lt;/p&gt;
&lt;p&gt;The machine learning configuration parameter &lt;a href=&#34;../../../en/sql-reference/config-parameters/ml-parameters/&#34;&gt;MaxModelSizeKB&lt;/a&gt; sets the maximum size of a model that can be imported into the database.&lt;/p&gt;

&lt;h2 id=&#34;support-for-pmml-models&#34;&gt;Support for PMML models&lt;/h2&gt;
&lt;p&gt;OpenText™ Analytics Database supports the import and export of machine learning models in Predictive Model Markup Language (PMML) format. Support for this platform-independent model format allows you to use models trained on other platforms to predict on data stored in your database. You can also use the database as your model repository. OpenText™ Analytics Database supports &lt;a href=&#34;https://dmg.org/pmml/pmml-v4-4-1.html&#34;&gt;PMML version 4.4.1&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With the &lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/transformation-functions/predict-pmml/#&#34;&gt;PREDICT_PMML&lt;/a&gt; function, you can use an archived PMML model to run prediction on data stored in the database. For more information, see &lt;a href=&#34;../../../en/data-analysis/ml-predictive-analytics/using-external-models-with/using-pmml-models/#&#34;&gt;Using PMML models&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For details on the PMML models, tags, and attributes that OpenText™ Analytics Database supports, see &lt;a href=&#34;../../../en/data-analysis/ml-predictive-analytics/using-external-models-with/using-pmml-models/pmml-features-and-attributes/#&#34;&gt;PMML features and attributes&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;support-for-tensorflow-models&#34;&gt;Support for TensorFlow models&lt;/h2&gt;
&lt;p&gt;OpenText™ Analytics Database now supports importing trained TensorFlow models, and using those models to do prediction on data stored in the database. OpenText™ Analytics Database supports TensorFlow models trained in TensorFlow version 1.15.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/transformation-functions/predict-tensorflow/#&#34;&gt;PREDICT_TENSORFLOW&lt;/a&gt; and &lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/transformation-functions/predict-tensorflow-scalar/#&#34;&gt;PREDICT_TENSORFLOW_SCALAR&lt;/a&gt; functions let you predict on data with TensorFlow models.&lt;/p&gt;
&lt;p&gt;For additional information, see &lt;a href=&#34;../../../en/data-analysis/ml-predictive-analytics/using-external-models-with/tensorflow-models/#&#34;&gt;TensorFlow models&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;additional-external-model-support&#34;&gt;Additional external model support&lt;/h2&gt;
&lt;p&gt;The following functions support both PMML and TensorFlow models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/model-management/import-models/#&#34;&gt;IMPORT_MODELS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/model-management/export-models/#&#34;&gt;EXPORT_MODELS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/model-management/get-model-attribute/#&#34;&gt;GET_MODEL_ATTRIBUTE&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;../../../en/sql-reference/functions/ml-functions/model-management/get-model-summary/#&#34;&gt;GET_MODEL_SUMMARY&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
