BALANCE

Returns a view with an equal distribution of the input data based on the response_column.

Returns a view with an equal distribution of the input data based on the response_column.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BALANCE ( 'output-view', 'input-relation', 'response-column', 'balance-method'
       [ USING PARAMETERS sampling_ratio=ratio ] )

Arguments

output-view
The name of the view where Vertica saves the balanced data from the input relation.
input-relation
The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response-column
Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.
balance-method
Specifies a method to select data from the minority and majority classes, one of the following.
  • hybrid_sampling: Performs over-sampling and under-sampling on different classes so each class is equally represented.

  • over_sampling: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality.

  • under_sampling: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality.

  • weighted_sampling: An alias of under_sampling.

Parameters

ratio
The desired ratio between the majority class and the minority class. This value has no effect when used with balance method hybrid_sampling.

Default: 1.0

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE

=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
 bug_type | COUNT
----------+-------
        2 |     1
        1 |     3
        3 |     7
(3 rows)

=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
        2 |     1
        1 |     2
        3 |     1
(3 rows)

See also