BALANCE

Returns a view with an equal distribution of the input data based on the response_column.

Returns a view with an equal distribution of the input data based on the response_column.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BALANCE ( 'output‑view', 'input‑relation', 'response‑column', 'balance‑method'
       [ USING PARAMETERS sampling_ratio=ratio ] )

Arguments

output‑view
The name of the view where Vertica saves the balanced data from the input relation.
input‑relation
The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response‑column
Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.
balance‑method
Specifies a method to select data from the minority and majority classes, one of the following.
  • hybrid_sampling: Performs over-sampling and under-sampling on different classes so each class is equally represented.

  • over_sampling: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality.

  • under_sampling: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality.

  • weighted_sampling: An alias of under_sampling.

Parameters

ratio
The desired ratio between the majority class and the minority class. This value has no effect when used with balance method hybrid_sampling.

Default: 1.0

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE

=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
 bug_type | COUNT
----------+-------
        2 |     1
        1 |     3
        3 |     7
(3 rows)

=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
        2 |     1
        1 |     2
        3 |     1
(3 rows)

See also