BALANCE
Returns a view with an equal distribution of the input data based on the response_column.
Returns a view with an equal distribution of the input data based on the response_column.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
BALANCE ( 'output-view', 'input-relation', 'response-column', 'balance-method'
[ USING PARAMETERS sampling_ratio=ratio ] )
Arguments
output-view
- The name of the view where Vertica saves the balanced data from the input relation.
Note
Note: The view that results from this function employs a random function. Its content can differ each time it is used in a query. To make the operations on the view predictable, store it in a regular table. input-relation
- The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.
balance-method
- Specifies a method to select data from the minority and majority classes, one of the following.
-
hybrid_sampling
: Performs over-sampling and under-sampling on different classes so each class is equally represented. -
over_sampling
: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality. -
under_sampling
: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality. -
weighted_sampling
: An alias ofunder_sampling
.
-
Parameters
ratio
- The desired ratio between the majority class and the minority class. This value has no effect when used with balance method
hybrid_sampling
.Default: 1.0
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE
=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.
=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
bug_type | COUNT
----------+-------
2 | 1
1 | 3
3 | 7
(3 rows)
=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
BALANCE
--------------------------
Finished in 1 iteration
(1 row)
=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
2 | 1
1 | 2
3 | 1
(3 rows)