Balancing imbalanced data

Imbalanced data occurs when an uneven distribution of classes occurs in the data.

Imbalanced data occurs when an uneven distribution of classes occurs in the data. Building a predictive model on the imbalanced data set would cause a model that appears to yield high accuracy but does not generalize well to the new data in the minority class. To prevent creating models with false levels of accuracy, you should rebalance your imbalanced data before creating a predictive model.

Before you begin the example, load the Machine Learning sample data.

You see imbalanced data a lot in financial transaction data where the majority of the transactions are not fraudulent and a small number of the transactions are fraudulent, as shown in the following example.

  1. View the distribution of the classes.

    => SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud;
     fraud | COUNT
    -------+-------
     TRUE  |    19
     FALSE |   981
    (2 rows)
    
  2. Use the BALANCE function to create a more balanced data set.

    => SELECT BALANCE('balance_fin_data', 'transaction_data', 'fraud', 'under_sampling'
                      USING PARAMETERS sampling_ratio = 0.2);
             BALANCE
    --------------------------
     Finished in 1 iteration
    
    (1 row)
    
  3. View the new distribution of the classifiers.

    => SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud;
     fraud | COUNT
    -------+-------
     t     |    19
     f     |   236
    (2 rows)
    

See also