[P] Proportionately split dataframe with multiple target columns
I have a dataframe with 30 rows and 10 columns. 5 of the columns are input features and the other 5 are output/target columns. The target columns contain classes represented as 0, 1, 2. I want to split the dataset into train and test such that, in the train set, for each output column, the proportion of class 1 is between 0.15 and 0.3. (I am not bothered about the distribution of classes in the test set).
ADDITIONAL CONTEXT: I am trying to balance the output classes in a multi-class and multi-output dataset. My understanding is that this would be an optimization problem with 25 (?) degrees of freedom. So if I have any input dataset, I would be able to create a subset of that input dataset which is my training data and which has the desired class balance (i.e class 1 between 0.15 and 0.3 for each output column).
I make the dataframe using this
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(42)
data = pd.DataFrame({
'A': np.random.rand(30),
'B': np.random.rand(30),
'C': np.random.rand(30),
'D': np.random.rand(30),
'E': np.random.rand(30),
'F': np.random.choice([0, 1, 2], 30),
'G': np.random.choice([0, 1, 2], 30),
'H': np.random.choice([0, 1, 2], 30),
'I': np.random.choice([0, 1, 2], 30),
'J': np.random.choice([0, 1, 2], 30)
})
My current silly/harebrained solution for this problem involves using two separate functions. I have a helper function that checks if the proportions of class 1 in each column is within my desired range
def check_proportions(df, cols, min_prop = 0.15, max_prop = 0.3, class_category = 1):
for col in cols:
prop = (df[col] == class_category).mean()
if not (min_prop <= prop <= max_prop):
return False
return True
def proportionately_split_data(data, target_cols, min_prop = 0.15, max_prop = 0.3):
while True:
random_state = np.random.randint(100_000)
train_df, test_df = train_test_split(data, test_size = 0.3, random_state = random_state)
if check_proportions(train_df, target_cols, min_prop, max_prop):
return train_df, test_df
Finally, I run the code using
target_cols = ["F", "G", "H", "I", "J"]
train, test = proportionately_split_data(data, target_cols)
My worry with this current "solution" is that it is probabilistic and not deterministic. I can see the proportionately_split_data
getting stuck in an infinite loop if none of the random state I set in train_test_split can randomly generate data with the desired proportion. Any help would be much appreciated!
I apologize for not providing this earlier, for a Minimal working example, the input (data) could be
A | B | C | D | E | OUTPUT_1 | OUTPUT_2 | OUTPUT_3 | OUTPUT_4 | OUTPUT_5 |
---|---|---|---|---|---|---|---|---|---|
5.65 | 3.56 | 0.94 | 9.23 | 6.43 | 0 | 1 | 1 | 0 | 1 |
7.43 | 3.95 | 1.24 | 7.22 | 2.66 | 0 | 0 | 0 | 1 | 2 |
9.31 | 2.42 | 2.91 | 2.64 | 6.28 | 2 | 1 | 2 | 2 | 0 |
8.19 | 5.12 | 1.32 | 3.12 | 8.41 | 1 | 2 | 0 | 1 | 2 |
9.35 | 1.92 | 3.12 | 4.13 | 3.14 | 0 | 1 | 1 | 0 | 1 |
8.43 | 9.72 | 7.23 | 8.29 | 9.18 | 1 | 0 | 0 | 2 | 2 |
4.32 | 2.12 | 3.84 | 9.42 | 8.19 | 0 | 0 | 0 | 0 | 0 |
3.92 | 3.91 | 2.90 | 8.19 | 8.41 | 2 | 2 | 2 | 2 | 1 |
7.89 | 1.92 | 4.12 | 8.19 | 7.28 | 1 | 1 | 2 | 0 | 2 |
5.21 | 2.42 | 3.10 | 0.31 | 1.31 | 2 | 0 | 1 | 1 | 0 |
which has 10 rows and 10 columns,
and an expected output (train set) could be
A | B | C | D | E | OUTPUT_1 | OUTPUT_2 | OUTPUT_3 | OUTPUT_4 | OUTPUT_5 |
---|---|---|---|---|---|---|---|---|---|
5.65 | 3.56 | 0.94 | 9.23 | 6.43 | 0 | 1 | 1 | 0 | 1 |
7.43 | 3.95 | 1.24 | 7.22 | 2.66 | 0 | 0 | 0 | 1 | 2 |
9.31 | 2.42 | 2.91 | 2.64 | 6.28 | 2 | 1 | 2 | 2 | 0 |
8.19 | 5.12 | 1.32 | 3.12 | 8.41 | 1 | 2 | 0 | 1 | 2 |
8.43 | 9.72 | 7.23 | 8.29 | 9.18 | 1 | 0 | 0 | 2 | 2 |
3.92 | 3.91 | 2.90 | 8.19 | 8.41 | 2 | 2 | 2 | 2 | 1 |
5.21 | 2.42 | 3.10 | 0.31 | 1.31 | 2 | 0 | 1 | 1 | 0 |
Whereby each output column in the train set has at least 2 (>= 0.15 * number of rows in input data) instances of Class 1 and at most 3 (<= 0.3 * number of rows in input data). I guess I also didn't clarify that the proportion is in relation to the number of examples (or rows) in the input dataset. My test set would be the remaining rows in the input dataset.