Version: 0.5.0

Data Duplication Utilities

This module provides functions to inject duplicated records into a dataset. These utilities are useful to simulate real-world issues such as:

There are two main strategies:

Global duplication: Randomly duplicates rows from the whole dataset.
Class-based duplication: Duplicates rows from a specific class in the target variable.

Each duplication method is available in two forms:

New methods: Inject new duplicates starting from a clean dataset.
Extended methods: Add additional duplicates to a dataset that may already contain repeated rows, based on a clean reference.

Randomly duplicates a percentage of rows from the entire dataset, excluding already existing duplicates.

train_df (pd.DataFrame):
The input dataset.
percentage (float):
Target percentage (0–1) of duplicated rows relative to the original dataset.

Extends the number of duplicated rows in the dataset to reach a specified percentage, using a clean reference dataset to calculate the delta.

original_df (pd.DataFrame):
The original dataset without duplicates.
train_df (pd.DataFrame):
Dataset that may already contain some duplicated rows.
percentage (float):
Desired final duplication rate (0–1) based on the original dataset.

Duplicates a percentage of rows belonging to a specific class in the dataset.

Extends the number of duplicated rows for a specific class until the desired percentage is reached, using the original dataset as a reference.

original_df (pd.DataFrame):
The original clean dataset.
train_df (pd.DataFrame):
Dataset that may already contain duplicated rows.
target (str):
Name of the column representing the class/label.
value (any):
The class value to duplicate.
percentage (float):
Desired final duplication rate (0–1) for that class, relative to the clean dataset.