Complexifier

Make your pandas even worse!

Complexifier is a Python library crafted to transform clean datasets into messy versions by introducing random errors and anomalies. This is particularly useful for educational purposes, where students learn to clean data through practical experience.

Problem

When teaching students to work with data, an important lesson is how to clean it.

The problem with this is that there are two types of datasets available on the internet:

Data that is good, but already cleaned
Data that is not cleaned, but is terrible and incomprehensible

Complexifier solves this problem by allowing you to take the former and turn it into a better version of the latter!

Dependencies

Complexifier relies on the following packages:

pandas
typo
random

Ensure these dependencies are installed in your environment.

Installation

You can install complexifier via pip:

pip install complexifier

Usage

Once installed, use complexifier to add mistakes and simulate anomalies in your data. This library provides several methods:

Methods

create_spag_error

complexifier.create_spag_error(word: str) → str

Gives a 10% chance to introduce a spelling error to a word.

Parameters:: word (str) – The original word to potentially alter.
Returns:: The original word or a word with a random spelling error.
Return type:: str

introduce_spag_error

complexifier.introduce_spag_error(df: DataFrame, columns=None) → DataFrame

Applies random spelling errors to specified columns in a DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame to alter.
columns (list or str, optional) – Column names to apply errors to. Defaults to all string columns.

Returns:

The DataFrame with potential spelling errors introduced.

Return type:

pd.DataFrame

add_or_subtract_outliers

complexifier.add_or_subtract_outliers(df: DataFrame, columns=None) → DataFrame

Adds or subtracts a random integer in columns of between 1% and 10% of the rows.

Parameters:

df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Column names to adjust. Defaults to all numeric columns if not specified.

Returns:

The DataFrame with outliers added.

Return type:

pd.DataFrame

add_standard_deviations

complexifier.add_standard_deviations(df: DataFrame, columns=None, min_std=1, max_std=5) → DataFrame

Adds random deviations to entries in specified numeric columns to simulate data anomalies.

Parameters:

df (pd.DataFrame) – The DataFrame to manipulate.
columns (list or str, optional) – Column names to modify. Defaults to numeric columns if not specified.
min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1
max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5

Returns:

The DataFrame with deviations added.

Return type:

pd.DataFrame

duplicate_rows

complexifier.duplicate_rows(df: DataFrame, sample_size=None) → DataFrame

Adds duplicate rows to a DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame to which duplicates will be added.
sample_size (int, optional) – Number of rows to duplicate. A random percentage between 1% and 10% if not specified.

Returns:

The DataFrame with duplicate rows added.

Return type:

pd.DataFrame

add_nulls

complexifier.add_nulls(df: DataFrame, columns=None, min_percent=1, max_percent=10) → DataFrame

Inserts null values into specified DataFrame columns.

Parameters:

df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Specific columns to add nulls to. Defaults to all columns if not specified.
min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%
max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%

Returns:

The DataFrame with null values inserted.

Return type:

pd.DataFrame

mess_it_up

complexifier.mess_it_up(df: DataFrame, columns=None, min_std=1, max_std=5, sample_size=None, min_percent=1, max_percent=10, introduce_spag=True, add_outliers=True, add_std=True, duplicate=True, add_null=True) → DataFrame

Applies several functions to add outliers, spelling errors and null values

Parameters:

df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Specific columns to modify. Defaults to all columns if not specified.
min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1
max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5
sample_size (int, optional) – Number of rows to duplicate. Randomly selected if not specified.
min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%
max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%
introduce_spag (bool, optional) – Adds spelling and grammar errors into string data. Defaults to True
add_outliers (bool, optional) – Adds outliers to numerical data. Defaults to True
add_std (bool, optional) – Adds standard deviations to the data. Defaults to True
duplicate (bool, optional) – Adds duplicate rows to the data. Defaults to True
add_null (bool, optional) – Adds null values to the dataset. Defaults to True

Returns:

The modified DataFrame.

Return type:

pd.DataFrame

Contributing

Feel free to contribute by submitting a pull request on GitHub. For large changes, please open an issue to discuss before implementing changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact Information

For support or inquiries, please contact Ruy at ruyzambrano@gmail.com

Changelog

Version 0.3.3

Complexifier

Problem

Dependencies

Installation

Usage

Methods

create_spag_error

introduce_spag_error

add_or_subtract_outliers

add_standard_deviations

duplicate_rows

add_nulls

mess_it_up

Contributing

License

Contact Information

Changelog

Badges