Complexifier

Make your pandas even worse!

Complexifier is a Python library crafted to transform clean datasets into messy versions by introducing random errors and anomalies. This is particularly useful for educational purposes, where students learn to clean data through practical experience.

Problem

When teaching students to work with data, an important lesson is how to clean it.

The problem with this is that there are two types of datasets available on the internet:

  1. Data that is good, but already cleaned

  2. Data that is not cleaned, but is terrible and incomprehensible

Complexifier solves this problem by allowing you to take the former and turn it into a better version of the latter!

Dependencies

Complexifier relies on the following packages:

  • pandas

  • typo

  • random

Ensure these dependencies are installed in your environment.

Installation

You can install complexifier via pip:

pip install complexifier

Usage

Once installed, use complexifier to add mistakes and simulate anomalies in your data. This library provides several methods:

Methods

create_spag_error

complexifier.create_spag_error(word: str) str

Gives a 10% chance to introduce a spelling error to a word.

Parameters:

word (str) – The original word to potentially alter.

Returns:

The original word or a word with a random spelling error.

Return type:

str

introduce_spag_error

complexifier.introduce_spag_error(df: DataFrame, columns=None) DataFrame

Applies random spelling errors to specified columns in a DataFrame.

Parameters:
  • df (pd.DataFrame) – The DataFrame to alter.

  • columns (list or str, optional) – Column names to apply errors to. Defaults to all string columns.

Returns:

The DataFrame with potential spelling errors introduced.

Return type:

pd.DataFrame

add_or_subtract_outliers

complexifier.add_or_subtract_outliers(df: DataFrame, columns=None) DataFrame

Adds or subtracts a random integer in columns of between 1% and 10% of the rows.

Parameters:
  • df (pd.DataFrame) – The DataFrame to modify.

  • columns (list or str, optional) – Column names to adjust. Defaults to all numeric columns if not specified.

Returns:

The DataFrame with outliers added.

Return type:

pd.DataFrame

add_standard_deviations

complexifier.add_standard_deviations(df: DataFrame, columns=None, min_std=1, max_std=5) DataFrame

Adds random deviations to entries in specified numeric columns to simulate data anomalies.

Parameters:
  • df (pd.DataFrame) – The DataFrame to manipulate.

  • columns (list or str, optional) – Column names to modify. Defaults to numeric columns if not specified.

  • min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1

  • max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5

Returns:

The DataFrame with deviations added.

Return type:

pd.DataFrame

duplicate_rows

complexifier.duplicate_rows(df: DataFrame, sample_size=None) DataFrame

Adds duplicate rows to a DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame to which duplicates will be added.

  • sample_size (int, optional) – Number of rows to duplicate. A random percentage between 1% and 10% if not specified.

Returns:

The DataFrame with duplicate rows added.

Return type:

pd.DataFrame

add_nulls

complexifier.add_nulls(df: DataFrame, columns=None, min_percent=1, max_percent=10) DataFrame

Inserts null values into specified DataFrame columns.

Parameters:
  • df (pd.DataFrame) – The DataFrame to modify.

  • columns (list or str, optional) – Specific columns to add nulls to. Defaults to all columns if not specified.

  • min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%

  • max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%

Returns:

The DataFrame with null values inserted.

Return type:

pd.DataFrame

mess_it_up

complexifier.mess_it_up(df: DataFrame, columns=None, min_std=1, max_std=5, sample_size=None, min_percent=1, max_percent=10, introduce_spag=True, add_outliers=True, add_std=True, duplicate=True, add_null=True) DataFrame

Applies several functions to add outliers, spelling errors and null values

Parameters:
  • df (pd.DataFrame) – The DataFrame to modify.

  • columns (list or str, optional) – Specific columns to modify. Defaults to all columns if not specified.

  • min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1

  • max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5

  • sample_size (int, optional) – Number of rows to duplicate. Randomly selected if not specified.

  • min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%

  • max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%

  • introduce_spag (bool, optional) – Adds spelling and grammar errors into string data. Defaults to True

  • add_outliers (bool, optional) – Adds outliers to numerical data. Defaults to True

  • add_std (bool, optional) – Adds standard deviations to the data. Defaults to True

  • duplicate (bool, optional) – Adds duplicate rows to the data. Defaults to True

  • add_null (bool, optional) – Adds null values to the dataset. Defaults to True

Returns:

The modified DataFrame.

Return type:

pd.DataFrame

Contributing

Feel free to contribute by submitting a pull request on GitHub. For large changes, please open an issue to discuss before implementing changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact Information

For support or inquiries, please contact Ruy at ruyzambrano@gmail.com

Changelog

Version 0.3.3

Badges

https://github.com/ruyzambrano/complexifier/workflows/Test/badge.svg