Complexifier
Make your pandas even worse!
Complexifier is a Python library crafted to transform clean datasets into messy versions by introducing random errors and anomalies. This is particularly useful for educational purposes, where students learn to clean data through practical experience.
Problem
When teaching students to work with data, an important lesson is how to clean it.
The problem with this is that there are two types of datasets available on the internet:
Data that is good, but already cleaned
Data that is not cleaned, but is terrible and incomprehensible
Complexifier solves this problem by allowing you to take the former and turn it into a better version of the latter!
Dependencies
Complexifier relies on the following packages:
pandas
typo
random
Ensure these dependencies are installed in your environment.
Installation
You can install complexifier via pip:
pip install complexifier
Usage
Once installed, use complexifier to add mistakes and simulate anomalies in your data. This library provides several methods:
Methods
create_spag_error
- complexifier.create_spag_error(word: str) str
Gives a 10% chance to introduce a spelling error to a word.
- Parameters:
word (str) – The original word to potentially alter.
- Returns:
The original word or a word with a random spelling error.
- Return type:
str
introduce_spag_error
- complexifier.introduce_spag_error(df: DataFrame, columns=None) DataFrame
Applies random spelling errors to specified columns in a DataFrame.
- Parameters:
df (pd.DataFrame) – The DataFrame to alter.
columns (list or str, optional) – Column names to apply errors to. Defaults to all string columns.
- Returns:
The DataFrame with potential spelling errors introduced.
- Return type:
pd.DataFrame
add_or_subtract_outliers
- complexifier.add_or_subtract_outliers(df: DataFrame, columns=None) DataFrame
Adds or subtracts a random integer in columns of between 1% and 10% of the rows.
- Parameters:
df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Column names to adjust. Defaults to all numeric columns if not specified.
- Returns:
The DataFrame with outliers added.
- Return type:
pd.DataFrame
add_standard_deviations
- complexifier.add_standard_deviations(df: DataFrame, columns=None, min_std=1, max_std=5) DataFrame
Adds random deviations to entries in specified numeric columns to simulate data anomalies.
- Parameters:
df (pd.DataFrame) – The DataFrame to manipulate.
columns (list or str, optional) – Column names to modify. Defaults to numeric columns if not specified.
min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1
max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5
- Returns:
The DataFrame with deviations added.
- Return type:
pd.DataFrame
duplicate_rows
- complexifier.duplicate_rows(df: DataFrame, sample_size=None) DataFrame
Adds duplicate rows to a DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame to which duplicates will be added.
sample_size (int, optional) – Number of rows to duplicate. A random percentage between 1% and 10% if not specified.
- Returns:
The DataFrame with duplicate rows added.
- Return type:
pd.DataFrame
add_nulls
- complexifier.add_nulls(df: DataFrame, columns=None, min_percent=1, max_percent=10) DataFrame
Inserts null values into specified DataFrame columns.
- Parameters:
df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Specific columns to add nulls to. Defaults to all columns if not specified.
min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%
max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%
- Returns:
The DataFrame with null values inserted.
- Return type:
pd.DataFrame
mess_it_up
- complexifier.mess_it_up(df: DataFrame, columns=None, min_std=1, max_std=5, sample_size=None, min_percent=1, max_percent=10, introduce_spag=True, add_outliers=True, add_std=True, duplicate=True, add_null=True) DataFrame
Applies several functions to add outliers, spelling errors and null values
- Parameters:
df (pd.DataFrame) – The DataFrame to modify.
columns (list or str, optional) – Specific columns to modify. Defaults to all columns if not specified.
min_std (int, optional) – Minimum number of standard deviations to add. Defaults to 1
max_std (int, optional) – Maximum number of standard deviations to add. Defaults to 5
sample_size (int, optional) – Number of rows to duplicate. Randomly selected if not specified.
min_percent (int, optional) – Minimum percentage of null values to insert. Defaults to 1%
max_percent (int, optional) – Maximum percentage of null values to insert. Defaults to 10%
introduce_spag (bool, optional) – Adds spelling and grammar errors into string data. Defaults to True
add_outliers (bool, optional) – Adds outliers to numerical data. Defaults to True
add_std (bool, optional) – Adds standard deviations to the data. Defaults to True
duplicate (bool, optional) – Adds duplicate rows to the data. Defaults to True
add_null (bool, optional) – Adds null values to the dataset. Defaults to True
- Returns:
The modified DataFrame.
- Return type:
pd.DataFrame
Contributing
Feel free to contribute by submitting a pull request on GitHub. For large changes, please open an issue to discuss before implementing changes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact Information
For support or inquiries, please contact Ruy at ruyzambrano@gmail.com
Changelog
Version 0.3.3