Recently, I got into a project challenge that requires one to build a machine learning algorithm that will be able to anonymize invoices’ personal information data. You can find the code on Github here.

The dataset contained sensitive personal information like names of the clients, signatures, and handwritten notes that needed to be anonymized. The goal was to let the data remain effective even after training the data by preserving the style, the orientation, complexities, and the imperfections of the original data.

The anonymized data was supposed to remain simpler and clearer when one reads it as compared to the original data. So I did a short script that will perform the tasks and change the dataset while still preserving the same key information.

Below, I will detail every step that I took, together with the AWS and Github links for the source code to highlight the journey and steps.

THE TASK 🙂

Prepare and develop a system to replace the content into an anonymized dataset that uses a machine-learning algorithm such that the personal identifying information is removed and also the anonymized data remains effective training dataset.

For this task, I worked with a dataset of 25,000 invoices and the invoice like documents from the RVL-CDIP Dataset and the ground truth labels of 1,000 labelled invoices from the dataset.

To start, we will be using a file that contains only three images that must be anonymized before using the other large dataset. The zip folder with the three images contains a PNG file, a PNG file with the bounding boxes and lastly a JSON file with the coordinates of the bounding boxes. So, we will start with the three images first ….

Okay, Let’s Start ….

First, run:

pip freeze

Normally, we would checklist of packages installed with pip. ie. The code lists all the packages installed. better to check if the package of interest is installed using piping first before installing anything :).

I will be using Jupyter notebook as my environment. First, let us import all the required libraries:

import pandas as pd
import numpy as np
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder

# get rid of warnings
import warnings
warnings.filterwarnings("ignore")
# get more than one output per Jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# for functions we implement later
from utils import best_fit_distribution
from utils import plot_result

I will assume that you are familiar with the libraries that we have imported above. For instance, the sklearn_pandas is a conventional library that tries to bridge the gap between two packages. It gets to provide a DataFrameMapper class that we will be using in making working with pandas DataFrames much easier to allow easy changing of the encoding of the variables in the code.

If you do not have sklearn-pandas, you can install it by using:

%pip install sklearn-pandas

The IPython.core.interactiveshell helps to display more that one output. More on Jupyter cheatsheet can be found in this blog post. Lastly, we will be putting some of the code into a file called utils.pu. This will be placed next to the notebook.

import json

openfile=open('imagesh_htl67e00_2063610012.json')
jsondata=json.load(openfile)
df=pd.DataFrame(jsondata)

openfile.close()
print(df)

The output will be as shown below:

For the analysis step, we will use the three dataset images:

df.shape
df.head()

Output:

Now that we have the loaded dataset, one can strip all the personal identifying information to remove the personal names, signatures, and handwritten notes from the dataset if they are available, in this case, we do not have them.

df.drop(columns=["name", "signature"], inplace=True) # Dropped because it is unique to all dataset
df.drop(columns=["handwritten notes"], inplace=True) # Dropped because it is almost unique to every dataset
df.dropna(inplace=True)

The drop function will drop the personal names, signatures, and handwritten information that is contained in the dataset. In this dataset, we don’t have the name, signature, and handwritten notes column, so we will not use the drop function. Then check the output again using:

df.shape
df.head()

The next step is to use the DataFrameMapper function form the sklearn_pandas package to accept the list of the tuples in which the first item of the tuples are the column names and the second item of the tuples are the transformers.

We will use the LabelEncoder() function, you can also use other Transformers like the MinMaxScaler(), the FunctionTransformer(), and /or the StandardScaler(). We will do this by joining the encoded data with the rest of the data to make it more readable.

ANONYMIZING BY SAMPLING FROM SAME DISTRIBUTION

The function below is used to show that we have one column and 1,000 raws in the dataset.

import numpy as np
X = np.array([
        df
    ])
X.shape

Categorical and continuous variables to help in creating a histogram:

for c in categorical:
        counts = df[c].value_counts()
        np.random.choice(list(counts.index), p=(counts/len(df)).values, size=5)
        
print(df)      

for c in continuous:
    data = df[c]
    best_fit_name, best_fit_params = best_fit_distribution(data, 50)
    best_distributions.append((best_fit_name, best_fit_params))
print(data)

PUTTING THE CODE IN ONE FUNCTION

def generate_like_df(df, categorical_cols, continuous_cols, best_distributions, n, seed=0):
    np.random.seed(seed)
    d = {}

    for c in categorical_cols:
        counts = df[c].value_counts()
        d[c] = np.random.choice(list(counts.index), p=(counts/len(df)).values, size=n)

    for c, bd in zip(continuous_cols, best_distributions):
        dist = getattr(scipy.stats, bd[0])
        d[c] = dist.rvs(size=n, *bd[1])

    return pd.DataFrame(d, columns=categorical_cols+continuous_cols)

The function is used to create new observations:

gendf = generate_like_df(df, categorical, continuous, best_distributions, n=100)
gendf.shape
gendf.head()

gendf.columns = list(range(gendf.shape[1]))

Let’s save the results in an output file now

gendf.to_csv("output.csv", index_label="id")

Plotting the results for visualization:

def plot_result(df, continuous, best_distributions):
    for c, (best_fit_name, best_fit_params) in zip(continuous, best_distributions):
        best_dist = getattr(st, best_fit_name)
        pdf = make_pdf(best_dist, best_fit_params)
        _ = plt.figure(figsize=(12,8))
        ax = pdf.plot(lw=2, label='PDF', legend=True)
        _ = df[c].plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)
        param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
        param_str = ', '.join([f'{k}={v:0.2f}' for k,v in zip(param_names, best_fit_params)])
        dist_str = f'{best_fit_name}({param_str})'
        _ = ax.set_title(c+ " " + dist_str)
        _ = ax.set_ylabel('Frequency')
        plt.show();

You can find all the code source on Github here.

If you have any question or comment, do not hesitate to ask us.

Quote: The moon looks upon many night flowers; the night flowers see but one moon. – Jean Ingelow