Scaling data is a constant challenge in analytics projects. Whether to get a visualization to fit on screen or to take a independent variable and remove outliers scaling transforms data from working in one domain or another.
Let's think about this by example, let's say you wanted to analyze the US Government budget. You would find that the US Federal government spends about \$2 Billion on deposit insurance and \\$1 Trillion on Medicare. A huge range. But what if we wanted to know about what share that would be if the US National budget were only \$100.
In that case we would scale the budgets to between \$0.00 and \\$100. There are many approaches and mathematical ways to scale the data. In this cookbook, we will compare all of the approaches in an attempt to pick the best one.
About the data¶
The dataset for this analysis is the US Budget Federal Account balances which can be downloaded from US Spending website.
let's load the data and take a look. Specifically, we are going to focus in on three main columns:
- budget_authority_appropriated_amount: The amount congress decided this account should have.
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.float_format = '${:,.2f}'.format
budget_data_raw = pd.read_csv("data/all_federal_account_account_balances_1.csv")
budget_data_raw.head()
Let's group the data by the "budget subfunction" as defined by the Department of the treasury.
budget_function_data = budget_data_raw.groupby(["budget_subfunction"]).sum()['budget_authority_appropriated_amount'].sort_values()
budget_function_data.plot(kind='bar')
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis().set_visible(False)
ax1.yaxis.set_major_formatter(FormatStrFormatter('$%.0f'))
Sci-Kit Learn contains most of the libraries necessary to support scaling in a wide variety of use cases.
Below we will create a pipeline that will call five of the most common scaling functions.
from sklearn.preprocessing import *
from sklearn import pipeline
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
(['budget_authority_appropriated_amount'], MinMaxScaler(feature_range=(0,100))
, {'alias': 'MinMaxScaler'}),
(['budget_authority_appropriated_amount'], MaxAbsScaler()
, {'alias': 'MaxAbsScaler'}),
(['budget_authority_appropriated_amount'], RobustScaler()
, {'alias': 'RobustScaler'}),
(['budget_authority_appropriated_amount'], QuantileTransformer(n_quantiles=10)
, {'alias': 'QuantileTransformer'}),
(['budget_authority_appropriated_amount'], PowerTransformer()
, {'alias': 'PowerTransformer'})
], df_out=True)
scaled_data = mapper.fit_transform(budget_function_data.reset_index())
and let's take a look at the results:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
fig = plt.figure(figsize=(20,10))
fig.suptitle('SkLearn Scaling Functions', fontsize=16)
# 3 x 1 grid, position at (1st row, 1st col), take two rows
ax0 = plt.subplot2grid((3, 2), (0, 0), rowspan=1)
budget_function_data.plot(kind='bar', title='Original Data')
ax0.yaxis.set_major_formatter(FormatStrFormatter('$%.0f'))
ax1 = plt.subplot2grid((3, 2), (1, 0), rowspan=1)
scaled_data['MinMaxScaler'].plot(kind='bar', title='MinMaxScaler')
ax2 = plt.subplot2grid((3, 2), (2, 0), rowspan=1)
scaled_data['MaxAbsScaler'].plot(kind='bar', title='MaxAbsScaler')
ax3 = plt.subplot2grid((3, 2), (0, 1), rowspan=1)
scaled_data['RobustScaler'].plot(kind='bar', title='Robust Scaler')
ax4 = plt.subplot2grid((3, 2), (1, 1), rowspan=1)
scaled_data['QuantileTransformer'].plot(kind='bar', title='QuantileTransformer')
# get the other ax
ax5 = plt.subplot2grid((3, 2), (2, 1), rowspan=1)
scaled_data['PowerTransformer'].plot(kind='bar', title='PowerTransformer')
for axis in fig.get_axes():
axis.get_xaxis().set_visible(False)
The effects and drawbacks should be fairly obvious. Both MinMaxScaler
and MaxAbsScaler
keep the same overall distribution. The relative difference between say "Executive Management" and "Medicare", which have budgets that are about 2000x different, are maintained. RobustScaler
appears the same but it subtly different. It is based on centering between the 75th and 25th percentiles. In this case Medicare and some of the smaller accounts are effectively clipped at the 25th and 75th percentile. Thus, outliers are managed the the relative differences are maintained for the whole population.
The last two PowerTransformer
and QuantileTransformer
are deliberately manipulative. Each force the population to fit a specific distribution. This removes the vast differences in scale inherent in the data and makes things seem more comparable. But beware, looking at these charts it appears that Medicare and the Defense department are just a little bit more expensive than say interest payments on debt; but, they are an order of magnitude different as shown below:
budget_function_data.tail()