Scaling data is a constant challenge in analytics projects. Whether to get a visualization to fit on screen or to take a independent variable and remove outliers scaling transforms data from working in one domain or another.

Let's think about this by example, let's say you wanted to analyze the US Government budget. You would find that the US Federal government spends about \$2 Billion on deposit insurance and \\$1 Trillion on Medicare. A huge range. But what if we wanted to know about what share that would be if the US National budget were only \$100.

In that case we would scale the budgets to between \$0.00 and \\$100. There are many approaches and mathematical ways to scale the data. In this cookbook, we will compare all of the approaches in an attempt to pick the best one.

About the data¶

The dataset for this analysis is the US Budget Federal Account balances which can be downloaded from US Spending website.

let's load the data and take a look. Specifically, we are going to focus in on three main columns:

budget_authority_appropriated_amount: The amount congress decided this account should have.

In [13]:

import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.float_format = '${:,.2f}'.format


budget_data_raw = pd.read_csv("data/all_federal_account_account_balances_1.csv")
budget_data_raw.head()

Out[13]:

	federal_account_symbol	federal_account_name	agency_name	budget_function	budget_subfunction	budget_authority_unobligated_balance_brought_forward	adjustments_to_unobligated_balance_brought_forward	budget_authority_appropriated_amount	borrowing_authority_amount	contract_authority_amount	spending_authority_from_offsetting_collections_amount	total_other_budgetary_resources_amount	total_budgetary_resources	obligations_incurred	deobligations_or_recoveries_or_refunds_from_prior_year	unobligated_balance	gross_outlay_amount	status_of_budgetary_resources_total
0	070-5702	9-11 Response and Biometric Exit Account, U.S....	Department of Homeland Security	Administration of Justice	Federal law enforcement activities	$65,767,068.87	$4,895,455.21	$59,911,887.98	$0.00	$0.00	$0.00	$0.00	$130,574,412.06	$81,302,583.39	$4,895,455.21	$49,271,828.67	$69,493,561.19	$130,574,412.06
1	014-5015	Abandoned Mine Reclamation Fund, Office of Sur...	Department of the Interior	Natural Resources and Environment	Conservation and land management	$173,858,680.72	$21,873,656.93	$379,980,611.05	$0.00	$0.00	$952.77	$952.77	$575,713,901.47	$402,411,947.53	$21,873,656.93	$173,301,953.94	$285,538,947.66	$575,713,901.47
2	014-2640	Abandoned Well Remediation Fund, Bureau of Lan...	Department of the Interior	Natural Resources and Environment	Conservation and land management	$447,121.09	$5.27	$0.00	$0.00	$0.00	$0.00	$0.00	$447,126.36	$285,815.33	$5.27	$161,311.03	$9,415,726.69	$447,126.36
3	068-0118	Abatement, Control and Compliance Loan Program...	Environmental Protection Agency	Natural Resources and Environment	Pollution control and abatement	$0.46	$-0.39	$0.00	$0.00	$0.00	$0.00	$0.00	$0.07	$0.00	$0.00	$0.07	$0.00	$0.07
4	069-0503	Access Highways to Public Recreation Areas on ...	Department of Transportation	Transportation	Ground transportation	$352,333.19	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$352,333.19	$0.00	$0.00	$352,333.19	$0.00	$352,333.19

Let's group the data by the "budget subfunction" as defined by the Department of the treasury.

In [20]:

budget_function_data = budget_data_raw.groupby(["budget_subfunction"]).sum()['budget_authority_appropriated_amount'].sort_values()

budget_function_data.plot(kind='bar')
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis().set_visible(False)
ax1.yaxis.set_major_formatter(FormatStrFormatter('$%.0f'))

Sci-Kit Learn contains most of the libraries necessary to support scaling in a wide variety of use cases.

Below we will create a pipeline that will call five of the most common scaling functions.

In [15]:

from sklearn.preprocessing import *
from sklearn import pipeline
from sklearn_pandas import DataFrameMapper


mapper = DataFrameMapper([
    (['budget_authority_appropriated_amount'], MinMaxScaler(feature_range=(0,100))
     , {'alias': 'MinMaxScaler'}),
    (['budget_authority_appropriated_amount'], MaxAbsScaler()
     , {'alias': 'MaxAbsScaler'}),
    (['budget_authority_appropriated_amount'], RobustScaler()
     , {'alias': 'RobustScaler'}),
    (['budget_authority_appropriated_amount'], QuantileTransformer(n_quantiles=10)
     , {'alias': 'QuantileTransformer'}),
    (['budget_authority_appropriated_amount'], PowerTransformer()
     , {'alias': 'PowerTransformer'})
], df_out=True)
  
scaled_data = mapper.fit_transform(budget_function_data.reset_index())

and let's take a look at the results:

In [16]:

import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter



fig = plt.figure(figsize=(20,10))
fig.suptitle('SkLearn Scaling Functions', fontsize=16)
# 3 x 1 grid, position at (1st row, 1st col), take two rows

ax0 = plt.subplot2grid((3, 2), (0, 0), rowspan=1) 

budget_function_data.plot(kind='bar', title='Original Data')
ax0.yaxis.set_major_formatter(FormatStrFormatter('$%.0f'))

ax1 = plt.subplot2grid((3, 2), (1, 0), rowspan=1) 

scaled_data['MinMaxScaler'].plot(kind='bar', title='MinMaxScaler')

ax2 = plt.subplot2grid((3, 2), (2, 0), rowspan=1) 
scaled_data['MaxAbsScaler'].plot(kind='bar', title='MaxAbsScaler')


ax3 = plt.subplot2grid((3, 2), (0, 1), rowspan=1) 

scaled_data['RobustScaler'].plot(kind='bar', title='Robust Scaler')

ax4 = plt.subplot2grid((3, 2), (1, 1), rowspan=1) 
scaled_data['QuantileTransformer'].plot(kind='bar', title='QuantileTransformer')
# get the other ax
ax5 = plt.subplot2grid((3, 2), (2, 1), rowspan=1) 
scaled_data['PowerTransformer'].plot(kind='bar', title='PowerTransformer')


for axis in fig.get_axes():
    axis.get_xaxis().set_visible(False)

The effects and drawbacks should be fairly obvious. Both MinMaxScaler and MaxAbsScaler keep the same overall distribution. The relative difference between say "Executive Management" and "Medicare", which have budgets that are about 2000x different, are maintained. RobustScaler appears the same but it subtly different. It is based on centering between the 75th and 25th percentiles. In this case Medicare and some of the smaller accounts are effectively clipped at the 25th and 75th percentile. Thus, outliers are managed the the relative differences are maintained for the whole population.

The last two PowerTransformer and QuantileTransformer are deliberately manipulative. Each force the population to fit a specific distribution. This removes the vast differences in scale inherent in the data and makes things seem more comparable. But beware, looking at these charts it appears that Medicare and the Defense department are just a little bit more expensive than say interest payments on debt; but, they are an order of magnitude different as shown below:

In [18]:

budget_function_data.tail()

Out[18]:

budget_subfunction
Interest on Treasury debt securities (gross)     $521,553,461,024.26
Health care services                             $551,865,968,531.50
Department of Defense-Military                   $679,616,519,190.72
Social security                                $1,025,474,293,425.75
Medicare                                       $1,072,980,747,544.25
Name: budget_authority_appropriated_amount, dtype: float64

In [ ]: