This week the health world turns to Orlando for the annual conference of the Health Information Management Systems Society annual conference. it is absolutely one of the biggest events in health IT and one of the top 25 trade shows in the entire country.

How many people attend HIMSS each year? We could search one of hundreds of press releases about the event from exhibitors. We could visit the link above from Trade Show news.

Big Data Approach¶

How about we measure how many people fly to Orlando during the quarter when HIMSS takes place. The Bureau of Transportation Statistics releases quarterly data files drawn from a 10% of all airline itineraries sold each quarter from all of the major US airline carriers, with more than 5% market share.

For this analysis, we downloaded the Q1 2017 which corresponds to the last time HIMSS visited Orlando and Q1 2015 which was presumably a non-HIMSS year, though the conference center likely booked other events for the same period.

However, US airports see about 2.6 Million per day so even 10% of tickets for one quarter is a massive file.

here are some basic stats on just one of the files:

In [7]:

import subprocess

lines_in_file = subprocess.run(["wc", "-l", "data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv"], stdout=subprocess.PIPE)
print("Lines in the file {0}".format(lines_in_file.stdout))

Lines in the file b'5866624 data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv\n'

The first step in processing data for our visualizations is typically to load the data into a pandas dataframe to manipulate and prepare the data before visualizing it. The flexibility and extensibility of Pandas means it is extremely memory intensive. One of the original authors of the library recommends having 5 to 10 times as much RAM as the size of your dataset. For one quarter of the airline data (1.4gb), this would require up to 10gb more than my laptops.

Enter Dask, a project developed by Anaconda that uses the same API as pandas but underpins it with a architecture that parallelizes the operations leaving the data on disk until called for. Dask's power is immense allowing a dataframe to span an entire cluster. But locally, it's also an easy way to use the same python data science workflows while keeping the data waiting until it is needed or in our case aggregated for our analysis.

Let's take a look at the data now.

In [8]:

import dask.dataframe as dd
from vega_datasets import data
import altair as alt
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
import json
import altair_theme
%matplotlib inline

In [9]:

data_Q1_2014 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2014_1.csv")
data_Q1_2015 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2015_1.csv")
data_Q1_2016 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2016_1.csv")
data_Q1_2017 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv")
data_Q1_2018 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2018_1.csv")

In [10]:

data_Q1_2014.head()

Out[10]:

	ItinID	MktID	MktCoupons	Year	Quarter	OriginAirportID	OriginAirportSeqID	OriginCityMarketID	Origin	OriginCountry	...	Passengers	MktFare	MktDistance	MktDistanceGroup	MktMilesFlown	NonStopMiles	ItinGeoType	MktGeoType	Unnamed: 41
0	201412918419	20141291841902	1	2014	1	11618	1161802	31703	EWR	US	...	1.0	266.5	719.0	2	719.0	719.0	2	2	NaN
1	201412918420	20141291842001	1	2014	1	13930	1393003	30977	ORD	US	...	1.0	266.5	719.0	2	719.0	719.0	2	2	NaN
2	201412918420	20141291842002	1	2014	1	11618	1161802	31703	EWR	US	...	1.0	266.5	719.0	2	719.0	719.0	2	2	NaN
3	201412918421	20141291842101	1	2014	1	13930	1393003	30977	ORD	US	...	1.0	268.0	719.0	2	719.0	719.0	2	2	NaN
4	201412918421	20141291842102	1	2014	1	11618	1161802	31703	EWR	US	...	1.0	268.0	719.0	2	719.0	719.0	2	2	NaN

5 rows × 42 columns

In [11]:

# Set some indicators so we know from when the data came. There is no indicator is the raw files. 
data_Q1_2014['Period'] = 201401
data_Q1_2014['HIMSS'] = 'No HIMSS'
data_Q1_2015['Period'] = 201501
data_Q1_2015['HIMSS'] = 'HIMSS'
data_Q1_2016['Period'] = 201601
data_Q1_2016['HIMSS'] = 'No HIMSS'
data_Q1_2017['Period'] = 201701
data_Q1_2017['HIMSS'] = 'HIMSS'
data_Q1_2018['Period'] = 201801
data_Q1_2018['HIMSS'] = 'No HIMSS'

# Dask uses append because it fits with the model of lazily parallel calculation 
# where as concat would rely on the whole dataset being available 
all_years = data_Q1_2014.append(data_Q1_2015).append(data_Q1_2016).append(data_Q1_2017).append(data_Q1_2018)

mco_totals_by_year_task = all_years[all_years.Dest == 'MCO'].groupby(['Period','HIMSS']).agg({'Passengers':'sum', 'MktFare':'mean'})
mco_totals_by_airport_task = all_years[all_years.Dest == 'MCO'].groupby('Origin').agg({'Passengers':'sum', 'MktFare':'mean'})

Each row is a itinerary for one passenger between an origin and destination with the fare and distance. We are focused only on Orlando.

The above commands essentially create two tasks and won't process until called for. In the next command, Dask will compute the aggregates and return a dataframe of the result.

In [ ]:

mco_totals_by_year = mco_totals_by_year_task.compute()
mco_totals_by_airport = mco_totals_by_airport_task.compute()

In [13]:

mco_totals_by_airport.sort_values('Passengers').tail()

Out[13]:

	Passengers	MktFare
Origin
ORD	66146.0	217.768325
MSP	66435.0	201.077036
DTW	67107.0	195.274883
PHL	81802.0	184.273720
EWR	84991.0	223.100379

Minneapolis, Detroit, Chicago all major airline hubs. For creating the visualization we will need a few other data sources. Geographic figures for US States, and latitude and longitude locations for major airports. These data sources are actually included in the altair/vega example datasets and thus are easily obtained using the input at the top.

In [14]:

us_states =  alt.topo_feature(data.us_10m.url,  'states')

The following chart will provide a base United States map for the visualization.

In [15]:

usa_chart = alt.Chart(us_states).mark_geoshape(fill='#dedede', stroke='white')
usa_chart

Out[15]:

We can see that looks a little funny. Because the standare projection is a world-mercator, we are seeing the map based on the realtive positions of Alaska and Hawaii. This is something we will correct later.

In [16]:

airports = pd.read_csv(data.airports.url)

In [17]:

merged_airport_data = airports.join(mco_totals_by_airport, on='iata',   how='inner')
# data_json = merged_airport_data[merged_airport_data.Passengers > 5000].to_json()
# data_to_chart = alt.Data(values=json.loads(data_json)['features'])

We are also going to filter the data to just focus on large airports (with over 10,000 passengers of traffic) in the 50 states.

In [18]:

key_airports = merged_airport_data[(merged_airport_data.Passengers > 10000) & 
                                   (merged_airport_data.state != 'PR')]
key_airports.head()

Out[18]:

	iata	name	city	state	country	latitude	longitude	Passengers	MktFare
771	ACY	Atlantic City International	Atlantic City	NJ	USA	39.457583	-74.577167	16133.0	88.635223
825	ALB	Albany Cty	Albany	NY	USA	42.748119	-73.802979	20567.0	204.663398
880	ATL	William B Hartsfield-Atlanta Intl	Atlanta	GA	USA	33.640444	-84.426944	55434.0	176.386318
890	AUS	Austin-Bergstrom International	Austin	TX	USA	30.194533	-97.669872	17309.0	193.862941
933	BDL	Bradley International	Windsor Locks	CT	USA	41.938874	-72.683228	34742.0	188.418242

In [19]:

airport_chart = alt.Chart(key_airports[['name', 'latitude','Passengers','longitude']]).mark_point(
    fill='steelblue',stroke='white', filled=True).encode(
tooltip='name:N', size='Passengers:Q', latitude='latitude:Q', longitude='longitude:Q')

In [20]:

airport_chart

Out[20]:

Great, this shows us the airports sized by their traffic to Orlando. Like with our us map it looks a little funny but we will fix it at the end. We can add one more element which is lines representing the routes between the cities. The code below creates columns for the destination (which is Orlando in each case). So we set them to a constant.

In [24]:

key_airports['dest_latitude'] = 28.428889
key_airports['dest_longitude'] = -81.316028

We are also going to give each route a name so that we can instruct the visualization to draw one line per route.

In [25]:

key_airports['route_name'] = merged_airport_data.iata + '_' + 'MCO'

Still, the data is formatted incorrectly for a line chart. Line charts work two ways, one connects multiple points into continuous line segments. The other way draws multiple points but draws multiple lines in a series which is what we want. So we want to arrange the data so that the origin and destination points are in the same column vertically with 'route_name' identifying which line is whice. The code below creates that structure by concatanting the origin points with the destination points in the same column

In [26]:

line_airport_points = pd.concat((key_airports[['route_name','latitude','longitude','Passengers']]
          , key_airports[['route_name','dest_latitude','dest_longitude','Passengers']].rename(
              columns={'dest_latitude':'latitude', 'dest_longitude':'longitude'})
          )
         )
line_airport_points.head()

Out[26]:

	route_name	latitude	longitude	Passengers
771	ACY_MCO	39.457583	-74.577167	16133.0
825	ALB_MCO	42.748119	-73.802979	20567.0
880	ATL_MCO	33.640444	-84.426944	55434.0
890	AUS_MCO	30.194533	-97.669872	17309.0
933	BDL_MCO	41.938874	-72.683228	34742.0

In [27]:

routes = alt.Chart(line_airport_points).mark_trail(strokeOpacity=.5).encode(
    latitude='latitude', 
    longitude='longitude', 
    color=alt.Color('Passengers', legend=None, scale=alt.Scale(scheme='purples')), 
    size=alt.Size('Passengers:Q', bin=True, scale=alt.Scale(range=(1,2)), legend=None),
    tooltip='route_name')

In [28]:

routes

Out[28]:

The use of mark_trail allowed us to size each line. Now all that's left is to put everything together and add a change to the projection which allows us see everything clearly in a US Centric map.

In [29]:

usa_map_combined = alt.layer(usa_chart, airport_chart,  routes, 
                            width=600, height=600).project(type='albersUsa')
usa_map_combined

Out[29]:

Comparing Multiple Years¶

There we have the legions of HIT vendors streaming through American airport to descend on Orlando. But how does it compare to non-HIMSS years. For that, we will use the 2014 2015 and 2017 data.

We will set up a Dask pipeline that will load the data for each year and give the totals arriving in Orlando for each. This where dask really shines as its essentially chunking through over 5gb of data rows across 5 files.

In [31]:

from importlib import reload
reload(altair_theme)

base = alt.Chart(mco_totals_by_year.reset_index()).encode(
x='Period:O')

passengers = base.mark_bar(size=20).encode(
y=alt.Y('Passengers:Q', axis=alt.Axis(format=",.0f", title='Number of Passengers'), scale=alt.Scale(domain=(0,600000))),
color=alt.Color('HIMSS:N', scale=alt.Scale(range=['#7e78d2', '#fb62f6'], domain=['No HIMSS','HIMSS']))
)

fares = base.mark_line().encode(
y=alt.Y('MktFare:Q', axis=alt.Axis(format='$.0f', title='Average Fare'),scale=alt.Scale(domain=(150,300)))
)

stats_chart = alt.layer(
    passengers,
    fares,
    width=600,
    height=100
).resolve_scale(
    y='independent'
)

In [32]:

final_map = alt.vconcat(usa_map_combined, stats_chart, title="Passenger Traffic Headed to Orlando 2014-2018").configure_title(
fontSize=20, anchor='middle')
final_map

Out[32]:

In summary, HIMSS' 50,000+ attendees don't even appear as a blip for the greater Orlando area. At least each year the avg fare does seem to be getting a little lower.

In [33]:

usa_chart.save('static/flights/usa-chart.html')
airport_chart.save('static/flights/airport_chart.html')
routes.save('static/flights/routes.html')
usa_map_combined.save('static/flights/usa_map_combined.html')
final_map.save('static/flights/final_map.html')

In [ ]: