This week the health world turns to Orlando for the annual conference of the Health Information Management Systems Society annual conference. it is absolutely one of the biggest events in health IT and one of the top 25 trade shows in the entire country.
How many people attend HIMSS each year? We could search one of hundreds of press releases about the event from exhibitors. We could visit the link above from Trade Show news.
Big Data Approach¶
How about we measure how many people fly to Orlando during the quarter when HIMSS takes place. The Bureau of Transportation Statistics releases quarterly data files drawn from a 10% of all airline itineraries sold each quarter from all of the major US airline carriers, with more than 5% market share.
For this analysis, we downloaded the Q1 2017 which corresponds to the last time HIMSS visited Orlando and Q1 2015 which was presumably a non-HIMSS year, though the conference center likely booked other events for the same period.
However, US airports see about 2.6 Million per day so even 10% of tickets for one quarter is a massive file.
here are some basic stats on just one of the files:
import subprocess
lines_in_file = subprocess.run(["wc", "-l", "data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv"], stdout=subprocess.PIPE)
print("Lines in the file {0}".format(lines_in_file.stdout))
The first step in processing data for our visualizations is typically to load the data into a pandas dataframe to manipulate and prepare the data before visualizing it. The flexibility and extensibility of Pandas means it is extremely memory intensive. One of the original authors of the library recommends having 5 to 10 times as much RAM as the size of your dataset. For one quarter of the airline data (1.4gb), this would require up to 10gb more than my laptops.
Enter Dask, a project developed by Anaconda that uses the same API as pandas but underpins it with a architecture that parallelizes the operations leaving the data on disk until called for. Dask's power is immense allowing a dataframe to span an entire cluster. But locally, it's also an easy way to use the same python data science workflows while keeping the data waiting until it is needed or in our case aggregated for our analysis.
Let's take a look at the data now.
import dask.dataframe as dd
from vega_datasets import data
import altair as alt
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
import json
import altair_theme
%matplotlib inline
data_Q1_2014 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2014_1.csv")
data_Q1_2015 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2015_1.csv")
data_Q1_2016 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2016_1.csv")
data_Q1_2017 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv")
data_Q1_2018 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2018_1.csv")
data_Q1_2014.head()
# Set some indicators so we know from when the data came. There is no indicator is the raw files.
data_Q1_2014['Period'] = 201401
data_Q1_2014['HIMSS'] = 'No HIMSS'
data_Q1_2015['Period'] = 201501
data_Q1_2015['HIMSS'] = 'HIMSS'
data_Q1_2016['Period'] = 201601
data_Q1_2016['HIMSS'] = 'No HIMSS'
data_Q1_2017['Period'] = 201701
data_Q1_2017['HIMSS'] = 'HIMSS'
data_Q1_2018['Period'] = 201801
data_Q1_2018['HIMSS'] = 'No HIMSS'
# Dask uses append because it fits with the model of lazily parallel calculation
# where as concat would rely on the whole dataset being available
all_years = data_Q1_2014.append(data_Q1_2015).append(data_Q1_2016).append(data_Q1_2017).append(data_Q1_2018)
mco_totals_by_year_task = all_years[all_years.Dest == 'MCO'].groupby(['Period','HIMSS']).agg({'Passengers':'sum', 'MktFare':'mean'})
mco_totals_by_airport_task = all_years[all_years.Dest == 'MCO'].groupby('Origin').agg({'Passengers':'sum', 'MktFare':'mean'})
Each row is a itinerary for one passenger between an origin and destination with the fare and distance. We are focused only on Orlando.
The above commands essentially create two tasks and won't process until called for. In the next command, Dask will compute the aggregates and return a dataframe of the result.
mco_totals_by_year = mco_totals_by_year_task.compute()
mco_totals_by_airport = mco_totals_by_airport_task.compute()
mco_totals_by_airport.sort_values('Passengers').tail()
Minneapolis, Detroit, Chicago all major airline hubs. For creating the visualization we will need a few other data sources. Geographic figures for US States, and latitude and longitude locations for major airports. These data sources are actually included in the altair/vega example datasets and thus are easily obtained using the input at the top.
us_states = alt.topo_feature(data.us_10m.url, 'states')
The following chart will provide a base United States map for the visualization.
usa_chart = alt.Chart(us_states).mark_geoshape(fill='#dedede', stroke='white')
usa_chart
We can see that looks a little funny. Because the standare projection is a world-mercator, we are seeing the map based on the realtive positions of Alaska and Hawaii. This is something we will correct later.
airports = pd.read_csv(data.airports.url)
merged_airport_data = airports.join(mco_totals_by_airport, on='iata', how='inner')
# data_json = merged_airport_data[merged_airport_data.Passengers > 5000].to_json()
# data_to_chart = alt.Data(values=json.loads(data_json)['features'])
We are also going to filter the data to just focus on large airports (with over 10,000 passengers of traffic) in the 50 states.
key_airports = merged_airport_data[(merged_airport_data.Passengers > 10000) &
(merged_airport_data.state != 'PR')]
key_airports.head()
airport_chart = alt.Chart(key_airports[['name', 'latitude','Passengers','longitude']]).mark_point(
fill='steelblue',stroke='white', filled=True).encode(
tooltip='name:N', size='Passengers:Q', latitude='latitude:Q', longitude='longitude:Q')
airport_chart
Great, this shows us the airports sized by their traffic to Orlando. Like with our us map it looks a little funny but we will fix it at the end. We can add one more element which is lines representing the routes between the cities. The code below creates columns for the destination (which is Orlando in each case). So we set them to a constant.
key_airports['dest_latitude'] = 28.428889
key_airports['dest_longitude'] = -81.316028
We are also going to give each route a name so that we can instruct the visualization to draw one line per route.
key_airports['route_name'] = merged_airport_data.iata + '_' + 'MCO'
Still, the data is formatted incorrectly for a line chart. Line charts work two ways, one connects multiple points into continuous line segments. The other way draws multiple points but draws multiple lines in a series which is what we want. So we want to arrange the data so that the origin and destination points are in the same column vertically with 'route_name' identifying which line is whice. The code below creates that structure by concatanting the origin points with the destination points in the same column
line_airport_points = pd.concat((key_airports[['route_name','latitude','longitude','Passengers']]
, key_airports[['route_name','dest_latitude','dest_longitude','Passengers']].rename(
columns={'dest_latitude':'latitude', 'dest_longitude':'longitude'})
)
)
line_airport_points.head()
routes = alt.Chart(line_airport_points).mark_trail(strokeOpacity=.5).encode(
latitude='latitude',
longitude='longitude',
color=alt.Color('Passengers', legend=None, scale=alt.Scale(scheme='purples')),
size=alt.Size('Passengers:Q', bin=True, scale=alt.Scale(range=(1,2)), legend=None),
tooltip='route_name')
routes
The use of mark_trail
allowed us to size each line. Now all that's left is to put everything together and add a change to the projection which allows us see everything clearly in a US Centric map.
usa_map_combined = alt.layer(usa_chart, airport_chart, routes,
width=600, height=600).project(type='albersUsa')
usa_map_combined
Comparing Multiple Years¶
There we have the legions of HIT vendors streaming through American airport to descend on Orlando. But how does it compare to non-HIMSS years. For that, we will use the 2014 2015 and 2017 data.
We will set up a Dask pipeline that will load the data for each year and give the totals arriving in Orlando for each. This where dask really shines as its essentially chunking through over 5gb of data rows across 5 files.
from importlib import reload
reload(altair_theme)
base = alt.Chart(mco_totals_by_year.reset_index()).encode(
x='Period:O')
passengers = base.mark_bar(size=20).encode(
y=alt.Y('Passengers:Q', axis=alt.Axis(format=",.0f", title='Number of Passengers'), scale=alt.Scale(domain=(0,600000))),
color=alt.Color('HIMSS:N', scale=alt.Scale(range=['#7e78d2', '#fb62f6'], domain=['No HIMSS','HIMSS']))
)
fares = base.mark_line().encode(
y=alt.Y('MktFare:Q', axis=alt.Axis(format='$.0f', title='Average Fare'),scale=alt.Scale(domain=(150,300)))
)
stats_chart = alt.layer(
passengers,
fares,
width=600,
height=100
).resolve_scale(
y='independent'
)
final_map = alt.vconcat(usa_map_combined, stats_chart, title="Passenger Traffic Headed to Orlando 2014-2018").configure_title(
fontSize=20, anchor='middle')
final_map
In summary, HIMSS' 50,000+ attendees don't even appear as a blip for the greater Orlando area. At least each year the avg fare does seem to be getting a little lower.
usa_chart.save('static/flights/usa-chart.html')
airport_chart.save('static/flights/airport_chart.html')
routes.save('static/flights/routes.html')
usa_map_combined.save('static/flights/usa_map_combined.html')
final_map.save('static/flights/final_map.html')