Explorations in Data Science and Health

Who is going to Orlando?

This week the health world turns to Orlando for the annual conference of the Health Information Management Systems Society annual conference. it is absolutely one of the biggest events in health IT and one of the top 25 trade shows in the entire country.

How many people attend HIMSS each year? We could search one of hundreds of press releases about the event from exhibitors. We could visit the link above from Trade Show news.

Big Data Approach

How about we measure how many people fly to Orlando during the quarter when HIMSS takes place. The Bureau of Transportation Statistics releases quarterly data files drawn from a 10% of all airline itineraries sold each quarter from all of the major US airline carriers, with more than 5% market share.

For this analysis, we downloaded the Q1 2017 which corresponds to the last time HIMSS visited Orlando and Q1 2015 which was presumably a non-HIMSS year, though the conference center likely booked other events for the same period.

However, US airports see about 2.6 Million per day so even 10% of tickets for one quarter is a massive file.

here are some basic stats on just one of the files:

In [7]:
import subprocess

lines_in_file = subprocess.run(["wc", "-l", "data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv"], stdout=subprocess.PIPE)
print("Lines in the file {0}".format(lines_in_file.stdout))
Lines in the file b'5866624 data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv\n'

The first step in processing data for our visualizations is typically to load the data into a pandas dataframe to manipulate and prepare the data before visualizing it. The flexibility and extensibility of Pandas means it is extremely memory intensive. One of the original authors of the library recommends having 5 to 10 times as much RAM as the size of your dataset. For one quarter of the airline data (1.4gb), this would require up to 10gb more than my laptops.

Enter Dask, a project developed by Anaconda that uses the same API as pandas but underpins it with a architecture that parallelizes the operations leaving the data on disk until called for. Dask's power is immense allowing a dataframe to span an entire cluster. But locally, it's also an easy way to use the same python data science workflows while keeping the data waiting until it is needed or in our case aggregated for our analysis.

Let's take a look at the data now.

In [8]:
import dask.dataframe as dd
from vega_datasets import data
import altair as alt
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
import json
import altair_theme
%matplotlib inline
In [9]:
data_Q1_2014 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2014_1.csv")
data_Q1_2015 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2015_1.csv")
data_Q1_2016 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2016_1.csv")
data_Q1_2017 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv")
data_Q1_2018 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2018_1.csv")
In [10]:
data_Q1_2014.head()
Out[10]:
ItinID MktID MktCoupons Year Quarter OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCountry ... BulkFare Passengers MktFare MktDistance MktDistanceGroup MktMilesFlown NonStopMiles ItinGeoType MktGeoType Unnamed: 41
0 201412918419 20141291841902 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
1 201412918420 20141291842001 1 2014 1 13930 1393003 30977 ORD US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
2 201412918420 20141291842002 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
3 201412918421 20141291842101 1 2014 1 13930 1393003 30977 ORD US ... 0.0 1.0 268.0 719.0 2 719.0 719.0 2 2 NaN
4 201412918421 20141291842102 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 268.0 719.0 2 719.0 719.0 2 2 NaN

5 rows × 42 columns

In [11]:
# Set some indicators so we know from when the data came. There is no indicator is the raw files. 
data_Q1_2014['Period'] = 201401
data_Q1_2014['HIMSS'] = 'No HIMSS'
data_Q1_2015['Period'] = 201501
data_Q1_2015['HIMSS'] = 'HIMSS'
data_Q1_2016['Period'] = 201601
data_Q1_2016['HIMSS'] = 'No HIMSS'
data_Q1_2017['Period'] = 201701
data_Q1_2017['HIMSS'] = 'HIMSS'
data_Q1_2018['Period'] = 201801
data_Q1_2018['HIMSS'] = 'No HIMSS'

# Dask uses append because it fits with the model of lazily parallel calculation 
# where as concat would rely on the whole dataset being available 
all_years = data_Q1_2014.append(data_Q1_2015).append(data_Q1_2016).append(data_Q1_2017).append(data_Q1_2018)

mco_totals_by_year_task = all_years[all_years.Dest == 'MCO'].groupby(['Period','HIMSS']).agg({'Passengers':'sum', 'MktFare':'mean'})
mco_totals_by_airport_task = all_years[all_years.Dest == 'MCO'].groupby('Origin').agg({'Passengers':'sum', 'MktFare':'mean'})

Each row is a itinerary for one passenger between an origin and destination with the fare and distance. We are focused only on Orlando.

The above commands essentially create two tasks and won't process until called for. In the next command, Dask will compute the aggregates and return a dataframe of the result.

In [ ]:
mco_totals_by_year = mco_totals_by_year_task.compute()
mco_totals_by_airport = mco_totals_by_airport_task.compute()
In [13]:
mco_totals_by_airport.sort_values('Passengers').tail()
Out[13]:
Passengers MktFare
Origin
ORD 66146.0 217.768325
MSP 66435.0 201.077036
DTW 67107.0 195.274883
PHL 81802.0 184.273720
EWR 84991.0 223.100379

Minneapolis, Detroit, Chicago all major airline hubs. For creating the visualization we will need a few other data sources. Geographic figures for US States, and latitude and longitude locations for major airports. These data sources are actually included in the altair/vega example datasets and thus are easily obtained using the input at the top.

In [14]:
us_states =  alt.topo_feature(data.us_10m.url,  'states')

The following chart will provide a base United States map for the visualization.

In [15]:
usa_chart = alt.Chart(us_states).mark_geoshape(fill='#dedede', stroke='white')
usa_chart
Out[15]:

We can see that looks a little funny. Because the standare projection is a world-mercator, we are seeing the map based on the realtive positions of Alaska and Hawaii. This is something we will correct later.

In [16]:
airports = pd.read_csv(data.airports.url)
In [17]:
merged_airport_data = airports.join(mco_totals_by_airport, on='iata',   how='inner')
# data_json = merged_airport_data[merged_airport_data.Passengers > 5000].to_json()
# data_to_chart = alt.Data(values=json.loads(data_json)['features'])

We are also going to filter the data to just focus on large airports (with over 10,000 passengers of traffic) in the 50 states.

In [18]:
key_airports = merged_airport_data[(merged_airport_data.Passengers > 10000) & 
                                   (merged_airport_data.state != 'PR')]
key_airports.head()
Out[18]:
iata name city state country latitude longitude Passengers MktFare
771 ACY Atlantic City International Atlantic City NJ USA 39.457583 -74.577167 16133.0 88.635223
825 ALB Albany Cty Albany NY USA 42.748119 -73.802979 20567.0 204.663398
880 ATL William B Hartsfield-Atlanta Intl Atlanta GA USA 33.640444 -84.426944 55434.0 176.386318
890 AUS Austin-Bergstrom International Austin TX USA 30.194533 -97.669872 17309.0 193.862941
933 BDL Bradley International Windsor Locks CT USA 41.938874 -72.683228 34742.0 188.418242
In [19]:
airport_chart = alt.Chart(key_airports[['name', 'latitude','Passengers','longitude']]).mark_point(
    fill='steelblue',stroke='white', filled=True).encode(
tooltip='name:N', size='Passengers:Q', latitude='latitude:Q', longitude='longitude:Q')
In [20]:
airport_chart
Out[20]:

Great, this shows us the airports sized by their traffic to Orlando. Like with our us map it looks a little funny but we will fix it at the end. We can add one more element which is lines representing the routes between the cities. The code below creates columns for the destination (which is Orlando in each case). So we set them to a constant.

In [24]:
key_airports['dest_latitude'] = 28.428889
key_airports['dest_longitude'] = -81.316028

We are also going to give each route a name so that we can instruct the visualization to draw one line per route.

In [25]:
key_airports['route_name'] = merged_airport_data.iata + '_' + 'MCO'

Still, the data is formatted incorrectly for a line chart. Line charts work two ways, one connects multiple points into continuous line segments. The other way draws multiple points but draws multiple lines in a series which is what we want. So we want to arrange the data so that the origin and destination points are in the same column vertically with 'route_name' identifying which line is whice. The code below creates that structure by concatanting the origin points with the destination points in the same column

In [26]:
line_airport_points = pd.concat((key_airports[['route_name','latitude','longitude','Passengers']]
          , key_airports[['route_name','dest_latitude','dest_longitude','Passengers']].rename(
              columns={'dest_latitude':'latitude', 'dest_longitude':'longitude'})
          )
         )
line_airport_points.head()
Out[26]:
route_name latitude longitude Passengers
771 ACY_MCO 39.457583 -74.577167 16133.0
825 ALB_MCO 42.748119 -73.802979 20567.0
880 ATL_MCO 33.640444 -84.426944 55434.0
890 AUS_MCO 30.194533 -97.669872 17309.0
933 BDL_MCO 41.938874 -72.683228 34742.0
In [27]:
routes = alt.Chart(line_airport_points).mark_trail(strokeOpacity=.5).encode(
    latitude='latitude', 
    longitude='longitude', 
    color=alt.Color('Passengers', legend=None, scale=alt.Scale(scheme='purples')), 
    size=alt.Size('Passengers:Q', bin=True, scale=alt.Scale(range=(1,2)), legend=None),
    tooltip='route_name')
In [28]:
routes
Out[28]:

The use of mark_trail allowed us to size each line. Now all that's left is to put everything together and add a change to the projection which allows us see everything clearly in a US Centric map.

In [29]:
usa_map_combined = alt.layer(usa_chart, airport_chart,  routes, 
                            width=600, height=600).project(type='albersUsa')
usa_map_combined
Out[29]:

Comparing Multiple Years

There we have the legions of HIT vendors streaming through American airport to descend on Orlando. But how does it compare to non-HIMSS years. For that, we will use the 2014 2015 and 2017 data.

We will set up a Dask pipeline that will load the data for each year and give the totals arriving in Orlando for each. This where dask really shines as its essentially chunking through over 5gb of data rows across 5 files.

In [31]:
from importlib import reload
reload(altair_theme)

base = alt.Chart(mco_totals_by_year.reset_index()).encode(
x='Period:O')

passengers = base.mark_bar(size=20).encode(
y=alt.Y('Passengers:Q', axis=alt.Axis(format=",.0f", title='Number of Passengers'), scale=alt.Scale(domain=(0,600000))),
color=alt.Color('HIMSS:N', scale=alt.Scale(range=['#7e78d2', '#fb62f6'], domain=['No HIMSS','HIMSS']))
)

fares = base.mark_line().encode(
y=alt.Y('MktFare:Q', axis=alt.Axis(format='$.0f', title='Average Fare'),scale=alt.Scale(domain=(150,300)))
)

stats_chart = alt.layer(
    passengers,
    fares,
    width=600,
    height=100
).resolve_scale(
    y='independent'
)
In [32]:
final_map = alt.vconcat(usa_map_combined, stats_chart, title="Passenger Traffic Headed to Orlando 2014-2018").configure_title(
fontSize=20, anchor='middle')
final_map
Out[32]:

In summary, HIMSS' 50,000+ attendees don't even appear as a blip for the greater Orlando area. At least each year the avg fare does seem to be getting a little lower.

In [33]:
usa_chart.save('static/flights/usa-chart.html')
airport_chart.save('static/flights/airport_chart.html')
routes.save('static/flights/routes.html')
usa_map_combined.save('static/flights/usa_map_combined.html')
final_map.save('static/flights/final_map.html')
In [ ]: