Explorations in Data Science and Health

Layering Maps and Data?

At Algorex Health, we make a lot of maps and get a lot of questions about the tools we use to make them. So, I thought I would briefly describe our process and the tools we use. I covered the basics of geographic charting in a past blog post and some of that terminology will be repeated here.

We use to two primary data visualization systems at Algorex Health both of which were chosen for their support withthin the Python/Jupyter stack, ease of use, and/or aesthetic flexibility:

  1. Altair: Altair is a python library that generates schemas that use Vega and Vegalite specification for visualization, those in turn rely on D3 for robust charting in a variety of formats. Altair becuase it relies on Vega-lite has a broad library of chart types including maps however Altair is all about the data. Every element of a map must be expressed in data and there is no base layer or tile layer. The tile layer is the part of a map you probably most recognize. It is the streets, borders, lakes, oceans that are automatically placed behind your map elements from a service like google maps. Altair maps don't support having a tile layer so when that is needed for interactivity we move on to Folium.
  2. Folium: Folium is a python api that generates maps using the Leaflet javascript library which is one of the defacto standards for web based interactive maps. In rare cases, we write leafletJS directly but generally Folium supports everything we need.

For this post, we will utilize Folium. I think the difference between most of our maps and the ones our clients are creating are in the rich use of layers. Tableau and other GUI applications make great maps but their geographies are predefined and it can be difficult to layer data from multiple types of shapes. Folium, which visualizes maps only provides the most flexibility and interactivity.

In the example here we will layer the Median Income by census tract for the Boston area and transit stations on the MBTA. This will layer two types of data: Median income by shape, and transit stations which are specific points.

Loading the data

There are three data files we need for this blog post all freely accessible:

  1. Locations and Shapes for the MBTA Mass Transit system which is available from the MassGIS agency.
  2. Census Tract shapes, available from the US Census.
  3. The median income data also available from the US Census.

We will load these data files using Pandas and GeoPandas to look at the data.

In [1]:
import pandas as pd
import geopandas as gpd

mbta_stations = gpd.read_file("data/mbta/MBTA_NODE.shp")
census_tracts = gpd.read_file("data/census_tracts_ma/tl_2018_25_tract.shp")
median_income_data = pd.read_csv("data/census_tracts_ma/ACS_17_5YR_S1903.csv", dtype={'GEO.id2':str}).rename(columns={
    'HC03_EST_VC02':'median_income', 'GEO.id2':'GEOID'
})

The MBTA data has a point geometry for each MBTA station.

In [2]:
mbta_stations.head()
Out[2]:
STATION LINE TERMINUS ROUTE geometry
0 Ashmont RED Y A - Ashmont C - Alewife POINT (236007.5383913815 892693.0225654766)
1 Harvard RED N A - Ashmont B - Braintree C - Alewife POINT (231387.2739390731 902684.0159827843)
2 Kendall/MIT RED N A - Ashmont B - Braintree C - Alewife POINT (234087.917269595 901406.5508935936)
3 Capen Street RED N Mattapan Trolley POINT (234055.4375255629 890869.3748197779)
4 Tufts Medical Center ORANGE N Forest Hills to Oak Grove POINT (235900.3235432804 899934.3133242205)

Those values don't look like latitude and longitudes. We need these geometries to be standardized if we are going to layer everything together. We fix this by changing the CRS of the file which will convert the geometry to a standard value.

In [3]:
crs = {'init' :'epsg:4326'}
mbta_stations = mbta_stations.to_crs(crs)
mbta_stations.head()
Out[3]:
STATION LINE TERMINUS ROUTE geometry
0 Ashmont RED Y A - Ashmont C - Alewife POINT (-71.06343014439 42.28388354622533)
1 Harvard RED N A - Ashmont B - Braintree C - Alewife POINT (-71.11890607237824 42.37402923068517)
2 Kendall/MIT RED N A - Ashmont B - Braintree C - Alewife POINT (-71.08619149399846 42.36241534488428)
3 Capen Street RED N Mattapan Trolley POINT (-71.08720482389784 42.2675530566276)
4 Tufts Medical Center ORANGE N Forest Hills to Oak Grove POINT (-71.06428155077408 42.34908000707444)

Much better now let's look at census tracts which we will convert as well just to be safe.

In [4]:
census_tracts = census_tracts.to_crs(crs)
census_tracts.head()
Out[4]:
STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry
0 25 027 724100 25027724100 7241 Census Tract 7241 G5020 S 53031300 1639215 +42.2566908 -072.1581690 POLYGON ((-72.21781799999999 42.27018, -72.217...
1 25 027 759100 25027759100 7591 Census Tract 7591 G5020 S 25570221 1427830 +42.2096822 -072.0401777 POLYGON ((-72.07887699999999 42.214751, -72.07...
2 25 023 510503 25023510503 5105.03 Census Tract 5105.03 G5020 S 2561883 3203 +42.0942619 -071.0655746 POLYGON ((-71.080483 42.095539, -71.0801759999...
3 25 005 652000 25005652000 6520 Census Tract 6520 G5020 S 439925 0 +41.6266977 -070.9278159 POLYGON ((-70.931105 41.63193, -70.929509 41.6...
4 25 023 510600 25023510600 5106 Census Tract 5106 G5020 S 5303844 49230 +42.0805604 -071.0602559 POLYGON ((-71.07875799999999 42.088274, -71.07...

Lastly, here is the median income data. The median income for each census tract is on column 7 which we named median_income.

In [5]:
median_income_data.head()
Out[5]:
GEO.id GEOID GEO.display-label HC01_EST_VC02 HC01_MOE_VC02 HC02_EST_VC02 HC02_MOE_VC02 median_income HC03_MOE_VC02 HC01_EST_VC04 ... HC02_EST_VC52 HC02_MOE_VC52 HC03_EST_VC52 HC03_MOE_VC52 HC01_EST_VC53 HC01_MOE_VC53 HC02_EST_VC53 HC02_MOE_VC53 HC03_EST_VC53 HC03_MOE_VC53
0 1400000US25001010100 25001010100 Census Tract 101, Barnstable County, Massachus... 1772 135 1772 135 47500.0 9839.0 1635 ... 41.3 8.6 31595.0 9777.0 113 50 10.3 4.5 NaN NaN
1 1400000US25001010206 25001010206 Census Tract 102.06, Barnstable County, Massac... 1503 157 1503 157 59042.0 13567.0 1449 ... 32.5 16.3 22177.0 17369.0 60 48 9.0 7.2 80882.0 12715.0
2 1400000US25001010208 25001010208 Census Tract 102.08, Barnstable County, Massac... 796 134 796 134 62844.0 12976.0 772 ... 38.9 14.4 43676.0 25949.0 19 22 5.8 7.2 NaN NaN
3 1400000US25001010304 25001010304 Census Tract 103.04, Barnstable County, Massac... 1145 142 1145 142 71250.0 15039.0 1113 ... 27.4 12.4 22986.0 18810.0 0 12 0.0 7.4 NaN NaN
4 1400000US25001010306 25001010306 Census Tract 103.06, Barnstable County, Massac... 1301 150 1301 150 55694.0 9955.0 1208 ... 26.9 14.6 NaN NaN 10 14 1.9 2.6 NaN NaN

5 rows × 243 columns

We need to merge the median income data with the tract shapes before we proceed which uses a little pandas magic. The GEOID is the common field between them. We will also select out only the columns we need from the median income data.

In [6]:
columns = ['median_income','GEOID']
census_tracts = census_tracts.join(median_income_data[columns],rsuffix='ic')

Make the Map

For the first map, we will use Altair. This map won't have base tiles but we will be able layer each element. The first layer will be a choropleth of median income followed by a layer of each MBTA station. Altair expects JSON data so there is some data wrangling before the map is drawn.

Filter Criteria

Because we do this all the time, it is helpful to filter out some of the data right now. This comes from having some hard lessons.

  1. Restrict to only the counties you need. The MBTA mass transit lines are only operating in three counties so we can restrict to just those.
  2. Remove census geographies with a lot of water. The field AWATER has the area in square meters of the census geography that is made of water while ALAND is the area of land. Anything with more water than land we are going to filter out.
In [7]:
filtered_census_tracts = census_tracts[(census_tracts.COUNTYFP.isin(('021','025','017'))) &
                                      (census_tracts.AWATER/census_tracts.ALAND < .55)]

To initiate a folium map, we will first set a few parameters.

In [8]:
import folium 
import branca.colormap as cm

folium_parameters = { 'location':[42.3601, -71.0589],
    'zoom_start':11,
    'width':300,
    'height':300}

boston_map = folium.Map(**folium_parameters)
boston_map
Out[8]:

Now we will begin to add layers starting with the census tracts. Because these are colored shapes, we will use the folium GeoJson class. This will draw one shape for every feature in the data. We will also style each feature using a linear color map so that the colors will be assigned based on the median income of the census tract. The colormap functions are supplied by the branca companion package.

In [9]:
color_scale= cm.linear.PuRd_07.scale(filtered_census_tracts.median_income.min(), 175000)
def tract_styles(feature):
    return {
        'fillColor': color_scale(feature['properties']['median_income']),
        'color': 'grey',
        'weight': .5,
    'dashArray': '5, 5'
        ,'fillOpacity':.8
    }
    
choropleth = folium.GeoJson(filtered_census_tracts.dropna().to_json(), style_function=tract_styles)
choropleth.add_to(boston_map)
Out[9]:
<folium.features.GeoJson at 0x7f7a7c52c080>

Adding the stations works a little differently. We could use the same GeoJson class above but the result would be that the stations would be visualized as markers which looks a little funny.

In [10]:
marker_only_map = folium.Map(**folium_parameters)

folium.GeoJson(mbta_stations.to_json()).add_to(marker_only_map)
marker_only_map
Out[10]:

Because we want to control the points drawn and make them circles not map markers, we will instead add each one to the map individually calling a style function that will assign it's color, based on its MBTA line.

In [11]:
def station_style(station):
    def station_colors(line):
        station_colors = {'ORANGE':'orange','BLUE':'blue','RED':'red','SILVER':'silver','GREEN':'green'}
        return station_colors.get(line,'grey')
    return folium.CircleMarker(
        radius=2,
        location=(station.geometry.y, station.geometry.x),
        color=station_colors(station.LINE),
        fill=True,
        fill_opacity=.8)
stations = folium.FeatureGroup()
for station in mbta_stations.itertuples():
    station_style(station).add_to(stations)

stations.add_to(boston_map)
Out[11]:
<folium.map.FeatureGroup at 0x7f7a6b570f98>

The other note is that the base layer, which defaults to OpenStreetMaps, is a little crowded. We don't need to show how someone navigates in Boston, we just need the minimal geography to give the viewer their bearings. To do this, we will change the default map tiles. Folium and Leaflet support a web standard for fetching tiles but you can also use any of the freely available tile styles shown in the leaflet documentation.

For this chart a basic gray style works best.

In [12]:
boston_map.add_tile_layer(**{'tiles':'https://server.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Light_Gray_Base/MapServer/tile/{z}/{y}/{x}',
    'attr':'Tiles &copy; Esri &mdash; Esri, DeLorme, NAVTEQ'})
boston_map.width = 500
boston_map
Out[12]: