the fivethirtyeight urban index

tl;dr Link to heading

I have attempted to recreate the fivethirtyeight urban index. Here’s the repo.

fivethirtyeight have a new urban index Link to heading

fivethirtyeight have come up with a method for quantifying urban or rural-ness. The data, quite helpfully, are posted on GitHub. Less helpfully, (and unlike one of the other sources they mention) they don’t show how their results are derived, so I thought I’d spend some time over the weekend reproducing the calculation.

the process Link to heading

get data for every census tract in the US
figure out which tracts are within five miles of each other
add up the population for tracts within five miles of each other

Geocomputation with R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow is very helpful for getting up to speed on the fundamentals of analyzing geospatial data, of which I know precious little.

getting the data Link to heading

{tidycensus} by Kyle Walker makes pulling census data into R incredibly easy.

Once you have the data, {sf} makes it easy to do spatial calculations. Judging from my many searches through stackoverflow, it hasn’t always been this easy.

getting it wrong Link to heading

How many people live within 5 miles of you? If you calculate this number for each person (well, each Census Tract) in the state, take the natural logarithm, then average them together (weighed based on the Census Tract’s population), you can come up with a nifty “urbanization index”

–Nate Silver

Did you think that meant you should compute five-mile buffers around each tract centroid and then perform an areal weighted interpolation of the tract populations into those buffers? No? Good. Neither did I.

Here are the first ten rows of that attempt.

# A tibble: 52 x 2
   state          avg_pop_within_five
   <chr>                        <dbl>
 1 Idaho                         8.54
 2 Texas                         8.51
 3 Georgia                       8.51
 4 Utah                          8.50
 5 Washington                    8.48
 6 California                    8.45
 7 Oregon                        8.45
 8 Florida                       8.44
 9 Massachusetts                 8.41
10 North Carolina                8.40

🙅🏾‍♂️

getting it less wrong Link to heading

Using spatial joins, I created a dataset of intersections between each tract and a five-mile buffer around its centroid. This is surprisingly fast, and gives results that are close to the reference solution:

inner_join(pop_within_5_mi, states) %>% 
  filter(pop_within_five != 0) %>% # this drops 18 or so tracts
  mutate(log_pop_within_five = log(pop_within_five)) %>% 
  group_by(state) %>% 
  summarise(avg_log_pop_within_five = mean(log_pop_within_five, na.rm = TRUE)) %>% 
  arrange(desc(avg_log_pop_within_five))

state	avg_log_pop_within_five
District of Columbia	13.609898
New York	12.923677
New Jersey	12.561995
California	12.555172
Massachusetts	12.250144
Maryland	12.223699
Nevada	12.176492
Illinois	12.132436
Rhode Island	12.114554
Florida	11.989963
Puerto Rico	11.987275
Arizona	11.907076
Connecticut	11.885447
Pennsylvania	11.796463
Texas	11.743707
Hawaii	11.690124
Utah	11.684077
Colorado	11.679461
Virginia	11.669739
Ohio	11.669211
Washington	11.661034
Delaware	11.646092
Michigan	11.545489
Georgia	11.408529
Oregon	11.371397
Indiana	11.267387
Louisiana	11.249018
North Carolina	11.240352
Minnesota	11.188313
Tennessee	11.160029
Wisconsin	11.149746
Missouri	11.146195
South Carolina	11.056779
New Hampshire	10.935114
Kentucky	10.910523
Oklahoma	10.900421
Alabama	10.791752
New Mexico	10.750496
Nebraska	10.725683
Kansas	10.713352
Idaho	10.535197
West Virginia	10.445101
Arkansas	10.360037
Mississippi	10.343037
Iowa	10.298608
Maine	10.135474
Vermont	10.094946
Alaska	9.940388
Wyoming	9.773519
South Dakota	9.607456
Montana	9.514699
North Dakota	9.415500

one more alternative approach Link to heading

Rather than intersecting tract boundaries with the five-mile buffer from their centroids, I tried using st_is_within_distance to join the data frame of tract centroids to itself. Since no index is built, this is not as fast. It also gives results that match the reference less well. You can install the package from the centroid-self-join branch for a look at those.

how to get even closer? Link to heading

fivethirtyeight do some tweaking of their index for tracts whose centroid is further than five miles away from other tract centroids:

For a census tract that is more than 5 miles away any other census tract (centroid to centroid), this number is decreased based on the minimum distance to the nearest census tract

I might be using the wrong projection for this kind of analysis.

There may be a better spatial join for expressing the relationship described in the article.