tl;dr
I have attempted to recreate the fivethirtyeight urban index. Here’s the repo.
fivethirtyeight have a new urban index
fivethirtyeight have come up with a method for quantifying urban or rural-ness. The data, quite helpfully, are posted on GitHub. Less helpfully, (and unlike one of the other sources they mention) they don’t show how their results are derived, so I thought I’d spend some time over the weekend reproducing the calculation.
the process
- get data for every census tract in the US
- figure out which tracts are within five miles of each other
- add up the population for tracts within five miles of each other
Geocomputation with R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow is very helpful for getting up to speed on the fundamentals of analyzing geospatial data, of which I know precious little.
getting the data
{tidycensus} by Kyle Walker makes pulling census data into R incredibly easy.
Once you have the data, {sf} makes it easy to do spatial calculations. Judging from my many searches through stackoverflow, it hasn’t always been this easy.
getting it wrong
How many people live within 5 miles of you? If you calculate this number for each person (well, each Census Tract) in the state, take the natural logarithm, then average them together (weighed based on the Census Tract’s population), you can come up with a nifty “urbanization index”
Did you think that meant you should compute five-mile buffers around each tract centroid and then perform an areal weighted interpolation of the tract populations into those buffers? No? Good. Neither did I.
Here are the first ten rows of that attempt.
# A tibble: 52 x 2
state avg_pop_within_five
<chr> <dbl>
1 Idaho 8.54
2 Texas 8.51
3 Georgia 8.51
4 Utah 8.50
5 Washington 8.48
6 California 8.45
7 Oregon 8.45
8 Florida 8.44
9 Massachusetts 8.41
10 North Carolina 8.40
๐ ๐พโโ๏ธ
getting it less wrong
Using spatial joins, I created a dataset of intersections between each tract and a five-mile buffer around its centroid. This is surprisingly fast, and gives results that are close to the reference solution:
inner_join(pop_within_5_mi, states) %>%
filter(pop_within_five != 0) %>% # this drops 18 or so tracts
mutate(log_pop_within_five = log(pop_within_five)) %>%
group_by(state) %>%
summarise(avg_log_pop_within_five = mean(log_pop_within_five, na.rm = TRUE)) %>%
arrange(desc(avg_log_pop_within_five))
state | avg_log_pop_within_five |
---|---|
District of Columbia | 13.609898 |
New York | 12.923677 |
New Jersey | 12.561995 |
California | 12.555172 |
Massachusetts | 12.250144 |
Maryland | 12.223699 |
Nevada | 12.176492 |
Illinois | 12.132436 |
Rhode Island | 12.114554 |
Florida | 11.989963 |
Puerto Rico | 11.987275 |
Arizona | 11.907076 |
Connecticut | 11.885447 |
Pennsylvania | 11.796463 |
Texas | 11.743707 |
Hawaii | 11.690124 |
Utah | 11.684077 |
Colorado | 11.679461 |
Virginia | 11.669739 |
Ohio | 11.669211 |
Washington | 11.661034 |
Delaware | 11.646092 |
Michigan | 11.545489 |
Georgia | 11.408529 |
Oregon | 11.371397 |
Indiana | 11.267387 |
Louisiana | 11.249018 |
North Carolina | 11.240352 |
Minnesota | 11.188313 |
Tennessee | 11.160029 |
Wisconsin | 11.149746 |
Missouri | 11.146195 |
South Carolina | 11.056779 |
New Hampshire | 10.935114 |
Kentucky | 10.910523 |
Oklahoma | 10.900421 |
Alabama | 10.791752 |
New Mexico | 10.750496 |
Nebraska | 10.725683 |
Kansas | 10.713352 |
Idaho | 10.535197 |
West Virginia | 10.445101 |
Arkansas | 10.360037 |
Mississippi | 10.343037 |
Iowa | 10.298608 |
Maine | 10.135474 |
Vermont | 10.094946 |
Alaska | 9.940388 |
Wyoming | 9.773519 |
South Dakota | 9.607456 |
Montana | 9.514699 |
North Dakota | 9.415500 |
one more alternative approach
Rather than intersecting tract boundaries with the five-mile buffer from their centroids, I tried using st_is_within_distance
to join the data frame of tract centroids to itself. Since no index is built, this is not as fast. It also gives results that match the reference less well. You can install the package from the centroid-self-join
branch for a look at those.
how to get even closer?
fivethirtyeight do some tweaking of their index for tracts whose centroid is further than five miles away from other tract centroids:
For a census tract that is more than 5 miles away any other census tract (centroid to centroid), this number is decreased based on the minimum distance to the nearest census tract
I might be using the wrong projection for this kind of analysis.
There may be a better spatial join for expressing the relationship described in the article.