Framework for aligning New England place nodes to census categories

Minh_Nguyen · April 27, 2024, 10:06pm

It seems to me that, over the years, overclassification east of the Mississippi and underclassification west of it has always stemmed from a naïve overreliance on administrative areas or their population counts. Your experiment also relies on population counts of administrative areas, namely, the populations of cities and towns, counties (aggregated into CBSAs), and states. However, it uses a formula for allowing more city nodes per MSA depending on a state’s population density based on some qualitative “bonus factors”.

As a dabbling map designer, I appreciate that you’ve replicated the label density scrubber found in some GIS tools, and that I long for renderers like MapLibre to provide. Unfortunately, there’s no way for any of us to judge whether these factors are versatile enough for OSM, other than the one constraint that “the map” “looks right”. This reduces place=* to a purely presentational attribute, rather than one about a populated place’s function in society. If place classification formerly suffered from “garbage in, garbage out”, this greener recycling process will require more transparency the moment anyone disagrees with its results. I’m not entirely sure we’ll ever be able to hone place classification into a science, but any formula we come up with deserves scrutiny.

In the other threads linked at the top, I’ve floated a half-baked idea for how we could classify places nationally without a minimum of fudge factors and magic numbers. We could restrict place=town/city/suburb to places within Urban Areas. Within a UA, place=city would be subject to a simple test of whether any surrounding places would be considered its suburbs or those of another city within the UA, and perhaps the same would extend to choosing a place=town in a smaller UA. This most likely aligns with the Census Bureau’s practice of titling UAs based on their “high-density nuclei”, except that we wouldn’t reclassify any MCD or directional place name as place=town or city. Palmer, Massachusetts, falls within the Springfield UA and would not be its place=city.

Where I get stuck is how to draw the line between a UA that has at least one place=city at its core and a UA that has only a place=town at its core. The same uncertainty recently caused the Census Bureau to drop its distinction between Urban Clusters and Urbanized Areas, which had been set at a population of 50,000 since 1950. They say we’re now free to categorize UAs based on any population threshold we want. Thanks a lot, Census Bureau!

Maybe we could scale up the old threshold to 109,515, based on the country’s population growth since 1950. That’s right around the median UA population of 101,536 and the formerly documented place=city cutoff of 100,000. Maybe we set a budget for the number of place=city nodes within a UA, based on the UA’s population divided by that threshold. But I think these arbitrary cutoffs are only useful to the extent that they align with real-world differences in how places function.

There have also been concerns that some sparsely populated regions of the country would go blank if we rely on UAs, which like CBSAs require a certain minimum housing density. Nome, Alaska, would be relegated to a village unless we come up with some exception for it. In general, though, I don’t think our goal should really be to pad out the map artificially. A stylesheet should pull in place=village and rely on symbol collision if it needs to maintain an even label density everywhere.

Perhaps we could also consider how Natural Earth classifies places at its three scales. I would rather leave the subjective curation to them and focus on value that we can add independently as a data-driven project, but many renderers mash up Natural Earth at low zoom levels and OSM at high zoom levels, so some degree of alignment may benefit the broader ecosystem. As well, some consistency between regions would benefit our users. Whether it’s our methodology or the resulting density that is consistent, predictability will encourage more data consumers to make more thorough use of our data.