Without any local knowledge, I can’t evaluate your tuning for accuracy. However, I would caution that this exercise reminds me of how geocoders and routers are typically implemented, by weighing factors until certain representative queries or routes yield the desired results. Essentially, it’s an act of curation rather than data gathering. To the extent that a geocoder would factor place=*
into its weights, it would want the tag to communicate some objective fact that it could not derive itself. But a renderer might appreciate a subjective hint, because algorithmic curation is a Hard Problem. I could see this becoming a source of tension in the long term, since both kinds of data consumers rely heavily on this same key.
The inputs to your spreadsheet are the populations of cities or towns and the populations of CBSAs, both of which have been shown to be poor proxies for the geographies of populated places, and thus inaccurate sources of population figures. You’ve attempted to correct for this inaccuracy by multiplexing them together. How stable would these weights be as populations change? And if we extend this approach beyond New England, can we be sure that it won’t unduly bias the map for bedroom communities at the expense of commercial hubs?
These are problems that the demographers at the Census Bureau have attempted to solve with their Urban Areas. I must admit that I have no idea what LODES data looks like, but the overall process seems to do a better job of identifying non-rural areas than we could on our own. What if you repeat your experiment but this time with UAs instead of official city and town boundaries? Would you be able to get away with fewer fudge factors? There’s still the issue of UA titles that don’t quite line up with populated place names, but it doesn’t seem as severe as with CBSAs.