Framework for aligning New England place nodes to census categories

Previous discussions on this forum and on the Slack have had good agreement that Berlin is just too small to be a place=city since the end of the timber boom. The largest three employers are the hospital, the prison, and the other prison.

1 Like

By that logic, Presque Isle should not be tagged as a city, either. And unlike the latter, at least Berlin has a defined Micropolitan area.

If Berlin, NH, a city with a defined micropolitan area, is unacceptable, then why is Presque Isle, ME still tagged as a city?

The reason Berlin has a NECTA is because of its isolation. The distance between Berlin, NH and Concord, NH is over twice the distance between Newport, RI and Providence, RI. Wakefield, Westerly, Woonsocket, and Pawtucket are even closer together. There’s a reason they’re all part of the Providence metropolitan area. The entire state of Rhode Island is smaller in size than the Bangor, ME and Portland, ME metro areas.

Agreed, Presque Isle should not be tagged as a city.

1 Like

I don’t think you’re aware that there have already been attempts to downsize New England city labels in the past to only include larger cities with metro areas. Because of cases such as Montpelier and Augusta, it will never be accepted.

New England has less of a city inflation problem, and much more of a lack of standardisation problem. I support a standard that comes from an independent, government source, that doesn’t drastically change the map that currently exists.

If you try to limit a state like Vermont to Burlington only, a state like Maine to Portland, Bangor, and Lewiston only, you will never get a large enough consensus to implement those changes, at least if you to expect them to last more than a week.

On the contrary, there is a fair amount of consensus emerging from the people that actually live here as we continue to explore the issue.

I would suggest toning down the combative attitude if you seriously intend to collaborate.

3 Likes

The OMB has demoted Berlin as it no longer even meets the population threshold for a micropolitan area. See: Is Your Locality Impacted by the Changes to the 2023 Core Based Statistical Area Definitions? | Chmura.

1 Like

A prerequisite for classifying the place=* points in New England is making sure that they actually represent what they purport to represent. The place=* values more or less fall into three orthogonal hierarchies:

Administrative areas Population centers Parts of population centers
country
state
county
municipality
city
town
village
hamlet
borough
suburb
quarter
neighbourhood

The vast majority of place=city/town/village/hamlet nodes were imported from the Populated Place class in GNIS. The coordinates are located at a “downtown” rather than the centroid of any administrative area, and the Feature ID in gnis:feature_id=* specifically represents a population center. (The Civil class in GNIS represents administrative areas, but we never imported those features.) To the extent that these population centers correspond well to TIGER-imported administrative areas in name and function, they are label members of administrative boundary relations.

If you were to map a place point for an administrative area, it would be located at the area’s centroid, regardless of where people live or work, and it would be the label member of an administrative boundary relation. Most states and counties have such place points. This is also an option for the cities and towns that evenly partition New England counties, which would be tagged as place=municipality, but to date only two have been mapped as points: Clarksburg, Massachusetts, and Colchester, Vermont. (place=municipality is the standard tagging for towns in New York and Wisconsin.)

Unfortunately, the TIGER import sometimes incorrectly conflated some places with larger, identically named places elsewhere in the state, inflating both the population and place classification. Compounding the problem, subsequent mass edits conflated many more populated place points with administrative areas by the same name, even though only a small portion of the administrative area is built up.

In the Midwest, many sleepy unincorporated communities wound up with population=*, wikipedia=*, and wikidata=* tags corresponding to the entire surrounding township, even though the point remained at a tiny population center in one corner of the township. Even today, in New England, this QLever query finds 413 place=city/town/village/hamlet/neighbourhood points that are linked to cities or towns in Wikidata and generally have the populations of those entire cities or towns, as if the feature were really tagged place=municipality. These points should ideally match CDPs that the Census Bureau has created to approximate the town’s central village.

In some cases, this overconflation may be the best we can do. In Maine, the Census Bureau abolishes any CDPs within a town when it reorganizes as a city, even if nothing has changed about the city’s population distribution. But ignoring that problem, this QLever query still found 48 place=town/village/hamlet/neighbourhood points that were linked to towns in Wikidata despite there being a CDP by the same name within the town.

I’ve gone through these results, replacing each point’s population=*, wikidata=*, and wikipedia=* tags with those of the CDP; changing its role in any administrative boundary relation to admin_centre; and adding it to a boundary=census relation, if available, as a label member. In some cases, the population rose as I simply updated the figure from 2006 estimates to 2020 figures, but in most cases, the population fell – in some cases, to as little as two percent of the previous figure.

I haven’t changed any place=* classifications, so some of them stick out more when you compare their population figures to a global distribution of place classifications. For example, at 17,000 residents, Brunswick, Maine, would be a median town globally but is tagged as a city. Hillsboro, New Hampshire, with a population of 2,200, is tagged as a town but is about one standard deviation away from both town and village. The 23 residents of Bolton, Vermont, can hardly claim to be a hamlet, let alone a village as currently tagged.

With these more accurate population figures, we have a more sound basis for evaluating the current classifications and any replacement criteria.

2 Likes

For those that missed this, I believe the standard deviation Minh is referring to is from Brian’s global analysis of the Distribution of primary populated place values.

Thank you Minh for this clear disambiguation of these population figures between CDPs (which generally align with more densely populated places) and their enclosing Towns.

Unfortunately, many New England Towns have densely population places in them that the Census Bureau has not defined CDPs for. In these cases it has been common practice in the past to attribute the Town’s population and wikidata id to the place node representing the dense settlement place rather than the administrative boundary.

For reference, here is the Census map showing CDPs in Vermont, note the numerous Towns without CDPs.

Two examples of this problem are Moretown (Town, settlement) and Middlesex (Town, settlement) Vermont. Both of these are tiny settlements on the border of hamlet and village surrounded by the large rural municipal area that likely has a greater population than the settlement itself.

In these cases I think that for internal consistency we should move the population and Wikidata tags to the administrative boundary relation as the settlements do not have a known-to-the-Census population and are not the municipality. An additional task would be to create new Wikidata items for the unincorporated settlements themselves if that was desired.

3 Likes

Agreed. For example, Madawaska, Caribou, Maine, had gotten conflated with a town by the same name elsewhere in Aroostook County. Since Caribou is a city, there’s no CDP for this Madawaska and therefore no convenient population figure to tag. Caribou’s central village similarly has no specific data. I stripped the population tag off Madawaska but haven’t stripped them off these cities’ central villages yet, since I was focusing on more incontrovertible changes.

In some cases, we could figure out which census blocks correspond to which places within the city. If the Urban Area fits within the city limits, it probably approximates what the place point represents and we can use its population directly. If the city organized since the 2000 census, we could track down the geometry of the former CDPs and correlate them to current census blocks, but that would be more time-consuming.

Wikidata generally doesn’t have preexisting items about places like Madawaska yet, because Wikipedia never considered them notable enough for an article that Wikidata could import. So I created an item for it based on GNIS. We could automate this item creation using a GNIS dump and Open Refine.

This might be a little tricky, but I found P.L. 94-171 County Block Map (2020 Census) which provides detailed maps of census blocks. It looks like these blocks generally are bounded by through-roads, rivers, and rail lines.

Here is a screenshot from Washington County Vermont with the Moretown village area roughly circled:

If a data set with these census blocks and their populations was available in a format that was browseable in QGIS then it may be somewhat straight forward to make up our own CDP-equivalents to get an estimates of place populations. Definitely not perfect, but probably more realistic than just taking the entire municipality’s population. Please post a link if anyone has found such this dataset!

I did this in for Colchester village within the Town of Colchester, VT (this place is like Barnstable, MA), but at the census tract level. This is not ideal as the tract is much larger than the actual village area so it’s probably a significant over-estimate. I’d be perfectly happy somehow indicating that a place node’s population is unknown. I’m mostly concerned with clearly distinguishing the difference between a hamlet/village and a surrounding municipality that shares the same name. Separate wikidata items certainly seems a reasonable way to handle this.

A place=municipality node at the admin boundary centroid with the label role also seems like a reasonable way to head off future mappers adding or upgrading nodes to place=town just because we call municipalities Towns here. A label node for each municipality shouldn’t be strictly necessary since data consumers can calculate a centroid from the boundary polygon, but with counties, states, and nations all having these centroid place nodes it seems ok for municipal boundaries to have them as well.

2 Likes

Just to be clear, while we can go down the rabbit hole of obtaining super-precise population figures, I’m only updating these population tags so that we can get people out of the mindset of oversimplifying the whole town as a monolith. Ultimately, if we rely on Urban Areas to classify places, then the places that lack CDPs will naturally be classified as village or less, so the precise population becomes less relevant to rendering or geocoding.

2 Likes

Also we should consider that these values are on overlapping distributions. In a dense region, I’d look for places to sit on the higher end of those curves, and vice versa for sparse areas.

2 Likes

Assuming that population density correlates to “major places density” to some extent, perhaps you could analyze the distribution of place classifications versus population density by quadtile or something along those lines. I’d be curious to see if that overlap becomes more or less pronounced. Of course, any analysis based on existing OSM data is affected by past practices like rigidly defining city, town, and village based on the order of magnitude of the population count.

Ultimately, I think the considerations for sparsely populated areas can only go so far. The remotest places in the country aren’t comparable to the most densely populated places in the country. A city needs a certain critical mass. Without that critical mass, it might have some of the elements and characteristics of a city without necessarily being a city.

A rendered map doesn’t need to concern itself with these nuances, because it can always surface smaller, more remote places as space allows. The allotted space can depend on plenty of factors that are irrelevant to this database, including the size of place labels, the density of natural feature labels, whether labels near the coasts are biased toward the sea, hanging off the continent – and above all, the map projection.

That would be a cool analysis though I’d have to think harder about how to compose it. I think at this point I’d offer a conceptual framework.

Considering the boundary between city and town, from my analysis, the mid-point of where that boundary seems to be, is – to pick a round number between the one-sigma limits of each distribution – around a population of 35,000.

As such, in an area of “average” population density (I use the scare quotes because I’m deliberately not defining what average means), I would expect that to be roughly the dividing line. And thus in such a hypothetical average region, we would expect, from the analysis above, Portsmouth, Bangor, Lewiston, and perhaps even one or both of Leominster/Fitchburg to make the cut.

Now, I could also imagine, using a not-yet-described heuristic, saying that all of New England is an area of higher than usual population density, and thus the threshold is not 35K, but rather something higher.

I could also imagine a line of logic that says that Maine, or perhaps most of Maine except the southernmost part, is a below-average population density. And thus, the cutoffs should be somewhat below 35K, and could include places like Brunswick or Augusta.

I don’t have any numbers or specfic heuristics behind this but conceptually it does seem to be consistent that there is some level of population density bonus or penalty.

1 Like

Nothing about your analysis was specific to New England, right? Would you then expect 35,000 to be a useful threshold anywhere in the country, or anywhere in the world? If not, and it needs to be scaled up or down in each region based on population density or named place density, then how do we know that New England was a good starting point for validating this threshold to begin with?

I’m also wary of defining cutoffs by taking a literal global distribution of existing population figures at face value, because some regions actually have stubbornly tied place (and highway) classification to legal designations. This is what led to the notorious china_population proposal. Hopefully mappers in China have had subsequent discussions along the lines of this one, since they face exactly the same problem as in New England.

Rather than any particular threshold, your analysis highlighted for me that town is underused relative to village and city. One would expect a more normal distribution where town ends up being used for a wider band of population figures than the other classifications. This would also (very roughly) line up with how “town” is a much more overloaded term in everyday speech than “village” or “city”.

Correct, my analysis was global, and perhaps more importantly, it does not in any way accout for any lumps due to imports or localized classification rules. Perhaps there is some glut of lesser-population city values in Bulgaria that throws the whole thing out of whack. You’d need a deeper analysis to suss that out. Without that, the only thing we can really say is how our classifications compare to the database as a whole. I’m relying on something akin to the central limit theorem to assume (rephrasing with some liberty) that on a big enough data set, you tend to see normal distributions.

town is certainly under-used compared to village, though city includes several bars on the far right of the graph that comes from so few nodes that I would be careful in drawing conclusions about the relative width of the city section of the graph compared to other values. If values were assigned ideally, I’d expect equal widths of hamlet, village, and town, with city and isolated_dwelling perhaps having tails that go wider.

I think it also highlights the overuse of village in places of very low population, in places where hamlet or isoluated_dwelling would be more appropriate.

No, I would expect “some threshold value” to be a useful threshold value, that is then adjusted up or down based on some heuristic based on regional population density.

What I am more sketching out is a general framework for how we might consider assigning place values that’s smarter than strict population thresholds but is still well short of anything that might be algorithmic.

Maybe, but as you point out, all it takes is one import to really skew the results. In fact, the place nodes in much of Asia came from several poorly documented imports from GNS and other local datasets back in the 2000s. These datasets don’t come with any population data, so there’s no telling what criteria were used to classify them. I’m not sure that most of the local communities have gotten around to systematically scrutinizing the classifications, especially in rural areas.

Edit: I reread this comment again and I don’t know how population-less nodes would skew a distribution of populations. :joy: