Proposed Import of Rockbrige County, VA address points

tekim · August 26, 2023, 10:28pm

It appears that SHENANRIDGE in the original data became Sheridge in the translated data. For example, this record from the original data:

ADDPTKEY=12916208435
ADDRNUM=70
ESN=305
FIPS=51163
FULLADDR=70 SHENANRIDGE LN
FULLNAME=SHENANRIDGE LN
LASTUPDATE=3/16/2023 0:00:00
LAT=37.877975929000058
LONG=-79.353159246999951
MSAG_COMMUNITY=FAIRFIELD
MUNICIPALITY=Rockbridge County
OID=-1
PO_NAME=FAIRFIELD
PSAP=7194
SITEADDID=51163000010819.000000000000000
STATE=VA
STATUS=CURRENT
STREET_NAME=SHENANRIDGE
STREET_TYPE=LN
USNGCOORD=17SPB4483393554
ZIP_5=24435

Became this record in the translated data (1_Complete data in one file.csv):

addr:city=Fairfield
addr:housenumber=70
addr:postcode=24435
addr:state=VA
addr:street=Sheridge Lane

Given what I am seeing, I don’t think this ChatGPT approach is appropriate. It is one thing to get upper/lower case mixed up, but to change the name of a street for no reason and to drop records is… weird. I worry what else ChatGPT did that I haven’t been able to detect?

tekim · August 26, 2023, 10:46pm

Here is another strange translation by ChatGPT: TWIN HOLLOWS LN was translated to Twin Hollow Lane (Hollows [plural] became Hollow [singular]). Here is an example of a record from the original data:

ADDPTKEY=1218221308
ADDRNUM=30
ESN=301
FIPS=51163
FULLADDR=30 TWIN HOLLOWS LN
FULLNAME=TWIN HOLLOWS LN
LASTUPDATE=3/16/2023 0:00:00
LAT=37.994073436000058
LONG=-79.486138322999977
MSAG_COMMUNITY=GOSHEN
MUNICIPALITY=Rockbridge County
OID=-1
PO_NAME=GOSHEN
PSAP=7194
SITEADDID=51163000002390.000000000000000
STATE=VA
STATUS=CURRENT
STREET_NAME=TWIN HOLLOWS
STREET_TYPE=LN
USNGCOORD=17SPC3292806238
ZIP_5=24439

and here is what it was translated to:

addr:city=Goshen
addr:housenumber=30
addr:postcode=24439
addr:state=VA
addr:street=Twin Hollow Lane

The associated street in OSM is named “Twin Hollows Lane.”

SherbetS · August 27, 2023, 12:59am

I will say… ChatGPT is really annoying to try to force to help with an import. we really do need good comprehensive scripts available that are adapted to work with modern address datasets.

pmfox · August 27, 2023, 3:37am

I’m replying to some easy answers here. I’ll reply to the other tomorrow.

Regarding the ADDRNUMSUF, I intentionally had that pit into addr:unit. While I agree it would be written as 143A, I did not think that was appropriate for OSM. I could can change this, just bringing up that it was intentional.

The address line with I64 and I81 I removed manually from the processed data file, because it was missing the addr:housenumber. I see in your example here, the source data has that info. Good catch, I’ll add it back.

For the road names with ‘and’. First, they were capitalized: ‘And’. When fixing, I replaced the and word with the symbol ‘&’. If the road sign uses the & symbol, should it use the word or symbol? In Augusta county, some places where I retained the and spelling, the sign actually reads ‘&’. However, if the USPS requires ’and’, and ‘and’ is how the data came, Im thinking I should revert to ‘and’.

Good catch on Saint Andrews, and Off the Beaten Path.

I removed the STO and OFF, because I didn’t know what they meant.

One comment on the road names- I do use mapwithai to validate addr:street matches a nearby road. I use VGIN road centerlines shapefile layer as the source for any road names in question. Obviously thats only for catching problems not found in the review like have been found here, but finding and fixing problems during the import that have been missed in reviewing of the data.

I believe the consensus here is that while chat gpt did wonders, it is not predictable, and is not always careful to not corrupt data.

Maybe, it can be helpful to me in developing a python script (python is how chat gpt processed this data), and by tweeking that script I might have a solid process. The biggest concern I have from these comments is where chat gpt renamed the road. That is unacceptable.

Here’s where I’m coming from: I’m busy (like everyone else here I’m sure). I work as a career EMS provider, and am a mechanic on my days off. I thoroughly enjoy mapping. But I don’t have the desire to put time needed into learning to code (from scratch), as doing that would require me saying no to other priorities. So I’m trying to think outside the box here. I really appreciate your alls support and review on this process.

Minh_Nguyen · August 27, 2023, 8:03am

By the way, it’s possible to automate this process in large part using something like PostGIS or QGIS. You’d join the address points to VGIN’s statewide parcel layer to associate each parcel polygon with an address, then join it to the building footprint layer to associate the largest building on each parcel with the address. The “largest building in parcel” heuristic isn’t perfect – it tends to snag boat garages – but it’ll save you a lot of time, especially if you intend to repeat this process in other counties.

The Virginia Address Points Geospatial Data Standard details the standard procedure for deriving the full street name from the various fields and includes a comprehensive table of abbreviations.

PO_NAME is defined as the “name of the postal office servicing area”, which sounds like the postal city. In some cases, the postal “city” might be a county name, but that’s expected.

If the dataset doesn’t have sub-building precision, you can tag the building with the full list of units within. But I’ve been arranging them neatly within the building and adding a fixme=* about verifying the relative locations. This makes it easier for another mapper to later retag an individual unit as a POI.

It may seem cheesy to create an artful arrangement of points within the building, but I’ve seen official address point datasets use a variety of patterns within buildings, such as a grid, a random distribution, or a beautiful arc around the entrance (can’t make this up).

I wrote that bit based on existing practice, but I dislike that we’ve been discarding the designator. The letter will reach its destination regardless of the designator, or even if you smoosh the unit number into the house number. But the designator would still be a nice level of polish for address formatting – or otherwise the state wouldn’t have kept track of it.

A couple mappers have experimented with isolating the designator as addr:flats:label=*. For individual address nodes within a building, addr:unit:label=* would pair nicely with addr:unit=*. My only qualm is that addr:flats:label=* might run into trouble where a single building uses one designator for some of its units and another designator for the rest. But that’s quite an edge case.

tekim · August 27, 2023, 2:10pm

@pmfox Thanks for your response and your continued interest in getting these data into OSM.

As I mentioned before, I have a Python program that I am willing to share with you (and the community) once I make a few changes. However, all of my free time yesterday was spent QCing the ChatGPT output, so I didn’t have a chance to work on the program. If you are open to the idea, I propose that over the next couple of days I make the changes, provide the program to you, and then you, I and the rest of the community can review both the output and the program. But if more ChatGPT output is provided in the meantime, I will spend my time reviewing that.

While I agree that 123A Main Street Suite A is a bit redundant, I am not the authority on such things, the addressing authority is, and unless the state of VA made an error in compiling these data from the individual local authorities, it should be 123A Main Street Suite A . Also, in some cases the addr:unit came out A Suite A, which doesn’t seem to make sense. In one case addr:unit in the translated data was B B

I am pretty sure, based on location, that OFF should be translated as Office, and I am pretty sure that STO was meant to mean Stop, but is an error.

That is a big concern I have as well. There are more examples that I didn’t bother to document. For example, in the street name, PARK seems to have always become Parks, the street type SPUR became Spurs. In the street name any time a word contained NAN it was removed from the word, perhaps ChatGPT interpreted NAN to mean not a number?. Trails End Lane became Trail End Lane.

pmfox · August 27, 2023, 8:13pm

@tekim Thank you for your willingness to help. And for reviewing my data. I’m happy to wait on your script. I was exploring more with Chat GPT, but I hadn’t considered the effect of consuming your time in reviewing it.

pmfox · August 27, 2023, 9:09pm

@Minh_Nguyen I like the idea of separating the unit label into addr:unit:label. That just makes sense (to my brain), because that should work for map rendering and data consumers alike.

Is there any reason to not implement the practice of removing the addr:unit designator, and placing it in a addr:unit:label field for this import?

I couldn’t resist messing with this some more, I made another attempt in dropbox 1_edited_street_data.csv. @tekim please don’t feel the need to review, I’m satisfied to wait for your script. I mainly wanted to implement some of the formatting discussed here. Here are a the changes with this file:

ADDRNUMSUF is concatenaded to addr:housenumber. 143 becomes 143A. (no addr:unit on any of these)
No mis spelled street names. However, there may well still be abbreviations.
added fixme=verify unit to all rows with addr:unit tag data.

Data is sorted by city, latitude, longitude.

Minh_Nguyen · August 27, 2023, 9:48pm

It can’t hurt. Any tags you like. If we decide to call it addr:unit_designator in the future, it would be a straightforward find and replace operation. You’ll have to decide whether to format the values as freeform text (e.g., Apartment) or as keywords (apartment), with implications for tagging addresses in, say, Spanish-speaking Puerto Rico. I would lean towards freeform text, if only because few other addr:* keys are set to keywords.

tekim · August 28, 2023, 12:42am

I will code the script to do that, note that the source data should already have the “label” broken out, but it is not in all cases. No big deal, just a little extra coding. Also as we have seen, sometimes the “label” and unit are smashed together, e.g. “APTB”. Wish these data providers would follow their own schemas.

Of course, that is what the community is all about. Thanks for being open to feedback.

I made progress today. I just need to handle the addr:unit and addr:unit:label and do some code cleanup.

tekim · August 28, 2023, 4:32pm

@pmfox This is a little messy, but it works.

github.com

MikeTho16/OSM-Import-Tools/blob/master/addr_prep.py

#!/usr/bin/python3
import re
import copy
import csv
import time

addr_input = 'VirginiaSiteAddressPoint.txt'

# Put all of the street type abbreviations to be expanded here
# Note that the USPS has a standard list of such abbreviations
# This list (Python dictionary) is not complete, but it works
# for Rockbridge County
street_types = {
    'RD': 'Road',
    'TRL': 'Trail',
    'LN': 'Lane',
    'DR': 'Drive',
    'AVE': 'Avenue',
    'WAY': 'Way',
    'CIR': 'Circle',

This file has been truncated. show original

For the moment, everything is hardcorded. You need to have a file in the same directory as the python program called VirginiaSiteAddressesPoint.txt

The program produces a file called rockbridge.csv

tekim · August 28, 2023, 4:43pm

@pmfox FYI, upon completion the program prints out a summary of all of the unique addr:street and addr:city values. This aids in QCing those fields.

OptikalCrow · August 28, 2023, 10:52pm

Just popping in as a spectator for now, but I’d like to use this same dataset in the future for a couple Virginia counties nearby me (Campbell, Amherst, and Bedford) so any lessons learned or tools built would be invaluable when I start working on it. Using the same procedure and tools across Virginia would be ideal, and I do not believe it would be significantly different to run these steps on a different county from the same base dataset.

tekim · August 28, 2023, 11:17pm

Hi @OptikalCrow

I will run some tests on those counties with the script I wrote and let you know. For now I have done the bare minimum to get it to work for Rockbridge County, so I anticipate some changes will be needed. Higher priority will be supporting @pmfox for Rockbridge.

OptikalCrow · August 28, 2023, 11:34pm

Yep, of course, I will wait until this import goes smoothly to start mine anyways. Thanks for taking a look, though!

pmfox · August 29, 2023, 2:28pm

@tekim this is fantastic! At first I tried running it wiht the ogr2osm script, but then I discovered I was overcomplicating it. I simply opened python IDLE, opened the script, and ran it and it worked great.

In reviewing the unique name list, I’ve only found a couple, and I don’t know it would be possible to automate these anyway.
Bv Farm Lane → BV Farm Lane
La Lane: not sure if this is the musical tone ‘la’, or abbreviation for the city in California. Leaving as-is.

Thank you @tekim

Dropbox folder of results
ToDo is to split the file into parts. Then, when the community is satisfied with this data, I’ll begin the import.

pmfox · August 29, 2023, 2:57pm

@tekim, a feature request to the script would be to include the name field with a fixme tag because some of the names seem to have formatting errors and weird symbols.

pmfox · August 29, 2023, 4:27pm

I’ve added the names. Reviewed some of them already, removing fixme tag from reviewed names. Added website=* tag to names I looked up on their website.

I’m satisfied this data is ready to begin import into OSM. I’ll give it a few days for any further discussion to happen.

pmfox · August 29, 2023, 7:09pm

@tekim, I’m working on adding all the USPS abbreviations to the python file. The most recent version can be found in the python scripts folder of the Dropbox link. Hopefully this is helpful so you can spend your efforts making code and not copying text.

pmfox · August 29, 2023, 8:58pm

I’ve finished adding all USPS street suffix abbreviations, direction prefixes, and addr:unit designator abbreviations.

@tekim I then ran the script for Augusta County. I had to add a bunch of Mc and Mac regex lines. Also added a few lines to fix specific errors data source entry. Those lines are commented as such, as I want them to be easy to find if they cause some else a problem.

I found a few things that I don’t know how to fix (when running with augusta county).

C-BO-G LN becomes C Bo G Lane. Usually on these instances, county GIS has several entered with hyphens, and several entered without hyphens. But on this lane, they all agree and have hyphens. Whats strange is following examples leave hyphens in place.
Gran-Duke Lane & Mikes-Knob Lane. Mikes Knob exists both with and without hyphen.
N and W Lane becomes North and W Lane. Obviously, this is due to direction prefix. Is there a way to limit the python script to obtaining the direction prefix to the STREET_P_1 column? In Rockbridge, this is the column that contains the direction prefix, but maybe that’s not true for all municipalities, which would make this difficult.
there are some instances where the suite is included in the street name. This is a data source error, not a script problem. But I thought to myself, I wonder if there is a way for Python, if it recognizes one of the abbreviations from the addr:unit:label dictionary within the street name, the script could move said abbreviation and any text after the abbreviation to the appropriate columns. This is probably easier said than done, I know. (or, maybe simply adding a fixme tag if it detects one of these abbreviations in the street name).

@OptikalCrow, you might try this script now for your municipalities. I added comments (# change me) to three lines I changed to export Augusta. One determines what municipality to export, the other two relate to the name of the generated CSV file.