Ok, I’m stopping for today. It will be a few days until I can pick this back up. I’ve done several conversions with ogr2ogr, ogr2osm, and gdal_translate.
I’m impressed by how quickly it processes the files, so much quicker than QGIS.
I think where I’m at is figuring out how to make a translation file. I understand this file to be the current format I need, but need to do some modification to match my data. Maybe if I can find enough examples online I will be able to come up with something functional.
Here are the fields, and what I need to do to them:
ADDRNUM -Reliable, addr:housenumber=*
ADDRNUMSUF - Probably should be addr:unit=*
UNITTYPE - abbreviations expanded, conectated with unit id for addr:unit=*
UNITID - conectated with the above unittype. Some of these fields have the UNITTYPE abbreviation in front of the unitid instead of in the unittype field.
STREET_P_1 - Direction prefix (N=North, etc)
STREET_NAM -Street name, reliable, needs capatilizaiton fixed. 3RD is correct when listed that way and should not be changed to ‘Third’. 3RD → 3rd.
Any node with a addr:unit=* tag should have a fixme=* tag added, many times there are multiple nodes placed on a building, one will not have the unit id, and the others will. Many times the nodes with addr:unit=* are deleted.
I was reminded today of the various populated areas that span 2 or more states (ex: Texarkana) as a good example of why including addr:state can be valuable.
A couple of additional observations about the source data:
It appears that the STREET_NAME itself may contain abbreviations rather than just the STREET_PREFIX and STREET_TYPE. For example, in some cases Mountain is abbreviated Mtn (but in other cases it is spelled out).
In some cases the STREET_NAME seems to be possesive, but is missing the apostrophe For example, “JACOBS LADDER” I am not sure as to whether there should be an apostrophe. I guess during the manual step you can check how the name of the associated street is spelled.
If a word in the STREET_NAME starts with MC can we assume that the third letter should always be upper case, or do we have to handle ever case individually, e.g. McDonald, McCrown, etc?
Excellent observation and attention to detail. Many expansions skip this for fear of “guessing too much” but I think it ends up being cleaner looking. I ended up handling the few cases around me somewhat ad-hoc. My giant switch statement of cases is here.
As for possessive apostrophes, I add them manually when I see opportunities that are super clear but skip otherwise.
I have some results! I used Chat GPT to format and filter my data (some may call it cheating, I call it using my resources ). You can read over my entire chat here. Scroll all the way down to the last reply for a good summary of what I did, and the promp I’ll try next time to make it quicker. I’m probably close to my 50 prompts every 3 hours use limit right now .
I’ll be adding this to the wiki as well. I need to do some more work to break up the file into managable portions. Maybe I’ll ask chat GPT to sort by road name, then give me 250 address blocks.
In JOSM, after conflating the address points, I use the review plugin with a little Auto Hot Key script. So I can approve an edit and go to the next with a single button press.
Thank you, @tekim. My apologies for not examining the results in more detail prior to posting. I understand the problems you found and definitaly need to fix them. I’ll do some more work and post results.
@pmfox No problem. Btw, since I am interested in adding addresses in my area, I started writing a Python script to do these type conversions (barrowing from the work of others). It is nearly ready. I can send it to you if you are intersted.
I am most certainly interested. While Chat GPT is useful, it’s not binary and the same prompt with the same info doesn’t get the same results. I guess sometimes thats a feature and other times a flaw. A python script would not be like that.
I would like to know this as well. For context, the data has both styles in it, probably half have the designator in front of the unit identifier, and the other half does not.
For a commercial building with Suites, I think retaining the Suite is desireable. 2654 Anyplace Way, Suite 101, Sometown, VA 99999 seems better than 2654 Anyplace Way, 101 (or #101), Sometown, VA 99999.
However in the case of a duplex house, it might be written out as 2654A (or 2654 A) Anyplace Way, Sometown, VA 99999.
The latter case not useing the address line #2, as in the case of many forms where one enters their address, but would be entered entirely in address line #1.
My personal preference would be to leave the designator in place, when provided by the data source. However, I do believe there needs to be consistancy in one building. All units in one building must be formatted the same.
Almost all of the addr:unit that I have seen around the US just have the identifying portion (number or letters). Clicking through the values in the US taginfo page for this key seems to show that stripping identifiers down to just the number/letter is typical. That said, values like “Unit A” and “Apartment A” do show up in non-trivial quantities.
There’s also ~600 items with “addr:unit=Apartment” which probably warrants a follow-up. I’ll add that to my list.
It looks like any time “AND” appeared in the street name in the original data it got changed to “&”, for the cases I tested, USPS says that “and” is correct. I wouldn’t say that USPS is the ultimate authority on this, but given that the source data from VA and the USPS agree “and” is probably correct.
The unit (OFF/Office) was removed from “481 Steeles Fort Road, Office, Raphine, VA 24472”, according to USPS an address can have a unit type of “Office” without a following number. The original data contained “OFF” for the unit, which isn’t the official abreviation for “Office” (it is “OFC”).
The unit (STO) was removed from “481 Steeles Fort Road, STO, Raphine, VA 24472”. “STO” is probably an error, but it probably needs to be replaced by something else, perhaps “Apartment B14” based on the units around it. For now, leaving “STO” and adding a fixme=* tag is probably the way to go. If you are local you might visit the complex, or call the local authorities or even the apartment management. Note “STO” is not an accepted abbreviation according to the USPS, but perhaps it means “Stop” (as in Mail Stop), but that would require a following number/letter designator, which this doesn’t have.
It seems if the original data contained “APT” immediately followed by a number or letter (no space, “APT” was not expanded to “Apartment”
addr:city=Lexington
addr:housenumber=143
addr:postcode=24450
addr:state=VA
addr:street=Gravel Lane
addr:unit=A
With all of that data it may not be obvious what happened, but I wanted to make sure that everyone had the full context. The ADDRNUMSUF in the original record became addr:unit
in the translated record, when it should have been appended to ADDRNUM to create addr:housenumber. This is supported by the fact that FULLADDR=143A GRAVEL LN in the original record.
Does not appear in the output (1_Complete data in one file.csv). I am guessing that ChatGPT didn’t like the fact that STREET_NAME was actually the intersection of two highways.