Proposed import of Contra Costa County, CA Address points

CCC Address Import

Hello OSMUS, and CCC Mappers, I am proposing to import the Contra Costa County Addresses dataset, sourced from Contra Costa County Department of Information Technology GIS group.

Documentation

This is the wiki page for my import:
wiki.openstreetmap.org/wiki/Contra_Costa_County_Address_Import
This is the source dataset’s website:
https://www.contracosta.ca.gov/1818/GIS
The data download is available here:
https://gis.cccounty.us/Downloads/General%20County%20Data/CCC_Adddress_Points.zip
This is a file I have prepared which shows the data after it was translated to OSM schema:

License

Works by California Governmental Agencies should be in the Public Domain, under State law. However, the GIS Disclaimer contains the following line: NOTE: “THIS DATA CONTAINS COPYRIGHTED INFORMATION OF THE COUNTY OF CONTRA COSTA” Which is really confusing.
I have emailed the county more than a week ago, and I have never heard back from them.
The data SHOULD be licensed under Public Domain.

Abstract

  • I came up with the idea after I noticed that no addresses exist in CCC
  • Zyphlar helped me a lot with preparing data
  • Writeup can be found in the git repo linked above
  • The dataset will be imported by Frigyes06_Import using JOSM
  • The formatted dataset is 97 Megabytes, it contains only Address points
  • Conflation will be aided and carefully reviewed using the conflation JOSM plugin.
  • Data will be carefully validated by me, and aided by Mapwithai JSOM Plugin

Thanks for the proposal, just a couple of quick comments (perhaps more later):

  • Some of the address points are actually outside the county by a significant distance.
  • I suggest the Mapwithai JOSM plugin as it has a validator rule that checks if the address matches a street that it is near.
  • Opinions differ, but some suggest adding addr:state=*

Good ideas!

  • Sometimes county borders have outliers, but I will check when conflating to make sure they are correct.
  • Mapwithai is a great idea, I’ll add it to the plan!
  • Why do opinions differ on addr:state? I think it’s a good addition.

The usual argument is that addr:state is redundant to the state boundary relation; a geocoder should be able to automatically infer that the feature lies within California. However, this is really an argument against is_in:state. addr:state is actually the state of the ZIP code, or the state of the post office that serves the delivery route, not of the delivery point itself. So many communities near state lines have a different addr:state than one would expect from checking if it’s in a certain state boundary.

Contra Costa County is far enough away from the Nevada state line that there’s no risk of this sort of spillover creating ambiguity. For this reason, some of the other imports in the Bay Area have also omitted addr:state. But I still add the tag anyways while mapping manually in the South Bay. It’s just customary to refer to a city and state together, like referring to “Main” and “Street” together.

addr:city is also mostly redundant to addr:postcode, but we still tag it most of the time for human-readability and to capture alternative postal city names that residents prefer. Just make sure that addr:city is actually the postal city, not whichever municipal boundary it happens to lie within. A telltale sign would be if addresses in unincorporated areas have a “city” like “Unincorporated”.

2 Likes

While “I am not a lawyer,” I 100% agree. The mumbo-jumbo above this line seem to be a release of liability, that the data are offered “as is.” But then, apart from the rest, in a paragraph (single line) on its own, and at the very end, we get that ALL CAPS line above. It is clear from both state law (California Government Code ¶6254.9, California Public Records Act) as well as rulings by the California Supreme Court (Sierra Club v. Superior Court (2013) Cal.4th, Case No. S194708), that these data are (must be by law) public domain.

(And, a pet peeve of mine, Contra Costa County, correct speech is “THESE DATA” not “THIS DATA”).

If you DO get an answer back from the County (County Counsel, it should be imo), it should be something like “oops, our bad…we can’t say such a thing here but likely did as an oversight.”

There are a small number of exceptions to “all state data are public domain” (like certain aspects of personnel records, the answer keys to tests for state licensing boards…), but GIS data are specifically denoted as being public domain (the cited court case from over a decade ago makes this quite clear).

1 Like

I am looking more closely at these data.

In some cases addr:street has two values, separated by a semicolon:


There are 573 cases of this.

Two addresses less than a meter apart. same addr:housenumber, different addr:postcode, different addr:street, but one is probably wrong.

One is in Sacramento, 35 miles away from the nearest bit of CCC:


No harm in including extra addresses I suppose, but if you have included extra addresses, have you also omitted addresses you intended to include?

One is addressed to Pleasanton (not in CCC), but is located 35 miles away in the San Joaquin River National Wildlife Refuge (also not in CCC, 25 miles from it):

Here is another one. Location is in the middle of a street. addr:postcode doesn’t match the addr:postcode of any other address nearby. USPS says address doesn’t even exist:

And another:addr:postcode doesn’t match the addr:postcode of any other address nearby. USPS says address does not exist:

And another: addr:postcode doesn’t match the addr:postcode of any other addresses nearby. USPS says address does not exist.

Hi!
All the data massaging was done by Zyphlar, as documented in the git repo. Thanks for pointing out the flaws with current data processing; I’ll correct them as soon as possible!
Also, if you have ready solutions, feel free to share them with me, as I have just been getting started with QGIS, so I might not know how to properly fix a data error!

As for the out of CCC addresses, I’ll probably have to manually go in and remove them, as they are out of scope (and sometimes incorrect).

Thanks for the feedback!

I can help, but not until this weekend. I have written conversion programs for address data in Virginia and Colorado which could be adapted.

Here is a partial list of errors I have found so far:

Lots of missing values and inproperly formated values.

Cool! Will look into it. Zyphlar and I were working on re-generating the geojson based on the feedback provided. Also some problems are caused by JOSM messing up the data. Most errors highlighted here would have been caught at import time, usually with automated tools (mapwithai, etc) but we should work on getting it decently clean.

Could you elaborate? I am not aware of JOSM messing up geojson data.

People above were complaining about street names having 2 names separated by a ;
That is not present in the original geojson, but gets introduced after opening it with JOSM.
Our guess is it’s merging same coordinate points.
Root cause is dirty county data (W 7th street, vs 7th west street, 2 points, same place, different addresses). We’re working on a fix that will tag by the name used on ground level.

I know y’all are working on the file and that is lovely. Just for a laugh, I ran the file through the JSOM validator but with some extra rules I have for roadway name expansion and other address checks. The good news: it found some issues! The even better news: it’s all relatively easy to get patched up.

It’s probably easiest if you or Zyphlar run it locally but I’m happy to provide more specifics if that’s helpful.

The overall list (ignore the MapWithAI warnings, I had no roadways which makes it upset). Mostly expansion issues or stuff you’ve identified above:

3 things that it’s hard to reason about how they ended up like this. Maybe the source data also needs some scrubbing? Maybe there’s some very oddly named streets?

image

1 Like

Looks like a nice test, looks like a nice handoff!

1 Like

Sorry, I’ve been behind. Schoolwork is crushing me. We’re currently looking into sanitizing the data more and then possibly splitting the import in the tasking manager into more manageable chunks.
Will keep you all posted!

I am busy with work, but have done a little more looking.

Yes, I think the source data is of very poor quality. @Frigyes06 and company: you have their work cut out for you. CCC also publishes parcel data, which presumably contains addresses, and that might be a better starting point.

Interesting, I wasn’t aware of this. You might produce a .osm file directly as opposed to using geojson. JOSM will complain, but then you can pull the points apart as appropriate, which should be a lot easier than dealing with ;-separated values. Making a .osm file of nodes is pretty easy.

Excellent idea!

IDK, there are a lot of null values, such as for house numbers. I am not sure if those can easily be filled in. Perhaps match them to the parcel data, if that is any better?

Here are some other things that I am noticing:

  • Repeated words, e.g. “Laurentia Way Way”
  • Non printable characters
  • Directionals not always expanded, e.g. N E S W
  • Street types not always expanded, e.g. PKWY
  • Unit types not always expanded, e.g. STE
  • Capitalization special cases, e.g. Macarthur should probably be MacArthur

That is a very good idea considering the amount of cleanup that is likely to be necessary. You can also consider a Maproulette cooperative challenge. Another idea is to split the tasks by street, or groups of streets as apposed to arbitrary rectangle areas.

@Frigyes06 definitely no rush. The map will be here.

@tekim fortunately the roadway data usually provides a nice cross check on the expansions and capitalization… but it does save a lot of typing if we can catch stuff like that in the processing steps!

Yes, which is a good argument for grouping tasks by street (or streets if there are not many addresses on a single street).

However, it is going to be difficult to fix missing housenumbers, bogus housenumbers, duplicate addresses (1270 sets of duplicates), etc. Perhaps referencing the parcel data may help?

It is also difficult to detect things like non printable characters in JOSM (maybe your validator catches these?). I only noticed because the indentation in the printout changed.

…and gives the community a chance to review the actual data that will be uploaded to OSM prior to the upload.

I think JOSM catches some set of weird characters but it’s pretty easy to add some more into the checks I have. Are there any in particular from this file? Happy to add something to the repo.

1 Like

Here is the regex that I am using
‘[^ -~]’
In this file I have seen \n and \r

1 Like