Bottom Line Up Front
In my opinion, this dataset is not suitable for import into OSM, even with manual review. It will be too easy for someone to just “click through” the MR challenge as a lot of locations will look reasonable, but will not necessarily be correct, and thus a significant amount of incorrect data could be rapidly injected into OSM.
Process
I downloaded the source data from the HIFLD website as a shapefile. In the rest of this post I will refer to this as the “complete source data.” I used QGIS to select all of the records where STATE=’FL’, and then Export → Save Selected Features as… (shapefile). I will refer to this as the “FL source data.” I also downloaded the GeoJSON file of converted/translated data that James linked to (thanks for providing that). I will refer to this as the “converted data.”
Source Data Quality
Original Geocoding Precision
In the FL source data, only 39 of 6418 features are located “ONENTITY” (GEOPREC=ONENTITY), that is on the actual facility (as we require in OSM), the balance are located “BLOCK FACE”, that is, only on the correct side of the street and in the correct block. This is as claimed by the data provider. So, even by the provider of the FL source data, less than one percent of the features have a location precise enough for OSM, and we can’t assume that they got the right block and side of the street.
Difficult to Determine or Verify Location
So in the FL source data we see that 99% of the features are not located with the precision we require within OSM, and therefore, during import, the person doing the import would have to somehow determine where these features should be placed. In addition, the balance of the data will have to have their location verified. Without a reference address layer, or street level imagery (which may not always be helpful), it will be very difficult to do this. Some locations may appear to be reasonable, but the location may be that of another business.
Missing House Numbers
In the FL source data, 17 Features do not have a house number as part of their street address, and thus it would be impossible to determine the correct location using automated methods, or even manual methods short of either an in person visit, or by consulting street level imagery (which would only be helpful if the manufacturing business has a sign that shows in such image).
Data is Old
I can find no evidence that the complete source data has been updated since it was published in 2009, and the data for Florida may be even older as the dataset level metadata states: “
Eighteen (18) states have been updated in this delivery (referring to the delivery of the data in 2009): Alaska, Arizona, Hawaii, Idaho, Massachusetts, Missouri, Nevada, New Hampshire, New York, Ohio, Oklahoma, Oregon, South Carolina, South Dakota, Tennessee, Utah, Wisconsin, and Wyoming. In addition to American Samoa, Guam, and the Commonwealth of the Northern Mariana Islands, two (2) US territories have been added to the dataset from the 2009 D1 of 2 update: Puerto Rico, and US Virgin Islands.
Note that Florida is not mentioned, implying that its last update was even earlier!
The metadata goes on to state:
“This totals 48,930 companies.”
If I query for those states and territories in QGIS I get 48,927, which would mean that there has only been a net change of three features since 2009. Given the vast shift of manufacturing to places other than the US, this seems unlikely in reality. In fact I suspect that the decrease in three features in the mentioned states and territories is due to data corruption as the complete source data has seven records where most of the fields are NULL (which itself is a red flag).
“Duplicate Features” - Approx same location, different names and addresses.
I have not had a chance to write a query to identify all such cases, but I did notice a number of cases, for example, Mako Millwork, Inc. and Aircraft Tubular Components are located only 10 meters apart. While not impossible, this seems unlikely.
Data conversion issues
Loss of records upon translation to OSM
The FL source data has 6,418 records, but the converted data only has 5,905. What accounts for the discrepancy?
Records merged
In 83 cases, two (or more) businesses were combined into a single record in the converted data. Each business should have it’s own record (and feature in OSM), even if they share an address with another business. To find these, load the converted data into JOSM and search with the following expression: “name”~"^.;.$
Abbreviations
Abbreviations in the name tag may need to be expanded, e.g. Div. => Division, Inc. => Incorporated, Co. => Company, etc. Some of this has already been done, but some remain. There may be some exceptions to the rule of abbreviation expansion, consult the community.
Records that have address ranges for house numbers
In the converted data, it seems like the second half of the address range was removed in the conversion, E.g. in the FL source data, Curv-a-Tech Corporation, 930-940 WEST 23RD STREET
Became:
addr:housenumber=930
(and no addr:street tag)
Also, house numbers with hyphens - house number cut off before hyphen.
Missing addr:street tags
In the converted data, 53 features do not have “addr:street” tags, even though this information is present the FL source data. Use the JOSM search string:
-("addr:street":)
to find these.
Address2 field appears to have been dropped
Consider using addr:unit