Proposed import of The HIFLD General Manufacturing Facilities dataset

As I understand … “Pelican Seafoods.” —> "Yakobi Fisheries is a family owned & operated, based out of Pelican, AK, " Family owned & operated in Pelican, Alaska. We buy wild fish from our friends and neighbors. — Yakobi Fisheries

“Seth stopped fishing full time, and founded Yakobi Fisheries in 2010. Around the same time, Pelican Seafoods, which had been operating at full capacity for many years, closed its doors.”

“They were soon able to expand, and in 2015, made the move to our current facility, The Old Crab Plant, previously operated by Pelican Seafoods.”

So “Pelican Seafoods.” database record was valid at 2009.

comments:

  • Perhaps the web domain verification could be semi-automated (python script?), to check if

    • the website is accessible
    • and there’s no text on it like ~ “domain for sale”
  • The data reminds me of the “Poultry Import” request from a year ago; I think it’s worth rereading that thread as well … The Imports April 2022 Archive by thread , especially since my example from then, “Mar-lees Seafood, Inc.”, is also among the current data…

checking Florida proposal : generalmanufacturing-florida.geojson

invalid housenumbers:

                "addr:housenumber": "FL",
                "addr:housenumber": "MCFARLAND",
                "addr:housenumber": "NORTH",
                "addr:housenumber": "OLD",
                "addr:housenumber": "STATE",
                "addr:housenumber": "US",

strange ‘;’ in the city name

                "addr:city": "Baldwin;Jacksonville",
                "addr:city": "Fort Lauderdale;Davie",
                "addr:city": "Hialeah;Miami Lakes",
                "addr:city": "Medley;Miami",
                "addr:city": "Miami Lakes;Miami",
                "addr:city": "Opa Locka;Miami",
                "addr:city": "Riviera Beach;West Palm Beach",

checking the website links - with “fping”, probably 1/3 need a double check …

   2700 is alive
    759 is unreachable
    547 Name or service not known

Strange streetnames

  "addr:street": "Street",
  "addr:street": "Nw",
2 Likes

@SherbetS this proposal is still scoped to Florida, right? I think most of the issues raised so far will probably apply in Florida too, but I just wanted to confirm so we can focus on examples that are more directly relevant to the import.

I’ve never seen anything like this done. I’m imagining something like, the raw PRODUCT field, as well as the NAICS description, and asking the user to determine a reasonable product value for OSM.

This is something I’ve seen discussed frequently recently, and I’d love to see another thread discussing this topic.

Good sleuthing! any way to use these results to purge from the pool of candidate data?

Yes. Previously I pulled two examples from a random location in QGIS to show the issues present with the PRODUCT key. The current scope is Florida, to cut down on size and make it easier to focus on any specific issues that appear, and make it easier if we decide to try and import nationwide.

this is a sample of “Name or service not known” websites

  • ( based on checking from Hungary/Europe )
www.us-gf.com: Name or service not known
www.usibm.com: Name or service not known
www.usmileandmill.com: Name or service not known
www.trutwinind.com: Name or service not known
www.ussamplecorp.com: Name or service not known
www.vandeeservices.com: Name or service not known
www.velocitechmfg.com: Name or service not known
www.vikingcoachworks.com: Name or service not known
www.wakullatrusses.com: Name or service not known
www.wallinnovators.com: Name or service not known
www.wekiwaconcrete.com: Name or service not known
www.westcoastmold.com: Name or service not known
www.westonfoods.com: Name or service not known
www.whitehallprinting.com: Name or service not known
www.willgarrett.com: Name or service not known
www.wittbiomedical.com: Name or service not known
www.worldpub.net: Name or service not known
www.wpi-interconnect.com: Name or service not known
www.wyliedynamics.com: Name or service not known
www.zf-marine.com: Name or service not known
www.zmanscreen.com: Name or service not known

What I really wanted to signal is that if we generalize from the quality of the domain-named data to all the data, my estimate is that ~40-50% of the data is already completely outdated.

  • 1/3 of the domain names are inaccessible,
  • and on some of the accessible domain names you can read “domain is for sale”,
  • or the owner has completely changed since…

And even if we clean up the domain-named data, if a similar (at least 40-50%) error rate is expected with those without a domain name, and unfortunately, these can’t be detected from the armchair and will erroneously enter the database.

Unfortunately, judging from the quality of the domain names, these are still largely data from 2009, despite being updated a few weeks ago.

So, a decision needs to be made about what “known error rate” is acceptable.

proposal:

  • One way to ensure data quality could be to make calling the phone number and verifying the data based on that a part of the MapRoulette task.
2 Likes

Yes, essentially. The idea of a preliminary, crowdsourced tagfest comes up once in a while, for example when cleaning up tree species tagging. On a smaller scale, for the Silicon Valley point of interest import, each MapRoulette task has instructions that suggest likely POI tags by official business category, based on a series of workshops the local community held over Zoom beforehand.

@watmildon is proposing a spreadsheet-based tagfest for the full set of NAICS codes, which would be broadly useful and not specific to product=*. You could decide to piggyback off of that effort or start a separate spreadsheet limited to just the PRODUCT values found in Florida.

Alternatively, you could fold this step into the MapRoulette challenge, asking mappers to come up with the product=* themselves on a case-by-case basis according to PRODUCT and NAICS. This would allow you to start sooner, but it would impose an extra burden on mappers (making the challenge less likely to reach completion) and introduce potential inconsistency.

I’m not sure that a typical mapper, settling into MapRoulette for some light editing at the end of the day, would be inclined to do cold calls on behalf of OSM, just to check whether there’s a matching voicemail greeting. But it’s an interesting idea.

1 Like

RE-tested - today :

My test method:

# generate website links ..
cat generalmanufacturing-florida.geojson | grep '"website":' | cut -d'"' -f4 | sed 's/;/ /g' | sort -u > fping_links.txt

# test links with fping ( ~ 20 min .. 60 min ) 
time fping --file=fping_links.txt --stats 2>&1 | tee fping_result.txt

Results:

.....

    3416 targets
    2670 alive
     747 unreachable
     537 unknown addresses

     747 timeouts (waiting for response)
    5683 ICMP Echos sent
    2670 ICMP Echo Replies received
      65 other ICMP received

 0.10 ms (min round trip time)
 90.0 ms (avg round trip time)
 453 ms (max round trip time)
       59.883 sec (elapsed real time)

comment:

Just because it is claimed that the dataset was updated a few weeks ago does not mean that every record was reviewed for accuracy at that time. The maintainers of the data could have only added, deleted, or edited one record, or none at all (the update may have concerned only reformatting the data). In fact, in the metadata, the most recent “Process step:” (dated 2017-09-18), only involves changing the formatting of the data, for example, filling in “NOT AVAILABLE” for blank entries (an action that arguably adds no value).

In my opinion, even with individual manual review, we should not be “importing” data from 2009. There is too much risk that incorrect data will make its way into the database.

3 Likes

~2000 of the sites in this portion (FL) of the HIFLD dataset do not respond to http requests. The list is here:HIFLD_FL_Dead_Link.txt · GitHub. Probably good to filter those out.

2 Likes

Given the large currently known error rate,
those without website information should also be filtered out.

So, if a decision is made that the import efforts are worth it
… then only the POIs with currently active website information
would be put into Maproulette tasks.

extra ideas :thinking:

1.) The registration time of active websites could be queried,
and if it is > 2010 then they are suspicious;
For example, this is in the FL request and valid website : www.monierlifetile.com

  • but according to ICANN Lookup it is: “Created: 2021-01-25 19:24:38 UTC”.

2.) MapRoulette: I would separate those with a Note tag from those without. My assumption is that the ‘Note’ information is given for a reason, because the address is not clear - and additional information is needed, but this free-text information cannot be taken into account by geocoders. Therefore, I would put those with a Note into a different group with a higher difficulty level.

"note": "FROM HIGHWAY 19 TURN WEST ONTO US 98 AND TRAVEL APPROXIMATELY 20 MILES ON SOUTH SIDE",
"note": "FROM STATE ROUTE 89 TURN EAST ONTO US HIGHWAY 90 AND TRAVEL APPROXIMATELY 12 MILES ON SOUTH SIDE",
"note": "APPROXIMATELY 9 MILES EAST OF AVON PARK ON EAST SIDE OF STATE ROAD 64",
"note": "APPROXIMATELY 2 MILES SOUTH OF COUNTY ROAD 827 ON THE WEST SIDE OF SOUTH US HIGHWAY 27",
"note": "FROM SE 144TH STREET TURN SOUTH ONTO US HIGHWAY 301 AND TRAVEL APPROXIMATELY 3 MILES ON WEST SIDE",
"note": "FROM INTERSTATE 0 EXTENSION 74 TURN EAST ONTO SAND LAKE ROAD AND TRAVEL APPROXIMATELY 4 MILES ON SOUTH SIDE",
"note": "FROM US 441 TURN NW ONTO NW COUNTY ROAD 25A AND TRAVEL APPROXIMATELY 2 MILES ON WEST SIDE",
"note": "APPROXIMATELY 25 MILES SOUTH FROM BEACH BOULEVARD ON EAST SIDE OF SAINT JOHNS BLUFF ROAD",
"note": "FROM INTERSTATE 275 TURN NORTH ONTO US HIGHWAY 41 AND TRAVEL APPROXIMATELY 3 MILES ON EAST SIDE",

Bottom Line Up Front

In my opinion, this dataset is not suitable for import into OSM, even with manual review. It will be too easy for someone to just “click through” the MR challenge as a lot of locations will look reasonable, but will not necessarily be correct, and thus a significant amount of incorrect data could be rapidly injected into OSM.

Process

I downloaded the source data from the HIFLD website as a shapefile. In the rest of this post I will refer to this as the “complete source data.” I used QGIS to select all of the records where STATE=’FL’, and then Export → Save Selected Features as… (shapefile). I will refer to this as the “FL source data.” I also downloaded the GeoJSON file of converted/translated data that James linked to (thanks for providing that). I will refer to this as the “converted data.”

Source Data Quality

Original Geocoding Precision

In the FL source data, only 39 of 6418 features are located “ONENTITY” (GEOPREC=ONENTITY), that is on the actual facility (as we require in OSM), the balance are located “BLOCK FACE”, that is, only on the correct side of the street and in the correct block. This is as claimed by the data provider. So, even by the provider of the FL source data, less than one percent of the features have a location precise enough for OSM, and we can’t assume that they got the right block and side of the street.

Difficult to Determine or Verify Location

So in the FL source data we see that 99% of the features are not located with the precision we require within OSM, and therefore, during import, the person doing the import would have to somehow determine where these features should be placed. In addition, the balance of the data will have to have their location verified. Without a reference address layer, or street level imagery (which may not always be helpful), it will be very difficult to do this. Some locations may appear to be reasonable, but the location may be that of another business.

Missing House Numbers

In the FL source data, 17 Features do not have a house number as part of their street address, and thus it would be impossible to determine the correct location using automated methods, or even manual methods short of either an in person visit, or by consulting street level imagery (which would only be helpful if the manufacturing business has a sign that shows in such image).

Data is Old

I can find no evidence that the complete source data has been updated since it was published in 2009, and the data for Florida may be even older as the dataset level metadata states: “

Eighteen (18) states have been updated in this delivery (referring to the delivery of the data in 2009): Alaska, Arizona, Hawaii, Idaho, Massachusetts, Missouri, Nevada, New Hampshire, New York, Ohio, Oklahoma, Oregon, South Carolina, South Dakota, Tennessee, Utah, Wisconsin, and Wyoming. In addition to American Samoa, Guam, and the Commonwealth of the Northern Mariana Islands, two (2) US territories have been added to the dataset from the 2009 D1 of 2 update: Puerto Rico, and US Virgin Islands.

Note that Florida is not mentioned, implying that its last update was even earlier!
The metadata goes on to state:

“This totals 48,930 companies.”

If I query for those states and territories in QGIS I get 48,927, which would mean that there has only been a net change of three features since 2009. Given the vast shift of manufacturing to places other than the US, this seems unlikely in reality. In fact I suspect that the decrease in three features in the mentioned states and territories is due to data corruption as the complete source data has seven records where most of the fields are NULL (which itself is a red flag).

“Duplicate Features” - Approx same location, different names and addresses.

I have not had a chance to write a query to identify all such cases, but I did notice a number of cases, for example, Mako Millwork, Inc. and Aircraft Tubular Components are located only 10 meters apart. While not impossible, this seems unlikely.

Data conversion issues

Loss of records upon translation to OSM

The FL source data has 6,418 records, but the converted data only has 5,905. What accounts for the discrepancy?

Records merged

In 83 cases, two (or more) businesses were combined into a single record in the converted data. Each business should have it’s own record (and feature in OSM), even if they share an address with another business. To find these, load the converted data into JOSM and search with the following expression: “name”~"^.;.$

Abbreviations

Abbreviations in the name tag may need to be expanded, e.g. Div. => Division, Inc. => Incorporated, Co. => Company, etc. Some of this has already been done, but some remain. There may be some exceptions to the rule of abbreviation expansion, consult the community.

Records that have address ranges for house numbers

In the converted data, it seems like the second half of the address range was removed in the conversion, E.g. in the FL source data, Curv-a-Tech Corporation, 930-940 WEST 23RD STREET
Became:
addr:housenumber=930
(and no addr:street tag)
Also, house numbers with hyphens - house number cut off before hyphen.

Missing addr:street tags

In the converted data, 53 features do not have “addr:street” tags, even though this information is present the FL source data. Use the JOSM search string:
-("addr:street":) to find these.

Address2 field appears to have been dropped

Consider using addr:unit

2 Likes

This is a great idea!

Unfortunately I have found cases where the content of the “DIRECTIONS” field (which became the notes tag) seem to have been written to fit the geocoded location, so while it may offer a clue as to the actual location, I don’t think it can be relied upon.

1 Like

Given these concerns, I wonder if it might be worth taking a step back and looking for an alternative dataset specific to Florida. Unlike a lot of states, works by most agencies of the Florida state government are in the public domain under state law, and the few agencies that are permitted to copyright their works have open data portals anyways or have even attempted to contribute data to OSM directly. If such a dataset can be found, it won’t be as satisfying as using Florida as a trial run for an HIFLD import throughout the rest of the country, but it might result in better data in this state and serve as a template for state-level imports elsewhere.

1 Like

In general, I think that looking at state level data is a better option than HIFLD, especially for entities that the states operate or regulate. The data is likely to be much more current than HIFLD. We will still have to take a critical look at the data. Also, I am assuming that the translation issues I pointed out could also be present with another dataset as a script was probably used. Having said all of that, I took a quick look at the Florida Geospatial Open Data portal (https://geodata.floridagio.gov/), and didn’t see anything for manufacturing per se.

Thanks for the input Mike and everyone else. I will not be importing this data. I’ll leave a note on the wiki.

Manufacturing seems to only have been compiled in earnest in this national dataset, hosted on the HIFLD, I haven’t found anywhere else that records this sort of data. It would be really nice if it was a little more up to date than 2009.

1 Like

James, thanks for considering the input of the community. I can tell you put a lot of work into this proposal - thanks for doing that. Having a clear, solid proposal made providing feedback easier.

I have to say I am impressed by the analysis done by the rest of the community, such as the work @ImreSamu and @watmildon did with pinging the websites for these businesses.

I am not aware of any other nationwide database of manufacturing facilities that has a license compatible with OSM - but that doesn’t mean it doesn’t exist. However, let’s think about what state and Federal agencies might regulate manufacturing. For example, some manufacturers will be regulated by the EPA (because they emit, or potentially could emit, pollutants). Perhaps we could get a listing from one or more of these agencies of the facilities they regulate. If the dataset doesn’t have coordinates, but does have addresses, we could potentially use addresses from https://openaddresses.io/ to geocode the records (need to check the terms of use/license of the address data). I know I have been critical of automatic geocoding in this thread, but if we have a high quality point address database (as opposed to just address ranges on street segments), and we don’t make assumptions like “James Street is the same as James Road” (automatic geocoders do this, and other crazy things, sometimes to force a match), I think we can get results that can be fed into a MR challenge. I can help with the geocoding if needed.

2 Likes

I’m glad you mentioned EPA! The HIFLD has a couple more datasets from the EPA, see:
https://wiki.openstreetmap.org/wiki/HIFLD/Chemicals

If you are interested in these data, why not go directly to the EPA for an up to date copy? I took a look at the " EPA Emergency Response (ER) Toxic Substances Control Act (TSCA) Facilities" dataset, and it is at least six years old according to the metadata, and depending on what HIFLD did during the last update, many records may actually be from 2014.