Proposed import of The HIFLD General Manufacturing Facilities dataset

HIFLD General Manufacturing (Florida)

Hello, I am proposing to import the General Manufacturing Facilities dataset, sourced from HIFLD, in the state of Florida.

Documentation

This is the wiki page for my import:
https://wiki.openstreetmap.org/wiki/HIFLD/Commercial
This is the source dataset’s website:

The data download is available here:

This is a file I have prepared which shows the data after it was translated to OSM schema:
https://cloud.disroot.org/s/tEeaZwF9seB4Y44/download/generalmanufacturing-florida.geojson

License

I have checked that this data is compatible with the ODbL.
This data is distributed under public domain.

Abstract

The General Manufacturing Facilities dataset provides data about factories and headquarters in the US. The dataset has 162,185 objects last I checked, but the current plan is to start by adding the data only in Florida, which totals 5,905 objects. I want to set up a maproulette task where users can review individual objects, and add to OSM after cleaning any issues.

The tag translations as listed on the wiki:

HIFLD Tag OSM Tag
man_made=works
NAME name
PHONE phone
ADDRESS addr:*
CITY addr:city
STATE addr:state
ZIP addr:postcode
PRODUCT product
WEB website
DIRECTIONS note

The Changeset tags as listed on the wiki:

Key Value
comment Importing Facility from HIFLD General Manufacturing Facilities
import yes
source HIFLD
source:url General Manufacturing Facilities
source:date 2021-03-15
import:page HIFLD/Commercial - OpenStreetMap Wiki
source:license PD

No conflation process has been done on the dataset, contributors will be expected to review if the object already exists, and add metadata or not import the object if already present.
These are the instructions I plan on giving on the MR project:

1: Visit the website, if one is provided, or sleuth using the internet to verify if the company is still operating
1: Verify address is correct, and the street is correctly formatted to match the name of the street on OSM.
3: Verify the object is in the correct location on the correct facility
4: Verify if the object is a factory or other type of facility, and change tags appropriately.

Don’t try to guess where a facility is located. If it’s not possible to figure it out, mark the task as too hard.

Please take a look at the GeoJSON provided and share your thoughts.

Thanks,
–James Crawford

1 Like

Is there need to process product value? Or is it somehow compatible with osm values already?

the product value is provided in specific and descriptive terms on the database, and it fits pretty well already with OSM’s ATYL for that key. It comes in all caps by default, but it was easy just to lowercase it all and add semicolons using QGIS.

1 Like

Couple of quick notes on first review, targeted at keeping the MR tasks as simple as possible (less typing etc etc). I’m also happy to help with some of the work I’ve described below. Writing a few little programs to do some legwork is always nice.

  • website should probably have https:// added to the fronts
  • websites can be checked via script to see if they are still alive (404 usually means the thing isn’t operating any more), with either a note or remove if not responsive
  • product= should have spaces replaced with ‘_’, maybe also ‘-’?
  • the most common things in product could be checked against taginfo values to see if they are way way off
  • leaving the location info in the note= field is great. The instructions should probably say to remove it before upload.
  • I am happy to help out and do the conflation described here if you think that could be of use watmildon's Diary | Using HIFLD dataset in JOSM to find unmapped Hospitals in the United States | OpenStreetMap

Love getting more of this cross-checked into the db.

While it isn’t difficult to add a changeset tag in most editors, it is fairly inconvenient to add six different tags for every MapRoulette task you complete. Most likely mappers will flub one or several of these tags on any given changeset. One workaround would be to provide all these tags in plain text as a list of key=value pairs that the mapper can paste into the tag editor’s text mode in iD.

product=* isn’t a freeform key that happens to be use snake_case: the values are supposed to be keywords, subject to OSM naming conventions including British English. Some values in this dataset would require some additional attention, because the PRODUCT field is descriptive rather than a particular ontology. For example:

PRODUCT SIC NAICS 2017 OSM product
BAKED & PACKAGED COOKIES 2051 311812 biscuits
BISCUITS 2051 311812 quick_bread?
FLASHLIGHTS 3648 335129 torches?
BASEBALL & FOOTBALL HELMETS 3949 339920 baseball_helmet;american_football_helmet?
DRIVEWAY & ROOF COATINGS 2952 324121 driveway2_coating? :wink:

As you can see, this can get into some unfortunate dialectal drama between American and British English. The dataset does include a Standard Industrial Classification code in the SIC field and a North American Industry Classification System 2017 code in the NAICS field that would be more useful for coming up with precise tagging. So one possible approach would be to stick to more general values of product, based on the NAICS parent industry (omitting one or two digits from the end of the code), then clarify the dataset’s precise intent using industry:sic_code or industry:naics_code. If we really want to capture the level of detail in PRODUCT, I think it probably belongs in description.

SIC is deprecated in favor of NAICS, which is also used by Canada and Mexico. A newer edition of NAICS came out last year with some differences. For example, 335129 is gone; the FLASHLIGHTS above has been renumbered 335139. I would expect a future version of this dataset to use NAICS 2022 codes instead. You can then use these tables to translate NAICS codes to likely OSM tags.

The NAICS correspondence tables have a lot of gaps in them. This effort could be a good opportunity to identify and explore these gaps and consider tagging that everyone can benefit from, not just in North America. NAICS is designed to be largely compatible with the UN’s ISIC standard, which is already being tagged as industry:isic_code.

For the curious here’s the products field, broken down by counts for each type of product: HIFLDProductsCount.tsv · GitHub

are there any tools presently out there to translate NAICS tags into OSM product values? I remember seeing a wiki page on this topic, but no code.

This table translates from NAICS 2017 to OSM tags (not just product=* but a number of keys). I also posted it to Gist as a TSV file back in 2018, but it doesn’t have the modifications you or I made to the wiki table since.

I think the GLOBALID field is the unique identifier for records in this data set. That field should be preserved in a ref:US:hifld tag in every HIFLD feature that is imported into OSM so that we can tell if the feature was previously imported into OSM from HIFLD or if the feature may be out of date because the record in HIFLD is newer.

I know it sounds like this is information OSM doesn’t need, but from experience working with GNIS features, having this data makes a huge difference when you come back to something that was imported ten years ago and the source data set has changed a lot.

2 Likes

Thanks for documenting this. I will do some analysis over the weekend, but I think this data is largely from 2009. Also I found several cases where two features are within a couple of meters of each other, but with different addresses. -Mike

2 Likes

I actually would go against trying to record this on the HIFLD, since in some cases, they’ve pulled from other datasets in some hosting arrangement, and their database numbers might not align with whoever’s upstream.

I was just talking with @watmildon, and have found issues with both the NAICS approach to determining product=, as well as using the PRODUCT= key to determine the value.

On database object 76685, the raw PRODUCT key reads:

MEAT PROCESSING & PACKING

The translation into OSM produced:

meat processing;packing

The NAICS code for this object was 311612:

MEAT PROCESSED FROM CARCASSES

Which would be enough to determine to the same level of detail for product= that the PRODUCT key could determine.

On database object 76736, the raw PRODUCT key reads:

VINYL-LINED SWIMMING POOLS, SPAS & ACCESSORIES

The translation into OSM produced:

vinyl-lined swimming pools;spas;accessories

But the NAICS value for this object was 339920:

SPORTING AND ATHLETIC GOODS MANUFACTURING

which is too general of a description to determine the actual output of the facility.
This might have to be something we pass onto the MapRoulette editors to determine the actual product value.

To be clear, the NAICS 2017 table was just my best guess with bleary eyes and a bit too much coffee-substitute. Definitely feel free to correct anything that’s blatantly wrong.

In terms of detail, NAICS has a bias for agriculture, wholesale, and manufacturing over retail and professional services, and within manufacturing it has a bias for heavy industry over light industry. I assume they just devote more detail to larger industries than smaller industries. This is why they separate and combine industries with every revision.

In my opinion, product=sport wouldn’t be entirely unreasonable, even if it could describe anything from bowling balls to baseball caps. But if we take the more literal approach, vinyl-lined_swimming_pools would be extremely specific while accessories would be so generic as to be useless. If someone were mapping this business from a field survey, they probably wouldn’t mention vinyl lining in product. On the other hand, an SEO specialist probably would mention the vinyl in description.

Sentry Scuba and Sentry Pool at the end of a strip mall

Generally speaking, how much precision do we expect in the product key? What are the anticipated use cases for this key that would benefit from greater specificity? Would it be a problem if a facility also manufactures related products that we’re only able to describe as “et cetera”?

Tagging aside, do you know if this example is representative of the coordinate precision and accuracy in the dataset? The dataset has this pool manufacturing facility some 500 feet away from the actual facility (which actually appears to be a pool showroom and attached scuba training center). This imprecision is not unreasonable for a dataset of this scale, but I’ve previously seen MapRoulette users get very confused by datasets with a similar precision.

If this example is representative, then step 3 of the instructions should be upfront that the coordinate is just a starting point. It should explicitly recommend looking for the building with a matching street address or, if the area hasn’t had an address import, stepping through street-level imagery to find the building.

A MapRoulette challenge is more likely to reach completion if each task requires fewer repetitive steps. What if beforehand we stick the 7,000 unique values in a collaborative spreadsheet and encourage people to contribute to a translation there? (Maybe we clean them up, split them by commas and & and unique them again to reduce the number of rows.) At a glance, the PRODUCT values are descriptive enough that I don’t think I would need street-level imagery to translate a given value into decent OSM tagging, but it couldn’t hurt to include a sample coordinate in each row.

2 Likes

That may be the case, but I’d encourage you to give the long term maintenance of this data some thought.

1 Like

It’s much easier to blow away the ref: field later if we decide it’s not useful than it is to include it after the fact.

1 Like

Yes, as I mentioned in a similar thread on the OSMUS Slack, I think it should be a requirement to keep a mapping between source identifiers and OSM object identifiers – but that mapping can be external, if that seems easier or more useful than keeping it on the OSM objects. Traceability is a Good Thing.

I think the rule of thumb should be based on whether we expect a future revision of this dataset – or a similar replacement dataset – to carry object IDs that are consistent with this dataset’s object IDs. We would only want to popularize a ref:HIFLD key if it would have some utility for reference or conflation as that key’s name would suggest.

On the one extreme is the automatically generated OBJECTID field in every ArcGIS dataset, which definitely carries no meaning outside of the dataset. On the other extreme is a GNIS Feature ID or an FAA antenna registration number, which is used for cross-referencing real datasets in the real world; the usefulness of these external IDs is not hypothetical or aspirational by any means.

In the middle, we have something like tiger:tlid, which was eventually removed, because having TIGER/Line IDs in the database didn’t really help us conflate later versions of TIGER any more than geometry- and name-based heuristics would. After all, neither the TIGER Roads overlay nor the TIGER Battlegrid tool have ever used TIGER/Line IDs to flag missing roads. That said, there are inherently different strategies for conflating line features (as in TIGER) and point features (as in this HIFLD dataset or any address point dataset). External IDs are a lot more practical for point features, while heuristics-based conflation is often easier with line features.

Unfortunately, this HIFLD dataset doesn’t contain meaningful IDs. The values in the OBJECTID field will most likely be 100% different if HIFLD publishes a new dataset with this same information. Meanwhile, you’d think the UNIQUE_ID field would contain something useful, but it’s 100% nulls.

2 Likes

Just to be clear, I don’t think a spreadsheet-based step should block a MapRoulette challenge. You could have this step feed into a MapRoulette challenge on a rolling basis: wait until some number of PRODUCTs get translated to OSM tags (500?), then release the features with those PRODUCTs to MapRoulette.

1 Like

I immediately found data that is located in water … which is strange … even if it’s related to the sea (‘PELICAN SEAFOODS’);
( this might be because the shoreline is at a different place on the ESRI background map …)

But I think we should initially filter out those data where the POI is placed in the sea or water.

( link )

Indeed, the coordinates are in the water, but there does appear to be an industrial building by a dock about a quarter mile to the southeast, in Pelican Harbor. That could plausibly be a seafood processing plant.

This kind of error could very easily occur because the data source took street addresses and geocoded them based on TIGER address interpolation ranges. Compounding the issue, the dataset only specifies latitudes and longitudes to the thousandth of a degree, which is not as much precision as we typically work with in OSM.

In this case, up in Alaska, a remote mapper won’t have any street-level imagery to definitively verify the location, as I did with the pool manufacturer above. However, in general, I think we should allow mappers to verify these POIs by looking around, rather than automatically discounting them as invalid based on the raw location.

2 Likes