Rethinking the import catalog

Minh_Nguyen · May 21, 2023, 5:54pm

Continuing the discussion from OSMF Strategy 2023:

Indeed, the import catalog likely isn’t comprehensive, even though it’s listed as one of the steps in the import guidelines. The page currently lists 518 proposed, ongoing, and completed imports, but the category of U.S. import proposals alone has over 400 import writeups, not counting the writeups for other countries.

We could scour the import category and its subcategories for unlisted import proposals, but I think our time would be better spent decentralizing the notion of an import catalog in favor of the individual proposal pages. I think it would be less effort to ensure that each of the table’s entries has a wiki page with all the same information, as the guidelines already require, then update the guidelines to remove the bit about the catalog.

In the long run, a less centralized catalog would be easier to maintain. For one thing, we won’t have to fret over the fact that some people put new entries at the top of the table and others at the bottom. If there’s any benefit to listing, say, all the Creative Commons Attribution–licensed imports or all the 2019 imports in one place, we can create categories for them, and an infobox template could automatically sort the pages into the right categories. This is how we’ve long organized tagging proposals.

Does anyone know of a legal or logistical reason why these specific details about each import proposal need to be duplicated in a single massive wiki table? Note that a different page contains the attribution legally required by some data sources.

Mxdanger · May 22, 2023, 3:01am

I’ve been the one who made the recent massive changes to that article to finally tackle how messy it has gotten… it was bad. Now that everything is tidy it would be quite easy to use automated preprocessing to convert entries into templates.

My next phase was going to be converting all the table entries into templates (but with templateData parameters to warn for missing fields and visual editor support)

This would hopefully make adding new entries painless.

This would be done using the same methods that the Wikipedia uses and turning each table into its own sub page template.

Should I continue with this or just stop?

This never mattered anyways, the tables can be sorted by date.

Minh_Nguyen · May 22, 2023, 4:37am

I did notice that the table is a lot cleaner than before – I appreciate the attention you’re giving it.

I think templatizing the table could make it easier to add entries to it, as long as we take care not to break the visual editor. There is a risk that, eventually, there might be enough imports to list that the page would exceed the template transclusion limit, just as the old calendar template did before we migrated it to an OSMCal widget.

Anyhow, the issue I’m confronting is not so much that it’s difficult to add entries to the table, but rather that we don’t have a robust way of keeping it in sync with the actual wiki pages. It isn’t clear to me that the difficulty of editing that table was the reason it wound up missing so many entries. At least categories could be automated by adding a (yet to be written) infobox template up top.

SimonPoole · May 23, 2023, 8:44am

Given that everybody is digging out old grievances, my turn.

The way we manage import meta data has never been fit for purpose.

The OSMF is the entity publishing the data and carrying the legal burden of adhering to any terms associated with use of the data. As a minimum the OSMF should be able to easily determine where, when, source and the terms of any data import*.

In the case of non-standard licences and special permission, both the OSM community contact and the contact of the data source need to be identified and available with the full current address, and a good quality scan of the original of the permission/terms needs to be filed with the OSMF.

All this works against a distributed system, actually it implies that the wiki is completely unsuitable for documenting imports.

* a note on the side: having easy access to this information would help OSM contributors too.

Minh_Nguyen · May 23, 2023, 11:42am

Fair points, though I don’t think any of this necessarily precludes devolving the metadata to individual wiki pages. After all, no one is proposing that import metadata be moved to hundreds of arbitrary GitHub repositories.

Let’s suppose there’s a need to track down the permission related to a suspected import of buildings and addresses in Cupertino, California. In theory, one could simply search for “Cupertino” in the import catalog and follow the link to the import proposal, but there is no such link. Instead, one needs to search the wiki or look for it in the California import category. One could cast aspersions on this import, eight years on, for not following the formality of “registering” it in the catalog, but the importer otherwise covered their bases, including by posting the permission e-mail.

Not every import proposal is so well written. Many import pages lack one or more important pieces of information, especially the import’s status. Part of my proposal is to develop a standardized template containing all the information from the catalog and add it to the top of each of the import pages. This will make the information just as searchable through categories or properly formatted search terms. In the future, we could develop a boilerplate (a “prefill” in MediaWiki lingo) that would walk importers through the considerations in this list to fill out the rest of the page in a consistent format.

An overreliance on the import catalog has already created some avoidable conflicts. For example, a volunteer mapper started working on a conventional import of buildings and addresses in Lee County, Florida. They apparently thought it necessary to dot their i’s and cross their t’s before “officially” announcing their proposal on the mailing list and catalog. In the meantime, Esri came forward with their own boilerplate proposal for importing the same dataset piecemeal through RapiD, apparently unaware of the other proposal, because it had not been cataloged yet. Thus began a slow process of getting the two to talk and not stomp on each others’ edits. In the end, Esri won by preempting the conventional import, which would’ve been comprehensive, with scattershot RapiD edits.

If we were to move all this information to the OSMF’s official site (which is technically a wiki), that’s fine I guess, but I’d still contend that a single massive wiki table would be a poor choice compared to individual searchable pages. Also, a nontrivial share of these proposals are just proposals that aren’t particularly noteworthy from an OSMF perspective.

SimonPoole · May 23, 2023, 2:26pm

My thinking would be more: zoom/pan to the place in question on openstreetmap.org, click on “Show data sources” and get a list of used and potential sources for the area in question, privacy related details hidden naturally. Obviously other kind of searches should be possible. Added bonus: generate attribution pages (which are a disaster in-and-of-themselves) automatically / on the fly.

See also Google Summer of Code/2017/Project Ideas - OpenStreetMap Wiki

At one point I prototyped part of the functionality, aka adding a source that an account was responsible for and generating a list, all no big deal just boring stuff, with a fair bit of business logic thrown in.

Minh_Nguyen · May 24, 2023, 4:59am

I think that’s a fantastic idea, the logical next step beyond the dynamic contributors widget that iD has in the corner. It would probably align more closely to what users and data sources expect when they think of attribution. At the same time, osm.org isn’t nearly the only way that people experience OSM data, so there probably does have to be something less dynamic as a fallback. I’d be happy if the fallback could just be a little more wieldy than it is today, but I fully admit to thinking small.

Minh_Nguyen · May 30, 2023, 4:01pm

Just noodling on what this could look like: as you alluded to, a Git repository structured like editor-layer-index or name-suggestion-index or osm-community-index could contain GeoJSON polygons tagged with basic metadata, including an attribution string and the URL of the full import proposal on the OSM Wiki.

A website, such as the OSM website, could fetch the GeoJSON on demand (just as iD does from the aforementioned repos) to determine the relevant polygon as the user browses the map. This could keep the effort limited to the client side and avoid needing to work on the backend API, although that might eventually become necessary once there are enough imports to catalog.

The proposals would continue to be categorized on the wiki for maintenance purposes, but the import catalog as we know it would be pretty much redundant. In this way, the original idea I tossed out would complement the broader, user-facing idea.

SimonPoole · May 30, 2023, 6:55pm

The GeoJSON file on github approach wont work because of the privacy/confidentiality aspect of some of the information that would be needed to be included.

I realize that this is a bit of a departure from the dump everything on the wiki approach, but for exactly the privacy reasons, the stuff there tends to be heavily redacted, which is a significant downside once you need to get hold of the involved parties.

Minh_Nguyen · May 30, 2023, 7:00pm

If any of the information needs to be confidential, it shouldn’t be shown to end users anyways, right? I was thinking of the GeoJSON files as an index into what’s on the wiki, but the same field that stores the URL to a wiki page could just as well be a URL to somewhere else, or the wiki page could link out to somewhere more secure.

More concretely, are you referring to the longstanding practice of pasting the correspondent’s full e-mail response with headers into the wiki page? Some proposals redact details to protect privacy, but it’s really inconsistent. It would be nice to instead use OTRS or its replacement for that purpose, then just include the ticket number in the proposal page and the proposed repository.

SimonPoole · May 31, 2023, 7:38am

Yes and no. Having the contact information and a non-redacted version of permission documents is important, but further we should be recording who (as in real name etc) was the contact on the OSM side of things, and keeping both of these up to date as far as possible.

I know the WMF maintains the permissions in OTRS (which is a very weird choice), and I could argue that the correspondence with 3rd parties on data access should really be logged in a CRM system, but that would be a very radical departure from what we do now with all kind of consequences that we probably don’t want. SO I would be happy with the basics, source and OSM contact, copy of the licence at the time the data was retrieves, or copy of permission document, bounding box/polygon, required attribution text, and any required OSM tagging.

Minh_Nguyen · May 31, 2023, 8:07am

Just to set expectations, wherever this information lives, getting it to a high degree of comprehensiveness would require a lot of research. For example, it took me a while to document this bus stop import, despite the anomalous license tagging, because the only documentation existed on the imports mailing list. Unfortunately, this would be a larger project than I’d be able to take on at the moment, but hopefully someone else has the right combination of interest in intellectual property rights and Web archaeology.