External Datasets that would add value to the map

This is intended to be a brainstorming thread; not a specific proposal for imports of specific data.

What would people find of the highest value as map consumers that currently is incomplete or difficult to achieve with human-scale best efforts?

Examples off the top of my head of things we have already come across:

  • openaddresses.io - (Addresses) Its a complementary dataset; discussed many times; not really suitable as per OpenAddresses.io - OpenStreetMap Wiki to import; but solves navigation from A to B.
  • alltheplaces.xyz - (Businesses/Opening Hours/contact attributes). Also discussed a number of times, with careful guardrails, some data in some cases is useful when coupled with human review. Solves a bit of “where is an X” and “when can I go to X?”.
  • Building footprints - Microsoft/RapID for example. The main use case is 3D maps; or in disaster response (ie: HOTOSM mapping buildings for ebola etc)
  • LIDAR measurements of building height; much less widely available but often collected by aerial mapping companies and occasionally permissively licenced by local governments.
  • Public amenities or facilities - ie; council or state governments publishing public parks, forestry, playgrounds, fire stations, police stations, bike parking, public toilets or even public pinic/bbqs
  • GTFS transist data from a variety of agencies is generally useful, though OSM tends to lag

Pushing this further; what’s the next level of data that could be inferred from open data and made easier for a human to review, validation and add; vs the value it provides?

What would people most want to see?

What would be the challenges with long term maintenance?

What would be the challenges with QA, and what kind of guard rails would we want to see over and above the standard import guidelines?

2 Likes

Hello

There are plenty of worldwide or local datasets that would deserve to be compared with OSM in the view of gardening our own lawn. Anyone can raise needs here, the list could be pretty long.

I wanted to advocate for a methodology about processing these situations to ensure quality on the long run.

Whatever the dataset can be, we need a systematic QA approach as to grow together. Osmose QA is my preferred automation tool chain for that. It already provides a reviewed and shared framework to deal with external datasets and get warnings about what is missing in OSM, without going for full import but to encourage individual review of each warning.

Example in France with Pharmacies from national healthcare equipment repository.

Let’s improve such systematic and common process instead of reinventing the wheel for particular purposes.

3 Likes

Globally I guess, the highest-value yet still incomplete data for map consumers includes accurate and comprehensive addresses, detailed road networks with surface and condition information, up-to-date business listings, pedestrian and cycling infrastructure, indoor maps, and reliable opening hours and accessibility details. But we already know all that. :popcorn:

Isn’t the real “next level” simply the data that already exists in public databases but hasn’t been released under an OSM-compatible license yet - like complete address datasets - so the real innovation would just be finally being allowed to use what’s already there? :clown_face:

A complete map? :grimacing:

Updates. Updates, oh and: Updates. How to ensure quality control, preventing vandalism, outdated import-data and license things.

Automatic conflation checks, rollback plans, monitoring, reviews, documentation and and and. :slight_smile:

4 Likes

In my opinion, anything that is difficult to achieve with human-scale best efforts should be kept elsewhere, not in OSM. Because if it cannot be added by humans then likewise humans will not be able to sustain it, to keep it accurate. We should make it easier for people to “mix in” third party sources when they render maps, instead of mixing these third party sources into OSM.

11 Likes

This already exists (or has exist) in combination with Osmose with very mixed results, see Lustige Sachen :D (2022-2024) - #217 by Nadjita and some of the following posts

1 Like

The task of doing these comparisons or value-adds is best placed in the end-user mapping application. We can think about how to make data joins easier, but the selection of which data joins to do isn’t something that OSM itself should be spending too much brain power on.

2 Likes

With the caveats stated by others, I’d strongly encourage this brainstorming exercise at a more local level. Many of us are fortunate to hail from localities where there are half-decent local datasets for a variety of themes, often published by government agencies under licenses that are suitable (or almost there, just need to ask). A local community can decide priorities relevant to them and their neighbors, that will motivate them to diligently clean up and maintain the data, working towards making OSM the gold standard for coverage of that feature type in that locality. Involving local community members early in the planning process will hopefully make them more invested in the outcome.

In this global forum, we can trade notes about strategies that have worked well in our communities. For example, thanks to the particular datasets available in my locality, we were able to import buildings and parcel-derived addresses, then use that coverage as a foundation for importing points of interest. Based on a foundation of imported streets and sidewalks, we were able to map crosswalks, bike lanes, and bus stops in larger numbers than we would’ve been able to otherwise. The procedures and code for each of these initiatives is available as open source, potentially enabling other communities to benefit from our experience.

External datasets can be valuable at a broader scale too, but the global community has less control over specific outcomes. In fact, national or global bulk imports could easily preempt more careful local imports by burdening local, less resourced communities with the difficult task of conflation. So at that scale, external datasets are more valuable for less invasive purposes, such as gauging completeness and flagging discrepancies.

5 Likes

One of the main draws of OSM for data users is that it’s a single global dataset that does not require puzzling together hundreds of local sources from various open data portals. While a strategy of letting others do the mixing allows us to avoid dealing with that challenge ourselves, it makes OSM less attractive than it could be. There’s also the issue that integrating many of these datasets will require at least some manual work.

While it’s not an easy problem, I think the OSM community would benefit from improving tools to make it easier for mappers to use external sources. That way, we would raise the bar of what’s possible to achieve with human-scale best efforts. Much like how the availability of aerial imagery means that each mapper can now achieve more than when all they had were GPS tracks.

And, of course, these tools and processes should ideally be reusable to avoid each local community having to roll their own.

6 Likes

A word of caution: many mappers, including me, have a tendency to import objects and details just because they are available. A possible (future) use case is very easy to conceive for practically any dataset. Currently import and conflation is a PITA, which prevents a lot of dataset hoarding. Making import and conflation easier will probably also increase the prevalence of this kind of data in OSM.

Maybe it’s important to think about the “add value to the map” part.

5 Likes

Hi,

One of the main draws of OSM for data users is that it’s a single global
dataset that does not require puzzling together hundreds of local
sources from various open data portals. While a strategy of letting
others do the mixing allows us to avoid dealing with that challenge
ourselves, it makes OSM less attractive than it could be.

I am thinking of something like the MS building footprints. The dataset
is too low-quality to “just import it”, but some commercial providers
have chosen to render those buildings into their maps nonetheless.

With vector tiles it would be relatively easy for someone to publish a
vector tile dataset that has “all MS building footprints that do not
intersect with an OSM building”, allowing someone who builds a web map
to integrate these buildings into their map with just a line of code or
two. The work required to publish this “add-on” dataset is a fraction of
the work required to make a good import, and the existence of such an
“add-on” dataset would relieve the pressure that drives some people to
drop our quality standards and import low-quality building outlines
wholesale in their region.

I am sure similar approaches are possible elsewhere.

While it’s not an easy problem, I think the OSM community would benefit
from improving tools to make it easier for mappers to use external
sources.

As long as it doesn’t make it so easy that it essentially becomes a
cloaked import (and when challenged, the puzzled mapper says “I just
clicked on the big flashing ‘improve the map’ button in my editor”),
that’s fine with me.

Bye
Frederik

2 Likes

While we see a lot of potential for accelerating and improving power mapping via QA as @infosreseaux suggested, automation and tooling, I agree with @woodpeck that we should not bulk upload data to OpenStreetMap without the human capacity to maintain it. The high quality and global coverage of the transmission grid and power plants was only possible thanks to the human-in-the-loop approach.

This does not mean that we cannot empower these mappers with better tools, training and ‘hint’ datasets, but ultimately, a human should validate and create the data using open imagery and public verifiable information as a source. What is really needed is human capacity building on the mapping side.

Unfortunately, this is becoming increasingly difficult as many false prophets are telling people all around the world that AI will automate most software and data work. Much of today’s impressive AI technology would not be possible without access to large amounts of high-quality open data and open-source resources. However, these resources are becoming smaller and smaller over time because most AI use cases do not support the resources they exploit.

One area in which mappers could be given more power is the mapping of rooftop and ground-based solar farms. As part of MapYourGrid, we have identified multiple AI-generated datasets around the world. Most of these AIs used OpenStreetMap data to train their detection algorithms. Feeding this data back into OSM could create significant value, as in many countries, the rapid growth of rooftop solar and solar farms has created a situation where grid operators and even governments don’t know what is actually out there, making energy planning extremely challenging. This also includes use cases like the solar nowcasting @CloCkWeRX mentioned. Here some data we have found that still needs to be harmonized:

  1. https://www.transitionzero.org/products/solar-asset-mapper
  2. ChinaPV: the spatial distribution of solar photovoltaic installation dataset across China in 2015 and 2020
  3. GM-SEUS: A harmonized dataset of ground-mounted solar energy in the US with enhanced metadata
  4. CPVPD-2024: A photovoltaic plant vector dataset derived from Chinese remote sensing imagery via a topography-enhanced deep learning framework with dynamic spatial-frequency attention
  5. GitHub - microsoft/global-renewables-watch
3 Likes

The challenge for us is to help data consumers appreciate the value that our data brings beyond consuming that external dataset wholesale. For example, Mapbox used to include OSM address data in the U.S. but later switched to some other dataset, probably proprietary. At just that time, my local community was carrying out a very careful import of addresses that avoided the systematic errors founded in other datasets’ coverage of the area, including Mapbox’s OSM replacement.

To us, it felt kind of like dumping OSM’s streets in favor of TIGER. Yuck! But whoever made that decision wasn’t thinking about one city. They wanted the assurance of coverage everywhere in the country, even rural areas far away from any local mapping community, even if it meant inferior coverage in some urban areas. OSM didn’t have enough of a critical mass of address coverage to justify figuring out address conflation (which is harder than building conflation). OSM does have that critical mass of buildings, but we wouldn’t if we had been more conservative about building imports and tasking manager projects. We know that our value goes beyond the raw numbers, but it can be difficult to make that case to developers and product managers who are focused on their own selling points.

By the way, if I’m not mistaken, publishing a conflated or intersected dataset of OSM buildings is permissible but could trigger the ODbL’s share-alike provisions depending how it’s done. Of course the Microsoft building dataset is also under the ODbL, and the other mentioned external datasets are presumably compatibly licensed; otherwise we wouldn’t be discussing them at all. But the overall message encouraging data consumers to join datasets would need to be careful not to incentivize violations of the license.

2 Likes

GTFS transist data from a variety of agencies is generally useful, though OSM tends to lag

The https://transitous.org/ project exists to create a worldwide open database of links to GTFS files. Instead of importing data from all of them, I see in a good light in the future, that there should be a way to more properly link OSM and Transitous (especially stop + platform/quays location for accurate pedestrian routing).

The same applies to similar open projects. I think that we should strive to create an interoperable ecosystem of them, which could then allow for the creation of truly complete and open map clients by combining these linked databases.

3 Likes

this is a good example where part of data can be useful (stop locations) and partially utterly out of scope (daily changes to timetables) and should be integrated by data consumers

I would rather look for human-scale drudgery that can be supported or partially automated by using extra datasets.

For example, automated notification that new road was build and is now visible on aerial and can be traced.

I am scared by imports too large to maintain.

1 Like

Can relate, I used to tediously import buildings in Belgium and burned myself out of OSM entirely due to this (enough to delete my account out of sheer frustration, though I made a new one recently to start afresh).

That and getting scolded because I didn’t keep the history for this or that specific item
 :face_with_bags_under_eyes:

Welcome back, I hope you will have a more satisfying run this time!
In Nederland, importing buildings and addresses from the BAG has become a cornerstone of the mapping community.

1 Like

We are already linking datasets for power plants. OpenStreetMap power plants are linked to more detailed information in Global Energy Monitor using WikiData. We are also planning similar approaches for power lines and substation.

However, I have to say that the tooling for linking WikiData in JOSM is still in its infancy. Better tooling and better cooperation with WikiData could really be helpful here.

Here how we link power plants with WikiData to other resources:
https://openinframap.org/#9.24/51.0275/-1.3816

2 Likes

Ability to produce combined extracts out of usual tools like Overpass could be an enabler as well but not very trivial:

  • Technically it needs very strong interoperability between APIs and it doesn’t yet exists
  • It may lead to licensing burdens as well

Such views can lower the cost to link datasets instead of adding directly yo OSM because it’s currently the cheapest solution.

I think we’re already quite close with Turtle extracts, federated queries in QLever, and visualizations in Ultra.

1 Like

Mayhaps (I got banned from the OSM World Discord for being myself anyway so I refuse to spend nearly as much time doing stuff on OSM)