Thanks all for the wonderful insight.
how it was verified? Based on manual aerial classification (that also has some error rare) or verified by survey?
The dataset was generated using a randomized subset of OSM ways that did have surface tags, scattered all throughout the US. So it’s assumed that this is a fair mix of aerial classification and survey.
Have you considered publishing a data dump of the output - perhaps just a CSV of way_id,surface
? That way, routers could get up and running with the project right away.
This is a wonderful idea! This is a great way to make this project helpful right away, while being tolerant of error. I’ll plan on this.
The TIGER A41 roads are still a massive problem, and making the situation 90% better would not only make OSM much better in itself, it would help us to leapfrog other mapping providers who also have very poor data for these areas.
Thanks for the compliment! Polishing the TIGER import is the main motivation behind this project. The data in my area is very poor and makes OSM-based routing very prohibitive. This is what I’m hoping to improve.
OK, I see point that it could be helpful in cases which are 100% wrong right now.
Right: I agree this shouldn’t be used to directly import to OSM, but regardless this is a powerful tool with tolerable accuracy for a lot of use cases. I think publishing inference results for a wide swath of the US + providing a JOSM plugin to allow this to assist mappers (using them to provide the ultimate ground truth decision).
You may be interested to know that Mapbox allows their satellite data to be used for offline machine learning stuff.
Thanks for this info! I will reach out to them. Of note is that the NAIP imagery, as-is, being at 2.3 m / px is probably the main limitation for this algo at the moment. But with this imagery I proved that this is a well-formed classification problem and my ML model is a feasible solution. Getting hold of high-res imagery should drastically improve these results.
I see you provide 20 random classification samples, all of which are correctly classified.
It would be more interesting to have a sample of cases in which the model fails to correctly recognize whether the road is paved or unpaved, so that we can understand when this happens and mappers can more easily recognize mistakes when approving these changes.
This is actually completely by coincidence. Thanks for the input. I’ll provide several examples of correctly categorized and incorrectly categorized results instead.
does it recognize the case when a road is not visible (hidden by trees, in a tunnel, etc.)?
No, this is an edge case. I will note that the model does well even when the road is hidden by trees; I’m guessing even getting a few pixels of road through the canopy is enough for it to make a decent decision. But this is only speculation.
I second @Mateusz_Konieczny’s questions. How did you calculate the accurracy? Did you compare it to self-collectd ground truth? Did you compare it with exising surface=*
tags in OSM?
See above: yes the dataset used for training this model and validating the accuracy came from a random sampling of existing surface=*
tags in OSM spread across the entire US. I detail the data generation process carefully on the project site if you’re curious, under the “data prep notebook”.
Because mapped data in contrast to unmapped data tends to make people more reluctant to editing OSM (see the consequences of TIGER import on the US and the number of volunteer mappers per inhabitants in contrast to various European countries), I am against this import.
Don’t worry, I’m not hoping to use this as-is to import mass amounts of data to OSM. I think a JOSM plugin to allow mappers to access this algo + providing a separate dataset dumping way_id, surface
, etc., is the best way to leverage this at the moment.
Y’know, StreetComplete has been available for years, and has helped add a lot of 𝚜𝚞𝚛𝚏𝚊𝚌𝚎 tags to OSM.
Agreed that StreetComplete is a good way to get decent “on-the-ground” surveyed ground truth for surface tags. But as others hinted I’m a little worried that focusing solely on these would bias the model to cities & populated areas, which is not what it’s trying to target. Also, thanks for the project link, I’ll check it out!
Looks useful! I’m in CO as well, and have struggled with surface tags. They are valuable for routing, but surveying is next to impossible in these larger states. I think the sheer size of the open spaces in the western US is hard for people to grasp if they haven’t lived there…
We’re on the same page, thanks for the compliment!
I do know Boulder County has been pretty rigorously updated with surface tags due to all the cyclists, so it could be an interesting area to examine for accuracy.
This is a great idea for a local case study and to tell a good story. I’ll look into this!
Either way, thanks all: this gives me a lot going forward. I’m hoping to leverage this project currently in 3 ways:
- Seeking out higher-res imagery via Mapbox and/or others, to get higher accuracy. Improving the model would always be beneficial.
- Writing a JOSM plugin (already started this) so other mappers can use the algo as they wish to assist in aerial-assisted mapping. This is selfishly for me, I’ll likely use this along with the nicer ESRI imagery to mark paved vs. unpaved for the TIGER deserts I’m most interested in.
- Publishing a CSV mapping
way_id
s to inferred surface=*
predictions so routers can use it immediately, and so others can get a feel for the quality of the data this provides.
Jon