Thank you for your comment, it is very insightful. I love what you are working on.
So my point for debate here is:
do you think there are/will be some specific cases where the machine-generated features (will) match the desired level and belong to OSM?
If yes, how can someone validate the performance in a measurable/objective manner (assuming the community has been informed and agreed with the import)?
Here is what I did for this project, and maybe it could be extended to gather more human input for evaluation
I made some tools to compute metrics (i.e. IoU for the polygons) between changesets. I used those to compare the degree of agreement/consensus between different humans (mostly me and my wife ) but also against machine generated outputs. I only published the tool once I reach a point where the agreement score of the machine generated was higher than the human consensus (and, as we have seen, this was clearly not enough).
I am hoping to find the time to implement that, thanks for the idea.
I am not sure I understand how do you reach the conclusion that OSM is not where that info is useful. Is that the case even if there is a point in time (maybe unreachable) where that AI-generated data matches the level of what humans annotate today?
In Poland someone made bot that autodected highway=crossing zebra crossings and added them. It was not errorfree, but quality was comparable with human mapping. Though it was making different errors than human would do.
Out of curiosity, can you elaborate on why OSM data wasn’t good enough to serve as training data for this purpose? Was our coverage too sparse or limited to certain kinds of pools, or were you seeing too many poorly drawn pools even among manually drawn ones?
Related to that, as you mentioned having trained the model in Spain, I wondered if you were using an area where swimming pools have been systematically imported as part of the Catastro buildings import project, or where they have been added in a less systematic way by tracing. My experience has been that the Catastro imports give good results for swimming pools. Of course no data source is perfect, but I think areas such as the one in this screenshot are probably as good as it gets - if this is not good enough it would suggest that OSM data can never be good enough for this purpose.
Well, some people will always object to any AI-related contributions, though I believe that is unreasonable.
Just throwing out an idea: I wonder if it would be helpful to include, as part of an import proposal, images of the geometry to be added (with examples for “lowest quality” / “average quality” / “highest quality”) on which the community can comment. (“The bar for “lowest quality” is too low - please increase the confidence requirement” or “The detections are great, but the geometry is not good enough - please add only nodes for the detected features.”)
The attitude I perceive in the OSM community (or at least, those of us who are active in the typical communication channels - which is a tiny percentage of OSM mappers) is that machine-generated data is welcome in OSM only when it is at the quality level of the data produced by experienced mappers.
That might not seem “fair” - after all, OSM has plenty of human contributions of terrible quality - but that doesn’t justify adding even more poor-quality data.
Convenient timing - I’m at our OpenThePaths 2025 conference right now and the keynote speaker just mentioned that the expectation for autonomous vehicles is to be better than humans before they are cleared for operation.
I did try to train with different filter variations (i.e. using only catastro imports, using only human edits). In the end, results were kind of the same.
The Catastro imports still suffer from common issues like misalignment with the aerial imagery. From the area you linked, this is just the very first example I checked overlayed in PNOA tiles:
Then, there is a separate issue with filtering which is that in some areas there is a mix of sources visible within the same tile. Filtering would result in more problems as there would be clearly visible swimming pools without label.
Again in the example area you shared, the closest pool from the previous,it is not from Catastro but a (questionable?) edit on top:
If not imported from Catastro it looks like it must have been traced from it. And I’m not sure it is questionable - Bing imagery (which is much clearer here) suggests this is indeed the shape of the pool.
In any case, this specific example aside, I agree that as the Castastro import here was about 6 years ago, there have inevitably been some changes on the ground. Which will mean either that some pools have been traced manually (if the changes were noticed) or the OSM data will not match current imagery (if they haven’t been).
So what does this all mean for @Minh_Nguyen’s question about why OSM data wasn’t good enough? Where swimming pools (or buildings etc) have been drawn by OSM mappers, there will be badly traced objects, or objects where the outline is simply unclear from imagery. Where they are imported, they are not subject to tracing errors but they will reflect any imperfections in the source data. In either case there will always be timing mismatches between the date the objects were added and any particular set of aerial images. Are these various imperfections the reason OSM is not good enough?
Sorry for the lack of clarity, it was first imported from the catastro, then it was edited. So the version you get by default (latest) is the manual edit.
To clarify, I think OSM data is not good enough to train a model to consistently generate polygons as good or better than humans.
It is good enough to train a model to generate bounding boxes (to ultimately be converted to point features), which is already useful (I would say roaming around to find swimming pools is the part that takes more time)
The imperfections add to some other things. In my experience, the most impactful one is (in)consistency.
It is hard to achieve consistency even if you hire a dedicate team for labeling and give them clear rules they need to study.
In OSM, there are new contributors, experienced contributors, machine-generated data, automatically-imported data, etc. You can imagine that is not feasible to expect consistency from that setting.
My point with the project is that I think general-purpose segmentors (i.e. SAM2) are getting close enough to remove the need to train a polygon predictor, which requires a lot of consistency, when combined with a good bounding box to guide the prediction.
I believe there are more than a few swimming pools generated by the tool as good or better than what an average? contributor would draw (I can at least speak for myself having labelled and edited hundreds of pools manually).
Somewhat prompted by the above conversation about the inevitable triumph of our machine overlords (and also by the date) I thought I’d share this “advertisement” from some time ago:
Business Software
TLO Toolkit: Trying to get to grips with The Last One? Our TLO
Toolkit makes the job a cinch, gets TLO working, produces exactly
the code you want. _Definitely_ the last program you'll need to
buy (revised version coming in September). £260.00.
For those not born at the time, “The Last One” was supposed to replace the actual writing of code - apparently it did that for you. Wikipedia says “The name derived from the idea that The Last One was the last program that would ever need writing, as it could be used to generate all subsequent software”.
I first saw this spoof in Cambridge in the UK in the early 1980s. It claimed to be “From Personal Computer World, April 1982”. Archive.org has this pdf of an inexpertly scanned copy of “Australian Personal Computer, April 1983”, which looks like a “repurposing” of the same thing, presumably from a year later, complete with IBM 370 jokes that perhaps needed a bit more explaining down under.
Edit: I’ve just realised that neither my quoting here nor the Australian “borrowing” link to the likely author, so I’ll correct that now.
This is a fascinating thread, and it seems to follow a consistent pattern we’ve seen with regard to AI mapping approaches: they tend to be ham-fisted and produce results worse than what a human would do. The proponents tend to have overconfidence about the quality of what the AI is producing.
Now, I’m a huge fan of what AI can, or will soon be able to do. I can’t wait for AI to take over the tedious parts of mapping, so we can map more, at larger scale, for the same amount of human effort.
However, each time one of these types of conflicts happens, it’s a step backwards and the community gets more and more skeptical of AI approaches. I am looking forward to the first AI-assisted mapping tool that actually understands the community zeitgeist and works to build trust. We need to introduce automation slowly, and in deliberate increments where we can all try things on for size and work out the kinks.
“Go fast and break stuff” doesn’t work here and will be regarded as “dumping crap into the map”. Instead, we need “Go slow and take pains not to break stuff”.