Simplifying lake with 200,000 nodes

Kogacarlo · March 25, 2023, 10:56pm

I’d like to throw one other thing into the discussion.
I’ve been told - but I never verified this - that all history of OSM is kept. This would mean that the 200 k nodes would stay on the OSM harddisks and some 60 k nodes are added to the database, thus not simplifying it but expanding it.
Is this true and is it a problem?

stevea · March 25, 2023, 11:03pm

Where such “data reduction efforts” show improvements over time, well one place anyway, is in downloading “scrubbed” data from a planet file (or regionalized pieces of it). By “scrubbed” I mean only the latest data, the latest version, not all data including that version’s history. More often, “latest version of the data” versions of OSM data get used as “OSM data,” and many, many fewer versions of the entire “snail dragging along every history of its existence like a long tail behind it” exist. So, it’s a win in the long-run for OSM to have efficient data.

Especially “of this sort” (big pigs of rough edges, no offense to pigs). Many bloated data can be spun into silk purses and that’s what we’re talking about here.

Blow a bubble, make a silk purse, follow the breadcrumbs. The pieces are mostly there to do this, the specifics aren’t yet, but they are a relatively simple toolchain-sketch away from being so.

Kogacarlo · March 25, 2023, 11:11pm

@stevea please explain your last four sentences to me. I don’t get what you’re saying.

stevea · March 25, 2023, 11:19pm

By “blow a bubble,” I mean, “we’re chewing this gum” of describing a “rough edge in OSM” and we’re smart enough people in this thread to “blow a bubble” like we (attempt, to hopefully successfully) describe a “short toolchain process.”

By “make a silk purse,” I was using the saying (“from a sow’s ear, make a silk purse” because I described the data as piggy). It means to “turn what otherwise seems like offal into a beautiful, valuable thing.” (In fact, there is a story of someone who boiled a pig’s ear and made a sort of protein jelly, which was then spun into a fiber, which was constructed into a purse, which did resemble silk. Maybe I should have offered a trigger warning, that’s pretty graphic).

By “follow the breadcrumbs,” again I allude to the story of the children who wished to follow the breadcrumbs in the forest to trace their steps. The people who have posted to this thread have provided a lot of “breadcrumbs” (navigation points of importance to follow) when constructing the path, like maybe 0.5 meter is a pretty tight value, maybe 1.5 or 2.0 will work, we might have some node reduction numbers, how about we “do these” so we can see how our efforts turn into actual “hugs the shoreline with tight tolerances” visual data. That builds into a small “let’s see” pilot project sketch, which seems like “what is next.”

Get what I’m saying? We have the community to do this. Let volunteers to do so, do so. Your breadcrumbs (and your own inventions of toolchains that you know and work are yours for the offering) are here. Now roll up (your) sleeves. See, I don’t want to point my finger at anybody, more like say “step right up.”

Edit: The birds ate the breadcrumbs, so the children were lost, but the idea of leaving breadcrumbs to follow a trail is sound, as long as nobody eats the breadcrumbs, something I don’t think OSM has any danger of experiencing here. I’ll tone down such allusion and allegory; though it caused some confusion (thank you for calling it out) it also seemed to provide a bit of mirth.

SomeoneElse · March 25, 2023, 11:26pm

Yes

No.

Everything currently in OSM (as you can see here) is 68 GB in total. That’s not a lot, in the great scheme of things. Your phone might have more storage than that.

Everything that has ever been in OSM since 2012 (see here) is 115 GB in total. That’s also not a huge amount. Text, and a few bytes per node, compress really really well.

Kogacarlo · March 25, 2023, 11:48pm

Thanks boys, I think I get it

I am quite a deleter myself so now I know I am indeed shrinking the ballast.

Another one I had to look up but you’re right

Hungerburg · March 25, 2023, 11:48pm

I did load the relation linked above into the editor. JOSM is great, it can deal with a million of objects, e.g. like when loading commercial/administrative GIS supplied data. Still, alone trying to figure out how such a monster came into existence proved quite demanding on my computer system.

Am I supposed to provide an ArcGIS capable workstation to work on OSM?

Recently, some (imported) waterways in my area got simplified. Nobody complained And correcting wrong import data, the first thing every sane person does, is simplify. These overnoded stuff are just a data prison.

What I found funny too: The shorline is mapped in such detail, so that it only corresponds to nature some limited time of the year, but islands, the size of football fields are missing.

stevea · March 25, 2023, 11:54pm

I am loving this discussion. OSM is a wonderfully step-wise project. We get an early sketch, and data either get deleted as junk or improved and improved and improved. Sure, there are some SSDs or spinning rust in several back rooms somewhere that store the hoary old stuff, but the “latest version” gets both slimmer-trimmer-smarter (if data growth holds still) and improved (because data growth doesn’t hold still).

Step-wise. Smart, dedicated-to-quality volunteers polishing, sharpening, improving. What’s not to love?!

Hungerburg · March 26, 2023, 12:07am

Computationally, this is not about SSD vs Magnetic; It may be about RAM and processing power, perhaps graphics too.

But modifying multi-polygons and being certain, that nothing gets mutilated, you better download the whole beast and all of the surroundings you work on, or someone else will have to resurrect some square kilometres of wood or whatever the next day.

I repeat: Overnoding makes data less accessible for change. And Yes, the world is in flux.

stevea · March 26, 2023, 12:20am

Oh, I only mentioned SSD as some storage of “data back to 2012” and spinning rust (traditional hard drives) as “really old data if you wanted to go back to a 2007 version” (it would be likely a 2007-back-to-OSM’s-origin archive is found on a HD). The storage technology is irrelevant to this discussion, yes.

@SomeoneElse ties it up nicely (“your phone might have more” is great!). Those are compressed versions, uncompressed it takes some “medium-to-big iron” (your average laptop or desktop computer storage, no, although PCIe 5.0 storage, around 12 gigabytes per second, is now a thing and that up-to-the-month tech is appropriate to work on big data) to work with data as large as that.

What is important is that “downstream, curated OSM data” (versions of them, like “the latest OSM of California,” that I might cook up or download and burn on to a microSD card to pop into my GPS when I hike in the wilderness and find my trails “right there”) are “only one version of OSM’s data.” (You might consider “the latest” to be “the best you might have,” many do). Not the “snail and the whole history behind it.”

Data be nimble? Data be fast! Data aren’t nimble when they are compressed and/or have their entire history behind them. But there are good reasons to both compress and to “carry the whole shell on your back” and OSM does both of those where/when/as it is appropriate to do so.

Then, somebody loads an image (on their screen, on their phone, onto an SD card, specifically for exactly those bike routes you just selected, all the passenger train routes into, out of and around that particular city…) and “the latest” (OSM data) appears. Snappily. Prettily. Just as its algorithm designed to present our data to you, beautifully.

Keep improving OSM, fellow volunteers. The whole world loves it.

Hungerburg · March 26, 2023, 12:31am

On systems starved with processing power, it likely proves superior, to not compress, storage being cheaper.

(I read your statement like that, e.g. on a hiking path: OSM has the way, not the individual steps, someone may have used. Regarding the shoreline: fishers should still be able to find the places, where the fish are hiding, even from a coarser shoreline.)

stevea · March 26, 2023, 12:39am

It does seem to be true that with recent advances in storage (technology) getting (as I saw written about early PCIe 5.0 storage prototypes — said to be “face-meltingly fast” — now real products with high, but not sky-high prices) faster and reasonably-priced for how fast it is, compression isn’t as important as it used to be. (Though compression will always have its place). Of course, when editing, you don’t need the history and you don’t want the compressed version.

When editing, you only need a short toolchain of instructions that tell you a process to follow and some tolerances to be within, and you (and maybe some AI or well-written software, or both) can even reduce a lake with 200,000 nodes down to short (≤2000 nodes per way? ≤1899 nodes per way?) segments, cleverly hugging the coast to a tight tolerance of “within so many centimeters of the actual shoreline” and maybe 1.5 meters between nodes at most.

I’d start with chopping up this beast into the right number of segments (is 1899 the right number of nodes per way, or is 2000 OK?). We can see how close we hug the shoreline (to within how many cm does it hug right now? would be a good thing to know) and play around with whether 0.5 meter or 1.5 meter or whatever is preferable after we “see.” The reason so many graphics have been provided in this thread (no need to apologize, @Graptemys , “a picture says 1000 words”) is that we do need to see, to judge the quality of how coarse (or fine) are our choices and efforts.

Hungerburg · March 26, 2023, 12:48am

I’d say, that needs a new model for the data, where nodes are not first class citizens.

PS: Except for POIs of course. These will always be welcome.

stevea · March 26, 2023, 12:59am

Well that certainly is fair. I was speaking in the very restricted context of “let’s treat this lake as a test / pilot project at edge reduction.” I wouldn’t usually not treat nodes as “first class data citizens” but I’m tossing out discussion dialog on a Discourse thread to further what we might do here.

As I think about it, maybe chopping up into 1899-noded segments first is not necessarily the right order of things. We might want to come closer to choosing whether the existing data using 1.5 meters of spacing is “acceptable data quality to the community” and then simply “convert doing that” (to reduce nodes). Then, we might chop into 1899-node-max segments.

If @Graptemys found those data to be “efficiently loaded” into his (what might be characterized as an “average-sized,” no slight meant, I hope no slight taken) JOSM editing environment, we might have something here: a two-step process of “big data edge” improvement.

For this case, we might agree to do that. I’ve actually put out some feelers to the communauté québécoise to chime in about this, let’s see if that “please smell our pretty flowers, we might improve this lake edge for everybody, what do you think?” results in any response from the Québec community. But my French is weak (écolier fleuri, “schoolboy flowery”) and I’m not québécois, so I’ve asked @PierZen for some wider exposure to this thread (and its pretty graphics).

It’s crazy fun to be doing this.

Edit: BTW, I should say that I’ve now seen that the 735 ways are already <2000 nodes each. So if we do this, it would be as easy as @Graptemys says:using JOSM’s Tools>Simplify Way. What we might continue to discuss here is whether 1.5 meters (deleting 53% of the 207982 nodes down to ~97,000). Depending on the results, the sorts of double-node data cleanup illustrated by @Richard might then be additionally applied. But i don’t think that’s an automatic process, it might be manual. Or, it might be cleaned up by the suggested toolchain, now the simple step of Simplify Way (with a community-agreed-upon choice of resolution, I’m only tossing out 1.5 m as a suggestion, AND we’d still need to decide we’d like to do this to these data). The current consensus from the local community (@PierZen is representing) is 0.5 meter. That does diminish 18.5% of the nodes and that’s at least something. @ezekielf agrees with 0.5 m.

What I wonder is whether the performance hit JOSM experiences is due to the number of members in a relation (>700 really is large), the number of nodes (>200,000, ditto), whether being larger/longer than 1900 or 2000 nodes or so makes a huge difference, or some combination. I think @SimonPoole was asking something similar, but I the way I’ve just stated it seems a more focused way to ask “why?”

dieterdreist · March 26, 2023, 8:03am

fishers should still be able to find the places, where the fish are hiding, even from a coarser shoreline

I don’t think it is acceptable to reduce actual detail in data because it doesn’t matter for some applications. It takes a lot of work to add detailed, accurate data, and it is just the push of a button and some automatic processing to reduce its quality. Reducing data by wiping away the details has to be done on local copies, not on the shared asset which is the central database.

stevea · March 26, 2023, 8:44am

Does anybody else experience the following problem? This is with giving JOSM a 16 GB heap using the method I describe before:

I open the relation in JOSM (nearly a minute passes), call the relation editor (nearly a minute passes), select all elements by clicking the first element, (nearly a minute passes), scroll to the end of the list and shift-click so all elements of the relation are selected, then I click the penultimate button under “Selection” (to transfer all the elements to the right-hand-side “Selection” pane). Nearly a minute later, they “jump over” to that list. I can move the relation editor window around, and see in JOSM’s Selection list (underneath it) that 735 ways are selected. But after I do this, I can neither click “OK” on the relation editor’s window (and get any response), nor can I “close the relation editor window” (on a Mac, this the red button in the upper left).

In effect, JOSM is hung: the relation window remains open, but I can’t really do anything. Indeed, my “Activity monitor” (a fancy way of looking at process status, memory, threads, etc.) shows: “JOSM (Not Responding).” I’ve never hung JOSM like this before. I’ve borked it pretty bad (slow it down for minutes at a time until it “catches up with itself”), but I’ve never pretty much hung it like this. It’s been over 20 minutes and…nothing.

Hungerburg · March 26, 2023, 2:57pm

The data we are talking about stems from an import. We do not know how it was created. The level of detail is completely unreal, search “false precision” in this topic. The work spent is equivalent to mapping a hiking way from an untreated GPX dump.

Wait when zoom 22 arrives

Kogacarlo · March 26, 2023, 6:23pm

So?
That’s not a reason to delete, nor is it a reason to keep the data.
It’s irrelevant.

SomeoneElse · March 26, 2023, 6:33pm

I implemented zoom 24 in “normal” raster tiles based on a 2011 mailing list post…

Zoom 22 has not only arrived it’s already in senior school!

stevea · March 26, 2023, 9:25pm

From my previous couple of posts, it may be obvious I’m trying to do exactly what @dieterdreist suggests: make a local copy of this beast, so that it can be compared with existing data now in our map. For example, w.r.t. its performance in slowing down or crippling JOSM: maybe it would be fewer nodes, maybe it would be fewer ways, I’m not sure. But as these data are now, they’re too much for OSM to comfortably edit. That is a problem we continue to need to address.

Unfortunately, even with the largest heap I’ve ever thrown at a Java app (16 GB), JOSM still seems to choke on these data. That’s a brand-new experience for me. So I can’t trim the beast to see if smaller versions don’t cripple JOSM, because attempting to do so doesn’t just cripple JOSM during that process, it hangs it!

This has been a long, interesting, fun-at-times, even tedious (I’ll plead “guilty,” though I was never formally charged) thread. But with this JOSM hang, I’m currently stuck. Thanks to @Graptemys for bringing it up — it’s a valid concern, though even with my ample computing resources, I’m flummoxed.

Maybe somebody will offer suggestions on how to remedy my JOSM hang, but I expect for the most part I will simply watch as others might continue to muddle over this. And via @PierZen, we still might get perspective from the Québec community offering what they think of these gigantic data.