Automated edit proposal - convert non-standard dashes to standard dashes

matheusgomesms · March 6, 2024, 4:54pm

Before creating an Wiki page etc, just wanted to confirm first if this is feasible and makes sense:

Basically I want to convert non-standard dashes to standard dashes for opening_hours.

Problem: some (maybe all?) softwares can’t parse non-standard dashes, which breaks the opening hours data. See Organic Maps for example.

Possible solution: editors could fix that on the go, but iD doesn’t yet.

Proposed solution:

1 - retrieve all cases where there are opening_hours with non-standard dashes through an Overpass Turbo query.
2 - export the results to Level0 editor
3 - replace them at a local text editor (copy from Level 0, paste on text editor, replace, copy and paste back on Level0)
4 - upload data

I did a small test in a city, and I don’t think it broke anything.

Is there anything missing? I know I could improve the query to include all types of non-standard dashes, but just wanted to make it simple (everything simple, actually).

I appreciate if someone could come up with a more robust solution, but otherwise, I think this works quite fine.

Edit: obviously I would try to make “small” upload bbox areas, maybe country-wise.

Thanks!

matheusgomesms · March 6, 2024, 4:56pm

Using that query, in the world there are right now 8617 nodes, 704 ways and 13 relations with that problem.

queerthoughts · March 6, 2024, 4:59pm

You probably already know this, but I thought I’d mention it nevertheless: your query only looks for en dashes (U+2013), but if you want to convert all types of dashes to U+002D (the “standard” (ASCII) dash in Unicode), you should look for all kids of dashes. Wikipedia has a nice overview of Unicode characters with the Dash=yes property.

Xvtn · March 6, 2024, 5:03pm

Why not extend this to all tags? (Not just opening_hours) Perhaps other than name.

queerthoughts · March 6, 2024, 5:06pm

The type of dash used depends on language. (Same goes for quotes, by the way. Think about the difference between English, German and French for example.) For opening_hours=*, there’s a formal specification, but some tags are language specific: e.g. brand=*, operator=*, description=* and of course name=*. Different languages use different types of dashes, however I’d say in the case of opening_hours=* it would be appropriate to simplify everything to using just U+002D.

Xvtn · March 6, 2024, 5:08pm

That makes sense!

os-emmer · March 6, 2024, 5:31pm

Converting such not standard dashes to the standard ones makes it easyer for data consumers. I approve this edit. Please go on with documenting your plan in the wiki and after waiting long enough for other comments, execute it! After that you could propose more edits for the other non-standard dashes.

That said, I have a question anyways:

Where do these numbers come from? Searching for "opening_hours"~"–" gives me 2137 nodes, 558 ways and 13 relations.

Minh_Nguyen · March 6, 2024, 6:12pm

Should non-ASCII dashes inside "comments" remain?

Also, are there any non-ASCII dashes inside conditional restriction syntax, which embeds the opening hours syntax?

matheusgomesms · March 6, 2024, 6:17pm

I did mention in my previous message, but thanks! I’ll try to improve the query with the list you provided!

From Overpass Turbo, but I think it gets all nodes from the areas, making the numbers larger. Your number is probably correct, thanks!

Probably, but just a noobie question, does it bork the data if non-ASCII dashes are replaced by ASCI dashes?

Didn’t check, as you can see by the very simple query. But again, same question as above.

I know the proposed solution is very simple, and I don’t want to break any data. Can someone then improve the step-by-step?

At the same time I don’t want to break anything, I don’t want to overcomplicate. If I have to go to Python, create a large script etc and spend a lot of time on this, I’ll likely skip this at all

queerthoughts · March 6, 2024, 7:41pm

Adding onto this: apart from conditional restrictions, there’s also opening_hours subkeys (opening_hours:*=*), service_times=* and collection_times=*, and probably other keys as well. Although once you have a script for performing the replacement in opening_hours=*, it shouldn’t be too hard to modify it to work on other keys as well (maybe conditional restrictions will be trickier because of the more complex syntax).

I would help out more if I could, but sadly I have very little experience with Overpass or automated edits.

SimonPoole · March 6, 2024, 9:03pm

It doesn’t really. I’m not quite sure who is supposed to be helped by this edit to start with. Anybody actually using the data needs to have a fairly complex parser + evaluator that will cater for Unicode look-a-like codes in any case (in particular these wont vanish as the reasons for them being utilized in lieu of the ASCII variants are not going away any time soon).

arctic-rocinante · March 6, 2024, 9:38pm

The OP linked to an example where this isn’t the case (Organic Maps), so the number of impacted consumers definitely isn’t zero. What makes you confident there aren’t other examples @SimonPoole?

It would of course be great if all consumers handled this situation nicely, but I’m not sure if that’s realistic to expect in practice.

matheusgomesms · March 6, 2024, 9:47pm

As I said in my first message, I can see Organic Maps (and consequently Maps.me) benefiting from that, so I would say a lot of end-users are directly impacted.

(OsmAND can handle that, though).

I agree that data consumers could be able to parse that, but:

1 - the data in OSM is wrong, since the Unicode dash is the only correct char
2 - we can’t expect everyone adjusting their thousands of softwares/codes to handle our bad data.

SimonPoole · March 6, 2024, 9:57pm

There are at least two non-ASCII dashes and half a dozen different space variants in use. As these are essentially indistinguishable from the ‘correct’ variants there can be no expectation that the problem will go away even if you fix them now.

It would be more sustainable and in the long run less work simply to fix maps.me and OM, which is completely possible.

Minh_Nguyen · March 6, 2024, 10:07pm

It probably won’t bork anything, but your stated rationale is interoperability rather than aesthetics or consistency, so I figure that you’d want to be more conservative with a worldwide mass mechanical edit than with something more local or manual.

That said, there are only 37 elements with en or em dashes inside comments, which shouldn’t be difficult to manually excise from your changeset. Meanwhile, there are 44 elements with en or em dashes in conditional restrictions (not counting the values or comments of those restrictions).

Skimming through the results, I question whether fixing the dashes alone would meaningfully benefit data consumers. Most seem to be copy-pasted from human-readable opening hours on webpages; Mo-Thu 7.00 - 11.00 would be no more consumable than Mo – Thu 7.00 - 11.00. Maybe this is better suited for a MapRoulette challenge than a mechanical edit, similar to how various local communities have been approaching poorly formatted phone numbers?

riiga · March 7, 2024, 7:57am

Performing this edit (replacing dashes) would still be a welcome first pass of fixes on the data, and the average user who might do MapRoulette challenges is more likely to be able to identify and fix errors like Thu or 7.00 than spotting marginally different dashes, so doing both would probably yield the best results.

Regardless, I support this mechanical edit.

Duja · March 7, 2024, 9:32am

The “ASCII dash” U+002D is actually called hyphen or, more precisely, hyphen minus and is part of the standard syntax for all programming languages. Other types of dashes are not. And they are typically visually distinguishable as well, since the hyphen is shorter than most other dash-like characters.

If we rightfully expect a parser for C++, Java, Python, Ruby, …, to balk at expressions such as x=4–2, x=4—2, x=4−2, per the respective language syntax specification, we surely should not expect our data consumers to handle those. Particularly as we have a very formal specification of the syntax, which explicitly states the permitted tokens:

Basic elements
<plus_or_minus> + -

So yes, those are syntax errors that ought to be fixed, per proposal.

matheusgomesms · March 7, 2024, 2:34pm

I agree, the issue on Github is open, but I’m not capable to fix that. I can correct the data, however, even if in the future it happens again. (And I insist, if the editors warned that while editing, softwares probably wouldn’t need to come up with a solution).

Yes, I’m worried about not breaking anything, and making the data correct. Good you came with different queries, it helps a lot, thanks!

I don’t quite agree about opening a MapRoulette to fix the dashes (but I think in the future, AFTER fixing the dashes, a more general task to fix opening_hours would be valid, just like phone structure you mentioned). As @riiga said, it’s hard to spot the dash error, so there’s no guarantee a mapper would manually fix the dashes, focusing on the other opening hours elements with errors.

Would be better if I manually fix these other elements (37+44), and then pass the query to automatically fix the rest?

Minh_Nguyen · March 7, 2024, 3:30pm

Yes, I’m just saying the proposed dash fix is necessary but insufficient.

Sure. I didn’t check if other kinds of syntax errors are prevalent in opening_hours outside of comments, but if you’re confident that the query I came up with reliably detects comments, then that would be a fine step.

Agreed, and even better if editors could present a UI for editing opening hours that doesn’t require an introductory compilers course as a prerequisite.

I’m assuming that most of the non-hyphen-minuses are the result of copy-pasting from another website, or manually typing something human-readable (possibly triggering the automatic dash formatting in e.g. Safari). Either scenario points to another cleanup you could perform: there are 218 opening hours containing non-breaking spaces, not to mention other spaces that may be difficult for both mappers and data consumers to work with.

Mateusz_Konieczny · March 7, 2024, 8:21pm

Well, obviously fixing data once does not protect against it being broken in future.

But if some hyphen variant is used in place where - was supposed to be used as defined by OH syntax then change is helpful.

Also, “data consumers should deal with it” solution would require fixing not only OM but also all other future data consumers. And makes using OSM data harder.

How many of tags with this bad dashes would become parsable after proposed bot edit?