Automated edit proposal - convert non-standard dashes to standard dashes

The OP linked to an example where this isn’t the case (Organic Maps), so the number of impacted consumers definitely isn’t zero. What makes you confident there aren’t other examples @SimonPoole?

It would of course be great if all consumers handled this situation nicely, but I’m not sure if that’s realistic to expect in practice.

4 Likes

As I said in my first message, I can see Organic Maps (and consequently Maps.me) benefiting from that, so I would say a lot of end-users are directly impacted.

(OsmAND can handle that, though).

I agree that data consumers could be able to parse that, but:

1 - the data in OSM is wrong, since the Unicode dash is the only correct char
2 - we can’t expect everyone adjusting their thousands of softwares/codes to handle our bad data.

There are at least two non-ASCII dashes and half a dozen different space variants in use. As these are essentially indistinguishable from the ‘correct’ variants there can be no expectation that the problem will go away even if you fix them now.

It would be more sustainable and in the long run less work simply to fix maps.me and OM, which is completely possible.

1 Like

It probably won’t bork anything, but your stated rationale is interoperability rather than aesthetics or consistency, so I figure that you’d want to be more conservative with a worldwide mass mechanical edit than with something more local or manual.

That said, there are only 37 elements with en or em dashes inside comments, which shouldn’t be difficult to manually excise from your changeset. Meanwhile, there are 44 elements with en or em dashes in conditional restrictions (not counting the values or comments of those restrictions).

Skimming through the results, I question whether fixing the dashes alone would meaningfully benefit data consumers. Most seem to be copy-pasted from human-readable opening hours on webpages; Mo-Thu 7.00 - 11.00 would be no more consumable than Mo – Thu 7.00 - 11.00. Maybe this is better suited for a MapRoulette challenge than a mechanical edit, similar to how various local communities have been approaching poorly formatted phone numbers?

4 Likes

Performing this edit (replacing dashes) would still be a welcome first pass of fixes on the data, and the average user who might do MapRoulette challenges is more likely to be able to identify and fix errors like Thu or 7.00 than spotting marginally different dashes, so doing both would probably yield the best results.

Regardless, I support this mechanical edit.

4 Likes

The “ASCII dash” U+002D is actually called hyphen or, more precisely, hyphen minus and is part of the standard syntax for all programming languages. Other types of dashes are not. And they are typically visually distinguishable as well, since the hyphen is shorter than most other dash-like characters.

If we rightfully expect a parser for C++, Java, Python, Ruby, …, to balk at expressions such as x=4–2, x=4—2, x=4−2, per the respective language syntax specification, we surely should not expect our data consumers to handle those. Particularly as we have a very formal specification of the syntax, which explicitly states the permitted tokens:

Basic elements
<plus_or_minus> + -

So yes, those are syntax errors that ought to be fixed, per proposal.

7 Likes

I agree, the issue on Github is open, but I’m not capable to fix that. I can correct the data, however, even if in the future it happens again. (And I insist, if the editors warned that while editing, softwares probably wouldn’t need to come up with a solution).

Yes, I’m worried about not breaking anything, and making the data correct. Good you came with different queries, it helps a lot, thanks!

I don’t quite agree about opening a MapRoulette to fix the dashes (but I think in the future, AFTER fixing the dashes, a more general task to fix opening_hours would be valid, just like phone structure you mentioned). As @riiga said, it’s hard to spot the dash error, so there’s no guarantee a mapper would manually fix the dashes, focusing on the other opening hours elements with errors.

Would be better if I manually fix these other elements (37+44), and then pass the query to automatically fix the rest?

Yes, I’m just saying the proposed dash fix is necessary but insufficient.

Sure. I didn’t check if other kinds of syntax errors are prevalent in opening_hours outside of comments, but if you’re confident that the query I came up with reliably detects comments, then that would be a fine step.

Agreed, and even better if editors could present a UI for editing opening hours that doesn’t require an introductory compilers course as a prerequisite.

I’m assuming that most of the non-hyphen-minuses are the result of copy-pasting from another website, or manually typing something human-readable (possibly triggering the automatic dash formatting in e.g. Safari). Either scenario points to another cleanup you could perform: there are 218 opening hours containing non-breaking spaces, not to mention other spaces that may be difficult for both mappers and data consumers to work with.

3 Likes

Well, obviously fixing data once does not protect against it being broken in future.

But if some hyphen variant is used in place where - was supposed to be used as defined by OH syntax then change is helpful.

Also, “data consumers should deal with it” solution would require fixing not only OM but also all other future data consumers. And makes using OSM data harder.

How many of tags with this bad dashes would become parsable after proposed bot edit?

5 Likes

Discouraging folks from fixing things seems a pretty bad look IMO. No edit ever fixes everything in the db in one swoop and for all time. Holding an edit to that standard is preposterous. We all get to contribute in our own way. This kind of edit it fine and should be encouraged. It’s not making anything worse and will make things better in some cases.

7 Likes

This is not your run of the mill unstructured tag. you need to invest a substantial amount of effort one way or the other to parse and evaluate the values mechanically.

Normalizing the characters is a 0.0001% thing, literally totally irrelevant from an effort pov and anything robust will do that in any cae. For human consumption the lookalike chars are exactly that and the differences are completely irrelevant.

And I’m not advocating doing nothing, I’m suggesting to fix the subpar implementation.

This is true, though opening_hours=* is the antithesis of the sort of freeform key that mappers generally take a laissez-faire approach toward. I’m not surprised to see both mappers and developers expecting more consistent usage of this key. Thanks to its presentation and complexity, the opening hours specification seems a lot more airtight and unambiguous than it really is.

A similarly superficially rigorous key is phone=*. It turns out that, in some languages and regions, popular style guides or publishing standards mandate the use of en dashes, perhaps one of the reasons why 786 phone numbers contain en dashes. (There’s a particularly high concentration in the D–A–CH region, possibly due to the wiki’s stipulation of DIN 5008, though I don’t have access to the standard to be sure.)

Per Postel’s law, we could strive to make the database more consistent and developers could make their software more forgiving of syntax errors that slip through.

6 Likes

I looked at the “not a reference implementation” OpeningHoursParser/src/main/java/ch/poole/openinghoursparser/OpeningHoursParser.jj at 42d1821b0db3ada0fd9b3cd0da609b99c10d6354 · simonpoole/OpeningHoursParser · GitHub and indeed, adding more look-a-likes seems trivial. Perhaps this topic can longer the list?

Some of these comments make it sound like we’re browser makers…

Downstream consumers are (mostly) out of our control, but the data isn’t. According to the spec (as @Duja has already linked), it’s illegal to use a non-ASCII minus. So values using a non-ASCII minus are illegal.

Yes, maybe downstream consumers can/could/will/would/shall/should handle this themselves, but only because the upstream data is wrong. Stop yak shaving y’all.

Thanks @matheusgomesms for the initiative! Hope it’s clear that I support this (after the usual bureaucratic stuff).

6 Likes

it is not matching specification, but calling it illegal is taking it a bit too far :slight_smile:

note that in English “illegal” means “breaking law”, and misformating opening hours in OSM is not something that breaks any legal rules anywhere.

“not allowed” would be better term. (I would send it as a PM but it seems you blocked them)

2 Likes

For what it’s worth, the term “illegal” in the sense “not according to specification” has a history in IT circles:
(https://www.pcmag.com/encyclopedia/term/illegal-operation)

An operation that is not authorized or understood. An “illegal operation” error message typically means that the computer has been directed to execute an invalid instruction and has stopped or has terminated the offending application (see abend).

…although it has fallen out of fashion lately.

6 Likes

As @Duja said. It’s also commonly used in (formal) language theory: an illegal string is one that does not conform to the (formal) language in question.

7 Likes

Hahahahaha.

Calling that page a “specification” (and yes, it is the name of the page) is a bit of a misnomer. A specification implies that people adding opening hours values have been told that they need to add values that conform to that specification, when in practice almost no-one editing data in OSM has seen it. With a bit of luck, they’ll be using an editor that can extract what they know about opening hours and format that into something that other people can understand, but there will always be edge cases (e.g. “I know that this place stays open late on a Thursday but don’t know any other opening hours”).

To be clear, “open late on a Thursday” is in no way a machine-readable opening hours string but it is more useful than writing nothing at all.

2 Likes

Yes, let us shave a few yaks more.

Next you’re gonna tell me that ISO standards, IETF RFCs, HTML, ECMA, and others are not specs because nobody reads them. But first tell me something serious, I can’t stop laughing…

Now, apparently this has a name – “derailing”, someone in this forum called it. So can we get back on the rails? What do you have against the proposed edit?

It is machine-readable if you tell the machine how to read it. No magic. And that’s within the spec, it’s called a comment, as long as it’s surrounded by double quotes.

1 Like

Now it’s me that’s laughing :smiley:

(my emphasis, obviously)