Hashing to track contributors of notes

emvee · January 20, 2024, 11:52am

Instead of IP addresses I would opt for something more and at the same time less.

IP addresses are easy to change (VPN etc.) and at the same time somewhat private information.

Browsers leak quite some information which gives a reasonable unique fingerprint

Even if no browser is used, some of this information is present.

What I would prefer is to create some fingerprint and put that through a hash. By putting it through a hash you can not get back the information that was used to create the fingerprint.

InsertUser · January 20, 2024, 12:46pm

I don’t really care if it’s a true IP address or some unique ID per IP address. Either way I think we need something that gives a common identifier to comments that apparently share an origin.

This is non-trivial, if you hash after creating the fingerprint then you leave yourself in a situation where e.g. a slightly different browser window size gives a completely different ID. We need a bit more persistence than that.

emvee · January 20, 2024, 1:01pm

Yes, true. But I my view is it for the average user as likely or even more likely that he/knows how to change his/her IP address then to understand the fingerprint-hash.

This “fingerprint-hash” could be published without saying what it is making it harder for causal users to understand the use of it and it it is more private then publishing an IP address.

We need a bit more persistence than that.

It will be always be possible for the advanced user with more resources to abuse things so let not design for that.

Do you agree that an IP address is not more persistent?

If not, how about the privacy aspects?
If so, what other alternative?

InsertUser · January 20, 2024, 3:51pm

IP addresses have to be acquired somewhere, headers used by fingerprints can be changed on the fly. If we ended up blocking whole IP ranges associated with popular VPN services then I’m OK with that, especially if legitimate contributors only have to spend five minutes setting up an account in order to post.

It’s a bit of a weird double negative, but no I don’t think that an “IP address is not more persistent”. By irreversibly hashing the IP along with more transient information then your persistence is limited by the most transient property hashed (at least as I understand it).

An IP based sequential token or hash with server side salt (that doesn’t live in GitHub) would make OSM’s identifier either very difficult or impossible for another website to use its information to de-anonymise the ID. To my mind and they shouldn’t be able to reconstruct the ID just by knowing the IP that went into it as they don’t have the salt. I don’t know enough about cryptography to be sure of this though so someone more technically literate would need to weigh in. FWIW I’m also not sure how far OSM needs to go legally so I don’t really know if it is workable.

If we went the hash route we could also consider deliberately truncating the result so that with the final ID length we would intend to get some collisions. This would make it very difficult for someone without the admin rights to see the full IP to be sure that any particular comment was definitely the same person as two or three IPs might have been merged during the process and they wouldn’t know if this has happened. This would of course mean that there is a risk of sweeping up a few extra people when a “bucket” is blocked but with low odds, probably dwarfed by the chance they are behind a shared institutional IP address. I’ve seen this approach suggested in a different context (p 12 onward although take this with a grain of salt as the author isn’t a cryptographic or legal expert). I don’t know how well it stacks up against UK and European (e.g. IE) guidance.

Minh_Nguyen · January 20, 2024, 4:12pm

We would be entering a cat-and-mouse game with browser vendors. Browsers are continually revising the signals that they send out to prevent fingerprinting, because most fingerprinting is for more nefarious purposes. In some cases, the browser stops reporting some attribute; in other cases, it continues to report the attribute but only provides a bogus value or introduces some random noise to make it unreliable. (Classic examples are various parts of a user agent string that used to identify hardware models etc., or visited link coloring in CSS.)

KoiAndBlueBird · January 21, 2024, 1:12pm

Yes, and that is disgusting.

The fact that you are even proposing this, is crazy to me. I am mad. This is incompatible with what OSM stands for. Tracking users?! If this gets implemented I sincerely hope that someone writes a script that completely circumvents this.

SomeoneElse · January 21, 2024, 1:36pm

Maybe I’ve misunderstood what you’re trying to say here, but OSM already tracks users in the sense that it doesn’t allow anonymous edits, and users can see who edited what and talk to them about it. It’s an important part of verifiability.

This thread is suggesting that some of the same approach to map edits is also applied to map notes.

KoiAndBlueBird · January 21, 2024, 1:51pm

I must admit that I did not phrase this well. Here is my point of view:

To me, using nicknames is not what I meant with tracking (but I see how you could make a point for it). Users can change their nicknames. Users have influence over their nicknames. Users can make a new account. Users can use a VPN or something like that to edit.

Users cannot (easily) change their fingerprints . They can make a new account but it continues to track them. It deanonymises users of VPNs. I guess this is exactly what was intended. But I think we need better, more private approaches to this. OSM should not manage this kind of user data. It is a database for edits and accounts, not for browser fingerprints and tracking cookies and whatever can be used for tracking. In my understanding, OSM is not an anonymous, but a pseudonymous community (with the username as key element). This allows users to stay private, without the loss of community, and verifiability.

Regarding the “We just hash 'em” idea: SHA-1 is broken already, who knows how long SHA-256 will take? Also: Implementing this securely is hard. Do we really want to spend more time into this? I guess you can give some insight, @SomeoneElse? Is the situation that bad right now with vandalism?

I am slightly in favor of anonymous notes, but if the alternative to deprecating them would be to make them track users, I’d rather ditch them for good.

TL;DR

Doing this properly is hard.
This will cost more time than it is worth it.
Regardless, OSM should not store this data about internet users.

emvee · January 21, 2024, 3:54pm

Maybe I am wrong (would be good if anybody could explain how), but it for sure does not deanonymises VPNS’s and it looks to me that such a proposed hash exactly implements what would be compatible with privacy concerns. An IP address is for sure not.

That only thing what you can do with the hash is see two hashes are the same so very likely it is the same (anonymous) user that made them. Does anybody (apart from scammers) see that as a privacy concern?

Please come up with a better approach or is this another way of saying you do not like the proposal?

That is a good point, how big is the problem and would it be helpful to check if the same (still) anonymous user did create multiple notes.

KoiAndBlueBird · January 21, 2024, 6:02pm

First off I’d like to apologise for the sharpness of my post. It wasn’t really up to the standards of discussion on this forum (except for landcover proposals ). Thank you for staying on topic and requesting proper arguments.

Well, for starters: All Tor browser users, and “resist fingerprinting” setting users have the same settings. Those parameters are known. Hashing them (even with a salt/nonce) is not a good idea, as it would be easy to reverse engineer.

Give a browser session an unique ID. Independent of the parameters/browser/time. I like the idea of assigning the anonymous users a long string like af657e678a..., but remove the hash part. Assign it just randomly.

emvee · January 21, 2024, 9:26pm

If you really think hashing is not secure against reverse engineering I think you need to educate yourself first as for example the https session that you use to connect secure to this forum or to your bank uses hash functionality.

How to do that? By setting a cooky?
Is that unique ID persistent over sessions?

KoiAndBlueBird · January 21, 2024, 10:46pm

I know how hashes work. But the data fed into them is problematic, as it is not as diverse as you think.
All “private” browsers will have the same fingerprint. By design. This way, the hash will stay the same. You just need to compute it once by yourself and you will get the hash. This leads to two problems:

a) It will be easy to see who is resisting fingerprinting. This is a privacy concern.
b) People who resist fingerprinting will be wrongfully placed into a batch with vandals who resist fingerprinting as well.

Not persistent over sessions. Just assign a random (not yet used?) 512 bit number to a user.

pnorman · January 22, 2024, 12:19pm

As someone who was around when notes were introduced, you are correct. It’s also because notes are basically OpenStreetBugs on osm.org, but the word bugs was considered too negative and restrictive.

SomeoneElse · January 21, 2024, 11:01pm

I’m not sure what any of this has to do with this topic (“We don’t need anonymous notes”)? Either people have to sign in before adding new OSM notes, or they don’t. As mentioned above, we already require map edits to be at least from a particular OSM account. Is there a genuine privacy reason why people should not need to sign up to OSM before adding notes?

I’d always assumed that the reason behind allowing anonymous notes was “ease of use” - allow people to quickly comment without signing up; nothing to do with privacy.

InsertUser · January 23, 2024, 4:04pm

I think the ease of use is a good reason to keep them. To my mind the main reason to not sign up probably isn’t privacy but the effort to create a login.

I don’t think we would put too many people off if we said something along like:

As you are not logged in your note will be published showing your visitor ID as author ({VISITOR ID}). If you do not wish to share this publicly please create a login first before adding a note.

This would be similar to Wikipedia’s use of IP addresses (and probably IP based), but without showing the actual IP as that is apparently beginning to be a problem from a privacy standpoint.

I think the benefit of this would be that normal users are more likely to be able to spot if there is a frequent abuser of the anonymous notes function active in their area and flag them to mods more easily.

Concerns have been raised about people who are worried about reprisals for adding notes, but I think people like that should be taking stronger steps to preserve their anonymity anyway and a warning that they might not be completely untraceable is beneficial to them in the long term.

Edit: Changed “connection ID” to “visitor ID”. We’d need a name that implies some degree of permanence.