Validating data: how many UK mosques do I need to sample?

I have two lists of similar POIs. Each list is ~1500 items.

I need to know precisely which POIs are in list A, and not in B, and vice versa.

I have a formula to match the items, but it won’t be perfect.

How many should I manually sample before I can conclude what the error rate is for the match formula?

(False negatives and false positives are equally interesting, but false positives are more serious.)

Thanks

What’s your expected confidence level and margin of error?

Keep in mind that the error rate may vary significantly between different geographical areas and types of POIs.

1 Like

Ah, you see now you’re asking maths-type questions and I’m no mathematician :slight_smile:

If I understand the question:

I will be happy if I know the false positive rate with 95% confidence.

(Is the false negative rate necessarily the same?)

No, you can trade false negatives for false positives.
Simple example: Your algorithm returns everything as a negative. You will necessarily have 0 false positives (nothing is positive), but you also have a 100% false negative rate. Flip the algorithm and you get a 100% false positive rate and no false negatives.
Good algorithms can reduce the sum of both, making improvements to both.

1 Like

No math magician here either, so pls take my advice with a grain of salt :slight_smile:
In theory, you should be able to validate your error rate with fewer than 100 tests, but my concern is that the distribution is likely lumpy (though randomizing the sets should mitigate this).

The false-pos/false-neg rates will likely vary.

What specific sets are you comparing? @wcedmisten recently compared restaurants in OSM vs. Overture, his methodology may be useful to you. @Minh_Nguyen’s reply in that thread raises good points (Especially the links to research papers on OSM quality assessment).

1 Like

Thanks, interesting read.

I’m doing something comparable, but much closer to home.

I have a privately-curated list of mosques in the UK & Ireland (not curated by me, and it’s not open data).

I want to compare it against OSM to see what overlap there is, and whether each list can improve the other.

First task is simply to match the data, and see which mosques are in both lists.

1 Like

This should be fairly straightforward, with some caveats (see below).

Once you’ve geocoded the addresses from your list, you can perform an around query (or a simple buffered bbox) to pick up candidate matches in the OSM data.

Here are some stats of current OSM data for UK and Ireland (There may be some overlap as Northern Ireland is in both sets):

$ gol query uk na[amenity=place_of_worship][religion=muslim] -f=count
977

$ gol query ireland na[amenity=place_of_worship][religion=muslim] -f=count
18
$ gol query uk na[building=mosque] -f=count
197

$ gol query ireland na[building=mosque] -f=count
5

Curiously, over a third of building=mosque features are not tagged amenity=place_of_worship. From a quick glance, it appears most of these are in active use, so amenity=place_of_worship is appropriate. So there’s definitely room for improvement on the OSM side.

There are also cases of feature duplication: A smaller structure within a larger complex, or building and node separately). The former should probably become a building:part; in the latter case, the tags of the node should be merged into the building itself.

(The latter case appears to be a common pattern in the UK: building=mosque with no other tags, separate node that contains the details)

I just got around to consider your question mathematically.

First of all some definitions:

list_A = a1, a2, a3, ...
list_B = b1, b2, b3, ...

Eq = algorithm(list_A, list_B)
positive: (a, b) in Eq
False positive: (a, b) in Eq and a != b
False positive rate: #(false positive)/#positive

Assuming a constant false positive rate (FPR) and a sample small enough to be roughly random (probably around 100 max.), we can consider the probability that our result s out of n samples is because of FPR p:

P(s | p) = p**s * (1-p)**(n-s) * binomial(n, s)

Obviously, the probability p = s/n has the highest value, though quite small. But we want the probability that the FPR is in a certain range, so we normalise against the total, i.e. we integrate the above between 0 and 1, which happens to be 1/(binomial(n, s) * (n+1)).
The integral from p_1 to p_2 is also useful to get the probability of a range. So we define the function P_{n,s}(p_1, p_2) = integral(p_1,p_2,n,s) / integral(0,1,n,s). This function gives us the probability that the FPR is between p_1 and p_2, i.e. our confidence level.

A trivial thing to note is that P_{n,s}(0, 1) = 1, so our FPR is always between 0 and 1 (duh).
An observation of mine is that the confidence is highest for a given precision r around [s/n - r/2, s/n + r/2]. I don’t have a proof for the precise range, but both ends of it must have the same value, i.e. P(s | p_1) = P(s | p_1 + r). It’s easy to see that this range is unique and always exists.

Results

As shown above, you can always achieve a 95% confidence level, but the range can be quite large. With 100 samples, you get a precision of 10%, i.e. a margin of error of about 5%. With just 10 samples, the precision is 27% = 13.5% margin of error.
For a precision of 1%, you might need between 7000-10000 (random) samples, which isn’t possible with just 1500 list elements, never mind the effort.
20 samples provide a margin of error of about 10%.

A sample is a collection of pairs which the algorithm considers to be the same object. A low FPR decreases the margin of error slightly for a given confidence level.
Not sure how useful this is to you, but 100 samples seems like a good balance.

Have you tried JOSM and JOSM conflation plugin ?

I found it useful for the same kind of task.

  1. Open JOSM.
  2. Make a query in overpass to select the OSM data I want and export it to JOSM
  3. Load the other dataset from a file in a new OSM layer.
  4. Proceed with the conflate plugin usage to my needs…
1 Like

Thanks for all replies.

I’m not a very technical person, but getting there slowly, I hope. So some of this I need to ask further questions on.

I take it you’re using Overpass here? Is this query doable with Overpass Turbo, which I can just about use?

I had noticed this inductively, and it needs a closer look. Sometimes it’s simply duck-tagging, as in two cases in Scotland I found recently, where purpose-built mosques did not have the place_of_worship tag. This is unsurprising: it’s really not intuitive that “building=mosque” is not a place of worship. That’s especially true if the mapper is Muslim: in Islamic law an echt mosque cannot be used for any other purpose, unlike a church.

The vast majority of places of Muslim worship in the UK are in buildings built for another purpose - warehouses, shops, dwelling houses, etc. I’m not sure how confident we can be that the tag building=mosque usually means what the wiki says it means, and not “Muslim place of worship”.

The mosque-in-a-mosque duplication you flagged is also interesting (Medina Mosque, Sheffield). The editor in your example was StreetComplete. Possibly there’s a bias in the way challenges are presented to Completers, or even a bug in StreetComplete tagging, that results in domes and minarets being tagged as mosques in their own right. I think there’s an Overpass query to search for X in Y, so I will run that at some point.

This bit is very useful! Thank you. I will ask a statistically-minded friend to talk me through the workings - and your work is appreciated even if I can’t understand it yet. :slight_smile:

Once again I find that I really need to learn JOSM.

1 Like

The around query is part of the GeoDesk OpenStreetMap Toolkit, but Overpass has a similar construct.

GeoDesk runs queries locally against a Geographic Object Library (GOL) using Python or Java.

For example, this script allows you to find all cases of a “mosque inside a mosque”:

from geodesk import *

uk = Features("uk.gol")
mosques = uk("na[building=mosque],na[amenity=place_of_worship][religion=muslim]")
count = 0
for a in mosques:
    for b in mosques.within(a):
        if b != a:
            print(f"{b} is inside {a}")
            count+=1
print(f"{count} cases found.")

These are the results (UK only):

node/9907376837 is inside way/81333651
node/35302217 is inside way/81330344
node/300299410 is inside way/1250590435
way/78514733 is inside way/474013485
node/2050786894 is inside way/655150264
way/178352103 is inside way/669190825
node/4594605802 is inside way/464339045
node/1669908794 is inside way/879170764
node/12194471590 is inside way/922284660
node/10964107570 is inside way/1180514091
node/11407818897 is inside way/1230137239
node/11006277560 is inside way/1184888096
node/9925310449 is inside way/1083009765
node/11414552081 is inside way/1230819180
node/11307554901 is inside way/1220281659
node/286960298 is inside way/1128108530
node/12134451374 is inside way/1310913290
node/286960306 is inside way/1126104205
node/12187072817 is inside way/1316797580
node/12175038570 is inside way/1315345013
way/766492678 is inside way/263969570
way/746392192 is inside way/70391069
way/168587591 is inside way/70391069
way/746392193 is inside way/70391069
node/10980642881 is inside way/387360626
node/5112598303 is inside way/524048975
node/11048584806 is inside way/358297336
node/3740606988 is inside way/459225247
way/461767650 is inside way/461767656
node/2698095414 is inside way/496623389
node/12161395011 is inside way/152752718
node/20695974 is inside way/164623269
node/12161403040 is inside way/942505220
node/2623579277 is inside way/1232863799
node/1830154145 is inside way/330359922
node/11195648179 is inside way/1209143228
node/499992689 is inside way/679248276
node/316580811 is inside way/679248291
node/364829368 is inside way/743364306
39 cases found.

Not all of these are problematic – some are situations where the mosque is inside a landuse=religious area. The node-within-building case is more frequent. If a building contains a mosque, then this is the correct form of tagging. However, if the building is the mosque, ideally the node should be deleted and its tags applied to the building.

1 Like

Thanks for the query & results.

Is this now orthodoxy? I had understood the point was debated - with some in the “POI as a node” camp and some in the “POI is the whole building” camp.

It seems clear to me that building=mosque should only be on a building; and if so, then having a node amenity=place_of_worship within it is pretty nonsensical.

But if it’s a Midlands brick terraced house, the case doesn’t seem so clear cut. Has the jury come in?

I’d say that it depends very much on the situation. Some are quite large and encompass several buildings).

As an aside, I had a quick look at Overture to see if it could be a fourth list of mosques to cross-reference, but it’s comically inaccurate. On the upside it has ~75% of local mosques, and in the right locations, but the other POIs are so out of date, badly placed, or badly spelled, that I don’t think it’s worth a longer look.

Now all resolved. Thank you for the info!

Great job!

For Ireland, I only found 2 instances:

node/9905082217 is inside way/767961562
way/110918141 is inside way/23997665

Keep in mind that if a building is exclusively used as a mosque, but is otherwise nondescript (i.e. it is not purpose-build or doesn’t have the architectural elements of a mosque), it may make sense to change building=mosque to a plain building=yes (or another type that reflects the architectural characteristics of the structure).

A site relation could come in handy for these? Or would it be better to create a landuse area (Similar to a university campus)?

If it’s in one location, and e.g. all has one address and shares other tags, then one (multi)polygon makes sense. That approach makes less sense where parts of one entity are spread over town. That’s a really problem now for universities but I’m guessing less so for mosques.