I just got around to consider your question mathematically.
First of all some definitions:
list_A = a1, a2, a3, ...
list_B = b1, b2, b3, ...
Eq = algorithm(list_A, list_B)
positive: (a, b) in Eq
False positive: (a, b) in Eq and a != b
False positive rate: #(false positive)/#positive
Assuming a constant false positive rate (FPR) and a sample small enough to be roughly random (probably around 100 max.), we can consider the probability that our result s
out of n
samples is because of FPR p
:
P(s | p) = p**s * (1-p)**(n-s) * binomial(n, s)
Obviously, the probability p = s/n
has the highest value, though quite small. But we want the probability that the FPR is in a certain range, so we normalise against the total, i.e. we integrate the above between 0 and 1, which happens to be 1/(binomial(n, s) * (n+1))
.
The integral from p_1 to p_2 is also useful to get the probability of a range. So we define the function P_{n,s}(p_1, p_2) = integral(p_1,p_2,n,s) / integral(0,1,n,s)
. This function gives us the probability that the FPR is between p_1
and p_2
, i.e. our confidence level.
A trivial thing to note is that P_{n,s}(0, 1) = 1
, so our FPR is always between 0 and 1 (duh).
An observation of mine is that the confidence is highest for a given precision r
around [s/n - r/2, s/n + r/2]
. I don’t have a proof for the precise range, but both ends of it must have the same value, i.e. P(s | p_1) = P(s | p_1 + r)
. It’s easy to see that this range is unique and always exists.
Results
As shown above, you can always achieve a 95% confidence level, but the range can be quite large. With 100 samples, you get a precision of 10%, i.e. a margin of error of about 5%. With just 10 samples, the precision is 27% = 13.5% margin of error.
For a precision of 1%, you might need between 7000-10000 (random) samples, which isn’t possible with just 1500 list elements, never mind the effort.
20 samples provide a margin of error of about 10%.
A sample is a collection of pairs which the algorithm considers to be the same object. A low FPR decreases the margin of error slightly for a given confidence level.
Not sure how useful this is to you, but 100 samples seems like a good balance.