Saturday, August 11, 2012

Profiling: a math experiment

In an earlier post, I summarized a paper that used a simple mathematical model of profiling to determine the best strategy.  I also argued that this paper was too idealized to apply to real-world situations, and it does not apply to the case of airplane hijackers at all.

Here I will expand on the paper just slightly.  The paper still won't apply to real-world situations, and still won't apply to airplane hijackers.  Basically this is just math for its own sake.

Here is a re-summary of the paper's model:  We have a large number of people with one unknown malfeasor.  Different people have different prior probabilities of being the malfeasor, and the government knows these probabilities.  The government must search people one by one, but has no memory of who it has searched before.  Therefore, the government must sample people randomly with predetermined sampling probabilities.

The paper chooses the strategy which minimizes the mean number of searches before the government catches the malfeasor.  However, I think it makes more sense to maximize the probability of catching the malfeasor with N searches.  This is a subtle difference, but it completely changes the math.

Math ahoy!

Let pi be the prior probability that person i is the malfeasor.  Let qi be the probability that the government will search person i in any given search.  The goal is to choose qi, given pi so that we maximize the probability of catching the malfeasor within N searches.  Let M be the total number of people.  We'll assume that M and N are very large numbers.

Consider the case where person j is the malfeasor.  The probability of not catching them in any given search is (1-qj).  The probability of not catching them at all in N searches is (1-qj)N.  This is not quite our probability of failure, because we still have to average over all possible people who could be the malfeasor.  And when we do the average, we have to weight the average by the probabilities pj.  Here is our probability of failure:

But it's not as easy as picking the qj which minimize this figure.  qj must also obey additional constraints (eg their sum total is 1).  Fortunately, there is a technique to account for these constraints.  It's called the Lagrange multiplier method, and I don't have the space here to explain it.  Here's a solution:

Oh no, I chased away all my readers with a scary equation!  But it gets worse... this is only sometimes the solution.*  In certain situations, the above result will give negative values for qj.  The physical meaning of this is that sometimes there are people who are best ignored completely, even if there is a small chance they are the malfeasor.  Naturally, the larger number of searches you can make, the fewer people who should be ignored.

A simple case

Suppose that we have two groups of people, one which composes 99% of the population, and the other which composes 1%.  And suppose that any given person in this 1% is 100 times as likely to be a malfeasor as any given person in the 99%.  Finally, let's say that the number of searches is equal to the number of people (but recall that we can't prevent redundant searching).  What should the sampling probabilities be?

The answer: You should check on people in the 1% more often by a factor of about 5.6.**  (This is proportional to the log of 100.)  Since there are far more people in the 99%, you should be performing 5.6 searches on the 1% for every 99 searches on the 99%.

This is only one particular limit of a very complicated solution, and you cannot draw more general conclusions from the results of this case.  I tried a bunch of things, and one general trend is that the larger number of searches you can make, the more equitable your searching should be.

Sweet FSM, this was useless, but it sure was fun.  There's one more post on profiling math coming.

*Details for those understand the math: Applying Lagrange multipliers is somewhat tricky because I don't know which constraints to apply.  Some qj will have values of 0, bumping up right against the constraint that all qj must be positive.  But for other qj, the constraint is irrelevant.  The form of the solution depends on which constraints apply, and which don't.  Mathematicians might have a better way of handling this, but I can only handle it by "guessing" which constraints are important.
**The general formula, when N and M are very large, is ( N + M(1-K)Ln(a/b) )/( N - M K Ln(a/b) ), where K is the fraction of the population in group A, a is the prior probability that a person in A is a malfeasor, and b is the prior probability that a person not in group A is a malfeasor.  Unless I made an error.

1 comment:

Larry Hamelin said...

Groovy! Thanks!

I'm getting ready to dive into undergrad mathematical statistics, so this is interesting.