In the latest gaffe to demonstrate the privacy perils of anonymized data, New York City officials have inadvertently revealed the detailed comings and goings of individual taxi drivers over more than 173 million trips.
City officials released the data in response to a public records request and specifically obscured the drivers’ hack license numbers and medallion numbers. Rather than including those numbers in plaintext, the 20 gigabyte file contained one-way cryptographic hashes using the MD5 algorithm. Instead of a record showing medallion number 9Y99 or hack number 5296319, for example, those numbers were converted to 71b9c3f3ee5efb81ca05e9b90c91c88f and 98c2b1aeb8d40ff826c6f1580a600853, respectively. Because they’re one-way hashes, they can’t be mathematically converted back into their original values. Presumably, officials used the hashes to preserve the privacy of individual drivers since the records provide a detailed view of their locations and work performance over an extended period of time.
It turns out there’s a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial to run all possible iterations through the same MD5 algorithm and then compare the output to the data contained in the 20GB file. Software developer Vijay Pandurangan did just that, and in less than two hours he had completely de-anonymized all 173 million entries.