forensics | 33 Bits of Entropy

Posts tagged ‘forensics’

Everything Has a Fingerprint — Don’t Forget Scanners and Printers

Previous articles in this series looked at fingerprinting of blank paper, digital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.

To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.

There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.

A group at Purdue has been very active in this area, as well as in printer identification, which I will discuss later in this article. These two papers are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.

To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.

Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly moment-based features of the noise vector.

Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.

The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.

The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.

The Purdue group have also worked on printer identification, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is banding — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.

Fourier Transform of greyscale amplitudes of a background fill (printed with an HP LaserJet)

Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the Fourier Transform of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.

Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify individual letters. By extracting a set of statistical features and applying an SVM classifier, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.

Needless to say, by combining the classification results from all the ‘e’s in a typical document, they were able to match documents to printers 100% of the time in their tests. Presumably the same method would apply for all other characters, but wasn’t tested due to the additional manual effort required for different shapes.

Vertical lines printed by three different inkjet printers

Inkjet printers seem to be even more variable than laser printers; an example is shown in the picture taken from this paper. I found it a bit hard to discern exactly what the state of the art is, but I’m guessing that if it isn’t already possible to detect different printer models with essentially perfect accuracy, it will soon be.

The privacy implications of printer identification, in the context of a whistleblower who wishes to print and mail some documents anonymously, would seem to be minimal. If you’re printing from the office, printer logs (that record a history of print jobs along with user information) would probably be a more realistic threat. If you’re using a home printer, there is typically no known set of documents that came from your printer to compare against, unless law enforcement has physical possession of your printer.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

October 11, 2011 at 10:02 am 1 comment

The Entropy of a DNA profile

I’m often asked how much entropy there is in the DNA profiles used in forensic investigations. Specifically, is it more than 33 bits, i.e., can it uniquely identify individuals? The short answer is: yes in theory, but there are many caveats in practice, and false matches are fairly common.

To explain the details, let’s start by looking at what is actually stored in a DNA profile. Your entire genome consists of billions of base pairs, but for profiling purposes, only a tiny portion of it is looked at — specifically, 13 locations or loci (in the U.S. version, which I will focus on. The U.K. version uses 10 loci.) Each of these loci yields a pair of integers which varies from person to person. You can see an example DNA profile on this page.

The degree of variation in the pairs of numbers — genotypes — at each locus has been empirically measured by many studies. Since biological laws dictate that the genotypes at different loci are uncorrelated, we can calculate entropy by simply adding up the entropy at individual loci. I analyzed (source code) the raw data on variation at each locus from a sample of U.S. Caucasians, and arrived at a figure of between 3.0 and 5.6 bits of entropy per locus and 54 bits of entropy for the whole 13-locus DNA profile. In addition, there is 1 sex-determining bit.

Since that number is well over 33 bits, with a high probability there is no one else who shares your DNA profile. However, there are many complications to this rosy picture:

Non-uniform genotype probabilities. The entropy calculation doesn’t quite tell the whole story, because some genotypes at each locus are much more common than others. If you happen to end up with a common genotype in all (or most) of the 13 loci, then there might be a significant chance that someone else in the world shares your DNA profile.

Population structure. The calculation above assumes the Hardy-Weinberg equilibrium, which is only true if mating is random, among other things. In reality, due to the non-random population structure, there is a slight deviation from the theoretical value. This manifests in two ways: first, the allele frequencies for different population groups (ethnic groups) need to be calculated separately. Second, there is a deviation from the expected genotype frequencies even within population groups, which is more difficult to account for (a correction factor called “theta” is applied in forensic calculations).

Familial relationships. Since we share half of our DNA with each parent and sibling, there is a much higher chance of a profile match between close relatives than between unrelated individuals. Therefore DNA database matches often turn up a relative of the perpetrator even if the perpetrator is not in the database (especially with partial matches; see below).

In recent years, law enforcement has sometimes adopted the strategy of turning this problem on its head and using these familial leads as starting points of investigation as a way to get to the true perpetrator. This is a controversial practice.

Each of the above factors results in an increase in the probability of a match between different individuals. But the effect is small; even after taking them into account, as long as we’re talking about the full 13-locus profile, most individuals do in fact have a unique DNA profile, albeit fewer than would be predicted by the simple entropy calculation.

Unfortunately, crime-scene sample collection is far from perfect, and profiles are often not extracted accurately from the physical samples due to limitations of technology and the quality of the sample. These inaccuracies in a crime-scene profile introduce errors into the matching process, which are the primary reason for false matches in investigations.

Partial and mixed profiles. Sometimes only a “partial profile” can be extracted from a crime-scene DNA sample. This means that only a subset of the 13 genotypes can be measured. This could be because the quantity of DNA available is too small (interfering with the “amplification” process at some of the loci), because the DNA has degraded, or because it is contaminated with chemicals called PCR inhibitors that interfere with the decoding process.

The other type of inaccuracy occurs when the DNA sample collected is in fact a mixture from multiple individuals. If this happens, multiple values for some genotypes might be measured. There is no foolproof way of separating the genotypes of each individual in the mixture.

These are very common occurrences, particularly partial profiles. There are no standards on the quality or quantity of the profile data for the evidence to be admissible in court. Instead, an expert witness computes a “likelihood ratio” based on the specific partial or mixed profile, and presents this to the court. Juries are often left not knowing how to interpret the number they are presented and are vulnerable to the prosecutor’s fallacy.

The birthday paradox. The history of DNA testing is littered with false matches and leads; one reason is the birthday paradox. The number of pairs of individuals in a database of size N grows proportional to N². The FBI database, for instance, has about 200,000 crime-scene profiles and 5 million offender profiles, for a total of a 1 trillion pairs of profiles. Due to use of partial profiles to find matches, the probability of a match between two random profiles is much higher than one in a trillion.

This long but fascinating paper has many hilarious stories of false DNA matches. Laboratory errors such as mixing up labels on the samples and contamination of the sample with the technician’s DNA appear to be depressingly common as well. Here is another story of lab contamination that cost $14 million.

Why only 13 loci? One question that all this raises is that if the use of a small number of loci causes problems when only a partial profile is available, why not use more of the genome, or even all of it? Research on mini-STRs shows how to better utilize degraded DNA to recover genotypes from beyond the 13 CODIS loci. The cost of whole-genome genotyping has been falling dramatically, and enables even individuals contributing trace amounts of DNA to a mixture to be identified!

One stumbling block seems to be the small quantity of DNA available from crime scenes; whole genome amplification is being developed to address that. But I suspect that the main reason is inertia: forensic protocols, procedures and expertise in DNA profiling have evolved over the last two decades, and it would be costly to make any changes at all. Whatever the reasons, I’m certain that things are going to be very different in a decade or two, because there are millions of bits of entropy in the entire genome, and forensic science currently uses about 54 of them.

Further reading.

Wikipedia: DNA profiling
Legal decisions concerning DNA statistics
DNA Testing: An Introduction for Non-Scientists
The Potential for Error in Forensic DNA Testing (referenced above, but worth repeating!)

December 2, 2009 at 4:26 pm 1 comment

33 Bits of Entropy

Posts tagged ‘forensics’

Everything Has a Fingerprint — Don’t Forget Scanners and Printers

The Entropy of a DNA profile

About 33bits.org

Me, elsewhere

Email Subscription