ELEVATOR PITCH

LOAF is a simple extension to email that lets you append your entire address book to outgoing mail message without compromising your privacy. Correspondents can use this information to prioritize their mail, and learn more about their social networks. The LOAF home page is at http://loaf.cantbedone.org.


OVERVIEW

LOAF creates and maintains a database of all your correspondents, defined as people to whom you have sent email at least once. Every time you send an email message, LOAF appends this information to the email message, using a format described further below. LOAF-enabled correspondents collect and store this information in their own local databases.

When you receive an email from an address you have not previously written to, LOAF checks to see if the email address is known to any of your existing correspondents. This essentially sorts incoming email into three categories:

  1. Mail from complete strangers
  2. These are people whom you do not know, and who are also unknown to your correspondents.

  3. Mail from partial strangers
  4. These are people you have never sent email to, but who have gotten email from at least one of your own correspondents. This email may deserve more attention, since at least one of your correspondents took the time to write back to the person.

  5. Mail from people you know.
  6. This last category consists of people whom you have written to before. Presumably this is email you're most interested in, unless it's another forward from your mom.

Mail in category (2) can be further classified by counting how many correspondents you and the sender have in common. If the originating email appears in the address books of several of your correspondents, this may indicate a person with whom you have many connections. Insert standard social network theory here.


EXAMPLE

Let's say Alice sends email to Bob about an exciting business opportunity in Nigerian bridge futures. Bob has never received email from Alice before, and is very skeptical about messages from unknown sources.

Bob takes Alice's email address and checks it against a database he maintains on his machine. This database contains an entry for each of Bob's correspondents, with a specially hashed list of their own corresponents. Bob checks Alice's email address against each of his correspondents' lists. If any of them have written Alice before, the search will result in a hit.

Let's say Bob finds Alice in a correspondent's list, and decides to read her email message. He is intrigued by the business opportunity and writes back to Alice asking for more information.

Because Bob has now written to Alice, all future emails Bob writes will list Alice as one of his correspondents. If Bob now sends a message to his old friend Carol, Carol will update her database entry for Bob with the LOAF information she finds in his email, wihch now includes the reference to Alice. When Alice writes to Carol a week later, Carol will see that Alice has previously interacted with Bob (and perhaps some other people in Carol's list of correspondents).


IMPLEMENTATION

LOAF uses a special data structure called a Bloom filter, described below. To the unaided eye, the filter looks like a string of characters. It contains a specially hashed representation of the sender's correspondents list, a few thousand bytes in length.

It is computationally trivial to check whether a given address appears in the filter, but reconstructing the list of email address in the filter requires a brute-force attack. It is possible to infer the rough size of the address list from the proportion of 'on' bits in the filter, but this will give an inexact estimate.

Outgoing email is munged by an outgoing mail filter to include the LOAF filter, in the form of a MIME attachment. Every address on an outgoing message is automatically added to the database of correspondents, unless the user expressly flags the message.

Incoming email is analyzed for the attachment, and messages that fall into category (2) above are marked with the special flashing LOAF banner.

Every time an email arrives from a known correspondent, the corresponding entry in the local database is updated with the latest LOAF filter.


BLOOM FILTERS

A Bloom filter is compact way of testing to see whether you have seen something, requiring less memory than a full list of the items you have seen. The tradeoff is a risk of false positives (Bloom filters never give a false negative) that is a function of the length of the filter and the number of items stored in it. The smaller the filter, and the more items it contains, the greater the risk that it will give a false positive.

The filter consists of two parts, a series of bits and a set of hashing functions. To add a string to a Bloom filter of length m, you run it through each one of the hashing functions, each with range (m), and turn on whatever bit of the filter corresponds to the output of the hashing function.

To see if a string has been seen by the filter, you run it through the same set of hashing functions and XOR the on bits in output with the corresponding bits in the filter. If the XOR gives a non-zero result, you are guaranteed that the string has not been seen before. If the result is zero, you can be certain with a known probability that the string is contained in the filter.

The false positive rate for Bloom filters is determined by the number of hashing functions, the size of the filter, and the number of entries in the filter, given by the approximate formula:

        ( 1 - e^(kn/m) )^k

Where k is the number of hashing functions, n is the number of keys in the filter, and m is the length of the filter in bits.

For the detailed math, see http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000


ATTACKS

Ex-Girlfriend attack
While a LOAF file is hard to reverse-engineer, it's designed to answer the question ``did this person ever send email to X?''. In some cases, that's a question you don't want people to be able to ask. To avoid exposing the fact that you are corresponding with certain people, you have three options:

- Don't use LOAF.

- Create a blacklist of addresses for LOAF to pass over when generating a filter.

- Set a false positive rate high enough to give you plausible deniability: ``Oh, honey, don't be ridiculous. I certainly never wrote to X, that must be a false positive'' will work, but you must be sure to read the caveat about keeping a constant filter size in Dictionary attack below.

Marc Canter attack
The technique is similar to getting a perfect score on the SAT by filling in every oval on the SAT exam sheet - you provide a Bloom filter consisting entirely of ones, and every email address checked against it will match.

Sending an overloaded filter does not help you get accepted by new correspondents, but once you are added to their list, it will make you appear to know everyone. One possible solution to this spoofing problem is to impose a maximum density.

Dictionary attack
The configurable false positive rate can make Bloom filters resistant to dictionary attack, but it also renders them less useful. Given a false positive rate of c, and a dictionary with k elements, a dictionary attack will result in ck false hits. This rate goes down if you can collect multiple filters from the same user that are either 1) of different length, or 2) use different hash functions (salts, in our implementation). False positives in either case will be different, so for n filters the false positive rate will drop to c^n.

This implies that the truly paranoid should use a presized filter large enough to contain as many correspondents as they ever expect to have on record, and an invariant set of salts. Under those conditions, collecting multiple filters will not change the false positive rate. A mostly empty large filter might have an unacceptably low false positive rate, so you would want to pad the list of real emails out with random data, to maintain a constant ratio of on/off bits as well.

The tradeoff with a high false positive rate is that the filter will be less useful to legitimate recipients. An intriguing possibility is that of sending out very inaccurate filters that are updated on a regular basis (for example weekly) so that a user has to accumulate a certain number of the filters in order to run queries with a good degree of certitude. This spreads private information over several filters and ensures that an eavesdropper who intercepts only one file will find it of very limited value.

Of course, the truly paranoid would be crazy to use LOAF.

Me Too attack
A ``me too'' attack consists of taking someone else's filter and claiming it as your own. This does not help you get recognized by other correspondents - that determination is made by comparing your email address against their list of stored filters - but once you are 'in', it will make you appear to share many contacts with people you actually don't know well at all.

For example, let's say Alice has a copy of Bob's filter. Charlie also has Bob's filter, and he takes it and replaces Bob's name with his own, hashing in some of his own email addresses so the filters are not completely identical.

Let's say in the normal course of things Alice adds Charlie as a correspondent. Now if Alice does some social network analysis of her stored bloom filters, she will see that Charlie and Bob are very closely related, sharing almost all correspondents.

It's unclear what the goal of such an attack would be, but there are conceivable scenarios where it might be beneficial to claim close similarity to another user. The defense against this attack is to include the email of the filter owner as an input in every hashing operation, so that changing the email address of the filter owner will effectively scramble the filter contents.


ANSWERED QUESTIONS

Can LOAF files be compressed?
This depends on the density of the Bloom filter. At optimum density, the filters are essentially incompressible. See the paper on compressed Bloom filters for an extensive treatment of this topic.

Why not send a link to the LOAF file instead of the actual file?
This is a risk, a variant on web bugs. Spammers could send fake LOAF headers and determine which messages had gone to live email addresses by looking at their fetch logs. It also exposes considerable information on email patterns to a centralized server, which many people would (rightfully) find problematic.


OPEN QUESTIONS

Standard salts - bug or feature?
Is it advisable to use the same salts across all users, or hash with different salts for everybody? The main difference seems to be enabling or disabling interesting bitwise operations. For example, standard salts make it possible to calculate overlap between any two Bloom filters. With unique salts, you can only calculate overlap between other Bloom filters and your own, unless you're willing to perform a dictionary search.

Standard salts would allow people to pool their LOAF information through bitwise OR (match if there are any hits) or bitwise AND (match only on a hit against everyone).

What are the security implications of standard salts?

Why are Bloom filters so obscure?
Bloom filters have been around for ages - the paper dates to 1970. They have some neat properties to boot - why haven't they been more widely used?


REFERENCES

Burton Bloom
Space/time trade-offs in hash coding with allowable errors. Communications of ACM , pages 13(7):422-426, July 1970.

Motley Bloom Tricks
``A Little Bloom Filter Theory (and a Bag of filter Tricks)'' http://www.cap-lore.com/code/BloomTheory.html

Bloom filters for finding annotations
Neat bit-offset technique for lumping several BFs into one. http://www.cap-lore.com/code/BloomAnno.html

Bloom Filters for secure DNS
http://www.research.att.com/~smb/papers/draft-bellovin-dnsext-bloomfilt-00.txt

Compressed Bloom Filters (PDF)
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf.pdf


AUTHORS

Maciej Cegłowski
Joshua Schachter


ACKNOWLEDGEMENTS

Dan Egnor
Peter V. Gadjokov


HISTORY

August 19, 2004
Corrected incorrect description of XORing to check for presence in filter (caught by Matt Crawford)

March 22, 2004
Clarified false p / privacy tradeoffs in dictionary attack. Added paragraph on multiple noisy filters.

March 19, 2004
Added Ex-Girlfriend attack

March 17, 2004
Added ``Me Too'' attack Further discussion on dictionary attacks

March 10, 2004
Added caveat on keeping filter length and hashes constant when seeking to avoid dictionary attacks

March 10, 2004
Split questions into 'answered' and 'unanswered' categories Added some recurring questions to the former category

March 9, 2004
Added section for references, questions. Corrected explanation of Bloom filters.

March 5, 2004
Initial draft


COPYRIGHT

(c) 2004 Joshua Schachter, Maciej Ceglowski <maciej @ceglowski.com>