Deduping the Internet: An email case study

Much of the traffic that traverses the Internet each day is redundant. That is, some or all of the data content has been sent previously. From a technical viewpoint, this represents a waste of resources, in terms of network bandwidth, storage, and energy efficiency. This paper presents an initial feasibility study to assess the potential of data deduplication technologies to reduce Internet traffic. The case study focuses on electronic mail (email), using an email dataset collected over the past 8 years.

The results from this longitudinal study suggest that the size, complexity, and redundancy of email messages have all increased over this time duration, as has the complexity of the email delivery infrastructure. The results indicate that bandwidth savings of 30-45% are possible using existing redundant traffic elimination techniques on email messages.