Source Coding
Data Compression

Data Deduplication with Random Substitutions

Hao Lou, Farzad Farnoud

Date & Time

01:00 am – 01:00 am


Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide information-theoretical analysis on the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.


Hao Lou

University of Virginia

Farzad Farnoud

University of Virginia

Session Chair

Marcelo Weinberger

Center for Science of Information