Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide information-theoretical analysis on the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.


Hao Lou

University of Virginia

Farzad Farnoud

University of Virginia
Private Conversation

Reach out to the speaker privately

Questions & Answers

Post a publicly available question

No questions have been asked.

Session Chair

Marcelo Weinberger

Center for Science of Information