Abstract
Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide information-theoretical analysis on the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.
Session Chair
Date & Time
01:00 am – 01:00 am
Session
Data Compression
A Universal Low Complexity Compression Algorithm for Sparse Marked Graphs
On a Redundancy of AIFV-m Codes for m = 3, 5
A Universal Data Compression Scheme based on the AIVF Coding Techniques
On D-ary Fano Codes
$O(\log \log n)$ Worst-Case Local Decoding and Update Efficiency for Data Compression
Data Deduplication with Random Substitutions