RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems
Abstract：I/O deduplication is a key technique for improving storage systems' space and I/O efficiency.
Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, its chunking operation is slow, and may become a performance bottleneck. Currently a choice has to be made between high deduplication ratio and high speed.
In this paper we leverage locality in the duplicate chunks to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The proposed deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. The other feature is that its high efficiency does not heavily depend on the existence of the locality. Our experimental results with synthetic and real-world datasets show that RapidCDC’s chunking speedup can be improved by up to 33× over regular CDC. Meanwhile, it maintains the same deduplication ratio.