feat: add BloomFilter, CRC32C, MurmurHash, varint utilities, and DeltaVarintCompressor#37
feat: add BloomFilter, CRC32C, MurmurHash, varint utilities, and DeltaVarintCompressor#37lxy-9602 wants to merge 4 commits into
Conversation
|
Thanks @ChaomingZhangCN for contributing the BloomFilter and CRC32C implementations — migrated as part of this batch. 🎉 |
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the migration. I found one blocker: DeltaVarintCompressor currently relies on Java-style signed wraparound in the extreme-value case, but the C++ implementation performs signed int64_t subtraction/addition, which is undefined behavior and can fail under UBSan or compiler optimization. Please implement the delta/reconstruction in a well-defined modulo-2^64 domain so it remains Java-compatible without UB.
| // 1. Delta encoding | ||
| std::vector<int64_t> deltas; | ||
| deltas.reserve(data.size()); | ||
| deltas.push_back(data[0]); |
There was a problem hiding this comment.
This subtraction can overflow for the tested {INT64_MIN, INT64_MAX} case. Java long arithmetic wraps, but C++ signed overflow is undefined behavior. Please compute the delta with well-defined modulo semantics, for example via unsigned intermediates, before ZigZag encoding.
| deltas.push_back(delta); | ||
| } | ||
|
|
||
| // 2. Delta decoding |
There was a problem hiding this comment.
Same issue during reconstruction: for the extreme-value case, INT64_MIN + (-1) relies on signed overflow to become INT64_MAX. Please reconstruct using well-defined modulo semantics instead of signed overflow.
leaves12138
left a comment
There was a problem hiding this comment.
Reviewed the latest head. The previous DeltaVarintCompressor signed-overflow blocker has been fixed with unsigned delta/reconstruction, and I did not find new blockers in this pass.
Purpose
No Linked issue.
Introduce hashing, checksum, and compression utilities:
BloomFilter/BloomFilter64— probabilistic membership filters with 32-bit and 64-bit hashing (bloom_filter.h/cpp,bloom_filter64.h/cpp)CRC32C— CRC32-Castagnoli checksum with SSE4.2 hardware acceleration fallback (crc32c.h/cpp)MurmurHashUtils— MurmurHash3 implementation for byte arrays and MemorySegments (murmurhash_utils.h)VarLengthIntUtils— variable-length integer encoding/decoding (var_length_int_utils.h)DeltaVarintCompressor— delta + varint compression for integer sequences (delta_varint_compressor.h/cpp)Also adds
xxhash_test.cppanddata_define_test.cpptest coverage and completetestharness.h/.cpp.Tests
bloom_filter_test.cpp,bloom_filter64_test.cpp— filter add/contains, false-positive ratecrc32c_test.cpp— checksum correctness across input sizesmurmurhash_utils_test.cpp— hash consistency and known-vector checksvar_length_int_utils_test.cpp— varint round-trip encodingdelta_varint_compressor_test.cpp— compression/decompression round-tripxxhash_test.cpp,data_define_test.cppAPI and Format
Documentation
Generative AI tooling
Migrate-by: Aone Copilot (Claude)