Skip to content

feat: add BloomFilter, CRC32C, MurmurHash, varint utilities, and DeltaVarintCompressor#37

Open
lxy-9602 wants to merge 4 commits into
apache:mainfrom
lxy-9602:add-bf-hash
Open

feat: add BloomFilter, CRC32C, MurmurHash, varint utilities, and DeltaVarintCompressor#37
lxy-9602 wants to merge 4 commits into
apache:mainfrom
lxy-9602:add-bf-hash

Conversation

@lxy-9602
Copy link
Copy Markdown
Contributor

@lxy-9602 lxy-9602 commented Jun 2, 2026

Purpose

No Linked issue.

Introduce hashing, checksum, and compression utilities:

  • BloomFilter / BloomFilter64 — probabilistic membership filters with 32-bit and 64-bit hashing (bloom_filter.h/cpp, bloom_filter64.h/cpp)
  • CRC32C — CRC32-Castagnoli checksum with SSE4.2 hardware acceleration fallback (crc32c.h/cpp)
  • MurmurHashUtils — MurmurHash3 implementation for byte arrays and MemorySegments (murmurhash_utils.h)
  • VarLengthIntUtils — variable-length integer encoding/decoding (var_length_int_utils.h)
  • DeltaVarintCompressor — delta + varint compression for integer sequences (delta_varint_compressor.h/cpp)

Also adds xxhash_test.cpp and data_define_test.cpp test coverage and complete testharness.h/.cpp.

Tests

  • bloom_filter_test.cpp, bloom_filter64_test.cpp — filter add/contains, false-positive rate
  • crc32c_test.cpp — checksum correctness across input sizes
  • murmurhash_utils_test.cpp — hash consistency and known-vector checks
  • var_length_int_utils_test.cpp — varint round-trip encoding
  • delta_varint_compressor_test.cpp — compression/decompression round-trip
  • xxhash_test.cpp, data_define_test.cpp

API and Format

Documentation

Generative AI tooling

Migrate-by: Aone Copilot (Claude)

@lxy-9602
Copy link
Copy Markdown
Contributor Author

lxy-9602 commented Jun 2, 2026

Thanks @ChaomingZhangCN for contributing the BloomFilter and CRC32C implementations — migrated as part of this batch. 🎉

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the migration. I found one blocker: DeltaVarintCompressor currently relies on Java-style signed wraparound in the extreme-value case, but the C++ implementation performs signed int64_t subtraction/addition, which is undefined behavior and can fail under UBSan or compiler optimization. Please implement the delta/reconstruction in a well-defined modulo-2^64 domain so it remains Java-compatible without UB.

// 1. Delta encoding
std::vector<int64_t> deltas;
deltas.reserve(data.size());
deltas.push_back(data[0]);
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subtraction can overflow for the tested {INT64_MIN, INT64_MAX} case. Java long arithmetic wraps, but C++ signed overflow is undefined behavior. Please compute the delta with well-defined modulo semantics, for example via unsigned intermediates, before ZigZag encoding.

deltas.push_back(delta);
}

// 2. Delta decoding
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue during reconstruction: for the extreme-value case, INT64_MIN + (-1) relies on signed overflow to become INT64_MAX. Please reconstruct using well-defined modulo semantics instead of signed overflow.

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest head. The previous DeltaVarintCompressor signed-overflow blocker has been fixed with unsigned delta/reconstruction, and I did not find new blockers in this pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants