Skip to content

[Exec](set) use bitset container to speed up hash set#62532

Open
HappenLee wants to merge 1 commit intoapache:masterfrom
HappenLee:pr-62304
Open

[Exec](set) use bitset container to speed up hash set#62532
HappenLee wants to merge 1 commit intoapache:masterfrom
HappenLee:pr-62304

Conversation

@HappenLee
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

use bitset container to speed up hash set

before:
SELECT count() from hits_100m WHERE FlashMajor IN (0, 1, 2, 3, 4, 5)

+----------+
| count()  |
+----------+
| 99997497 |
+----------+
1 row in set (0.411 sec)

after:

SELECT count() from hits_100m WHERE FlashMajor IN (0, 1, 2, 3, 4, 5)

+----------+
| count()  |
+----------+
| 99997497 |
+----------+
1 row in set (0.36 sec)

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 15, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@HappenLee
Copy link
Copy Markdown
Contributor Author

run buildall

@HappenLee
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 blocking issue.

Critical checkpoints:

  • Goal of the task: Speed up tinyint/smallint IN-set lookups. The current change does not fully accomplish that goal, because it regresses the non-constant IN execution path for SMALLINT.
  • Small, clear, focused change: The patch is small, but the unconditional type-based switch is too broad because it overrides the existing small-N specialization everywhere.
  • Concurrency: No new concurrency or locking concerns were introduced in the reviewed code paths.
  • Lifecycle / static initialization: No special lifecycle or static initialization issues were found.
  • Configuration: No new configs.
  • Compatibility: No storage / protocol compatibility concerns were found.
  • Parallel code paths: Yes, there are parallel callers of create_set(). In particular, FunctionIn::impl_without_set() is negatively affected by this change.
  • Special conditional checks: The new TYPE_TINYINT || TYPE_SMALLINT branch lacks a size threshold, so it bypasses the existing FixedContainer fast path even for tiny RHS sets.
  • Test coverage: Existing be/test/exprs/hybrid_set_test.cpp still covers basic membership, but there is no new test covering the changed container-selection behavior or the non-constant SMALLINT IN (...) path that regresses here.
  • Observability: No additional observability seems necessary.
  • Transaction / persistence / data writes / FE-BE variable passing: Not applicable to this PR.
  • Performance: Blocking regression. FunctionIn::impl_without_set() builds a temporary set per input row via create_set(arg_type, set_datas.size(), true). After this patch, every SMALLINT row now allocates and zero-fills a 65,536-entry container even when the RHS has only 1-8 elements, replacing the previous fixed-container fast path with ~64 KiB of work per row.
  • Other issues: None beyond the blocker above.

Please preserve the existing fixed-container path for small N, and only switch to BitSetContainer once the set is large enough to amortize its initialization cost.

static BasePtr get_function(bool null_aware) {
if constexpr (is_string_type(type)) {
return new StringSet<>(null_aware);
} else if constexpr (type == TYPE_TINYINT || type == TYPE_SMALLINT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_set() is also used by FunctionIn::impl_without_set() (be/src/exprs/function/in.h) for non-constant RHS expressions, and that path builds a temporary set per input row. With this branch ahead of the existing N <= FIXED_CONTAINER_MAX_SIZE check, every SMALLINT fallback set now becomes a BitSetContainer<int16_t> even when the RHS only has 1-8 elements. That means each row pays for allocating and zero-filling a 65,536-slot container instead of using the previous fixed-container fast path. For queries like smallint_col IN (other_smallint1, other_smallint2, ...), this turns the fallback into ~64 KiB of work per row and can easily dominate execution time. Please keep the small-N FixedContainer path and only choose BitSetContainer once the cardinality is large enough to amortize its initialization cost.`

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (46/46) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (27439/37217)
Line Coverage 57.32% (296153/516645)
Region Coverage 54.34% (245772/452299)
Branch Coverage 56.12% (106763/190254)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants