Hello, The existing implementations of heap/heapsort follow the conventional textbook approach, where each heapify operation requires approximately 2*log2(n) comparisons. In this series, I introduce a bottom-up variant that reduces the number of comparisons during heapify operations to approximately log2(n), while maintaining the same number of swap operations. Thanks, Kuan-Wei Kuan-Wei Chiu (5): bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort bcachefs: Introduce parent function for sort_cmp_size() bcachefs: Optimize sort_cmp_size() using bottom-up heapsort bcachefs: Optimize number of comparisons in heap_sift_down bcache: Optimize number of comparisons in heap_sift drivers/md/bcache/util.h | 23 +++++---- fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++------------- fs/bcachefs/util.h | 23 +++++---- 3 files changed, 98 insertions(+), 57 deletions(-) -- 2.25.1

This optimization reduces the average number of comparisons required from 2*n*log2(n) - 3*n + o(n) to n*log2(n) + 0.37*n + o(n). When n is sufficiently large, it results in approximately 50% fewer comparisons. Currently, eytzinger0_sort employs the textbook version of heapsort, where during the heapify process, each level requires two comparisons to determine the maximum among three elements. In contrast, the bottom-up heapsort, during heapify, only compares two children at each level until reaching a leaf node. Then, it backtracks from the leaf node to find the correct position. Since heapify typically continues until very close to the leaf node, the standard heapify requires about 2*log2(n) comparisons, while the bottom-up variant only needs log2(n) comparisons. The experimental data presented below is based on an array generated by get_random_u32(). | N | comparisons(old) | comparisons(new) | time(old) | time(new) | |-------|------------------|------------------|-----------|-----------| | 10000 | 235381 | 136615 | 25545 us | 20366 us | | 20000 | 510694 | 293425 | 31336 us | 18312 us | | 30000 | 800384 | 457412 | 35042 us | 27386 us | | 40000 | 1101617 | 626831 | 48779 us | 38253 us | | 50000 | 1409762 | 799637 | 62238 us | 46950 us | | 60000 | 1721191 | 974521 | 75588 us | 58367 us | | 70000 | 2038536 | 1152171 | 90823 us | 68778 us | | 80000 | 2362958 | 1333472 | 104165 us | 78625 us | | 90000 | 2690900 | 1516065 | 116111 us | 89573 us | | 100000| 3019413 | 1699879 | 133638 us | 100998 us | Refs: BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small) Ingo Wegener Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993 https://doi.org/10.1016/0304-3975(93)90364-Y Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- This patch has undergone unit testing and micro benchmarking using the following code [1]. [1]: static long long int cmp_count = 0; static int mycmp(const void *a, const void *b, size_t size) { u32 _a = *(u32 *)a; u32 _b = *(u32 *)b; cmp_count++; if (_a < _b) return -1; else if (_a > _b) return 1; else return 0; } static int test(void) { size_t N, i, L, R; ktime_t start, end; s64 delta; u32 *arr; for (N = 10000; N <= 100000; N += 10000) { arr = kmalloc_array(N, sizeof(u32), GFP_KERNEL); cmp_count = 0; for (i = 0; i < N; i++) arr[i] = get_random_u32(); start = ktime_get(); eytzinger0_sort(arr, N, sizeof(u32), mycmp, NULL); end = ktime_get(); delta = ktime_us_delta(end, start); printk(KERN_INFO "time: %lld\n", delta); printk(KERN_INFO "comparisons: %lld\n", cmp_count); for (int i = 0; i < N; i++) { L = 2 * i + 1; R = 2 * i + 2; if (L < N && arr[i] < arr[L]) goto err; if (R < N && arr[i] > arr[R]) goto err; } kfree(arr); } return 0; err: kfree(arr); return -1; } fs/bcachefs/util.c | 50 +++++++++++++++++++++++++++------------------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c index c2ef7cddaa4f..bbc83b43162e 100644 --- a/fs/bcachefs/util.c +++ b/fs/bcachefs/util.c @@ -911,7 +911,7 @@ void eytzinger0_sort(void *base, size_t n, size_t size, int (*cmp_func)(const void *, const void *, size_t), void (*swap_func)(void *, void *, size_t)) { - int i, c, r; + int i, j, k; if (!swap_func) { if (size == 4 && alignment_ok(base, 4)) @@ -924,17 +924,22 @@ void eytzinger0_sort(void *base, size_t n, size_t size, /* heapify */ for (i = n / 2 - 1; i >= 0; --i) { - for (r = i; r * 2 + 1 < n; r = c) { - c = r * 2 + 1; - - if (c + 1 < n && - do_cmp(base, n, size, cmp_func, c, c + 1) < 0) - c++; - - if (do_cmp(base, n, size, cmp_func, r, c) >= 0) - break; - - do_swap(base, n, size, swap_func, r, c); + /* Find the sift-down path all the way to the leaves. */ + for (j = i; k = j * 2 + 1, k + 1 < n;) + j = do_cmp(base, n, size, cmp_func, k, k + 1) > 0 ? k : k + 1; + + /* Special case for the last leaf with no sibling. */ + if (j * 2 + 2 == n) + j = j * 2 + 1; + + /* Backtrack to the correct location. */ + while (j != i && do_cmp(base, n, size, cmp_func, i, j) >= 0) + j = (j - 1) / 2; + + /* Shift the element into its correct place. */ + for (k = j; j != i;) { + j = (j - 1) / 2; + do_swap(base, n, size, swap_func, j, k); } } @@ -942,17 +947,22 @@ void eytzinger0_sort(void *base, size_t n, size_t size, for (i = n - 1; i > 0; --i) { do_swap(base, n, size, swap_func, 0, i); - for (r = 0; r * 2 + 1 < i; r = c) { - c = r * 2 + 1; + /* Find the sift-down path all the way to the leaves. */ + for (j = 0; k = j * 2 + 1, k + 1 < i;) + j = do_cmp(base, n, size, cmp_func, k, k + 1) > 0 ? k : k + 1; - if (c + 1 < i && - do_cmp(base, n, size, cmp_func, c, c + 1) < 0) - c++; + /* Special case for the last leaf with no sibling. */ + if (j * 2 + 2 == i) + j = j * 2 + 1; - if (do_cmp(base, n, size, cmp_func, r, c) >= 0) - break; + /* Backtrack to the correct location. */ + while (j && do_cmp(base, n, size, cmp_func, 0, j) >= 0) + j = (j - 1) / 2; - do_swap(base, n, size, swap_func, r, c); + /* Shift the element into its correct place. */ + for (k = j; j;) { + j = (j - 1) / 2; + do_swap(base, n, size, swap_func, j, k); } } } -- 2.25.1

When dealing with array indices, the parent's index can be obtained using the formula (i - 1) / 2. However, when working with byte offsets, this approach is not straightforward. To address this, we have introduced a branch-free parent function that does not require any division operations to calculate the parent's byte offset. Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- This patch has undergone unit testing using the following code [1]. [1]: static int test(void) { size_t i, p, size, lsbit; for (i = 0; i < 10000; i++) { size = get_random_u32() % (1U << 10); lsbit = size & -size; i = get_random_u32() % (1U << 20) * size + size; p = parent(i, lsbit, size); if (p != (i / size - 1) / 2 * size) return -1; } return 0; } fs/bcachefs/util.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c index bbc83b43162e..f5bbf96df2ce 100644 --- a/fs/bcachefs/util.c +++ b/fs/bcachefs/util.c @@ -907,6 +907,13 @@ static inline void do_swap(void *base, size_t n, size_t size, size); } +static inline size_t parent(size_t i, size_t lsbit, size_t size) +{ + i -= size; + i -= size & -(i & lsbit); + return i >> 1; +} + void eytzinger0_sort(void *base, size_t n, size_t size, int (*cmp_func)(const void *, const void *, size_t), void (*swap_func)(void *, void *, size_t)) -- 2.25.1

This optimization reduces the average number of comparisons required from 2*n*log2(n) - 3*n + o(n) to n*log2(n) + 0.37*n + o(n). When n is sufficiently large, it results in approximately 50% fewer comparisons. Currently, sort_cmp_size employs the textbook version of heapsort, where during the heapify process, each level requires two comparisons to determine the maximum among three elements. In contrast, the bottom-up heapsort, during heapify, only compares two children at each level until reaching a leaf node. Then, it backtracks from the leaf node to find the correct position. Since heapify typically continues until very close to the leaf node, the standard heapify requires about 2*log2(n) comparisons, while the bottom-up variant only needs log2(n) comparisons. Note: This patch depends on patch "bcachefs: Introduce parent function for sort_cmp_size()". The experimental data presented below is based on an array generated by get_random_u32(). | N | comparisons(old) | comparisons(new) | time(old) | time(new) | |-------|------------------|------------------|-----------|-----------| | 10000 | 235498 | 136592 | 17954 us | 14834 us | | 20000 | 510666 | 293254 | 23549 us | 18364 us | | 30000 | 800461 | 457351 | 33361 us | 19493 us | | 40000 | 1101317 | 626661 | 33672 us | 26810 us | | 50000 | 1409743 | 799745 | 42634 us | 34694 us | | 60000 | 1721165 | 974737 | 51414 us | 41367 us | | 70000 | 2037972 | 1152111 | 63084 us | 49146 us | | 80000 | 2362590 | 1333270 | 73802 us | 54715 us | | 90000 | 2690155 | 1516148 | 82693 us | 63070 us | | 100000| 3019730 | 1699757 | 88981 us | 70367 us | Refs: BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small) Ingo Wegener Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993 https://doi.org/10.1016/0304-3975(93)90364-Y Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- This patch has undergone unit testing and micro benchmarking using the following code [1]. [1]: static long long int cmp_count = 0; static int mycmp(const void *a, const void *b, size_t size) { u32 _a = *(u32 *)a; u32 _b = *(u32 *)b; cmp_count++; if (_a < _b) return -1; else if (_a > _b) return 1; else return 0; } static int test(void) { size_t N, i; ktime_t start, end; s64 delta; u32 *arr; for (N = 10000; N <= 100000; N += 10000) { arr = kmalloc_array(N, sizeof(u32), GFP_KERNEL); cmp_count = 0; for (i = 0; i < N; i++) arr[i] = get_random_u32(); start = ktime_get(); sort_cmp_size(arr, N, sizeof(u32), mycmp, NULL); end = ktime_get(); delta = ktime_us_delta(end, start); printk(KERN_INFO "time: %lld\n", delta); printk(KERN_INFO "comparisons: %lld\n", cmp_count); for (i = 0; i < N - 1; i++) { if (arr[i] > arr[i+1]) { kfree(arr); return -1; } } kfree(arr); } return 0; } fs/bcachefs/util.c | 52 +++++++++++++++++++++++++++++++--------------- 1 file changed, 35 insertions(+), 17 deletions(-) diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c index f5bbf96df2ce..4dacb2b9a0a5 100644 --- a/fs/bcachefs/util.c +++ b/fs/bcachefs/util.c @@ -979,7 +979,8 @@ void sort_cmp_size(void *base, size_t num, size_t size, void (*swap_func)(void *, void *, size_t size)) { /* pre-scale counters for performance */ - int i = (num/2 - 1) * size, n = num * size, c, r; + int i = (num / 2 - 1) * size, n = num * size, j, k; + const size_t lsbit = size & -size; if (!swap_func) { if (size == 4 && alignment_ok(base, 4)) @@ -992,28 +993,45 @@ void sort_cmp_size(void *base, size_t num, size_t size, /* heapify */ for ( ; i >= 0; i -= size) { - for (r = i; r * 2 + size < n; r = c) { - c = r * 2 + size; - if (c < n - size && - cmp_func(base + c, base + c + size, size) < 0) - c += size; - if (cmp_func(base + r, base + c, size) >= 0) - break; - swap_func(base + r, base + c, size); + /* Find the sift-down path all the way to the leaves. */ + for (j = i; k = j * 2 + size, k + size < n;) + j = cmp_func(base + k, base + k + size, size) > 0 ? k : k + size; + + /* Special case for the last leaf with no sibling. */ + if (j * 2 + size * 2 == n) + j = j * 2 + size; + + /* Backtrack to the correct location. */ + while (j != i && cmp_func(base + i, base + j, size) >= 0) + j = parent(j, lsbit, size); + + /* Shift the element into its correct place. */ + for (k = j; j != i;) { + j = parent(j, lsbit, size); + swap_func(base + j, base + k, size); } } /* sort */ for (i = n - size; i > 0; i -= size) { swap_func(base, base + i, size); - for (r = 0; r * 2 + size < i; r = c) { - c = r * 2 + size; - if (c < i - size && - cmp_func(base + c, base + c + size, size) < 0) - c += size; - if (cmp_func(base + r, base + c, size) >= 0) - break; - swap_func(base + r, base + c, size); + + /* Find the sift-down path all the way to the leaves. */ + for (j = 0; k = j * 2 + size, k + size < i;) + j = cmp_func(base + k, base + k + size, size) > 0 ? k : k + size; + + /* Special case for the last leaf with no sibling. */ + if (j * 2 + size * 2 == i) + j = j * 2 + size; + + /* Backtrack to the correct location. */ + while (j && cmp_func(base, base + j, size) >= 0) + j = parent(j, lsbit, size); + + /* Shift the element into its correct place. */ + for (k = j; j;) { + j = parent(j, lsbit, size); + swap_func(base + j, base + k, size); } } } -- 2.25.1

Optimize the heap_sift_down() macro, resulting in a significant reduction of approximately 50% in the number of comparisons for large random inputs, while maintaining identical results. The current implementation performs two comparisons per level to identify the maximum among three elements. In contrast, the proposed bottom-up variation uses only one comparison per level to assess two children until reaching the leaves. Then, it sifts up until the correct position is determined. Typically, the process of sifting down proceeds to the leaf level, resulting in O(1) secondary comparisons instead of log2(n). This optimization significantly reduces the number of costly indirect function calls and improves overall performance. The experimental data below is derived from first adding N elements generated by get_random_u32() to the heap, and then performing heap_pop until the heap is empty. | N | comparisons(old) | comparisons(new) | time(old) | time(new) | |-------|------------------|------------------|-----------|-----------| | 10000 | 239233 | 142673 | 1219 us | 1000 us | | 20000 | 518498 | 305394 | 2564 us | 1904 us | | 30000 | 812864 | 476594 | 4197 us | 3203 us | | 40000 | 1117553 | 651737 | 5666 us | 4290 us | | 50000 | 1430128 | 830600 | 7156 us | 5574 us | | 60000 | 1746128 | 1012201 | 8862 us | 7036 us | | 70000 | 2066099 | 1196653 | 10454 us | 8111 us | | 80000 | 2394593 | 1383311 | 11993 us | 9602 us | | 90000 | 2727097 | 1572381 | 13501 us | 10718 us | | 100000| 3059841 | 1763083 | 15325 us | 11776 us | Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- This patch has undergone unit testing and micro benchmarking using the following code [1]. [1]: static long long int cmp_count = 0; typedef HEAP(u32) heap; int mycmp(heap *h, u32 a, u32 b) { cmp_count++; if (a < b) return -1; if (a > b) return 1; return 0; } int check_heap(heap *h, int (*cmp)(heap *, u32, u32)) { size_t i; for (i = 0; i < h->used / 2; i++) { if (i * 2 + 1 < h->used) if (cmp(h, h->data[i], h->data[i * 2 + 1]) > 0) return -1; if (i * 2 + 2 < h->used) if (cmp(h, h->data[i], h->data[i * 2 + 2]) > 0) return -1; } return 0; } static int test(void) { size_t N, i; u32 x; heap myheap; ktime_t start, end; s64 delta; /* Test for correctness. */ for (N = 1000; N <= 10000; N += 1000) { init_heap(&myheap, N, GFP_KERNEL); for (i = 0; i < N; i++) { heap_add(&myheap, get_random_u32(), mycmp, NULL); if (check_heap(&myheap, mycmp)) return -1; } for (i = 0; i < N; i++) { heap_pop(&myheap, x, mycmp, NULL); if (check_heap(&myheap, mycmp)) return -1; } free_heap(&myheap); } /* Micro-benchmark. */ for (N = 10000; N <= 100000; N += 10000) { cmp_count = 0; init_heap(&myheap, N, GFP_KERNEL); start = ktime_get(); for (i = 0; i < N; i++) heap_add(&myheap, get_random_u32(), mycmp, NULL); for (i = 0; i < N; i++) heap_pop(&myheap, x, mycmp, NULL); end = ktime_get(); delta = ktime_us_delta(end, start); printk(KERN_INFO "time: %lld\n", delta); printk(KERN_INFO "comparisons: %lld\n", cmp_count); free_heap(&myheap); } return 0; } fs/bcachefs/util.h | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h index c75fc31915d3..be22eb63084b 100644 --- a/fs/bcachefs/util.h +++ b/fs/bcachefs/util.h @@ -131,17 +131,20 @@ do { \ #define heap_sift_down(h, i, cmp, set_backpointer) \ do { \ - size_t _c, _j = i; \ + size_t _j, _k; \ \ - for (; _j * 2 + 1 < (h)->used; _j = _c) { \ - _c = _j * 2 + 1; \ - if (_c + 1 < (h)->used && \ - cmp(h, (h)->data[_c], (h)->data[_c + 1]) >= 0) \ - _c++; \ - \ - if (cmp(h, (h)->data[_c], (h)->data[_j]) >= 0) \ - break; \ - heap_swap(h, _c, _j, set_backpointer); \ + for (_j = i; _k = _j * 2 + 1, _k + 1 < (h)->used;) { \ + if (cmp(h, (h)->data[_k], (h)->data[_k + 1]) >= 0) \ + _k++; \ + _j = _k; \ + } \ + if (_j * 2 + 2 == (h)->used) \ + _j = _j * 2 + 1; \ + while (_j != i && cmp(h, (h)->data[i], (h)->data[_j]) <= 0) \ + _j = (_j - 1) / 2; \ + for (_k = _j; _j != i;) { \ + _j = (_j - 1) / 2; \ + heap_swap(h, _j, _k, set_backpointer); \ } \ } while (0) -- 2.25.1

Optimize the heap_sift() macro, resulting in a significant reduction of approximately 50% in the number of comparisons for large random inputs, while maintaining identical results. The current implementation performs two comparisons per level to identify the maximum among three elements. In contrast, the proposed bottom-up variation uses only one comparison per level to assess two children until reaching the leaves. Then, it sifts up until the correct position is determined. Typically, the process of sifting down proceeds to the leaf level, resulting in O(1) secondary comparisons instead of log2(n). This optimization significantly reduces the number of costly indirect function calls and improves overall performance. The experimental data below is derived from first adding N elements generated by get_random_u32() to the heap, and then performing heap_pop until the heap is empty. | N | comparisons(old) | comparisons(new) | time(old) | time(new) | |-------|------------------|------------------|-----------|-----------| | 10000 | 249215 | 164872 | 1253 us | 958 us | | 20000 | 539766 | 350113 | 2693 us | 2059 us | | 30000 | 843297 | 543037 | 4120 us | 3119 us | | 40000 | 1159127 | 739595 | 5624 us | 4221 us | | 50000 | 1482608 | 941655 | 7147 us | 5349 us | | 60000 | 1808772 | 1144716 | 8754 us | 6786 us | | 70000 | 2139443 | 1348878 | 10323 us | 8030 us | | 80000 | 2478304 | 1560892 | 11934 us | 9061 us | | 90000 | 2820532 | 1768678 | 13611 us | 10679 us | | 100000| 3163503 | 1983806 | 15244 us | 11745 us | Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- This patch has undergone unit testing and micro benchmarking using the following code [1]. [1]: static long long int cmp_count = 0; typedef DECLARE_HEAP(u32, heap); int mycmp(u32 a, u32 b) { cmp_count++; return a > b; } int check_heap(heap *h, int (*cmp)(u32, u32)) { size_t i; for (i = 0; i < h->used / 2; i++) { if (i * 2 + 1 < h->used) if (cmp(h->data[i], h->data[i * 2 + 1])) return -1; if (i * 2 + 2 < h->used) if (cmp(h->data[i], h->data[i * 2 + 2])) return -1; } return 0; } static int test(void) { size_t N, i; u32 x; heap myheap; ktime_t start, end; s64 delta; /* Test for correctness. */ for (N = 1000; N <= 10000; N += 1000) { init_heap(&myheap, N, GFP_KERNEL); for (i = 0; i < N; i++) { heap_add(&myheap, get_random_u32(), mycmp); if (check_heap(&myheap, mycmp)) return -1; } for (i = 0; i < N; i++) { heap_pop(&myheap, x, mycmp); if (check_heap(&myheap, mycmp)) return -1; } free_heap(&myheap); } /* Micro-benchmark. */ for(N = 10000; N <= 100000; N += 10000) { cmp_count = 0; init_heap(&myheap, N, GFP_KERNEL); start = ktime_get(); for (i = 0; i < N; i++) heap_add(&myheap, get_random_u32(), mycmp); for (i = 0; i < N; i++) heap_pop(&myheap, x, mycmp); end = ktime_get(); delta = ktime_us_delta(end, start); printk(KERN_INFO "time: %lld\n", delta); printk(KERN_INFO "comparisons: %lld\n", cmp_count); free_heap(&myheap); } return 0; } drivers/md/bcache/util.h | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index f61ab1bada6c..3aa74b0d7f0a 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -56,17 +56,20 @@ do { \ #define heap_sift(h, i, cmp) \ do { \ - size_t _r, _j = i; \ + size_t _j, _k; \ \ - for (; _j * 2 + 1 < (h)->used; _j = _r) { \ - _r = _j * 2 + 1; \ - if (_r + 1 < (h)->used && \ - cmp((h)->data[_r], (h)->data[_r + 1])) \ - _r++; \ - \ - if (cmp((h)->data[_r], (h)->data[_j])) \ - break; \ - heap_swap(h, _r, _j); \ + for (_j = i; _k = 2 * _j + 1, _k + 1 < (h)->used;) { \ + if (cmp((h)->data[_k], (h)->data[_k + 1])) \ + _k++; \ + _j = _k; \ + } \ + if (_j * 2 + 2 == (h)->used) \ + _j = _j * 2 + 1; \ + while (_j != i && cmp((h)->data[_j], (h)->data[i])) \ + _j = (_j - 1) / 2; \ + for (_k = _j; _j != i;) { \ + _j = (_j - 1) / 2; \ + heap_swap(h, _j, _k); \ } \ } while (0) -- 2.25.1

On Sun, Jan 21, 2024 at 11:36:46PM +0800, Kuan-Wei Chiu wrote: > When dealing with array indices, the parent's index can be obtained > using the formula (i - 1) / 2. However, when working with byte offsets, > this approach is not straightforward. To address this, we have > introduced a branch-free parent function that does not require any > division operations to calculate the parent's byte offset. This is a good commit message - but it would be even better if it was a function comment on parent() > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > --- > This patch has undergone unit testing using the following code [1]. > > [1]: > static int test(void) > { > size_t i, p, size, lsbit; > > for (i = 0; i < 10000; i++) { > size = get_random_u32() % (1U << 10); > lsbit = size & -size; > i = get_random_u32() % (1U << 20) * size + size; > p = parent(i, lsbit, size); > if (p != (i / size - 1) / 2 * size) > return -1; > } > > return 0; > } > > fs/bcachefs/util.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c > index bbc83b43162e..f5bbf96df2ce 100644 > --- a/fs/bcachefs/util.c > +++ b/fs/bcachefs/util.c > @@ -907,6 +907,13 @@ static inline void do_swap(void *base, size_t n, size_t size, > size); > } > > +static inline size_t parent(size_t i, size_t lsbit, size_t size) > +{ > + i -= size; > + i -= size & -(i & lsbit); > + return i >> 1; > +} > + > void eytzinger0_sort(void *base, size_t n, size_t size, > int (*cmp_func)(const void *, const void *, size_t), > void (*swap_func)(void *, void *, size_t)) > -- > 2.25.1 >

```
On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote:
> Hello,
>
> The existing implementations of heap/heapsort follow the conventional
> textbook approach, where each heapify operation requires approximately
> 2*log2(n) comparisons. In this series, I introduce a bottom-up variant
> that reduces the number of comparisons during heapify operations to
> approximately log2(n), while maintaining the same number of swap
> operations.
>
> Thanks,
> Kuan-Wei
>
> Kuan-Wei Chiu (5):
> bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort
> bcachefs: Introduce parent function for sort_cmp_size()
> bcachefs: Optimize sort_cmp_size() using bottom-up heapsort
> bcachefs: Optimize number of comparisons in heap_sift_down
> bcache: Optimize number of comparisons in heap_sift
>
> drivers/md/bcache/util.h | 23 +++++----
> fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++-------------
> fs/bcachefs/util.h | 23 +++++----
> 3 files changed, 98 insertions(+), 57 deletions(-)
Good stuff
While we're looking at this code, we should be doing some cleanup too -
there's no reason for the heap code to be duplicated in bcache and
bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h
moved to include/linux and bcache converted to use it.
I also would not be surprised if there's another heap implementation in
include/linux; we'll want to check for that and if there is decide which
is worth keeping.
Would you or Coli be interested in taking that on as well?
```

On Sun, Jan 21, 2024 at 11:21:06AM -0500, Kent Overstreet wrote: > On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote: > > Hello, > > > > The existing implementations of heap/heapsort follow the conventional > > textbook approach, where each heapify operation requires approximately > > 2*log2(n) comparisons. In this series, I introduce a bottom-up variant > > that reduces the number of comparisons during heapify operations to > > approximately log2(n), while maintaining the same number of swap > > operations. > > > > Thanks, > > Kuan-Wei > > > > Kuan-Wei Chiu (5): > > bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort > > bcachefs: Introduce parent function for sort_cmp_size() > > bcachefs: Optimize sort_cmp_size() using bottom-up heapsort > > bcachefs: Optimize number of comparisons in heap_sift_down > > bcache: Optimize number of comparisons in heap_sift > > > > drivers/md/bcache/util.h | 23 +++++---- > > fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++------------- > > fs/bcachefs/util.h | 23 +++++---- > > 3 files changed, 98 insertions(+), 57 deletions(-) > > Good stuff > > While we're looking at this code, we should be doing some cleanup too - > there's no reason for the heap code to be duplicated in bcache and > bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h > moved to include/linux and bcache converted to use it. > > I also would not be surprised if there's another heap implementation in > include/linux; we'll want to check for that and if there is decide which > is worth keeping. > Yes, we have 'min_heap.h' in include/linux. > Would you or Coli be interested in taking that on as well?

On Sun, Jan 21, 2024 at 11:17:30AM -0500, Kent Overstreet wrote: > On Sun, Jan 21, 2024 at 11:36:46PM +0800, Kuan-Wei Chiu wrote: > > When dealing with array indices, the parent's index can be obtained > > using the formula (i - 1) / 2. However, when working with byte offsets, > > this approach is not straightforward. To address this, we have > > introduced a branch-free parent function that does not require any > > division operations to calculate the parent's byte offset. > > This is a good commit message - but it would be even better if it was a > function comment on parent() > Sure, however, it seems that sort_cmp_size() can be directly replaced with the sort function from include/linux. Once we decide on the cleanup tasks, if we still choose to retain this patch, I will make the adjustments. > > > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > --- > > This patch has undergone unit testing using the following code [1]. > > > > [1]: > > static int test(void) > > { > > size_t i, p, size, lsbit; > > > > for (i = 0; i < 10000; i++) { > > size = get_random_u32() % (1U << 10); > > lsbit = size & -size; > > i = get_random_u32() % (1U << 20) * size + size; > > p = parent(i, lsbit, size); > > if (p != (i / size - 1) / 2 * size) > > return -1; > > } > > > > return 0; > > } > > > > fs/bcachefs/util.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c > > index bbc83b43162e..f5bbf96df2ce 100644 > > --- a/fs/bcachefs/util.c > > +++ b/fs/bcachefs/util.c > > @@ -907,6 +907,13 @@ static inline void do_swap(void *base, size_t n, size_t size, > > size); > > } > > > > +static inline size_t parent(size_t i, size_t lsbit, size_t size) > > +{ > > + i -= size; > > + i -= size & -(i & lsbit); > > + return i >> 1; > > +} > > + > > void eytzinger0_sort(void *base, size_t n, size_t size, > > int (*cmp_func)(const void *, const void *, size_t), > > void (*swap_func)(void *, void *, size_t)) > > -- > > 2.25.1 > >

```
On Mon, Jan 22, 2024 at 01:05:28AM +0800, Kuan-Wei Chiu wrote:
> On Sun, Jan 21, 2024 at 11:17:30AM -0500, Kent Overstreet wrote:
> > On Sun, Jan 21, 2024 at 11:36:46PM +0800, Kuan-Wei Chiu wrote:
> > > When dealing with array indices, the parent's index can be obtained
> > > using the formula (i - 1) / 2. However, when working with byte offsets,
> > > this approach is not straightforward. To address this, we have
> > > introduced a branch-free parent function that does not require any
> > > division operations to calculate the parent's byte offset.
> >
> > This is a good commit message - but it would be even better if it was a
> > function comment on parent()
> >
> Sure, however, it seems that sort_cmp_size() can be directly replaced
> with the sort function from include/linux. Once we decide on the
> cleanup tasks, if we still choose to retain this patch, I will make the
> adjustments.
nice catch - looks like sort_r() is the more recent addition, so that's
how that happened.
```

```
On Mon, Jan 22, 2024 at 12:55:51AM +0800, Kuan-Wei Chiu wrote:
> On Sun, Jan 21, 2024 at 11:21:06AM -0500, Kent Overstreet wrote:
> > On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote:
> > > Hello,
> > >
> > > The existing implementations of heap/heapsort follow the conventional
> > > textbook approach, where each heapify operation requires approximately
> > > 2*log2(n) comparisons. In this series, I introduce a bottom-up variant
> > > that reduces the number of comparisons during heapify operations to
> > > approximately log2(n), while maintaining the same number of swap
> > > operations.
> > >
> > > Thanks,
> > > Kuan-Wei
> > >
> > > Kuan-Wei Chiu (5):
> > > bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort
> > > bcachefs: Introduce parent function for sort_cmp_size()
> > > bcachefs: Optimize sort_cmp_size() using bottom-up heapsort
> > > bcachefs: Optimize number of comparisons in heap_sift_down
> > > bcache: Optimize number of comparisons in heap_sift
> > >
> > > drivers/md/bcache/util.h | 23 +++++----
> > > fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++-------------
> > > fs/bcachefs/util.h | 23 +++++----
> > > 3 files changed, 98 insertions(+), 57 deletions(-)
> >
> > Good stuff
> >
> > While we're looking at this code, we should be doing some cleanup too -
> > there's no reason for the heap code to be duplicated in bcache and
> > bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h
> > moved to include/linux and bcache converted to use it.
> >
> > I also would not be surprised if there's another heap implementation in
> > include/linux; we'll want to check for that and if there is decide which
> > is worth keeping.
> >
> Yes, we have 'min_heap.h' in include/linux.
So that has the advantage of more readable code - functions instead of
macros - whereas my version has the type safe interface.
We could combine the two approaches, and put a type-safe interface on
top of the min_heap.h code with some small macro wrappers - see
generic-radix-tree.h for an example of how that's done.
min_heap.h has only one user though? I don't think I can quite believe
that's the only other code in the kernel using a heap, there must be
more open coded out there...
```

On Sun, Jan 21, 2024 at 12:41:55PM -0500, Kent Overstreet wrote: > On Mon, Jan 22, 2024 at 12:55:51AM +0800, Kuan-Wei Chiu wrote: > > On Sun, Jan 21, 2024 at 11:21:06AM -0500, Kent Overstreet wrote: > > > On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote: > > > > Hello, > > > > > > > > The existing implementations of heap/heapsort follow the conventional > > > > textbook approach, where each heapify operation requires approximately > > > > 2*log2(n) comparisons. In this series, I introduce a bottom-up variant > > > > that reduces the number of comparisons during heapify operations to > > > > approximately log2(n), while maintaining the same number of swap > > > > operations. > > > > > > > > Thanks, > > > > Kuan-Wei > > > > > > > > Kuan-Wei Chiu (5): > > > > bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort > > > > bcachefs: Introduce parent function for sort_cmp_size() > > > > bcachefs: Optimize sort_cmp_size() using bottom-up heapsort > > > > bcachefs: Optimize number of comparisons in heap_sift_down > > > > bcache: Optimize number of comparisons in heap_sift > > > > > > > > drivers/md/bcache/util.h | 23 +++++---- > > > > fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++------------- > > > > fs/bcachefs/util.h | 23 +++++---- > > > > 3 files changed, 98 insertions(+), 57 deletions(-) > > > > > > Good stuff > > > > > > While we're looking at this code, we should be doing some cleanup too - > > > there's no reason for the heap code to be duplicated in bcache and > > > bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h > > > moved to include/linux and bcache converted to use it. > > > > > > I also would not be surprised if there's another heap implementation in > > > include/linux; we'll want to check for that and if there is decide which > > > is worth keeping. > > > > > Yes, we have 'min_heap.h' in include/linux. > > So that has the advantage of more readable code - functions instead of > macros - whereas my version has the type safe interface. > > We could combine the two approaches, and put a type-safe interface on > top of the min_heap.h code with some small macro wrappers - see > generic-radix-tree.h for an example of how that's done. Without modifying the interface provided by min_heap.h, it seems challenging to implement the functionality of heap_add due to the relationship with heap_setbackpointer. Additionally, when looking into the code in generic-radix-tree.h, should we replace type[0] with type[]? This is because zero-length arrays are deprecated language features mentioned in document [1]. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1] > > min_heap.h has only one user though? I don't think I can quite believe > that's the only other code in the kernel using a heap, there must be > more open coded out there... I'm not sure why, but it seems that in the kernel, other places using the heap implement their own subsystem-specific solutions rather than utilizing a generic heap interface. For instance, kernel/sched/cpudeadline.c and net/sched/sch_cake.c both have their own implementations.

On Mon, Jan 22, 2024 at 11:06:54PM +0800, Kuan-Wei Chiu wrote: > On Sun, Jan 21, 2024 at 12:41:55PM -0500, Kent Overstreet wrote: > > On Mon, Jan 22, 2024 at 12:55:51AM +0800, Kuan-Wei Chiu wrote: > > > On Sun, Jan 21, 2024 at 11:21:06AM -0500, Kent Overstreet wrote: > > > > On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote: > > > > > Hello, > > > > > > > > > > The existing implementations of heap/heapsort follow the conventional > > > > > textbook approach, where each heapify operation requires approximately > > > > > 2*log2(n) comparisons. In this series, I introduce a bottom-up variant > > > > > that reduces the number of comparisons during heapify operations to > > > > > approximately log2(n), while maintaining the same number of swap > > > > > operations. > > > > > > > > > > Thanks, > > > > > Kuan-Wei > > > > > > > > > > Kuan-Wei Chiu (5): > > > > > bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort > > > > > bcachefs: Introduce parent function for sort_cmp_size() > > > > > bcachefs: Optimize sort_cmp_size() using bottom-up heapsort > > > > > bcachefs: Optimize number of comparisons in heap_sift_down > > > > > bcache: Optimize number of comparisons in heap_sift > > > > > > > > > > drivers/md/bcache/util.h | 23 +++++---- > > > > > fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++------------- > > > > > fs/bcachefs/util.h | 23 +++++---- > > > > > 3 files changed, 98 insertions(+), 57 deletions(-) > > > > > > > > Good stuff > > > > > > > > While we're looking at this code, we should be doing some cleanup too - > > > > there's no reason for the heap code to be duplicated in bcache and > > > > bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h > > > > moved to include/linux and bcache converted to use it. > > > > > > > > I also would not be surprised if there's another heap implementation in > > > > include/linux; we'll want to check for that and if there is decide which > > > > is worth keeping. > > > > > > > Yes, we have 'min_heap.h' in include/linux. > > > > So that has the advantage of more readable code - functions instead of > > macros - whereas my version has the type safe interface. > > > > We could combine the two approaches, and put a type-safe interface on > > top of the min_heap.h code with some small macro wrappers - see > > generic-radix-tree.h for an example of how that's done. > > Without modifying the interface provided by min_heap.h, it seems > challenging to implement the functionality of heap_add due to the > relationship with heap_setbackpointer. min_heap.h has the same functionality, different interface - updating the callers for an interface change is fine. > > Additionally, when looking into the code in generic-radix-tree.h, > should we replace type[0] with type[]? This is because zero-length > arrays are deprecated language features mentioned in document [1]. Zero length arrays are deprecated as VLAs, but this isn't a VLA - we're not storing anything there, the variable is just so that macros have access to the type. > Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1] > > > > min_heap.h has only one user though? I don't think I can quite believe > > that's the only other code in the kernel using a heap, there must be > > more open coded out there... > > I'm not sure why, but it seems that in the kernel, other places using > the heap implement their own subsystem-specific solutions rather than > utilizing a generic heap interface. For instance, > kernel/sched/cpudeadline.c and net/sched/sch_cake.c both have their own > implementations. Sounds like a fun cleanup project :)

On Mon, Jan 22, 2024 at 11:06:39AM -0500, Kent Overstreet wrote: > On Mon, Jan 22, 2024 at 11:06:54PM +0800, Kuan-Wei Chiu wrote: > > On Sun, Jan 21, 2024 at 12:41:55PM -0500, Kent Overstreet wrote: > > > On Mon, Jan 22, 2024 at 12:55:51AM +0800, Kuan-Wei Chiu wrote: > > > > On Sun, Jan 21, 2024 at 11:21:06AM -0500, Kent Overstreet wrote: > > > > > On Sun, Jan 21, 2024 at 11:36:44PM +0800, Kuan-Wei Chiu wrote: > > > > > > Hello, > > > > > > > > > > > > The existing implementations of heap/heapsort follow the conventional > > > > > > textbook approach, where each heapify operation requires approximately > > > > > > 2*log2(n) comparisons. In this series, I introduce a bottom-up variant > > > > > > that reduces the number of comparisons during heapify operations to > > > > > > approximately log2(n), while maintaining the same number of swap > > > > > > operations. > > > > > > > > > > > > Thanks, > > > > > > Kuan-Wei > > > > > > > > > > > > Kuan-Wei Chiu (5): > > > > > > bcachefs: Optimize eytzinger0_sort() using bottom-up heapsort > > > > > > bcachefs: Introduce parent function for sort_cmp_size() > > > > > > bcachefs: Optimize sort_cmp_size() using bottom-up heapsort > > > > > > bcachefs: Optimize number of comparisons in heap_sift_down > > > > > > bcache: Optimize number of comparisons in heap_sift > > > > > > > > > > > > drivers/md/bcache/util.h | 23 +++++---- > > > > > > fs/bcachefs/util.c | 109 ++++++++++++++++++++++++++------------- > > > > > > fs/bcachefs/util.h | 23 +++++---- > > > > > > 3 files changed, 98 insertions(+), 57 deletions(-) > > > > > > > > > > Good stuff > > > > > > > > > > While we're looking at this code, we should be doing some cleanup too - > > > > > there's no reason for the heap code to be duplicated in bcache and > > > > > bcachefs anymore, and it'd also be nice to get fs/bcachefs/eytzinger.h > > > > > moved to include/linux and bcache converted to use it. > > > > > > > > > > I also would not be surprised if there's another heap implementation in > > > > > include/linux; we'll want to check for that and if there is decide which > > > > > is worth keeping. > > > > > > > > > Yes, we have 'min_heap.h' in include/linux. > > > > > > So that has the advantage of more readable code - functions instead of > > > macros - whereas my version has the type safe interface. > > > > > > We could combine the two approaches, and put a type-safe interface on > > > top of the min_heap.h code with some small macro wrappers - see > > > generic-radix-tree.h for an example of how that's done. > > > > Without modifying the interface provided by min_heap.h, it seems > > challenging to implement the functionality of heap_add due to the > > relationship with heap_setbackpointer. > > min_heap.h has the same functionality, different interface - updating > the callers for an interface change is fine. > OK, I'll take some time to do these cleanups. > > > > Additionally, when looking into the code in generic-radix-tree.h, > > should we replace type[0] with type[]? This is because zero-length > > arrays are deprecated language features mentioned in document [1]. > > Zero length arrays are deprecated as VLAs, but this isn't a VLA - we're > not storing anything there, the variable is just so that macros have > access to the type. > > > Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1] > > > > > > min_heap.h has only one user though? I don't think I can quite believe > > > that's the only other code in the kernel using a heap, there must be > > > more open coded out there... > > > > I'm not sure why, but it seems that in the kernel, other places using > > the heap implement their own subsystem-specific solutions rather than > > utilizing a generic heap interface. For instance, > > kernel/sched/cpudeadline.c and net/sched/sch_cake.c both have their own > > implementations. > > Sounds like a fun cleanup project :)