On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote: > On Fri, Mar 15, 2019 at 5:33 AM George Spelvin wrote: >> On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote: >>> On Tue, Mar 05, 2019 at 03:06:44AM +0000, George Spelvin wrote: >>>> + for (bit = 1; count & bit; bit <<= 1) { >>>> + cur = merge(priv, (cmp_func)cmp, pending, cur); >>>> + pending = pending->prev; /* Untouched by merge() */ >>>> } >>> >>> Wouldn't be it the same to >>> >>> bit = ffz(count); >>> while (bit--) { >>> ... >>> } >>> ? >>> >>> Though I dunno which one is generating better code. >> >> One question I should ask everyone: should "count" be 32 or 64 bits >> on 64-bit machines? That would let x86 save a few REX bytes. (815 >> vs. 813 byte code, if anyone cares.) >> >> Allegedy ARM can save a few pJ by gating the high 32 >> bits of the ALU. >> >> Most other 64-bit processors would prefer 64-bit operations as >> it saves masking operations. >> >> If we never sort a list with more than 2^32 entries, it >> makes no difference. >> >> If we use a 32-bit count and we *do* sort a list with more than >> 2^32 entries, then it still sorts, but the performance degrades to >> O((n/2^32)^2). >> >> Just how often do we expect the kernel to face lists that long? >> (Note that the old code was O((n/2^20)^2).) > > Using size_t sounds most logical to me (argument of least surprise). Yes, it is the obvious solution, which is why that's my default choice. But a bit of thought shows that a list long enough to break a 32-bit implementation is beyond ridiculous. The list must be at least 3 * 2^32 elements long to make the sort merge non-optimally. That's 1.5 * 2^37 bytes (192 GiB) of list_head structures alone; at least double that for any practical application. And 32 * 3 * 2^32 + (2 + 3) * 2^32 = 101 * 2^32 = 1.57 * 2^38 compares. That seems like a lot but that's not the botteneck. Each compare reads from a new list element, and pretty soon, they'll miss all caches and go to main memory. Since the memory locations are random, for any small subset of the list, you'll get only one element per cache line. A 32 MiB L3 cache is 2^19 cache lines (assuming 64B lines). So merge levels 20 through 33 will go to main memory. That's (12 * 3 + 5) * 2^32 = 1.28 * 2^37 cache misses. At 60 ns each (typical local DRAM access time on i7 & Xeon according to Intel), that's a hard minimum of 10565 seconds = 2h 56m 05s in one list_sort call. This is definitely the scale of problem where a mutithreaded sort is called for. It's *so* impossible that maybe it's worth trading that capability for a couple of bytes in the inner loop. >> In the code, I could do something like >> >> #ifdef CONFIG_X86_64 >> /* Comment explaining why */ >> typedef uint32_t count_t; >> #else >> typedef size_t count_t; >> #endif >> >> ... >> count_t count = 0; > > Using different types makes it more complex, e.g. to print the value > in debug code. > And adding more typedefs is frowned upon. It's a *local* typedef, purely for the purpose of moving #ifdef clutter out of the function declaration. I agree that *global* typedefs are discouraged. As for debugging, that's a red herring; it's easy to cast to (size_t). > Just my 0.02€. Thank you for considering the issue! >> I prefer the shorter _ints and _longs names, but this is just >> not a hill I want to die on. > > Argument of least surprise: don't call something a duck if it's not > guaranteed to behave like a duck. > > If I read "long", this triggers a warning flag in my head: be careful, > this is 32-bit on 32-bit platforms, and 64-bit on 64-bit platforms. Good point. And "_longlong" or "_llong" is a bit ugly, too.