On 1/17/23 19:25, Linus Torvalds wrote: > On Tue, Jan 17, 2023 at 4:22 AM Feng Tang wrote: >> >> With the following patch to use 'O1' instead 'O2' gcc optoin for >> page_alloc.c, the list corruption issue can't be reproduced for >> commit 7118fc2906 in 1000 runs. > > Ugh. > > It would be lovely if you could just narrow it down with > > #pragma GCC optimize ("O1") > ... > #pragma GCC optimize ("O2") > > around just that prep_compound_page(), but when I tried it myself I > get some function attribute mismatch errors. > > >> As is can't be reproduced with X86_64 build, it could be i386 >> compiling related. > > Your particular config causes a huge amount of nasty 64-bit arithmetic > according to the objdump code, with sequences like > > c13b3cbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0 > c13b3cc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4 > > which seems to be just from some coverage profiling being on > (CONFIG_GCOV?), or something. It makes it very hard to read the code. I think whatever this is, it's also responsible for the bug. Now from all the dumps I've seen so far, the struct page corruptions looked like prep_compound_page() is called more times than it should. Which didn't make sense as the code is very simple, a loop of 1 << order - 1 iterations. If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit), I can work out that edi holds the 1 << order, ebx is initialized as 1, and there's a nice clear "inc ebx, cmp ebx, edi, jne ". Now the disassembly of the "bad" commit is much harder to follow when it comes to the iteration control. There are xors and or's involved so I didn't even try to work out the meaning. Clearly it can't be deterministically wrong or it would crash always and fast. But crucially I think it uses memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could explain the bug. Before the loop iteration itself, it initializes several registers from these addreses, and the values end up modifying also registers used for the loop iteration control (ebx/esi) And crucially in the loop iteration it also writes some values to the same addresses, and there seems to be no synchronization whatsoever. So it seems to me, even if these addresses are supposed to be "private" to prep_compound_page(), running two prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did reproduce this with -smp 1 as well) can result in a race ultimately affecting the number of iterations taken. Attaching the disasm I annotated for more details. Everything register whose value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as "crap" there (sorry). > You also have UBSAN enabled, which - again - makes for some really > grotty asm that hides any actual logic. > > Finally, your objdump version also does some horrendous decoding, like > > c13b3e29: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi > > which is just a 7-byte 'nop' instruction, but again, it makes it > rather hard to actually read the code. > > With the i386 defconfig, gcc generates a function that is just ~30 > instructions for me, so this makes a huge difference in the legibility > of the code. > > I wonder if you can recreate the issue with a much more > straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB > debugging on, but without the things like UBSAN and GCOV. > >> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and >> its parent commit 48b8d744ea84, which have big difference than the >> simple 'set_page_count()' change, but I can't tell which part is >> abnormal, so attach them for further check. > > Yeah, I can't make heads or tails of them either, see above on how > illegible the objdump files are. And that's despite not even having > all of prep_compound_page() in them (it's missing > prep_compound_page.cold, which is probably just UBSAN fixup code, but > who knows..) > > That said, with the i386 defconfig, the only change from adding > set_page_count() to the loop seems to be exactly that: > > .L589: > - movl $1024, 12(%eax) > + movl $0, 28(%eax) > addl $32, %eax > + movl $1024, -20(%eax) > movl %esi, -28(%eax) > movl $0, -12(%eax) > cmpl %edx, %eax > > (don't ask me why gcc does *one* access using the pre-incremented > pointer, and then the rest to the post-incremented ones, but whatever > - it means that it's not just "add a mov $0", it's also changing how > the > > p->mapping = TAIL_MAPPING; > > instruction is done, which is that > > - movl $1024, 12(%eax) > + movl $1024, -20(%eax) > > part of the change) > > Linus