Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]

From: Florian Weimer <fw@deneb.enyo.de>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
	 linux-mm@kvack.org,  Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
Date: Wed, 16 Oct 2019 21:38:49 +0200	[thread overview]
Message-ID: <87blugh452.fsf@mid.deneb.enyo.de> (raw)
In-Reply-To: <96023250-6168-3806-320a-a3468f1cd8c9@suse.cz> (Vlastimil Babka's message of "Tue, 1 Oct 2019 11:10:22 +0200")

* Vlastimil Babka:

> On 9/30/19 11:17 PM, Dave Chinner wrote:
>> On Mon, Sep 30, 2019 at 09:07:53PM +0200, Florian Weimer wrote:
>>> * Dave Chinner:
>>>
>>>> On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
>>>>> Simply running “du -hc” on a large directory tree causes du to be
>>>>> killed because of kernel paging request failure in the XFS code.
>>>>
>>>> dmesg output? if the system was still running, then you might be
>>>> able to pull the trace from syslog. But we can't do much without
>>>> knowing what the actual failure was....
>>>
>>> Huh.  I actually have something in syslog:
>>>
>>> [ 4001.238411] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000000
>>> [ 4001.238415] #PF: supervisor read access in kernel mode
>>> [ 4001.238417] #PF: error_code(0x0000) - not-present page
>>> [ 4001.238418] PGD 0 P4D 0 
>>> [ 4001.238420] Oops: 0000 [#1] SMP PTI
>>> [ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G I 5.2.16fw+
>>> #1
>>> [ 4001.238424] Hardware name: System manufacturer System Product
>>> Name/P6X58D-E, BIOS 0701 05/10/2011
>>> [ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
>> 
>> That's memory compaction code it's crashed in.
>> 
>>> [ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0
>>> 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1
>>> e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00
>>> 00 00 4c 89 f7
>
> Tried to decode it, but couldn't match it to source code, my version of
> compiled code is too different. Would it be possible to either send
> mm/compaction.o from the matching build, or output of 'objdump -d -l'
> for the __reset_isolation_pfn function?

(dropping the fs lists)

I got another crash, this time triggered by rsync (large tree with
many small files, few files changed).

Oops:

[41969.140117] BUG: kernel NULL pointer dereference, address: 0000000000000000
[41969.140121] #PF: supervisor read access in kernel mode
[41969.140122] #PF: error_code(0x0000) - not-present page
[41969.140123] PGD 0 P4D 0
[41969.140125] Oops: 0000 [#1] SMP PTI
[41969.140127] CPU: 5 PID: 144 Comm: kswapd0 Tainted: G          I       5.2.18fw+ #10
[41969.140128] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
[41969.140133] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
[41969.140134] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
[41969.140135] RSP: 0018:ffffc900003ffde0 EFLAGS: 00010246
[41969.140137] RAX: 000000000004fdac RBX: 0000000000118000 RCX: 0000000000000000
[41969.140138] RDX: 0000000000000000 RSI: 0000000000000230 RDI: ffff88833fffa000
[41969.140138] RBP: ffffc900003ffe18 R08: 000000000000003c R09: ffff888335080000
[41969.140139] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000001
[41969.140140] R13: 0000000000000001 R14: ffff888338dc01c0 R15: 0000000000000001
[41969.140141] FS:  0000000000000000(0000) GS:ffff888333d40000(0000) knlGS:0000000000000000
[41969.140142] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41969.140143] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000000206e0
[41969.140144] Call Trace:
[41969.140147]  __reset_isolation_suitable+0x9b/0x120
[41969.140149]  reset_isolation_suitable+0x3b/0x40
[41969.140152]  kswapd+0x98/0x300
[41969.140154]  ? wait_woken+0x80/0x80
[41969.140157]  kthread+0x114/0x130
[41969.140158]  ? balance_pgdat+0x450/0x450
[41969.140159]  ? kthread_park+0x80/0x80
[41969.140162]  ret_from_fork+0x1f/0x30
[41969.140163] Modules linked in: usb_storage nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp serio_raw snd_hda_intel kvm_intel snd_hda_codec kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e xhci_pci ptp ehci_pci uhci_hcd xhci_hcd pps_core ehci_hcd sky2 usbcore ttm usb_common sd_mod
[41969.140187] CR2: 0000000000000000
[41969.140189] ---[ end trace e27ddb472a95c047 ]---

This time, I've got a kernel with debugging information (still
5.2.18).  The crash is at offset 0x39f:

        if (!mem_section[SECTION_NR_TO_ROOT(nr)])
     384:       48 c1 ea 35             shr    $0x35,%rdx
     388:       48 8b 14 d7             mov    (%rdi,%rdx,8),%rdx
     38c:       48 c1 e8 2d             shr    $0x2d,%rax
     390:       48 85 d2                test   %rdx,%rdx
     393:       74 0a                   je     39f <__reset_isolation_pfn+0x27f>
        return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
     395:       0f b6 c0                movzbl %al,%eax
     398:       48 c1 e0 04             shl    $0x4,%rax
     39c:       48 01 c2                add    %rax,%rdx
        unsigned long map = section->section_mem_map;
     39f:       48 8b 02                mov    (%rdx),%rax
                                clear_pageblock_skip(page);
     3a2:       4c 89 f2                mov    %r14,%rdx
     3a5:       41 b8 01 00 00 00       mov    $0x1,%r8d
     3ab:       31 f6                   xor    %esi,%esi
     3ad:       b9 03 00 00 00          mov    $0x3,%ecx
     3b2:       4c 89 f7                mov    %r14,%rdi

Hmm, -l output is likely more helpful here:

/home/fw/src/linux/linux/mm/compaction.c:293
     37a:       a8 10                   test   $0x10,%al
     37c:       74 bc                   je     33a <__reset_isolation_pfn+0x21a>
page_to_section():
/home/fw/src/linux/linux/./include/linux/mm.h:1265
     37e:       49 8b 16                mov    (%r14),%rdx
     381:       48 89 d0                mov    %rdx,%rax
__nr_to_section():
/home/fw/src/linux/linux/./include/linux/mmzone.h:1218
     384:       48 c1 ea 35             shr    $0x35,%rdx
     388:       48 8b 14 d7             mov    (%rdi,%rdx,8),%rdx
page_to_section():
/home/fw/src/linux/linux/./include/linux/mm.h:1265
     38c:       48 c1 e8 2d             shr    $0x2d,%rax
__nr_to_section():
/home/fw/src/linux/linux/./include/linux/mmzone.h:1218
     390:       48 85 d2                test   %rdx,%rdx
     393:       74 0a                   je     39f <__reset_isolation_pfn+0x27f>
/home/fw/src/linux/linux/./include/linux/mmzone.h:1220
     395:       0f b6 c0                movzbl %al,%eax
     398:       48 c1 e0 04             shl    $0x4,%rax
     39c:       48 01 c2                add    %rax,%rdx
__section_mem_map_addr():
/home/fw/src/linux/linux/./include/linux/mmzone.h:1247
     39f:       48 8b 02                mov    (%rdx),%rax
__reset_isolation_pfn():
/home/fw/src/linux/linux/mm/compaction.c:294
     3a2:       4c 89 f2                mov    %r14,%rdx
     3a5:       41 b8 01 00 00 00       mov    $0x1,%r8d
     3ab:       31 f6                   xor    %esi,%esi

It's this loop:

  286         /*
  287          * Only clear the hint if a sample indicates there is either a
  288          * free page or an LRU page in the block. One or other condition
  289          * is necessary for the block to be a migration source/target.
  290          */
  291         do {
  292                 if (pfn_valid_within(pfn)) {
  293                         if (check_source && PageLRU(page)) {
  294                                 clear_pageblock_skip(page);
  295                                 return true;
  296                         }
  297 
  298                         if (check_target && PageBuddy(page)) {
  299                                 clear_pageblock_skip(page);
  300                                 return true;
  301                         }
  302                 }
  303 
  304                 page += (1 << PAGE_ALLOC_COSTLY_ORDER);
  305                 pfn += (1 << PAGE_ALLOC_COSTLY_ORDER);
  306         } while (page < end_page);