linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xfs_inode not reclaimed/memory leak on 5.2.16
@ 2019-09-30  7:28 Florian Weimer
  2019-09-30  8:54 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2019-09-30  7:28 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Simply running “du -hc” on a large directory tree causes du to be
killed because of kernel paging request failure in the XFS code.

I ran slabtop, and it showed tons of xfs_inode objects.

The system was rather unhappy after that, so I wasn't able to capture
much more information.

Is this a known issue on Linux 5.2?  I don't see it with kernel
5.0.20.  Those are plain upstream kernels built for x86-64, with no
unusual config options (that I know of).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs_inode not reclaimed/memory leak on 5.2.16
  2019-09-30  7:28 xfs_inode not reclaimed/memory leak on 5.2.16 Florian Weimer
@ 2019-09-30  8:54 ` Dave Chinner
  2019-09-30 19:07   ` Florian Weimer
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2019-09-30  8:54 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-xfs, linux-fsdevel

On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
> Simply running “du -hc” on a large directory tree causes du to be
> killed because of kernel paging request failure in the XFS code.

dmesg output? if the system was still running, then you might be
able to pull the trace from syslog. But we can't do much without
knowing what the actual failure was....

FWIW, one of my regular test workloads is iterating a directory tree
with 50 million inodes in several different ways to stress reclaim
algorithms in ways that users do. I haven't seen issues with that
test for a while, so it's not an obvious problem whatever you came
across.

> I ran slabtop, and it showed tons of xfs_inode objects.

Sure, because your workload is iterating inodes.

> The system was rather unhappy after that, so I wasn't able to capture
> much more information.
> 
> Is this a known issue on Linux 5.2?

Not that I know of.

> I don't see it with kernel
> 5.0.20.  Those are plain upstream kernels built for x86-64, with no
> unusual config options (that I know of).

We've had quite a few memory reclaim regressions in recent times
that have displayed similar symptoms - XFS is often just the
messenger because the inode cache is generating the memory pressure.
e.g. the shrinker infrastructure was broken in 4.16 and then broken
differently in 4.17 to try to fix it, and we didn't hear about them
until about 4.18/4.19 when users started to trip over them. I fixed
those problems in 5.0, but there's every chance that there have been
new regressions since then.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs_inode not reclaimed/memory leak on 5.2.16
  2019-09-30  8:54 ` Dave Chinner
@ 2019-09-30 19:07   ` Florian Weimer
  2019-09-30 21:17     ` [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16] Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2019-09-30 19:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

* Dave Chinner:

> On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
>> Simply running “du -hc” on a large directory tree causes du to be
>> killed because of kernel paging request failure in the XFS code.
>
> dmesg output? if the system was still running, then you might be
> able to pull the trace from syslog. But we can't do much without
> knowing what the actual failure was....

Huh.  I actually have something in syslog:

[ 4001.238411] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 4001.238415] #PF: supervisor read access in kernel mode
[ 4001.238417] #PF: error_code(0x0000) - not-present page
[ 4001.238418] PGD 0 P4D 0 
[ 4001.238420] Oops: 0000 [#1] SMP PTI
[ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G          I       5.2.16fw+ #1
[ 4001.238424] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
[ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
[ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
[ 4001.238433] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
[ 4001.238435] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
[ 4001.238437] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
[ 4001.238438] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
[ 4001.238439] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
[ 4001.238440] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
[ 4001.238442] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
[ 4001.238444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4001.238445] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
[ 4001.238446] Call Trace:
[ 4001.238450]  __reset_isolation_suitable+0x9b/0x120
[ 4001.238453]  reset_isolation_suitable+0x3b/0x40
[ 4001.238456]  kswapd+0x98/0x300
[ 4001.238460]  ? wait_woken+0x80/0x80
[ 4001.238463]  kthread+0x114/0x130
[ 4001.238465]  ? balance_pgdat+0x450/0x450
[ 4001.238467]  ? kthread_park+0x80/0x80
[ 4001.238470]  ret_from_fork+0x1f/0x30
[ 4001.238472] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
[ 4001.238509] CR2: 0000000000000000
[ 4001.238511] ---[ end trace 3cdcc14b40255fe6 ]---
[ 4001.238514] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
[ 4001.238516] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
[ 4001.238518] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
[ 4001.238519] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
[ 4001.238521] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
[ 4001.238522] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
[ 4001.238523] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
[ 4001.238524] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
[ 4001.238526] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
[ 4001.238528] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4001.238529] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
[ 4001.709169] BUG: unable to handle page fault for address: ffffc900003e7ec8
[ 4001.709172] #PF: supervisor read access in kernel mode
[ 4001.709173] #PF: error_code(0x0000) - not-present page
[ 4001.709174] PGD 33201a067 P4D 33201a067 PUD 33201b067 PMD 3322af067 PTE 0
[ 4001.709177] Oops: 0000 [#2] SMP PTI
[ 4001.709179] CPU: 1 PID: 10507 Comm: du Tainted: G      D   I       5.2.16fw+ #1
[ 4001.709180] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
[ 4001.709184] RIP: 0010:__wake_up_common+0x3c/0x130
[ 4001.709186] Code: 85 c9 74 0a 41 f6 01 04 0f 85 9f 00 00 00 48 8b 47 08 48 8d 5f 08 48 83 e8 18 48 8d 78 18 48 39 fb 0f 84 ca 00 00 00 89 75 d4 <48> 8b 70 18 4d 89 cd 45 31 e4 4c 89 45 c8 89 4d d0 89 55 c4 4c 8d
[ 4001.709187] RSP: 0018:ffffc900043db5e0 EFLAGS: 00010012
[ 4001.709188] RAX: ffffc900003e7eb0 RBX: ffffffff82066f00 RCX: 0000000000000000
[ 4001.709189] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffc900003e7ec8
[ 4001.709190] RBP: ffffc900043db620 R08: 0000000000000000 R09: ffffc900043db638
[ 4001.709191] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
[ 4001.709192] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000000
[ 4001.709193] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
[ 4001.709194] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4001.709195] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0
[ 4001.709195] Call Trace:
[ 4001.709198]  __wake_up_common_lock+0x6c/0x90
[ 4001.709200]  __wake_up+0xe/0x10
[ 4001.709203]  wakeup_kswapd+0xf4/0x120
[ 4001.709206]  get_page_from_freelist+0x52e/0xc80
[ 4001.709208]  __alloc_pages_nodemask+0xf0/0xcc0
[ 4001.709209]  ? get_page_from_freelist+0xa8d/0xc80
[ 4001.709212]  ? radix_tree_lookup+0xd/0x10
[ 4001.709215]  ? kmem_cache_alloc+0x80/0xa0
[ 4001.709217]  xfs_buf_allocate_memory+0x20e/0x320
[ 4001.709219]  xfs_buf_get_map+0xe8/0x190
[ 4001.709220]  xfs_buf_read_map+0x25/0x100
[ 4001.709223]  xfs_trans_read_buf_map+0xb2/0x1f0
[ 4001.709225]  xfs_imap_to_bp+0x53/0xa0
[ 4001.709226]  xfs_iread+0x76/0x1b0
[ 4001.709229]  xfs_iget+0x1e5/0x700
[ 4001.709231]  xfs_lookup+0x63/0x90
[ 4001.709232]  xfs_vn_lookup+0x47/0x80
[ 4001.709235]  __lookup_slow+0x7f/0x120
[ 4001.709236]  lookup_slow+0x35/0x50
[ 4001.709238]  walk_component+0x193/0x2a0
[ 4001.709239]  ? path_init+0x112/0x2f0
[ 4001.709240]  path_lookupat.isra.16+0x5c/0x200
[ 4001.709242]  filename_lookup.part.27+0x88/0x100
[ 4001.709243]  ? xfs_ilock+0x39/0x90
[ 4001.709245]  ? __check_object_size+0xf6/0x187
[ 4001.709248]  ? strncpy_from_user+0x56/0x1c0
[ 4001.709249]  user_path_at_empty+0x39/0x40
[ 4001.709250]  vfs_statx+0x62/0xb0
[ 4001.709252]  __se_sys_newfstatat+0x26/0x50
[ 4001.709254]  __x64_sys_newfstatat+0x19/0x20
[ 4001.709255]  do_syscall_64+0x4b/0x260
[ 4001.709257]  ? page_fault+0x8/0x30
[ 4001.709259]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4001.709261] RIP: 0033:0x7f0090c47e49
[ 4001.709262] Code: 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 89 f0 48 89 d6 83 ff 01 77 36 89 c7 45 89 c2 48 89 ca b8 06 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 07 c3 66 0f 1f 44 00 00 48 8b 15 11 10 0d 00
[ 4001.709263] RSP: 002b:00007fffee0929e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
[ 4001.709264] RAX: ffffffffffffffda RBX: 00005635d95cfe00 RCX: 00007f0090c47e49
[ 4001.709265] RDX: 00005635d95cfe78 RSI: 00005635d95cff08 RDI: 0000000000000004
[ 4001.709266] RBP: 00005635d95cfe78 R08: 0000000000000100 R09: 0000000000000001
[ 4001.709267] R10: 0000000000000100 R11: 0000000000000246 R12: 00005635d8c1e5c0
[ 4001.709268] R13: 00005635d95cfe00 R14: 00005635d8c1e650 R15: 000000000000000b
[ 4001.709269] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
[ 4001.709293] CR2: ffffc900003e7ec8
[ 4001.709295] ---[ end trace 3cdcc14b40255fe7 ]---
[ 4001.709297] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
[ 4001.709299] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
[ 4001.709299] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
[ 4001.709300] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
[ 4001.709301] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
[ 4001.709302] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
[ 4001.709303] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
[ 4001.709304] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
[ 4001.709305] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
[ 4001.709306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4001.709307] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0

So XFS wasn't *that* unhappy if it could still write to the file
system.

> FWIW, one of my regular test workloads is iterating a directory tree
> with 50 million inodes in several different ways to stress reclaim
> algorithms in ways that users do. I haven't seen issues with that
> test for a while, so it's not an obvious problem whatever you came
> across.

Right, I should have tried to reproduce it first.  I actually can't.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-09-30 19:07   ` Florian Weimer
@ 2019-09-30 21:17     ` Dave Chinner
  2019-09-30 21:42       ` Florian Weimer
  2019-10-01  9:10       ` Vlastimil Babka
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2019-09-30 21:17 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Mon, Sep 30, 2019 at 09:07:53PM +0200, Florian Weimer wrote:
> * Dave Chinner:
> 
> > On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
> >> Simply running “du -hc” on a large directory tree causes du to be
> >> killed because of kernel paging request failure in the XFS code.
> >
> > dmesg output? if the system was still running, then you might be
> > able to pull the trace from syslog. But we can't do much without
> > knowing what the actual failure was....
> 
> Huh.  I actually have something in syslog:
> 
> [ 4001.238411] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 4001.238415] #PF: supervisor read access in kernel mode
> [ 4001.238417] #PF: error_code(0x0000) - not-present page
> [ 4001.238418] PGD 0 P4D 0 
> [ 4001.238420] Oops: 0000 [#1] SMP PTI
> [ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G          I       5.2.16fw+ #1
> [ 4001.238424] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
> [ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0

That's memory compaction code it's crashed in.

> [ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
> [ 4001.238433] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
> [ 4001.238435] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
> [ 4001.238437] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
> [ 4001.238438] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
> [ 4001.238439] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
> [ 4001.238440] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
> [ 4001.238442] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
> [ 4001.238444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4001.238445] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
> [ 4001.238446] Call Trace:
> [ 4001.238450]  __reset_isolation_suitable+0x9b/0x120
> [ 4001.238453]  reset_isolation_suitable+0x3b/0x40
> [ 4001.238456]  kswapd+0x98/0x300
> [ 4001.238460]  ? wait_woken+0x80/0x80
> [ 4001.238463]  kthread+0x114/0x130
> [ 4001.238465]  ? balance_pgdat+0x450/0x450
> [ 4001.238467]  ? kthread_park+0x80/0x80
> [ 4001.238470]  ret_from_fork+0x1f/0x30

Ok, so the memory compaction code has had a null pointer dereference
which has killed kswapd. memory reclaim is going to have serious
problems from this point on as kswapd does most of the reclaim.

I have no idea why this might have happened - are they any other
unexpected events or clues in the syslog that might point to a
memory corruption or some sign of badness before this crash?

> [ 4001.238472] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
> [ 4001.238509] CR2: 0000000000000000
> [ 4001.238511] ---[ end trace 3cdcc14b40255fe6 ]---
> [ 4001.238514] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
> [ 4001.238516] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
> [ 4001.238518] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
> [ 4001.238519] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
> [ 4001.238521] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
> [ 4001.238522] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
> [ 4001.238523] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
> [ 4001.238524] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
> [ 4001.238526] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
> [ 4001.238528] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4001.238529] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
> [ 4001.709169] BUG: unable to handle page fault for address: ffffc900003e7ec8
> [ 4001.709172] #PF: supervisor read access in kernel mode
> [ 4001.709173] #PF: error_code(0x0000) - not-present page
> [ 4001.709174] PGD 33201a067 P4D 33201a067 PUD 33201b067 PMD 3322af067 PTE 0
> [ 4001.709177] Oops: 0000 [#2] SMP PTI
> [ 4001.709179] CPU: 1 PID: 10507 Comm: du Tainted: G      D   I       5.2.16fw+ #1
> [ 4001.709180] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
> [ 4001.709184] RIP: 0010:__wake_up_common+0x3c/0x130

Then half a second later, the du process has crashed in
__wake_up_common (core scheduler code)....

> [ 4001.709186] Code: 85 c9 74 0a 41 f6 01 04 0f 85 9f 00 00 00 48 8b 47 08 48 8d 5f 08 48 83 e8 18 48 8d 78 18 48 39 fb 0f 84 ca 00 00 00 89 75 d4 <48> 8b 70 18 4d 89 cd 45 31 e4 4c 89 45 c8 89 4d d0 89 55 c4 4c 8d
> [ 4001.709187] RSP: 0018:ffffc900043db5e0 EFLAGS: 00010012
> [ 4001.709188] RAX: ffffc900003e7eb0 RBX: ffffffff82066f00 RCX: 0000000000000000
> [ 4001.709189] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffc900003e7ec8
> [ 4001.709190] RBP: ffffc900043db620 R08: 0000000000000000 R09: ffffc900043db638
> [ 4001.709191] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
> [ 4001.709192] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000000
> [ 4001.709193] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
> [ 4001.709194] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4001.709195] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0
> [ 4001.709195] Call Trace:
> [ 4001.709198]  __wake_up_common_lock+0x6c/0x90
> [ 4001.709200]  __wake_up+0xe/0x10
> [ 4001.709203]  wakeup_kswapd+0xf4/0x120

...trying to wake up kswapd. This may have crashed because the
kswapd task has been killed and it hasn't been removed from
the wait list and so there's a dead/freed task being woken.
Regardless, this looks like a follow-on issue, not a root cause.

> [ 4001.709206]  get_page_from_freelist+0x52e/0xc80
> [ 4001.709208]  __alloc_pages_nodemask+0xf0/0xcc0
> [ 4001.709209]  ? get_page_from_freelist+0xa8d/0xc80
> [ 4001.709212]  ? radix_tree_lookup+0xd/0x10
> [ 4001.709215]  ? kmem_cache_alloc+0x80/0xa0
> [ 4001.709217]  xfs_buf_allocate_memory+0x20e/0x320
> [ 4001.709219]  xfs_buf_get_map+0xe8/0x190
> [ 4001.709220]  xfs_buf_read_map+0x25/0x100
> [ 4001.709223]  xfs_trans_read_buf_map+0xb2/0x1f0
> [ 4001.709225]  xfs_imap_to_bp+0x53/0xa0
> [ 4001.709226]  xfs_iread+0x76/0x1b0
> [ 4001.709229]  xfs_iget+0x1e5/0x700
> [ 4001.709231]  xfs_lookup+0x63/0x90
> [ 4001.709232]  xfs_vn_lookup+0x47/0x80

The XFS part of this is that it triggers the memory allocation that
trips over the bad kswapd state, nothing else.

IOWs, this doesn't look like an XFS problem at all, but more likely
something going wrong with memory compaction or memory reclaim, so
I'd suggest linux-mm@kvack.org [cc'd] is the first port of call for
further triage.

> [ 4001.709235]  __lookup_slow+0x7f/0x120
> [ 4001.709236]  lookup_slow+0x35/0x50
> [ 4001.709238]  walk_component+0x193/0x2a0
> [ 4001.709239]  ? path_init+0x112/0x2f0
> [ 4001.709240]  path_lookupat.isra.16+0x5c/0x200
> [ 4001.709242]  filename_lookup.part.27+0x88/0x100
> [ 4001.709243]  ? xfs_ilock+0x39/0x90
> [ 4001.709245]  ? __check_object_size+0xf6/0x187
> [ 4001.709248]  ? strncpy_from_user+0x56/0x1c0
> [ 4001.709249]  user_path_at_empty+0x39/0x40
> [ 4001.709250]  vfs_statx+0x62/0xb0
> [ 4001.709252]  __se_sys_newfstatat+0x26/0x50
> [ 4001.709254]  __x64_sys_newfstatat+0x19/0x20
> [ 4001.709255]  do_syscall_64+0x4b/0x260
> [ 4001.709257]  ? page_fault+0x8/0x30
> [ 4001.709259]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 4001.709261] RIP: 0033:0x7f0090c47e49
> [ 4001.709262] Code: 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 89 f0 48 89 d6 83 ff 01 77 36 89 c7 45 89 c2 48 89 ca b8 06 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 07 c3 66 0f 1f 44 00 00 48 8b 15 11 10 0d 00
> [ 4001.709263] RSP: 002b:00007fffee0929e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
> [ 4001.709264] RAX: ffffffffffffffda RBX: 00005635d95cfe00 RCX: 00007f0090c47e49
> [ 4001.709265] RDX: 00005635d95cfe78 RSI: 00005635d95cff08 RDI: 0000000000000004
> [ 4001.709266] RBP: 00005635d95cfe78 R08: 0000000000000100 R09: 0000000000000001
> [ 4001.709267] R10: 0000000000000100 R11: 0000000000000246 R12: 00005635d8c1e5c0
> [ 4001.709268] R13: 00005635d95cfe00 R14: 00005635d8c1e650 R15: 000000000000000b
> [ 4001.709269] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
> [ 4001.709293] CR2: ffffc900003e7ec8
> [ 4001.709295] ---[ end trace 3cdcc14b40255fe7 ]---
> [ 4001.709297] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
> [ 4001.709299] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
> [ 4001.709299] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
> [ 4001.709300] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
> [ 4001.709301] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
> [ 4001.709302] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
> [ 4001.709303] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
> [ 4001.709304] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
> [ 4001.709305] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
> [ 4001.709306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4001.709307] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0
> 
> So XFS wasn't *that* unhappy if it could still write to the file
> system.

RIght, as long as it doesn't trip over any of the leaked state (e.g.
locks) from the du process that was killed, it'll keep going as long
as direct memory reclaim can keep reclaiming memory.

> 
> > FWIW, one of my regular test workloads is iterating a directory tree
> > with 50 million inodes in several different ways to stress reclaim
> > algorithms in ways that users do. I haven't seen issues with that
> > test for a while, so it's not an obvious problem whatever you came
> > across.
> 
> Right, I should have tried to reproduce it first.  I actually can't.

Not surprising, it has the smell of "random crash" to it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-09-30 21:17     ` [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16] Dave Chinner
@ 2019-09-30 21:42       ` Florian Weimer
  2019-10-01  9:10       ` Vlastimil Babka
  1 sibling, 0 replies; 10+ messages in thread
From: Florian Weimer @ 2019-09-30 21:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

* Dave Chinner:

>> [ 4001.238446] Call Trace:
>> [ 4001.238450]  __reset_isolation_suitable+0x9b/0x120
>> [ 4001.238453]  reset_isolation_suitable+0x3b/0x40
>> [ 4001.238456]  kswapd+0x98/0x300
>> [ 4001.238460]  ? wait_woken+0x80/0x80
>> [ 4001.238463]  kthread+0x114/0x130
>> [ 4001.238465]  ? balance_pgdat+0x450/0x450
>> [ 4001.238467]  ? kthread_park+0x80/0x80
>> [ 4001.238470]  ret_from_fork+0x1f/0x30
>
> Ok, so the memory compaction code has had a null pointer dereference
> which has killed kswapd. memory reclaim is going to have serious
> problems from this point on as kswapd does most of the reclaim.

Sorry, no.  OpenVPN opened a tun device at the same time (same
second), and udevd reacted to that, but that's it.

I also double-checked, and there haven't been any recent previous
occurrences of that crash.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-09-30 21:17     ` [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16] Dave Chinner
  2019-09-30 21:42       ` Florian Weimer
@ 2019-10-01  9:10       ` Vlastimil Babka
  2019-10-01 19:40         ` Florian Weimer
  1 sibling, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2019-10-01  9:10 UTC (permalink / raw)
  To: Dave Chinner, Florian Weimer
  Cc: linux-xfs, linux-fsdevel, linux-mm, Mel Gorman

On 9/30/19 11:17 PM, Dave Chinner wrote:
> On Mon, Sep 30, 2019 at 09:07:53PM +0200, Florian Weimer wrote:
>> * Dave Chinner:
>>
>>> On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
>>>> Simply running “du -hc” on a large directory tree causes du to be
>>>> killed because of kernel paging request failure in the XFS code.
>>>
>>> dmesg output? if the system was still running, then you might be
>>> able to pull the trace from syslog. But we can't do much without
>>> knowing what the actual failure was....
>>
>> Huh.  I actually have something in syslog:
>>
>> [ 4001.238411] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [ 4001.238415] #PF: supervisor read access in kernel mode
>> [ 4001.238417] #PF: error_code(0x0000) - not-present page
>> [ 4001.238418] PGD 0 P4D 0 
>> [ 4001.238420] Oops: 0000 [#1] SMP PTI
>> [ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G          I       5.2.16fw+ #1
>> [ 4001.238424] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
>> [ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
> 
> That's memory compaction code it's crashed in.
> 
>> [ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7

Tried to decode it, but couldn't match it to source code, my version of
compiled code is too different. Would it be possible to either send
mm/compaction.o from the matching build, or output of 'objdump -d -l'
for the __reset_isolation_pfn function?

>> [ 4001.238433] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
>> [ 4001.238435] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
>> [ 4001.238437] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
>> [ 4001.238438] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
>> [ 4001.238439] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
>> [ 4001.238440] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
>> [ 4001.238442] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
>> [ 4001.238444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4001.238445] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
>> [ 4001.238446] Call Trace:
>> [ 4001.238450]  __reset_isolation_suitable+0x9b/0x120
>> [ 4001.238453]  reset_isolation_suitable+0x3b/0x40
>> [ 4001.238456]  kswapd+0x98/0x300
>> [ 4001.238460]  ? wait_woken+0x80/0x80
>> [ 4001.238463]  kthread+0x114/0x130
>> [ 4001.238465]  ? balance_pgdat+0x450/0x450
>> [ 4001.238467]  ? kthread_park+0x80/0x80
>> [ 4001.238470]  ret_from_fork+0x1f/0x30
> 
> Ok, so the memory compaction code has had a null pointer dereference
> which has killed kswapd. memory reclaim is going to have serious
> problems from this point on as kswapd does most of the reclaim.
> 
> I have no idea why this might have happened - are they any other
> unexpected events or clues in the syslog that might point to a
> memory corruption or some sign of badness before this crash?
> 
>> [ 4001.238472] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
>> [ 4001.238509] CR2: 0000000000000000
>> [ 4001.238511] ---[ end trace 3cdcc14b40255fe6 ]---
>> [ 4001.238514] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
>> [ 4001.238516] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
>> [ 4001.238518] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
>> [ 4001.238519] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
>> [ 4001.238521] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
>> [ 4001.238522] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
>> [ 4001.238523] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
>> [ 4001.238524] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
>> [ 4001.238526] FS:  0000000000000000(0000) GS:ffff888333cc0000(0000) knlGS:0000000000000000
>> [ 4001.238528] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4001.238529] CR2: 0000000000000000 CR3: 000000000200a003 CR4: 00000000000206e0
>> [ 4001.709169] BUG: unable to handle page fault for address: ffffc900003e7ec8
>> [ 4001.709172] #PF: supervisor read access in kernel mode
>> [ 4001.709173] #PF: error_code(0x0000) - not-present page
>> [ 4001.709174] PGD 33201a067 P4D 33201a067 PUD 33201b067 PMD 3322af067 PTE 0
>> [ 4001.709177] Oops: 0000 [#2] SMP PTI
>> [ 4001.709179] CPU: 1 PID: 10507 Comm: du Tainted: G      D   I       5.2.16fw+ #1
>> [ 4001.709180] Hardware name: System manufacturer System Product Name/P6X58D-E, BIOS 0701    05/10/2011
>> [ 4001.709184] RIP: 0010:__wake_up_common+0x3c/0x130
> 
> Then half a second later, the du process has crashed in
> __wake_up_common (core scheduler code)....
> 
>> [ 4001.709186] Code: 85 c9 74 0a 41 f6 01 04 0f 85 9f 00 00 00 48 8b 47 08 48 8d 5f 08 48 83 e8 18 48 8d 78 18 48 39 fb 0f 84 ca 00 00 00 89 75 d4 <48> 8b 70 18 4d 89 cd 45 31 e4 4c 89 45 c8 89 4d d0 89 55 c4 4c 8d
>> [ 4001.709187] RSP: 0018:ffffc900043db5e0 EFLAGS: 00010012
>> [ 4001.709188] RAX: ffffc900003e7eb0 RBX: ffffffff82066f00 RCX: 0000000000000000
>> [ 4001.709189] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffc900003e7ec8
>> [ 4001.709190] RBP: ffffc900043db620 R08: 0000000000000000 R09: ffffc900043db638
>> [ 4001.709191] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
>> [ 4001.709192] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000000
>> [ 4001.709193] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
>> [ 4001.709194] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4001.709195] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0
>> [ 4001.709195] Call Trace:
>> [ 4001.709198]  __wake_up_common_lock+0x6c/0x90
>> [ 4001.709200]  __wake_up+0xe/0x10
>> [ 4001.709203]  wakeup_kswapd+0xf4/0x120
> 
> ...trying to wake up kswapd. This may have crashed because the
> kswapd task has been killed and it hasn't been removed from
> the wait list and so there's a dead/freed task being woken.
> Regardless, this looks like a follow-on issue, not a root cause.
> 
>> [ 4001.709206]  get_page_from_freelist+0x52e/0xc80
>> [ 4001.709208]  __alloc_pages_nodemask+0xf0/0xcc0
>> [ 4001.709209]  ? get_page_from_freelist+0xa8d/0xc80
>> [ 4001.709212]  ? radix_tree_lookup+0xd/0x10
>> [ 4001.709215]  ? kmem_cache_alloc+0x80/0xa0
>> [ 4001.709217]  xfs_buf_allocate_memory+0x20e/0x320
>> [ 4001.709219]  xfs_buf_get_map+0xe8/0x190
>> [ 4001.709220]  xfs_buf_read_map+0x25/0x100
>> [ 4001.709223]  xfs_trans_read_buf_map+0xb2/0x1f0
>> [ 4001.709225]  xfs_imap_to_bp+0x53/0xa0
>> [ 4001.709226]  xfs_iread+0x76/0x1b0
>> [ 4001.709229]  xfs_iget+0x1e5/0x700
>> [ 4001.709231]  xfs_lookup+0x63/0x90
>> [ 4001.709232]  xfs_vn_lookup+0x47/0x80
> 
> The XFS part of this is that it triggers the memory allocation that
> trips over the bad kswapd state, nothing else.
> 
> IOWs, this doesn't look like an XFS problem at all, but more likely
> something going wrong with memory compaction or memory reclaim, so
> I'd suggest linux-mm@kvack.org [cc'd] is the first port of call for
> further triage.
> 
>> [ 4001.709235]  __lookup_slow+0x7f/0x120
>> [ 4001.709236]  lookup_slow+0x35/0x50
>> [ 4001.709238]  walk_component+0x193/0x2a0
>> [ 4001.709239]  ? path_init+0x112/0x2f0
>> [ 4001.709240]  path_lookupat.isra.16+0x5c/0x200
>> [ 4001.709242]  filename_lookup.part.27+0x88/0x100
>> [ 4001.709243]  ? xfs_ilock+0x39/0x90
>> [ 4001.709245]  ? __check_object_size+0xf6/0x187
>> [ 4001.709248]  ? strncpy_from_user+0x56/0x1c0
>> [ 4001.709249]  user_path_at_empty+0x39/0x40
>> [ 4001.709250]  vfs_statx+0x62/0xb0
>> [ 4001.709252]  __se_sys_newfstatat+0x26/0x50
>> [ 4001.709254]  __x64_sys_newfstatat+0x19/0x20
>> [ 4001.709255]  do_syscall_64+0x4b/0x260
>> [ 4001.709257]  ? page_fault+0x8/0x30
>> [ 4001.709259]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 4001.709261] RIP: 0033:0x7f0090c47e49
>> [ 4001.709262] Code: 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 89 f0 48 89 d6 83 ff 01 77 36 89 c7 45 89 c2 48 89 ca b8 06 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 07 c3 66 0f 1f 44 00 00 48 8b 15 11 10 0d 00
>> [ 4001.709263] RSP: 002b:00007fffee0929e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
>> [ 4001.709264] RAX: ffffffffffffffda RBX: 00005635d95cfe00 RCX: 00007f0090c47e49
>> [ 4001.709265] RDX: 00005635d95cfe78 RSI: 00005635d95cff08 RDI: 0000000000000004
>> [ 4001.709266] RBP: 00005635d95cfe78 R08: 0000000000000100 R09: 0000000000000001
>> [ 4001.709267] R10: 0000000000000100 R11: 0000000000000246 R12: 00005635d8c1e5c0
>> [ 4001.709268] R13: 00005635d95cfe00 R14: 00005635d8c1e650 R15: 000000000000000b
>> [ 4001.709269] Modules linked in: nfnetlink 8021q garp stp llc fuse ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter xt_state xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter tun ip6_tables binfmt_misc mxm_wmi evdev snd_hda_codec_hdmi coretemp snd_hda_intel kvm_intel snd_hda_codec serio_raw kvm snd_hwdep irqbypass snd_hda_core pcspkr snd_pcm snd_timer snd soundcore sg i7core_edac asus_atk0110 wmi button loop ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod hid_generic usbhid hid crc32c_intel psmouse sr_mod cdrom radeon e1000e ptp xhci_pci pps_core uhci_hcd ehci_pci xhci_hcd ehci_hcd sky2 usbcore ttm usb_common sd_mod
>> [ 4001.709293] CR2: ffffc900003e7ec8
>> [ 4001.709295] ---[ end trace 3cdcc14b40255fe7 ]---
>> [ 4001.709297] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
>> [ 4001.709299] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1 e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00 00 00 4c 89 f7
>> [ 4001.709299] RSP: 0018:ffffc900003e7de0 EFLAGS: 00010246
>> [ 4001.709300] RAX: 0000000000057285 RBX: 0000000000108000 RCX: 0000000000000000
>> [ 4001.709301] RDX: 0000000000000000 RSI: 0000000000000210 RDI: ffff88833fffa000
>> [ 4001.709302] RBP: ffffc900003e7e18 R08: 0000000000000004 R09: ffff888335000000
>> [ 4001.709303] R10: ffff88833fff9000 R11: 0000000000000000 R12: 0000000000000000
>> [ 4001.709304] R13: 0000000000000000 R14: ffff8883389c01c0 R15: 0000000000000001
>> [ 4001.709305] FS:  00007f0090d20540(0000) GS:ffff888333c40000(0000) knlGS:0000000000000000
>> [ 4001.709306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4001.709307] CR2: ffffc900003e7ec8 CR3: 00000002b85dc004 CR4: 00000000000206e0
>>
>> So XFS wasn't *that* unhappy if it could still write to the file
>> system.
> 
> RIght, as long as it doesn't trip over any of the leaked state (e.g.
> locks) from the du process that was killed, it'll keep going as long
> as direct memory reclaim can keep reclaiming memory.
> 
>>
>>> FWIW, one of my regular test workloads is iterating a directory tree
>>> with 50 million inodes in several different ways to stress reclaim
>>> algorithms in ways that users do. I haven't seen issues with that
>>> test for a while, so it's not an obvious problem whatever you came
>>> across.
>>
>> Right, I should have tried to reproduce it first.  I actually can't.
> 
> Not surprising, it has the smell of "random crash" to it.
> 
> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-10-01  9:10       ` Vlastimil Babka
@ 2019-10-01 19:40         ` Florian Weimer
  2019-10-07 13:28           ` Vlastimil Babka
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2019-10-01 19:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-mm, Mel Gorman

* Vlastimil Babka:

> On 9/30/19 11:17 PM, Dave Chinner wrote:
>> On Mon, Sep 30, 2019 at 09:07:53PM +0200, Florian Weimer wrote:
>>> * Dave Chinner:
>>>
>>>> On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
>>>>> Simply running “du -hc” on a large directory tree causes du to be
>>>>> killed because of kernel paging request failure in the XFS code.
>>>>
>>>> dmesg output? if the system was still running, then you might be
>>>> able to pull the trace from syslog. But we can't do much without
>>>> knowing what the actual failure was....
>>>
>>> Huh.  I actually have something in syslog:
>>>
>>> [ 4001.238411] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000000
>>> [ 4001.238415] #PF: supervisor read access in kernel mode
>>> [ 4001.238417] #PF: error_code(0x0000) - not-present page
>>> [ 4001.238418] PGD 0 P4D 0 
>>> [ 4001.238420] Oops: 0000 [#1] SMP PTI
>>> [ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G I 5.2.16fw+
>>> #1
>>> [ 4001.238424] Hardware name: System manufacturer System Product
>>> Name/P6X58D-E, BIOS 0701 05/10/2011
>>> [ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
>> 
>> That's memory compaction code it's crashed in.
>> 
>>> [ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0
>>> 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1
>>> e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00
>>> 00 00 4c 89 f7
>
> Tried to decode it, but couldn't match it to source code, my version of
> compiled code is too different. Would it be possible to either send
> mm/compaction.o from the matching build, or output of 'objdump -d -l'
> for the __reset_isolation_pfn function?

See below.  I don't have debuginfo for this build, and the binary does
not reproduce for some reason.  Due to the heavy inlining, it might be
quite hard to figure out what's going on.

I've switched to kernel builds with debuginfo from now on.  I'm
surprised that it's not the default.

0000000000000120 <__reset_isolation_pfn>:
__reset_isolation_pfn():
     120:	48 89 f0             	mov    %rsi,%rax
     123:	48 c1 e8 0f          	shr    $0xf,%rax
     127:	48 3d ff ff 07 00    	cmp    $0x7ffff,%rax
     12d:	0f 87 83 00 00 00    	ja     1b6 <__reset_isolation_pfn+0x96>
     133:	4c 8b 0d 00 00 00 00 	mov    0x0(%rip),%r9        # 13a <__reset_isolation_pfn+0x1a>
			136: R_X86_64_PC32	mem_section-0x4
     13a:	4d 85 c9             	test   %r9,%r9
     13d:	74 77                	je     1b6 <__reset_isolation_pfn+0x96>
     13f:	49 89 f2             	mov    %rsi,%r10
     142:	49 c1 ea 17          	shr    $0x17,%r10
     146:	4f 8b 0c d1          	mov    (%r9,%r10,8),%r9
     14a:	4d 85 c9             	test   %r9,%r9
     14d:	74 67                	je     1b6 <__reset_isolation_pfn+0x96>
     14f:	0f b6 c0             	movzbl %al,%eax
     152:	48 c1 e0 04          	shl    $0x4,%rax
     156:	4c 01 c8             	add    %r9,%rax
     159:	74 5b                	je     1b6 <__reset_isolation_pfn+0x96>
     15b:	4c 8b 08             	mov    (%rax),%r9
     15e:	41 f6 c1 02          	test   $0x2,%r9b
     162:	74 52                	je     1b6 <__reset_isolation_pfn+0x96>
     164:	48 6b c6 38          	imul   $0x38,%rsi,%rax
     168:	55                   	push   %rbp
     169:	49 83 e1 f8          	and    $0xfffffffffffffff8,%r9
     16d:	48 89 e5             	mov    %rsp,%rbp
     170:	41 57                	push   %r15
     172:	41 56                	push   %r14
     174:	4d 89 ce             	mov    %r9,%r14
     177:	41 55                	push   %r13
     179:	41 54                	push   %r12
     17b:	53                   	push   %rbx
     17c:	48 83 ec 10          	sub    $0x10,%rsp
     180:	49 01 c6             	add    %rax,%r14
     183:	74 1c                	je     1a1 <__reset_isolation_pfn+0x81>
     185:	49 8b 06             	mov    (%r14),%rax
     188:	48 c1 e8 2b          	shr    $0x2b,%rax
     18c:	83 e0 03             	and    $0x3,%eax
     18f:	48 69 c0 80 05 00 00 	imul   $0x580,%rax,%rax
     196:	48 05 00 00 00 00    	add    $0x0,%rax
			198: R_X86_64_32S	contig_page_data
     19c:	48 39 c7             	cmp    %rax,%rdi
     19f:	74 1c                	je     1bd <__reset_isolation_pfn+0x9d>
     1a1:	45 31 d2             	xor    %r10d,%r10d
     1a4:	48 83 c4 10          	add    $0x10,%rsp
     1a8:	44 89 d0             	mov    %r10d,%eax
     1ab:	5b                   	pop    %rbx
     1ac:	41 5c                	pop    %r12
     1ae:	41 5d                	pop    %r13
     1b0:	41 5e                	pop    %r14
     1b2:	41 5f                	pop    %r15
     1b4:	5d                   	pop    %rbp
     1b5:	c3                   	retq   
     1b6:	45 31 d2             	xor    %r10d,%r10d
     1b9:	44 89 d0             	mov    %r10d,%eax
     1bc:	c3                   	retq   
     1bd:	48 89 7d d0          	mov    %rdi,-0x30(%rbp)
     1c1:	41 89 cc             	mov    %ecx,%r12d
     1c4:	41 89 cd             	mov    %ecx,%r13d
     1c7:	4c 89 f7             	mov    %r14,%rdi
     1ca:	89 d1                	mov    %edx,%ecx
     1cc:	41 89 d7             	mov    %edx,%r15d
     1cf:	48 89 f3             	mov    %rsi,%rbx
     1d2:	89 4d cc             	mov    %ecx,-0x34(%rbp)
     1d5:	e8 46 fe ff ff       	callq  20 <pageblock_skip_persistent>
     1da:	84 c0                	test   %al,%al
     1dc:	75 c3                	jne    1a1 <__reset_isolation_pfn+0x81>
     1de:	45 89 fa             	mov    %r15d,%r10d
     1e1:	45 20 e2             	and    %r12b,%r10b
     1e4:	0f 85 f3 01 00 00    	jne    3dd <__reset_isolation_pfn+0x2bd>
     1ea:	80 7d cc 01          	cmpb   $0x1,-0x34(%rbp)
     1ee:	74 6d                	je     25d <__reset_isolation_pfn+0x13d>
     1f0:	45 84 e4             	test   %r12b,%r12b
     1f3:	74 68                	je     25d <__reset_isolation_pfn+0x13d>
     1f5:	49 8b 0e             	mov    (%r14),%rcx
     1f8:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 1ff <__reset_isolation_pfn+0xdf>
			1fb: R_X86_64_PC32	mem_section-0x4
     1ff:	48 89 ca             	mov    %rcx,%rdx
     202:	48 c1 ea 2d          	shr    $0x2d,%rdx
     206:	48 85 c0             	test   %rax,%rax
     209:	74 17                	je     222 <__reset_isolation_pfn+0x102>
     20b:	48 c1 e9 35          	shr    $0x35,%rcx
     20f:	48 8b 04 c8          	mov    (%rax,%rcx,8),%rax
     213:	48 85 c0             	test   %rax,%rax
     216:	74 0a                	je     222 <__reset_isolation_pfn+0x102>
     218:	0f b6 d2             	movzbl %dl,%edx
     21b:	48 c1 e2 04          	shl    $0x4,%rdx
     21f:	48 01 d0             	add    %rdx,%rax
     222:	48 8b 00             	mov    (%rax),%rax
     225:	4c 89 f6             	mov    %r14,%rsi
     228:	b9 07 00 00 00       	mov    $0x7,%ecx
     22d:	ba 02 00 00 00       	mov    $0x2,%edx
     232:	4c 89 f7             	mov    %r14,%rdi
     235:	48 83 e0 f8          	and    $0xfffffffffffffff8,%rax
     239:	48 29 c6             	sub    %rax,%rsi
     23c:	48 b8 b7 6d db b6 6d 	movabs $0x6db6db6db6db6db7,%rax
     243:	db b6 6d 
     246:	48 c1 fe 03          	sar    $0x3,%rsi
     24a:	48 0f af f0          	imul   %rax,%rsi
     24e:	e8 00 00 00 00       	callq  253 <__reset_isolation_pfn+0x133>
			24f: R_X86_64_PLT32	get_pfnblock_flags_mask-0x4
     253:	48 83 f8 01          	cmp    $0x1,%rax
     257:	0f 85 44 ff ff ff    	jne    1a1 <__reset_isolation_pfn+0x81>
     25d:	48 8b 7d d0          	mov    -0x30(%rbp),%rdi
     261:	48 81 e3 00 fe ff ff 	and    $0xfffffffffffffe00,%rbx
     268:	48 8b 47 58          	mov    0x58(%rdi),%rax
     26c:	48 39 c3             	cmp    %rax,%rbx
     26f:	48 89 c1             	mov    %rax,%rcx
     272:	48 0f 43 cb          	cmovae %rbx,%rcx
     276:	48 03 47 68          	add    0x68(%rdi),%rax
     27a:	48 81 c3 00 02 00 00 	add    $0x200,%rbx
     281:	48 89 ca             	mov    %rcx,%rdx
     284:	48 c1 ea 0f          	shr    $0xf,%rdx
     288:	48 83 e8 01          	sub    $0x1,%rax
     28c:	48 39 d8             	cmp    %rbx,%rax
     28f:	48 0f 47 c3          	cmova  %rbx,%rax
     293:	48 89 c6             	mov    %rax,%rsi
     296:	48 c1 ee 0f          	shr    $0xf,%rsi
     29a:	48 81 fa ff ff 07 00 	cmp    $0x7ffff,%rdx
     2a1:	0f 87 ab 01 00 00    	ja     452 <__reset_isolation_pfn+0x332>
     2a7:	48 8b 3d 00 00 00 00 	mov    0x0(%rip),%rdi        # 2ae <__reset_isolation_pfn+0x18e>
			2aa: R_X86_64_PC32	mem_section-0x4
     2ae:	48 85 ff             	test   %rdi,%rdi
     2b1:	0f 84 ea fe ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     2b7:	49 89 ca             	mov    %rcx,%r10
     2ba:	49 c1 ea 17          	shr    $0x17,%r10
     2be:	4e 8b 14 d7          	mov    (%rdi,%r10,8),%r10
     2c2:	4d 85 d2             	test   %r10,%r10
     2c5:	74 23                	je     2ea <__reset_isolation_pfn+0x1ca>
     2c7:	0f b6 d2             	movzbl %dl,%edx
     2ca:	48 c1 e2 04          	shl    $0x4,%rdx
     2ce:	4c 01 d2             	add    %r10,%rdx
     2d1:	74 17                	je     2ea <__reset_isolation_pfn+0x1ca>
     2d3:	48 8b 12             	mov    (%rdx),%rdx
     2d6:	f6 c2 02             	test   $0x2,%dl
     2d9:	74 0f                	je     2ea <__reset_isolation_pfn+0x1ca>
     2db:	48 6b c9 38          	imul   $0x38,%rcx,%rcx
     2df:	48 83 e2 f8          	and    $0xfffffffffffffff8,%rdx
     2e3:	48 01 ca             	add    %rcx,%rdx
     2e6:	4c 0f 45 f2          	cmovne %rdx,%r14
     2ea:	48 81 fe 00 00 08 00 	cmp    $0x80000,%rsi
     2f1:	0f 84 aa fe ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     2f7:	48 89 c2             	mov    %rax,%rdx
     2fa:	48 c1 ea 17          	shr    $0x17,%rdx
     2fe:	48 8b 14 d7          	mov    (%rdi,%rdx,8),%rdx
     302:	48 85 d2             	test   %rdx,%rdx
     305:	0f 84 96 fe ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     30b:	40 0f b6 f6          	movzbl %sil,%esi
     30f:	48 c1 e6 04          	shl    $0x4,%rsi
     313:	48 01 f2             	add    %rsi,%rdx
     316:	0f 84 85 fe ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     31c:	48 8b 12             	mov    (%rdx),%rdx
     31f:	f6 c2 02             	test   $0x2,%dl
     322:	0f 84 79 fe ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     328:	48 6b c0 38          	imul   $0x38,%rax,%rax
     32c:	48 83 e2 f8          	and    $0xfffffffffffffff8,%rdx
     330:	48 01 c2             	add    %rax,%rdx
     333:	75 2e                	jne    363 <__reset_isolation_pfn+0x243>
     335:	e9 67 fe ff ff       	jmpq   1a1 <__reset_isolation_pfn+0x81>
     33a:	45 84 e4             	test   %r12b,%r12b
     33d:	74 14                	je     353 <__reset_isolation_pfn+0x233>
     33f:	41 8b 46 30          	mov    0x30(%r14),%eax
     343:	25 80 00 00 f0       	and    $0xf0000080,%eax
     348:	3d 00 00 00 f0       	cmp    $0xf0000000,%eax
     34d:	0f 84 21 01 00 00    	je     474 <__reset_isolation_pfn+0x354>
     353:	49 81 c6 c0 01 00 00 	add    $0x1c0,%r14
     35a:	4c 39 f2             	cmp    %r14,%rdx
     35d:	0f 86 3e fe ff ff    	jbe    1a1 <__reset_isolation_pfn+0x81>
     363:	45 84 ff             	test   %r15b,%r15b
     366:	74 d2                	je     33a <__reset_isolation_pfn+0x21a>
     368:	49 8b 4e 08          	mov    0x8(%r14),%rcx
     36c:	48 8d 41 ff          	lea    -0x1(%rcx),%rax
     370:	83 e1 01             	and    $0x1,%ecx
     373:	49 0f 44 c6          	cmove  %r14,%rax
     377:	48 8b 00             	mov    (%rax),%rax
     37a:	a8 10                	test   $0x10,%al
     37c:	74 bc                	je     33a <__reset_isolation_pfn+0x21a>
     37e:	49 8b 16             	mov    (%r14),%rdx
     381:	48 89 d0             	mov    %rdx,%rax
     384:	48 c1 ea 35          	shr    $0x35,%rdx
     388:	48 8b 14 d7          	mov    (%rdi,%rdx,8),%rdx
     38c:	48 c1 e8 2d          	shr    $0x2d,%rax
     390:	48 85 d2             	test   %rdx,%rdx
     393:	74 0a                	je     39f <__reset_isolation_pfn+0x27f>
     395:	0f b6 c0             	movzbl %al,%eax
     398:	48 c1 e0 04          	shl    $0x4,%rax
     39c:	48 01 c2             	add    %rax,%rdx
     39f:	48 8b 02             	mov    (%rdx),%rax
     3a2:	4c 89 f2             	mov    %r14,%rdx
     3a5:	41 b8 01 00 00 00    	mov    $0x1,%r8d
     3ab:	31 f6                	xor    %esi,%esi
     3ad:	b9 03 00 00 00       	mov    $0x3,%ecx
     3b2:	4c 89 f7             	mov    %r14,%rdi
     3b5:	48 83 e0 f8          	and    $0xfffffffffffffff8,%rax
     3b9:	48 29 c2             	sub    %rax,%rdx
     3bc:	48 b8 b7 6d db b6 6d 	movabs $0x6db6db6db6db6db7,%rax
     3c3:	db b6 6d 
     3c6:	48 c1 fa 03          	sar    $0x3,%rdx
     3ca:	48 0f af d0          	imul   %rax,%rdx
     3ce:	e8 00 00 00 00       	callq  3d3 <__reset_isolation_pfn+0x2b3>
			3cf: R_X86_64_PLT32	set_pfnblock_flags_mask-0x4
     3d3:	44 0f b6 55 cc       	movzbl -0x34(%rbp),%r10d
     3d8:	e9 c7 fd ff ff       	jmpq   1a4 <__reset_isolation_pfn+0x84>
     3dd:	49 8b 0e             	mov    (%r14),%rcx
     3e0:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 3e7 <__reset_isolation_pfn+0x2c7>
			3e3: R_X86_64_PC32	mem_section-0x4
     3e7:	48 89 ca             	mov    %rcx,%rdx
     3ea:	48 c1 ea 2d          	shr    $0x2d,%rdx
     3ee:	48 85 c0             	test   %rax,%rax
     3f1:	74 17                	je     40a <__reset_isolation_pfn+0x2ea>
     3f3:	48 c1 e9 35          	shr    $0x35,%rcx
     3f7:	48 8b 04 c8          	mov    (%rax,%rcx,8),%rax
     3fb:	48 85 c0             	test   %rax,%rax
     3fe:	74 0a                	je     40a <__reset_isolation_pfn+0x2ea>
     400:	0f b6 d2             	movzbl %dl,%edx
     403:	48 c1 e2 04          	shl    $0x4,%rdx
     407:	48 01 d0             	add    %rdx,%rax
     40a:	48 8b 00             	mov    (%rax),%rax
     40d:	4c 89 f6             	mov    %r14,%rsi
     410:	b9 01 00 00 00       	mov    $0x1,%ecx
     415:	ba 03 00 00 00       	mov    $0x3,%edx
     41a:	4c 89 f7             	mov    %r14,%rdi
     41d:	44 88 55 cb          	mov    %r10b,-0x35(%rbp)
     421:	48 83 e0 f8          	and    $0xfffffffffffffff8,%rax
     425:	48 29 c6             	sub    %rax,%rsi
     428:	48 b8 b7 6d db b6 6d 	movabs $0x6db6db6db6db6db7,%rax
     42f:	db b6 6d 
     432:	48 c1 fe 03          	sar    $0x3,%rsi
     436:	48 0f af f0          	imul   %rax,%rsi
     43a:	e8 00 00 00 00       	callq  43f <__reset_isolation_pfn+0x31f>
			43b: R_X86_64_PLT32	get_pfnblock_flags_mask-0x4
     43f:	44 0f b6 55 cb       	movzbl -0x35(%rbp),%r10d
     444:	48 85 c0             	test   %rax,%rax
     447:	0f 85 10 fe ff ff    	jne    25d <__reset_isolation_pfn+0x13d>
     44d:	e9 52 fd ff ff       	jmpq   1a4 <__reset_isolation_pfn+0x84>
     452:	48 81 fe 00 00 08 00 	cmp    $0x80000,%rsi
     459:	0f 84 42 fd ff ff    	je     1a1 <__reset_isolation_pfn+0x81>
     45f:	48 8b 3d 00 00 00 00 	mov    0x0(%rip),%rdi        # 466 <__reset_isolation_pfn+0x346>
			462: R_X86_64_PC32	mem_section-0x4
     466:	48 85 ff             	test   %rdi,%rdi
     469:	0f 85 88 fe ff ff    	jne    2f7 <__reset_isolation_pfn+0x1d7>
     46f:	e9 2d fd ff ff       	jmpq   1a1 <__reset_isolation_pfn+0x81>
     474:	49 8b 06             	mov    (%r14),%rax
     477:	48 89 c2             	mov    %rax,%rdx
     47a:	48 c1 e8 35          	shr    $0x35,%rax
     47e:	48 8b 04 c7          	mov    (%rdi,%rax,8),%rax
     482:	48 c1 ea 2d          	shr    $0x2d,%rdx
     486:	48 85 c0             	test   %rax,%rax
     489:	74 0a                	je     495 <__reset_isolation_pfn+0x375>
     48b:	0f b6 d2             	movzbl %dl,%edx
     48e:	48 c1 e2 04          	shl    $0x4,%rdx
     492:	48 01 d0             	add    %rdx,%rax
     495:	48 8b 00             	mov    (%rax),%rax
     498:	4c 89 f2             	mov    %r14,%rdx
     49b:	41 b8 01 00 00 00    	mov    $0x1,%r8d
     4a1:	31 f6                	xor    %esi,%esi
     4a3:	b9 03 00 00 00       	mov    $0x3,%ecx
     4a8:	4c 89 f7             	mov    %r14,%rdi
     4ab:	48 83 e0 f8          	and    $0xfffffffffffffff8,%rax
     4af:	48 29 c2             	sub    %rax,%rdx
     4b2:	48 b8 b7 6d db b6 6d 	movabs $0x6db6db6db6db6db7,%rax
     4b9:	db b6 6d 
     4bc:	48 c1 fa 03          	sar    $0x3,%rdx
     4c0:	48 0f af d0          	imul   %rax,%rdx
     4c4:	e8 00 00 00 00       	callq  4c9 <__reset_isolation_pfn+0x3a9>
			4c5: R_X86_64_PLT32	set_pfnblock_flags_mask-0x4
     4c9:	45 89 ea             	mov    %r13d,%r10d
     4cc:	e9 d3 fc ff ff       	jmpq   1a4 <__reset_isolation_pfn+0x84>
     4d1:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
     4d8:	00 00 00 00 
     4dc:	0f 1f 40 00          	nopl   0x0(%rax)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-10-01 19:40         ` Florian Weimer
@ 2019-10-07 13:28           ` Vlastimil Babka
  2019-10-07 13:56             ` Vlastimil Babka
  0 siblings, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2019-10-07 13:28 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-mm, Mel Gorman

On 10/1/19 9:40 PM, Florian Weimer wrote:
> * Vlastimil Babka:
> 
>> On 9/30/19 11:17 PM, Dave Chinner wrote:
>>> On Mon, Sep 30, 2019 at 09:07:53PM +0200, Florian Weimer wrote:
>>>> * Dave Chinner:
>>>>
>>>>> On Mon, Sep 30, 2019 at 09:28:27AM +0200, Florian Weimer wrote:
>>>>>> Simply running “du -hc” on a large directory tree causes du to be
>>>>>> killed because of kernel paging request failure in the XFS code.
>>>>>
>>>>> dmesg output? if the system was still running, then you might be
>>>>> able to pull the trace from syslog. But we can't do much without
>>>>> knowing what the actual failure was....
>>>>
>>>> Huh.  I actually have something in syslog:
>>>>
>>>> [ 4001.238411] BUG: kernel NULL pointer dereference, address:
>>>> 0000000000000000
>>>> [ 4001.238415] #PF: supervisor read access in kernel mode
>>>> [ 4001.238417] #PF: error_code(0x0000) - not-present page
>>>> [ 4001.238418] PGD 0 P4D 0 
>>>> [ 4001.238420] Oops: 0000 [#1] SMP PTI
>>>> [ 4001.238423] CPU: 3 PID: 143 Comm: kswapd0 Tainted: G I 5.2.16fw+
>>>> #1
>>>> [ 4001.238424] Hardware name: System manufacturer System Product
>>>> Name/P6X58D-E, BIOS 0701 05/10/2011
>>>> [ 4001.238430] RIP: 0010:__reset_isolation_pfn+0x27f/0x3c0
>>>
>>> That's memory compaction code it's crashed in.
>>>
>>>> [ 4001.238432] Code: 44 c6 48 8b 00 a8 10 74 bc 49 8b 16 48 89 d0
>>>> 48 c1 ea 35 48 8b 14 d7 48 c1 e8 2d 48 85 d2 74 0a 0f b6 c0 48 c1
>>>> e0 04 48 01 c2 <48> 8b 02 4c 89 f2 41 b8 01 00 00 00 31 f6 b9 03 00
>>>> 00 00 4c 89 f7
>>
>> Tried to decode it, but couldn't match it to source code, my version of
>> compiled code is too different. Would it be possible to either send
>> mm/compaction.o from the matching build, or output of 'objdump -d -l'
>> for the __reset_isolation_pfn function?
> 
> See below.  I don't have debuginfo for this build, and the binary does
> not reproduce for some reason.  Due to the heavy inlining, it might be
> quite hard to figure out what's going on.

Thanks, but I'm still not able to "decompile" that in my head.

> I've switched to kernel builds with debuginfo from now on.  I'm
> surprised that it's not the default.

Let's see if you can reproduce it with that.

However, I've noticed at least something weird:

>      37e:	49 8b 16             	mov    (%r14),%rdx
>      381:	48 89 d0             	mov    %rdx,%rax
>      384:	48 c1 ea 35          	shr    $0x35,%rdx
>      388:	48 8b 14 d7          	mov    (%rdi,%rdx,8),%rdx
>      38c:	48 c1 e8 2d          	shr    $0x2d,%rax
>      390:	48 85 d2             	test   %rdx,%rdx
>      393:	74 0a                	je     39f <__reset_isolation_pfn+0x27f>

IIUC, this will jump to 39f when rdx is zero.

>      395:	0f b6 c0             	movzbl %al,%eax
>      398:	48 c1 e0 04          	shl    $0x4,%rax
>      39c:	48 01 c2             	add    %rax,%rdx
>      39f:	48 8b 02             	mov    (%rdx),%rax

And this is where we crash because rdx is zero. So the test+branch might
have sent us directly here to crash. Sounds like an inverted condition
somewhere? Or possibly a result of optimizations.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-10-07 13:28           ` Vlastimil Babka
@ 2019-10-07 13:56             ` Vlastimil Babka
  2019-10-08  8:52               ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2019-10-07 13:56 UTC (permalink / raw)
  To: Florian Weimer, Mel Gorman
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-mm

On 10/7/19 3:28 PM, Vlastimil Babka wrote:
> On 10/1/19 9:40 PM, Florian Weimer wrote:
>> * Vlastimil Babka:
>>
>>
>> See below.  I don't have debuginfo for this build, and the binary does
>> not reproduce for some reason.  Due to the heavy inlining, it might be
>> quite hard to figure out what's going on.
> 
> Thanks, but I'm still not able to "decompile" that in my head.

While staring at the code, I think I found two probably unrelated bugs.
One is that pfn and page might be desynced when zone starts in the middle
of pageblock, as the max() is only applied to page and not pfn. But that
only effectively affects the later pfn_valid_within() checks, which should
be always true on x86.

The second is that "end of pageblock online and valid" should refer to
the last pfn of pageblock, not first pfn of next pageblocks. Otherwise we
might return false needlessly. Mel, what do you think?

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -270,14 +270,15 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
 
        /* Ensure the start of the pageblock or zone is online and valid */
        block_pfn = pageblock_start_pfn(pfn);
-       block_page = pfn_to_online_page(max(block_pfn, zone->zone_start_pfn));
+       block_pfn = max(block_pfn, zone->zone_start_pfn);
+       block_page = pfn_to_online_page(block_pfn);
        if (block_page) {
                page = block_page;
                pfn = block_pfn;
        }
 
        /* Ensure the end of the pageblock or zone is online and valid */
-       block_pfn += pageblock_nr_pages;
+       block_pfn = pageblock_end_pfn(pfn) - 1;
        block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
        end_page = pfn_to_online_page(block_pfn);
        if (!end_page)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16]
  2019-10-07 13:56             ` Vlastimil Babka
@ 2019-10-08  8:52               ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2019-10-08  8:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Florian Weimer, Dave Chinner, linux-xfs, linux-fsdevel, linux-mm

On Mon, Oct 07, 2019 at 03:56:41PM +0200, Vlastimil Babka wrote:
> On 10/7/19 3:28 PM, Vlastimil Babka wrote:
> > On 10/1/19 9:40 PM, Florian Weimer wrote:
> >> * Vlastimil Babka:
> >>
> >>
> >> See below.  I don't have debuginfo for this build, and the binary does
> >> not reproduce for some reason.  Due to the heavy inlining, it might be
> >> quite hard to figure out what's going on.
> > 
> > Thanks, but I'm still not able to "decompile" that in my head.
> 
> While staring at the code, I think I found two probably unrelated bugs.
> One is that pfn and page might be desynced when zone starts in the middle
> of pageblock, as the max() is only applied to page and not pfn. But that
> only effectively affects the later pfn_valid_within() checks, which should
> be always true on x86.
> 
> The second is that "end of pageblock online and valid" should refer to
> the last pfn of pageblock, not first pfn of next pageblocks. Otherwise we
> might return false needlessly. Mel, what do you think?
> 

I think you are correct in both cases. It's perfectly possible I would
not have observed a problem in testing if zones were aligned which I
think is generally the case on my test machines.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-10-08  9:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-30  7:28 xfs_inode not reclaimed/memory leak on 5.2.16 Florian Weimer
2019-09-30  8:54 ` Dave Chinner
2019-09-30 19:07   ` Florian Weimer
2019-09-30 21:17     ` [bug, 5.2.16] kswapd/compaction null pointer crash [was Re: xfs_inode not reclaimed/memory leak on 5.2.16] Dave Chinner
2019-09-30 21:42       ` Florian Weimer
2019-10-01  9:10       ` Vlastimil Babka
2019-10-01 19:40         ` Florian Weimer
2019-10-07 13:28           ` Vlastimil Babka
2019-10-07 13:56             ` Vlastimil Babka
2019-10-08  8:52               ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).