All of lore.kernel.org
 help / color / mirror / Atom feed
* Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
@ 2023-10-21  3:38 Marek Marczykowski-Górecki
  2023-10-21  7:48 ` Bagas Sanjaya
  2023-10-23 20:59   ` Mikulas Patocka
  0 siblings, 2 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-21  3:38 UTC (permalink / raw)
  To: stable
  Cc: regressions, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, dm-devel

[-- Attachment #1: Type: text/plain, Size: 7203 bytes --]

Hi,

Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
triggers the issue, but often it happens when doing some LVM operations
(lvremove, lvrename etc) on a dm-thin volume together with bulk data
copy to/from another LVM thin volume with ext4 fs.

The storage stack I use is:
  nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4

And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
specific.

I can reproduce the issue on at least 3 different machines. I did tried
also 6.5.6 and the issue is still there. I haven't checked newer
versions, but briefly reviewed git log and haven't found anything
suggesting a fix to similar issue.

I managed to bisect it down to this commit:

    commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
    Author: Mikulas Patocka <mpatocka@redhat.com>
    Date:   Mon May 1 09:19:17 2023 -0400

    dm crypt: allocate compound pages if possible
    
    It was reported that allocating pages for the write buffer in dm-crypt
    causes measurable overhead [1].
    
    Change dm-crypt to allocate compound pages if they are available. If
    not, fall back to the mempool.
    
    [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
    
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@kernel.org>

TBH, I'm not sure if the bug is in this commit, or maybe in some
functions it uses (I don't see dm-crypt functions directly involved in
the stack traces I collected). But reverting this commit on top of 6.5.6
seems to fix the issue.

I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.

I managed to collect "blocked tasks" dump via sysrq below. Few more can
be found at https://github.com/QubesOS/qubes-issues/issues/8575

    [ 4246.558313] sysrq: Show Blocked State
    [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
    [ 4246.558407] Call Trace:
    [ 4246.558414]  <TASK>
    [ 4246.558422]  __schedule+0x23d/0x670
    [ 4246.558440]  schedule+0x5e/0xd0
    [ 4246.558450]  io_schedule+0x46/0x70
    [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
    [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
    [ 4246.558488]  folio_wait_writeback+0x2c/0x90
    [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
    [ 4246.558512]  ext4_do_writepages+0x25f/0x770
    [ 4246.558523]  ext4_writepages+0xad/0x180
    [ 4246.558533]  do_writepages+0xcf/0x1e0
    [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
    [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
    [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
    [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
    [ 4246.558588]  ext4_sync_file+0x88/0x380
    [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
    [ 4246.558609]  do_syscall_64+0x5c/0x90
    [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
    [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    [ 4246.558644] RIP: 0033:0x7710cf124d0a
    [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
    [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
    [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
    [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
    [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
    [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
    [ 4246.558728]  </TASK>
    [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
    [ 4246.558852] Call Trace:
    [ 4246.558857]  <TASK>
    [ 4246.558863]  __schedule+0x23d/0x670
    [ 4246.558874]  schedule+0x5e/0xd0
    [ 4246.558884]  io_schedule+0x46/0x70
    [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
    [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
    [ 4246.558922]  __dm_suspend+0x7e/0x1b0
    [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
    [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
    [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
    [ 4246.558980]  __dm_suspend+0x41/0x1b0
    [ 4246.558991]  dm_suspend+0xc0/0xe0
    [ 4246.559001]  dev_suspend+0xa5/0xd0
    [ 4246.559011]  ctl_ioctl+0x26e/0x350
    [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
    [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
    [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
    [ 4246.559052]  do_syscall_64+0x5c/0x90
    [ 4246.559062]  ? do_syscall_64+0x6b/0x90
    [ 4246.559072]  ? do_syscall_64+0x6b/0x90
    [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
    [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
    [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
    [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
    [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
    [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
    [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
    [ 4246.559199]  </TASK>
    [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
    [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
    [ 4246.559238] Call Trace:
    [ 4246.559244]  <TASK>
    [ 4246.559249]  __schedule+0x23d/0x670
    [ 4246.559260]  schedule+0x5e/0xd0
    [ 4246.559270]  io_schedule+0x46/0x70
    [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
    [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
    [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
    [ 4246.559314]  ext4_do_writepages+0x25f/0x770
    [ 4246.559324]  ext4_writepages+0xad/0x180
    [ 4246.559334]  do_writepages+0xcf/0x1e0
    [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
    [ 4246.559354]  __writeback_single_inode+0x3d/0x280
    [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
    [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
    [ 4246.559393]  wb_writeback+0x298/0x310
    [ 4246.559403]  wb_do_writeback+0x230/0x2b0
    [ 4246.559414]  wb_workfn+0x5f/0x260
    [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
    [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
    [ 4246.559447]  ? __schedule+0x245/0x670
    [ 4246.559457]  process_one_work+0x1df/0x3e0
    [ 4246.559466]  worker_thread+0x51/0x390
    [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
    [ 4246.559484]  kthread+0xe5/0x120
    [ 4246.559495]  ? __pfx_kthread+0x10/0x10
    [ 4246.559504]  ret_from_fork+0x31/0x50
    [ 4246.559514]  ? __pfx_kthread+0x10/0x10
    [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
    [ 4246.559536]  </TASK>

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-21  3:38 Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Marek Marczykowski-Górecki
@ 2023-10-21  7:48 ` Bagas Sanjaya
  2023-10-29  6:23     ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-10-23 20:59   ` Mikulas Patocka
  1 sibling, 1 reply; 66+ messages in thread
From: Bagas Sanjaya @ 2023-10-21  7:48 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Linux Stable
  Cc: Linux Regressions, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Linux Devicemapper, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 7665 bytes --]

On Sat, Oct 21, 2023 at 05:38:58AM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> triggers the issue, but often it happens when doing some LVM operations
> (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> copy to/from another LVM thin volume with ext4 fs.
> 
> The storage stack I use is:
>   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> 
> And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> specific.
> 
> I can reproduce the issue on at least 3 different machines. I did tried
> also 6.5.6 and the issue is still there. I haven't checked newer
> versions, but briefly reviewed git log and haven't found anything
> suggesting a fix to similar issue.
> 
> I managed to bisect it down to this commit:
> 
>     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
>     Author: Mikulas Patocka <mpatocka@redhat.com>
>     Date:   Mon May 1 09:19:17 2023 -0400
> 
>     dm crypt: allocate compound pages if possible
>     
>     It was reported that allocating pages for the write buffer in dm-crypt
>     causes measurable overhead [1].
>     
>     Change dm-crypt to allocate compound pages if they are available. If
>     not, fall back to the mempool.
>     
>     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
>     
>     Suggested-by: Matthew Wilcox <willy@infradead.org>
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> 
> TBH, I'm not sure if the bug is in this commit, or maybe in some
> functions it uses (I don't see dm-crypt functions directly involved in
> the stack traces I collected). But reverting this commit on top of 6.5.6
> seems to fix the issue.
> 
> I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> 
> I managed to collect "blocked tasks" dump via sysrq below. Few more can
> be found at https://github.com/QubesOS/qubes-issues/issues/8575
> 
>     [ 4246.558313] sysrq: Show Blocked State
>     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
>     [ 4246.558407] Call Trace:
>     [ 4246.558414]  <TASK>
>     [ 4246.558422]  __schedule+0x23d/0x670
>     [ 4246.558440]  schedule+0x5e/0xd0
>     [ 4246.558450]  io_schedule+0x46/0x70
>     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
>     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
>     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
>     [ 4246.558523]  ext4_writepages+0xad/0x180
>     [ 4246.558533]  do_writepages+0xcf/0x1e0
>     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
>     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
>     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
>     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
>     [ 4246.558588]  ext4_sync_file+0x88/0x380
>     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
>     [ 4246.558609]  do_syscall_64+0x5c/0x90
>     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
>     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.558644] RIP: 0033:0x7710cf124d0a
>     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
>     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
>     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
>     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
>     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
>     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
>     [ 4246.558728]  </TASK>
>     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
>     [ 4246.558852] Call Trace:
>     [ 4246.558857]  <TASK>
>     [ 4246.558863]  __schedule+0x23d/0x670
>     [ 4246.558874]  schedule+0x5e/0xd0
>     [ 4246.558884]  io_schedule+0x46/0x70
>     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
>     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
>     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
>     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
>     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
>     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
>     [ 4246.558980]  __dm_suspend+0x41/0x1b0
>     [ 4246.558991]  dm_suspend+0xc0/0xe0
>     [ 4246.559001]  dev_suspend+0xa5/0xd0
>     [ 4246.559011]  ctl_ioctl+0x26e/0x350
>     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
>     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
>     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
>     [ 4246.559052]  do_syscall_64+0x5c/0x90
>     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
>     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
>     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
>     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
>     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
>     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
>     [ 4246.559199]  </TASK>
>     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
>     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
>     [ 4246.559238] Call Trace:
>     [ 4246.559244]  <TASK>
>     [ 4246.559249]  __schedule+0x23d/0x670
>     [ 4246.559260]  schedule+0x5e/0xd0
>     [ 4246.559270]  io_schedule+0x46/0x70
>     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
>     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
>     [ 4246.559324]  ext4_writepages+0xad/0x180
>     [ 4246.559334]  do_writepages+0xcf/0x1e0
>     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
>     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
>     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
>     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
>     [ 4246.559393]  wb_writeback+0x298/0x310
>     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
>     [ 4246.559414]  wb_workfn+0x5f/0x260
>     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
>     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
>     [ 4246.559447]  ? __schedule+0x245/0x670
>     [ 4246.559457]  process_one_work+0x1df/0x3e0
>     [ 4246.559466]  worker_thread+0x51/0x390
>     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
>     [ 4246.559484]  kthread+0xe5/0x120
>     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559504]  ret_from_fork+0x31/0x50
>     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
>     [ 4246.559536]  </TASK>
> 

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 5054e778fcd9cd

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-21  3:38 Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Marek Marczykowski-Górecki
@ 2023-10-23 20:59   ` Mikulas Patocka
  2023-10-23 20:59   ` Mikulas Patocka
  1 sibling, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-23 20:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel

[-- Attachment #1: Type: text/plain, Size: 8019 bytes --]

Hi

It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
higher-order pages and some subsystem stalls because of it.

In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
"unsigned int order = MAX_ORDER - 1"

What happens if you set the "order" variable to some small value, like 1, 
2 or 3. Does the problem go away? Could you find a threshold value 
(something between 0 and MAX_ORDER-1) where the bug starts to appear?

What happens if you replace
"pages = alloc_pages(gfp_mask"
with
"pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
? Does the bug go away?

Mikulas


On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:

> Hi,
> 
> Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> triggers the issue, but often it happens when doing some LVM operations
> (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> copy to/from another LVM thin volume with ext4 fs.
> 
> The storage stack I use is:
>   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> 
> And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> specific.
> 
> I can reproduce the issue on at least 3 different machines. I did tried
> also 6.5.6 and the issue is still there. I haven't checked newer
> versions, but briefly reviewed git log and haven't found anything
> suggesting a fix to similar issue.
> 
> I managed to bisect it down to this commit:
> 
>     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
>     Author: Mikulas Patocka <mpatocka@redhat.com>
>     Date:   Mon May 1 09:19:17 2023 -0400
> 
>     dm crypt: allocate compound pages if possible
>     
>     It was reported that allocating pages for the write buffer in dm-crypt
>     causes measurable overhead [1].
>     
>     Change dm-crypt to allocate compound pages if they are available. If
>     not, fall back to the mempool.
>     
>     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
>     
>     Suggested-by: Matthew Wilcox <willy@infradead.org>
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> 
> TBH, I'm not sure if the bug is in this commit, or maybe in some
> functions it uses (I don't see dm-crypt functions directly involved in
> the stack traces I collected). But reverting this commit on top of 6.5.6
> seems to fix the issue.
> 
> I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> 
> I managed to collect "blocked tasks" dump via sysrq below. Few more can
> be found at https://github.com/QubesOS/qubes-issues/issues/8575
> 
>     [ 4246.558313] sysrq: Show Blocked State
>     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
>     [ 4246.558407] Call Trace:
>     [ 4246.558414]  <TASK>
>     [ 4246.558422]  __schedule+0x23d/0x670
>     [ 4246.558440]  schedule+0x5e/0xd0
>     [ 4246.558450]  io_schedule+0x46/0x70
>     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
>     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
>     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
>     [ 4246.558523]  ext4_writepages+0xad/0x180
>     [ 4246.558533]  do_writepages+0xcf/0x1e0
>     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
>     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
>     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
>     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
>     [ 4246.558588]  ext4_sync_file+0x88/0x380
>     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
>     [ 4246.558609]  do_syscall_64+0x5c/0x90
>     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
>     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.558644] RIP: 0033:0x7710cf124d0a
>     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
>     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
>     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
>     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
>     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
>     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
>     [ 4246.558728]  </TASK>
>     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
>     [ 4246.558852] Call Trace:
>     [ 4246.558857]  <TASK>
>     [ 4246.558863]  __schedule+0x23d/0x670
>     [ 4246.558874]  schedule+0x5e/0xd0
>     [ 4246.558884]  io_schedule+0x46/0x70
>     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
>     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
>     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
>     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
>     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
>     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
>     [ 4246.558980]  __dm_suspend+0x41/0x1b0
>     [ 4246.558991]  dm_suspend+0xc0/0xe0
>     [ 4246.559001]  dev_suspend+0xa5/0xd0
>     [ 4246.559011]  ctl_ioctl+0x26e/0x350
>     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
>     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
>     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
>     [ 4246.559052]  do_syscall_64+0x5c/0x90
>     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
>     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
>     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
>     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
>     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
>     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
>     [ 4246.559199]  </TASK>
>     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
>     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
>     [ 4246.559238] Call Trace:
>     [ 4246.559244]  <TASK>
>     [ 4246.559249]  __schedule+0x23d/0x670
>     [ 4246.559260]  schedule+0x5e/0xd0
>     [ 4246.559270]  io_schedule+0x46/0x70
>     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
>     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
>     [ 4246.559324]  ext4_writepages+0xad/0x180
>     [ 4246.559334]  do_writepages+0xcf/0x1e0
>     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
>     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
>     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
>     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
>     [ 4246.559393]  wb_writeback+0x298/0x310
>     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
>     [ 4246.559414]  wb_workfn+0x5f/0x260
>     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
>     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
>     [ 4246.559447]  ? __schedule+0x245/0x670
>     [ 4246.559457]  process_one_work+0x1df/0x3e0
>     [ 4246.559466]  worker_thread+0x51/0x390
>     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
>     [ 4246.559484]  kthread+0xe5/0x120
>     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559504]  ret_from_fork+0x31/0x50
>     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
>     [ 4246.559536]  </TASK>
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
@ 2023-10-23 20:59   ` Mikulas Patocka
  0 siblings, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-23 20:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: dm-devel, Alasdair Kergon, Mike Snitzer, regressions, stable

[-- Attachment #1: Type: text/plain, Size: 8019 bytes --]

Hi

It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
higher-order pages and some subsystem stalls because of it.

In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
"unsigned int order = MAX_ORDER - 1"

What happens if you set the "order" variable to some small value, like 1, 
2 or 3. Does the problem go away? Could you find a threshold value 
(something between 0 and MAX_ORDER-1) where the bug starts to appear?

What happens if you replace
"pages = alloc_pages(gfp_mask"
with
"pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
? Does the bug go away?

Mikulas


On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:

> Hi,
> 
> Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> triggers the issue, but often it happens when doing some LVM operations
> (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> copy to/from another LVM thin volume with ext4 fs.
> 
> The storage stack I use is:
>   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> 
> And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> specific.
> 
> I can reproduce the issue on at least 3 different machines. I did tried
> also 6.5.6 and the issue is still there. I haven't checked newer
> versions, but briefly reviewed git log and haven't found anything
> suggesting a fix to similar issue.
> 
> I managed to bisect it down to this commit:
> 
>     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
>     Author: Mikulas Patocka <mpatocka@redhat.com>
>     Date:   Mon May 1 09:19:17 2023 -0400
> 
>     dm crypt: allocate compound pages if possible
>     
>     It was reported that allocating pages for the write buffer in dm-crypt
>     causes measurable overhead [1].
>     
>     Change dm-crypt to allocate compound pages if they are available. If
>     not, fall back to the mempool.
>     
>     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
>     
>     Suggested-by: Matthew Wilcox <willy@infradead.org>
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> 
> TBH, I'm not sure if the bug is in this commit, or maybe in some
> functions it uses (I don't see dm-crypt functions directly involved in
> the stack traces I collected). But reverting this commit on top of 6.5.6
> seems to fix the issue.
> 
> I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> 
> I managed to collect "blocked tasks" dump via sysrq below. Few more can
> be found at https://github.com/QubesOS/qubes-issues/issues/8575
> 
>     [ 4246.558313] sysrq: Show Blocked State
>     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
>     [ 4246.558407] Call Trace:
>     [ 4246.558414]  <TASK>
>     [ 4246.558422]  __schedule+0x23d/0x670
>     [ 4246.558440]  schedule+0x5e/0xd0
>     [ 4246.558450]  io_schedule+0x46/0x70
>     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
>     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
>     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
>     [ 4246.558523]  ext4_writepages+0xad/0x180
>     [ 4246.558533]  do_writepages+0xcf/0x1e0
>     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
>     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
>     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
>     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
>     [ 4246.558588]  ext4_sync_file+0x88/0x380
>     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
>     [ 4246.558609]  do_syscall_64+0x5c/0x90
>     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
>     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.558644] RIP: 0033:0x7710cf124d0a
>     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
>     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
>     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
>     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
>     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
>     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
>     [ 4246.558728]  </TASK>
>     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
>     [ 4246.558852] Call Trace:
>     [ 4246.558857]  <TASK>
>     [ 4246.558863]  __schedule+0x23d/0x670
>     [ 4246.558874]  schedule+0x5e/0xd0
>     [ 4246.558884]  io_schedule+0x46/0x70
>     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
>     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
>     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
>     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
>     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
>     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
>     [ 4246.558980]  __dm_suspend+0x41/0x1b0
>     [ 4246.558991]  dm_suspend+0xc0/0xe0
>     [ 4246.559001]  dev_suspend+0xa5/0xd0
>     [ 4246.559011]  ctl_ioctl+0x26e/0x350
>     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
>     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
>     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
>     [ 4246.559052]  do_syscall_64+0x5c/0x90
>     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
>     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
>     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
>     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
>     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
>     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
>     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
>     [ 4246.559199]  </TASK>
>     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
>     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
>     [ 4246.559238] Call Trace:
>     [ 4246.559244]  <TASK>
>     [ 4246.559249]  __schedule+0x23d/0x670
>     [ 4246.559260]  schedule+0x5e/0xd0
>     [ 4246.559270]  io_schedule+0x46/0x70
>     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
>     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
>     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
>     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
>     [ 4246.559324]  ext4_writepages+0xad/0x180
>     [ 4246.559334]  do_writepages+0xcf/0x1e0
>     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
>     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
>     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
>     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
>     [ 4246.559393]  wb_writeback+0x298/0x310
>     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
>     [ 4246.559414]  wb_workfn+0x5f/0x260
>     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
>     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
>     [ 4246.559447]  ? __schedule+0x245/0x670
>     [ 4246.559457]  process_one_work+0x1df/0x3e0
>     [ 4246.559466]  worker_thread+0x51/0x390
>     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
>     [ 4246.559484]  kthread+0xe5/0x120
>     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559504]  ret_from_fork+0x31/0x50
>     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
>     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
>     [ 4246.559536]  </TASK>
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-23 20:59   ` Mikulas Patocka
@ 2023-10-25  3:10     ` Marek Marczykowski-Górecki
  -1 siblings, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-25  3:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel

[-- Attachment #1: Type: text/plain, Size: 10693 bytes --]

On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
> Hi
> 
> It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
> higher-order pages and some subsystem stalls because of it.
> 
> In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
> "unsigned int order = MAX_ORDER - 1"
> 
> What happens if you set the "order" variable to some small value, like 1, 
> 2 or 3. Does the problem go away? Could you find a threshold value 
> (something between 0 and MAX_ORDER-1) where the bug starts to appear?

With 3 and lower it seems to work, with 4 it freezes. My test is not
100% reproducible, but I've repeated it a few times and got rather
consistent results on this system (i3-1315U, NVMe).

BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I
couldn't reproduce the issue at all. But OTOH, I hit it once on a system
with i7-7600U and otherwise similar specs but much different workload
(several more VMs accessing the data). I'm not sure if that info helps
at all...

> 
> What happens if you replace
> "pages = alloc_pages(gfp_mask"
> with
> "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
> ? Does the bug go away?

In a limited test, with order restored to MAX_ORDER - 1, no, still got
the issue, and got this via sysrq (just one task listed):

[  516.375254] sysrq: Show Blocked State
[  516.375461] task:dd              state:D stack:13072 pid:4385  ppid:4371   flags:0x00004002
[  516.375496] Call Trace:
[  516.375507]  <TASK>
[  516.375518]  __schedule+0x30e/0x8b0
[  516.375549]  schedule+0x59/0xb0
[  516.375566]  io_schedule+0x41/0x70
[  516.375582]  folio_wait_bit_common+0x12c/0x300
[  516.375603]  ? __pfx_wake_page_function+0x10/0x10
[  516.375623]  folio_wait_writeback+0x27/0x80
[  516.375639]  __filemap_fdatawait_range+0x7b/0xe0
[  516.375656]  file_write_and_wait_range+0x86/0xb0
[  516.375672]  blkdev_fsync+0x33/0x60
[  516.375693]  __x64_sys_fdatasync+0x4a/0x90
[  516.375713]  do_syscall_64+0x3c/0x90
[  516.375729]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  516.375751] RIP: 0033:0x7fd718562da4
[  516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
[  516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4
[  516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001
[  516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000
[  516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000
[  516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000
[  516.375865]  </TASK>

(it is dd writing to LVM thin volume, where PV is on dm-crypt)

> 
> Mikulas
> 
> 
> On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
> 
> > Hi,
> > 
> > Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> > subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> > triggers the issue, but often it happens when doing some LVM operations
> > (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> > copy to/from another LVM thin volume with ext4 fs.
> > 
> > The storage stack I use is:
> >   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> > 
> > And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> > specific.
> > 
> > I can reproduce the issue on at least 3 different machines. I did tried
> > also 6.5.6 and the issue is still there. I haven't checked newer
> > versions, but briefly reviewed git log and haven't found anything
> > suggesting a fix to similar issue.
> > 
> > I managed to bisect it down to this commit:
> > 
> >     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
> >     Author: Mikulas Patocka <mpatocka@redhat.com>
> >     Date:   Mon May 1 09:19:17 2023 -0400
> > 
> >     dm crypt: allocate compound pages if possible
> >     
> >     It was reported that allocating pages for the write buffer in dm-crypt
> >     causes measurable overhead [1].
> >     
> >     Change dm-crypt to allocate compound pages if they are available. If
> >     not, fall back to the mempool.
> >     
> >     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
> >     
> >     Suggested-by: Matthew Wilcox <willy@infradead.org>
> >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > 
> > TBH, I'm not sure if the bug is in this commit, or maybe in some
> > functions it uses (I don't see dm-crypt functions directly involved in
> > the stack traces I collected). But reverting this commit on top of 6.5.6
> > seems to fix the issue.
> > 
> > I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> > 
> > I managed to collect "blocked tasks" dump via sysrq below. Few more can
> > be found at https://github.com/QubesOS/qubes-issues/issues/8575
> > 
> >     [ 4246.558313] sysrq: Show Blocked State
> >     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
> >     [ 4246.558407] Call Trace:
> >     [ 4246.558414]  <TASK>
> >     [ 4246.558422]  __schedule+0x23d/0x670
> >     [ 4246.558440]  schedule+0x5e/0xd0
> >     [ 4246.558450]  io_schedule+0x46/0x70
> >     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
> >     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
> >     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
> >     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
> >     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
> >     [ 4246.558523]  ext4_writepages+0xad/0x180
> >     [ 4246.558533]  do_writepages+0xcf/0x1e0
> >     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
> >     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
> >     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
> >     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
> >     [ 4246.558588]  ext4_sync_file+0x88/0x380
> >     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
> >     [ 4246.558609]  do_syscall_64+0x5c/0x90
> >     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
> >     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >     [ 4246.558644] RIP: 0033:0x7710cf124d0a
> >     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> >     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
> >     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
> >     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
> >     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
> >     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
> >     [ 4246.558728]  </TASK>
> >     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
> >     [ 4246.558852] Call Trace:
> >     [ 4246.558857]  <TASK>
> >     [ 4246.558863]  __schedule+0x23d/0x670
> >     [ 4246.558874]  schedule+0x5e/0xd0
> >     [ 4246.558884]  io_schedule+0x46/0x70
> >     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
> >     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
> >     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
> >     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
> >     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
> >     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
> >     [ 4246.558980]  __dm_suspend+0x41/0x1b0
> >     [ 4246.558991]  dm_suspend+0xc0/0xe0
> >     [ 4246.559001]  dev_suspend+0xa5/0xd0
> >     [ 4246.559011]  ctl_ioctl+0x26e/0x350
> >     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
> >     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
> >     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
> >     [ 4246.559052]  do_syscall_64+0x5c/0x90
> >     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
> >     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
> >     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
> >     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
> >     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
> >     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
> >     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
> >     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> >     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
> >     [ 4246.559199]  </TASK>
> >     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
> >     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
> >     [ 4246.559238] Call Trace:
> >     [ 4246.559244]  <TASK>
> >     [ 4246.559249]  __schedule+0x23d/0x670
> >     [ 4246.559260]  schedule+0x5e/0xd0
> >     [ 4246.559270]  io_schedule+0x46/0x70
> >     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
> >     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
> >     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
> >     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
> >     [ 4246.559324]  ext4_writepages+0xad/0x180
> >     [ 4246.559334]  do_writepages+0xcf/0x1e0
> >     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
> >     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
> >     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
> >     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
> >     [ 4246.559393]  wb_writeback+0x298/0x310
> >     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
> >     [ 4246.559414]  wb_workfn+0x5f/0x260
> >     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
> >     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
> >     [ 4246.559447]  ? __schedule+0x245/0x670
> >     [ 4246.559457]  process_one_work+0x1df/0x3e0
> >     [ 4246.559466]  worker_thread+0x51/0x390
> >     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
> >     [ 4246.559484]  kthread+0xe5/0x120
> >     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
> >     [ 4246.559504]  ret_from_fork+0x31/0x50
> >     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
> >     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
> >     [ 4246.559536]  </TASK>
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> > 


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
@ 2023-10-25  3:10     ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-25  3:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, Alasdair Kergon, Mike Snitzer, regressions, stable

[-- Attachment #1: Type: text/plain, Size: 10693 bytes --]

On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
> Hi
> 
> It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
> higher-order pages and some subsystem stalls because of it.
> 
> In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
> "unsigned int order = MAX_ORDER - 1"
> 
> What happens if you set the "order" variable to some small value, like 1, 
> 2 or 3. Does the problem go away? Could you find a threshold value 
> (something between 0 and MAX_ORDER-1) where the bug starts to appear?

With 3 and lower it seems to work, with 4 it freezes. My test is not
100% reproducible, but I've repeated it a few times and got rather
consistent results on this system (i3-1315U, NVMe).

BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I
couldn't reproduce the issue at all. But OTOH, I hit it once on a system
with i7-7600U and otherwise similar specs but much different workload
(several more VMs accessing the data). I'm not sure if that info helps
at all...

> 
> What happens if you replace
> "pages = alloc_pages(gfp_mask"
> with
> "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
> ? Does the bug go away?

In a limited test, with order restored to MAX_ORDER - 1, no, still got
the issue, and got this via sysrq (just one task listed):

[  516.375254] sysrq: Show Blocked State
[  516.375461] task:dd              state:D stack:13072 pid:4385  ppid:4371   flags:0x00004002
[  516.375496] Call Trace:
[  516.375507]  <TASK>
[  516.375518]  __schedule+0x30e/0x8b0
[  516.375549]  schedule+0x59/0xb0
[  516.375566]  io_schedule+0x41/0x70
[  516.375582]  folio_wait_bit_common+0x12c/0x300
[  516.375603]  ? __pfx_wake_page_function+0x10/0x10
[  516.375623]  folio_wait_writeback+0x27/0x80
[  516.375639]  __filemap_fdatawait_range+0x7b/0xe0
[  516.375656]  file_write_and_wait_range+0x86/0xb0
[  516.375672]  blkdev_fsync+0x33/0x60
[  516.375693]  __x64_sys_fdatasync+0x4a/0x90
[  516.375713]  do_syscall_64+0x3c/0x90
[  516.375729]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  516.375751] RIP: 0033:0x7fd718562da4
[  516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
[  516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4
[  516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001
[  516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000
[  516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000
[  516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000
[  516.375865]  </TASK>

(it is dd writing to LVM thin volume, where PV is on dm-crypt)

> 
> Mikulas
> 
> 
> On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
> 
> > Hi,
> > 
> > Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> > subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> > triggers the issue, but often it happens when doing some LVM operations
> > (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> > copy to/from another LVM thin volume with ext4 fs.
> > 
> > The storage stack I use is:
> >   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> > 
> > And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> > specific.
> > 
> > I can reproduce the issue on at least 3 different machines. I did tried
> > also 6.5.6 and the issue is still there. I haven't checked newer
> > versions, but briefly reviewed git log and haven't found anything
> > suggesting a fix to similar issue.
> > 
> > I managed to bisect it down to this commit:
> > 
> >     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
> >     Author: Mikulas Patocka <mpatocka@redhat.com>
> >     Date:   Mon May 1 09:19:17 2023 -0400
> > 
> >     dm crypt: allocate compound pages if possible
> >     
> >     It was reported that allocating pages for the write buffer in dm-crypt
> >     causes measurable overhead [1].
> >     
> >     Change dm-crypt to allocate compound pages if they are available. If
> >     not, fall back to the mempool.
> >     
> >     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
> >     
> >     Suggested-by: Matthew Wilcox <willy@infradead.org>
> >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > 
> > TBH, I'm not sure if the bug is in this commit, or maybe in some
> > functions it uses (I don't see dm-crypt functions directly involved in
> > the stack traces I collected). But reverting this commit on top of 6.5.6
> > seems to fix the issue.
> > 
> > I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> > 
> > I managed to collect "blocked tasks" dump via sysrq below. Few more can
> > be found at https://github.com/QubesOS/qubes-issues/issues/8575
> > 
> >     [ 4246.558313] sysrq: Show Blocked State
> >     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
> >     [ 4246.558407] Call Trace:
> >     [ 4246.558414]  <TASK>
> >     [ 4246.558422]  __schedule+0x23d/0x670
> >     [ 4246.558440]  schedule+0x5e/0xd0
> >     [ 4246.558450]  io_schedule+0x46/0x70
> >     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
> >     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
> >     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
> >     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
> >     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
> >     [ 4246.558523]  ext4_writepages+0xad/0x180
> >     [ 4246.558533]  do_writepages+0xcf/0x1e0
> >     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
> >     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
> >     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
> >     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
> >     [ 4246.558588]  ext4_sync_file+0x88/0x380
> >     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
> >     [ 4246.558609]  do_syscall_64+0x5c/0x90
> >     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
> >     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >     [ 4246.558644] RIP: 0033:0x7710cf124d0a
> >     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> >     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
> >     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
> >     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
> >     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
> >     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
> >     [ 4246.558728]  </TASK>
> >     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
> >     [ 4246.558852] Call Trace:
> >     [ 4246.558857]  <TASK>
> >     [ 4246.558863]  __schedule+0x23d/0x670
> >     [ 4246.558874]  schedule+0x5e/0xd0
> >     [ 4246.558884]  io_schedule+0x46/0x70
> >     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
> >     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
> >     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
> >     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
> >     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
> >     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
> >     [ 4246.558980]  __dm_suspend+0x41/0x1b0
> >     [ 4246.558991]  dm_suspend+0xc0/0xe0
> >     [ 4246.559001]  dev_suspend+0xa5/0xd0
> >     [ 4246.559011]  ctl_ioctl+0x26e/0x350
> >     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
> >     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
> >     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
> >     [ 4246.559052]  do_syscall_64+0x5c/0x90
> >     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
> >     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
> >     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
> >     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
> >     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
> >     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
> >     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
> >     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> >     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
> >     [ 4246.559199]  </TASK>
> >     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
> >     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
> >     [ 4246.559238] Call Trace:
> >     [ 4246.559244]  <TASK>
> >     [ 4246.559249]  __schedule+0x23d/0x670
> >     [ 4246.559260]  schedule+0x5e/0xd0
> >     [ 4246.559270]  io_schedule+0x46/0x70
> >     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
> >     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
> >     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
> >     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
> >     [ 4246.559324]  ext4_writepages+0xad/0x180
> >     [ 4246.559334]  do_writepages+0xcf/0x1e0
> >     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
> >     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
> >     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
> >     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
> >     [ 4246.559393]  wb_writeback+0x298/0x310
> >     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
> >     [ 4246.559414]  wb_workfn+0x5f/0x260
> >     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
> >     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
> >     [ 4246.559447]  ? __schedule+0x245/0x670
> >     [ 4246.559457]  process_one_work+0x1df/0x3e0
> >     [ 4246.559466]  worker_thread+0x51/0x390
> >     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
> >     [ 4246.559484]  kthread+0xe5/0x120
> >     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
> >     [ 4246.559504]  ret_from_fork+0x31/0x50
> >     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
> >     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
> >     [ 4246.559536]  </TASK>
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> > 


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-25  3:10     ` Marek Marczykowski-Górecki
@ 2023-10-25  3:22       ` Marek Marczykowski-Górecki
  -1 siblings, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-25  3:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel

[-- Attachment #1: Type: text/plain, Size: 15326 bytes --]

On Wed, Oct 25, 2023 at 05:10:56AM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
> > Hi
> > 
> > It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
> > higher-order pages and some subsystem stalls because of it.
> > 
> > In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
> > "unsigned int order = MAX_ORDER - 1"
> > 
> > What happens if you set the "order" variable to some small value, like 1, 
> > 2 or 3. Does the problem go away? Could you find a threshold value 
> > (something between 0 and MAX_ORDER-1) where the bug starts to appear?
> 
> With 3 and lower it seems to work, with 4 it freezes. My test is not
> 100% reproducible, but I've repeated it a few times and got rather
> consistent results on this system (i3-1315U, NVMe).
> 
> BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I
> couldn't reproduce the issue at all. But OTOH, I hit it once on a system
> with i7-7600U and otherwise similar specs but much different workload
> (several more VMs accessing the data). I'm not sure if that info helps
> at all...
> 
> > 
> > What happens if you replace
> > "pages = alloc_pages(gfp_mask"
> > with
> > "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
> > ? Does the bug go away?
> 
> In a limited test, with order restored to MAX_ORDER - 1, no, still got
> the issue, and got this via sysrq (just one task listed):
> 
> [  516.375254] sysrq: Show Blocked State
> [  516.375461] task:dd              state:D stack:13072 pid:4385  ppid:4371   flags:0x00004002
> [  516.375496] Call Trace:
> [  516.375507]  <TASK>
> [  516.375518]  __schedule+0x30e/0x8b0
> [  516.375549]  schedule+0x59/0xb0
> [  516.375566]  io_schedule+0x41/0x70
> [  516.375582]  folio_wait_bit_common+0x12c/0x300
> [  516.375603]  ? __pfx_wake_page_function+0x10/0x10
> [  516.375623]  folio_wait_writeback+0x27/0x80
> [  516.375639]  __filemap_fdatawait_range+0x7b/0xe0
> [  516.375656]  file_write_and_wait_range+0x86/0xb0
> [  516.375672]  blkdev_fsync+0x33/0x60
> [  516.375693]  __x64_sys_fdatasync+0x4a/0x90
> [  516.375713]  do_syscall_64+0x3c/0x90
> [  516.375729]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  516.375751] RIP: 0033:0x7fd718562da4
> [  516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
> [  516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4
> [  516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001
> [  516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000
> [  516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000
> [  516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000
> [  516.375865]  </TASK>
> 
> (it is dd writing to LVM thin volume, where PV is on dm-crypt)

In this bad state, I extracted some more info on memory:

/proc/meminfo:

MemTotal:        3983088 kB
MemFree:          349264 kB
MemAvailable:    3474744 kB
Buffers:         1511732 kB
Cached:          1608132 kB
SwapCached:           12 kB
Active:           203136 kB
Inactive:        3147696 kB
Active(anon):       1288 kB
Inactive(anon):   252128 kB
Active(file):     201848 kB
Inactive(file):  2895568 kB
Unevictable:       46064 kB
Mlocked:           39920 kB
SwapTotal:       4112380 kB
SwapFree:        4112124 kB
Dirty:                 0 kB
Writeback:          8556 kB
AnonPages:        277020 kB
Mapped:           137424 kB
Shmem:             13792 kB
KReclaimable:      91728 kB
Slab:             182428 kB
SReclaimable:      91728 kB
SUnreclaim:        90700 kB
KernelStack:        5776 kB
PageTables:         7480 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6103924 kB
Committed_AS:     990924 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       11560 kB
VmallocChunk:          0 kB
Percpu:             2528 kB
DirectMap4k:     4325764 kB
DirectMap2M:           0 kB

and also:
[ 1168.537096] sysrq: Show Memory
[ 1168.537192] Mem-Info:
[ 1168.537206] active_anon:322 inactive_anon:63032 isolated_anon:0
[ 1168.537206]  active_file:50462 inactive_file:723892 isolated_file:0
[ 1168.537206]  unevictable:11516 dirty:1 writeback:2139
[ 1168.537206]  slab_reclaimable:22932 slab_unreclaimable:22675
[ 1168.537206]  mapped:34357 shmem:3448 pagetables:1870
[ 1168.537206]  sec_pagetables:0 bounce:0
[ 1168.537206]  kernel_misc_reclaimable:0
[ 1168.537206]  free:87499 free_pcp:1642 free_cma:0
[ 1168.537279] Node 0 active_anon:1288kB inactive_anon:252128kB active_file:201848kB inactive_file:2895568kB unevictable:46064kB isolated(anon):0kB isolated(file):0kB mapped:137428kB dirty:4kB writeback:8556kB shmem:13792kB writeback_tmp:0kB kernel_stack:5776kB pagetables:7480kB sec_pagetables:0kB all_unreclaimable? no
[ 1168.537332] Node 0 DMA free:15488kB boost:0kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:352kB unevictable:0kB writepending:84kB present:15996kB managed:15936kB mlocked:0kB bounce:0kB free_pcp:12kB local_pcp:0kB free_cma:0kB
[ 1168.537387] lowmem_reserve[]: 0 1728 3857 3857
[ 1168.537409] Node 0 DMA32 free:155300kB boost:0kB min:3552kB low:5320kB high:7088kB reserved_highatomic:0KB active_anon:16kB inactive_anon:3368kB active_file:5660kB inactive_file:1585120kB unevictable:0kB writepending:7576kB present:1924360kB managed:1803376kB mlocked:0kB bounce:0kB free_pcp:3184kB local_pcp:864kB free_cma:0kB
[ 1168.537464] lowmem_reserve[]: 0 0 2129 2129
[ 1168.537483] Node 0 Normal free:179208kB boost:0kB min:4376kB low:6556kB high:8736kB reserved_highatomic:2048KB active_anon:1272kB inactive_anon:248760kB active_file:196104kB inactive_file:1310096kB unevictable:46064kB writepending:900kB present:2253948kB managed:2163776kB mlocked:39920kB bounce:0kB free_pcp:3372kB local_pcp:632kB free_cma:0kB
[ 1168.537540] lowmem_reserve[]: 0 0 0 0
[ 1168.537557] Node 0 DMA: 4*4kB (UM) 4*8kB (UM) 3*16kB (UM) 3*32kB (U) 3*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 15488kB
[ 1168.537624] Node 0 DMA32: 1945*4kB (UME) 1446*8kB (UME) 953*16kB (UME) 1410*32kB (UM) 297*64kB (UM) 162*128kB (UM) 60*256kB (UM) 22*512kB (UM) 7*1024kB (UM) 1*2048kB (U) 0*4096kB = 155300kB
[ 1168.537695] Node 0 Normal: 3146*4kB (UMEH) 2412*8kB (UMEH) 1444*16kB (UMEH) 1138*32kB (UMEH) 486*64kB (UME) 97*128kB (UM) 13*256kB (UM) 8*512kB (M) 2*1024kB (M) 5*2048kB (M) 6*4096kB (UM) = 179208kB
[ 1168.537767] 779973 total pagecache pages
[ 1168.537778] 3 pages in swap cache
[ 1168.537788] Free swap  = 4112124kB
[ 1168.537798] Total swap = 4112380kB
[ 1168.537807] 1048576 pages RAM
[ 1168.537817] 0 pages HighMem/MovableOnly
[ 1168.537827] 52804 pages reserved


> 
> > 
> > Mikulas
> > 
> > 
> > On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
> > 
> > > Hi,
> > > 
> > > Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> > > subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> > > triggers the issue, but often it happens when doing some LVM operations
> > > (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> > > copy to/from another LVM thin volume with ext4 fs.
> > > 
> > > The storage stack I use is:
> > >   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> > > 
> > > And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> > > specific.
> > > 
> > > I can reproduce the issue on at least 3 different machines. I did tried
> > > also 6.5.6 and the issue is still there. I haven't checked newer
> > > versions, but briefly reviewed git log and haven't found anything
> > > suggesting a fix to similar issue.
> > > 
> > > I managed to bisect it down to this commit:
> > > 
> > >     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
> > >     Author: Mikulas Patocka <mpatocka@redhat.com>
> > >     Date:   Mon May 1 09:19:17 2023 -0400
> > > 
> > >     dm crypt: allocate compound pages if possible
> > >     
> > >     It was reported that allocating pages for the write buffer in dm-crypt
> > >     causes measurable overhead [1].
> > >     
> > >     Change dm-crypt to allocate compound pages if they are available. If
> > >     not, fall back to the mempool.
> > >     
> > >     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
> > >     
> > >     Suggested-by: Matthew Wilcox <willy@infradead.org>
> > >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > >     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > 
> > > TBH, I'm not sure if the bug is in this commit, or maybe in some
> > > functions it uses (I don't see dm-crypt functions directly involved in
> > > the stack traces I collected). But reverting this commit on top of 6.5.6
> > > seems to fix the issue.
> > > 
> > > I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> > > 
> > > I managed to collect "blocked tasks" dump via sysrq below. Few more can
> > > be found at https://github.com/QubesOS/qubes-issues/issues/8575
> > > 
> > >     [ 4246.558313] sysrq: Show Blocked State
> > >     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
> > >     [ 4246.558407] Call Trace:
> > >     [ 4246.558414]  <TASK>
> > >     [ 4246.558422]  __schedule+0x23d/0x670
> > >     [ 4246.558440]  schedule+0x5e/0xd0
> > >     [ 4246.558450]  io_schedule+0x46/0x70
> > >     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
> > >     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
> > >     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
> > >     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
> > >     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
> > >     [ 4246.558523]  ext4_writepages+0xad/0x180
> > >     [ 4246.558533]  do_writepages+0xcf/0x1e0
> > >     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
> > >     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
> > >     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
> > >     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
> > >     [ 4246.558588]  ext4_sync_file+0x88/0x380
> > >     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
> > >     [ 4246.558609]  do_syscall_64+0x5c/0x90
> > >     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
> > >     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >     [ 4246.558644] RIP: 0033:0x7710cf124d0a
> > >     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> > >     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
> > >     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
> > >     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
> > >     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
> > >     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
> > >     [ 4246.558728]  </TASK>
> > >     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
> > >     [ 4246.558852] Call Trace:
> > >     [ 4246.558857]  <TASK>
> > >     [ 4246.558863]  __schedule+0x23d/0x670
> > >     [ 4246.558874]  schedule+0x5e/0xd0
> > >     [ 4246.558884]  io_schedule+0x46/0x70
> > >     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
> > >     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
> > >     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
> > >     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
> > >     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
> > >     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
> > >     [ 4246.558980]  __dm_suspend+0x41/0x1b0
> > >     [ 4246.558991]  dm_suspend+0xc0/0xe0
> > >     [ 4246.559001]  dev_suspend+0xa5/0xd0
> > >     [ 4246.559011]  ctl_ioctl+0x26e/0x350
> > >     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
> > >     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
> > >     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
> > >     [ 4246.559052]  do_syscall_64+0x5c/0x90
> > >     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
> > >     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
> > >     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
> > >     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
> > >     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > >     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
> > >     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
> > >     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
> > >     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> > >     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
> > >     [ 4246.559199]  </TASK>
> > >     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
> > >     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
> > >     [ 4246.559238] Call Trace:
> > >     [ 4246.559244]  <TASK>
> > >     [ 4246.559249]  __schedule+0x23d/0x670
> > >     [ 4246.559260]  schedule+0x5e/0xd0
> > >     [ 4246.559270]  io_schedule+0x46/0x70
> > >     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
> > >     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
> > >     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
> > >     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
> > >     [ 4246.559324]  ext4_writepages+0xad/0x180
> > >     [ 4246.559334]  do_writepages+0xcf/0x1e0
> > >     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
> > >     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
> > >     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
> > >     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
> > >     [ 4246.559393]  wb_writeback+0x298/0x310
> > >     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
> > >     [ 4246.559414]  wb_workfn+0x5f/0x260
> > >     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
> > >     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
> > >     [ 4246.559447]  ? __schedule+0x245/0x670
> > >     [ 4246.559457]  process_one_work+0x1df/0x3e0
> > >     [ 4246.559466]  worker_thread+0x51/0x390
> > >     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
> > >     [ 4246.559484]  kthread+0xe5/0x120
> > >     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
> > >     [ 4246.559504]  ret_from_fork+0x31/0x50
> > >     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
> > >     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
> > >     [ 4246.559536]  </TASK>
> > > 
> > > -- 
> > > Best Regards,
> > > Marek Marczykowski-Górecki
> > > Invisible Things Lab
> > > 
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
@ 2023-10-25  3:22       ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-25  3:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, Alasdair Kergon, Mike Snitzer, regressions, stable

[-- Attachment #1: Type: text/plain, Size: 15326 bytes --]

On Wed, Oct 25, 2023 at 05:10:56AM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
> > Hi
> > 
> > It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
> > higher-order pages and some subsystem stalls because of it.
> > 
> > In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
> > "unsigned int order = MAX_ORDER - 1"
> > 
> > What happens if you set the "order" variable to some small value, like 1, 
> > 2 or 3. Does the problem go away? Could you find a threshold value 
> > (something between 0 and MAX_ORDER-1) where the bug starts to appear?
> 
> With 3 and lower it seems to work, with 4 it freezes. My test is not
> 100% reproducible, but I've repeated it a few times and got rather
> consistent results on this system (i3-1315U, NVMe).
> 
> BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I
> couldn't reproduce the issue at all. But OTOH, I hit it once on a system
> with i7-7600U and otherwise similar specs but much different workload
> (several more VMs accessing the data). I'm not sure if that info helps
> at all...
> 
> > 
> > What happens if you replace
> > "pages = alloc_pages(gfp_mask"
> > with
> > "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
> > ? Does the bug go away?
> 
> In a limited test, with order restored to MAX_ORDER - 1, no, still got
> the issue, and got this via sysrq (just one task listed):
> 
> [  516.375254] sysrq: Show Blocked State
> [  516.375461] task:dd              state:D stack:13072 pid:4385  ppid:4371   flags:0x00004002
> [  516.375496] Call Trace:
> [  516.375507]  <TASK>
> [  516.375518]  __schedule+0x30e/0x8b0
> [  516.375549]  schedule+0x59/0xb0
> [  516.375566]  io_schedule+0x41/0x70
> [  516.375582]  folio_wait_bit_common+0x12c/0x300
> [  516.375603]  ? __pfx_wake_page_function+0x10/0x10
> [  516.375623]  folio_wait_writeback+0x27/0x80
> [  516.375639]  __filemap_fdatawait_range+0x7b/0xe0
> [  516.375656]  file_write_and_wait_range+0x86/0xb0
> [  516.375672]  blkdev_fsync+0x33/0x60
> [  516.375693]  __x64_sys_fdatasync+0x4a/0x90
> [  516.375713]  do_syscall_64+0x3c/0x90
> [  516.375729]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  516.375751] RIP: 0033:0x7fd718562da4
> [  516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
> [  516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4
> [  516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001
> [  516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000
> [  516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000
> [  516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000
> [  516.375865]  </TASK>
> 
> (it is dd writing to LVM thin volume, where PV is on dm-crypt)

In this bad state, I extracted some more info on memory:

/proc/meminfo:

MemTotal:        3983088 kB
MemFree:          349264 kB
MemAvailable:    3474744 kB
Buffers:         1511732 kB
Cached:          1608132 kB
SwapCached:           12 kB
Active:           203136 kB
Inactive:        3147696 kB
Active(anon):       1288 kB
Inactive(anon):   252128 kB
Active(file):     201848 kB
Inactive(file):  2895568 kB
Unevictable:       46064 kB
Mlocked:           39920 kB
SwapTotal:       4112380 kB
SwapFree:        4112124 kB
Dirty:                 0 kB
Writeback:          8556 kB
AnonPages:        277020 kB
Mapped:           137424 kB
Shmem:             13792 kB
KReclaimable:      91728 kB
Slab:             182428 kB
SReclaimable:      91728 kB
SUnreclaim:        90700 kB
KernelStack:        5776 kB
PageTables:         7480 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6103924 kB
Committed_AS:     990924 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       11560 kB
VmallocChunk:          0 kB
Percpu:             2528 kB
DirectMap4k:     4325764 kB
DirectMap2M:           0 kB

and also:
[ 1168.537096] sysrq: Show Memory
[ 1168.537192] Mem-Info:
[ 1168.537206] active_anon:322 inactive_anon:63032 isolated_anon:0
[ 1168.537206]  active_file:50462 inactive_file:723892 isolated_file:0
[ 1168.537206]  unevictable:11516 dirty:1 writeback:2139
[ 1168.537206]  slab_reclaimable:22932 slab_unreclaimable:22675
[ 1168.537206]  mapped:34357 shmem:3448 pagetables:1870
[ 1168.537206]  sec_pagetables:0 bounce:0
[ 1168.537206]  kernel_misc_reclaimable:0
[ 1168.537206]  free:87499 free_pcp:1642 free_cma:0
[ 1168.537279] Node 0 active_anon:1288kB inactive_anon:252128kB active_file:201848kB inactive_file:2895568kB unevictable:46064kB isolated(anon):0kB isolated(file):0kB mapped:137428kB dirty:4kB writeback:8556kB shmem:13792kB writeback_tmp:0kB kernel_stack:5776kB pagetables:7480kB sec_pagetables:0kB all_unreclaimable? no
[ 1168.537332] Node 0 DMA free:15488kB boost:0kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:352kB unevictable:0kB writepending:84kB present:15996kB managed:15936kB mlocked:0kB bounce:0kB free_pcp:12kB local_pcp:0kB free_cma:0kB
[ 1168.537387] lowmem_reserve[]: 0 1728 3857 3857
[ 1168.537409] Node 0 DMA32 free:155300kB boost:0kB min:3552kB low:5320kB high:7088kB reserved_highatomic:0KB active_anon:16kB inactive_anon:3368kB active_file:5660kB inactive_file:1585120kB unevictable:0kB writepending:7576kB present:1924360kB managed:1803376kB mlocked:0kB bounce:0kB free_pcp:3184kB local_pcp:864kB free_cma:0kB
[ 1168.537464] lowmem_reserve[]: 0 0 2129 2129
[ 1168.537483] Node 0 Normal free:179208kB boost:0kB min:4376kB low:6556kB high:8736kB reserved_highatomic:2048KB active_anon:1272kB inactive_anon:248760kB active_file:196104kB inactive_file:1310096kB unevictable:46064kB writepending:900kB present:2253948kB managed:2163776kB mlocked:39920kB bounce:0kB free_pcp:3372kB local_pcp:632kB free_cma:0kB
[ 1168.537540] lowmem_reserve[]: 0 0 0 0
[ 1168.537557] Node 0 DMA: 4*4kB (UM) 4*8kB (UM) 3*16kB (UM) 3*32kB (U) 3*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 15488kB
[ 1168.537624] Node 0 DMA32: 1945*4kB (UME) 1446*8kB (UME) 953*16kB (UME) 1410*32kB (UM) 297*64kB (UM) 162*128kB (UM) 60*256kB (UM) 22*512kB (UM) 7*1024kB (UM) 1*2048kB (U) 0*4096kB = 155300kB
[ 1168.537695] Node 0 Normal: 3146*4kB (UMEH) 2412*8kB (UMEH) 1444*16kB (UMEH) 1138*32kB (UMEH) 486*64kB (UME) 97*128kB (UM) 13*256kB (UM) 8*512kB (M) 2*1024kB (M) 5*2048kB (M) 6*4096kB (UM) = 179208kB
[ 1168.537767] 779973 total pagecache pages
[ 1168.537778] 3 pages in swap cache
[ 1168.537788] Free swap  = 4112124kB
[ 1168.537798] Total swap = 4112380kB
[ 1168.537807] 1048576 pages RAM
[ 1168.537817] 0 pages HighMem/MovableOnly
[ 1168.537827] 52804 pages reserved


> 
> > 
> > Mikulas
> > 
> > 
> > On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
> > 
> > > Hi,
> > > 
> > > Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> > > subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> > > triggers the issue, but often it happens when doing some LVM operations
> > > (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> > > copy to/from another LVM thin volume with ext4 fs.
> > > 
> > > The storage stack I use is:
> > >   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> > > 
> > > And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> > > specific.
> > > 
> > > I can reproduce the issue on at least 3 different machines. I did tried
> > > also 6.5.6 and the issue is still there. I haven't checked newer
> > > versions, but briefly reviewed git log and haven't found anything
> > > suggesting a fix to similar issue.
> > > 
> > > I managed to bisect it down to this commit:
> > > 
> > >     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
> > >     Author: Mikulas Patocka <mpatocka@redhat.com>
> > >     Date:   Mon May 1 09:19:17 2023 -0400
> > > 
> > >     dm crypt: allocate compound pages if possible
> > >     
> > >     It was reported that allocating pages for the write buffer in dm-crypt
> > >     causes measurable overhead [1].
> > >     
> > >     Change dm-crypt to allocate compound pages if they are available. If
> > >     not, fall back to the mempool.
> > >     
> > >     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
> > >     
> > >     Suggested-by: Matthew Wilcox <willy@infradead.org>
> > >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > >     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > 
> > > TBH, I'm not sure if the bug is in this commit, or maybe in some
> > > functions it uses (I don't see dm-crypt functions directly involved in
> > > the stack traces I collected). But reverting this commit on top of 6.5.6
> > > seems to fix the issue.
> > > 
> > > I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> > > 
> > > I managed to collect "blocked tasks" dump via sysrq below. Few more can
> > > be found at https://github.com/QubesOS/qubes-issues/issues/8575
> > > 
> > >     [ 4246.558313] sysrq: Show Blocked State
> > >     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
> > >     [ 4246.558407] Call Trace:
> > >     [ 4246.558414]  <TASK>
> > >     [ 4246.558422]  __schedule+0x23d/0x670
> > >     [ 4246.558440]  schedule+0x5e/0xd0
> > >     [ 4246.558450]  io_schedule+0x46/0x70
> > >     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
> > >     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
> > >     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
> > >     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
> > >     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
> > >     [ 4246.558523]  ext4_writepages+0xad/0x180
> > >     [ 4246.558533]  do_writepages+0xcf/0x1e0
> > >     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
> > >     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
> > >     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
> > >     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
> > >     [ 4246.558588]  ext4_sync_file+0x88/0x380
> > >     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
> > >     [ 4246.558609]  do_syscall_64+0x5c/0x90
> > >     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
> > >     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >     [ 4246.558644] RIP: 0033:0x7710cf124d0a
> > >     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> > >     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
> > >     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
> > >     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
> > >     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
> > >     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
> > >     [ 4246.558728]  </TASK>
> > >     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
> > >     [ 4246.558852] Call Trace:
> > >     [ 4246.558857]  <TASK>
> > >     [ 4246.558863]  __schedule+0x23d/0x670
> > >     [ 4246.558874]  schedule+0x5e/0xd0
> > >     [ 4246.558884]  io_schedule+0x46/0x70
> > >     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
> > >     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
> > >     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
> > >     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
> > >     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
> > >     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
> > >     [ 4246.558980]  __dm_suspend+0x41/0x1b0
> > >     [ 4246.558991]  dm_suspend+0xc0/0xe0
> > >     [ 4246.559001]  dev_suspend+0xa5/0xd0
> > >     [ 4246.559011]  ctl_ioctl+0x26e/0x350
> > >     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
> > >     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
> > >     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
> > >     [ 4246.559052]  do_syscall_64+0x5c/0x90
> > >     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
> > >     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
> > >     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
> > >     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
> > >     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > >     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
> > >     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
> > >     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
> > >     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> > >     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
> > >     [ 4246.559199]  </TASK>
> > >     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
> > >     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
> > >     [ 4246.559238] Call Trace:
> > >     [ 4246.559244]  <TASK>
> > >     [ 4246.559249]  __schedule+0x23d/0x670
> > >     [ 4246.559260]  schedule+0x5e/0xd0
> > >     [ 4246.559270]  io_schedule+0x46/0x70
> > >     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
> > >     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
> > >     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
> > >     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
> > >     [ 4246.559324]  ext4_writepages+0xad/0x180
> > >     [ 4246.559334]  do_writepages+0xcf/0x1e0
> > >     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
> > >     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
> > >     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
> > >     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
> > >     [ 4246.559393]  wb_writeback+0x298/0x310
> > >     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
> > >     [ 4246.559414]  wb_workfn+0x5f/0x260
> > >     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
> > >     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
> > >     [ 4246.559447]  ? __schedule+0x245/0x670
> > >     [ 4246.559457]  process_one_work+0x1df/0x3e0
> > >     [ 4246.559466]  worker_thread+0x51/0x390
> > >     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
> > >     [ 4246.559484]  kthread+0xe5/0x120
> > >     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
> > >     [ 4246.559504]  ret_from_fork+0x31/0x50
> > >     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
> > >     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
> > >     [ 4246.559536]  </TASK>
> > > 
> > > -- 
> > > Best Regards,
> > > Marek Marczykowski-Górecki
> > > Invisible Things Lab
> > > 
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-25  3:22       ` Marek Marczykowski-Górecki
  (?)
@ 2023-10-25 10:13       ` Mikulas Patocka
  2023-10-27 17:32         ` Mikulas Patocka
  2023-10-29 20:02         ` Vlastimil Babka
  -1 siblings, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-25 10:13 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, Vlastimil Babka
  Cc: stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 16457 bytes --]

So, I forward this to memory management maintainers.

What do you think? - We have a problem that if dm-crypt allocates pages 
with order > 3 (PAGE_ALLOC_COSTLY_ORDER), the system occasionally freezes 
waiting in writeback.

dm-crypt allocates the pages with GFP_NOWAIT | __GFP_HIGHMEM | 
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, so it 
shouldn't put any pressure on the system. If the allocations fails, it 
falls back to smaller order allocation or to mempool as a last resort.

When the freeze happens, there is "349264kB" free memory - so the system 
is not short on memory.

Should we restrict the dm-crypt allocation size to 
PAGE_ALLOC_COSTLY_ORDER? Or is it a bug somewhere in memory management 
system that needs to be fixes there?

Mikulas


On Wed, 25 Oct 2023, Marek Marczykowski-Górecki wrote:

> On Wed, Oct 25, 2023 at 05:10:56AM +0200, Marek Marczykowski-Górecki wrote:
> > On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > It's hard to say what causes this. Perhaps dm-crypt exhausts all the 
> > > higher-order pages and some subsystem stalls because of it.
> > > 
> > > In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is
> > > "unsigned int order = MAX_ORDER - 1"
> > > 
> > > What happens if you set the "order" variable to some small value, like 1, 
> > > 2 or 3. Does the problem go away? Could you find a threshold value 
> > > (something between 0 and MAX_ORDER-1) where the bug starts to appear?
> > 
> > With 3 and lower it seems to work, with 4 it freezes. My test is not
> > 100% reproducible, but I've repeated it a few times and got rather
> > consistent results on this system (i3-1315U, NVMe).
> > 
> > BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I
> > couldn't reproduce the issue at all. But OTOH, I hit it once on a system
> > with i7-7600U and otherwise similar specs but much different workload
> > (several more VMs accessing the data). I'm not sure if that info helps
> > at all...
> > 
> > > 
> > > What happens if you replace
> > > "pages = alloc_pages(gfp_mask"
> > > with
> > > "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)"
> > > ? Does the bug go away?
> > 
> > In a limited test, with order restored to MAX_ORDER - 1, no, still got
> > the issue, and got this via sysrq (just one task listed):
> > 
> > [  516.375254] sysrq: Show Blocked State
> > [  516.375461] task:dd              state:D stack:13072 pid:4385  ppid:4371   flags:0x00004002
> > [  516.375496] Call Trace:
> > [  516.375507]  <TASK>
> > [  516.375518]  __schedule+0x30e/0x8b0
> > [  516.375549]  schedule+0x59/0xb0
> > [  516.375566]  io_schedule+0x41/0x70
> > [  516.375582]  folio_wait_bit_common+0x12c/0x300
> > [  516.375603]  ? __pfx_wake_page_function+0x10/0x10
> > [  516.375623]  folio_wait_writeback+0x27/0x80
> > [  516.375639]  __filemap_fdatawait_range+0x7b/0xe0
> > [  516.375656]  file_write_and_wait_range+0x86/0xb0
> > [  516.375672]  blkdev_fsync+0x33/0x60
> > [  516.375693]  __x64_sys_fdatasync+0x4a/0x90
> > [  516.375713]  do_syscall_64+0x3c/0x90
> > [  516.375729]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > [  516.375751] RIP: 0033:0x7fd718562da4
> > [  516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
> > [  516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4
> > [  516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001
> > [  516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000
> > [  516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000
> > [  516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000
> > [  516.375865]  </TASK>
> > 
> > (it is dd writing to LVM thin volume, where PV is on dm-crypt)
> 
> In this bad state, I extracted some more info on memory:
> 
> /proc/meminfo:
> 
> MemTotal:        3983088 kB
> MemFree:          349264 kB
> MemAvailable:    3474744 kB
> Buffers:         1511732 kB
> Cached:          1608132 kB
> SwapCached:           12 kB
> Active:           203136 kB
> Inactive:        3147696 kB
> Active(anon):       1288 kB
> Inactive(anon):   252128 kB
> Active(file):     201848 kB
> Inactive(file):  2895568 kB
> Unevictable:       46064 kB
> Mlocked:           39920 kB
> SwapTotal:       4112380 kB
> SwapFree:        4112124 kB
> Dirty:                 0 kB
> Writeback:          8556 kB
> AnonPages:        277020 kB
> Mapped:           137424 kB
> Shmem:             13792 kB
> KReclaimable:      91728 kB
> Slab:             182428 kB
> SReclaimable:      91728 kB
> SUnreclaim:        90700 kB
> KernelStack:        5776 kB
> PageTables:         7480 kB
> SecPageTables:         0 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     6103924 kB
> Committed_AS:     990924 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:       11560 kB
> VmallocChunk:          0 kB
> Percpu:             2528 kB
> DirectMap4k:     4325764 kB
> DirectMap2M:           0 kB
> 
> and also:
> [ 1168.537096] sysrq: Show Memory
> [ 1168.537192] Mem-Info:
> [ 1168.537206] active_anon:322 inactive_anon:63032 isolated_anon:0
> [ 1168.537206]  active_file:50462 inactive_file:723892 isolated_file:0
> [ 1168.537206]  unevictable:11516 dirty:1 writeback:2139
> [ 1168.537206]  slab_reclaimable:22932 slab_unreclaimable:22675
> [ 1168.537206]  mapped:34357 shmem:3448 pagetables:1870
> [ 1168.537206]  sec_pagetables:0 bounce:0
> [ 1168.537206]  kernel_misc_reclaimable:0
> [ 1168.537206]  free:87499 free_pcp:1642 free_cma:0
> [ 1168.537279] Node 0 active_anon:1288kB inactive_anon:252128kB active_file:201848kB inactive_file:2895568kB unevictable:46064kB isolated(anon):0kB isolated(file):0kB mapped:137428kB dirty:4kB writeback:8556kB shmem:13792kB writeback_tmp:0kB kernel_stack:5776kB pagetables:7480kB sec_pagetables:0kB all_unreclaimable? no
> [ 1168.537332] Node 0 DMA free:15488kB boost:0kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:352kB unevictable:0kB writepending:84kB present:15996kB managed:15936kB mlocked:0kB bounce:0kB free_pcp:12kB local_pcp:0kB free_cma:0kB
> [ 1168.537387] lowmem_reserve[]: 0 1728 3857 3857
> [ 1168.537409] Node 0 DMA32 free:155300kB boost:0kB min:3552kB low:5320kB high:7088kB reserved_highatomic:0KB active_anon:16kB inactive_anon:3368kB active_file:5660kB inactive_file:1585120kB unevictable:0kB writepending:7576kB present:1924360kB managed:1803376kB mlocked:0kB bounce:0kB free_pcp:3184kB local_pcp:864kB free_cma:0kB
> [ 1168.537464] lowmem_reserve[]: 0 0 2129 2129
> [ 1168.537483] Node 0 Normal free:179208kB boost:0kB min:4376kB low:6556kB high:8736kB reserved_highatomic:2048KB active_anon:1272kB inactive_anon:248760kB active_file:196104kB inactive_file:1310096kB unevictable:46064kB writepending:900kB present:2253948kB managed:2163776kB mlocked:39920kB bounce:0kB free_pcp:3372kB local_pcp:632kB free_cma:0kB
> [ 1168.537540] lowmem_reserve[]: 0 0 0 0
> [ 1168.537557] Node 0 DMA: 4*4kB (UM) 4*8kB (UM) 3*16kB (UM) 3*32kB (U) 3*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 15488kB
> [ 1168.537624] Node 0 DMA32: 1945*4kB (UME) 1446*8kB (UME) 953*16kB (UME) 1410*32kB (UM) 297*64kB (UM) 162*128kB (UM) 60*256kB (UM) 22*512kB (UM) 7*1024kB (UM) 1*2048kB (U) 0*4096kB = 155300kB
> [ 1168.537695] Node 0 Normal: 3146*4kB (UMEH) 2412*8kB (UMEH) 1444*16kB (UMEH) 1138*32kB (UMEH) 486*64kB (UME) 97*128kB (UM) 13*256kB (UM) 8*512kB (M) 2*1024kB (M) 5*2048kB (M) 6*4096kB (UM) = 179208kB
> [ 1168.537767] 779973 total pagecache pages
> [ 1168.537778] 3 pages in swap cache
> [ 1168.537788] Free swap  = 4112124kB
> [ 1168.537798] Total swap = 4112380kB
> [ 1168.537807] 1048576 pages RAM
> [ 1168.537817] 0 pages HighMem/MovableOnly
> [ 1168.537827] 52804 pages reserved
> 
> 
> > 
> > > 
> > > Mikulas
> > > 
> > > 
> > > On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
> > > 
> > > > Hi,
> > > > 
> > > > Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
> > > > subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
> > > > triggers the issue, but often it happens when doing some LVM operations
> > > > (lvremove, lvrename etc) on a dm-thin volume together with bulk data
> > > > copy to/from another LVM thin volume with ext4 fs.
> > > > 
> > > > The storage stack I use is:
> > > >   nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
> > > > 
> > > > And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be
> > > > specific.
> > > > 
> > > > I can reproduce the issue on at least 3 different machines. I did tried
> > > > also 6.5.6 and the issue is still there. I haven't checked newer
> > > > versions, but briefly reviewed git log and haven't found anything
> > > > suggesting a fix to similar issue.
> > > > 
> > > > I managed to bisect it down to this commit:
> > > > 
> > > >     commit 5054e778fcd9cd29ddaa8109077cd235527e4f94
> > > >     Author: Mikulas Patocka <mpatocka@redhat.com>
> > > >     Date:   Mon May 1 09:19:17 2023 -0400
> > > > 
> > > >     dm crypt: allocate compound pages if possible
> > > >     
> > > >     It was reported that allocating pages for the write buffer in dm-crypt
> > > >     causes measurable overhead [1].
> > > >     
> > > >     Change dm-crypt to allocate compound pages if they are available. If
> > > >     not, fall back to the mempool.
> > > >     
> > > >     [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
> > > >     
> > > >     Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > >     Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > 
> > > > TBH, I'm not sure if the bug is in this commit, or maybe in some
> > > > functions it uses (I don't see dm-crypt functions directly involved in
> > > > the stack traces I collected). But reverting this commit on top of 6.5.6
> > > > seems to fix the issue.
> > > > 
> > > > I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
> > > > 
> > > > I managed to collect "blocked tasks" dump via sysrq below. Few more can
> > > > be found at https://github.com/QubesOS/qubes-issues/issues/8575
> > > > 
> > > >     [ 4246.558313] sysrq: Show Blocked State
> > > >     [ 4246.558388] task:journal-offline state:D stack:0     pid:8098  ppid:1      flags:0x00000002
> > > >     [ 4246.558407] Call Trace:
> > > >     [ 4246.558414]  <TASK>
> > > >     [ 4246.558422]  __schedule+0x23d/0x670
> > > >     [ 4246.558440]  schedule+0x5e/0xd0
> > > >     [ 4246.558450]  io_schedule+0x46/0x70
> > > >     [ 4246.558461]  folio_wait_bit_common+0x13d/0x350
> > > >     [ 4246.558475]  ? __pfx_wake_page_function+0x10/0x10
> > > >     [ 4246.558488]  folio_wait_writeback+0x2c/0x90
> > > >     [ 4246.558498]  mpage_prepare_extent_to_map+0x15c/0x4d0
> > > >     [ 4246.558512]  ext4_do_writepages+0x25f/0x770
> > > >     [ 4246.558523]  ext4_writepages+0xad/0x180
> > > >     [ 4246.558533]  do_writepages+0xcf/0x1e0
> > > >     [ 4246.558543]  ? __seccomp_filter+0x32a/0x4f0
> > > >     [ 4246.558554]  filemap_fdatawrite_wbc+0x63/0x90
> > > >     [ 4246.558567]  __filemap_fdatawrite_range+0x5c/0x80
> > > >     [ 4246.558578]  file_write_and_wait_range+0x4a/0xb0
> > > >     [ 4246.558588]  ext4_sync_file+0x88/0x380
> > > >     [ 4246.558598]  __x64_sys_fsync+0x3b/0x70
> > > >     [ 4246.558609]  do_syscall_64+0x5c/0x90
> > > >     [ 4246.558621]  ? exit_to_user_mode_prepare+0xb2/0xd0
> > > >     [ 4246.558632]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > >     [ 4246.558644] RIP: 0033:0x7710cf124d0a
> > > >     [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> > > >     [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a
> > > >     [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
> > > >     [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0
> > > >     [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940
> > > >     [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000
> > > >     [ 4246.558728]  </TASK>
> > > >     [ 4246.558836] task:lvm             state:D stack:0     pid:7835  ppid:5665   flags:0x00004006
> > > >     [ 4246.558852] Call Trace:
> > > >     [ 4246.558857]  <TASK>
> > > >     [ 4246.558863]  __schedule+0x23d/0x670
> > > >     [ 4246.558874]  schedule+0x5e/0xd0
> > > >     [ 4246.558884]  io_schedule+0x46/0x70
> > > >     [ 4246.558894]  dm_wait_for_bios_completion+0xfc/0x110
> > > >     [ 4246.558909]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > >     [ 4246.558922]  __dm_suspend+0x7e/0x1b0
> > > >     [ 4246.558932]  dm_internal_suspend_noflush+0x5c/0x80
> > > >     [ 4246.558946]  pool_presuspend+0xcc/0x130 [dm_thin_pool]
> > > >     [ 4246.558968]  dm_table_presuspend_targets+0x3f/0x60
> > > >     [ 4246.558980]  __dm_suspend+0x41/0x1b0
> > > >     [ 4246.558991]  dm_suspend+0xc0/0xe0
> > > >     [ 4246.559001]  dev_suspend+0xa5/0xd0
> > > >     [ 4246.559011]  ctl_ioctl+0x26e/0x350
> > > >     [ 4246.559020]  ? __pfx_dev_suspend+0x10/0x10
> > > >     [ 4246.559032]  dm_ctl_ioctl+0xe/0x20
> > > >     [ 4246.559041]  __x64_sys_ioctl+0x94/0xd0
> > > >     [ 4246.559052]  do_syscall_64+0x5c/0x90
> > > >     [ 4246.559062]  ? do_syscall_64+0x6b/0x90
> > > >     [ 4246.559072]  ? do_syscall_64+0x6b/0x90
> > > >     [ 4246.559081]  ? xen_pv_evtchn_do_upcall+0x54/0xb0
> > > >     [ 4246.559093]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > >     [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f
> > > >     [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > >     [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f
> > > >     [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003
> > > >     [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420
> > > >     [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> > > >     [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0
> > > >     [ 4246.559199]  </TASK>
> > > >     [ 4246.559207] task:kworker/u8:3    state:D stack:0     pid:8033  ppid:2      flags:0x00004000
> > > >     [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4)
> > > >     [ 4246.559238] Call Trace:
> > > >     [ 4246.559244]  <TASK>
> > > >     [ 4246.559249]  __schedule+0x23d/0x670
> > > >     [ 4246.559260]  schedule+0x5e/0xd0
> > > >     [ 4246.559270]  io_schedule+0x46/0x70
> > > >     [ 4246.559280]  folio_wait_bit_common+0x13d/0x350
> > > >     [ 4246.559290]  ? __pfx_wake_page_function+0x10/0x10
> > > >     [ 4246.559302]  mpage_prepare_extent_to_map+0x309/0x4d0
> > > >     [ 4246.559314]  ext4_do_writepages+0x25f/0x770
> > > >     [ 4246.559324]  ext4_writepages+0xad/0x180
> > > >     [ 4246.559334]  do_writepages+0xcf/0x1e0
> > > >     [ 4246.559344]  ? find_busiest_group+0x42/0x1a0
> > > >     [ 4246.559354]  __writeback_single_inode+0x3d/0x280
> > > >     [ 4246.559368]  writeback_sb_inodes+0x1ed/0x4a0
> > > >     [ 4246.559381]  __writeback_inodes_wb+0x4c/0xf0
> > > >     [ 4246.559393]  wb_writeback+0x298/0x310
> > > >     [ 4246.559403]  wb_do_writeback+0x230/0x2b0
> > > >     [ 4246.559414]  wb_workfn+0x5f/0x260
> > > >     [ 4246.559424]  ? _raw_spin_unlock+0xe/0x30
> > > >     [ 4246.559434]  ? finish_task_switch.isra.0+0x95/0x2b0
> > > >     [ 4246.559447]  ? __schedule+0x245/0x670
> > > >     [ 4246.559457]  process_one_work+0x1df/0x3e0
> > > >     [ 4246.559466]  worker_thread+0x51/0x390
> > > >     [ 4246.559475]  ? __pfx_worker_thread+0x10/0x10
> > > >     [ 4246.559484]  kthread+0xe5/0x120
> > > >     [ 4246.559495]  ? __pfx_kthread+0x10/0x10
> > > >     [ 4246.559504]  ret_from_fork+0x31/0x50
> > > >     [ 4246.559514]  ? __pfx_kthread+0x10/0x10
> > > >     [ 4246.559523]  ret_from_fork_asm+0x1b/0x30
> > > >     [ 4246.559536]  </TASK>
> > > > 
> > > > -- 
> > > > Best Regards,
> > > > Marek Marczykowski-Górecki
> > > > Invisible Things Lab
> > > > 
> > 
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> 
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-25 10:13       ` Mikulas Patocka
@ 2023-10-27 17:32         ` Mikulas Patocka
  2023-10-28  9:23           ` Matthew Wilcox
  2023-10-29 11:15           ` Marek Marczykowski-Górecki
  2023-10-29 20:02         ` Vlastimil Babka
  1 sibling, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-27 17:32 UTC (permalink / raw)
  To: Mike Snitzer, Marek Marczykowski-Górecki
  Cc: stable, regressions, Alasdair Kergon, dm-devel, linux-mm,
	Andrew Morton, Matthew Wilcox, Michal Hocko, Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

So, we got no reponse from the MM maintainers. Marek - please try this 
patch on all the machines where you hit the bug and if you still hit the 
bug with this patch, report it.

Mikulas



From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] dm-crypt: don't allocate large compound pages

It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94
("dm crypt: allocate compound pages if possible") causes intermittent
freezes [1].

So far, it is not clear what is the root cause. It was reported that with
the allocation order 3 or lower it works [1], so we restrict the order to
3 (that is PAGE_ALLOC_COSTLY_ORDER).

[1] https://www.spinics.net/lists/dm-devel/msg56048.html

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Cc: stable@vger.kernel.org	# v6.5+
Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")

---
 drivers/md/dm-crypt.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c
+++ linux-2.6/drivers/md/dm-crypt.c
@@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st
 	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
 	unsigned int remaining_size;
-	unsigned int order = MAX_ORDER - 1;
+	unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
 
 retry:
 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-27 17:32         ` Mikulas Patocka
@ 2023-10-28  9:23           ` Matthew Wilcox
  2023-10-28 15:14             ` Mike Snitzer
  2023-10-29 11:15           ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2023-10-28  9:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Marek Marczykowski-Górecki, stable,
	regressions, Alasdair Kergon, dm-devel, linux-mm, Andrew Morton,
	Michal Hocko, Vlastimil Babka

On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
> So, we got no reponse from the MM maintainers. Marek - please try this 

yes, how dare i go on holiday.  i'll look at this when i'm back.
probably.

> patch on all the machines where you hit the bug and if you still hit the 
> bug with this patch, report it.
> 
> Mikulas
> 
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] dm-crypt: don't allocate large compound pages
> 
> It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94
> ("dm crypt: allocate compound pages if possible") causes intermittent
> freezes [1].
> 
> So far, it is not clear what is the root cause. It was reported that with
> the allocation order 3 or lower it works [1], so we restrict the order to
> 3 (that is PAGE_ALLOC_COSTLY_ORDER).
> 
> [1] https://www.spinics.net/lists/dm-devel/msg56048.html
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Cc: stable@vger.kernel.org	# v6.5+
> Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")
> 
> ---
>  drivers/md/dm-crypt.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/drivers/md/dm-crypt.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-crypt.c
> +++ linux-2.6/drivers/md/dm-crypt.c
> @@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st
>  	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
>  	unsigned int remaining_size;
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
>  
>  retry:
>  	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-28  9:23           ` Matthew Wilcox
@ 2023-10-28 15:14             ` Mike Snitzer
  0 siblings, 0 replies; 66+ messages in thread
From: Mike Snitzer @ 2023-10-28 15:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mikulas Patocka, Marek Marczykowski-Górecki, stable,
	regressions, Alasdair Kergon, dm-devel, linux-mm, Andrew Morton,
	Michal Hocko, Vlastimil Babka

On Sat, Oct 28 2023 at  5:23P -0400,
Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
> > So, we got no reponse from the MM maintainers. Marek - please try this 
> 
> yes, how dare i go on holiday.  i'll look at this when i'm back.
> probably.

There was no malice or disrespect until you emerged from holiday.
Hopefully you're at the start, if not you should extend it. Probably.

I've staged this workaround until you or others can look closer:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.7&id=b3d87b87017d50cab1ea49fc6810c74584e00027

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-21  7:48 ` Bagas Sanjaya
@ 2023-10-29  6:23     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 66+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-10-29  6:23 UTC (permalink / raw)
  To: Bagas Sanjaya, Marek Marczykowski-Górecki, Linux Stable
  Cc: Linux Regressions, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Linux Devicemapper, Matthew Wilcox

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Closes: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 21.10.23 09:48, Bagas Sanjaya wrote:
> On Sat, Oct 21, 2023 at 05:38:58AM +0200, Marek Marczykowski-Górecki wrote:
>>
>> Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
>> subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
>> triggers the issue, but often it happens when doing some LVM operations
>> (lvremove, lvrename etc) on a dm-thin volume together with bulk data
>> copy to/from another LVM thin volume with ext4 fs.
> [...] 
> #regzbot ^introduced: 5054e778fcd9cd

#regzbot fix: dm crypt: don't allocate large compound pages
#regzbot ignore-activity

(fix currently found in
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.7&id=b3d87b87017d50cab1ea49fc6810c74584e00027
)

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
@ 2023-10-29  6:23     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 66+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-10-29  6:23 UTC (permalink / raw)
  To: Bagas Sanjaya, Marek Marczykowski-Górecki, Linux Stable
  Cc: Linux Regressions, Mike Snitzer, Matthew Wilcox,
	Linux Devicemapper, Mikulas Patocka, Alasdair Kergon

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Closes: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 21.10.23 09:48, Bagas Sanjaya wrote:
> On Sat, Oct 21, 2023 at 05:38:58AM +0200, Marek Marczykowski-Górecki wrote:
>>
>> Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage
>> subsystem freeze - any I/O ends up frozen. I'm not sure what exactly
>> triggers the issue, but often it happens when doing some LVM operations
>> (lvremove, lvrename etc) on a dm-thin volume together with bulk data
>> copy to/from another LVM thin volume with ext4 fs.
> [...] 
> #regzbot ^introduced: 5054e778fcd9cd

#regzbot fix: dm crypt: don't allocate large compound pages
#regzbot ignore-activity

(fix currently found in
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.7&id=b3d87b87017d50cab1ea49fc6810c74584e00027
)

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-27 17:32         ` Mikulas Patocka
  2023-10-28  9:23           ` Matthew Wilcox
@ 2023-10-29 11:15           ` Marek Marczykowski-Górecki
  1 sibling, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-29 11:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, stable, regressions, Alasdair Kergon, dm-devel,
	linux-mm, Andrew Morton, Matthew Wilcox, Michal Hocko,
	Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
> So, we got no reponse from the MM maintainers. Marek - please try this 
> patch on all the machines where you hit the bug and if you still hit the 
> bug with this patch, report it.

I've tested it for a bit and did not hit the issue again.

> 
> Mikulas
> 
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] dm-crypt: don't allocate large compound pages
> 
> It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94
> ("dm crypt: allocate compound pages if possible") causes intermittent
> freezes [1].
> 
> So far, it is not clear what is the root cause. It was reported that with
> the allocation order 3 or lower it works [1], so we restrict the order to
> 3 (that is PAGE_ALLOC_COSTLY_ORDER).
> 
> [1] https://www.spinics.net/lists/dm-devel/msg56048.html
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Cc: stable@vger.kernel.org	# v6.5+
> Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")
> 
> ---
>  drivers/md/dm-crypt.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/drivers/md/dm-crypt.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-crypt.c
> +++ linux-2.6/drivers/md/dm-crypt.c
> @@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st
>  	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
>  	unsigned int remaining_size;
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
>  
>  retry:
>  	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-25 10:13       ` Mikulas Patocka
  2023-10-27 17:32         ` Mikulas Patocka
@ 2023-10-29 20:02         ` Vlastimil Babka
  2023-10-30  7:37           ` Mikulas Patocka
  1 sibling, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2023-10-29 20:02 UTC (permalink / raw)
  To: Mikulas Patocka, Marek Marczykowski-Górecki, Andrew Morton,
	Matthew Wilcox, Michal Hocko
  Cc: stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On 10/25/23 12:13, Mikulas Patocka wrote:
> So, I forward this to memory management maintainers.

Hi,

> What do you think? - We have a problem that if dm-crypt allocates pages 
> with order > 3 (PAGE_ALLOC_COSTLY_ORDER), the system occasionally freezes 
> waiting in writeback.

Hmm, so I've checked the backtraces provided and none seems to show the
actual allocating task doing the crypt_alloc_buffer(), do we know what it's
doing? Is it blocked or perhaps spinning in some infinite loop?

> dm-crypt allocates the pages with GFP_NOWAIT | __GFP_HIGHMEM | 
> __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, so it 
> shouldn't put any pressure on the system. If the allocations fails, it 
> falls back to smaller order allocation or to mempool as a last resort.

Right. I noticed it may also fallback to __GFP_DIRECT_RECLAIM but then it's
only order-0.

> When the freeze happens, there is "349264kB" free memory - so the system 
> is not short on memory.

Yeah, it may still be fragmented, although in [1] the sysqr show memory
report suggests it's not, pages do exist up to MAX_ORDER. Weird.

[1] https://lore.kernel.org/all/ZTiJ3CO8w0jauOzW@mail-itl/

> Should we restrict the dm-crypt allocation size to 
> PAGE_ALLOC_COSTLY_ORDER? Or is it a bug somewhere in memory management 
> system that needs to be fixes there?

Haven't found any. However I'd like to point out some things I noticed in
crypt_alloc_buffer(), although they are probably not related.

> static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
> {
> 	struct crypt_config *cc = io->cc;
> 	struct bio *clone;
> 	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
> 	unsigned int remaining_size;
> 	unsigned int order = MAX_ORDER - 1;
> 
> retry:
> 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> 		mutex_lock(&cc->bio_alloc_lock);

What if we end up in "goto retry" more than once? I don't see a matching
unlock. Yeah, very unlikely to happen that order-0 in page allocator which
includes __GFP_DIRECT_RECLAIM would fail, but not impossible, and also I see
crypt_page_alloc() for the mempool can fail for another reason, due to a
counter being too high. Looks dangerous.

> 
> 	clone = bio_alloc_bioset(cc->dev->bdev, nr_iovecs, io->base_bio->bi_opf,
> 				 GFP_NOIO, &cc->bs);
> 	clone->bi_private = io;
> 	clone->bi_end_io = crypt_endio;
> 
> 	remaining_size = size;
> 
> 	while (remaining_size) {
> 		struct page *pages;
> 		unsigned size_to_add;
> 		unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT);

Tip: you can use get_order(remaining_size) here.

> 		order = min(order, remaining_order);
> 
> 		while (order > 0) {

Is this intentionally > 0 and not >= 0? We could still succeed avoiding
mempool with order-0...

> 			pages = alloc_pages(gfp_mask
> 				| __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP,
> 				order);
> 			if (likely(pages != NULL))
> 				goto have_pages;
> 			order--;
> 		}
> 
> 		pages = mempool_alloc(&cc->page_pool, gfp_mask);
> 		if (!pages) {
> 			crypt_free_buffer_pages(cc, clone);
> 			bio_put(clone);
> 			gfp_mask |= __GFP_DIRECT_RECLAIM;
> 			order = 0;
> 			goto retry;
> 		}
> 
> have_pages:
> 		size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
> 		__bio_add_page(clone, pages, size_to_add, 0);
> 		remaining_size -= size_to_add;
> 	}
> 
> 	/* Allocate space for integrity tags */
> 	if (dm_crypt_integrity_io_alloc(io, clone)) {
> 		crypt_free_buffer_pages(cc, clone);
> 		bio_put(clone);
> 		clone = NULL;
> 	}
> 
> 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> 		mutex_unlock(&cc->bio_alloc_lock);
> 
> 	return clone;
> }


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-29 20:02         ` Vlastimil Babka
@ 2023-10-30  7:37           ` Mikulas Patocka
  2023-10-30  8:37             ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30  7:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm



On Sun, 29 Oct 2023, Vlastimil Babka wrote:

> Haven't found any. However I'd like to point out some things I noticed in
> crypt_alloc_buffer(), although they are probably not related.
> 
> > static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
> > {
> > 	struct crypt_config *cc = io->cc;
> > 	struct bio *clone;
> > 	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
> > 	unsigned int remaining_size;
> > 	unsigned int order = MAX_ORDER - 1;
> > 
> > retry:
> > 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> > 		mutex_lock(&cc->bio_alloc_lock);
> 
> What if we end up in "goto retry" more than once? I don't see a matching

It is impossible. Before we jump to the retry label, we set 
__GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if 
__GFP_DIRECT_RECLAIM is present (it will just wait until some other task 
frees some objects into the mempool).

> unlock. Yeah, very unlikely to happen that order-0 in page allocator which
> includes __GFP_DIRECT_RECLAIM would fail, but not impossible, and also I see
> crypt_page_alloc() for the mempool can fail for another reason, due to a
> counter being too high. Looks dangerous.

If crypt_page_alloc fails, mempool falls back to allocating from a 
pre-allocated list.

But now I see that there is a bug that the compound pages don't contribute 
to the "cc->n_allocated_pages" counter. I'll have to fix it.

> > 
> > 	clone = bio_alloc_bioset(cc->dev->bdev, nr_iovecs, io->base_bio->bi_opf,
> > 				 GFP_NOIO, &cc->bs);
> > 	clone->bi_private = io;
> > 	clone->bi_end_io = crypt_endio;
> > 
> > 	remaining_size = size;
> > 
> > 	while (remaining_size) {
> > 		struct page *pages;
> > 		unsigned size_to_add;
> > 		unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT);
> 
> Tip: you can use get_order(remaining_size) here.

get_order rounds the size up and we need to round it down here (rounding 
it up would waste memory).

> > 		order = min(order, remaining_order);
> > 
> > 		while (order > 0) {
> 
> Is this intentionally > 0 and not >= 0? We could still succeed avoiding
> mempool with order-0...

Yes, it is intentional. mempool alloc will try to allocate the page using 
alloc_page, so there is no need to go to the "pages = alloc_pages" branch 
before it.

> > 			pages = alloc_pages(gfp_mask
> > 				| __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP,
> > 				order);
> > 			if (likely(pages != NULL))
> > 				goto have_pages;
> > 			order--;
> > 		}
> > 
> > 		pages = mempool_alloc(&cc->page_pool, gfp_mask);
> > 		if (!pages) {
> > 			crypt_free_buffer_pages(cc, clone);
> > 			bio_put(clone);
> > 			gfp_mask |= __GFP_DIRECT_RECLAIM;
> > 			order = 0;
> > 			goto retry;
> > 		}
> > 
> > have_pages:
> > 		size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
> > 		__bio_add_page(clone, pages, size_to_add, 0);
> > 		remaining_size -= size_to_add;
> > 	}
> > 
> > 	/* Allocate space for integrity tags */
> > 	if (dm_crypt_integrity_io_alloc(io, clone)) {
> > 		crypt_free_buffer_pages(cc, clone);
> > 		bio_put(clone);
> > 		clone = NULL;
> > 	}
> > 
> > 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> > 		mutex_unlock(&cc->bio_alloc_lock);
> > 
> > 	return clone;
> > }

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30  7:37           ` Mikulas Patocka
@ 2023-10-30  8:37             ` Vlastimil Babka
  2023-10-30 11:22               ` Mikulas Patocka
  2023-10-30 11:28               ` Jan Kara
  0 siblings, 2 replies; 66+ messages in thread
From: Vlastimil Babka @ 2023-10-30  8:37 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, Jan Kara

On 10/30/23 08:37, Mikulas Patocka wrote:
> 
> 
> On Sun, 29 Oct 2023, Vlastimil Babka wrote:
> 
>> Haven't found any. However I'd like to point out some things I noticed in
>> crypt_alloc_buffer(), although they are probably not related.
>> 
>> > static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
>> > {
>> > 	struct crypt_config *cc = io->cc;
>> > 	struct bio *clone;
>> > 	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> > 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
>> > 	unsigned int remaining_size;
>> > 	unsigned int order = MAX_ORDER - 1;
>> > 
>> > retry:
>> > 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>> > 		mutex_lock(&cc->bio_alloc_lock);
>> 
>> What if we end up in "goto retry" more than once? I don't see a matching
> 
> It is impossible. Before we jump to the retry label, we set 
> __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if 
> __GFP_DIRECT_RECLAIM is present (it will just wait until some other task 
> frees some objects into the mempool).

Ah, missed that. And the traces don't show that we would be waiting for
that. I'm starting to think the allocation itself is really not the issue
here. Also I don't think it deprives something else of large order pages, as
per the sysrq listing they still existed.

What I rather suspect is what happens next to the allocated bio such that it
works well with order-0 or up to costly_order pages, but there's some
problem causing a deadlock if the bio contains larger pages than that?

Cc Honza. The thread starts here:
https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/

The linked qubes reports has a number of blocked task listings that can be
expanded:
https://github.com/QubesOS/qubes-issues/issues/8575

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30  8:37             ` Vlastimil Babka
@ 2023-10-30 11:22               ` Mikulas Patocka
  2023-10-30 11:30                 ` Vlastimil Babka
  2023-10-30 11:28               ` Jan Kara
  1 sibling, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30 11:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, Jan Kara



On Mon, 30 Oct 2023, Vlastimil Babka wrote:

> Ah, missed that. And the traces don't show that we would be waiting for
> that. I'm starting to think the allocation itself is really not the issue
> here. Also I don't think it deprives something else of large order pages, as
> per the sysrq listing they still existed.
> 
> What I rather suspect is what happens next to the allocated bio such that it
> works well with order-0 or up to costly_order pages, but there's some
> problem causing a deadlock if the bio contains larger pages than that?

Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
memory allocation code and I suppose that one of them does something bad 
and triggers this bug. But I don't know which one.

> Cc Honza. The thread starts here:
> https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/
> 
> The linked qubes reports has a number of blocked task listings that can be
> expanded:
> https://github.com/QubesOS/qubes-issues/issues/8575

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30  8:37             ` Vlastimil Babka
  2023-10-30 11:22               ` Mikulas Patocka
@ 2023-10-30 11:28               ` Jan Kara
  2023-10-30 11:49                 ` Mikulas Patocka
  1 sibling, 1 reply; 66+ messages in thread
From: Jan Kara @ 2023-10-30 11:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mikulas Patocka, Marek Marczykowski-Górecki, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, Jan Kara

On Mon 30-10-23 09:37:10, Vlastimil Babka wrote:
> On 10/30/23 08:37, Mikulas Patocka wrote:
> > On Sun, 29 Oct 2023, Vlastimil Babka wrote:
> > 
> >> Haven't found any. However I'd like to point out some things I noticed in
> >> crypt_alloc_buffer(), although they are probably not related.
> >> 
> >> > static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
> >> > {
> >> > 	struct crypt_config *cc = io->cc;
> >> > 	struct bio *clone;
> >> > 	unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> >> > 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
> >> > 	unsigned int remaining_size;
> >> > 	unsigned int order = MAX_ORDER - 1;
> >> > 
> >> > retry:
> >> > 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> >> > 		mutex_lock(&cc->bio_alloc_lock);
> >> 
> >> What if we end up in "goto retry" more than once? I don't see a matching
> > 
> > It is impossible. Before we jump to the retry label, we set 
> > __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if 
> > __GFP_DIRECT_RECLAIM is present (it will just wait until some other task 
> > frees some objects into the mempool).
> 
> Ah, missed that. And the traces don't show that we would be waiting for
> that. I'm starting to think the allocation itself is really not the issue
> here. Also I don't think it deprives something else of large order pages, as
> per the sysrq listing they still existed.
> 
> What I rather suspect is what happens next to the allocated bio such that it
> works well with order-0 or up to costly_order pages, but there's some
> problem causing a deadlock if the bio contains larger pages than that?

Hum, so in all the backtraces presented we see that we are waiting for page
writeback to complete but I don't see anything that would be preventing the
bios from completing. Page writeback can submit quite large bios so it kind
of makes sense that it trips up some odd behavior. Looking at the code
I can see one possible problem in crypt_alloc_buffer() but it doesn't
explain why reducing initial page order would help. Anyway: Are we
guaranteed mempool has enough pages for arbitrarily large bio that can
enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the
number of pages in the mempool to dm_crypt_pages_per_client plus I assume
the percpu counter bias in cc->n_allocated_pages can limit the really
available number of pages even further. So if a single bio is large enough
to trip percpu_counter_read_positive(&cc->n_allocated_pages) >=
dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop
forever? But maybe this cannot happen for some reason...

If this is not it, I think we need to find out why the writeback bios are
not completeting. Probably I'd start with checking what is kcryptd,
presumably responsible for processing these bios, doing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 11:22               ` Mikulas Patocka
@ 2023-10-30 11:30                 ` Vlastimil Babka
  2023-10-30 11:37                   ` Mikulas Patocka
  2023-10-30 12:25                   ` Jan Kara
  0 siblings, 2 replies; 66+ messages in thread
From: Vlastimil Babka @ 2023-10-30 11:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, Jan Kara

On 10/30/23 12:22, Mikulas Patocka wrote:
> 
> 
> On Mon, 30 Oct 2023, Vlastimil Babka wrote:
> 
>> Ah, missed that. And the traces don't show that we would be waiting for
>> that. I'm starting to think the allocation itself is really not the issue
>> here. Also I don't think it deprives something else of large order pages, as
>> per the sysrq listing they still existed.
>> 
>> What I rather suspect is what happens next to the allocated bio such that it
>> works well with order-0 or up to costly_order pages, but there's some
>> problem causing a deadlock if the bio contains larger pages than that?
> 
> Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
> memory allocation code and I suppose that one of them does something bad 
> and triggers this bug. But I don't know which one.

It's not what I meant. All the interesting branches for costly order in page
allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be
hitting those here.
The traces I've seen suggest the allocation of the bio suceeded, and
problems arised only after it was submitted.

I wouldn't even be surprised if the threshold for hitting the bug was not
exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
+ 1 or + 2 (has that been tested?) or rather that there's no exact
threshold, but probability increases with order.

>> Cc Honza. The thread starts here:
>> https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/
>> 
>> The linked qubes reports has a number of blocked task listings that can be
>> expanded:
>> https://github.com/QubesOS/qubes-issues/issues/8575
> 
> Mikulas
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 11:30                 ` Vlastimil Babka
@ 2023-10-30 11:37                   ` Mikulas Patocka
  2023-10-30 12:25                   ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30 11:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Marek Marczykowski-Górecki, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, Jan Kara



On Mon, 30 Oct 2023, Vlastimil Babka wrote:

> On 10/30/23 12:22, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 30 Oct 2023, Vlastimil Babka wrote:
> > 
> >> Ah, missed that. And the traces don't show that we would be waiting for
> >> that. I'm starting to think the allocation itself is really not the issue
> >> here. Also I don't think it deprives something else of large order pages, as
> >> per the sysrq listing they still existed.
> >> 
> >> What I rather suspect is what happens next to the allocated bio such that it
> >> works well with order-0 or up to costly_order pages, but there's some
> >> problem causing a deadlock if the bio contains larger pages than that?
> > 
> > Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
> > memory allocation code and I suppose that one of them does something bad 
> > and triggers this bug. But I don't know which one.
> 
> It's not what I meant. All the interesting branches for costly order in page
> allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be
> hitting those here.
> The traces I've seen suggest the allocation of the bio suceeded, and
> problems arised only after it was submitted.
> 
> I wouldn't even be surprised if the threshold for hitting the bug was not
> exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
> + 1 or + 2 (has that been tested?) or rather that there's no exact
> threshold, but probability increases with order.

It has been tested - Marek tested it with intial order == 4 and he hit the 
bug. With 3 and less he didn't.

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 11:28               ` Jan Kara
@ 2023-10-30 11:49                 ` Mikulas Patocka
  2023-10-30 12:11                   ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30 11:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vlastimil Babka, Marek Marczykowski-Górecki, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm



On Mon, 30 Oct 2023, Jan Kara wrote:

> > >> What if we end up in "goto retry" more than once? I don't see a matching
> > > 
> > > It is impossible. Before we jump to the retry label, we set 
> > > __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if 
> > > __GFP_DIRECT_RECLAIM is present (it will just wait until some other task 
> > > frees some objects into the mempool).
> > 
> > Ah, missed that. And the traces don't show that we would be waiting for
> > that. I'm starting to think the allocation itself is really not the issue
> > here. Also I don't think it deprives something else of large order pages, as
> > per the sysrq listing they still existed.
> > 
> > What I rather suspect is what happens next to the allocated bio such that it
> > works well with order-0 or up to costly_order pages, but there's some
> > problem causing a deadlock if the bio contains larger pages than that?
> 
> Hum, so in all the backtraces presented we see that we are waiting for page
> writeback to complete but I don't see anything that would be preventing the
> bios from completing. Page writeback can submit quite large bios so it kind
> of makes sense that it trips up some odd behavior. Looking at the code
> I can see one possible problem in crypt_alloc_buffer() but it doesn't
> explain why reducing initial page order would help. Anyway: Are we
> guaranteed mempool has enough pages for arbitrarily large bio that can
> enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the
> number of pages in the mempool to dm_crypt_pages_per_client plus I assume
> the percpu counter bias in cc->n_allocated_pages can limit the really
> available number of pages even further. So if a single bio is large enough
> to trip percpu_counter_read_positive(&cc->n_allocated_pages) >=
> dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop
> forever? But maybe this cannot happen for some reason...
> 
> If this is not it, I think we need to find out why the writeback bios are
> not completeting. Probably I'd start with checking what is kcryptd,
> presumably responsible for processing these bios, doing.
> 
> 								Honza

cc->page_pool is initialized to hold BIO_MAX_VECS pages. crypt_map will 
restrict the bio size to BIO_MAX_VECS (see dm_accept_partial_bio being 
called from crypt_map).

When we allocate a buffer in crypt_alloc_buffer, we try first allocation 
without waiting, then we grab the mutex and we try allocation with 
waiting.

The mutex should prevent a deadlock when two processes allocate 128 pages 
concurrently and wait for each other to free some pages.

The limit to dm_crypt_pages_per_client only applies to pages allocated 
from the kernel - when this limit is reached, we can still allocate from 
the mempool, so it shoudn't cause deadlocks.

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 11:49                 ` Mikulas Patocka
@ 2023-10-30 12:11                   ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2023-10-30 12:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Vlastimil Babka, Marek Marczykowski-Górecki,
	Andrew Morton, Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Mon 30-10-23 12:49:01, Mikulas Patocka wrote:
> On Mon, 30 Oct 2023, Jan Kara wrote:
> > > >> What if we end up in "goto retry" more than once? I don't see a matching
> > > > 
> > > > It is impossible. Before we jump to the retry label, we set 
> > > > __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if 
> > > > __GFP_DIRECT_RECLAIM is present (it will just wait until some other task 
> > > > frees some objects into the mempool).
> > > 
> > > Ah, missed that. And the traces don't show that we would be waiting for
> > > that. I'm starting to think the allocation itself is really not the issue
> > > here. Also I don't think it deprives something else of large order pages, as
> > > per the sysrq listing they still existed.
> > > 
> > > What I rather suspect is what happens next to the allocated bio such that it
> > > works well with order-0 or up to costly_order pages, but there's some
> > > problem causing a deadlock if the bio contains larger pages than that?
> > 
> > Hum, so in all the backtraces presented we see that we are waiting for page
> > writeback to complete but I don't see anything that would be preventing the
> > bios from completing. Page writeback can submit quite large bios so it kind
> > of makes sense that it trips up some odd behavior. Looking at the code
> > I can see one possible problem in crypt_alloc_buffer() but it doesn't
> > explain why reducing initial page order would help. Anyway: Are we
> > guaranteed mempool has enough pages for arbitrarily large bio that can
> > enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the
> > number of pages in the mempool to dm_crypt_pages_per_client plus I assume
> > the percpu counter bias in cc->n_allocated_pages can limit the really
> > available number of pages even further. So if a single bio is large enough
> > to trip percpu_counter_read_positive(&cc->n_allocated_pages) >=
> > dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop
> > forever? But maybe this cannot happen for some reason...
> > 
> > If this is not it, I think we need to find out why the writeback bios are
> > not completeting. Probably I'd start with checking what is kcryptd,
> > presumably responsible for processing these bios, doing.
> > 
> > 								Honza
> 
> cc->page_pool is initialized to hold BIO_MAX_VECS pages. crypt_map will 
> restrict the bio size to BIO_MAX_VECS (see dm_accept_partial_bio being 
> called from crypt_map).
> 
> When we allocate a buffer in crypt_alloc_buffer, we try first allocation 
> without waiting, then we grab the mutex and we try allocation with 
> waiting.
> 
> The mutex should prevent a deadlock when two processes allocate 128 pages 
> concurrently and wait for each other to free some pages.
> 
> The limit to dm_crypt_pages_per_client only applies to pages allocated 
> from the kernel - when this limit is reached, we can still allocate from 
> the mempool, so it shoudn't cause deadlocks.

Ah, ok, I missed the limitation of the bio size in crypt_map(). Thanks for
explanation! So really the only advice I have now it to check what kcryptd
is doing when the system is stuck. Because we didn't see it in any of the
stacktrace dumps.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 11:30                 ` Vlastimil Babka
  2023-10-30 11:37                   ` Mikulas Patocka
@ 2023-10-30 12:25                   ` Jan Kara
  2023-10-30 13:30                     ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 66+ messages in thread
From: Jan Kara @ 2023-10-30 12:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mikulas Patocka, Marek Marczykowski-Górecki, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, Jan Kara

On Mon 30-10-23 12:30:23, Vlastimil Babka wrote:
> On 10/30/23 12:22, Mikulas Patocka wrote:
> > On Mon, 30 Oct 2023, Vlastimil Babka wrote:
> > 
> >> Ah, missed that. And the traces don't show that we would be waiting for
> >> that. I'm starting to think the allocation itself is really not the issue
> >> here. Also I don't think it deprives something else of large order pages, as
> >> per the sysrq listing they still existed.
> >> 
> >> What I rather suspect is what happens next to the allocated bio such that it
> >> works well with order-0 or up to costly_order pages, but there's some
> >> problem causing a deadlock if the bio contains larger pages than that?
> > 
> > Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
> > memory allocation code and I suppose that one of them does something bad 
> > and triggers this bug. But I don't know which one.
> 
> It's not what I meant. All the interesting branches for costly order in page
> allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be
> hitting those here.
> The traces I've seen suggest the allocation of the bio suceeded, and
> problems arised only after it was submitted.
> 
> I wouldn't even be surprised if the threshold for hitting the bug was not
> exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
> + 1 or + 2 (has that been tested?) or rather that there's no exact
> threshold, but probability increases with order.

Well, it would be possible that larger pages in a bio would trip e.g. bio
splitting due to maximum segment size the disk supports (which can be e.g.
0xffff) and that upsets something somewhere. But this is pure
speculation. We definitely need more debug data to be able to tell more.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 12:25                   ` Jan Kara
@ 2023-10-30 13:30                     ` Marek Marczykowski-Górecki
  2023-10-30 14:08                       ` Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-30 13:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vlastimil Babka, Mikulas Patocka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2361 bytes --]

On Mon, Oct 30, 2023 at 01:25:13PM +0100, Jan Kara wrote:
> On Mon 30-10-23 12:30:23, Vlastimil Babka wrote:
> > On 10/30/23 12:22, Mikulas Patocka wrote:
> > > On Mon, 30 Oct 2023, Vlastimil Babka wrote:
> > > 
> > >> Ah, missed that. And the traces don't show that we would be waiting for
> > >> that. I'm starting to think the allocation itself is really not the issue
> > >> here. Also I don't think it deprives something else of large order pages, as
> > >> per the sysrq listing they still existed.
> > >> 
> > >> What I rather suspect is what happens next to the allocated bio such that it
> > >> works well with order-0 or up to costly_order pages, but there's some
> > >> problem causing a deadlock if the bio contains larger pages than that?
> > > 
> > > Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
> > > memory allocation code and I suppose that one of them does something bad 
> > > and triggers this bug. But I don't know which one.
> > 
> > It's not what I meant. All the interesting branches for costly order in page
> > allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be
> > hitting those here.
> > The traces I've seen suggest the allocation of the bio suceeded, and
> > problems arised only after it was submitted.
> > 
> > I wouldn't even be surprised if the threshold for hitting the bug was not
> > exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
> > + 1 or + 2 (has that been tested?) or rather that there's no exact
> > threshold, but probability increases with order.
> 
> Well, it would be possible that larger pages in a bio would trip e.g. bio
> splitting due to maximum segment size the disk supports (which can be e.g.
> 0xffff) and that upsets something somewhere. But this is pure
> speculation. We definitely need more debug data to be able to tell more.

I can collect more info, but I need some guidance how :) Some patch
adding extra debug messages?
Note I collect those via serial console (writing to disk doesn't work
when it freezes), and that has some limits in the amount of data I can
extract especially when printed quickly. For example sysrq-t is too much.
Or maybe there is some trick to it, like increasing log_bug_len?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 13:30                     ` Marek Marczykowski-Górecki
@ 2023-10-30 14:08                       ` Mikulas Patocka
  2023-10-30 15:56                         ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30 14:08 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]



On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:

> > Well, it would be possible that larger pages in a bio would trip e.g. bio
> > splitting due to maximum segment size the disk supports (which can be e.g.
> > 0xffff) and that upsets something somewhere. But this is pure
> > speculation. We definitely need more debug data to be able to tell more.
> 
> I can collect more info, but I need some guidance how :) Some patch
> adding extra debug messages?
> Note I collect those via serial console (writing to disk doesn't work
> when it freezes), and that has some limits in the amount of data I can
> extract especially when printed quickly. For example sysrq-t is too much.
> Or maybe there is some trick to it, like increasing log_bug_len?

If you can do more tests, I would suggest this:

We already know that it works with order 3 and doesn't work with order 4.

So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER 
from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int 
order = PAGE_ALLOC_COSTLY_ORDER" there.

Does it deadlock or not?

So, that we can see whether the deadlock depends on 
PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.

Mikulas

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 14:08                       ` Mikulas Patocka
@ 2023-10-30 15:56                         ` Jan Kara
  2023-10-30 16:51                           ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2023-10-30 15:56 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marek Marczykowski-Górecki, Jan Kara, Vlastimil Babka,
	Andrew Morton, Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
> On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> 
> > > Well, it would be possible that larger pages in a bio would trip e.g. bio
> > > splitting due to maximum segment size the disk supports (which can be e.g.
> > > 0xffff) and that upsets something somewhere. But this is pure
> > > speculation. We definitely need more debug data to be able to tell more.
> > 
> > I can collect more info, but I need some guidance how :) Some patch
> > adding extra debug messages?
> > Note I collect those via serial console (writing to disk doesn't work
> > when it freezes), and that has some limits in the amount of data I can
> > extract especially when printed quickly. For example sysrq-t is too much.
> > Or maybe there is some trick to it, like increasing log_bug_len?
> 
> If you can do more tests, I would suggest this:
> 
> We already know that it works with order 3 and doesn't work with order 4.
> 
> So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER 
> from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int 
> order = PAGE_ALLOC_COSTLY_ORDER" there.
> 
> Does it deadlock or not?
> 
> So, that we can see whether the deadlock depends on 
> PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.

Good idea. Also if the kernel hangs, please find kcryptd processes. In what
state are they? If they are sleeping, please send what's in
/proc/<kcryptd-pid>/stack. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 15:56                         ` Jan Kara
@ 2023-10-30 16:51                           ` Marek Marczykowski-Górecki
  2023-10-30 17:50                             ` Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-30 16:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 20199 bytes --]

On Mon, Oct 30, 2023 at 04:56:03PM +0100, Jan Kara wrote:
> On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
> > On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> > 
> > > > Well, it would be possible that larger pages in a bio would trip e.g. bio
> > > > splitting due to maximum segment size the disk supports (which can be e.g.
> > > > 0xffff) and that upsets something somewhere. But this is pure
> > > > speculation. We definitely need more debug data to be able to tell more.
> > > 
> > > I can collect more info, but I need some guidance how :) Some patch
> > > adding extra debug messages?
> > > Note I collect those via serial console (writing to disk doesn't work
> > > when it freezes), and that has some limits in the amount of data I can
> > > extract especially when printed quickly. For example sysrq-t is too much.
> > > Or maybe there is some trick to it, like increasing log_bug_len?
> > 
> > If you can do more tests, I would suggest this:
> > 
> > We already know that it works with order 3 and doesn't work with order 4.
> > 
> > So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER 
> > from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int 
> > order = PAGE_ALLOC_COSTLY_ORDER" there.
> > 
> > Does it deadlock or not?
> > 
> > So, that we can see whether the deadlock depends on 
> > PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
> 
> Good idea. Also if the kernel hangs, please find kcryptd processes. In what
> state are they? If they are sleeping, please send what's in
> /proc/<kcryptd-pid>/stack. Thanks!

Will do.

In the meantime, while testing version with PAGE_ALLOC_COSTLY_ORDER=4,
and order=PAGE_ALLOC_COSTLY_ORDER, I'm getting crash like this (see
important note below both traces):

    [   92.668486] BUG: unable to handle page fault for address: ffff8880c7b64098
    [   92.668558] #PF: supervisor read access in kernel mode
    [   92.668574] #PF: error_code(0x0000) - not-present page
    [   92.668590] PGD 2a32067 P4D 2a32067 PUD 12868a067 PMD 0 
    [   92.668617] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [   92.668637] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W          6.5.6-dirty #354
    [   92.668658] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
    [   92.668675] RIP: e030:__free_one_page+0x301/0x3e0
    [   92.668704] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
    [   92.668738] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
    [   92.668754] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
    [   92.668773] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
    [   92.668789] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
    [   92.668805] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
    [   92.668821] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
    [   92.668854] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
    [   92.668874] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
    [   92.668889] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
    [   92.668912] Call Trace:
    [   92.668924]  <IRQ>
    [   92.668934]  ? __die+0x1e/0x60
    [   92.668955]  ? page_fault_oops+0x178/0x4a0
    [   92.668975]  ? exc_page_fault+0x14e/0x160
    [   92.668994]  ? asm_exc_page_fault+0x26/0x30
    [   92.669014]  ? __free_one_page+0x301/0x3e0
    [   92.669027]  free_pcppages_bulk+0x11c/0x2b0
    [   92.669042]  free_unref_page+0x10d/0x170
    [   92.669058]  crypt_free_buffer_pages+0x1f4/0x250
    [   92.669079]  crypt_endio+0x48/0x70
    [   92.669094]  blk_mq_end_request_batch+0xd0/0x400
    [   92.669114]  nvme_irq+0x6d/0x80
    [   92.669132]  ? __pfx_nvme_pci_complete_batch+0x10/0x10
    [   92.669148]  __handle_irq_event_percpu+0x42/0x1a0
    [   92.669166]  handle_irq_event+0x33/0x70
    [   92.669179]  handle_edge_irq+0x9e/0x240
    [   92.669197]  handle_irq_desc+0x36/0x50
    [   92.669215]  __evtchn_fifo_handle_events+0x1af/0x1d0
    [   92.669236]  __xen_evtchn_do_upcall+0x70/0xd0
    [   92.669255]  __xen_pv_evtchn_do_upcall+0x3d/0x70
    [   92.669275]  xen_pv_evtchn_do_upcall+0xd9/0x110
    [   92.669294]  </IRQ>
    [   92.669302]  <TASK>
    [   92.669310]  exc_xen_hypervisor_callback+0x8/0x20
    [   92.669330] RIP: e030:xen_hypercall_xen_version+0xa/0x20
    [   92.669348] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
    [   92.669380] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246
    [   92.669394] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a
    [   92.669411] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
    [   92.669426] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba
    [   92.669442] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540
    [   92.669458] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402
    [   92.669475]  ? xen_hypercall_xen_version+0xa/0x20
    [   92.669493]  ? pmu_msr_read+0x3c/0xd0
    [   92.669510]  ? xen_force_evtchn_callback+0xd/0x20
    [   92.669526]  ? check_events+0x16/0x30
    [   92.669540]  ? xen_irq_enable_direct+0x1d/0x30
    [   92.669556]  ? finish_task_switch.isra.0+0x8e/0x270
    [   92.669576]  ? __switch_to+0x165/0x3b0
    [   92.669590]  ? __schedule+0x316/0x8b0
    [   92.669609]  ? schedule_idle+0x25/0x40
    [   92.669626]  ? cpu_startup_entry+0x25/0x30
    [   92.669641]  ? cpu_bringup_and_idle+0x89/0xa0
    [   92.669660]  ? asm_cpu_bringup_and_idle+0x9/0x10
    [   92.669680]  </TASK>
    [   92.669688] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
    [   92.669901] CR2: ffff8880c7b64098
    [   92.669915] ---[ end trace 0000000000000000 ]---
    [   92.669930] RIP: e030:__free_one_page+0x301/0x3e0
    [   92.669946] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
    [   92.669977] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
    [   92.669992] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
    [   92.670008] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
    [   92.670024] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
    [   92.670040] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
    [   92.670055] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
    [   92.670083] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
    [   92.670101] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
    [   92.670116] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
    [   92.670148] Kernel panic - not syncing: Fatal exception in interrupt
    [   92.670177] ------------[ cut here ]------------
    [   92.670188] WARNING: CPU: 5 PID: 0 at kernel/smp.c:766 smp_call_function_many_cond+0x4db/0x560
    [   92.670219] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
    [   92.670400] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G      D W          6.5.6-dirty #354
    [   92.670419] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
    [   92.670434] RIP: e030:smp_call_function_many_cond+0x4db/0x560
    [   92.670455] Code: f0 52 1b 81 48 89 74 24 08 e8 d1 c8 f6 ff 48 8b 74 24 08 65 ff 0d 5d 70 e7 7e 0f 85 bb fd ff ff 0f 1f 44 00 00 e9 b1 fd ff ff <0f> 0b e9 69 fb ff ff 8b 7c 24 38 e8 d5 4f f7 ff 84 c0 0f 84 a8 fd
    [   92.670486] RSP: e02b:ffffc900401549d0 EFLAGS: 00010006
    [   92.670500] RAX: 0000000080010007 RBX: 0000000000000000 RCX: 0000000000000000
    [   92.670516] RDX: 0000000000000000 RSI: ffffffff826f3c61 RDI: ffffffff82cbc1d0
    [   92.670531] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
    [   92.670546] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: 0000000000000005
    [   92.670562] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [   92.670588] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
    [   92.670606] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
    [   92.670620] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
    [   92.670642] Call Trace:
    [   92.670650]  <IRQ>
    [   92.670659]  ? smp_call_function_many_cond+0x4db/0x560
    [   92.670677]  ? __warn+0x7c/0x130
    [   92.670693]  ? smp_call_function_many_cond+0x4db/0x560
    [   92.670711]  ? report_bug+0x191/0x1c0
    [   92.670726]  ? handle_bug+0x3c/0x80
    [   92.670742]  ? exc_invalid_op+0x17/0x70
    [   92.670758]  ? asm_exc_invalid_op+0x1a/0x20
    [   92.670775]  ? smp_call_function_many_cond+0x4db/0x560
    [   92.670793]  ? __pfx_stop_self+0x10/0x10
    [   92.670811]  ? _printk+0x5f/0x80
    [   92.670823]  ? __pfx_stop_self+0x10/0x10
    [   92.670840]  smp_call_function+0x38/0x70
    [   92.670857]  panic+0x19c/0x320
    [   92.670873]  oops_end+0xd8/0xe0
    [   92.670887]  page_fault_oops+0x19c/0x4a0
    [   92.670905]  exc_page_fault+0x14e/0x160
    [   92.670921]  asm_exc_page_fault+0x26/0x30
    [   92.670937] RIP: e030:__free_one_page+0x301/0x3e0
    [   92.670952] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
    [   92.670983] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
    [   92.670997] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
    [   92.671013] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
    [   92.671028] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
    [   92.671044] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
    [   92.671059] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
    [   92.671077]  free_pcppages_bulk+0x11c/0x2b0
    [   92.671091]  free_unref_page+0x10d/0x170
    [   92.671106]  crypt_free_buffer_pages+0x1f4/0x250
    [   92.671122]  crypt_endio+0x48/0x70
    [   92.671136]  blk_mq_end_request_batch+0xd0/0x400
    [   92.671152]  nvme_irq+0x6d/0x80
    [   92.671166]  ? __pfx_nvme_pci_complete_batch+0x10/0x10
    [   92.671181]  __handle_irq_event_percpu+0x42/0x1a0
    [   92.671196]  handle_irq_event+0x33/0x70
    [   92.671208]  handle_edge_irq+0x9e/0x240
    [   92.671224]  handle_irq_desc+0x36/0x50
    [   92.671241]  __evtchn_fifo_handle_events+0x1af/0x1d0
    [   92.671258]  __xen_evtchn_do_upcall+0x70/0xd0
    [   92.671275]  __xen_pv_evtchn_do_upcall+0x3d/0x70
    [   92.671292]  xen_pv_evtchn_do_upcall+0xd9/0x110
    [   92.671308]  </IRQ>
    [   92.671316]  <TASK>
    [   92.671324]  exc_xen_hypervisor_callback+0x8/0x20
    [   92.671342] RIP: e030:xen_hypercall_xen_version+0xa/0x20
    [   92.671360] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
    [   92.671390] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246
    [   92.671403] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a
    [   92.671419] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
    [   92.671434] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba
    [   92.671449] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540
    [   92.671465] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402
    [   92.671481]  ? xen_hypercall_xen_version+0xa/0x20
    [   92.671499]  ? pmu_msr_read+0x3c/0xd0
    [   92.671514]  ? xen_force_evtchn_callback+0xd/0x20
    [   92.671529]  ? check_events+0x16/0x30
    [   92.671543]  ? xen_irq_enable_direct+0x1d/0x30
    [   92.671560]  ? finish_task_switch.isra.0+0x8e/0x270
    [   92.671579]  ? __switch_to+0x165/0x3b0
    [   92.671593]  ? __schedule+0x316/0x8b0
    [   92.671611]  ? schedule_idle+0x25/0x40
    [   92.671631]  ? cpu_startup_entry+0x25/0x30
    [   92.671644]  ? cpu_bringup_and_idle+0x89/0xa0
    [   92.671663]  ? asm_cpu_bringup_and_idle+0x9/0x10
    [   92.671683]  </TASK>
    [   92.671691] ---[ end trace 0000000000000000 ]---
    [   92.671782] Kernel Offset: disabled
    (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.


If I change order=PAGE_ALLOC_COSTLY_ORDER+1, then this:


    [ 2205.112802] BUG: unable to handle page fault for address: ffffffff89630301
    [ 2205.112866] #PF: supervisor write access in kernel mode
    [ 2205.112882] #PF: error_code(0x0002) - not-present page
    [ 2205.112899] PGD 2a35067 P4D 2a35067 PUD 2a36067 PMD 0 
    [ 2205.112921] Oops: 0002 [#1] PREEMPT SMP NOPTI
    [ 2205.112946] CPU: 0 PID: 12609 Comm: kworker/u12:9 Tainted: G        W          6.5.6-dirty #355
    [ 2205.112979] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
    [ 2205.112997] Workqueue: kcryptd/252:0 kcryptd_crypt
    [ 2205.113022] RIP: e030:get_page_from_freelist+0x281/0x10c0
    [ 2205.113044] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89
    [ 2205.113079] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283
    [ 2205.113096] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301
    [ 2205.113117] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001
    [ 2205.113134] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07
    [ 2205.113153] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00
    [ 2205.113171] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700
    [ 2205.113208] FS:  0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000
    [ 2205.113229] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2205.113245] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660
    [ 2205.113272] Call Trace:
    [ 2205.113285]  <TASK>
    [ 2205.113295]  ? __die+0x1e/0x60
    [ 2205.113312]  ? page_fault_oops+0x178/0x4a0
    [ 2205.113330]  ? exc_page_fault+0x14e/0x160
    [ 2205.113346]  ? asm_exc_page_fault+0x26/0x30
    [ 2205.113363]  ? get_page_from_freelist+0x281/0x10c0
    [ 2205.113379]  ? get_page_from_freelist+0x227/0x10c0
    [ 2205.113396]  __alloc_pages+0x1dd/0x300
    [ 2205.113412]  crypt_page_alloc+0x29/0x60
    [ 2205.113425]  mempool_alloc+0x81/0x1b0
    [ 2205.113443]  kcryptd_crypt+0x293/0x4b0
    [ 2205.113458]  process_one_work+0x1e0/0x3e0
    [ 2205.113476]  worker_thread+0x49/0x3b0
    [ 2205.113490]  ? _raw_spin_lock_irqsave+0x22/0x50
    [ 2205.113510]  ? __pfx_worker_thread+0x10/0x10
    [ 2205.113527]  kthread+0xef/0x120
    [ 2205.113541]  ? __pfx_kthread+0x10/0x10
    [ 2205.113554]  ret_from_fork+0x2c/0x50
    [ 2205.113568]  ? __pfx_kthread+0x10/0x10
    [ 2205.113581]  ret_from_fork_asm+0x1b/0x30
    [ 2205.113598]  </TASK>
    [ 2205.113606] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_hda_codec_generic snd_sof ledtrig_audio snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm hid_multitouch snd_timer snd i2c_designware_platform i2c_designware_core i2c_i801 soundcore iwlwifi idma64 i2c_smbus i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput efivarfs
    [ 2205.113795] CR2: ffffffff89630301
    [ 2205.113808] ---[ end trace 0000000000000000 ]---
    [ 2205.113822] RIP: e030:get_page_from_freelist+0x281/0x10c0
    [ 2205.113839] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89
    [ 2205.113873] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283
    [ 2205.113887] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301
    [ 2205.113903] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001
    [ 2205.113919] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07
    [ 2205.113935] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00
    [ 2205.113950] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700
    [ 2205.113980] FS:  0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000
    [ 2205.113998] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2205.114012] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660
    [ 2205.114038] note: kworker/u12:9[12609] exited with irqs disabled
    [ 2205.114100] note: kworker/u12:9[12609] exited with preempt_count 2

Then retried with order=PAGE_ALLOC_COSTLY_ORDER and
PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.

Both happened only after logging into X session (lightdm -> Xfce) and
starting xfce4-terminal. If I use text console/ssh/serial and leave
lightdm at login screen, then it does not crash (this is in fact how I
did most previous tests). I have no idea if it's related to some other
bug somewhere else (graphics driver?), or simply higher memory usage due
to the whole Xfce session running.
It could be also a coincidence, the sample size is rather small...
But could be also some memory corruption that depending on memory layout
sometimes results in a crash and sometimes in "just" storage freeze.

Note this all is still on top of 6.5.6 with changes we discuss here. If
you believe it's another issue that got fixed in the meantime, I can
switch to another version, but otherwise I'd like to limit changes.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 16:51                           ` Marek Marczykowski-Górecki
@ 2023-10-30 17:50                             ` Mikulas Patocka
  2023-10-31  3:48                               ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-30 17:50 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 20899 bytes --]



On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:

> On Mon, Oct 30, 2023 at 04:56:03PM +0100, Jan Kara wrote:
> > On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
> > > On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> > > 
> > > > > Well, it would be possible that larger pages in a bio would trip e.g. bio
> > > > > splitting due to maximum segment size the disk supports (which can be e.g.
> > > > > 0xffff) and that upsets something somewhere. But this is pure
> > > > > speculation. We definitely need more debug data to be able to tell more.
> > > > 
> > > > I can collect more info, but I need some guidance how :) Some patch
> > > > adding extra debug messages?
> > > > Note I collect those via serial console (writing to disk doesn't work
> > > > when it freezes), and that has some limits in the amount of data I can
> > > > extract especially when printed quickly. For example sysrq-t is too much.
> > > > Or maybe there is some trick to it, like increasing log_bug_len?
> > > 
> > > If you can do more tests, I would suggest this:
> > > 
> > > We already know that it works with order 3 and doesn't work with order 4.
> > > 
> > > So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER 
> > > from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int 
> > > order = PAGE_ALLOC_COSTLY_ORDER" there.
> > > 
> > > Does it deadlock or not?
> > > 
> > > So, that we can see whether the deadlock depends on 
> > > PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
> > 
> > Good idea. Also if the kernel hangs, please find kcryptd processes. In what
> > state are they? If they are sleeping, please send what's in
> > /proc/<kcryptd-pid>/stack. Thanks!
> 
> Will do.
> 
> In the meantime, while testing version with PAGE_ALLOC_COSTLY_ORDER=4,
> and order=PAGE_ALLOC_COSTLY_ORDER, I'm getting crash like this (see
> important note below both traces):

Perhaps the kernel uses a hardcoded value "3" instead of 
PAGE_ALLOC_COSTLY_ORDER somewhere...

>     [   92.668486] BUG: unable to handle page fault for address: ffff8880c7b64098
>     [   92.668558] #PF: supervisor read access in kernel mode
>     [   92.668574] #PF: error_code(0x0000) - not-present page
>     [   92.668590] PGD 2a32067 P4D 2a32067 PUD 12868a067 PMD 0 
>     [   92.668617] Oops: 0000 [#1] PREEMPT SMP NOPTI
>     [   92.668637] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W          6.5.6-dirty #354
>     [   92.668658] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
>     [   92.668675] RIP: e030:__free_one_page+0x301/0x3e0
>     [   92.668704] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
>     [   92.668738] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
>     [   92.668754] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
>     [   92.668773] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
>     [   92.668789] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
>     [   92.668805] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
>     [   92.668821] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
>     [   92.668854] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
>     [   92.668874] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
>     [   92.668889] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
>     [   92.668912] Call Trace:
>     [   92.668924]  <IRQ>
>     [   92.668934]  ? __die+0x1e/0x60
>     [   92.668955]  ? page_fault_oops+0x178/0x4a0
>     [   92.668975]  ? exc_page_fault+0x14e/0x160
>     [   92.668994]  ? asm_exc_page_fault+0x26/0x30
>     [   92.669014]  ? __free_one_page+0x301/0x3e0
>     [   92.669027]  free_pcppages_bulk+0x11c/0x2b0
>     [   92.669042]  free_unref_page+0x10d/0x170
>     [   92.669058]  crypt_free_buffer_pages+0x1f4/0x250
>     [   92.669079]  crypt_endio+0x48/0x70
>     [   92.669094]  blk_mq_end_request_batch+0xd0/0x400
>     [   92.669114]  nvme_irq+0x6d/0x80
>     [   92.669132]  ? __pfx_nvme_pci_complete_batch+0x10/0x10
>     [   92.669148]  __handle_irq_event_percpu+0x42/0x1a0
>     [   92.669166]  handle_irq_event+0x33/0x70
>     [   92.669179]  handle_edge_irq+0x9e/0x240
>     [   92.669197]  handle_irq_desc+0x36/0x50
>     [   92.669215]  __evtchn_fifo_handle_events+0x1af/0x1d0
>     [   92.669236]  __xen_evtchn_do_upcall+0x70/0xd0
>     [   92.669255]  __xen_pv_evtchn_do_upcall+0x3d/0x70
>     [   92.669275]  xen_pv_evtchn_do_upcall+0xd9/0x110
>     [   92.669294]  </IRQ>
>     [   92.669302]  <TASK>
>     [   92.669310]  exc_xen_hypervisor_callback+0x8/0x20
>     [   92.669330] RIP: e030:xen_hypercall_xen_version+0xa/0x20
>     [   92.669348] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
>     [   92.669380] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246
>     [   92.669394] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a
>     [   92.669411] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
>     [   92.669426] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba
>     [   92.669442] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540
>     [   92.669458] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402
>     [   92.669475]  ? xen_hypercall_xen_version+0xa/0x20
>     [   92.669493]  ? pmu_msr_read+0x3c/0xd0
>     [   92.669510]  ? xen_force_evtchn_callback+0xd/0x20
>     [   92.669526]  ? check_events+0x16/0x30
>     [   92.669540]  ? xen_irq_enable_direct+0x1d/0x30
>     [   92.669556]  ? finish_task_switch.isra.0+0x8e/0x270
>     [   92.669576]  ? __switch_to+0x165/0x3b0
>     [   92.669590]  ? __schedule+0x316/0x8b0
>     [   92.669609]  ? schedule_idle+0x25/0x40
>     [   92.669626]  ? cpu_startup_entry+0x25/0x30
>     [   92.669641]  ? cpu_bringup_and_idle+0x89/0xa0
>     [   92.669660]  ? asm_cpu_bringup_and_idle+0x9/0x10
>     [   92.669680]  </TASK>
>     [   92.669688] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
>     [   92.669901] CR2: ffff8880c7b64098
>     [   92.669915] ---[ end trace 0000000000000000 ]---
>     [   92.669930] RIP: e030:__free_one_page+0x301/0x3e0
>     [   92.669946] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
>     [   92.669977] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
>     [   92.669992] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
>     [   92.670008] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
>     [   92.670024] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
>     [   92.670040] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
>     [   92.670055] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
>     [   92.670083] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
>     [   92.670101] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
>     [   92.670116] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
>     [   92.670148] Kernel panic - not syncing: Fatal exception in interrupt
>     [   92.670177] ------------[ cut here ]------------
>     [   92.670188] WARNING: CPU: 5 PID: 0 at kernel/smp.c:766 smp_call_function_many_cond+0x4db/0x560
>     [   92.670219] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
>     [   92.670400] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G      D W          6.5.6-dirty #354
>     [   92.670419] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
>     [   92.670434] RIP: e030:smp_call_function_many_cond+0x4db/0x560
>     [   92.670455] Code: f0 52 1b 81 48 89 74 24 08 e8 d1 c8 f6 ff 48 8b 74 24 08 65 ff 0d 5d 70 e7 7e 0f 85 bb fd ff ff 0f 1f 44 00 00 e9 b1 fd ff ff <0f> 0b e9 69 fb ff ff 8b 7c 24 38 e8 d5 4f f7 ff 84 c0 0f 84 a8 fd
>     [   92.670486] RSP: e02b:ffffc900401549d0 EFLAGS: 00010006
>     [   92.670500] RAX: 0000000080010007 RBX: 0000000000000000 RCX: 0000000000000000
>     [   92.670516] RDX: 0000000000000000 RSI: ffffffff826f3c61 RDI: ffffffff82cbc1d0
>     [   92.670531] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
>     [   92.670546] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: 0000000000000005
>     [   92.670562] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>     [   92.670588] FS:  0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
>     [   92.670606] CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
>     [   92.670620] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660
>     [   92.670642] Call Trace:
>     [   92.670650]  <IRQ>
>     [   92.670659]  ? smp_call_function_many_cond+0x4db/0x560
>     [   92.670677]  ? __warn+0x7c/0x130
>     [   92.670693]  ? smp_call_function_many_cond+0x4db/0x560
>     [   92.670711]  ? report_bug+0x191/0x1c0
>     [   92.670726]  ? handle_bug+0x3c/0x80
>     [   92.670742]  ? exc_invalid_op+0x17/0x70
>     [   92.670758]  ? asm_exc_invalid_op+0x1a/0x20
>     [   92.670775]  ? smp_call_function_many_cond+0x4db/0x560
>     [   92.670793]  ? __pfx_stop_self+0x10/0x10
>     [   92.670811]  ? _printk+0x5f/0x80
>     [   92.670823]  ? __pfx_stop_self+0x10/0x10
>     [   92.670840]  smp_call_function+0x38/0x70
>     [   92.670857]  panic+0x19c/0x320
>     [   92.670873]  oops_end+0xd8/0xe0
>     [   92.670887]  page_fault_oops+0x19c/0x4a0
>     [   92.670905]  exc_page_fault+0x14e/0x160
>     [   92.670921]  asm_exc_page_fault+0x26/0x30
>     [   92.670937] RIP: e030:__free_one_page+0x301/0x3e0
>     [   92.670952] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10
>     [   92.670983] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006
>     [   92.670997] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0
>     [   92.671013] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000
>     [   92.671028] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac
>     [   92.671044] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8
>     [   92.671059] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00
>     [   92.671077]  free_pcppages_bulk+0x11c/0x2b0
>     [   92.671091]  free_unref_page+0x10d/0x170
>     [   92.671106]  crypt_free_buffer_pages+0x1f4/0x250
>     [   92.671122]  crypt_endio+0x48/0x70
>     [   92.671136]  blk_mq_end_request_batch+0xd0/0x400
>     [   92.671152]  nvme_irq+0x6d/0x80
>     [   92.671166]  ? __pfx_nvme_pci_complete_batch+0x10/0x10
>     [   92.671181]  __handle_irq_event_percpu+0x42/0x1a0
>     [   92.671196]  handle_irq_event+0x33/0x70
>     [   92.671208]  handle_edge_irq+0x9e/0x240
>     [   92.671224]  handle_irq_desc+0x36/0x50
>     [   92.671241]  __evtchn_fifo_handle_events+0x1af/0x1d0
>     [   92.671258]  __xen_evtchn_do_upcall+0x70/0xd0
>     [   92.671275]  __xen_pv_evtchn_do_upcall+0x3d/0x70
>     [   92.671292]  xen_pv_evtchn_do_upcall+0xd9/0x110
>     [   92.671308]  </IRQ>
>     [   92.671316]  <TASK>
>     [   92.671324]  exc_xen_hypervisor_callback+0x8/0x20
>     [   92.671342] RIP: e030:xen_hypercall_xen_version+0xa/0x20
>     [   92.671360] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
>     [   92.671390] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246
>     [   92.671403] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a
>     [   92.671419] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
>     [   92.671434] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba
>     [   92.671449] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540
>     [   92.671465] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402
>     [   92.671481]  ? xen_hypercall_xen_version+0xa/0x20
>     [   92.671499]  ? pmu_msr_read+0x3c/0xd0
>     [   92.671514]  ? xen_force_evtchn_callback+0xd/0x20
>     [   92.671529]  ? check_events+0x16/0x30
>     [   92.671543]  ? xen_irq_enable_direct+0x1d/0x30
>     [   92.671560]  ? finish_task_switch.isra.0+0x8e/0x270
>     [   92.671579]  ? __switch_to+0x165/0x3b0
>     [   92.671593]  ? __schedule+0x316/0x8b0
>     [   92.671611]  ? schedule_idle+0x25/0x40
>     [   92.671631]  ? cpu_startup_entry+0x25/0x30
>     [   92.671644]  ? cpu_bringup_and_idle+0x89/0xa0
>     [   92.671663]  ? asm_cpu_bringup_and_idle+0x9/0x10
>     [   92.671683]  </TASK>
>     [   92.671691] ---[ end trace 0000000000000000 ]---
>     [   92.671782] Kernel Offset: disabled
>     (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> 
> 
> If I change order=PAGE_ALLOC_COSTLY_ORDER+1, then this:
> 
> 
>     [ 2205.112802] BUG: unable to handle page fault for address: ffffffff89630301
>     [ 2205.112866] #PF: supervisor write access in kernel mode
>     [ 2205.112882] #PF: error_code(0x0002) - not-present page
>     [ 2205.112899] PGD 2a35067 P4D 2a35067 PUD 2a36067 PMD 0 
>     [ 2205.112921] Oops: 0002 [#1] PREEMPT SMP NOPTI
>     [ 2205.112946] CPU: 0 PID: 12609 Comm: kworker/u12:9 Tainted: G        W          6.5.6-dirty #355
>     [ 2205.112979] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
>     [ 2205.112997] Workqueue: kcryptd/252:0 kcryptd_crypt
>     [ 2205.113022] RIP: e030:get_page_from_freelist+0x281/0x10c0
>     [ 2205.113044] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89
>     [ 2205.113079] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283
>     [ 2205.113096] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301
>     [ 2205.113117] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001
>     [ 2205.113134] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07
>     [ 2205.113153] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00
>     [ 2205.113171] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700
>     [ 2205.113208] FS:  0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000
>     [ 2205.113229] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [ 2205.113245] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660
>     [ 2205.113272] Call Trace:
>     [ 2205.113285]  <TASK>
>     [ 2205.113295]  ? __die+0x1e/0x60
>     [ 2205.113312]  ? page_fault_oops+0x178/0x4a0
>     [ 2205.113330]  ? exc_page_fault+0x14e/0x160
>     [ 2205.113346]  ? asm_exc_page_fault+0x26/0x30
>     [ 2205.113363]  ? get_page_from_freelist+0x281/0x10c0
>     [ 2205.113379]  ? get_page_from_freelist+0x227/0x10c0
>     [ 2205.113396]  __alloc_pages+0x1dd/0x300
>     [ 2205.113412]  crypt_page_alloc+0x29/0x60
>     [ 2205.113425]  mempool_alloc+0x81/0x1b0
>     [ 2205.113443]  kcryptd_crypt+0x293/0x4b0
>     [ 2205.113458]  process_one_work+0x1e0/0x3e0
>     [ 2205.113476]  worker_thread+0x49/0x3b0
>     [ 2205.113490]  ? _raw_spin_lock_irqsave+0x22/0x50
>     [ 2205.113510]  ? __pfx_worker_thread+0x10/0x10
>     [ 2205.113527]  kthread+0xef/0x120
>     [ 2205.113541]  ? __pfx_kthread+0x10/0x10
>     [ 2205.113554]  ret_from_fork+0x2c/0x50
>     [ 2205.113568]  ? __pfx_kthread+0x10/0x10
>     [ 2205.113581]  ret_from_fork_asm+0x1b/0x30
>     [ 2205.113598]  </TASK>
>     [ 2205.113606] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_hda_codec_generic snd_sof ledtrig_audio snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm hid_multitouch snd_timer snd i2c_designware_platform i2c_designware_core i2c_i801 soundcore iwlwifi idma64 i2c_smbus i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput efivarfs
>     [ 2205.113795] CR2: ffffffff89630301
>     [ 2205.113808] ---[ end trace 0000000000000000 ]---
>     [ 2205.113822] RIP: e030:get_page_from_freelist+0x281/0x10c0
>     [ 2205.113839] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89
>     [ 2205.113873] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283
>     [ 2205.113887] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301
>     [ 2205.113903] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001
>     [ 2205.113919] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07
>     [ 2205.113935] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00
>     [ 2205.113950] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700
>     [ 2205.113980] FS:  0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000
>     [ 2205.113998] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [ 2205.114012] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660
>     [ 2205.114038] note: kworker/u12:9[12609] exited with irqs disabled
>     [ 2205.114100] note: kworker/u12:9[12609] exited with preempt_count 2
> 
> Then retried with order=PAGE_ALLOC_COSTLY_ORDER and
> PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.

So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER 
isn't safe?

Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try 
to provoke a similar crash. Let's see if it crashes on one of the 
VM_BUG_ON statements.

Mikulas

> Both happened only after logging into X session (lightdm -> Xfce) and
> starting xfce4-terminal. If I use text console/ssh/serial and leave
> lightdm at login screen, then it does not crash (this is in fact how I
> did most previous tests). I have no idea if it's related to some other
> bug somewhere else (graphics driver?), or simply higher memory usage due
> to the whole Xfce session running.
> It could be also a coincidence, the sample size is rather small...
> But could be also some memory corruption that depending on memory layout
> sometimes results in a crash and sometimes in "just" storage freeze.
> 
> Note this all is still on top of 6.5.6 with changes we discuss here. If
> you believe it's another issue that got fixed in the meantime, I can
> switch to another version, but otherwise I'd like to limit changes.
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-30 17:50                             ` Mikulas Patocka
@ 2023-10-31  3:48                               ` Marek Marczykowski-Górecki
  2023-10-31 14:01                                 ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-31  3:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 11179 bytes --]

On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
> On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> > Then retried with order=PAGE_ALLOC_COSTLY_ORDER and
> > PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
> 
> So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER 
> isn't safe?

That seems to be another bug, see below.

> Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try 
> to provoke a similar crash. Let's see if it crashes on one of the 
> VM_BUG_ON statements.

This was very interesting idea. With this, immediately after login I get
the crash like below. Which makes sense, as this is when pulseaudio
starts and opens /dev/snd/*. I then tried with the dm-crypt commit
reverted and still got the crash! But, after blacklisting snd_pcm,
there is no BUG splat, but the storage freeze still happens on vanilla 6.5.6.

The snd_pcm BUG splat:

[   51.082877] page:00000000d8fdb7f1 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11b7d9
[   51.082919] flags: 0x200000000000000(node=0|zone=2)
[   51.082924] page_type: 0xffffffff()
[   51.082929] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[   51.082934] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   51.082938] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
[   51.082969] ------------[ cut here ]------------
[   51.082972] kernel BUG at include/linux/mm.h:1406!
[   51.082980] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   51.082986] CPU: 5 PID: 3893 Comm: alsa-sink-Gener Tainted: G        W          6.5.6-dirty #359
[   51.082992] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
[   51.082997] RIP: e030:snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083015] Code: 48 2b 05 8e 7b 67 c2 48 01 f0 48 c1 e8 0c 48 c1 e0 06 48 03 05 6c 7b 67 c2 e9 4c ff ff ff 48 c7 c6 d8 71 1c c0 e8 93 1e 0e c1 <0f> 0b 48 83 ef 01 e9 4d ff ff ff 48 8b 05 51 47 89 c2 eb c9 66 66
[   51.083023] RSP: e02b:ffffc90041be7e00 EFLAGS: 00010246
[   51.083028] RAX: 000000000000005c RBX: ffffc90041be7e28 RCX: 0000000000000000
[   51.083033] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
[   51.083038] RBP: ffff888102e75f18 R08: 00000000ffffdfff R09: 0000000000000001
[   51.083042] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: ffff888102e75f18
[   51.083047] R13: 0000000000000255 R14: ffff888100955e80 R15: ffff888102e75f18
[   51.083056] FS:  00007f51d354f6c0(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
[   51.083061] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.083065] CR2: 00007f51d36f6000 CR3: 000000011b53e000 CR4: 0000000000050660
[   51.083072] Call Trace:
[   51.083076]  <TASK>
[   51.083078]  ? die+0x31/0x80
[   51.083085]  ? do_trap+0xd5/0x100
[   51.083089]  ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083103]  ? do_error_trap+0x65/0x90
[   51.083107]  ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083120]  ? exc_invalid_op+0x50/0x70
[   51.083127]  ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083140]  ? asm_exc_invalid_op+0x1a/0x20
[   51.083146]  ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083159]  __do_fault+0x29/0x110
[   51.083165]  __handle_mm_fault+0x5fb/0xc40
[   51.083170]  handle_mm_fault+0x91/0x1e0
[   51.083173]  do_user_addr_fault+0x216/0x5d0
[   51.083179]  ? check_preemption_disabled+0x31/0xf0
[   51.083185]  exc_page_fault+0x71/0x160
[   51.083189]  asm_exc_page_fault+0x26/0x30
[   51.083195] RIP: 0033:0x7f51e56793ca
[   51.083198] Code: c5 fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 47 40 c5 fe 7f 47 60 c5 f8 77 c3 66 0f 1f 84 00 00 00 00 00 40 0f b6 c6 48 89 d1 48 89 fa <f3> aa 48 89 d0 c5 f8 77 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
[   51.083207] RSP: 002b:00007f51d354c528 EFLAGS: 00010202
[   51.083211] RAX: 0000000000000000 RBX: 00007f51d354ec80 RCX: 00000000000034e0
[   51.083216] RDX: 00007f51d36f5000 RSI: 0000000000000000 RDI: 00007f51d36f6000
[   51.083220] RBP: 000055fec98b2f60 R08: 00007f51cc0031c0 R09: 0000000000000000
[   51.083224] R10: 0000000000000000 R11: 0000000000000101 R12: 000055fec98b2f60
[   51.083228] R13: 00007f51d354c630 R14: 0000000000000000 R15: 000055fec78ba680
[   51.083233]  </TASK>
[   51.083235] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_hda_codec_generic snd_sof_pci ledtrig_audio snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device hid_multitouch snd_pcm i2c_i801 idma64 iwlwifi i2c_smbus i2c_designware_platform i2c_designware_core snd_timer snd soundcore efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
[   51.083293] ---[ end trace 0000000000000000 ]---
[   51.083296] RIP: e030:snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm]
[   51.083310] Code: 48 2b 05 8e 7b 67 c2 48 01 f0 48 c1 e8 0c 48 c1 e0 06 48 03 05 6c 7b 67 c2 e9 4c ff ff ff 48 c7 c6 d8 71 1c c0 e8 93 1e 0e c1 <0f> 0b 48 83 ef 01 e9 4d ff ff ff 48 8b 05 51 47 89 c2 eb c9 66 66
[   51.083318] RSP: e02b:ffffc90041be7e00 EFLAGS: 00010246
[   51.083323] RAX: 000000000000005c RBX: ffffc90041be7e28 RCX: 0000000000000000
[   51.083327] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
[   51.083331] RBP: ffff888102e75f18 R08: 00000000ffffdfff R09: 0000000000000001
[   51.083335] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: ffff888102e75f18
[   51.083340] R13: 0000000000000255 R14: ffff888100955e80 R15: ffff888102e75f18
[   51.083347] FS:  00007f51d354f6c0(0000) GS:ffff888189740000(0000) knlGS:0000000000000000
[   51.083353] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.083356] CR2: 00007f51d36f6000 CR3: 000000011b53e000 CR4: 0000000000050660

Having discovered that, I'm redoing recent tests with snd_pcm
blacklisted. I'll get back to debugging snd_pcm issue separately.

Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state:

[  143.195348] sysrq: Show Blocked State
[  143.195471] task:lvm             state:D stack:13312 pid:4882  ppid:2025   flags:0x00004002
[  143.195504] Call Trace:
[  143.195514]  <TASK>
[  143.195526]  __schedule+0x30e/0x8b0
[  143.195550]  ? __pfx_dev_suspend+0x10/0x10
[  143.195569]  schedule+0x59/0xb0
[  143.195582]  io_schedule+0x41/0x70
[  143.195595]  dm_wait_for_completion+0x19d/0x1b0
[  143.195671]  ? __pfx_autoremove_wake_function+0x10/0x10
[  143.195693]  __dm_suspend+0x79/0x190
[  143.195707]  ? __pfx_dev_suspend+0x10/0x10
[  143.195723]  dm_internal_suspend_noflush+0x57/0x80
[  143.195740]  pool_presuspend+0xc7/0x130
[  143.195759]  dm_table_presuspend_targets+0x38/0x60
[  143.195774]  __dm_suspend+0x34/0x190
[  143.195788]  ? preempt_count_add+0x69/0xa0
[  143.195805]  ? __pfx_dev_suspend+0x10/0x10
[  143.195819]  dm_suspend+0xbb/0xe0
[  143.195835]  ? preempt_count_add+0x46/0xa0
[  143.195851]  dev_suspend+0x18e/0x2d0
[  143.195867]  ? __pfx_dev_suspend+0x10/0x10
[  143.195882]  ctl_ioctl+0x329/0x640
[  143.195901]  dm_ctl_ioctl+0x9/0x10
[  143.195917]  __x64_sys_ioctl+0x8f/0xd0
[  143.195938]  do_syscall_64+0x3c/0x90
[  143.195954]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  143.195975] RIP: 0033:0x7f2e0ab1fe0f
[  143.195989] RSP: 002b:00007ffd59a16e60 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  143.196011] RAX: ffffffffffffffda RBX: 000056289d130840 RCX: 00007f2e0ab1fe0f
[  143.196029] RDX: 000056289d120b80 RSI: 00000000c138fd06 RDI: 0000000000000003
[  143.196046] RBP: 000056289d120b80 R08: 000056289a7eb190 R09: 00007ffd59a16d20
[  143.196063] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
[  143.196080] R13: 000056289a7e4cf0 R14: 000056289a77e14d R15: 000056289d120bb0
[  143.196098]  </TASK>
[  143.196106] task:blkdiscard      state:D stack:13672 pid:4884  ppid:2025   flags:0x00000002
[  143.196130] Call Trace:
[  143.196139]  <TASK>
[  143.196147]  __schedule+0x30e/0x8b0
[  143.196162]  schedule+0x59/0xb0
[  143.196175]  schedule_timeout+0x14c/0x160
[  143.196193]  io_schedule_timeout+0x4b/0x70
[  143.196207]  wait_for_completion_io+0x81/0x130
[  143.196226]  submit_bio_wait+0x5c/0x90
[  143.196241]  blkdev_issue_discard+0x94/0xe0
[  143.196260]  blkdev_common_ioctl+0x79e/0x9c0
[  143.196279]  blkdev_ioctl+0xc7/0x270
[  143.196293]  __x64_sys_ioctl+0x8f/0xd0
[  143.196310]  do_syscall_64+0x3c/0x90
[  143.196324]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  143.196343] RIP: 0033:0x7fa6cebcee0f
[  143.196354] RSP: 002b:00007ffe6700fa80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  143.196374] RAX: ffffffffffffffda RBX: 0000000280000000 RCX: 00007fa6cebcee0f
[  143.196391] RDX: 00007ffe6700fb50 RSI: 0000000000001277 RDI: 0000000000000003
[  143.196408] RBP: 0000000000000003 R08: 0000000000000071 R09: 0000000000000004
[  143.196424] R10: 00007ffe67064170 R11: 0000000000000246 R12: 0000000040000000
[  143.196441] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  143.196460]  </TASK>

for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done
==> /proc/3761/comm <==
kworker/u12:7-kcryptd/252:0

==> /proc/3761/stack <==
[<0>] worker_thread+0xab/0x3b0
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
==> /proc/51/comm <==
cryptd

==> /proc/51/stack <==
[<0>] rescuer_thread+0x2d5/0x390
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
==> /proc/556/comm <==
kcryptd_io/252:

==> /proc/556/stack <==
[<0>] rescuer_thread+0x2d5/0x390
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
==> /proc/557/comm <==
kcryptd/252:0

==> /proc/557/stack <==
[<0>] rescuer_thread+0x2d5/0x390
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
==> /proc/558/comm <==
dmcrypt_write/252:0

==> /proc/558/stack <==
[<0>] dmcrypt_write+0x6a/0x140
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
==> /proc/717/comm <==
kworker/u12:6-kcryptd/252:0

==> /proc/717/stack <==
[<0>] worker_thread+0xab/0x3b0
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x2c/0x50
[<0>] ret_from_fork_asm+0x1b/0x30


Then tried:
 - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
 - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
 - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly

I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
and I can't reproduce the issue there. I'm confused...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31  3:48                               ` Marek Marczykowski-Górecki
@ 2023-10-31 14:01                                 ` Jan Kara
  2023-10-31 15:42                                   ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2023-10-31 14:01 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Mikulas Patocka, Jan Kara, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
> > On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> > > Then retried with order=PAGE_ALLOC_COSTLY_ORDER and
> > > PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
> > 
> > So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER 
> > isn't safe?
> 
> That seems to be another bug, see below.
> 
> > Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try 
> > to provoke a similar crash. Let's see if it crashes on one of the 
> > VM_BUG_ON statements.
> 
> This was very interesting idea. With this, immediately after login I get
> the crash like below. Which makes sense, as this is when pulseaudio
> starts and opens /dev/snd/*. I then tried with the dm-crypt commit
> reverted and still got the crash! But, after blacklisting snd_pcm,
> there is no BUG splat, but the storage freeze still happens on vanilla
> 6.5.6.

OK, great. Thanks for testing.

<snip snd_pcm bug>

> Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state:
> [  143.196106] task:blkdiscard      state:D stack:13672 pid:4884  ppid:2025   flags:0x00000002
> [  143.196130] Call Trace:
> [  143.196139]  <TASK>
> [  143.196147]  __schedule+0x30e/0x8b0
> [  143.196162]  schedule+0x59/0xb0
> [  143.196175]  schedule_timeout+0x14c/0x160
> [  143.196193]  io_schedule_timeout+0x4b/0x70
> [  143.196207]  wait_for_completion_io+0x81/0x130
> [  143.196226]  submit_bio_wait+0x5c/0x90
> [  143.196241]  blkdev_issue_discard+0x94/0xe0
> [  143.196260]  blkdev_common_ioctl+0x79e/0x9c0
> [  143.196279]  blkdev_ioctl+0xc7/0x270
> [  143.196293]  __x64_sys_ioctl+0x8f/0xd0
> [  143.196310]  do_syscall_64+0x3c/0x90

So this shows there was bio submitted and it never ran to completion.
 
> for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done
<snip some backtraces>

So this shows dm-crypt layer isn't stuck anywhere. So the allocation path
itself doesn't seem to be locking up, looping or anything.

> Then tried:
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> 
> I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> and I can't reproduce the issue there. I'm confused...

And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
causing hangs is most likely just a coincidence. Rather something either in
the block layer or in the storage driver has problems with handling bios
with sufficiently high order pages attached. This is going to be a bit
painful to debug I'm afraid. How long does it take for you trigger the
hang? I'm asking to get rough estimate how heavy tracing we can afford so
that we don't overwhelm the system...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31 14:01                                 ` Jan Kara
@ 2023-10-31 15:42                                   ` Marek Marczykowski-Górecki
  2023-10-31 17:17                                     ` Mikulas Patocka
  2023-11-01  1:27                                     ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-31 15:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > Then tried:
> >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > 
> > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > and I can't reproduce the issue there. I'm confused...
> 
> And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> causing hangs is most likely just a coincidence. Rather something either in
> the block layer or in the storage driver has problems with handling bios
> with sufficiently high order pages attached. This is going to be a bit
> painful to debug I'm afraid. How long does it take for you trigger the
> hang? I'm asking to get rough estimate how heavy tracing we can afford so
> that we don't overwhelm the system...

Sometimes it freezes just after logging in, but in worst case it takes
me about 10min of more or less `tar xz` + `dd`.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31 15:42                                   ` Marek Marczykowski-Górecki
@ 2023-10-31 17:17                                     ` Mikulas Patocka
  2023-10-31 17:24                                       ` Mikulas Patocka
  2023-11-01  1:27                                     ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-31 17:17 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2858 bytes --]



On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:

> On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > Then tried:
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > 
> > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > and I can't reproduce the issue there. I'm confused...
> > 
> > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > causing hangs is most likely just a coincidence. Rather something either in
> > the block layer or in the storage driver has problems with handling bios
> > with sufficiently high order pages attached. This is going to be a bit
> > painful to debug I'm afraid. How long does it take for you trigger the
> > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > that we don't overwhelm the system...
> 
> Sometimes it freezes just after logging in, but in worst case it takes
> me about 10min of more or less `tar xz` + `dd`.

Hi

I would like to ask you to try this patch. Revert the changes to "order" 
and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
clean upstream kernel.

Does it deadlock?

There is a bug in dm-crypt that it doesn't account large pages in 
cc->n_allocated_pages, this patch fixes the bug.

Mikulas



---
 drivers/md/dm-crypt.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-stable/drivers/md/dm-crypt.c
===================================================================
--- linux-stable.orig/drivers/md/dm-crypt.c	2023-10-31 16:25:09.000000000 +0100
+++ linux-stable/drivers/md/dm-crypt.c	2023-10-31 16:53:14.000000000 +0100
@@ -1700,11 +1700,16 @@ retry:
 		order = min(order, remaining_order);
 
 		while (order > 0) {
+			if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client))
+				goto decrease_order;
 			pages = alloc_pages(gfp_mask
 				| __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP,
 				order);
-			if (likely(pages != NULL))
+			if (likely(pages != NULL)) {
+				percpu_counter_add(&cc->n_allocated_pages, 1 << order);
 				goto have_pages;
+			}
+decrease_order:
 			order--;
 		}
 
@@ -1742,10 +1747,12 @@ static void crypt_free_buffer_pages(stru
 
 	if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */
 		bio_for_each_folio_all(fi, clone) {
-			if (folio_test_large(fi.folio))
+			if (folio_test_large(fi.folio)) {
+				percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio));
 				folio_put(fi.folio);
-			else
+			} else {
 				mempool_free(&fi.folio->page, &cc->page_pool);
+			}
 		}
 	}
 }

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31 17:17                                     ` Mikulas Patocka
@ 2023-10-31 17:24                                       ` Mikulas Patocka
  2023-11-02  0:38                                         ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-10-31 17:24 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4010 bytes --]



On Tue, 31 Oct 2023, Mikulas Patocka wrote:

> 
> 
> On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:
> 
> > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > Then tried:
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > 
> > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > and I can't reproduce the issue there. I'm confused...
> > > 
> > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > causing hangs is most likely just a coincidence. Rather something either in
> > > the block layer or in the storage driver has problems with handling bios
> > > with sufficiently high order pages attached. This is going to be a bit
> > > painful to debug I'm afraid. How long does it take for you trigger the
> > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > that we don't overwhelm the system...
> > 
> > Sometimes it freezes just after logging in, but in worst case it takes
> > me about 10min of more or less `tar xz` + `dd`.
> 
> Hi
> 
> I would like to ask you to try this patch. Revert the changes to "order" 
> and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> clean upstream kernel.
> 
> Does it deadlock?
> 
> There is a bug in dm-crypt that it doesn't account large pages in 
> cc->n_allocated_pages, this patch fixes the bug.
> 
> Mikulas

If the previous patch didn't fix it, try this patch (on a clean upstream 
kernel).

This patch allocates large pages, but it breaks them up into single-page 
entries when adding them to the bio.

If this patch deadlocks, it is a sign that allocating large pages causes 
the problem.

If this patch doesn't deadlock, it is a sign that processing a bio with 
large pages is the problem.

Mikulas

---
 drivers/md/dm-crypt.c |   26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

Index: linux-stable/drivers/md/dm-crypt.c
===================================================================
--- linux-stable.orig/drivers/md/dm-crypt.c	2023-10-31 17:16:03.000000000 +0100
+++ linux-stable/drivers/md/dm-crypt.c	2023-10-31 17:21:14.000000000 +0100
@@ -1700,11 +1700,16 @@ retry:
 		order = min(order, remaining_order);
 
 		while (order > 0) {
+			if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client))
+				goto decrease_order;
 			pages = alloc_pages(gfp_mask
 				| __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP,
 				order);
-			if (likely(pages != NULL))
+			if (likely(pages != NULL)) {
+				percpu_counter_add(&cc->n_allocated_pages, 1 << order);
 				goto have_pages;
+			}
+decrease_order:
 			order--;
 		}
 
@@ -1719,8 +1724,13 @@ retry:
 
 have_pages:
 		size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
-		__bio_add_page(clone, pages, size_to_add, 0);
 		remaining_size -= size_to_add;
+		while (size_to_add) {
+			unsigned this_step = min(size_to_add, (unsigned)PAGE_SIZE);
+			__bio_add_page(clone, pages, this_step, 0);
+			size_to_add -= this_step;
+			pages++;
+		}
 	}
 
 	/* Allocate space for integrity tags */
@@ -1739,13 +1749,21 @@ have_pages:
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
 	struct folio_iter fi;
+	unsigned skip = 0;
 
 	if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */
 		bio_for_each_folio_all(fi, clone) {
-			if (folio_test_large(fi.folio))
+			if (skip) {
+				skip--;
+				continue;
+			}
+			if (folio_test_large(fi.folio)) {
+				skip = (1 << folio_order(fi.folio)) - 1;
+				percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio));
 				folio_put(fi.folio);
-			else
+			} else {
 				mempool_free(&fi.folio->page, &cc->page_pool);
+			}
 		}
 	}
 }

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31 15:42                                   ` Marek Marczykowski-Górecki
  2023-10-31 17:17                                     ` Mikulas Patocka
@ 2023-11-01  1:27                                     ` Ming Lei
  2023-11-01  2:14                                       ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 66+ messages in thread
From: Ming Lei @ 2023-11-01  1:27 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > Then tried:
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > >
> > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > and I can't reproduce the issue there. I'm confused...
> >
> > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > causing hangs is most likely just a coincidence. Rather something either in
> > the block layer or in the storage driver has problems with handling bios
> > with sufficiently high order pages attached. This is going to be a bit
> > painful to debug I'm afraid. How long does it take for you trigger the
> > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > that we don't overwhelm the system...
>
> Sometimes it freezes just after logging in, but in worst case it takes
> me about 10min of more or less `tar xz` + `dd`.

blk-mq debugfs is usually helpful for hang issue in block layer or
underlying drivers:

(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)

BTW,  you can just collect logs of the exact disks if you know what
are behind dm-crypt,
which can be figured out by `lsblk`, and it has to be collected after
the hang is triggered.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  1:27                                     ` Ming Lei
@ 2023-11-01  2:14                                       ` Marek Marczykowski-Górecki
  2023-11-01  2:15                                         ` Marek Marczykowski-Górecki
  2023-11-01  3:24                                         ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-01  2:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 12717 bytes --]

On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com> wrote:
> >
> > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > Then tried:
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > >
> > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > and I can't reproduce the issue there. I'm confused...
> > >
> > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > causing hangs is most likely just a coincidence. Rather something either in
> > > the block layer or in the storage driver has problems with handling bios
> > > with sufficiently high order pages attached. This is going to be a bit
> > > painful to debug I'm afraid. How long does it take for you trigger the
> > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > that we don't overwhelm the system...
> >
> > Sometimes it freezes just after logging in, but in worst case it takes
> > me about 10min of more or less `tar xz` + `dd`.
> 
> blk-mq debugfs is usually helpful for hang issue in block layer or
> underlying drivers:
> 
> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> 
> BTW,  you can just collect logs of the exact disks if you know what
> are behind dm-crypt,
> which can be figured out by `lsblk`, and it has to be collected after
> the hang is triggered.

dm-crypt lives on the nvme disk, this is what I collected when it
hanged:

nvme0n1/hctx5/type:default
nvme0n1/hctx5/dispatch_busy:0
nvme0n1/hctx5/active:0
nvme0n1/hctx5/run:273
nvme0n1/hctx5/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx5/tags:nr_tags=1023
nvme0n1/hctx5/tags:nr_reserved_tags=0
nvme0n1/hctx5/tags:active_queues=0
nvme0n1/hctx5/tags:bitmap_tags:
nvme0n1/hctx5/tags:depth=1023
nvme0n1/hctx5/tags:busy=0
nvme0n1/hctx5/tags:cleared=7
nvme0n1/hctx5/tags:bits_per_word=64
nvme0n1/hctx5/tags:map_nr=16
nvme0n1/hctx5/tags:alloc_hint={633, 450, 354, 913, 651, 645}
nvme0n1/hctx5/tags:wake_batch=8
nvme0n1/hctx5/tags:wake_index=0
nvme0n1/hctx5/tags:ws_active=0
nvme0n1/hctx5/tags:ws={
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:	{.wait=inactive},
nvme0n1/hctx5/tags:}
nvme0n1/hctx5/tags:round_robin=0
nvme0n1/hctx5/tags:min_shallow_depth=4294967295
nvme0n1/hctx5/ctx_map:00000000: 00
nvme0n1/hctx5/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
nvme0n1/hctx4/type:default
nvme0n1/hctx4/dispatch_busy:9
nvme0n1/hctx4/active:0
nvme0n1/hctx4/run:20290468
nvme0n1/hctx4/tags_bitmap:00000000: 0000 0000 0000 0000 4240 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx4/tags:nr_tags=1023
nvme0n1/hctx4/tags:nr_reserved_tags=0
nvme0n1/hctx4/tags:active_queues=0
nvme0n1/hctx4/tags:bitmap_tags:
nvme0n1/hctx4/tags:depth=1023
nvme0n1/hctx4/tags:busy=3
nvme0n1/hctx4/tags:cleared=7
nvme0n1/hctx4/tags:bits_per_word=64
nvme0n1/hctx4/tags:map_nr=16
nvme0n1/hctx4/tags:alloc_hint={899, 846, 390, 472, 73, 439}
nvme0n1/hctx4/tags:wake_batch=8
nvme0n1/hctx4/tags:wake_index=0
nvme0n1/hctx4/tags:ws_active=0
nvme0n1/hctx4/tags:ws={
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:	{.wait=inactive},
nvme0n1/hctx4/tags:}
nvme0n1/hctx4/tags:round_robin=0
nvme0n1/hctx4/tags:min_shallow_depth=4294967295
nvme0n1/hctx4/ctx_map:00000000: 01
nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/hctx4/state:SCHED_RESTART
nvme0n1/hctx3/type:default
nvme0n1/hctx3/dispatch_busy:0
nvme0n1/hctx3/active:0
nvme0n1/hctx3/run:296
nvme0n1/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx3/tags:nr_tags=1023
nvme0n1/hctx3/tags:nr_reserved_tags=0
nvme0n1/hctx3/tags:active_queues=0
nvme0n1/hctx3/tags:bitmap_tags:
nvme0n1/hctx3/tags:depth=1023
nvme0n1/hctx3/tags:busy=0
nvme0n1/hctx3/tags:cleared=23
nvme0n1/hctx3/tags:bits_per_word=64
nvme0n1/hctx3/tags:map_nr=16
nvme0n1/hctx3/tags:alloc_hint={862, 557, 480, 24, 841, 23}
nvme0n1/hctx3/tags:wake_batch=8
nvme0n1/hctx3/tags:wake_index=0
nvme0n1/hctx3/tags:ws_active=0
nvme0n1/hctx3/tags:ws={
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:	{.wait=inactive},
nvme0n1/hctx3/tags:}
nvme0n1/hctx3/tags:round_robin=0
nvme0n1/hctx3/tags:min_shallow_depth=4294967295
nvme0n1/hctx3/ctx_map:00000000: 00
nvme0n1/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/hctx2/type:default
nvme0n1/hctx2/dispatch_busy:0
nvme0n1/hctx2/active:0
nvme0n1/hctx2/run:279
nvme0n1/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx2/tags:nr_tags=1023
nvme0n1/hctx2/tags:nr_reserved_tags=0
nvme0n1/hctx2/tags:active_queues=0
nvme0n1/hctx2/tags:bitmap_tags:
nvme0n1/hctx2/tags:depth=1023
nvme0n1/hctx2/tags:busy=0
nvme0n1/hctx2/tags:cleared=16
nvme0n1/hctx2/tags:bits_per_word=64
nvme0n1/hctx2/tags:map_nr=16
nvme0n1/hctx2/tags:alloc_hint={960, 528, 145, 730, 447, 1002}
nvme0n1/hctx2/tags:wake_batch=8
nvme0n1/hctx2/tags:wake_index=0
nvme0n1/hctx2/tags:ws_active=0
nvme0n1/hctx2/tags:ws={
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:	{.wait=inactive},
nvme0n1/hctx2/tags:}
nvme0n1/hctx2/tags:round_robin=0
nvme0n1/hctx2/tags:min_shallow_depth=4294967295
nvme0n1/hctx2/ctx_map:00000000: 00
nvme0n1/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/hctx1/type:default
nvme0n1/hctx1/dispatch_busy:0
nvme0n1/hctx1/active:0
nvme0n1/hctx1/run:458
nvme0n1/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx1/tags:nr_tags=1023
nvme0n1/hctx1/tags:nr_reserved_tags=0
nvme0n1/hctx1/tags:active_queues=0
nvme0n1/hctx1/tags:bitmap_tags:
nvme0n1/hctx1/tags:depth=1023
nvme0n1/hctx1/tags:busy=0
nvme0n1/hctx1/tags:cleared=31
nvme0n1/hctx1/tags:bits_per_word=64
nvme0n1/hctx1/tags:map_nr=16
nvme0n1/hctx1/tags:alloc_hint={689, 284, 498, 188, 808, 610}
nvme0n1/hctx1/tags:wake_batch=8
nvme0n1/hctx1/tags:wake_index=0
nvme0n1/hctx1/tags:ws_active=0
nvme0n1/hctx1/tags:ws={
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:	{.wait=inactive},
nvme0n1/hctx1/tags:}
nvme0n1/hctx1/tags:round_robin=0
nvme0n1/hctx1/tags:min_shallow_depth=4294967295
nvme0n1/hctx1/ctx_map:00000000: 00
nvme0n1/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/hctx0/type:default
nvme0n1/hctx0/dispatch_busy:0
nvme0n1/hctx0/active:0
nvme0n1/hctx0/run:375
nvme0n1/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
nvme0n1/hctx0/tags:nr_tags=1023
nvme0n1/hctx0/tags:nr_reserved_tags=0
nvme0n1/hctx0/tags:active_queues=0
nvme0n1/hctx0/tags:bitmap_tags:
nvme0n1/hctx0/tags:depth=1023
nvme0n1/hctx0/tags:busy=0
nvme0n1/hctx0/tags:cleared=6
nvme0n1/hctx0/tags:bits_per_word=64
nvme0n1/hctx0/tags:map_nr=16
nvme0n1/hctx0/tags:alloc_hint={5, 18, 320, 448, 728, 9}
nvme0n1/hctx0/tags:wake_batch=8
nvme0n1/hctx0/tags:wake_index=0
nvme0n1/hctx0/tags:ws_active=0
nvme0n1/hctx0/tags:ws={
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:	{.wait=inactive},
nvme0n1/hctx0/tags:}
nvme0n1/hctx0/tags:round_robin=0
nvme0n1/hctx0/tags:min_shallow_depth=4294967295
nvme0n1/hctx0/ctx_map:00000000: 00
nvme0n1/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE
nvme0n1/state:SAME_COMP|NONROT|IO_STAT|INIT_DONE|WC|FUA|REGISTERED|NOWAIT
nvme0n1/pm_only:0

This is without any of the patches Mikulas proposed.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  2:14                                       ` Marek Marczykowski-Górecki
@ 2023-11-01  2:15                                         ` Marek Marczykowski-Górecki
  2023-11-01  2:35                                           ` Marek Marczykowski-Górecki
  2023-11-01  3:24                                         ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-01  2:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 13463 bytes --]

On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > <marmarek@invisiblethingslab.com> wrote:
> > >
> > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > Then tried:
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > >
> > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > and I can't reproduce the issue there. I'm confused...
> > > >
> > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > the block layer or in the storage driver has problems with handling bios
> > > > with sufficiently high order pages attached. This is going to be a bit
> > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > that we don't overwhelm the system...
> > >
> > > Sometimes it freezes just after logging in, but in worst case it takes
> > > me about 10min of more or less `tar xz` + `dd`.
> > 
> > blk-mq debugfs is usually helpful for hang issue in block layer or
> > underlying drivers:
> > 
> > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > 
> > BTW,  you can just collect logs of the exact disks if you know what
> > are behind dm-crypt,
> > which can be figured out by `lsblk`, and it has to be collected after
> > the hang is triggered.
> 
> dm-crypt lives on the nvme disk, this is what I collected when it
> hanged:
> 
> nvme0n1/hctx5/type:default
> nvme0n1/hctx5/dispatch_busy:0
> nvme0n1/hctx5/active:0
> nvme0n1/hctx5/run:273
> nvme0n1/hctx5/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx5/tags:nr_tags=1023
> nvme0n1/hctx5/tags:nr_reserved_tags=0
> nvme0n1/hctx5/tags:active_queues=0
> nvme0n1/hctx5/tags:bitmap_tags:
> nvme0n1/hctx5/tags:depth=1023
> nvme0n1/hctx5/tags:busy=0
> nvme0n1/hctx5/tags:cleared=7
> nvme0n1/hctx5/tags:bits_per_word=64
> nvme0n1/hctx5/tags:map_nr=16
> nvme0n1/hctx5/tags:alloc_hint={633, 450, 354, 913, 651, 645}
> nvme0n1/hctx5/tags:wake_batch=8
> nvme0n1/hctx5/tags:wake_index=0
> nvme0n1/hctx5/tags:ws_active=0
> nvme0n1/hctx5/tags:ws={
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:	{.wait=inactive},
> nvme0n1/hctx5/tags:}
> nvme0n1/hctx5/tags:round_robin=0
> nvme0n1/hctx5/tags:min_shallow_depth=4294967295
> nvme0n1/hctx5/ctx_map:00000000: 00
> nvme0n1/hctx5/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> nvme0n1/hctx4/type:default
> nvme0n1/hctx4/dispatch_busy:9
> nvme0n1/hctx4/active:0
> nvme0n1/hctx4/run:20290468
> nvme0n1/hctx4/tags_bitmap:00000000: 0000 0000 0000 0000 4240 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx4/tags:nr_tags=1023
> nvme0n1/hctx4/tags:nr_reserved_tags=0
> nvme0n1/hctx4/tags:active_queues=0
> nvme0n1/hctx4/tags:bitmap_tags:
> nvme0n1/hctx4/tags:depth=1023
> nvme0n1/hctx4/tags:busy=3
> nvme0n1/hctx4/tags:cleared=7
> nvme0n1/hctx4/tags:bits_per_word=64
> nvme0n1/hctx4/tags:map_nr=16
> nvme0n1/hctx4/tags:alloc_hint={899, 846, 390, 472, 73, 439}
> nvme0n1/hctx4/tags:wake_batch=8
> nvme0n1/hctx4/tags:wake_index=0
> nvme0n1/hctx4/tags:ws_active=0
> nvme0n1/hctx4/tags:ws={
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:	{.wait=inactive},
> nvme0n1/hctx4/tags:}
> nvme0n1/hctx4/tags:round_robin=0
> nvme0n1/hctx4/tags:min_shallow_depth=4294967295
> nvme0n1/hctx4/ctx_map:00000000: 01
> nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx4/state:SCHED_RESTART
> nvme0n1/hctx3/type:default
> nvme0n1/hctx3/dispatch_busy:0
> nvme0n1/hctx3/active:0
> nvme0n1/hctx3/run:296
> nvme0n1/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx3/tags:nr_tags=1023
> nvme0n1/hctx3/tags:nr_reserved_tags=0
> nvme0n1/hctx3/tags:active_queues=0
> nvme0n1/hctx3/tags:bitmap_tags:
> nvme0n1/hctx3/tags:depth=1023
> nvme0n1/hctx3/tags:busy=0
> nvme0n1/hctx3/tags:cleared=23
> nvme0n1/hctx3/tags:bits_per_word=64
> nvme0n1/hctx3/tags:map_nr=16
> nvme0n1/hctx3/tags:alloc_hint={862, 557, 480, 24, 841, 23}
> nvme0n1/hctx3/tags:wake_batch=8
> nvme0n1/hctx3/tags:wake_index=0
> nvme0n1/hctx3/tags:ws_active=0
> nvme0n1/hctx3/tags:ws={
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:	{.wait=inactive},
> nvme0n1/hctx3/tags:}
> nvme0n1/hctx3/tags:round_robin=0
> nvme0n1/hctx3/tags:min_shallow_depth=4294967295
> nvme0n1/hctx3/ctx_map:00000000: 00
> nvme0n1/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx2/type:default
> nvme0n1/hctx2/dispatch_busy:0
> nvme0n1/hctx2/active:0
> nvme0n1/hctx2/run:279
> nvme0n1/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx2/tags:nr_tags=1023
> nvme0n1/hctx2/tags:nr_reserved_tags=0
> nvme0n1/hctx2/tags:active_queues=0
> nvme0n1/hctx2/tags:bitmap_tags:
> nvme0n1/hctx2/tags:depth=1023
> nvme0n1/hctx2/tags:busy=0
> nvme0n1/hctx2/tags:cleared=16
> nvme0n1/hctx2/tags:bits_per_word=64
> nvme0n1/hctx2/tags:map_nr=16
> nvme0n1/hctx2/tags:alloc_hint={960, 528, 145, 730, 447, 1002}
> nvme0n1/hctx2/tags:wake_batch=8
> nvme0n1/hctx2/tags:wake_index=0
> nvme0n1/hctx2/tags:ws_active=0
> nvme0n1/hctx2/tags:ws={
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:	{.wait=inactive},
> nvme0n1/hctx2/tags:}
> nvme0n1/hctx2/tags:round_robin=0
> nvme0n1/hctx2/tags:min_shallow_depth=4294967295
> nvme0n1/hctx2/ctx_map:00000000: 00
> nvme0n1/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx1/type:default
> nvme0n1/hctx1/dispatch_busy:0
> nvme0n1/hctx1/active:0
> nvme0n1/hctx1/run:458
> nvme0n1/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx1/tags:nr_tags=1023
> nvme0n1/hctx1/tags:nr_reserved_tags=0
> nvme0n1/hctx1/tags:active_queues=0
> nvme0n1/hctx1/tags:bitmap_tags:
> nvme0n1/hctx1/tags:depth=1023
> nvme0n1/hctx1/tags:busy=0
> nvme0n1/hctx1/tags:cleared=31
> nvme0n1/hctx1/tags:bits_per_word=64
> nvme0n1/hctx1/tags:map_nr=16
> nvme0n1/hctx1/tags:alloc_hint={689, 284, 498, 188, 808, 610}
> nvme0n1/hctx1/tags:wake_batch=8
> nvme0n1/hctx1/tags:wake_index=0
> nvme0n1/hctx1/tags:ws_active=0
> nvme0n1/hctx1/tags:ws={
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:	{.wait=inactive},
> nvme0n1/hctx1/tags:}
> nvme0n1/hctx1/tags:round_robin=0
> nvme0n1/hctx1/tags:min_shallow_depth=4294967295
> nvme0n1/hctx1/ctx_map:00000000: 00
> nvme0n1/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx0/type:default
> nvme0n1/hctx0/dispatch_busy:0
> nvme0n1/hctx0/active:0
> nvme0n1/hctx0/run:375
> nvme0n1/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
> nvme0n1/hctx0/tags:nr_tags=1023
> nvme0n1/hctx0/tags:nr_reserved_tags=0
> nvme0n1/hctx0/tags:active_queues=0
> nvme0n1/hctx0/tags:bitmap_tags:
> nvme0n1/hctx0/tags:depth=1023
> nvme0n1/hctx0/tags:busy=0
> nvme0n1/hctx0/tags:cleared=6
> nvme0n1/hctx0/tags:bits_per_word=64
> nvme0n1/hctx0/tags:map_nr=16
> nvme0n1/hctx0/tags:alloc_hint={5, 18, 320, 448, 728, 9}
> nvme0n1/hctx0/tags:wake_batch=8
> nvme0n1/hctx0/tags:wake_index=0
> nvme0n1/hctx0/tags:ws_active=0
> nvme0n1/hctx0/tags:ws={
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:	{.wait=inactive},
> nvme0n1/hctx0/tags:}
> nvme0n1/hctx0/tags:round_robin=0
> nvme0n1/hctx0/tags:min_shallow_depth=4294967295
> nvme0n1/hctx0/ctx_map:00000000: 00
> nvme0n1/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/state:SAME_COMP|NONROT|IO_STAT|INIT_DONE|WC|FUA|REGISTERED|NOWAIT
> nvme0n1/pm_only:0
> 
> This is without any of the patches Mikulas proposed.

And BTW, an attempt to access another partition on the same disk (not
covered with dm-crypt) hangs too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  2:15                                         ` Marek Marczykowski-Górecki
@ 2023-11-01  2:35                                           ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-01  2:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 335 bytes --]

On Wed, Nov 01, 2023 at 03:15:50AM +0100, Marek Marczykowski-Górecki wrote:
> And BTW, an attempt to access another partition on the same disk (not
> covered with dm-crypt) hangs too.

No, not sure about that part, it was rather reading /bin/ls hanging...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  2:14                                       ` Marek Marczykowski-Górecki
  2023-11-01  2:15                                         ` Marek Marczykowski-Górecki
@ 2023-11-01  3:24                                         ` Ming Lei
  2023-11-01 10:15                                           ` Hannes Reinecke
  1 sibling, 1 reply; 66+ messages in thread
From: Ming Lei @ 2023-11-01  3:24 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme, ming.lei

On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > <marmarek@invisiblethingslab.com> wrote:
> > >
> > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > Then tried:
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > >
> > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > and I can't reproduce the issue there. I'm confused...
> > > >
> > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > the block layer or in the storage driver has problems with handling bios
> > > > with sufficiently high order pages attached. This is going to be a bit
> > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > that we don't overwhelm the system...
> > >
> > > Sometimes it freezes just after logging in, but in worst case it takes
> > > me about 10min of more or less `tar xz` + `dd`.
> > 
> > blk-mq debugfs is usually helpful for hang issue in block layer or
> > underlying drivers:
> > 
> > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > 
> > BTW,  you can just collect logs of the exact disks if you know what
> > are behind dm-crypt,
> > which can be figured out by `lsblk`, and it has to be collected after
> > the hang is triggered.
> 
> dm-crypt lives on the nvme disk, this is what I collected when it
> hanged:
> 
...
> nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}

Two requests stays in sw queue, but not related with this issue.

> nvme0n1/hctx4/type:default
> nvme0n1/hctx4/dispatch_busy:9

non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
nvme_queue_rq() recently and mostly.

> nvme0n1/hctx4/active:0
> nvme0n1/hctx4/run:20290468

...

> nvme0n1/hctx4/tags:nr_tags=1023
> nvme0n1/hctx4/tags:nr_reserved_tags=0
> nvme0n1/hctx4/tags:active_queues=0
> nvme0n1/hctx4/tags:bitmap_tags:
> nvme0n1/hctx4/tags:depth=1023
> nvme0n1/hctx4/tags:busy=3

Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.

...

> nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx4/state:SCHED_RESTART

The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
returns -BLK_STS_RESOURCE constantly, and you can verify with
the following bpftrace when the hang is triggered:

	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'

It is very likely that memory allocation inside nvme_queue_rq()
can't be done successfully, then blk-mq just have to retry by calling
nvme_queue_rq() on the above request.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  3:24                                         ` Ming Lei
@ 2023-11-01 10:15                                           ` Hannes Reinecke
  2023-11-01 10:26                                             ` Jan Kara
                                                               ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Hannes Reinecke @ 2023-11-01 10:15 UTC (permalink / raw)
  To: Ming Lei, Marek Marczykowski-Górecki
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme, ming.lei

On 11/1/23 04:24, Ming Lei wrote:
> On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
>> On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
>>> On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
>>> <marmarek@invisiblethingslab.com> wrote:
>>>>
>>>> On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
>>>>> On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
>>>>>> Then tried:
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
>>>>>>
>>>>>> I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
>>>>>> and I can't reproduce the issue there. I'm confused...
>>>>>
>>>>> And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
>>>>> causing hangs is most likely just a coincidence. Rather something either in
>>>>> the block layer or in the storage driver has problems with handling bios
>>>>> with sufficiently high order pages attached. This is going to be a bit
>>>>> painful to debug I'm afraid. How long does it take for you trigger the
>>>>> hang? I'm asking to get rough estimate how heavy tracing we can afford so
>>>>> that we don't overwhelm the system...
>>>>
>>>> Sometimes it freezes just after logging in, but in worst case it takes
>>>> me about 10min of more or less `tar xz` + `dd`.
>>>
>>> blk-mq debugfs is usually helpful for hang issue in block layer or
>>> underlying drivers:
>>>
>>> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
>>>
>>> BTW,  you can just collect logs of the exact disks if you know what
>>> are behind dm-crypt,
>>> which can be figured out by `lsblk`, and it has to be collected after
>>> the hang is triggered.
>>
>> dm-crypt lives on the nvme disk, this is what I collected when it
>> hanged:
>>
> ...
>> nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
>> nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> 
> Two requests stays in sw queue, but not related with this issue.
> 
>> nvme0n1/hctx4/type:default
>> nvme0n1/hctx4/dispatch_busy:9
> 
> non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> nvme_queue_rq() recently and mostly.
> 
>> nvme0n1/hctx4/active:0
>> nvme0n1/hctx4/run:20290468
> 
> ...
> 
>> nvme0n1/hctx4/tags:nr_tags=1023
>> nvme0n1/hctx4/tags:nr_reserved_tags=0
>> nvme0n1/hctx4/tags:active_queues=0
>> nvme0n1/hctx4/tags:bitmap_tags:
>> nvme0n1/hctx4/tags:depth=1023
>> nvme0n1/hctx4/tags:busy=3
> 
> Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> 
> ...
> 
>> nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
>> nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
>> nvme0n1/hctx4/state:SCHED_RESTART
> 
> The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> returns -BLK_STS_RESOURCE constantly, and you can verify with
> the following bpftrace when the hang is triggered:
> 
> 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> 
> It is very likely that memory allocation inside nvme_queue_rq()
> can't be done successfully, then blk-mq just have to retry by calling
> nvme_queue_rq() on the above request.
> 
And that is something I've been wondering (for quite some time now):
What _is_ the appropriate error handling for -ENOMEM?
At this time, we assume it to be a retryable error and re-run the queue
in the hope that things will sort itself out.
But if they don't we're stuck.
Can we somehow figure out if we make progress during submission, and (at 
least) issue a warning once we detect a stall?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                                           ` Hannes Reinecke
@ 2023-11-01 10:26                                             ` Jan Kara
  2023-11-01 11:23                                             ` Ming Lei
  2023-11-01 12:16                                             ` Mikulas Patocka
  2 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2023-11-01 10:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, linux-block, linux-nvme, ming.lei

On Wed 01-11-23 11:15:02, Hannes Reinecke wrote:
> On 11/1/23 04:24, Ming Lei wrote:
> > On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > > > <marmarek@invisiblethingslab.com> wrote:
> > > > > 
> > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > > > Then tried:
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > > > 
> > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > > > and I can't reproduce the issue there. I'm confused...
> > > > > > 
> > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > > > the block layer or in the storage driver has problems with handling bios
> > > > > > with sufficiently high order pages attached. This is going to be a bit
> > > > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > > > that we don't overwhelm the system...
> > > > > 
> > > > > Sometimes it freezes just after logging in, but in worst case it takes
> > > > > me about 10min of more or less `tar xz` + `dd`.
> > > > 
> > > > blk-mq debugfs is usually helpful for hang issue in block layer or
> > > > underlying drivers:
> > > > 
> > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > BTW,  you can just collect logs of the exact disks if you know what
> > > > are behind dm-crypt,
> > > > which can be figured out by `lsblk`, and it has to be collected after
> > > > the hang is triggered.
> > > 
> > > dm-crypt lives on the nvme disk, this is what I collected when it
> > > hanged:
> > > 
> > ...
> > > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> > > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> > 
> > Two requests stays in sw queue, but not related with this issue.
> > 
> > > nvme0n1/hctx4/type:default
> > > nvme0n1/hctx4/dispatch_busy:9
> > 
> > non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> > nvme_queue_rq() recently and mostly.
> > 
> > > nvme0n1/hctx4/active:0
> > > nvme0n1/hctx4/run:20290468
> > 
> > ...
> > 
> > > nvme0n1/hctx4/tags:nr_tags=1023
> > > nvme0n1/hctx4/tags:nr_reserved_tags=0
> > > nvme0n1/hctx4/tags:active_queues=0
> > > nvme0n1/hctx4/tags:bitmap_tags:
> > > nvme0n1/hctx4/tags:depth=1023
> > > nvme0n1/hctx4/tags:busy=3
> > 
> > Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> > 
> > ...
> > 
> > > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> > > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> > > nvme0n1/hctx4/state:SCHED_RESTART
> > 
> > The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> > returns -BLK_STS_RESOURCE constantly, and you can verify with
> > the following bpftrace when the hang is triggered:
> > 
> > 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> > 
> > It is very likely that memory allocation inside nvme_queue_rq()
> > can't be done successfully, then blk-mq just have to retry by calling
> > nvme_queue_rq() on the above request.
> > 
> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?
> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.
> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

Well, but Marek has show [1] the machine is pretty far from being OOM when it
is stuck. So it doesn't seem like a simple OOM situation...

								Honza

[1] https://lore.kernel.org/all/ZTiJ3CO8w0jauOzW@mail-itl/

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                                           ` Hannes Reinecke
  2023-11-01 10:26                                             ` Jan Kara
@ 2023-11-01 11:23                                             ` Ming Lei
  2023-11-02 14:02                                               ` Keith Busch
  2023-11-01 12:16                                             ` Mikulas Patocka
  2 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2023-11-01 11:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, linux-block, linux-nvme

On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> On 11/1/23 04:24, Ming Lei wrote:
> > On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > > > <marmarek@invisiblethingslab.com> wrote:
> > > > > 
> > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > > > Then tried:
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > > > 
> > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > > > and I can't reproduce the issue there. I'm confused...
> > > > > > 
> > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > > > the block layer or in the storage driver has problems with handling bios
> > > > > > with sufficiently high order pages attached. This is going to be a bit
> > > > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > > > that we don't overwhelm the system...
> > > > > 
> > > > > Sometimes it freezes just after logging in, but in worst case it takes
> > > > > me about 10min of more or less `tar xz` + `dd`.
> > > > 
> > > > blk-mq debugfs is usually helpful for hang issue in block layer or
> > > > underlying drivers:
> > > > 
> > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > BTW,  you can just collect logs of the exact disks if you know what
> > > > are behind dm-crypt,
> > > > which can be figured out by `lsblk`, and it has to be collected after
> > > > the hang is triggered.
> > > 
> > > dm-crypt lives on the nvme disk, this is what I collected when it
> > > hanged:
> > > 
> > ...
> > > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> > > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> > 
> > Two requests stays in sw queue, but not related with this issue.
> > 
> > > nvme0n1/hctx4/type:default
> > > nvme0n1/hctx4/dispatch_busy:9
> > 
> > non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> > nvme_queue_rq() recently and mostly.
> > 
> > > nvme0n1/hctx4/active:0
> > > nvme0n1/hctx4/run:20290468
> > 
> > ...
> > 
> > > nvme0n1/hctx4/tags:nr_tags=1023
> > > nvme0n1/hctx4/tags:nr_reserved_tags=0
> > > nvme0n1/hctx4/tags:active_queues=0
> > > nvme0n1/hctx4/tags:bitmap_tags:
> > > nvme0n1/hctx4/tags:depth=1023
> > > nvme0n1/hctx4/tags:busy=3
> > 
> > Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> > 
> > ...
> > 
> > > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> > > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> > > nvme0n1/hctx4/state:SCHED_RESTART
> > 
> > The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> > returns -BLK_STS_RESOURCE constantly, and you can verify with
> > the following bpftrace when the hang is triggered:
> > 
> > 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> > 
> > It is very likely that memory allocation inside nvme_queue_rq()
> > can't be done successfully, then blk-mq just have to retry by calling
> > nvme_queue_rq() on the above request.
> > 
> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?

It is just my guess.

Actually it shouldn't fail since the sgl allocation is backed with
memory pool, but there is also dma pool allocation and dma mapping.

> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.

It should not be hard to figure out why nvme_queue_rq() can't move on.

> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

It needs counting on request retry, and people often hate to add something
to request or bio in fast path. Also this kind of issue is easy to show
in blk-mq debugfs or bpftrace.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                                           ` Hannes Reinecke
  2023-11-01 10:26                                             ` Jan Kara
  2023-11-01 11:23                                             ` Ming Lei
@ 2023-11-01 12:16                                             ` Mikulas Patocka
  2 siblings, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-01 12:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Vlastimil Babka, Andrew Morton, Matthew Wilcox, Michal Hocko,
	stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel,
	linux-mm, linux-block, linux-nvme, ming.lei



On Wed, 1 Nov 2023, Hannes Reinecke wrote:

> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?
> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.
> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

The appropriate way is to use mempools. mempool_alloc (with 
__GFP_DIRECT_RECLAIM) can't ever fail.

But some kernel code does GFP_NOIO allocations in the I/O path and the 
authors hope that they get away with it.

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-10-31 17:24                                       ` Mikulas Patocka
@ 2023-11-02  0:38                                         ` Marek Marczykowski-Górecki
  2023-11-02  9:28                                           ` Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-02  0:38 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4542 bytes --]

On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 31 Oct 2023, Mikulas Patocka wrote:
> 
> > 
> > 
> > On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:
> > 
> > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > Then tried:
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > 
> > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > and I can't reproduce the issue there. I'm confused...
> > > > 
> > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > the block layer or in the storage driver has problems with handling bios
> > > > with sufficiently high order pages attached. This is going to be a bit
> > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > that we don't overwhelm the system...
> > > 
> > > Sometimes it freezes just after logging in, but in worst case it takes
> > > me about 10min of more or less `tar xz` + `dd`.
> > 
> > Hi
> > 
> > I would like to ask you to try this patch. Revert the changes to "order" 
> > and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> > clean upstream kernel.
> > 
> > Does it deadlock?
> > 
> > There is a bug in dm-crypt that it doesn't account large pages in 
> > cc->n_allocated_pages, this patch fixes the bug.

This patch did not help.

> If the previous patch didn't fix it, try this patch (on a clean upstream 
> kernel).
>
> This patch allocates large pages, but it breaks them up into single-page 
> entries when adding them to the bio.

But this does help.

> If this patch deadlocks, it is a sign that allocating large pages causes 
> the problem.
> 
> If this patch doesn't deadlock, it is a sign that processing a bio with 
> large pages is the problem.
> 
> Mikulas
> 
> ---
>  drivers/md/dm-crypt.c |   26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> Index: linux-stable/drivers/md/dm-crypt.c
> ===================================================================
> --- linux-stable.orig/drivers/md/dm-crypt.c	2023-10-31 17:16:03.000000000 +0100
> +++ linux-stable/drivers/md/dm-crypt.c	2023-10-31 17:21:14.000000000 +0100
> @@ -1700,11 +1700,16 @@ retry:
>  		order = min(order, remaining_order);
>  
>  		while (order > 0) {
> +			if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client))
> +				goto decrease_order;
>  			pages = alloc_pages(gfp_mask
>  				| __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP,
>  				order);
> -			if (likely(pages != NULL))
> +			if (likely(pages != NULL)) {
> +				percpu_counter_add(&cc->n_allocated_pages, 1 << order);
>  				goto have_pages;
> +			}
> +decrease_order:
>  			order--;
>  		}
>  
> @@ -1719,8 +1724,13 @@ retry:
>  
>  have_pages:
>  		size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
> -		__bio_add_page(clone, pages, size_to_add, 0);
>  		remaining_size -= size_to_add;
> +		while (size_to_add) {
> +			unsigned this_step = min(size_to_add, (unsigned)PAGE_SIZE);
> +			__bio_add_page(clone, pages, this_step, 0);
> +			size_to_add -= this_step;
> +			pages++;
> +		}
>  	}
>  
>  	/* Allocate space for integrity tags */
> @@ -1739,13 +1749,21 @@ have_pages:
>  static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
>  {
>  	struct folio_iter fi;
> +	unsigned skip = 0;
>  
>  	if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */
>  		bio_for_each_folio_all(fi, clone) {
> -			if (folio_test_large(fi.folio))
> +			if (skip) {
> +				skip--;
> +				continue;
> +			}
> +			if (folio_test_large(fi.folio)) {
> +				skip = (1 << folio_order(fi.folio)) - 1;
> +				percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio));
>  				folio_put(fi.folio);
> -			else
> +			} else {
>  				mempool_free(&fi.folio->page, &cc->page_pool);
> +			}
>  		}
>  	}
>  }


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02  0:38                                         ` Marek Marczykowski-Górecki
@ 2023-11-02  9:28                                           ` Mikulas Patocka
  2023-11-02 11:45                                             ` Marek Marczykowski-Górecki
  2023-11-02 12:21                                             ` Jan Kara
  0 siblings, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-02  9:28 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]



On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:

> On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
> 
> > > Hi
> > > 
> > > I would like to ask you to try this patch. Revert the changes to "order" 
> > > and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> > > clean upstream kernel.
> > > 
> > > Does it deadlock?
> > > 
> > > There is a bug in dm-crypt that it doesn't account large pages in 
> > > cc->n_allocated_pages, this patch fixes the bug.
> 
> This patch did not help.
> 
> > If the previous patch didn't fix it, try this patch (on a clean upstream 
> > kernel).
> >
> > This patch allocates large pages, but it breaks them up into single-page 
> > entries when adding them to the bio.
> 
> But this does help.

Thanks. So we can stop blaming the memory allocator and start blaming the 
NVMe subsystem.


I added NVMe maintainers to this thread - the summary of the problem is: 
In dm-crypt, we allocate a large compound page and add this compound page 
to the bio as a single big vector entry. Marek reports that on his system 
it causes deadlocks, the deadlocks look like a lost bio that was never 
completed. When I chop the large compound page to individual pages in 
dm-crypt and add bio vector for each of them, Marek reports that there are 
no longer any deadlocks. So, we have a problem (either hardware or 
software) that the NVMe subsystem doesn't like bio vectors with large 
bv_len. This is the original bug report: 
https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/


Marek, what NVMe devices do you use? Do you use the same device on all 3 
machines where you hit this bug?

In the directory /sys/block/nvme0n1/queue: what is the value of 
dma_alignment, max_hw_sectors_kb, max_sectors_kb, max_segment_size, 
max_segments, virt_boundary_mask?

Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value 
(for example 64) and test if it helps.

Mikulas

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02  9:28                                           ` Mikulas Patocka
@ 2023-11-02 11:45                                             ` Marek Marczykowski-Górecki
  2023-11-02 17:06                                               ` Mikulas Patocka
  2023-11-02 12:21                                             ` Jan Kara
  1 sibling, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-02 11:45 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]

On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
> 
> 
> On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
> 
> > On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
> > 
> > > > Hi
> > > > 
> > > > I would like to ask you to try this patch. Revert the changes to "order" 
> > > > and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> > > > clean upstream kernel.
> > > > 
> > > > Does it deadlock?
> > > > 
> > > > There is a bug in dm-crypt that it doesn't account large pages in 
> > > > cc->n_allocated_pages, this patch fixes the bug.
> > 
> > This patch did not help.
> > 
> > > If the previous patch didn't fix it, try this patch (on a clean upstream 
> > > kernel).
> > >
> > > This patch allocates large pages, but it breaks them up into single-page 
> > > entries when adding them to the bio.
> > 
> > But this does help.
> 
> Thanks. So we can stop blaming the memory allocator and start blaming the 
> NVMe subsystem.
> 
> 
> I added NVMe maintainers to this thread - the summary of the problem is: 
> In dm-crypt, we allocate a large compound page and add this compound page 
> to the bio as a single big vector entry. Marek reports that on his system 
> it causes deadlocks, the deadlocks look like a lost bio that was never 
> completed. When I chop the large compound page to individual pages in 
> dm-crypt and add bio vector for each of them, Marek reports that there are 
> no longer any deadlocks. So, we have a problem (either hardware or 
> software) that the NVMe subsystem doesn't like bio vectors with large 
> bv_len. This is the original bug report: 
> https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
> 
> 
> Marek, what NVMe devices do you use? Do you use the same device on all 3 
> machines where you hit this bug?

This one is "Star Drive PCIe SSD", another one is "Samsung SSD 970 EVO
Plus 1TB", I can't check the third one right now.

> In the directory /sys/block/nvme0n1/queue: what is the value of 
> dma_alignment, max_hw_sectors_kb, max_sectors_kb, max_segment_size, 
> max_segments, virt_boundary_mask?

/sys/block/nvme0n1/queue/dma_alignment:3
/sys/block/nvme0n1/queue/max_hw_sectors_kb:2048
/sys/block/nvme0n1/queue/max_sectors_kb:1280
/sys/block/nvme0n1/queue/max_segment_size:4294967295
/sys/block/nvme0n1/queue/max_segments:128
/sys/block/nvme0n1/queue/virt_boundary_mask:4095

> Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value 
> (for example 64) and test if it helps.

Yes, this helps too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02  9:28                                           ` Mikulas Patocka
  2023-11-02 11:45                                             ` Marek Marczykowski-Górecki
@ 2023-11-02 12:21                                             ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Jan Kara @ 2023-11-02 12:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marek Marczykowski-Górecki, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jan Kara, Vlastimil Babka,
	Andrew Morton, Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Thu 02-11-23 10:28:57, Mikulas Patocka wrote:
> On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
> > On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
> > 
> > > > Hi
> > > > 
> > > > I would like to ask you to try this patch. Revert the changes to "order" 
> > > > and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> > > > clean upstream kernel.
> > > > 
> > > > Does it deadlock?
> > > > 
> > > > There is a bug in dm-crypt that it doesn't account large pages in 
> > > > cc->n_allocated_pages, this patch fixes the bug.
> > 
> > This patch did not help.
> > 
> > > If the previous patch didn't fix it, try this patch (on a clean upstream 
> > > kernel).
> > >
> > > This patch allocates large pages, but it breaks them up into single-page 
> > > entries when adding them to the bio.
> > 
> > But this does help.
> 
> Thanks. So we can stop blaming the memory allocator and start blaming the 
> NVMe subsystem.

;-)

> I added NVMe maintainers to this thread - the summary of the problem is: 
> In dm-crypt, we allocate a large compound page and add this compound page 
> to the bio as a single big vector entry. Marek reports that on his system 
> it causes deadlocks, the deadlocks look like a lost bio that was never 
> completed. When I chop the large compound page to individual pages in 
> dm-crypt and add bio vector for each of them, Marek reports that there are 
> no longer any deadlocks. So, we have a problem (either hardware or 
> software) that the NVMe subsystem doesn't like bio vectors with large 
> bv_len. This is the original bug report: 
> https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/

Actually, Ming Lei has already identified [1] that we are apparently
looping in an endless retry loop in nvme_queue_rq(), always ending up the
attempt with BLK_STS_RESOURCE.
								Honza

[1] https://lore.kernel.org/all/ZUHE52SznRaZQxnG@fedora

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 11:23                                             ` Ming Lei
@ 2023-11-02 14:02                                               ` Keith Busch
  0 siblings, 0 replies; 66+ messages in thread
From: Keith Busch @ 2023-11-02 14:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Ming Lei, Marek Marczykowski-G'orecki,
	Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme

On Wed, Nov 01, 2023 at 07:23:05PM +0800, Ming Lei wrote:
> On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> > > nvme_queue_rq() on the above request.
> > > 
> > And that is something I've been wondering (for quite some time now):
> > What _is_ the appropriate error handling for -ENOMEM?
> 
> It is just my guess.
> 
> Actually it shouldn't fail since the sgl allocation is backed with
> memory pool, but there is also dma pool allocation and dma mapping.
> 
> > At this time, we assume it to be a retryable error and re-run the queue
> > in the hope that things will sort itself out.
> 
> It should not be hard to figure out why nvme_queue_rq() can't move on.

There's only a few reasons nvme_queue_rq would return BLK_STS_RESOURCE
for a typical read/write command:

  DMA mapping error
  Can't allocate SGL from mempool
  Can't allocate PRP from dma_pool
  Controller stuck in resetting state

We should always be able to get at least one allocation from the memory
pools, so I think the only one the driver doesn't have a way to
guarantee eventual forward progress are the DMA mapping error
conditions. Is there some other limit that the driver needs to consider
when configuring it's largest supported transfers?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02 11:45                                             ` Marek Marczykowski-Górecki
@ 2023-11-02 17:06                                               ` Mikulas Patocka
  2023-11-03 15:01                                                 ` Marek Marczykowski-Górecki
  2023-11-03 16:15                                                 ` Marek Marczykowski-Górecki
  0 siblings, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-02 17:06 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5324 bytes --]



On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:

> On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
> 
> > Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value 
> > (for example 64) and test if it helps.
> 
> Yes, this helps too.

On a plain upstream kernel with no other modifications (and with default 
max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold 
to "0" and test it if it deadlocks. Then, set this value to "1" and test 
it again.

Revert sgl_threshold back to the default (32768). Boot the kernel with the 
option "iommu=panic". Reproduce the deadlock and if you get a kernel 
panic, send us the panic log.

Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
tell us which one of the "printk" statements is triggered during the 
deadlock.

Mikulas

---
 drivers/nvme/host/core.c |    8 ++++++--
 drivers/nvme/host/pci.c  |   27 ++++++++++++++++++++++-----
 2 files changed, 28 insertions(+), 7 deletions(-)

Index: linux-stable/drivers/nvme/host/pci.c
===================================================================
--- linux-stable.orig/drivers/nvme/host/pci.c	2023-10-31 15:35:38.000000000 +0100
+++ linux-stable/drivers/nvme/host/pci.c	2023-11-02 17:38:20.000000000 +0100
@@ -622,6 +622,10 @@ static blk_status_t nvme_pci_setup_prps(
 	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 	if (!prp_list) {
 		iod->nr_allocations = -1;
+		if (nprps <= (256 / 8))
+			printk("allocation failure at %d\n", __LINE__);
+		else
+			printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
 	}
 	iod->list[0].prp_list = prp_list;
@@ -631,8 +635,10 @@ static blk_status_t nvme_pci_setup_prps(
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
-			if (!prp_list)
+			if (!prp_list) {
+				printk("allocation failure at %d\n", __LINE__);
 				goto free_prps;
+			}
 			iod->list[iod->nr_allocations++].prp_list = prp_list;
 			prp_list[0] = old_prp_list[i - 1];
 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
@@ -712,6 +718,7 @@ static blk_status_t nvme_pci_setup_sgls(
 	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list) {
 		iod->nr_allocations = -1;
+		printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
 	}
 
@@ -736,8 +743,10 @@ static blk_status_t nvme_setup_prp_simpl
 	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
 
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
+	if (dma_mapping_error(dev->dev, iod->first_dma)) {
+		printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
+	}
 	iod->dma_len = bv->bv_len;
 
 	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
@@ -755,8 +764,10 @@ static blk_status_t nvme_setup_sgl_simpl
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
+	if (dma_mapping_error(dev->dev, iod->first_dma)) {
+		printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
+	}
 	iod->dma_len = bv->bv_len;
 
 	cmnd->flags = NVME_CMD_SGL_METABUF;
@@ -791,8 +802,10 @@ static blk_status_t nvme_map_data(struct
 
 	iod->dma_len = 0;
 	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sgt.sgl)
+	if (!iod->sgt.sgl) {
+		printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
+	}
 	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
 	iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
 	if (!iod->sgt.orig_nents)
@@ -801,8 +814,12 @@ static blk_status_t nvme_map_data(struct
 	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
 			     DMA_ATTR_NO_WARN);
 	if (rc) {
-		if (rc == -EREMOTEIO)
+		if (rc == -EREMOTEIO) {
+			printk("allocation failure at %d\n", __LINE__);
 			ret = BLK_STS_TARGET;
+		} else {
+			printk("allocation failure at %d\n", __LINE__);
+		}
 		goto out_free_sg;
 	}
 
Index: linux-stable/drivers/nvme/host/core.c
===================================================================
--- linux-stable.orig/drivers/nvme/host/core.c	2023-10-31 15:35:38.000000000 +0100
+++ linux-stable/drivers/nvme/host/core.c	2023-11-02 17:12:39.000000000 +0100
@@ -708,8 +708,10 @@ blk_status_t nvme_fail_nonready_command(
 	    ctrl->state != NVME_CTRL_DELETING &&
 	    ctrl->state != NVME_CTRL_DEAD &&
 	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
-	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
+	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) {
+		printk("allocation failure at %d\n", __LINE__);
 		return BLK_STS_RESOURCE;
+	}
 	return nvme_host_path_error(rq);
 }
 EXPORT_SYMBOL_GPL(nvme_fail_nonready_command);
@@ -784,8 +786,10 @@ static blk_status_t nvme_setup_discard(s
 		 * discard page. If that's also busy, it's safe to return
 		 * busy, as we know we can make progress once that's freed.
 		 */
-		if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy))
+		if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy)) {
+			printk("allocation failure at %d\n", __LINE__);
 			return BLK_STS_RESOURCE;
+		}
 
 		range = page_address(ns->ctrl->discard_page);
 	}

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02 17:06                                               ` Mikulas Patocka
@ 2023-11-03 15:01                                                 ` Marek Marczykowski-Górecki
  2023-11-03 15:10                                                   ` Keith Busch
  2023-11-03 16:15                                                 ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-03 15:01 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]

On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> 
> 
> On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
> 
> > On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
> > 
> > > Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value 
> > > (for example 64) and test if it helps.
> > 
> > Yes, this helps too.
> 
> On a plain upstream kernel with no other modifications (and with default 
> max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold 
> to "0" and test it if it deadlocks. Then, set this value to "1" and test 
> it again.

Got deadlock wit both values.

> Revert sgl_threshold back to the default (32768). Boot the kernel with the 
> option "iommu=panic". Reproduce the deadlock and if you get a kernel 
> panic, send us the panic log.

This is a Xen PV, so Linux is not in charge of IOMMU here. And there is
SWIOTLB involved (64MB of it), I'm not sure if for every DMA, but
definitely for some. 

> Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> tell us which one of the "printk" statements is triggered during the 
> deadlock.

I'll try this next.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 15:01                                                 ` Marek Marczykowski-Górecki
@ 2023-11-03 15:10                                                   ` Keith Busch
  0 siblings, 0 replies; 66+ messages in thread
From: Keith Busch @ 2023-11-03 15:10 UTC (permalink / raw)
  To: Marek Marczykowski-G'orecki
  Cc: Mikulas Patocka, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

On Fri, Nov 03, 2023 at 04:01:19PM +0100, Marek Marczykowski-G'orecki wrote:
> On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > On Thu, 2 Nov 2023, Marek Marczykowski-G'orecki wrote:
> > 
> > > On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
> > > 
> > > > Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value 
> > > > (for example 64) and test if it helps.
> > > 
> > > Yes, this helps too.
> > 
> > On a plain upstream kernel with no other modifications (and with default 
> > max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold 
> > to "0" and test it if it deadlocks. Then, set this value to "1" and test 
> > it again.
> 
> Got deadlock wit both values.
> 
> > Revert sgl_threshold back to the default (32768). Boot the kernel with the 
> > option "iommu=panic". Reproduce the deadlock and if you get a kernel 
> > panic, send us the panic log.
> 
> This is a Xen PV, so Linux is not in charge of IOMMU here. And there is
> SWIOTLB involved (64MB of it), I'm not sure if for every DMA, but
> definitely for some.

So it's using xen_swiotlb_dma_ops, right? That doesn't implmeent
.opt_mapping_size, and I'm guessing it should be equal to
swiotlb_max_mapping_size().
 
> > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > tell us which one of the "printk" statements is triggered during the 
> > deadlock.
> 
> I'll try this next.

Placing my bet now: you'll see a DMA mapping error.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-02 17:06                                               ` Mikulas Patocka
  2023-11-03 15:01                                                 ` Marek Marczykowski-Górecki
@ 2023-11-03 16:15                                                 ` Marek Marczykowski-Górecki
  2023-11-03 16:54                                                   ` Keith Busch
  1 sibling, 1 reply; 66+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-11-03 16:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5212 bytes --]

On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> tell us which one of the "printk" statements is triggered during the 
> deadlock.

The "821" one - see below.

> Mikulas
> 
> ---
>  drivers/nvme/host/core.c |    8 ++++++--
>  drivers/nvme/host/pci.c  |   27 ++++++++++++++++++++++-----
>  2 files changed, 28 insertions(+), 7 deletions(-)
> 
> Index: linux-stable/drivers/nvme/host/pci.c
> ===================================================================
> --- linux-stable.orig/drivers/nvme/host/pci.c	2023-10-31 15:35:38.000000000 +0100
> +++ linux-stable/drivers/nvme/host/pci.c	2023-11-02 17:38:20.000000000 +0100
> @@ -622,6 +622,10 @@ static blk_status_t nvme_pci_setup_prps(
>  	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
>  	if (!prp_list) {
>  		iod->nr_allocations = -1;
> +		if (nprps <= (256 / 8))
> +			printk("allocation failure at %d\n", __LINE__);
> +		else
> +			printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
>  	}
>  	iod->list[0].prp_list = prp_list;
> @@ -631,8 +635,10 @@ static blk_status_t nvme_pci_setup_prps(
>  		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
>  			__le64 *old_prp_list = prp_list;
>  			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
> -			if (!prp_list)
> +			if (!prp_list) {
> +				printk("allocation failure at %d\n", __LINE__);
>  				goto free_prps;
> +			}
>  			iod->list[iod->nr_allocations++].prp_list = prp_list;
>  			prp_list[0] = old_prp_list[i - 1];
>  			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> @@ -712,6 +718,7 @@ static blk_status_t nvme_pci_setup_sgls(
>  	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
>  	if (!sg_list) {
>  		iod->nr_allocations = -1;
> +		printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
>  	}
>  
> @@ -736,8 +743,10 @@ static blk_status_t nvme_setup_prp_simpl
>  	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
>  
>  	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
> -	if (dma_mapping_error(dev->dev, iod->first_dma))
> +	if (dma_mapping_error(dev->dev, iod->first_dma)) {
> +		printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
> +	}
>  	iod->dma_len = bv->bv_len;
>  
>  	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
> @@ -755,8 +764,10 @@ static blk_status_t nvme_setup_sgl_simpl
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>  
>  	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
> -	if (dma_mapping_error(dev->dev, iod->first_dma))
> +	if (dma_mapping_error(dev->dev, iod->first_dma)) {
> +		printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
> +	}
>  	iod->dma_len = bv->bv_len;
>  
>  	cmnd->flags = NVME_CMD_SGL_METABUF;
> @@ -791,8 +802,10 @@ static blk_status_t nvme_map_data(struct
>  
>  	iod->dma_len = 0;
>  	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
> -	if (!iod->sgt.sgl)
> +	if (!iod->sgt.sgl) {
> +		printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
> +	}
>  	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
>  	iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
>  	if (!iod->sgt.orig_nents)
> @@ -801,8 +814,12 @@ static blk_status_t nvme_map_data(struct
>  	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
>  			     DMA_ATTR_NO_WARN);
>  	if (rc) {
> -		if (rc == -EREMOTEIO)
> +		if (rc == -EREMOTEIO) {
> +			printk("allocation failure at %d\n", __LINE__);
>  			ret = BLK_STS_TARGET;
> +		} else {
> +			printk("allocation failure at %d\n", __LINE__);

I get a lot of this one.

> +		}
>  		goto out_free_sg;
>  	}
>  
> Index: linux-stable/drivers/nvme/host/core.c
> ===================================================================
> --- linux-stable.orig/drivers/nvme/host/core.c	2023-10-31 15:35:38.000000000 +0100
> +++ linux-stable/drivers/nvme/host/core.c	2023-11-02 17:12:39.000000000 +0100
> @@ -708,8 +708,10 @@ blk_status_t nvme_fail_nonready_command(
>  	    ctrl->state != NVME_CTRL_DELETING &&
>  	    ctrl->state != NVME_CTRL_DEAD &&
>  	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> -	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> +	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) {
> +		printk("allocation failure at %d\n", __LINE__);
>  		return BLK_STS_RESOURCE;
> +	}
>  	return nvme_host_path_error(rq);
>  }
>  EXPORT_SYMBOL_GPL(nvme_fail_nonready_command);
> @@ -784,8 +786,10 @@ static blk_status_t nvme_setup_discard(s
>  		 * discard page. If that's also busy, it's safe to return
>  		 * busy, as we know we can make progress once that's freed.
>  		 */
> -		if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy))
> +		if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy)) {
> +			printk("allocation failure at %d\n", __LINE__);
>  			return BLK_STS_RESOURCE;
> +		}
>  
>  		range = page_address(ns->ctrl->discard_page);
>  	}


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 16:15                                                 ` Marek Marczykowski-Górecki
@ 2023-11-03 16:54                                                   ` Keith Busch
  2023-11-03 20:30                                                     ` Marek Marczykowski-G'orecki
  2023-11-06  7:08                                                     ` Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Christoph Hellwig
  0 siblings, 2 replies; 66+ messages in thread
From: Keith Busch @ 2023-11-03 16:54 UTC (permalink / raw)
  To: Marek Marczykowski-G'orecki
  Cc: Mikulas Patocka, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
> On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > tell us which one of the "printk" statements is triggered during the 
> > deadlock.
> 
> The "821" one - see below.

Thanks for confirming!

Could you try this patch?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 60a08dfe8d75f..348fd6c6702a5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -41,7 +41,7 @@
  * These can be higher, but we need to ensure that any command doesn't
  * require an sg allocation that needs more than a page of data.
  */
-#define NVME_MAX_KB_SZ	8192
+#define NVME_MAX_KB_SZ	8192u
 #define NVME_MAX_SEGS	128
 #define NVME_MAX_NR_ALLOCATIONS	5
 
@@ -2957,8 +2957,9 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	 * Limit the max command size to prevent iod->sg allocations going
 	 * over a single page.
 	 */
-	dev->ctrl.max_hw_sectors = min_t(u32,
-		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
+	dev->ctrl.max_hw_sectors = min3(NVME_MAX_KB_SZ << 1,
+					dma_opt_mapping_size(&pdev->dev) >> 9,
+					dma_max_mapping_size(&pdev->dev) >> 9);
 	dev->ctrl.max_segments = NVME_MAX_SEGS;
 
 	/*
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 946bd56f0ac53..0e6c6c25d154f 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
 	.get_sgtable = dma_common_get_sgtable,
 	.alloc_pages = dma_common_alloc_pages,
 	.free_pages = dma_common_free_pages,
+	.max_mapping_size = swiotlb_max_mapping_size,
 };
--

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 16:54                                                   ` Keith Busch
@ 2023-11-03 20:30                                                     ` Marek Marczykowski-G'orecki
  2023-11-03 22:42                                                       ` Keith Busch
  2023-11-06  7:08                                                     ` Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Christoph Hellwig
  1 sibling, 1 reply; 66+ messages in thread
From: Marek Marczykowski-G'orecki @ 2023-11-03 20:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mikulas Patocka, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
> On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
> > On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > > tell us which one of the "printk" statements is triggered during the 
> > > deadlock.
> > 
> > The "821" one - see below.
> 
> Thanks for confirming!
> 
> Could you try this patch?

Besides min3() being unhappy about types, it works.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 20:30                                                     ` Marek Marczykowski-G'orecki
@ 2023-11-03 22:42                                                       ` Keith Busch
  2023-11-04  9:27                                                         ` Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Keith Busch @ 2023-11-03 22:42 UTC (permalink / raw)
  To: Marek Marczykowski-G'orecki
  Cc: Mikulas Patocka, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm

On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
> On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
> > On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
> > > On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > > > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > > > tell us which one of the "printk" statements is triggered during the 
> > > > deadlock.
> > > 
> > > The "821" one - see below.
> > 
> > Thanks for confirming!
> > 
> > Could you try this patch?
> 
> Besides min3() being unhappy about types, it works.

Oops, should have changed the constant to "ul" instead of just "u".

Anyway, the overall idea makes sense to me, but I don't know the swiotlb
stuff well.

Christoph, does that patch make sense? For reference:

   https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefacebook.com/T/#m8d34245e0eef43f8e9fe6cba6038d77ed2a93ad6

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 22:42                                                       ` Keith Busch
@ 2023-11-04  9:27                                                         ` Mikulas Patocka
  2023-11-04 13:59                                                           ` Keith Busch
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-04  9:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Marek Marczykowski-G'orecki, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Jan Kara, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm



On Fri, 3 Nov 2023, Keith Busch wrote:

> On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
> > On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
> > > On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
> > > > On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > > > > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > > > > tell us which one of the "printk" statements is triggered during the 
> > > > > deadlock.
> > > > 
> > > > The "821" one - see below.
> > > 
> > > Thanks for confirming!
> > > 
> > > Could you try this patch?
> > 
> > Besides min3() being unhappy about types, it works.
> 
> Oops, should have changed the constant to "ul" instead of just "u".
> 
> Anyway, the overall idea makes sense to me, but I don't know the swiotlb
> stuff well.
> 
> Christoph, does that patch make sense? For reference:
> 
>    https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefacebook.com/T/#m8d34245e0eef43f8e9fe6cba6038d77ed2a93ad6

dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So 
you don't have to call dma_max_mapping_size explicitly, you can leave the 
file drivers/nvme/host/pci.c as it is.

What about the other block device drivers (AHCI, SCSI...)? Should there be 
some generic framework that restricts max_hw_sectors according to 
dma_opt_mapping_size?

Mikulas


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-04  9:27                                                         ` Mikulas Patocka
@ 2023-11-04 13:59                                                           ` Keith Busch
  2023-11-06  7:10                                                             ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Keith Busch @ 2023-11-04 13:59 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marek Marczykowski-G'orecki, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Jan Kara, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Sat, Nov 04, 2023 at 10:27:32AM +0100, Mikulas Patocka wrote:
> On Fri, 3 Nov 2023, Keith Busch wrote:
> > On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
> > > On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
> > > > On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
> > > > > On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
> > > > > > Then, try this patch (without "iommu=panic"), reproduce the deadlock and 
> > > > > > tell us which one of the "printk" statements is triggered during the 
> > > > > > deadlock.
> > > > > 
> > > > > The "821" one - see below.
> > > > 
> > > > Thanks for confirming!
> > > > 
> > > > Could you try this patch?
> > > 
> > > Besides min3() being unhappy about types, it works.
> > 
> > Oops, should have changed the constant to "ul" instead of just "u".
> > 
> > Anyway, the overall idea makes sense to me, but I don't know the swiotlb
> > stuff well.
> > 
> > Christoph, does that patch make sense? For reference:
> > 
> >    https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefacebook.com/T/#m8d34245e0eef43f8e9fe6cba6038d77ed2a93ad6
> 
> dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So 
> you don't have to call dma_max_mapping_size explicitly, you can leave the 
> file drivers/nvme/host/pci.c as it is.

Indeed.
 
> What about the other block device drivers (AHCI, SCSI...)? Should there be 
> some generic framework that restricts max_hw_sectors according to 
> dma_opt_mapping_size?

I think it's just like any other request_queue limits and the individual
drivers have to set up these.

Thinking on this again, this feels more like a max_segment_size limit
rather than a max_hw_sectors.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-03 16:54                                                   ` Keith Busch
  2023-11-03 20:30                                                     ` Marek Marczykowski-G'orecki
@ 2023-11-06  7:08                                                     ` Christoph Hellwig
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2023-11-06  7:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Marek Marczykowski-G'orecki, Mikulas Patocka, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jan Kara, Vlastimil Babka,
	Andrew Morton, Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

> -	dev->ctrl.max_hw_sectors = min_t(u32,
> -		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +	dev->ctrl.max_hw_sectors = min3(NVME_MAX_KB_SZ << 1,
> +					dma_opt_mapping_size(&pdev->dev) >> 9,
> +					dma_max_mapping_size(&pdev->dev) >> 9);

dma_opt_mapping_size is already capped by dma_max_mapping_size, so no
need for this hunk.

>  	dev->ctrl.max_segments = NVME_MAX_SEGS;
>  
>  	/*
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 946bd56f0ac53..0e6c6c25d154f 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
>  	.get_sgtable = dma_common_get_sgtable,
>  	.alloc_pages = dma_common_alloc_pages,
>  	.free_pages = dma_common_free_pages,
> +	.max_mapping_size = swiotlb_max_mapping_size,
>  };
> --

And this is the right thing to do.  I'm pretty sure I wrote this
myself a while ago, but I must not have sent it out in the end.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-04 13:59                                                           ` Keith Busch
@ 2023-11-06  7:10                                                             ` Christoph Hellwig
  2023-11-06 14:59                                                               ` [PATCH] swiotlb-xen: provide the "max_mapping_size" method Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2023-11-06  7:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mikulas Patocka, Marek Marczykowski-G'orecki, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jan Kara, Vlastimil Babka,
	Andrew Morton, Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Sat, Nov 04, 2023 at 07:59:06AM -0600, Keith Busch wrote:
> > dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So 
> > you don't have to call dma_max_mapping_size explicitly, you can leave the 
> > file drivers/nvme/host/pci.c as it is.
> 
> Indeed.
>  
> > What about the other block device drivers (AHCI, SCSI...)? Should there be 
> > some generic framework that restricts max_hw_sectors according to 
> > dma_opt_mapping_size?
> 
> I think it's just like any other request_queue limits and the individual
> drivers have to set up these.

Yes, or in case of SCSI the scsi midlayer, which already does it.
The block layer itself doesn't even know that we're using DMA to
transfer data.

> Thinking on this again, this feels more like a max_segment_size limit
> rather than a max_hw_sectors.

No, it's an overall limit.  If we'd have a lot of segments using the
max size we'd still starve the swiotlb pool.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH] swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06  7:10                                                             ` Christoph Hellwig
@ 2023-11-06 14:59                                                               ` Mikulas Patocka
  2023-11-06 15:16                                                                 ` Keith Busch
  0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-06 14:59 UTC (permalink / raw)
  To: Christoph Hellwig, Juergen Gross, Stefano Stabellini, xen-devel, iommu
  Cc: Keith Busch, Marek Marczykowski-G'orecki, Jens Axboe,
	Sagi Grimberg, Jan Kara, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, 
the kernel deadlocks [1].

The deadlocks are caused by inability to map a large bio vector -
dma_map_sgtable always returns an error, this gets propagated to the block
layer as BLK_STS_RESOURCE and the block layer retries the request
indefinitely.

XEN uses the swiotlb framework to map discontiguous pages into contiguous
runs that are submitted to the PCIe device. The swiotlb framework has a
limitation on the length of a mapping - this needs to be announced with
the max_mapping_size method to make sure that the hardware drivers do not
create larger mappings.

Without max_mapping_size, the NVMe block driver would create large
mappings that overrun the maximum mapping size.

[1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org

---
 drivers/xen/swiotlb-xen.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-stable/drivers/xen/swiotlb-xen.c
===================================================================
--- linux-stable.orig/drivers/xen/swiotlb-xen.c	2023-11-03 17:57:18.000000000 +0100
+++ linux-stable/drivers/xen/swiotlb-xen.c	2023-11-06 15:30:59.000000000 +0100
@@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma
 	.get_sgtable = dma_common_get_sgtable,
 	.alloc_pages = dma_common_alloc_pages,
 	.free_pages = dma_common_free_pages,
+	.max_mapping_size = swiotlb_max_mapping_size,
 };

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06 14:59                                                               ` [PATCH] swiotlb-xen: provide the "max_mapping_size" method Mikulas Patocka
@ 2023-11-06 15:16                                                                 ` Keith Busch
  2023-11-06 15:30                                                                   ` Mike Snitzer
  0 siblings, 1 reply; 66+ messages in thread
From: Keith Busch @ 2023-11-06 15:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Juergen Gross, Stefano Stabellini, xen-devel,
	iommu, Marek Marczykowski-G'orecki, Jens Axboe,
	Sagi Grimberg, Jan Kara, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm

On Mon, Nov 06, 2023 at 03:59:40PM +0100, Mikulas Patocka wrote:
> There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, 
> the kernel deadlocks [1].
> 
> The deadlocks are caused by inability to map a large bio vector -
> dma_map_sgtable always returns an error, this gets propagated to the block
> layer as BLK_STS_RESOURCE and the block layer retries the request
> indefinitely.
> 
> XEN uses the swiotlb framework to map discontiguous pages into contiguous
> runs that are submitted to the PCIe device. The swiotlb framework has a
> limitation on the length of a mapping - this needs to be announced with
> the max_mapping_size method to make sure that the hardware drivers do not
> create larger mappings.
> 
> Without max_mapping_size, the NVMe block driver would create large
> mappings that overrun the maximum mapping size.
> 
> [1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/

This should be a "Link:" tag.

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Reported-by: Marek Marczykowski-G'orecki <marmarek@invisiblethingslab.com>
> Tested-by: Marek Marczykowski-G'orecki <marmarek@invisiblethingslab.com>
> Suggested-by: Keith Busch <kbusch@kernel.org>

I was about to send the same thing. I did a little more than suggest
this: it's is the very patch I wrote for testing, minus the redundant
nvme bits! But since you already have a commit message for it...

Acked-by: Keith Busch <kbusch@kernel.org>

> Suggested-by: Christoph Hellwig <hch@lst.de>
> Cc: stable@vger.kernel.org
> 
> ---
>  drivers/xen/swiotlb-xen.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: linux-stable/drivers/xen/swiotlb-xen.c
> ===================================================================
> --- linux-stable.orig/drivers/xen/swiotlb-xen.c	2023-11-03 17:57:18.000000000 +0100
> +++ linux-stable/drivers/xen/swiotlb-xen.c	2023-11-06 15:30:59.000000000 +0100
> @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma
>  	.get_sgtable = dma_common_get_sgtable,
>  	.alloc_pages = dma_common_alloc_pages,
>  	.free_pages = dma_common_free_pages,
> +	.max_mapping_size = swiotlb_max_mapping_size,
>  };


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06 15:16                                                                 ` Keith Busch
@ 2023-11-06 15:30                                                                   ` Mike Snitzer
  2023-11-06 17:12                                                                     ` [PATCH v2] " Mikulas Patocka
  0 siblings, 1 reply; 66+ messages in thread
From: Mike Snitzer @ 2023-11-06 15:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mikulas Patocka, Christoph Hellwig, Juergen Gross,
	Stefano Stabellini, xen-devel, iommu,
	Marek Marczykowski-G'orecki, Jens Axboe, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, dm-devel,
	linux-mm

On Mon, Nov 06 2023 at 10:16P -0500,
Keith Busch <kbusch@kernel.org> wrote:

> On Mon, Nov 06, 2023 at 03:59:40PM +0100, Mikulas Patocka wrote:
> > There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, 
> > the kernel deadlocks [1].
> > 
> > The deadlocks are caused by inability to map a large bio vector -
> > dma_map_sgtable always returns an error, this gets propagated to the block
> > layer as BLK_STS_RESOURCE and the block layer retries the request
> > indefinitely.
> > 
> > XEN uses the swiotlb framework to map discontiguous pages into contiguous
> > runs that are submitted to the PCIe device. The swiotlb framework has a
> > limitation on the length of a mapping - this needs to be announced with
> > the max_mapping_size method to make sure that the hardware drivers do not
> > create larger mappings.
> > 
> > Without max_mapping_size, the NVMe block driver would create large
> > mappings that overrun the maximum mapping size.
> > 
> > [1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
> 
> This should be a "Link:" tag.
> 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Reported-by: Marek Marczykowski-G'orecki <marmarek@invisiblethingslab.com>
> > Tested-by: Marek Marczykowski-G'orecki <marmarek@invisiblethingslab.com>
> > Suggested-by: Keith Busch <kbusch@kernel.org>
> 
> I was about to send the same thing. I did a little more than suggest
> this: it's is the very patch I wrote for testing, minus the redundant
> nvme bits! But since you already have a commit message for it...
> 
> Acked-by: Keith Busch <kbusch@kernel.org>

No, this patch should be attributed to you Keith.

Mikulas, I like that you ran with getting a fix prepared but please
update the patch so Keith is the author and use Link: as suggested for
the v2. Note: you'll still use your Signed-off-by since you had a role
in getting this patch together (but please move yours to the end of
the header).

Mike


> 
> > Suggested-by: Christoph Hellwig <hch@lst.de>
> > Cc: stable@vger.kernel.org
> > 
> > ---
> >  drivers/xen/swiotlb-xen.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > Index: linux-stable/drivers/xen/swiotlb-xen.c
> > ===================================================================
> > --- linux-stable.orig/drivers/xen/swiotlb-xen.c	2023-11-03 17:57:18.000000000 +0100
> > +++ linux-stable/drivers/xen/swiotlb-xen.c	2023-11-06 15:30:59.000000000 +0100
> > @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma
> >  	.get_sgtable = dma_common_get_sgtable,
> >  	.alloc_pages = dma_common_alloc_pages,
> >  	.free_pages = dma_common_free_pages,
> > +	.max_mapping_size = swiotlb_max_mapping_size,
> >  };
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v2] swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06 15:30                                                                   ` Mike Snitzer
@ 2023-11-06 17:12                                                                     ` Mikulas Patocka
  2023-11-07  4:18                                                                       ` Stefano Stabellini
  2023-11-08  7:31                                                                       ` Christoph Hellwig
  0 siblings, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2023-11-06 17:12 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, Juergen Gross,
	Stefano Stabellini, xen-devel, iommu,
	Marek Marczykowski-G'orecki, Jens Axboe, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, dm-devel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1795 bytes --]

From: Keith Busch <kbusch@kernel.org>

There's a bug that when using the XEN hypervisor with bios with large
multi-page bio vectors on NVMe, the kernel deadlocks [1].

The deadlocks are caused by inability to map a large bio vector -
dma_map_sgtable always returns an error, this gets propagated to the block
layer as BLK_STS_RESOURCE and the block layer retries the request
indefinitely.

XEN uses the swiotlb framework to map discontiguous pages into contiguous
runs that are submitted to the PCIe device. The swiotlb framework has a
limitation on the length of a mapping - this needs to be announced with
the max_mapping_size method to make sure that the hardware drivers do not
create larger mappings.

Without max_mapping_size, the NVMe block driver would create large
mappings that overrun the maximum mapping size.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Link: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/ [1]
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/xen/swiotlb-xen.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-stable/drivers/xen/swiotlb-xen.c
===================================================================
--- linux-stable.orig/drivers/xen/swiotlb-xen.c	2023-11-03 17:57:18.000000000 +0100
+++ linux-stable/drivers/xen/swiotlb-xen.c	2023-11-06 15:30:59.000000000 +0100
@@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma
 	.get_sgtable = dma_common_get_sgtable,
 	.alloc_pages = dma_common_alloc_pages,
 	.free_pages = dma_common_free_pages,
+	.max_mapping_size = swiotlb_max_mapping_size,
 };

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2] swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06 17:12                                                                     ` [PATCH v2] " Mikulas Patocka
@ 2023-11-07  4:18                                                                       ` Stefano Stabellini
  2023-11-08  7:31                                                                       ` Christoph Hellwig
  1 sibling, 0 replies; 66+ messages in thread
From: Stefano Stabellini @ 2023-11-07  4:18 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Keith Busch, Christoph Hellwig, Juergen Gross,
	Stefano Stabellini, xen-devel, iommu,
	Marek Marczykowski-G'orecki, Jens Axboe, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, dm-devel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

On Mon, 6 Nov 2023, Mikulas Patocka wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> There's a bug that when using the XEN hypervisor with bios with large
> multi-page bio vectors on NVMe, the kernel deadlocks [1].
> 
> The deadlocks are caused by inability to map a large bio vector -
> dma_map_sgtable always returns an error, this gets propagated to the block
> layer as BLK_STS_RESOURCE and the block layer retries the request
> indefinitely.
> 
> XEN uses the swiotlb framework to map discontiguous pages into contiguous
> runs that are submitted to the PCIe device. The swiotlb framework has a
> limitation on the length of a mapping - this needs to be announced with
> the max_mapping_size method to make sure that the hardware drivers do not
> create larger mappings.
> 
> Without max_mapping_size, the NVMe block driver would create large
> mappings that overrun the maximum mapping size.
> 
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Link: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/ [1]
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Acked-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
>  drivers/xen/swiotlb-xen.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: linux-stable/drivers/xen/swiotlb-xen.c
> ===================================================================
> --- linux-stable.orig/drivers/xen/swiotlb-xen.c	2023-11-03 17:57:18.000000000 +0100
> +++ linux-stable/drivers/xen/swiotlb-xen.c	2023-11-06 15:30:59.000000000 +0100
> @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma
>  	.get_sgtable = dma_common_get_sgtable,
>  	.alloc_pages = dma_common_alloc_pages,
>  	.free_pages = dma_common_free_pages,
> +	.max_mapping_size = swiotlb_max_mapping_size,
>  };

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2] swiotlb-xen: provide the "max_mapping_size" method
  2023-11-06 17:12                                                                     ` [PATCH v2] " Mikulas Patocka
  2023-11-07  4:18                                                                       ` Stefano Stabellini
@ 2023-11-08  7:31                                                                       ` Christoph Hellwig
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2023-11-08  7:31 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Keith Busch, Christoph Hellwig, Juergen Gross,
	Stefano Stabellini, xen-devel, iommu,
	Marek Marczykowski-G'orecki, Jens Axboe, Sagi Grimberg,
	Jan Kara, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, dm-devel,
	linux-mm

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2023-11-08  7:32 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-21  3:38 Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Marek Marczykowski-Górecki
2023-10-21  7:48 ` Bagas Sanjaya
2023-10-29  6:23   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-10-29  6:23     ` Linux regression tracking #update (Thorsten Leemhuis)
2023-10-23 20:59 ` Mikulas Patocka
2023-10-23 20:59   ` Mikulas Patocka
2023-10-25  3:10   ` Marek Marczykowski-Górecki
2023-10-25  3:10     ` Marek Marczykowski-Górecki
2023-10-25  3:22     ` Marek Marczykowski-Górecki
2023-10-25  3:22       ` Marek Marczykowski-Górecki
2023-10-25 10:13       ` Mikulas Patocka
2023-10-27 17:32         ` Mikulas Patocka
2023-10-28  9:23           ` Matthew Wilcox
2023-10-28 15:14             ` Mike Snitzer
2023-10-29 11:15           ` Marek Marczykowski-Górecki
2023-10-29 20:02         ` Vlastimil Babka
2023-10-30  7:37           ` Mikulas Patocka
2023-10-30  8:37             ` Vlastimil Babka
2023-10-30 11:22               ` Mikulas Patocka
2023-10-30 11:30                 ` Vlastimil Babka
2023-10-30 11:37                   ` Mikulas Patocka
2023-10-30 12:25                   ` Jan Kara
2023-10-30 13:30                     ` Marek Marczykowski-Górecki
2023-10-30 14:08                       ` Mikulas Patocka
2023-10-30 15:56                         ` Jan Kara
2023-10-30 16:51                           ` Marek Marczykowski-Górecki
2023-10-30 17:50                             ` Mikulas Patocka
2023-10-31  3:48                               ` Marek Marczykowski-Górecki
2023-10-31 14:01                                 ` Jan Kara
2023-10-31 15:42                                   ` Marek Marczykowski-Górecki
2023-10-31 17:17                                     ` Mikulas Patocka
2023-10-31 17:24                                       ` Mikulas Patocka
2023-11-02  0:38                                         ` Marek Marczykowski-Górecki
2023-11-02  9:28                                           ` Mikulas Patocka
2023-11-02 11:45                                             ` Marek Marczykowski-Górecki
2023-11-02 17:06                                               ` Mikulas Patocka
2023-11-03 15:01                                                 ` Marek Marczykowski-Górecki
2023-11-03 15:10                                                   ` Keith Busch
2023-11-03 16:15                                                 ` Marek Marczykowski-Górecki
2023-11-03 16:54                                                   ` Keith Busch
2023-11-03 20:30                                                     ` Marek Marczykowski-G'orecki
2023-11-03 22:42                                                       ` Keith Busch
2023-11-04  9:27                                                         ` Mikulas Patocka
2023-11-04 13:59                                                           ` Keith Busch
2023-11-06  7:10                                                             ` Christoph Hellwig
2023-11-06 14:59                                                               ` [PATCH] swiotlb-xen: provide the "max_mapping_size" method Mikulas Patocka
2023-11-06 15:16                                                                 ` Keith Busch
2023-11-06 15:30                                                                   ` Mike Snitzer
2023-11-06 17:12                                                                     ` [PATCH v2] " Mikulas Patocka
2023-11-07  4:18                                                                       ` Stefano Stabellini
2023-11-08  7:31                                                                       ` Christoph Hellwig
2023-11-06  7:08                                                     ` Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Christoph Hellwig
2023-11-02 12:21                                             ` Jan Kara
2023-11-01  1:27                                     ` Ming Lei
2023-11-01  2:14                                       ` Marek Marczykowski-Górecki
2023-11-01  2:15                                         ` Marek Marczykowski-Górecki
2023-11-01  2:35                                           ` Marek Marczykowski-Górecki
2023-11-01  3:24                                         ` Ming Lei
2023-11-01 10:15                                           ` Hannes Reinecke
2023-11-01 10:26                                             ` Jan Kara
2023-11-01 11:23                                             ` Ming Lei
2023-11-02 14:02                                               ` Keith Busch
2023-11-01 12:16                                             ` Mikulas Patocka
2023-10-30 11:28               ` Jan Kara
2023-10-30 11:49                 ` Mikulas Patocka
2023-10-30 12:11                   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.