All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [fuse] alloc_page nofs avoid deadlock
@ 2021-06-03 12:41 chenguanyou
  2021-06-03 12:56 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: chenguanyou @ 2021-06-03 12:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, keescook, mhocko, lukas.bulwahn, vbabka, gpiccoli, chenguanyou

ABA deadlock

PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"

PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"

Signed-off-by: chenguanyou <chenguanyou@xiaomi.com>
---
 fs/fuse/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index c0fee830a34e..d36125ff0405 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -721,7 +721,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			if (cs->nr_segs >= cs->pipe->max_usage)
 				return -EIO;
 
-			page = alloc_page(GFP_HIGHUSER);
+			page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
 			if (!page)
 				return -ENOMEM;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-06-03 12:41 [PATCH] [fuse] alloc_page nofs avoid deadlock chenguanyou
@ 2021-06-03 12:56 ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2021-06-03 12:56 UTC (permalink / raw)
  To: chenguanyou
  Cc: linux-kernel, akpm, keescook, lukas.bulwahn, vbabka, gpiccoli,
	chenguanyou

On Thu 03-06-21 20:41:05, chenguanyou wrote:
> ABA deadlock
> 
> PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> 
> PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"

Please be much more specific about the underlying issue and what the
actual lock dependency during the reclaim is. The above is essentially
void of any relevant information.

> Signed-off-by: chenguanyou <chenguanyou@xiaomi.com>
> ---
>  fs/fuse/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index c0fee830a34e..d36125ff0405 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -721,7 +721,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>  			if (cs->nr_segs >= cs->pipe->max_usage)
>  				return -EIO;
>  
> -			page = alloc_page(GFP_HIGHUSER);
> +			page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
>  			if (!page)
>  				return -ENOMEM;
>  
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2022-07-11  7:49                     ` Miklos Szeredi
@ 2022-07-12  1:16                       ` Ed Tsai
  0 siblings, 0 replies; 18+ messages in thread
From: Ed Tsai @ 2022-07-12  1:16 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Mon, 2022-07-11 at 15:49 +0800, Miklos Szeredi wrote:
> On Mon, 13 Jun 2022 at 11:29, Ed Tsai <ed.tsai@mediatek.com> wrote:
> > 
> > On Mon, 2022-06-13 at 16:45 +0800, Miklos Szeredi wrote:
> > > On Fri, 10 Jun 2022 at 09:48, Ed Tsai <ed.tsai@mediatek.com>
> > > wrote:
> > > 
> > > > Recently, we get this deadlock issue again.
> > > > fuse_flush_time_update()
> > > > use sync_inode_metadata() and it only write the metadata, so
> > > > the
> > > > writeback worker could still be blocked becaused of file data.
> > > > 
> > > > I try to use write_inode_now() instead of sync_inode_metadata()
> > > > and
> > > > the
> > > > writeback thread will not be blocked anymore. I don't think
> > > > this is
> > > > a
> > > > good solution, but this confirm that there is still a potential
> > > > deadlock because of file data. WDYT.
> > > 
> > > I'm not sure how that happens.  Normally writeback doesn't
> > > block.  Can
> > > you provide the stack traces of all related tasks in the
> > > deadlock?
> > > 
> > > Thanks,
> > > Miklos
> > 
> > The writeback worker
> > ppid=22915 pid=22915 S cpu=6 prio=120 wait=3614s kworker/u16:21
> > vmlinux  request_wait_answer + 64
> > vmlinux  __fuse_request_send + 328
> > vmlinux  fuse_request_send + 60
> > vmlinux  fuse_simple_request + 376
> > vmlinux  fuse_flush_times + 276
> > vmlinux  fuse_write_inode + 104 (inode=0xFFFFFFD516CC4780, ff=0)
> > vmlinux  write_inode + 384
> > vmlinux  __writeback_single_inode + 960
> > vmlinux  writeback_sb_inodes + 892
> > vmlinux  __writeback_inodes_wb + 156
> > vmlinux  wb_writeback + 512
> > vmlinux  wb_check_background_flush + 600
> > vmlinux  wb_do_writeback + 644
> > vmlinux  wb_workfn + 756
> > vmlinux  process_one_work + 628
> > vmlinux  worker_thread + 708
> > vmlinux  kthread + 376
> > vmlinux  ret_from_fork + 16
> > 
> > Thread-11
> > ppid=3961 pid=26057 D cpu=4 prio=120 wait=3614s Thread-11
> > vmlinux  __inode_wait_for_writeback + 108
> > vmlinux  inode_wait_for_writeback + 156
> > vmlinux  evict + 160
> > vmlinux  iput_final + 292
> > vmlinux  iput + 600
> > vmlinux  dentry_unlink_inode + 212
> > vmlinux  __dentry_kill + 228
> > vmlinux  shrink_dentry_list + 408
> > vmlinux  prune_dcache_sb + 80
> > vmlinux  super_cache_scan + 272
> > vmlinux  do_shrink_slab + 944
> > vmlinux  shrink_slab + 1104
> > vmlinux  shrink_node + 712
> > vmlinux  shrink_zones + 188
> > vmlinux  do_try_to_free_pages + 348
> > vmlinux  try_to_free_pages + 848
> > vmlinux  __perform_reclaim + 64
> > vmlinux  __alloc_pages_direct_reclaim + 64
> > vmlinux  __alloc_pages_slowpath + 1296
> > vmlinux  __alloc_pages_nodemask + 2004
> > vmlinux  __alloc_pages + 16
> > vmlinux  __alloc_pages_node + 16
> > vmlinux  alloc_pages_node + 16
> > vmlinux  __read_swap_cache_async + 172
> > vmlinux  read_swap_cache_async + 12
> > vmlinux  swapin_readahead + 328
> > vmlinux  do_swap_page + 844
> > vmlinux  handle_pte_fault + 268
> > vmlinux  __handle_speculative_fault + 548
> > vmlinux  handle_speculative_fault + 44
> > vmlinux  do_page_fault + 500
> > vmlinux  do_translation_fault + 64
> > vmlinux  do_mem_abort + 72
> > vmlinux  el0_sync + 1032
> > 
> > ppid=3961 is com.google.android.providers.media.module, and it is
> > the
> > android fuse daemon.
> > 
> > So, the daemon and wb worker were wait for each other.
> 
> Is commit 5c791fe1e2a4 ("fuse: make sure reclaim doesn't write the
> inode") applied to this kernel?
> 
> Thanks,
> Miklos

Yes, it has been applied to our kernel.

fuse_flush_time_update() only write metadata to disk in that patch.
Here, I tried to use write_inode_now() instead of
sync_inode_metadata(), and no more system hang is observed. So I
suppose the deadlock is caused by inode data.

Best,
Ed Tsai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2022-06-13  9:29                   ` Ed Tsai
@ 2022-07-11  7:49                     ` Miklos Szeredi
  2022-07-12  1:16                       ` Ed Tsai
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2022-07-11  7:49 UTC (permalink / raw)
  To: Ed Tsai
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Mon, 13 Jun 2022 at 11:29, Ed Tsai <ed.tsai@mediatek.com> wrote:
>
> On Mon, 2022-06-13 at 16:45 +0800, Miklos Szeredi wrote:
> > On Fri, 10 Jun 2022 at 09:48, Ed Tsai <ed.tsai@mediatek.com> wrote:
> >
> > > Recently, we get this deadlock issue again.
> > > fuse_flush_time_update()
> > > use sync_inode_metadata() and it only write the metadata, so the
> > > writeback worker could still be blocked becaused of file data.
> > >
> > > I try to use write_inode_now() instead of sync_inode_metadata() and
> > > the
> > > writeback thread will not be blocked anymore. I don't think this is
> > > a
> > > good solution, but this confirm that there is still a potential
> > > deadlock because of file data. WDYT.
> >
> > I'm not sure how that happens.  Normally writeback doesn't
> > block.  Can
> > you provide the stack traces of all related tasks in the deadlock?
> >
> > Thanks,
> > Miklos
>
> The writeback worker
> ppid=22915 pid=22915 S cpu=6 prio=120 wait=3614s kworker/u16:21
> vmlinux  request_wait_answer + 64
> vmlinux  __fuse_request_send + 328
> vmlinux  fuse_request_send + 60
> vmlinux  fuse_simple_request + 376
> vmlinux  fuse_flush_times + 276
> vmlinux  fuse_write_inode + 104 (inode=0xFFFFFFD516CC4780, ff=0)
> vmlinux  write_inode + 384
> vmlinux  __writeback_single_inode + 960
> vmlinux  writeback_sb_inodes + 892
> vmlinux  __writeback_inodes_wb + 156
> vmlinux  wb_writeback + 512
> vmlinux  wb_check_background_flush + 600
> vmlinux  wb_do_writeback + 644
> vmlinux  wb_workfn + 756
> vmlinux  process_one_work + 628
> vmlinux  worker_thread + 708
> vmlinux  kthread + 376
> vmlinux  ret_from_fork + 16
>
> Thread-11
> ppid=3961 pid=26057 D cpu=4 prio=120 wait=3614s Thread-11
> vmlinux  __inode_wait_for_writeback + 108
> vmlinux  inode_wait_for_writeback + 156
> vmlinux  evict + 160
> vmlinux  iput_final + 292
> vmlinux  iput + 600
> vmlinux  dentry_unlink_inode + 212
> vmlinux  __dentry_kill + 228
> vmlinux  shrink_dentry_list + 408
> vmlinux  prune_dcache_sb + 80
> vmlinux  super_cache_scan + 272
> vmlinux  do_shrink_slab + 944
> vmlinux  shrink_slab + 1104
> vmlinux  shrink_node + 712
> vmlinux  shrink_zones + 188
> vmlinux  do_try_to_free_pages + 348
> vmlinux  try_to_free_pages + 848
> vmlinux  __perform_reclaim + 64
> vmlinux  __alloc_pages_direct_reclaim + 64
> vmlinux  __alloc_pages_slowpath + 1296
> vmlinux  __alloc_pages_nodemask + 2004
> vmlinux  __alloc_pages + 16
> vmlinux  __alloc_pages_node + 16
> vmlinux  alloc_pages_node + 16
> vmlinux  __read_swap_cache_async + 172
> vmlinux  read_swap_cache_async + 12
> vmlinux  swapin_readahead + 328
> vmlinux  do_swap_page + 844
> vmlinux  handle_pte_fault + 268
> vmlinux  __handle_speculative_fault + 548
> vmlinux  handle_speculative_fault + 44
> vmlinux  do_page_fault + 500
> vmlinux  do_translation_fault + 64
> vmlinux  do_mem_abort + 72
> vmlinux  el0_sync + 1032
>
> ppid=3961 is com.google.android.providers.media.module, and it is the
> android fuse daemon.
>
> So, the daemon and wb worker were wait for each other.

Is commit 5c791fe1e2a4 ("fuse: make sure reclaim doesn't write the
inode") applied to this kernel?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2022-06-13  8:45                 ` Miklos Szeredi
@ 2022-06-13  9:29                   ` Ed Tsai
  2022-07-11  7:49                     ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2022-06-13  9:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Mon, 2022-06-13 at 16:45 +0800, Miklos Szeredi wrote:
> On Fri, 10 Jun 2022 at 09:48, Ed Tsai <ed.tsai@mediatek.com> wrote:
> 
> > Recently, we get this deadlock issue again.
> > fuse_flush_time_update()
> > use sync_inode_metadata() and it only write the metadata, so the
> > writeback worker could still be blocked becaused of file data.
> > 
> > I try to use write_inode_now() instead of sync_inode_metadata() and
> > the
> > writeback thread will not be blocked anymore. I don't think this is
> > a
> > good solution, but this confirm that there is still a potential
> > deadlock because of file data. WDYT.
> 
> I'm not sure how that happens.  Normally writeback doesn't
> block.  Can
> you provide the stack traces of all related tasks in the deadlock?
> 
> Thanks,
> Miklos

The writeback worker
ppid=22915 pid=22915 S cpu=6 prio=120 wait=3614s kworker/u16:21
vmlinux  request_wait_answer + 64
vmlinux  __fuse_request_send + 328
vmlinux  fuse_request_send + 60
vmlinux  fuse_simple_request + 376
vmlinux  fuse_flush_times + 276
vmlinux  fuse_write_inode + 104 (inode=0xFFFFFFD516CC4780, ff=0)
vmlinux  write_inode + 384
vmlinux  __writeback_single_inode + 960
vmlinux  writeback_sb_inodes + 892
vmlinux  __writeback_inodes_wb + 156
vmlinux  wb_writeback + 512
vmlinux  wb_check_background_flush + 600
vmlinux  wb_do_writeback + 644
vmlinux  wb_workfn + 756
vmlinux  process_one_work + 628
vmlinux  worker_thread + 708
vmlinux  kthread + 376
vmlinux  ret_from_fork + 16

Thread-11
ppid=3961 pid=26057 D cpu=4 prio=120 wait=3614s Thread-11
vmlinux  __inode_wait_for_writeback + 108
vmlinux  inode_wait_for_writeback + 156
vmlinux  evict + 160
vmlinux  iput_final + 292
vmlinux  iput + 600
vmlinux  dentry_unlink_inode + 212
vmlinux  __dentry_kill + 228
vmlinux  shrink_dentry_list + 408
vmlinux  prune_dcache_sb + 80
vmlinux  super_cache_scan + 272
vmlinux  do_shrink_slab + 944
vmlinux  shrink_slab + 1104
vmlinux  shrink_node + 712
vmlinux  shrink_zones + 188
vmlinux  do_try_to_free_pages + 348
vmlinux  try_to_free_pages + 848
vmlinux  __perform_reclaim + 64
vmlinux  __alloc_pages_direct_reclaim + 64
vmlinux  __alloc_pages_slowpath + 1296
vmlinux  __alloc_pages_nodemask + 2004
vmlinux  __alloc_pages + 16
vmlinux  __alloc_pages_node + 16
vmlinux  alloc_pages_node + 16
vmlinux  __read_swap_cache_async + 172
vmlinux  read_swap_cache_async + 12
vmlinux  swapin_readahead + 328
vmlinux  do_swap_page + 844
vmlinux  handle_pte_fault + 268
vmlinux  __handle_speculative_fault + 548
vmlinux  handle_speculative_fault + 44
vmlinux  do_page_fault + 500
vmlinux  do_translation_fault + 64
vmlinux  do_mem_abort + 72
vmlinux  el0_sync + 1032

ppid=3961 is com.google.android.providers.media.module, and it is the
android fuse daemon.

So, the daemon and wb worker were wait for each other.

Best,
Ed Tsai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2022-06-10  7:48               ` Ed Tsai
@ 2022-06-13  8:45                 ` Miklos Szeredi
  2022-06-13  9:29                   ` Ed Tsai
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2022-06-13  8:45 UTC (permalink / raw)
  To: Ed Tsai
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Fri, 10 Jun 2022 at 09:48, Ed Tsai <ed.tsai@mediatek.com> wrote:

> Recently, we get this deadlock issue again. fuse_flush_time_update()
> use sync_inode_metadata() and it only write the metadata, so the
> writeback worker could still be blocked becaused of file data.
>
> I try to use write_inode_now() instead of sync_inode_metadata() and the
> writeback thread will not be blocked anymore. I don't think this is a
> good solution, but this confirm that there is still a potential
> deadlock because of file data. WDYT.

I'm not sure how that happens.  Normally writeback doesn't block.  Can
you provide the stack traces of all related tasks in the deadlock?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
       [not found]             ` <SI2PR03MB5545E0B76E54013678B9FEEC8BA99@SI2PR03MB5545.apcprd03.prod.outlook.com>
@ 2022-06-10  7:48               ` Ed Tsai
  2022-06-13  8:45                 ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2022-06-10  7:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

> On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > wrote:
> > > 
> > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com>
> > > > wrote:
> > > > > 
> > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou < 
> > > > > > chenguanyou9338@gmail.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > ABA deadlock
> > > > > > > 
> > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > "Thread-21"
> > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > ffffff800830e1e8
> > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > ffffff80082f4c90
> > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > ffffff80082f1c34
> > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > ffffff80082d55ac
> > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at 
> > > > > > > ffffff8008268460
> > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > ffffff80082680d0
> > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > ffffff8008256514
> > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > ffffff8008437654
> > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at 
> > > > > > > ffffff8008436f40
> > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > 
> > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > "kworker/u16:8"
> > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > ffffff8008435760
> > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > ffffff8008435b14
> > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > ffffff8008312740
> > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > ffffff80083117e4
> > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at 
> > > > > > > ffffff8008311d98
> > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > ffffff80080e4fac
> > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > 
> > > > > > The issue is real.
> > > > > > 
> > > > > > The fix, however, is not the right one.  The fundamental 
> > > > > > problem is that fuse_write_inode() blocks on a request to 
> > > > > > userspace.
> > > > > > 
> > > > > > This is the same issue that fuse_writepage/fuse_writepages 
> > > > > > face.  In that case the solution was to copy the page
> > > > > > contents 
> > > > > > to a temporary buffer and return immediately as if the 
> > > > > > writeback already completed.
> > > > > > 
> > > > > > Something similar needs to be done here: send the
> > > > > > FUSE_SETATTR 
> > > > > > request asynchronously and return immediately from 
> > > > > > fuse_write_inode().  The tricky part is to make sure that 
> > > > > > multiple time updates for the same inode aren't mixed up...
> > > > > > 
> > > > > > Thanks,
> > > > > > Miklos
> > > > > 
> > > > > Dear Szeredi,
> > > > > 
> > > > > Writeback thread calls fuse_write_inode() and wait for user 
> > > > > Daemon to complete this write inode request. The user daemon 
> > > > > will
> > > > > alloc_page()
> > > > > after taking this request, and a deadlock could happen when
> > > > > we 
> > > > > try to shrink dentry list under memory pressure.
> > > > > 
> > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > also LTS.
> > > > > So
> > > > > another problem is that we should not change the protocol or 
> > > > > feature for stable kernel.
> > > > > 
> > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip
> > > > > the 
> > > > > dentry shirnker. It works but degrade the alloc_page success 
> > > > > rate. In a more fundamental way, we could cache the contents
> > > > > and 
> > > > > return immediately.
> > > > > But how to ensure the request will be done successfully,
> > > > > e.g., 
> > > > > always retry if it fails from daemon.
> > > > 
> > > > Key is where the the dirty metadata is flushed.  To prevent 
> > > > deadlock it must not be flushed from memory reclaim, so must
> > > > make 
> > > > sure that it is flushed on close(2) and munmap(2) and not
> > > > dirtied after that.
> > > > 
> > > > I'm working on this currently and hope to get it ready for the 
> > > > next merge window.
> > > > 
> > > > Thanks,
> > > > Miklos
> > > 
> > > Hi Miklos,
> > > 
> > > I'm not sure whether it has already been resolved in mainline.
> > > If it still WIP, please cc me on future emails.
> > 
> > Hi,
> > 
> > This is taking a bit longer, unfortunately, but I already have 
> > something in testing and currently cleaning it up for review.  Hope
> > to 
> > post a series today or early next week.
> 
> 
> Here's a minimal patch.  It's been through some iterations and some
> testing, but more review and testing is definitely welcome.
> 
> Chenguanyou, can you please verify that it fixes the deadlock?
> 
> Thanks,
> Miklos
> 
> ---
> From: Miklos Szeredi <mszeredi@redhat.com>
> Subject: fuse: make sure reclaim doesn't write the inode
> 
> In writeback cache mode mtime/ctime updates are cached, and flushed
> to the server using the ->write_inode() callback.
> 
> Closing the file will result in a dirty inode being immediately
> written, but in other cases the inode can remain dirty after all
> references are dropped.  This result in the inode being written back
> from reclaim, which can deadlock on a regular allocation while the
> request is being served.
> 
> The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> because serving a request involves unrelated userspace process(es).
> 
> Instead do the same as for dirty pages: make sure the inode is
> written before the last reference is gone.
> 
>  - fuse_vma_close(): flush times in addition to the dirty pages
> 
>  - fallocate(2)/copy_file_range(2): these call file_update_time() or
>    file_modified(), so flush the inode before returning from the call
> 
>  - unlink(2), link(2) and rename(2): these call fuse_update_ctime(),
> so
>    flush the ctime directly from this helper
> 
> Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fuse/dir.c    |    8 ++++++++
>  fs/fuse/file.c   |   24 +++++++++++++++++++++---
>  fs/fuse/fuse_i.h |    1 +
>  3 files changed, 30 insertions(+), 3 deletions(-)
> 
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
>  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);  }
>  
> +void fuse_flush_time_update(struct inode *inode) {
> +	int err = sync_inode_metadata(inode, 1);
> +
> +	mapping_set_error(inode->i_mapping, err); }
> +
>  void fuse_update_ctime(struct inode *inode)  {
>  	fuse_invalidate_attr(inode);
>  	if (!IS_NOCMTIME(inode)) {
>  		inode->i_ctime = current_time(inode);
>  		mark_inode_dirty_sync(inode);
> +		fuse_flush_time_update(inode);
>  	}
>  }
>  
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
>  	struct fuse_file *ff;
>  	int err;
>  
> +	/*
> +	 * Inode is always written before the last reference is dropped
> and
> +	 * hence this should not be reached from reclaim.
> +	 *
> +	 * Writing back the inode from reclaim can deadlock if the
> request
> +	 * processing itself needs an allocation.  Allocations
> triggering
> +	 * reclaim while serving a request can't be prevented, because
> it can
> +	 * involve any number of unrelated userspace processes.
> +	 */
> +	WARN_ON(wbc->for_reclaim);
> +
>  	ff = __fuse_write_file_get(fi);
>  	err = fuse_flush_times(inode, ff);
>  	if (ff)
> @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page  }
>  
>  /*
> - * Write back dirty pages now, because there may not be any suitable
> - * open files later
> + * Write back dirty data/metadata now (there may not be any suitable
> + * open files later for data)
>   */
>  static void fuse_vma_close(struct vm_area_struct *vma)  {
> -	filemap_write_and_wait(vma->vm_file->f_mapping);
> +	int err;
> +
> +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> +	mapping_set_error(vma->vm_file->f_mapping, err);
>  }
>  
>  /*
> @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
>  	if (lock_inode)
>  		inode_unlock(inode);
>  
> +	fuse_flush_time_update(inode);
> +
>  	return err;
>  }
>  
> @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
>  	inode_unlock(inode_out);
>  	file_accessed(file_in);
>  
> +	fuse_flush_time_update(inode_out);
> +
>  	return err;
>  }
>  
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
>  
>  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
>  
> +void fuse_flush_time_update(struct inode *inode);
>  void fuse_update_ctime(struct inode *inode);
>  
>  int fuse_update_attributes(struct inode *inode, struct file *file);

Hi Miklos,

Recently, we get this deadlock issue again. fuse_flush_time_update()
use sync_inode_metadata() and it only write the metadata, so the
writeback worker could still be blocked becaused of file data.

I try to use write_inode_now() instead of sync_inode_metadata() and the
writeback thread will not be blocked anymore. I don't think this is a
good solution, but this confirm that there is still a potential
deadlock because of file data. WDYT.

Best,
Ed Tsai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-15  8:22                 ` Ed Tsai
@ 2021-12-15 13:52                   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 18+ messages in thread
From: Greg Kroah-Hartman @ 2021-12-15 13:52 UTC (permalink / raw)
  To: Ed Tsai
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Wed, Dec 15, 2021 at 04:22:12PM +0800, Ed Tsai wrote:
> On Tue, 2021-12-14 at 17:38 +0800, Greg Kroah-Hartman wrote:
> > On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> > > On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > > > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > > > chenguanyou9338@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > ABA deadlock
> > > > > > > > > > 
> > > > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > > > "Thread-21"
> > > > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at
> > > > > > > > > > ffffff8009200a34
> > > > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > > > ffffff800830e1e8
> > > > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > > > ffffff80082f4c90
> > > > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at
> > > > > > > > > > ffffff80082f1710
> > > > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > > > ffffff80082f1c34
> > > > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at
> > > > > > > > > > ffffff80082f18a8
> > > > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > > > ffffff80082d55ac
> > > > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > > > ffffff8008268460
> > > > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > > > ffffff80082680d0
> > > > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > > > ffffff8008256514
> > > > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at
> > > > > > > > > > ffffff8008438268
> > > > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > > > ffffff8008437654
> > > > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > > > ffffff8008436f40
> > > > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > > > 
> > > > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > > > "kworker/u16:8"
> > > > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > > > ffffff8008435760
> > > > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > > > ffffff8008435b14
> > > > > > > > > > 5 [ffffff802e793930] fuse_flush_times at
> > > > > > > > > > ffffff800843a7a0
> > > > > > > > > > 6 [ffffff802e793950] fuse_write_inode at
> > > > > > > > > > ffffff800843e4dc
> > > > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > > > ffffff8008312740
> > > > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > > > ffffff80083117e4
> > > > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > > > ffffff8008311d98
> > > > > > > > > > 10 [ffffff802e793c00] wb_writeback at
> > > > > > > > > > ffffff8008310cfc
> > > > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > > > ffffff80080e4fac
> > > > > > > > > > 13 [ffffff802e793e00] worker_thread at
> > > > > > > > > > ffffff80080e5670
> > > > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > > > 
> > > > > > > > > The issue is real.
> > > > > > > > > 
> > > > > > > > > The fix, however, is not the right one.  The
> > > > > > > > > fundamental
> > > > > > > > > problem
> > > > > > > > > is
> > > > > > > > > that fuse_write_inode() blocks on a request to
> > > > > > > > > userspace.
> > > > > > > > > 
> > > > > > > > > This is the same issue that
> > > > > > > > > fuse_writepage/fuse_writepages
> > > > > > > > > face.  In
> > > > > > > > > that case the solution was to copy the page contents to
> > > > > > > > > a
> > > > > > > > > temporary
> > > > > > > > > buffer and return immediately as if the writeback
> > > > > > > > > already
> > > > > > > > > completed.
> > > > > > > > > 
> > > > > > > > > Something similar needs to be done here: send the
> > > > > > > > > FUSE_SETATTR
> > > > > > > > > request
> > > > > > > > > asynchronously and return immediately from
> > > > > > > > > fuse_write_inode().  The
> > > > > > > > > tricky part is to make sure that multiple time updates
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > same
> > > > > > > > > inode aren't mixed up...
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Miklos
> > > > > > > > 
> > > > > > > > Dear Szeredi,
> > > > > > > > 
> > > > > > > > Writeback thread calls fuse_write_inode() and wait for
> > > > > > > > user
> > > > > > > > Daemon
> > > > > > > > to
> > > > > > > > complete this write inode request. The user daemon will
> > > > > > > > alloc_page()
> > > > > > > > after taking this request, and a deadlock could happen
> > > > > > > > when
> > > > > > > > we try
> > > > > > > > to
> > > > > > > > shrink dentry list under memory pressure.
> > > > > > > > 
> > > > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > > > also LTS.
> > > > > > > > So
> > > > > > > > another problem is that we should not change the protocol
> > > > > > > > or
> > > > > > > > feature
> > > > > > > > for stable kernel.
> > > > > > > > 
> > > > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by
> > > > > > > > skip
> > > > > > > > the
> > > > > > > > dentry
> > > > > > > > shirnker. It works but degrade the alloc_page success
> > > > > > > > rate.
> > > > > > > > In a
> > > > > > > > more
> > > > > > > > fundamental way, we could cache the contents and return
> > > > > > > > immediately.
> > > > > > > > But how to ensure the request will be done successfully,
> > > > > > > > e.g.,
> > > > > > > > always
> > > > > > > > retry if it fails from daemon.
> > > > > > > 
> > > > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > > > deadlock
> > > > > > > it must not be flushed from memory reclaim, so must make
> > > > > > > sure
> > > > > > > that it
> > > > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > > > that.
> > > > > > > 
> > > > > > > I'm working on this currently and hope to get it ready for
> > > > > > > the
> > > > > > > next
> > > > > > > merge window.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Miklos
> > > > > > 
> > > > > > Hi Miklos,
> > > > > > 
> > > > > > I'm not sure whether it has already been resolved in
> > > > > > mainline.
> > > > > > If it still WIP, please cc me on future emails.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This is taking a bit longer, unfortunately, but I already have
> > > > > something in testing and currently cleaning it up for
> > > > > review.  Hope
> > > > > to
> > > > > post a series today or early next week.
> > > > 
> > > > 
> > > > Here's a minimal patch.  It's been through some iterations and
> > > > some
> > > > testing, but
> > > > more review and testing is definitely welcome.
> > > > 
> > > > Chenguanyou, can you please verify that it fixes the deadlock?
> > > > 
> > > > Thanks,
> > > > Miklos
> > > > 
> > > > ---
> > > > From: Miklos Szeredi <mszeredi@redhat.com>
> > > > Subject: fuse: make sure reclaim doesn't write the inode
> > > > 
> > > > In writeback cache mode mtime/ctime updates are cached, and
> > > > flushed
> > > > to the
> > > > server using the ->write_inode() callback.
> > > > 
> > > > Closing the file will result in a dirty inode being immediately
> > > > written,
> > > > but in other cases the inode can remain dirty after all
> > > > references
> > > > are
> > > > dropped.  This result in the inode being written back from
> > > > reclaim,
> > > > which
> > > > can deadlock on a regular allocation while the request is being
> > > > served.
> > > > 
> > > > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > > > because
> > > > serving a request involves unrelated userspace process(es).
> > > > 
> > > > Instead do the same as for dirty pages: make sure the inode is
> > > > written
> > > > before the last reference is gone.
> > > > 
> > > >  - fuse_vma_close(): flush times in addition to the dirty pages
> > > > 
> > > >  - fallocate(2)/copy_file_range(2): these call file_update_time()
> > > > or
> > > >    file_modified(), so flush the inode before returning from the
> > > > call
> > > > 
> > > >  - unlink(2), link(2) and rename(2): these call
> > > > fuse_update_ctime(),
> > > > so
> > > >    flush the ctime directly from this helper
> > > > 
> > > > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > > ---
> > > >  fs/fuse/dir.c    |    8 ++++++++
> > > >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> > > >  fs/fuse/fuse_i.h |    1 +
> > > >  3 files changed, 30 insertions(+), 3 deletions(-)
> > > > 
> > > > --- a/fs/fuse/dir.c
> > > > +++ b/fs/fuse/dir.c
> > > > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> > > >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> > > >  }
> > > >  
> > > > +void fuse_flush_time_update(struct inode *inode)
> > > > +{
> > > > +	int err = sync_inode_metadata(inode, 1);
> > > > +
> > > > +	mapping_set_error(inode->i_mapping, err);
> > > > +}
> > > > +
> > > >  void fuse_update_ctime(struct inode *inode)
> > > >  {
> > > >  	fuse_invalidate_attr(inode);
> > > >  	if (!IS_NOCMTIME(inode)) {
> > > >  		inode->i_ctime = current_time(inode);
> > > >  		mark_inode_dirty_sync(inode);
> > > > +		fuse_flush_time_update(inode);
> > > >  	}
> > > >  }
> > > >  
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> > > >  	struct fuse_file *ff;
> > > >  	int err;
> > > >  
> > > > +	/*
> > > > +	 * Inode is always written before the last reference is dropped
> > > > and
> > > > +	 * hence this should not be reached from reclaim.
> > > > +	 *
> > > > +	 * Writing back the inode from reclaim can deadlock if the
> > > > request
> > > > +	 * processing itself needs an allocation.  Allocations
> > > > triggering
> > > > +	 * reclaim while serving a request can't be prevented, because
> > > > it can
> > > > +	 * involve any number of unrelated userspace processes.
> > > > +	 */
> > > > +	WARN_ON(wbc->for_reclaim);
> > > > +
> > > >  	ff = __fuse_write_file_get(fi);
> > > >  	err = fuse_flush_times(inode, ff);
> > > >  	if (ff)
> > > > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> > > >  }
> > > >  
> > > >  /*
> > > > - * Write back dirty pages now, because there may not be any
> > > > suitable
> > > > - * open files later
> > > > + * Write back dirty data/metadata now (there may not be any
> > > > suitable
> > > > + * open files later for data)
> > > >   */
> > > >  static void fuse_vma_close(struct vm_area_struct *vma)
> > > >  {
> > > > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > > > +	int err;
> > > > +
> > > > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > > > +	mapping_set_error(vma->vm_file->f_mapping, err);
> > > >  }
> > > >  
> > > >  /*
> > > > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> > > >  	if (lock_inode)
> > > >  		inode_unlock(inode);
> > > >  
> > > > +	fuse_flush_time_update(inode);
> > > > +
> > > >  	return err;
> > > >  }
> > > >  
> > > > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> > > >  	inode_unlock(inode_out);
> > > >  	file_accessed(file_in);
> > > >  
> > > > +	fuse_flush_time_update(inode_out);
> > > > +
> > > >  	return err;
> > > >  }
> > > >  
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> > > >  
> > > >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> > > >  
> > > > +void fuse_flush_time_update(struct inode *inode);
> > > >  void fuse_update_ctime(struct inode *inode);
> > > >  
> > > >  int fuse_update_attributes(struct inode *inode, struct file
> > > > *file);
> > > 
> > > Hi Mikloz, Greg,
> > > 
> > > This deadlock issue could be raised in high memory pressure and the
> > > patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> > > doesn't write the inode").
> > > 
> > > Can we take it to the LTS version?
> > 
> > What kernel tree(s) do you want this backported to?  Have you tested
> > it
> > that it will apply cleanly and work?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Hi Greg, I want to take this commit to 5.10 LTS. This can be work to
> resolve the deadlock issue. Also, I have done some monkey tests on our
> ACK 5.10 phone to confirm the stability.

Queued up for 5.15.y and 5.10.y now, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-14  9:38               ` Greg Kroah-Hartman
@ 2021-12-15  8:22                 ` Ed Tsai
  2021-12-15 13:52                   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2021-12-15  8:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, 2021-12-14 at 17:38 +0800, Greg Kroah-Hartman wrote:
> On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> > On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > > wrote:
> > > > > 
> > > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com
> > > > > > >
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > > chenguanyou9338@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > ABA deadlock
> > > > > > > > > 
> > > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > > "Thread-21"
> > > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at
> > > > > > > > > ffffff8009200a34
> > > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > > ffffff800830e1e8
> > > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > > ffffff80082f4c90
> > > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at
> > > > > > > > > ffffff80082f1710
> > > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > > ffffff80082f1c34
> > > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at
> > > > > > > > > ffffff80082f18a8
> > > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > > ffffff80082d55ac
> > > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > > ffffff8008268460
> > > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > > ffffff80082680d0
> > > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > > ffffff8008256514
> > > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at
> > > > > > > > > ffffff8008438268
> > > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > > ffffff8008437654
> > > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > > ffffff8008436f40
> > > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > > 
> > > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > > "kworker/u16:8"
> > > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > > ffffff8008435760
> > > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > > ffffff8008435b14
> > > > > > > > > 5 [ffffff802e793930] fuse_flush_times at
> > > > > > > > > ffffff800843a7a0
> > > > > > > > > 6 [ffffff802e793950] fuse_write_inode at
> > > > > > > > > ffffff800843e4dc
> > > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > > ffffff8008312740
> > > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > > ffffff80083117e4
> > > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > > ffffff8008311d98
> > > > > > > > > 10 [ffffff802e793c00] wb_writeback at
> > > > > > > > > ffffff8008310cfc
> > > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > > ffffff80080e4fac
> > > > > > > > > 13 [ffffff802e793e00] worker_thread at
> > > > > > > > > ffffff80080e5670
> > > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > > 
> > > > > > > > The issue is real.
> > > > > > > > 
> > > > > > > > The fix, however, is not the right one.  The
> > > > > > > > fundamental
> > > > > > > > problem
> > > > > > > > is
> > > > > > > > that fuse_write_inode() blocks on a request to
> > > > > > > > userspace.
> > > > > > > > 
> > > > > > > > This is the same issue that
> > > > > > > > fuse_writepage/fuse_writepages
> > > > > > > > face.  In
> > > > > > > > that case the solution was to copy the page contents to
> > > > > > > > a
> > > > > > > > temporary
> > > > > > > > buffer and return immediately as if the writeback
> > > > > > > > already
> > > > > > > > completed.
> > > > > > > > 
> > > > > > > > Something similar needs to be done here: send the
> > > > > > > > FUSE_SETATTR
> > > > > > > > request
> > > > > > > > asynchronously and return immediately from
> > > > > > > > fuse_write_inode().  The
> > > > > > > > tricky part is to make sure that multiple time updates
> > > > > > > > for
> > > > > > > > the
> > > > > > > > same
> > > > > > > > inode aren't mixed up...
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Miklos
> > > > > > > 
> > > > > > > Dear Szeredi,
> > > > > > > 
> > > > > > > Writeback thread calls fuse_write_inode() and wait for
> > > > > > > user
> > > > > > > Daemon
> > > > > > > to
> > > > > > > complete this write inode request. The user daemon will
> > > > > > > alloc_page()
> > > > > > > after taking this request, and a deadlock could happen
> > > > > > > when
> > > > > > > we try
> > > > > > > to
> > > > > > > shrink dentry list under memory pressure.
> > > > > > > 
> > > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > > also LTS.
> > > > > > > So
> > > > > > > another problem is that we should not change the protocol
> > > > > > > or
> > > > > > > feature
> > > > > > > for stable kernel.
> > > > > > > 
> > > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by
> > > > > > > skip
> > > > > > > the
> > > > > > > dentry
> > > > > > > shirnker. It works but degrade the alloc_page success
> > > > > > > rate.
> > > > > > > In a
> > > > > > > more
> > > > > > > fundamental way, we could cache the contents and return
> > > > > > > immediately.
> > > > > > > But how to ensure the request will be done successfully,
> > > > > > > e.g.,
> > > > > > > always
> > > > > > > retry if it fails from daemon.
> > > > > > 
> > > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > > deadlock
> > > > > > it must not be flushed from memory reclaim, so must make
> > > > > > sure
> > > > > > that it
> > > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > > that.
> > > > > > 
> > > > > > I'm working on this currently and hope to get it ready for
> > > > > > the
> > > > > > next
> > > > > > merge window.
> > > > > > 
> > > > > > Thanks,
> > > > > > Miklos
> > > > > 
> > > > > Hi Miklos,
> > > > > 
> > > > > I'm not sure whether it has already been resolved in
> > > > > mainline.
> > > > > If it still WIP, please cc me on future emails.
> > > > 
> > > > Hi,
> > > > 
> > > > This is taking a bit longer, unfortunately, but I already have
> > > > something in testing and currently cleaning it up for
> > > > review.  Hope
> > > > to
> > > > post a series today or early next week.
> > > 
> > > 
> > > Here's a minimal patch.  It's been through some iterations and
> > > some
> > > testing, but
> > > more review and testing is definitely welcome.
> > > 
> > > Chenguanyou, can you please verify that it fixes the deadlock?
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > > ---
> > > From: Miklos Szeredi <mszeredi@redhat.com>
> > > Subject: fuse: make sure reclaim doesn't write the inode
> > > 
> > > In writeback cache mode mtime/ctime updates are cached, and
> > > flushed
> > > to the
> > > server using the ->write_inode() callback.
> > > 
> > > Closing the file will result in a dirty inode being immediately
> > > written,
> > > but in other cases the inode can remain dirty after all
> > > references
> > > are
> > > dropped.  This result in the inode being written back from
> > > reclaim,
> > > which
> > > can deadlock on a regular allocation while the request is being
> > > served.
> > > 
> > > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > > because
> > > serving a request involves unrelated userspace process(es).
> > > 
> > > Instead do the same as for dirty pages: make sure the inode is
> > > written
> > > before the last reference is gone.
> > > 
> > >  - fuse_vma_close(): flush times in addition to the dirty pages
> > > 
> > >  - fallocate(2)/copy_file_range(2): these call file_update_time()
> > > or
> > >    file_modified(), so flush the inode before returning from the
> > > call
> > > 
> > >  - unlink(2), link(2) and rename(2): these call
> > > fuse_update_ctime(),
> > > so
> > >    flush the ctime directly from this helper
> > > 
> > > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > ---
> > >  fs/fuse/dir.c    |    8 ++++++++
> > >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> > >  fs/fuse/fuse_i.h |    1 +
> > >  3 files changed, 30 insertions(+), 3 deletions(-)
> > > 
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> > >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> > >  }
> > >  
> > > +void fuse_flush_time_update(struct inode *inode)
> > > +{
> > > +	int err = sync_inode_metadata(inode, 1);
> > > +
> > > +	mapping_set_error(inode->i_mapping, err);
> > > +}
> > > +
> > >  void fuse_update_ctime(struct inode *inode)
> > >  {
> > >  	fuse_invalidate_attr(inode);
> > >  	if (!IS_NOCMTIME(inode)) {
> > >  		inode->i_ctime = current_time(inode);
> > >  		mark_inode_dirty_sync(inode);
> > > +		fuse_flush_time_update(inode);
> > >  	}
> > >  }
> > >  
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> > >  	struct fuse_file *ff;
> > >  	int err;
> > >  
> > > +	/*
> > > +	 * Inode is always written before the last reference is dropped
> > > and
> > > +	 * hence this should not be reached from reclaim.
> > > +	 *
> > > +	 * Writing back the inode from reclaim can deadlock if the
> > > request
> > > +	 * processing itself needs an allocation.  Allocations
> > > triggering
> > > +	 * reclaim while serving a request can't be prevented, because
> > > it can
> > > +	 * involve any number of unrelated userspace processes.
> > > +	 */
> > > +	WARN_ON(wbc->for_reclaim);
> > > +
> > >  	ff = __fuse_write_file_get(fi);
> > >  	err = fuse_flush_times(inode, ff);
> > >  	if (ff)
> > > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> > >  }
> > >  
> > >  /*
> > > - * Write back dirty pages now, because there may not be any
> > > suitable
> > > - * open files later
> > > + * Write back dirty data/metadata now (there may not be any
> > > suitable
> > > + * open files later for data)
> > >   */
> > >  static void fuse_vma_close(struct vm_area_struct *vma)
> > >  {
> > > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > > +	int err;
> > > +
> > > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > > +	mapping_set_error(vma->vm_file->f_mapping, err);
> > >  }
> > >  
> > >  /*
> > > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> > >  	if (lock_inode)
> > >  		inode_unlock(inode);
> > >  
> > > +	fuse_flush_time_update(inode);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> > >  	inode_unlock(inode_out);
> > >  	file_accessed(file_in);
> > >  
> > > +	fuse_flush_time_update(inode_out);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> > >  
> > >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> > >  
> > > +void fuse_flush_time_update(struct inode *inode);
> > >  void fuse_update_ctime(struct inode *inode);
> > >  
> > >  int fuse_update_attributes(struct inode *inode, struct file
> > > *file);
> > 
> > Hi Mikloz, Greg,
> > 
> > This deadlock issue could be raised in high memory pressure and the
> > patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> > doesn't write the inode").
> > 
> > Can we take it to the LTS version?
> 
> What kernel tree(s) do you want this backported to?  Have you tested
> it
> that it will apply cleanly and work?
> 
> thanks,
> 
> greg k-h

Hi Greg, I want to take this commit to 5.10 LTS. This can be work to
resolve the deadlock issue. Also, I have done some monkey tests on our
ACK 5.10 phone to confirm the stability.

Best,

Ed Tsai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-14  9:25             ` Ed Tsai
@ 2021-12-14  9:38               ` Greg Kroah-Hartman
  2021-12-15  8:22                 ` Ed Tsai
  0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2021-12-14  9:38 UTC (permalink / raw)
  To: Ed Tsai
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > wrote:
> > > > 
> > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > chenguanyou9338@gmail.com>
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > ABA deadlock
> > > > > > > > 
> > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > "Thread-21"
> > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > ffffff800830e1e8
> > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > ffffff80082f4c90
> > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > ffffff80082f1c34
> > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > ffffff80082d55ac
> > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > ffffff8008268460
> > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > ffffff80082680d0
> > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > ffffff8008256514
> > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > ffffff8008437654
> > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > ffffff8008436f40
> > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > 
> > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > "kworker/u16:8"
> > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > ffffff8008435760
> > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > ffffff8008435b14
> > > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > ffffff8008312740
> > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > ffffff80083117e4
> > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > ffffff8008311d98
> > > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > ffffff80080e4fac
> > > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > 
> > > > > > > The issue is real.
> > > > > > > 
> > > > > > > The fix, however, is not the right one.  The fundamental
> > > > > > > problem
> > > > > > > is
> > > > > > > that fuse_write_inode() blocks on a request to userspace.
> > > > > > > 
> > > > > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > > > > face.  In
> > > > > > > that case the solution was to copy the page contents to a
> > > > > > > temporary
> > > > > > > buffer and return immediately as if the writeback already
> > > > > > > completed.
> > > > > > > 
> > > > > > > Something similar needs to be done here: send the
> > > > > > > FUSE_SETATTR
> > > > > > > request
> > > > > > > asynchronously and return immediately from
> > > > > > > fuse_write_inode().  The
> > > > > > > tricky part is to make sure that multiple time updates for
> > > > > > > the
> > > > > > > same
> > > > > > > inode aren't mixed up...
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Miklos
> > > > > > 
> > > > > > Dear Szeredi,
> > > > > > 
> > > > > > Writeback thread calls fuse_write_inode() and wait for user
> > > > > > Daemon
> > > > > > to
> > > > > > complete this write inode request. The user daemon will
> > > > > > alloc_page()
> > > > > > after taking this request, and a deadlock could happen when
> > > > > > we try
> > > > > > to
> > > > > > shrink dentry list under memory pressure.
> > > > > > 
> > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > also LTS.
> > > > > > So
> > > > > > another problem is that we should not change the protocol or
> > > > > > feature
> > > > > > for stable kernel.
> > > > > > 
> > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip
> > > > > > the
> > > > > > dentry
> > > > > > shirnker. It works but degrade the alloc_page success rate.
> > > > > > In a
> > > > > > more
> > > > > > fundamental way, we could cache the contents and return
> > > > > > immediately.
> > > > > > But how to ensure the request will be done successfully,
> > > > > > e.g.,
> > > > > > always
> > > > > > retry if it fails from daemon.
> > > > > 
> > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > deadlock
> > > > > it must not be flushed from memory reclaim, so must make sure
> > > > > that it
> > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > that.
> > > > > 
> > > > > I'm working on this currently and hope to get it ready for the
> > > > > next
> > > > > merge window.
> > > > > 
> > > > > Thanks,
> > > > > Miklos
> > > > 
> > > > Hi Miklos,
> > > > 
> > > > I'm not sure whether it has already been resolved in mainline.
> > > > If it still WIP, please cc me on future emails.
> > > 
> > > Hi,
> > > 
> > > This is taking a bit longer, unfortunately, but I already have
> > > something in testing and currently cleaning it up for review.  Hope
> > > to
> > > post a series today or early next week.
> > 
> > 
> > Here's a minimal patch.  It's been through some iterations and some
> > testing, but
> > more review and testing is definitely welcome.
> > 
> > Chenguanyou, can you please verify that it fixes the deadlock?
> > 
> > Thanks,
> > Miklos
> > 
> > ---
> > From: Miklos Szeredi <mszeredi@redhat.com>
> > Subject: fuse: make sure reclaim doesn't write the inode
> > 
> > In writeback cache mode mtime/ctime updates are cached, and flushed
> > to the
> > server using the ->write_inode() callback.
> > 
> > Closing the file will result in a dirty inode being immediately
> > written,
> > but in other cases the inode can remain dirty after all references
> > are
> > dropped.  This result in the inode being written back from reclaim,
> > which
> > can deadlock on a regular allocation while the request is being
> > served.
> > 
> > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > because
> > serving a request involves unrelated userspace process(es).
> > 
> > Instead do the same as for dirty pages: make sure the inode is
> > written
> > before the last reference is gone.
> > 
> >  - fuse_vma_close(): flush times in addition to the dirty pages
> > 
> >  - fallocate(2)/copy_file_range(2): these call file_update_time() or
> >    file_modified(), so flush the inode before returning from the call
> > 
> >  - unlink(2), link(2) and rename(2): these call fuse_update_ctime(),
> > so
> >    flush the ctime directly from this helper
> > 
> > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >  fs/fuse/dir.c    |    8 ++++++++
> >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> >  fs/fuse/fuse_i.h |    1 +
> >  3 files changed, 30 insertions(+), 3 deletions(-)
> > 
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> >  }
> >  
> > +void fuse_flush_time_update(struct inode *inode)
> > +{
> > +	int err = sync_inode_metadata(inode, 1);
> > +
> > +	mapping_set_error(inode->i_mapping, err);
> > +}
> > +
> >  void fuse_update_ctime(struct inode *inode)
> >  {
> >  	fuse_invalidate_attr(inode);
> >  	if (!IS_NOCMTIME(inode)) {
> >  		inode->i_ctime = current_time(inode);
> >  		mark_inode_dirty_sync(inode);
> > +		fuse_flush_time_update(inode);
> >  	}
> >  }
> >  
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> >  	struct fuse_file *ff;
> >  	int err;
> >  
> > +	/*
> > +	 * Inode is always written before the last reference is dropped
> > and
> > +	 * hence this should not be reached from reclaim.
> > +	 *
> > +	 * Writing back the inode from reclaim can deadlock if the
> > request
> > +	 * processing itself needs an allocation.  Allocations
> > triggering
> > +	 * reclaim while serving a request can't be prevented, because
> > it can
> > +	 * involve any number of unrelated userspace processes.
> > +	 */
> > +	WARN_ON(wbc->for_reclaim);
> > +
> >  	ff = __fuse_write_file_get(fi);
> >  	err = fuse_flush_times(inode, ff);
> >  	if (ff)
> > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> >  }
> >  
> >  /*
> > - * Write back dirty pages now, because there may not be any suitable
> > - * open files later
> > + * Write back dirty data/metadata now (there may not be any suitable
> > + * open files later for data)
> >   */
> >  static void fuse_vma_close(struct vm_area_struct *vma)
> >  {
> > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > +	int err;
> > +
> > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > +	mapping_set_error(vma->vm_file->f_mapping, err);
> >  }
> >  
> >  /*
> > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> >  	if (lock_inode)
> >  		inode_unlock(inode);
> >  
> > +	fuse_flush_time_update(inode);
> > +
> >  	return err;
> >  }
> >  
> > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> >  	inode_unlock(inode_out);
> >  	file_accessed(file_in);
> >  
> > +	fuse_flush_time_update(inode_out);
> > +
> >  	return err;
> >  }
> >  
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> >  
> >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> >  
> > +void fuse_flush_time_update(struct inode *inode);
> >  void fuse_update_ctime(struct inode *inode);
> >  
> >  int fuse_update_attributes(struct inode *inode, struct file *file);
> 
> Hi Mikloz, Greg,
> 
> This deadlock issue could be raised in high memory pressure and the
> patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> doesn't write the inode").
> 
> Can we take it to the LTS version?

What kernel tree(s) do you want this backported to?  Have you tested it
that it will apply cleanly and work?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-09-28 15:25           ` Miklos Szeredi
@ 2021-12-14  9:25             ` Ed Tsai
  2021-12-14  9:38               ` Greg Kroah-Hartman
       [not found]             ` <SI2PR03MB5545E0B76E54013678B9FEEC8BA99@SI2PR03MB5545.apcprd03.prod.outlook.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2021-12-14  9:25 UTC (permalink / raw)
  To: Miklos Szeredi, Greg Kroah-Hartman, stable
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > wrote:
> > > 
> > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com>
> > > > wrote:
> > > > > 
> > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > chenguanyou9338@gmail.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > ABA deadlock
> > > > > > > 
> > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > "Thread-21"
> > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > ffffff800830e1e8
> > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > ffffff80082f4c90
> > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > ffffff80082f1c34
> > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > ffffff80082d55ac
> > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > ffffff8008268460
> > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > ffffff80082680d0
> > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > ffffff8008256514
> > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > ffffff8008437654
> > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > ffffff8008436f40
> > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > 
> > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > "kworker/u16:8"
> > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > ffffff8008435760
> > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > ffffff8008435b14
> > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > ffffff8008312740
> > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > ffffff80083117e4
> > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > ffffff8008311d98
> > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > ffffff80080e4fac
> > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > 
> > > > > > The issue is real.
> > > > > > 
> > > > > > The fix, however, is not the right one.  The fundamental
> > > > > > problem
> > > > > > is
> > > > > > that fuse_write_inode() blocks on a request to userspace.
> > > > > > 
> > > > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > > > face.  In
> > > > > > that case the solution was to copy the page contents to a
> > > > > > temporary
> > > > > > buffer and return immediately as if the writeback already
> > > > > > completed.
> > > > > > 
> > > > > > Something similar needs to be done here: send the
> > > > > > FUSE_SETATTR
> > > > > > request
> > > > > > asynchronously and return immediately from
> > > > > > fuse_write_inode().  The
> > > > > > tricky part is to make sure that multiple time updates for
> > > > > > the
> > > > > > same
> > > > > > inode aren't mixed up...
> > > > > > 
> > > > > > Thanks,
> > > > > > Miklos
> > > > > 
> > > > > Dear Szeredi,
> > > > > 
> > > > > Writeback thread calls fuse_write_inode() and wait for user
> > > > > Daemon
> > > > > to
> > > > > complete this write inode request. The user daemon will
> > > > > alloc_page()
> > > > > after taking this request, and a deadlock could happen when
> > > > > we try
> > > > > to
> > > > > shrink dentry list under memory pressure.
> > > > > 
> > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > also LTS.
> > > > > So
> > > > > another problem is that we should not change the protocol or
> > > > > feature
> > > > > for stable kernel.
> > > > > 
> > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip
> > > > > the
> > > > > dentry
> > > > > shirnker. It works but degrade the alloc_page success rate.
> > > > > In a
> > > > > more
> > > > > fundamental way, we could cache the contents and return
> > > > > immediately.
> > > > > But how to ensure the request will be done successfully,
> > > > > e.g.,
> > > > > always
> > > > > retry if it fails from daemon.
> > > > 
> > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > deadlock
> > > > it must not be flushed from memory reclaim, so must make sure
> > > > that it
> > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > that.
> > > > 
> > > > I'm working on this currently and hope to get it ready for the
> > > > next
> > > > merge window.
> > > > 
> > > > Thanks,
> > > > Miklos
> > > 
> > > Hi Miklos,
> > > 
> > > I'm not sure whether it has already been resolved in mainline.
> > > If it still WIP, please cc me on future emails.
> > 
> > Hi,
> > 
> > This is taking a bit longer, unfortunately, but I already have
> > something in testing and currently cleaning it up for review.  Hope
> > to
> > post a series today or early next week.
> 
> 
> Here's a minimal patch.  It's been through some iterations and some
> testing, but
> more review and testing is definitely welcome.
> 
> Chenguanyou, can you please verify that it fixes the deadlock?
> 
> Thanks,
> Miklos
> 
> ---
> From: Miklos Szeredi <mszeredi@redhat.com>
> Subject: fuse: make sure reclaim doesn't write the inode
> 
> In writeback cache mode mtime/ctime updates are cached, and flushed
> to the
> server using the ->write_inode() callback.
> 
> Closing the file will result in a dirty inode being immediately
> written,
> but in other cases the inode can remain dirty after all references
> are
> dropped.  This result in the inode being written back from reclaim,
> which
> can deadlock on a regular allocation while the request is being
> served.
> 
> The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> because
> serving a request involves unrelated userspace process(es).
> 
> Instead do the same as for dirty pages: make sure the inode is
> written
> before the last reference is gone.
> 
>  - fuse_vma_close(): flush times in addition to the dirty pages
> 
>  - fallocate(2)/copy_file_range(2): these call file_update_time() or
>    file_modified(), so flush the inode before returning from the call
> 
>  - unlink(2), link(2) and rename(2): these call fuse_update_ctime(),
> so
>    flush the ctime directly from this helper
> 
> Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fuse/dir.c    |    8 ++++++++
>  fs/fuse/file.c   |   24 +++++++++++++++++++++---
>  fs/fuse/fuse_i.h |    1 +
>  3 files changed, 30 insertions(+), 3 deletions(-)
> 
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
>  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
>  }
>  
> +void fuse_flush_time_update(struct inode *inode)
> +{
> +	int err = sync_inode_metadata(inode, 1);
> +
> +	mapping_set_error(inode->i_mapping, err);
> +}
> +
>  void fuse_update_ctime(struct inode *inode)
>  {
>  	fuse_invalidate_attr(inode);
>  	if (!IS_NOCMTIME(inode)) {
>  		inode->i_ctime = current_time(inode);
>  		mark_inode_dirty_sync(inode);
> +		fuse_flush_time_update(inode);
>  	}
>  }
>  
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
>  	struct fuse_file *ff;
>  	int err;
>  
> +	/*
> +	 * Inode is always written before the last reference is dropped
> and
> +	 * hence this should not be reached from reclaim.
> +	 *
> +	 * Writing back the inode from reclaim can deadlock if the
> request
> +	 * processing itself needs an allocation.  Allocations
> triggering
> +	 * reclaim while serving a request can't be prevented, because
> it can
> +	 * involve any number of unrelated userspace processes.
> +	 */
> +	WARN_ON(wbc->for_reclaim);
> +
>  	ff = __fuse_write_file_get(fi);
>  	err = fuse_flush_times(inode, ff);
>  	if (ff)
> @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
>  }
>  
>  /*
> - * Write back dirty pages now, because there may not be any suitable
> - * open files later
> + * Write back dirty data/metadata now (there may not be any suitable
> + * open files later for data)
>   */
>  static void fuse_vma_close(struct vm_area_struct *vma)
>  {
> -	filemap_write_and_wait(vma->vm_file->f_mapping);
> +	int err;
> +
> +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> +	mapping_set_error(vma->vm_file->f_mapping, err);
>  }
>  
>  /*
> @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
>  	if (lock_inode)
>  		inode_unlock(inode);
>  
> +	fuse_flush_time_update(inode);
> +
>  	return err;
>  }
>  
> @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
>  	inode_unlock(inode_out);
>  	file_accessed(file_in);
>  
> +	fuse_flush_time_update(inode_out);
> +
>  	return err;
>  }
>  
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
>  
>  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
>  
> +void fuse_flush_time_update(struct inode *inode);
>  void fuse_update_ctime(struct inode *inode);
>  
>  int fuse_update_attributes(struct inode *inode, struct file *file);

Hi Mikloz, Greg,

This deadlock issue could be raised in high memory pressure and the
patch has been merged in commit 5c791fe ("fuse: make sure reclaim
doesn't write the inode").

Can we take it to the LTS version?

Best,
Ed Tsai



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-09-24  7:52         ` Miklos Szeredi
@ 2021-09-28 15:25           ` Miklos Szeredi
  2021-12-14  9:25             ` Ed Tsai
       [not found]             ` <SI2PR03MB5545E0B76E54013678B9FEEC8BA99@SI2PR03MB5545.apcprd03.prod.outlook.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Miklos Szeredi @ 2021-09-28 15:25 UTC (permalink / raw)
  To: chenguanyou
  Cc: linux-fsdevel, linux-kernel, Ed Tsai, chenguanyou,
	Stanley Chu (朱原陞)

On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com> wrote:
> >
> > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com> wrote:
> > > >
> > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > chenguanyou9338@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > ABA deadlock
> > > > > >
> > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > ffffff800830e1e8
> > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> > > > > > 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > ffffff8008256514
> > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > >
> > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > "kworker/u16:8"
> > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> > > > > > 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > ffffff8008312740
> > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > >
> > > > > The issue is real.
> > > > >
> > > > > The fix, however, is not the right one.  The fundamental problem
> > > > > is
> > > > > that fuse_write_inode() blocks on a request to userspace.
> > > > >
> > > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > > face.  In
> > > > > that case the solution was to copy the page contents to a
> > > > > temporary
> > > > > buffer and return immediately as if the writeback already
> > > > > completed.
> > > > >
> > > > > Something similar needs to be done here: send the FUSE_SETATTR
> > > > > request
> > > > > asynchronously and return immediately from
> > > > > fuse_write_inode().  The
> > > > > tricky part is to make sure that multiple time updates for the
> > > > > same
> > > > > inode aren't mixed up...
> > > > >
> > > > > Thanks,
> > > > > Miklos
> > > >
> > > > Dear Szeredi,
> > > >
> > > > Writeback thread calls fuse_write_inode() and wait for user Daemon
> > > > to
> > > > complete this write inode request. The user daemon will
> > > > alloc_page()
> > > > after taking this request, and a deadlock could happen when we try
> > > > to
> > > > shrink dentry list under memory pressure.
> > > >
> > > > We (Mediatek) glad to work on this issue for mainline and also LTS.
> > > > So
> > > > another problem is that we should not change the protocol or
> > > > feature
> > > > for stable kernel.
> > > >
> > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the
> > > > dentry
> > > > shirnker. It works but degrade the alloc_page success rate. In a
> > > > more
> > > > fundamental way, we could cache the contents and return
> > > > immediately.
> > > > But how to ensure the request will be done successfully, e.g.,
> > > > always
> > > > retry if it fails from daemon.
> > >
> > > Key is where the the dirty metadata is flushed.  To prevent deadlock
> > > it must not be flushed from memory reclaim, so must make sure that it
> > > is flushed on close(2) and munmap(2) and not dirtied after that.
> > >
> > > I'm working on this currently and hope to get it ready for the next
> > > merge window.
> > >
> > > Thanks,
> > > Miklos
> >
> > Hi Miklos,
> >
> > I'm not sure whether it has already been resolved in mainline.
> > If it still WIP, please cc me on future emails.
> 
> Hi,
> 
> This is taking a bit longer, unfortunately, but I already have
> something in testing and currently cleaning it up for review.  Hope to
> post a series today or early next week.


Here's a minimal patch.  It's been through some iterations and some testing, but
more review and testing is definitely welcome.

Chenguanyou, can you please verify that it fixes the deadlock?

Thanks,
Miklos

---
From: Miklos Szeredi <mszeredi@redhat.com>
Subject: fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped.  This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

 - fuse_vma_close(): flush times in addition to the dirty pages

 - fallocate(2)/copy_file_range(2): these call file_update_time() or
   file_modified(), so flush the inode before returning from the call

 - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
   flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/dir.c    |    8 ++++++++
 fs/fuse/file.c   |   24 +++++++++++++++++++++---
 fs/fuse/fuse_i.h |    1 +
 3 files changed, 30 insertions(+), 3 deletions(-)

--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
 	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
 }
 
+void fuse_flush_time_update(struct inode *inode)
+{
+	int err = sync_inode_metadata(inode, 1);
+
+	mapping_set_error(inode->i_mapping, err);
+}
+
 void fuse_update_ctime(struct inode *inode)
 {
 	fuse_invalidate_attr(inode);
 	if (!IS_NOCMTIME(inode)) {
 		inode->i_ctime = current_time(inode);
 		mark_inode_dirty_sync(inode);
+		fuse_flush_time_update(inode);
 	}
 }
 
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
 	struct fuse_file *ff;
 	int err;
 
+	/*
+	 * Inode is always written before the last reference is dropped and
+	 * hence this should not be reached from reclaim.
+	 *
+	 * Writing back the inode from reclaim can deadlock if the request
+	 * processing itself needs an allocation.  Allocations triggering
+	 * reclaim while serving a request can't be prevented, because it can
+	 * involve any number of unrelated userspace processes.
+	 */
+	WARN_ON(wbc->for_reclaim);
+
 	ff = __fuse_write_file_get(fi);
 	err = fuse_flush_times(inode, ff);
 	if (ff)
@@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
 }
 
 /*
- * Write back dirty pages now, because there may not be any suitable
- * open files later
+ * Write back dirty data/metadata now (there may not be any suitable
+ * open files later for data)
  */
 static void fuse_vma_close(struct vm_area_struct *vma)
 {
-	filemap_write_and_wait(vma->vm_file->f_mapping);
+	int err;
+
+	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
+	mapping_set_error(vma->vm_file->f_mapping, err);
 }
 
 /*
@@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
 	if (lock_inode)
 		inode_unlock(inode);
 
+	fuse_flush_time_update(inode);
+
 	return err;
 }
 
@@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
 	inode_unlock(inode_out);
 	file_accessed(file_in);
 
+	fuse_flush_time_update(inode_out);
+
 	return err;
 }
 
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
 
 u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
 
+void fuse_flush_time_update(struct inode *inode);
 void fuse_update_ctime(struct inode *inode);
 
 int fuse_update_attributes(struct inode *inode, struct file *file);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-09-24  3:52       ` Ed Tsai
@ 2021-09-24  7:52         ` Miklos Szeredi
  2021-09-28 15:25           ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2021-09-24  7:52 UTC (permalink / raw)
  To: Ed Tsai
  Cc: linux-fsdevel, linux-kernel, chenguanyou, chenguanyou,
	Stanley Chu (朱原陞)

On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com> wrote:
>
> On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com> wrote:
> > >
> > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > chenguanyou9338@gmail.com>
> > > > wrote:
> > > > >
> > > > > ABA deadlock
> > > > >
> > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > ffffff800830e1e8
> > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> > > > > 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > ffffff8008256514
> > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > >
> > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > "kworker/u16:8"
> > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> > > > > 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > ffffff8008312740
> > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > >
> > > > The issue is real.
> > > >
> > > > The fix, however, is not the right one.  The fundamental problem
> > > > is
> > > > that fuse_write_inode() blocks on a request to userspace.
> > > >
> > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > face.  In
> > > > that case the solution was to copy the page contents to a
> > > > temporary
> > > > buffer and return immediately as if the writeback already
> > > > completed.
> > > >
> > > > Something similar needs to be done here: send the FUSE_SETATTR
> > > > request
> > > > asynchronously and return immediately from
> > > > fuse_write_inode().  The
> > > > tricky part is to make sure that multiple time updates for the
> > > > same
> > > > inode aren't mixed up...
> > > >
> > > > Thanks,
> > > > Miklos
> > >
> > > Dear Szeredi,
> > >
> > > Writeback thread calls fuse_write_inode() and wait for user Daemon
> > > to
> > > complete this write inode request. The user daemon will
> > > alloc_page()
> > > after taking this request, and a deadlock could happen when we try
> > > to
> > > shrink dentry list under memory pressure.
> > >
> > > We (Mediatek) glad to work on this issue for mainline and also LTS.
> > > So
> > > another problem is that we should not change the protocol or
> > > feature
> > > for stable kernel.
> > >
> > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the
> > > dentry
> > > shirnker. It works but degrade the alloc_page success rate. In a
> > > more
> > > fundamental way, we could cache the contents and return
> > > immediately.
> > > But how to ensure the request will be done successfully, e.g.,
> > > always
> > > retry if it fails from daemon.
> >
> > Key is where the the dirty metadata is flushed.  To prevent deadlock
> > it must not be flushed from memory reclaim, so must make sure that it
> > is flushed on close(2) and munmap(2) and not dirtied after that.
> >
> > I'm working on this currently and hope to get it ready for the next
> > merge window.
> >
> > Thanks,
> > Miklos
>
> Hi Miklos,
>
> I'm not sure whether it has already been resolved in mainline.
> If it still WIP, please cc me on future emails.

Hi,

This is taking a bit longer, unfortunately, but I already have
something in testing and currently cleaning it up for review.  Hope to
post a series today or early next week.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-08-18  9:24     ` Miklos Szeredi
@ 2021-09-24  3:52       ` Ed Tsai
  2021-09-24  7:52         ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2021-09-24  3:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, chenguanyou, chenguanyou,
	Stanley Chu (朱原陞)

On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com> wrote:
> > 
> > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > chenguanyou9338@gmail.com>
> > > wrote:
> > > > 
> > > > ABA deadlock
> > > > 
> > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > ffffff800830e1e8
> > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> > > > 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > ffffff8008256514
> > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > 
> > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > "kworker/u16:8"
> > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> > > > 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > ffffff8008312740
> > > > 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> > > > 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > 
> > > The issue is real.
> > > 
> > > The fix, however, is not the right one.  The fundamental problem
> > > is
> > > that fuse_write_inode() blocks on a request to userspace.
> > > 
> > > This is the same issue that fuse_writepage/fuse_writepages
> > > face.  In
> > > that case the solution was to copy the page contents to a
> > > temporary
> > > buffer and return immediately as if the writeback already
> > > completed.
> > > 
> > > Something similar needs to be done here: send the FUSE_SETATTR
> > > request
> > > asynchronously and return immediately from
> > > fuse_write_inode().  The
> > > tricky part is to make sure that multiple time updates for the
> > > same
> > > inode aren't mixed up...
> > > 
> > > Thanks,
> > > Miklos
> > 
> > Dear Szeredi,
> > 
> > Writeback thread calls fuse_write_inode() and wait for user Daemon
> > to
> > complete this write inode request. The user daemon will
> > alloc_page()
> > after taking this request, and a deadlock could happen when we try
> > to
> > shrink dentry list under memory pressure.
> > 
> > We (Mediatek) glad to work on this issue for mainline and also LTS.
> > So
> > another problem is that we should not change the protocol or
> > feature
> > for stable kernel.
> > 
> > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the
> > dentry
> > shirnker. It works but degrade the alloc_page success rate. In a
> > more
> > fundamental way, we could cache the contents and return
> > immediately.
> > But how to ensure the request will be done successfully, e.g.,
> > always
> > retry if it fails from daemon.
> 
> Key is where the the dirty metadata is flushed.  To prevent deadlock
> it must not be flushed from memory reclaim, so must make sure that it
> is flushed on close(2) and munmap(2) and not dirtied after that.
> 
> I'm working on this currently and hope to get it ready for the next
> merge window.
> 
> Thanks,
> Miklos

Hi Miklos,

I'm not sure whether it has already been resolved in mainline.
If it still WIP, please cc me on future emails.

Best regards,
Ed Tsai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-07-13  2:42   ` Ed Tsai
@ 2021-08-18  9:24     ` Miklos Szeredi
  2021-09-24  3:52       ` Ed Tsai
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2021-08-18  9:24 UTC (permalink / raw)
  To: Ed Tsai
  Cc: linux-fsdevel, linux-kernel, chenguanyou, chenguanyou, stanley.chu

On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com> wrote:
>
> On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > On Thu, 3 Jun 2021 at 14:52, chenguanyou <chenguanyou9338@gmail.com>
> > wrote:
> > >
> > > ABA deadlock
> > >
> > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at ffffff800830e1e8
> > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> > > 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at ffffff8008256514
> > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > >
> > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"
> > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> > > 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > 7 [ffffff802e793980] __writeback_single_inode at ffffff8008312740
> > > 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> > > 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> >
> > The issue is real.
> >
> > The fix, however, is not the right one.  The fundamental problem is
> > that fuse_write_inode() blocks on a request to userspace.
> >
> > This is the same issue that fuse_writepage/fuse_writepages face.  In
> > that case the solution was to copy the page contents to a temporary
> > buffer and return immediately as if the writeback already completed.
> >
> > Something similar needs to be done here: send the FUSE_SETATTR
> > request
> > asynchronously and return immediately from fuse_write_inode().  The
> > tricky part is to make sure that multiple time updates for the same
> > inode aren't mixed up...
> >
> > Thanks,
> > Miklos
>
> Dear Szeredi,
>
> Writeback thread calls fuse_write_inode() and wait for user Daemon to
> complete this write inode request. The user daemon will alloc_page()
> after taking this request, and a deadlock could happen when we try to
> shrink dentry list under memory pressure.
>
> We (Mediatek) glad to work on this issue for mainline and also LTS. So
> another problem is that we should not change the protocol or feature
> for stable kernel.
>
> Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the dentry
> shirnker. It works but degrade the alloc_page success rate. In a more
> fundamental way, we could cache the contents and return immediately.
> But how to ensure the request will be done successfully, e.g., always
> retry if it fails from daemon.

Key is where the the dirty metadata is flushed.  To prevent deadlock
it must not be flushed from memory reclaim, so must make sure that it
is flushed on close(2) and munmap(2) and not dirtied after that.

I'm working on this currently and hope to get it ready for the next
merge window.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-06-08 15:30 ` Miklos Szeredi
@ 2021-07-13  2:42   ` Ed Tsai
  2021-08-18  9:24     ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Ed Tsai @ 2021-07-13  2:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, chenguanyou, chenguanyou, stanley.chu

On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> On Thu, 3 Jun 2021 at 14:52, chenguanyou <chenguanyou9338@gmail.com>
> wrote:
> > 
> > ABA deadlock
> > 
> > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > 5 [ffffff802d16b5b0] inode_wait_for_writeback at ffffff800830e1e8
> > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> > 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at ffffff8008256514
> > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> > 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > 
> > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"
> > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > 2 [ffffff802e793720] schedule at ffffff8009200348
> > 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> > 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > 7 [ffffff802e793980] __writeback_single_inode at ffffff8008312740
> > 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> > 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> 
> The issue is real.
> 
> The fix, however, is not the right one.  The fundamental problem is
> that fuse_write_inode() blocks on a request to userspace.
> 
> This is the same issue that fuse_writepage/fuse_writepages face.  In
> that case the solution was to copy the page contents to a temporary
> buffer and return immediately as if the writeback already completed.
> 
> Something similar needs to be done here: send the FUSE_SETATTR
> request
> asynchronously and return immediately from fuse_write_inode().  The
> tricky part is to make sure that multiple time updates for the same
> inode aren't mixed up...
> 
> Thanks,
> Miklos

Dear Szeredi,

Writeback thread calls fuse_write_inode() and wait for user Daemon to
complete this write inode request. The user daemon will alloc_page()
after taking this request, and a deadlock could happen when we try to
shrink dentry list under memory pressure.

We (Mediatek) glad to work on this issue for mainline and also LTS. So
another problem is that we should not change the protocol or feature
for stable kernel.

Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the dentry
shirnker. It works but degrade the alloc_page success rate. In a more
fundamental way, we could cache the contents and return immediately.
But how to ensure the request will be done successfully, e.g., always
retry if it fails from daemon.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-06-03 12:52 chenguanyou
@ 2021-06-08 15:30 ` Miklos Szeredi
  2021-07-13  2:42   ` Ed Tsai
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2021-06-08 15:30 UTC (permalink / raw)
  To: chenguanyou; +Cc: linux-fsdevel, linux-kernel, chenguanyou

On Thu, 3 Jun 2021 at 14:52, chenguanyou <chenguanyou9338@gmail.com> wrote:
>
> ABA deadlock
>
> PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
> 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> 5 [ffffff802d16b5b0] inode_wait_for_writeback at ffffff800830e1e8
> 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> 7 [ffffff802d16b620] iput at ffffff80082f9270
> 8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
> 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> 10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
> 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> 12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
> 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> 15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
> 16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
> 17 [ffffff802d16bbe0] __alloc_pages_nodemask at ffffff8008256514
> 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> 19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
> 20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
> 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
>
> PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"
> 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> 2 [ffffff802e793720] schedule at ffffff8009200348
> 3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
> 4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
> 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> 7 [ffffff802e793980] __writeback_single_inode at ffffff8008312740
> 8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
> 9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
> 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> 12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
> 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> 14 [ffffff802e793e60] kthread at ffffff80080eb650

The issue is real.

The fix, however, is not the right one.  The fundamental problem is
that fuse_write_inode() blocks on a request to userspace.

This is the same issue that fuse_writepage/fuse_writepages face.  In
that case the solution was to copy the page contents to a temporary
buffer and return immediately as if the writeback already completed.

Something similar needs to be done here: send the FUSE_SETATTR request
asynchronously and return immediately from fuse_write_inode().  The
tricky part is to make sure that multiple time updates for the same
inode aren't mixed up...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] [fuse] alloc_page nofs avoid deadlock
@ 2021-06-03 12:52 chenguanyou
  2021-06-08 15:30 ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: chenguanyou @ 2021-06-03 12:52 UTC (permalink / raw)
  To: miklos; +Cc: linux-fsdevel, linux-kernel, chenguanyou

ABA deadlock

PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
1 [ffffff802d16b470] __schedule at ffffff80091ffe58
2 [ffffff802d16b4d0] schedule at ffffff8009200348
3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
5 [ffffff802d16b5b0] inode_wait_for_writeback at ffffff800830e1e8
6 [ffffff802d16b5e0] evict at ffffff80082fb15c
7 [ffffff802d16b620] iput at ffffff80082f9270
8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
13 [ffffff802d16b860] shrink_slab at ffffff8008266170
14 [ffffff802d16b900] shrink_node at ffffff800826b420
15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
17 [ffffff802d16bbe0] __alloc_pages_nodemask at ffffff8008256514
18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
21 [ffffff802d16be60] sys_splice at ffffff8008315d18
22 [ffffff802d16bff0] __sys_trace at ffffff8008084014

PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: "kworker/u16:8"
0 [ffffff802e793650] __switch_to at ffffff8008086a4c
1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
2 [ffffff802e793720] schedule at ffffff8009200348
3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
7 [ffffff802e793980] __writeback_single_inode at ffffff8008312740
8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
13 [ffffff802e793e00] worker_thread at ffffff80080e5670
14 [ffffff802e793e60] kthread at ffffff80080eb650

Signed-off-by: chenguanyou <chenguanyou@xiaomi.com>
---
 fs/fuse/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index c0fee830a34e..d36125ff0405 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -721,7 +721,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			if (cs->nr_segs >= cs->pipe->max_usage)
 				return -EIO;
 
-			page = alloc_page(GFP_HIGHUSER);
+			page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
 			if (!page)
 				return -ENOMEM;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-07-12  1:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-03 12:41 [PATCH] [fuse] alloc_page nofs avoid deadlock chenguanyou
2021-06-03 12:56 ` Michal Hocko
2021-06-03 12:52 chenguanyou
2021-06-08 15:30 ` Miklos Szeredi
2021-07-13  2:42   ` Ed Tsai
2021-08-18  9:24     ` Miklos Szeredi
2021-09-24  3:52       ` Ed Tsai
2021-09-24  7:52         ` Miklos Szeredi
2021-09-28 15:25           ` Miklos Szeredi
2021-12-14  9:25             ` Ed Tsai
2021-12-14  9:38               ` Greg Kroah-Hartman
2021-12-15  8:22                 ` Ed Tsai
2021-12-15 13:52                   ` Greg Kroah-Hartman
     [not found]             ` <SI2PR03MB5545E0B76E54013678B9FEEC8BA99@SI2PR03MB5545.apcprd03.prod.outlook.com>
2022-06-10  7:48               ` Ed Tsai
2022-06-13  8:45                 ` Miklos Szeredi
2022-06-13  9:29                   ` Ed Tsai
2022-07-11  7:49                     ` Miklos Szeredi
2022-07-12  1:16                       ` Ed Tsai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.