stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
       [not found]           ` <YVMz8E1Lg/GZQcjw@miu.piliscsaba.redhat.com>
@ 2021-12-14  9:25             ` Ed Tsai
  2021-12-14  9:38               ` Greg Kroah-Hartman
  0 siblings, 1 reply; 4+ messages in thread
From: Ed Tsai @ 2021-12-14  9:25 UTC (permalink / raw)
  To: Miklos Szeredi, Greg Kroah-Hartman, stable
  Cc: linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > wrote:
> > > 
> > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com>
> > > > wrote:
> > > > > 
> > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > chenguanyou9338@gmail.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > ABA deadlock
> > > > > > > 
> > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > "Thread-21"
> > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > ffffff800830e1e8
> > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > ffffff80082f4c90
> > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > ffffff80082f1c34
> > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > ffffff80082d55ac
> > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > ffffff8008268460
> > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > ffffff80082680d0
> > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > ffffff8008256514
> > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > ffffff8008437654
> > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > ffffff8008436f40
> > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > 
> > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > "kworker/u16:8"
> > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > ffffff8008435760
> > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > ffffff8008435b14
> > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > ffffff8008312740
> > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > ffffff80083117e4
> > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > ffffff8008311d98
> > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > ffffff80080e4fac
> > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > 
> > > > > > The issue is real.
> > > > > > 
> > > > > > The fix, however, is not the right one.  The fundamental
> > > > > > problem
> > > > > > is
> > > > > > that fuse_write_inode() blocks on a request to userspace.
> > > > > > 
> > > > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > > > face.  In
> > > > > > that case the solution was to copy the page contents to a
> > > > > > temporary
> > > > > > buffer and return immediately as if the writeback already
> > > > > > completed.
> > > > > > 
> > > > > > Something similar needs to be done here: send the
> > > > > > FUSE_SETATTR
> > > > > > request
> > > > > > asynchronously and return immediately from
> > > > > > fuse_write_inode().  The
> > > > > > tricky part is to make sure that multiple time updates for
> > > > > > the
> > > > > > same
> > > > > > inode aren't mixed up...
> > > > > > 
> > > > > > Thanks,
> > > > > > Miklos
> > > > > 
> > > > > Dear Szeredi,
> > > > > 
> > > > > Writeback thread calls fuse_write_inode() and wait for user
> > > > > Daemon
> > > > > to
> > > > > complete this write inode request. The user daemon will
> > > > > alloc_page()
> > > > > after taking this request, and a deadlock could happen when
> > > > > we try
> > > > > to
> > > > > shrink dentry list under memory pressure.
> > > > > 
> > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > also LTS.
> > > > > So
> > > > > another problem is that we should not change the protocol or
> > > > > feature
> > > > > for stable kernel.
> > > > > 
> > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip
> > > > > the
> > > > > dentry
> > > > > shirnker. It works but degrade the alloc_page success rate.
> > > > > In a
> > > > > more
> > > > > fundamental way, we could cache the contents and return
> > > > > immediately.
> > > > > But how to ensure the request will be done successfully,
> > > > > e.g.,
> > > > > always
> > > > > retry if it fails from daemon.
> > > > 
> > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > deadlock
> > > > it must not be flushed from memory reclaim, so must make sure
> > > > that it
> > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > that.
> > > > 
> > > > I'm working on this currently and hope to get it ready for the
> > > > next
> > > > merge window.
> > > > 
> > > > Thanks,
> > > > Miklos
> > > 
> > > Hi Miklos,
> > > 
> > > I'm not sure whether it has already been resolved in mainline.
> > > If it still WIP, please cc me on future emails.
> > 
> > Hi,
> > 
> > This is taking a bit longer, unfortunately, but I already have
> > something in testing and currently cleaning it up for review.  Hope
> > to
> > post a series today or early next week.
> 
> 
> Here's a minimal patch.  It's been through some iterations and some
> testing, but
> more review and testing is definitely welcome.
> 
> Chenguanyou, can you please verify that it fixes the deadlock?
> 
> Thanks,
> Miklos
> 
> ---
> From: Miklos Szeredi <mszeredi@redhat.com>
> Subject: fuse: make sure reclaim doesn't write the inode
> 
> In writeback cache mode mtime/ctime updates are cached, and flushed
> to the
> server using the ->write_inode() callback.
> 
> Closing the file will result in a dirty inode being immediately
> written,
> but in other cases the inode can remain dirty after all references
> are
> dropped.  This result in the inode being written back from reclaim,
> which
> can deadlock on a regular allocation while the request is being
> served.
> 
> The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> because
> serving a request involves unrelated userspace process(es).
> 
> Instead do the same as for dirty pages: make sure the inode is
> written
> before the last reference is gone.
> 
>  - fuse_vma_close(): flush times in addition to the dirty pages
> 
>  - fallocate(2)/copy_file_range(2): these call file_update_time() or
>    file_modified(), so flush the inode before returning from the call
> 
>  - unlink(2), link(2) and rename(2): these call fuse_update_ctime(),
> so
>    flush the ctime directly from this helper
> 
> Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fuse/dir.c    |    8 ++++++++
>  fs/fuse/file.c   |   24 +++++++++++++++++++++---
>  fs/fuse/fuse_i.h |    1 +
>  3 files changed, 30 insertions(+), 3 deletions(-)
> 
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
>  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
>  }
>  
> +void fuse_flush_time_update(struct inode *inode)
> +{
> +	int err = sync_inode_metadata(inode, 1);
> +
> +	mapping_set_error(inode->i_mapping, err);
> +}
> +
>  void fuse_update_ctime(struct inode *inode)
>  {
>  	fuse_invalidate_attr(inode);
>  	if (!IS_NOCMTIME(inode)) {
>  		inode->i_ctime = current_time(inode);
>  		mark_inode_dirty_sync(inode);
> +		fuse_flush_time_update(inode);
>  	}
>  }
>  
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
>  	struct fuse_file *ff;
>  	int err;
>  
> +	/*
> +	 * Inode is always written before the last reference is dropped
> and
> +	 * hence this should not be reached from reclaim.
> +	 *
> +	 * Writing back the inode from reclaim can deadlock if the
> request
> +	 * processing itself needs an allocation.  Allocations
> triggering
> +	 * reclaim while serving a request can't be prevented, because
> it can
> +	 * involve any number of unrelated userspace processes.
> +	 */
> +	WARN_ON(wbc->for_reclaim);
> +
>  	ff = __fuse_write_file_get(fi);
>  	err = fuse_flush_times(inode, ff);
>  	if (ff)
> @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
>  }
>  
>  /*
> - * Write back dirty pages now, because there may not be any suitable
> - * open files later
> + * Write back dirty data/metadata now (there may not be any suitable
> + * open files later for data)
>   */
>  static void fuse_vma_close(struct vm_area_struct *vma)
>  {
> -	filemap_write_and_wait(vma->vm_file->f_mapping);
> +	int err;
> +
> +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> +	mapping_set_error(vma->vm_file->f_mapping, err);
>  }
>  
>  /*
> @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
>  	if (lock_inode)
>  		inode_unlock(inode);
>  
> +	fuse_flush_time_update(inode);
> +
>  	return err;
>  }
>  
> @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
>  	inode_unlock(inode_out);
>  	file_accessed(file_in);
>  
> +	fuse_flush_time_update(inode_out);
> +
>  	return err;
>  }
>  
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
>  
>  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
>  
> +void fuse_flush_time_update(struct inode *inode);
>  void fuse_update_ctime(struct inode *inode);
>  
>  int fuse_update_attributes(struct inode *inode, struct file *file);

Hi Mikloz, Greg,

This deadlock issue could be raised in high memory pressure and the
patch has been merged in commit 5c791fe ("fuse: make sure reclaim
doesn't write the inode").

Can we take it to the LTS version?

Best,
Ed Tsai



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-14  9:25             ` [PATCH] [fuse] alloc_page nofs avoid deadlock Ed Tsai
@ 2021-12-14  9:38               ` Greg Kroah-Hartman
  2021-12-15  8:22                 ` Ed Tsai
  0 siblings, 1 reply; 4+ messages in thread
From: Greg Kroah-Hartman @ 2021-12-14  9:38 UTC (permalink / raw)
  To: Ed Tsai
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > wrote:
> > > > 
> > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > chenguanyou9338@gmail.com>
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > ABA deadlock
> > > > > > > > 
> > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > "Thread-21"
> > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
> > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > ffffff800830e1e8
> > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > ffffff80082f4c90
> > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
> > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > ffffff80082f1c34
> > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
> > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > ffffff80082d55ac
> > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > ffffff8008268460
> > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > ffffff80082680d0
> > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > ffffff8008256514
> > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
> > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > ffffff8008437654
> > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > ffffff8008436f40
> > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > 
> > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > "kworker/u16:8"
> > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > ffffff8008435760
> > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > ffffff8008435b14
> > > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
> > > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
> > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > ffffff8008312740
> > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > ffffff80083117e4
> > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > ffffff8008311d98
> > > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
> > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > ffffff80080e4fac
> > > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670
> > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > 
> > > > > > > The issue is real.
> > > > > > > 
> > > > > > > The fix, however, is not the right one.  The fundamental
> > > > > > > problem
> > > > > > > is
> > > > > > > that fuse_write_inode() blocks on a request to userspace.
> > > > > > > 
> > > > > > > This is the same issue that fuse_writepage/fuse_writepages
> > > > > > > face.  In
> > > > > > > that case the solution was to copy the page contents to a
> > > > > > > temporary
> > > > > > > buffer and return immediately as if the writeback already
> > > > > > > completed.
> > > > > > > 
> > > > > > > Something similar needs to be done here: send the
> > > > > > > FUSE_SETATTR
> > > > > > > request
> > > > > > > asynchronously and return immediately from
> > > > > > > fuse_write_inode().  The
> > > > > > > tricky part is to make sure that multiple time updates for
> > > > > > > the
> > > > > > > same
> > > > > > > inode aren't mixed up...
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Miklos
> > > > > > 
> > > > > > Dear Szeredi,
> > > > > > 
> > > > > > Writeback thread calls fuse_write_inode() and wait for user
> > > > > > Daemon
> > > > > > to
> > > > > > complete this write inode request. The user daemon will
> > > > > > alloc_page()
> > > > > > after taking this request, and a deadlock could happen when
> > > > > > we try
> > > > > > to
> > > > > > shrink dentry list under memory pressure.
> > > > > > 
> > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > also LTS.
> > > > > > So
> > > > > > another problem is that we should not change the protocol or
> > > > > > feature
> > > > > > for stable kernel.
> > > > > > 
> > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip
> > > > > > the
> > > > > > dentry
> > > > > > shirnker. It works but degrade the alloc_page success rate.
> > > > > > In a
> > > > > > more
> > > > > > fundamental way, we could cache the contents and return
> > > > > > immediately.
> > > > > > But how to ensure the request will be done successfully,
> > > > > > e.g.,
> > > > > > always
> > > > > > retry if it fails from daemon.
> > > > > 
> > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > deadlock
> > > > > it must not be flushed from memory reclaim, so must make sure
> > > > > that it
> > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > that.
> > > > > 
> > > > > I'm working on this currently and hope to get it ready for the
> > > > > next
> > > > > merge window.
> > > > > 
> > > > > Thanks,
> > > > > Miklos
> > > > 
> > > > Hi Miklos,
> > > > 
> > > > I'm not sure whether it has already been resolved in mainline.
> > > > If it still WIP, please cc me on future emails.
> > > 
> > > Hi,
> > > 
> > > This is taking a bit longer, unfortunately, but I already have
> > > something in testing and currently cleaning it up for review.  Hope
> > > to
> > > post a series today or early next week.
> > 
> > 
> > Here's a minimal patch.  It's been through some iterations and some
> > testing, but
> > more review and testing is definitely welcome.
> > 
> > Chenguanyou, can you please verify that it fixes the deadlock?
> > 
> > Thanks,
> > Miklos
> > 
> > ---
> > From: Miklos Szeredi <mszeredi@redhat.com>
> > Subject: fuse: make sure reclaim doesn't write the inode
> > 
> > In writeback cache mode mtime/ctime updates are cached, and flushed
> > to the
> > server using the ->write_inode() callback.
> > 
> > Closing the file will result in a dirty inode being immediately
> > written,
> > but in other cases the inode can remain dirty after all references
> > are
> > dropped.  This result in the inode being written back from reclaim,
> > which
> > can deadlock on a regular allocation while the request is being
> > served.
> > 
> > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > because
> > serving a request involves unrelated userspace process(es).
> > 
> > Instead do the same as for dirty pages: make sure the inode is
> > written
> > before the last reference is gone.
> > 
> >  - fuse_vma_close(): flush times in addition to the dirty pages
> > 
> >  - fallocate(2)/copy_file_range(2): these call file_update_time() or
> >    file_modified(), so flush the inode before returning from the call
> > 
> >  - unlink(2), link(2) and rename(2): these call fuse_update_ctime(),
> > so
> >    flush the ctime directly from this helper
> > 
> > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >  fs/fuse/dir.c    |    8 ++++++++
> >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> >  fs/fuse/fuse_i.h |    1 +
> >  3 files changed, 30 insertions(+), 3 deletions(-)
> > 
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> >  }
> >  
> > +void fuse_flush_time_update(struct inode *inode)
> > +{
> > +	int err = sync_inode_metadata(inode, 1);
> > +
> > +	mapping_set_error(inode->i_mapping, err);
> > +}
> > +
> >  void fuse_update_ctime(struct inode *inode)
> >  {
> >  	fuse_invalidate_attr(inode);
> >  	if (!IS_NOCMTIME(inode)) {
> >  		inode->i_ctime = current_time(inode);
> >  		mark_inode_dirty_sync(inode);
> > +		fuse_flush_time_update(inode);
> >  	}
> >  }
> >  
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> >  	struct fuse_file *ff;
> >  	int err;
> >  
> > +	/*
> > +	 * Inode is always written before the last reference is dropped
> > and
> > +	 * hence this should not be reached from reclaim.
> > +	 *
> > +	 * Writing back the inode from reclaim can deadlock if the
> > request
> > +	 * processing itself needs an allocation.  Allocations
> > triggering
> > +	 * reclaim while serving a request can't be prevented, because
> > it can
> > +	 * involve any number of unrelated userspace processes.
> > +	 */
> > +	WARN_ON(wbc->for_reclaim);
> > +
> >  	ff = __fuse_write_file_get(fi);
> >  	err = fuse_flush_times(inode, ff);
> >  	if (ff)
> > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> >  }
> >  
> >  /*
> > - * Write back dirty pages now, because there may not be any suitable
> > - * open files later
> > + * Write back dirty data/metadata now (there may not be any suitable
> > + * open files later for data)
> >   */
> >  static void fuse_vma_close(struct vm_area_struct *vma)
> >  {
> > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > +	int err;
> > +
> > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > +	mapping_set_error(vma->vm_file->f_mapping, err);
> >  }
> >  
> >  /*
> > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> >  	if (lock_inode)
> >  		inode_unlock(inode);
> >  
> > +	fuse_flush_time_update(inode);
> > +
> >  	return err;
> >  }
> >  
> > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> >  	inode_unlock(inode_out);
> >  	file_accessed(file_in);
> >  
> > +	fuse_flush_time_update(inode_out);
> > +
> >  	return err;
> >  }
> >  
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> >  
> >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> >  
> > +void fuse_flush_time_update(struct inode *inode);
> >  void fuse_update_ctime(struct inode *inode);
> >  
> >  int fuse_update_attributes(struct inode *inode, struct file *file);
> 
> Hi Mikloz, Greg,
> 
> This deadlock issue could be raised in high memory pressure and the
> patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> doesn't write the inode").
> 
> Can we take it to the LTS version?

What kernel tree(s) do you want this backported to?  Have you tested it
that it will apply cleanly and work?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-14  9:38               ` Greg Kroah-Hartman
@ 2021-12-15  8:22                 ` Ed Tsai
  2021-12-15 13:52                   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 4+ messages in thread
From: Ed Tsai @ 2021-12-15  8:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Tue, 2021-12-14 at 17:38 +0800, Greg Kroah-Hartman wrote:
> On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> > On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > > wrote:
> > > > > 
> > > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com
> > > > > > >
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > > chenguanyou9338@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > ABA deadlock
> > > > > > > > > 
> > > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > > "Thread-21"
> > > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at
> > > > > > > > > ffffff8009200a34
> > > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > > ffffff800830e1e8
> > > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > > ffffff80082f4c90
> > > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at
> > > > > > > > > ffffff80082f1710
> > > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > > ffffff80082f1c34
> > > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at
> > > > > > > > > ffffff80082f18a8
> > > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > > ffffff80082d55ac
> > > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > > ffffff8008268460
> > > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > > ffffff80082680d0
> > > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > > ffffff8008256514
> > > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at
> > > > > > > > > ffffff8008438268
> > > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > > ffffff8008437654
> > > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > > ffffff8008436f40
> > > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > > 
> > > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > > "kworker/u16:8"
> > > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > > ffffff8008435760
> > > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > > ffffff8008435b14
> > > > > > > > > 5 [ffffff802e793930] fuse_flush_times at
> > > > > > > > > ffffff800843a7a0
> > > > > > > > > 6 [ffffff802e793950] fuse_write_inode at
> > > > > > > > > ffffff800843e4dc
> > > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > > ffffff8008312740
> > > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > > ffffff80083117e4
> > > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > > ffffff8008311d98
> > > > > > > > > 10 [ffffff802e793c00] wb_writeback at
> > > > > > > > > ffffff8008310cfc
> > > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > > ffffff80080e4fac
> > > > > > > > > 13 [ffffff802e793e00] worker_thread at
> > > > > > > > > ffffff80080e5670
> > > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > > 
> > > > > > > > The issue is real.
> > > > > > > > 
> > > > > > > > The fix, however, is not the right one.  The
> > > > > > > > fundamental
> > > > > > > > problem
> > > > > > > > is
> > > > > > > > that fuse_write_inode() blocks on a request to
> > > > > > > > userspace.
> > > > > > > > 
> > > > > > > > This is the same issue that
> > > > > > > > fuse_writepage/fuse_writepages
> > > > > > > > face.  In
> > > > > > > > that case the solution was to copy the page contents to
> > > > > > > > a
> > > > > > > > temporary
> > > > > > > > buffer and return immediately as if the writeback
> > > > > > > > already
> > > > > > > > completed.
> > > > > > > > 
> > > > > > > > Something similar needs to be done here: send the
> > > > > > > > FUSE_SETATTR
> > > > > > > > request
> > > > > > > > asynchronously and return immediately from
> > > > > > > > fuse_write_inode().  The
> > > > > > > > tricky part is to make sure that multiple time updates
> > > > > > > > for
> > > > > > > > the
> > > > > > > > same
> > > > > > > > inode aren't mixed up...
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Miklos
> > > > > > > 
> > > > > > > Dear Szeredi,
> > > > > > > 
> > > > > > > Writeback thread calls fuse_write_inode() and wait for
> > > > > > > user
> > > > > > > Daemon
> > > > > > > to
> > > > > > > complete this write inode request. The user daemon will
> > > > > > > alloc_page()
> > > > > > > after taking this request, and a deadlock could happen
> > > > > > > when
> > > > > > > we try
> > > > > > > to
> > > > > > > shrink dentry list under memory pressure.
> > > > > > > 
> > > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > > also LTS.
> > > > > > > So
> > > > > > > another problem is that we should not change the protocol
> > > > > > > or
> > > > > > > feature
> > > > > > > for stable kernel.
> > > > > > > 
> > > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by
> > > > > > > skip
> > > > > > > the
> > > > > > > dentry
> > > > > > > shirnker. It works but degrade the alloc_page success
> > > > > > > rate.
> > > > > > > In a
> > > > > > > more
> > > > > > > fundamental way, we could cache the contents and return
> > > > > > > immediately.
> > > > > > > But how to ensure the request will be done successfully,
> > > > > > > e.g.,
> > > > > > > always
> > > > > > > retry if it fails from daemon.
> > > > > > 
> > > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > > deadlock
> > > > > > it must not be flushed from memory reclaim, so must make
> > > > > > sure
> > > > > > that it
> > > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > > that.
> > > > > > 
> > > > > > I'm working on this currently and hope to get it ready for
> > > > > > the
> > > > > > next
> > > > > > merge window.
> > > > > > 
> > > > > > Thanks,
> > > > > > Miklos
> > > > > 
> > > > > Hi Miklos,
> > > > > 
> > > > > I'm not sure whether it has already been resolved in
> > > > > mainline.
> > > > > If it still WIP, please cc me on future emails.
> > > > 
> > > > Hi,
> > > > 
> > > > This is taking a bit longer, unfortunately, but I already have
> > > > something in testing and currently cleaning it up for
> > > > review.  Hope
> > > > to
> > > > post a series today or early next week.
> > > 
> > > 
> > > Here's a minimal patch.  It's been through some iterations and
> > > some
> > > testing, but
> > > more review and testing is definitely welcome.
> > > 
> > > Chenguanyou, can you please verify that it fixes the deadlock?
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > > ---
> > > From: Miklos Szeredi <mszeredi@redhat.com>
> > > Subject: fuse: make sure reclaim doesn't write the inode
> > > 
> > > In writeback cache mode mtime/ctime updates are cached, and
> > > flushed
> > > to the
> > > server using the ->write_inode() callback.
> > > 
> > > Closing the file will result in a dirty inode being immediately
> > > written,
> > > but in other cases the inode can remain dirty after all
> > > references
> > > are
> > > dropped.  This result in the inode being written back from
> > > reclaim,
> > > which
> > > can deadlock on a regular allocation while the request is being
> > > served.
> > > 
> > > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > > because
> > > serving a request involves unrelated userspace process(es).
> > > 
> > > Instead do the same as for dirty pages: make sure the inode is
> > > written
> > > before the last reference is gone.
> > > 
> > >  - fuse_vma_close(): flush times in addition to the dirty pages
> > > 
> > >  - fallocate(2)/copy_file_range(2): these call file_update_time()
> > > or
> > >    file_modified(), so flush the inode before returning from the
> > > call
> > > 
> > >  - unlink(2), link(2) and rename(2): these call
> > > fuse_update_ctime(),
> > > so
> > >    flush the ctime directly from this helper
> > > 
> > > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > ---
> > >  fs/fuse/dir.c    |    8 ++++++++
> > >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> > >  fs/fuse/fuse_i.h |    1 +
> > >  3 files changed, 30 insertions(+), 3 deletions(-)
> > > 
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> > >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> > >  }
> > >  
> > > +void fuse_flush_time_update(struct inode *inode)
> > > +{
> > > +	int err = sync_inode_metadata(inode, 1);
> > > +
> > > +	mapping_set_error(inode->i_mapping, err);
> > > +}
> > > +
> > >  void fuse_update_ctime(struct inode *inode)
> > >  {
> > >  	fuse_invalidate_attr(inode);
> > >  	if (!IS_NOCMTIME(inode)) {
> > >  		inode->i_ctime = current_time(inode);
> > >  		mark_inode_dirty_sync(inode);
> > > +		fuse_flush_time_update(inode);
> > >  	}
> > >  }
> > >  
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> > >  	struct fuse_file *ff;
> > >  	int err;
> > >  
> > > +	/*
> > > +	 * Inode is always written before the last reference is dropped
> > > and
> > > +	 * hence this should not be reached from reclaim.
> > > +	 *
> > > +	 * Writing back the inode from reclaim can deadlock if the
> > > request
> > > +	 * processing itself needs an allocation.  Allocations
> > > triggering
> > > +	 * reclaim while serving a request can't be prevented, because
> > > it can
> > > +	 * involve any number of unrelated userspace processes.
> > > +	 */
> > > +	WARN_ON(wbc->for_reclaim);
> > > +
> > >  	ff = __fuse_write_file_get(fi);
> > >  	err = fuse_flush_times(inode, ff);
> > >  	if (ff)
> > > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> > >  }
> > >  
> > >  /*
> > > - * Write back dirty pages now, because there may not be any
> > > suitable
> > > - * open files later
> > > + * Write back dirty data/metadata now (there may not be any
> > > suitable
> > > + * open files later for data)
> > >   */
> > >  static void fuse_vma_close(struct vm_area_struct *vma)
> > >  {
> > > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > > +	int err;
> > > +
> > > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > > +	mapping_set_error(vma->vm_file->f_mapping, err);
> > >  }
> > >  
> > >  /*
> > > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> > >  	if (lock_inode)
> > >  		inode_unlock(inode);
> > >  
> > > +	fuse_flush_time_update(inode);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> > >  	inode_unlock(inode_out);
> > >  	file_accessed(file_in);
> > >  
> > > +	fuse_flush_time_update(inode_out);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> > >  
> > >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> > >  
> > > +void fuse_flush_time_update(struct inode *inode);
> > >  void fuse_update_ctime(struct inode *inode);
> > >  
> > >  int fuse_update_attributes(struct inode *inode, struct file
> > > *file);
> > 
> > Hi Mikloz, Greg,
> > 
> > This deadlock issue could be raised in high memory pressure and the
> > patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> > doesn't write the inode").
> > 
> > Can we take it to the LTS version?
> 
> What kernel tree(s) do you want this backported to?  Have you tested
> it
> that it will apply cleanly and work?
> 
> thanks,
> 
> greg k-h

Hi Greg, I want to take this commit to 5.10 LTS. This can be work to
resolve the deadlock issue. Also, I have done some monkey tests on our
ACK 5.10 phone to confirm the stability.

Best,

Ed Tsai


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [fuse] alloc_page nofs avoid deadlock
  2021-12-15  8:22                 ` Ed Tsai
@ 2021-12-15 13:52                   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 4+ messages in thread
From: Greg Kroah-Hartman @ 2021-12-15 13:52 UTC (permalink / raw)
  To: Ed Tsai
  Cc: Miklos Szeredi, stable, linux-fsdevel, linux-kernel, chenguanyou,
	Stanley Chu (朱原陞),
	Yong-xuan Wang (王詠萱)

On Wed, Dec 15, 2021 at 04:22:12PM +0800, Ed Tsai wrote:
> On Tue, 2021-12-14 at 17:38 +0800, Greg Kroah-Hartman wrote:
> > On Tue, Dec 14, 2021 at 05:25:01PM +0800, Ed Tsai wrote:
> > > On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote:
> > > > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote:
> > > > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai@mediatek.com>
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:
> > > > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai@mediatek.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:
> > > > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou <
> > > > > > > > > chenguanyou9338@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > ABA deadlock
> > > > > > > > > > 
> > > > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND:
> > > > > > > > > > "Thread-21"
> > > > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
> > > > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58
> > > > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348
> > > > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
> > > > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at
> > > > > > > > > > ffffff8009200a34
> > > > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at
> > > > > > > > > > ffffff800830e1e8
> > > > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c
> > > > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270
> > > > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at
> > > > > > > > > > ffffff80082f4c90
> > > > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at
> > > > > > > > > > ffffff80082f1710
> > > > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at
> > > > > > > > > > ffffff80082f1c34
> > > > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at
> > > > > > > > > > ffffff80082f18a8
> > > > > > > > > > 12 [ffffff802d16b770] super_cache_scan at
> > > > > > > > > > ffffff80082d55ac
> > > > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170
> > > > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420
> > > > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at
> > > > > > > > > > ffffff8008268460
> > > > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at
> > > > > > > > > > ffffff80082680d0
> > > > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at
> > > > > > > > > > ffffff8008256514
> > > > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at
> > > > > > > > > > ffffff8008438268
> > > > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at
> > > > > > > > > > ffffff8008437654
> > > > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at
> > > > > > > > > > ffffff8008436f40
> > > > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18
> > > > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014
> > > > > > > > > > 
> > > > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
> > > > > > > > > > "kworker/u16:8"
> > > > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c
> > > > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
> > > > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348
> > > > > > > > > > 3 [ffffff802e793770] __fuse_request_send at
> > > > > > > > > > ffffff8008435760
> > > > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at
> > > > > > > > > > ffffff8008435b14
> > > > > > > > > > 5 [ffffff802e793930] fuse_flush_times at
> > > > > > > > > > ffffff800843a7a0
> > > > > > > > > > 6 [ffffff802e793950] fuse_write_inode at
> > > > > > > > > > ffffff800843e4dc
> > > > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at
> > > > > > > > > > ffffff8008312740
> > > > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at
> > > > > > > > > > ffffff80083117e4
> > > > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at
> > > > > > > > > > ffffff8008311d98
> > > > > > > > > > 10 [ffffff802e793c00] wb_writeback at
> > > > > > > > > > ffffff8008310cfc
> > > > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
> > > > > > > > > > 12 [ffffff802e793d90] process_one_work at
> > > > > > > > > > ffffff80080e4fac
> > > > > > > > > > 13 [ffffff802e793e00] worker_thread at
> > > > > > > > > > ffffff80080e5670
> > > > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650
> > > > > > > > > 
> > > > > > > > > The issue is real.
> > > > > > > > > 
> > > > > > > > > The fix, however, is not the right one.  The
> > > > > > > > > fundamental
> > > > > > > > > problem
> > > > > > > > > is
> > > > > > > > > that fuse_write_inode() blocks on a request to
> > > > > > > > > userspace.
> > > > > > > > > 
> > > > > > > > > This is the same issue that
> > > > > > > > > fuse_writepage/fuse_writepages
> > > > > > > > > face.  In
> > > > > > > > > that case the solution was to copy the page contents to
> > > > > > > > > a
> > > > > > > > > temporary
> > > > > > > > > buffer and return immediately as if the writeback
> > > > > > > > > already
> > > > > > > > > completed.
> > > > > > > > > 
> > > > > > > > > Something similar needs to be done here: send the
> > > > > > > > > FUSE_SETATTR
> > > > > > > > > request
> > > > > > > > > asynchronously and return immediately from
> > > > > > > > > fuse_write_inode().  The
> > > > > > > > > tricky part is to make sure that multiple time updates
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > same
> > > > > > > > > inode aren't mixed up...
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Miklos
> > > > > > > > 
> > > > > > > > Dear Szeredi,
> > > > > > > > 
> > > > > > > > Writeback thread calls fuse_write_inode() and wait for
> > > > > > > > user
> > > > > > > > Daemon
> > > > > > > > to
> > > > > > > > complete this write inode request. The user daemon will
> > > > > > > > alloc_page()
> > > > > > > > after taking this request, and a deadlock could happen
> > > > > > > > when
> > > > > > > > we try
> > > > > > > > to
> > > > > > > > shrink dentry list under memory pressure.
> > > > > > > > 
> > > > > > > > We (Mediatek) glad to work on this issue for mainline and
> > > > > > > > also LTS.
> > > > > > > > So
> > > > > > > > another problem is that we should not change the protocol
> > > > > > > > or
> > > > > > > > feature
> > > > > > > > for stable kernel.
> > > > > > > > 
> > > > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by
> > > > > > > > skip
> > > > > > > > the
> > > > > > > > dentry
> > > > > > > > shirnker. It works but degrade the alloc_page success
> > > > > > > > rate.
> > > > > > > > In a
> > > > > > > > more
> > > > > > > > fundamental way, we could cache the contents and return
> > > > > > > > immediately.
> > > > > > > > But how to ensure the request will be done successfully,
> > > > > > > > e.g.,
> > > > > > > > always
> > > > > > > > retry if it fails from daemon.
> > > > > > > 
> > > > > > > Key is where the the dirty metadata is flushed.  To prevent
> > > > > > > deadlock
> > > > > > > it must not be flushed from memory reclaim, so must make
> > > > > > > sure
> > > > > > > that it
> > > > > > > is flushed on close(2) and munmap(2) and not dirtied after
> > > > > > > that.
> > > > > > > 
> > > > > > > I'm working on this currently and hope to get it ready for
> > > > > > > the
> > > > > > > next
> > > > > > > merge window.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Miklos
> > > > > > 
> > > > > > Hi Miklos,
> > > > > > 
> > > > > > I'm not sure whether it has already been resolved in
> > > > > > mainline.
> > > > > > If it still WIP, please cc me on future emails.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This is taking a bit longer, unfortunately, but I already have
> > > > > something in testing and currently cleaning it up for
> > > > > review.  Hope
> > > > > to
> > > > > post a series today or early next week.
> > > > 
> > > > 
> > > > Here's a minimal patch.  It's been through some iterations and
> > > > some
> > > > testing, but
> > > > more review and testing is definitely welcome.
> > > > 
> > > > Chenguanyou, can you please verify that it fixes the deadlock?
> > > > 
> > > > Thanks,
> > > > Miklos
> > > > 
> > > > ---
> > > > From: Miklos Szeredi <mszeredi@redhat.com>
> > > > Subject: fuse: make sure reclaim doesn't write the inode
> > > > 
> > > > In writeback cache mode mtime/ctime updates are cached, and
> > > > flushed
> > > > to the
> > > > server using the ->write_inode() callback.
> > > > 
> > > > Closing the file will result in a dirty inode being immediately
> > > > written,
> > > > but in other cases the inode can remain dirty after all
> > > > references
> > > > are
> > > > dropped.  This result in the inode being written back from
> > > > reclaim,
> > > > which
> > > > can deadlock on a regular allocation while the request is being
> > > > served.
> > > > 
> > > > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE,
> > > > because
> > > > serving a request involves unrelated userspace process(es).
> > > > 
> > > > Instead do the same as for dirty pages: make sure the inode is
> > > > written
> > > > before the last reference is gone.
> > > > 
> > > >  - fuse_vma_close(): flush times in addition to the dirty pages
> > > > 
> > > >  - fallocate(2)/copy_file_range(2): these call file_update_time()
> > > > or
> > > >    file_modified(), so flush the inode before returning from the
> > > > call
> > > > 
> > > >  - unlink(2), link(2) and rename(2): these call
> > > > fuse_update_ctime(),
> > > > so
> > > >    flush the ctime directly from this helper
> > > > 
> > > > Reported-by: chenguanyou <chenguanyou@xiaomi.com>
> > > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > > ---
> > > >  fs/fuse/dir.c    |    8 ++++++++
> > > >  fs/fuse/file.c   |   24 +++++++++++++++++++++---
> > > >  fs/fuse/fuse_i.h |    1 +
> > > >  3 files changed, 30 insertions(+), 3 deletions(-)
> > > > 
> > > > --- a/fs/fuse/dir.c
> > > > +++ b/fs/fuse/dir.c
> > > > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name
> > > >  	return create_new_entry(fm, &args, dir, entry, S_IFLNK);
> > > >  }
> > > >  
> > > > +void fuse_flush_time_update(struct inode *inode)
> > > > +{
> > > > +	int err = sync_inode_metadata(inode, 1);
> > > > +
> > > > +	mapping_set_error(inode->i_mapping, err);
> > > > +}
> > > > +
> > > >  void fuse_update_ctime(struct inode *inode)
> > > >  {
> > > >  	fuse_invalidate_attr(inode);
> > > >  	if (!IS_NOCMTIME(inode)) {
> > > >  		inode->i_ctime = current_time(inode);
> > > >  		mark_inode_dirty_sync(inode);
> > > > +		fuse_flush_time_update(inode);
> > > >  	}
> > > >  }
> > > >  
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode
> > > >  	struct fuse_file *ff;
> > > >  	int err;
> > > >  
> > > > +	/*
> > > > +	 * Inode is always written before the last reference is dropped
> > > > and
> > > > +	 * hence this should not be reached from reclaim.
> > > > +	 *
> > > > +	 * Writing back the inode from reclaim can deadlock if the
> > > > request
> > > > +	 * processing itself needs an allocation.  Allocations
> > > > triggering
> > > > +	 * reclaim while serving a request can't be prevented, because
> > > > it can
> > > > +	 * involve any number of unrelated userspace processes.
> > > > +	 */
> > > > +	WARN_ON(wbc->for_reclaim);
> > > > +
> > > >  	ff = __fuse_write_file_get(fi);
> > > >  	err = fuse_flush_times(inode, ff);
> > > >  	if (ff)
> > > > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page
> > > >  }
> > > >  
> > > >  /*
> > > > - * Write back dirty pages now, because there may not be any
> > > > suitable
> > > > - * open files later
> > > > + * Write back dirty data/metadata now (there may not be any
> > > > suitable
> > > > + * open files later for data)
> > > >   */
> > > >  static void fuse_vma_close(struct vm_area_struct *vma)
> > > >  {
> > > > -	filemap_write_and_wait(vma->vm_file->f_mapping);
> > > > +	int err;
> > > > +
> > > > +	err = write_inode_now(vma->vm_file->f_mapping->host, 1);
> > > > +	mapping_set_error(vma->vm_file->f_mapping, err);
> > > >  }
> > > >  
> > > >  /*
> > > > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f
> > > >  	if (lock_inode)
> > > >  		inode_unlock(inode);
> > > >  
> > > > +	fuse_flush_time_update(inode);
> > > > +
> > > >  	return err;
> > > >  }
> > > >  
> > > > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st
> > > >  	inode_unlock(inode_out);
> > > >  	file_accessed(file_in);
> > > >  
> > > > +	fuse_flush_time_update(inode_out);
> > > > +
> > > >  	return err;
> > > >  }
> > > >  
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu
> > > >  
> > > >  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
> > > >  
> > > > +void fuse_flush_time_update(struct inode *inode);
> > > >  void fuse_update_ctime(struct inode *inode);
> > > >  
> > > >  int fuse_update_attributes(struct inode *inode, struct file
> > > > *file);
> > > 
> > > Hi Mikloz, Greg,
> > > 
> > > This deadlock issue could be raised in high memory pressure and the
> > > patch has been merged in commit 5c791fe ("fuse: make sure reclaim
> > > doesn't write the inode").
> > > 
> > > Can we take it to the LTS version?
> > 
> > What kernel tree(s) do you want this backported to?  Have you tested
> > it
> > that it will apply cleanly and work?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Hi Greg, I want to take this commit to 5.10 LTS. This can be work to
> resolve the deadlock issue. Also, I have done some monkey tests on our
> ACK 5.10 phone to confirm the stability.

Queued up for 5.15.y and 5.10.y now, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-15 13:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210603125242.31699-1-chenguanyou@xiaomi.com>
     [not found] ` <CAJfpegsEkRnU26Vvo4BTQUmx89Hahp6=RTuyEcPm=rqz8icwUQ@mail.gmail.com>
     [not found]   ` <1fabb91167a86990f4723e9036a0e006293518f4.camel@mediatek.com>
     [not found]     ` <CAJfpegsOSWZpKHqDNE_B489dGCzLr-RVAhimVOsFkxJwMYmj9A@mail.gmail.com>
     [not found]       ` <07c5f2f1e10671bc462f88717f84aae9ee1e4d2b.camel@mediatek.com>
     [not found]         ` <CAJfpegvAJS=An+hyAshkNcTS8A2TM28V2UP4SYycXUw3awOR+g@mail.gmail.com>
     [not found]           ` <YVMz8E1Lg/GZQcjw@miu.piliscsaba.redhat.com>
2021-12-14  9:25             ` [PATCH] [fuse] alloc_page nofs avoid deadlock Ed Tsai
2021-12-14  9:38               ` Greg Kroah-Hartman
2021-12-15  8:22                 ` Ed Tsai
2021-12-15 13:52                   ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).