linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
@ 2022-04-15  1:37 Zhihao Cheng
  2022-04-15  6:39 ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Zhihao Cheng @ 2022-04-15  1:37 UTC (permalink / raw)
  To: viro, torvalds, hch; +Cc: linux-fsdevel, linux-kernel, chengzhihao1, yukuai3

Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:

    wb_writeback		fat_file_fsync
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs have been added into plug->mq_list  // LOCK A
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes // fat's bdev
    .   __writeback_single_inode
    .   . generic_writepages
    .   .   __block_write_full_page
    .   .   . . 	    __generic_file_fsync
    .   .   . . 	      sync_inode_metadata
    .   .   . . 	        writeback_single_inode
    .   .   . . 		  __writeback_single_inode
    .   .   . . 		    fat_write_inode
    .   .   . . 		      __fat_write_inode
    .   .   . . 		        sync_dirty_buffer	// fat's bdev
    .   .   . . 			  lock_buffer(bh)	// LOCK B
    .   .   . . 			    submit_bh
    .   .   . . 			      blk_mq_get_tag	// LOCK A
    .   .   . trylock_buffer(bh)  // LOCK B
    .   .   .   redirty_page_for_writepage
    .   .   .     wbc->pages_skipped++
    .   .   --wbc->nr_to_write
    .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
    .   requeue_inode
    .     redirty_tail_locked
    if (progress)    // progress > 0
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called

Above process triggers a hungtask like:
[  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
2426 flags:0x00004000
[  399.051556] Call Trace:
[  399.051570]  __schedule+0x480/0x1050
[  399.051592]  schedule+0x92/0x1a0
[  399.051602]  io_schedule+0x22/0x50
[  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
[  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
[  399.051657]  blk_mq_submit_bio+0x68d/0xca0
[  399.051674]  __submit_bio+0x1b5/0x2d0
[  399.051708]  submit_bio_noacct+0x34e/0x720
[  399.051718]  submit_bio+0x3b/0x150
[  399.051725]  submit_bh_wbc+0x161/0x230
[  399.051734]  __sync_dirty_buffer+0xd1/0x420
[  399.051744]  sync_dirty_buffer+0x17/0x20
[  399.051750]  __fat_write_inode+0x289/0x310
[  399.051766]  fat_write_inode+0x2a/0xa0
[  399.051783]  __writeback_single_inode+0x53c/0x6f0
[  399.051795]  writeback_single_inode+0x145/0x200
[  399.051803]  sync_inode_metadata+0x45/0x70
[  399.051856]  __generic_file_fsync+0xa3/0x150
[  399.051880]  fat_file_fsync+0x1d/0x80
[  399.051895]  vfs_fsync_range+0x40/0xb0
[  399.051929]  __x64_sys_fsync+0x18/0x30

In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().

Fix it by flush plug before next iteration in wb_writeback().

Goto Link to find a reproducer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/fs-writeback.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 591fe9cf1659..e524c0a1749c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2036,8 +2036,21 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			/*
+			 * The progress may be false postive in page redirty
+			 * case (which is caused by failing to get buffer head
+			 * lock), which will requeue dirty inodes and start
+			 * next writeback iteration, and other tasks maybe
+			 * stuck for getting tags for new requests. So, flush
+			 * plug to schedule requests holding tags.
+			 *
+			 * The code can be removed after buffer head
+			 * disappering from linux.
+			 */
+			blk_flush_plug(current->plug, false);
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
  2022-04-15  1:37 [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback() Zhihao Cheng
@ 2022-04-15  6:39 ` Christoph Hellwig
  2022-04-15  8:39   ` Zhihao Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2022-04-15  6:39 UTC (permalink / raw)
  To: Zhihao Cheng; +Cc: viro, torvalds, hch, linux-fsdevel, linux-kernel, yukuai3

On Fri, Apr 15, 2022 at 09:37:35AM +0800, Zhihao Cheng wrote:
> +		if (progress) {
> +			/*
> +			 * The progress may be false postive in page redirty
> +			 * case (which is caused by failing to get buffer head
> +			 * lock), which will requeue dirty inodes and start
> +			 * next writeback iteration, and other tasks maybe
> +			 * stuck for getting tags for new requests. So, flush
> +			 * plug to schedule requests holding tags.
> +			 *
> +			 * The code can be removed after buffer head
> +			 * disappering from linux.
> +			 */
> +			blk_flush_plug(current->plug, false);

This basically removes plugging entirely, so we might as well stop
adding the plug if we can't solve it any other way.  But it seems
like that fake progress needs to be fixed instead.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
  2022-04-15  6:39 ` Christoph Hellwig
@ 2022-04-15  8:39   ` Zhihao Cheng
  2022-04-16  5:42     ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Zhihao Cheng @ 2022-04-15  8:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: viro, torvalds, linux-fsdevel, linux-kernel, yukuai3

在 2022/4/15 14:39, Christoph Hellwig 写道:
Hi, Christoph
> This basically removes plugging entirely, so we might as well stop
> adding the plug if we can't solve it any other way.  But it seems
> like that fake progress needs to be fixed instead.
> 
Maybe there is a more ideal solution:
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e524c0a1749c..9723f77841f8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1855,7 +1855,7 @@ static long writeback_sb_inodes(struct super_block 
*sb,

                 wbc_detach_inode(&wbc);
                 work->nr_pages -= write_chunk - wbc.nr_to_write;
-               wrote += write_chunk - wbc.nr_to_write;
+               wrote += write_chunk - wbc.nr_to_write - wbc.pages_skipped;

                 if (need_resched()) {
                         /*

, or following is better(It looks awkward.):

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e524c0a1749c..5f310e53bf1e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1780,6 +1780,7 @@ static long writeback_sb_inodes(struct super_block 
*sb,
         while (!list_empty(&wb->b_io)) {
                 struct inode *inode = wb_inode(wb->b_io.prev);
                 struct bdi_writeback *tmp_wb;
+               long tmp_wrote;

                 if (inode->i_sb != sb) {
                         if (work->sb) {
@@ -1854,8 +1855,11 @@ static long writeback_sb_inodes(struct 
super_block *sb,
                 __writeback_single_inode(inode, &wbc);

                 wbc_detach_inode(&wbc);
-               work->nr_pages -= write_chunk - wbc.nr_to_write;
-               wrote += write_chunk - wbc.nr_to_write;
+               tmp_wrote = write_chunk - wbc.nr_to_write >= 
wbc.pages_skipped ?
+                           write_chunk - wbc.nr_to_write - 
wbc.pages_skipped :
+                           write_chunk - wbc.nr_to_write;workaround
+               work->nr_pages -= tmp_wrote;
+               wrote += tmp_wrote;

                 if (need_resched()) {
                         /*

It depends on how specific filesystem behaves after invoking 
redirty_page_for_writepage(). Most filesystems return 
AOP_WRITEPAGE_ACTIVATE or 0 after invoking redirty_page_for_writepage() 
in writepage, but there still exist accidential examples:
1. metapage_writepage() could returns -EIO after 
redirty_page_for_writepage()
2. ext4_writepage() could returns -ENOMEM after redirty_page_for_writepage()

write_cache_pages
   error = (*writepage)(page, wbc, data);
   if (unlikely(error)) {
     ...
     break;
   }
   --wbc->nr_to_write   // Skip if 'error < 0'. And if writepage invokes 
redirty_page_for_writepage(), wrote could be negative.


I think the root cause is fsync gets buffer head's lock without locking 
corresponding page, fixing 'progess' and flushing plug are both workarounds.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
  2022-04-15  8:39   ` Zhihao Cheng
@ 2022-04-16  5:42     ` Christoph Hellwig
  2022-04-16  8:41       ` Zhihao Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:42 UTC (permalink / raw)
  To: Zhihao Cheng
  Cc: Christoph Hellwig, viro, torvalds, linux-fsdevel, linux-kernel, yukuai3

> I think the root cause is fsync gets buffer head's lock without locking 
> corresponding page, fixing 'progess' and flushing plug are both 
> workarounds.

So let's fix that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
  2022-04-16  5:42     ` Christoph Hellwig
@ 2022-04-16  8:41       ` Zhihao Cheng
  2022-04-18  5:00         ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Zhihao Cheng @ 2022-04-16  8:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: viro, torvalds, linux-fsdevel, linux-kernel, yukuai3

在 2022/4/16 13:42, Christoph Hellwig 写道:
>> I think the root cause is fsync gets buffer head's lock without locking
>> corresponding page, fixing 'progess' and flushing plug are both
>> workarounds.
> 
> So let's fix that.
> 

I think adding page lock before locking buffer head is a little 
difficult and risky:
1. There are too many places getting buffer head before submitting bio, 
and not all filesystems behave same in readpage/writepage/write_inode. 
For example, ntfs_read_block() has locked page before locking buffer 
head and then submitting bh, ext4(no journal) and fat may lock buffer 
head without locking page while writing inode. It's a huge work to check 
all places.
2. Import page lock before locking buffer head may bring new unknown 
problem(other deadlocks about page ?). Taking page lock before locking 
buffer head(in all processes which can be concurrent with wb_writeback) 
is a dangerous thing.

So, how about applying the safe and simple method(flush plug) for the 
time being?
PS: Maybe someday buffer head is removed from all filesystems, then we 
can remove this superfluous blk_flush_plug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback()
  2022-04-16  8:41       ` Zhihao Cheng
@ 2022-04-18  5:00         ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2022-04-18  5:00 UTC (permalink / raw)
  To: Zhihao Cheng
  Cc: Christoph Hellwig, viro, torvalds, linux-fsdevel, linux-kernel, yukuai3

On Sat, Apr 16, 2022 at 04:41:35PM +0800, Zhihao Cheng wrote:
> So, how about applying the safe and simple method(flush plug) for the time 
> being?

As said before - if you flush everytime here the plug is basically
useless and we could remove it.  But it was added for a reason.  So let's
at least improve the accounting for the skipped writeback as suggested in
your last mail.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-04-18  5:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15  1:37 [PATCH] fs-writeback: Flush plug before next iteration in wb_writeback() Zhihao Cheng
2022-04-15  6:39 ` Christoph Hellwig
2022-04-15  8:39   ` Zhihao Cheng
2022-04-16  5:42     ` Christoph Hellwig
2022-04-16  8:41       ` Zhihao Cheng
2022-04-18  5:00         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).