All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ext4: start the handle later in ext4_writepages() to avoid unnecessary wait
@ 2021-09-23 12:12 xueqingwen
  2021-10-14  2:31 ` Theodore Ts'o
  0 siblings, 1 reply; 3+ messages in thread
From: xueqingwen @ 2021-09-23 12:12 UTC (permalink / raw)
  To: tytso, adilger.kernel
  Cc: linux-ext4, linux-kernel, xueqingwen, zhaojie, jimyan

The mpage_prepare_extent_to_map() was called in ext4_writepages() to find
pages that need mapping. It probably waited for locking page in
mpage_prepare_extent_to_map() while the handle had been started, which
would block the commit of corresponding transaction and further likely to
cause the I/O request delays jitter in other cgroups.

This problem was touched in our pressure test environment as follow:
There were two equivalent database server processes, which meet key-value
request of data access by exposing different ports in a machine. They were
attached to cgroup A and cgroup B, separately. They shared the same NVME
SSD, and the io read/write bandwidth of the SSD was limited to 20Mbps by
using io.max in cgroup A and the bandwidth limit was not set in cgroup B.
Then the read/write requests with same pressure would be continuously sent
to the two server processes separately and simultaneously. Our expectation
is that the server process in cgroup B could meet the request latency
metrics.

According to the previous discussion in upstream community,
the dioread_nolock was mounted for ext4 and the io priority inversion
problem has indeed been alleviated to a great extent. However, in the
above test scenario, the write request of server process in cgroup B would
be hanged occasionally with servel sceconds, causing a sharp increase of
request latency. The stacktraces when hanging were captured as follows:

PID: 7602  COMMAND: "jbd2/nvme0n1-8":
 #0 __schedule
 #1 schedule
 #2 jbd2_journal_commit_transaction
 #3 kjournald2
 #4 kthread
 #5 ret_from_fork

PID: 55526 COMMAND: "kworker/u114:9":
 #0 __schedule
 #1 schedule
 #2 io_schedule
 #3 __lock_page
 #4 mpage_prepare_extent_to_map
 #5 ext4_writepages
 #6 do_writepages
 #7 __writeback_single_inode
 #8 writeback_sb_inodes
 #9 __writeback_inodes_wb

PID: 109830  COMMAND: "snnode":
 #0 __schedule
 #1 schedule
 #2 io_schedule
 #3 wait_on_page_bit
 #4 mpage_prepare_extent_to_map
 #5 ext4_writepages
 #6 do_writepages
 #7 __filemap_fdatawrite_range
 #8 file_write_and_wait_range
 #9 ext4_sync_file

By analyzing the above stack information, it indicated that the process
named "kworker/u114:9" was waitting for locking a page while holding a
handle. The page was locked by the process named "snnode", which belonged
to cgroup A, and its writeback to block device was throttled. Then the
Kjournald2 process had to wait the handle being stopped to commit current
transaction. Then it caused the I/O request delays jitter from the cgroup
B.

Therefore, the handle was delayed to start until finding the pages that
need mapping in ext4_writepages(). With this patch, the above problem did
not recur. We had looked this patch over pretty carefully, but another pair
of eyes would be appreciated. Please help to review whether there are
defects and whether it can be merged to upstream.

Thanks.

Co-developed-by: zhaojie <zhaojie17@baidu.com>
Signed-off-by: zhaojie <zhaojie17@baidu.com>
Signed-off-by: xueqingwen <xueqingwen@baidu.com>
Signed-off-by: jimyan <jimyan@baidu.com>
---
 fs/ext4/inode.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..29bb7ab3e01f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2785,26 +2785,32 @@ static int ext4_writepages(struct address_space *mapping,
 		BUG_ON(ext4_should_journal_data(inode));
 		needed_blocks = ext4_da_writepages_trans_blocks(inode);
 
-		/* start a new transaction */
-		handle = ext4_journal_start_with_reserve(inode,
-				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
-			       "%ld pages, ino %lu; err %d", __func__,
-				wbc->nr_to_write, inode->i_ino, ret);
-			/* Release allocated io_end */
-			ext4_put_io_end(mpd.io_submit.io_end);
-			mpd.io_submit.io_end = NULL;
-			break;
-		}
 		mpd.do_map = 1;
 
 		trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
 		ret = mpage_prepare_extent_to_map(&mpd);
-		if (!ret && mpd.map.m_len)
+		if (!ret && mpd.map.m_len) {
+			/* start a new transaction */
+			handle = ext4_journal_start_with_reserve(inode,
+					EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
+			if (IS_ERR(handle)) {
+				ret = PTR_ERR(handle);
+				ext4_msg(inode->i_sb, KERN_CRIT,
+					"%s: jbd2_start: %ld pages, ino %lu; err %d",
+					__func__, wbc->nr_to_write, inode->i_ino, ret);
+				/*submit prepared bio if has*/
+				ext4_io_submit(&mpd.io_submit);
+
+				/* Release allocated io_end */
+				ext4_put_io_end(mpd.io_submit.io_end);
+				mpd.io_submit.io_end = NULL;
+				/*unlock pages we don't use */
+				mpage_release_unused_pages(&mpd, false);
+				break;
+			}
 			ret = mpage_map_and_submit_extent(handle, &mpd,
 					&give_up_on_write);
+		}
 		/*
 		 * Caution: If the handle is synchronous,
 		 * ext4_journal_stop() can wait for transaction commit
@@ -2815,7 +2821,7 @@ static int ext4_writepages(struct address_space *mapping,
 		 * and dropped io_end reference (for extent conversion
 		 * to be able to complete) before stopping the handle.
 		 */
-		if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
+		if (handle && (!ext4_handle_valid(handle) || handle->h_sync == 0)) {
 			ext4_journal_stop(handle);
 			handle = NULL;
 			mpd.do_map = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] ext4: start the handle later in ext4_writepages() to avoid unnecessary wait
  2021-09-23 12:12 [PATCH] ext4: start the handle later in ext4_writepages() to avoid unnecessary wait xueqingwen
@ 2021-10-14  2:31 ` Theodore Ts'o
  2021-10-18 21:20   ` Jan Kara
  0 siblings, 1 reply; 3+ messages in thread
From: Theodore Ts'o @ 2021-10-14  2:31 UTC (permalink / raw)
  To: xueqingwen; +Cc: adilger.kernel, linux-ext4, linux-kernel, zhaojie, jimyan

On Thu, Sep 23, 2021 at 08:12:04PM +0800, xueqingwen wrote:
>   ....
> Therefore, the handle was delayed to start until finding the pages that
> need mapping in ext4_writepages(). With this patch, the above problem did
> not recur. We had looked this patch over pretty carefully, but another pair
> of eyes would be appreciated. Please help to review whether there are
> defects and whether it can be merged to upstream.

Hi,

I've tried tests against this patch, and it's causing a large number
of hangs.  For most of the hangs, it's while running generic/269,
although there were a few other tests which would cause the kernel to
hang.

I don't have time to try to figure out why your patch might be
failing, at least not this week.  So if you could take a look at at
the test artifiacts in this xz compressed tarfile, I'd appreciate it.
The "report" file contains a summary report, and the *.serial files
contain the output from the serial console of the VM's which were
hanging with your patch applied.  Perhaps you can determine what needs
to be fixed to prevent the kernel hangs?

https://drive.google.com/file/d/1sk2wx6w3D-grO0WihbIiX2LonthjpOTf/view?usp=sharing

The above link will allow you to download 4 megabyte file:

  results.ltm-20211013191843.5.15.0-rc4-xfstests-00010-ged577e416dce.tar.xz

Cheers,

						- Ted

P.S.  What sort of testing had you run against your change before you
submitted it for upstream review?  The above set of test artifacts was
generated using gce-xfstests, and while you might not have access to
the Google Cloud Platform, the same xfstests-bld infrastructure which
I also provides kvm-xfstests.  For more information please see:

https://thunk.org/gce-xfstests
https://github.com/tytso/xfstests-bld/blob/master/Documentation/00-index.md

The above test artifact found in the Google Drive link was generated
via: 

gce-xfstests launch-ltm
gce-xfstests install-kconfig
gce-xfstests kbuild
gce-xfstests ltm full

If you are using kvm-xfstests, you can run the equivalent set of tests
via a set of commands like this:

kvm-xfstests install-kconfig
kvm-xfstests kbuild
kvm-xfstests full

But it might take over 24 hours to run, and the first time the kernel
hangs, it will stop the full test.  With gce-xfstests's lightweight
test manager (ltm), it runs the test VM's in parallel, and I can get
an e-mailed report in my inbox in roughly two and half hours.

For future reference, it would save me a lot of time ext4 developers
could run "kvm-xfstests smoke" (takes about 20 minutes) or
"kvm-xfstests -c ext4/4k -g auto" (takes about 2 hours) before
sending a patch for review.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] ext4: start the handle later in ext4_writepages() to avoid unnecessary wait
  2021-10-14  2:31 ` Theodore Ts'o
@ 2021-10-18 21:20   ` Jan Kara
  0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2021-10-18 21:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: xueqingwen, adilger.kernel, linux-ext4, linux-kernel, zhaojie, jimyan

On Wed 13-10-21 22:31:37, Theodore Ts'o wrote:
> On Thu, Sep 23, 2021 at 08:12:04PM +0800, xueqingwen wrote:
> >   ....
> > Therefore, the handle was delayed to start until finding the pages that
> > need mapping in ext4_writepages(). With this patch, the above problem did
> > not recur. We had looked this patch over pretty carefully, but another pair
> > of eyes would be appreciated. Please help to review whether there are
> > defects and whether it can be merged to upstream.
> 
> Hi,
> 
> I've tried tests against this patch, and it's causing a large number
> of hangs.  For most of the hangs, it's while running generic/269,
> although there were a few other tests which would cause the kernel to
> hang.
> 
> I don't have time to try to figure out why your patch might be
> failing, at least not this week.  So if you could take a look at at
> the test artifiacts in this xz compressed tarfile, I'd appreciate it.
> The "report" file contains a summary report, and the *.serial files
> contain the output from the serial console of the VM's which were
> hanging with your patch applied.  Perhaps you can determine what needs
> to be fixed to prevent the kernel hangs?

Well, I guess the problem is that proper lock ordering is transaction start
-> page lock and this patch inverts it so it creates all sorts of deadlock
possibilities. Lockdep will not catch this problem because page lock is not
tracked by it.

I do understand the problem description but this just isn't a viable
solution to it. There are some possible solutions like locking the first
page outside of transaction, then unlocking it, starting a transaction and
then only trylocking pages in mpage_prepare_extent_to_map() but it tends to
result in pretty ugly code. Also we'd need to make sure we don't call
submit_bio() when having transaction started (as that is where throttling
happens) - any such place may cause described latency problems. It's going
to be rather difficult to find and address all such places.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-18 21:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-23 12:12 [PATCH] ext4: start the handle later in ext4_writepages() to avoid unnecessary wait xueqingwen
2021-10-14  2:31 ` Theodore Ts'o
2021-10-18 21:20   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.