Re: Kernel Benchmarking

* Re: Kernel Benchmarking
       [not found]                       ` <CAHk-=wiz=J=8mJ=zRG93nuJ9GtQAm5bSRAbWJbWZuN4Br38+EQ@mail.gmail.com>
@ 2020-09-11  0:05                         ` Linus Torvalds
  2020-09-11  0:49                           ` Michael Larabel
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-11  0:05 UTC (permalink / raw)
  To: Michael Larabel, Ted Ts'o, Andreas Dilger; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

[ Ted / Andreas - Michael bisected a nasty regression to the new fair
page lock, and I think at least part of the reason is the ext4 page
locking patterns ]

On Thu, Sep 10, 2020 at 1:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I can already from a quick look see that one of the major
> "interesting" paths here is a "writev()" system call that takes a page
> fault when it copies data from user space.

I think the page fault is incidental and not important.

No, I think the issue is that ext4_write_begin() does some fairly crazy things.

It does

        page = grab_cache_page_write_begin(mapping, index, flags);
        if (!page)
                return -ENOMEM;
        unlock_page(page);

which is all kinds of bad, because where grab_cache_page_write_begin()
will get the page lock, and wait for it to not be under writeback any
more.

And then we unlock it right away.

Only to do the journal start, and after that immediately do

        lock_page(page);
        ... check that the mapping hasn't changed ..
        /* In case writeback began while the page was unlocked */
        wait_for_stable_page(page);

so it does that again.

And I think this is exactly the pattern where the old unfair page
locking worked very well, because the second "lock_page()" will
probably happen while the previous "unlock_page()" had kept it
unlocked. So 99% of the time, the second lock_page() was free.

But with the new fair page locking, the previous unlock_page() will
have given the page away to whoever was waiting for it, and now when
we do the second lock_page(), we'll block and wait for that user - and
every other possible one. Because that's fair - everybody gets the
page lock in order.

This may not be *the* reason, but it's exactly the kind of pessimal
pattern where the old unfair model worked very well (where "well"
means "good average performance, but then occasionally you get
watchdogs firing because there's no forward progress"), and the new
fair code will really stutter, because the lock/unlock/lock pattern is
basically *exactly* the wrong thing to do and only causes a complete
serialization in case there are other waiters, because fairness means
that the second lock will always be done after *all* other queued
waiters have been handled.

And the sad part is that the code doesn't even *want* the lock for
that initial case, and immediately drops it.

The main reason the code seems to want to use that
grab_cache_page_write_begin() that lkocks the page is that it wants to
create the page if it didn't exist, and that creation creates a locked
page.

But the code *could* use FGP_FOR_MMAP instead, which only locks that
initial page case.

So something like this might at least work around this particular
case. But it's *entirely* untested.

Ted, Andreas, comments? The old unfair lock_page() made this a
non-issue, but we really do have years of reports of odd watchdog
errors that seem to be due to that almost infinite unfairness under
bad loads..

Michael: it's entirely possible that the two cases in fs/ext4/inode.c
that I noticed are not that important. But I found them from following
your profile data down to lock_page() cases, so they seem to be at
least _part_ of the issue.

Again: the patch is ENTIRELY untested. It compiles for me, and it
looks superficially right, but that's all I'm going to say about it..

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 2136 bytes --]

 fs/ext4/inode.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..65355e33eaae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1112,6 +1112,33 @@ static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
 }
 #endif
 
+
+/*
+ * This is like grab_cache_page_write_begin(), but without
+ * locking the page, because we want to grab a page first
+ * before we start a transaction handle.
+ *
+ * The page will be locked later.
+ *
+ * That FGP_FOR_MMAP is exactly that "I don't want it locked"
+ * flag, and is required to work with FGP_CREAT.
+ */
+static struct page *grab_cache_page_write_begin_unlocked(
+	struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	struct page *page;
+	int fgp_flags = FGP_FOR_MMAP|FGP_WRITE|FGP_CREAT;
+
+	if (flags & AOP_FLAG_NOFS)
+		fgp_flags |= FGP_NOFS;
+	page = pagecache_get_page(mapping, index, fgp_flags,
+			mapping_gfp_mask(mapping));
+	if (page)
+		wait_for_stable_page(page);
+
+	return page;
+}
+
 static int ext4_write_begin(struct file *file, struct address_space *mapping,
 			    loff_t pos, unsigned len, unsigned flags,
 			    struct page **pagep, void **fsdata)
@@ -1154,10 +1181,9 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	 * the page (if needed) without using GFP_NOFS.
 	 */
 retry_grab:
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	page = grab_cache_page_write_begin_unlocked(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-	unlock_page(page);
 
 retry_journal:
 	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
@@ -2962,10 +2988,9 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	 * the page (if needed) without using GFP_NOFS.
 	 */
 retry_grab:
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	page = grab_cache_page_write_begin_unlocked(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-	unlock_page(page);
 
 	/*
 	 * With delayed allocation, we don't log the i_disksize update

^ permalink raw reply related	[flat|nested] 65+ messages in thread