All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Kernel Benchmarking
       [not found]                       ` <CAHk-=wiz=J=8mJ=zRG93nuJ9GtQAm5bSRAbWJbWZuN4Br38+EQ@mail.gmail.com>
@ 2020-09-11  0:05                         ` Linus Torvalds
  2020-09-11  0:49                           ` Michael Larabel
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-11  0:05 UTC (permalink / raw)
  To: Michael Larabel, Ted Ts'o, Andreas Dilger; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

[ Ted / Andreas - Michael bisected a nasty regression to the new fair
page lock, and I think at least part of the reason is the ext4 page
locking patterns ]

On Thu, Sep 10, 2020 at 1:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I can already from a quick look see that one of the major
> "interesting" paths here is a "writev()" system call that takes a page
> fault when it copies data from user space.

I think the page fault is incidental and not important.

No, I think the issue is that ext4_write_begin() does some fairly crazy things.

It does

        page = grab_cache_page_write_begin(mapping, index, flags);
        if (!page)
                return -ENOMEM;
        unlock_page(page);

which is all kinds of bad, because where grab_cache_page_write_begin()
will get the page lock, and wait for it to not be under writeback any
more.

And then we unlock it right away.

Only to do the journal start, and after that immediately do

        lock_page(page);
        ... check that the mapping hasn't changed ..
        /* In case writeback began while the page was unlocked */
        wait_for_stable_page(page);

so it does that again.

And I think this is exactly the pattern where the old unfair page
locking worked very well, because the second "lock_page()" will
probably happen while the previous "unlock_page()" had kept it
unlocked. So 99% of the time, the second lock_page() was free.

But with the new fair page locking, the previous unlock_page() will
have given the page away to whoever was waiting for it, and now when
we do the second lock_page(), we'll block and wait for that user - and
every other possible one. Because that's fair - everybody gets the
page lock in order.

This may not be *the* reason, but it's exactly the kind of pessimal
pattern where the old unfair model worked very well (where "well"
means "good average performance, but then occasionally you get
watchdogs firing because there's no forward progress"), and the new
fair code will really stutter, because the lock/unlock/lock pattern is
basically *exactly* the wrong thing to do and only causes a complete
serialization in case there are other waiters, because fairness means
that the second lock will always be done after *all* other queued
waiters have been handled.

And the sad part is that the code doesn't even *want* the lock for
that initial case, and immediately drops it.

The main reason the code seems to want to use that
grab_cache_page_write_begin() that lkocks the page is that it wants to
create the page if it didn't exist, and that creation creates a locked
page.

But the code *could* use FGP_FOR_MMAP instead, which only locks that
initial page case.

So something like this might at least work around this particular
case. But it's *entirely* untested.

Ted, Andreas, comments? The old unfair lock_page() made this a
non-issue, but we really do have years of reports of odd watchdog
errors that seem to be due to that almost infinite unfairness under
bad loads..

Michael: it's entirely possible that the two cases in fs/ext4/inode.c
that I noticed are not that important. But I found them from following
your profile data down to lock_page() cases, so they seem to be at
least _part_ of the issue.

Again: the patch is ENTIRELY untested. It compiles for me, and it
looks superficially right, but that's all I'm going to say about it..

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 2136 bytes --]

 fs/ext4/inode.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..65355e33eaae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1112,6 +1112,33 @@ static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
 }
 #endif
 
+
+/*
+ * This is like grab_cache_page_write_begin(), but without
+ * locking the page, because we want to grab a page first
+ * before we start a transaction handle.
+ *
+ * The page will be locked later.
+ *
+ * That FGP_FOR_MMAP is exactly that "I don't want it locked"
+ * flag, and is required to work with FGP_CREAT.
+ */
+static struct page *grab_cache_page_write_begin_unlocked(
+	struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	struct page *page;
+	int fgp_flags = FGP_FOR_MMAP|FGP_WRITE|FGP_CREAT;
+
+	if (flags & AOP_FLAG_NOFS)
+		fgp_flags |= FGP_NOFS;
+	page = pagecache_get_page(mapping, index, fgp_flags,
+			mapping_gfp_mask(mapping));
+	if (page)
+		wait_for_stable_page(page);
+
+	return page;
+}
+
 static int ext4_write_begin(struct file *file, struct address_space *mapping,
 			    loff_t pos, unsigned len, unsigned flags,
 			    struct page **pagep, void **fsdata)
@@ -1154,10 +1181,9 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	 * the page (if needed) without using GFP_NOFS.
 	 */
 retry_grab:
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	page = grab_cache_page_write_begin_unlocked(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-	unlock_page(page);
 
 retry_journal:
 	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
@@ -2962,10 +2988,9 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	 * the page (if needed) without using GFP_NOFS.
 	 */
 retry_grab:
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	page = grab_cache_page_write_begin_unlocked(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-	unlock_page(page);
 
 	/*
 	 * With delayed allocation, we don't log the i_disksize update

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-11  0:05                         ` Kernel Benchmarking Linus Torvalds
@ 2020-09-11  0:49                           ` Michael Larabel
  2020-09-11  2:20                             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Michael Larabel @ 2020-09-11  0:49 UTC (permalink / raw)
  To: Linus Torvalds, Ted Ts'o, Andreas Dilger; +Cc: Ext4 Developers List

I should be able to fire up some benchmarks of the patch overnight to 
see what they show, but guessing something more might be at play. While 
it's plausible this might help the Apache and Nginx web server results 
as they do touch the disk, Hackbench for instance shouldn't really be 
interacting with the file-system. Was the Hackbench perf data useful at 
all or should I generate a longer run of that for more events?

Michael


On 9/10/20 7:05 PM, Linus Torvalds wrote:
> [ Ted / Andreas - Michael bisected a nasty regression to the new fair
> page lock, and I think at least part of the reason is the ext4 page
> locking patterns ]
>
> On Thu, Sep 10, 2020 at 1:57 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> I can already from a quick look see that one of the major
>> "interesting" paths here is a "writev()" system call that takes a page
>> fault when it copies data from user space.
> I think the page fault is incidental and not important.
>
> No, I think the issue is that ext4_write_begin() does some fairly crazy things.
>
> It does
>
>          page = grab_cache_page_write_begin(mapping, index, flags);
>          if (!page)
>                  return -ENOMEM;
>          unlock_page(page);
>
> which is all kinds of bad, because where grab_cache_page_write_begin()
> will get the page lock, and wait for it to not be under writeback any
> more.
>
> And then we unlock it right away.
>
> Only to do the journal start, and after that immediately do
>
>          lock_page(page);
>          ... check that the mapping hasn't changed ..
>          /* In case writeback began while the page was unlocked */
>          wait_for_stable_page(page);
>
> so it does that again.
>
> And I think this is exactly the pattern where the old unfair page
> locking worked very well, because the second "lock_page()" will
> probably happen while the previous "unlock_page()" had kept it
> unlocked. So 99% of the time, the second lock_page() was free.
>
> But with the new fair page locking, the previous unlock_page() will
> have given the page away to whoever was waiting for it, and now when
> we do the second lock_page(), we'll block and wait for that user - and
> every other possible one. Because that's fair - everybody gets the
> page lock in order.
>
> This may not be *the* reason, but it's exactly the kind of pessimal
> pattern where the old unfair model worked very well (where "well"
> means "good average performance, but then occasionally you get
> watchdogs firing because there's no forward progress"), and the new
> fair code will really stutter, because the lock/unlock/lock pattern is
> basically *exactly* the wrong thing to do and only causes a complete
> serialization in case there are other waiters, because fairness means
> that the second lock will always be done after *all* other queued
> waiters have been handled.
>
> And the sad part is that the code doesn't even *want* the lock for
> that initial case, and immediately drops it.
>
> The main reason the code seems to want to use that
> grab_cache_page_write_begin() that lkocks the page is that it wants to
> create the page if it didn't exist, and that creation creates a locked
> page.
>
> But the code *could* use FGP_FOR_MMAP instead, which only locks that
> initial page case.
>
> So something like this might at least work around this particular
> case. But it's *entirely* untested.
>
> Ted, Andreas, comments? The old unfair lock_page() made this a
> non-issue, but we really do have years of reports of odd watchdog
> errors that seem to be due to that almost infinite unfairness under
> bad loads..
>
> Michael: it's entirely possible that the two cases in fs/ext4/inode.c
> that I noticed are not that important. But I found them from following
> your profile data down to lock_page() cases, so they seem to be at
> least _part_ of the issue.
>
> Again: the patch is ENTIRELY untested. It compiles for me, and it
> looks superficially right, but that's all I'm going to say about it..
>
>                      Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-11  0:49                           ` Michael Larabel
@ 2020-09-11  2:20                             ` Linus Torvalds
       [not found]                               ` <0cbc959e-1b8d-8d7e-1dc6-672cf5b3899a@MichaelLarabel.com>
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-11  2:20 UTC (permalink / raw)
  To: Michael Larabel; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List

On Thu, Sep 10, 2020 at 5:49 PM Michael Larabel
<Michael@michaellarabel.com> wrote:
>
> I should be able to fire up some benchmarks of the patch overnight to
> see what they show, but guessing something more might be at play. While
> it's plausible this might help the Apache and Nginx web server results
> as they do touch the disk, Hackbench for instance shouldn't really be
> interacting with the file-system. Was the Hackbench perf data useful at
> all or should I generate a longer run of that for more events?

The hackbench data actually does have some of the same patterns with
ext4_write_iter showing up there too, but the perf profile there is
fairly weak (it and nginx both have _much_ fewer profile data points
than the apache run had).

hackbench I also didn't feel was all that interesting, because the
performance impact seemed more mixed there.

NOTE! The whole fair locking issue does show up even without any
lock/unlock/lock patterns, because even if you don't have that
"immediately re-take the lock" thing going on as in the ext4 example,
it's a very easy pattern to trigger by simply having a microbenchmark
that does the same system call over and over again. So the
"lock-unlock-lock-unlock" pattern can be two separate system calls,
each of which just does a single lock-unlock, but they do so at a high
frequency. So the ext4 code I pointed at and that trial patch (maybe)
fixing is just the most egregious case of lock re-taking. It can
easily happen with an external loop too (although normally I'd expect
people to buffer writes enough that the next write certainly shouldn't
be to the same page).

I do kind of wonder why that apache benchmark would have multiple
processes locking the same page, which is what makes me wonder if
there's something else going on. The profile wasn't entirely trivial
to read (the page locking itself does not show up very high on the
profile at all, so ), so I might have missed some other clue.

There were other lock_page cases, ie jbd2_journal_get_write_access()
etc, so I think it's the writing side interacting with the flushing
side.

But the lock-unlock-lock pattern in the ext4 write code made me go
"yeah, that's _exactly_ the kind of thing that would potentialyl slow
down a lot".

Again, that's not saying that other similar patterns don't occur
elsewhere. It's only saying that that's the only really obvious one I
found.

            Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
       [not found]                               ` <0cbc959e-1b8d-8d7e-1dc6-672cf5b3899a@MichaelLarabel.com>
@ 2020-09-11 16:19                                 ` Linus Torvalds
  2020-09-11 22:07                                   ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:19 UTC (permalink / raw)
  To: Michael Larabel; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List

On Fri, Sep 11, 2020 at 6:42 AM Michael Larabel
<Michael@michaellarabel.com> wrote:
>
> From preliminary testing of the patch on a Threadripper box, the EXT4 locking patch did help with a small improvement at 10 concurrent users for Apache but all the higher counts didn't end up showing any real change with the patch.

Ok, it's probably simply that fairness is really bad for performance
here in general, and that special case is just that - a special case,
not the main issue.

I'll have to think about it. We've certainly seen this before (the
queued spinlocks brought the same fairness issues), but this is much
worse because of how it affects scheduling on a big level.

Some middle ground hybrid model (unfair in the common case, but with
at least _some_ measure of fairness for the worst-case situation to
avoid the worst-case latency spikes) would be best, but I don't see
how to do it.

              Linus

                 Linus

                  Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-11 16:19                                 ` Linus Torvalds
@ 2020-09-11 22:07                                   ` Linus Torvalds
  2020-09-11 22:37                                     ` Michael Larabel
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-11 22:07 UTC (permalink / raw)
  To: Michael Larabel; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List

On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, it's probably simply that fairness is really bad for performance
> here in general, and that special case is just that - a special case,
> not the main issue.

Ahh. It turns out that I should have looked more at the fault path
after all. It was higher up in the profile, but I ignored it because I
found that lock-unlock-lock pattern lower down.

The main contention point is actually filemap_fault(). Your apache
test accesses the 'test.html' file that is mmap'ed into memory, and
all the threads hammer on that one single file concurrently and that
seems to be the main page lock contention.

Which is really sad - the page lock there isn't really all that
interesting, and the normal "read()" path doesn't even take it. But
faulting the page in does so because the page will have a long-term
existence in the page tables, and so there's a worry about racing with
truncate.

Interesting, but also very annoying.

Anyway, I don't have a solution for it, but thought I'd let you know
that I'm still looking at this.

                Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-11 22:07                                   ` Linus Torvalds
@ 2020-09-11 22:37                                     ` Michael Larabel
  2020-09-12  7:28                                       ` Amir Goldstein
  0 siblings, 1 reply; 65+ messages in thread
From: Michael Larabel @ 2020-09-11 22:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List

On 9/11/20 5:07 PM, Linus Torvalds wrote:
> On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Ok, it's probably simply that fairness is really bad for performance
>> here in general, and that special case is just that - a special case,
>> not the main issue.
> Ahh. It turns out that I should have looked more at the fault path
> after all. It was higher up in the profile, but I ignored it because I
> found that lock-unlock-lock pattern lower down.
>
> The main contention point is actually filemap_fault(). Your apache
> test accesses the 'test.html' file that is mmap'ed into memory, and
> all the threads hammer on that one single file concurrently and that
> seems to be the main page lock contention.
>
> Which is really sad - the page lock there isn't really all that
> interesting, and the normal "read()" path doesn't even take it. But
> faulting the page in does so because the page will have a long-term
> existence in the page tables, and so there's a worry about racing with
> truncate.
>
> Interesting, but also very annoying.
>
> Anyway, I don't have a solution for it, but thought I'd let you know
> that I'm still looking at this.
>
>                  Linus

I've been running your EXT4 patch on more systems and with some 
additional workloads today. While not the original problem, the patch 
does seem to help a fair amount for the MariaDB database sever. This 
wasn't one of the workloads regressing on 5.9 but at least with the 
systems tried so far the patch does make a meaningful improvement to the 
performance. I haven't run into any apparent issues with that patch so 
continuing to try it out on more systems and other database/server 
workloads.

Michael


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-11 22:37                                     ` Michael Larabel
@ 2020-09-12  7:28                                       ` Amir Goldstein
  2020-09-12 10:32                                         ` Michael Larabel
                                                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Amir Goldstein @ 2020-09-12  7:28 UTC (permalink / raw)
  To: Michael Larabel
  Cc: Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
<Michael@michaellarabel.com> wrote:
>
> On 9/11/20 5:07 PM, Linus Torvalds wrote:
> > On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >> Ok, it's probably simply that fairness is really bad for performance
> >> here in general, and that special case is just that - a special case,
> >> not the main issue.
> > Ahh. It turns out that I should have looked more at the fault path
> > after all. It was higher up in the profile, but I ignored it because I
> > found that lock-unlock-lock pattern lower down.
> >
> > The main contention point is actually filemap_fault(). Your apache
> > test accesses the 'test.html' file that is mmap'ed into memory, and
> > all the threads hammer on that one single file concurrently and that
> > seems to be the main page lock contention.
> >
> > Which is really sad - the page lock there isn't really all that
> > interesting, and the normal "read()" path doesn't even take it. But
> > faulting the page in does so because the page will have a long-term
> > existence in the page tables, and so there's a worry about racing with
> > truncate.
> >
> > Interesting, but also very annoying.
> >
> > Anyway, I don't have a solution for it, but thought I'd let you know
> > that I'm still looking at this.
> >
> >                  Linus
>
> I've been running your EXT4 patch on more systems and with some
> additional workloads today. While not the original problem, the patch
> does seem to help a fair amount for the MariaDB database sever. This
> wasn't one of the workloads regressing on 5.9 but at least with the
> systems tried so far the patch does make a meaningful improvement to the
> performance. I haven't run into any apparent issues with that patch so
> continuing to try it out on more systems and other database/server
> workloads.
>

Michael,

Can you please add a reference to the original problem report and
to the offending commit? This conversation appeared on the list without
this information.

Are filesystems other than ext4 also affected by this performance
regression?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12  7:28                                       ` Amir Goldstein
@ 2020-09-12 10:32                                         ` Michael Larabel
  2020-09-12 14:37                                           ` Matthew Wilcox
  2020-09-12 15:53                                         ` Matthew Wilcox
  2020-09-12 17:59                                         ` Linus Torvalds
  2 siblings, 1 reply; 65+ messages in thread
From: Michael Larabel @ 2020-09-12 10:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On 9/12/20 2:28 AM, Amir Goldstein wrote:
> On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
> <Michael@michaellarabel.com> wrote:
>> On 9/11/20 5:07 PM, Linus Torvalds wrote:
>>> On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> Ok, it's probably simply that fairness is really bad for performance
>>>> here in general, and that special case is just that - a special case,
>>>> not the main issue.
>>> Ahh. It turns out that I should have looked more at the fault path
>>> after all. It was higher up in the profile, but I ignored it because I
>>> found that lock-unlock-lock pattern lower down.
>>>
>>> The main contention point is actually filemap_fault(). Your apache
>>> test accesses the 'test.html' file that is mmap'ed into memory, and
>>> all the threads hammer on that one single file concurrently and that
>>> seems to be the main page lock contention.
>>>
>>> Which is really sad - the page lock there isn't really all that
>>> interesting, and the normal "read()" path doesn't even take it. But
>>> faulting the page in does so because the page will have a long-term
>>> existence in the page tables, and so there's a worry about racing with
>>> truncate.
>>>
>>> Interesting, but also very annoying.
>>>
>>> Anyway, I don't have a solution for it, but thought I'd let you know
>>> that I'm still looking at this.
>>>
>>>                   Linus
>> I've been running your EXT4 patch on more systems and with some
>> additional workloads today. While not the original problem, the patch
>> does seem to help a fair amount for the MariaDB database sever. This
>> wasn't one of the workloads regressing on 5.9 but at least with the
>> systems tried so far the patch does make a meaningful improvement to the
>> performance. I haven't run into any apparent issues with that patch so
>> continuing to try it out on more systems and other database/server
>> workloads.
>>
> Michael,
>
> Can you please add a reference to the original problem report and
> to the offending commit? This conversation appeared on the list without
> this information.
>
> Are filesystems other than ext4 also affected by this performance
> regression?
>
> Thanks,
> Amir.

On Linux 5.9 Git, Apache HTTPD, Redis, Nginx, and Hackbench appear to be 
the main workloads that are running measurably slower than on Linux 5.8 
and prior on multiple systems.

The issue was bisected to 2a9127fcf2296674d58024f83981f40b128fffea. The 
Kernel Test Robot also previously was triggered by the commit in 
question with mixed Hackbench results. In looking at the problem Linus 
had a hunch when looking at the perf data that it may have had an 
adverse reaction with the EXT4 locking behavior to which he sent out 
that patch. That EXT4 patch didn't end up addressing the performance 
issue with the original workloads in question (though in testing other 
workloads it seems to have benefit for MariaDB at least depending upon 
the system there can be slightly better performance).

Michael


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 10:32                                         ` Michael Larabel
@ 2020-09-12 14:37                                           ` Matthew Wilcox
  2020-09-12 14:44                                             ` Michael Larabel
       [not found]                                             ` <658ae026-32d9-0a25-5a59-9c510d6898d5@MichaelLarabel.com>
  0 siblings, 2 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-12 14:37 UTC (permalink / raw)
  To: Michael Larabel
  Cc: Amir Goldstein, Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 05:32:11AM -0500, Michael Larabel wrote:
> On 9/12/20 2:28 AM, Amir Goldstein wrote:
> > On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
> > <Michael@michaellarabel.com> wrote:
> > > On 9/11/20 5:07 PM, Linus Torvalds wrote:
> > > > On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > > Ok, it's probably simply that fairness is really bad for performance
> > > > > here in general, and that special case is just that - a special case,
> > > > > not the main issue.
> > > > Ahh. It turns out that I should have looked more at the fault path
> > > > after all. It was higher up in the profile, but I ignored it because I
> > > > found that lock-unlock-lock pattern lower down.
> > > > 
> > > > The main contention point is actually filemap_fault(). Your apache
> > > > test accesses the 'test.html' file that is mmap'ed into memory, and
> > > > all the threads hammer on that one single file concurrently and that
> > > > seems to be the main page lock contention.
> > > > 
> > > > Which is really sad - the page lock there isn't really all that
> > > > interesting, and the normal "read()" path doesn't even take it. But
> > > > faulting the page in does so because the page will have a long-term
> > > > existence in the page tables, and so there's a worry about racing with
> > > > truncate.
> > > > 
> > > > Interesting, but also very annoying.
> > > > 
> > > > Anyway, I don't have a solution for it, but thought I'd let you know
> > > > that I'm still looking at this.
> > > > 
> > > >                   Linus
> > > I've been running your EXT4 patch on more systems and with some
> > > additional workloads today. While not the original problem, the patch
> > > does seem to help a fair amount for the MariaDB database sever. This
> > > wasn't one of the workloads regressing on 5.9 but at least with the
> > > systems tried so far the patch does make a meaningful improvement to the
> > > performance. I haven't run into any apparent issues with that patch so
> > > continuing to try it out on more systems and other database/server
> > > workloads.
> > > 
> > Michael,
> > 
> > Can you please add a reference to the original problem report and
> > to the offending commit? This conversation appeared on the list without
> > this information.
> > 
> > Are filesystems other than ext4 also affected by this performance
> > regression?
> > 
> > Thanks,
> > Amir.
> 
> On Linux 5.9 Git, Apache HTTPD, Redis, Nginx, and Hackbench appear to be the
> main workloads that are running measurably slower than on Linux 5.8 and
> prior on multiple systems.
> 
> The issue was bisected to 2a9127fcf2296674d58024f83981f40b128fffea. The
> Kernel Test Robot also previously was triggered by the commit in question
> with mixed Hackbench results. In looking at the problem Linus had a hunch
> when looking at the perf data that it may have had an adverse reaction with
> the EXT4 locking behavior to which he sent out that patch. That EXT4 patch
> didn't end up addressing the performance issue with the original workloads
> in question (though in testing other workloads it seems to have benefit for
> MariaDB at least depending upon the system there can be slightly better
> performance).

Based on this limited amount of information, I would suspect there would
also be a problem with XFS, and that would be even _more_ sad because
XFS already excludes a truncate-vs-mmap race with the MMAPLOCK_SHARED in
__xfs_filemap_fault vs MMAPLOCK_EXCL ... somewhere in the truncate path,
I'm sure.  It's definitely there for the holepunch.

So maybe XFS should have its own implementation of filemap_fault,
or we should have a filemap_fault_locked() for filesystems which have
their own locking that excludes truncate.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 14:37                                           ` Matthew Wilcox
@ 2020-09-12 14:44                                             ` Michael Larabel
  2020-09-15  3:32                                               ` Matthew Wilcox
       [not found]                                             ` <658ae026-32d9-0a25-5a59-9c510d6898d5@MichaelLarabel.com>
  1 sibling, 1 reply; 65+ messages in thread
From: Michael Larabel @ 2020-09-12 14:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel


On 9/12/20 9:37 AM, Matthew Wilcox wrote:
> On Sat, Sep 12, 2020 at 05:32:11AM -0500, Michael Larabel wrote:
>> On 9/12/20 2:28 AM, Amir Goldstein wrote:
>>> On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
>>> <Michael@michaellarabel.com> wrote:
>>>> On 9/11/20 5:07 PM, Linus Torvalds wrote:
>>>>> On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
>>>>> <torvalds@linux-foundation.org> wrote:
>>>>>> Ok, it's probably simply that fairness is really bad for performance
>>>>>> here in general, and that special case is just that - a special case,
>>>>>> not the main issue.
>>>>> Ahh. It turns out that I should have looked more at the fault path
>>>>> after all. It was higher up in the profile, but I ignored it because I
>>>>> found that lock-unlock-lock pattern lower down.
>>>>>
>>>>> The main contention point is actually filemap_fault(). Your apache
>>>>> test accesses the 'test.html' file that is mmap'ed into memory, and
>>>>> all the threads hammer on that one single file concurrently and that
>>>>> seems to be the main page lock contention.
>>>>>
>>>>> Which is really sad - the page lock there isn't really all that
>>>>> interesting, and the normal "read()" path doesn't even take it. But
>>>>> faulting the page in does so because the page will have a long-term
>>>>> existence in the page tables, and so there's a worry about racing with
>>>>> truncate.
>>>>>
>>>>> Interesting, but also very annoying.
>>>>>
>>>>> Anyway, I don't have a solution for it, but thought I'd let you know
>>>>> that I'm still looking at this.
>>>>>
>>>>>                    Linus
>>>> I've been running your EXT4 patch on more systems and with some
>>>> additional workloads today. While not the original problem, the patch
>>>> does seem to help a fair amount for the MariaDB database sever. This
>>>> wasn't one of the workloads regressing on 5.9 but at least with the
>>>> systems tried so far the patch does make a meaningful improvement to the
>>>> performance. I haven't run into any apparent issues with that patch so
>>>> continuing to try it out on more systems and other database/server
>>>> workloads.
>>>>
>>> Michael,
>>>
>>> Can you please add a reference to the original problem report and
>>> to the offending commit? This conversation appeared on the list without
>>> this information.
>>>
>>> Are filesystems other than ext4 also affected by this performance
>>> regression?
>>>
>>> Thanks,
>>> Amir.
>> On Linux 5.9 Git, Apache HTTPD, Redis, Nginx, and Hackbench appear to be the
>> main workloads that are running measurably slower than on Linux 5.8 and
>> prior on multiple systems.
>>
>> The issue was bisected to 2a9127fcf2296674d58024f83981f40b128fffea. The
>> Kernel Test Robot also previously was triggered by the commit in question
>> with mixed Hackbench results. In looking at the problem Linus had a hunch
>> when looking at the perf data that it may have had an adverse reaction with
>> the EXT4 locking behavior to which he sent out that patch. That EXT4 patch
>> didn't end up addressing the performance issue with the original workloads
>> in question (though in testing other workloads it seems to have benefit for
>> MariaDB at least depending upon the system there can be slightly better
>> performance).
> Based on this limited amount of information, I would suspect there would
> also be a problem with XFS, and that would be even _more_ sad because
> XFS already excludes a truncate-vs-mmap race with the MMAPLOCK_SHARED in
> __xfs_filemap_fault vs MMAPLOCK_EXCL ... somewhere in the truncate path,
> I'm sure.  It's definitely there for the holepunch.
>
> So maybe XFS should have its own implementation of filemap_fault,
> or we should have a filemap_fault_locked() for filesystems which have
> their own locking that excludes truncate.

Interesting, I'll fire up some cross-filesystem benchmarks with those 
tests today and report back shortly with the difference.

Michael


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12  7:28                                       ` Amir Goldstein
  2020-09-12 10:32                                         ` Michael Larabel
@ 2020-09-12 15:53                                         ` Matthew Wilcox
  2020-09-12 17:59                                         ` Linus Torvalds
  2 siblings, 0 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-12 15:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Michael Larabel, Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 10:28:29AM +0300, Amir Goldstein wrote:
> On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
> <Michael@michaellarabel.com> wrote:
> >
> > On 9/11/20 5:07 PM, Linus Torvalds wrote:
> > > On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > >> Ok, it's probably simply that fairness is really bad for performance
> > >> here in general, and that special case is just that - a special case,
> > >> not the main issue.
> > > Ahh. It turns out that I should have looked more at the fault path
> > > after all. It was higher up in the profile, but I ignored it because I
> > > found that lock-unlock-lock pattern lower down.
> > >
> > > The main contention point is actually filemap_fault(). Your apache
> > > test accesses the 'test.html' file that is mmap'ed into memory, and
> > > all the threads hammer on that one single file concurrently and that
> > > seems to be the main page lock contention.
> > >
> > > Which is really sad - the page lock there isn't really all that
> > > interesting, and the normal "read()" path doesn't even take it. But
> > > faulting the page in does so because the page will have a long-term
> > > existence in the page tables, and so there's a worry about racing with
> > > truncate.

Here's an idea (sorry, no patch, about to go out for the day)

What if we cleared PageUptodate in the truncate path?  And then
filemap_fault() looks a lot more like generic_file_buffered_read() where
we check PageUptodate, and only if it's clear do we take the page lock
and call ->readpage.

We'd need to recheck PageUptodate after installing the PTE and zap
it ourselves, and we wouldn't be able to check page_mapped() in the
truncate path any more, which would make me sad.  But there's something
to be said for making faults cheaper.

Even the XFS model where we take the MMAPLOCK_SHARED isn't free -- it's
just write-vs-write on a cacheline instead of a visible contention on
the page lock.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12  7:28                                       ` Amir Goldstein
  2020-09-12 10:32                                         ` Michael Larabel
  2020-09-12 15:53                                         ` Matthew Wilcox
@ 2020-09-12 17:59                                         ` Linus Torvalds
  2020-09-12 20:32                                           ` Rogério Brito
                                                             ` (4 more replies)
  2 siblings, 5 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-12 17:59 UTC (permalink / raw)
  To: Amir Goldstein, Hugh Dickins
  Cc: Michael Larabel, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

Sorry, I should have put much more background when I started cc'ing
people and lists..

On Sat, Sep 12, 2020 at 12:28 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Can you please add a reference to the original problem report and
> to the offending commit? This conversation appeared on the list without
> this information.
>
> Are filesystems other than ext4 also affected by this performance
> regression?

So let me expand on this, because it actually comes from a really old
problem that has been around for ages, and that I think I finally
understand.

And sadly, the regression comes from the fix to that old problem.

So I think the VM people (but perhaps not necessarily all filesystem
people) have been aware of a long-time problem with certain loads
causing huge latencies, up to and including watchdogs firing because
processes wouldn't make progress for over half a minute (or whatever
the default blocking watchdog timeout is - I would like to say that
it's some odd number like 22 seconds, but maybe that was RCU).

We've known it's related to long queues for the page lock, and about
three years ago now we added a "bookmark" entry to the page wakeup
queues, because those queues got so long that even just traversing the
wakeup queue was a big latency hit. But it's generally been some heavy
private load on a customer machine, and nobody ever really had a good
test-case for it.

We've actually had tons of different page lockers involved. One of the
suspects (and in fact I think it really was one of the causes, just
not the only one) was the NUMA migration, where under certain loads
with lots and lots of threads, the kernel would decide to try to
migrate a hot page, and lots of threads would come in and all
NUMA-fault on it (because it was some core page everything used), and
as part of the fault they would get the page lock to serialize, and
you'd end up with wait queues that were multiple _thousands_ of
entries long.

So the reports of watchdogs firing go back many many years, and over
the years we've had various band-aid fixes - things that really do
help the symptoms a lot, but really seem to be fixes for the symptoms
rather than something fundamental. That "let's break up the wait queue
with a bookmark so that we can at least enable interrupts" is perhaps
the best example of code that just shouldn't exist, but comes about
because there's been incredible contention on the page lock.

See commits 2554db916586 ("sched/wait: Break up long wake list walk")
and 11a19c7b099f ("sched/wait: Introduce wakeup bookmark in
wake_up_page_bit") for that bookmark thing and some of the list
numbers.

There's been a few actual fixes too - I think Hugh Dickins really
ended up fixing at least part of the NUMA balancing case by changing
some of the reference counting. So I don't think it's _all_ been
band-aids, but the page lock has been a thing that has come up
multiple times over the years.

See for example commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated") for a patch
that ended up hopefully fixing at least one of the causes of the long
queues during migration. I say "hopefully", because (again) the loads
that cause these things were those "internal customer load" things
that we don't really have a lot of insight into. Hugh has been
involved over the years presumably exactly because google has been one
of those customers, although not the only one by far.

But the point here is that the page lock has been problematic for
years - with those reports of watchdogs (after tens of seconds!)
firing going back long before the fixes above. It's definitely not a
new thing, although I think it has perhaps become more common due to
"bigger machines running more complex loads becoming more common", but
who knows..

Anyway, for various reasons I was looking at this again a couple of
months ago: we had _yet_ another report of softlockups:

  https://lore.kernel.org/lkml/20200721063258.17140-1-mhocko@kernel.org/

and we had an unrelated thread about a low-level race in page wakeup
(the original report was wrong, but it led to figuring out another
race):

  https://lore.kernel.org/lkml/20200624161142.GA12184@redhat.com/

and there was something else going on too that I can't recall, that
had made me look at the page locking.

And while there, I realized that the simplest explanation for all
those years of softlockups was simply that the page waiting is very
very unfair indeed.

While somebody is patiently waiting for a page, another process can
(and will) come in and get the page lock from under it, and the
original waiter will end up just re-queueing - at the end of the list.
Which explains how you can get those half-minute latencies - not
because any page lock holder really holds the lock for very long
(almost all of them are CPU-bound, not IO bound), but because under
heavy load and contention, you end up with the poor waiters scheduling
away and *new* page lockers end up being treated very preferentially,
with absolutely nothing keeping them from taking the lock while
somebody else is waiting for it.

ANYWAY. That's a long email of background for the commit that I then
put in the tree this merge window:

  2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")

which actually makes the page locking truly fair (well, there's a
small window that allows new lockers to come in, but now it's about
CPU instructions rather than "scheduling a sleeping process to run",
so it's essentially gone as a major and recurring unfairness issue).

And that commit actually had a fair number of positive reports: even
before merging it, Hugh had tested this on yet another "company
internal load" (*cough*google*cough*) that used to show watchdog
failures, and it did seem to solve those, and some of the kernel test
robot benchmarks improved by tens of percent, in one case by 160%.

So everything looked fine. It solves a long-running problem, and for
once I think it really _solves_ it by fixing a fundamental problem,
rather than papering over the symptoms.

But there were a couple of nagging concerns. Hackbench showed wildly
fluctuating performance: some tests improved by a lot, others
regressed by a lot. Not a huge deal, I felt: hackbench isn't a great
benchmark, and the performance fluctuations really seemed to be going
both ways and be very dependent on exact test and machine. So it was a
slight concern, but on the whole not really worth worrying about.

But the reason Michal is on the Cc is because the Phoronix benchmark
suite showed a rather marked decrease in the apache test. Unlike
hackbench, I think that's much more of a "real" test, and it seemed to
be a lot more consistent too. So I asked for profiles (and eventually
just recreated the test locally), and I think I understand what's
going on.

It's the fairness.

Fairness is good, but fairness is usually bad for performance even if
it does get rid of the worst-case issues. In this case, it's _really_
bad for performance, because that page lock has always been unfair,
and we have a lot of patterns that have basically come to
(unintentionally) depend on that unfairness.

In particular, the page locking is often used for just verifying
simple things, with the most common example being "lock page, check
that the mapping is still valid, insert page into page tables, unlock
page".

The reason the apache benchmark regresses is that it basically does a
web server test with a single file ("test.html") that gets served by
just mmap'ing it, and sending it out that way. Using lots of threads,
and using lots of different mappings. So they *all* fault on the read
of that page, and they *all* do that "lock page, check that the
mapping is valid, insert page" dance.

That actually worked ok - not great, but ok - when the page lock was
unfair, and anybody runnable would basically just get it. Yes, people
would occasionally get put on the wait-queue, but those waiting
lockers wouldn't really affect the other ones that are going through
that dance since they would just take the lock anyway. VERY unfair,
but hey, very nice for that load.

It works much less well when the page lock is suddenly fair, and if
anybody starts waiting for it, gets the lock handed to it when the
page is unlocked. Now the page is owned by the next waiter in line,
and they're sleeping, and new page lockers don't magically and
unfairly get to just bypass the older waiter.

This is not a new issue. We've had exactly the same thing happen when
we made spinlocks, semaphores, and rwlocks be fair.

And like those other times, we had to make them fair because *not*
making them fair caused those unacceptable outliers under contention,
to the point of starvation and watchdogs firing.

Anyway, I don't have a great solution. I have a few options (roughly
ordered by "simplest to most complex"):

 (a) just revert
 (b) add some busy-spinning
 (c) reader-writer page lock
 (d) try to de-emphasize the page lock

but I'd love to hear comments.

Honestly, (a) is trivial to do. We've had the problem for years, the
really *bad* cases are fairly rare, and the workarounds mostly work.
Yeah, you get watchdogs firing, but it's not exactly _common_.

But equally honestly, I hate (a). I feel like this change really fixed
a fundamental issue, and after looking at the apache benchmark, in
many ways it's not a great benchmark. The reason it shows such a
(relatively) huge regression is that it hammers on just a single small
file. So my inclination is to say "we know how to fix the performance
regression, even if we may not be able to do so for 5.9, and this
benchmark behavior is very unlikely to actually hit a real load".

Option (b) is just because right now the page lock is very much a
black-and-white "try to lock once or sleep". Where most lockers (the
initial actual IO to fill the page being the main exception) are
CPU-bound, not IO bound. So spinning is the usual simplistic fix for
locking behavior like that. It doesn't really "fix" anything, but it
helps the bad contended performance case and we wouldn't get the
scheduling and sleeping behavior.

I can imagine coming up with a ten-liner patch to add some spinning
that claws back much of the performance on that benchmark. Maybe.

I don't like option (b) very much, but it might be the band-aid for
5.9 if we feel that the benchmark results _might_ translate to real
loads.

Option (c) is, I feel, the best one. Reader-writer locks aren't
wonderful, but the page lock really tends to have two very distinct
uses: exclusive for the initial IO and for the (very very unlikely)
truncate and hole punching issues, and then the above kind of "lock to
check that it's still valid" use, which is very very common and
happens on every page fault and then some. And it would be very
natural to make the latter be a read-lock (or even just a sequence
counting one with retry rather than a real lock).

Option (d) is "we already have a locking in many filesystems that give
us exclusion between faulting in a page, and the truncate/hole punch,
so we shouldn't use the page lock at all".

I do think that the locking that filesystems do is in many ways
inferior - it's done on a per-inode basis rather than on a per-page
basis. But if the filesystems end up doing that *anyway*, what's the
advantage of the finer granularity one? And *because* the common case
is all about the reading case, the bigger granularity tends to work
very well in practice, and basically never sees contention.

So I think option (c) is potentially technically better because it has
smaller locking granularity, but in practice (d) might be easier and
we already effectively do it for several filesystems.

Also, making the page lock be a rw-lock may be "easy" in theory, but
in practice we have the usual "uhhuh, 'struct page' is very crowded,
and finding even just one more bit in the flags to use as a read bit
is not great, and finding a whole reader _count_ would likely require
us to go to that hashed queue, which we know has horrendous cache
behavior from past experience".

This turned out to be a very long email, and probably most people
didn't get this far. But if you did, comments, opinions, suggestions?

Any other suggestions than those (a)-(d) ones above?

               Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 17:59                                         ` Linus Torvalds
@ 2020-09-12 20:32                                           ` Rogério Brito
  2020-09-14  9:33                                             ` Jan Kara
  2020-09-12 20:58                                           ` Josh Triplett
                                                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Rogério Brito @ 2020-09-12 20:32 UTC (permalink / raw)
  To: Linus Torvalds, Amir Goldstein, Hugh Dickins
  Cc: Michael Larabel, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel,
	Rogério Theodoro de Brito

Hi, Linus and other people.

On 12/09/2020 14.59, Linus Torvalds wrote:
> On Sat, Sep 12, 2020 at 12:28 AM Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> Can you please add a reference to the original problem report and
>> to the offending commit? This conversation appeared on the list without
>> this information.
>>
>> Are filesystems other than ext4 also affected by this performance
>> regression?
> 
> So let me expand on this, because it actually comes from a really old
> problem that has been around for ages, and that I think I finally
> understand.
> 
> And sadly, the regression comes from the fix to that old problem.
> 
> So I think the VM people (but perhaps not necessarily all filesystem
> people) have been aware of a long-time problem with certain loads
> causing huge latencies, up to and including watchdogs firing because
> processes wouldn't make progress for over half a minute (or whatever
> the default blocking watchdog timeout is - I would like to say that
> it's some odd number like 22 seconds, but maybe that was RCU).
> 
> We've known it's related to long queues for the page lock, and about
> three years ago now we added a "bookmark" entry to the page wakeup
> queues, because those queues got so long that even just traversing the
> wakeup queue was a big latency hit. But it's generally been some heavy
> private load on a customer machine, and nobody ever really had a good
> test-case for it.
> 
> We've actually had tons of different page lockers involved. One of the
> suspects (and in fact I think it really was one of the causes, just
> not the only one) was the NUMA migration, where under certain loads
> with lots and lots of threads, the kernel would decide to try to
> migrate a hot page, and lots of threads would come in and all
> NUMA-fault on it (because it was some core page everything used), and
> as part of the fault they would get the page lock to serialize, and
> you'd end up with wait queues that were multiple _thousands_ of
> entries long.
> 
> So the reports of watchdogs firing go back many many years, and over
> the years we've had various band-aid fixes - things that really do
> help the symptoms a lot, but really seem to be fixes for the symptoms
> rather than something fundamental. That "let's break up the wait queue
> with a bookmark so that we can at least enable interrupts" is perhaps
> the best example of code that just shouldn't exist, but comes about
> because there's been incredible contention on the page lock.
> 
> See commits 2554db916586 ("sched/wait: Break up long wake list walk")
> and 11a19c7b099f ("sched/wait: Introduce wakeup bookmark in
> wake_up_page_bit") for that bookmark thing and some of the list
> numbers.
> 
> There's been a few actual fixes too - I think Hugh Dickins really
> ended up fixing at least part of the NUMA balancing case by changing
> some of the reference counting. So I don't think it's _all_ been
> band-aids, but the page lock has been a thing that has come up
> multiple times over the years.
> 
> See for example commit 9a1ea439b16b ("mm:
> put_and_wait_on_page_locked() while page is migrated") for a patch
> that ended up hopefully fixing at least one of the causes of the long
> queues during migration. I say "hopefully", because (again) the loads
> that cause these things were those "internal customer load" things
> that we don't really have a lot of insight into. Hugh has been
> involved over the years presumably exactly because google has been one
> of those customers, although not the only one by far.
> 
> But the point here is that the page lock has been problematic for
> years - with those reports of watchdogs (after tens of seconds!)
> firing going back long before the fixes above. It's definitely not a
> new thing, although I think it has perhaps become more common due to
> "bigger machines running more complex loads becoming more common", but
> who knows..


First of all, please excuse my layman questions, but this conversation 
picked up my interest.

Now, to the subject: is this that you describe (RCU or VFS), in some 
sense, related to, say, copying a "big" file (e.g., a movie) to a "slow" 
media (in my case, a USB thumb drive, so that I can watch said movie on 
my TV)?

I've seen backtraces mentioning "task xxx hung for yyy seconds" and a 
non-reponsive cp process at that... I say RCU or VFS because I see this 
with the thumb drives with vfat filesystems (so, it wouldn't be quite 
related to ext4, apart from the fact that all my Linux-specific 
filesystems are ext4).

The same thing happens with my slow home network when I copy things to 
my crappy NFS server (an armel system that has only 128MB of RAM).

In both cases (a local or a remote fs), whenever I try to send SIGSTOP 
to the cp process, it stays there without being stopped for many 
minutes... I would venture a guess that this is because nothing else is 
done unless the outstanding bytes are actually committed to the 
filesystem...

In some sense, a very slow system with "moderate load" is akin to a 
high-end, loaded server with many threads competing for resources, in my 
experience...

OK, so many guesses and conjectures on my side, but the interactivity to 
the end-user suffers a lot (have not yet tested any 5.9 kernel).


Thanks,

Rogério.


> 
> Anyway, for various reasons I was looking at this again a couple of
> months ago: we had _yet_ another report of softlockups:
> 
>    https://lore.kernel.org/lkml/20200721063258.17140-1-mhocko@kernel.org/
> 
> and we had an unrelated thread about a low-level race in page wakeup
> (the original report was wrong, but it led to figuring out another
> race):
> 
>    https://lore.kernel.org/lkml/20200624161142.GA12184@redhat.com/
> 
> and there was something else going on too that I can't recall, that
> had made me look at the page locking.
> 
> And while there, I realized that the simplest explanation for all
> those years of softlockups was simply that the page waiting is very
> very unfair indeed.
> 
> While somebody is patiently waiting for a page, another process can
> (and will) come in and get the page lock from under it, and the
> original waiter will end up just re-queueing - at the end of the list.
> Which explains how you can get those half-minute latencies - not
> because any page lock holder really holds the lock for very long
> (almost all of them are CPU-bound, not IO bound), but because under
> heavy load and contention, you end up with the poor waiters scheduling
> away and *new* page lockers end up being treated very preferentially,
> with absolutely nothing keeping them from taking the lock while
> somebody else is waiting for it.
> 
> ANYWAY. That's a long email of background for the commit that I then
> put in the tree this merge window:
> 
>    2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> 
> which actually makes the page locking truly fair (well, there's a
> small window that allows new lockers to come in, but now it's about
> CPU instructions rather than "scheduling a sleeping process to run",
> so it's essentially gone as a major and recurring unfairness issue).
> 
> And that commit actually had a fair number of positive reports: even
> before merging it, Hugh had tested this on yet another "company
> internal load" (*cough*google*cough*) that used to show watchdog
> failures, and it did seem to solve those, and some of the kernel test
> robot benchmarks improved by tens of percent, in one case by 160%.
> 
> So everything looked fine. It solves a long-running problem, and for
> once I think it really _solves_ it by fixing a fundamental problem,
> rather than papering over the symptoms.
> 
> But there were a couple of nagging concerns. Hackbench showed wildly
> fluctuating performance: some tests improved by a lot, others
> regressed by a lot. Not a huge deal, I felt: hackbench isn't a great
> benchmark, and the performance fluctuations really seemed to be going
> both ways and be very dependent on exact test and machine. So it was a
> slight concern, but on the whole not really worth worrying about.
> 
> But the reason Michal is on the Cc is because the Phoronix benchmark
> suite showed a rather marked decrease in the apache test. Unlike
> hackbench, I think that's much more of a "real" test, and it seemed to
> be a lot more consistent too. So I asked for profiles (and eventually
> just recreated the test locally), and I think I understand what's
> going on.
> 
> It's the fairness.
> 
> Fairness is good, but fairness is usually bad for performance even if
> it does get rid of the worst-case issues. In this case, it's _really_
> bad for performance, because that page lock has always been unfair,
> and we have a lot of patterns that have basically come to
> (unintentionally) depend on that unfairness.
> 
> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
> 
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.
> 
> That actually worked ok - not great, but ok - when the page lock was
> unfair, and anybody runnable would basically just get it. Yes, people
> would occasionally get put on the wait-queue, but those waiting
> lockers wouldn't really affect the other ones that are going through
> that dance since they would just take the lock anyway. VERY unfair,
> but hey, very nice for that load.
> 
> It works much less well when the page lock is suddenly fair, and if
> anybody starts waiting for it, gets the lock handed to it when the
> page is unlocked. Now the page is owned by the next waiter in line,
> and they're sleeping, and new page lockers don't magically and
> unfairly get to just bypass the older waiter.
> 
> This is not a new issue. We've had exactly the same thing happen when
> we made spinlocks, semaphores, and rwlocks be fair.
> 
> And like those other times, we had to make them fair because *not*
> making them fair caused those unacceptable outliers under contention,
> to the point of starvation and watchdogs firing.
> 
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
> 
>   (a) just revert
>   (b) add some busy-spinning
>   (c) reader-writer page lock
>   (d) try to de-emphasize the page lock
> 
> but I'd love to hear comments.
> 
> Honestly, (a) is trivial to do. We've had the problem for years, the
> really *bad* cases are fairly rare, and the workarounds mostly work.
> Yeah, you get watchdogs firing, but it's not exactly _common_.
> 
> But equally honestly, I hate (a). I feel like this change really fixed
> a fundamental issue, and after looking at the apache benchmark, in
> many ways it's not a great benchmark. The reason it shows such a
> (relatively) huge regression is that it hammers on just a single small
> file. So my inclination is to say "we know how to fix the performance
> regression, even if we may not be able to do so for 5.9, and this
> benchmark behavior is very unlikely to actually hit a real load".
> 
> Option (b) is just because right now the page lock is very much a
> black-and-white "try to lock once or sleep". Where most lockers (the
> initial actual IO to fill the page being the main exception) are
> CPU-bound, not IO bound. So spinning is the usual simplistic fix for
> locking behavior like that. It doesn't really "fix" anything, but it
> helps the bad contended performance case and we wouldn't get the
> scheduling and sleeping behavior.
> 
> I can imagine coming up with a ten-liner patch to add some spinning
> that claws back much of the performance on that benchmark. Maybe.
> 
> I don't like option (b) very much, but it might be the band-aid for
> 5.9 if we feel that the benchmark results _might_ translate to real
> loads.
> 
> Option (c) is, I feel, the best one. Reader-writer locks aren't
> wonderful, but the page lock really tends to have two very distinct
> uses: exclusive for the initial IO and for the (very very unlikely)
> truncate and hole punching issues, and then the above kind of "lock to
> check that it's still valid" use, which is very very common and
> happens on every page fault and then some. And it would be very
> natural to make the latter be a read-lock (or even just a sequence
> counting one with retry rather than a real lock).
> 
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
> 
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.
> 
> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.
> 
> Also, making the page lock be a rw-lock may be "easy" in theory, but
> in practice we have the usual "uhhuh, 'struct page' is very crowded,
> and finding even just one more bit in the flags to use as a read bit
> is not great, and finding a whole reader _count_ would likely require
> us to go to that hashed queue, which we know has horrendous cache
> behavior from past experience".
> 
> This turned out to be a very long email, and probably most people
> didn't get this far. But if you did, comments, opinions, suggestions?
> 
> Any other suggestions than those (a)-(d) ones above?
> 
>                 Linus
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 17:59                                         ` Linus Torvalds
  2020-09-12 20:32                                           ` Rogério Brito
@ 2020-09-12 20:58                                           ` Josh Triplett
  2020-09-12 20:59                                           ` James Bottomley
                                                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Josh Triplett @ 2020-09-12 20:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:
> So I think the VM people (but perhaps not necessarily all filesystem
> people) have been aware of a long-time problem with certain loads
> causing huge latencies, up to and including watchdogs firing because
> processes wouldn't make progress for over half a minute (or whatever
> the default blocking watchdog timeout is - I would like to say that
> it's some odd number like 22 seconds, but maybe that was RCU).
> 
> We've known it's related to long queues for the page lock, and about
> three years ago now we added a "bookmark" entry to the page wakeup
> queues, because those queues got so long that even just traversing the
> wakeup queue was a big latency hit. But it's generally been some heavy
> private load on a customer machine, and nobody ever really had a good
> test-case for it.

I don't *know* if this is the same bottleneck, but I have an easily
reproducible workload that rather reliably triggers softlockup
watchdogs, massive performance bottlenecks, and processes that hang for
a while without making forward progress, and it seemed worth mentioning
in case it might serve as a reproducer for those private workloads.
(Haven't tested it on a kernel with this fairness fix added; most recent
tests were on 5.7-rc6.)

On a GCP n1-highcpu-96 instance, with nested virtualization enabled,
create a QEMU/KVM VM with the same number of CPUs backed by a disk image
using either NVME or virtio, and in that VM, build a defconfig kernel
with `make -j$(nproc)`. Lots of softlockup warnings, processes that
should be very quick hanging for a long time, and the build on the guest
is up to 5x slower than the host system, with 12-15x the system time.

I've seen similar softlockups with huge VMs running on physical
hardware, not just on cloud systems that allow nested virtualization.
This is *probably* reproducible for anyone who has local hardware with
lots of CPUs, but doing it on GCP should be accessible to anyone.

(I'm not using GCP anymore, and the systems I'm using don't support
nested virtualization, so I don't have this workload readily available
anymore. It was a completely standard Debian image with the cloud kernel
installed, and zero unusual configuration.)

> Fairness is good, but fairness is usually bad for performance even if
> it does get rid of the worst-case issues. In this case, it's _really_
> bad for performance, because that page lock has always been unfair,
> and we have a lot of patterns that have basically come to
> (unintentionally) depend on that unfairness.
> 
> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
[...]
> This is not a new issue. We've had exactly the same thing happen when
> we made spinlocks, semaphores, and rwlocks be fair.
> 
> And like those other times, we had to make them fair because *not*
> making them fair caused those unacceptable outliers under contention,
> to the point of starvation and watchdogs firing.
> 
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
> 
>  (a) just revert
>  (b) add some busy-spinning
>  (c) reader-writer page lock
>  (d) try to de-emphasize the page lock
> 
> but I'd love to hear comments.

[...]

> Honestly, (a) is trivial to do. We've had the problem for years, the
> really *bad* cases are fairly rare, and the workarounds mostly work.
> Yeah, you get watchdogs firing, but it's not exactly _common_.

I feel like every time I run a non-trivial load inside a huge VM, I end
up hitting those watchdogs; they don't *feel* rare.

> Option (c) is, I feel, the best one. Reader-writer locks aren't
> wonderful, but the page lock really tends to have two very distinct
> uses: exclusive for the initial IO and for the (very very unlikely)
> truncate and hole punching issues, and then the above kind of "lock to
> check that it's still valid" use, which is very very common and
> happens on every page fault and then some. And it would be very
> natural to make the latter be a read-lock (or even just a sequence
> counting one with retry rather than a real lock).
> 
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
> 
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.
> 
> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.

If filesystems are going to have to have that lock *anyway*, and it
makes the page lock entirely redundant for that use case, then it
doesn't seem like there's any point to making the page lock cheaper if
we can avoid it entirely. On the other hand, that seems like it might
make locking a *lot* more complicated, if the synchronization on a
struct page is "usually the page lock, but if it's a filesystem page,
then a filesystem-specific lock instead".

So, it seems like there'd be two deciding factors between (c) and (d):
- Whether filesystems might ever be able to use the locks in (c) to
  reduce or avoid having to do their own locking for this case. (Seems
  like there might be a brlock-style approach that could work for
  truncate/hole-punch.)
- Whether (d) would make the locking story excessively complicated
  compared to (c).

> This turned out to be a very long email, and probably most people
> didn't get this far. But if you did, comments, opinions, suggestions?
> 
> Any other suggestions than those (a)-(d) ones above?
> 
>                Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 17:59                                         ` Linus Torvalds
  2020-09-12 20:32                                           ` Rogério Brito
  2020-09-12 20:58                                           ` Josh Triplett
@ 2020-09-12 20:59                                           ` James Bottomley
  2020-09-12 21:15                                             ` Linus Torvalds
  2020-09-12 22:32                                           ` Matthew Wilcox
  2020-09-13  0:40                                           ` Dave Chinner
  4 siblings, 1 reply; 65+ messages in thread
From: James Bottomley @ 2020-09-12 20:59 UTC (permalink / raw)
  To: Linus Torvalds, Amir Goldstein, Hugh Dickins
  Cc: Michael Larabel, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, 2020-09-12 at 10:59 -0700, Linus Torvalds wrote:
[...]
> Any other suggestions than those (a)-(d) ones above?

What about revert and try to fix the outliers?  Say by having a timer
set when a process gets put to sleep waiting on the page lock.  If the
time fires it gets woken up and put at the head of the queue.  I
suppose it would also be useful to know if this had happened, so if the
timer has to be reset because the process again fails to win and gets
put to sleep it should perhaps be woken after a shorter interval or
perhaps it should spin before sleeping.

I'm not advocating this as the long term solution, but it could be the
stopgap while people work on (c).

James


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 20:59                                           ` James Bottomley
@ 2020-09-12 21:15                                             ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-12 21:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 1:59 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Sat, 2020-09-12 at 10:59 -0700, Linus Torvalds wrote:
> [...]
> > Any other suggestions than those (a)-(d) ones above?
>
> What about revert and try to fix the outliers?  Say by having a timer
> set when a process gets put to sleep waiting on the page lock.

No timer needed, I suspect.

I tried to code something like this up yesterday (hjmm. Thursday?) as
a "hybrid" scheme, where we'd start out with the old behavior and let
people unfairly get the lock while there were waiters, but when a
waiter woke up and noticed that it still couldn't get the lock, _then_
it would stat using the new scheme.

So still be unfair for a bit, but limit the unfairness so that a
waiter won't lose the lock more than once (but obviously while the
waiter initially slept, _many_ other lockers could have come through).

I ended up with a code mess and gave up on it (it seemed to just get
all the complications from the old _and_ the new model), but maybe I
should try again now that I know what went wrong last time. I think I
tried too hard to actually mix the old and the new code.

(If I tried again, I'd not try to mix the new and the old code, I'd
make the new one start out with a non-exclusive wait - which the code
already supports for that whole "wait for PG_writeback to end" as
opposed to "wait to take PG_lock" - and then turn it into an exclusive
wait if it fails.. That might work out better and not mix entirely
different approaches).

             Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 17:59                                         ` Linus Torvalds
                                                             ` (2 preceding siblings ...)
  2020-09-12 20:59                                           ` James Bottomley
@ 2020-09-12 22:32                                           ` Matthew Wilcox
  2020-09-13  0:40                                           ` Dave Chinner
  4 siblings, 0 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-12 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
> 
>  (a) just revert
>  (b) add some busy-spinning
>  (c) reader-writer page lock
>  (d) try to de-emphasize the page lock
> 
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
> 
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.

I guess this is option (e).  Completely untested; not even compiled,
but it might be a design that means filesystems don't need to take
per-inode locks.  I probably screwed up the drop-mmap-lock-for-io
parts of filemap_fault.  I definitely didn't update DAX for the
new parameter for finish_fault(), and now I think about it, I didn't
update the header file either, so it definitely won't compile.

diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..3909613f1c9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2602,8 +2602,22 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		}
 	}
 
+	if (fpin)
+		goto out_retry;
+	if (likely(PageUptodate(page)))
+		goto uptodate;
+
 	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
 		goto out_retry;
+	VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+
+	/* Did somebody else update it for us? */
+	if (PageUptodate(page)) {
+		unlock_page(page);
+		if (fpin)
+			goto out_retry;
+		goto uptodate;
+	}
 
 	/* Did it get truncated? */
 	if (unlikely(compound_head(page)->mapping != mapping)) {
@@ -2611,14 +2625,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		put_page(page);
 		goto retry_find;
 	}
-	VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
-
-	/*
-	 * We have a locked page in the page cache, now we need to check
-	 * that it's up-to-date. If not, it is going to be due to an error.
-	 */
-	if (unlikely(!PageUptodate(page)))
-		goto page_not_uptodate;
 
 	/*
 	 * We've made it this far and we had to drop our mmap_lock, now is the
@@ -2641,10 +2647,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
-	vmf->page = page;
-	return ret | VM_FAULT_LOCKED;
-
-page_not_uptodate:
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.
 	 * Try to re-read it _once_. We do this synchronously,
@@ -2680,6 +2682,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (fpin)
 		fput(fpin);
 	return ret | VM_FAULT_RETRY;
+
+uptodate:
+	vmf->page = page;
+	return ret | VM_FAULT_UPTODATE;
 }
 EXPORT_SYMBOL(filemap_fault);
 
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..48fb04e75a3a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3460,6 +3460,8 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return VM_FAULT_HWPOISON;
 	}
 
+	if (ret & VM_FAULT_UPTODATE)
+		return ret;
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
 		lock_page(vmf->page);
 	else
@@ -3684,7 +3686,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
  *
  * Return: %0 on success, %VM_FAULT_ code in case of error.
  */
-vm_fault_t finish_fault(struct vm_fault *vmf)
+vm_fault_t finish_fault(struct vm_fault *vmf, vm_fault_t ret2)
 {
 	struct page *page;
 	vm_fault_t ret = 0;
@@ -3704,9 +3706,17 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		ret = check_stable_address_space(vmf->vma->vm_mm);
 	if (!ret)
 		ret = alloc_set_pte(vmf, page);
+	if (ret2 & VM_FAULT_UPTODATE) {
+		if (!PageUptodate(page)) {
+			/* probably other things to do here */
+			page_remove_rmap(page);
+			pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+			put_page(page);
+		}
+	}
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return ret;
+	return ret | ret2;
 }
 
 static unsigned long fault_around_bytes __read_mostly =
@@ -3844,8 +3854,9 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
+	ret = finish_fault(vmf, ret);
+	if (!(ret & VM_FAULT_UPTODATE))
+		unlock_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		put_page(vmf->page);
 	return ret;
@@ -3878,8 +3889,9 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
-	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
+	ret = finish_fault(vmf, ret);
+	if (!(ret & VM_FAULT_UPTODATE))
+		unlock_page(vmf->page);
 	put_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
@@ -3912,10 +3924,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 		}
 	}
 
-	ret |= finish_fault(vmf);
+	ret = finish_fault(vmf, ret);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 					VM_FAULT_RETRY))) {
-		unlock_page(vmf->page);
+		if (!(ret & VM_FAULT_UPTODATE))
+			unlock_page(vmf->page);
 		put_page(vmf->page);
 		return ret;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..649381703f31 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -176,6 +176,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
 static void
 truncate_cleanup_page(struct address_space *mapping, struct page *page)
 {
+	ClearPageUptodate(page);
 	if (page_mapped(page)) {
 		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
 		unmap_mapping_pages(mapping, page->index, nr, false);
@@ -738,7 +739,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 								1, false);
 				}
 			}
-			BUG_ON(page_mapped(page));
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 17:59                                         ` Linus Torvalds
                                                             ` (3 preceding siblings ...)
  2020-09-12 22:32                                           ` Matthew Wilcox
@ 2020-09-13  0:40                                           ` Dave Chinner
  2020-09-13  2:39                                             ` Linus Torvalds
  2020-09-13  3:18                                             ` Matthew Wilcox
  4 siblings, 2 replies; 65+ messages in thread
From: Dave Chinner @ 2020-09-13  0:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:

[...]

> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
> 
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.

Hmmmm. So this is a typically a truncate race check, but this isn't
sufficient to protect the fault against all page invalidation races
as the page can be re-inserted into the same mapping at a different
page->index now within EOF.

Hence filesystems that support hole punching have to serialise the
->fault path against the page invalidations done in ->fallocate
operations because they otherwise we get data corruption from the
mm/ truncate checks failing to detect invalidated pages within EOF
correctly.

i.e. truncate/hole punch is a multi-object modification operation,
with the typical atomicity boundary of the operation defined by the
inode_lock() and/or the filesystem transaction that makes the
modification. IOWs, page_lock() based truncation/invalidation checks
aren't atomic w.r.t. the other objects being modified in the same
operation. Truncate avoids this by the ordering the file size update
vs the page cache invalidation, but no such ordering protection can
be provided for ->fallocate() operations that directly manipulate
the metadata of user data in the file.

> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
> 
>  (a) just revert
>  (b) add some busy-spinning
>  (c) reader-writer page lock
>  (d) try to de-emphasize the page lock

....

> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.

*nod*

Given that:

1) we have been doing (d) for 5 years (see commit 653c60b633a ("xfs:
introduce mmap/truncate lock")),

2) ext4 also has this same functionality,

3) DAX requires the ability for filesystems to exclude page faults

4) it is a widely deployed and tested solution

5) filesystems will still need to be able to exclude page faults
over a file range while they directly manipulate file metadata to
change the user data in the file

> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.

Right.  Even if we go for (c), AFAICT we still need (d) because we
still (d) largely because of reason (5) above.  There are a whole
class of "page fault vs direct storage manipulation offload"
serialisation issues that filesystems have to consider (especially
if they want to support DAX), so if we can use that same mechanism
to knock a whole bunch of page locking out of the fault paths then
that seems like a win to me....

> Any other suggestions than those (a)-(d) ones above?

Not really - I've been advocating for (d) as the general mechanism
for truncate/holepunch exclusion for quite a few years now because
it largely seems to work with no obvious/apparent issues.

Just as a FWIW: I agree that the per-inode rwsem could be an issue
here, jsut as it is for the IO path.  As a side project I'm working
on shared/exclusive range locks for the XFS inode to replace the
rwsems for the XFS_IOLOCK_{SHARED,EXCL} and the
XFS_MMAPLOCK_{SHARED,EXCL}.

That will largely alleviate any problems that "per-inode rwsem"
serialisation migh cause us here - I've got the DIO fastpath down to
2 atomic operations per lock/unlock - it's with 10% of rwsems up to
approx. half a million concurrent DIO read/writes to the same inode.
Concurrent buffered read/write are not far behind direct IO until I
run out of CPU to copy data. None of this requires changes to
anything outside fs/xfs because everythign is already correctly serialised
to "atomic" filesystem operations and range locking preserves the
atomicity of those operations including all the page cache
operations done within them.

Hence I'd much prefer to be moving the page cache in a direction
that results in the page cache not having to care at all about
serialising against racing truncates, hole punches or anythign else
that runs page invalidation. That will make the page cache code
simpler, require less locking, and likely have less invalidation
related issues over time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13  0:40                                           ` Dave Chinner
@ 2020-09-13  2:39                                             ` Linus Torvalds
  2020-09-13  3:40                                               ` Matthew Wilcox
  2020-09-13 23:45                                               ` Dave Chinner
  2020-09-13  3:18                                             ` Matthew Wilcox
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-13  2:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Hmmmm. So this is a typically a truncate race check, but this isn't
> sufficient to protect the fault against all page invalidation races
> as the page can be re-inserted into the same mapping at a different
> page->index now within EOF.

Using some "move" ioctl or similar and using a "invalidate page
mapping, then move it to a different point" model?

Yeah. I think that ends up being basically an extended special case of
the truncate thing (for the invalidate), and would require the
filesystem to serialize externally to the page anyway.

Which they would presumably already do with the MMAPLOCK or similar,
so I guess that's not a huge deal.

The real worry with (d) is that we are using the page lock for other
things too, not *just* the truncate check. Things where the inode lock
wouldn't be helping, like locking against throwing pages out of the
page cache entirely, or the hugepage splitting/merging etc. It's not
being truncated, it's just the VM shrinking the cache or modifying
things in other ways.

So I do worry a bit about trying to make things per-inode (or even
some per-range thing with a smarter lock) for those reasons. We use
the page lock not just for synchronizing with filesystem operations,
but for other page state synchronization too.

In many ways I think keeping it as a page-lock, and making the
filesystem operations just act on the range of pages would be safer.

But the page locking code does have some extreme downsides, exactly
because there are so _many_ pages and we end up having to play some
extreme size games due to that (ie the whole external hashing, but
also just not being able to use any debug locks etc, because we just
don't have the resources to do debugging locks at that kind of
granularity).

That's somewhat more longer-term. I'll try to do another version of
the "hybrid fairness" page lock (and/or just try some limited
optimistic spinning) to see if I can at least avoid the nasty
regression. Admittedly it really probably only happens for these kinds
of microbenchmarks that just hammer on one page over and over again,
but it's a big enough regression for a "real enough" load that I
really don't like it.

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13  0:40                                           ` Dave Chinner
  2020-09-13  2:39                                             ` Linus Torvalds
@ 2020-09-13  3:18                                             ` Matthew Wilcox
  1 sibling, 0 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-13  3:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Amir Goldstein, Hugh Dickins, Michael Larabel,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 13, 2020 at 10:40:57AM +1000, Dave Chinner wrote:
> > The reason the apache benchmark regresses is that it basically does a
> > web server test with a single file ("test.html") that gets served by
> > just mmap'ing it, and sending it out that way. Using lots of threads,
> > and using lots of different mappings. So they *all* fault on the read
> > of that page, and they *all* do that "lock page, check that the
> > mapping is valid, insert page" dance.
> 
> Hmmmm. So this is a typically a truncate race check, but this isn't
> sufficient to protect the fault against all page invalidation races
> as the page can be re-inserted into the same mapping at a different
> page->index now within EOF.

No it can't.  find_get_page() returns the page with an elevated refcount.
The page can't be reused until we call put_page().  It can be removed
from the page cache, but can't go back to the page allocator until the
refcount hits zero.

> 5) filesystems will still need to be able to exclude page faults
> over a file range while they directly manipulate file metadata to
> change the user data in the file

Yes, but they can do that with a lock inside ->readpage (and, for that
matter in ->readahead()), so there's no need to take a lock for pages
which are stable in cache.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13  2:39                                             ` Linus Torvalds
@ 2020-09-13  3:40                                               ` Matthew Wilcox
  2020-09-13 23:45                                               ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-13  3:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Amir Goldstein, Hugh Dickins, Michael Larabel,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote:
> The real worry with (d) is that we are using the page lock for other
> things too, not *just* the truncate check. Things where the inode lock
> wouldn't be helping, like locking against throwing pages out of the
> page cache entirely, or the hugepage splitting/merging etc. It's not
> being truncated, it's just the VM shrinking the cache or modifying
> things in other ways.

Actually, hugepage splitting is done under the protection of page freezing
where we temporarily set the refcount to zero, so pagecache lookups spin
rather than sleep on the lock.  Quite nasty, but also quite rare.

> But the page locking code does have some extreme downsides, exactly
> because there are so _many_ pages and we end up having to play some

The good news is that the THP patchset is making good progress.  I have
seven consecutive successful three-hour runs of xfstests, so maybe we'll
see fewer pages in the future.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13  2:39                                             ` Linus Torvalds
  2020-09-13  3:40                                               ` Matthew Wilcox
@ 2020-09-13 23:45                                               ` Dave Chinner
  2020-09-14  3:31                                                 ` Matthew Wilcox
  2020-09-15  9:27                                                 ` Jan Kara
  1 sibling, 2 replies; 65+ messages in thread
From: Dave Chinner @ 2020-09-13 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Amir Goldstein, Hugh Dickins, Michael Larabel, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Hmmmm. So this is a typically a truncate race check, but this isn't
> > sufficient to protect the fault against all page invalidation races
> > as the page can be re-inserted into the same mapping at a different
> > page->index now within EOF.
> 
> Using some "move" ioctl or similar and using a "invalidate page
> mapping, then move it to a different point" model?

Right, that's the sort of optimisation we could do inside a
FALLOC_FL_{COLLAPSE,INSERT}_RANGE operation if we wanted to preserve
the page cache contents instead of invalidating it.

> Yeah. I think that ends up being basically an extended special case of
> the truncate thing (for the invalidate), and would require the
> filesystem to serialize externally to the page anyway.

*nod*

> Which they would presumably already do with the MMAPLOCK or similar,
> so I guess that's not a huge deal.
> 
> The real worry with (d) is that we are using the page lock for other
> things too, not *just* the truncate check. Things where the inode lock
> wouldn't be helping, like locking against throwing pages out of the
> page cache entirely, or the hugepage splitting/merging etc. It's not
> being truncated, it's just the VM shrinking the cache or modifying
> things in other ways.

Yes, that is a problem, and us FS people don't know/see all the
places this can occur. We generally find out about them when one of
our regression stress tests trips over a data corruption. :(

I have my doubts that complex page cache manipulation operations
like ->migrate_page that rely exclusively on page and internal mm
serialisation are really safe against ->fallocate based invalidation
races.  I think they probably also need to be wrapped in the
MMAPLOCK, but I don't understand all the locking and constraints
that ->migrate_page has and there's been no evidence yet that it's a
problem so I've kinda left that alone. I suspect that "no evidence"
thing comes from "filesystem people are largely unable to induce
page migrations in regression testing" so it has pretty much zero
test coverage....

Stuff like THP splitting hasn't been an issue for us because the
file-backed page cache does not support THP (yet!). That's
something I'll be looking closely at in Willy's upcoming patchset.

> So I do worry a bit about trying to make things per-inode (or even
> some per-range thing with a smarter lock) for those reasons. We use
> the page lock not just for synchronizing with filesystem operations,
> but for other page state synchronization too.

Right, I'm not suggesting the page lock goes away, just saying that
we actually need two levels of locking for file-backed pages - one
filesystem, one page level - and that carefully selecting where we
"aggregate" the locking for complex multi-object operations might
make the overall locking simpler.

> In many ways I think keeping it as a page-lock, and making the
> filesystem operations just act on the range of pages would be safer.

Possibly, but that "range of pages" lock still doesn't really solve
the filesystem level serialisation problem.  We have to prevent page
faults from running over a range even when there aren't pages in the
page cache over that range (i.e. after we invalidate the range).
Hence we cannot rely on anything struct page related - the
serialisation mechanism has to be external to the cached pages
themselves, but it also has to integrate cleanly into the existing
locking and transaction ordering constraints we have.

> But the page locking code does have some extreme downsides, exactly
> because there are so _many_ pages and we end up having to play some
> extreme size games due to that (ie the whole external hashing, but
> also just not being able to use any debug locks etc, because we just
> don't have the resources to do debugging locks at that kind of
> granularity).

*nod*

The other issue here is that serialisation via individual cache
object locking just doesn't scale in any way to the sizes of
operations that fallocate() can run. fallocate() has 64 bit
operands, so a user could ask us to lock down a full 8EB range of
file. Locking that page by page, even using 1GB huge page Xarray
slot entries, is just not practical... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13 23:45                                               ` Dave Chinner
@ 2020-09-14  3:31                                                 ` Matthew Wilcox
  2020-09-15 14:28                                                   ` Chris Mason
  2020-09-15  9:27                                                 ` Jan Kara
  1 sibling, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-14  3:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Amir Goldstein, Hugh Dickins, Michael Larabel,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote:
> I have my doubts that complex page cache manipulation operations
> like ->migrate_page that rely exclusively on page and internal mm
> serialisation are really safe against ->fallocate based invalidation
> races.  I think they probably also need to be wrapped in the
> MMAPLOCK, but I don't understand all the locking and constraints
> that ->migrate_page has and there's been no evidence yet that it's a
> problem so I've kinda left that alone. I suspect that "no evidence"
> thing comes from "filesystem people are largely unable to induce
> page migrations in regression testing" so it has pretty much zero
> test coverage....

Maybe we can get someone who knows the page migration code to give
us a hack to induce pretty much constant migration?

> Stuff like THP splitting hasn't been an issue for us because the
> file-backed page cache does not support THP (yet!). That's
> something I'll be looking closely at in Willy's upcoming patchset.

One of the things I did was fail every tenth I/O to a THP.  That causes
us to split the THP when we come to try to make use of it.  Far more
effective than using dm-flakey because I know that failing a readahead
I/O should not cause any test to fail, so any newly-failing test is
caused by the THP code.

I've probably spent more time looking at the page splitting and
truncate/hole-punch/invalidate/invalidate2 paths than anything else.
It's definitely an area where more eyes are welcome, and just having
more people understand it would be good.  split_huge_page_to_list and
its various helper functions are about 400 lines of code and, IMO,
a little too complex.

> The other issue here is that serialisation via individual cache
> object locking just doesn't scale in any way to the sizes of
> operations that fallocate() can run. fallocate() has 64 bit
> operands, so a user could ask us to lock down a full 8EB range of
> file. Locking that page by page, even using 1GB huge page Xarray
> slot entries, is just not practical... :/

FWIW, there's not currently a "lock down this range" mechanism in
the page cache.  If there were, it wouldn't be restricted to 4k/2M/1G
sizes -- with the XArray today, it's fairly straightforward to
lock ranges which are m * 64^n entries in size (for 1 <= m <= 63, n >=0).
In the next year or two, I hope to be able to offer a "lock arbitrary
page range" feature which is as cheap to lock 8EiB as it is 128KiB.

It would still be page-ranges, not byte-ranges, so I don't know how well
that fits your needs.  It doesn't solve the DIO vs page cache problems
at all, since we want DIO to ranges which happen to be within the same
pages as each other to not conflict.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 20:32                                           ` Rogério Brito
@ 2020-09-14  9:33                                             ` Jan Kara
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Kara @ 2020-09-14  9:33 UTC (permalink / raw)
  To: Rogério Brito
  Cc: Linus Torvalds, Amir Goldstein, Hugh Dickins, Michael Larabel,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sat 12-09-20 17:32:41, Rogério Brito wrote:
> Now, to the subject: is this that you describe (RCU or VFS), in some sense,
> related to, say, copying a "big" file (e.g., a movie) to a "slow" media (in
> my case, a USB thumb drive, so that I can watch said movie on my TV)?
> 
> I've seen backtraces mentioning "task xxx hung for yyy seconds" and a
> non-reponsive cp process at that... I say RCU or VFS because I see this
> with the thumb drives with vfat filesystems (so, it wouldn't be quite
> related to ext4, apart from the fact that all my Linux-specific
> filesystems are ext4).

This is very likely completely different problem. I'd need to see exact
messages and kernel traces but usually errors like these happen when the IO
is very slow and other things (such as grabbing some locks or doing memory
allocation) get blocked waiting for that IO.

In the case Linus speaks about this is really more about CPU bound tasks
that heavily hammer the same cached contents.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
       [not found]                                             ` <658ae026-32d9-0a25-5a59-9c510d6898d5@MichaelLarabel.com>
@ 2020-09-14 17:47                                               ` Linus Torvalds
  2020-09-14 20:21                                                 ` Matthieu Baerts
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-14 17:47 UTC (permalink / raw)
  To: Michael Larabel
  Cc: Matthew Wilcox, Amir Goldstein, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2075 bytes --]

Michael et al,
 Ok, I redid my failed "hybrid mode" patch from scratch (original
patch never sent out, I never got it to a working point).

Having learnt from my mistake, this time instead of trying to mix the
old and the new code, instead I just extended the new code, and wrote
a _lot_ of comments about it.

I also made it configurable, using a "page_lock_unfairness" knob,
which this patch defaults to 1000 (which is basically infinite).
That's just a value that says how many times we'll try the old unfair
case, so "1000" means "we'll re-queue up to a thousand times before we
say enough is enough" and zero is the fair mode that shows the
performance problems.

I've only (lightly) tested those two extremes, I think the interesting
range is likely in the 1-5 range.

So you can do

    echo 0 > /proc/sys/vm/page_lock_unfairness
    .. run test ..

and you should get the same numbers as without this patch (within
noise, of course).

Or do

    echo 5 > /proc/sys/vm/page_lock_unfairness
    .. run test ..

and get numbers for "we accept some unfairness, but if we have to
requeue more than five times, we force the fair mode".

Again, the default is 1000, which is ludicrously high (it's not a
"this many retries per page" count, it's a "for each waiter" count). I
made it that high just because I have *not* run any numbers for that
interesting range, I just checked the extreme cases, and I wanted to
make sure that Michael sees the old performance (modulo other changes
to 5.9, of course).

Comments? The patch really has a fair amount of comments in it, in
fact the code changes are reasonably small, most of the changes really
are about new and updated comments about what is going on.

I was burnt by making a mess of this the first time, so I proceeded
more thoughtfully this time. Hopefullyt the end result is also better.

(Note that it's a commit and has a SHA1, but it's from my "throw-away
tree for testing", so it doesn't have my sign-off or any real commit
message yet: I'll do that once it gets actual testing and comments).

                 Linus

[-- Attachment #2: 0001-Page-lock-unfairness-sysctl.patch --]
[-- Type: text/x-patch, Size: 10020 bytes --]

From 880db10a9fea1dad0c8cf29ae04b4446d2e7170b Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 13 Sep 2020 14:05:35 -0700
Subject: [PATCH] Page lock unfairness sysctl

---
 include/linux/mm.h   |   2 +
 include/linux/wait.h |   1 +
 kernel/sysctl.c      |   8 +++
 mm/filemap.c         | 160 ++++++++++++++++++++++++++++++++++---------
 4 files changed, 140 insertions(+), 31 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..b2f370f0b420 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -41,6 +41,8 @@ struct writeback_control;
 struct bdi_writeback;
 struct pt_regs;
 
+extern int sysctl_page_lock_unfairness;
+
 void init_mm_internals(void);
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES	/* Don't use mapnrs, do it properly */
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 898c890fc153..27fb99cfeb02 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -21,6 +21,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 #define WQ_FLAG_WOKEN		0x02
 #define WQ_FLAG_BOOKMARK	0x04
 #define WQ_FLAG_CUSTOM		0x08
+#define WQ_FLAG_DONE		0x10
 
 /*
  * A single wait-queue entry structure:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 09e70ee2332e..afad085960b8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2912,6 +2912,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "page_lock_unfairness",
+		.data		= &sysctl_page_lock_unfairness,
+		.maxlen		= sizeof(sysctl_page_lock_unfairness),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 #ifdef CONFIG_MMU
 	{
 		.procname	= "max_map_count",
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..d0a76069bcb8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -988,9 +988,43 @@ void __init pagecache_init(void)
 	page_writeback_init();
 }
 
+/*
+ * The page wait code treats the "wait->flags" somewhat unusually, because
+ * we have multiple different kinds of waits, not just he usual "exclusive"
+ * one.
+ *
+ * We have:
+ *
+ *  (a) no special bits set:
+ *
+ *	We're just waiting for the bit to be released, and when a waker
+ *	calls the wakeup function, we set WQ_FLAG_WOKEN and wake it up,
+ *	and remove it from the wait queue.
+ *
+ *	Simple and straightforward.
+ *
+ *  (b) WQ_FLAG_EXCLUSIVE:
+ *
+ *	The waiter is waiting to get the lock, and only one waiter should
+ *	be woken up to avoid any thundering herd behavior. We'll set the
+ *	WQ_FLAG_WOKEN bit, wake it up, and remove it from the wait queue.
+ *
+ *	This is the traditional exclusive wait.
+ *
+ *  (b) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
+ *
+ *	The waiter is waiting to get the bit, and additionally wants the
+ *	lock to be transferred to it for fair lock behavior. If the lock
+ *	cannot be taken, we stop walking the wait queue without waking
+ *	the waiter.
+ *
+ *	This is the "fair lock handoff" case, and in addition to setting
+ *	WQ_FLAG_WOKEN, we set WQ_FLAG_DONE to let the waiter easily see
+ *	that it now has the lock.
+ */
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
-	int ret;
+	unsigned int flags;
 	struct wait_page_key *key = arg;
 	struct wait_page_queue *wait_page
 		= container_of(wait, struct wait_page_queue, wait);
@@ -999,35 +1033,44 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 		return 0;
 
 	/*
-	 * If it's an exclusive wait, we get the bit for it, and
-	 * stop walking if we can't.
-	 *
-	 * If it's a non-exclusive wait, then the fact that this
-	 * wake function was called means that the bit already
-	 * was cleared, and we don't care if somebody then
-	 * re-took it.
+	 * If it's a lock handoff wait, we get the bit for it, and
+	 * stop walking (and do not wake it up) if we can't.
 	 */
-	ret = 0;
-	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
-		if (test_and_set_bit(key->bit_nr, &key->page->flags))
+	flags = wait->flags;
+	if (flags & WQ_FLAG_EXCLUSIVE) {
+		if (test_bit(key->bit_nr, &key->page->flags))
 			return -1;
-		ret = 1;
+		if (flags & WQ_FLAG_CUSTOM) {
+			if (test_and_set_bit(key->bit_nr, &key->page->flags))
+				return -1;
+			flags |= WQ_FLAG_DONE;
+		}
 	}
-	wait->flags |= WQ_FLAG_WOKEN;
 
+	/*
+	 * We are holding the wait-queue lock, but the waiter that
+	 * is waiting for this will be checking the flags without
+	 * any locking.
+	 *
+	 * So update the flags atomically, and wake up the waiter
+	 * afterwards to avoid any races. This store-release pairs
+	 * with the load-acquire in wait_on_page_bit_common().
+	 */
+	smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
 	wake_up_state(wait->private, mode);
 
 	/*
 	 * Ok, we have successfully done what we're waiting for,
 	 * and we can unconditionally remove the wait entry.
 	 *
-	 * Note that this has to be the absolute last thing we do,
-	 * since after list_del_init(&wait->entry) the wait entry
+	 * Note that this pairs with the "finish_wait()" in the
+	 * waiter, and has to be the absolute last thing we do.
+	 * After this list_del_init(&wait->entry) the wait entry
 	 * might be de-allocated and the process might even have
 	 * exited.
 	 */
 	list_del_init_careful(&wait->entry);
-	return ret;
+	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
 static void wake_up_page_bit(struct page *page, int bit_nr)
@@ -1107,8 +1150,8 @@ enum behavior {
 };
 
 /*
- * Attempt to check (or get) the page bit, and mark the
- * waiter woken if successful.
+ * Attempt to check (or get) the page bit, and mark us done
+ * if successful.
  */
 static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
 					struct wait_queue_entry *wait)
@@ -1119,13 +1162,17 @@ static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
 	} else if (test_bit(bit_nr, &page->flags))
 		return false;
 
-	wait->flags |= WQ_FLAG_WOKEN;
+	wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE;
 	return true;
 }
 
+/* How many times do we accept lock stealing from under a waiter? */
+int sysctl_page_lock_unfairness = 1000;
+
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	struct page *page, int bit_nr, int state, enum behavior behavior)
 {
+	int unfairness = sysctl_page_lock_unfairness;
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
 	bool thrashing = false;
@@ -1143,11 +1190,18 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	}
 
 	init_wait(wait);
-	wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
 	wait->func = wake_page_function;
 	wait_page.page = page;
 	wait_page.bit_nr = bit_nr;
 
+repeat:
+	wait->flags = 0;
+	if (behavior == EXCLUSIVE) {
+		wait->flags = WQ_FLAG_EXCLUSIVE;
+		if (--unfairness < 0)
+			wait->flags |= WQ_FLAG_CUSTOM;
+	}
+
 	/*
 	 * Do one last check whether we can get the
 	 * page bit synchronously.
@@ -1170,27 +1224,63 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	/*
 	 * From now on, all the logic will be based on
-	 * the WQ_FLAG_WOKEN flag, and the and the page
-	 * bit testing (and setting) will be - or has
-	 * already been - done by the wake function.
+	 * the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag, to
+	 * see whether the page bit testing has already
+	 * been done by the wake function.
 	 *
 	 * We can drop our reference to the page.
 	 */
 	if (behavior == DROP)
 		put_page(page);
 
+	/*
+	 * Note that until the "finish_wait()", or until
+	 * we see the WQ_FLAG_WOKEN flag, we need to
+	 * be very careful with the 'wait->flags', because
+	 * we may race with a waker that sets them.
+	 */
 	for (;;) {
+		unsigned int flags;
+
 		set_current_state(state);
 
-		if (signal_pending_state(state, current))
+		/* Loop until we've been woken or interrupted */
+		flags = smp_load_acquire(&wait->flags);
+		if (!(flags & WQ_FLAG_WOKEN)) {
+			if (signal_pending_state(state, current))
+				break;
+
+			io_schedule();
+			continue;
+		}
+
+		/* If we were non-exclusive, we're done */
+		if (behavior != EXCLUSIVE)
 			break;
 
-		if (wait->flags & WQ_FLAG_WOKEN)
+		/* If the waker got the lock for us, we're done */
+		if (flags & WQ_FLAG_DONE)
 			break;
 
-		io_schedule();
+		/*
+		 * Otherwise, if we're getting the lock, we need to
+		 * try to get it ourselves.
+		 *
+		 * And if that fails, we'll have to retry this all.
+		 */
+		if (unlikely(test_and_set_bit(bit_nr, &page->flags)))
+			goto repeat;
+
+		wait->flags |= WQ_FLAG_DONE;
+		break;
 	}
 
+	/*
+	 * If a signal happened, this 'finish_wait()' may remove the last
+	 * waiter from the wait-queues, but the PageWaiters bit will remain
+	 * set. That's ok. The next wakeup will take care of it, and trying
+	 * to do it here would be difficult and prone to races.
+	 */
 	finish_wait(q, wait);
 
 	if (thrashing) {
@@ -1200,12 +1290,20 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	}
 
 	/*
-	 * A signal could leave PageWaiters set. Clearing it here if
-	 * !waitqueue_active would be possible (by open-coding finish_wait),
-	 * but still fail to catch it in the case of wait hash collision. We
-	 * already can fail to clear wait hash collision cases, so don't
-	 * bother with signals either.
+	 * NOTE! The wait->flags weren't stable until we've done the
+	 * 'finish_wait()', and we could have exited the loop above due
+	 * to a signal, and had a wakeup event happen after the signal
+	 * test but before the 'finish_wait()'.
+	 *
+	 * So only after the finish_wait() can we reliably determine
+	 * if we got woken up or not, so we can now figure out the final
+	 * return value based on that state without races.
+	 *
+	 * Also note that WQ_FLAG_WOKEN is sufficient for a non-exclusive
+	 * waiter, but an exclusive one requires WQ_FLAG_DONE.
 	 */
+	if (behavior == EXCLUSIVE)
+		return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
 
 	return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
 }
-- 
2.28.0.218.gc12ef3d349


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 17:47                                               ` Linus Torvalds
@ 2020-09-14 20:21                                                 ` Matthieu Baerts
  2020-09-14 20:53                                                   ` Linus Torvalds
  2020-09-15 14:21                                                 ` Michael Larabel
  2020-09-17 17:51                                                 ` Linus Torvalds
  2 siblings, 1 reply; 65+ messages in thread
From: Matthieu Baerts @ 2020-09-14 20:21 UTC (permalink / raw)
  To: Linus Torvalds, Michael Larabel
  Cc: Matthew Wilcox, Amir Goldstein, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

Hello everyone,

On 14/09/2020 19:47, Linus Torvalds wrote:
> Michael et al,
>   Ok, I redid my failed "hybrid mode" patch from scratch (original
> patch never sent out, I never got it to a working point).
> 
> Having learnt from my mistake, this time instead of trying to mix the
> old and the new code, instead I just extended the new code, and wrote
> a _lot_ of comments about it.
> 
> I also made it configurable, using a "page_lock_unfairness" knob,
> which this patch defaults to 1000 (which is basically infinite).
> That's just a value that says how many times we'll try the old unfair
> case, so "1000" means "we'll re-queue up to a thousand times before we
> say enough is enough" and zero is the fair mode that shows the
> performance problems.

Thank you for the new patch and all the work around from everybody!

Sorry to jump in this thread but I wanted to share my issue, also linked 
to the same commit:

     2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")

I have a simple test environment[1] using Docker and virtme[2] almost 
with the default kernel config and validating some tests for the MPTCP 
Upstream project[3]. Some of these tests are using a modified version of 
packetdrill[4].

Recently, some of these packetdrill tests have been failing after 2 
minutes (timeout) instead of being executed in a few seconds (~6 
seconds). No packets are even exchanged during these two minutes.

I did a git bisect and it also pointed me to 2a9127fcf229.

I can run the same test 10 times without any issue with the parent 
commit (v5.8 tag) but with 2a9127fcf229, I have a timeout most of the time.

Of course, when I try to add some debug info on the userspace or 
kernelspace side, I can no longer reproduce the timeout issue. But 
without debug, it is easy for me to validate if the issue is there or 
not. My issue doesn't seem to be linked to a small file that needs to be 
read multiple of times on a FS. Only a few bytes should be transferred 
with packetdrill but when there is a timeout, it is even before that 
because I don't see any transferred packets in case of issue. I don't 
think a lot of IO is used by Packetdrill before transferring a few 
packets to a "tun" interface but I didn't analyse further.

With your new patch and the default value, I no longer have the issue.

> I've only (lightly) tested those two extremes, I think the interesting
> range is likely in the 1-5 range.
> 
> So you can do
> 
>      echo 0 > /proc/sys/vm/page_lock_unfairness
>      .. run test ..
> 
> and you should get the same numbers as without this patch (within
> noise, of course).

On my side, I have the issue with 0. So it seems good because expected!

> Or do
> 
>      echo 5 > /proc/sys/vm/page_lock_unfairness
>      .. run test ..
> 
> and get numbers for "we accept some unfairness, but if we have to
> requeue more than five times, we force the fair mode".

Already with 1, it is fine on my side: no more timeout! Same with 5. I 
am not checking the performances but only the fact I can run packetdrill 
without timeout. With 1 and 5, tests finish in a normal time, that's 
really good. I didn't have any timeout in 10 runs, each of them started 
from a fresh VM. Patch tested with success!

I would be glad to help by validating new modifications or providing new 
info. My setup is also easy to put in place: a Docker image is built 
with all required tools to start the same VM just like the one I have. 
All scripts are on a public repository[1].

Please tell me if I can help!

Cheers,
Matt

[1] 
https://github.com/multipath-tcp/mptcp_net-next/blob/scripts/ci/virtme.sh and 
https://github.com/multipath-tcp/mptcp_net-next/blob/scripts/ci/Dockerfile.virtme.sh
[2] https://git.kernel.org/pub/scm/utils/kernel/virtme/virtme.git
[3] https://github.com/multipath-tcp/mptcp_net-next/wiki
[4] https://github.com/multipath-tcp/packetdrill
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 20:21                                                 ` Matthieu Baerts
@ 2020-09-14 20:53                                                   ` Linus Torvalds
  2020-09-15  0:42                                                     ` Linus Torvalds
  2020-09-15 15:34                                                     ` Matthieu Baerts
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-14 20:53 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Mon, Sep 14, 2020 at 1:21 PM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> Recently, some of these packetdrill tests have been failing after 2
> minutes (timeout) instead of being executed in a few seconds (~6
> seconds). No packets are even exchanged during these two minutes.

Hmm.

That sounds like a deadlock to me, and sounds like it's a latent bug
waiting to happen.

One way I can see that happening (with the fair page locking) is to do
something like this

thread A does:
  lock_page()
    do something

thread B:
  lock_page - ends up blocking on the lock

thread A continue:
   unlock_page() - for the fair case this now transfers the page lock
to thread B
   .. do more work
   lock_page() - this now blocks because B already owns the lock

thread B continues:
  do something that requires A to have continued, but A is blocked on
B, and we have a classic ABBA deadlock

and the only difference here is that with the unfair locks, thread A
would get the page lock and finish whatever it did, and you'd never
see the deadlock.

And by "never" I mean "very very seldom". That's why it sounds like a
latent bug to me - the fact that it triggers with the fair locks
really makes me suspect that it *could* have triggered with the unfair
locks, it just never really did, because we didn't have that
synchronous lock transfer to the waiter.

One of the problems with the page lock is that it's a very very
special lock, and afaik has never worked with lockdep. So we have
absolutely _zero_ coverage of even the simplest ABBA deadlocks with
the page lock.

> I would be glad to help by validating new modifications or providing new
> info. My setup is also easy to put in place: a Docker image is built
> with all required tools to start the same VM just like the one I have.

I'm not familiar enough with packetdrill or any of that infrastructure
- does it do its own kernel modules etc for the packet latency
testing?

But it sounds like it's 100% repeatable with the fair page lock, which
is actually a good thing. It means that if you do a "sysrq-w" while
it's blocking, you should see exactly what is waiting for what.

(Except since it times out nicely eventually, probably at least part
of the waiting is interruptible, and then you need to do "sysrq-t"
instead and it's going to be _very_ verbose and much harder to
pinpoint things, and you'll probably need to have a very big printk
buffer).

There are obviously other ways to do it too - kgdb or whatever - which
you may or may not be more used to.

But sysrq is very traditional and often particularly easy if it's a
very repeatable "things are hung". Not nearly as good as lockdep, of
course. But if the machine is otherwise working, you can just do

    echo 'w' > /proc/sysrq-trigger

in another terminal (and again, maybe you need 't', but then you
really want to do it *without* having a full GUI setup or anythign
like that, to at least make it somewhat less verbose).

Aside: a quick google shows that Nick Piggin did try to extend lockdep
to the page lock many many years ago. I don't think it ever went
anywhere. To quote Avril Lavigne: "It's complicated".

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 20:53                                                   ` Linus Torvalds
@ 2020-09-15  0:42                                                     ` Linus Torvalds
  2020-09-15 15:34                                                     ` Matthieu Baerts
  1 sibling, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15  0:42 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Mon, Sep 14, 2020 at 1:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> One way I can see that happening (with the fair page locking) is to do
> something like this

Note that the "lock_page()" cases in that deadlock example sequence of
mine is not necessarily at all an explicit lock_page(). It is more
likely to be something that indirectly causes it - like a user access
that faults in the page and causes lock_page() as part of that.

                    Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-12 14:44                                             ` Michael Larabel
@ 2020-09-15  3:32                                               ` Matthew Wilcox
  2020-09-15 10:39                                                 ` Jan Kara
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-15  3:32 UTC (permalink / raw)
  To: Michael Larabel
  Cc: Amir Goldstein, Linus Torvalds, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Sat, Sep 12, 2020 at 09:44:15AM -0500, Michael Larabel wrote:
> Interesting, I'll fire up some cross-filesystem benchmarks with those tests
> today and report back shortly with the difference.

If you have time, perhaps you'd like to try this patch.  It tries to
handle page faults locklessly when possible, which should be the case
where you're seeing page lock contention.  I've tried to be fairly
conservative in this patch; reducing page lock acquisition should be
possible in more cases.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..a14785b7fca7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,6 +416,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UPTODATE_ONLY: The fault handler returned @VM_FAULT_UPTODATE.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -446,6 +447,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_UPTODATE_ONLY		0x400
 
 /*
  * The default fault flags that should be used by most of the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..632eabcad2f7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -689,6 +689,8 @@ typedef __bitwise unsigned int vm_fault_t;
  * @VM_FAULT_NEEDDSYNC:		->fault did not modify page tables and needs
  *				fsync() to complete (for synchronous page faults
  *				in DAX)
+ * @VM_FAULT_UPTODATE:		Page is not locked; must check it is still
+ *				uptodate under the page table lock
  * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
  *
  */
@@ -706,6 +708,7 @@ enum vm_fault_reason {
 	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
 	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
 	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
+	VM_FAULT_UPTODATE       = (__force vm_fault_t)0x004000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..38f87dd86312 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2602,6 +2602,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		}
 	}
 
+	if (fpin)
+		goto out_retry;
+	if (likely(PageUptodate(page))) {
+		ret |= VM_FAULT_UPTODATE;
+		goto uptodate;
+	}
+
 	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
 		goto out_retry;
 
@@ -2630,19 +2637,19 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		goto out_retry;
 	}
 
-	/*
-	 * Found the page and have a reference on it.
-	 * We must recheck i_size under page lock.
-	 */
+	ret |= VM_FAULT_LOCKED;
+	/* Must recheck i_size after getting a stable reference to the page */
+uptodate:
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 	if (unlikely(offset >= max_off)) {
-		unlock_page(page);
+		if (ret & VM_FAULT_LOCKED)
+			unlock_page(page);
 		put_page(page);
 		return VM_FAULT_SIGBUS;
 	}
 
 	vmf->page = page;
-	return ret | VM_FAULT_LOCKED;
+	return ret;
 
 page_not_uptodate:
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..53c8ef2bb38b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3460,11 +3460,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return VM_FAULT_HWPOISON;
 	}
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	return ret;
 }
 
@@ -3646,12 +3641,13 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 		return VM_FAULT_NOPAGE;
 	}
 
-	flush_icache_page(vma, page);
-	entry = mk_pte(page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (write)
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	/* copy-on-write page */
+	/*
+	 * If the page isn't locked, truncate or invalidate2 may be
+	 * trying to remove it at the same time.  Both paths will check
+	 * the page's mapcount after clearing the PageUptodate bit,
+	 * so if we increment the mapcount here before checking the
+	 * Uptodate bit, the page will be unmapped by the other thread.
+	 */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
@@ -3660,6 +3656,25 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
 	}
+	smp_mb__after_atomic();
+
+	if ((vmf->flags & FAULT_FLAG_UPTODATE_ONLY) && !PageUptodate(page)) {
+		page_remove_rmap(page, false);
+		if (write && !(vma->vm_flags & VM_SHARED)) {
+			dec_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+			/* lru_cache_remove_inactive_or_unevictable? */
+		} else {
+			dec_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+		}
+		return VM_FAULT_NOPAGE;
+	}
+
+	flush_icache_page(vma, page);
+	entry = mk_pte(page, vma->vm_page_prot);
+	entry = pte_sw_mkyoung(entry);
+	if (write)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	/* copy-on-write page */
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
 	/* no need to invalidate: a not-present page won't be cached */
@@ -3844,8 +3859,18 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (ret & VM_FAULT_UPTODATE)
+		vmf->flags |= FAULT_FLAG_UPTODATE_ONLY;
+	else if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
+	if (ret & VM_FAULT_UPTODATE)
+		vmf->flags &= ~FAULT_FLAG_UPTODATE_ONLY;
+	else
+		unlock_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		put_page(vmf->page);
 	return ret;
@@ -3875,6 +3900,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
+	if (!(ret & VM_FAULT_LOCKED))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
@@ -3898,6 +3928,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (!(ret & VM_FAULT_LOCKED))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..96a0408804a7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -176,6 +176,8 @@ void do_invalidatepage(struct page *page, unsigned int offset,
 static void
 truncate_cleanup_page(struct address_space *mapping, struct page *page)
 {
+	ClearPageUptodate(page);
+	smp_mb__before_atomic();
 	if (page_mapped(page)) {
 		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
 		unmap_mapping_pages(mapping, page->index, nr, false);
@@ -655,6 +657,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		mapping->a_ops->freepage(page);
 
 	put_page(page);	/* pagecache ref */
+
+	/* An unlocked page fault may have inserted an entry */
+	ClearPageUptodate(page);
+	smp_mb__before_atomic();
+	if (page_mapped(page))
+		unmap_mapping_pages(mapping, page->index, 1, false);
 	return 1;
 failed:
 	xa_unlock_irqrestore(&mapping->i_pages, flags);

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-13 23:45                                               ` Dave Chinner
  2020-09-14  3:31                                                 ` Matthew Wilcox
@ 2020-09-15  9:27                                                 ` Jan Kara
  1 sibling, 0 replies; 65+ messages in thread
From: Jan Kara @ 2020-09-15  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Amir Goldstein, Hugh Dickins, Michael Larabel,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Mon 14-09-20 09:45:03, Dave Chinner wrote:
> I have my doubts that complex page cache manipulation operations
> like ->migrate_page that rely exclusively on page and internal mm
> serialisation are really safe against ->fallocate based invalidation
> races.  I think they probably also need to be wrapped in the
> MMAPLOCK, but I don't understand all the locking and constraints
> that ->migrate_page has and there's been no evidence yet that it's a
> problem so I've kinda left that alone. I suspect that "no evidence"
> thing comes from "filesystem people are largely unable to induce
> page migrations in regression testing" so it has pretty much zero
> test coverage....

Last time I've looked, ->migrate_page seemed safe to me. Page migration
happens under page lock so truncate_inode_pages_range() will block until
page migration is done (and this covers currently pretty much anything
fallocate related). And once truncate_inode_pages_range() is done,
there are no pages to migrate :) (plus migration code checks page->mapping
!= NULL after locking the page).

But I agree testing would be nice. When I was chasing a data corruption in
block device page cache caused by page migration, I was using thpscale [1]
or thpfioscale [2] benchmarks from mmtests which create anon hugepage
mapping and bang it from several threads thus making kernel try to compact
pages (and thus migrate other pages that block compaction) really hard. And
with it in parallel I was running the filesystem stress that seemed to
cause issues for the customer... I guess something like fsx & fsstress runs
with this THP stress test in parallel might be decent fstests to have.

								Honza

[1] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/thpscale/thpscale.c
[2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/thpfioscale/thpfioscale.c

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15  3:32                                               ` Matthew Wilcox
@ 2020-09-15 10:39                                                 ` Jan Kara
  2020-09-15 13:52                                                   ` Matthew Wilcox
  0 siblings, 1 reply; 65+ messages in thread
From: Jan Kara @ 2020-09-15 10:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael Larabel, Amir Goldstein, Linus Torvalds, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel,
	linux-mm

Hi Matthew!

On Tue 15-09-20 04:32:10, Matthew Wilcox wrote:
> On Sat, Sep 12, 2020 at 09:44:15AM -0500, Michael Larabel wrote:
> > Interesting, I'll fire up some cross-filesystem benchmarks with those tests
> > today and report back shortly with the difference.
> 
> If you have time, perhaps you'd like to try this patch.  It tries to
> handle page faults locklessly when possible, which should be the case
> where you're seeing page lock contention.  I've tried to be fairly
> conservative in this patch; reducing page lock acquisition should be
> possible in more cases.

So I'd be somewhat uneasy with this optimization. The thing is that e.g.
page migration relies on page lock protecting page from being mapped? How
does your patch handle that? I'm also not sure if the rmap code is really
ready for new page reverse mapping being added without holding page lock...

								Honza
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ca6e6a81576b..a14785b7fca7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -416,6 +416,7 @@ extern pgprot_t protection_map[16];
>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_UPTODATE_ONLY: The fault handler returned @VM_FAULT_UPTODATE.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -446,6 +447,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_REMOTE			0x80
>  #define FAULT_FLAG_INSTRUCTION  		0x100
>  #define FAULT_FLAG_INTERRUPTIBLE		0x200
> +#define FAULT_FLAG_UPTODATE_ONLY		0x400
>  
>  /*
>   * The default fault flags that should be used by most of the
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 496c3ff97cce..632eabcad2f7 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -689,6 +689,8 @@ typedef __bitwise unsigned int vm_fault_t;
>   * @VM_FAULT_NEEDDSYNC:		->fault did not modify page tables and needs
>   *				fsync() to complete (for synchronous page faults
>   *				in DAX)
> + * @VM_FAULT_UPTODATE:		Page is not locked; must check it is still
> + *				uptodate under the page table lock
>   * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
>   *
>   */
> @@ -706,6 +708,7 @@ enum vm_fault_reason {
>  	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
>  	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
>  	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> +	VM_FAULT_UPTODATE       = (__force vm_fault_t)0x004000,
>  	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1aaea26556cc..38f87dd86312 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2602,6 +2602,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		}
>  	}
>  
> +	if (fpin)
> +		goto out_retry;
> +	if (likely(PageUptodate(page))) {
> +		ret |= VM_FAULT_UPTODATE;
> +		goto uptodate;
> +	}
> +
>  	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
>  		goto out_retry;
>  
> @@ -2630,19 +2637,19 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		goto out_retry;
>  	}
>  
> -	/*
> -	 * Found the page and have a reference on it.
> -	 * We must recheck i_size under page lock.
> -	 */
> +	ret |= VM_FAULT_LOCKED;
> +	/* Must recheck i_size after getting a stable reference to the page */
> +uptodate:
>  	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
>  	if (unlikely(offset >= max_off)) {
> -		unlock_page(page);
> +		if (ret & VM_FAULT_LOCKED)
> +			unlock_page(page);
>  		put_page(page);
>  		return VM_FAULT_SIGBUS;
>  	}
>  
>  	vmf->page = page;
> -	return ret | VM_FAULT_LOCKED;
> +	return ret;
>  
>  page_not_uptodate:
>  	/*
> diff --git a/mm/memory.c b/mm/memory.c
> index 469af373ae76..53c8ef2bb38b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3460,11 +3460,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  		return VM_FAULT_HWPOISON;
>  	}
>  
> -	if (unlikely(!(ret & VM_FAULT_LOCKED)))
> -		lock_page(vmf->page);
> -	else
> -		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
> -
>  	return ret;
>  }
>  
> @@ -3646,12 +3641,13 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
>  		return VM_FAULT_NOPAGE;
>  	}
>  
> -	flush_icache_page(vma, page);
> -	entry = mk_pte(page, vma->vm_page_prot);
> -	entry = pte_sw_mkyoung(entry);
> -	if (write)
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -	/* copy-on-write page */
> +	/*
> +	 * If the page isn't locked, truncate or invalidate2 may be
> +	 * trying to remove it at the same time.  Both paths will check
> +	 * the page's mapcount after clearing the PageUptodate bit,
> +	 * so if we increment the mapcount here before checking the
> +	 * Uptodate bit, the page will be unmapped by the other thread.
> +	 */
>  	if (write && !(vma->vm_flags & VM_SHARED)) {
>  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  		page_add_new_anon_rmap(page, vma, vmf->address, false);
> @@ -3660,6 +3656,25 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
>  		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>  		page_add_file_rmap(page, false);
>  	}
> +	smp_mb__after_atomic();
> +
> +	if ((vmf->flags & FAULT_FLAG_UPTODATE_ONLY) && !PageUptodate(page)) {
> +		page_remove_rmap(page, false);
> +		if (write && !(vma->vm_flags & VM_SHARED)) {
> +			dec_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> +			/* lru_cache_remove_inactive_or_unevictable? */
> +		} else {
> +			dec_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> +		}
> +		return VM_FAULT_NOPAGE;
> +	}
> +
> +	flush_icache_page(vma, page);
> +	entry = mk_pte(page, vma->vm_page_prot);
> +	entry = pte_sw_mkyoung(entry);
> +	if (write)
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	/* copy-on-write page */
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
>  	/* no need to invalidate: a not-present page won't be cached */
> @@ -3844,8 +3859,18 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> +	if (ret & VM_FAULT_UPTODATE)
> +		vmf->flags |= FAULT_FLAG_UPTODATE_ONLY;
> +	else if (unlikely(!(ret & VM_FAULT_LOCKED)))
> +		lock_page(vmf->page);
> +	else
> +		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
> +
>  	ret |= finish_fault(vmf);
> -	unlock_page(vmf->page);
> +	if (ret & VM_FAULT_UPTODATE)
> +		vmf->flags &= ~FAULT_FLAG_UPTODATE_ONLY;
> +	else
> +		unlock_page(vmf->page);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		put_page(vmf->page);
>  	return ret;
> @@ -3875,6 +3900,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
>  	if (ret & VM_FAULT_DONE_COW)
>  		return ret;
>  
> +	if (!(ret & VM_FAULT_LOCKED))
> +		lock_page(vmf->page);
> +	else
> +		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
> +
>  	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
>  	__SetPageUptodate(vmf->cow_page);
>  
> @@ -3898,6 +3928,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> +	if (!(ret & VM_FAULT_LOCKED))
> +		lock_page(vmf->page);
> +	else
> +		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
> +
>  	/*
>  	 * Check if the backing address space wants to know that the page is
>  	 * about to become writable
> diff --git a/mm/truncate.c b/mm/truncate.c
> index dd9ebc1da356..96a0408804a7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -176,6 +176,8 @@ void do_invalidatepage(struct page *page, unsigned int offset,
>  static void
>  truncate_cleanup_page(struct address_space *mapping, struct page *page)
>  {
> +	ClearPageUptodate(page);
> +	smp_mb__before_atomic();
>  	if (page_mapped(page)) {
>  		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
>  		unmap_mapping_pages(mapping, page->index, nr, false);
> @@ -655,6 +657,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
>  		mapping->a_ops->freepage(page);
>  
>  	put_page(page);	/* pagecache ref */
> +
> +	/* An unlocked page fault may have inserted an entry */
> +	ClearPageUptodate(page);
> +	smp_mb__before_atomic();
> +	if (page_mapped(page))
> +		unmap_mapping_pages(mapping, page->index, 1, false);
>  	return 1;
>  failed:
>  	xa_unlock_irqrestore(&mapping->i_pages, flags);
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 10:39                                                 ` Jan Kara
@ 2020-09-15 13:52                                                   ` Matthew Wilcox
  0 siblings, 0 replies; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-15 13:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michael Larabel, Amir Goldstein, Linus Torvalds, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, linux-fsdevel, linux-mm

On Tue, Sep 15, 2020 at 12:39:38PM +0200, Jan Kara wrote:
> Hi Matthew!
> 
> On Tue 15-09-20 04:32:10, Matthew Wilcox wrote:
> > On Sat, Sep 12, 2020 at 09:44:15AM -0500, Michael Larabel wrote:
> > > Interesting, I'll fire up some cross-filesystem benchmarks with those tests
> > > today and report back shortly with the difference.
> > 
> > If you have time, perhaps you'd like to try this patch.  It tries to
> > handle page faults locklessly when possible, which should be the case
> > where you're seeing page lock contention.  I've tried to be fairly
> > conservative in this patch; reducing page lock acquisition should be
> > possible in more cases.
> 
> So I'd be somewhat uneasy with this optimization. The thing is that e.g.
> page migration relies on page lock protecting page from being mapped? How
> does your patch handle that? I'm also not sure if the rmap code is really
> ready for new page reverse mapping being added without holding page lock...

I admit to not even having looked at the page migration code.  This
patch was really to demonstrate that it's _possible_ to do page faults
without taking the page lock.

It's possible to expand the ClearPageUptodate page validity protocol
beyond mm/truncate.c, of course.  We can find all necessary places to
change by grepping for 'page_mapped'.  Some places (eg the invalidate2
path) can't safely ClearPageUptodate before their existing call to
unmap_mapping_pages(), and those places will have to add a second
test-and-call.

It seems to me the page_add_file_rmap() is fine with being called
without the page lock, unless the page is compound.  So we could
make sure not to use this new protocol for THPs ...

+++ b/mm/filemap.c
@@ -2604,7 +2604,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 
        if (fpin)
                goto out_retry;
-       if (likely(PageUptodate(page))) {
+       if (likely(PageUptodate(page) && !PageTransHuge(page))) {
                ret |= VM_FAULT_UPTODATE;
                goto uptodate;
        }
diff --git a/mm/memory.c b/mm/memory.c
index 53c8ef2bb38b..6981e8738df4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3460,6 +3460,9 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
                return VM_FAULT_HWPOISON;
        }
 
+       /* rmap needs THP pages to be locked in case it's mlocked */
+       VM_BUG_ON((ret & VM_FAULT_UPTODATE) && PageTransHuge(page));
+
        return ret;
 }



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 17:47                                               ` Linus Torvalds
  2020-09-14 20:21                                                 ` Matthieu Baerts
@ 2020-09-15 14:21                                                 ` Michael Larabel
  2020-09-15 17:52                                                   ` Linus Torvalds
  2020-09-17 17:51                                                 ` Linus Torvalds
  2 siblings, 1 reply; 65+ messages in thread
From: Michael Larabel @ 2020-09-15 14:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ext4 Developers List, linux-fsdevel

On 9/14/20 12:47 PM, Linus Torvalds wrote:
> Michael et al,
>   Ok, I redid my failed "hybrid mode" patch from scratch (original
> patch never sent out, I never got it to a working point).
>
> Having learnt from my mistake, this time instead of trying to mix the
> old and the new code, instead I just extended the new code, and wrote
> a _lot_ of comments about it.
>
> I also made it configurable, using a "page_lock_unfairness" knob,
> which this patch defaults to 1000 (which is basically infinite).
> That's just a value that says how many times we'll try the old unfair
> case, so "1000" means "we'll re-queue up to a thousand times before we
> say enough is enough" and zero is the fair mode that shows the
> performance problems.
>
> I've only (lightly) tested those two extremes, I think the interesting
> range is likely in the 1-5 range.
>
> So you can do
>
>      echo 0 > /proc/sys/vm/page_lock_unfairness
>      .. run test ..
>
> and you should get the same numbers as without this patch (within
> noise, of course).
>
> Or do
>
>      echo 5 > /proc/sys/vm/page_lock_unfairness
>      .. run test ..
>
> and get numbers for "we accept some unfairness, but if we have to
> requeue more than five times, we force the fair mode".
>
> Again, the default is 1000, which is ludicrously high (it's not a
> "this many retries per page" count, it's a "for each waiter" count). I
> made it that high just because I have *not* run any numbers for that
> interesting range, I just checked the extreme cases, and I wanted to
> make sure that Michael sees the old performance (modulo other changes
> to 5.9, of course).
>
> Comments? The patch really has a fair amount of comments in it, in
> fact the code changes are reasonably small, most of the changes really
> are about new and updated comments about what is going on.
>
> I was burnt by making a mess of this the first time, so I proceeded
> more thoughtfully this time. Hopefullyt the end result is also better.
>
> (Note that it's a commit and has a SHA1, but it's from my "throw-away
> tree for testing", so it doesn't have my sign-off or any real commit
> message yet: I'll do that once it gets actual testing and comments).
>
>                   Linus


Still running more benchmarks and on more systems, but so far at least 
as the Apache test is concerned this patch does seem to largely address 
the issue. The performance with the default 1000 page_lock_unfairness 
was yielding results more similar to 5.8 and in some cases tweaking the 
value did help improve the performance. A PLU value of 4~5 seems to 
yield the best performance.

The results though with Hackbench and Redis that exhibited similar drops 
from the commit in question remained mixed. Overview of the Apache 
metrics below, the other tests and system details @ 
https://openbenchmarking.org/result/2009154-FI-LINUX58CO57 For the other 
systems still testing, so far it's looking like similar relative impact 
to these results.

Apache Siege 2.4.29
Concurrent Users: 1
Transactions Per Second > Higher Is Better
v5.8 ............. 7684.81 
|=================================================
v5.9 Git ......... 7390.86 |===============================================
Default PLU 1000 . 7579.49 
|=================================================
PLU 0 ............ 7937.84 
|===================================================
PLU 1 ............ 7464.61 |================================================
PLU 2 ............ 7552.61 
|=================================================
PLU 3 ............ 7475.96 |================================================
PLU 4 ............ 7638.69 
|=================================================
PLU 5 ............ 7735.75 
|==================================================


Apache Siege 2.4.29
Concurrent Users: 50
Transactions Per Second > Higher Is Better
v5.8 ............. 39280.51 |===============================================
v5.9 Git ......... 28240.71 |==================================
Default PLU 1000 . 39708.30 |===============================================
PLU 0 ............ 26645.15 |================================
PLU 1 ............ 38709.95 |==============================================
PLU 2 ............ 39712.82 |===============================================
PLU 3 ............ 41959.67 
|==================================================
PLU 4 ............ 38870.90 |==============================================
PLU 5 ............ 41301.97 
|=================================================


Apache Siege 2.4.29
Concurrent Users: 100
Transactions Per Second > Higher Is Better
v5.8 ............. 51255.73 |=============================================
v5.9 Git ......... 21926.62 |===================
Default PLU 1000 . 42001.86 |=====================================
PLU 0 ............ 21528.43 |===================
PLU 1 ............ 37138.49 |=================================
PLU 2 ............ 38086.58 |==================================
PLU 3 ............ 38057.72 |==================================
PLU 4 ............ 56350.51 
|==================================================
PLU 5 ............ 37868.57 |==================================


Apache Siege 2.4.29
Concurrent Users: 200
Transactions Per Second > Higher Is Better
v5.8 ............. 47825.12 |===================================
v5.9 Git ......... 20174.78 |===============
Default PLU 1000 . 48190.05 |===================================
PLU 0 ............ 20095.10 |===============
PLU 1 ............ 48524.44 |====================================
PLU 2 ............ 47823.09 |===================================
PLU 3 ............ 47751.02 |===================================
PLU 4 ............ 68286.02 
|==================================================
PLU 5 ............ 47662.08 |===================================


Apache Siege 2.4.29
Concurrent Users: 250
Transactions Per Second > Higher Is Better
v5.8 ............. 55279.65 |=====================================
v5.9 Git ......... 20282.62 |==============
Default PLU 1000 . 67639.46 |=============================================
PLU 0 ............ 20181.98 |==============
PLU 1 ............ 40505.37 |===========================
PLU 2 ............ 56914.07 |======================================
PLU 3 ............ 55285.35 |=====================================
PLU 4 ............ 55499.25 |=====================================
PLU 5 ............ 74347.77 
|==================================================


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14  3:31                                                 ` Matthew Wilcox
@ 2020-09-15 14:28                                                   ` Chris Mason
  0 siblings, 0 replies; 65+ messages in thread
From: Chris Mason @ 2020-09-15 14:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, Amir Goldstein, Hugh Dickins,
	Michael Larabel, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On 13 Sep 2020, at 23:31, Matthew Wilcox wrote:

> On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote:
>> I have my doubts that complex page cache manipulation operations
>> like ->migrate_page that rely exclusively on page and internal mm
>> serialisation are really safe against ->fallocate based invalidation
>> races.  I think they probably also need to be wrapped in the
>> MMAPLOCK, but I don't understand all the locking and constraints
>> that ->migrate_page has and there's been no evidence yet that it's a
>> problem so I've kinda left that alone. I suspect that "no evidence"
>> thing comes from "filesystem people are largely unable to induce
>> page migrations in regression testing" so it has pretty much zero
>> test coverage....
>
> Maybe we can get someone who knows the page migration code to give
> us a hack to induce pretty much constant migration?

While debugging migrate page problems, I usually run dbench and

while(true) ; do echo 1 > /proc/sys/vm/compact_memory ; done

I’ll do this with a mixture of memory pressure or drop_caches or a 
memory hog depending on what I hope to trigger.

Because of hugepage allocations, we tend to bash on migration/compaction 
fairly hard in the fleet.  We do fallocate in some of these workloads as 
well, but I’m sure it doesn’t count as complete coverage for the 
races Dave is worried about.

-chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 20:53                                                   ` Linus Torvalds
  2020-09-15  0:42                                                     ` Linus Torvalds
@ 2020-09-15 15:34                                                     ` Matthieu Baerts
  2020-09-15 18:27                                                       ` Linus Torvalds
  2020-09-15 18:31                                                       ` Linus Torvalds
  1 sibling, 2 replies; 65+ messages in thread
From: Matthieu Baerts @ 2020-09-15 15:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 15302 bytes --]

Hi Linus,

Thank you very much for your reply, with very clear explanations and 
instructions!

On 14/09/2020 22:53, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 1:21 PM Matthieu Baerts
> <matthieu.baerts@tessares.net> wrote:
>>
>> Recently, some of these packetdrill tests have been failing after 2
>> minutes (timeout) instead of being executed in a few seconds (~6
>> seconds). No packets are even exchanged during these two minutes.
> 
> Hmm.
> 
> That sounds like a deadlock to me, and sounds like it's a latent bug
> waiting to happen.

Yesterday evening, I wanted to get confirmation about that using 
PROVE_LOCKING but just like today, each time I enable this kconfig, I 
cannot reproduce the issue.

Anyway I am sure you are right and this bug has been there for sometime 
but is too hard to reproduce.
>> I would be glad to help by validating new modifications or providing new
>> info. My setup is also easy to put in place: a Docker image is built
>> with all required tools to start the same VM just like the one I have.
> 
> I'm not familiar enough with packetdrill or any of that infrastructure
> - does it do its own kernel modules etc for the packet latency
> testing?

No, Packetdrill doesn't load any kernel module.

Here is a short description of the execution model of Packetdrill from a 
paper the authors wrote:

     packetdrill parses an entire test script, and then executes each
     timestamped line in real time -- at the pace described by the
     timestamps -- to replay and verify the scenario.
     - For each system call line, packetdrill executes the system call
       and verifies that it returns the expected result.
     - For each command line, packetdrill executes the shell command.
     - For each incoming packet (denoted by a leading < on the line),
       packetdrill constructs a packet and injects it into the kernel.
     - For each outgoing packet (denoted by a leading > on the line),
       packetdrill sniffs the next outgoing packet and verifies that the
       packet's timing and contents match the script.

Source: https://research.google/pubs/pub41316/

> But it sounds like it's 100% repeatable with the fair page lock, which
> is actually a good thing. It means that if you do a "sysrq-w" while
> it's blocking, you should see exactly what is waiting for what.
> 
> (Except since it times out nicely eventually, probably at least part
> of the waiting is interruptible, and then you need to do "sysrq-t"
> instead and it's going to be _very_ verbose and much harder to
> pinpoint things, and you'll probably need to have a very big printk
> buffer).

Thank you for this idea! I was focused on using lockdep and I forgot 
about this simple method. It is not (yet) a reflex for me to use it!

I think I got an interesting trace I took 20 seconds after having 
started packetdrill:


------------------- 8< -------------------
[   25.507563] sysrq: Show Blocked State
[   25.510695] task:packetdrill     state:D stack:13848 pid:  188 ppid: 
   155 flags:0x00004000
[   25.517841] Call Trace:
[   25.520103]  __schedule+0x3eb/0x680
[   25.523197]  schedule+0x45/0xb0
[   25.526013]  io_schedule+0xd/0x30
[   25.528964]  __lock_page_killable+0x13e/0x280
[   25.532794]  ? file_fdatawait_range+0x20/0x20
[   25.536605]  filemap_fault+0x6b4/0x970
[   25.539911]  ? filemap_map_pages+0x195/0x330
[   25.543682]  __do_fault+0x32/0x90
[   25.546620]  handle_mm_fault+0x8c1/0xe50
[   25.550050]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[   25.554637]  __get_user_pages+0x25c/0x750
[   25.558101]  populate_vma_page_range+0x57/0x60
[   25.561968]  __mm_populate+0xa9/0x150
[   25.565125]  __x64_sys_mlockall+0x151/0x180
[   25.568787]  do_syscall_64+0x33/0x40
[   25.571915]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.576230] RIP: 0033:0x7f21bee46b3b
[   25.579357] Code: Bad RIP value.
[   25.582199] RSP: 002b:00007ffcb5f8ad38 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   25.588588] RAX: ffffffffffffffda RBX: 000055c9762f1450 RCX: 
00007f21bee46b3b
[   25.594627] RDX: 00007ffcb5f8ad28 RSI: 0000000000000002 RDI: 
0000000000000003
[   25.600637] RBP: 00007ffcb5f8ad40 R08: 0000000000000001 R09: 
0000000000000000
[   25.606701] R10: 00007f21beec9ac0 R11: 0000000000000246 R12: 
000055c9762b30a0
[   25.612738] R13: 00007ffcb5f8b180 R14: 0000000000000000 R15: 
0000000000000000
[   25.618762] task:packetdrill     state:D stack:13952 pid:  190 ppid: 
   153 flags:0x00004000
[   25.625781] Call Trace:
[   25.627987]  __schedule+0x3eb/0x680
[   25.631046]  schedule+0x45/0xb0
[   25.633796]  io_schedule+0xd/0x30
[   25.636726]  ? wake_up_page_bit+0xd1/0x100
[   25.640271]  ? file_fdatawait_range+0x20/0x20
[   25.644022]  ? filemap_fault+0x6b4/0x970
[   25.647427]  ? filemap_map_pages+0x195/0x330
[   25.651146]  ? __do_fault+0x32/0x90
[   25.654227]  ? handle_mm_fault+0x8c1/0xe50
[   25.657752]  ? __get_user_pages+0x25c/0x750
[   25.661368]  ? populate_vma_page_range+0x57/0x60
[   25.665338]  ? __mm_populate+0xa9/0x150
[   25.668707]  ? __x64_sys_mlockall+0x151/0x180
[   25.672467]  ? do_syscall_64+0x33/0x40
[   25.675751]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.680213] task:packetdrill     state:D stack:13952 pid:  193 ppid: 
   160 flags:0x00004000
[   25.687285] Call Trace:
[   25.689472]  __schedule+0x3eb/0x680
[   25.692547]  schedule+0x45/0xb0
[   25.695314]  io_schedule+0xd/0x30
[   25.698216]  __lock_page_killable+0x13e/0x280
[   25.702013]  ? file_fdatawait_range+0x20/0x20
[   25.705752]  filemap_fault+0x6b4/0x970
[   25.709010]  ? filemap_map_pages+0x195/0x330
[   25.712691]  __do_fault+0x32/0x90
[   25.715620]  handle_mm_fault+0x8c1/0xe50
[   25.719013]  __get_user_pages+0x25c/0x750
[   25.722485]  populate_vma_page_range+0x57/0x60
[   25.726326]  __mm_populate+0xa9/0x150
[   25.729528]  __x64_sys_mlockall+0x151/0x180
[   25.733138]  do_syscall_64+0x33/0x40
[   25.736263]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.740587] RIP: 0033:0x7feb59c16b3b
[   25.743716] Code: Bad RIP value.
[   25.746653] RSP: 002b:00007ffd75ef7f38 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   25.753019] RAX: ffffffffffffffda RBX: 0000562a49acc450 RCX: 
00007feb59c16b3b
[   25.759077] RDX: 00007ffd75ef7f28 RSI: 0000000000000002 RDI: 
0000000000000003
[   25.765127] RBP: 00007ffd75ef7f40 R08: 0000000000000001 R09: 
0000000000000000
[   25.771231] R10: 00007feb59c99ac0 R11: 0000000000000246 R12: 
0000562a49a8e0a0
[   25.777442] R13: 00007ffd75ef8380 R14: 0000000000000000 R15: 
0000000000000000
[   25.783496] task:packetdrill     state:D stack:13952 pid:  194 ppid: 
   157 flags:0x00004000
[   25.790536] Call Trace:
[   25.792726]  __schedule+0x3eb/0x680
[   25.795777]  schedule+0x45/0xb0
[   25.798582]  io_schedule+0xd/0x30
[   25.801473]  __lock_page_killable+0x13e/0x280
[   25.805246]  ? file_fdatawait_range+0x20/0x20
[   25.809015]  filemap_fault+0x6b4/0x970
[   25.812279]  ? filemap_map_pages+0x195/0x330
[   25.815981]  __do_fault+0x32/0x90
[   25.818909]  handle_mm_fault+0x8c1/0xe50
[   25.822458]  __get_user_pages+0x25c/0x750
[   25.825947]  populate_vma_page_range+0x57/0x60
[   25.829775]  __mm_populate+0xa9/0x150
[   25.832973]  __x64_sys_mlockall+0x151/0x180
[   25.836591]  do_syscall_64+0x33/0x40
[   25.839715]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.844089] RIP: 0033:0x7f1bdd340b3b
[   25.847219] Code: Bad RIP value.
[   25.850079] RSP: 002b:00007fff992f49e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   25.856446] RAX: ffffffffffffffda RBX: 0000557ddd3b8450 RCX: 
00007f1bdd340b3b
[   25.862481] RDX: 00007fff992f49d8 RSI: 0000000000000002 RDI: 
0000000000000003
[   25.868455] RBP: 00007fff992f49f0 R08: 0000000000000001 R09: 
0000000000000000
[   25.874528] R10: 00007f1bdd3c3ac0 R11: 0000000000000246 R12: 
0000557ddd37a0a0
[   25.880541] R13: 00007fff992f4e30 R14: 0000000000000000 R15: 
0000000000000000
[   25.886556] task:packetdrill     state:D stack:13952 pid:  200 ppid: 
   162 flags:0x00004000
[   25.893568] Call Trace:
[   25.895776]  __schedule+0x3eb/0x680
[   25.898833]  schedule+0x45/0xb0
[   25.901578]  io_schedule+0xd/0x30
[   25.904495]  __lock_page_killable+0x13e/0x280
[   25.908246]  ? file_fdatawait_range+0x20/0x20
[   25.912012]  filemap_fault+0x6b4/0x970
[   25.915270]  ? filemap_map_pages+0x195/0x330
[   25.918964]  __do_fault+0x32/0x90
[   25.921853]  handle_mm_fault+0x8c1/0xe50
[   25.925245]  __get_user_pages+0x25c/0x750
[   25.928720]  populate_vma_page_range+0x57/0x60
[   25.932543]  __mm_populate+0xa9/0x150
[   25.935727]  __x64_sys_mlockall+0x151/0x180
[   25.939348]  do_syscall_64+0x33/0x40
[   25.942466]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.946906] RIP: 0033:0x7fb34860bb3b
[   25.950026] Code: Bad RIP value.
[   25.952846] RSP: 002b:00007ffea61b7668 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   25.959289] RAX: ffffffffffffffda RBX: 000055c6f01c2450 RCX: 
00007fb34860bb3b
[   25.965453] RDX: 00007ffea61b7658 RSI: 0000000000000002 RDI: 
0000000000000003
[   25.971504] RBP: 00007ffea61b7670 R08: 0000000000000001 R09: 
0000000000000000
[   25.977505] R10: 00007fb34868eac0 R11: 0000000000000246 R12: 
000055c6f01840a0
[   25.983596] R13: 00007ffea61b7ab0 R14: 0000000000000000 R15: 
0000000000000000
[   25.989598] task:packetdrill     state:D stack:13952 pid:  203 ppid: 
   169 flags:0x00004000
[   25.996611] Call Trace:
[   25.998823]  __schedule+0x3eb/0x680
[   26.001863]  schedule+0x45/0xb0
[   26.004645]  io_schedule+0xd/0x30
[   26.007576]  ? wake_up_page_bit+0xd1/0x100
[   26.011133]  ? file_fdatawait_range+0x20/0x20
[   26.014900]  ? filemap_fault+0x6b4/0x970
[   26.018282]  ? filemap_map_pages+0x195/0x330
[   26.021973]  ? __do_fault+0x32/0x90
[   26.025017]  ? handle_mm_fault+0x8c1/0xe50
[   26.028551]  ? __get_user_pages+0x25c/0x750
[   26.032163]  ? populate_vma_page_range+0x57/0x60
[   26.036134]  ? __mm_populate+0xa9/0x150
[   26.039487]  ? __x64_sys_mlockall+0x151/0x180
[   26.043260]  ? do_syscall_64+0x33/0x40
[   26.046528]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   26.050968] task:packetdrill     state:D stack:13904 pid:  207 ppid: 
   173 flags:0x00004000
[   26.058008] Call Trace:
[   26.060192]  __schedule+0x3eb/0x680
[   26.063248]  schedule+0x45/0xb0
[   26.066032]  io_schedule+0xd/0x30
[   26.068924]  __lock_page_killable+0x13e/0x280
[   26.072677]  ? file_fdatawait_range+0x20/0x20
[   26.076429]  filemap_fault+0x6b4/0x970
[   26.079704]  ? filemap_map_pages+0x195/0x330
[   26.083424]  __do_fault+0x32/0x90
[   26.086342]  handle_mm_fault+0x8c1/0xe50
[   26.089722]  __get_user_pages+0x25c/0x750
[   26.093209]  populate_vma_page_range+0x57/0x60
[   26.097040]  __mm_populate+0xa9/0x150
[   26.100218]  __x64_sys_mlockall+0x151/0x180
[   26.103837]  do_syscall_64+0x33/0x40
[   26.106948]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   26.111256] RIP: 0033:0x7f90fb829b3b
[   26.114383] Code: Bad RIP value.
[   26.117183] RSP: 002b:00007ffc3ae07ea8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   26.123589] RAX: ffffffffffffffda RBX: 0000560bf837d450 RCX: 
00007f90fb829b3b
[   26.129614] RDX: 00007ffc3ae07e98 RSI: 0000000000000002 RDI: 
0000000000000003
[   26.135641] RBP: 00007ffc3ae07eb0 R08: 0000000000000001 R09: 
0000000000000000
[   26.141660] R10: 00007f90fb8acac0 R11: 0000000000000246 R12: 
0000560bf833f0a0
[   26.147675] R13: 00007ffc3ae082f0 R14: 0000000000000000 R15: 
0000000000000000
[   26.153693] task:packetdrill     state:D stack:13952 pid:  210 ppid: 
   179 flags:0x00004000
[   26.160728] Call Trace:
[   26.162923]  ? sched_clock_cpu+0x95/0xa0
[   26.166326]  ? ttwu_do_wakeup.isra.0+0x34/0xe0
[   26.170172]  ? __schedule+0x3eb/0x680
[   26.173349]  ? schedule+0x45/0xb0
[   26.176271]  ? io_schedule+0xd/0x30
[   26.179320]  ? __lock_page_killable+0x13e/0x280
[   26.183216]  ? file_fdatawait_range+0x20/0x20
[   26.187031]  ? filemap_fault+0x6b4/0x970
[   26.190451]  ? filemap_map_pages+0x195/0x330
[   26.194128]  ? __do_fault+0x32/0x90
[   26.197161]  ? handle_mm_fault+0x8c1/0xe50
[   26.200692]  ? push_rt_tasks+0xc/0x20
[   26.203866]  ? __get_user_pages+0x25c/0x750
[   26.207474]  ? populate_vma_page_range+0x57/0x60
[   26.211423]  ? __mm_populate+0xa9/0x150
[   26.214763]  ? __x64_sys_mlockall+0x151/0x180
[   26.218506]  ? do_syscall_64+0x33/0x40
[   26.221757]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   26.226216] task:packetdrill     state:D stack:13952 pid:  212 ppid: 
   185 flags:0x00004000
[   26.233223] Call Trace:
[   26.235435]  __schedule+0x3eb/0x680
[   26.238487]  schedule+0x45/0xb0
[   26.241234]  io_schedule+0xd/0x30
[   26.244133]  __lock_page_killable+0x13e/0x280
[   26.247890]  ? file_fdatawait_range+0x20/0x20
[   26.251647]  filemap_fault+0x6b4/0x970
[   26.254906]  ? filemap_map_pages+0x195/0x330
[   26.258590]  __do_fault+0x32/0x90
[   26.261462]  handle_mm_fault+0x8c1/0xe50
[   26.264869]  __get_user_pages+0x25c/0x750
[   26.268327]  populate_vma_page_range+0x57/0x60
[   26.272162]  __mm_populate+0xa9/0x150
[   26.275347]  __x64_sys_mlockall+0x151/0x180
[   26.278970]  do_syscall_64+0x33/0x40
[   26.282082]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   26.286466] RIP: 0033:0x7f00e8863b3b
[   26.289574] Code: Bad RIP value.
[   26.292420] RSP: 002b:00007fff5b28f378 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   26.298797] RAX: ffffffffffffffda RBX: 000055ea3bc97450 RCX: 
00007f00e8863b3b
[   26.304787] RDX: 00007fff5b28f368 RSI: 0000000000000002 RDI: 
0000000000000003
[   26.310867] RBP: 00007fff5b28f380 R08: 0000000000000001 R09: 
0000000000000000
[   26.316697] R10: 00007f00e88e6ac0 R11: 0000000000000246 R12: 
000055ea3bc590a0
[   26.322525] R13: 00007fff5b28f7c0 R14: 0000000000000000 R15: 
0000000000000000
------------------- 8< -------------------


A version from "decode_stacktrace.sh" is also attached to this email, I 
was not sure it would be readable here.
Please tell me if anything else is needed.

One more thing, only when I have the issue, I can also see this kernel 
message that seems clearly linked:

   [    7.198259] sched: RT throttling activated

> There are obviously other ways to do it too - kgdb or whatever - which
> you may or may not be more used to.

I never tried to use kgdb with this setup but it is clearly a good 
occasion to start! I guess I will be able to easily reproduce the issue 
and then generate the crash dump.

> But sysrq is very traditional and often particularly easy if it's a
> very repeatable "things are hung". Not nearly as good as lockdep, of
> course. But if the machine is otherwise working, you can just do
> 
>      echo 'w' > /proc/sysrq-trigger
> 
> in another terminal (and again, maybe you need 't', but then you
> really want to do it *without* having a full GUI setup or anythign
> like that, to at least make it somewhat less verbose).

Please tell me if the trace I shared above is helpful. If not I can 
easily share the long output from sysrq-t -- no GUI nor any other 
services are running in the background -- and if needed I can prioritise 
the generation of a crash dump + analysis.

> Aside: a quick google shows that Nick Piggin did try to extend lockdep
> to the page lock many many years ago. I don't think it ever went
> anywhere. To quote Avril Lavigne: "It's complicated".

:-)

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

[-- Attachment #2: sysrq_w_1_analysed.txt --]
[-- Type: text/plain, Size: 13513 bytes --]

[   25.507563] sysrq: Show Blocked State
[   25.510695] task:packetdrill     state:D stack:13848 pid:  188 ppid:   155 flags:0x00004000
[   25.517841] Call Trace:
[   25.520103] __schedule (kernel/sched/core.c:3778)
[   25.523197] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   25.526013] io_schedule (kernel/sched/core.c:6271)
[   25.528964] __lock_page_killable (mm/filemap.c:1245)
[   25.532794] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   25.536605] filemap_fault (mm/filemap.c:2538)
[   25.539911] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   25.543682] __do_fault (mm/memory.c:3439)
[   25.546620] handle_mm_fault (mm/memory.c:3833)
[   25.550050] ? asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:581)
[   25.554637] __get_user_pages (mm/gup.c:879)
[   25.558101] populate_vma_page_range (mm/gup.c:1420)
[   25.561968] __mm_populate (mm/gup.c:1476)
[   25.565125] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.568787] do_syscall_64 (arch/x86/entry/common.c:46)
[   25.571915] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.576230] RIP: 0033:0x7f21bee46b3b
[ 25.579357] Code: Bad RIP value.
objdump: '/tmp/tmp.8NmKDGTy17.o': No such file

Code starting with the faulting instruction
===========================================
[   25.582199] RSP: 002b:00007ffcb5f8ad38 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   25.588588] RAX: ffffffffffffffda RBX: 000055c9762f1450 RCX: 00007f21bee46b3b
[   25.594627] RDX: 00007ffcb5f8ad28 RSI: 0000000000000002 RDI: 0000000000000003
[   25.600637] RBP: 00007ffcb5f8ad40 R08: 0000000000000001 R09: 0000000000000000
[   25.606701] R10: 00007f21beec9ac0 R11: 0000000000000246 R12: 000055c9762b30a0
[   25.612738] R13: 00007ffcb5f8b180 R14: 0000000000000000 R15: 0000000000000000
[   25.618762] task:packetdrill     state:D stack:13952 pid:  190 ppid:   153 flags:0x00004000
[   25.625781] Call Trace:
[   25.627987] __schedule (kernel/sched/core.c:3778)
[   25.631046] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   25.633796] io_schedule (kernel/sched/core.c:6271)
[   25.636726] ? wake_up_page_bit (mm/filemap.c:1128)
[   25.640271] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   25.644022] ? filemap_fault (mm/filemap.c:2538)
[   25.647427] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   25.651146] ? __do_fault (mm/memory.c:3439)
[   25.654227] ? handle_mm_fault (mm/memory.c:3833)
[   25.657752] ? __get_user_pages (mm/gup.c:879)
[   25.661368] ? populate_vma_page_range (mm/gup.c:1420)
[   25.665338] ? __mm_populate (mm/gup.c:1476)
[   25.668707] ? __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.672467] ? do_syscall_64 (arch/x86/entry/common.c:46)
[   25.675751] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.680213] task:packetdrill     state:D stack:13952 pid:  193 ppid:   160 flags:0x00004000
[   25.687285] Call Trace:
[   25.689472] __schedule (kernel/sched/core.c:3778)
[   25.692547] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   25.695314] io_schedule (kernel/sched/core.c:6271)
[   25.698216] __lock_page_killable (mm/filemap.c:1245)
[   25.702013] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   25.705752] filemap_fault (mm/filemap.c:2538)
[   25.709010] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   25.712691] __do_fault (mm/memory.c:3439)
[   25.715620] handle_mm_fault (mm/memory.c:3833)
[   25.719013] __get_user_pages (mm/gup.c:879)
[   25.722485] populate_vma_page_range (mm/gup.c:1420)
[   25.726326] __mm_populate (mm/gup.c:1476)
[   25.729528] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.733138] do_syscall_64 (arch/x86/entry/common.c:46)
[   25.736263] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.740587] RIP: 0033:0x7feb59c16b3b
[ 25.743716] Code: Bad RIP value.
objdump: '/tmp/tmp.x3l9eJ419A.o': No such file

Code starting with the faulting instruction
===========================================
[   25.746653] RSP: 002b:00007ffd75ef7f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   25.753019] RAX: ffffffffffffffda RBX: 0000562a49acc450 RCX: 00007feb59c16b3b
[   25.759077] RDX: 00007ffd75ef7f28 RSI: 0000000000000002 RDI: 0000000000000003
[   25.765127] RBP: 00007ffd75ef7f40 R08: 0000000000000001 R09: 0000000000000000
[   25.771231] R10: 00007feb59c99ac0 R11: 0000000000000246 R12: 0000562a49a8e0a0
[   25.777442] R13: 00007ffd75ef8380 R14: 0000000000000000 R15: 0000000000000000
[   25.783496] task:packetdrill     state:D stack:13952 pid:  194 ppid:   157 flags:0x00004000
[   25.790536] Call Trace:
[   25.792726] __schedule (kernel/sched/core.c:3778)
[   25.795777] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   25.798582] io_schedule (kernel/sched/core.c:6271)
[   25.801473] __lock_page_killable (mm/filemap.c:1245)
[   25.805246] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   25.809015] filemap_fault (mm/filemap.c:2538)
[   25.812279] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   25.815981] __do_fault (mm/memory.c:3439)
[   25.818909] handle_mm_fault (mm/memory.c:3833)
[   25.822458] __get_user_pages (mm/gup.c:879)
[   25.825947] populate_vma_page_range (mm/gup.c:1420)
[   25.829775] __mm_populate (mm/gup.c:1476)
[   25.832973] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.836591] do_syscall_64 (arch/x86/entry/common.c:46)
[   25.839715] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.844089] RIP: 0033:0x7f1bdd340b3b
[ 25.847219] Code: Bad RIP value.
objdump: '/tmp/tmp.y5EAYMWY3w.o': No such file

Code starting with the faulting instruction
===========================================
[   25.850079] RSP: 002b:00007fff992f49e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   25.856446] RAX: ffffffffffffffda RBX: 0000557ddd3b8450 RCX: 00007f1bdd340b3b
[   25.862481] RDX: 00007fff992f49d8 RSI: 0000000000000002 RDI: 0000000000000003
[   25.868455] RBP: 00007fff992f49f0 R08: 0000000000000001 R09: 0000000000000000
[   25.874528] R10: 00007f1bdd3c3ac0 R11: 0000000000000246 R12: 0000557ddd37a0a0
[   25.880541] R13: 00007fff992f4e30 R14: 0000000000000000 R15: 0000000000000000
[   25.886556] task:packetdrill     state:D stack:13952 pid:  200 ppid:   162 flags:0x00004000
[   25.893568] Call Trace:
[   25.895776] __schedule (kernel/sched/core.c:3778)
[   25.898833] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   25.901578] io_schedule (kernel/sched/core.c:6271)
[   25.904495] __lock_page_killable (mm/filemap.c:1245)
[   25.908246] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   25.912012] filemap_fault (mm/filemap.c:2538)
[   25.915270] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   25.918964] __do_fault (mm/memory.c:3439)
[   25.921853] handle_mm_fault (mm/memory.c:3833)
[   25.925245] __get_user_pages (mm/gup.c:879)
[   25.928720] populate_vma_page_range (mm/gup.c:1420)
[   25.932543] __mm_populate (mm/gup.c:1476)
[   25.935727] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.939348] do_syscall_64 (arch/x86/entry/common.c:46)
[   25.942466] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.946906] RIP: 0033:0x7fb34860bb3b
[ 25.950026] Code: Bad RIP value.
objdump: '/tmp/tmp.7dTIuFV40h.o': No such file

Code starting with the faulting instruction
===========================================
[   25.952846] RSP: 002b:00007ffea61b7668 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   25.959289] RAX: ffffffffffffffda RBX: 000055c6f01c2450 RCX: 00007fb34860bb3b
[   25.965453] RDX: 00007ffea61b7658 RSI: 0000000000000002 RDI: 0000000000000003
[   25.971504] RBP: 00007ffea61b7670 R08: 0000000000000001 R09: 0000000000000000
[   25.977505] R10: 00007fb34868eac0 R11: 0000000000000246 R12: 000055c6f01840a0
[   25.983596] R13: 00007ffea61b7ab0 R14: 0000000000000000 R15: 0000000000000000
[   25.989598] task:packetdrill     state:D stack:13952 pid:  203 ppid:   169 flags:0x00004000
[   25.996611] Call Trace:
[   25.998823] __schedule (kernel/sched/core.c:3778)
[   26.001863] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   26.004645] io_schedule (kernel/sched/core.c:6271)
[   26.007576] ? wake_up_page_bit (mm/filemap.c:1128)
[   26.011133] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   26.014900] ? filemap_fault (mm/filemap.c:2538)
[   26.018282] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   26.021973] ? __do_fault (mm/memory.c:3439)
[   26.025017] ? handle_mm_fault (mm/memory.c:3833)
[   26.028551] ? __get_user_pages (mm/gup.c:879)
[   26.032163] ? populate_vma_page_range (mm/gup.c:1420)
[   26.036134] ? __mm_populate (mm/gup.c:1476)
[   26.039487] ? __x64_sys_mlockall (./include/linux/mm.h:2567)
[   26.043260] ? do_syscall_64 (arch/x86/entry/common.c:46)
[   26.046528] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   26.050968] task:packetdrill     state:D stack:13904 pid:  207 ppid:   173 flags:0x00004000
[   26.058008] Call Trace:
[   26.060192] __schedule (kernel/sched/core.c:3778)
[   26.063248] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   26.066032] io_schedule (kernel/sched/core.c:6271)
[   26.068924] __lock_page_killable (mm/filemap.c:1245)
[   26.072677] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   26.076429] filemap_fault (mm/filemap.c:2538)
[   26.079704] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   26.083424] __do_fault (mm/memory.c:3439)
[   26.086342] handle_mm_fault (mm/memory.c:3833)
[   26.089722] __get_user_pages (mm/gup.c:879)
[   26.093209] populate_vma_page_range (mm/gup.c:1420)
[   26.097040] __mm_populate (mm/gup.c:1476)
[   26.100218] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   26.103837] do_syscall_64 (arch/x86/entry/common.c:46)
[   26.106948] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   26.111256] RIP: 0033:0x7f90fb829b3b
[ 26.114383] Code: Bad RIP value.
objdump: '/tmp/tmp.RyztxuBbUi.o': No such file

Code starting with the faulting instruction
===========================================
[   26.117183] RSP: 002b:00007ffc3ae07ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   26.123589] RAX: ffffffffffffffda RBX: 0000560bf837d450 RCX: 00007f90fb829b3b
[   26.129614] RDX: 00007ffc3ae07e98 RSI: 0000000000000002 RDI: 0000000000000003
[   26.135641] RBP: 00007ffc3ae07eb0 R08: 0000000000000001 R09: 0000000000000000
[   26.141660] R10: 00007f90fb8acac0 R11: 0000000000000246 R12: 0000560bf833f0a0
[   26.147675] R13: 00007ffc3ae082f0 R14: 0000000000000000 R15: 0000000000000000
[   26.153693] task:packetdrill     state:D stack:13952 pid:  210 ppid:   179 flags:0x00004000
[   26.160728] Call Trace:
[   26.162923] ? sched_clock_cpu (kernel/sched/clock.c:382)
[   26.166326] ? ttwu_do_wakeup.isra.0 (kernel/sched/core.c:2480)
[   26.170172] ? __schedule (kernel/sched/core.c:3778)
[   26.173349] ? schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   26.176271] ? io_schedule (kernel/sched/core.c:6271)
[   26.179320] ? __lock_page_killable (mm/filemap.c:1245)
[   26.183216] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   26.187031] ? filemap_fault (mm/filemap.c:2538)
[   26.190451] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   26.194128] ? __do_fault (mm/memory.c:3439)
[   26.197161] ? handle_mm_fault (mm/memory.c:3833)
[   26.200692] ? push_rt_tasks (kernel/sched/rt.c:1945 (discriminator 1))
[   26.203866] ? __get_user_pages (mm/gup.c:879)
[   26.207474] ? populate_vma_page_range (mm/gup.c:1420)
[   26.211423] ? __mm_populate (mm/gup.c:1476)
[   26.214763] ? __x64_sys_mlockall (./include/linux/mm.h:2567)
[   26.218506] ? do_syscall_64 (arch/x86/entry/common.c:46)
[   26.221757] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   26.226216] task:packetdrill     state:D stack:13952 pid:  212 ppid:   185 flags:0x00004000
[   26.233223] Call Trace:
[   26.235435] __schedule (kernel/sched/core.c:3778)
[   26.238487] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   26.241234] io_schedule (kernel/sched/core.c:6271)
[   26.244133] __lock_page_killable (mm/filemap.c:1245)
[   26.247890] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   26.251647] filemap_fault (mm/filemap.c:2538)
[   26.254906] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   26.258590] __do_fault (mm/memory.c:3439)
[   26.261462] handle_mm_fault (mm/memory.c:3833)
[   26.264869] __get_user_pages (mm/gup.c:879)
[   26.268327] populate_vma_page_range (mm/gup.c:1420)
[   26.272162] __mm_populate (mm/gup.c:1476)
[   26.275347] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   26.278970] do_syscall_64 (arch/x86/entry/common.c:46)
[   26.282082] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   26.286466] RIP: 0033:0x7f00e8863b3b
[ 26.289574] Code: Bad RIP value.
objdump: '/tmp/tmp.jv0VOAKh1q.o': No such file

Code starting with the faulting instruction
===========================================
[   26.292420] RSP: 002b:00007fff5b28f378 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   26.298797] RAX: ffffffffffffffda RBX: 000055ea3bc97450 RCX: 00007f00e8863b3b
[   26.304787] RDX: 00007fff5b28f368 RSI: 0000000000000002 RDI: 0000000000000003
[   26.310867] RBP: 00007fff5b28f380 R08: 0000000000000001 R09: 0000000000000000
[   26.316697] R10: 00007f00e88e6ac0 R11: 0000000000000246 R12: 000055ea3bc590a0
[   26.322525] R13: 00007fff5b28f7c0 R14: 0000000000000000 R15: 0000000000000000

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 14:21                                                 ` Michael Larabel
@ 2020-09-15 17:52                                                   ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 17:52 UTC (permalink / raw)
  To: Michael Larabel; +Cc: Ext4 Developers List, linux-fsdevel

On Tue, Sep 15, 2020 at 7:22 AM Michael Larabel
<Michael@michaellarabel.com> wrote:
>
> Still running more benchmarks and on more systems, but so far at least
> as the Apache test is concerned this patch does seem to largely address
> the issue. The performance with the default 1000 page_lock_unfairness
> was yielding results more similar to 5.8 and in some cases tweaking the
> value did help improve the performance. A PLU value of 4~5 seems to
> yield the best performance.

Yeah. Although looking at those results, they are somewhat mixed - I
think the benchmark is just not very stable performance-wise.

Looking at your 250 concurrent users numbers, it's strange how that
unfairness=5 value gets performance _so_ much better than anything
else (including 5.8), so I do think that this benchmark is just very
borderline sensitive to this all, and some of the variation may just
be almost random.

We've often seen how cache placement etc can cause big differences on
benchmarks that just do one thing over and over.

But the big picture is fairly clear: apache siege absolutely hates the
complete fairness, and clearly the hybrid thing works fine.

The fact that a "unfairness count" of 4-5 seems to be consistently
fairly good (and sometimes seems clearly better than the "completely
unfair" behavior) makes me feel better about things, though. I was
suspecting that would be a somewhat reasonable value, and I _hope_
that it's small enough that it still gets rid of the watchdog issues
that the original fairness patch was aiming to fix.

> The results though with Hackbench and Redis that exhibited similar drops
> from the commit in question remained mixed.

The hackbench numbers seemed fairly randomly sensitive before too.

I wouldn't worry about it as long as there's no clear regression on a
load that seems realistic. We know the page lock is important, and we
do know that the _real_ fix in many ways is to try to not get
contention in the first place. Either by trying to avoid the page lock
entirely (ie the approach that Willy is pursuing), or by maybe trying
to have a reader-writer mode or something.

So in a very real sense, all this page lock fairness stuff is still
just a symptom of a more fundamental problem, even though I think that
the fairness itself is also a fairly fundamental thing.

                  Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 15:34                                                     ` Matthieu Baerts
@ 2020-09-15 18:27                                                       ` Linus Torvalds
  2020-09-15 18:47                                                         ` Linus Torvalds
       [not found]                                                         ` <9a92bf16-02c5-ba38-33c7-f350588ac874@tessares.net>
  2020-09-15 18:31                                                       ` Linus Torvalds
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 18:27 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 8:34 AM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> > But it sounds like it's 100% repeatable with the fair page lock, which
> > is actually a good thing. It means that if you do a "sysrq-w" while
> > it's blocking, you should see exactly what is waiting for what.
> >
> > (Except since it times out nicely eventually, probably at least part
> > of the waiting is interruptible, and then you need to do "sysrq-t"
> > instead and it's going to be _very_ verbose and much harder to
> > pinpoint things, and you'll probably need to have a very big printk
> > buffer).
>
> Thank you for this idea! I was focused on using lockdep and I forgot
> about this simple method. It is not (yet) a reflex for me to use it!
>
> I think I got an interesting trace I took 20 seconds after having
> started packetdrill:

Ok, so everybody there is basically in the same identical situation,
they all seem to be doing mlockall(), which does __mm_populate() ->
populate_vma_page_range() -> __get_user_pages() -> handle_mm_fault()
and then actually tries to fault in the missing pages.

And that does do a lot of "lock_page()" (and, of course, as a result,
a lot of "unlock_page()" too).

Every one of them is in the "io_schedule()" in the filemap_fault()
path, although two of them seem to be in file_fdatawait_range() rather
than in the lock_page() code itself (so they are also waiting on a
page bit, but they are waiting for the writeback bit to clear).

And all of them do it under the mmap_read_lock().

I'm not seeing what else they'd be blocking on, though.

As mentioned, the thing they are blocking on might be something
interruptible that holds the lock, and might not be in 'D' state. Then
it wouldn't show up in sysrq-W, you'd have to do 'sysrq-T' to see
those..

From past experience, that tends to be a _lot_ of data, though, and it
easily overflows the printk buffers etc.

lockdep has made these kinds of sysrq hacks mostly a thing of the
past, and the few non-lockdep locks (and the page lock is definitely
the biggest of them) are an annoying blast to the past..

                    Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 15:34                                                     ` Matthieu Baerts
  2020-09-15 18:27                                                       ` Linus Torvalds
@ 2020-09-15 18:31                                                       ` Linus Torvalds
  1 sibling, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 18:31 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 8:34 AM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> One more thing, only when I have the issue, I can also see this kernel
> message that seems clearly linked:
>
>    [    7.198259] sched: RT throttling activated

Hmm. It does seem like this might be related and a clue, but you'd
have to ask the RT folks what the likely cause is and how to debug
things.. Not my area.

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 18:27                                                       ` Linus Torvalds
@ 2020-09-15 18:47                                                         ` Linus Torvalds
  2020-09-15 19:26                                                           ` Matthieu Baerts
       [not found]                                                         ` <9a92bf16-02c5-ba38-33c7-f350588ac874@tessares.net>
  1 sibling, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 18:47 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 11:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Every one of them is in the "io_schedule()" in the filemap_fault()
> path, although two of them seem to be in file_fdatawait_range() rather
> than in the lock_page() code itself (so they are also waiting on a
> page bit, but they are waiting for the writeback bit to clear).

No, that seems to be just stale entries on the stack from a previous
system call, rather than a real trace. There's no way to reach
file_fdatawait_range() from mlockall() that I can see.

So I'm not entirely sure why the stack trace for two of the processes
looks a bit different, but they all look like they should be in
__lock_page_killable().

It's possible those two were woken up (by another CPU) and their stack
is in flux. They also have "wake_up_page_bit()" as a stale entry on
their stack, so that's not entirely unlikely.

So that sysrq-W state shows that yes, people are stuck waiting for a
page, but that wasn't exactly unexpected.

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
       [not found]                                                         ` <9a92bf16-02c5-ba38-33c7-f350588ac874@tessares.net>
@ 2020-09-15 19:24                                                           ` Linus Torvalds
  2020-09-15 19:38                                                             ` Matthieu Baerts
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 19:24 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 12:01 PM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> Earlier today, I got one trace with 'sysrq-T' but it is more than 1100
> lines. It is attached to this email also with a version from
> "decode_stacktrace.sh", I hope that's alright.

Yeah, there's nothing interesting there.

The only relevant tasks seem to be the packetdrill ones that are
blocked on the page lock. I don't see anything that looks even
*remotely* like it could be holding a page lock and be waiting for
anything else.

A couple of pipe readers, a number of parents waiting on their
children, one futex waiter, one select loop.. Nothing at all
unexpected or remotely suspicious.

The packetdrill ones look very similar.

> I forgot one important thing, I was on top of David Miller's net-next
> branch by reflex. I can redo the traces on top of linux-next if needed.

Not likely an issue.

I'll go stare at the page lock code again to see if I've missed
anything. I still suspect it's a latent ABBA deadlock that is just
much *much* easier to trigger with the synchronous lock handoff, but I
don't see where it is.

I guess this is all fairly theoretical since we apparently need to do
that hybrid "limited fairness" patch anyway, and it fixes your issue,
but I hate not understanding the problem.

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 18:47                                                         ` Linus Torvalds
@ 2020-09-15 19:26                                                           ` Matthieu Baerts
  2020-09-15 19:32                                                             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Matthieu Baerts @ 2020-09-15 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 16554 bytes --]

On 15/09/2020 20:47, Linus Torvalds wrote:
> On Tue, Sep 15, 2020 at 11:27 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Every one of them is in the "io_schedule()" in the filemap_fault()
>> path, although two of them seem to be in file_fdatawait_range() rather
>> than in the lock_page() code itself (so they are also waiting on a
>> page bit, but they are waiting for the writeback bit to clear).
> 
> No, that seems to be just stale entries on the stack from a previous
> system call, rather than a real trace. There's no way to reach
> file_fdatawait_range() from mlockall() that I can see.
> 
> So I'm not entirely sure why the stack trace for two of the processes
> looks a bit different, but they all look like they should be in
> __lock_page_killable().
> 
> It's possible those two were woken up (by another CPU) and their stack
> is in flux. They also have "wake_up_page_bit()" as a stale entry on
> their stack, so that's not entirely unlikely.

I don't know if this info is useful but I just checked and I can 
reproduce the issue with a single CPU. And the trace is very similar to 
the previous one:

------------------- 8< -------------------
[   23.884953] sysrq: Show Blocked State
[   23.888310] task:packetdrill     state:D stack:13952 pid:  177 ppid: 
   148 flags:0x00004000
[   23.895619] Call Trace:
[   23.897885]  __schedule+0x3eb/0x680
[   23.901033]  schedule+0x45/0xb0
[   23.903882]  io_schedule+0xd/0x30
[   23.906868]  __lock_page_killable+0x13e/0x280
[   23.910729]  ? file_fdatawait_range+0x20/0x20
[   23.914648]  filemap_fault+0x6b4/0x970
[   23.918061]  ? filemap_map_pages+0x195/0x330
[   23.921833]  __do_fault+0x32/0x90
[   23.924754]  handle_mm_fault+0x8c1/0xe50
[   23.928011]  __get_user_pages+0x25c/0x750
[   23.931594]  populate_vma_page_range+0x57/0x60
[   23.935518]  __mm_populate+0xa9/0x150
[   23.938467]  __x64_sys_mlockall+0x151/0x180
[   23.942228]  do_syscall_64+0x33/0x40
[   23.945408]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   23.949736] RIP: 0033:0x7f28b847bb3b
[   23.960960] Code: Bad RIP value.
[   23.963856] RSP: 002b:00007ffe48d833c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   23.970157] RAX: ffffffffffffffda RBX: 000055a16594a450 RCX: 
00007f28b847bb3b
[   23.976474] RDX: 00007ffe48d812a2 RSI: 00007f28b84ff3c0 RDI: 
0000000000000003
[   23.982773] RBP: 00007ffe48d833d0 R08: 0000000000000000 R09: 
0000000000000000
[   23.988998] R10: 00007f28b84feac0 R11: 0000000000000246 R12: 
000055a16590c0a0
[   23.995079] R13: 00007ffe48d83810 R14: 0000000000000000 R15: 
0000000000000000
[   24.001143] task:packetdrill     state:D stack:13952 pid:  179 ppid: 
   146 flags:0x00004000
[   24.008425] Call Trace:
[   24.010495]  __schedule+0x3eb/0x680
[   24.013378]  schedule+0x45/0xb0
[   24.016053]  io_schedule+0xd/0x30
[   24.018763]  __lock_page_killable+0x13e/0x280
[   24.022663]  ? file_fdatawait_range+0x20/0x20
[   24.026564]  filemap_fault+0x6b4/0x970
[   24.029954]  ? filemap_map_pages+0x195/0x330
[   24.033865]  __do_fault+0x32/0x90
[   24.036773]  handle_mm_fault+0x8c1/0xe50
[   24.040072]  __get_user_pages+0x25c/0x750
[   24.043667]  populate_vma_page_range+0x57/0x60
[   24.047200]  __mm_populate+0xa9/0x150
[   24.050554]  __x64_sys_mlockall+0x151/0x180
[   24.054273]  do_syscall_64+0x33/0x40
[   24.057492]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.061795] RIP: 0033:0x7f3f900f2b3b
[   24.065048] Code: Bad RIP value.
[   24.067971] RSP: 002b:00007ffd682b6338 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.074233] RAX: ffffffffffffffda RBX: 0000563239032450 RCX: 
00007f3f900f2b3b
[   24.080303] RDX: 00007ffd682b4212 RSI: 00007f3f901763c0 RDI: 
0000000000000003
[   24.086263] RBP: 00007ffd682b6340 R08: 0000000000000000 R09: 
0000000000000000
[   24.092364] R10: 00007f3f90175ac0 R11: 0000000000000246 R12: 
0000563238ff40a0
[   24.098345] R13: 00007ffd682b6780 R14: 0000000000000000 R15: 
0000000000000000
[   24.104588] task:packetdrill     state:D stack:13512 pid:  185 ppid: 
   153 flags:0x00004000
[   24.111856] Call Trace:
[   24.114132]  __schedule+0x3eb/0x680
[   24.117323]  schedule+0x45/0xb0
[   24.120052]  io_schedule+0xd/0x30
[   24.123036]  __lock_page_killable+0x13e/0x280
[   24.126600]  ? file_fdatawait_range+0x20/0x20
[   24.130146]  filemap_fault+0x6b4/0x970
[   24.133264]  ? filemap_map_pages+0x195/0x330
[   24.136846]  __do_fault+0x32/0x90
[   24.139653]  handle_mm_fault+0x8c1/0xe50
[   24.143165]  __get_user_pages+0x25c/0x750
[   24.146439]  populate_vma_page_range+0x57/0x60
[   24.150050]  __mm_populate+0xa9/0x150
[   24.153325]  __x64_sys_mlockall+0x151/0x180
[   24.157089]  do_syscall_64+0x33/0x40
[   24.160181]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.164690] RIP: 0033:0x7f18e0da3b3b
[   24.167851] Code: Bad RIP value.
[   24.170516] RSP: 002b:00007ffc3a0d67f8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.177116] RAX: ffffffffffffffda RBX: 0000562d7b1b1450 RCX: 
00007f18e0da3b3b
[   24.183423] RDX: 00007ffc3a0d46d2 RSI: 00007f18e0e273c0 RDI: 
0000000000000003
[   24.189707] RBP: 00007ffc3a0d6800 R08: 0000000000000000 R09: 
0000000000000000
[   24.195977] R10: 00007f18e0e26ac0 R11: 0000000000000246 R12: 
0000562d7b1730a0
[   24.202018] R13: 00007ffc3a0d6c40 R14: 0000000000000000 R15: 
0000000000000000
[   24.208311] task:packetdrill     state:D stack:13952 pid:  188 ppid: 
   151 flags:0x00004000
[   24.215398] Call Trace:
[   24.217471]  __schedule+0x3eb/0x680
[   24.220446]  schedule+0x45/0xb0
[   24.223044]  io_schedule+0xd/0x30
[   24.225774]  __lock_page_killable+0x13e/0x280
[   24.229621]  ? file_fdatawait_range+0x20/0x20
[   24.233542]  filemap_fault+0x6b4/0x970
[   24.236868]  ? xas_start+0x69/0x90
[   24.239766]  ? filemap_map_pages+0x195/0x330
[   24.243194]  __do_fault+0x32/0x90
[   24.246252]  handle_mm_fault+0x8c1/0xe50
[   24.249759]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[   24.254378]  __get_user_pages+0x25c/0x750
[   24.257960]  populate_vma_page_range+0x57/0x60
[   24.261868]  __mm_populate+0xa9/0x150
[   24.265114]  __x64_sys_mlockall+0x151/0x180
[   24.268683]  do_syscall_64+0x33/0x40
[   24.271658]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.275772] RIP: 0033:0x7f2d0b01eb3b
[   24.278691] Code: Bad RIP value.
[   24.281400] RSP: 002b:00007ffd7b2d9fa8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.287430] RAX: ffffffffffffffda RBX: 000055785f8a4450 RCX: 
00007f2d0b01eb3b
[   24.293273] RDX: 00007ffd7b2d7e82 RSI: 00007f2d0b0a23c0 RDI: 
0000000000000003
[   24.302752] RBP: 00007ffd7b2d9fb0 R08: 0000000000000000 R09: 
0000000000000000
[   24.309115] R10: 00007f2d0b0a1ac0 R11: 0000000000000246 R12: 
000055785f8660a0
[   24.315529] R13: 00007ffd7b2da3f0 R14: 0000000000000000 R15: 
0000000000000000
[   24.321891] task:packetdrill     state:D stack:13952 pid:  190 ppid: 
   157 flags:0x00004000
[   24.329197] Call Trace:
[   24.331531]  __schedule+0x3eb/0x680
[   24.334736]  schedule+0x45/0xb0
[   24.337548]  io_schedule+0xd/0x30
[   24.340362]  __lock_page_killable+0x13e/0x280
[   24.344098]  ? file_fdatawait_range+0x20/0x20
[   24.348001]  filemap_fault+0x6b4/0x970
[   24.351427]  ? filemap_map_pages+0x195/0x330
[   24.355292]  __do_fault+0x32/0x90
[   24.358282]  handle_mm_fault+0x8c1/0xe50
[   24.361788]  __get_user_pages+0x25c/0x750
[   24.365427]  populate_vma_page_range+0x57/0x60
[   24.369408]  __mm_populate+0xa9/0x150
[   24.372726]  __x64_sys_mlockall+0x151/0x180
[   24.376496]  do_syscall_64+0x33/0x40
[   24.379522]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.383623] RIP: 0033:0x7fe749a73b3b
[   24.386534] Code: Bad RIP value.
[   24.389240] RSP: 002b:00007fff6d4ad268 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.395433] RAX: ffffffffffffffda RBX: 0000556473049450 RCX: 
00007fe749a73b3b
[   24.401307] RDX: 00007fff6d4ab142 RSI: 00007fe749af73c0 RDI: 
0000000000000003
[   24.407426] RBP: 00007fff6d4ad270 R08: 0000000000000000 R09: 
0000000000000000
[   24.413696] R10: 00007fe749af6ac0 R11: 0000000000000246 R12: 
000055647300b0a0
[   24.419919] R13: 00007fff6d4ad6b0 R14: 0000000000000000 R15: 
0000000000000000
[   24.425940] task:packetdrill     state:D stack:13952 pid:  193 ppid: 
   160 flags:0x00004000
[   24.433269] Call Trace:
[   24.435568]  __schedule+0x3eb/0x680
[   24.438397]  schedule+0x45/0xb0
[   24.441030]  io_schedule+0xd/0x30
[   24.443760]  __lock_page_killable+0x13e/0x280
[   24.447275]  ? file_fdatawait_range+0x20/0x20
[   24.450919]  filemap_fault+0x6b4/0x970
[   24.453976]  ? filemap_map_pages+0x195/0x330
[   24.457501]  __do_fault+0x32/0x90
[   24.460273]  handle_mm_fault+0x8c1/0xe50
[   24.463397]  __get_user_pages+0x25c/0x750
[   24.466743]  populate_vma_page_range+0x57/0x60
[   24.470728]  __mm_populate+0xa9/0x150
[   24.474084]  __x64_sys_mlockall+0x151/0x180
[   24.477780]  do_syscall_64+0x33/0x40
[   24.480924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.485302] RIP: 0033:0x7faf08ad9b3b
[   24.488404] Code: Bad RIP value.
[   24.491348] RSP: 002b:00007ffec68c61d8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.497549] RAX: ffffffffffffffda RBX: 00005556df2df450 RCX: 
00007faf08ad9b3b
[   24.503686] RDX: 00007ffec68c40b2 RSI: 00007faf08b5d3c0 RDI: 
0000000000000003
[   24.509693] RBP: 00007ffec68c61e0 R08: 0000000000000000 R09: 
0000000000000000
[   24.515919] R10: 00007faf08b5cac0 R11: 0000000000000246 R12: 
00005556df2a10a0
[   24.521919] R13: 00007ffec68c6620 R14: 0000000000000000 R15: 
0000000000000000
[   24.528173] task:packetdrill     state:D stack:13952 pid:  199 ppid: 
   163 flags:0x00004000
[   24.535290] Call Trace:
[   24.537348]  __schedule+0x3eb/0x680
[   24.540324]  schedule+0x45/0xb0
[   24.542834]  io_schedule+0xd/0x30
[   24.545594]  __lock_page_killable+0x13e/0x280
[   24.549485]  ? file_fdatawait_range+0x20/0x20
[   24.553266]  filemap_fault+0x6b4/0x970
[   24.556501]  ? filemap_map_pages+0x195/0x330
[   24.560073]  __do_fault+0x32/0x90
[   24.562786]  handle_mm_fault+0x8c1/0xe50
[   24.566304]  __get_user_pages+0x25c/0x750
[   24.569874]  populate_vma_page_range+0x57/0x60
[   24.573844]  __mm_populate+0xa9/0x150
[   24.577054]  __x64_sys_mlockall+0x151/0x180
[   24.580624]  do_syscall_64+0x33/0x40
[   24.583629]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.587866] RIP: 0033:0x7efdca54bb3b
[   24.590766] Code: Bad RIP value.
[   24.593713] RSP: 002b:00007ffe160c8be8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.600288] RAX: ffffffffffffffda RBX: 000055668dc19450 RCX: 
00007efdca54bb3b
[   24.606330] RDX: 00007ffe160c6ac2 RSI: 00007efdca5cf3c0 RDI: 
0000000000000003
[   24.612650] RBP: 00007ffe160c8bf0 R08: 0000000000000000 R09: 
0000000000000000
[   24.618715] R10: 00007efdca5ceac0 R11: 0000000000000246 R12: 
000055668dbdb0a0
[   24.625055] R13: 00007ffe160c9030 R14: 0000000000000000 R15: 
0000000000000000
[   24.631336] task:packetdrill     state:D stack:13952 pid:  200 ppid: 
   167 flags:0x00004000
[   24.638212] Call Trace:
[   24.640504]  __schedule+0x3eb/0x680
[   24.643730]  schedule+0x45/0xb0
[   24.646616]  io_schedule+0xd/0x30
[   24.649646]  __lock_page_killable+0x13e/0x280
[   24.653420]  ? file_fdatawait_range+0x20/0x20
[   24.657351]  filemap_fault+0x6b4/0x970
[   24.660596]  ? filemap_map_pages+0x195/0x330
[   24.664134]  __do_fault+0x32/0x90
[   24.666936]  handle_mm_fault+0x8c1/0xe50
[   24.670502]  __get_user_pages+0x25c/0x750
[   24.674111]  populate_vma_page_range+0x57/0x60
[   24.678106]  __mm_populate+0xa9/0x150
[   24.681434]  __x64_sys_mlockall+0x151/0x180
[   24.685191]  do_syscall_64+0x33/0x40
[   24.688289]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.692754] RIP: 0033:0x7f4ac3f45b3b
[   24.696017] Code: Bad RIP value.
[   24.698963] RSP: 002b:00007ffd159771e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.705620] RAX: ffffffffffffffda RBX: 00005620eb450450 RCX: 
00007f4ac3f45b3b
[   24.711901] RDX: 00007ffd159750c2 RSI: 00007f4ac3fc93c0 RDI: 
0000000000000003
[   24.718195] RBP: 00007ffd159771f0 R08: 0000000000000000 R09: 
0000000000000000
[   24.724468] R10: 00007f4ac3fc8ac0 R11: 0000000000000246 R12: 
00005620eb4120a0
[   24.730689] R13: 00007ffd15977630 R14: 0000000000000000 R15: 
0000000000000000
[   24.736965] task:packetdrill     state:D stack:13952 pid:  202 ppid: 
   174 flags:0x00004000
[   24.744128] Call Trace:
[   24.746188]  __schedule+0x3eb/0x680
[   24.749129]  schedule+0x45/0xb0
[   24.751715]  io_schedule+0xd/0x30
[   24.754430]  __lock_page_killable+0x13e/0x280
[   24.758317]  ? file_fdatawait_range+0x20/0x20
[   24.762160]  filemap_fault+0x6b4/0x970
[   24.765582]  ? filemap_map_pages+0x195/0x330
[   24.769418]  __do_fault+0x32/0x90
[   24.772432]  handle_mm_fault+0x8c1/0xe50
[   24.775700]  __get_user_pages+0x25c/0x750
[   24.779051]  populate_vma_page_range+0x57/0x60
[   24.783009]  __mm_populate+0xa9/0x150
[   24.786309]  __x64_sys_mlockall+0x151/0x180
[   24.790052]  do_syscall_64+0x33/0x40
[   24.793285]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.797821] RIP: 0033:0x7f177627db3b
[   24.801013] Code: Bad RIP value.
[   24.803752] RSP: 002b:00007ffcd9c784d8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.810149] RAX: ffffffffffffffda RBX: 0000556bbe193450 RCX: 
00007f177627db3b
[   24.816393] RDX: 00007ffcd9c763b2 RSI: 00007f17763013c0 RDI: 
0000000000000003
[   24.822407] RBP: 00007ffcd9c784e0 R08: 0000000000000000 R09: 
0000000000000000
[   24.828686] R10: 00007f1776300ac0 R11: 0000000000000246 R12: 
0000556bbe1550a0
[   24.834924] R13: 00007ffcd9c78920 R14: 0000000000000000 R15: 
0000000000000000
[   24.840714] task:packetdrill     state:D stack:13952 pid:  204 ppid: 
   182 flags:0x00004000
[   24.847842] Call Trace:
[   24.850112]  __schedule+0x3eb/0x680
[   24.853316]  schedule+0x45/0xb0
[   24.856185]  io_schedule+0xd/0x30
[   24.859181]  __lock_page_killable+0x13e/0x280
[   24.862675]  ? file_fdatawait_range+0x20/0x20
[   24.866212]  filemap_fault+0x6b4/0x970
[   24.869525]  ? filemap_map_pages+0x195/0x330
[   24.873240]  __do_fault+0x32/0x90
[   24.876250]  handle_mm_fault+0x8c1/0xe50
[   24.879470]  __get_user_pages+0x25c/0x750
[   24.882680]  populate_vma_page_range+0x57/0x60
[   24.894498]  __mm_populate+0xa9/0x150
[   24.897763]  __x64_sys_mlockall+0x151/0x180
[   24.901363]  do_syscall_64+0x33/0x40
[   24.904437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.908958] RIP: 0033:0x7ff1fe4e2b3b
[   24.911951] Code: Bad RIP value.
[   24.914865] RSP: 002b:00007ffc28177598 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   24.921310] RAX: ffffffffffffffda RBX: 0000558d9bbe3450 RCX: 
00007ff1fe4e2b3b
[   24.927427] RDX: 00007ffc28175472 RSI: 00007ff1fe5663c0 RDI: 
0000000000000003
[   24.933468] RBP: 00007ffc281775a0 R08: 0000000000000000 R09: 
0000000000000000
[   24.939722] R10: 00007ff1fe565ac0 R11: 0000000000000246 R12: 
0000558d9bba50a0
[   24.945719] R13: 00007ffc281779e0 R14: 0000000000000000 R15: 
0000000000000000
[   24.951908] task:packetdrill     state:D stack:13952 pid:  205 ppid: 
   187 flags:0x00004000
[   24.958948] Call Trace:
[   24.961229]  __schedule+0x3eb/0x680
[   24.964212]  schedule+0x45/0xb0
[   24.967086]  io_schedule+0xd/0x30
[   24.970104]  __lock_page_killable+0x13e/0x280
[   24.974001]  ? file_fdatawait_range+0x20/0x20
[   24.977912]  filemap_fault+0x6b4/0x970
[   24.981228]  ? filemap_map_pages+0x195/0x330
[   24.984924]  __do_fault+0x32/0x90
[   24.987934]  handle_mm_fault+0x8c1/0xe50
[   24.991452]  __get_user_pages+0x25c/0x750
[   24.995022]  populate_vma_page_range+0x57/0x60
[   24.998990]  __mm_populate+0xa9/0x150
[   25.002109]  __x64_sys_mlockall+0x151/0x180
[   25.005873]  do_syscall_64+0x33/0x40
[   25.009084]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   25.013550] RIP: 0033:0x7fc6a6880b3b
[   25.016548] Code: Bad RIP value.
[   25.019318] RSP: 002b:00007ffc69437db8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000097
[   25.025516] RAX: ffffffffffffffda RBX: 00005572a33dd450 RCX: 
00007fc6a6880b3b
[   25.031678] RDX: 00007ffc69435c92 RSI: 00007fc6a69043c0 RDI: 
0000000000000003
[   25.037940] RBP: 00007ffc69437dc0 R08: 0000000000000000 R09: 
0000000000000000
[   25.044100] R10: 00007fc6a6903ac0 R11: 0000000000000246 R12: 
00005572a339f0a0
[   25.049950] R13: 00007ffc69438200 R14: 0000000000000000 R15: 
0000000000000000

> So that sysrq-W state shows that yes, people are stuck waiting for a
> page, but that wasn't exactly unexpected.

Is there anything else I can do to get more info? I guess a core dump 
would start to be really interesting here.

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

[-- Attachment #2: sysrq_w_1_core_analysed.txt --]
[-- Type: text/plain, Size: 18956 bytes --]

[    4.721395] ip (166) used greatest stack depth: 12264 bytes left
[    4.911863] ip (198) used greatest stack depth: 12152 bytes left
+ '[' -d /proc/138 ']'
+ echo w
[   23.884953] sysrq: Show Blocked State
[   23.888310] task:packetdrill     state:D stack:13952 pid:  177 ppid:   148 flags:0x00004000
[   23.895619] Call Trace:
[   23.897885] __schedule (kernel/sched/core.c:3778)
[   23.901033] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   23.903882] io_schedule (kernel/sched/core.c:6271)
[   23.906868] __lock_page_killable (mm/filemap.c:1245)
[   23.910729] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   23.914648] filemap_fault (mm/filemap.c:2538)
[   23.918061] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   23.921833] __do_fault (mm/memory.c:3439)
[   23.924754] handle_mm_fault (mm/memory.c:3833)
[   23.928011] __get_user_pages (mm/gup.c:879)
[   23.931594] populate_vma_page_range (mm/gup.c:1420)
[   23.935518] __mm_populate (mm/gup.c:1476)
[   23.938467] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   23.942228] do_syscall_64 (arch/x86/entry/common.c:46)
[   23.945408] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   23.949736] RIP: 0033:0x7f28b847bb3b
[ 23.960960] Code: Bad RIP value.
objdump: '/tmp/tmp.Ljq5rmExmf.o': No such file

Code starting with the faulting instruction
===========================================
[   23.963856] RSP: 002b:00007ffe48d833c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   23.970157] RAX: ffffffffffffffda RBX: 000055a16594a450 RCX: 00007f28b847bb3b
[   23.976474] RDX: 00007ffe48d812a2 RSI: 00007f28b84ff3c0 RDI: 0000000000000003
[   23.982773] RBP: 00007ffe48d833d0 R08: 0000000000000000 R09: 0000000000000000
[   23.988998] R10: 00007f28b84feac0 R11: 0000000000000246 R12: 000055a16590c0a0
[   23.995079] R13: 00007ffe48d83810 R14: 0000000000000000 R15: 0000000000000000
[   24.001143] task:packetdrill     state:D stack:13952 pid:  179 ppid:   146 flags:0x00004000
[   24.008425] Call Trace:
[   24.010495] __schedule (kernel/sched/core.c:3778)
[   24.013378] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.016053] io_schedule (kernel/sched/core.c:6271)
[   24.018763] __lock_page_killable (mm/filemap.c:1245)
[   24.022663] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.026564] filemap_fault (mm/filemap.c:2538)
[   24.029954] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.033865] __do_fault (mm/memory.c:3439)
[   24.036773] handle_mm_fault (mm/memory.c:3833)
[   24.040072] __get_user_pages (mm/gup.c:879)
[   24.043667] populate_vma_page_range (mm/gup.c:1420)
[   24.047200] __mm_populate (mm/gup.c:1476)
[   24.050554] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.054273] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.057492] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.061795] RIP: 0033:0x7f3f900f2b3b
[ 24.065048] Code: Bad RIP value.
objdump: '/tmp/tmp.WbMCRNJcaL.o': No such file

Code starting with the faulting instruction
===========================================
[   24.067971] RSP: 002b:00007ffd682b6338 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.074233] RAX: ffffffffffffffda RBX: 0000563239032450 RCX: 00007f3f900f2b3b
[   24.080303] RDX: 00007ffd682b4212 RSI: 00007f3f901763c0 RDI: 0000000000000003
[   24.086263] RBP: 00007ffd682b6340 R08: 0000000000000000 R09: 0000000000000000
[   24.092364] R10: 00007f3f90175ac0 R11: 0000000000000246 R12: 0000563238ff40a0
[   24.098345] R13: 00007ffd682b6780 R14: 0000000000000000 R15: 0000000000000000
[   24.104588] task:packetdrill     state:D stack:13512 pid:  185 ppid:   153 flags:0x00004000
[   24.111856] Call Trace:
[   24.114132] __schedule (kernel/sched/core.c:3778)
[   24.117323] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.120052] io_schedule (kernel/sched/core.c:6271)
[   24.123036] __lock_page_killable (mm/filemap.c:1245)
[   24.126600] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.130146] filemap_fault (mm/filemap.c:2538)
[   24.133264] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.136846] __do_fault (mm/memory.c:3439)
[   24.139653] handle_mm_fault (mm/memory.c:3833)
[   24.143165] __get_user_pages (mm/gup.c:879)
[   24.146439] populate_vma_page_range (mm/gup.c:1420)
[   24.150050] __mm_populate (mm/gup.c:1476)
[   24.153325] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.157089] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.160181] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.164690] RIP: 0033:0x7f18e0da3b3b
[ 24.167851] Code: Bad RIP value.
objdump: '/tmp/tmp.LKljBzrpTK.o': No such file

Code starting with the faulting instruction
===========================================
[   24.170516] RSP: 002b:00007ffc3a0d67f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.177116] RAX: ffffffffffffffda RBX: 0000562d7b1b1450 RCX: 00007f18e0da3b3b
[   24.183423] RDX: 00007ffc3a0d46d2 RSI: 00007f18e0e273c0 RDI: 0000000000000003
[   24.189707] RBP: 00007ffc3a0d6800 R08: 0000000000000000 R09: 0000000000000000
[   24.195977] R10: 00007f18e0e26ac0 R11: 0000000000000246 R12: 0000562d7b1730a0
[   24.202018] R13: 00007ffc3a0d6c40 R14: 0000000000000000 R15: 0000000000000000
[   24.208311] task:packetdrill     state:D stack:13952 pid:  188 ppid:   151 flags:0x00004000
[   24.215398] Call Trace:
[   24.217471] __schedule (kernel/sched/core.c:3778)
[   24.220446] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.223044] io_schedule (kernel/sched/core.c:6271)
[   24.225774] __lock_page_killable (mm/filemap.c:1245)
[   24.229621] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.233542] filemap_fault (mm/filemap.c:2538)
[   24.236868] ? xas_start (lib/xarray.c:193)
[   24.239766] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.243194] __do_fault (mm/memory.c:3439)
[   24.246252] handle_mm_fault (mm/memory.c:3833)
[   24.249759] ? asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:581)
[   24.254378] __get_user_pages (mm/gup.c:879)
[   24.257960] populate_vma_page_range (mm/gup.c:1420)
[   24.261868] __mm_populate (mm/gup.c:1476)
[   24.265114] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.268683] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.271658] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.275772] RIP: 0033:0x7f2d0b01eb3b
[ 24.278691] Code: Bad RIP value.
objdump: '/tmp/tmp.fE6oiFXCjI.o': No such file

Code starting with the faulting instruction
===========================================
[   24.281400] RSP: 002b:00007ffd7b2d9fa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.287430] RAX: ffffffffffffffda RBX: 000055785f8a4450 RCX: 00007f2d0b01eb3b
[   24.293273] RDX: 00007ffd7b2d7e82 RSI: 00007f2d0b0a23c0 RDI: 0000000000000003
[   24.302752] RBP: 00007ffd7b2d9fb0 R08: 0000000000000000 R09: 0000000000000000
[   24.309115] R10: 00007f2d0b0a1ac0 R11: 0000000000000246 R12: 000055785f8660a0
[   24.315529] R13: 00007ffd7b2da3f0 R14: 0000000000000000 R15: 0000000000000000
[   24.321891] task:packetdrill     state:D stack:13952 pid:  190 ppid:   157 flags:0x00004000
[   24.329197] Call Trace:
[   24.331531] __schedule (kernel/sched/core.c:3778)
[   24.334736] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.337548] io_schedule (kernel/sched/core.c:6271)
[   24.340362] __lock_page_killable (mm/filemap.c:1245)
[   24.344098] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.348001] filemap_fault (mm/filemap.c:2538)
[   24.351427] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.355292] __do_fault (mm/memory.c:3439)
[   24.358282] handle_mm_fault (mm/memory.c:3833)
[   24.361788] __get_user_pages (mm/gup.c:879)
[   24.365427] populate_vma_page_range (mm/gup.c:1420)
[   24.369408] __mm_populate (mm/gup.c:1476)
[   24.372726] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.376496] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.379522] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.383623] RIP: 0033:0x7fe749a73b3b
[ 24.386534] Code: Bad RIP value.
objdump: '/tmp/tmp.40gz4fjQGW.o': No such file

Code starting with the faulting instruction
===========================================
[   24.389240] RSP: 002b:00007fff6d4ad268 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.395433] RAX: ffffffffffffffda RBX: 0000556473049450 RCX: 00007fe749a73b3b
[   24.401307] RDX: 00007fff6d4ab142 RSI: 00007fe749af73c0 RDI: 0000000000000003
[   24.407426] RBP: 00007fff6d4ad270 R08: 0000000000000000 R09: 0000000000000000
[   24.413696] R10: 00007fe749af6ac0 R11: 0000000000000246 R12: 000055647300b0a0
[   24.419919] R13: 00007fff6d4ad6b0 R14: 0000000000000000 R15: 0000000000000000
[   24.425940] task:packetdrill     state:D stack:13952 pid:  193 ppid:   160 flags:0x00004000
[   24.433269] Call Trace:
[   24.435568] __schedule (kernel/sched/core.c:3778)
[   24.438397] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.441030] io_schedule (kernel/sched/core.c:6271)
[   24.443760] __lock_page_killable (mm/filemap.c:1245)
[   24.447275] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.450919] filemap_fault (mm/filemap.c:2538)
[   24.453976] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.457501] __do_fault (mm/memory.c:3439)
[   24.460273] handle_mm_fault (mm/memory.c:3833)
[   24.463397] __get_user_pages (mm/gup.c:879)
[   24.466743] populate_vma_page_range (mm/gup.c:1420)
[   24.470728] __mm_populate (mm/gup.c:1476)
[   24.474084] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.477780] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.480924] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.485302] RIP: 0033:0x7faf08ad9b3b
[ 24.488404] Code: Bad RIP value.
objdump: '/tmp/tmp.tfpiUEugWr.o': No such file

Code starting with the faulting instruction
===========================================
[   24.491348] RSP: 002b:00007ffec68c61d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.497549] RAX: ffffffffffffffda RBX: 00005556df2df450 RCX: 00007faf08ad9b3b
[   24.503686] RDX: 00007ffec68c40b2 RSI: 00007faf08b5d3c0 RDI: 0000000000000003
[   24.509693] RBP: 00007ffec68c61e0 R08: 0000000000000000 R09: 0000000000000000
[   24.515919] R10: 00007faf08b5cac0 R11: 0000000000000246 R12: 00005556df2a10a0
[   24.521919] R13: 00007ffec68c6620 R14: 0000000000000000 R15: 0000000000000000
[   24.528173] task:packetdrill     state:D stack:13952 pid:  199 ppid:   163 flags:0x00004000
[   24.535290] Call Trace:
[   24.537348] __schedule (kernel/sched/core.c:3778)
[   24.540324] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.542834] io_schedule (kernel/sched/core.c:6271)
[   24.545594] __lock_page_killable (mm/filemap.c:1245)
[   24.549485] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.553266] filemap_fault (mm/filemap.c:2538)
[   24.556501] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.560073] __do_fault (mm/memory.c:3439)
[   24.562786] handle_mm_fault (mm/memory.c:3833)
[   24.566304] __get_user_pages (mm/gup.c:879)
[   24.569874] populate_vma_page_range (mm/gup.c:1420)
[   24.573844] __mm_populate (mm/gup.c:1476)
[   24.577054] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.580624] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.583629] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.587866] RIP: 0033:0x7efdca54bb3b
[ 24.590766] Code: Bad RIP value.
objdump: '/tmp/tmp.FQfITSXTM8.o': No such file

Code starting with the faulting instruction
===========================================
[   24.593713] RSP: 002b:00007ffe160c8be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.600288] RAX: ffffffffffffffda RBX: 000055668dc19450 RCX: 00007efdca54bb3b
[   24.606330] RDX: 00007ffe160c6ac2 RSI: 00007efdca5cf3c0 RDI: 0000000000000003
[   24.612650] RBP: 00007ffe160c8bf0 R08: 0000000000000000 R09: 0000000000000000
[   24.618715] R10: 00007efdca5ceac0 R11: 0000000000000246 R12: 000055668dbdb0a0
[   24.625055] R13: 00007ffe160c9030 R14: 0000000000000000 R15: 0000000000000000
[   24.631336] task:packetdrill     state:D stack:13952 pid:  200 ppid:   167 flags:0x00004000
[   24.638212] Call Trace:
[   24.640504] __schedule (kernel/sched/core.c:3778)
[   24.643730] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.646616] io_schedule (kernel/sched/core.c:6271)
[   24.649646] __lock_page_killable (mm/filemap.c:1245)
[   24.653420] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.657351] filemap_fault (mm/filemap.c:2538)
[   24.660596] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.664134] __do_fault (mm/memory.c:3439)
[   24.666936] handle_mm_fault (mm/memory.c:3833)
[   24.670502] __get_user_pages (mm/gup.c:879)
[   24.674111] populate_vma_page_range (mm/gup.c:1420)
[   24.678106] __mm_populate (mm/gup.c:1476)
[   24.681434] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.685191] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.688289] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.692754] RIP: 0033:0x7f4ac3f45b3b
[ 24.696017] Code: Bad RIP value.
objdump: '/tmp/tmp.Sj77JcXCql.o': No such file

Code starting with the faulting instruction
===========================================
[   24.698963] RSP: 002b:00007ffd159771e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.705620] RAX: ffffffffffffffda RBX: 00005620eb450450 RCX: 00007f4ac3f45b3b
[   24.711901] RDX: 00007ffd159750c2 RSI: 00007f4ac3fc93c0 RDI: 0000000000000003
[   24.718195] RBP: 00007ffd159771f0 R08: 0000000000000000 R09: 0000000000000000
[   24.724468] R10: 00007f4ac3fc8ac0 R11: 0000000000000246 R12: 00005620eb4120a0
[   24.730689] R13: 00007ffd15977630 R14: 0000000000000000 R15: 0000000000000000
[   24.736965] task:packetdrill     state:D stack:13952 pid:  202 ppid:   174 flags:0x00004000
[   24.744128] Call Trace:
[   24.746188] __schedule (kernel/sched/core.c:3778)
[   24.749129] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.751715] io_schedule (kernel/sched/core.c:6271)
[   24.754430] __lock_page_killable (mm/filemap.c:1245)
[   24.758317] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.762160] filemap_fault (mm/filemap.c:2538)
[   24.765582] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.769418] __do_fault (mm/memory.c:3439)
[   24.772432] handle_mm_fault (mm/memory.c:3833)
[   24.775700] __get_user_pages (mm/gup.c:879)
[   24.779051] populate_vma_page_range (mm/gup.c:1420)
[   24.783009] __mm_populate (mm/gup.c:1476)
[   24.786309] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.790052] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.793285] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.797821] RIP: 0033:0x7f177627db3b
[ 24.801013] Code: Bad RIP value.
objdump: '/tmp/tmp.psLmCiKZMt.o': No such file

Code starting with the faulting instruction
===========================================
[   24.803752] RSP: 002b:00007ffcd9c784d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.810149] RAX: ffffffffffffffda RBX: 0000556bbe193450 RCX: 00007f177627db3b
[   24.816393] RDX: 00007ffcd9c763b2 RSI: 00007f17763013c0 RDI: 0000000000000003
[   24.822407] RBP: 00007ffcd9c784e0 R08: 0000000000000000 R09: 0000000000000000
[   24.828686] R10: 00007f1776300ac0 R11: 0000000000000246 R12: 0000556bbe1550a0
[   24.834924] R13: 00007ffcd9c78920 R14: 0000000000000000 R15: 0000000000000000
[   24.840714] task:packetdrill     state:D stack:13952 pid:  204 ppid:   182 flags:0x00004000
[   24.847842] Call Trace:
[   24.850112] __schedule (kernel/sched/core.c:3778)
[   24.853316] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.856185] io_schedule (kernel/sched/core.c:6271)
[   24.859181] __lock_page_killable (mm/filemap.c:1245)
[   24.862675] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.866212] filemap_fault (mm/filemap.c:2538)
[   24.869525] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.873240] __do_fault (mm/memory.c:3439)
[   24.876250] handle_mm_fault (mm/memory.c:3833)
[   24.879470] __get_user_pages (mm/gup.c:879)
[   24.882680] populate_vma_page_range (mm/gup.c:1420)
[   24.894498] __mm_populate (mm/gup.c:1476)
[   24.897763] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   24.901363] do_syscall_64 (arch/x86/entry/common.c:46)
[   24.904437] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   24.908958] RIP: 0033:0x7ff1fe4e2b3b
[ 24.911951] Code: Bad RIP value.
objdump: '/tmp/tmp.06fO0gVPOe.o': No such file

Code starting with the faulting instruction
===========================================
[   24.914865] RSP: 002b:00007ffc28177598 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   24.921310] RAX: ffffffffffffffda RBX: 0000558d9bbe3450 RCX: 00007ff1fe4e2b3b
[   24.927427] RDX: 00007ffc28175472 RSI: 00007ff1fe5663c0 RDI: 0000000000000003
[   24.933468] RBP: 00007ffc281775a0 R08: 0000000000000000 R09: 0000000000000000
[   24.939722] R10: 00007ff1fe565ac0 R11: 0000000000000246 R12: 0000558d9bba50a0
[   24.945719] R13: 00007ffc281779e0 R14: 0000000000000000 R15: 0000000000000000
[   24.951908] task:packetdrill     state:D stack:13952 pid:  205 ppid:   187 flags:0x00004000
[   24.958948] Call Trace:
[   24.961229] __schedule (kernel/sched/core.c:3778)
[   24.964212] schedule (./arch/x86/include/asm/bitops.h:207 (discriminator 1))
[   24.967086] io_schedule (kernel/sched/core.c:6271)
[   24.970104] __lock_page_killable (mm/filemap.c:1245)
[   24.974001] ? file_fdatawait_range (./include/linux/pagemap.h:515)
[   24.977912] filemap_fault (mm/filemap.c:2538)
[   24.981228] ? filemap_map_pages (./include/linux/xarray.h:1606)
[   24.984924] __do_fault (mm/memory.c:3439)
[   24.987934] handle_mm_fault (mm/memory.c:3833)
[   24.991452] __get_user_pages (mm/gup.c:879)
[   24.995022] populate_vma_page_range (mm/gup.c:1420)
[   24.998990] __mm_populate (mm/gup.c:1476)
[   25.002109] __x64_sys_mlockall (./include/linux/mm.h:2567)
[   25.005873] do_syscall_64 (arch/x86/entry/common.c:46)
[   25.009084] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:125)
[   25.013550] RIP: 0033:0x7fc6a6880b3b
[ 25.016548] Code: Bad RIP value.
objdump: '/tmp/tmp.7Xc3VrR5Qh.o': No such file

Code starting with the faulting instruction
===========================================
[   25.019318] RSP: 002b:00007ffc69437db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000097
[   25.025516] RAX: ffffffffffffffda RBX: 00005572a33dd450 RCX: 00007fc6a6880b3b
[   25.031678] RDX: 00007ffc69435c92 RSI: 00007fc6a69043c0 RDI: 0000000000000003
[   25.037940] RBP: 00007ffc69437dc0 R08: 0000000000000000 R09: 0000000000000000
[   25.044100] R10: 00007fc6a6903ac0 R11: 0000000000000246 R12: 00005572a339f0a0
[   25.049950] R13: 00007ffc69438200 R14: 0000000000000000 R15: 0000000000000000

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 19:26                                                           ` Matthieu Baerts
@ 2020-09-15 19:32                                                             ` Linus Torvalds
  2020-09-15 19:56                                                               ` Matthieu Baerts
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 19:32 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 12:26 PM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> I don't know if this info is useful but I just checked and I can
> reproduce the issue with a single CPU.

Good thinking.

> And the trace is very similar to the previous one:

.. and yes, now there are no messy traces, they all have that
__lock_page_killable() unambiguously in them (and the only '?' entries
are just from stale stuff on the stack which is due to stack frames
that aren't fully initialized and old stack frame data shining
through).

So it does seem like the previous trace uncertainty was likely just a
cross-CPU issue.

Was that an actual UP kernel? It might be good to try that too, just
to see if it could be an SMP race in the page locking code.

After all, one such theoretical race was one reason I started the rewrite.

                       Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 19:24                                                           ` Linus Torvalds
@ 2020-09-15 19:38                                                             ` Matthieu Baerts
  0 siblings, 0 replies; 65+ messages in thread
From: Matthieu Baerts @ 2020-09-15 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On 15/09/2020 21:24, Linus Torvalds wrote:
> On Tue, Sep 15, 2020 at 12:01 PM Matthieu Baerts
> <matthieu.baerts@tessares.net> wrote:
 >
>> I forgot one important thing, I was on top of David Miller's net-next
>> branch by reflex. I can redo the traces on top of linux-next if needed.
> 
> Not likely an issue.
> 
> I'll go stare at the page lock code again to see if I've missed
> anything. I still suspect it's a latent ABBA deadlock that is just
> much *much* easier to trigger with the synchronous lock handoff, but I
> don't see where it is.
> 
> I guess this is all fairly theoretical since we apparently need to do
> that hybrid "limited fairness" patch anyway, and it fixes your issue,
> but I hate not understanding the problem.

I understand :)
Thank you again for looking at this issue!

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 19:32                                                             ` Linus Torvalds
@ 2020-09-15 19:56                                                               ` Matthieu Baerts
  2020-09-15 23:35                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Matthieu Baerts @ 2020-09-15 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On 15/09/2020 21:32, Linus Torvalds wrote:>
> Was that an actual UP kernel? It might be good to try that too, just
> to see if it could be an SMP race in the page locking code.

I am sorry, I am not sure how to verify this. I guess it was one 
processor because I removed "-smp 2" option from qemu. So I guess it 
switched to a uniprocessor mode.

Also, when I did the test and to make sure I was using only one CPU, I 
also printed the output of /proc/cpuinfo:


+ cat /proc/cpuinfo 

processor       : 0 

vendor_id       : AuthenticAMD 

cpu family      : 23 

model           : 1 

model name      : AMD EPYC 7401P 24-Core Processor 

stepping        : 2 

microcode       : 0x1000065 

cpu MHz         : 2000.000
cache size      : 512 KB 

physical id     : 0 
 
 

siblings        : 1 
 
 

core id         : 0 

cpu cores       : 1 

apicid          : 0 

initial apicid  : 0 

fpu             : yes 
 
 

fpu_exception   : yes 

cpuid level     : 13 

wp              : yes 
 
 

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni 
pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm 
cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ssbd 
ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap 
clflushopt sha_ni xsaveopt xsavec xgetbv1 virt_ssbd
arat npt nrip_save arch_capabilities
bugs            : fxsave_leak sysret_ss_attrs null_seg spectre_v1 
spectre_v2 spec_store_bypass 
 

bogomips        : 4000.00
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64 

address sizes   : 40 bits physical, 48 bits virtual 

power management:


Do you want me to try another qemu config?

Sorry, it is getting late for me but I also forgot to mention earlier 
that with 1 CPU and your new sysctl set to 1, it didn't reproduce my 
issue for 6 executions.

> After all, one such theoretical race was one reason I started the rewrite.

And that's a good thing, thank you! :)

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 19:56                                                               ` Matthieu Baerts
@ 2020-09-15 23:35                                                                 ` Linus Torvalds
  2020-09-16 10:34                                                                   ` Jan Kara
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-15 23:35 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Michael Larabel, Matthew Wilcox, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Tue, Sep 15, 2020 at 12:56 PM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> I am sorry, I am not sure how to verify this. I guess it was one
> processor because I removed "-smp 2" option from qemu. So I guess it
> switched to a uniprocessor mode.

Ok, that all sounds fine. So yes, your problem happens even with just
one CPU, and it's not any subtle SMP race.

Which is all good - apart from the bug existing in the first place, of
course. It just reinforces the "it's probably a latent deadlock"
thing.

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-15 23:35                                                                 ` Linus Torvalds
@ 2020-09-16 10:34                                                                   ` Jan Kara
  2020-09-16 18:47                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Jan Kara @ 2020-09-16 10:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthieu Baerts, Michael Larabel, Matthew Wilcox, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Tue 15-09-20 16:35:45, Linus Torvalds wrote:
> On Tue, Sep 15, 2020 at 12:56 PM Matthieu Baerts
> <matthieu.baerts@tessares.net> wrote:
> >
> > I am sorry, I am not sure how to verify this. I guess it was one
> > processor because I removed "-smp 2" option from qemu. So I guess it
> > switched to a uniprocessor mode.
> 
> Ok, that all sounds fine. So yes, your problem happens even with just
> one CPU, and it's not any subtle SMP race.
> 
> Which is all good - apart from the bug existing in the first place, of
> course. It just reinforces the "it's probably a latent deadlock"
> thing.

So from the traces another theory that appeared to me is that it could be a
"missed wakeup" problem. Looking at the code in wait_on_page_bit_common() I
found one suspicious thing (which isn't a great match because the problem
seems to happen on UP as well and I think it's mostly a theoretical issue but
still I'll write it here):

wait_on_page_bit_common() has:

        spin_lock_irq(&q->lock);
        SetPageWaiters(page);
        if (!trylock_page_bit_common(page, bit_nr, wait))
	  - which expands to:
	  (
	        if (wait->flags & WQ_FLAG_EXCLUSIVE) {
        	        if (test_and_set_bit(bit_nr, &page->flags))
                	        return false;
	        } else if (test_bit(bit_nr, &page->flags))
        	        return false;
	  )

                __add_wait_queue_entry_tail(q, wait);
        spin_unlock_irq(&q->lock);

Now the suspicious thing is the ordering here. What prevents the compiler
(or the CPU for that matter) from reordering SetPageWaiters() call behind
the __add_wait_queue_entry_tail() call? I know SetPageWaiters() and
test_and_set_bit() operate on the same long but is it really guaranteed
something doesn't reorder these?

In unlock_page() we have:

        if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
                wake_up_page_bit(page, PG_locked);

So if the reordering happens, clear_bit_unlock_is_negative_byte() could
return false even though we have a waiter queued.

And this seems to be a thing commit 2a9127fcf22 ("mm: rewrite
wait_on_page_bit_common() logic") introduced because before we had
set_current_state() between SetPageWaiters() and test_bit() which implies a
memory barrier.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-16 10:34                                                                   ` Jan Kara
@ 2020-09-16 18:47                                                                     ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-16 18:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthieu Baerts, Michael Larabel, Matthew Wilcox, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

On Wed, Sep 16, 2020 at 3:34 AM Jan Kara <jack@suse.cz> wrote:
>
> wait_on_page_bit_common() has:
>
>         spin_lock_irq(&q->lock);
>         SetPageWaiters(page);
>         if (!trylock_page_bit_common(page, bit_nr, wait))
>           - which expands to:
>           (
>                 if (wait->flags & WQ_FLAG_EXCLUSIVE) {
>                         if (test_and_set_bit(bit_nr, &page->flags))
>                                 return false;
>                 } else if (test_bit(bit_nr, &page->flags))
>                         return false;
>           )
>
>                 __add_wait_queue_entry_tail(q, wait);
>         spin_unlock_irq(&q->lock);
>
> Now the suspicious thing is the ordering here. What prevents the compiler
> (or the CPU for that matter) from reordering SetPageWaiters() call behind
> the __add_wait_queue_entry_tail() call? I know SetPageWaiters() and
> test_and_set_bit() operate on the same long but is it really guaranteed
> something doesn't reorder these?

I agree that we might want to make this much more specific, but no,
those things can't be re-ordered.

Part of it is very much that memory ordering is only about two
different locations - accessing the *same* location is always ordered,
even on weakly ordered CPU's.

And the compiler can't re-order them either, we very much make
test_and_set_bit() have the proper barriers. We'd be very screwed if a
"set_bit()" could pass a "test_and_set_bit".

That said, the PageWaiters bit is obviously very much by design in the
same word as the bit we're testing and setting - because the whole
point is that we can then at clear time check the PageWaiters bit
atomically with the bit we're clearing.

So this code optimally shouldn't use separate operations for those
bits at all. It would be better to just have one atomic sequence with
a cmpxchg that does both at the same time.

So I agree that sequence isn't wonderful. But no, I don't think this is the bug.

And as you mention, Matthieu sees it on UP, so memory ordering
wouldn't have been an issue anyway (and compiler re-ordering would
cause all kinds of other problems and break our bit operations
entirely).

Plus if it was some subtle bug like that, it wouldn't trigger as
consistently as it apparently does for Matthieu.

Of course, the reason that I don't see how it can trigger at all (I
still like my ABBA deadlock scenario, but I don't see anybody holding
any crossing locks in Matthieu's list of processes) means that I'm
clearly missing something

                  Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-14 17:47                                               ` Linus Torvalds
  2020-09-14 20:21                                                 ` Matthieu Baerts
  2020-09-15 14:21                                                 ` Michael Larabel
@ 2020-09-17 17:51                                                 ` Linus Torvalds
  2020-09-17 18:23                                                   ` Matthew Wilcox
  2 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-17 17:51 UTC (permalink / raw)
  To: Michael Larabel, Matthieu Baerts
  Cc: Matthew Wilcox, Amir Goldstein, Ted Ts'o, Andreas Dilger,
	Ext4 Developers List, Jan Kara, linux-fsdevel

On Mon, Sep 14, 2020 at 10:47 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> (Note that it's a commit and has a SHA1, but it's from my "throw-away
> tree for testing", so it doesn't have my sign-off or any real commit
> message yet: I'll do that once it gets actual testing and comments).

Just to keep the list and people who were on this thread informed:
Michal ended up doing more benchmarking, and everything seems to line
up and yes, that patch continues to work fine with a 'unfairness'
value of 5.

So I've committed it to my git tree (not pushed out yet, I have other
pull requests etc I'm handling too), and we'll see if anybody can come
up with a better model for how to avoid the page locking being such a
pain. Or if somebody can figure out why fair locking causes problems
for that packetdrill load that Matthieu reported.

It does strike me that if the main source of contention comes from
that "we need to check that the mapping is still valid as we insert
the page into the page tables", then the page lock really isn't the
obvious lock to use.

It would be much more natural to use the mapping->i_mmap_rwsem, I feel.

Willy? Your "just check for uptodate without any lock" patch itself
feels wrong. That's what we do for plain reads, but the difference is
that a read is a one-time event and a race is fine: we get valid data,
it's just that it's only valid *concurrently* with the truncate or
hole-punching event (ie either all zeroes or old data is fine).

The reason faulting a page in is different from a read is that if you
then map in a stale page, it might have had the correct contents at
the time of the fault, but it will not have the correct contents going
forward.

So a page-in requires fundamentally stronger locking than a read()
does, because of how the page-in causes that "future lifetime" of the
page, in ways a read() event does not.

But truncation that does page cache removal already requires that
i_mmap_rwsem, and in fact the VM already very much uses that (ie when
walking the page mapping).

The other alternative might be just the mapping->private_lock. It's
not a reader-writer lock, but if we don't need to sleep (and I don't
think the final "check ->mapping" can sleep anyway since it has to be
done together with the page table lock), a spinlock would be fine.

                   Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 17:51                                                 ` Linus Torvalds
@ 2020-09-17 18:23                                                   ` Matthew Wilcox
  2020-09-17 18:30                                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-17 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 10:51:41AM -0700, Linus Torvalds wrote:
> It does strike me that if the main source of contention comes from
> that "we need to check that the mapping is still valid as we insert
> the page into the page tables", then the page lock really isn't the
> obvious lock to use.
> 
> It would be much more natural to use the mapping->i_mmap_rwsem, I feel.
> 
> Willy? Your "just check for uptodate without any lock" patch itself
> feels wrong. That's what we do for plain reads, but the difference is
> that a read is a one-time event and a race is fine: we get valid data,
> it's just that it's only valid *concurrently* with the truncate or
> hole-punching event (ie either all zeroes or old data is fine).
> 
> The reason faulting a page in is different from a read is that if you
> then map in a stale page, it might have had the correct contents at
> the time of the fault, but it will not have the correct contents going
> forward.
> 
> So a page-in requires fundamentally stronger locking than a read()
> does, because of how the page-in causes that "future lifetime" of the
> page, in ways a read() event does not.

Yes, I agree, mmap is granting future access to a page in a
way that read is not.  So we need to be sure that any concurrent
truncate/hole-punch/collapse-range/invalidate-for-directio/migration
(henceforth page-killer) doesn't allow a page that's about to be recycled
to be added to the page tables.

What I was going for was that every page-killer currently does something
like this:

        if (page_mapped(page))
                unmap_mapping_pages(mapping, page->index, nr, false);

so as long as we mark the page as being no-new-maps-allowed before
the page-killer checks page_mapped(), and the map-page path checks
that the page isn't no-new-maps-allowed after incrementing page_mapped(),
then we'll never see something we shouldn't in the page tables -- either
it will show up in the page tables right before the page-killer calls
unmap_mapping_pages() (in which case you'll get the old contents), or
it won't show up in the page tables.

I'd actually want to wrap all that into:

static inline void page_kill_mappings(struct page)
{
	ClearPageUptodate(page);
	smb_mb__before_atomic();
	if (!page_mapped(page))
		return;
	unmap_mapping_pages(mapping, page->index, compound_nr(page), false);
}

but that's just syntax.

I'm pretty sure that patch I sent out doesn't handle page faults on
disappearing pages correctly; it needs to retry so it can instantiate
a new page in the page cache.  And as Jan pointed out, it didn't handle
the page migration case.  But that wasn't really the point of the patch.

> But truncation that does page cache removal already requires that
> i_mmap_rwsem, and in fact the VM already very much uses that (ie when
> walking the page mapping).
> 
> The other alternative might be just the mapping->private_lock. It's
> not a reader-writer lock, but if we don't need to sleep (and I don't
> think the final "check ->mapping" can sleep anyway since it has to be
> done together with the page table lock), a spinlock would be fine.

I'm not a huge fan of taking file-wide locks for something that has
a naturally finer granularity.  It tends to bite us when weirdos with
giant databases mmap the whole thing and then whine about contention on
the rwsem during page faults.

But you're right that unmap_mapping_range() already takes this lock for
removing pages from the page table, so it would be reasonable to take
it for read when adding pages to the page table.  Something like taking
the i_mmap_lock_read(file->f_mapping) in filemap_fault, then adding a
new VM_FAULT_I_MMAP_LOCKED bit so that do_read_fault() and friends add:

	if (ret & VM_FAULT_I_MMAP_LOCKED)
		i_mmap_unlock_read(vmf->vma->vm_file->f_mapping);
	else
		unlock_page(page);

... want me to turn that into a real patch?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 18:23                                                   ` Matthew Wilcox
@ 2020-09-17 18:30                                                     ` Linus Torvalds
  2020-09-17 18:50                                                       ` Matthew Wilcox
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-17 18:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 11:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
>             Something like taking
> the i_mmap_lock_read(file->f_mapping) in filemap_fault, then adding a
> new VM_FAULT_I_MMAP_LOCKED bit so that do_read_fault() and friends add:
>
>         if (ret & VM_FAULT_I_MMAP_LOCKED)
>                 i_mmap_unlock_read(vmf->vma->vm_file->f_mapping);
>         else
>                 unlock_page(page);
>
> ... want me to turn that into a real patch?

I can't guarantee it's the right model - it does worry me how many
places we might get that i_mmap_rwlock, and how long we migth hold it
for writing, and what deadlocks it might cause when we take it for
reading in the page fault path.

But I think it might be very interesting as a benchmark patch and a
trial balloon. Maybe it "just works".

I would _love_ for the page lock itself to be only (or at least
_mainly_) about the actual IO synchronization on the page.

That was the origin of it, the whole "protect all the complex state of
a page" behavior kind of grew over time, since it was the only
per-page lock we had.

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 18:30                                                     ` Linus Torvalds
@ 2020-09-17 18:50                                                       ` Matthew Wilcox
  2020-09-17 19:00                                                         ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-17 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 11:30:00AM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 11:23 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> >             Something like taking
> > the i_mmap_lock_read(file->f_mapping) in filemap_fault, then adding a
> > new VM_FAULT_I_MMAP_LOCKED bit so that do_read_fault() and friends add:
> >
> >         if (ret & VM_FAULT_I_MMAP_LOCKED)
> >                 i_mmap_unlock_read(vmf->vma->vm_file->f_mapping);
> >         else
> >                 unlock_page(page);
> >
> > ... want me to turn that into a real patch?
> 
> I can't guarantee it's the right model - it does worry me how many
> places we might get that i_mmap_rwlock, and how long we migth hold it
> for writing, and what deadlocks it might cause when we take it for
> reading in the page fault path.
> 
> But I think it might be very interesting as a benchmark patch and a
> trial balloon. Maybe it "just works".

Ahh.  Here's a race this doesn't close:

int truncate_inode_page(struct address_space *mapping, struct page *page)
{
        VM_BUG_ON_PAGE(PageTail(page), page);

        if (page->mapping != mapping)
                return -EIO;

        truncate_cleanup_page(mapping, page);
        delete_from_page_cache(page);
        return 0;
}

truncate_cleanup_page() does
        if (page_mapped(page)) {
                pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
                unmap_mapping_pages(mapping, page->index, nr, false);
        }

but ->mapping isn't cleared until delete_from_page_cache() many
instructions later.  So we can get the lock and have a page which appears
to be not-truncated, only for it to get truncated on us later.

> I would _love_ for the page lock itself to be only (or at least
> _mainly_) about the actual IO synchronization on the page.
> 
> That was the origin of it, the whole "protect all the complex state of
> a page" behavior kind of grew over time, since it was the only
> per-page lock we had.

Yes, I think that's a noble goal.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 18:50                                                       ` Matthew Wilcox
@ 2020-09-17 19:00                                                         ` Linus Torvalds
  2020-09-17 19:27                                                           ` Matthew Wilcox
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-17 19:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 11:50 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Ahh.  Here's a race this doesn't close:
>
> int truncate_inode_page(struct address_space *mapping, struct page *page)

I think this one currently depends on the page lock, doesn't it?

And I think the point would be to get rid of that dependency, and just
make the rule be that it's done with the i_mmap_rwsem held for
writing.

But it might be one of those cases where taking it for writing might
add way too much serialization and might not be acceptable.

Again, I do get the feeling that a spinlock would be much better here.
Generally the areas we want to protect are truly just the trivial
"check that mapping is valid". That's a spinlock kind of thing, not a
semaphore kind of thing.

Doing a blocking semaphore that might need to serialize IO with page
faulting for the whole mapping is horrible and completely
unacceptable. Truncation events might be rare, but they aren't unheard
of!

But doing a spinlock that does the same is likely a complete non-issue.

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 19:00                                                         ` Linus Torvalds
@ 2020-09-17 19:27                                                           ` Matthew Wilcox
  2020-09-17 19:47                                                             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2020-09-17 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 12:00:06PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 11:50 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Ahh.  Here's a race this doesn't close:
> >
> > int truncate_inode_page(struct address_space *mapping, struct page *page)
> 
> I think this one currently depends on the page lock, doesn't it?
> 
> And I think the point would be to get rid of that dependency, and just
> make the rule be that it's done with the i_mmap_rwsem held for
> writing.

Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
basically, the entirety of truncate_inode_pages_range().  I don't see
a problem with lock scope; according to rmap.c, i_mmap_rwsem is near
the top of the hierarchy, just under lock_page.  We do wait for I/O to
complete (both reads and writes), but I don't know a reason for that to
be a problem.

We might want to take the page lock anyway to prevent truncate() from
racing with a read() that decides to start new I/O to this page, which
would involve adjusting the locking hierarchy (although to a way in which
hugetlb and the regular VM are back in sync).  My brain is starting to
hurt from thinking about ways that not taking the page lock in truncate
might go wrong.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 19:27                                                           ` Matthew Wilcox
@ 2020-09-17 19:47                                                             ` Linus Torvalds
  2020-09-18  0:39                                                               ` Sedat Dilek
  2020-09-20 23:23                                                               ` Dave Chinner
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-17 19:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael Larabel, Matthieu Baerts, Amir Goldstein, Ted Ts'o,
	Andreas Dilger, Ext4 Developers List, Jan Kara, linux-fsdevel

On Thu, Sep 17, 2020 at 12:27 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
> basically, the entirety of truncate_inode_pages_range().

I really suspect that will be entirely unacceptable for latency
reasons, but who knows. In practice, nobody actually truncates a file
_while_ it's mapped, that's just crazy talk.

But almost every time I go "nobody actually does this", I tend to be
surprised by just how crazy some loads are, and it turns out that
_somebody_ does it, and has a really good reason for doing odd things,
and has been doing it for years because it worked really well and
solved some odd problem.

So the "hold it for the entirety of truncate_inode_pages_range()"
thing seems to be a really simple approach, and nice and clean, but it
makes me go "*somebody* is going to do bad things and complain about
page fault latencies".

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 19:47                                                             ` Linus Torvalds
@ 2020-09-18  0:39                                                               ` Sedat Dilek
  2020-09-18  0:40                                                                 ` Sedat Dilek
  2020-09-20 23:23                                                               ` Dave Chinner
  1 sibling, 1 reply; 65+ messages in thread
From: Sedat Dilek @ 2020-09-18  0:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Thu, Sep 17, 2020 at 10:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Sep 17, 2020 at 12:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
> > basically, the entirety of truncate_inode_pages_range().
>
> I really suspect that will be entirely unacceptable for latency
> reasons, but who knows. In practice, nobody actually truncates a file
> _while_ it's mapped, that's just crazy talk.
>
> But almost every time I go "nobody actually does this", I tend to be
> surprised by just how crazy some loads are, and it turns out that
> _somebody_ does it, and has a really good reason for doing odd things,
> and has been doing it for years because it worked really well and
> solved some odd problem.
>
> So the "hold it for the entirety of truncate_inode_pages_range()"
> thing seems to be a really simple approach, and nice and clean, but it
> makes me go "*somebody* is going to do bad things and complain about
> page fault latencies".
>

Hi,

I followed this thread a bit and see there is now a...

commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79
"mm: allow a controlled amount of unfairness in the page lock"

By first reading I saw...

+ *  (a) no special bits set:
...
+ *  (b) WQ_FLAG_EXCLUSIVE:
...
+ *  (b) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:

The last one should be (c).

There was a second typo I cannot remember when you sent your patch
without a commit message.

Will look again.

Thanks and Greetings,
- Sedat -

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-18  0:39                                                               ` Sedat Dilek
@ 2020-09-18  0:40                                                                 ` Sedat Dilek
  2020-09-18 20:25                                                                   ` Sedat Dilek
  0 siblings, 1 reply; 65+ messages in thread
From: Sedat Dilek @ 2020-09-18  0:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Fri, Sep 18, 2020 at 2:39 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Thu, Sep 17, 2020 at 10:00 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Sep 17, 2020 at 12:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
> > > basically, the entirety of truncate_inode_pages_range().
> >
> > I really suspect that will be entirely unacceptable for latency
> > reasons, but who knows. In practice, nobody actually truncates a file
> > _while_ it's mapped, that's just crazy talk.
> >
> > But almost every time I go "nobody actually does this", I tend to be
> > surprised by just how crazy some loads are, and it turns out that
> > _somebody_ does it, and has a really good reason for doing odd things,
> > and has been doing it for years because it worked really well and
> > solved some odd problem.
> >
> > So the "hold it for the entirety of truncate_inode_pages_range()"
> > thing seems to be a really simple approach, and nice and clean, but it
> > makes me go "*somebody* is going to do bad things and complain about
> > page fault latencies".
> >
>
> Hi,
>
> I followed this thread a bit and see there is now a...
>
> commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79
> "mm: allow a controlled amount of unfairness in the page lock"
>
> By first reading I saw...
>
> + *  (a) no special bits set:
> ...
> + *  (b) WQ_FLAG_EXCLUSIVE:
> ...
> + *  (b) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
>
> The last one should be (c).
>
> There was a second typo I cannot remember when you sent your patch
> without a commit message.
>
> Will look again.
>
> Thanks and Greetings,
> - Sedat -

Ah I see...

+ * we have multiple different kinds of waits, not just he usual "exclusive"

... *t*he usual ...

- Sedat -

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-18  0:40                                                                 ` Sedat Dilek
@ 2020-09-18 20:25                                                                   ` Sedat Dilek
  2020-09-20 17:06                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Sedat Dilek @ 2020-09-18 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Fri, Sep 18, 2020 at 2:40 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Fri, Sep 18, 2020 at 2:39 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >
> > On Thu, Sep 17, 2020 at 10:00 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, Sep 17, 2020 at 12:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
> > > > basically, the entirety of truncate_inode_pages_range().
> > >
> > > I really suspect that will be entirely unacceptable for latency
> > > reasons, but who knows. In practice, nobody actually truncates a file
> > > _while_ it's mapped, that's just crazy talk.
> > >
> > > But almost every time I go "nobody actually does this", I tend to be
> > > surprised by just how crazy some loads are, and it turns out that
> > > _somebody_ does it, and has a really good reason for doing odd things,
> > > and has been doing it for years because it worked really well and
> > > solved some odd problem.
> > >
> > > So the "hold it for the entirety of truncate_inode_pages_range()"
> > > thing seems to be a really simple approach, and nice and clean, but it
> > > makes me go "*somebody* is going to do bad things and complain about
> > > page fault latencies".
> > >
> >
> > Hi,
> >
> > I followed this thread a bit and see there is now a...
> >
> > commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79
> > "mm: allow a controlled amount of unfairness in the page lock"
> >
> > By first reading I saw...
> >
> > + *  (a) no special bits set:
> > ...
> > + *  (b) WQ_FLAG_EXCLUSIVE:
> > ...
> > + *  (b) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
> >
> > The last one should be (c).
> >
> > There was a second typo I cannot remember when you sent your patch
> > without a commit message.
> >
> > Will look again.
> >
> > Thanks and Greetings,
> > - Sedat -
>
> Ah I see...
>
> + * we have multiple different kinds of waits, not just he usual "exclusive"
>
> ... *t*he usual ...
>

Hi Linus,

do you want me to send a patch for the above typos or do you want to
do that yourself?

Thanks.

Regards,
- Sedat -

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-18 20:25                                                                   ` Sedat Dilek
@ 2020-09-20 17:06                                                                     ` Linus Torvalds
  2020-09-20 17:14                                                                       ` Sedat Dilek
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-20 17:06 UTC (permalink / raw)
  To: Sedat Dilek
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Fri, Sep 18, 2020 at 1:25 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> do you want me to send a patch for the above typos or do you want to
> do that yourself?

I was about to do it myself, and then I noticed this email of yours
and I went "heck yeah, let Sedat send me a patch and get all the glory
for it".

                    Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 17:06                                                                     ` Linus Torvalds
@ 2020-09-20 17:14                                                                       ` Sedat Dilek
  2020-09-20 17:40                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Sedat Dilek @ 2020-09-20 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 7:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Sep 18, 2020 at 1:25 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >
> > do you want me to send a patch for the above typos or do you want to
> > do that yourself?
>
> I was about to do it myself, and then I noticed this email of yours
> and I went "heck yeah, let Sedat send me a patch and get all the glory
> for it".
>

A few minutes ago I logged into my machine and read this.

You had the glory of writing the patch :-).

Of course, I can send a patch if you desire.

- Sedat -

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 17:14                                                                       ` Sedat Dilek
@ 2020-09-20 17:40                                                                         ` Linus Torvalds
  2020-09-20 18:00                                                                           ` Sedat Dilek
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2020-09-20 17:40 UTC (permalink / raw)
  To: Sedat Dilek
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 10:14 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> You had the glory of writing the patch :-).

Your loss. I know it must hurt to not get the glory of authorship.

             Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 17:40                                                                         ` Linus Torvalds
@ 2020-09-20 18:00                                                                           ` Sedat Dilek
  0 siblings, 0 replies; 65+ messages in thread
From: Sedat Dilek @ 2020-09-20 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 7:40 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Sep 20, 2020 at 10:14 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >
> > You had the glory of writing the patch :-).
>
> Your loss. I know it must hurt to not get the glory of authorship.
>

No money. No bitcoins. Yelping for credits.

- Sedat -

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-17 19:47                                                             ` Linus Torvalds
  2020-09-18  0:39                                                               ` Sedat Dilek
@ 2020-09-20 23:23                                                               ` Dave Chinner
  2020-09-20 23:31                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 65+ messages in thread
From: Dave Chinner @ 2020-09-20 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Thu, Sep 17, 2020 at 12:47:16PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 12:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Ah, I see what you mean.  Hold the i_mmap_rwsem for write across,
> > basically, the entirety of truncate_inode_pages_range().
> 
> I really suspect that will be entirely unacceptable for latency
> reasons, but who knows. In practice, nobody actually truncates a file
> _while_ it's mapped, that's just crazy talk.
> 
> But almost every time I go "nobody actually does this", I tend to be
> surprised by just how crazy some loads are, and it turns out that
> _somebody_ does it, and has a really good reason for doing odd things,
> and has been doing it for years because it worked really well and
> solved some odd problem.
> 
> So the "hold it for the entirety of truncate_inode_pages_range()"
> thing seems to be a really simple approach, and nice and clean, but it
> makes me go "*somebody* is going to do bad things and complain about
> page fault latencies".

I don't think there's a major concern here because that's what we
are already doing at the filesystem level. In this case, it is
because some filesystems need to serialise IO to the inode -before-
calling truncate_setsize(). e.g.

- we have to wait for inflight direct IO that may be beyond the new
  EOF to drain before we start changing where EOF lies.

- we have data vs metadata ordering requirements that mean we have
  to ensure dirty data is stable before we change the inode size.

Hence we've already been locking out page faults for the entire
truncate operation for a few years on both XFS and ext4. We haven't
heard of any problems result from truncate-related page fault
latencies....

FWIW, if the fs layer is already providing this level of IO
exclusion w.r.t. address space access, does it need to be replicated
at the address space level?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 23:23                                                               ` Dave Chinner
@ 2020-09-20 23:31                                                                 ` Linus Torvalds
  2020-09-20 23:40                                                                   ` Linus Torvalds
  2020-09-21  1:20                                                                   ` Dave Chinner
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-20 23:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 4:23 PM Dave Chinner <david@fromorbit.com> wrote:
>
> FWIW, if the fs layer is already providing this level of IO
> exclusion w.r.t. address space access, does it need to be replicated
> at the address space level?

Honestly, I'd rather do it the other way, and go "if the vfs layer
were to provide the IO exclusion, maybe the filesystems can drop it?

Because we end up having something like 60 different filesystems. It's
*really* hard to know that "Yeah, this filesystem does it right".

And if we do end up doing it at both levels, and end up having some of
the locking duplicated, that's still better than "sometimes we don't
do it at all", and have odd problems on the less usual (and often less
well maintained) filesystems..

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 23:31                                                                 ` Linus Torvalds
@ 2020-09-20 23:40                                                                   ` Linus Torvalds
  2020-09-21  1:20                                                                   ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2020-09-20 23:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 4:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And if we do end up doing it at both levels, and end up having some of
> the locking duplicated, that's still better than "sometimes we don't
> do it at all", and have odd problems on the less usual (and often less
> well maintained) filesystems..

Doing locking at a higher level also often allows for much more easily
changing and improving semantics.

For example, the only reason we have absolutely the best pathname
lookup in the industry (by a couple of orders of magnitude) is that
it's done by the VFS layer and the dentry caching has been worked on
for decades to tune it and do all the lockless lookups.

That would simply not have been possible to do at a filesystem level.
A filesystem might have some complex and cumbersom code to do multiple
sequential lookups as long as they stay inside that filesystem, but it
would be ugly and it would be strictly worse than what the VFS layer
can and does do.

That is a fairly extreme example of course - and pathname resolution
really is somewhat odd - but I do think there are advantages to having
locking and access rules that are centralized across filesystems.

               Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Kernel Benchmarking
  2020-09-20 23:31                                                                 ` Linus Torvalds
  2020-09-20 23:40                                                                   ` Linus Torvalds
@ 2020-09-21  1:20                                                                   ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2020-09-21  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Michael Larabel, Matthieu Baerts, Amir Goldstein,
	Ted Ts'o, Andreas Dilger, Ext4 Developers List, Jan Kara,
	linux-fsdevel

On Sun, Sep 20, 2020 at 04:31:57PM -0700, Linus Torvalds wrote:
> On Sun, Sep 20, 2020 at 4:23 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > FWIW, if the fs layer is already providing this level of IO
> > exclusion w.r.t. address space access, does it need to be replicated
> > at the address space level?
> 
> Honestly, I'd rather do it the other way, and go "if the vfs layer
> were to provide the IO exclusion, maybe the filesystems can drop it?

I'm not sure it can because of the diversity of filesystems and
their locking requirements. XFS spent many, many years ignoring the
VFS inode locking because it was a mutex and instead did all it's
own locking with rwsems internally.

> Because we end up having something like 60 different filesystems. It's
> *really* hard to know that "Yeah, this filesystem does it right".

Agreed, but I don't think moving the serialisation up to the VFS
will fix that because many filesysetms will still have to do their
own thing. e.g. cluster filesystems requiring cluster locks first,
not local inode rwsems....

> And if we do end up doing it at both levels, and end up having some of
> the locking duplicated, that's still better than "sometimes we don't
> do it at all", and have odd problems on the less usual (and often less
> well maintained) filesystems..

Sure.

However, the problem I'm trying to avoid arises when the filesystem
is able to do things concurrently that the new locking in the
VFS/address space can't do concurrently.

e.g. the range locking I'm working on for XFS allows truncate to run
concurrently with reads, writes, fallocate(), etc and it all just
works right now. Adding address space wide exclusive locking to the
page cache will likely defeat this - we can no longer run fallocate()
concurrently and so cannot solve the performance problems we have
with qemu+qcow2 where it uses fallocate() to initialise sparse
clusters and so fallocate() on a single uninitialised cluster
serialises all IO to the image file.

Hence I'd like to avoid introducing address-space wide serialisation
for invalidation at the page cache level just as we are about to
enable concurrency for operations that do page cache invalidation.
Range locking in the page cache would be fine, but global locking
will be .... problematic.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2020-09-21  1:20 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=wiZnE409WkTOG6fbF_eV1LgrHBvMtyKkpTqM9zT5hpf9A@mail.gmail.com>
     [not found] ` <4ced9401-de3d-b7c9-9976-2739e837fafc@MichaelLarabel.com>
     [not found]   ` <CAHk-=wj+Qj=wXByMrAx3T8jmw=soUetioRrbz6dQaECx+zjMtg@mail.gmail.com>
     [not found]     ` <CAHk-=wgOPjbJsj-LeLc-JMx9Sz9DjGF66Q+jQFJROt9X9utdBg@mail.gmail.com>
     [not found]       ` <CAHk-=wjjK7PTnDZNi039yBxSHtAqusFoRrZzgMNTiYkJYdNopw@mail.gmail.com>
     [not found]         ` <aa90f272-1186-f9e1-8fdb-eefd332fdae8@MichaelLarabel.com>
     [not found]           ` <CAHk-=wh_31_XBNHbdF7EUJceLpEpwRxVF+_1TONzyBUym6Pw4w@mail.gmail.com>
     [not found]             ` <e24ef34d-7b1d-dd99-082d-28ca285a79ff@MichaelLarabel.com>
     [not found]               ` <CAHk-=wgEE4GuNjcRaaAvaS97tW+239-+tjcPjTq2FGhEuM8HYg@mail.gmail.com>
     [not found]                 ` <6e1d8740-2594-c58b-ff02-a04df453d53c@MichaelLarabel.com>
     [not found]                   ` <CAHk-=wgJ3-cEkU-5zXFPvRCHKkCCuKxVauYWGphjePEhJJgtgQ@mail.gmail.com>
     [not found]                     ` <d2023f4c-ef14-b877-b5bb-e4f8af332abc@MichaelLarabel.com>
     [not found]                       ` <CAHk-=wiz=J=8mJ=zRG93nuJ9GtQAm5bSRAbWJbWZuN4Br38+EQ@mail.gmail.com>
2020-09-11  0:05                         ` Kernel Benchmarking Linus Torvalds
2020-09-11  0:49                           ` Michael Larabel
2020-09-11  2:20                             ` Linus Torvalds
     [not found]                               ` <0cbc959e-1b8d-8d7e-1dc6-672cf5b3899a@MichaelLarabel.com>
2020-09-11 16:19                                 ` Linus Torvalds
2020-09-11 22:07                                   ` Linus Torvalds
2020-09-11 22:37                                     ` Michael Larabel
2020-09-12  7:28                                       ` Amir Goldstein
2020-09-12 10:32                                         ` Michael Larabel
2020-09-12 14:37                                           ` Matthew Wilcox
2020-09-12 14:44                                             ` Michael Larabel
2020-09-15  3:32                                               ` Matthew Wilcox
2020-09-15 10:39                                                 ` Jan Kara
2020-09-15 13:52                                                   ` Matthew Wilcox
     [not found]                                             ` <658ae026-32d9-0a25-5a59-9c510d6898d5@MichaelLarabel.com>
2020-09-14 17:47                                               ` Linus Torvalds
2020-09-14 20:21                                                 ` Matthieu Baerts
2020-09-14 20:53                                                   ` Linus Torvalds
2020-09-15  0:42                                                     ` Linus Torvalds
2020-09-15 15:34                                                     ` Matthieu Baerts
2020-09-15 18:27                                                       ` Linus Torvalds
2020-09-15 18:47                                                         ` Linus Torvalds
2020-09-15 19:26                                                           ` Matthieu Baerts
2020-09-15 19:32                                                             ` Linus Torvalds
2020-09-15 19:56                                                               ` Matthieu Baerts
2020-09-15 23:35                                                                 ` Linus Torvalds
2020-09-16 10:34                                                                   ` Jan Kara
2020-09-16 18:47                                                                     ` Linus Torvalds
     [not found]                                                         ` <9a92bf16-02c5-ba38-33c7-f350588ac874@tessares.net>
2020-09-15 19:24                                                           ` Linus Torvalds
2020-09-15 19:38                                                             ` Matthieu Baerts
2020-09-15 18:31                                                       ` Linus Torvalds
2020-09-15 14:21                                                 ` Michael Larabel
2020-09-15 17:52                                                   ` Linus Torvalds
2020-09-17 17:51                                                 ` Linus Torvalds
2020-09-17 18:23                                                   ` Matthew Wilcox
2020-09-17 18:30                                                     ` Linus Torvalds
2020-09-17 18:50                                                       ` Matthew Wilcox
2020-09-17 19:00                                                         ` Linus Torvalds
2020-09-17 19:27                                                           ` Matthew Wilcox
2020-09-17 19:47                                                             ` Linus Torvalds
2020-09-18  0:39                                                               ` Sedat Dilek
2020-09-18  0:40                                                                 ` Sedat Dilek
2020-09-18 20:25                                                                   ` Sedat Dilek
2020-09-20 17:06                                                                     ` Linus Torvalds
2020-09-20 17:14                                                                       ` Sedat Dilek
2020-09-20 17:40                                                                         ` Linus Torvalds
2020-09-20 18:00                                                                           ` Sedat Dilek
2020-09-20 23:23                                                               ` Dave Chinner
2020-09-20 23:31                                                                 ` Linus Torvalds
2020-09-20 23:40                                                                   ` Linus Torvalds
2020-09-21  1:20                                                                   ` Dave Chinner
2020-09-12 15:53                                         ` Matthew Wilcox
2020-09-12 17:59                                         ` Linus Torvalds
2020-09-12 20:32                                           ` Rogério Brito
2020-09-14  9:33                                             ` Jan Kara
2020-09-12 20:58                                           ` Josh Triplett
2020-09-12 20:59                                           ` James Bottomley
2020-09-12 21:15                                             ` Linus Torvalds
2020-09-12 22:32                                           ` Matthew Wilcox
2020-09-13  0:40                                           ` Dave Chinner
2020-09-13  2:39                                             ` Linus Torvalds
2020-09-13  3:40                                               ` Matthew Wilcox
2020-09-13 23:45                                               ` Dave Chinner
2020-09-14  3:31                                                 ` Matthew Wilcox
2020-09-15 14:28                                                   ` Chris Mason
2020-09-15  9:27                                                 ` Jan Kara
2020-09-13  3:18                                             ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.