Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [LSF/MM TOPIC] Congestion
@ 2019-12-31 12:59 Matthew Wilcox
  2020-01-04  9:09 ` Dave Chinner
  2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2019-12-31 12:59 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel, linux-mm


I don't want to present this topic; I merely noticed the problem.
I nominate Jens Axboe and Michael Hocko as session leaders.  See the
thread here:

https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/

Summary: Congestion is broken and has been for years, and everybody's
system is sleeping waiting for congestion that will never clear.

A good outcome for this meeting would be:

 - MM defines what information they want from the block stack.
 - Block stack commits to giving them that information.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] Congestion
  2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox
@ 2020-01-04  9:09 ` Dave Chinner
  2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2020-01-04  9:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Tue, Dec 31, 2019 at 04:59:08AM -0800, Matthew Wilcox wrote:
> 
> I don't want to present this topic; I merely noticed the problem.
> I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> thread here:
> 
> https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> 
> Summary: Congestion is broken and has been for years, and everybody's
> system is sleeping waiting for congestion that will never clear.

Another symptom: system does not sleep because there is no recorded
congestion so it doesn't back off when it should (the
wait_iff_congested() backoff case).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox
  2020-01-04  9:09 ` Dave Chinner
@ 2020-01-06 11:55 ` Michal Hocko
  2020-01-06 23:21   ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-01-06 11:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm, Mel Gorman

On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> 
> I don't want to present this topic; I merely noticed the problem.
> I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> thread here:

Thanks for bringing this up Matthew! The change in the behavior came as
a surprise to me. I can lead the session for the MM side.

> https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> 
> Summary: Congestion is broken and has been for years, and everybody's
> system is sleeping waiting for congestion that will never clear.
> 
> A good outcome for this meeting would be:
> 
>  - MM defines what information they want from the block stack.

The history of the congestion waiting is kinda hairy but I will try to
summarize expectations we used to have and we can discuss how much of
that has been real and what followed up as a cargo cult. Maybe we just
find out that we do not need functionality like that anymore. I believe
Mel would be a great contributor to the discussion.


>  - Block stack commits to giving them that information.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko
@ 2020-01-06 23:21   ` Dave Chinner
  2020-01-07  8:23     ` Chris Murphy
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2020-01-06 23:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman

On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > 
> > I don't want to present this topic; I merely noticed the problem.
> > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > thread here:
> 
> Thanks for bringing this up Matthew! The change in the behavior came as
> a surprise to me. I can lead the session for the MM side.
> 
> > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > 
> > Summary: Congestion is broken and has been for years, and everybody's
> > system is sleeping waiting for congestion that will never clear.
> > 
> > A good outcome for this meeting would be:
> > 
> >  - MM defines what information they want from the block stack.
> 
> The history of the congestion waiting is kinda hairy but I will try to
> summarize expectations we used to have and we can discuss how much of
> that has been real and what followed up as a cargo cult. Maybe we just
> find out that we do not need functionality like that anymore. I believe
> Mel would be a great contributor to the discussion.

We most definitely do need some form of reclaim throttling based on
IO congestion, because it is trivial to drive the system into swap
storms and OOM killer invocation when there are large dirty slab
caches that require IO to make reclaim progress and there's little
in the way of page cache to reclaim.

This is one of the biggest issues I've come across trying to make
XFS inode reclaim non-blocking - the existing code blocks on inode
writeback IO congestion to throttle the overall reclaim rate and
so prevents swap storms and OOM killer rampages from occurring.

The moment I remove the inode writeback blocking from the reclaim
path and move the backoffs to the core reclaim congestion backoff
algorithms, I see a sustantial increase in the typical reclaim scan
priority. This is because the reclaim code does not have an
integrated back-off mechanism that can balance reclaim throttling
between slab cache and page cache reclaim. This results in
insufficient page reclaim backoff under slab cache backoff
conditions, leading to excessive page cache reclaim and swapping out
all the anonymous pages in memory. Then performance goes to hell as
userspace then starts to block on page faults swap thrashing like
this:

page_fault
  swap_in
    alloc page
      direct reclaim
        swap out anon page
	  submit_bio
	    wbt_throttle


IOWs, page reclaim doesn't back off until userspace gets throttled
in the block layer doing swap out during swap in during page
faults. For these sorts of workloads there should be little to no
swap thrashing occurring - throttling reclaim to the rate at which
inodes are cleaned by async IO dispatcher threads is what is needed
here, not continuing to wind up reclaim priority  until swap storms
and the oom killer end up killng the machine...

I also see this when the inode cache load is on a separate device to
the swap partition - both devices end up at 100% utilisation, one
doing inode writeback flat out (about 300,000 inodes/sec from an
inode cache of 5-10 million inodes), the other is swap thrashing
from a page cache of only 250-500 pages in size.

Hence the way congestion was historically dealt with as a "global
condition" still needs to exist in some manner - congestion on a
single device is sufficient to cause the high level reclaim
algroithms to misbehave badly...

Hence it seems to me that having IO load feedback to the memory
reclaim algorithms is most definitely required for memory reclaim to
be able to make the correct decisions about what to reclaim. If the
shrinker for the cache that uses 50% of RAM in the machine is saying
"backoff needed" and it's underlying device is
congested and limiting object reclaim rates, then it's a pretty good
indication that reclaim should back off and wait for IO progress to
be made instead of trying to reclaim from other LRUs that hold an
insignificant amount of memory compared to the huge cache that is
backed up waiting on IO completion to make progress....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-06 23:21   ` Dave Chinner
@ 2020-01-07  8:23     ` Chris Murphy
  2020-01-07 11:53       ` Michal Hocko
  2020-01-07 11:53     ` Michal Hocko
  2020-01-09 11:07     ` Jan Kara
  2 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2020-01-07  8:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm,
	Mel Gorman

On Mon, Jan 6, 2020 at 4:21 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > >
> > > I don't want to present this topic; I merely noticed the problem.
> > > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > > thread here:
> >
> > Thanks for bringing this up Matthew! The change in the behavior came as
> > a surprise to me. I can lead the session for the MM side.
> >
> > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > >
> > > Summary: Congestion is broken and has been for years, and everybody's
> > > system is sleeping waiting for congestion that will never clear.
> > >
> > > A good outcome for this meeting would be:
> > >
> > >  - MM defines what information they want from the block stack.
> >
> > The history of the congestion waiting is kinda hairy but I will try to
> > summarize expectations we used to have and we can discuss how much of
> > that has been real and what followed up as a cargo cult. Maybe we just
> > find out that we do not need functionality like that anymore. I believe
> > Mel would be a great contributor to the discussion.
>
> We most definitely do need some form of reclaim throttling based on
> IO congestion, because it is trivial to drive the system into swap
> storms and OOM killer invocation when there are large dirty slab
> caches that require IO to make reclaim progress and there's little
> in the way of page cache to reclaim.
>
> This is one of the biggest issues I've come across trying to make
> XFS inode reclaim non-blocking - the existing code blocks on inode
> writeback IO congestion to throttle the overall reclaim rate and
> so prevents swap storms and OOM killer rampages from occurring.
>
> The moment I remove the inode writeback blocking from the reclaim
> path and move the backoffs to the core reclaim congestion backoff
> algorithms, I see a sustantial increase in the typical reclaim scan
> priority. This is because the reclaim code does not have an
> integrated back-off mechanism that can balance reclaim throttling
> between slab cache and page cache reclaim. This results in
> insufficient page reclaim backoff under slab cache backoff
> conditions, leading to excessive page cache reclaim and swapping out
> all the anonymous pages in memory. Then performance goes to hell as
> userspace then starts to block on page faults swap thrashing like
> this:

This really caught my attention, however unrelated it may actually be.
The gist of my question is: what are distributions doing wrong, that
it's possible for an unprivileged process to take down a system such
that an ordinary user reaches for the power button? [1] More helpful
would be, what should distributions be doing better to avoid the
problem in the first place? User space oom daemons are now popular,
and there's talk about avoiding swap thrashing and oom by strict use
of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a
swap device at all, what are you crazy? Then there's swap on ZRAM. And
alas zswap too. So what's actually recommended to help with this
problem?

I don't have many original thoughts, but I can't find a reference for
why my brain is telling me the kernel oom-killer is mainly concerned
about kernel survival in low memory situations, and not user space.
But an approximate is "It is the job of the linux 'oom killer' to
sacrifice one or more processes in order to free up memory for the
system when all else fails." [2]

However, a) failure has happened way before oom-killer is invoked,
back when the GUI became unresponsive, and b) often it kills some
small thing, seemingly freeing up just enough memory that the kernel
is happy to stay in this state for indeterminate time. For my testing
that's 30 minutes, but I'm compelled to defend a user who asserts a
mere 15 second grace period before reaching for the power button.

This isn't a common experience across a broad user population, but
those who have experienced it once are really familiar with it (they
haven't experienced it only once). And I really want to know what can
be done to make the user experience better, but it's not clear to me
how to do that.

[1]
Fedora 30/31 default installation, 8G RAM, 8G swap (on plain SSD
partition), and compile webkitgtk. Within ~5 minutes all RAM is
cosumed, and the "swap storm" begins. The GUI stutters, even the mouse
pointer starts to gets choppy, and soon after it's pretty much locked
up and for all practical purposes it's locked up. Most typical, it
stays this way for 30+ minutes. Occasionally oom-killer kicks in and
clobbers something. Sometimes it's one of the compile threads. And
also occasionally it'll be something absurd like sshd, sssd, or
systemd-journald - which really makes no sense at all.

[2]
https://linux-mm.org/OOM_Killer

--
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-06 23:21   ` Dave Chinner
  2020-01-07  8:23     ` Chris Murphy
@ 2020-01-07 11:53     ` Michal Hocko
  2020-01-09 11:07     ` Jan Kara
  2 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2020-01-07 11:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman

On Tue 07-01-20 10:21:00, Dave Chinner wrote:
> On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > > 
> > > I don't want to present this topic; I merely noticed the problem.
> > > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > > thread here:
> > 
> > Thanks for bringing this up Matthew! The change in the behavior came as
> > a surprise to me. I can lead the session for the MM side.
> > 
> > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > > 
> > > Summary: Congestion is broken and has been for years, and everybody's
> > > system is sleeping waiting for congestion that will never clear.
> > > 
> > > A good outcome for this meeting would be:
> > > 
> > >  - MM defines what information they want from the block stack.
> > 
> > The history of the congestion waiting is kinda hairy but I will try to
> > summarize expectations we used to have and we can discuss how much of
> > that has been real and what followed up as a cargo cult. Maybe we just
> > find out that we do not need functionality like that anymore. I believe
> > Mel would be a great contributor to the discussion.
> 
> We most definitely do need some form of reclaim throttling based on
> IO congestion, because it is trivial to drive the system into swap
> storms and OOM killer invocation when there are large dirty slab
> caches that require IO to make reclaim progress and there's little
> in the way of page cache to reclaim.

Just to clarify. I do agree that we need some form of throttling. Sorry
if my wording was confusing. What I meant is that I am not sure whether
wait_iff_congested as it is implemented now is the right way. We
definitely have to slow/block the reclaim when there is a lot of dirty
(meta)data. How to do that is a good topic to discuss.

[skipping the rest of the email which has many good points]

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-07  8:23     ` Chris Murphy
@ 2020-01-07 11:53       ` Michal Hocko
  2020-01-07 20:12         ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-01-07 11:53 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Dave Chinner, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm,
	Mel Gorman

On Tue 07-01-20 01:23:38, Chris Murphy wrote:
> On Mon, Jan 6, 2020 at 4:21 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > > >
> > > > I don't want to present this topic; I merely noticed the problem.
> > > > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > > > thread here:
> > >
> > > Thanks for bringing this up Matthew! The change in the behavior came as
> > > a surprise to me. I can lead the session for the MM side.
> > >
> > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > > >
> > > > Summary: Congestion is broken and has been for years, and everybody's
> > > > system is sleeping waiting for congestion that will never clear.
> > > >
> > > > A good outcome for this meeting would be:
> > > >
> > > >  - MM defines what information they want from the block stack.
> > >
> > > The history of the congestion waiting is kinda hairy but I will try to
> > > summarize expectations we used to have and we can discuss how much of
> > > that has been real and what followed up as a cargo cult. Maybe we just
> > > find out that we do not need functionality like that anymore. I believe
> > > Mel would be a great contributor to the discussion.
> >
> > We most definitely do need some form of reclaim throttling based on
> > IO congestion, because it is trivial to drive the system into swap
> > storms and OOM killer invocation when there are large dirty slab
> > caches that require IO to make reclaim progress and there's little
> > in the way of page cache to reclaim.
> >
> > This is one of the biggest issues I've come across trying to make
> > XFS inode reclaim non-blocking - the existing code blocks on inode
> > writeback IO congestion to throttle the overall reclaim rate and
> > so prevents swap storms and OOM killer rampages from occurring.
> >
> > The moment I remove the inode writeback blocking from the reclaim
> > path and move the backoffs to the core reclaim congestion backoff
> > algorithms, I see a sustantial increase in the typical reclaim scan
> > priority. This is because the reclaim code does not have an
> > integrated back-off mechanism that can balance reclaim throttling
> > between slab cache and page cache reclaim. This results in
> > insufficient page reclaim backoff under slab cache backoff
> > conditions, leading to excessive page cache reclaim and swapping out
> > all the anonymous pages in memory. Then performance goes to hell as
> > userspace then starts to block on page faults swap thrashing like
> > this:
> 
> This really caught my attention, however unrelated it may actually be.
> The gist of my question is: what are distributions doing wrong, that
> it's possible for an unprivileged process to take down a system such
> that an ordinary user reaches for the power button? [1]

Well, free ticket to all the available memory is the key here I believe.
Memory cgroups can be of a great help to reduce the amount of memory
for untrusted users. I am not sure whether that would help your example
in the footnote though. It seems your workload is reaching a threshing
state. It would be interesting to get some more data to see whether that
is a result of the real memory demand or the memory reclaim misbehavior
(It would be great to collect /proc/vmstat data while the system is
behaving like that in a separate email thread).

> More helpful
> would be, what should distributions be doing better to avoid the
> problem in the first place? User space oom daemons are now popular,
> and there's talk about avoiding swap thrashing and oom by strict use
> of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a
> swap device at all, what are you crazy? Then there's swap on ZRAM. And
> alas zswap too. So what's actually recommended to help with this
> problem?

I believe this will be workload specific and it is always appreciated to
report the behavior as mentioned above.

> I don't have many original thoughts, but I can't find a reference for
> why my brain is telling me the kernel oom-killer is mainly concerned
> about kernel survival in low memory situations, and not user space.

This is indeed the case. It is a last resort measure to survive the
memory depletion. Unfortunately the oom detection doesn't cope well with
the threshing scenarios where the memory is still reclaimable reasonably
easily while the userspace cannot make much progress because it is
refaulting the working set constantly. PSI has been a great step forward
for those workloads. We haven't found a good way to integrate that
information into the oom detection yet, unfortunately because an
acceptable level of refaulting is very workload dependent.

[...]

> [1]
> Fedora 30/31 default installation, 8G RAM, 8G swap (on plain SSD
> partition), and compile webkitgtk. Within ~5 minutes all RAM is
> cosumed, and the "swap storm" begins. The GUI stutters, even the mouse
> pointer starts to gets choppy, and soon after it's pretty much locked
> up and for all practical purposes it's locked up. Most typical, it
> stays this way for 30+ minutes. Occasionally oom-killer kicks in and
> clobbers something. Sometimes it's one of the compile threads. And
> also occasionally it'll be something absurd like sshd, sssd, or
> systemd-journald - which really makes no sense at all.
> 
> [2]
> https://linux-mm.org/OOM_Killer
> 
> --
> Chris Murphy

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-07 11:53       ` Michal Hocko
@ 2020-01-07 20:12         ` Chris Murphy
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2020-01-07 20:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Murphy, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm,
	Mel Gorman

On Tue, Jan 7, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 07-01-20 01:23:38, Chris Murphy wrote:

> > More helpful
> > would be, what should distributions be doing better to avoid the
> > problem in the first place? User space oom daemons are now popular,
> > and there's talk about avoiding swap thrashing and oom by strict use
> > of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a
> > swap device at all, what are you crazy? Then there's swap on ZRAM. And
> > alas zswap too. So what's actually recommended to help with this
> > problem?
>
> I believe this will be workload specific and it is always appreciated to
> report the behavior as mentioned above.

I'll do so in a separate email. But by what mechanism is workload
determined or categorized? And how is the system dynamically
reconfigured to better handle different workloads? These are general
purpose operating systems, of course a user has different workloads
from moment to moment.



--
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-06 23:21   ` Dave Chinner
  2020-01-07  8:23     ` Chris Murphy
  2020-01-07 11:53     ` Michal Hocko
@ 2020-01-09 11:07     ` Jan Kara
  2020-01-09 23:00       ` Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-01-09 11:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	Mel Gorman

On Tue 07-01-20 10:21:00, Dave Chinner wrote:
> On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > > 
> > > I don't want to present this topic; I merely noticed the problem.
> > > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > > thread here:
> > 
> > Thanks for bringing this up Matthew! The change in the behavior came as
> > a surprise to me. I can lead the session for the MM side.
> > 
> > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > > 
> > > Summary: Congestion is broken and has been for years, and everybody's
> > > system is sleeping waiting for congestion that will never clear.
> > > 
> > > A good outcome for this meeting would be:
> > > 
> > >  - MM defines what information they want from the block stack.
> > 
> > The history of the congestion waiting is kinda hairy but I will try to
> > summarize expectations we used to have and we can discuss how much of
> > that has been real and what followed up as a cargo cult. Maybe we just
> > find out that we do not need functionality like that anymore. I believe
> > Mel would be a great contributor to the discussion.
> 
> We most definitely do need some form of reclaim throttling based on
> IO congestion, because it is trivial to drive the system into swap
> storms and OOM killer invocation when there are large dirty slab
> caches that require IO to make reclaim progress and there's little
> in the way of page cache to reclaim.

Agreed, but I guess the question is how do we implement that in a reliable
fashion? More on that below...

> This is one of the biggest issues I've come across trying to make
> XFS inode reclaim non-blocking - the existing code blocks on inode
> writeback IO congestion to throttle the overall reclaim rate and
> so prevents swap storms and OOM killer rampages from occurring.
> 
> The moment I remove the inode writeback blocking from the reclaim
> path and move the backoffs to the core reclaim congestion backoff
> algorithms, I see a sustantial increase in the typical reclaim scan
> priority. This is because the reclaim code does not have an
> integrated back-off mechanism that can balance reclaim throttling
> between slab cache and page cache reclaim. This results in
> insufficient page reclaim backoff under slab cache backoff
> conditions, leading to excessive page cache reclaim and swapping out
> all the anonymous pages in memory. Then performance goes to hell as
> userspace then starts to block on page faults swap thrashing like
> this:
> 
> page_fault
>   swap_in
>     alloc page
>       direct reclaim
>         swap out anon page
> 	  submit_bio
> 	    wbt_throttle
> 
> 
> IOWs, page reclaim doesn't back off until userspace gets throttled
> in the block layer doing swap out during swap in during page
> faults. For these sorts of workloads there should be little to no
> swap thrashing occurring - throttling reclaim to the rate at which
> inodes are cleaned by async IO dispatcher threads is what is needed
> here, not continuing to wind up reclaim priority  until swap storms
> and the oom killer end up killng the machine...
> 
> I also see this when the inode cache load is on a separate device to
> the swap partition - both devices end up at 100% utilisation, one
> doing inode writeback flat out (about 300,000 inodes/sec from an
> inode cache of 5-10 million inodes), the other is swap thrashing
> from a page cache of only 250-500 pages in size.
> 
> Hence the way congestion was historically dealt with as a "global
> condition" still needs to exist in some manner - congestion on a
> single device is sufficient to cause the high level reclaim
> algroithms to misbehave badly...
> 
> Hence it seems to me that having IO load feedback to the memory
> reclaim algorithms is most definitely required for memory reclaim to
> be able to make the correct decisions about what to reclaim. If the
> shrinker for the cache that uses 50% of RAM in the machine is saying
> "backoff needed" and it's underlying device is
> congested and limiting object reclaim rates, then it's a pretty good
> indication that reclaim should back off and wait for IO progress to
> be made instead of trying to reclaim from other LRUs that hold an
> insignificant amount of memory compared to the huge cache that is
> backed up waiting on IO completion to make progress....

Yes and I think here's the key detail: Reclaim really needs to wait for
slab object cleaning to progress so that slab cache can be shrinked.  This
is related, but not always in a straightforward way, with IO progress and
even less with IO congestion on some device. I can easily imagine that e.g.
cleaning of inodes to reclaim inode slab may not be efficient enough to
utilize full paralelism of a fast storage so the storage will not ever
become congested - sure it's an inefficiency that could be fixed but should
it misguide reclaim? I don't think so... So I think that to solve this
problem in a robust way, we need to provide a mechanism for slab shrinkers
to say something like "hang on, I can reclaim X objects you asked for but
it will take time, I'll signal to you when they are reclaimable". This way
we avoid blocking in the shrinker and can do more efficient async batched
reclaim and on mm side we have the freedom to either wait for slab reclaim
to progress (if this slab is fundamental to memory pressure) or just go try
reclaim something else. Of course, the devil is in the details :).

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-09 11:07     ` Jan Kara
@ 2020-01-09 23:00       ` Dave Chinner
  2020-02-05 16:05         ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2020-01-09 23:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	Mel Gorman

On Thu, Jan 09, 2020 at 12:07:51PM +0100, Jan Kara wrote:
> On Tue 07-01-20 10:21:00, Dave Chinner wrote:
> > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > > > 
> > > > I don't want to present this topic; I merely noticed the problem.
> > > > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > > > thread here:
> > > 
> > > Thanks for bringing this up Matthew! The change in the behavior came as
> > > a surprise to me. I can lead the session for the MM side.
> > > 
> > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/
> > > > 
> > > > Summary: Congestion is broken and has been for years, and everybody's
> > > > system is sleeping waiting for congestion that will never clear.
> > > > 
> > > > A good outcome for this meeting would be:
> > > > 
> > > >  - MM defines what information they want from the block stack.
> > > 
> > > The history of the congestion waiting is kinda hairy but I will try to
> > > summarize expectations we used to have and we can discuss how much of
> > > that has been real and what followed up as a cargo cult. Maybe we just
> > > find out that we do not need functionality like that anymore. I believe
> > > Mel would be a great contributor to the discussion.
> > 
> > We most definitely do need some form of reclaim throttling based on
> > IO congestion, because it is trivial to drive the system into swap
> > storms and OOM killer invocation when there are large dirty slab
> > caches that require IO to make reclaim progress and there's little
> > in the way of page cache to reclaim.
> 
> Agreed, but I guess the question is how do we implement that in a reliable
> fashion? More on that below...
....
> > Hence it seems to me that having IO load feedback to the memory
> > reclaim algorithms is most definitely required for memory reclaim to
> > be able to make the correct decisions about what to reclaim. If the
> > shrinker for the cache that uses 50% of RAM in the machine is saying
> > "backoff needed" and it's underlying device is
> > congested and limiting object reclaim rates, then it's a pretty good
> > indication that reclaim should back off and wait for IO progress to
> > be made instead of trying to reclaim from other LRUs that hold an
> > insignificant amount of memory compared to the huge cache that is
> > backed up waiting on IO completion to make progress....
> 
> Yes and I think here's the key detail: Reclaim really needs to wait for
> slab object cleaning to progress so that slab cache can be shrinked.  This
> is related, but not always in a straightforward way, with IO progress and
> even less with IO congestion on some device. I can easily imagine that e.g.
> cleaning of inodes to reclaim inode slab may not be efficient enough to
> utilize full paralelism of a fast storage so the storage will not ever
> become congested

XFS can currently write back inodes at several hundred MB/s if the
underlying storage is capable of sustaining that. i.e. it can drive
hundreds of thousands of metadata IOPS if the underlying storage can
handle that. With the non-blocking reclaim mods, it's all async
writeback, so at least for XFS we will be able to drive fast devices
into congestion.

> - sure it's an inefficiency that could be fixed but should
> it misguide reclaim?

The problem is that even cleaning inodes at this rate, I can't get
reclaim to actually do the right thing. Reclaim is already going
wrong for really fast devices..

> I don't think so... So I think that to solve this
> problem in a robust way, we need to provide a mechanism for slab shrinkers
> to say something like "hang on, I can reclaim X objects you asked for but
> it will take time, I'll signal to you when they are reclaimable". This way
> we avoid blocking in the shrinker and can do more efficient async batched
> reclaim and on mm side we have the freedom to either wait for slab reclaim
> to progress (if this slab is fundamental to memory pressure) or just go try
> reclaim something else. Of course, the devil is in the details :).

That's pretty much exactly what my non-blocking XFS inode reclaim
patches do. It tries to scan, but when it can't make progress it
sets a "need backoff" flag and defers the remaining work and expects
the high level code to make a sensible back-off decision.

The problem is that the decision the high level code makes at the
moment is not sensible - it is "back off for a bit, then increase
the reclaim priority and reclaim from the page cache again. That;s
what is driving the swap storms - inode reclaim says "back-off" and
stops trying to do reclaim, and that causes the high level code to
reclaim the page cache harder.

OTOH, if we *block in the inode shrinker* as we do now, then we
don't increase reclaim priority (and hence the amount of page cache
scanning) and so the reclaim algorithms don't drive deeply into
swap-storm conditions.

That's the fundamental problem here - we need to throttle reclaim
without *needing to restart the entire high level reclaim loop*.
This is an architecture problem more than anything - node and memcg
aware shrinkers outnumber the page cache LRU zones by a large
number, but we can't throttle on individual shrinkers and wait for
them to make progress like we can individual page LRU zone lists.
Hence if we want to throttle an individual shrinker, the *only
reliable option* we currently have is for the shrinker to block
itself.

I note that we handle similar "need more individual work" conditions
in other writeback situations. e.g. the BDI has a "b_more_io" list
to park inodes that require more writeback than a single pass. This
allows writeback to *fairly* revisit inodes that require large
amounts of writeback to do more writeback without needing to start a
whole new BDI dirty inode writeback pass.

I suspect that this is the sort of thing we need for reclaim - we
need to park shrinker instances that needed backoff onto a "need
more reclaim" list that we continue to iterate and back-off on until
we've done the reclaim work that this specific reclaim priority pass
required us to do.

And, realistically, to make this all work in a consistent manner,
the zone LRU walkers really should be transitioned to run as shrinker
instances that are node and memcg aware, and so they do individual
backoff and throttling in the same manner that large slab caches do.
This way we end up with an integrated, consistent high level reclaim
management architecture that automatically balances page cache vs
slab cache reclaim balance...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-01-09 23:00       ` Dave Chinner
@ 2020-02-05 16:05         ` Mel Gorman
  2020-02-06 23:19           ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2020-02-05 16:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel,
	linux-mm, Mel Gorman

This thread is ancient but I'm only getting to it now, to express an
interest in the general discussion as much as anything else.

On Fri, Jan 10, 2020 at 10:00:43AM +1100, Dave Chinner wrote:
> > I don't think so... So I think that to solve this
> > problem in a robust way, we need to provide a mechanism for slab shrinkers
> > to say something like "hang on, I can reclaim X objects you asked for but
> > it will take time, I'll signal to you when they are reclaimable". This way
> > we avoid blocking in the shrinker and can do more efficient async batched
> > reclaim and on mm side we have the freedom to either wait for slab reclaim
> > to progress (if this slab is fundamental to memory pressure) or just go try
> > reclaim something else. Of course, the devil is in the details :).
> 
> That's pretty much exactly what my non-blocking XFS inode reclaim
> patches do. It tries to scan, but when it can't make progress it
> sets a "need backoff" flag and defers the remaining work and expects
> the high level code to make a sensible back-off decision.
> 
> The problem is that the decision the high level code makes at the
> moment is not sensible - it is "back off for a bit, then increase
> the reclaim priority and reclaim from the page cache again. That;s
> what is driving the swap storms - inode reclaim says "back-off" and
> stops trying to do reclaim, and that causes the high level code to
> reclaim the page cache harder.
> 
> OTOH, if we *block in the inode shrinker* as we do now, then we
> don't increase reclaim priority (and hence the amount of page cache
> scanning) and so the reclaim algorithms don't drive deeply into
> swap-storm conditions.
> 
> That's the fundamental problem here - we need to throttle reclaim
> without *needing to restart the entire high level reclaim loop*.
> This is an architecture problem more than anything - node and memcg
> aware shrinkers outnumber the page cache LRU zones by a large
> number, but we can't throttle on individual shrinkers and wait for
> them to make progress like we can individual page LRU zone lists.
> Hence if we want to throttle an individual shrinker, the *only
> reliable option* we currently have is for the shrinker to block
> itself.
> 

Despite the topic name, I learning towards thinking that this is not a
congestion issue as such. The throttling mechanism based on BDI partially
solved old problems of swap storm, direct relaim issued writeback
(historical) or excessive scanning leading to premature OOM kill. When
reclaim stopped issuing waiting on writeback it had to rely on congestion
control instead and it always was a bit fragile but mostly worked until
hardware moved on, storage got faster, memories got larger, or did
something crazy like buy a second disk.

The  commonmreason that stalling would occur is because large amounts of
dirty/writeback pages were encountered at the tail of the LRU leading to
large amounts of CPU time spent on useless scanning and increasing scan
rates until OOM occurred. It never took into account any other factor
like shrinker state.

But fundamentally what gets a process into trouble is when "reclaim
efficiency" drops. Efficiency is the ratio between reclaim scan and
reclaim steal with perfect efficiency being one page scanned results in
one page reclaimed. As long as reclaim efficiency is perfect, a system
may be thrashing but it's not stalling on writeback. It may still be
stalling on read but that tends to be less harmful.

Blocking on "congestion" caught one very bad condition where efficiency
drops -- excessive dirty/writeback pages on the tail of the file LRU. It
happened to be a common condition such as if a USB stick was being written
but not the only one. When it happened, excessive clean file pages would
be taken, swap storms occur and the system thrashes while the dirty
pages are being cleaned.

In roughly in order of severity the most relevant causes of efficiency
drops that come to mind are

o page is unevictable due to mlock (goes to separate list)
o page is accessed and gets activated
o THP has to be split and does another lap through the LRU
o page could not be unmapped (probably heavily shared and should be
  activated anyway)
o page is dirty/writeback and goes back on the LRU
o page has associated buffers that cannot be freed

While I'm nowhere near having enough time to write a prototype, I think
it could be throttle reclaim based on recent allocation rate and the
contributors to poor reclaim efficiency.

Recent allocation rate is appropriate because processes dirtying memory
should get caught in balance_dirty_page. It's only heavy allocators that
can drive excessive reclaim for multiple unrelated processes. So first,
try and keep a rough track of the recent allocation rate or maybe just
something like the number of consecutive allocations that entered the
slow path due to a low watermark failure.

Once a task enters direct reclaim, track the reasons for poor reclaim
efficiency (like the list above but maybe add shrinkers) and calculate a
score based on weight. An accessed page would have a light weight, a dirty
page would have a heavy weight. Shrinkers could apply some unknown weight
but I don't know what might be sensible or what the relative weighting
would be.

If direct reclaim should continue for another loop, wait on a per-node
waitqueue until kswapd frees pages above the high watermark or a
timeout. The length of the timeout would depend on how heavy an allocator
the process is and the reasons why reclaim efficiency was dropping. The
timeout costs should accumulate while a task remains in direct reclaim
to limit the chance that an unrelated process is punished.

It's all hand-waving but I think this would be enough to detect a heavy
allocator encountering lots of dirty pages at the tail of the LRU at high
frequency without relying on BDI congestion detection. The downside is if
the system really is thrashing then a light allocator can become a heavy
allocator because it's trying to read itself from swap or fetch hot data.

> And, realistically, to make this all work in a consistent manner,
> the zone LRU walkers really should be transitioned to run as shrinker
> instances that are node and memcg aware, and so they do individual
> backoff and throttling in the same manner that large slab caches do.
> This way we end up with an integrated, consistent high level reclaim
> management architecture that automatically balances page cache vs
> slab cache reclaim balance...
> 

That'd probably make more sense but I don't think it would be mandatory
to get some basic replacement for wait_iff_congested working.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-02-05 16:05         ` Mel Gorman
@ 2020-02-06 23:19           ` Dave Chinner
  2020-02-07  0:08             ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2020-02-06 23:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jan Kara, Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel,
	linux-mm, Mel Gorman

On Wed, Feb 05, 2020 at 04:05:51PM +0000, Mel Gorman wrote:
> This thread is ancient but I'm only getting to it now, to express an
> interest in the general discussion as much as anything else.
> 
> On Fri, Jan 10, 2020 at 10:00:43AM +1100, Dave Chinner wrote:
> > > I don't think so... So I think that to solve this
> > > problem in a robust way, we need to provide a mechanism for slab shrinkers
> > > to say something like "hang on, I can reclaim X objects you asked for but
> > > it will take time, I'll signal to you when they are reclaimable". This way
> > > we avoid blocking in the shrinker and can do more efficient async batched
> > > reclaim and on mm side we have the freedom to either wait for slab reclaim
> > > to progress (if this slab is fundamental to memory pressure) or just go try
> > > reclaim something else. Of course, the devil is in the details :).
> > 
> > That's pretty much exactly what my non-blocking XFS inode reclaim
> > patches do. It tries to scan, but when it can't make progress it
> > sets a "need backoff" flag and defers the remaining work and expects
> > the high level code to make a sensible back-off decision.
> > 
> > The problem is that the decision the high level code makes at the
> > moment is not sensible - it is "back off for a bit, then increase
> > the reclaim priority and reclaim from the page cache again. That;s
> > what is driving the swap storms - inode reclaim says "back-off" and
> > stops trying to do reclaim, and that causes the high level code to
> > reclaim the page cache harder.
> > 
> > OTOH, if we *block in the inode shrinker* as we do now, then we
> > don't increase reclaim priority (and hence the amount of page cache
> > scanning) and so the reclaim algorithms don't drive deeply into
> > swap-storm conditions.
> > 
> > That's the fundamental problem here - we need to throttle reclaim
> > without *needing to restart the entire high level reclaim loop*.
> > This is an architecture problem more than anything - node and memcg
> > aware shrinkers outnumber the page cache LRU zones by a large
> > number, but we can't throttle on individual shrinkers and wait for
> > them to make progress like we can individual page LRU zone lists.
> > Hence if we want to throttle an individual shrinker, the *only
> > reliable option* we currently have is for the shrinker to block
> > itself.
> > 
> 
> Despite the topic name, I learning towards thinking that this is not a
> congestion issue as such. The throttling mechanism based on BDI partially
> solved old problems of swap storm, direct relaim issued writeback
> (historical) or excessive scanning leading to premature OOM kill. When
> reclaim stopped issuing waiting on writeback it had to rely on congestion
> control instead and it always was a bit fragile but mostly worked until
> hardware moved on, storage got faster, memories got larger, or did
> something crazy like buy a second disk.

That's because the code didn't evolve with the changing capabilities
of the hardware. Nobody cared because it "mostly worked" and then
when it didn't they worked around it in other ways. e.g. the block
layer writeback throttle largely throttles swap storms by limiting
the amount of swap IO memory reclaim can issue. THe issue is that it
doesn't prevent swap storms, just moves them "back out of sight" so
no-one cares that about them much again...

> The  commonmreason that stalling would occur is because large amounts of
> dirty/writeback pages were encountered at the tail of the LRU leading to
> large amounts of CPU time spent on useless scanning and increasing scan
> rates until OOM occurred. It never took into account any other factor
> like shrinker state.
> 
> But fundamentally what gets a process into trouble is when "reclaim
> efficiency" drops. Efficiency is the ratio between reclaim scan and
> reclaim steal with perfect efficiency being one page scanned results in
> one page reclaimed. As long as reclaim efficiency is perfect, a system
> may be thrashing but it's not stalling on writeback. It may still be
> stalling on read but that tends to be less harmful.
> 
> Blocking on "congestion" caught one very bad condition where efficiency
> drops -- excessive dirty/writeback pages on the tail of the file LRU. It
> happened to be a common condition such as if a USB stick was being written
> but not the only one. When it happened, excessive clean file pages would
> be taken, swap storms occur and the system thrashes while the dirty
> pages are being cleaned.
> 
> In roughly in order of severity the most relevant causes of efficiency
> drops that come to mind are
> 
> o page is unevictable due to mlock (goes to separate list)
> o page is accessed and gets activated
> o THP has to be split and does another lap through the LRU
> o page could not be unmapped (probably heavily shared and should be
>   activated anyway)
> o page is dirty/writeback and goes back on the LRU
> o page has associated buffers that cannot be freed

One of the issues I see here is the focus on the congestion problem
entirely form the point of view of page reclaim. What I tried to
point out above is that we have *all* the same issues with inode
reclaim in the shrinker.

The common reason for stalling inode reclaim is large amounts of
dirty/writeback inodes on the tail of the LRU.

Inode reclaim efficiency drops occur because:

o All LRUs are currently performing reclaim scans, so new direct
  scans only cause lock and/or IO contention rather than increase scan
  rates.
o inode is unevictable because it is currently pinned by the journal
o inode has been referenced and activated, so gets skipped
o inode is locked, so does another lap through the LRU
o inode is dirty/writeback, so does another lap of the LRU

IOWs shrinkers have exactly the same problems as the page LRU
reclaim. This is why I'm advocating for this problem to be solved in
a generic manner, not as a solution focussed entirely around the
requirements of page reclaim.

> While I'm nowhere near having enough time to write a prototype, I think
> it could be throttle reclaim based on recent allocation rate and the
> contributors to poor reclaim efficiency.
> 
> Recent allocation rate is appropriate because processes dirtying memory
> should get caught in balance_dirty_page. It's only heavy allocators that
> can drive excessive reclaim for multiple unrelated processes. So first,
> try and keep a rough track of the recent allocation rate or maybe just
> something like the number of consecutive allocations that entered the
> slow path due to a low watermark failure.

Inode dirtying is throttled by the filesystem journal space, which
has nothing really to do with memory pressure. And inode allocation
isn't throttled in any way at all, except by memory reclaim when
there is memory pressure.

That's the underlying reason that we've traditionally had to
throttle reclaim under heavy inode allocation pressure - blocking
reclaim on "congestion" in the inode shrinker caught this very bad
condition where efficiency drops off a cliff....

Hence a solution that relies on measuring recent allocation rate
needs to first add all that infrastructure for every shrinker that
does allocation. A solution that only works for the page LRU
infrastructure is not a viable solution to the problems being raised
here.

> Once a task enters direct reclaim, track the reasons for poor reclaim
> efficiency (like the list above but maybe add shrinkers) and calculate a
> score based on weight. An accessed page would have a light weight, a dirty
> page would have a heavy weight. Shrinkers could apply some unknown weight
> but I don't know what might be sensible or what the relative weighting
> would be.

I'm not sure what a weighting might acheive given we might be
scanning millions of objects (we can have hundreds of millions of
cached inodes on the LRUs). An aggregated weight does not indicate
whether we skipped lots of referenced clean inode that would
otherwise be trivial to reclaim, or we skipped a smaller amount of
dirty inodes that will take a *long time* to reclaim because they
require IO....

> If direct reclaim should continue for another loop, wait on a per-node
> waitqueue until kswapd frees pages above the high watermark or a
> timeout. The length of the timeout would depend on how heavy an allocator
> the process is and the reasons why reclaim efficiency was dropping. The
> timeout costs should accumulate while a task remains in direct reclaim
> to limit the chance that an unrelated process is punished.
> 
> It's all hand-waving but I think this would be enough to detect a heavy
> allocator encountering lots of dirty pages at the tail of the LRU at high
> frequency without relying on BDI congestion detection. The downside is if
> the system really is thrashing then a light allocator can become a heavy
> allocator because it's trying to read itself from swap or fetch hot data.

But detecting an abundance dirty pages/inodes on the LRU doesn't
really solve the problem of determining if and/or how long we should
wait for IO before we try to free more objects. There is no problem
with having lots of dirty pages/inodes on the LRU as long as the IO
subsystem keeps up with the rate at which reclaim is asking them to
be written back via async mechanisms (bdi writeback, metadata
writeback, etc).

The problem comes when we cannot make efficient progress cleaning
pages/inodes on the LRU because the IO subsystem is overloaded and
cannot clean pages/inodes any faster. At this point, we have to wait
for the IO subsystem to make progress and without feedback from the
IO subsystem, we have no idea how fast that progress is made. Hence
we have no idea how long we need to wait before trying to reclaim
again. i.e. the answer can be different depending on hardware
behaviour, not just the current instantaneous reclaim and IO state.

That's the fundamental problem we need to solve, and realistically
it can only be done with some level of feedback from the IO
subsystem.

I'd be quite happy to attach a BDI to the reclaim feedback
structure and tell the high level code "wait on this bdi for X
completions or X milliseconds" rather than the current global "any
BDI that completes an IO and is not congested breaks waiters out of
congestion_wait".  This would also solve the problem of
wait_iff_congested() waiting on any congested BDI (or being woken by
any uncongested BDI) rather than the actual BDI the shrinker is
operating/waiting on.

Further, we can actually refine this by connecting the memcg the
object belongs to to the blk_cgroup that the object is cleaned
through (e.g.  wb_get_lookup()) from inside reclaim.  Hence we
should be able to determine if the blkcg that backs the memcg we are
reclaiming from is slow because it is being throttled, even though
the backing device itself is not congested. IOWs, we will not block
memcg reclaim on IO congestion caused by other memcgs....

> > And, realistically, to make this all work in a consistent manner,
> > the zone LRU walkers really should be transitioned to run as shrinker
> > instances that are node and memcg aware, and so they do individual
> > backoff and throttling in the same manner that large slab caches do.
> > This way we end up with an integrated, consistent high level reclaim
> > management architecture that automatically balances page cache vs
> > slab cache reclaim balance...
> 
> That'd probably make more sense but I don't think it would be mandatory
> to get some basic replacement for wait_iff_congested working.

Sure, that's the near term issue, but long term it jsut makes no
sense to treat shrinker relcaim differently from page reclaim
because they have all the same requirements for IO backoff feedback
and control.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-02-06 23:19           ` Dave Chinner
@ 2020-02-07  0:08             ` Matthew Wilcox
  2020-02-13  3:18               ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2020-02-07  0:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Jan Kara, Michal Hocko, lsf-pc, linux-fsdevel,
	linux-mm, Mel Gorman

On Fri, Feb 07, 2020 at 10:19:28AM +1100, Dave Chinner wrote:
> But detecting an abundance dirty pages/inodes on the LRU doesn't
> really solve the problem of determining if and/or how long we should
> wait for IO before we try to free more objects. There is no problem
> with having lots of dirty pages/inodes on the LRU as long as the IO
> subsystem keeps up with the rate at which reclaim is asking them to
> be written back via async mechanisms (bdi writeback, metadata
> writeback, etc).
> 
> The problem comes when we cannot make efficient progress cleaning
> pages/inodes on the LRU because the IO subsystem is overloaded and
> cannot clean pages/inodes any faster. At this point, we have to wait
> for the IO subsystem to make progress and without feedback from the
> IO subsystem, we have no idea how fast that progress is made. Hence
> we have no idea how long we need to wait before trying to reclaim
> again. i.e. the answer can be different depending on hardware
> behaviour, not just the current instantaneous reclaim and IO state.
> 
> That's the fundamental problem we need to solve, and realistically
> it can only be done with some level of feedback from the IO
> subsystem.

That triggered a memory for me.  Jeremy Kerr presented a paper at LCA2006
on a different model where the device driver pulls dirty things from the VM
rather than having the VM push dirty things to the device driver.  It was
prototyped in K42 rather than Linux, but the idea might be useful.

http://jk.ozlabs.org/projects/k42/
http://jk.ozlabs.org/projects/k42/device-driven-IO-lca06.pdf


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
  2020-02-07  0:08             ` Matthew Wilcox
@ 2020-02-13  3:18               ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2020-02-13  3:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Mel Gorman, Jan Kara, Michal Hocko, lsf-pc,
	linux-fsdevel, linux-mm, Mel Gorman

On Thu, 6 Feb 2020 16:08:53 -0800 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Feb 07, 2020 at 10:19:28AM +1100, Dave Chinner wrote:
> > But detecting an abundance dirty pages/inodes on the LRU doesn't
> > really solve the problem of determining if and/or how long we should
> > wait for IO before we try to free more objects. There is no problem
> > with having lots of dirty pages/inodes on the LRU as long as the IO
> > subsystem keeps up with the rate at which reclaim is asking them to
> > be written back via async mechanisms (bdi writeback, metadata
> > writeback, etc).
> > 
> > The problem comes when we cannot make efficient progress cleaning
> > pages/inodes on the LRU because the IO subsystem is overloaded and
> > cannot clean pages/inodes any faster. At this point, we have to wait
> > for the IO subsystem to make progress and without feedback from the
> > IO subsystem, we have no idea how fast that progress is made. Hence
> > we have no idea how long we need to wait before trying to reclaim
> > again. i.e. the answer can be different depending on hardware
> > behaviour, not just the current instantaneous reclaim and IO state.
> > 
> > That's the fundamental problem we need to solve, and realistically
> > it can only be done with some level of feedback from the IO
> > subsystem.
> 
> That triggered a memory for me.  Jeremy Kerr presented a paper at LCA2006
> on a different model where the device driver pulls dirty things from the VM
> rather than having the VM push dirty things to the device driver.  It was
> prototyped in K42 rather than Linux, but the idea might be useful.
> 
> http://jk.ozlabs.org/projects/k42/
> http://jk.ozlabs.org/projects/k42/device-driven-IO-lca06.pdf

Fun.  Device drivers says "I have spare bandwidth so send me some stuff".

But if device drivers could do that, we wouldn't have broken congestion
in the first place ;)


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox
2020-01-04  9:09 ` Dave Chinner
2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko
2020-01-06 23:21   ` Dave Chinner
2020-01-07  8:23     ` Chris Murphy
2020-01-07 11:53       ` Michal Hocko
2020-01-07 20:12         ` Chris Murphy
2020-01-07 11:53     ` Michal Hocko
2020-01-09 11:07     ` Jan Kara
2020-01-09 23:00       ` Dave Chinner
2020-02-05 16:05         ` Mel Gorman
2020-02-06 23:19           ` Dave Chinner
2020-02-07  0:08             ` Matthew Wilcox
2020-02-13  3:18               ` Andrew Morton

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git