mm, memcg: reclaim more aggressively before high allocator throttling
diff mbox series

Message ID 20200520143712.GA749486@chrisdown.name
State Superseded
Headers show
Series
  • mm, memcg: reclaim more aggressively before high allocator throttling
Related show

Commit Message

Chris Down May 20, 2020, 2:37 p.m. UTC
In Facebook production, we've seen cases where cgroups have been put
into allocator throttling even when they appear to have a lot of slack
file caches which should be trivially reclaimable.

Looking more closely, the problem is that we only try a single cgroup
reclaim walk for each return to usermode before calculating whether or
not we should throttle. This single attempt doesn't produce enough
pressure to shrink for cgroups with a rapidly growing amount of file
caches prior to entering allocator throttling.

As an example, we see that threads in an affected cgroup are stuck in
allocator throttling:

    # for i in $(cat cgroup.threads); do
    >     grep over_high "/proc/$i/stack"
    > done
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150

...however, there is no I/O pressure reported by PSI, despite a lot of
slack file pages:

    # cat memory.pressure
    some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
    full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
    # cat io.pressure
    some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
    full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
    # grep _file memory.stat
    inactive_file 1370939392
    active_file 661635072

This patch changes the behaviour to retry reclaim either until the
current task goes below the 10ms grace period, or we are making no
reclaim progress at all. In the latter case, we enter reclaim throttling
as before.

To a user, there's no intuitive reason for the reclaim behaviour to
differ from hitting memory.high as part of a new allocation, as opposed
to hitting memory.high because someone lowered its value. As such this
also brings an added benefit: it unifies the reclaim behaviour between
the two.

There's precedent for this behaviour: we already do reclaim retries when
writing to memory.{high,max}, in max reclaim, and in the page allocator
itself.

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/memcontrol.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

Comments

Michal Hocko May 20, 2020, 4:07 p.m. UTC | #1
On Wed 20-05-20 15:37:12, Chris Down wrote:
> In Facebook production, we've seen cases where cgroups have been put
> into allocator throttling even when they appear to have a lot of slack
> file caches which should be trivially reclaimable.
> 
> Looking more closely, the problem is that we only try a single cgroup
> reclaim walk for each return to usermode before calculating whether or
> not we should throttle. This single attempt doesn't produce enough
> pressure to shrink for cgroups with a rapidly growing amount of file
> caches prior to entering allocator throttling.
> 
> As an example, we see that threads in an affected cgroup are stuck in
> allocator throttling:
> 
>     # for i in $(cat cgroup.threads); do
>     >     grep over_high "/proc/$i/stack"
>     > done
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> 
> ...however, there is no I/O pressure reported by PSI, despite a lot of
> slack file pages:
> 
>     # cat memory.pressure
>     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
>     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
>     # cat io.pressure
>     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
>     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
>     # grep _file memory.stat
>     inactive_file 1370939392
>     active_file 661635072
> 
> This patch changes the behaviour to retry reclaim either until the
> current task goes below the 10ms grace period, or we are making no
> reclaim progress at all. In the latter case, we enter reclaim throttling
> as before.

Let me try to understand the actual problem. The high memory reclaim has
a target which is proportional to the amount of charged memory. For most
requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
N is the number of memcgs in excess up the hierarchy). I can see to be
insufficient if the memcg is already in a large excess but if the
reclaim can make a forward progress this should just work fine because
each charging context should reclaim at least the contributed amount.

Do you have any insight on why this doesn't work in your situation?
Especially with such a large inactive file list I would be really
surprised if the reclaim was not able to make a forward progress.

Now to your patch. I do not like it much to be honest.
MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
memory_high_write because the that is an interruptible context so there
shouldn't be a good reason to give up after $FOO number of failed
attempts. try_charge and memory_max_write are slightly different because
we are invoking OOM killer based on the number of failed attempts.

Also if the current high reclaim scaling is insufficient then we should
be handling that via memcg_nr_pages_over_high rather than effectivelly
unbound number of reclaim retries.

That being said, the changelog should be more specific about the
underlying problem and if the real problem is in the reclaim target then
it should be handled by an increased but still fixed size. If the
throttling is just too aggressive and puts task into sleep even when a
reclaim has been performed then the throttling should be fixed.
Johannes Weiner May 20, 2020, 4:51 p.m. UTC | #2
On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote:
> On Wed 20-05-20 15:37:12, Chris Down wrote:
> > In Facebook production, we've seen cases where cgroups have been put
> > into allocator throttling even when they appear to have a lot of slack
> > file caches which should be trivially reclaimable.
> > 
> > Looking more closely, the problem is that we only try a single cgroup
> > reclaim walk for each return to usermode before calculating whether or
> > not we should throttle. This single attempt doesn't produce enough
> > pressure to shrink for cgroups with a rapidly growing amount of file
> > caches prior to entering allocator throttling.
> > 
> > As an example, we see that threads in an affected cgroup are stuck in
> > allocator throttling:
> > 
> >     # for i in $(cat cgroup.threads); do
> >     >     grep over_high "/proc/$i/stack"
> >     > done
> >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > 
> > ...however, there is no I/O pressure reported by PSI, despite a lot of
> > slack file pages:
> > 
> >     # cat memory.pressure
> >     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> >     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> >     # cat io.pressure
> >     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> >     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> >     # grep _file memory.stat
> >     inactive_file 1370939392
> >     active_file 661635072
> > 
> > This patch changes the behaviour to retry reclaim either until the
> > current task goes below the 10ms grace period, or we are making no
> > reclaim progress at all. In the latter case, we enter reclaim throttling
> > as before.
> 
> Let me try to understand the actual problem. The high memory reclaim has
> a target which is proportional to the amount of charged memory. For most
> requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> N is the number of memcgs in excess up the hierarchy). I can see to be
> insufficient if the memcg is already in a large excess but if the
> reclaim can make a forward progress this should just work fine because
> each charging context should reclaim at least the contributed amount.
> 
> Do you have any insight on why this doesn't work in your situation?
> Especially with such a large inactive file list I would be really
> surprised if the reclaim was not able to make a forward progress.

The workload we observed this in was downloading a large file and
writing it to disk, which means that a good chunk of that memory was
dirty. The first reclaim pass appears to make little progress because
it runs into dirty pages.

> Now to your patch. I do not like it much to be honest.
> MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
> memory_high_write because the that is an interruptible context so there
> shouldn't be a good reason to give up after $FOO number of failed
> attempts. try_charge and memory_max_write are slightly different because
> we are invoking OOM killer based on the number of failed attempts.

The same is true for memory.high. We are invoking the userspace OOM
killer when memory.high reclaim fails and we put tasks to sleep.

The actual number of retries is arbitrary, correct. That's because OOM
is arbitrary. It's a sampled state, and this is our sampling period.

But it's not that important. The much more important thing is that we
continue reclaiming as long as there is forward progress. How many
times we retry when there is no forward progress is less critical -
but it's nice to have the same cutoff for OOM situations everywhere.

> Also if the current high reclaim scaling is insufficient then we should
> be handling that via memcg_nr_pages_over_high rather than effectivelly
> unbound number of reclaim retries.

???
Michal Hocko May 20, 2020, 5:04 p.m. UTC | #3
On Wed 20-05-20 12:51:31, Johannes Weiner wrote:
> On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote:
> > On Wed 20-05-20 15:37:12, Chris Down wrote:
> > > In Facebook production, we've seen cases where cgroups have been put
> > > into allocator throttling even when they appear to have a lot of slack
> > > file caches which should be trivially reclaimable.
> > > 
> > > Looking more closely, the problem is that we only try a single cgroup
> > > reclaim walk for each return to usermode before calculating whether or
> > > not we should throttle. This single attempt doesn't produce enough
> > > pressure to shrink for cgroups with a rapidly growing amount of file
> > > caches prior to entering allocator throttling.
> > > 
> > > As an example, we see that threads in an affected cgroup are stuck in
> > > allocator throttling:
> > > 
> > >     # for i in $(cat cgroup.threads); do
> > >     >     grep over_high "/proc/$i/stack"
> > >     > done
> > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > 
> > > ...however, there is no I/O pressure reported by PSI, despite a lot of
> > > slack file pages:
> > > 
> > >     # cat memory.pressure
> > >     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> > >     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> > >     # cat io.pressure
> > >     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> > >     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> > >     # grep _file memory.stat
> > >     inactive_file 1370939392
> > >     active_file 661635072
> > > 
> > > This patch changes the behaviour to retry reclaim either until the
> > > current task goes below the 10ms grace period, or we are making no
> > > reclaim progress at all. In the latter case, we enter reclaim throttling
> > > as before.
> > 
> > Let me try to understand the actual problem. The high memory reclaim has
> > a target which is proportional to the amount of charged memory. For most
> > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > N is the number of memcgs in excess up the hierarchy). I can see to be
> > insufficient if the memcg is already in a large excess but if the
> > reclaim can make a forward progress this should just work fine because
> > each charging context should reclaim at least the contributed amount.
> > 
> > Do you have any insight on why this doesn't work in your situation?
> > Especially with such a large inactive file list I would be really
> > surprised if the reclaim was not able to make a forward progress.
> 
> The workload we observed this in was downloading a large file and
> writing it to disk, which means that a good chunk of that memory was
> dirty. The first reclaim pass appears to make little progress because
> it runs into dirty pages.

OK, I see but why does the subsequent reclaim attempt makes a forward
progress? Is this just because dirty pages are flushed in the mean time?
Because if this is the case then the underlying problem seems to be that
the reclaim should be throttled on dirty data.

> > Now to your patch. I do not like it much to be honest.
> > MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
> > memory_high_write because the that is an interruptible context so there
> > shouldn't be a good reason to give up after $FOO number of failed
> > attempts. try_charge and memory_max_write are slightly different because
> > we are invoking OOM killer based on the number of failed attempts.
> 
> The same is true for memory.high. We are invoking the userspace OOM
> killer when memory.high reclaim fails and we put tasks to sleep.

Right but there is no way to indicate that the reclaim has failed when
writing to memory.high.

> The actual number of retries is arbitrary, correct. That's because OOM
> is arbitrary. It's a sampled state, and this is our sampling period.
> 
> But it's not that important. The much more important thing is that we
> continue reclaiming as long as there is forward progress. How many
> times we retry when there is no forward progress is less critical -
> but it's nice to have the same cutoff for OOM situations everywhere.
> 
> > Also if the current high reclaim scaling is insufficient then we should
> > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > unbound number of reclaim retries.
> 
> ???

I am not sure what you are asking here.
Johannes Weiner May 20, 2020, 5:51 p.m. UTC | #4
On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote:
> On Wed 20-05-20 12:51:31, Johannes Weiner wrote:
> > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote:
> > > On Wed 20-05-20 15:37:12, Chris Down wrote:
> > > > In Facebook production, we've seen cases where cgroups have been put
> > > > into allocator throttling even when they appear to have a lot of slack
> > > > file caches which should be trivially reclaimable.
> > > > 
> > > > Looking more closely, the problem is that we only try a single cgroup
> > > > reclaim walk for each return to usermode before calculating whether or
> > > > not we should throttle. This single attempt doesn't produce enough
> > > > pressure to shrink for cgroups with a rapidly growing amount of file
> > > > caches prior to entering allocator throttling.
> > > > 
> > > > As an example, we see that threads in an affected cgroup are stuck in
> > > > allocator throttling:
> > > > 
> > > >     # for i in $(cat cgroup.threads); do
> > > >     >     grep over_high "/proc/$i/stack"
> > > >     > done
> > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > 
> > > > ...however, there is no I/O pressure reported by PSI, despite a lot of
> > > > slack file pages:
> > > > 
> > > >     # cat memory.pressure
> > > >     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> > > >     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> > > >     # cat io.pressure
> > > >     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> > > >     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> > > >     # grep _file memory.stat
> > > >     inactive_file 1370939392
> > > >     active_file 661635072
> > > > 
> > > > This patch changes the behaviour to retry reclaim either until the
> > > > current task goes below the 10ms grace period, or we are making no
> > > > reclaim progress at all. In the latter case, we enter reclaim throttling
> > > > as before.
> > > 
> > > Let me try to understand the actual problem. The high memory reclaim has
> > > a target which is proportional to the amount of charged memory. For most
> > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > > N is the number of memcgs in excess up the hierarchy). I can see to be
> > > insufficient if the memcg is already in a large excess but if the
> > > reclaim can make a forward progress this should just work fine because
> > > each charging context should reclaim at least the contributed amount.
> > > 
> > > Do you have any insight on why this doesn't work in your situation?
> > > Especially with such a large inactive file list I would be really
> > > surprised if the reclaim was not able to make a forward progress.
> > 
> > The workload we observed this in was downloading a large file and
> > writing it to disk, which means that a good chunk of that memory was
> > dirty. The first reclaim pass appears to make little progress because
> > it runs into dirty pages.
> 
> OK, I see but why does the subsequent reclaim attempt makes a forward
> progress? Is this just because dirty pages are flushed in the mean time?
> Because if this is the case then the underlying problem seems to be that
> the reclaim should be throttled on dirty data.

That's what I assume. Chris wanted to do more reclaim tracing. But is
this actually important beyond maybe curiosity?

We retry every other reclaim invocation on forward progress. There is
not a single naked call to try_to_free_pages(), and this here is the
only exception where we don't loop on try_to_free_mem_cgroup_pages().

And there are very good, widely established reason for that:

Under pressure, page reclaim can struggle to satisfy the reclaim
goal and may return with less pages reclaimed than asked to.

Under concurrency, a parallel allocation can invalidate the reclaim
progress made by a thread.

When either of these happen, the reclaiming thread should not throw
its hands up and give up. It shouldn't invoke the kernel OOM killer,
and it shouldn't go to sleep to trigger the userspace OOM killer.
Reclaim hasn't failed as long as there is forward progress to be made.

This isn't a daring concept, it's standard practice throughout the VM.

I don't quite understand what makes this situation different. It's not
*that* important which of the many known reasons for reclaim to not
succeed on first try has prompted this patch, is it?

> > > Now to your patch. I do not like it much to be honest.
> > > MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
> > > memory_high_write because the that is an interruptible context so there
> > > shouldn't be a good reason to give up after $FOO number of failed
> > > attempts. try_charge and memory_max_write are slightly different because
> > > we are invoking OOM killer based on the number of failed attempts.
> > 
> > The same is true for memory.high. We are invoking the userspace OOM
> > killer when memory.high reclaim fails and we put tasks to sleep.
> 
> Right but there is no way to indicate that the reclaim has failed when
> writing to memory.high.

I'm less concerned about memory.high-writing than try_charge().

Although IMO it's nice to be consistent and make the same effort as we
would everywhere else to meet the limit before returning from the write().

> > The actual number of retries is arbitrary, correct. That's because OOM
> > is arbitrary. It's a sampled state, and this is our sampling period.
> > 
> > But it's not that important. The much more important thing is that we
> > continue reclaiming as long as there is forward progress. How many
> > times we retry when there is no forward progress is less critical -
> > but it's nice to have the same cutoff for OOM situations everywhere.
> > 
> > > Also if the current high reclaim scaling is insufficient then we should
> > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > unbound number of reclaim retries.
> > 
> > ???
> 
> I am not sure what you are asking here.

You expressed that some alternate solution B would be preferable,
without any detail on why you think that is the case.

And it's certainly not obvious or self-explanatory - in particular
because Chris's proposal *is* obvious and self-explanatory, given how
everybody else is already doing loops around page reclaim.
Chris Down May 20, 2020, 8:26 p.m. UTC | #5
Michal Hocko writes:
>Let me try to understand the actual problem. The high memory reclaim has
>a target which is proportional to the amount of charged memory. For most
>requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
>N is the number of memcgs in excess up the hierarchy). I can see to be
>insufficient if the memcg is already in a large excess but if the
>reclaim can make a forward progress this should just work fine because
>each charging context should reclaim at least the contributed amount.
>
>Do you have any insight on why this doesn't work in your situation?
>Especially with such a large inactive file list I would be really
>surprised if the reclaim was not able to make a forward progress.

Reclaim can fail for any number of reasons, which is why we have retries 
sprinkled all over for it already. It doesn't seem hard to believe that it 
might just fail for transient reasons and drive us deeper into the hole as a 
result.

In this case, a.) the application is producing tons of dirty pages, and b.) we 
have really heavy systemwide I/O contention on the affected machines. This high 
load is one of the reasons that direct and kswapd reclaim cannot keep up, and 
thus nr_pages can become a number of orders of magnitude larger than 
SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, it's not an 
edge case.

Putting a trace_printk("%d\n", __LINE__) at non-successful reclaim in 
shrink_page_list shows that what's happening is always (and I really mean 
always) the "dirty page and you're not kswapd" check, as expected:

	if (PageDirty(page)) {
		/*
		 * Only kswapd can writeback filesystem pages
		 * to avoid risk of stack overflow. But avoid
		 * injecting inefficient single-page IO into
		 * flusher writeback as much as possible: only
		 * write pages when we've encountered many
		 * dirty pages, and when we've already scanned
		 * the rest of the LRU for clean pages and see
		 * the same dirty pages again (PageReclaim).
		 */
		if (page_is_file_lru(page) &&
			(!current_is_kswapd() || !PageReclaim(page) ||
			!test_bit(PGDAT_DIRTY, &pgdat->flags))) {
			/*
			 * Immediately reclaim when written back.
			 * Similar in principal to deactivate_page()
			 * except we already have the page isolated
			 * and know it's dirty
			 */
			inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
			SetPageReclaim(page);

			goto activate_locked;
		}

>Now to your patch. I do not like it much to be honest.
>MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
>memory_high_write because the that is an interruptible context so there
>shouldn't be a good reason to give up after $FOO number of failed
>attempts. try_charge and memory_max_write are slightly different because
>we are invoking OOM killer based on the number of failed attempts.

As Johannes mentioned, the very intent of memory.high is to have it managed 
using a userspace OOM killer, which monitors PSI. As such, I'm not sure this 
distinction means much.
Michal Hocko May 21, 2020, 7:19 a.m. UTC | #6
On Wed 20-05-20 21:26:50, Chris Down wrote:
> Michal Hocko writes:
> > Let me try to understand the actual problem. The high memory reclaim has
> > a target which is proportional to the amount of charged memory. For most
> > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > N is the number of memcgs in excess up the hierarchy). I can see to be
> > insufficient if the memcg is already in a large excess but if the
> > reclaim can make a forward progress this should just work fine because
> > each charging context should reclaim at least the contributed amount.
> > 
> > Do you have any insight on why this doesn't work in your situation?
> > Especially with such a large inactive file list I would be really
> > surprised if the reclaim was not able to make a forward progress.
> 
> Reclaim can fail for any number of reasons, which is why we have retries
> sprinkled all over for it already. It doesn't seem hard to believe that it
> might just fail for transient reasons and drive us deeper into the hole as a
> result.

Reclaim can certainly fail. It is however surprising to see it fail with
such a large inactive lru list and reasonably small reclaim target.
Having the full LRU of dirty pages sounds a bit unusual, IO throttling
for v2 and explicit throttling during the reclaim for v1 should prevent
from that. If the reclaim gives up too easily then this should be
addressed at the reclaim level.

> In this case, a.) the application is producing tons of dirty pages, and b.)
> we have really heavy systemwide I/O contention on the affected machines.
> This high load is one of the reasons that direct and kswapd reclaim cannot
> keep up, and thus nr_pages can become a number of orders of magnitude larger
> than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines,
> it's not an edge case.

Please elaborate some more. memcg_nr_pages_over_high shouldn't really
depend on the system wide activity. It should scale with the requested
charges. So yes it can get large for something like a large read/write
which does a lot of allocations in a single syscall before returning to
the userspace.
 
But ok, let's say that the reclaim target is large and then a single
reclaim attempt might fail. Then I am wondering why your patch is not
really targetting to reclaim memcg_nr_pages_over_high pages and instead
push for reclaim down to the high limit.

The main problem I see with that approach is that the loop could easily
lead to reclaim unfairness when a heavy producer which doesn't leave the
kernel (e.g. a large read/write call) can keep a different task doing
all the reclaim work. The loop is effectivelly unbound when there is a
reclaim progress and so the return to the userspace is by no means
proportional to the requested memory/charge.
Michal Hocko May 21, 2020, 7:32 a.m. UTC | #7
On Wed 20-05-20 13:51:35, Johannes Weiner wrote:
> On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote:
> > On Wed 20-05-20 12:51:31, Johannes Weiner wrote:
> > > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote:
> > > > On Wed 20-05-20 15:37:12, Chris Down wrote:
> > > > > In Facebook production, we've seen cases where cgroups have been put
> > > > > into allocator throttling even when they appear to have a lot of slack
> > > > > file caches which should be trivially reclaimable.
> > > > > 
> > > > > Looking more closely, the problem is that we only try a single cgroup
> > > > > reclaim walk for each return to usermode before calculating whether or
> > > > > not we should throttle. This single attempt doesn't produce enough
> > > > > pressure to shrink for cgroups with a rapidly growing amount of file
> > > > > caches prior to entering allocator throttling.
> > > > > 
> > > > > As an example, we see that threads in an affected cgroup are stuck in
> > > > > allocator throttling:
> > > > > 
> > > > >     # for i in $(cat cgroup.threads); do
> > > > >     >     grep over_high "/proc/$i/stack"
> > > > >     > done
> > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > > 
> > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of
> > > > > slack file pages:
> > > > > 
> > > > >     # cat memory.pressure
> > > > >     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> > > > >     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> > > > >     # cat io.pressure
> > > > >     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> > > > >     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> > > > >     # grep _file memory.stat
> > > > >     inactive_file 1370939392
> > > > >     active_file 661635072
> > > > > 
> > > > > This patch changes the behaviour to retry reclaim either until the
> > > > > current task goes below the 10ms grace period, or we are making no
> > > > > reclaim progress at all. In the latter case, we enter reclaim throttling
> > > > > as before.
> > > > 
> > > > Let me try to understand the actual problem. The high memory reclaim has
> > > > a target which is proportional to the amount of charged memory. For most
> > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > > > N is the number of memcgs in excess up the hierarchy). I can see to be
> > > > insufficient if the memcg is already in a large excess but if the
> > > > reclaim can make a forward progress this should just work fine because
> > > > each charging context should reclaim at least the contributed amount.
> > > > 
> > > > Do you have any insight on why this doesn't work in your situation?
> > > > Especially with such a large inactive file list I would be really
> > > > surprised if the reclaim was not able to make a forward progress.
> > > 
> > > The workload we observed this in was downloading a large file and
> > > writing it to disk, which means that a good chunk of that memory was
> > > dirty. The first reclaim pass appears to make little progress because
> > > it runs into dirty pages.
> > 
> > OK, I see but why does the subsequent reclaim attempt makes a forward
> > progress? Is this just because dirty pages are flushed in the mean time?
> > Because if this is the case then the underlying problem seems to be that
> > the reclaim should be throttled on dirty data.
> 
> That's what I assume. Chris wanted to do more reclaim tracing. But is
> this actually important beyond maybe curiosity?

Yes, because it might show that there is a deeper problem. Having an
extremely large file list full of dirty data and pre-mature failure for
the reclaim sounds like a problem that is worth looking into closely.

> We retry every other reclaim invocation on forward progress. There is
> not a single naked call to try_to_free_pages(), and this here is the
> only exception where we don't loop on try_to_free_mem_cgroup_pages().

I am not saying the looping over try_to_free_pages is wrong. I do care
about the final reclaim target. That shouldn't be arbitrary. We have
established a target which is proportional to the requested amount of
memory. And there is a good reason for that. If any task tries to
reclaim down to the high limit then this might lead to a large
unfairness when heavy producers piggy back on the active reclaimer(s).

I wouldn't mind to loop over try_to_free_pages to meet the requested
memcg_nr_pages_over_high target.

[...]

> > > > Also if the current high reclaim scaling is insufficient then we should
> > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > unbound number of reclaim retries.
> > > 
> > > ???
> > 
> > I am not sure what you are asking here.
> 
> You expressed that some alternate solution B would be preferable,
> without any detail on why you think that is the case.
> 
> And it's certainly not obvious or self-explanatory - in particular
> because Chris's proposal *is* obvious and self-explanatory, given how
> everybody else is already doing loops around page reclaim.

Sorry, I could have been less cryptic. I hope the above and my response
to Chris goes into more details why I do not like this proposal and what
is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
target. If the current calculation of the target is unsufficient - e.g.
in situations where the high limit excess is very large then this should
be reflected in memcg_nr_pages_over_high.

Is it more clear?
Chris Down May 21, 2020, 11:27 a.m. UTC | #8
Michal Hocko writes:
>On Wed 20-05-20 21:26:50, Chris Down wrote:
>> Michal Hocko writes:
>> > Let me try to understand the actual problem. The high memory reclaim has
>> > a target which is proportional to the amount of charged memory. For most
>> > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
>> > N is the number of memcgs in excess up the hierarchy). I can see to be
>> > insufficient if the memcg is already in a large excess but if the
>> > reclaim can make a forward progress this should just work fine because
>> > each charging context should reclaim at least the contributed amount.
>> >
>> > Do you have any insight on why this doesn't work in your situation?
>> > Especially with such a large inactive file list I would be really
>> > surprised if the reclaim was not able to make a forward progress.
>>
>> Reclaim can fail for any number of reasons, which is why we have retries
>> sprinkled all over for it already. It doesn't seem hard to believe that it
>> might just fail for transient reasons and drive us deeper into the hole as a
>> result.
>
>Reclaim can certainly fail. It is however surprising to see it fail with
>such a large inactive lru list and reasonably small reclaim target.

Why do you think the reclaim target is small? In the case of generating tons of 
dirty pages, current->memcg_nr_pages_over_high can grow to be huge (on the 
order of several tens of megabytes or more).

>Having the full LRU of dirty pages sounds a bit unusual, IO throttling
>for v2 and explicit throttling during the reclaim for v1 should prevent
>from that. If the reclaim gives up too easily then this should be
>addressed at the reclaim level.

I'm not sure I agree. Reclaim knows what you asked it to do: reclaim N pages, 
but what to do about the situation when it fails to satisfy that is a job for 
the caller. In this case, we are willing to even tolerate a little bit of 
overage up to the 10ms throttle threshold. In other cases, we want to do other 
checks first before retrying, because the tradeoffs are different. Putting all 
of this inside the reclaim logic seems unwieldy.

>> In this case, a.) the application is producing tons of dirty pages, and b.)
>> we have really heavy systemwide I/O contention on the affected machines.
>> This high load is one of the reasons that direct and kswapd reclaim cannot
>> keep up, and thus nr_pages can become a number of orders of magnitude larger
>> than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines,
>> it's not an edge case.
>
>Please elaborate some more. memcg_nr_pages_over_high shouldn't really
>depend on the system wide activity. It should scale with the requested
>charges. So yes it can get large for something like a large read/write
>which does a lot of allocations in a single syscall before returning to
>the userspace.

It can also get large if a number of subsequent reclaim attempts are making 
progress, but not satisfying demand fully, as is happening here. As a facetious 
example, even if we request N and reclaim can satisfy N-1 each time, eventually 
those single pages can grow to become a non-trivial size.

>But ok, let's say that the reclaim target is large and then a single
>reclaim attempt might fail. Then I am wondering why your patch is not
>really targetting to reclaim memcg_nr_pages_over_high pages and instead
>push for reclaim down to the high limit.
>
>The main problem I see with that approach is that the loop could easily
>lead to reclaim unfairness when a heavy producer which doesn't leave the
>kernel (e.g. a large read/write call) can keep a different task doing
>all the reclaim work. The loop is effectivelly unbound when there is a
>reclaim progress and so the return to the userspace is by no means
>proportional to the requested memory/charge.

It's not unbound when there is reclaim progress, it stops when we are within 
the memory.high throttling grace period. Right after reclaim, we check if 
penalty_jiffies is less than 10ms, and abort and further reclaim or allocator 
throttling:

	retry_reclaim:
		nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL);

		/*
		 * memory.high is breached and reclaim is unable to keep up. Throttle
		 * allocators proactively to slow down excessive growth.
		 */
		penalty_jiffies = calculate_high_delay(memcg, nr_pages);

		/*
		 * Don't sleep if the amount of jiffies this memcg owes us is so low
		 * that it's not even worth doing, in an attempt to be nice to those who
		 * go only a small amount over their memory.high value and maybe haven't
		 * been aggressively reclaimed enough yet.
		 */
		if (penalty_jiffies <= HZ / 100)
			goto out;

Regardless, you're pushing for different reclaim semantics for memory.high than 
memory.max here, which requires evidence that the current approach taken for 
memory.max is wrong or causing issues. And sure, you can say that that's 
because in memory.max's case we would have a memcg OOM, but again, that's not 
really different from how memory.high is supposed to work: with a userspace OOM 
killer monitoring it and producing OOM kills as necessary.
Michal Hocko May 21, 2020, 12:04 p.m. UTC | #9
On Thu 21-05-20 12:27:11, Chris Down wrote:
> Michal Hocko writes:
> > On Wed 20-05-20 21:26:50, Chris Down wrote:
> > > Michal Hocko writes:
> > > > Let me try to understand the actual problem. The high memory reclaim has
> > > > a target which is proportional to the amount of charged memory. For most
> > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > > > N is the number of memcgs in excess up the hierarchy). I can see to be
> > > > insufficient if the memcg is already in a large excess but if the
> > > > reclaim can make a forward progress this should just work fine because
> > > > each charging context should reclaim at least the contributed amount.
> > > >
> > > > Do you have any insight on why this doesn't work in your situation?
> > > > Especially with such a large inactive file list I would be really
> > > > surprised if the reclaim was not able to make a forward progress.
> > > 
> > > Reclaim can fail for any number of reasons, which is why we have retries
> > > sprinkled all over for it already. It doesn't seem hard to believe that it
> > > might just fail for transient reasons and drive us deeper into the hole as a
> > > result.
> > 
> > Reclaim can certainly fail. It is however surprising to see it fail with
> > such a large inactive lru list and reasonably small reclaim target.
> 
> Why do you think the reclaim target is small? In the case of generating tons
> of dirty pages, current->memcg_nr_pages_over_high can grow to be huge (on
> the order of several tens of megabytes or more).

Because from my experience there are not tons of charges inside one
syscall usually. Yeah, some syscalls can generate a lot of them but that
shouldn't be a majority.

> > Having the full LRU of dirty pages sounds a bit unusual, IO throttling
> > for v2 and explicit throttling during the reclaim for v1 should prevent
> > from that. If the reclaim gives up too easily then this should be
> > addressed at the reclaim level.
> 
> I'm not sure I agree. Reclaim knows what you asked it to do: reclaim N
> pages, but what to do about the situation when it fails to satisfy that is a
> job for the caller. In this case, we are willing to even tolerate a little
> bit of overage up to the 10ms throttle threshold. In other cases, we want to
> do other checks first before retrying, because the tradeoffs are different.
> Putting all of this inside the reclaim logic seems unwieldy.

That is not what I meant. We do have some throttling inside the reclaim
because failing reclaim too quickly can easily lead to pre mature OOMs.
If that doesn't work then we should have a look why. E.g. it is quite
unexpected to have large LRU full of dirty pages because this suggests
that dirty throttling doesn't work properly.

> > The main problem I see with that approach is that the loop could easily
> > lead to reclaim unfairness when a heavy producer which doesn't leave the
> > kernel (e.g. a large read/write call) can keep a different task doing
> > all the reclaim work. The loop is effectivelly unbound when there is a
> > reclaim progress and so the return to the userspace is by no means
> > proportional to the requested memory/charge.
> 
> It's not unbound when there is reclaim progress, it stops when we are within
> the memory.high throttling grace period. Right after reclaim, we check if
> penalty_jiffies is less than 10ms, and abort and further reclaim or
> allocator throttling:

Just imagine that you have parallel producers increasing the high limit
excess while somebody reclaims those. Sure in practice the loop will be
bounded but the reclaimer might perform much more work on behalf of
other tasks.
Chris Down May 21, 2020, 12:23 p.m. UTC | #10
(I'll leave the dirty throttling discussion to Johannes, because I'm not so 
familiar with that code or its history.)

Michal Hocko writes:
>> > The main problem I see with that approach is that the loop could easily
>> > lead to reclaim unfairness when a heavy producer which doesn't leave the
>> > kernel (e.g. a large read/write call) can keep a different task doing
>> > all the reclaim work. The loop is effectivelly unbound when there is a
>> > reclaim progress and so the return to the userspace is by no means
>> > proportional to the requested memory/charge.
>>
>> It's not unbound when there is reclaim progress, it stops when we are within
>> the memory.high throttling grace period. Right after reclaim, we check if
>> penalty_jiffies is less than 10ms, and abort and further reclaim or
>> allocator throttling:
>
>Just imagine that you have parallel producers increasing the high limit
>excess while somebody reclaims those. Sure in practice the loop will be
>bounded but the reclaimer might perform much more work on behalf of
>other tasks.

A cgroup is a unit and breaking it down into "reclaim fairness" for individual 
tasks like this seems suspect to me. For example, if one task in a cgroup is 
leaking unreclaimable memory like crazy, everyone in that cgroup is going to be 
penalised by allocator throttling as a result, even if they aren't 
"responsible" for that reclaim.

So the options here are as follows when a cgroup is over memory.high and a 
single reclaim isn't enough:

1. Decline further reclaim. Instead, throttle for up to 2 seconds.
2. Keep on reclaiming. Only throttle if we can't get back under memory.high.

The outcome of your suggestion to decline further reclaim is case #1, which is 
significantly more practically "unfair" to that task. Throttling is extremely 
disruptive to tasks and should be a last resort when we've exhausted all other 
practical options. It shouldn't be something you get just because you didn't 
try to reclaim hard enough.
Chris Down May 21, 2020, 12:24 p.m. UTC | #11
Chris Down writes:
>A cgroup is a unit and breaking it down into "reclaim fairness" for 
>individual tasks like this seems suspect to me. For example, if one 
>task in a cgroup is leaking unreclaimable memory like crazy, everyone 
>in that cgroup is going to be penalised by allocator throttling as a 
>result, even if they aren't "responsible" for that reclaim.

s/for that reclaim/for that overage/
Michal Hocko May 21, 2020, 12:28 p.m. UTC | #12
On Thu 21-05-20 12:27:11, Chris Down wrote:
[...]
> Regardless, you're pushing for different reclaim semantics for memory.high
> than memory.max here, which requires evidence that the current approach
> taken for memory.max is wrong or causing issues.

Sorry, I have skipped over this part. Memory high limit reclaim has
historically acted as a best effort action to throttle the
allocation/charge pace. This would work both if the implementation
simply tried to reclaim down to the high limit or if the reclaim is
proportional to the memory consumption by a specific consumer. We do the
later because it is much easier to establish fairness for. If you want
to change that you somehow have to deal with the fairness problem.

And yes, we do not guarantee any fairness for the hard limit or direct
reclaim in general but that behavior is generally problematic and there
should be really strong arguments to move high limit reclaim that
direction IMHO.
Michal Hocko May 21, 2020, 12:37 p.m. UTC | #13
On Thu 21-05-20 13:23:27, Chris Down wrote:
> (I'll leave the dirty throttling discussion to Johannes, because I'm not so
> familiar with that code or its history.)
> 
> Michal Hocko writes:
> > > > The main problem I see with that approach is that the loop could easily
> > > > lead to reclaim unfairness when a heavy producer which doesn't leave the
> > > > kernel (e.g. a large read/write call) can keep a different task doing
> > > > all the reclaim work. The loop is effectivelly unbound when there is a
> > > > reclaim progress and so the return to the userspace is by no means
> > > > proportional to the requested memory/charge.
> > > 
> > > It's not unbound when there is reclaim progress, it stops when we are within
> > > the memory.high throttling grace period. Right after reclaim, we check if
> > > penalty_jiffies is less than 10ms, and abort and further reclaim or
> > > allocator throttling:
> > 
> > Just imagine that you have parallel producers increasing the high limit
> > excess while somebody reclaims those. Sure in practice the loop will be
> > bounded but the reclaimer might perform much more work on behalf of
> > other tasks.
> 
> A cgroup is a unit and breaking it down into "reclaim fairness" for
> individual tasks like this seems suspect to me. For example, if one task in
> a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup
> is going to be penalised by allocator throttling as a result, even if they
> aren't "responsible" for that reclaim.

You are right, but that doesn't mean that it is desirable that some
tasks would be throttled unexpectedly too long because of the other's activity.
We already have that behavior for the direct reclaim and I have to say I
really hate it and had to spend a lot of time debugging latency issues.
Our excuse has been that the system is struggling at that time so any
quality of service is simply out of picture. I do not think the same
argument can be applied to memory.high which doesn't really represent a
mark when the memcg is struggling so hard to drop any signs of fairness
on the floor.

> So the options here are as follows when a cgroup is over memory.high and a
> single reclaim isn't enough:
> 
> 1. Decline further reclaim. Instead, throttle for up to 2 seconds.
> 2. Keep on reclaiming. Only throttle if we can't get back under memory.high.
> 
> The outcome of your suggestion to decline further reclaim is case #1, which
> is significantly more practically "unfair" to that task. Throttling is
> extremely disruptive to tasks and should be a last resort when we've
> exhausted all other practical options. It shouldn't be something you get
> just because you didn't try to reclaim hard enough.

I believe I have asked in other email in this thread. Could you explain
why enforcint the requested target (memcg_nr_pages_over_high) is
insufficient for the problem you are dealing with? Because that would
make sense for large targets to me while it would keep relatively
reasonable semantic of the throttling - aka proportional to the memory
demand rather than the excess.
Chris Down May 21, 2020, 12:57 p.m. UTC | #14
Michal Hocko writes:
>> A cgroup is a unit and breaking it down into "reclaim fairness" for
>> individual tasks like this seems suspect to me. For example, if one task in
>> a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup
>> is going to be penalised by allocator throttling as a result, even if they
>> aren't "responsible" for that reclaim.
>
>You are right, but that doesn't mean that it is desirable that some
>tasks would be throttled unexpectedly too long because of the other's activity.

Are you really talking about throttling, or reclaim? If throttling, tasks are 
already throttled proportionally to how much this allocation is contributing to 
the overage in calculate_high_delay.

If you're talking about reclaim, trying to reason about whether the overage is 
the result of some other task in this cgroup or the task that's allocating 
right now is something that we already know doesn't work well (eg. global OOM). 

>> So the options here are as follows when a cgroup is over memory.high and a
>> single reclaim isn't enough:
>>
>> 1. Decline further reclaim. Instead, throttle for up to 2 seconds.
>> 2. Keep on reclaiming. Only throttle if we can't get back under memory.high.
>>
>> The outcome of your suggestion to decline further reclaim is case #1, which
>> is significantly more practically "unfair" to that task. Throttling is
>> extremely disruptive to tasks and should be a last resort when we've
>> exhausted all other practical options. It shouldn't be something you get
>> just because you didn't try to reclaim hard enough.
>
>I believe I have asked in other email in this thread. Could you explain
>why enforcint the requested target (memcg_nr_pages_over_high) is
>insufficient for the problem you are dealing with? Because that would
>make sense for large targets to me while it would keep relatively
>reasonable semantic of the throttling - aka proportional to the memory
>demand rather than the excess.

memcg_nr_pages_over_high is related to the charge size. As such, if you're way 
over memory.high as a result of transient reclaim failures, but the majority of 
your charges are small, it's going to hard to make meaningful progress:

1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help;
2. Large allocations will only get a single reclaim attempt to succeed.

As such, in many cases we're either doomed to successfully reclaim a paltry 
amount of pages, or fail to reclaim a lot of pages. Asking try_to_free_pages() 
to deal with those huge allocations is generally not reasonable, regardless of 
the specifics of why it doesn't work in this case.
Chris Down May 21, 2020, 1:05 p.m. UTC | #15
Chris Down writes:
>>I believe I have asked in other email in this thread. Could you explain
>>why enforcint the requested target (memcg_nr_pages_over_high) is
>>insufficient for the problem you are dealing with? Because that would
>>make sense for large targets to me while it would keep relatively
>>reasonable semantic of the throttling - aka proportional to the memory
>>demand rather than the excess.
>
>memcg_nr_pages_over_high is related to the charge size. As such, if 
>you're way over memory.high as a result of transient reclaim failures, 
>but the majority of your charges are small, it's going to hard to make 
>meaningful progress:
>
>1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help;
>2. Large allocations will only get a single reclaim attempt to succeed.
>
>As such, in many cases we're either doomed to successfully reclaim a 
>paltry amount of pages, or fail to reclaim a lot of pages. Asking 
>try_to_free_pages() to deal with those huge allocations is generally 
>not reasonable, regardless of the specifics of why it doesn't work in 
>this case.

Oh, I somehow elided the "enforcing" part of your proposal. Still, there's no 
guarantee even if large allocations are reclaimed fully that we will end up 
going back below memory.high, because even a single other large allocation 
which fails to reclaim can knock us out of whack again.
Michal Hocko May 21, 2020, 1:21 p.m. UTC | #16
On Thu 21-05-20 13:57:59, Chris Down wrote:
> Michal Hocko writes:
> > > A cgroup is a unit and breaking it down into "reclaim fairness" for
> > > individual tasks like this seems suspect to me. For example, if one task in
> > > a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup
> > > is going to be penalised by allocator throttling as a result, even if they
> > > aren't "responsible" for that reclaim.
> > 
> > You are right, but that doesn't mean that it is desirable that some
> > tasks would be throttled unexpectedly too long because of the other's activity.
> 
> Are you really talking about throttling, or reclaim? If throttling, tasks
> are already throttled proportionally to how much this allocation is
> contributing to the overage in calculate_high_delay.

Reclaim is a part of the throttling mechanism. It is a productive side
of it actually.
 
> If you're talking about reclaim, trying to reason about whether the overage
> is the result of some other task in this cgroup or the task that's
> allocating right now is something that we already know doesn't work well
> (eg. global OOM).

I am not sure I follow you here.
Michal Hocko May 21, 2020, 1:28 p.m. UTC | #17
On Thu 21-05-20 14:05:30, Chris Down wrote:
> Chris Down writes:
> > > I believe I have asked in other email in this thread. Could you explain
> > > why enforcint the requested target (memcg_nr_pages_over_high) is
> > > insufficient for the problem you are dealing with? Because that would
> > > make sense for large targets to me while it would keep relatively
> > > reasonable semantic of the throttling - aka proportional to the memory
> > > demand rather than the excess.
> > 
> > memcg_nr_pages_over_high is related to the charge size. As such, if
> > you're way over memory.high as a result of transient reclaim failures,
> > but the majority of your charges are small, it's going to hard to make
> > meaningful progress:
> > 
> > 1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help;
> > 2. Large allocations will only get a single reclaim attempt to succeed.
> > 
> > As such, in many cases we're either doomed to successfully reclaim a
> > paltry amount of pages, or fail to reclaim a lot of pages. Asking
> > try_to_free_pages() to deal with those huge allocations is generally not
> > reasonable, regardless of the specifics of why it doesn't work in this
> > case.
> 
> Oh, I somehow elided the "enforcing" part of your proposal. Still, there's
> no guarantee even if large allocations are reclaimed fully that we will end
> up going back below memory.high, because even a single other large
> allocation which fails to reclaim can knock us out of whack again.

Yeah, there is no guarantee and that is fine. Because memory.high is not
about guarantee. It is about a best effort and slowing down the
allocation pace so that the userspace has time to do something about
that.

That being said I would be really curious about how enforcing the
memcg_nr_pages_over_high target works in your setups where you see the
problem. If that doesn't work for some reason and the reclaim should be
more pro-active then I would suggest to scale it via memcg_nr_pages_over_high
rather than essentially keep it around and ignore it. Preserving at
least some form of fairness and predictable behavior is important IMHO
but if there is no way to achieve that then there should be a very good
explanation for that.

I hope that we it is more clear what is our thinking now. I will be FTO
for upcoming days trying to get some rest from email so my response time
will be longer. Will be back on Thursday.
Chris Down May 21, 2020, 1:41 p.m. UTC | #18
Michal Hocko writes:
>On Thu 21-05-20 13:57:59, Chris Down wrote:
>> Michal Hocko writes:
>> > > A cgroup is a unit and breaking it down into "reclaim fairness" for
>> > > individual tasks like this seems suspect to me. For example, if one task in
>> > > a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup
>> > > is going to be penalised by allocator throttling as a result, even if they
>> > > aren't "responsible" for that reclaim.
>> >
>> > You are right, but that doesn't mean that it is desirable that some
>> > tasks would be throttled unexpectedly too long because of the other's activity.
>>
>> Are you really talking about throttling, or reclaim? If throttling, tasks
>> are already throttled proportionally to how much this allocation is
>> contributing to the overage in calculate_high_delay.
>
>Reclaim is a part of the throttling mechanism. It is a productive side
>of it actually.

I guess let's avoid using the term "throttling", since in this context it 
sounds like we're talking about allocator throttling :-)

>> If you're talking about reclaim, trying to reason about whether the overage
>> is the result of some other task in this cgroup or the task that's
>> allocating right now is something that we already know doesn't work well
>> (eg. global OOM).
>
>I am not sure I follow you here.

Let me rephrase: your statement is that it's not desirable "that some task 
would be throttled unexpectedly too long because of [the activity of another 
task also within that cgroup]" (let me know if that's not what you meant). But 
trying to avoid that requires knowing which activity abstractly instigates this 
entire mess in the first place, which we have nowhere near enough context to 
determine.
Johannes Weiner May 21, 2020, 1:51 p.m. UTC | #19
On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> On Wed 20-05-20 13:51:35, Johannes Weiner wrote:
> > On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote:
> > > On Wed 20-05-20 12:51:31, Johannes Weiner wrote:
> > > > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote:
> > > > > On Wed 20-05-20 15:37:12, Chris Down wrote:
> > > > > > In Facebook production, we've seen cases where cgroups have been put
> > > > > > into allocator throttling even when they appear to have a lot of slack
> > > > > > file caches which should be trivially reclaimable.
> > > > > > 
> > > > > > Looking more closely, the problem is that we only try a single cgroup
> > > > > > reclaim walk for each return to usermode before calculating whether or
> > > > > > not we should throttle. This single attempt doesn't produce enough
> > > > > > pressure to shrink for cgroups with a rapidly growing amount of file
> > > > > > caches prior to entering allocator throttling.
> > > > > > 
> > > > > > As an example, we see that threads in an affected cgroup are stuck in
> > > > > > allocator throttling:
> > > > > > 
> > > > > >     # for i in $(cat cgroup.threads); do
> > > > > >     >     grep over_high "/proc/$i/stack"
> > > > > >     > done
> > > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > > >     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> > > > > > 
> > > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of
> > > > > > slack file pages:
> > > > > > 
> > > > > >     # cat memory.pressure
> > > > > >     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> > > > > >     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> > > > > >     # cat io.pressure
> > > > > >     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> > > > > >     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> > > > > >     # grep _file memory.stat
> > > > > >     inactive_file 1370939392
> > > > > >     active_file 661635072
> > > > > > 
> > > > > > This patch changes the behaviour to retry reclaim either until the
> > > > > > current task goes below the 10ms grace period, or we are making no
> > > > > > reclaim progress at all. In the latter case, we enter reclaim throttling
> > > > > > as before.
> > > > > 
> > > > > Let me try to understand the actual problem. The high memory reclaim has
> > > > > a target which is proportional to the amount of charged memory. For most
> > > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > > > > N is the number of memcgs in excess up the hierarchy). I can see to be
> > > > > insufficient if the memcg is already in a large excess but if the
> > > > > reclaim can make a forward progress this should just work fine because
> > > > > each charging context should reclaim at least the contributed amount.
> > > > > 
> > > > > Do you have any insight on why this doesn't work in your situation?
> > > > > Especially with such a large inactive file list I would be really
> > > > > surprised if the reclaim was not able to make a forward progress.
> > > > 
> > > > The workload we observed this in was downloading a large file and
> > > > writing it to disk, which means that a good chunk of that memory was
> > > > dirty. The first reclaim pass appears to make little progress because
> > > > it runs into dirty pages.
> > > 
> > > OK, I see but why does the subsequent reclaim attempt makes a forward
> > > progress? Is this just because dirty pages are flushed in the mean time?
> > > Because if this is the case then the underlying problem seems to be that
> > > the reclaim should be throttled on dirty data.
> > 
> > That's what I assume. Chris wanted to do more reclaim tracing. But is
> > this actually important beyond maybe curiosity?
> 
> Yes, because it might show that there is a deeper problem. Having an
> extremely large file list full of dirty data and pre-mature failure for
> the reclaim sounds like a problem that is worth looking into closely.
> 
> > We retry every other reclaim invocation on forward progress. There is
> > not a single naked call to try_to_free_pages(), and this here is the
> > only exception where we don't loop on try_to_free_mem_cgroup_pages().
> 
> I am not saying the looping over try_to_free_pages is wrong. I do care
> about the final reclaim target. That shouldn't be arbitrary. We have
> established a target which is proportional to the requested amount of
> memory. And there is a good reason for that. If any task tries to
> reclaim down to the high limit then this might lead to a large
> unfairness when heavy producers piggy back on the active reclaimer(s).

Why is that different than any other form of reclaim?

> I wouldn't mind to loop over try_to_free_pages to meet the requested
> memcg_nr_pages_over_high target.

Should we do the same for global reclaim? Move reclaim to userspace
resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and
then have everybody just reclaim exactly what they asked for, and punt
interrupts / kthread allocations to a worker/kswapd?

> > > > > Also if the current high reclaim scaling is insufficient then we should
> > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > > unbound number of reclaim retries.
> > > > 
> > > > ???
> > > 
> > > I am not sure what you are asking here.
> > 
> > You expressed that some alternate solution B would be preferable,
> > without any detail on why you think that is the case.
> > 
> > And it's certainly not obvious or self-explanatory - in particular
> > because Chris's proposal *is* obvious and self-explanatory, given how
> > everybody else is already doing loops around page reclaim.
> 
> Sorry, I could have been less cryptic. I hope the above and my response
> to Chris goes into more details why I do not like this proposal and what
> is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
> target. If the current calculation of the target is unsufficient - e.g.
> in situations where the high limit excess is very large then this should
> be reflected in memcg_nr_pages_over_high.
> 
> Is it more clear?

Well you haven't made a good argument why memory.high is actually
different than any other form of reclaim, and why it should be the
only implementation of page reclaim that has special-cased handling
for the inherent "unfairness" or rather raciness of that operation.

You cut these lines from the quote:

  Under pressure, page reclaim can struggle to satisfy the reclaim
  goal and may return with less pages reclaimed than asked to.

  Under concurrency, a parallel allocation can invalidate the reclaim
  progress made by a thread.

Even if we *could* invest more into trying to avoid any unfairness,
you haven't made a point why we actually should do that here
specifically, yet not everywhere else.

(And people have tried to do it for global reclaim[1], but clearly
this isn't a meaningful problem in practice.)

I have a good reason why we shouldn't: because it's special casing
memory.high from other forms of reclaim, and that is a maintainability
problem. We've recently been discussing ways to make the memory.high
implementation stand out less, not make it stand out even more. There
is no solid reason it should be different from memory.max reclaim,
except that it should sleep instead of invoke OOM at the end. It's
already a mess we're trying to get on top of and straighten out, and
you're proposing to add more kinks that will make this work harder.

I have to admit, I'm baffled by this conversation. I consider this a
fairly obvious, idiomatic change, and I cannot relate to the
objections or counter-proposals in the slightest.

[1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
Michal Hocko May 21, 2020, 1:58 p.m. UTC | #20
On Thu 21-05-20 14:41:47, Chris Down wrote:
> Michal Hocko writes:
> > On Thu 21-05-20 13:57:59, Chris Down wrote:
[...]
> > > If you're talking about reclaim, trying to reason about whether the overage
> > > is the result of some other task in this cgroup or the task that's
> > > allocating right now is something that we already know doesn't work well
> > > (eg. global OOM).
> > 
> > I am not sure I follow you here.
> 
> Let me rephrase: your statement is that it's not desirable "that some task
> would be throttled unexpectedly too long because of [the activity of another
> task also within that cgroup]" (let me know if that's not what you meant).
> But trying to avoid that requires knowing which activity abstractly
> instigates this entire mess in the first place, which we have nowhere near
> enough context to determine.

Yeah, if we want to be really precise then you are right, nothing like
that is really feasible for the reclaim. Reclaiming 1 page might be much
more expensive than 100 pages because LRU order doesn't reflect the
cost of the reclaim at all. What, I believe, we want is a best effort,
really. If the reclaim target is somehow bound to the requested amount
of memory then we can at least say that more memory hungry consumers are
reclaiming more. Which is something people can wrap their head around
much easier than a free competition on the reclaim with some hard to
predict losers who do all the work and some lucky ones which just happen
to avoid throttling by a better timing. Really think of the direct
reclaim and how the unfairness suck there.
Chris Down May 21, 2020, 2:22 p.m. UTC | #21
Michal Hocko writes:
>On Thu 21-05-20 14:41:47, Chris Down wrote:
>> Michal Hocko writes:
>> > On Thu 21-05-20 13:57:59, Chris Down wrote:
>[...]
>> > > If you're talking about reclaim, trying to reason about whether the overage
>> > > is the result of some other task in this cgroup or the task that's
>> > > allocating right now is something that we already know doesn't work well
>> > > (eg. global OOM).
>> >
>> > I am not sure I follow you here.
>>
>> Let me rephrase: your statement is that it's not desirable "that some task
>> would be throttled unexpectedly too long because of [the activity of another
>> task also within that cgroup]" (let me know if that's not what you meant).
>> But trying to avoid that requires knowing which activity abstractly
>> instigates this entire mess in the first place, which we have nowhere near
>> enough context to determine.
>
>Yeah, if we want to be really precise then you are right, nothing like
>that is really feasible for the reclaim. Reclaiming 1 page might be much
>more expensive than 100 pages because LRU order doesn't reflect the
>cost of the reclaim at all. What, I believe, we want is a best effort,
>really. If the reclaim target is somehow bound to the requested amount
>of memory then we can at least say that more memory hungry consumers are
>reclaiming more. Which is something people can wrap their head around
>much easier than a free competition on the reclaim with some hard to
>predict losers who do all the work and some lucky ones which just happen
>to avoid throttling by a better timing. Really think of the direct
>reclaim and how the unfairness suck there.

I really don't follow this logic. You're talking about reclaim-induced latency, 
but the alternative isn't freedom from latency, it's scheduler-induced latency 
from allocator throttling (and probably of a significantly higher magnitude). 
And again, that's totally justified if you are part of a cgroup which is 
significantly above its memory.high -- that's the kind of grouping you sign up 
for when you put multiple tasks in the same cgroup.

The premise of being over memory.high is that everyone in the affected cgroup 
must do their utmost to reclaim where possible, and if they fail to get below 
it again, we're going to deschedule them. *That's* what's best-effort about it.

The losers aren't hard to predict. It's *all* the tasks in this cgroup if they 
don't each make their utmost attempt to get the cgroup's memory back under 
control. Doing more reclaim isn't even in the same magnitude of sucking as 
getting allocator throttled.
Johannes Weiner May 21, 2020, 2:22 p.m. UTC | #22
On Thu, May 21, 2020 at 09:51:55AM -0400, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> > I wouldn't mind to loop over try_to_free_pages to meet the requested
> > memcg_nr_pages_over_high target.
> 
> Should we do the same for global reclaim? Move reclaim to userspace
> resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and
> then have everybody just reclaim exactly what they asked for, and punt
> interrupts / kthread allocations to a worker/kswapd?

Oof, typo: I meant limit reclaim by memory.max and
memory.limit_in_bytes. Not physical memory reclaim of course.

> > > > > > Also if the current high reclaim scaling is insufficient then we should
> > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > > > unbound number of reclaim retries.
> > > > > 
> > > > > ???
> > > > 
> > > > I am not sure what you are asking here.
> > > 
> > > You expressed that some alternate solution B would be preferable,
> > > without any detail on why you think that is the case.
> > > 
> > > And it's certainly not obvious or self-explanatory - in particular
> > > because Chris's proposal *is* obvious and self-explanatory, given how
> > > everybody else is already doing loops around page reclaim.
> > 
> > Sorry, I could have been less cryptic. I hope the above and my response
> > to Chris goes into more details why I do not like this proposal and what
> > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
> > target. If the current calculation of the target is unsufficient - e.g.
> > in situations where the high limit excess is very large then this should
> > be reflected in memcg_nr_pages_over_high.
> > 
> > Is it more clear?
> 
> Well you haven't made a good argument why memory.high is actually
> different than any other form of reclaim, and why it should be the
> only implementation of page reclaim that has special-cased handling
> for the inherent "unfairness" or rather raciness of that operation.
> 
> You cut these lines from the quote:
> 
>   Under pressure, page reclaim can struggle to satisfy the reclaim
>   goal and may return with less pages reclaimed than asked to.
> 
>   Under concurrency, a parallel allocation can invalidate the reclaim
>   progress made by a thread.
> 
> Even if we *could* invest more into trying to avoid any unfairness,
> you haven't made a point why we actually should do that here
> specifically, yet not everywhere else.
> 
> (And people have tried to do it for global reclaim[1], but clearly
> this isn't a meaningful problem in practice.)
> 
> I have a good reason why we shouldn't: because it's special casing
> memory.high from other forms of reclaim, and that is a maintainability
> problem. We've recently been discussing ways to make the memory.high
> implementation stand out less, not make it stand out even more. There
> is no solid reason it should be different from memory.max reclaim,
> except that it should sleep instead of invoke OOM at the end. It's
> already a mess we're trying to get on top of and straighten out, and
> you're proposing to add more kinks that will make this work harder.
> 
> I have to admit, I'm baffled by this conversation. I consider this a
> fairly obvious, idiomatic change, and I cannot relate to the
> objections or counter-proposals in the slightest.
> 
> [1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
Michal Hocko May 21, 2020, 2:35 p.m. UTC | #23
On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
[...]
> > I am not saying the looping over try_to_free_pages is wrong. I do care
> > about the final reclaim target. That shouldn't be arbitrary. We have
> > established a target which is proportional to the requested amount of
> > memory. And there is a good reason for that. If any task tries to
> > reclaim down to the high limit then this might lead to a large
> > unfairness when heavy producers piggy back on the active reclaimer(s).
> 
> Why is that different than any other form of reclaim?

Because the high limit reclaim is a best effort rather than must to
either get over reclaim watermarks and continue allocation or meet the
hard limit requirement to continue.

In an ideal world even the global resp. hard limit reclaim should
consider fairness. They don't because that is easier but that sucks. I
have been involved in debugging countless of issues where direct reclaim
was taking too long because of the unfairness. Users simply see that as
bug and I am not surprised.

> > I wouldn't mind to loop over try_to_free_pages to meet the requested
> > memcg_nr_pages_over_high target.
> 
> Should we do the same for global reclaim? Move reclaim to userspace
> resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and
> then have everybody just reclaim exactly what they asked for, and punt
> interrupts / kthread allocations to a worker/kswapd?

This would be quite challenging considering the page allocator wouldn't
be able to make a forward progress without doing any reclaim. But maybe
you can be creative with watermarks.

> > > > > > Also if the current high reclaim scaling is insufficient then we should
> > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > > > unbound number of reclaim retries.
> > > > > 
> > > > > ???
> > > > 
> > > > I am not sure what you are asking here.
> > > 
> > > You expressed that some alternate solution B would be preferable,
> > > without any detail on why you think that is the case.
> > > 
> > > And it's certainly not obvious or self-explanatory - in particular
> > > because Chris's proposal *is* obvious and self-explanatory, given how
> > > everybody else is already doing loops around page reclaim.
> > 
> > Sorry, I could have been less cryptic. I hope the above and my response
> > to Chris goes into more details why I do not like this proposal and what
> > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
> > target. If the current calculation of the target is unsufficient - e.g.
> > in situations where the high limit excess is very large then this should
> > be reflected in memcg_nr_pages_over_high.
> > 
> > Is it more clear?
> 
> Well you haven't made a good argument why memory.high is actually
> different than any other form of reclaim, and why it should be the
> only implementation of page reclaim that has special-cased handling
> for the inherent "unfairness" or rather raciness of that operation.
> 
> You cut these lines from the quote:
> 
>   Under pressure, page reclaim can struggle to satisfy the reclaim
>   goal and may return with less pages reclaimed than asked to.
> 
>   Under concurrency, a parallel allocation can invalidate the reclaim
>   progress made by a thread.
> 
> Even if we *could* invest more into trying to avoid any unfairness,
> you haven't made a point why we actually should do that here
> specifically, yet not everywhere else.

I have tried to explain my thinking elsewhere in the thread. The bottom
line is that high limit is a way of throttling rather than meeting a
specific target. With the current implementation we scale the reclaim
activity by the consumer's demand which is something that is not
terribly complex to wrap your head around and reason about. Because the
objective is to not increase the excess much. It offers some sort of
fairness as well. I fully recognize that a full fairness is not
something we can target but working reasonably well most of the time
sounds good enough for me.

> (And people have tried to do it for global reclaim[1], but clearly
> this isn't a meaningful problem in practice.)
> 
> I have a good reason why we shouldn't: because it's special casing
> memory.high from other forms of reclaim, and that is a maintainability
> problem. We've recently been discussing ways to make the memory.high
> implementation stand out less, not make it stand out even more. There
> is no solid reason it should be different from memory.max reclaim,
> except that it should sleep instead of invoke OOM at the end. It's
> already a mess we're trying to get on top of and straighten out, and
> you're proposing to add more kinks that will make this work harder.

I do see your point of course. But I do not give the code consistency
a higher priority than the potential unfairness aspect of the user
visible behavior for something that can do better. Really the direct
reclaim unfairness is really painfull and hard to explain to users. You
can essentially only hand wave that system is struggling so fairness is
not really a priority anymore.

> I have to admit, I'm baffled by this conversation. I consider this a
> fairly obvious, idiomatic change, and I cannot relate to the
> objections or counter-proposals in the slightest.

I have to admit that I would prefer a much less aggressive tone. We are
discussing a topic which is obviously not black and white and there are
different aspects of it.

Thanks!

> [1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
Chris Down May 21, 2020, 3:02 p.m. UTC | #24
Michal Hocko writes:
>> I have a good reason why we shouldn't: because it's special casing
>> memory.high from other forms of reclaim, and that is a maintainability
>> problem. We've recently been discussing ways to make the memory.high
>> implementation stand out less, not make it stand out even more. There
>> is no solid reason it should be different from memory.max reclaim,
>> except that it should sleep instead of invoke OOM at the end. It's
>> already a mess we're trying to get on top of and straighten out, and
>> you're proposing to add more kinks that will make this work harder.
>
>I do see your point of course. But I do not give the code consistency
>a higher priority than the potential unfairness aspect of the user
>visible behavior for something that can do better. Really the direct
>reclaim unfairness is really painfull and hard to explain to users. You
>can essentially only hand wave that system is struggling so fairness is
>not really a priority anymore.

It's not handwaving. When using cgroup features, including memory.high, the 
unit for consideration is a cgroup, not a task. That we happen to act on 
individual tasks in this case is just an implementation detail.

That one task in that cgroup is may be penalised "unfairly" is well within the 
specification: we set limits as part of a cgroup, we account as part of a 
cgroup, and we throttle and reclaim as part of a cgroup. We may make some very 
rudimentary attempts to "be fair" on a per-task basis where that's trivial, but 
that's just one-off niceties, not a statement of precedent.

When exceeding memory.high, the contract is "this cgroup must immediately 
attempt to shrink". Breaking it down per-task in terms of fairness at that 
point doesn't make sense: all the tasks in one cgroup are in it together.
Johannes Weiner May 21, 2020, 4:38 p.m. UTC | #25
On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> [...]
> > > I am not saying the looping over try_to_free_pages is wrong. I do care
> > > about the final reclaim target. That shouldn't be arbitrary. We have
> > > established a target which is proportional to the requested amount of
> > > memory. And there is a good reason for that. If any task tries to
> > > reclaim down to the high limit then this might lead to a large
> > > unfairness when heavy producers piggy back on the active reclaimer(s).
> > 
> > Why is that different than any other form of reclaim?
> 
> Because the high limit reclaim is a best effort rather than must to
> either get over reclaim watermarks and continue allocation or meet the
> hard limit requirement to continue.

It's not best effort. It's a must-meet or get put to sleep. You are
mistaken about what memory.high is.

> In an ideal world even the global resp. hard limit reclaim should
> consider fairness. They don't because that is easier but that sucks. I
> have been involved in debugging countless of issues where direct reclaim
> was taking too long because of the unfairness. Users simply see that as
> bug and I am not surprised.

Then there should be a generic fix to this problem (like the page
capturing during reclaim).

You're bringing anecdotal evidence that reclaim has a generic problem,
which nobody has seriously tried to fix in recent times, and then ask
people to hack around it in a patch that only brings the behavior for
this specific instance in line with everybody else.

I'm sorry, but this IS a black and white issue, and I think you're out
of line here. If you think reclaim fairness is a problem, it should be
on you to provide concrete data for that and propose changes on how we
do reclaim, instead of asking to hack around it in one callsite -
thereby introducing inconsistencies to userspace between different
limits, as well as inconsistencies and complications for the kernel
developers that actually work on this code (take a look at git blame).

> > > I wouldn't mind to loop over try_to_free_pages to meet the requested
> > > memcg_nr_pages_over_high target.
> > 
> > Should we do the same for global reclaim? Move reclaim to userspace
> > resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and
> > then have everybody just reclaim exactly what they asked for, and punt
> > interrupts / kthread allocations to a worker/kswapd?
> 
> This would be quite challenging considering the page allocator wouldn't
> be able to make a forward progress without doing any reclaim. But maybe
> you can be creative with watermarks.

I clarified in the follow-up email that I meant limit reclaim.

> > > > > > > Also if the current high reclaim scaling is insufficient then we should
> > > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > > > > unbound number of reclaim retries.
> > > > > > 
> > > > > > ???
> > > > > 
> > > > > I am not sure what you are asking here.
> > > > 
> > > > You expressed that some alternate solution B would be preferable,
> > > > without any detail on why you think that is the case.
> > > > 
> > > > And it's certainly not obvious or self-explanatory - in particular
> > > > because Chris's proposal *is* obvious and self-explanatory, given how
> > > > everybody else is already doing loops around page reclaim.
> > > 
> > > Sorry, I could have been less cryptic. I hope the above and my response
> > > to Chris goes into more details why I do not like this proposal and what
> > > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
> > > target. If the current calculation of the target is unsufficient - e.g.
> > > in situations where the high limit excess is very large then this should
> > > be reflected in memcg_nr_pages_over_high.
> > > 
> > > Is it more clear?
> > 
> > Well you haven't made a good argument why memory.high is actually
> > different than any other form of reclaim, and why it should be the
> > only implementation of page reclaim that has special-cased handling
> > for the inherent "unfairness" or rather raciness of that operation.
> > 
> > You cut these lines from the quote:
> > 
> >   Under pressure, page reclaim can struggle to satisfy the reclaim
> >   goal and may return with less pages reclaimed than asked to.
> > 
> >   Under concurrency, a parallel allocation can invalidate the reclaim
> >   progress made by a thread.
> > 
> > Even if we *could* invest more into trying to avoid any unfairness,
> > you haven't made a point why we actually should do that here
> > specifically, yet not everywhere else.
> 
> I have tried to explain my thinking elsewhere in the thread. The bottom
> line is that high limit is a way of throttling rather than meeting a
> specific target.

That's an incorrect assumption. Of course it should meet the specific
target that the user specified.

> > (And people have tried to do it for global reclaim[1], but clearly
> > this isn't a meaningful problem in practice.)
> > 
> > I have a good reason why we shouldn't: because it's special casing
> > memory.high from other forms of reclaim, and that is a maintainability
> > problem. We've recently been discussing ways to make the memory.high
> > implementation stand out less, not make it stand out even more. There
> > is no solid reason it should be different from memory.max reclaim,
> > except that it should sleep instead of invoke OOM at the end. It's
> > already a mess we're trying to get on top of and straighten out, and
> > you're proposing to add more kinks that will make this work harder.
> 
> I do see your point of course. But I do not give the code consistency
> a higher priority than the potential unfairness aspect of the user
> visible behavior for something that can do better.

Michal, you have almost no authorship stake in this code base. Would
it be possible to defer judgement on maintainability to people who do?
Michal Hocko May 21, 2020, 5:37 p.m. UTC | #26
On Thu 21-05-20 12:38:33, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote:
> > On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> > [...]
> > > > I am not saying the looping over try_to_free_pages is wrong. I do care
> > > > about the final reclaim target. That shouldn't be arbitrary. We have
> > > > established a target which is proportional to the requested amount of
> > > > memory. And there is a good reason for that. If any task tries to
> > > > reclaim down to the high limit then this might lead to a large
> > > > unfairness when heavy producers piggy back on the active reclaimer(s).
> > > 
> > > Why is that different than any other form of reclaim?
> > 
> > Because the high limit reclaim is a best effort rather than must to
> > either get over reclaim watermarks and continue allocation or meet the
> > hard limit requirement to continue.
> 
> It's not best effort. It's a must-meet or get put to sleep. You are
> mistaken about what memory.high is.

I do not see anything like that being documented. Let me remind you what
the documentation says:
  memory.high
        A read-write single value file which exists on non-root
        cgroups.  The default is "max".

        Memory usage throttle limit.  This is the main mechanism to
        control memory usage of a cgroup.  If a cgroup's usage goes
        over the high boundary, the processes of the cgroup are
        throttled and put under heavy reclaim pressure.

        Going over the high limit never invokes the OOM killer and
        under extreme conditions the limit may be breached.

My understanding is that breaching the limit is acceptable if the memory
is not reclaimable after placing a heavy reclaim pressure. We can
discuss what the heavy reclaim means but the underlying fact is that the
keeping the consumption under the limit is a best effort.

Please also let me remind you that the best effort implementation has
been there since the beginning when the memory.high has been introduced.
Now you seem to be convinced that the semantic is _obviously_ different.

It is not the first time when the high limit behavior has changed.
Mostly based on "what is currently happening in your fleet". And can see
why it is reasonable to adopt to a real life usage. That is OK most of
the time. But I haven't heard why keeping the existing approach and
enforcing the reclaim target is not working properly so far. All I can
hear is a generic statement that consistency matters much more than all
potential problem it might introduce.

Anyway, I do see that you are not really willing to have a
non-confrontational discussion so I do not bother to reply to the rest
and participate in the further discussion.

As usual, let me remind you that I haven't nacked the patch. I do not
plan to do that because "this is not black&white" as already said. But
if your really want to push this through then let's do it properly at
least. memcg->memcg_nr_pages_over_high has only very vague meaning if
the reclaim target is the high limit. The changelog should be also
explicit about a potentially large stalls so that people debugging such
a problem have a clue at least.
Johannes Weiner May 21, 2020, 6:45 p.m. UTC | #27
On Thu, May 21, 2020 at 07:37:01PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 12:38:33, Johannes Weiner wrote:
> > On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote:
> > > On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> > > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I am not saying the looping over try_to_free_pages is wrong. I do care
> > > > > about the final reclaim target. That shouldn't be arbitrary. We have
> > > > > established a target which is proportional to the requested amount of
> > > > > memory. And there is a good reason for that. If any task tries to
> > > > > reclaim down to the high limit then this might lead to a large
> > > > > unfairness when heavy producers piggy back on the active reclaimer(s).
> > > > 
> > > > Why is that different than any other form of reclaim?
> > > 
> > > Because the high limit reclaim is a best effort rather than must to
> > > either get over reclaim watermarks and continue allocation or meet the
> > > hard limit requirement to continue.
> > 
> > It's not best effort. It's a must-meet or get put to sleep. You are
> > mistaken about what memory.high is.
> 
> I do not see anything like that being documented. Let me remind you what
> the documentation says:
>   memory.high
>         A read-write single value file which exists on non-root
>         cgroups.  The default is "max".
> 
>         Memory usage throttle limit.  This is the main mechanism to
>         control memory usage of a cgroup.  If a cgroup's usage goes
>         over the high boundary, the processes of the cgroup are
>         throttled and put under heavy reclaim pressure.
> 
>         Going over the high limit never invokes the OOM killer and
>         under extreme conditions the limit may be breached.
> 
> My understanding is that breaching the limit is acceptable if the memory
> is not reclaimable after placing a heavy reclaim pressure. We can
> discuss what the heavy reclaim means but the underlying fact is that the
> keeping the consumption under the limit is a best effort.

It says it's the main mechanism to control memory usage, and that
"under extreme conditions the limit may be breached". This doesn't
sound like "let's try some reclaim and see how it goes" to me.

As the person who designed and implemented this feature, it certainly
wasn't the intention.

> Please also let me remind you that the best effort implementation has
> been there since the beginning when the memory.high has been introduced.
> Now you seem to be convinced that the semantic is _obviously_ different.
>
> It is not the first time when the high limit behavior has changed.
> Mostly based on "what is currently happening in your fleet". And can see
> why it is reasonable to adopt to a real life usage. That is OK most of
> the time. But I haven't heard why keeping the existing approach and
> enforcing the reclaim target is not working properly so far. All I can
> hear is a generic statement that consistency matters much more than all
> potential problem it might introduce.

The assumption when writing the first implementation was that the full
reclaim cycle that we impose on every subsequent allocation was enough
to 1) mount a significant effort to push back allocations or 2) if it
fails, at least hold up allocations enough to curb further growth.

As it turned out when deploying this code at scale, reclaim is not
sufficient to achieve #2, because it simply may fail with not that
many pages to scan - especially on systems without swap. So after
observing a violation of the promised behavior, we added the sleeps
for situations where reclaim fails to contain the workload as stated.

After adding the sleeps, we noticed - again after deploying at scale -
that in certain situations reclaim isn't failing but simply returning
early, and we go to sleep and get OOM killed on full file LRUs.

After analyzing this problem, it's clear that we had an oversight
here: all other reclaimers are already familiar with the fact that
reclaim may not be able to complete the reclaim target in one call, or
that page reclaim is inherently racy and reclaim work can be stolen.

We send a simple bug fix: bring this instance of reclaim in line with
how everybody else is using the reclaim API, to meet the semantics as
they are intendend and documented.

And somehow this is controversial, and we're just changing around user
promises as we see fit for our particular usecase?

I don't even understand how the supposed alternate semantics you read
between the lines in the documentation would make for a useful
feature: It may fail to contain a group of offending tasks to the
configured limit, but it will be fair to those tasks while doing so?

> But if your really want to push this through then let's do it
> properly at least. memcg->memcg_nr_pages_over_high has only very
> vague meaning if the reclaim target is the high limit.

task->memcg_nr_pages_over_high is not vague, it's a best-effort
mechanism to distribute fairness. It's the current task's share of the
cgroup's overage, and it allows us in the majority of situations to
distribute reclaim work and sleeps in proportion to how much the task
is actually at fault.

However, due to the inherent raciness of reclaim, this may sometimes
fail. In that situation, it's way better to suffer some unfairness
than to give up, go to sleep, and risk OOM intervention - or give up,
let the task continue and fail memory containment of the cgroup.

Both of these are more important than the concept of "fairness"
between tasks that already share a cgroup and memory pool, and likely
have a myriad of other resource dependencies between them.

> The changelog should be also explicit about a potentially large
> stalls so that people debugging such a problem have a clue at least.

The large stalls you see from hitting your memory limit?

At what point would that be unexpected? All you see is a task inside
reclaim while it's trying to allocate and the cgroup is at its limit.
The same as you would with memory.max and global reclaim.
Michal Hocko May 28, 2020, 4:31 p.m. UTC | #28
On Thu 21-05-20 14:45:05, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 07:37:01PM +0200, Michal Hocko wrote:
> > On Thu 21-05-20 12:38:33, Johannes Weiner wrote:
> > > On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote:
> > > > On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> > > > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I am not saying the looping over try_to_free_pages is wrong. I do care
> > > > > > about the final reclaim target. That shouldn't be arbitrary. We have
> > > > > > established a target which is proportional to the requested amount of
> > > > > > memory. And there is a good reason for that. If any task tries to
> > > > > > reclaim down to the high limit then this might lead to a large
> > > > > > unfairness when heavy producers piggy back on the active reclaimer(s).
> > > > > 
> > > > > Why is that different than any other form of reclaim?
> > > > 
> > > > Because the high limit reclaim is a best effort rather than must to
> > > > either get over reclaim watermarks and continue allocation or meet the
> > > > hard limit requirement to continue.
> > > 
> > > It's not best effort. It's a must-meet or get put to sleep. You are
> > > mistaken about what memory.high is.
> > 
> > I do not see anything like that being documented. Let me remind you what
> > the documentation says:
> >   memory.high
> >         A read-write single value file which exists on non-root
> >         cgroups.  The default is "max".
> > 
> >         Memory usage throttle limit.  This is the main mechanism to
> >         control memory usage of a cgroup.  If a cgroup's usage goes
> >         over the high boundary, the processes of the cgroup are
> >         throttled and put under heavy reclaim pressure.
> > 
> >         Going over the high limit never invokes the OOM killer and
> >         under extreme conditions the limit may be breached.
> > 
> > My understanding is that breaching the limit is acceptable if the memory
> > is not reclaimable after placing a heavy reclaim pressure. We can
> > discuss what the heavy reclaim means but the underlying fact is that the
> > keeping the consumption under the limit is a best effort.
> 
> It says it's the main mechanism to control memory usage, and that
> "under extreme conditions the limit may be breached". This doesn't
> sound like "let's try some reclaim and see how it goes" to me.
> 
> As the person who designed and implemented this feature, it certainly
> wasn't the intention.
> 
> > Please also let me remind you that the best effort implementation has
> > been there since the beginning when the memory.high has been introduced.
> > Now you seem to be convinced that the semantic is _obviously_ different.
> >
> > It is not the first time when the high limit behavior has changed.
> > Mostly based on "what is currently happening in your fleet". And can see
> > why it is reasonable to adopt to a real life usage. That is OK most of
> > the time. But I haven't heard why keeping the existing approach and
> > enforcing the reclaim target is not working properly so far. All I can
> > hear is a generic statement that consistency matters much more than all
> > potential problem it might introduce.
> 
> The assumption when writing the first implementation was that the full
> reclaim cycle that we impose on every subsequent allocation was enough
> to 1) mount a significant effort to push back allocations or 2) if it
> fails, at least hold up allocations enough to curb further growth.
> 
> As it turned out when deploying this code at scale, reclaim is not
> sufficient to achieve #2, because it simply may fail with not that
> many pages to scan - especially on systems without swap. So after
> observing a violation of the promised behavior, we added the sleeps
> for situations where reclaim fails to contain the workload as stated.
> 
> After adding the sleeps, we noticed - again after deploying at scale -
> that in certain situations reclaim isn't failing but simply returning
> early, and we go to sleep and get OOM killed on full file LRUs.
> 
> After analyzing this problem, it's clear that we had an oversight
> here: all other reclaimers are already familiar with the fact that
> reclaim may not be able to complete the reclaim target in one call, or
> that page reclaim is inherently racy and reclaim work can be stolen.

There is no disagreement here.

> We send a simple bug fix: bring this instance of reclaim in line with
> how everybody else is using the reclaim API, to meet the semantics as
> they are intendend and documented.

Here is where we are not on the same page though. Once you have identified
that the main problem is that the reclaim fails too early to meet the
target then the fix would be to enforce that target. I have asked why
this hasn't been done and haven't got any real answer for that. Instead
what you call "a simple bug fix" has larger consequences which are not
really explained in the changelog and they are also not really trivial
to see. If the changelog explicitly stated that the proportional memory
reclaim is not sufficient because XYZ and the implementation has been
changed to instead meet the high limit target then this would be a
completely different story and I believe we could have saved some
discussion.

> And somehow this is controversial, and we're just changing around user
> promises as we see fit for our particular usecase?
> 
> I don't even understand how the supposed alternate semantics you read
> between the lines in the documentation would make for a useful
> feature: It may fail to contain a group of offending tasks to the
> configured limit, but it will be fair to those tasks while doing so?
> 
> > But if your really want to push this through then let's do it
> > properly at least. memcg->memcg_nr_pages_over_high has only very
> > vague meaning if the reclaim target is the high limit.
> 
> task->memcg_nr_pages_over_high is not vague, it's a best-effort
> mechanism to distribute fairness. It's the current task's share of the
> cgroup's overage, and it allows us in the majority of situations to
> distribute reclaim work and sleeps in proportion to how much the task
> is actually at fault.

Agreed. But this stops being the case as soon as the reclaim target has
been reached and new reclaim attempts are enforced because the memcg is
still above the high limit. Because then you have a completely different
reclaim target - get down to the limit. This would be especially visible
with a large memcg_nr_pages_over_high which could even lead to an over
reclaim.
Chris Down May 28, 2020, 4:48 p.m. UTC | #29
Michal Hocko writes:
>> We send a simple bug fix: bring this instance of reclaim in line with
>> how everybody else is using the reclaim API, to meet the semantics as
>> they are intendend and documented.
>
>Here is where we are not on the same page though. Once you have identified
>that the main problem is that the reclaim fails too early to meet the
>target then the fix would be to enforce that target. I have asked why
>this hasn't been done and haven't got any real answer for that. Instead
>what you call "a simple bug fix" has larger consequences which are not
>really explained in the changelog and they are also not really trivial
>to see. If the changelog explicitly stated that the proportional memory
>reclaim is not sufficient because XYZ and the implementation has been
>changed to instead meet the high limit target then this would be a
>completely different story and I believe we could have saved some
>discussion.

I agree that the changelog can be made more clear. Any objection if I send v2 
with changelog changes to that effect, then? :-)

>> And somehow this is controversial, and we're just changing around user
>> promises as we see fit for our particular usecase?
>>
>> I don't even understand how the supposed alternate semantics you read
>> between the lines in the documentation would make for a useful
>> feature: It may fail to contain a group of offending tasks to the
>> configured limit, but it will be fair to those tasks while doing so?
>>
>> > But if your really want to push this through then let's do it
>> > properly at least. memcg->memcg_nr_pages_over_high has only very
>> > vague meaning if the reclaim target is the high limit.
>>
>> task->memcg_nr_pages_over_high is not vague, it's a best-effort
>> mechanism to distribute fairness. It's the current task's share of the
>> cgroup's overage, and it allows us in the majority of situations to
>> distribute reclaim work and sleeps in proportion to how much the task
>> is actually at fault.
>
>Agreed. But this stops being the case as soon as the reclaim target has
>been reached and new reclaim attempts are enforced because the memcg is
>still above the high limit. Because then you have a completely different
>reclaim target - get down to the limit. This would be especially visible
>with a large memcg_nr_pages_over_high which could even lead to an over
>reclaim.

We actually over reclaim even before this patch -- this patch doesn't bring 
much new in that regard.

Tracing try_to_free_pages for a cgroup at the memory.high threshold shows that 
before this change, we sometimes even reclaim on the order of twice the number 
of pages requested. For example, I see cases where we requested 1000 pages to 
be reclaimed, but end up reclaiming 2000 in a single reclaim attempt.
Shakeel Butt May 28, 2020, 6:02 p.m. UTC | #30
I haven't gone through the whole email-chain, so, I might be asking
some repetitive questions. I will go through the email-chain later.

On Wed, May 20, 2020 at 7:37 AM Chris Down <chris@chrisdown.name> wrote:
>
> In Facebook production, we've seen cases where cgroups have been put
> into allocator throttling even when they appear to have a lot of slack
> file caches which should be trivially reclaimable.
>
> Looking more closely, the problem is that we only try a single cgroup
> reclaim walk for each return to usermode before calculating whether or
> not we should throttle. This single attempt doesn't produce enough
> pressure to shrink for cgroups with a rapidly growing amount of file
> caches prior to entering allocator throttling.

In my experience it is usually shrink_slab which requires hammering
multiple times to actually reclaim memory.

>
> As an example, we see that threads in an affected cgroup are stuck in
> allocator throttling:
>
>     # for i in $(cat cgroup.threads); do
>     >     grep over_high "/proc/$i/stack"
>     > done
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>     [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>
> ...however, there is no I/O pressure reported by PSI, despite a lot of
> slack file pages:
>
>     # cat memory.pressure
>     some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
>     full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
>     # cat io.pressure
>     some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
>     full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
>     # grep _file memory.stat
>     inactive_file 1370939392
>     active_file 661635072
>
> This patch changes the behaviour to retry reclaim either until the
> current task goes below the 10ms grace period, or we are making no
> reclaim progress at all. In the latter case, we enter reclaim throttling
> as before.
>
> To a user, there's no intuitive reason for the reclaim behaviour to
> differ from hitting memory.high as part of a new allocation, as opposed
> to hitting memory.high because someone lowered its value. As such this
> also brings an added benefit: it unifies the reclaim behaviour between
> the two.

What was the initial reason to have different behavior in the first place?

>
> There's precedent for this behaviour: we already do reclaim retries when
> writing to memory.{high,max}, in max reclaim, and in the page allocator
> itself.
>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  mm/memcontrol.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2df9510b7d64..b040951ccd6b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>
>  struct mem_cgroup *root_mem_cgroup __read_mostly;
>
> +/* The number of times we should retry reclaim failures before giving up. */
>  #define MEM_CGROUP_RECLAIM_RETRIES     5
>
>  /* Socket memory accounting disabled? */
> @@ -2228,17 +2229,22 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>         return 0;
>  }
>
> -static void reclaim_high(struct mem_cgroup *memcg,
> -                        unsigned int nr_pages,
> -                        gfp_t gfp_mask)
> +static unsigned long reclaim_high(struct mem_cgroup *memcg,
> +                                 unsigned int nr_pages,
> +                                 gfp_t gfp_mask)
>  {
> +       unsigned long nr_reclaimed = 0;
> +
>         do {
>                 if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high))
>                         continue;
>                 memcg_memory_event(memcg, MEMCG_HIGH);
> -               try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +               nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> +                                                            gfp_mask, true);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> +
> +       return nr_reclaimed;
>  }
>
>  static void high_work_func(struct work_struct *work)
> @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void)
>  {
>         unsigned long penalty_jiffies;
>         unsigned long pflags;
> +       unsigned long nr_reclaimed;
>         unsigned int nr_pages = current->memcg_nr_pages_over_high;

Is there any benefit to keep current->memcg_nr_pages_over_high after
this change? Why not just use SWAP_CLUSTER_MAX?

> +       int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>         struct mem_cgroup *memcg;
>
>         if (likely(!nr_pages))
>                 return;
>
>         memcg = get_mem_cgroup_from_mm(current->mm);
> -       reclaim_high(memcg, nr_pages, GFP_KERNEL);
>         current->memcg_nr_pages_over_high = 0;
>
> +retry_reclaim:
> +       nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL);
> +
>         /*
>          * memory.high is breached and reclaim is unable to keep up. Throttle
>          * allocators proactively to slow down excessive growth.
> @@ -2403,6 +2413,14 @@ void mem_cgroup_handle_over_high(void)
>         if (penalty_jiffies <= HZ / 100)
>                 goto out;
>
> +       /*
> +        * If reclaim is making forward progress but we're still over
> +        * memory.high, we want to encourage that rather than doing allocator
> +        * throttling.
> +        */
> +       if (nr_reclaimed || nr_retries--)
> +               goto retry_reclaim;
> +
>         /*
>          * If we exit early, we're guaranteed to die (since
>          * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> --
> 2.26.2
>
Chris Down May 28, 2020, 7:48 p.m. UTC | #31
Shakeel Butt writes:
>What was the initial reason to have different behavior in the first place?

This differing behaviour is simply a mistake, it was never intended to be this 
deviate from what happens elsewhere. To that extent this patch is as much a bug 
fix as it is an improvement.

>>  static void high_work_func(struct work_struct *work)
>> @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void)
>>  {
>>         unsigned long penalty_jiffies;
>>         unsigned long pflags;
>> +       unsigned long nr_reclaimed;
>>         unsigned int nr_pages = current->memcg_nr_pages_over_high;
>
>Is there any benefit to keep current->memcg_nr_pages_over_high after
>this change? Why not just use SWAP_CLUSTER_MAX?

I don't feel strongly either way, but current->memcg_nr_pages_over_high can be 
very large for large allocations.

That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - high)` 
for each loop? I agree that with this design it looks like perhaps we don't 
need it any more.

Johannes, what do you think?
Johannes Weiner May 28, 2020, 8:11 p.m. UTC | #32
On Thu, May 28, 2020 at 06:31:01PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 14:45:05, Johannes Weiner wrote:
> > After analyzing this problem, it's clear that we had an oversight
> > here: all other reclaimers are already familiar with the fact that
> > reclaim may not be able to complete the reclaim target in one call, or
> > that page reclaim is inherently racy and reclaim work can be stolen.
> 
> There is no disagreement here.
> 
> > We send a simple bug fix: bring this instance of reclaim in line with
> > how everybody else is using the reclaim API, to meet the semantics as
> > they are intendend and documented.
> 
> Here is where we are not on the same page though. Once you have identified
> that the main problem is that the reclaim fails too early to meet the
> target then the fix would be to enforce that target. I have asked why
> this hasn't been done and haven't got any real answer for that.

Then I encourage you to re-read the thread.

I have explained that reclaim invocations can fail to meet the
requested target for a variety of reasons, including dirty state or
other states that make memory temporarily unreclaimable, race
conditions between reclaimers and so forth.

I have also pointed out that this is widely acknowledged by the fact
that all other reclaimers retry in the exact same manner. If you want
to question that VM-wide precedence, please do so in your own patches.

As to the question around fairness, I have explained that fairness is
a best effort and that if push comes to shove, preventing premature
OOM situations or failing cgroup containment and causing system-wide
OOMs is more important.

> Instead what you call "a simple bug fix" has larger consequences
> which are not really explained in the changelog and they are also
> not really trivial to see. If the changelog explicitly stated that
> the proportional memory reclaim is not sufficient because XYZ and
> the implementation has been changed to instead meet the high limit
> target then this would be a completely different story and I believe
> we could have saved some discussion.

The point of memory.high reclaim is to meet the memory.high memory
limit. That, too, has been addressed - although it's astounding that
it needed to be pointed out. The proportionality is an attempt at
fairness that doesn't override the primary purpose.

I appreciate your concerns, but your questions have been addressed.

And you're not contributing anything of value to the conversation
until you familiarize yourself with the purpose of the memory.high
interface.

Thanks
Johannes Weiner May 28, 2020, 8:29 p.m. UTC | #33
On Thu, May 28, 2020 at 08:48:31PM +0100, Chris Down wrote:
> Shakeel Butt writes:
> > What was the initial reason to have different behavior in the first place?
> 
> This differing behaviour is simply a mistake, it was never intended to be
> this deviate from what happens elsewhere. To that extent this patch is as
> much a bug fix as it is an improvement.

Yes, it was an oversight.

> > >  static void high_work_func(struct work_struct *work)
> > > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void)
> > >  {
> > >         unsigned long penalty_jiffies;
> > >         unsigned long pflags;
> > > +       unsigned long nr_reclaimed;
> > >         unsigned int nr_pages = current->memcg_nr_pages_over_high;
> > 
> > Is there any benefit to keep current->memcg_nr_pages_over_high after
> > this change? Why not just use SWAP_CLUSTER_MAX?

It's there for the same reason why try_to_free_pages() takes a reclaim
argument in the first place: we want to make the thread allocating the
most also do the most reclaim work. Consider a thread allocating THPs
in a loop with another thread allocating regular pages.

Remember that all callers loop. They could theoretically all just ask
for SWAP_CLUSTER_MAX pages over and over again.

The idea is to have fairness in most cases, and avoid allocation
failure, premature OOM, and containment failure in the edge cases that
are caused by the inherent raciness of page reclaim.

> I don't feel strongly either way, but current->memcg_nr_pages_over_high can
> be very large for large allocations.
> 
> That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current -
> high)` for each loop? I agree that with this design it looks like perhaps we
> don't need it any more.
> 
> Johannes, what do you think?

How about this:

Reclaim memcg_nr_pages_over_high in the first iteration, then switch
to SWAP_CLUSTER_MAX in the retries.

This acknowledges that while the page allocator and memory.max reclaim
every time an allocation is made, memory.high is currently batched and
can have larger targets. We want the allocating thread to reclaim at
least the batch size, but beyond that only what's necessary to prevent
premature OOM or failing containment.

Add a comment stating as much.

Once we reclaim memory.high synchronously instead of batched, this
exceptional handling is no longer needed and can be deleted again.

Does that sound reasonable?
Shakeel Butt May 28, 2020, 9:02 p.m. UTC | #34
On Thu, May 28, 2020 at 1:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, May 28, 2020 at 08:48:31PM +0100, Chris Down wrote:
> > Shakeel Butt writes:
> > > What was the initial reason to have different behavior in the first place?
> >
> > This differing behaviour is simply a mistake, it was never intended to be
> > this deviate from what happens elsewhere. To that extent this patch is as
> > much a bug fix as it is an improvement.
>
> Yes, it was an oversight.
>
> > > >  static void high_work_func(struct work_struct *work)
> > > > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void)
> > > >  {
> > > >         unsigned long penalty_jiffies;
> > > >         unsigned long pflags;
> > > > +       unsigned long nr_reclaimed;
> > > >         unsigned int nr_pages = current->memcg_nr_pages_over_high;
> > >
> > > Is there any benefit to keep current->memcg_nr_pages_over_high after
> > > this change? Why not just use SWAP_CLUSTER_MAX?
>
> It's there for the same reason why try_to_free_pages() takes a reclaim
> argument in the first place: we want to make the thread allocating the
> most also do the most reclaim work. Consider a thread allocating THPs
> in a loop with another thread allocating regular pages.
>
> Remember that all callers loop. They could theoretically all just ask
> for SWAP_CLUSTER_MAX pages over and over again.
>
> The idea is to have fairness in most cases, and avoid allocation
> failure, premature OOM, and containment failure in the edge cases that
> are caused by the inherent raciness of page reclaim.
>

Thanks for the explanation.

> > I don't feel strongly either way, but current->memcg_nr_pages_over_high can
> > be very large for large allocations.
> >
> > That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current -
> > high)` for each loop? I agree that with this design it looks like perhaps we
> > don't need it any more.
> >
> > Johannes, what do you think?
>
> How about this:
>
> Reclaim memcg_nr_pages_over_high in the first iteration, then switch
> to SWAP_CLUSTER_MAX in the retries.
>
> This acknowledges that while the page allocator and memory.max reclaim
> every time an allocation is made, memory.high is currently batched and
> can have larger targets. We want the allocating thread to reclaim at
> least the batch size, but beyond that only what's necessary to prevent
> premature OOM or failing containment.
>
> Add a comment stating as much.
>
> Once we reclaim memory.high synchronously instead of batched, this
> exceptional handling is no longer needed and can be deleted again.
>
> Does that sound reasonable?

SGTM. It does not seem controversial to me to let the task do the work
to resolve the condition for which it is being throttled.
Chris Down May 28, 2020, 9:14 p.m. UTC | #35
Johannes Weiner writes:
>> I don't feel strongly either way, but current->memcg_nr_pages_over_high can
>> be very large for large allocations.
>>
>> That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current -
>> high)` for each loop? I agree that with this design it looks like perhaps we
>> don't need it any more.
>>
>> Johannes, what do you think?
>
>How about this:
>
>Reclaim memcg_nr_pages_over_high in the first iteration, then switch
>to SWAP_CLUSTER_MAX in the retries.
>
>This acknowledges that while the page allocator and memory.max reclaim
>every time an allocation is made, memory.high is currently batched and
>can have larger targets. We want the allocating thread to reclaim at
>least the batch size, but beyond that only what's necessary to prevent
>premature OOM or failing containment.
>
>Add a comment stating as much.
>
>Once we reclaim memory.high synchronously instead of batched, this
>exceptional handling is no longer needed and can be deleted again.
>
>Does that sound reasonable?

That sounds good to me, thanks. I'll change that in v2 soonish and update the 
changelog.
Michal Hocko May 29, 2020, 7:25 a.m. UTC | #36
On Thu 28-05-20 16:29:44, Johannes Weiner wrote:
[...]
> How about this:
> 
> Reclaim memcg_nr_pages_over_high in the first iteration, then switch
> to SWAP_CLUSTER_MAX in the retries.

Yes, that makes much more sense to me.
Michal Hocko May 29, 2020, 7:31 a.m. UTC | #37
On Thu 28-05-20 17:48:48, Chris Down wrote:
> Michal Hocko writes:
> > > We send a simple bug fix: bring this instance of reclaim in line with
> > > how everybody else is using the reclaim API, to meet the semantics as
> > > they are intendend and documented.
> > 
> > Here is where we are not on the same page though. Once you have identified
> > that the main problem is that the reclaim fails too early to meet the
> > target then the fix would be to enforce that target. I have asked why
> > this hasn't been done and haven't got any real answer for that. Instead
> > what you call "a simple bug fix" has larger consequences which are not
> > really explained in the changelog and they are also not really trivial
> > to see. If the changelog explicitly stated that the proportional memory
> > reclaim is not sufficient because XYZ and the implementation has been
> > changed to instead meet the high limit target then this would be a
> > completely different story and I believe we could have saved some
> > discussion.
> 
> I agree that the changelog can be made more clear. Any objection if I send
> v2 with changelog changes to that effect, then? :-)

Yes, please. And I would highly appreciate to have the above addressed.
So that we do not have to really scratch heads why a particular design
decision has been made and argue what was the thinking behind.

> > > And somehow this is controversial, and we're just changing around user
> > > promises as we see fit for our particular usecase?
> > > 
> > > I don't even understand how the supposed alternate semantics you read
> > > between the lines in the documentation would make for a useful
> > > feature: It may fail to contain a group of offending tasks to the
> > > configured limit, but it will be fair to those tasks while doing so?
> > > 
> > > > But if your really want to push this through then let's do it
> > > > properly at least. memcg->memcg_nr_pages_over_high has only very
> > > > vague meaning if the reclaim target is the high limit.
> > > 
> > > task->memcg_nr_pages_over_high is not vague, it's a best-effort
> > > mechanism to distribute fairness. It's the current task's share of the
> > > cgroup's overage, and it allows us in the majority of situations to
> > > distribute reclaim work and sleeps in proportion to how much the task
> > > is actually at fault.
> > 
> > Agreed. But this stops being the case as soon as the reclaim target has
> > been reached and new reclaim attempts are enforced because the memcg is
> > still above the high limit. Because then you have a completely different
> > reclaim target - get down to the limit. This would be especially visible
> > with a large memcg_nr_pages_over_high which could even lead to an over
> > reclaim.
> 
> We actually over reclaim even before this patch -- this patch doesn't bring
> much new in that regard.
> 
> Tracing try_to_free_pages for a cgroup at the memory.high threshold shows
> that before this change, we sometimes even reclaim on the order of twice the
> number of pages requested. For example, I see cases where we requested 1000
> pages to be reclaimed, but end up reclaiming 2000 in a single reclaim
> attempt.

This is interesting and worth looking into. I am aware that we can
reclaim potentially much more pages during the icache reclaim and that
there was a heated discussion without any fix merged in the end IIRC.
Do you have any details?
Chris Down May 29, 2020, 10:08 a.m. UTC | #38
Michal Hocko writes:
>> > > task->memcg_nr_pages_over_high is not vague, it's a best-effort
>> > > mechanism to distribute fairness. It's the current task's share of the
>> > > cgroup's overage, and it allows us in the majority of situations to
>> > > distribute reclaim work and sleeps in proportion to how much the task
>> > > is actually at fault.
>> >
>> > Agreed. But this stops being the case as soon as the reclaim target has
>> > been reached and new reclaim attempts are enforced because the memcg is
>> > still above the high limit. Because then you have a completely different
>> > reclaim target - get down to the limit. This would be especially visible
>> > with a large memcg_nr_pages_over_high which could even lead to an over
>> > reclaim.
>>
>> We actually over reclaim even before this patch -- this patch doesn't bring
>> much new in that regard.
>>
>> Tracing try_to_free_pages for a cgroup at the memory.high threshold shows
>> that before this change, we sometimes even reclaim on the order of twice the
>> number of pages requested. For example, I see cases where we requested 1000
>> pages to be reclaimed, but end up reclaiming 2000 in a single reclaim
>> attempt.
>
>This is interesting and worth looking into. I am aware that we can
>reclaim potentially much more pages during the icache reclaim and that
>there was a heated discussion without any fix merged in the end IIRC.
>Do you have any details?

Sure, we can look into this more, but let's do it separately from this patch -- 
I don't see that its merging should be contingent on that discussion :-)
Michal Hocko May 29, 2020, 10:14 a.m. UTC | #39
On Fri 29-05-20 11:08:58, Chris Down wrote:
> Michal Hocko writes:
> > > > > task->memcg_nr_pages_over_high is not vague, it's a best-effort
> > > > > mechanism to distribute fairness. It's the current task's share of the
> > > > > cgroup's overage, and it allows us in the majority of situations to
> > > > > distribute reclaim work and sleeps in proportion to how much the task
> > > > > is actually at fault.
> > > >
> > > > Agreed. But this stops being the case as soon as the reclaim target has
> > > > been reached and new reclaim attempts are enforced because the memcg is
> > > > still above the high limit. Because then you have a completely different
> > > > reclaim target - get down to the limit. This would be especially visible
> > > > with a large memcg_nr_pages_over_high which could even lead to an over
> > > > reclaim.
> > > 
> > > We actually over reclaim even before this patch -- this patch doesn't bring
> > > much new in that regard.
> > > 
> > > Tracing try_to_free_pages for a cgroup at the memory.high threshold shows
> > > that before this change, we sometimes even reclaim on the order of twice the
> > > number of pages requested. For example, I see cases where we requested 1000
> > > pages to be reclaimed, but end up reclaiming 2000 in a single reclaim
> > > attempt.
> > 
> > This is interesting and worth looking into. I am aware that we can
> > reclaim potentially much more pages during the icache reclaim and that
> > there was a heated discussion without any fix merged in the end IIRC.
> > Do you have any details?
> 
> Sure, we can look into this more, but let's do it separately from this patch
> -- I don't see that its merging should be contingent on that discussion :-)

Yes that is a separate issue.

Patch
diff mbox series

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df9510b7d64..b040951ccd6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,6 +73,7 @@  EXPORT_SYMBOL(memory_cgrp_subsys);
 
 struct mem_cgroup *root_mem_cgroup __read_mostly;
 
+/* The number of times we should retry reclaim failures before giving up. */
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
 /* Socket memory accounting disabled? */
@@ -2228,17 +2229,22 @@  static int memcg_hotplug_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
-static void reclaim_high(struct mem_cgroup *memcg,
-			 unsigned int nr_pages,
-			 gfp_t gfp_mask)
+static unsigned long reclaim_high(struct mem_cgroup *memcg,
+				  unsigned int nr_pages,
+				  gfp_t gfp_mask)
 {
+	unsigned long nr_reclaimed = 0;
+
 	do {
 		if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high))
 			continue;
 		memcg_memory_event(memcg, MEMCG_HIGH);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
+							     gfp_mask, true);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
+
+	return nr_reclaimed;
 }
 
 static void high_work_func(struct work_struct *work)
@@ -2378,16 +2384,20 @@  void mem_cgroup_handle_over_high(void)
 {
 	unsigned long penalty_jiffies;
 	unsigned long pflags;
+	unsigned long nr_reclaimed;
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
 	memcg = get_mem_cgroup_from_mm(current->mm);
-	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	current->memcg_nr_pages_over_high = 0;
 
+retry_reclaim:
+	nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL);
+
 	/*
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
@@ -2403,6 +2413,14 @@  void mem_cgroup_handle_over_high(void)
 	if (penalty_jiffies <= HZ / 100)
 		goto out;
 
+	/*
+	 * If reclaim is making forward progress but we're still over
+	 * memory.high, we want to encourage that rather than doing allocator
+	 * throttling.
+	 */
+	if (nr_reclaimed || nr_retries--)
+		goto retry_reclaim;
+
 	/*
 	 * If we exit early, we're guaranteed to die (since
 	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't