linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] memcg rstat flushing optimization
@ 2022-10-05  1:17 Yosry Ahmed
  2022-10-05 16:30 ` Tejun Heo
  2022-10-17 18:52 ` Michal Koutný
  0 siblings, 2 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-05  1:17 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný
  Cc: Andrew Morton, Linux-MM, Cgroups, Greg Thelen

Hey everyone,

Sorry for the long email :)

We have recently ran into a hard lockup on a machine with hundreds of
CPUs and thousands of memcgs during an rstat flush. There have also
been some discussions during LPC between myself, Michal Koutný, and
Shakeel about memcg rstat flushing optimization. This email is a
follow up on that, discussing possible ideas to optimize memcg rstat
flushing.

Currently, mem_cgroup_flush_stats() is the main interface to flush
memcg stats. It has some internal optimizations that can skip a flush
if there hasn't been significant updates in general. It always flushes
the entire memcg hierarchy, and always invokes flushing using
cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does
not sleep. As you can imagine, with a sufficiently large number of
memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
in an extreme case like the one we ran into, cause a hard lockup
(despite periodically flushing every 4 seconds).

(a) A first step might be to introduce a non _irqsafe version of
mem_cgroup_flush_stats(), and only call the _irqsafe version in places
where we can't sleep. This will exclude some contexts from possibly
introducing a lockup, like the stats reading context and the periodic
flushing context.

(b) We can also stop flushing the entire memcg hierarchy in hopes that
flushing might happen incrementally over subtrees, but this was
introduced to reduce lock contention when there are multiple contexts
trying to flush memcgs stats concurrently, where only one of them will
flush and all the others return immediately (although there is some
inaccuracy here as we didn't actually wait for the flush to complete).
This will re-introduce the lock contention. Maybe we can mitigate this
in rstat code by having hierarchical locks instead of a global lock,
although I can imagine this can quickly get too complicated.

(c) One other thing we can do (similar to the recent blkcg patch
series [1]) is keep track of which stats have been updated. We
currently flush MEMCG_NR_STATS + MEMCG_NR_EVENTS (thanks to Shakeel) +
nodes * NR_VM_NODE_STAT_ITEMS. I didn't make the exact calculation but
I suspect this easily goes over a 100. Keeping track of updated stats
might be in the form of a percpu bitmask. It will introduce some
overhead to the update side and flush sides, but it can help us skip a
lot of up-to-date stats and cache misses. In a few sample machines I
have found that every (memcg, cpu) pair had less than 5 stats on
average that are actually updated.

(d) Instead of optimizing rstat flushing in general, we can just
mitigate the cases that can actually cause a lockup. After we do (a)
and separate call sites that actually need to disable interrupts, we
can introduce a new selective flush callback (e.g.
cgroup_rstat_flush_opts()). This callback can flush only the stats we
care about (bitmask?) and leave the rstat tree untouched (only
traverse the tree, don't pop the nodes). It might be less than optimal
in cases where the stats we choose to flush are the only ones that are
updated, and the cgroup just remains on the rstat tree for no reason.
However, it effectively addresses the cases that can cause a lockup by
only flushing a small subset of the stats.

(e) If we do both (c) and (d), we can go one step further. We can make
cgroup_rstat_flush_opts() return a boolean to indicate whether this
cgroup is completely flushed (what we asked to flush is all what was
updated). If true, we can remove the cgroup from the rstat tree.
However, to do this we will need to have separate rstat trees for each
subsystem or to keep track of which subsystems have updates for a
cgroup (so that if cgroup_rstat_flush_opts() returns true we know if
we can remove the cgroup from the tree or not).

Of course nothing is free. Most of the solutions above will either
introduce overhead somewhere, complexity, or both. We also don't have
a de facto benchmark that will tell us for sure if a change made
things generally better or not, as it will vastly differ depending on
the setup, the workloads, etc. Nothing will make everything better for
all use cases. This is just me kicking off a discussion to see what we
can/should do :)

[1] https://lore.kernel.org/lkml/20221004151748.293388-1-longman@redhat.com/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05  1:17 [RFC] memcg rstat flushing optimization Yosry Ahmed
@ 2022-10-05 16:30 ` Tejun Heo
  2022-10-05 17:20   ` Yosry Ahmed
  2022-10-17 18:52 ` Michal Koutný
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-10-05 16:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

Hello,

On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed wrote:
> We have recently ran into a hard lockup on a machine with hundreds of
> CPUs and thousands of memcgs during an rstat flush. There have also
> been some discussions during LPC between myself, Michal Koutný, and
> Shakeel about memcg rstat flushing optimization. This email is a
> follow up on that, discussing possible ideas to optimize memcg rstat
> flushing.
> 
> Currently, mem_cgroup_flush_stats() is the main interface to flush
> memcg stats. It has some internal optimizations that can skip a flush
> if there hasn't been significant updates in general. It always flushes
> the entire memcg hierarchy, and always invokes flushing using
> cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does
> not sleep. As you can imagine, with a sufficiently large number of
> memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
> in an extreme case like the one we ran into, cause a hard lockup
> (despite periodically flushing every 4 seconds).

How long were the stalls? Given that rstats are usually flushed by its
consumers, flushing taking some time might be acceptable but what's really
problematic is that the whole thing is done with irq disabled. We can think
about other optimizations later too but I think the first thing to do is
making the flush code able to pause and resume. ie. flush in batches and
re-enable irq / resched between batches. We'd have to pay attention to
guaranteeing forward progress. It'd be ideal if we can structure iteration
in such a way that resuming doesn't end up nodes which got added after it
started flushing.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 16:30 ` Tejun Heo
@ 2022-10-05 17:20   ` Yosry Ahmed
  2022-10-05 17:42     ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-05 17:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Wed, Oct 5, 2022 at 9:30 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed wrote:
> > We have recently ran into a hard lockup on a machine with hundreds of
> > CPUs and thousands of memcgs during an rstat flush. There have also
> > been some discussions during LPC between myself, Michal Koutný, and
> > Shakeel about memcg rstat flushing optimization. This email is a
> > follow up on that, discussing possible ideas to optimize memcg rstat
> > flushing.
> >
> > Currently, mem_cgroup_flush_stats() is the main interface to flush
> > memcg stats. It has some internal optimizations that can skip a flush
> > if there hasn't been significant updates in general. It always flushes
> > the entire memcg hierarchy, and always invokes flushing using
> > cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does
> > not sleep. As you can imagine, with a sufficiently large number of
> > memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
> > in an extreme case like the one we ran into, cause a hard lockup
> > (despite periodically flushing every 4 seconds).
>
> How long were the stalls? Given that rstats are usually flushed by its

I think 10 seconds while interrupts are disabled is what we need for a
hard lockup, right?

> consumers, flushing taking some time might be acceptable but what's really
> problematic is that the whole thing is done with irq disabled. We can think
> about other optimizations later too but I think the first thing to do is
> making the flush code able to pause and resume. ie. flush in batches and
> re-enable irq / resched between batches. We'd have to pay attention to
> guaranteeing forward progress. It'd be ideal if we can structure iteration
> in such a way that resuming doesn't end up nodes which got added after it
> started flushing.

IIUC you mean that the caller of cgroup_rstat_flush() can call a
different variant that only flushes a part of the rstat tree then
returns, and the caller makes several calls interleaved by re-enabling
irq, right? Because the flushing code seems to already do this
internally if the non irqsafe version is used.

I think this might be tricky. In this case the path that caused the
lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats().
Interrupts are disabled by callers of memcg_check_events(), but the
rstat flush call is made much deeper in the call stack. Whoever is
disabling interrupts doesn't have access to pause/resume flushing.

There are also other code paths that used to use
cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats()
was introduced like mem_cgroup_wb_stats() [1].

This is why I suggested a selective flushing variant of
cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled
have the ability to only flush a subset of the stats to avoid long
stalls if possible.

[1] https://lore.kernel.org/lkml/20211001190040.48086-2-shakeelb@google.com/

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 17:20   ` Yosry Ahmed
@ 2022-10-05 17:42     ` Tejun Heo
  2022-10-05 18:02       ` Yosry Ahmed
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-10-05 17:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

Hello,

On Wed, Oct 05, 2022 at 10:20:54AM -0700, Yosry Ahmed wrote:
> > How long were the stalls? Given that rstats are usually flushed by its
> 
> I think 10 seconds while interrupts are disabled is what we need for a
> hard lockup, right?

Oh man, that's a long while. I'd really like to learn more about the
numbers. How many cgroups are being flushed across how many CPUs?

> IIUC you mean that the caller of cgroup_rstat_flush() can call a
> different variant that only flushes a part of the rstat tree then
> returns, and the caller makes several calls interleaved by re-enabling
> irq, right? Because the flushing code seems to already do this
> internally if the non irqsafe version is used.

I was thinking more that being done inside the flush function.

> I think this might be tricky. In this case the path that caused the
> lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats().
> Interrupts are disabled by callers of memcg_check_events(), but the
> rstat flush call is made much deeper in the call stack. Whoever is
> disabling interrupts doesn't have access to pause/resume flushing.

Hmm.... yeah I guess it's worthwhile to experiment with selective flushing
for specific paths. That said, we'd still need to address the whole flush
taking long too.

> There are also other code paths that used to use
> cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats()
> was introduced like mem_cgroup_wb_stats() [1].
> 
> This is why I suggested a selective flushing variant of
> cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled
> have the ability to only flush a subset of the stats to avoid long
> stalls if possible.

I have nothing against selective flushing but it's not a free thing to do
both in terms of complexity and runtime overhead, so let's get some numbers
on how much time is spent where.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 17:42     ` Tejun Heo
@ 2022-10-05 18:02       ` Yosry Ahmed
  2022-10-05 18:22         ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-05 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Wed, Oct 5, 2022 at 10:43 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 05, 2022 at 10:20:54AM -0700, Yosry Ahmed wrote:
> > > How long were the stalls? Given that rstats are usually flushed by its
> >
> > I think 10 seconds while interrupts are disabled is what we need for a
> > hard lockup, right?
>
> Oh man, that's a long while. I'd really like to learn more about the
> numbers. How many cgroups are being flushed across how many CPUs?

The total number of cgroups is about ~11k. Unfortunately, I wouldn't
know how many of them are on the rstat updated tree. The number of
cpus is 256.

In all honesty a chunk of those cgroups were dying, which is a
different problem, but there is nothing really preventing our users
from creating that many live cgroups. Also, we naturally don't want
the kernel to face a 10s hard lockup and panic even if we have a
zombie cgroups problem.

Interestingly, we are on cgroup v1, which means we are only flushing
memcg stats. When we move to cgroup v2 we will also flush blkcg stats
in the same irq disabled call.

>
> > IIUC you mean that the caller of cgroup_rstat_flush() can call a
> > different variant that only flushes a part of the rstat tree then
> > returns, and the caller makes several calls interleaved by re-enabling
> > irq, right? Because the flushing code seems to already do this
> > internally if the non irqsafe version is used.
>
> I was thinking more that being done inside the flush function.

I think the flush function already does that in some sense if
might_sleep is true, right? The problem here is that we are using
cgroup_rstat_flush_irqsafe() which can't sleep. Even if we modify
mem_cgroup_flush_stats() so that it doesn't always call the irqsafe
version, we still have code paths that need AFAICT. It would help to
limit the callers to the irqsafe version, but it doesn't fix the
problem.

>
> > I think this might be tricky. In this case the path that caused the
> > lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats().
> > Interrupts are disabled by callers of memcg_check_events(), but the
> > rstat flush call is made much deeper in the call stack. Whoever is
> > disabling interrupts doesn't have access to pause/resume flushing.
>
> Hmm.... yeah I guess it's worthwhile to experiment with selective flushing
> for specific paths. That said, we'd still need to address the whole flush
> taking long too.

Agreed. The irqsafe paths are a more severe problem but ideally we
want to optimize flushing in general (which is why I dumped a lot of
ideas in the original email, to see what makes sense to other folks).

>
> > There are also other code paths that used to use
> > cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats()
> > was introduced like mem_cgroup_wb_stats() [1].
> >
> > This is why I suggested a selective flushing variant of
> > cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled
> > have the ability to only flush a subset of the stats to avoid long
> > stalls if possible.
>
> I have nothing against selective flushing but it's not a free thing to do
> both in terms of complexity and runtime overhead, so let's get some numbers
> on how much time is spent where.

The problem with acquiring numbers is that rstat flushing is very
heavily dependent on workloads. The stats represent basically
everything that memcg does. There might be some workloads that only
update a couple of stats mostly, and workloads that exercise most of
them. There might also be workloads that are limited to a few cpus or
can run on all cpus. The number of memcgs is also a huge factor. It
feels like if we use an artificial benchmark it would significantly be
non-representative.

I took a couple of crashed machines kdumps and ran a script to
traverse updated memcgs and check how many cpus have updates and how
many updates are there on each cpu. I found that on average only a
couple of stats are updated per-cpu per-cgroup, and less than 25% of
cpus (but this is on a large machine, I expect the number to go higher
on smaller machines). Which is why I suggested a bitmask. I understand
though that this depends on whatever workloads were running on those
machines, and that in case where most stats are updated the bitmask
will actually make things slightly worse.

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 18:02       ` Yosry Ahmed
@ 2022-10-05 18:22         ` Tejun Heo
  2022-10-05 18:38           ` Yosry Ahmed
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-10-05 18:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

Hello,

On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > I was thinking more that being done inside the flush function.
> 
> I think the flush function already does that in some sense if
> might_sleep is true, right? The problem here is that we are using

Oh I forgot about that. Right.

...
> I took a couple of crashed machines kdumps and ran a script to
> traverse updated memcgs and check how many cpus have updates and how
> many updates are there on each cpu. I found that on average only a
> couple of stats are updated per-cpu per-cgroup, and less than 25% of
> cpus (but this is on a large machine, I expect the number to go higher
> on smaller machines). Which is why I suggested a bitmask. I understand
> though that this depends on whatever workloads were running on those
> machines, and that in case where most stats are updated the bitmask
> will actually make things slightly worse.

One worry I have about selective flushing is that it's only gonna improve
things by some multiples while we can reasonably increase the problem size
by orders of magnitude.

The only real ways out I can think of are:

* Implement a periodic flusher which keeps the stats needed in irqsafe path
  acceptably uptodate to avoid flushing with irq disabled. We can make this
  adaptive too - no reason to do all this if the number to flush isn't huge.

* Shift some work to the updaters. e.g. in many cases, propagating per-cpu
  updates a couple levels up from update path will significantly reduce the
  fanouts and thus the number of entries which need to be flushed later. It
  does add on-going overhead, so it prolly should adaptive or configurable,
  hopefully the former.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 18:22         ` Tejun Heo
@ 2022-10-05 18:38           ` Yosry Ahmed
  2022-10-06  2:13             ` Yosry Ahmed
  2022-10-11  0:15             ` Yosry Ahmed
  0 siblings, 2 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-05 18:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > I was thinking more that being done inside the flush function.
> >
> > I think the flush function already does that in some sense if
> > might_sleep is true, right? The problem here is that we are using
>
> Oh I forgot about that. Right.
>
> ...
> > I took a couple of crashed machines kdumps and ran a script to
> > traverse updated memcgs and check how many cpus have updates and how
> > many updates are there on each cpu. I found that on average only a
> > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > cpus (but this is on a large machine, I expect the number to go higher
> > on smaller machines). Which is why I suggested a bitmask. I understand
> > though that this depends on whatever workloads were running on those
> > machines, and that in case where most stats are updated the bitmask
> > will actually make things slightly worse.
>
> One worry I have about selective flushing is that it's only gonna improve
> things by some multiples while we can reasonably increase the problem size
> by orders of magnitude.

I think we would usually want to flush a few stats (< 5?) in irqsafe
contexts out of over 100, so I would say the improvement would be
good, but yeah, the problem size can reasonably increase more than
that. It also depends on which stats we selectively flush. If they are
not in the same cache line we might end up bringing in a lot of stats
anyway into the cpu cache.

>
> The only real ways out I can think of are:
>
> * Implement a periodic flusher which keeps the stats needed in irqsafe path
>   acceptably uptodate to avoid flushing with irq disabled. We can make this
>   adaptive too - no reason to do all this if the number to flush isn't huge.

We do have a periodic flusher today for memcg stats (see
flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
only flushes if the total number of updates is over a certain
threshold.
mem_cgroup_flush_stas_delayed(), which is called in the page fault
path, only does a flush if the last flush was a certain while ago. We
don't use the delayed version in all irqsafe contexts though, and I am
not the right person to tell if we can.

But I think this is not what you meant. I think you meant only
flushing the specific stats needed in irqsafe contexts more frequently
and not invoking a flush at all in irqsafe contexts (or using
mem_cgroup_flush_stas_delayed()..?). Right?

I am not the right person to judge what is acceptably up-to-date to be
honest, so I would wait for other memcgs folks to chime in on this.

>
> * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
>   updates a couple levels up from update path will significantly reduce the
>   fanouts and thus the number of entries which need to be flushed later. It
>   does add on-going overhead, so it prolly should adaptive or configurable,
>   hopefully the former.

If we are adding overhead to the updaters, would it be better to
maintain a bitmask of updated stats, or do you think it would be more
effective to propagate updates a couple of levels up? I think to
propagate updates up in updaters context we would need percpu versions
of the "pending" stats, which would also add memory consumption.

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 18:38           ` Yosry Ahmed
@ 2022-10-06  2:13             ` Yosry Ahmed
  2022-10-11  0:15             ` Yosry Ahmed
  1 sibling, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-06  2:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > > I was thinking more that being done inside the flush function.
> > >
> > > I think the flush function already does that in some sense if
> > > might_sleep is true, right? The problem here is that we are using
> >
> > Oh I forgot about that. Right.
> >
> > ...
> > > I took a couple of crashed machines kdumps and ran a script to
> > > traverse updated memcgs and check how many cpus have updates and how
> > > many updates are there on each cpu. I found that on average only a
> > > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > > cpus (but this is on a large machine, I expect the number to go higher
> > > on smaller machines). Which is why I suggested a bitmask. I understand
> > > though that this depends on whatever workloads were running on those
> > > machines, and that in case where most stats are updated the bitmask
> > > will actually make things slightly worse.
> >
> > One worry I have about selective flushing is that it's only gonna improve
> > things by some multiples while we can reasonably increase the problem size
> > by orders of magnitude.
>
> I think we would usually want to flush a few stats (< 5?) in irqsafe
> contexts out of over 100, so I would say the improvement would be
> good, but yeah, the problem size can reasonably increase more than
> that. It also depends on which stats we selectively flush. If they are
> not in the same cache line we might end up bringing in a lot of stats
> anyway into the cpu cache.
>
> >
> > The only real ways out I can think of are:
> >
> > * Implement a periodic flusher which keeps the stats needed in irqsafe path
> >   acceptably uptodate to avoid flushing with irq disabled. We can make this
> >   adaptive too - no reason to do all this if the number to flush isn't huge.
>
> We do have a periodic flusher today for memcg stats (see
> flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
> only flushes if the total number of updates is over a certain
> threshold.
> mem_cgroup_flush_stas_delayed(), which is called in the page fault
> path, only does a flush if the last flush was a certain while ago. We
> don't use the delayed version in all irqsafe contexts though, and I am
> not the right person to tell if we can.
>
> But I think this is not what you meant. I think you meant only
> flushing the specific stats needed in irqsafe contexts more frequently
> and not invoking a flush at all in irqsafe contexts (or using
> mem_cgroup_flush_stas_delayed()..?). Right?
>
> I am not the right person to judge what is acceptably up-to-date to be
> honest, so I would wait for other memcgs folks to chime in on this.
>
> >
> > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
> >   updates a couple levels up from update path will significantly reduce the
> >   fanouts and thus the number of entries which need to be flushed later. It
> >   does add on-going overhead, so it prolly should adaptive or configurable,
> >   hopefully the former.
>
> If we are adding overhead to the updaters, would it be better to
> maintain a bitmask of updated stats, or do you think it would be more
> effective to propagate updates a couple of levels up? I think to
> propagate updates up in updaters context we would need percpu versions
> of the "pending" stats, which would also add memory consumption.
>

A potential problem that I also noticed with propagating percpu
updates up on the update path is that we will need to update
memcg->vmstats_percpu->[state/event/..]_prev. Currently these percpu
prev variables are only updated by rstat flushing code. If they can
also be updated on the update path then we might need some locking
primitive to protect them, which would add more overhead.

> >
> > Thanks.
> >
> > --
> > tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05 18:38           ` Yosry Ahmed
  2022-10-06  2:13             ` Yosry Ahmed
@ 2022-10-11  0:15             ` Yosry Ahmed
  2022-10-11  0:19               ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-11  0:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > > I was thinking more that being done inside the flush function.
> > >
> > > I think the flush function already does that in some sense if
> > > might_sleep is true, right? The problem here is that we are using
> >
> > Oh I forgot about that. Right.
> >
> > ...
> > > I took a couple of crashed machines kdumps and ran a script to
> > > traverse updated memcgs and check how many cpus have updates and how
> > > many updates are there on each cpu. I found that on average only a
> > > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > > cpus (but this is on a large machine, I expect the number to go higher
> > > on smaller machines). Which is why I suggested a bitmask. I understand
> > > though that this depends on whatever workloads were running on those
> > > machines, and that in case where most stats are updated the bitmask
> > > will actually make things slightly worse.
> >
> > One worry I have about selective flushing is that it's only gonna improve
> > things by some multiples while we can reasonably increase the problem size
> > by orders of magnitude.
>
> I think we would usually want to flush a few stats (< 5?) in irqsafe
> contexts out of over 100, so I would say the improvement would be
> good, but yeah, the problem size can reasonably increase more than
> that. It also depends on which stats we selectively flush. If they are
> not in the same cache line we might end up bringing in a lot of stats
> anyway into the cpu cache.
>
> >
> > The only real ways out I can think of are:
> >
> > * Implement a periodic flusher which keeps the stats needed in irqsafe path
> >   acceptably uptodate to avoid flushing with irq disabled. We can make this
> >   adaptive too - no reason to do all this if the number to flush isn't huge.
>
> We do have a periodic flusher today for memcg stats (see
> flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
> only flushes if the total number of updates is over a certain
> threshold.
> mem_cgroup_flush_stas_delayed(), which is called in the page fault
> path, only does a flush if the last flush was a certain while ago. We
> don't use the delayed version in all irqsafe contexts though, and I am
> not the right person to tell if we can.
>
> But I think this is not what you meant. I think you meant only
> flushing the specific stats needed in irqsafe contexts more frequently
> and not invoking a flush at all in irqsafe contexts (or using
> mem_cgroup_flush_stas_delayed()..?). Right?
>
> I am not the right person to judge what is acceptably up-to-date to be
> honest, so I would wait for other memcgs folks to chime in on this.
>
> >
> > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
> >   updates a couple levels up from update path will significantly reduce the
> >   fanouts and thus the number of entries which need to be flushed later. It
> >   does add on-going overhead, so it prolly should adaptive or configurable,
> >   hopefully the former.
>
> If we are adding overhead to the updaters, would it be better to
> maintain a bitmask of updated stats, or do you think it would be more
> effective to propagate updates a couple of levels up? I think to
> propagate updates up in updaters context we would need percpu versions
> of the "pending" stats, which would also add memory consumption.
>

Any thoughts here, Tejun or anyone?

> >
> > Thanks.
> >
> > --
> > tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-11  0:15             ` Yosry Ahmed
@ 2022-10-11  0:19               ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2022-10-11  0:19 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Andrew Morton, Linux-MM, Cgroups, Greg Thelen

Hello,

On Mon, Oct 10, 2022 at 05:15:33PM -0700, Yosry Ahmed wrote:
> Any thoughts here, Tejun or anyone?

I'm having a bit of hard time imagining how things would look like without
code and I think we'd need inputs from mm folks re. the tradeoff between
information timeliness and overhead.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-05  1:17 [RFC] memcg rstat flushing optimization Yosry Ahmed
  2022-10-05 16:30 ` Tejun Heo
@ 2022-10-17 18:52 ` Michal Koutný
  2022-10-17 21:30   ` Yosry Ahmed
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Koutný @ 2022-10-17 18:52 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Andrew Morton, Linux-MM, Cgroups, Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

Hello.

On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> Sorry for the long email :)

(I'll get to other parts sometime in the future. Sorry for my latency :)

> We have recently ran into a hard lockup on a machine with hundreds of
> CPUs and thousands of memcgs during an rstat flush.
> [...]

I only respond with some remarks to this particular case.


> As you can imagine, with a sufficiently large number of
> memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
> in an extreme case like the one we ran into, cause a hard lockup
> (despite periodically flushing every 4 seconds).

Is this your modification from the upstream value of FLUSH_TIME (that's
every 2 s)?

In the mailthread, you also mention >10s for hard-lockups. That sounds
scary (even with the once per 4 seconds) since with large enough update
tree (and update activity) periodic flush couldn't keep up.
Also, it seems to be kind of bad feedback, the longer a (periodic) flush
takes, the lower is the frequency of them and the more updates may
accumulate. I.e. one spike in update activity can get the system into
a spiral of long flushes that won't recover once the activity doesn't
drop much more. 

(2nd point should have been about some memcg_check_events() optimization
or THRESHOLDS_EVENTS_TARGET justifying delayed flush but I've found none to be applicable.
Just noting that v2 fortunetly doesn't have the threshold
notifications.)

Regards,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] memcg rstat flushing optimization
  2022-10-17 18:52 ` Michal Koutný
@ 2022-10-17 21:30   ` Yosry Ahmed
  0 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-10-17 21:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Andrew Morton, Linux-MM, Cgroups, Greg Thelen

On Mon, Oct 17, 2022 at 11:52 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Sorry for the long email :)
>
> (I'll get to other parts sometime in the future. Sorry for my latency :)
>
> > We have recently ran into a hard lockup on a machine with hundreds of
> > CPUs and thousands of memcgs during an rstat flush.
> > [...]
>
> I only respond with some remarks to this particular case.
>
>
> > As you can imagine, with a sufficiently large number of
> > memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
> > in an extreme case like the one we ran into, cause a hard lockup
> > (despite periodically flushing every 4 seconds).
>
> Is this your modification from the upstream value of FLUSH_TIME (that's
> every 2 s)?

It's actually once every 4s like upstream, I got confused by
flush_next_time multiplying the flush interval by 2.

>
> In the mailthread, you also mention >10s for hard-lockups. That sounds
> scary (even with the once per 4 seconds) since with large enough update
> tree (and update activity) periodic flush couldn't keep up.
> Also, it seems to be kind of bad feedback, the longer a (periodic) flush
> takes, the lower is the frequency of them and the more updates may
> accumulate. I.e. one spike in update activity can get the system into
> a spiral of long flushes that won't recover once the activity doesn't
> drop much more.

Yeah it is scary and shouldn't be likely to happen, but it did :(

We can keep coming up with mitigations to try and make it less likely,
but I was hoping we can find something more fundamental like keeping
track of what we really need to flush or avoiding all flushing in
non-sleepable contexts if possible.

>
> (2nd point should have been about some memcg_check_events() optimization
> or THRESHOLDS_EVENTS_TARGET justifying delayed flush but I've found none to be applicable.
> Just noting that v2 fortunetly doesn't have the threshold
> notifications.)

I think even without that, we can still run into the same problem in
other non-sleepable flushing contexts.

>
> Regards,
> Michal


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-17 21:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-05  1:17 [RFC] memcg rstat flushing optimization Yosry Ahmed
2022-10-05 16:30 ` Tejun Heo
2022-10-05 17:20   ` Yosry Ahmed
2022-10-05 17:42     ` Tejun Heo
2022-10-05 18:02       ` Yosry Ahmed
2022-10-05 18:22         ` Tejun Heo
2022-10-05 18:38           ` Yosry Ahmed
2022-10-06  2:13             ` Yosry Ahmed
2022-10-11  0:15             ` Yosry Ahmed
2022-10-11  0:19               ` Tejun Heo
2022-10-17 18:52 ` Michal Koutný
2022-10-17 21:30   ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).