All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Leonardo Brás" <leobras@redhat.com>
To: Michal Hocko <mhocko@suse.com>, Marcelo Tosatti <mtosatti@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining
Date: Fri, 27 Jan 2023 03:55:39 -0300	[thread overview]
Message-ID: <0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com> (raw)
In-Reply-To: <Y9LQ615H13RmG7wL@dhcp22.suse.cz>

On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > [...]
> > > > Remote draining reduces interruptions whether CPU 
> > > > is marked as isolated or not:
> > > > 
> > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > 
> > > > https://lkml.org/lkml/2022/6/13/2769
> > > 
> > > This is talking about page allocato per cpu caches, right? In this patch
> > > we are talking about memcg pcp caches. Are you sure the same applies
> > > here?
> > 
> > Both can stall the users of the drain operation.
> 
> Yes. But it is important to consider who those users are. We are
> draining when
> 	- we are charging and the limit is hit so that memory reclaim
> 	  has to be triggered.
> 	- hard, high limits are set and require memory reclaim.
> 	- force_empty - full memory reclaim for a memcg
> 	- memcg offlining - cgroup removel - quite a heavy operation as
> 	  well.
> all those could be really costly kernel operations and they affect
> isolated cpu only if the same memcg is used by both isolated and non-isolated
> cpus. In other words those costly operations would have to be triggered
> from non-isolated cpus and those are to be expected to be stalled. It is
> the side effect of the local cpu draining that is scheduled that affects
> the isolated cpu as well.
> 
> Is that more clear?

I think so, please help me check:

IIUC, we can approach this by dividing the problem in two working modes:
1 - Normal, meaning no drain_all_stock() running.
2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
destroying, reconfiguring a memcg.

For (1), we will have (ideally) only local cpu working on the percpu struct.
This mode will not have any kind of contention, because each CPU will hold it's
own spinlock only. 

For (2), we will have a lot of drain_all_stock() running. This will mean a lot
of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
local cpus having to wait for a lock to get their cache, on the patch proposal.

Ok, given the above is correct:

# Some arguments point that (1) becomes slower with this patch.

This is partially true: while test 2.2 pointed that local cpu functions running
time had became slower by a few cycles, test 2.4 points that the userspace
perception of it was that the syscalls and pagefaulting actually became faster:

During some debugging tests before getting the performance on test 2.4, I
noticed that the 'syscall + write' test would call all those functions that
became slower on test 2.2. Those functions were called multiple millions of
times during a single test, and still the patched version performance test
returned faster for test 2.4 than upstream version. Maybe the functions became
slower, but overall the usage of them in the usual context became faster.

Is not that a small improvement?

# Regarding (2), I notice that we fear contention 

While this seems to be the harder part of the discussion, I think we have enough
data to deal with it. 

In which case contention would be a big problem here? 
IIUC it would be when a lot of drain_all_stock() get running because the memory
limit is getting near. I mean, having the user to create / modify a memcg
multiple times a second for a while is not something that is expected, IMHO.

Now, if I assumed correctly and the case where contention could be a problem is
on a memcg with high memory pressure, then we have the argument that Marcelo
Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
allocation brought better results than local_locks + schedule_work_on().

I mean, while contention would cause the cpu to wait for a while before getting
the lock for allocating a page from cache, something similar would happen with
schedule_work_on(), which would force the current task to wait while the
draining happens locally. 

What I am able to see is that, for each drain_all_stock(), for each cpu getting
drained we have the option to (a) (sometimes) wait for a lock to be freed, or
(b) wait for a whole context switch to happen.
And IIUC, (b) is much slower than (a) on average, and this is what causes the
improved performance seen in [P1].

(I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
it to run on the remote CPU may not be that different, since the cacheline is
already writen to by the remote cpu on Upstream)

Also according to test 2.2, for the patched version, drain_local_stock() have
gotten faster (much faster for 128 cpus), even though it does all the draining
instead of just scheduling it on the other cpus. 
I mean, summing that to the brief nature of local cpu functions, we may not hit
contention as much as we are expected.

##

Sorry for the long text.
I may be missing some point, please let me know if that's the case.

Thanks a lot for reviewing!
Leo

[P1]: https://lkml.org/lkml/2022/6/13/2769


WARNING: multiple messages have this Message-ID (diff)
From: "Leonardo Brás" <leobras-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Marcelo Tosatti
	<mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Roman Gushchin
	<roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>,
	Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining
Date: Fri, 27 Jan 2023 03:55:39 -0300	[thread overview]
Message-ID: <0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com> (raw)
In-Reply-To: <Y9LQ615H13RmG7wL-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > [...]
> > > > Remote draining reduces interruptions whether CPU 
> > > > is marked as isolated or not:
> > > > 
> > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > 
> > > > https://lkml.org/lkml/2022/6/13/2769
> > > 
> > > This is talking about page allocato per cpu caches, right? In this patch
> > > we are talking about memcg pcp caches. Are you sure the same applies
> > > here?
> > 
> > Both can stall the users of the drain operation.
> 
> Yes. But it is important to consider who those users are. We are
> draining when
> 	- we are charging and the limit is hit so that memory reclaim
> 	  has to be triggered.
> 	- hard, high limits are set and require memory reclaim.
> 	- force_empty - full memory reclaim for a memcg
> 	- memcg offlining - cgroup removel - quite a heavy operation as
> 	  well.
> all those could be really costly kernel operations and they affect
> isolated cpu only if the same memcg is used by both isolated and non-isolated
> cpus. In other words those costly operations would have to be triggered
> from non-isolated cpus and those are to be expected to be stalled. It is
> the side effect of the local cpu draining that is scheduled that affects
> the isolated cpu as well.
> 
> Is that more clear?

I think so, please help me check:

IIUC, we can approach this by dividing the problem in two working modes:
1 - Normal, meaning no drain_all_stock() running.
2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
destroying, reconfiguring a memcg.

For (1), we will have (ideally) only local cpu working on the percpu struct.
This mode will not have any kind of contention, because each CPU will hold it's
own spinlock only. 

For (2), we will have a lot of drain_all_stock() running. This will mean a lot
of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
local cpus having to wait for a lock to get their cache, on the patch proposal.

Ok, given the above is correct:

# Some arguments point that (1) becomes slower with this patch.

This is partially true: while test 2.2 pointed that local cpu functions running
time had became slower by a few cycles, test 2.4 points that the userspace
perception of it was that the syscalls and pagefaulting actually became faster:

During some debugging tests before getting the performance on test 2.4, I
noticed that the 'syscall + write' test would call all those functions that
became slower on test 2.2. Those functions were called multiple millions of
times during a single test, and still the patched version performance test
returned faster for test 2.4 than upstream version. Maybe the functions became
slower, but overall the usage of them in the usual context became faster.

Is not that a small improvement?

# Regarding (2), I notice that we fear contention 

While this seems to be the harder part of the discussion, I think we have enough
data to deal with it. 

In which case contention would be a big problem here? 
IIUC it would be when a lot of drain_all_stock() get running because the memory
limit is getting near. I mean, having the user to create / modify a memcg
multiple times a second for a while is not something that is expected, IMHO.

Now, if I assumed correctly and the case where contention could be a problem is
on a memcg with high memory pressure, then we have the argument that Marcelo
Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
allocation brought better results than local_locks + schedule_work_on().

I mean, while contention would cause the cpu to wait for a while before getting
the lock for allocating a page from cache, something similar would happen with
schedule_work_on(), which would force the current task to wait while the
draining happens locally. 

What I am able to see is that, for each drain_all_stock(), for each cpu getting
drained we have the option to (a) (sometimes) wait for a lock to be freed, or
(b) wait for a whole context switch to happen.
And IIUC, (b) is much slower than (a) on average, and this is what causes the
improved performance seen in [P1].

(I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
it to run on the remote CPU may not be that different, since the cacheline is
already writen to by the remote cpu on Upstream)

Also according to test 2.2, for the patched version, drain_local_stock() have
gotten faster (much faster for 128 cpus), even though it does all the draining
instead of just scheduling it on the other cpus. 
I mean, summing that to the brief nature of local cpu functions, we may not hit
contention as much as we are expected.

##

Sorry for the long text.
I may be missing some point, please let me know if that's the case.

Thanks a lot for reviewing!
Leo

[P1]: https://lkml.org/lkml/2022/6/13/2769


  reply	other threads:[~2023-01-27  6:56 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25  7:34 [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining Leonardo Bras
2023-01-25  7:34 ` Leonardo Bras
2023-01-25  7:34 ` [PATCH v2 1/5] mm/memcontrol: Align percpu memcg_stock to cache Leonardo Bras
2023-01-25  7:34   ` Leonardo Bras
2023-01-25  7:34 ` [PATCH v2 2/5] mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t Leonardo Bras
2023-01-25  7:34   ` Leonardo Bras
2023-01-25  7:35 ` [PATCH v2 3/5] mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes Leonardo Bras
2023-01-25  7:35   ` Leonardo Bras
2023-01-25  7:35 ` [PATCH v2 4/5] mm/memcontrol: Perform all stock drain in current CPU Leonardo Bras
2023-01-25  7:35 ` [PATCH v2 5/5] mm/memcontrol: Remove flags from memcg_stock_pcp Leonardo Bras
2023-01-25  7:35   ` Leonardo Bras
2023-01-25  8:33 ` [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining Michal Hocko
2023-01-25  8:33   ` Michal Hocko
2023-01-25 11:06   ` Leonardo Brás
2023-01-25 11:06     ` Leonardo Brás
2023-01-25 11:39     ` Michal Hocko
2023-01-25 11:39       ` Michal Hocko
2023-01-25 18:22     ` Marcelo Tosatti
2023-01-25 18:22       ` Marcelo Tosatti
2023-01-25 23:14       ` Roman Gushchin
2023-01-25 23:14         ` Roman Gushchin
2023-01-26  7:41         ` Michal Hocko
2023-01-26  7:41           ` Michal Hocko
2023-01-26 18:03           ` Marcelo Tosatti
2023-01-26 19:20             ` Michal Hocko
2023-01-27  0:32               ` Marcelo Tosatti
2023-01-27  0:32                 ` Marcelo Tosatti
2023-01-27  6:58                 ` Michal Hocko
2023-01-27  6:58                   ` Michal Hocko
2023-02-01 18:31               ` Roman Gushchin
2023-01-26 23:12           ` Roman Gushchin
2023-01-27  7:11             ` Michal Hocko
2023-01-27  7:11               ` Michal Hocko
2023-01-27  7:22               ` Leonardo Brás
2023-01-27  8:12                 ` Leonardo Brás
2023-01-27  8:12                   ` Leonardo Brás
2023-01-27  9:23                   ` Michal Hocko
2023-01-27  9:23                     ` Michal Hocko
2023-01-27 13:03                   ` Frederic Weisbecker
2023-01-27 13:03                     ` Frederic Weisbecker
2023-01-27 13:58               ` Michal Hocko
2023-01-27 13:58                 ` Michal Hocko
2023-01-27 18:18                 ` Roman Gushchin
2023-01-27 18:18                   ` Roman Gushchin
2023-02-03 15:21                   ` Michal Hocko
2023-02-03 15:21                     ` Michal Hocko
2023-02-03 19:25                     ` Roman Gushchin
2023-02-03 19:25                       ` Roman Gushchin
2023-02-13 13:36                       ` Michal Hocko
2023-02-13 13:36                         ` Michal Hocko
2023-01-27  7:14             ` Leonardo Brás
2023-01-27  7:14               ` Leonardo Brás
2023-01-27  7:20               ` Michal Hocko
2023-01-27  7:20                 ` Michal Hocko
2023-01-27  7:35                 ` Leonardo Brás
2023-01-27  9:29                   ` Michal Hocko
2023-01-27 19:29                     ` Leonardo Brás
2023-01-27 19:29                       ` Leonardo Brás
2023-01-27 23:50                       ` Roman Gushchin
2023-01-26 18:19         ` Marcelo Tosatti
2023-01-26 18:19           ` Marcelo Tosatti
2023-01-27  5:40           ` Leonardo Brás
2023-01-27  5:40             ` Leonardo Brás
2023-01-26  2:01       ` Hillf Danton
2023-01-26  7:45       ` Michal Hocko
2023-01-26  7:45         ` Michal Hocko
2023-01-26 18:14         ` Marcelo Tosatti
2023-01-26 18:14           ` Marcelo Tosatti
2023-01-26 19:13           ` Michal Hocko
2023-01-26 19:13             ` Michal Hocko
2023-01-27  6:55             ` Leonardo Brás [this message]
2023-01-27  6:55               ` Leonardo Brás
2023-01-31 11:35               ` Marcelo Tosatti
2023-01-31 11:35                 ` Marcelo Tosatti
2023-02-01  4:36                 ` Leonardo Brás
2023-02-01 12:52                   ` Michal Hocko
2023-02-01 12:52                     ` Michal Hocko
2023-02-01 12:41                 ` Michal Hocko
2023-02-01 12:41                   ` Michal Hocko
2023-02-04  4:55                   ` Leonardo Brás
2023-02-04  4:55                     ` Leonardo Brás
2023-02-05 19:49                     ` Roman Gushchin
2023-02-07  3:18                       ` Leonardo Brás
2023-02-08 19:23                         ` Roman Gushchin
2023-02-08 19:23                           ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com \
    --to=leobras@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mtosatti@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.