All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.cz>, Tejun Heo <tj@kernel.org>,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Date: Wed, 28 Oct 2015 11:58:10 -0700	[thread overview]
Message-ID: <20151028185810.GA31488@cmpxchg.org> (raw)
In-Reply-To: <20151028082003.GK13221@esperanza>

On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> Then you'd better not touch existing tcp limits at all, because they
> just work, and the logic behind them is very close to that of global tcp
> limits. I don't think one can simplify it somehow.

Uhm, no, there is a crapload of boilerplate code and complication that
seems entirely unnecessary. The only thing missing from my patch seems
to be the part where it enters memory pressure state when the limit is
hit. I'm adding this for completeness, but I doubt it even matters.

> Moreover, frankly I still have my reservations about this vmpressure
> propagation to skb you're proposing. It might work, but I doubt it
> will allow us to throw away explicit tcp limit, as I explained
> previously. So, even with your approach I think we can still need
> per memcg tcp limit *unless* you get rid of global tcp limit
> somehow.

Having the hard limit as a failsafe (or a minimum for other consumers)
is one thing, and certainly something I'm open to for cgroupv2, should
we have problems with load startup up after a socket memory landgrab.

That being said, if the VM is struggling to reclaim pages, or is even
swapping, it makes perfect sense to let the socket memory scheduler
know it shouldn't continue to increase its footprint until the VM
recovers. Regardless of any hard limitations/minimum guarantees.

This is what my patch does and it seems pretty straight-forward to
me. I don't really understand why this is so controversial.

The *next* step would be to figure out whether we can actually
*reclaim* memory in the network subsystem--shrink windows and steal
buffers back--and that might even be an avenue to replace tcp window
limits. But it's not necessary for *this* patch series to be useful.

> > So this seemed like a good way to prove a new mechanism before rolling
> > it out to every single Linux setup, rather than switch everybody over
> > after the limited scope testing I can do as a developer on my own.
> > 
> > Keep in mind that my patches are not committing anything in terms of
> > interface, so we retain all the freedom to fix and tune the way this
> > is implemented, including the freedom to re-add tcp window limits in
> > case the pressure balancing is not a comprehensive solution.
> 
> I really dislike this kind of proof. It looks like you're trying to
> push something you think is right covertly, w/o having a proper
> discussion with networking people and then say that it just works
> and hence should be done globally, but what if it won't? Revert it?
> We already have a lot of dubious stuff in memcg that should be
> reverted, so let's please try to avoid this kind of mistakes in
> future. Note, I say "w/o having a proper discussion with networking
> people", because I don't think they will really care *unless* you
> change the global logic, simply because most of them aren't very
> interested in memcg AFAICS.

Come on, Dave is the first To and netdev is CC'd. They might not care
about memcg, but "pushing things covertly" is a bit of a stretch.

> That effectively means you loose a chance to listen to networking
> experts, who could point you at design flaws and propose an improvement
> right away. Let's please not miss such an opportunity. You said that
> you'd seen this problem happen w/o cgroups, so you have a use case that
> might need fixing at the global level. IMO it shouldn't be difficult to
> prepare an RFC patch for the global case first and see what people think
> about it.

No, the problem we are running into is when network memory is not
tracked per cgroup. The lack of containment means that the socket
memory consumption of individual cgroups can trigger system OOM.

We tried using the per-memcg tcp limits, and that prevents the OOMs
for sure, but it's horrendous for network performance. There is no
"stop growing" phase, it just keeps going full throttle until it hits
the wall hard.

Now, we could probably try to replicate the global knobs and add a
per-memcg soft limit. But you know better than anyone else how hard it
is to estimate the overall workingset size of a workload, and the
margins on containerized loads are razor-thin. Performance is much
more sensitive to input errors, and often times parameters must be
adjusted continuously during the runtime of a workload. It'd be
disasterous to rely on yet more static, error-prone user input here.

What all this means to me is that fixing it on the cgroup level has
higher priority. But it also means that once we figured it out under
such a high-pressure environment, it's much easier to apply to the
global case and potentially replace the soft limit there.

This seems like a better approach to me than starting globally, only
to realize that the solution is not workable for cgroups and we need
yet something else.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.cz>, Tejun Heo <tj@kernel.org>,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Date: Wed, 28 Oct 2015 11:58:10 -0700	[thread overview]
Message-ID: <20151028185810.GA31488@cmpxchg.org> (raw)
In-Reply-To: <20151028082003.GK13221@esperanza>

On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> Then you'd better not touch existing tcp limits at all, because they
> just work, and the logic behind them is very close to that of global tcp
> limits. I don't think one can simplify it somehow.

Uhm, no, there is a crapload of boilerplate code and complication that
seems entirely unnecessary. The only thing missing from my patch seems
to be the part where it enters memory pressure state when the limit is
hit. I'm adding this for completeness, but I doubt it even matters.

> Moreover, frankly I still have my reservations about this vmpressure
> propagation to skb you're proposing. It might work, but I doubt it
> will allow us to throw away explicit tcp limit, as I explained
> previously. So, even with your approach I think we can still need
> per memcg tcp limit *unless* you get rid of global tcp limit
> somehow.

Having the hard limit as a failsafe (or a minimum for other consumers)
is one thing, and certainly something I'm open to for cgroupv2, should
we have problems with load startup up after a socket memory landgrab.

That being said, if the VM is struggling to reclaim pages, or is even
swapping, it makes perfect sense to let the socket memory scheduler
know it shouldn't continue to increase its footprint until the VM
recovers. Regardless of any hard limitations/minimum guarantees.

This is what my patch does and it seems pretty straight-forward to
me. I don't really understand why this is so controversial.

The *next* step would be to figure out whether we can actually
*reclaim* memory in the network subsystem--shrink windows and steal
buffers back--and that might even be an avenue to replace tcp window
limits. But it's not necessary for *this* patch series to be useful.

> > So this seemed like a good way to prove a new mechanism before rolling
> > it out to every single Linux setup, rather than switch everybody over
> > after the limited scope testing I can do as a developer on my own.
> > 
> > Keep in mind that my patches are not committing anything in terms of
> > interface, so we retain all the freedom to fix and tune the way this
> > is implemented, including the freedom to re-add tcp window limits in
> > case the pressure balancing is not a comprehensive solution.
> 
> I really dislike this kind of proof. It looks like you're trying to
> push something you think is right covertly, w/o having a proper
> discussion with networking people and then say that it just works
> and hence should be done globally, but what if it won't? Revert it?
> We already have a lot of dubious stuff in memcg that should be
> reverted, so let's please try to avoid this kind of mistakes in
> future. Note, I say "w/o having a proper discussion with networking
> people", because I don't think they will really care *unless* you
> change the global logic, simply because most of them aren't very
> interested in memcg AFAICS.

Come on, Dave is the first To and netdev is CC'd. They might not care
about memcg, but "pushing things covertly" is a bit of a stretch.

> That effectively means you loose a chance to listen to networking
> experts, who could point you at design flaws and propose an improvement
> right away. Let's please not miss such an opportunity. You said that
> you'd seen this problem happen w/o cgroups, so you have a use case that
> might need fixing at the global level. IMO it shouldn't be difficult to
> prepare an RFC patch for the global case first and see what people think
> about it.

No, the problem we are running into is when network memory is not
tracked per cgroup. The lack of containment means that the socket
memory consumption of individual cgroups can trigger system OOM.

We tried using the per-memcg tcp limits, and that prevents the OOMs
for sure, but it's horrendous for network performance. There is no
"stop growing" phase, it just keeps going full throttle until it hits
the wall hard.

Now, we could probably try to replicate the global knobs and add a
per-memcg soft limit. But you know better than anyone else how hard it
is to estimate the overall workingset size of a workload, and the
margins on containerized loads are razor-thin. Performance is much
more sensitive to input errors, and often times parameters must be
adjusted continuously during the runtime of a workload. It'd be
disasterous to rely on yet more static, error-prone user input here.

What all this means to me is that fixing it on the cgroup level has
higher priority. But it also means that once we figured it out under
such a high-pressure environment, it's much easier to apply to the
global case and potentially replace the soft limit there.

This seems like a better approach to me than starting globally, only
to realize that the solution is not workable for cgroups and we need
yet something else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-10-28 18:58 UTC|newest]

Thread overview: 156+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-22  4:21 [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
2015-10-22  4:21 ` Johannes Weiner
2015-10-22  4:21 ` [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 11:31   ` Michal Hocko
2015-10-23 11:31     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 11:32   ` Michal Hocko
2015-10-23 11:32     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:46   ` Vladimir Davydov
2015-10-22 18:46     ` Vladimir Davydov
2015-10-22 18:46     ` Vladimir Davydov
2015-10-22 19:09     ` Johannes Weiner
2015-10-22 19:09       ` Johannes Weiner
2015-10-23 13:42       ` Vladimir Davydov
2015-10-23 13:42         ` Vladimir Davydov
2015-10-23 13:42         ` Vladimir Davydov
2015-10-23 12:38   ` Michal Hocko
2015-10-23 12:38     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 12:39   ` Michal Hocko
2015-10-23 12:39     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:47   ` Vladimir Davydov
2015-10-22 18:47     ` Vladimir Davydov
2015-10-22 18:47     ` Vladimir Davydov
2015-10-23 13:19   ` Michal Hocko
2015-10-23 13:19     ` Michal Hocko
2015-10-23 13:19     ` Michal Hocko
2015-10-23 13:59     ` David Miller
2015-10-23 13:59       ` David Miller
2015-10-23 13:59       ` David Miller
2015-10-26 16:56       ` Johannes Weiner
2015-10-26 16:56         ` Johannes Weiner
2015-10-27 12:26         ` Michal Hocko
2015-10-27 12:26           ` Michal Hocko
2015-10-27 13:49           ` David Miller
2015-10-27 13:49             ` David Miller
2015-10-27 13:49             ` David Miller
2015-10-27 15:41           ` Johannes Weiner
2015-10-27 15:41             ` Johannes Weiner
2015-10-27 15:41             ` Johannes Weiner
2015-10-27 16:15             ` Michal Hocko
2015-10-27 16:15               ` Michal Hocko
2015-10-27 16:42               ` Johannes Weiner
2015-10-27 16:42                 ` Johannes Weiner
2015-10-28  0:45                 ` David Miller
2015-10-28  0:45                   ` David Miller
2015-10-28  0:45                   ` David Miller
2015-10-28  3:05                   ` Johannes Weiner
2015-10-28  3:05                     ` Johannes Weiner
2015-10-29 15:25                 ` Michal Hocko
2015-10-29 15:25                   ` Michal Hocko
2015-10-29 16:10                   ` Johannes Weiner
2015-10-29 16:10                     ` Johannes Weiner
2015-10-29 16:10                     ` Johannes Weiner
2015-11-04 10:42                     ` Michal Hocko
2015-11-04 10:42                       ` Michal Hocko
2015-11-04 19:50                       ` Johannes Weiner
2015-11-04 19:50                         ` Johannes Weiner
2015-11-04 19:50                         ` Johannes Weiner
2015-11-05 14:40                         ` Michal Hocko
2015-11-05 14:40                           ` Michal Hocko
2015-11-05 16:16                           ` David Miller
2015-11-05 16:16                             ` David Miller
2015-11-05 16:28                             ` Michal Hocko
2015-11-05 16:28                               ` Michal Hocko
2015-11-05 16:28                               ` Michal Hocko
2015-11-05 16:30                               ` David Miller
2015-11-05 16:30                                 ` David Miller
2015-11-05 22:32                               ` Johannes Weiner
2015-11-05 22:32                                 ` Johannes Weiner
2015-11-05 22:32                                 ` Johannes Weiner
2015-11-06 12:51                                 ` Michal Hocko
2015-11-06 12:51                                   ` Michal Hocko
2015-11-05 20:55                           ` Johannes Weiner
2015-11-05 20:55                             ` Johannes Weiner
2015-11-05 22:52                             ` Johannes Weiner
2015-11-05 22:52                               ` Johannes Weiner
2015-11-05 22:52                               ` Johannes Weiner
2015-11-06 10:57                               ` Michal Hocko
2015-11-06 10:57                                 ` Michal Hocko
2015-11-06 16:19                                 ` Johannes Weiner
2015-11-06 16:19                                   ` Johannes Weiner
2015-11-06 16:46                                   ` Michal Hocko
2015-11-06 16:46                                     ` Michal Hocko
2015-11-06 16:46                                     ` Michal Hocko
2015-11-06 17:45                                     ` Johannes Weiner
2015-11-06 17:45                                       ` Johannes Weiner
2015-11-06 17:45                                       ` Johannes Weiner
2015-11-07  3:45                                     ` David Miller
2015-11-07  3:45                                       ` David Miller
2015-11-12 18:36                                   ` Mel Gorman
2015-11-12 18:36                                     ` Mel Gorman
2015-11-12 18:36                                     ` Mel Gorman
2015-11-12 19:12                                     ` Johannes Weiner
2015-11-12 19:12                                       ` Johannes Weiner
2015-11-06  9:05                             ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06 13:29                               ` Michal Hocko
2015-11-06 13:29                                 ` Michal Hocko
2015-11-06 16:35                               ` Johannes Weiner
2015-11-06 16:35                                 ` Johannes Weiner
2015-11-06 13:21                             ` Michal Hocko
2015-11-06 13:21                               ` Michal Hocko
2015-10-22  4:21 ` [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 13:26   ` Michal Hocko
2015-10-23 13:26     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:48   ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-23 13:49   ` Michal Hocko
2015-10-23 13:49     ` Michal Hocko
2015-10-23 13:49     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:57   ` Vladimir Davydov
2015-10-22 18:57     ` Vladimir Davydov
2015-10-22 18:57     ` Vladimir Davydov
2015-10-22 18:45 ` [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Vladimir Davydov
2015-10-22 18:45   ` Vladimir Davydov
2015-10-22 18:45   ` Vladimir Davydov
2015-10-26 17:22   ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-27  8:43     ` Vladimir Davydov
2015-10-27  8:43       ` Vladimir Davydov
2015-10-27  8:43       ` Vladimir Davydov
2015-10-27 16:01       ` Johannes Weiner
2015-10-27 16:01         ` Johannes Weiner
2015-10-28  8:20         ` Vladimir Davydov
2015-10-28  8:20           ` Vladimir Davydov
2015-10-28  8:20           ` Vladimir Davydov
2015-10-28 18:58           ` Johannes Weiner [this message]
2015-10-28 18:58             ` Johannes Weiner
2015-10-29  9:27             ` Vladimir Davydov
2015-10-29  9:27               ` Vladimir Davydov
2015-10-29  9:27               ` Vladimir Davydov
2015-10-29 17:52               ` Johannes Weiner
2015-10-29 17:52                 ` Johannes Weiner
2015-10-29 17:52                 ` Johannes Weiner
2015-11-02 14:47                 ` Vladimir Davydov
2015-11-02 14:47                   ` Vladimir Davydov
2015-11-02 14:47                   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151028185810.GA31488@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=netdev@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=vdavydov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.