All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, netdev@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
Date: Wed, 23 Oct 2019 11:46:18 -0400	[thread overview]
Message-ID: <20191023154618.GA366316@cmpxchg.org> (raw)
In-Reply-To: <20191023064012.GB754@dhcp22.suse.cz>

On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:
> On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> > While upgrading from 4.16 to 5.2, we noticed these allocation errors
> > in the log of the new kernel:
> > 
> > [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> > [ 8642.269170]   cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
> > [ 8642.293009]   node 0: slabs: 5, objs: 170, free: 0
> > 
> >         slab_out_of_memory+1
> >         ___slab_alloc+969
> >         __slab_alloc+14
> >         kmem_cache_alloc+346
> >         inet_twsk_alloc+60
> >         tcp_time_wait+46
> >         tcp_fin+206
> >         tcp_data_queue+2034
> >         tcp_rcv_state_process+784
> >         tcp_v6_do_rcv+405
> >         __release_sock+118
> >         tcp_close+385
> >         inet_release+46
> >         __sock_release+55
> >         sock_close+17
> >         __fput+170
> >         task_work_run+127
> >         exit_to_usermode_loop+191
> >         do_syscall_64+212
> >         entry_SYSCALL_64_after_hwframe+68
> > 
> > accompanied by an increase in machines going completely radio silent
> > under memory pressure.
> 
> This is really worrying because that suggests that something depends on
> GFP_ATOMIC allocation which is fragile and broken. 

I don't think that is true. You cannot rely on a *single instance* of
atomic allocations to succeed. But you have to be able to rely on that
failure is temporary and there is a chance of succeeding eventually.

Network is a good example. It retries transmits, but within reason. If
you aren't able to process incoming packets for minutes, you might as
well be dead.

> > One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> > sock objects to kmemcg"), which made these slab caches subject to
> > cgroup memory accounting and control.
> > 
> > The problem with that is that cgroups, unlike the page allocator, do
> > not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> > its limit, atomic allocations - such as done during network rx - can
> > fail consistently for extended periods of time. The kernel is not able
> > to operate under these conditions.
> > 
> > We don't want to revert the culprit patch, because it indeed tracks a
> > potentially substantial amount of memory used by a cgroup.
> > 
> > We also don't want to implement dedicated atomic reserves for cgroups.
> > There is no point in keeping a fixed margin of unused bytes in the
> > cgroup's memory budget to accomodate a consumer that is impossible to
> > predict - we'd be wasting memory and get into configuration headaches,
> > not unlike what we have going with min_free_kbytes. We do this for
> > physical mem because we have to, but cgroups are an accounting game.
> > 
> > Instead, account these privileged allocations to the cgroup, but let
> > them bypass the configured limit if they have to. This way, we get the
> > benefits of accounting the consumed memory and have it exert pressure
> > on the rest of the cgroup, but like with the page allocator, we shift
> > the burden of reclaimining on behalf of atomic allocations onto the
> > regular allocations that can block.
> 
> On the other hand this would allow to break the isolation by an
> unpredictable amount. Should we put a simple cap on how much we can go
> over the limit. If the memcg limit reclaim is not able to keep up with
> those overflows then even __GFP_ATOMIC allocations have to fail. What do
> you think?

I don't expect a big overrun in practice, and it appears that Google
has been letting even NOWAIT allocations pass through without
isolation issues. Likewise, we have been force-charging the skmem for
a while now and it hasn't been an issue for reclaim to keep up.

My experience from production is that it's a whole lot easier to debug
something like a memory.max overrun than it is to debug a machine that
won't respond to networking. So that's the side I would err on.

  reply	other threads:[~2019-10-23 15:46 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-22 23:37 [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges Johannes Weiner
2019-10-23  0:08 ` Shakeel Butt
2019-10-23  0:08   ` Shakeel Butt
2019-10-23  6:40 ` Michal Hocko
2019-10-23 15:46   ` Johannes Weiner [this message]
2019-10-23 17:38     ` Shakeel Butt
2019-10-23 17:38       ` Shakeel Butt
2019-10-24  8:14       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191023154618.GA366316@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.