linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: add big honking pfmemalloc OOM warning
@ 2019-04-10 10:19 Juha-Matti Tilli
  2019-04-10 14:16 ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Juha-Matti Tilli @ 2019-04-10 10:19 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: juha-matti.tilli, Eric Dumazet, Rafael Aquini, Murphy Zhou,
	Yongcheng Yang, Jianhong Yin

A system administrator is not notified (except via an obscure SNMP
counter that most sysadmins don't know to look for) if packets are
dropped due to out-of-memory condition when SKBs use pfmemalloc
reserves. This can for example lead to NFS connections hanging on
high-volume systems.

Implement a ratelimited big honking out of memory warning that directs
the sysadmin to bump up vm.min_free_kbytes in case this problem happens.
Our experience shows that with default vm.min_free_kbytes (90112, or
about 90 megabytes), NFS connections hang approximately once per day,
whereas with 901120 (default multiplied by 10) NFS connections never
hang.

Signed-off-by: Juha-Matti Tilli <juha-matti.tilli@foreca.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Murphy Zhou <xzhou@redhat.com>
Cc: Yongcheng Yang <yoyang@redhat.com>
Cc: Jianhong Yin <jiyin@redhat.com>
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index fc92ebc4e200..7d8ef239b9af 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -99,6 +99,8 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	 * helping free memory
 	 */
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
+		printk_ratelimited(KERN_WARNING
+				   "dropped packet due to out-of-memory condition, please bump up vm.min_free_kbytes\n");
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
 		return -ENOMEM;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 10:19 [PATCH] net: add big honking pfmemalloc OOM warning Juha-Matti Tilli
@ 2019-04-10 14:16 ` Eric Dumazet
  2019-04-10 15:01   ` Juha-Matti Tilli
  2019-04-10 19:11   ` David Miller
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2019-04-10 14:16 UTC (permalink / raw)
  To: Juha-Matti Tilli
  Cc: LKML, netdev, Rafael Aquini, Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 3:20 AM Juha-Matti Tilli
<juha-matti.tilli@foreca.com> wrote:
>
> A system administrator is not notified (except via an obscure SNMP
> counter that most sysadmins don't know to look for) if packets are
> dropped due to out-of-memory condition when SKBs use pfmemalloc
> reserves. This can for example lead to NFS connections hanging on
> high-volume systems.
>
> Implement a ratelimited big honking out of memory warning that directs
> the sysadmin to bump up vm.min_free_kbytes in case this problem happens.
> Our experience shows that with default vm.min_free_kbytes (90112, or
> about 90 megabytes), NFS connections hang approximately once per day,
> whereas with 901120 (default multiplied by 10) NFS connections never
> hang.
>

If NFS sessions hang, then there is a bug to eventually root cause and fix.

Just telling the user : Increase the limit is the same thing than admitting :

Our limit system or TCP or NFS stacks are broken and unable to
recover, so lets disable the limit system and work around a more
serious bug.

Maybe the bug is in a NIC driver, please share more details before
adding yet another noisy signal in syslog

SNMP counters are per netns, and more useful in the modern computing
era,  where a host is shared by many different containers.



>
> Signed-off-by: Juha-Matti Tilli <juha-matti.tilli@foreca.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Rafael Aquini <aquini@redhat.com>
> Cc: Murphy Zhou <xzhou@redhat.com>
> Cc: Yongcheng Yang <yoyang@redhat.com>
> Cc: Jianhong Yin <jiyin@redhat.com>
> ---
>  net/core/filter.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fc92ebc4e200..7d8ef239b9af 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -99,6 +99,8 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>          * helping free memory
>          */
>         if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
> +               printk_ratelimited(KERN_WARNING
> +                                  "dropped packet due to out-of-memory condition, please bump up vm.min_free_kbytes\n");
>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
>                 return -ENOMEM;
>         }
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 14:16 ` Eric Dumazet
@ 2019-04-10 15:01   ` Juha-Matti Tilli
  2019-04-10 15:36     ` Eric Dumazet
  2019-04-10 19:11   ` David Miller
  1 sibling, 1 reply; 8+ messages in thread
From: Juha-Matti Tilli @ 2019-04-10 15:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, netdev, Rafael Aquini, Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <edumazet@google.com> wrote:
> If NFS sessions hang, then there is a bug to eventually root cause and fix.
>
> Just telling the user : Increase the limit is the same thing than admitting :
>
> Our limit system or TCP or NFS stacks are broken and unable to
> recover, so lets disable the limit system and work around a more
> serious bug.
>
> Maybe the bug is in a NIC driver, please share more details before
> adding yet another noisy signal in syslog
>
> SNMP counters are per netns, and more useful in the modern computing
> era,  where a host is shared by many different containers.

Any idea where the bug might be?

It can't be in NFS, because I have observed the issue to be a TCP
level issue. NFS would be working just fine if TCP worked, but the
underlying TCP connection is not working fine, unless we bump up
vm.min_free_kbytes.

It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
for some reason, and that happens repeatedly for a duration of 5-10
minutes for every single retransmit, until the condition clears. Ping
is working just fine at the time the NFS connection is stuck. I think
these 63-queue NICs use different queue for ping than they use for the
TCP NFS connection. I think there is some code in ixgbe for not
reusing pfmemalloc pages, but it seems every packet nevertheless gets
a pfmemalloc page in the queue that is used for TCP NFS. Might the
cause be that if ixgbe gets the pages in large bunches, it gets
multiple pfmemalloc pages at a time and then every packet is dropped
until all the pfmemalloc pages run out (not being reused)?

It could also be in the default value of vm.min_free_kbytes, but I'm
not experienced enough in Linux kernel internals to adjust the complex
calculations. Just saying that 90 MB sounds ridiculously low on a 256
GB NUMA machine.

Are you of the opinion that Intel as the developer of ixgbe should be informed?

Anyway, I posted more details to the mailing lists about a week ago,
search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
suspected" in the mailing lists, or click this direct link:
https://lkml.org/lkml/2019/4/3/682

The current situation is that we've been running the production system
for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
before the bump, we had approximately one hang per day, so without the
bump, the period of 2 weeks would have approximately 14 NFS hangs.

To me, this OOM condition seems to be global, so having it per-netns
offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
container tunable?

BR, Juha-Matti

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 15:01   ` Juha-Matti Tilli
@ 2019-04-10 15:36     ` Eric Dumazet
  2019-04-11  6:26       ` Juha-Matti Tilli
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2019-04-10 15:36 UTC (permalink / raw)
  To: Juha-Matti Tilli
  Cc: LKML, netdev, Rafael Aquini, Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 8:01 AM Juha-Matti Tilli
<juha-matti.tilli@foreca.com> wrote:
>
> On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <edumazet@google.com> wrote:
> > If NFS sessions hang, then there is a bug to eventually root cause and fix.
> >
> > Just telling the user : Increase the limit is the same thing than admitting :
> >
> > Our limit system or TCP or NFS stacks are broken and unable to
> > recover, so lets disable the limit system and work around a more
> > serious bug.
> >
> > Maybe the bug is in a NIC driver, please share more details before
> > adding yet another noisy signal in syslog
> >
> > SNMP counters are per netns, and more useful in the modern computing
> > era,  where a host is shared by many different containers.
>
> Any idea where the bug might be?
>

Before diving into the details, can we first double check which exact
kernel version you are using ?

In the past some pfmemalloc bugs have been solved, I do not want spend
time finding them a second time.



> It can't be in NFS, because I have observed the issue to be a TCP
> level issue. NFS would be working just fine if TCP worked, but the
> underlying TCP connection is not working fine, unless we bump up
> vm.min_free_kbytes.
>
> It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
> for some reason, and that happens repeatedly for a duration of 5-10
> minutes for every single retransmit, until the condition clears. Ping
> is working just fine at the time the NFS connection is stuck. I think
> these 63-queue NICs use different queue for ping than they use for the
> TCP NFS connection. I think there is some code in ixgbe for not
> reusing pfmemalloc pages, but it seems every packet nevertheless gets
> a pfmemalloc page in the queue that is used for TCP NFS. Might the
> cause be that if ixgbe gets the pages in large bunches, it gets
> multiple pfmemalloc pages at a time and then every packet is dropped
> until all the pfmemalloc pages run out (not being reused)?
>
> It could also be in the default value of vm.min_free_kbytes, but I'm
> not experienced enough in Linux kernel internals to adjust the complex
> calculations. Just saying that 90 MB sounds ridiculously low on a 256
> GB NUMA machine.
>
> Are you of the opinion that Intel as the developer of ixgbe should be informed?
>
> Anyway, I posted more details to the mailing lists about a week ago,
> search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
> suspected" in the mailing lists, or click this direct link:
> https://lkml.org/lkml/2019/4/3/682
>
> The current situation is that we've been running the production system
> for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
> before the bump, we had approximately one hang per day, so without the
> bump, the period of 2 weeks would have approximately 14 NFS hangs.
>
> To me, this OOM condition seems to be global, so having it per-netns
> offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
> container tunable?
>
> BR, Juha-Matti

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 14:16 ` Eric Dumazet
  2019-04-10 15:01   ` Juha-Matti Tilli
@ 2019-04-10 19:11   ` David Miller
  2019-04-11  6:51     ` Juha-Matti Tilli
  1 sibling, 1 reply; 8+ messages in thread
From: David Miller @ 2019-04-10 19:11 UTC (permalink / raw)
  To: edumazet
  Cc: juha-matti.tilli, linux-kernel, netdev, aquini, xzhou, yoyang, jiyin

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 10 Apr 2019 07:16:21 -0700

> SNMP counters are per netns, and more useful in the modern computing
> era,  where a host is shared by many different containers.

+1  There is no way I am applying this patch.

The kernel should not "big honking" anything in the logs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 15:36     ` Eric Dumazet
@ 2019-04-11  6:26       ` Juha-Matti Tilli
  2019-04-11 10:26         ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Juha-Matti Tilli @ 2019-04-11  6:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, netdev, Rafael Aquini, Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 6:36 PM Eric Dumazet <edumazet@google.com> wrote:
> Before diving into the details, can we first double check which exact
> kernel version you are using ?
>
> In the past some pfmemalloc bugs have been solved, I do not want spend
> time finding them a second time.

Of course I can tell the kernel version.

The NFS server is 3.10.0-862.14.4.el7.foreca.x86_64 where "foreca"
means we've applied a custom Samsung SSD TRIM patch, nothing else
different from regular CentOS / RHEL kernel.

The NFS client is 3.10.0-957.5.1.el7.x86_64.

The pfmemalloc problem happens at the server, not at the client.

The exact patches in this kernel may be a bit complicated to
determine, because RHEL apparently likes to maintain their own kernel
version. They do not have the SNMP counter even, which is why I opened
a bug report to RHEL Bugzilla. I have extracted source code, so if you
can tell a list of git commit SHA1 hashes that could fix the issue, I
can check manually whether they're there, and open bug reports to RHEL
bugzilla to have them backported if not there.

I at least see this in the exact kernel version I'm using:

static inline bool ixgbe_page_is_reserved(struct page *page)
{
        return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
}

...so ixgbe shouldn't be reusing pfmemalloc pages.

BR, Juha-Matti

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-10 19:11   ` David Miller
@ 2019-04-11  6:51     ` Juha-Matti Tilli
  0 siblings, 0 replies; 8+ messages in thread
From: Juha-Matti Tilli @ 2019-04-11  6:51 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, Juha-Matti Tilli, LKML, netdev, Rafael Aquini,
	Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 10:11 PM David Miller <davem@davemloft.net> wrote:
> > SNMP counters are per netns, and more useful in the modern computing
> > era,  where a host is shared by many different containers.
>
> +1  There is no way I am applying this patch.
>
> The kernel should not "big honking" anything in the logs.

Just to check, is the opposition to the patch related to the
expectation that it will log the condition too often despite the rate
limit, if many packets are dropped? Because if it is, that might be
possible to fix.

I think it might be possible to check the SNMP counter value, and if
zero, log the first instance of pfmemalloc drop, and then omit logging
afterwards. There could be race conditions, so in the absolute worst
case, you could have let's say 2 or 3 of these log lines instead of 1,
but I don't see that as an issue, because 99% of the time there would
be just one, and 2 or 3 lines won't fill the logs.

In our case, the existence of such a log message and the helpful
suggestion to bump up vm.min_free_kbytes would have saved us
approximately one month of debugging (or 2-3 weeks if the SNMP counter
was there in this kernel version). Even one such log message would be
enough. Our production systems were hanging daily during this
debugging happening.

In my opinion, the ideal count of pfmemalloc drops is exactly 0, and
the interesting event is the first instance of pfmemalloc drop
occurring.

If there's a bug in the kernel, I think the user should be notified,
so I see this as similar to some WARN_ON line -- which is even more
"big honking" log event because it's associated with a backtrace.

BR, Juha-Matti

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: add big honking pfmemalloc OOM warning
  2019-04-11  6:26       ` Juha-Matti Tilli
@ 2019-04-11 10:26         ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2019-04-11 10:26 UTC (permalink / raw)
  To: Juha-Matti Tilli
  Cc: LKML, netdev, Rafael Aquini, Murphy Zhou, Yongcheng Yang, Jianhong Yin

On Wed, Apr 10, 2019 at 11:26 PM Juha-Matti Tilli
<juha-matti.tilli@foreca.com> wrote:
>
> On Wed, Apr 10, 2019 at 6:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > Before diving into the details, can we first double check which exact
> > kernel version you are using ?
> >
> > In the past some pfmemalloc bugs have been solved, I do not want spend
> > time finding them a second time.
>
> Of course I can tell the kernel version.
>
> The NFS server is 3.10.0-862.14.4.el7.foreca.x86_64 where "foreca"
> means we've applied a custom Samsung SSD TRIM patch, nothing else
> different from regular CentOS / RHEL kernel.
>
> The NFS client is 3.10.0-957.5.1.el7.x86_64.
>
> The pfmemalloc problem happens at the server, not at the client.
>
> The exact patches in this kernel may be a bit complicated to
> determine, because RHEL apparently likes to maintain their own kernel
> version. They do not have the SNMP counter even, which is why I opened
> a bug report to RHEL Bugzilla. I have extracted source code, so if you
> can tell a list of git commit SHA1 hashes that could fix the issue, I
> can check manually whether they're there, and open bug reports to RHEL
> bugzilla to have them backported if not there.
>
> I at least see this in the exact kernel version I'm using:

Here (netdev/lkml) we deal with current kernels.

Chances are very high the bug(s) you are fighting was(were) already fixed.

Please repro the bug using latest David Miller net tree, otherwise we
will _not_ accept any patch.

Thank you

>
> static inline bool ixgbe_page_is_reserved(struct page *page)
> {
>         return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
> }
>
> ...so ixgbe shouldn't be reusing pfmemalloc pages.
>
> BR, Juha-Matti

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-04-11 10:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 10:19 [PATCH] net: add big honking pfmemalloc OOM warning Juha-Matti Tilli
2019-04-10 14:16 ` Eric Dumazet
2019-04-10 15:01   ` Juha-Matti Tilli
2019-04-10 15:36     ` Eric Dumazet
2019-04-11  6:26       ` Juha-Matti Tilli
2019-04-11 10:26         ` Eric Dumazet
2019-04-10 19:11   ` David Miller
2019-04-11  6:51     ` Juha-Matti Tilli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).