All of lore.kernel.org
 help / color / mirror / Atom feed
* UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-16 18:52 Gražvydas Ignotas
       [not found] ` <CANOLnON11vzvVdyJfW+QJ36siWR4-s=HJ2aRKpRy7CP=aRPoSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Gražvydas Ignotas @ 2022-08-16 18:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

I'm unsure if it's supposed to be like this, but I'm seeing this on
various hardware combinations/VMs and Debian kernel versions, plus
self-compiled vanilla 5.19.1 I just tried. It looks like this only
happens on cgroup v2:

Debian11/bullseye (cgroup v2), distro kernel: yes
Debian11/bullseye (cgroup v2), vanilla 5.19.1: yes
Debian10/buster (cgroup v1), bpo kernel: no
Debian10/buster (cgroup v2)*, bpo kernel: yes
* - booted with 'systemd.unified_cgroup_hierarchy=1' to enable cgroup v2

Basically, when there is git activity in the container with a memory
limit, other processes in the same container start to suffer (very)
occasional network issues (mostly DNS lookup failures). Git's or other
processes' memory usage doesn't seem to be anywhere close to the
limit. The fact about packet drops can be seen from /proc/net/snmp
"Udp InErrors" counter increasing, as well as "drops" counter
increasing in /proc/net/udp . Some other random details about this:
- stopping git (its disk activity?) makes the packet loss stop
- tcpdump (ran in the container itself) shows packet correctly
arriving without errors, but the process times out waiting for
response
- if memory limit is removed the problem disappears
- if memory limit is set to host's RAM size, the problem disappears
- reducing dirty_ratio, dirty_background_ratio doesn't help

My recipe to reproduce:
- install kubernetes on a host machine with Debian11 and 32GB RAM
- create a debian9 container with 'resources: limits: memory: "8G"'
- in the container:

# run this:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
cd linux
while git checkout linux-2.6.32.y && git checkout linux-5.19.y; do true; done
# at the same time in the same container:
while sleep .1; do host <remotehost>. > /dev/null; awk '/^Udp:
[0-9]/{print $4}' /proc/net/snmp; done

The packet drop counter should start increasing after some time. The
effect is much stronger if the git repository is bigger and has
different multi-gigabyte files in those branches. Can something be
done to avoid this packet loss?

Gražvydas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
       [not found] ` <CANOLnON11vzvVdyJfW+QJ36siWR4-s=HJ2aRKpRy7CP=aRPoSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2022-08-17 16:50   ` Gražvydas Ignotas
  2022-08-17 17:13       ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Gražvydas Ignotas @ 2022-08-17 16:50 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner

On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Basically, when there is git activity in the container with a memory
> limit, other processes in the same container start to suffer (very)
> occasional network issues (mostly DNS lookup failures).

ok I've traced this and it's failing in try_charge_memcg(), which
doesn't seem to be trying too hard because it's called from irq
context.

Here is the backtrace:
 <IRQ>
 ? fib_validate_source+0xb4/0x100
 ? ip_route_input_slow+0xa11/0xb70
 mem_cgroup_charge_skmem+0x4b/0xf0
 __sk_mem_raise_allocated+0x17f/0x3e0
 __udp_enqueue_schedule_skb+0x220/0x270
 udp_queue_rcv_one_skb+0x330/0x5e0
 udp_unicast_rcv_skb+0x75/0x90
 __udp4_lib_rcv+0x1ba/0xca0
 ? ip_rcv_finish_core.constprop.0+0x63/0x490
 ip_protocol_deliver_rcu+0xd6/0x230
 ip_local_deliver_finish+0x73/0xa0
 __netif_receive_skb_one_core+0x8b/0xa0
 process_backlog+0x8e/0x120
 __napi_poll+0x2c/0x160
 net_rx_action+0x2a2/0x360
 ? rebalance_domains+0xeb/0x3b0
 __do_softirq+0xeb/0x2eb
 __irq_exit_rcu+0xb9/0x110
 sysvec_apic_timer_interrupt+0xa2/0xd0
 </IRQ>

Calling mem_cgroup_print_oom_meminfo() in such a case reveals:

memory: usage 7812476kB, limit 7812500kB, failcnt 775198
swap: usage 0kB, limit 0kB, failcnt 0
Memory cgroup stats for
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
anon 348016640
file 7502163968
kernel 146997248
kernel_stack 327680
pagetables 2224128
percpu 0
sock 4096
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 112041984
file_dirty 1181028352
file_writeback 2686976
swapcached 0
anon_thp 44040192
file_thp 0
shmem_thp 0
inactive_anon 350756864
active_anon 36864
inactive_file 3614003200
active_file 3888070656
unevictable 0
slab_reclaimable 143692600
slab_unreclaimable 545120
slab 144237720
workingset_refault_anon 0
workingset_refault_file 2318
workingset_activate_anon 0
workingset_activate_file 2318
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 334152
pgmajfault 1238
pgrefill 3400
pgscan 819608
pgsteal 791005
pgactivate 949122
pgdeactivate 1694
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 709
thp_collapse_alloc 0

So it basically renders UDP inoperable because of disk cache. I hope
this is not the intended behavior. Naturally booting with
cgroup.memory=nosocket solves this issue.

Gražvydas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 17:13       ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2022-08-17 17:13 UTC (permalink / raw)
  To: Gražvydas Ignotas
  Cc: Wei Wang, Shakeel Butt, Michal Hocko, Roman Gushchin, linux-mm, cgroups

On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> > Basically, when there is git activity in the container with a memory
> > limit, other processes in the same container start to suffer (very)
> > occasional network issues (mostly DNS lookup failures).
> 
> ok I've traced this and it's failing in try_charge_memcg(), which
> doesn't seem to be trying too hard because it's called from irq
> context.
> 
> Here is the backtrace:
>  <IRQ>
>  ? fib_validate_source+0xb4/0x100
>  ? ip_route_input_slow+0xa11/0xb70
>  mem_cgroup_charge_skmem+0x4b/0xf0
>  __sk_mem_raise_allocated+0x17f/0x3e0
>  __udp_enqueue_schedule_skb+0x220/0x270
>  udp_queue_rcv_one_skb+0x330/0x5e0
>  udp_unicast_rcv_skb+0x75/0x90
>  __udp4_lib_rcv+0x1ba/0xca0
>  ? ip_rcv_finish_core.constprop.0+0x63/0x490
>  ip_protocol_deliver_rcu+0xd6/0x230
>  ip_local_deliver_finish+0x73/0xa0
>  __netif_receive_skb_one_core+0x8b/0xa0
>  process_backlog+0x8e/0x120
>  __napi_poll+0x2c/0x160
>  net_rx_action+0x2a2/0x360
>  ? rebalance_domains+0xeb/0x3b0
>  __do_softirq+0xeb/0x2eb
>  __irq_exit_rcu+0xb9/0x110
>  sysvec_apic_timer_interrupt+0xa2/0xd0
>  </IRQ>
> 
> Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> 
> memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> swap: usage 0kB, limit 0kB, failcnt 0
> Memory cgroup stats for
> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> anon 348016640
> file 7502163968
> kernel 146997248
> kernel_stack 327680
> pagetables 2224128
> percpu 0
> sock 4096
> vmalloc 0
> shmem 0
> zswap 0
> zswapped 0
> file_mapped 112041984
> file_dirty 1181028352
> file_writeback 2686976
> swapcached 0
> anon_thp 44040192
> file_thp 0
> shmem_thp 0
> inactive_anon 350756864
> active_anon 36864
> inactive_file 3614003200
> active_file 3888070656
> unevictable 0
> slab_reclaimable 143692600
> slab_unreclaimable 545120
> slab 144237720
> workingset_refault_anon 0
> workingset_refault_file 2318
> workingset_activate_anon 0
> workingset_activate_file 2318
> workingset_restore_anon 0
> workingset_restore_file 0
> workingset_nodereclaim 0
> pgfault 334152
> pgmajfault 1238
> pgrefill 3400
> pgscan 819608
> pgsteal 791005
> pgactivate 949122
> pgdeactivate 1694
> pglazyfree 0
> pglazyfreed 0
> zswpin 0
> zswpout 0
> thp_fault_alloc 709
> thp_collapse_alloc 0
> 
> So it basically renders UDP inoperable because of disk cache. I hope
> this is not the intended behavior. Naturally booting with
> cgroup.memory=nosocket solves this issue.

This is most likely a regression caused by this patch:

commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
Author: Wei Wang <weiwan@google.com>
Date:   Tue Aug 17 12:40:03 2021 -0700

    net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
    
    Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
    to give more control to the networking stack and enable it to change
    memcg charging behavior. In the future, the networking stack may decide
    to avoid oom-kills when fallbacks are more appropriate.
    
    One behavior change in mem_cgroup_charge_skmem() by this patch is to
    avoid force charging by default and let the caller decide when and if
    force charging is needed through the presence or absence of
    __GFP_NOFAIL.
    
    Signed-off-by: Wei Wang <weiwan@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

We never used to fail these allocations. Cgroups don't have a
kswapd-style watermark reclaimer, so the network relied on
force-charging and leaving reclaim to allocations that can block.
Now it seems network packets could just fail indefinitely.

The changelog is a bit terse given how drastic the behavior change
is. Wei, Shakeel, can you fill in why this was changed? Can we revert
this for the time being?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 17:13       ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2022-08-17 17:13 UTC (permalink / raw)
  To: Gražvydas Ignotas
  Cc: Wei Wang, Shakeel Butt, Michal Hocko, Roman Gushchin,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Basically, when there is git activity in the container with a memory
> > limit, other processes in the same container start to suffer (very)
> > occasional network issues (mostly DNS lookup failures).
> 
> ok I've traced this and it's failing in try_charge_memcg(), which
> doesn't seem to be trying too hard because it's called from irq
> context.
> 
> Here is the backtrace:
>  <IRQ>
>  ? fib_validate_source+0xb4/0x100
>  ? ip_route_input_slow+0xa11/0xb70
>  mem_cgroup_charge_skmem+0x4b/0xf0
>  __sk_mem_raise_allocated+0x17f/0x3e0
>  __udp_enqueue_schedule_skb+0x220/0x270
>  udp_queue_rcv_one_skb+0x330/0x5e0
>  udp_unicast_rcv_skb+0x75/0x90
>  __udp4_lib_rcv+0x1ba/0xca0
>  ? ip_rcv_finish_core.constprop.0+0x63/0x490
>  ip_protocol_deliver_rcu+0xd6/0x230
>  ip_local_deliver_finish+0x73/0xa0
>  __netif_receive_skb_one_core+0x8b/0xa0
>  process_backlog+0x8e/0x120
>  __napi_poll+0x2c/0x160
>  net_rx_action+0x2a2/0x360
>  ? rebalance_domains+0xeb/0x3b0
>  __do_softirq+0xeb/0x2eb
>  __irq_exit_rcu+0xb9/0x110
>  sysvec_apic_timer_interrupt+0xa2/0xd0
>  </IRQ>
> 
> Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> 
> memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> swap: usage 0kB, limit 0kB, failcnt 0
> Memory cgroup stats for
> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> anon 348016640
> file 7502163968
> kernel 146997248
> kernel_stack 327680
> pagetables 2224128
> percpu 0
> sock 4096
> vmalloc 0
> shmem 0
> zswap 0
> zswapped 0
> file_mapped 112041984
> file_dirty 1181028352
> file_writeback 2686976
> swapcached 0
> anon_thp 44040192
> file_thp 0
> shmem_thp 0
> inactive_anon 350756864
> active_anon 36864
> inactive_file 3614003200
> active_file 3888070656
> unevictable 0
> slab_reclaimable 143692600
> slab_unreclaimable 545120
> slab 144237720
> workingset_refault_anon 0
> workingset_refault_file 2318
> workingset_activate_anon 0
> workingset_activate_file 2318
> workingset_restore_anon 0
> workingset_restore_file 0
> workingset_nodereclaim 0
> pgfault 334152
> pgmajfault 1238
> pgrefill 3400
> pgscan 819608
> pgsteal 791005
> pgactivate 949122
> pgdeactivate 1694
> pglazyfree 0
> pglazyfreed 0
> zswpin 0
> zswpout 0
> thp_fault_alloc 709
> thp_collapse_alloc 0
> 
> So it basically renders UDP inoperable because of disk cache. I hope
> this is not the intended behavior. Naturally booting with
> cgroup.memory=nosocket solves this issue.

This is most likely a regression caused by this patch:

commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
Author: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Date:   Tue Aug 17 12:40:03 2021 -0700

    net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
    
    Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
    to give more control to the networking stack and enable it to change
    memcg charging behavior. In the future, the networking stack may decide
    to avoid oom-kills when fallbacks are more appropriate.
    
    One behavior change in mem_cgroup_charge_skmem() by this patch is to
    avoid force charging by default and let the caller decide when and if
    force charging is needed through the presence or absence of
    __GFP_NOFAIL.
    
    Signed-off-by: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

We never used to fail these allocations. Cgroups don't have a
kswapd-style watermark reclaimer, so the network relied on
force-charging and leaving reclaim to allocations that can block.
Now it seems network packets could just fail indefinitely.

The changelog is a bit terse given how drastic the behavior change
is. Wei, Shakeel, can you fill in why this was changed? Can we revert
this for the time being?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 17:37         ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-17 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Eric Dumazet, netdev
  Cc: Gražvydas Ignotas, Wei Wang, Michal Hocko, Roman Gushchin,
	Linux MM, Cgroups

+ Eric and netdev

On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> > On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> > > Basically, when there is git activity in the container with a memory
> > > limit, other processes in the same container start to suffer (very)
> > > occasional network issues (mostly DNS lookup failures).
> >
> > ok I've traced this and it's failing in try_charge_memcg(), which
> > doesn't seem to be trying too hard because it's called from irq
> > context.
> >
> > Here is the backtrace:
> >  <IRQ>
> >  ? fib_validate_source+0xb4/0x100
> >  ? ip_route_input_slow+0xa11/0xb70
> >  mem_cgroup_charge_skmem+0x4b/0xf0
> >  __sk_mem_raise_allocated+0x17f/0x3e0
> >  __udp_enqueue_schedule_skb+0x220/0x270
> >  udp_queue_rcv_one_skb+0x330/0x5e0
> >  udp_unicast_rcv_skb+0x75/0x90
> >  __udp4_lib_rcv+0x1ba/0xca0
> >  ? ip_rcv_finish_core.constprop.0+0x63/0x490
> >  ip_protocol_deliver_rcu+0xd6/0x230
> >  ip_local_deliver_finish+0x73/0xa0
> >  __netif_receive_skb_one_core+0x8b/0xa0
> >  process_backlog+0x8e/0x120
> >  __napi_poll+0x2c/0x160
> >  net_rx_action+0x2a2/0x360
> >  ? rebalance_domains+0xeb/0x3b0
> >  __do_softirq+0xeb/0x2eb
> >  __irq_exit_rcu+0xb9/0x110
> >  sysvec_apic_timer_interrupt+0xa2/0xd0
> >  </IRQ>
> >
> > Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> >
> > memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> > swap: usage 0kB, limit 0kB, failcnt 0
> > Memory cgroup stats for
> > /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> > anon 348016640
> > file 7502163968
> > kernel 146997248
> > kernel_stack 327680
> > pagetables 2224128
> > percpu 0
> > sock 4096
> > vmalloc 0
> > shmem 0
> > zswap 0
> > zswapped 0
> > file_mapped 112041984
> > file_dirty 1181028352
> > file_writeback 2686976
> > swapcached 0
> > anon_thp 44040192
> > file_thp 0
> > shmem_thp 0
> > inactive_anon 350756864
> > active_anon 36864
> > inactive_file 3614003200
> > active_file 3888070656
> > unevictable 0
> > slab_reclaimable 143692600
> > slab_unreclaimable 545120
> > slab 144237720
> > workingset_refault_anon 0
> > workingset_refault_file 2318
> > workingset_activate_anon 0
> > workingset_activate_file 2318
> > workingset_restore_anon 0
> > workingset_restore_file 0
> > workingset_nodereclaim 0
> > pgfault 334152
> > pgmajfault 1238
> > pgrefill 3400
> > pgscan 819608
> > pgsteal 791005
> > pgactivate 949122
> > pgdeactivate 1694
> > pglazyfree 0
> > pglazyfreed 0
> > zswpin 0
> > zswpout 0
> > thp_fault_alloc 709
> > thp_collapse_alloc 0
> >
> > So it basically renders UDP inoperable because of disk cache. I hope
> > this is not the intended behavior. Naturally booting with
> > cgroup.memory=nosocket solves this issue.
>
> This is most likely a regression caused by this patch:
>
> commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> Author: Wei Wang <weiwan@google.com>
> Date:   Tue Aug 17 12:40:03 2021 -0700
>
>     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
>
>     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
>     to give more control to the networking stack and enable it to change
>     memcg charging behavior. In the future, the networking stack may decide
>     to avoid oom-kills when fallbacks are more appropriate.
>
>     One behavior change in mem_cgroup_charge_skmem() by this patch is to
>     avoid force charging by default and let the caller decide when and if
>     force charging is needed through the presence or absence of
>     __GFP_NOFAIL.
>
>     Signed-off-by: Wei Wang <weiwan@google.com>
>     Reviewed-by: Shakeel Butt <shakeelb@google.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
> We never used to fail these allocations. Cgroups don't have a
> kswapd-style watermark reclaimer, so the network relied on
> force-charging and leaving reclaim to allocations that can block.
> Now it seems network packets could just fail indefinitely.
>
> The changelog is a bit terse given how drastic the behavior change
> is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> this for the time being?

Does reverting the patch fix the issue? However I don't think it will.

Please note that we still have the force charging as before this
patch. Previously when mem_cgroup_charge_skmem() force charges, it
returns false and __sk_mem_raise_allocated takes suppress_allocation
code path. Based on some heuristics, it may allow it or it may
uncharge and return failure.

The given patch has not changed any heuristic. It has only changed
when forced charging happens. After the path the initial call
mem_cgroup_charge_skmem() can fail and we take suppress_allocation
code path and if heuristics allow, we force charge with __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 17:37         ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-17 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Eric Dumazet, netdev
  Cc: Gražvydas Ignotas, Wei Wang, Michal Hocko, Roman Gushchin,
	Linux MM, Cgroups

+ Eric and netdev

On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> > On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> > > Basically, when there is git activity in the container with a memory
> > > limit, other processes in the same container start to suffer (very)
> > > occasional network issues (mostly DNS lookup failures).
> >
> > ok I've traced this and it's failing in try_charge_memcg(), which
> > doesn't seem to be trying too hard because it's called from irq
> > context.
> >
> > Here is the backtrace:
> >  <IRQ>
> >  ? fib_validate_source+0xb4/0x100
> >  ? ip_route_input_slow+0xa11/0xb70
> >  mem_cgroup_charge_skmem+0x4b/0xf0
> >  __sk_mem_raise_allocated+0x17f/0x3e0
> >  __udp_enqueue_schedule_skb+0x220/0x270
> >  udp_queue_rcv_one_skb+0x330/0x5e0
> >  udp_unicast_rcv_skb+0x75/0x90
> >  __udp4_lib_rcv+0x1ba/0xca0
> >  ? ip_rcv_finish_core.constprop.0+0x63/0x490
> >  ip_protocol_deliver_rcu+0xd6/0x230
> >  ip_local_deliver_finish+0x73/0xa0
> >  __netif_receive_skb_one_core+0x8b/0xa0
> >  process_backlog+0x8e/0x120
> >  __napi_poll+0x2c/0x160
> >  net_rx_action+0x2a2/0x360
> >  ? rebalance_domains+0xeb/0x3b0
> >  __do_softirq+0xeb/0x2eb
> >  __irq_exit_rcu+0xb9/0x110
> >  sysvec_apic_timer_interrupt+0xa2/0xd0
> >  </IRQ>
> >
> > Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> >
> > memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> > swap: usage 0kB, limit 0kB, failcnt 0
> > Memory cgroup stats for
> > /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> > anon 348016640
> > file 7502163968
> > kernel 146997248
> > kernel_stack 327680
> > pagetables 2224128
> > percpu 0
> > sock 4096
> > vmalloc 0
> > shmem 0
> > zswap 0
> > zswapped 0
> > file_mapped 112041984
> > file_dirty 1181028352
> > file_writeback 2686976
> > swapcached 0
> > anon_thp 44040192
> > file_thp 0
> > shmem_thp 0
> > inactive_anon 350756864
> > active_anon 36864
> > inactive_file 3614003200
> > active_file 3888070656
> > unevictable 0
> > slab_reclaimable 143692600
> > slab_unreclaimable 545120
> > slab 144237720
> > workingset_refault_anon 0
> > workingset_refault_file 2318
> > workingset_activate_anon 0
> > workingset_activate_file 2318
> > workingset_restore_anon 0
> > workingset_restore_file 0
> > workingset_nodereclaim 0
> > pgfault 334152
> > pgmajfault 1238
> > pgrefill 3400
> > pgscan 819608
> > pgsteal 791005
> > pgactivate 949122
> > pgdeactivate 1694
> > pglazyfree 0
> > pglazyfreed 0
> > zswpin 0
> > zswpout 0
> > thp_fault_alloc 709
> > thp_collapse_alloc 0
> >
> > So it basically renders UDP inoperable because of disk cache. I hope
> > this is not the intended behavior. Naturally booting with
> > cgroup.memory=nosocket solves this issue.
>
> This is most likely a regression caused by this patch:
>
> commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> Author: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Date:   Tue Aug 17 12:40:03 2021 -0700
>
>     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
>
>     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
>     to give more control to the networking stack and enable it to change
>     memcg charging behavior. In the future, the networking stack may decide
>     to avoid oom-kills when fallbacks are more appropriate.
>
>     One behavior change in mem_cgroup_charge_skmem() by this patch is to
>     avoid force charging by default and let the caller decide when and if
>     force charging is needed through the presence or absence of
>     __GFP_NOFAIL.
>
>     Signed-off-by: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>     Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>     Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
>
> We never used to fail these allocations. Cgroups don't have a
> kswapd-style watermark reclaimer, so the network relied on
> force-charging and leaving reclaim to allocations that can block.
> Now it seems network packets could just fail indefinitely.
>
> The changelog is a bit terse given how drastic the behavior change
> is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> this for the time being?

Does reverting the patch fix the issue? However I don't think it will.

Please note that we still have the force charging as before this
patch. Previously when mem_cgroup_charge_skmem() force charges, it
returns false and __sk_mem_raise_allocated takes suppress_allocation
code path. Based on some heuristics, it may allow it or it may
uncharge and return failure.

The given patch has not changed any heuristic. It has only changed
when forced charging happens. After the path the initial call
mem_cgroup_charge_skmem() can fail and we take suppress_allocation
code path and if heuristics allow, we force charge with __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 18:16           ` Wei Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Wei Wang @ 2022-08-17 18:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Eric Dumazet, netdev, Gražvydas Ignotas,
	Michal Hocko, Roman Gushchin, Linux MM, Cgroups

On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> + Eric and netdev
>
> On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> > > On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> > > > Basically, when there is git activity in the container with a memory
> > > > limit, other processes in the same container start to suffer (very)
> > > > occasional network issues (mostly DNS lookup failures).
> > >
> > > ok I've traced this and it's failing in try_charge_memcg(), which
> > > doesn't seem to be trying too hard because it's called from irq
> > > context.
> > >
> > > Here is the backtrace:
> > >  <IRQ>
> > >  ? fib_validate_source+0xb4/0x100
> > >  ? ip_route_input_slow+0xa11/0xb70
> > >  mem_cgroup_charge_skmem+0x4b/0xf0
> > >  __sk_mem_raise_allocated+0x17f/0x3e0
> > >  __udp_enqueue_schedule_skb+0x220/0x270
> > >  udp_queue_rcv_one_skb+0x330/0x5e0
> > >  udp_unicast_rcv_skb+0x75/0x90
> > >  __udp4_lib_rcv+0x1ba/0xca0
> > >  ? ip_rcv_finish_core.constprop.0+0x63/0x490
> > >  ip_protocol_deliver_rcu+0xd6/0x230
> > >  ip_local_deliver_finish+0x73/0xa0
> > >  __netif_receive_skb_one_core+0x8b/0xa0
> > >  process_backlog+0x8e/0x120
> > >  __napi_poll+0x2c/0x160
> > >  net_rx_action+0x2a2/0x360
> > >  ? rebalance_domains+0xeb/0x3b0
> > >  __do_softirq+0xeb/0x2eb
> > >  __irq_exit_rcu+0xb9/0x110
> > >  sysvec_apic_timer_interrupt+0xa2/0xd0
> > >  </IRQ>
> > >
> > > Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> > >
> > > memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> > > swap: usage 0kB, limit 0kB, failcnt 0
> > > Memory cgroup stats for
> > > /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> > > anon 348016640
> > > file 7502163968
> > > kernel 146997248
> > > kernel_stack 327680
> > > pagetables 2224128
> > > percpu 0
> > > sock 4096
> > > vmalloc 0
> > > shmem 0
> > > zswap 0
> > > zswapped 0
> > > file_mapped 112041984
> > > file_dirty 1181028352
> > > file_writeback 2686976
> > > swapcached 0
> > > anon_thp 44040192
> > > file_thp 0
> > > shmem_thp 0
> > > inactive_anon 350756864
> > > active_anon 36864
> > > inactive_file 3614003200
> > > active_file 3888070656
> > > unevictable 0
> > > slab_reclaimable 143692600
> > > slab_unreclaimable 545120
> > > slab 144237720
> > > workingset_refault_anon 0
> > > workingset_refault_file 2318
> > > workingset_activate_anon 0
> > > workingset_activate_file 2318
> > > workingset_restore_anon 0
> > > workingset_restore_file 0
> > > workingset_nodereclaim 0
> > > pgfault 334152
> > > pgmajfault 1238
> > > pgrefill 3400
> > > pgscan 819608
> > > pgsteal 791005
> > > pgactivate 949122
> > > pgdeactivate 1694
> > > pglazyfree 0
> > > pglazyfreed 0
> > > zswpin 0
> > > zswpout 0
> > > thp_fault_alloc 709
> > > thp_collapse_alloc 0
> > >
> > > So it basically renders UDP inoperable because of disk cache. I hope
> > > this is not the intended behavior. Naturally booting with
> > > cgroup.memory=nosocket solves this issue.
> >
> > This is most likely a regression caused by this patch:
> >
> > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > Author: Wei Wang <weiwan@google.com>
> > Date:   Tue Aug 17 12:40:03 2021 -0700
> >
> >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> >
> >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> >     to give more control to the networking stack and enable it to change
> >     memcg charging behavior. In the future, the networking stack may decide
> >     to avoid oom-kills when fallbacks are more appropriate.
> >
> >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> >     avoid force charging by default and let the caller decide when and if
> >     force charging is needed through the presence or absence of
> >     __GFP_NOFAIL.
> >
> >     Signed-off-by: Wei Wang <weiwan@google.com>
> >     Reviewed-by: Shakeel Butt <shakeelb@google.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > We never used to fail these allocations. Cgroups don't have a
> > kswapd-style watermark reclaimer, so the network relied on
> > force-charging and leaving reclaim to allocations that can block.
> > Now it seems network packets could just fail indefinitely.
> >
> > The changelog is a bit terse given how drastic the behavior change
> > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > this for the time being?
>
> Does reverting the patch fix the issue? However I don't think it will.
>
> Please note that we still have the force charging as before this
> patch. Previously when mem_cgroup_charge_skmem() force charges, it
> returns false and __sk_mem_raise_allocated takes suppress_allocation
> code path. Based on some heuristics, it may allow it or it may
> uncharge and return failure.

The force charging logic in __sk_mem_raise_allocated only gets
considered on tx path for STREAM socket. So it probably does not take
effect on UDP path. And, that logic is NOT being altered in the above
patch.
So specifically for UDP receive path, what happens in
__sk_mem_raise_allocated() BEFORE the above patch is:
- mem_cgroup_charge_skmem() gets called:
    - try_charge() with GFP_NOWAIT gets called and  failed
    - try_charge() with __GFP_NOFAIL
    - return false
- goto suppress_allocation:
    - mem_cgroup_uncharge_skmem() gets called
- return 0 (which means failure)

AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
- mem_cgroup_charge_skmem() gets called:
    - try_charge() with GFP_NOWAIT gets called and failed
    - return false
- goto suppress_allocation:
    - We no longer calls mem_cgroup_uncharge_skmem()
- return 0

So I agree with Shakeel, that this change shouldn't alter the behavior
of the above call path in such a situation.
But do let us know if reverting this change has any effect on your test.

>
> The given patch has not changed any heuristic. It has only changed
> when forced charging happens. After the path the initial call
> mem_cgroup_charge_skmem() can fail and we take suppress_allocation
> code path and if heuristics allow, we force charge with __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-08-17 18:16           ` Wei Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Wei Wang @ 2022-08-17 18:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Eric Dumazet, netdev, Gražvydas Ignotas,
	Michal Hocko, Roman Gushchin, Linux MM, Cgroups

On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> + Eric and netdev
>
> On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> > On Wed, Aug 17, 2022 at 07:50:13PM +0300, Gražvydas Ignotas wrote:
> > > On Tue, Aug 16, 2022 at 9:52 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> > > > Basically, when there is git activity in the container with a memory
> > > > limit, other processes in the same container start to suffer (very)
> > > > occasional network issues (mostly DNS lookup failures).
> > >
> > > ok I've traced this and it's failing in try_charge_memcg(), which
> > > doesn't seem to be trying too hard because it's called from irq
> > > context.
> > >
> > > Here is the backtrace:
> > >  <IRQ>
> > >  ? fib_validate_source+0xb4/0x100
> > >  ? ip_route_input_slow+0xa11/0xb70
> > >  mem_cgroup_charge_skmem+0x4b/0xf0
> > >  __sk_mem_raise_allocated+0x17f/0x3e0
> > >  __udp_enqueue_schedule_skb+0x220/0x270
> > >  udp_queue_rcv_one_skb+0x330/0x5e0
> > >  udp_unicast_rcv_skb+0x75/0x90
> > >  __udp4_lib_rcv+0x1ba/0xca0
> > >  ? ip_rcv_finish_core.constprop.0+0x63/0x490
> > >  ip_protocol_deliver_rcu+0xd6/0x230
> > >  ip_local_deliver_finish+0x73/0xa0
> > >  __netif_receive_skb_one_core+0x8b/0xa0
> > >  process_backlog+0x8e/0x120
> > >  __napi_poll+0x2c/0x160
> > >  net_rx_action+0x2a2/0x360
> > >  ? rebalance_domains+0xeb/0x3b0
> > >  __do_softirq+0xeb/0x2eb
> > >  __irq_exit_rcu+0xb9/0x110
> > >  sysvec_apic_timer_interrupt+0xa2/0xd0
> > >  </IRQ>
> > >
> > > Calling mem_cgroup_print_oom_meminfo() in such a case reveals:
> > >
> > > memory: usage 7812476kB, limit 7812500kB, failcnt 775198
> > > swap: usage 0kB, limit 0kB, failcnt 0
> > > Memory cgroup stats for
> > > /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb8f4f0e9_fb95_4f2d_8443_e6a78f235c9a.slice/docker-9e7cad93b2e0774d49148474989b41fe6d67a5985d059d08d9d64495f1539a81.scope:
> > > anon 348016640
> > > file 7502163968
> > > kernel 146997248
> > > kernel_stack 327680
> > > pagetables 2224128
> > > percpu 0
> > > sock 4096
> > > vmalloc 0
> > > shmem 0
> > > zswap 0
> > > zswapped 0
> > > file_mapped 112041984
> > > file_dirty 1181028352
> > > file_writeback 2686976
> > > swapcached 0
> > > anon_thp 44040192
> > > file_thp 0
> > > shmem_thp 0
> > > inactive_anon 350756864
> > > active_anon 36864
> > > inactive_file 3614003200
> > > active_file 3888070656
> > > unevictable 0
> > > slab_reclaimable 143692600
> > > slab_unreclaimable 545120
> > > slab 144237720
> > > workingset_refault_anon 0
> > > workingset_refault_file 2318
> > > workingset_activate_anon 0
> > > workingset_activate_file 2318
> > > workingset_restore_anon 0
> > > workingset_restore_file 0
> > > workingset_nodereclaim 0
> > > pgfault 334152
> > > pgmajfault 1238
> > > pgrefill 3400
> > > pgscan 819608
> > > pgsteal 791005
> > > pgactivate 949122
> > > pgdeactivate 1694
> > > pglazyfree 0
> > > pglazyfreed 0
> > > zswpin 0
> > > zswpout 0
> > > thp_fault_alloc 709
> > > thp_collapse_alloc 0
> > >
> > > So it basically renders UDP inoperable because of disk cache. I hope
> > > this is not the intended behavior. Naturally booting with
> > > cgroup.memory=nosocket solves this issue.
> >
> > This is most likely a regression caused by this patch:
> >
> > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > Author: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Date:   Tue Aug 17 12:40:03 2021 -0700
> >
> >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> >
> >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> >     to give more control to the networking stack and enable it to change
> >     memcg charging behavior. In the future, the networking stack may decide
> >     to avoid oom-kills when fallbacks are more appropriate.
> >
> >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> >     avoid force charging by default and let the caller decide when and if
> >     force charging is needed through the presence or absence of
> >     __GFP_NOFAIL.
> >
> >     Signed-off-by: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >     Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >     Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> >
> > We never used to fail these allocations. Cgroups don't have a
> > kswapd-style watermark reclaimer, so the network relied on
> > force-charging and leaving reclaim to allocations that can block.
> > Now it seems network packets could just fail indefinitely.
> >
> > The changelog is a bit terse given how drastic the behavior change
> > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > this for the time being?
>
> Does reverting the patch fix the issue? However I don't think it will.
>
> Please note that we still have the force charging as before this
> patch. Previously when mem_cgroup_charge_skmem() force charges, it
> returns false and __sk_mem_raise_allocated takes suppress_allocation
> code path. Based on some heuristics, it may allow it or it may
> uncharge and return failure.

The force charging logic in __sk_mem_raise_allocated only gets
considered on tx path for STREAM socket. So it probably does not take
effect on UDP path. And, that logic is NOT being altered in the above
patch.
So specifically for UDP receive path, what happens in
__sk_mem_raise_allocated() BEFORE the above patch is:
- mem_cgroup_charge_skmem() gets called:
    - try_charge() with GFP_NOWAIT gets called and  failed
    - try_charge() with __GFP_NOFAIL
    - return false
- goto suppress_allocation:
    - mem_cgroup_uncharge_skmem() gets called
- return 0 (which means failure)

AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
- mem_cgroup_charge_skmem() gets called:
    - try_charge() with GFP_NOWAIT gets called and failed
    - return false
- goto suppress_allocation:
    - We no longer calls mem_cgroup_uncharge_skmem()
- return 0

So I agree with Shakeel, that this change shouldn't alter the behavior
of the above call path in such a situation.
But do let us know if reverting this change has any effect on your test.

>
> The given patch has not changed any heuristic. It has only changed
> when forced charging happens. After the path the initial call
> mem_cgroup_charge_skmem() can fail and we take suppress_allocation
> code path and if heuristics allow, we force charge with __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
  2022-08-17 18:16           ` Wei Wang
  (?)
@ 2022-08-17 20:12           ` Gražvydas Ignotas
  2022-10-13  4:36               ` Shakeel Butt
  -1 siblings, 1 reply; 13+ messages in thread
From: Gražvydas Ignotas @ 2022-08-17 20:12 UTC (permalink / raw)
  To: Wei Wang
  Cc: Shakeel Butt, Johannes Weiner, Eric Dumazet, netdev,
	Michal Hocko, Roman Gushchin, Linux MM, Cgroups

On Wed, Aug 17, 2022 at 9:16 PM Wei Wang <weiwan@google.com> wrote:
>
> On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > + Eric and netdev
> >
> > On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > This is most likely a regression caused by this patch:
> > >
> > > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > > Author: Wei Wang <weiwan@google.com>
> > > Date:   Tue Aug 17 12:40:03 2021 -0700
> > >
> > >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> > >
> > >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > >     to give more control to the networking stack and enable it to change
> > >     memcg charging behavior. In the future, the networking stack may decide
> > >     to avoid oom-kills when fallbacks are more appropriate.
> > >
> > >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > >     avoid force charging by default and let the caller decide when and if
> > >     force charging is needed through the presence or absence of
> > >     __GFP_NOFAIL.
> > >
> > >     Signed-off-by: Wei Wang <weiwan@google.com>
> > >     Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > We never used to fail these allocations. Cgroups don't have a
> > > kswapd-style watermark reclaimer, so the network relied on
> > > force-charging and leaving reclaim to allocations that can block.
> > > Now it seems network packets could just fail indefinitely.
> > >
> > > The changelog is a bit terse given how drastic the behavior change
> > > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > > this for the time being?
> >
> > Does reverting the patch fix the issue? However I don't think it will.
> >
> > Please note that we still have the force charging as before this
> > patch. Previously when mem_cgroup_charge_skmem() force charges, it
> > returns false and __sk_mem_raise_allocated takes suppress_allocation
> > code path. Based on some heuristics, it may allow it or it may
> > uncharge and return failure.
>
> The force charging logic in __sk_mem_raise_allocated only gets
> considered on tx path for STREAM socket. So it probably does not take
> effect on UDP path. And, that logic is NOT being altered in the above
> patch.
> So specifically for UDP receive path, what happens in
> __sk_mem_raise_allocated() BEFORE the above patch is:
> - mem_cgroup_charge_skmem() gets called:
>     - try_charge() with GFP_NOWAIT gets called and  failed
>     - try_charge() with __GFP_NOFAIL
>     - return false
> - goto suppress_allocation:
>     - mem_cgroup_uncharge_skmem() gets called
> - return 0 (which means failure)
>
> AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
> - mem_cgroup_charge_skmem() gets called:
>     - try_charge() with GFP_NOWAIT gets called and failed
>     - return false
> - goto suppress_allocation:
>     - We no longer calls mem_cgroup_uncharge_skmem()
> - return 0
>
> So I agree with Shakeel, that this change shouldn't alter the behavior
> of the above call path in such a situation.
> But do let us know if reverting this change has any effect on your test.

The problem is still there (the kernel wasn't compiling after revert,
had to adjust another seemingly unrelated callsite). It's hard to tell
if it's better or worse since it happens so randomly.

>
> >
> > The given patch has not changed any heuristic. It has only changed
> > when forced charging happens. After the path the initial call
> > mem_cgroup_charge_skmem() can fail and we take suppress_allocation
> > code path and if heuristics allow, we force charge with __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
  2022-08-17 20:12           ` Gražvydas Ignotas
@ 2022-10-13  4:36               ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-10-13  4:36 UTC (permalink / raw)
  To: Gražvydas Ignotas
  Cc: Wei Wang, Johannes Weiner, Eric Dumazet, netdev, Michal Hocko,
	Roman Gushchin, Linux MM, Cgroups

On Wed, Aug 17, 2022 at 1:12 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
>
> On Wed, Aug 17, 2022 at 9:16 PM Wei Wang <weiwan@google.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > + Eric and netdev
> > >
> > > On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > This is most likely a regression caused by this patch:
> > > >
> > > > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > > > Author: Wei Wang <weiwan@google.com>
> > > > Date:   Tue Aug 17 12:40:03 2021 -0700
> > > >
> > > >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> > > >
> > > >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > > >     to give more control to the networking stack and enable it to change
> > > >     memcg charging behavior. In the future, the networking stack may decide
> > > >     to avoid oom-kills when fallbacks are more appropriate.
> > > >
> > > >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > > >     avoid force charging by default and let the caller decide when and if
> > > >     force charging is needed through the presence or absence of
> > > >     __GFP_NOFAIL.
> > > >
> > > >     Signed-off-by: Wei Wang <weiwan@google.com>
> > > >     Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >
> > > > We never used to fail these allocations. Cgroups don't have a
> > > > kswapd-style watermark reclaimer, so the network relied on
> > > > force-charging and leaving reclaim to allocations that can block.
> > > > Now it seems network packets could just fail indefinitely.
> > > >
> > > > The changelog is a bit terse given how drastic the behavior change
> > > > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > > > this for the time being?
> > >
> > > Does reverting the patch fix the issue? However I don't think it will.
> > >
> > > Please note that we still have the force charging as before this
> > > patch. Previously when mem_cgroup_charge_skmem() force charges, it
> > > returns false and __sk_mem_raise_allocated takes suppress_allocation
> > > code path. Based on some heuristics, it may allow it or it may
> > > uncharge and return failure.
> >
> > The force charging logic in __sk_mem_raise_allocated only gets
> > considered on tx path for STREAM socket. So it probably does not take
> > effect on UDP path. And, that logic is NOT being altered in the above
> > patch.
> > So specifically for UDP receive path, what happens in
> > __sk_mem_raise_allocated() BEFORE the above patch is:
> > - mem_cgroup_charge_skmem() gets called:
> >     - try_charge() with GFP_NOWAIT gets called and  failed
> >     - try_charge() with __GFP_NOFAIL
> >     - return false
> > - goto suppress_allocation:
> >     - mem_cgroup_uncharge_skmem() gets called
> > - return 0 (which means failure)
> >
> > AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
> > - mem_cgroup_charge_skmem() gets called:
> >     - try_charge() with GFP_NOWAIT gets called and failed
> >     - return false
> > - goto suppress_allocation:
> >     - We no longer calls mem_cgroup_uncharge_skmem()
> > - return 0
> >
> > So I agree with Shakeel, that this change shouldn't alter the behavior
> > of the above call path in such a situation.
> > But do let us know if reverting this change has any effect on your test.
>
> The problem is still there (the kernel wasn't compiling after revert,
> had to adjust another seemingly unrelated callsite). It's hard to tell
> if it's better or worse since it happens so randomly.
>

Hello everyone, we have a better understanding why the patch pointed
out by Johannes might have exposed this issue. See
https://lore.kernel.org/all/20221013041833.rhifxw4gqwk4ofi2@google.com/.

To summarize, the old code was depending on a subtle interaction of
force-charge and percpu charge caches which this patch removed. The
fix I am proposing is for the network stack to be explicit of its need
(i.e. use GFP_ATOMIC) instead of depending on a subtle behavior.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-10-13  4:36               ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-10-13  4:36 UTC (permalink / raw)
  To: Gražvydas Ignotas
  Cc: Wei Wang, Johannes Weiner, Eric Dumazet, netdev, Michal Hocko,
	Roman Gushchin, Linux MM, Cgroups

On Wed, Aug 17, 2022 at 1:12 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
>
> On Wed, Aug 17, 2022 at 9:16 PM Wei Wang <weiwan@google.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > + Eric and netdev
> > >
> > > On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > This is most likely a regression caused by this patch:
> > > >
> > > > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > > > Author: Wei Wang <weiwan@google.com>
> > > > Date:   Tue Aug 17 12:40:03 2021 -0700
> > > >
> > > >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> > > >
> > > >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > > >     to give more control to the networking stack and enable it to change
> > > >     memcg charging behavior. In the future, the networking stack may decide
> > > >     to avoid oom-kills when fallbacks are more appropriate.
> > > >
> > > >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > > >     avoid force charging by default and let the caller decide when and if
> > > >     force charging is needed through the presence or absence of
> > > >     __GFP_NOFAIL.
> > > >
> > > >     Signed-off-by: Wei Wang <weiwan@google.com>
> > > >     Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >
> > > > We never used to fail these allocations. Cgroups don't have a
> > > > kswapd-style watermark reclaimer, so the network relied on
> > > > force-charging and leaving reclaim to allocations that can block.
> > > > Now it seems network packets could just fail indefinitely.
> > > >
> > > > The changelog is a bit terse given how drastic the behavior change
> > > > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > > > this for the time being?
> > >
> > > Does reverting the patch fix the issue? However I don't think it will.
> > >
> > > Please note that we still have the force charging as before this
> > > patch. Previously when mem_cgroup_charge_skmem() force charges, it
> > > returns false and __sk_mem_raise_allocated takes suppress_allocation
> > > code path. Based on some heuristics, it may allow it or it may
> > > uncharge and return failure.
> >
> > The force charging logic in __sk_mem_raise_allocated only gets
> > considered on tx path for STREAM socket. So it probably does not take
> > effect on UDP path. And, that logic is NOT being altered in the above
> > patch.
> > So specifically for UDP receive path, what happens in
> > __sk_mem_raise_allocated() BEFORE the above patch is:
> > - mem_cgroup_charge_skmem() gets called:
> >     - try_charge() with GFP_NOWAIT gets called and  failed
> >     - try_charge() with __GFP_NOFAIL
> >     - return false
> > - goto suppress_allocation:
> >     - mem_cgroup_uncharge_skmem() gets called
> > - return 0 (which means failure)
> >
> > AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
> > - mem_cgroup_charge_skmem() gets called:
> >     - try_charge() with GFP_NOWAIT gets called and failed
> >     - return false
> > - goto suppress_allocation:
> >     - We no longer calls mem_cgroup_uncharge_skmem()
> > - return 0
> >
> > So I agree with Shakeel, that this change shouldn't alter the behavior
> > of the above call path in such a situation.
> > But do let us know if reverting this change has any effect on your test.
>
> The problem is still there (the kernel wasn't compiling after revert,
> had to adjust another seemingly unrelated callsite). It's hard to tell
> if it's better or worse since it happens so randomly.
>

Hello everyone, we have a better understanding why the patch pointed
out by Johannes might have exposed this issue. See
https://lore.kernel.org/all/20221013041833.rhifxw4gqwk4ofi2@google.com/.

To summarize, the old code was depending on a subtle interaction of
force-charge and percpu charge caches which this patch removed. The
fix I am proposing is for the network stack to be explicit of its need
(i.e. use GFP_ATOMIC) instead of depending on a subtle behavior.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-10-13 14:22                 ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2022-10-13 14:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Gražvydas Ignotas, Wei Wang, Eric Dumazet, netdev,
	Michal Hocko, Roman Gushchin, Linux MM, Cgroups

On Wed, Oct 12, 2022 at 09:36:34PM -0700, Shakeel Butt wrote:
> On Wed, Aug 17, 2022 at 1:12 PM Gražvydas Ignotas <notasas@gmail.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 9:16 PM Wei Wang <weiwan@google.com> wrote:
> > >
> > > On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > + Eric and netdev
> > > >
> > > > On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > This is most likely a regression caused by this patch:
> > > > >
> > > > > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > > > > Author: Wei Wang <weiwan@google.com>
> > > > > Date:   Tue Aug 17 12:40:03 2021 -0700
> > > > >
> > > > >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> > > > >
> > > > >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > > > >     to give more control to the networking stack and enable it to change
> > > > >     memcg charging behavior. In the future, the networking stack may decide
> > > > >     to avoid oom-kills when fallbacks are more appropriate.
> > > > >
> > > > >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > > > >     avoid force charging by default and let the caller decide when and if
> > > > >     force charging is needed through the presence or absence of
> > > > >     __GFP_NOFAIL.
> > > > >
> > > > >     Signed-off-by: Wei Wang <weiwan@google.com>
> > > > >     Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > > >
> > > > > We never used to fail these allocations. Cgroups don't have a
> > > > > kswapd-style watermark reclaimer, so the network relied on
> > > > > force-charging and leaving reclaim to allocations that can block.
> > > > > Now it seems network packets could just fail indefinitely.
> > > > >
> > > > > The changelog is a bit terse given how drastic the behavior change
> > > > > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > > > > this for the time being?
> > > >
> > > > Does reverting the patch fix the issue? However I don't think it will.
> > > >
> > > > Please note that we still have the force charging as before this
> > > > patch. Previously when mem_cgroup_charge_skmem() force charges, it
> > > > returns false and __sk_mem_raise_allocated takes suppress_allocation
> > > > code path. Based on some heuristics, it may allow it or it may
> > > > uncharge and return failure.
> > >
> > > The force charging logic in __sk_mem_raise_allocated only gets
> > > considered on tx path for STREAM socket. So it probably does not take
> > > effect on UDP path. And, that logic is NOT being altered in the above
> > > patch.
> > > So specifically for UDP receive path, what happens in
> > > __sk_mem_raise_allocated() BEFORE the above patch is:
> > > - mem_cgroup_charge_skmem() gets called:
> > >     - try_charge() with GFP_NOWAIT gets called and  failed
> > >     - try_charge() with __GFP_NOFAIL
> > >     - return false
> > > - goto suppress_allocation:
> > >     - mem_cgroup_uncharge_skmem() gets called
> > > - return 0 (which means failure)
> > >
> > > AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
> > > - mem_cgroup_charge_skmem() gets called:
> > >     - try_charge() with GFP_NOWAIT gets called and failed
> > >     - return false
> > > - goto suppress_allocation:
> > >     - We no longer calls mem_cgroup_uncharge_skmem()
> > > - return 0
> > >
> > > So I agree with Shakeel, that this change shouldn't alter the behavior
> > > of the above call path in such a situation.
> > > But do let us know if reverting this change has any effect on your test.
> >
> > The problem is still there (the kernel wasn't compiling after revert,
> > had to adjust another seemingly unrelated callsite). It's hard to tell
> > if it's better or worse since it happens so randomly.
> >
> 
> Hello everyone, we have a better understanding why the patch pointed
> out by Johannes might have exposed this issue. See
> https://lore.kernel.org/all/20221013041833.rhifxw4gqwk4ofi2@google.com/.

Wow, that's super subtle! Nice sleuthing.

> To summarize, the old code was depending on a subtle interaction of
> force-charge and percpu charge caches which this patch removed. The
> fix I am proposing is for the network stack to be explicit of its need
> (i.e. use GFP_ATOMIC) instead of depending on a subtle behavior.

That sounds good to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UDP rx packet loss in a cgroup with a memory limit
@ 2022-10-13 14:22                 ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2022-10-13 14:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Gražvydas Ignotas, Wei Wang, Eric Dumazet, netdev,
	Michal Hocko, Roman Gushchin, Linux MM, Cgroups

On Wed, Oct 12, 2022 at 09:36:34PM -0700, Shakeel Butt wrote:
> On Wed, Aug 17, 2022 at 1:12 PM Gražvydas Ignotas <notasas-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >
> > On Wed, Aug 17, 2022 at 9:16 PM Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > On Wed, Aug 17, 2022 at 10:37 AM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > > >
> > > > + Eric and netdev
> > > >
> > > > On Wed, Aug 17, 2022 at 10:13 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > > > >
> > > > > This is most likely a regression caused by this patch:
> > > > >
> > > > > commit 4b1327be9fe57443295ae86fe0fcf24a18469e9f
> > > > > Author: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > > Date:   Tue Aug 17 12:40:03 2021 -0700
> > > > >
> > > > >     net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
> > > > >
> > > > >     Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > > > >     to give more control to the networking stack and enable it to change
> > > > >     memcg charging behavior. In the future, the networking stack may decide
> > > > >     to avoid oom-kills when fallbacks are more appropriate.
> > > > >
> > > > >     One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > > > >     avoid force charging by default and let the caller decide when and if
> > > > >     force charging is needed through the presence or absence of
> > > > >     __GFP_NOFAIL.
> > > > >
> > > > >     Signed-off-by: Wei Wang <weiwan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > >     Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > >     Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > > >
> > > > > We never used to fail these allocations. Cgroups don't have a
> > > > > kswapd-style watermark reclaimer, so the network relied on
> > > > > force-charging and leaving reclaim to allocations that can block.
> > > > > Now it seems network packets could just fail indefinitely.
> > > > >
> > > > > The changelog is a bit terse given how drastic the behavior change
> > > > > is. Wei, Shakeel, can you fill in why this was changed? Can we revert
> > > > > this for the time being?
> > > >
> > > > Does reverting the patch fix the issue? However I don't think it will.
> > > >
> > > > Please note that we still have the force charging as before this
> > > > patch. Previously when mem_cgroup_charge_skmem() force charges, it
> > > > returns false and __sk_mem_raise_allocated takes suppress_allocation
> > > > code path. Based on some heuristics, it may allow it or it may
> > > > uncharge and return failure.
> > >
> > > The force charging logic in __sk_mem_raise_allocated only gets
> > > considered on tx path for STREAM socket. So it probably does not take
> > > effect on UDP path. And, that logic is NOT being altered in the above
> > > patch.
> > > So specifically for UDP receive path, what happens in
> > > __sk_mem_raise_allocated() BEFORE the above patch is:
> > > - mem_cgroup_charge_skmem() gets called:
> > >     - try_charge() with GFP_NOWAIT gets called and  failed
> > >     - try_charge() with __GFP_NOFAIL
> > >     - return false
> > > - goto suppress_allocation:
> > >     - mem_cgroup_uncharge_skmem() gets called
> > > - return 0 (which means failure)
> > >
> > > AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
> > > - mem_cgroup_charge_skmem() gets called:
> > >     - try_charge() with GFP_NOWAIT gets called and failed
> > >     - return false
> > > - goto suppress_allocation:
> > >     - We no longer calls mem_cgroup_uncharge_skmem()
> > > - return 0
> > >
> > > So I agree with Shakeel, that this change shouldn't alter the behavior
> > > of the above call path in such a situation.
> > > But do let us know if reverting this change has any effect on your test.
> >
> > The problem is still there (the kernel wasn't compiling after revert,
> > had to adjust another seemingly unrelated callsite). It's hard to tell
> > if it's better or worse since it happens so randomly.
> >
> 
> Hello everyone, we have a better understanding why the patch pointed
> out by Johannes might have exposed this issue. See
> https://lore.kernel.org/all/20221013041833.rhifxw4gqwk4ofi2-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/.

Wow, that's super subtle! Nice sleuthing.

> To summarize, the old code was depending on a subtle interaction of
> force-charge and percpu charge caches which this patch removed. The
> fix I am proposing is for the network stack to be explicit of its need
> (i.e. use GFP_ATOMIC) instead of depending on a subtle behavior.

That sounds good to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-10-13 14:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 18:52 UDP rx packet loss in a cgroup with a memory limit Gražvydas Ignotas
     [not found] ` <CANOLnON11vzvVdyJfW+QJ36siWR4-s=HJ2aRKpRy7CP=aRPoSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-08-17 16:50   ` Gražvydas Ignotas
2022-08-17 17:13     ` Johannes Weiner
2022-08-17 17:13       ` Johannes Weiner
2022-08-17 17:37       ` Shakeel Butt
2022-08-17 17:37         ` Shakeel Butt
2022-08-17 18:16         ` Wei Wang
2022-08-17 18:16           ` Wei Wang
2022-08-17 20:12           ` Gražvydas Ignotas
2022-10-13  4:36             ` Shakeel Butt
2022-10-13  4:36               ` Shakeel Butt
2022-10-13 14:22               ` Johannes Weiner
2022-10-13 14:22                 ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.