From: Roman Gushchin <guro@fb.com>
To: <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>, <netdev@vger.kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Shakeel Butt <shakeelb@google.com>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <kernel-team@fb.com>,
Roman Gushchin <guro@fb.com>
Subject: [PATCH bpf-next v5 00/34] bpf: switch to memcg-based memory accounting
Date: Thu, 12 Nov 2020 14:15:09 -0800 [thread overview]
Message-ID: <20201112221543.3621014-1-guro@fb.com> (raw)
Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:
1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.
2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.
3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.
4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.
5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.
In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.
This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads. Of course, it
requires enabled cgroups and kernel memory accounting and properly configured
cgroup tree, but it's a default configuration for a modern Linux system.
2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.
3) There is a simple way to get the current value and statistics.
In general, if a process performs a bpf operation (e.g. creates or updates
a map), it's memory cgroup is charged. However map updates performed from
an interrupt context are charged to the memory cgroup which contained
the process, which created the map.
Providing a 1:1 replacement for the rlimit-based memory accounting is
a non-goal of this patchset. Users and memory cgroups are completely
orthogonal, so it's not possible even in theory.
Memcg-based memory accounting requires a properly configured cgroup tree
to be actually useful. However, it's the way how the memory is managed
on a modern Linux system.
The patchset consists of the following parts:
1) 4 mm patches, which are already in the mm tree, but are required
to avoid a regression (otherwise vmallocs cannot be mapped to userspace).
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples
First 4 patches are not supposed to be merged via the bpf tree. I'm including
them to make sure bpf tests will pass.
v5:
- rebased to the latest version of the remote charging API
- implemented kmem accounting from an interrupt context, by Shakeel
- rebased to latest changes in mm allowed to map vmallocs to userspace
- fixed a build issue in kselftests, by Alexei
- fixed a use-after-free bug in bpf_map_free_deferred()
- added bpf line info coverage, by Shakeel
- split bpf map charging preparations into a separate patch
v4:
- covered allocations made from an interrupt context, by Daniel
- added some clarifications to the cover letter
v3:
- droped the userspace part for further discussions/refinements,
by Andrii and Song
v2:
- fixed build issue, caused by the remaining rlimit-based accounting
for sockhash maps
Roman Gushchin (34):
mm: memcontrol: use helpers to read page's memcg data
mm: memcontrol/slab: use helpers to access slab page's memcg_data
mm: introduce page memcg flags
mm: convert page kmemcg type to a page memcg flag
bpf: memcg-based memory accounting for bpf progs
bpf: prepare for memcg-based memory accounting for bpf maps
bpf: memcg-based memory accounting for bpf maps
bpf: refine memcg-based memory accounting for arraymap maps
bpf: refine memcg-based memory accounting for cpumap maps
bpf: memcg-based memory accounting for cgroup storage maps
bpf: refine memcg-based memory accounting for devmap maps
bpf: refine memcg-based memory accounting for hashtab maps
bpf: memcg-based memory accounting for lpm_trie maps
bpf: memcg-based memory accounting for bpf ringbuffer
bpf: memcg-based memory accounting for bpf local storage maps
bpf: refine memcg-based memory accounting for sockmap and sockhash
maps
bpf: refine memcg-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for arraymap maps
bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
bpf: eliminate rlimit-based memory accounting for cpumap maps
bpf: eliminate rlimit-based memory accounting for cgroup storage maps
bpf: eliminate rlimit-based memory accounting for devmap maps
bpf: eliminate rlimit-based memory accounting for hashtab maps
bpf: eliminate rlimit-based memory accounting for lpm_trie maps
bpf: eliminate rlimit-based memory accounting for queue_stack_maps
maps
bpf: eliminate rlimit-based memory accounting for reuseport_array maps
bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
bpf: eliminate rlimit-based memory accounting for sockmap and sockhash
maps
bpf: eliminate rlimit-based memory accounting for stackmap maps
bpf: eliminate rlimit-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for bpf local storage
maps
bpf: eliminate rlimit-based memory accounting infra for bpf maps
bpf: eliminate rlimit-based memory accounting for bpf progs
bpf: samples: do not touch RLIMIT_MEMLOCK
fs/buffer.c | 2 +-
fs/iomap/buffered-io.c | 2 +-
include/linux/bpf.h | 27 +--
include/linux/memcontrol.h | 215 +++++++++++++++++-
include/linux/mm.h | 22 --
include/linux/mm_types.h | 5 +-
include/linux/page-flags.h | 11 +-
include/trace/events/writeback.h | 2 +-
kernel/bpf/arraymap.c | 30 +--
kernel/bpf/bpf_local_storage.c | 18 +-
kernel/bpf/bpf_struct_ops.c | 19 +-
kernel/bpf/core.c | 22 +-
kernel/bpf/cpumap.c | 20 +-
kernel/bpf/devmap.c | 23 +-
kernel/bpf/hashtab.c | 33 +--
kernel/bpf/helpers.c | 37 ++-
kernel/bpf/local_storage.c | 38 +---
kernel/bpf/lpm_trie.c | 17 +-
kernel/bpf/queue_stack_maps.c | 16 +-
kernel/bpf/reuseport_array.c | 12 +-
kernel/bpf/ringbuf.c | 33 +--
kernel/bpf/stackmap.c | 16 +-
kernel/bpf/syscall.c | 177 ++++----------
kernel/fork.c | 7 +-
mm/debug.c | 4 +-
mm/huge_memory.c | 4 +-
mm/memcontrol.c | 139 +++++------
mm/page_alloc.c | 8 +-
mm/page_io.c | 6 +-
mm/slab.h | 38 +---
mm/workingset.c | 2 +-
net/core/bpf_sk_storage.c | 2 +-
net/core/sock_map.c | 40 +---
net/xdp/xskmap.c | 15 +-
samples/bpf/map_perf_test_user.c | 6 -
samples/bpf/offwaketime_user.c | 6 -
samples/bpf/sockex2_user.c | 2 -
samples/bpf/sockex3_user.c | 2 -
samples/bpf/spintest_user.c | 6 -
samples/bpf/syscall_tp_user.c | 2 -
samples/bpf/task_fd_query_user.c | 5 -
samples/bpf/test_lru_dist.c | 3 -
samples/bpf/test_map_in_map_user.c | 6 -
samples/bpf/test_overhead_user.c | 2 -
samples/bpf/trace_event_user.c | 2 -
samples/bpf/tracex2_user.c | 6 -
samples/bpf/tracex3_user.c | 6 -
samples/bpf/tracex4_user.c | 6 -
samples/bpf/tracex5_user.c | 3 -
samples/bpf/tracex6_user.c | 3 -
samples/bpf/xdp1_user.c | 6 -
samples/bpf/xdp_adjust_tail_user.c | 6 -
samples/bpf/xdp_monitor_user.c | 5 -
samples/bpf/xdp_redirect_cpu_user.c | 6 -
samples/bpf/xdp_redirect_map_user.c | 6 -
samples/bpf/xdp_redirect_user.c | 6 -
samples/bpf/xdp_router_ipv4_user.c | 6 -
samples/bpf/xdp_rxq_info_user.c | 6 -
samples/bpf/xdp_sample_pkts_user.c | 6 -
samples/bpf/xdp_tx_iptunnel_user.c | 6 -
samples/bpf/xdpsock_user.c | 7 -
.../selftests/bpf/progs/bpf_iter_bpf_map.c | 2 +-
.../selftests/bpf/progs/map_ptr_kern.c | 7 -
63 files changed, 460 insertions(+), 743 deletions(-)
--
2.26.2
next reply other threads:[~2020-11-12 22:16 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-12 22:15 Roman Gushchin [this message]
2020-11-12 22:15 ` [PATCH bpf-next v5 05/34] bpf: memcg-based memory accounting for bpf progs Roman Gushchin
2020-11-13 17:31 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 06/34] bpf: prepare for memcg-based memory accounting for bpf maps Roman Gushchin
2020-11-13 17:46 ` Song Liu
2020-11-13 19:40 ` Roman Gushchin
2020-11-13 20:48 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 07/34] bpf: " Roman Gushchin
2020-11-13 18:04 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 08/34] bpf: refine memcg-based memory accounting for arraymap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 09/34] bpf: refine memcg-based memory accounting for cpumap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 10/34] bpf: memcg-based memory accounting for cgroup storage maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 11/34] bpf: refine memcg-based memory accounting for devmap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 12/34] bpf: refine memcg-based memory accounting for hashtab maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 13/34] bpf: memcg-based memory accounting for lpm_trie maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 14/34] bpf: memcg-based memory accounting for bpf ringbuffer Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 15/34] bpf: memcg-based memory accounting for bpf local storage maps Roman Gushchin
2020-11-13 18:07 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 16/34] bpf: refine memcg-based memory accounting for sockmap and sockhash maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 17/34] bpf: refine memcg-based memory accounting for xskmap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 18/34] bpf: eliminate rlimit-based memory accounting for arraymap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 19/34] bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 20/34] bpf: eliminate rlimit-based memory accounting for cpumap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 21/34] bpf: eliminate rlimit-based memory accounting for cgroup storage maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 22/34] bpf: eliminate rlimit-based memory accounting for devmap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 23/34] bpf: eliminate rlimit-based memory accounting for hashtab maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 24/34] bpf: eliminate rlimit-based memory accounting for lpm_trie maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 25/34] bpf: eliminate rlimit-based memory accounting for queue_stack_maps maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 26/34] bpf: eliminate rlimit-based memory accounting for reuseport_array maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 27/34] bpf: eliminate rlimit-based memory accounting for bpf ringbuffer Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 28/34] bpf: eliminate rlimit-based memory accounting for sockmap and sockhash maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 29/34] bpf: eliminate rlimit-based memory accounting for stackmap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 30/34] bpf: eliminate rlimit-based memory accounting for xskmap maps Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 31/34] bpf: eliminate rlimit-based memory accounting for bpf local storage maps Roman Gushchin
2020-11-13 18:14 ` Song Liu
2020-11-13 19:33 ` Roman Gushchin
2020-11-13 20:53 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 32/34] bpf: eliminate rlimit-based memory accounting infra for bpf maps Roman Gushchin
2020-11-13 18:17 ` Song Liu
2020-11-12 22:15 ` [PATCH bpf-next v5 33/34] bpf: eliminate rlimit-based memory accounting for bpf progs Roman Gushchin
2020-11-12 22:15 ` [PATCH bpf-next v5 34/34] bpf: samples: do not touch RLIMIT_MEMLOCK Roman Gushchin
[not found] ` <20201112221543.3621014-2-guro@fb.com>
2020-11-12 22:56 ` [PATCH bpf-next v5 01/34] mm: memcontrol: use helpers to read page's memcg data Stephen Rothwell
2020-11-13 0:26 ` Roman Gushchin
2020-11-13 3:04 ` Alexei Starovoitov
2020-11-13 3:18 ` Andrew Morton
2020-11-13 3:25 ` Alexei Starovoitov
2020-11-13 3:40 ` Andrew Morton
2020-11-13 4:08 ` Alexei Starovoitov
2020-11-13 4:01 ` Roman Gushchin
2020-11-13 14:25 ` Shakeel Butt
2020-11-13 17:18 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201112221543.3621014-1-guro@fb.com \
--to=guro@fb.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=shakeelb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).