All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] bpf: Allow not to charge bpf memory
@ 2022-03-19 17:30 Yafang Shao
  2022-03-19 17:30 ` [PATCH 01/14] bpf: Introduce no charge flag for bpf map Yafang Shao
                   ` (14 more replies)
  0 siblings, 15 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by defaut, that causes unexpected issues for
us. For instance, the container of the loader-which loads the bpf programs
and pins them on bpffs-may restart after pinning the progs and maps. After
the restart, the pinned progs and maps won't belong to the new container
any more, while they actually belong to an offline memcg left by the
previous generation. That inconsistent behavior will make trouble for the
memory resource management for this container. 

The reason why these progs and maps have to be persistent across multiple
generations is that these progs and maps are also used by other processes
which are not in this container. IOW, they can't be removed when this
container is restarted. Take a specific example, bpf program for clsact
qdisc is loaded by a agent running in a container, which not only loads
bpf program but also processes the data generated by this program and do
some other maintainace things.

In order to keep the charging behavior consistent, we used to consider a
way to recharge these pinned maps and progs again after the container is
restarted, but after the discussion[1] with Roman, we decided to go
another direction that don't charge them to the container in the first
place. TL;DR about the mentioned disccussion: recharging is not a generic
solution and it may take too much risk.

This patchset is the solution of no charge. Two flags are introduced in
union bpf_attr, one for bpf map and another for bpf prog. The user who
doesn't want to charge to current memcg can use these two flags. These two
flags are only permitted for sys admin as these memory will be accounted to
the root memcg only.

Patches #1~#8 are for bpf map. Patches #9~#12 are for bpf prog. Patch #13
and #14 are for selftests and also the examples of how to use them.

[1]. https://lwn.net/Articles/887180/ 

Yafang Shao (14):
  bpf: Introduce no charge flag for bpf map
  bpf: Only sys admin can set no charge flag
  bpf: Enable no charge in map _CREATE_FLAG_MASK
  bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc
  bpf: Allow no charge in bpf_map_area_alloc
  bpf: Allow no charge for allocation not at map creation time
  bpf: Allow no charge in map specific allocation
  bpf: Aggregate flags for BPF_PROG_LOAD command
  bpf: Add no charge flag for bpf prog
  bpf: Only sys admin can set no charge flag for bpf prog
  bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc
  bpf: Allow no charge for bpf prog
  bpf: selftests: Add test case for BPF_F_NO_CHARTE
  bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE

 include/linux/bpf.h                           | 27 ++++++-
 include/uapi/linux/bpf.h                      | 21 +++--
 kernel/bpf/arraymap.c                         |  9 +--
 kernel/bpf/bloom_filter.c                     |  7 +-
 kernel/bpf/bpf_local_storage.c                |  8 +-
 kernel/bpf/bpf_struct_ops.c                   | 13 +--
 kernel/bpf/core.c                             | 20 +++--
 kernel/bpf/cpumap.c                           | 10 ++-
 kernel/bpf/devmap.c                           | 14 ++--
 kernel/bpf/hashtab.c                          | 14 ++--
 kernel/bpf/local_storage.c                    |  4 +-
 kernel/bpf/lpm_trie.c                         |  4 +-
 kernel/bpf/queue_stack_maps.c                 |  5 +-
 kernel/bpf/reuseport_array.c                  |  3 +-
 kernel/bpf/ringbuf.c                          | 19 ++---
 kernel/bpf/stackmap.c                         | 13 +--
 kernel/bpf/syscall.c                          | 40 +++++++---
 kernel/bpf/verifier.c                         |  2 +-
 net/core/filter.c                             |  6 +-
 net/core/sock_map.c                           |  8 +-
 net/xdp/xskmap.c                              |  9 ++-
 tools/include/uapi/linux/bpf.h                | 21 +++--
 .../selftests/bpf/map_tests/no_charg.c        | 79 +++++++++++++++++++
 .../selftests/bpf/prog_tests/no_charge.c      | 49 ++++++++++++
 24 files changed, 297 insertions(+), 108 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/14] bpf: Introduce no charge flag for bpf map
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 02/14] bpf: Only sys admin can set no charge flag Yafang Shao
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

A new map flag BPF_F_NO_CHARGE is introduced in bpf_attr, with which we
can choose not to charge map memory while account it to root memcg only.
At the map creation time, we can get the no charge flag from struct
bpf_attr directly, while for other paths we can get it from struct
bpf_map.

The usecase of this flag is that sometimes we may create bpf maps with a
process running in a container (with memcg) but these maps are targeted
to the whole system, so we don't want to charge these memory into this
container. That will be good for memory resource management for this
container, as these shared bpf maps are always pinned which should
belong to the system rather than this container. That can also help
to make the charging behavior consistent, for example, if we charge the
pinned memory into this container, after the contianer restarts these
memory will not belong to it any more.

Two helpers are introduced for followup usage.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            | 15 ++++++++++++++-
 include/uapi/linux/bpf.h       |  3 +++
 kernel/bpf/syscall.c           |  1 +
 tools/include/uapi/linux/bpf.h |  3 +++
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88449fbbe063..07c6603a6c81 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -184,7 +184,8 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 14 bytes hole */
+	bool no_charge; /* Don't charge to memcg */
+	/* 13 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -207,6 +208,18 @@ struct bpf_map {
 	} owner;
 };
 
+static inline gfp_t
+map_flags_no_charge(gfp_t flags, union bpf_attr *attr)
+{
+	return flags |= (attr->map_flags & BPF_F_NO_CHARGE) ? 0 : __GFP_ACCOUNT;
+}
+
+static inline gfp_t
+bpf_flags_no_charge(gfp_t flags, bool no_charge)
+{
+	return flags |= no_charge ? 0 : __GFP_ACCOUNT;
+}
+
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 {
 	return map->spin_lock_off >= 0;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7604e7d5438f..e2dba6cdd88d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1225,6 +1225,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Don't charge memory to memcg */
+	BPF_F_NO_CHARGE		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..029f04588b1a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -368,6 +368,7 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 	map->map_flags = bpf_map_flags_retain_permanent(attr->map_flags);
 	map->numa_node = bpf_map_attr_numa_node(attr);
 	map->map_extra = attr->map_extra;
+	map->no_charge = attr->map_flags & BPF_F_NO_CHARGE;
 }
 
 static int bpf_map_alloc_id(struct bpf_map *map)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7604e7d5438f..e2dba6cdd88d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1225,6 +1225,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Don't charge memory to memcg */
+	BPF_F_NO_CHARGE		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/14] bpf: Only sys admin can set no charge flag
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
  2022-03-19 17:30 ` [PATCH 01/14] bpf: Introduce no charge flag for bpf map Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 03/14] bpf: Enable no charge in map _CREATE_FLAG_MASK Yafang Shao
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Only the sys admin has the privilege to account the bpf map memory into
root memcg only.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 029f04588b1a..0cca3d7d0d84 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -852,6 +852,9 @@ static int map_create(union bpf_attr *attr)
 	    attr->map_extra != 0)
 		return -EINVAL;
 
+	if (attr->map_flags & BPF_F_NO_CHARGE && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
 		return f_flags;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/14] bpf: Enable no charge in map _CREATE_FLAG_MASK
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
  2022-03-19 17:30 ` [PATCH 01/14] bpf: Introduce no charge flag for bpf map Yafang Shao
  2022-03-19 17:30 ` [PATCH 02/14] bpf: Only sys admin can set no charge flag Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 04/14] bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc Yafang Shao
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Many maps have their create masks to warn the invalid map flag. Set
BPF_F_NO_CHARGE in all these masks to enable it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c          | 2 +-
 kernel/bpf/bloom_filter.c      | 4 ++--
 kernel/bpf/bpf_local_storage.c | 3 ++-
 kernel/bpf/bpf_struct_ops.c    | 5 ++++-
 kernel/bpf/cpumap.c            | 5 ++++-
 kernel/bpf/devmap.c            | 2 +-
 kernel/bpf/hashtab.c           | 2 +-
 kernel/bpf/local_storage.c     | 2 +-
 kernel/bpf/lpm_trie.c          | 2 +-
 kernel/bpf/queue_stack_maps.c  | 2 +-
 kernel/bpf/ringbuf.c           | 2 +-
 kernel/bpf/stackmap.c          | 2 +-
 net/core/sock_map.c            | 2 +-
 net/xdp/xskmap.c               | 5 ++++-
 14 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..ac123747303c 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -16,7 +16,7 @@
 
 #define ARRAY_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | \
-	 BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP)
+	 BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP | BPF_F_NO_CHARGE)
 
 static void bpf_array_free_percpu(struct bpf_array *array)
 {
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index b141a1346f72..f8ebfb4831e5 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -8,8 +8,8 @@
 #include <linux/jhash.h>
 #include <linux/random.h>
 
-#define BLOOM_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK)
+#define BLOOM_CREATE_FLAG_MASK	(BPF_F_NUMA_NODE |	\
+	BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
 
 struct bpf_bloom_filter {
 	struct bpf_map map;
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 092a1ac772d7..a92d3032fcde 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -15,7 +15,8 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/rcupdate_wait.h>
 
-#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
+#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK	\
+	(BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_NO_CHARGE)
 
 static struct bpf_local_storage_map_bucket *
 select_bucket(struct bpf_local_storage_map *smap,
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 21069dbe9138..09eb848e6d12 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -21,6 +21,8 @@ enum bpf_struct_ops_state {
 	refcount_t refcnt;				\
 	enum bpf_struct_ops_state state
 
+#define STRUCT_OPS_CREATE_FLAG_MASK (BPF_F_NO_CHARGE)
+
 struct bpf_struct_ops_value {
 	BPF_STRUCT_OPS_COMMON_VALUE;
 	char data[] ____cacheline_aligned_in_smp;
@@ -556,7 +558,8 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
 {
 	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
-	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
+		attr->map_flags & ~STRUCT_OPS_CREATE_FLAG_MASK ||
+		!attr->btf_vmlinux_value_type_id)
 		return -EINVAL;
 	return 0;
 }
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 650e5d21f90d..201226fc652b 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -39,6 +39,9 @@
  */
 
 #define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
+
+#define CPU_MAP_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_NO_CHARGE)
+
 struct bpf_cpu_map_entry;
 struct bpf_cpu_map;
 
@@ -93,7 +96,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
 	    (value_size != offsetofend(struct bpf_cpumap_val, qsize) &&
 	     value_size != offsetofend(struct bpf_cpumap_val, bpf_prog.fd)) ||
-	    attr->map_flags & ~BPF_F_NUMA_NODE)
+	    attr->map_flags & ~CPU_MAP_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
 	cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 038f6d7a83e4..39bf8b521f27 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -50,7 +50,7 @@
 #include <trace/events/xdp.h>
 
 #define DEV_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE)
 
 struct xdp_dev_bulk_queue {
 	struct xdp_frame *q[DEV_MAP_BULK_SIZE];
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65877967f414..5c6ec8780b09 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -16,7 +16,7 @@
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
-	 BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
+	 BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED | BPF_F_NO_CHARGE)
 
 #define BATCH_OPS(_name)			\
 	.map_lookup_batch =			\
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 497916060ac7..865766c240d6 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -15,7 +15,7 @@
 #include "../cgroup/cgroup-internal.h"
 
 #define LOCAL_STORAGE_CREATE_FLAG_MASK					\
-	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
 
 struct bpf_cgroup_storage_map {
 	struct bpf_map map;
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 5763cc7ac4f1..f42edf613624 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -537,7 +537,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 #define LPM_KEY_SIZE_MIN	LPM_KEY_SIZE(LPM_DATA_SIZE_MIN)
 
 #define LPM_CREATE_FLAG_MASK	(BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE |	\
-				 BPF_F_ACCESS_MASK)
+				 BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
 
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f9c734aaa990..c78eed4659ce 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -11,7 +11,7 @@
 #include "percpu_freelist.h"
 
 #define QUEUE_STACK_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
 
 struct bpf_queue_stack {
 	struct bpf_map map;
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 710ba9de12ce..88779f688679 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -11,7 +11,7 @@
 #include <linux/kmemleak.h>
 #include <uapi/linux/btf.h>
 
-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
+#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_NO_CHARGE)
 
 /* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
 #define RINGBUF_PGOFF \
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 38bdfcd06f55..b2e7dc1d9f5a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -14,7 +14,7 @@
 
 #define STACK_CREATE_FLAG_MASK					\
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
-	 BPF_F_STACK_BUILD_ID)
+	 BPF_F_STACK_BUILD_ID | BPF_F_NO_CHARGE)
 
 struct stack_map_bucket {
 	struct pcpu_freelist_node fnode;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 2d213c4011db..7b0215bea413 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -22,7 +22,7 @@ struct bpf_stab {
 };
 
 #define SOCK_CREATE_FLAG_MASK				\
-	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE)
 
 static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 				struct bpf_prog *old, u32 which);
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 65b53fb3de13..10a5ae727bd5 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -12,6 +12,9 @@
 
 #include "xsk.h"
 
+#define XSK_MAP_CREATE_FLAG_MASK	\
+		(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE)
+
 static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
 					       struct xdp_sock __rcu **map_entry)
 {
@@ -68,7 +71,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
 	    attr->value_size != 4 ||
-	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
+	    attr->map_flags & ~XSK_MAP_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
 	numa_node = bpf_map_attr_numa_node(attr);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/14] bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (2 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 03/14] bpf: Enable no charge in map _CREATE_FLAG_MASK Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 05/14] bpf: Allow no charge " Yafang Shao
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Add a new parameter bpf_attr in bpf_map_area_alloc(), then we can get no
charge flag from it. Currently there're two parameters, one of which is
also got from bpf_attr, so we can remove it after this change.

No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h           |  4 ++--
 kernel/bpf/arraymap.c         |  5 ++---
 kernel/bpf/bloom_filter.c     |  3 +--
 kernel/bpf/bpf_struct_ops.c   |  8 ++++----
 kernel/bpf/cpumap.c           |  3 +--
 kernel/bpf/devmap.c           | 10 ++++------
 kernel/bpf/hashtab.c          | 10 ++++------
 kernel/bpf/queue_stack_maps.c |  3 +--
 kernel/bpf/reuseport_array.c  |  3 +--
 kernel/bpf/ringbuf.c          | 11 ++++++-----
 kernel/bpf/stackmap.c         | 11 ++++++-----
 kernel/bpf/syscall.c          | 13 +++++++------
 net/core/sock_map.c           |  6 ++----
 net/xdp/xskmap.c              |  4 +---
 14 files changed, 42 insertions(+), 52 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 07c6603a6c81..90a542d5a411 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1518,8 +1518,8 @@ void bpf_map_inc_with_uref(struct bpf_map *map);
 struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
-void *bpf_map_area_alloc(u64 size, int numa_node);
-void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
+void *bpf_map_area_alloc(u64 size, union bpf_attr *attr);
+void *bpf_map_area_mmapable_alloc(u64 size, union bpf_attr *attr);
 void bpf_map_area_free(void *base);
 bool bpf_map_write_active(const struct bpf_map *map);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index ac123747303c..e26aef906392 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -81,7 +81,6 @@ int array_map_alloc_check(union bpf_attr *attr)
 static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 {
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
-	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
 	bool bypass_spec_v1 = bpf_bypass_spec_v1();
 	u64 array_size, mask64;
@@ -130,13 +129,13 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		void *data;
 
 		/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
-		data = bpf_map_area_mmapable_alloc(array_size, numa_node);
+		data = bpf_map_area_mmapable_alloc(array_size, attr);
 		if (!data)
 			return ERR_PTR(-ENOMEM);
 		array = data + PAGE_ALIGN(sizeof(struct bpf_array))
 			- offsetof(struct bpf_array, value);
 	} else {
-		array = bpf_map_area_alloc(array_size, numa_node);
+		array = bpf_map_area_alloc(array_size, attr);
 	}
 	if (!array)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index f8ebfb4831e5..a35c664b4a02 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -90,7 +90,6 @@ static int bloom_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 {
 	u32 bitset_bytes, bitset_mask, nr_hash_funcs, nr_bits;
-	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_bloom_filter *bloom;
 
 	if (!bpf_capable())
@@ -141,7 +140,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	}
 
 	bitset_bytes = roundup(bitset_bytes, sizeof(unsigned long));
-	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node);
+	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, attr);
 
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 09eb848e6d12..1ca1407ae5e6 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -591,17 +591,17 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 		 */
 		(vt->size - sizeof(struct bpf_struct_ops_value));
 
-	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	attr->map_flags &= ~BPF_F_NUMA_NODE;
+	st_map = bpf_map_area_alloc(st_map_size, attr);
 	if (!st_map)
 		return ERR_PTR(-ENOMEM);
 
 	st_map->st_ops = st_ops;
 	map = &st_map->map;
 
-	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->uvalue = bpf_map_area_alloc(vt->size, attr);
 	st_map->progs =
-		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
-				   NUMA_NO_NODE);
+		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), attr);
 	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
 	if (!st_map->uvalue || !st_map->progs || !st_map->image) {
 		bpf_struct_ops_map_free(map);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 201226fc652b..5a5b40e986ff 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -113,8 +113,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
 	/* Alloc array for possible remote "destination" CPUs */
 	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
-					   sizeof(struct bpf_cpu_map_entry *),
-					   cmap->map.numa_node);
+					   sizeof(struct bpf_cpu_map_entry *), attr);
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 39bf8b521f27..2857176c82bb 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -88,12 +88,12 @@ static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
 static struct hlist_head *dev_map_create_hash(unsigned int entries,
-					      int numa_node)
+							union bpf_attr *attr)
 {
 	int i;
 	struct hlist_head *hash;
 
-	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node);
+	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), attr);
 	if (hash != NULL)
 		for (i = 0; i < entries; i++)
 			INIT_HLIST_HEAD(&hash[i]);
@@ -137,16 +137,14 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	}
 
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets,
-							   dtab->map.numa_node);
+		dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets, attr);
 		if (!dtab->dev_index_head)
 			return -ENOMEM;
 
 		spin_lock_init(&dtab->index_lock);
 	} else {
 		dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries *
-						      sizeof(struct bpf_dtab_netdev *),
-						      dtab->map.numa_node);
+						      sizeof(struct bpf_dtab_netdev *), attr);
 		if (!dtab->netdev_map)
 			return -ENOMEM;
 	}
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 5c6ec8780b09..2c84045ff8e1 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -303,7 +303,7 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
 	return NULL;
 }
 
-static int prealloc_init(struct bpf_htab *htab)
+static int prealloc_init(union bpf_attr *attr, struct bpf_htab *htab)
 {
 	u32 num_entries = htab->map.max_entries;
 	int err = -ENOMEM, i;
@@ -311,8 +311,7 @@ static int prealloc_init(struct bpf_htab *htab)
 	if (htab_has_extra_elems(htab))
 		num_entries += num_possible_cpus();
 
-	htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries,
-					 htab->map.numa_node);
+	htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries, attr);
 	if (!htab->elems)
 		return -ENOMEM;
 
@@ -513,8 +512,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	err = -ENOMEM;
 	htab->buckets = bpf_map_area_alloc(htab->n_buckets *
-					   sizeof(struct bucket),
-					   htab->map.numa_node);
+					   sizeof(struct bucket), attr);
 	if (!htab->buckets)
 		goto free_htab;
 
@@ -535,7 +533,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	htab_init_buckets(htab);
 
 	if (prealloc) {
-		err = prealloc_init(htab);
+		err = prealloc_init(attr, htab);
 		if (err)
 			goto free_map_locked;
 
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index c78eed4659ce..0ff93c5bc184 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -66,14 +66,13 @@ static int queue_stack_map_alloc_check(union bpf_attr *attr)
 
 static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 {
-	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_queue_stack *qs;
 	u64 size, queue_size;
 
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
 
-	qs = bpf_map_area_alloc(queue_size, numa_node);
+	qs = bpf_map_area_alloc(queue_size, attr);
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 8251243022a2..b19fb70118a4 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -150,14 +150,13 @@ static void reuseport_array_free(struct bpf_map *map)
 
 static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 {
-	int numa_node = bpf_map_attr_numa_node(attr);
 	struct reuseport_array *array;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node);
+	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), attr);
 	if (!array)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 88779f688679..a3b4d2a0a2c7 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -58,13 +58,14 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
-static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, union bpf_attr *attr)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
 			    __GFP_NOWARN | __GFP_ZERO;
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
+	int numa_node = bpf_map_attr_numa_node(attr);
 	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
 	size_t array_size;
@@ -88,7 +89,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * user-space implementations significantly.
 	 */
 	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_map_area_alloc(array_size, attr);
 	if (!pages)
 		return NULL;
 
@@ -126,11 +127,11 @@ static void bpf_ringbuf_notify(struct irq_work *work)
 	wake_up_all(&rb->waitq);
 }
 
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, union bpf_attr *attr)
 {
 	struct bpf_ringbuf *rb;
 
-	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
+	rb = bpf_ringbuf_area_alloc(data_sz, attr);
 	if (!rb)
 		return NULL;
 
@@ -169,7 +170,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 
 	bpf_map_init_from_attr(&rb_map->map, attr);
 
-	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, attr);
 	if (!rb_map->rb) {
 		kfree(rb_map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b2e7dc1d9f5a..ed6bebef0132 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -42,14 +42,15 @@ static inline int stack_map_data_size(struct bpf_map *map)
 		sizeof(struct bpf_stack_build_id) : sizeof(u64);
 }
 
-static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
+static int prealloc_elems_and_freelist(union bpf_attr *attr,
+				struct bpf_stack_map *smap)
 {
 	u64 elem_size = sizeof(struct stack_map_bucket) +
 			(u64)smap->map.value_size;
 	int err;
 
-	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
-					 smap->map.numa_node);
+	smap->elems = bpf_map_area_alloc(elem_size *
+					smap->map.max_entries, attr);
 	if (!smap->elems)
 		return -ENOMEM;
 
@@ -101,7 +102,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 
 	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
 	cost += n_buckets * (value_size + sizeof(struct stack_map_bucket));
-	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
+	smap = bpf_map_area_alloc(cost, attr);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 
@@ -113,7 +114,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	if (err)
 		goto free_smap;
 
-	err = prealloc_elems_and_freelist(smap);
+	err = prealloc_elems_and_freelist(attr, smap);
 	if (err)
 		goto put_buffers;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0cca3d7d0d84..f70a7067ef4a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -295,7 +295,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
  * (e.g. in map update path) without taking care of setting the active
  * memory cgroup (see at bpf_map_kmalloc_node() for example).
  */
-static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
+static void *__bpf_map_area_alloc(u64 size, union bpf_attr *attr, bool mmapable)
 {
 	/* We really just want to fail instead of triggering OOM killer
 	 * under memory pressure, therefore we set __GFP_NORETRY to kmalloc,
@@ -308,8 +308,9 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 	 */
 
 	const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT;
-	unsigned int flags = 0;
+	int numa_node = bpf_map_attr_numa_node(attr);
 	unsigned long align = 1;
+	unsigned int flags = 0;
 	void *area;
 
 	if (size >= SIZE_MAX)
@@ -332,14 +333,14 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 			flags, numa_node, __builtin_return_address(0));
 }
 
-void *bpf_map_area_alloc(u64 size, int numa_node)
+void *bpf_map_area_alloc(u64 size, union bpf_attr *attr)
 {
-	return __bpf_map_area_alloc(size, numa_node, false);
+	return __bpf_map_area_alloc(size, attr, false);
 }
 
-void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
+void *bpf_map_area_mmapable_alloc(u64 size, union bpf_attr *attr)
 {
-	return __bpf_map_area_alloc(size, numa_node, true);
+	return __bpf_map_area_alloc(size, attr, true);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 7b0215bea413..26b89d37944d 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -49,8 +49,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	raw_spin_lock_init(&stab->lock);
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
-				       sizeof(struct sock *),
-				       stab->map.numa_node);
+					sizeof(struct sock *), attr);
 	if (!stab->sks) {
 		kfree(stab);
 		return ERR_PTR(-ENOMEM);
@@ -1093,8 +1092,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	}
 
 	htab->buckets = bpf_map_area_alloc(htab->buckets_num *
-					   sizeof(struct bpf_shtab_bucket),
-					   htab->map.numa_node);
+					   sizeof(struct bpf_shtab_bucket), attr);
 	if (!htab->buckets) {
 		err = -ENOMEM;
 		goto free_htab;
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 10a5ae727bd5..50795c0c9b81 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -63,7 +63,6 @@ static void xsk_map_sock_delete(struct xdp_sock *xs,
 static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 {
 	struct xsk_map *m;
-	int numa_node;
 	u64 size;
 
 	if (!capable(CAP_NET_ADMIN))
@@ -74,10 +73,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~XSK_MAP_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
-	numa_node = bpf_map_attr_numa_node(attr);
 	size = struct_size(m, xsk_map, attr->max_entries);
 
-	m = bpf_map_area_alloc(size, numa_node);
+	m = bpf_map_area_alloc(size, attr);
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/14] bpf: Allow no charge in bpf_map_area_alloc
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (3 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 04/14] bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 06/14] bpf: Allow no charge for allocation not at map creation time Yafang Shao
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Use the helper we introduced before to decide whether set the
__GFP_ACCOUNT or not.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f70a7067ef4a..add3b4045b4d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -307,7 +307,7 @@ static void *__bpf_map_area_alloc(u64 size, union bpf_attr *attr, bool mmapable)
 	 * __GFP_RETRY_MAYFAIL to avoid such situations.
 	 */
 
-	const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT;
+	const gfp_t gfp = map_flags_no_charge(__GFP_NOWARN | __GFP_ZERO, attr);
 	int numa_node = bpf_map_attr_numa_node(attr);
 	unsigned long align = 1;
 	unsigned int flags = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/14] bpf: Allow no charge for allocation not at map creation time
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (4 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 05/14] bpf: Allow no charge " Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 07/14] bpf: Allow no charge in map specific allocation Yafang Shao
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Below three functions are used for memory allocation which is not at
map creation time,
  - bpf_map_kmalloc_node()
  - bpf_map_kzalloc()
  - bpf_map_alloc_percpu()

For this kind of path, we can get the no charge flag from bpf_map
struct we set before.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index add3b4045b4d..e84aeefa05f4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -434,7 +434,8 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 	void *ptr;
 
 	old_memcg = set_active_memcg(map->memcg);
-	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
+	ptr = kmalloc_node(size, bpf_flags_no_charge(flags, map->no_charge),
+			node);
 	set_active_memcg(old_memcg);
 
 	return ptr;
@@ -446,7 +447,7 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	void *ptr;
 
 	old_memcg = set_active_memcg(map->memcg);
-	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
+	ptr = kzalloc(size, bpf_flags_no_charge(flags, map->no_charge));
 	set_active_memcg(old_memcg);
 
 	return ptr;
@@ -459,7 +460,8 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	void __percpu *ptr;
 
 	old_memcg = set_active_memcg(map->memcg);
-	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
+	ptr = __alloc_percpu_gfp(size, align,
+			bpf_flags_no_charge(flags, map->no_charge));
 	set_active_memcg(old_memcg);
 
 	return ptr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/14] bpf: Allow no charge in map specific allocation
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (5 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 06/14] bpf: Allow no charge for allocation not at map creation time Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 08/14] bpf: Aggregate flags for BPF_PROG_LOAD command Yafang Shao
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Some maps have their own ->map_alloc, in which the no charge should also
be allowed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c          | 2 +-
 kernel/bpf/bpf_local_storage.c | 5 +++--
 kernel/bpf/cpumap.c            | 2 +-
 kernel/bpf/devmap.c            | 2 +-
 kernel/bpf/hashtab.c           | 2 +-
 kernel/bpf/local_storage.c     | 2 +-
 kernel/bpf/lpm_trie.c          | 2 +-
 kernel/bpf/ringbuf.c           | 6 +++---
 8 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e26aef906392..9df425ad769c 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1062,7 +1062,7 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
+	aux = kzalloc(sizeof(*aux), map_flags_no_charge(GFP_KERNEL, attr));
 	if (!aux)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index a92d3032fcde..b626546d384d 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -582,11 +582,12 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
 
 struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 {
+	gfp_t gfp_flags = map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr);
 	struct bpf_local_storage_map *smap;
 	unsigned int i;
 	u32 nbuckets;
 
-	smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap = kzalloc(sizeof(*smap), gfp_flags);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 	bpf_map_init_from_attr(&smap->map, attr);
@@ -597,7 +598,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	smap->bucket_log = ilog2(nbuckets);
 
 	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+				 map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr));
 	if (!smap->buckets) {
 		kfree(smap);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 5a5b40e986ff..fd3b3f05e76a 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -99,7 +99,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~CPU_MAP_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
-	cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT);
+	cmap = kzalloc(sizeof(*cmap), map_flags_no_charge(GFP_USER, attr));
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 2857176c82bb..6aaa2e3ce795 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -160,7 +160,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	dtab = kzalloc(sizeof(*dtab), GFP_USER | __GFP_ACCOUNT);
+	dtab = kzalloc(sizeof(*dtab), map_flags_no_charge(GFP_USER, attr));
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 2c84045ff8e1..74696d8196a5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -474,7 +474,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	struct bpf_htab *htab;
 	int err, i;
 
-	htab = kzalloc(sizeof(*htab), GFP_USER | __GFP_ACCOUNT);
+	htab = kzalloc(sizeof(*htab), map_flags_no_charge(GFP_USER, attr));
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 865766c240d6..741ab7cf3626 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EINVAL);
 
 	map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map),
-			   __GFP_ZERO | GFP_USER | __GFP_ACCOUNT, numa_node);
+			   map_flags_no_charge(__GFP_ZERO | GFP_USER, attr), numa_node);
 	if (!map)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index f42edf613624..9673f8e98c58 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -557,7 +557,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	    attr->value_size > LPM_VAL_SIZE_MAX)
 		return ERR_PTR(-EINVAL);
 
-	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	trie = kzalloc(sizeof(*trie), map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr));
 	if (!trie)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index a3b4d2a0a2c7..3db07cd0ab60 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -60,8 +60,8 @@ struct bpf_ringbuf_hdr {
 
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, union bpf_attr *attr)
 {
-	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
-			    __GFP_NOWARN | __GFP_ZERO;
+	const gfp_t flags = map_flags_no_charge(__GFP_RETRY_MAYFAIL |
+			__GFP_NOWARN | __GFP_ZERO, attr);
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
@@ -164,7 +164,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 #endif
 
-	rb_map = kzalloc(sizeof(*rb_map), GFP_USER | __GFP_ACCOUNT);
+	rb_map = kzalloc(sizeof(*rb_map), map_flags_no_charge(GFP_USER, attr));
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/14] bpf: Aggregate flags for BPF_PROG_LOAD command
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (6 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 07/14] bpf: Allow no charge in map specific allocation Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 09/14] bpf: Add no charge flag for bpf prog Yafang Shao
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

It will be easy to read if we aggregate the flags for BPF_PROG_LOAD into

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/bpf.h       | 15 +++++++++------
 tools/include/uapi/linux/bpf.h | 15 +++++++++------
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2dba6cdd88d..93ee04fb8c62 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1065,12 +1065,14 @@ enum bpf_link_type {
 #define BPF_F_ALLOW_MULTI	(1U << 1)
 #define BPF_F_REPLACE		(1U << 2)
 
+/* flags for BPF_PROG_LOAD command */
+enum {
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
  * verifier will perform strict alignment checking as if the kernel
  * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set,
  * and NET_IP_ALIGN defined to 2.
  */
-#define BPF_F_STRICT_ALIGNMENT	(1U << 0)
+	BPF_F_STRICT_ALIGNMENT	= (1U << 0),
 
 /* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
  * verifier will allow any alignment whatsoever.  On platforms
@@ -1084,7 +1086,7 @@ enum bpf_link_type {
  * of an unaligned access the alignment check would trigger before
  * the one we are interested in.
  */
-#define BPF_F_ANY_ALIGNMENT	(1U << 1)
+	BPF_F_ANY_ALIGNMENT		= (1U << 1),
 
 /* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose.
  * Verifier does sub-register def/use analysis and identifies instructions whose
@@ -1102,10 +1104,10 @@ enum bpf_link_type {
  * Then, if verifier is not doing correct analysis, such randomization will
  * regress tests to expose bugs.
  */
-#define BPF_F_TEST_RND_HI32	(1U << 2)
+	BPF_F_TEST_RND_HI32		= (1U << 2),
 
 /* The verifier internal test flag. Behavior is undefined */
-#define BPF_F_TEST_STATE_FREQ	(1U << 3)
+	BPF_F_TEST_STATE_FREQ	= (1U << 3),
 
 /* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
  * restrict map and helper usage for such programs. Sleepable BPF programs can
@@ -1113,12 +1115,13 @@ enum bpf_link_type {
  * Such programs are allowed to use helpers that may sleep like
  * bpf_copy_from_user().
  */
-#define BPF_F_SLEEPABLE		(1U << 4)
+	BPF_F_SLEEPABLE			= (1U << 4),
 
 /* If BPF_F_XDP_HAS_FRAGS is used in BPF_PROG_LOAD command, the loaded program
  * fully support xdp frags.
  */
-#define BPF_F_XDP_HAS_FRAGS	(1U << 5)
+	BPF_F_XDP_HAS_FRAGS		= (1U << 5),
+};
 
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e2dba6cdd88d..71a4d8fdc880 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1065,12 +1065,14 @@ enum bpf_link_type {
 #define BPF_F_ALLOW_MULTI	(1U << 1)
 #define BPF_F_REPLACE		(1U << 2)
 
+/* flags for BPF_PROG_LOAD */
+enum {
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
  * verifier will perform strict alignment checking as if the kernel
  * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set,
  * and NET_IP_ALIGN defined to 2.
  */
-#define BPF_F_STRICT_ALIGNMENT	(1U << 0)
+	BPF_F_STRICT_ALIGNMENT	= (1U << 0),
 
 /* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
  * verifier will allow any alignment whatsoever.  On platforms
@@ -1084,7 +1086,7 @@ enum bpf_link_type {
  * of an unaligned access the alignment check would trigger before
  * the one we are interested in.
  */
-#define BPF_F_ANY_ALIGNMENT	(1U << 1)
+	BPF_F_ANY_ALIGNMENT		= (1U << 1),
 
 /* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose.
  * Verifier does sub-register def/use analysis and identifies instructions whose
@@ -1102,10 +1104,10 @@ enum bpf_link_type {
  * Then, if verifier is not doing correct analysis, such randomization will
  * regress tests to expose bugs.
  */
-#define BPF_F_TEST_RND_HI32	(1U << 2)
+	BPF_F_TEST_RND_HI32		= (1U << 2),
 
 /* The verifier internal test flag. Behavior is undefined */
-#define BPF_F_TEST_STATE_FREQ	(1U << 3)
+	BPF_F_TEST_STATE_FREQ	= (1U << 3),
 
 /* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
  * restrict map and helper usage for such programs. Sleepable BPF programs can
@@ -1113,12 +1115,13 @@ enum bpf_link_type {
  * Such programs are allowed to use helpers that may sleep like
  * bpf_copy_from_user().
  */
-#define BPF_F_SLEEPABLE		(1U << 4)
+	BPF_F_SLEEPABLE			= (1U << 4),
 
 /* If BPF_F_XDP_HAS_FRAGS is used in BPF_PROG_LOAD command, the loaded program
  * fully support xdp frags.
  */
-#define BPF_F_XDP_HAS_FRAGS	(1U << 5)
+	BPF_F_XDP_HAS_FRAGS		= (1U << 5),
+};
 
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/14] bpf: Add no charge flag for bpf prog
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (7 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 08/14] bpf: Aggregate flags for BPF_PROG_LOAD command Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 10/14] bpf: Only sys admin can set " Yafang Shao
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

A no charge flag is also introduced for bpf prog, which is similar with
the no charge flag introduced for bpf map. The usecase of it is the same
too.

It is added in bpf_attr for BPF_PROG_LOAD command, and then set in
bpf_prog_aux for the memory allocation which is not at the loading path.
There're 4B holes after the member max_rdwr_access in struct bpf_prog_aux,
so we can place the new member there.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            | 8 ++++++++
 include/uapi/linux/bpf.h       | 3 +++
 kernel/bpf/syscall.c           | 4 +++-
 tools/include/uapi/linux/bpf.h | 3 +++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 90a542d5a411..69ff3e35b8f2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -214,6 +214,13 @@ map_flags_no_charge(gfp_t flags, union bpf_attr *attr)
 	return flags |= (attr->map_flags & BPF_F_NO_CHARGE) ? 0 : __GFP_ACCOUNT;
 }
 
+static inline gfp_t
+prog_flags_no_charge(gfp_t flags, union bpf_attr *attr)
+{
+	return flags |= (attr->prog_flags & BPF_F_PROG_NO_CHARGE) ?
+					0 : __GFP_ACCOUNT;
+}
+
 static inline gfp_t
 bpf_flags_no_charge(gfp_t flags, bool no_charge)
 {
@@ -958,6 +965,7 @@ struct bpf_prog_aux {
 	u32 ctx_arg_info_size;
 	u32 max_rdonly_access;
 	u32 max_rdwr_access;
+	bool no_charge; /* dont' charge memory to memcg */
 	struct btf *attach_btf;
 	const struct bpf_ctx_arg_aux *ctx_arg_info;
 	struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93ee04fb8c62..3c98b1b77db6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1121,6 +1121,9 @@ enum {
  * fully support xdp frags.
  */
 	BPF_F_XDP_HAS_FRAGS		= (1U << 5),
+
+/* Don't charge memory to memcg */
+	BPF_F_PROG_NO_CHARGE	= (1U << 6),
 };
 
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e84aeefa05f4..346f3df9fa1d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2230,7 +2230,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_PROG_NO_CHARGE))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2317,6 +2318,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->offload_requested = !!attr->prog_ifindex;
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
+	prog->aux->no_charge = attr->prog_flags & BPF_F_PROG_NO_CHARGE;
 
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 71a4d8fdc880..89752e5c11c0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1121,6 +1121,9 @@ enum {
  * fully support xdp frags.
  */
 	BPF_F_XDP_HAS_FRAGS		= (1U << 5),
+
+/* Don't charge memory to memcg */
+	BPF_F_PROG_NO_CHARGE	= (1U << 6),
 };
 
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/14] bpf: Only sys admin can set no charge flag for bpf prog
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (8 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 09/14] bpf: Add no charge flag for bpf prog Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 11/14] bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc Yafang Shao
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

When a bpf prog is loaded by a proccess running in a container (with memcg),
only sys admin has privilege not to charge bpf prog memory into this
container while account it to root memcg only.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 346f3df9fa1d..ecc5de216f50 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2234,6 +2234,9 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_PROG_NO_CHARGE))
 		return -EINVAL;
 
+	if (attr->prog_flags & BPF_F_PROG_NO_CHARGE && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
 	    !bpf_capable())
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/14] bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (9 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 10/14] bpf: Only sys admin can set " Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 12/14] bpf: Allow no charge for bpf prog Yafang Shao
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Instead of setting __GFP_ACCOUNT inside bpf_prog_alloc(), let's set it
at the callsite. No functional change.

It is a preparation for followup patch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/core.c     | 8 ++++----
 kernel/bpf/syscall.c  | 2 +-
 kernel/bpf/verifier.c | 2 +-
 net/core/filter.c     | 6 +++---
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1324f9523e7c..0f68b8203c18 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -80,7 +80,7 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
 
 struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags;
+	gfp_t gfp_flags = __GFP_ZERO | gfp_extra_flags;
 	struct bpf_prog_aux *aux;
 	struct bpf_prog *fp;
 
@@ -89,12 +89,12 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	if (fp == NULL)
 		return NULL;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT | gfp_extra_flags);
+	aux = kzalloc(sizeof(*aux), gfp_extra_flags);
 	if (aux == NULL) {
 		vfree(fp);
 		return NULL;
 	}
-	fp->active = alloc_percpu_gfp(int, GFP_KERNEL_ACCOUNT | gfp_extra_flags);
+	fp->active = alloc_percpu_gfp(int, gfp_flags);
 	if (!fp->active) {
 		vfree(fp);
 		kfree(aux);
@@ -116,7 +116,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 
 struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags;
+	gfp_t gfp_flags = __GFP_ZERO | gfp_extra_flags;
 	struct bpf_prog *prog;
 	int cpu;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ecc5de216f50..fdfbb4d0d5e0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2305,7 +2305,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	}
 
 	/* plain bpf_prog allocation */
-	prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
+	prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER | __GFP_ACCOUNT);
 	if (!prog) {
 		if (dst_prog)
 			bpf_prog_put(dst_prog);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0287176bfe9a..fe989cc08391 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12991,7 +12991,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		 * subprogs don't have IDs and not reachable via prog_get_next_id
 		 * func[i]->stats will never be accessed and stays NULL
 		 */
-		func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER);
+		func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER | __GFP_ACCOUNT);
 		if (!func[i])
 			goto out_free;
 		memcpy(func[i]->insnsi, &prog->insnsi[subprog_start],
diff --git a/net/core/filter.c b/net/core/filter.c
index 03655f2074ae..6466a1e0ed4d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1363,7 +1363,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
 		return -EINVAL;
 
-	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0);
+	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT);
 	if (!fp)
 		return -ENOMEM;
 
@@ -1410,7 +1410,7 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
 		return -EINVAL;
 
-	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0);
+	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT);
 	if (!fp)
 		return -ENOMEM;
 
@@ -1488,7 +1488,7 @@ struct bpf_prog *__get_filter(struct sock_fprog *fprog, struct sock *sk)
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
 		return ERR_PTR(-EINVAL);
 
-	prog = bpf_prog_alloc(bpf_prog_size(fprog->len), 0);
+	prog = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT);
 	if (!prog)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/14] bpf: Allow no charge for bpf prog
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (10 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 11/14] bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 13/14] bpf: selftests: Add test case for BPF_F_NO_CHARTE Yafang Shao
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Allow not to charge memory used by bpf progs. This includes the memory
used by bpf progs itself, auxxiliary data, statistics and bpf line info.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/core.c    | 12 ++++++++++--
 kernel/bpf/syscall.c |  6 ++++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 0f68b8203c18..7aa750e8bd7d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -144,12 +144,17 @@ EXPORT_SYMBOL_GPL(bpf_prog_alloc);
 
 int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog)
 {
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
+
 	if (!prog->aux->nr_linfo || !prog->jit_requested)
 		return 0;
 
+	if (!prog->aux->no_charge)
+		gfp_flags |= __GFP_ACCOUNT;
+
 	prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
 					  sizeof(*prog->aux->jited_linfo),
-					  GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+					  gfp_flags);
 	if (!prog->aux->jited_linfo)
 		return -ENOMEM;
 
@@ -224,7 +229,7 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
 struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
 				  gfp_t gfp_extra_flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
 	struct bpf_prog *fp;
 	u32 pages;
 
@@ -233,6 +238,9 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
 	if (pages <= fp_old->pages)
 		return fp_old;
 
+	if (!fp_old->aux->no_charge)
+		gfp_flags |= __GFP_ACCOUNT;
+
 	fp = __vmalloc(size, gfp_flags);
 	if (fp) {
 		memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fdfbb4d0d5e0..e386b549fafc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2218,9 +2218,10 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	enum bpf_prog_type type = attr->prog_type;
 	struct bpf_prog *prog, *dst_prog = NULL;
 	struct btf *attach_btf = NULL;
-	int err;
+	gfp_t gfp_flags = GFP_USER;
 	char license[128];
 	bool is_gpl;
+	int err;
 
 	if (CHECK_ATTR(BPF_PROG_LOAD))
 		return -EINVAL;
@@ -2305,7 +2306,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	}
 
 	/* plain bpf_prog allocation */
-	prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER | __GFP_ACCOUNT);
+	prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt),
+				prog_flags_no_charge(gfp_flags, attr));
 	if (!prog) {
 		if (dst_prog)
 			bpf_prog_put(dst_prog);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 13/14] bpf: selftests: Add test case for BPF_F_NO_CHARTE
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (11 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 12/14] bpf: Allow no charge for bpf prog Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-19 17:30 ` [PATCH 14/14] bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE Yafang Shao
  2022-03-21 22:52 ` [PATCH 00/14] bpf: Allow not to charge bpf memory Roman Gushchin
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

BPF_F_NO_CHARTE test case for various maps. Below is the result,
$ ./test_maps
...
test_no_charge:PASS
...

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/map_tests/no_charg.c        | 79 +++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c

diff --git a/tools/testing/selftests/bpf/map_tests/no_charg.c b/tools/testing/selftests/bpf/map_tests/no_charg.c
new file mode 100644
index 000000000000..db18685a53f7
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/no_charg.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <sys/syscall.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <errno.h>
+
+#include <test_maps.h>
+
+struct map_attr {
+	__u32 map_type;
+	char *map_name;
+	__u32 key_size;
+	__u32 value_size;
+	__u32 max_entries;
+	__u32 map_flags;
+};
+
+static struct map_attr attrs[] = {
+	{BPF_MAP_TYPE_HASH, "BPF_MAP_TYPE_HASH", 4, 4, 10},
+	{BPF_MAP_TYPE_ARRAY, "BPF_MAP_TYPE_ARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_PROG_ARRAY, "BPF_MAP_TYPE_PROG_ARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, "BPF_MAP_TYPE_PERF_EVENT_ARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_PERCPU_HASH, "BPF_MAP_TYPE_PERCPU_HASH", 4, 4, 10},
+	{BPF_MAP_TYPE_PERCPU_ARRAY, "BPF_MAP_TYPE_PERCPU_ARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_STACK_TRACE, "BPF_MAP_TYPE_STACK_TRACE", 4, 8, 10},
+	{BPF_MAP_TYPE_CGROUP_ARRAY, "BPF_MAP_TYPE_CGROUP_ARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_LRU_HASH, "BPF_MAP_TYPE_LRU_HASH", 4, 4, 10},
+	{BPF_MAP_TYPE_LRU_PERCPU_HASH, "BPF_MAP_TYPE_LRU_PERCPU_HASH", 4, 4, 10},
+	{BPF_MAP_TYPE_LPM_TRIE, "BPF_MAP_TYPE_LPM_TRIE", 32, 4, 10, BPF_F_NO_PREALLOC},
+	{BPF_MAP_TYPE_DEVMAP, "BPF_MAP_TYPE_DEVMAP", 4, 4, 10},
+	{BPF_MAP_TYPE_SOCKMAP, "BPF_MAP_TYPE_SOCKMAP", 4, 4, 10},
+	{BPF_MAP_TYPE_CPUMAP, "BPF_MAP_TYPE_CPUMAP", 4, 4, 10},
+	{BPF_MAP_TYPE_XSKMAP, "BPF_MAP_TYPE_XSKMAP", 4, 4, 10},
+	{BPF_MAP_TYPE_SOCKHASH, "BPF_MAP_TYPE_SOCKHASH", 4, 4, 10},
+	{BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, "BPF_MAP_TYPE_REUSEPORT_SOCKARRAY", 4, 4, 10},
+	{BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, "BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE", 8, 4, 0},
+	{BPF_MAP_TYPE_QUEUE, "BPF_MAP_TYPE_QUEUE", 0, 4, 10},
+	{BPF_MAP_TYPE_DEVMAP_HASH, "BPF_MAP_TYPE_DEVMAP_HASH", 4, 4, 10},
+	{BPF_MAP_TYPE_RINGBUF, "BPF_MAP_TYPE_RINGBUF", 0, 0, 4096},
+	{BPF_MAP_TYPE_BLOOM_FILTER, "BPF_MAP_TYPE_BLOOM_FILTER", 0, 4, 10},
+};
+
+static __u32 flags[] = {
+	BPF_F_NO_CHARGE,
+};
+
+void test_map_flags(union bpf_attr *attr, char *name)
+{
+	int mfd;
+
+	mfd = syscall(SYS_bpf, BPF_MAP_CREATE, attr, sizeof(*attr));
+	CHECK(mfd <= 0 && mfd != -EPERM, "no_charge", "%s error: %s\n",
+		name, strerror(errno));
+
+	if (mfd > 0)
+		close(mfd);
+}
+
+void test_no_charge(void)
+{
+	union bpf_attr attr;
+	int i, j;
+
+	memset(&attr, 0, sizeof(attr));
+	for (i = 0; i < sizeof(flags) / sizeof(__u32); i++) {
+		for (j = 0; j < sizeof(attrs) / sizeof(struct map_attr); j++) {
+			attr.map_type = attrs[j].map_type;
+			attr.key_size = attrs[j].key_size;
+			attr.value_size = attrs[j].value_size;
+			attr.max_entries = attrs[j].max_entries;
+			attr.map_flags = attrs[j].map_flags | flags[i];
+			test_map_flags(&attr, attrs[j].map_name);
+		}
+	}
+
+	printf("%s:PASS\n", __func__);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 14/14] bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (12 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 13/14] bpf: selftests: Add test case for BPF_F_NO_CHARTE Yafang Shao
@ 2022-03-19 17:30 ` Yafang Shao
  2022-03-21 22:52 ` [PATCH 00/14] bpf: Allow not to charge bpf memory Roman Gushchin
  14 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-19 17:30 UTC (permalink / raw)
  To: roman.gushchin, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, shuah
  Cc: netdev, bpf, linux-kselftest, Yafang Shao

Test case to check if BPF_F_PROG_NO_CHARGE valid.
The result as follows,
 $ ./test_progs
 ...
 #103 no_charge:OK
 ...

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/no_charge.c      | 49 +++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c

diff --git a/tools/testing/selftests/bpf/prog_tests/no_charge.c b/tools/testing/selftests/bpf/prog_tests/no_charge.c
new file mode 100644
index 000000000000..85fbd2ccbc71
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/no_charge.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <sys/syscall.h>
+
+#include "test_progs.h"
+
+#define BPF_ALU64_IMM(OP, DST, IMM)					\
+	((struct bpf_insn) {							\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,								\
+		.src_reg = 0,								\
+		.off   = 0,									\
+		.imm   = IMM })
+
+#define BPF_EXIT_INSN()					\
+	((struct bpf_insn) {				\
+		.code  = BPF_JMP | BPF_EXIT,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,						\
+		.imm   = 0 })
+
+void test_no_charge(void)
+{
+	struct bpf_insn prog[] = {
+		BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	union bpf_attr attr;
+	int duration = 0;
+	int fd;
+
+	bzero(&attr, sizeof(attr));
+	attr.prog_type = BPF_PROG_TYPE_SCHED_CLS;
+	attr.insn_cnt = 2;
+	attr.insns = (__u64)prog;
+	attr.license = (__u64)("GPL");
+	attr.prog_flags |= BPF_F_PROG_NO_CHARGE;
+
+	fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
+	CHECK(fd < 0 && fd != -EPERM, "no_charge", "error: %s\n",
+			strerror(errno));
+
+	if (fd > 0)
+		close(fd);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/14] bpf: Allow not to charge bpf memory
  2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
                   ` (13 preceding siblings ...)
  2022-03-19 17:30 ` [PATCH 14/14] bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE Yafang Shao
@ 2022-03-21 22:52 ` Roman Gushchin
  2022-03-22 16:10   ` Yafang Shao
  14 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2022-03-21 22:52 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, shuah, netdev, bpf, linux-kselftest

Hello, Yafang!

Thank you for continuing working on this!

On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting, the bpf memory is
> charged to the loader's memcg by defaut, that causes unexpected issues for
> us. For instance, the container of the loader-which loads the bpf programs
> and pins them on bpffs-may restart after pinning the progs and maps. After
> the restart, the pinned progs and maps won't belong to the new container
> any more, while they actually belong to an offline memcg left by the
> previous generation. That inconsistent behavior will make trouble for the
> memory resource management for this container.

I'm looking at this text and increasingly feeling that it's not a bpf-specific
problem and it shouldn't be solved as one.

Is there any significant reason why the loader can't temporarily enter the
root cgroup before creating bpf maps/progs? I know it will require some changes
in the userspace code, but so do new bpf flags.

> 
> The reason why these progs and maps have to be persistent across multiple
> generations is that these progs and maps are also used by other processes
> which are not in this container. IOW, they can't be removed when this
> container is restarted. Take a specific example, bpf program for clsact
> qdisc is loaded by a agent running in a container, which not only loads
> bpf program but also processes the data generated by this program and do
> some other maintainace things.
> 
> In order to keep the charging behavior consistent, we used to consider a
> way to recharge these pinned maps and progs again after the container is
> restarted, but after the discussion[1] with Roman, we decided to go
> another direction that don't charge them to the container in the first
> place. TL;DR about the mentioned disccussion: recharging is not a generic
> solution and it may take too much risk.
> 
> This patchset is the solution of no charge. Two flags are introduced in
> union bpf_attr, one for bpf map and another for bpf prog. The user who
> doesn't want to charge to current memcg can use these two flags. These two
> flags are only permitted for sys admin as these memory will be accounted to
> the root memcg only.

If we're going with bpf-specific flags (which I'd prefer not to), let's
define them as the way to create system-wide bpf objects which are expected
to outlive the original cgroup. With expectations that they will be treated
as belonging to the root cgroup by any sort of existing or future resource
accounting (e.g. if we'll start accounting CPU used by bpf prgrams).

But then again: why just not create them in the root cgroup?

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/14] bpf: Allow not to charge bpf memory
  2022-03-21 22:52 ` [PATCH 00/14] bpf: Allow not to charge bpf memory Roman Gushchin
@ 2022-03-22 16:10   ` Yafang Shao
  2022-03-22 19:10     ` Roman Gushchin
  0 siblings, 1 reply; 19+ messages in thread
From: Yafang Shao @ 2022-03-22 16:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, shuah, netdev,
	bpf, linux-kselftest

On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Hello, Yafang!
>
> Thank you for continuing working on this!
>
> On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by defaut, that causes unexpected issues for
> > us. For instance, the container of the loader-which loads the bpf programs
> > and pins them on bpffs-may restart after pinning the progs and maps. After
> > the restart, the pinned progs and maps won't belong to the new container
> > any more, while they actually belong to an offline memcg left by the
> > previous generation. That inconsistent behavior will make trouble for the
> > memory resource management for this container.
>
> I'm looking at this text and increasingly feeling that it's not a bpf-specific
> problem and it shouldn't be solved as one.
>

I'm not sure whether it is a common issue or not, but I'm sure bpf has
its special attribute that we should handle it specifically.  I can
show you an example on why bpf is a special one.

The pinned bpf is similar to a kernel module, right?
But that issue can't happen in a kernel module, while it can happen in
bpf only.  The reason is that the kernel module has the choice on
whether account the allocated memory or not, e.g.
    - Account
      kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
   - Not Account
      kmalloc(size, GFP_KERNEL);

While the bpf has no choice because the GFP_KERNEL is a KAPI which is
not exposed to the user.

Then the issue is exposed when the memcg-based accounting is
forcefully enabled to all bpf programs. That is a behavior change,
while unfortunately we don't give the user a chance to keep the old
behavior unless they don't use memcg....

But that is not to say the memcg-based accounting is bad, while it is
really useful, but it needs to be improved. We can't expose
GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
the user, that's what this patchset did, like what we always have done
in bpf.

> Is there any significant reason why the loader can't temporarily enter the
> root cgroup before creating bpf maps/progs? I know it will require some changes
> in the userspace code, but so do new bpf flags.
>

On our k8s environment, the user agents should be deployed in a
Daemonset[1].  It will make more trouble to temporarily enter the root
group before creating bpf maps/progs, for example we must change the
way we used to deploy user agents, that will be a big project.

[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

> >
> > The reason why these progs and maps have to be persistent across multiple
> > generations is that these progs and maps are also used by other processes
> > which are not in this container. IOW, they can't be removed when this
> > container is restarted. Take a specific example, bpf program for clsact
> > qdisc is loaded by a agent running in a container, which not only loads
> > bpf program but also processes the data generated by this program and do
> > some other maintainace things.
> >
> > In order to keep the charging behavior consistent, we used to consider a
> > way to recharge these pinned maps and progs again after the container is
> > restarted, but after the discussion[1] with Roman, we decided to go
> > another direction that don't charge them to the container in the first
> > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > solution and it may take too much risk.
> >
> > This patchset is the solution of no charge. Two flags are introduced in
> > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > doesn't want to charge to current memcg can use these two flags. These two
> > flags are only permitted for sys admin as these memory will be accounted to
> > the root memcg only.
>
> If we're going with bpf-specific flags (which I'd prefer not to), let's
> define them as the way to create system-wide bpf objects which are expected
> to outlive the original cgroup. With expectations that they will be treated
> as belonging to the root cgroup by any sort of existing or future resource
> accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
>

Now that talking about the cpu resource, I have some more complaints
that cpu cgroup does really better than memory cgroup. Below is the
detailed information why cpu cgroup does a good job,

   - CPU
                        Task Cgroup
      Code          CPU time is accounted to the one who is executeING
 this code

   - Memory
                         Memory Cgroup
      Data           Memory usage is accounted to the one who
allocatED this data.

Have you found the difference?
The difference is that, cpu time is accounted to the one who is using
it (that is reasonable), while memory usage is accounted to the one
who allocated it (that is unreasonable). If we split the Data/Code
into private and shared, we can find why it is unreasonable.

                                Memory Cgroup
Private Data           Private and thus accounted to one single memcg, good.
Shared Data           Shared but accounted to one single memcg, bad.

                                Task Cgroup
Private Code          Private and thus accounted to one single task group, good.
Shared Code          Shared and thus accounted to all the task groups, good.

The pages are accounted when they are allocated rather than when they
are used, that is why so many ridiculous things happen.   But we have
a common sense that we can’t dynamically charge the page to the
process who is accessing it, right?  So we have to handle the issues
caused by shared pages case by case.

> But then again: why just not create them in the root cgroup?
>

As I explained above, that is limited by the deployment.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/14] bpf: Allow not to charge bpf memory
  2022-03-22 16:10   ` Yafang Shao
@ 2022-03-22 19:10     ` Roman Gushchin
  2022-03-23  1:37       ` Yafang Shao
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2022-03-22 19:10 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, shuah, netdev,
	bpf, linux-kselftest

On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
> On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Hello, Yafang!
> >
> > Thank you for continuing working on this!
> >
> > On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > charged to the loader's memcg by defaut, that causes unexpected issues for
> > > us. For instance, the container of the loader-which loads the bpf programs
> > > and pins them on bpffs-may restart after pinning the progs and maps. After
> > > the restart, the pinned progs and maps won't belong to the new container
> > > any more, while they actually belong to an offline memcg left by the
> > > previous generation. That inconsistent behavior will make trouble for the
> > > memory resource management for this container.
> >
> > I'm looking at this text and increasingly feeling that it's not a bpf-specific
> > problem and it shouldn't be solved as one.
> >
> 
> I'm not sure whether it is a common issue or not, but I'm sure bpf has
> its special attribute that we should handle it specifically.  I can
> show you an example on why bpf is a special one.
> 
> The pinned bpf is similar to a kernel module, right?
> But that issue can't happen in a kernel module, while it can happen in
> bpf only.  The reason is that the kernel module has the choice on
> whether account the allocated memory or not, e.g.
>     - Account
>       kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
>    - Not Account
>       kmalloc(size, GFP_KERNEL);
> 
> While the bpf has no choice because the GFP_KERNEL is a KAPI which is
> not exposed to the user.

But if your process opens a file, creates a pipe etc there are also
kernel allocations happening and the process has no control over whether
these allocations are accounted or not. The same applies for the anonymous
memory and pagecache as well, so it's not even kmem-specific.

> 
> Then the issue is exposed when the memcg-based accounting is
> forcefully enabled to all bpf programs. That is a behavior change,
> while unfortunately we don't give the user a chance to keep the old
> behavior unless they don't use memcg....
> 
> But that is not to say the memcg-based accounting is bad, while it is
> really useful, but it needs to be improved. We can't expose
> GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
> the user, that's what this patchset did, like what we always have done
> in bpf.
> 
> > Is there any significant reason why the loader can't temporarily enter the
> > root cgroup before creating bpf maps/progs? I know it will require some changes
> > in the userspace code, but so do new bpf flags.
> >
> 
> On our k8s environment, the user agents should be deployed in a
> Daemonset[1].  It will make more trouble to temporarily enter the root
> group before creating bpf maps/progs, for example we must change the
> way we used to deploy user agents, that will be a big project.

I understand, however introducing new kernel interfaces to overcome such
things has its own downside: every introduced interface will stay pretty
much forever and we'll _have_ to support it. Kernel interfaces have a very long
life cycle, we have to admit it.

The problem you're describing - inconsistencies on accounting of shared regions
of memory - is a generic cgroup problem, which has a configuration solution:
the resource accounting and control should be performed on a stable level and
actual workloads can be (re)started in sub-cgroups with optionally disabled
physical controllers.
E.g.:
			/
	workload2.slice   workload1.slice     <- accounting should be performed here
workload_gen1.scope workload_gen2.scope ...


> 
> [1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
> 
> > >
> > > The reason why these progs and maps have to be persistent across multiple
> > > generations is that these progs and maps are also used by other processes
> > > which are not in this container. IOW, they can't be removed when this
> > > container is restarted. Take a specific example, bpf program for clsact
> > > qdisc is loaded by a agent running in a container, which not only loads
> > > bpf program but also processes the data generated by this program and do
> > > some other maintainace things.
> > >
> > > In order to keep the charging behavior consistent, we used to consider a
> > > way to recharge these pinned maps and progs again after the container is
> > > restarted, but after the discussion[1] with Roman, we decided to go
> > > another direction that don't charge them to the container in the first
> > > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > > solution and it may take too much risk.
> > >
> > > This patchset is the solution of no charge. Two flags are introduced in
> > > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > > doesn't want to charge to current memcg can use these two flags. These two
> > > flags are only permitted for sys admin as these memory will be accounted to
> > > the root memcg only.
> >
> > If we're going with bpf-specific flags (which I'd prefer not to), let's
> > define them as the way to create system-wide bpf objects which are expected
> > to outlive the original cgroup. With expectations that they will be treated
> > as belonging to the root cgroup by any sort of existing or future resource
> > accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
> >
> 
> Now that talking about the cpu resource, I have some more complaints
> that cpu cgroup does really better than memory cgroup. Below is the
> detailed information why cpu cgroup does a good job,
> 
>    - CPU
>                         Task Cgroup
>       Code          CPU time is accounted to the one who is executeING
>  this code
> 
>    - Memory
>                          Memory Cgroup
>       Data           Memory usage is accounted to the one who
> allocatED this data.
> 
> Have you found the difference?

Well, RAM is a vastly different thing than CPU :)
They have different physical properties and corresponding accounting limitations.

> The difference is that, cpu time is accounted to the one who is using
> it (that is reasonable), while memory usage is accounted to the one
> who allocated it (that is unreasonable). If we split the Data/Code
> into private and shared, we can find why it is unreasonable.
> 
>                                 Memory Cgroup
> Private Data           Private and thus accounted to one single memcg, good.
> Shared Data           Shared but accounted to one single memcg, bad.
> 
>                                 Task Cgroup
> Private Code          Private and thus accounted to one single task group, good.
> Shared Code          Shared and thus accounted to all the task groups, good.
> 
> The pages are accounted when they are allocated rather than when they
> are used, that is why so many ridiculous things happen.   But we have
> a common sense that we can’t dynamically charge the page to the
> process who is accessing it, right?  So we have to handle the issues
> caused by shared pages case by case.

The accounting of shared regions of memory is complex because of two
physical limitations:

1) Amount of (meta)data which we can be used to track ownership. We expect
the memory overhead be small in comparison to the accounted data. If a page
is used by many cgroups, even saving a single pointer to each cgroup can take
a lot of space. Even worse for slab objects. At some point it stops making
sense: if the accounting takes more memory than the accounted memory, it's
better to not account at all.

2) CPU overhead: tracking memory usage beyond the initial allocation adds
an overhead to some very hot paths. Imagine two processes mapping the same file,
first processes faults in the whole file and the second just uses the pagecache.
Currently it's very efficient. Causing the second process to change the ownership
information on each minor page fault will lead to a performance regression.
Think of libc binary as this file.

That said, I'm not saying it can't be done better that now. But it's a complex
problem.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/14] bpf: Allow not to charge bpf memory
  2022-03-22 19:10     ` Roman Gushchin
@ 2022-03-23  1:37       ` Yafang Shao
  0 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-03-23  1:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, shuah, netdev,
	bpf, linux-kselftest

On Wed, Mar 23, 2022 at 3:10 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
> > On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > Hello, Yafang!
> > >
> > > Thank you for continuing working on this!
> > >
> > > On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > charged to the loader's memcg by defaut, that causes unexpected issues for
> > > > us. For instance, the container of the loader-which loads the bpf programs
> > > > and pins them on bpffs-may restart after pinning the progs and maps. After
> > > > the restart, the pinned progs and maps won't belong to the new container
> > > > any more, while they actually belong to an offline memcg left by the
> > > > previous generation. That inconsistent behavior will make trouble for the
> > > > memory resource management for this container.
> > >
> > > I'm looking at this text and increasingly feeling that it's not a bpf-specific
> > > problem and it shouldn't be solved as one.
> > >
> >
> > I'm not sure whether it is a common issue or not, but I'm sure bpf has
> > its special attribute that we should handle it specifically.  I can
> > show you an example on why bpf is a special one.
> >
> > The pinned bpf is similar to a kernel module, right?
> > But that issue can't happen in a kernel module, while it can happen in
> > bpf only.  The reason is that the kernel module has the choice on
> > whether account the allocated memory or not, e.g.
> >     - Account
> >       kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
> >    - Not Account
> >       kmalloc(size, GFP_KERNEL);
> >
> > While the bpf has no choice because the GFP_KERNEL is a KAPI which is
> > not exposed to the user.
>
> But if your process opens a file, creates a pipe etc there are also
> kernel allocations happening and the process has no control over whether
> these allocations are accounted or not. The same applies for the anonymous
> memory and pagecache as well, so it's not even kmem-specific.
>

So what is the real problem in practice ?
Does anyone complain about it ?

[At least,  there's no behavior change in these areas.]

> >
> > Then the issue is exposed when the memcg-based accounting is
> > forcefully enabled to all bpf programs. That is a behavior change,
> > while unfortunately we don't give the user a chance to keep the old
> > behavior unless they don't use memcg....
> >
> > But that is not to say the memcg-based accounting is bad, while it is
> > really useful, but it needs to be improved. We can't expose
> > GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
> > the user, that's what this patchset did, like what we always have done
> > in bpf.
> >
> > > Is there any significant reason why the loader can't temporarily enter the
> > > root cgroup before creating bpf maps/progs? I know it will require some changes
> > > in the userspace code, but so do new bpf flags.
> > >
> >
> > On our k8s environment, the user agents should be deployed in a
> > Daemonset[1].  It will make more trouble to temporarily enter the root
> > group before creating bpf maps/progs, for example we must change the
> > way we used to deploy user agents, that will be a big project.
>
> I understand, however introducing new kernel interfaces to overcome such
> things has its own downside: every introduced interface will stay pretty
> much forever and we'll _have_ to support it. Kernel interfaces have a very long
> life cycle, we have to admit it.
>
> The problem you're describing - inconsistencies on accounting of shared regions
> of memory - is a generic cgroup problem, which has a configuration solution:
> the resource accounting and control should be performed on a stable level and
> actual workloads can be (re)started in sub-cgroups with optionally disabled
> physical controllers.
> E.g.:
>                         /
>         workload2.slice   workload1.slice     <- accounting should be performed here
> workload_gen1.scope workload_gen2.scope ...
>

I think we talked about it several days earlier.

>
> >
> > [1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
> >
> > > >
> > > > The reason why these progs and maps have to be persistent across multiple
> > > > generations is that these progs and maps are also used by other processes
> > > > which are not in this container. IOW, they can't be removed when this
> > > > container is restarted. Take a specific example, bpf program for clsact
> > > > qdisc is loaded by a agent running in a container, which not only loads
> > > > bpf program but also processes the data generated by this program and do
> > > > some other maintainace things.
> > > >
> > > > In order to keep the charging behavior consistent, we used to consider a
> > > > way to recharge these pinned maps and progs again after the container is
> > > > restarted, but after the discussion[1] with Roman, we decided to go
> > > > another direction that don't charge them to the container in the first
> > > > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > > > solution and it may take too much risk.
> > > >
> > > > This patchset is the solution of no charge. Two flags are introduced in
> > > > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > > > doesn't want to charge to current memcg can use these two flags. These two
> > > > flags are only permitted for sys admin as these memory will be accounted to
> > > > the root memcg only.
> > >
> > > If we're going with bpf-specific flags (which I'd prefer not to), let's
> > > define them as the way to create system-wide bpf objects which are expected
> > > to outlive the original cgroup. With expectations that they will be treated
> > > as belonging to the root cgroup by any sort of existing or future resource
> > > accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
> > >
> >
> > Now that talking about the cpu resource, I have some more complaints
> > that cpu cgroup does really better than memory cgroup. Below is the
> > detailed information why cpu cgroup does a good job,
> >
> >    - CPU
> >                         Task Cgroup
> >       Code          CPU time is accounted to the one who is executeING
> >  this code
> >
> >    - Memory
> >                          Memory Cgroup
> >       Data           Memory usage is accounted to the one who
> > allocatED this data.
> >
> > Have you found the difference?
>
> Well, RAM is a vastly different thing than CPU :)
> They have different physical properties and corresponding accounting limitations.
>
> > The difference is that, cpu time is accounted to the one who is using
> > it (that is reasonable), while memory usage is accounted to the one
> > who allocated it (that is unreasonable). If we split the Data/Code
> > into private and shared, we can find why it is unreasonable.
> >
> >                                 Memory Cgroup
> > Private Data           Private and thus accounted to one single memcg, good.
> > Shared Data           Shared but accounted to one single memcg, bad.
> >
> >                                 Task Cgroup
> > Private Code          Private and thus accounted to one single task group, good.
> > Shared Code          Shared and thus accounted to all the task groups, good.
> >
> > The pages are accounted when they are allocated rather than when they
> > are used, that is why so many ridiculous things happen.   But we have
> > a common sense that we can’t dynamically charge the page to the
> > process who is accessing it, right?  So we have to handle the issues
> > caused by shared pages case by case.
>
> The accounting of shared regions of memory is complex because of two
> physical limitations:
>
> 1) Amount of (meta)data which we can be used to track ownership. We expect
> the memory overhead be small in comparison to the accounted data. If a page
> is used by many cgroups, even saving a single pointer to each cgroup can take
> a lot of space. Even worse for slab objects. At some point it stops making
> sense: if the accounting takes more memory than the accounted memory, it's
> better to not account at all.
>
> 2) CPU overhead: tracking memory usage beyond the initial allocation adds
> an overhead to some very hot paths. Imagine two processes mapping the same file,
> first processes faults in the whole file and the second just uses the pagecache.
> Currently it's very efficient. Causing the second process to change the ownership
> information on each minor page fault will lead to a performance regression.
> Think of libc binary as this file.
>
> That said, I'm not saying it can't be done better that now. But it's a complex
> problem.
>

Memcg-based accounting is also complex, but it introduces user visible
behavior change now :(

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-03-23  1:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-19 17:30 [PATCH 00/14] bpf: Allow not to charge bpf memory Yafang Shao
2022-03-19 17:30 ` [PATCH 01/14] bpf: Introduce no charge flag for bpf map Yafang Shao
2022-03-19 17:30 ` [PATCH 02/14] bpf: Only sys admin can set no charge flag Yafang Shao
2022-03-19 17:30 ` [PATCH 03/14] bpf: Enable no charge in map _CREATE_FLAG_MASK Yafang Shao
2022-03-19 17:30 ` [PATCH 04/14] bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc Yafang Shao
2022-03-19 17:30 ` [PATCH 05/14] bpf: Allow no charge " Yafang Shao
2022-03-19 17:30 ` [PATCH 06/14] bpf: Allow no charge for allocation not at map creation time Yafang Shao
2022-03-19 17:30 ` [PATCH 07/14] bpf: Allow no charge in map specific allocation Yafang Shao
2022-03-19 17:30 ` [PATCH 08/14] bpf: Aggregate flags for BPF_PROG_LOAD command Yafang Shao
2022-03-19 17:30 ` [PATCH 09/14] bpf: Add no charge flag for bpf prog Yafang Shao
2022-03-19 17:30 ` [PATCH 10/14] bpf: Only sys admin can set " Yafang Shao
2022-03-19 17:30 ` [PATCH 11/14] bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc Yafang Shao
2022-03-19 17:30 ` [PATCH 12/14] bpf: Allow no charge for bpf prog Yafang Shao
2022-03-19 17:30 ` [PATCH 13/14] bpf: selftests: Add test case for BPF_F_NO_CHARTE Yafang Shao
2022-03-19 17:30 ` [PATCH 14/14] bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE Yafang Shao
2022-03-21 22:52 ` [PATCH 00/14] bpf: Allow not to charge bpf memory Roman Gushchin
2022-03-22 16:10   ` Yafang Shao
2022-03-22 19:10     ` Roman Gushchin
2022-03-23  1:37       ` Yafang Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.