[PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps
@ 2019-11-15  4:02 Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  4:02 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

This patch set adds ability to memory-map BPF array maps (single- and
multi-element). The primary use case is memory-mapping BPF array maps, created
to back global data variables, created by libbpf implicitly. This allows for
much better usability, along with avoiding syscalls to read or update data
completely.

Due to memory-mapping requirements, BPF array map that is supposed to be
memory-mapped, has to be created with special BPF_F_MMAPABLE attribute, which
triggers slightly different memory allocation strategy internally. See
patch 1 for details.

Libbpf is extended to detect kernel support for this flag, and if supported,
will specify it for all global data maps automatically.

Patch #1 refactors bpf_map_inc() and converts bpf_map's refcnt to atomic64_t
to make refcounting never fail.

v3->v4:
- add mmap's open() callback to fix refcounting (Johannes);
- switch to remap_vmalloc_pages() instead of custom fault handler (Johannes);
- converted bpf_map's refcnt/usercnt into atomic64_t;
- provide default bpf_map_default_vmops handling open/close properly;

v2->v3:
- change allocation strategy to avoid extra pointer dereference (Jakub);

v1->v2:
- fix map lookup code generation for BPF_F_MMAPABLE case;
- prevent BPF_F_MMAPABLE flag for all but plain array map type;
- centralize ref-counting in generic bpf_map_mmap();
- don't use uref counting (Alexei);
- use vfree() directly;
- print flags with %x (Song);
- extend tests to verify bpf_map_{lookup,update}_elem() logic as well.


Andrii Nakryiko (4):
  bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails
  bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  libbpf: make global data internal arrays mmap()-able, if possible
  selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests

 .../net/ethernet/netronome/nfp/bpf/offload.c  |   4 +-
 include/linux/bpf.h                           |  21 +-
 include/linux/vmalloc.h                       |   1 +
 include/uapi/linux/bpf.h                      |   3 +
 kernel/bpf/arraymap.c                         |  59 ++++-
 kernel/bpf/inode.c                            |   2 +-
 kernel/bpf/map_in_map.c                       |   2 +-
 kernel/bpf/syscall.c                          | 150 +++++++++---
 kernel/bpf/verifier.c                         |   6 +-
 kernel/bpf/xskmap.c                           |   6 +-
 mm/vmalloc.c                                  |  20 ++
 net/core/bpf_sk_storage.c                     |   2 +-
 tools/include/uapi/linux/bpf.h                |   3 +
 tools/lib/bpf/libbpf.c                        |  32 ++-
 .../selftests/bpf/prog_tests/core_reloc.c     |  45 ++--
 tools/testing/selftests/bpf/prog_tests/mmap.c | 220 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mmap.c |  45 ++++
 17 files changed, 540 insertions(+), 81 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mmap.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails
  2019-11-15  4:02 [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps Andrii Nakryiko
@ 2019-11-15  4:02 ` Andrii Nakryiko
  2019-11-15 21:47   ` Song Liu
  2019-11-15 23:23   ` Daniel Borkmann
  2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  4:02 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../net/ethernet/netronome/nfp/bpf/offload.c  |  4 +-
 include/linux/bpf.h                           | 10 ++--
 kernel/bpf/inode.c                            |  2 +-
 kernel/bpf/map_in_map.c                       |  2 +-
 kernel/bpf/syscall.c                          | 51 ++++++++-----------
 kernel/bpf/verifier.c                         |  6 +--
 kernel/bpf/xskmap.c                           |  6 +--
 net/core/bpf_sk_storage.c                     |  2 +-
 8 files changed, 34 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 88fab6a82acf..06927ba5a3ae 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -46,9 +46,7 @@ nfp_map_ptr_record(struct nfp_app_bpf *bpf, struct nfp_prog *nfp_prog,
 	/* Grab a single ref to the map for our record.  The prog destroy ndo
 	 * happens after free_used_maps().
 	 */
-	map = bpf_map_inc(map, false);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
+	bpf_map_inc(map);
 
 	record = kmalloc(sizeof(*record), GFP_KERNEL);
 	if (!record) {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7c7f518811a6..6fbe599fb977 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -101,8 +101,8 @@ struct bpf_map {
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
 	 */
-	atomic_t refcnt ____cacheline_aligned;
-	atomic_t usercnt;
+	atomic64_t refcnt ____cacheline_aligned;
+	atomic64_t usercnt;
 	struct work_struct work;
 	char name[BPF_OBJ_NAME_LEN];
 };
@@ -677,9 +677,9 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
-struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
-struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map,
-						   bool uref);
+void bpf_map_inc(struct bpf_map *map);
+void bpf_map_inc_with_uref(struct bpf_map *map);
+struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_charge_memlock(struct bpf_map *map, u32 pages);
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index a70f7209cda3..2f17f24258dc 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -34,7 +34,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 		raw = bpf_prog_inc(raw);
 		break;
 	case BPF_TYPE_MAP:
-		raw = bpf_map_inc(raw, true);
+		bpf_map_inc_with_uref(raw);
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index fab4fb134547..4cbe987be35b 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -98,7 +98,7 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map,
 		return inner_map;
 
 	if (bpf_map_meta_equal(map->inner_map_meta, inner_map))
-		inner_map = bpf_map_inc(inner_map, false);
+		bpf_map_inc(inner_map);
 	else
 		inner_map = ERR_PTR(-EINVAL);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6d9ce95e5a8d..9d8b56d660d7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -311,7 +311,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 static void bpf_map_put_uref(struct bpf_map *map)
 {
-	if (atomic_dec_and_test(&map->usercnt)) {
+	if (atomic64_dec_and_test(&map->usercnt)) {
 		if (map->ops->map_release_uref)
 			map->ops->map_release_uref(map);
 	}
@@ -322,7 +322,7 @@ static void bpf_map_put_uref(struct bpf_map *map)
  */
 static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
 {
-	if (atomic_dec_and_test(&map->refcnt)) {
+	if (atomic64_dec_and_test(&map->refcnt)) {
 		/* bpf_map_free_id() must be called first */
 		bpf_map_free_id(map, do_idr_lock);
 		btf_put(map->btf);
@@ -575,8 +575,8 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map;
 
-	atomic_set(&map->refcnt, 1);
-	atomic_set(&map->usercnt, 1);
+	atomic64_set(&map->refcnt, 1);
+	atomic64_set(&map->usercnt, 1);
 
 	if (attr->btf_key_type_id || attr->btf_value_type_id) {
 		struct btf *btf;
@@ -653,21 +653,19 @@ struct bpf_map *__bpf_map_get(struct fd f)
 	return f.file->private_data;
 }
 
-/* prog's and map's refcnt limit */
-#define BPF_MAX_REFCNT 32768
-
-struct bpf_map *bpf_map_inc(struct bpf_map *map, bool uref)
+void bpf_map_inc(struct bpf_map *map)
 {
-	if (atomic_inc_return(&map->refcnt) > BPF_MAX_REFCNT) {
-		atomic_dec(&map->refcnt);
-		return ERR_PTR(-EBUSY);
-	}
-	if (uref)
-		atomic_inc(&map->usercnt);
-	return map;
+	atomic64_inc(&map->refcnt);
 }
 EXPORT_SYMBOL_GPL(bpf_map_inc);
 
+void bpf_map_inc_with_uref(struct bpf_map *map)
+{
+	atomic64_inc(&map->refcnt);
+	atomic64_inc(&map->usercnt);
+}
+EXPORT_SYMBOL_GPL(bpf_map_inc_with_uref);
+
 struct bpf_map *bpf_map_get_with_uref(u32 ufd)
 {
 	struct fd f = fdget(ufd);
@@ -677,38 +675,30 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
 	if (IS_ERR(map))
 		return map;
 
-	map = bpf_map_inc(map, true);
+	bpf_map_inc_with_uref(map);
 	fdput(f);
 
 	return map;
 }
 
 /* map_idr_lock should have been held */
-static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map,
-					      bool uref)
+static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
 {
 	int refold;
 
-	refold = atomic_fetch_add_unless(&map->refcnt, 1, 0);
-
-	if (refold >= BPF_MAX_REFCNT) {
-		__bpf_map_put(map, false);
-		return ERR_PTR(-EBUSY);
-	}
-
+	refold = atomic64_fetch_add_unless(&map->refcnt, 1, 0);
 	if (!refold)
 		return ERR_PTR(-ENOENT);
-
 	if (uref)
-		atomic_inc(&map->usercnt);
+		atomic64_inc(&map->usercnt);
 
 	return map;
 }
 
-struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
+struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map)
 {
 	spin_lock_bh(&map_idr_lock);
-	map = __bpf_map_inc_not_zero(map, uref);
+	map = __bpf_map_inc_not_zero(map, false);
 	spin_unlock_bh(&map_idr_lock);
 
 	return map;
@@ -1454,6 +1444,9 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
 	return f.file->private_data;
 }
 
+/* prog's refcnt limit */
+#define BPF_MAX_REFCNT 32768
+
 struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
 {
 	if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2f2374967b36..38d34afe979c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8169,11 +8169,7 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
 			 * will be used by the valid program until it's unloaded
 			 * and all maps are released in free_used_maps()
 			 */
-			map = bpf_map_inc(map, false);
-			if (IS_ERR(map)) {
-				fdput(f);
-				return PTR_ERR(map);
-			}
+			bpf_map_inc(map);
 
 			aux->map_index = env->used_map_cnt;
 			env->used_maps[env->used_map_cnt++] = map;
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index da16c30868f3..90c4fce1c981 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -11,10 +11,8 @@
 
 int xsk_map_inc(struct xsk_map *map)
 {
-	struct bpf_map *m = &map->map;
-
-	m = bpf_map_inc(m, false);
-	return PTR_ERR_OR_ZERO(m);
+	bpf_map_inc(&map->map);
+	return 0;
 }
 
 void xsk_map_put(struct xsk_map *map)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index da5639a5bd3b..458be6b3eda9 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -798,7 +798,7 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
 		 * Try to grab map refcnt to make sure that it's still
 		 * alive and prevent concurrent removal.
 		 */
-		map = bpf_map_inc_not_zero(&smap->map, false);
+		map = bpf_map_inc_not_zero(&smap->map);
 		if (IS_ERR(map))
 			continue;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  4:02 [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
@ 2019-11-15  4:02 ` Andrii Nakryiko
  2019-11-15  4:45   ` Alexei Starovoitov
                     ` (2 more replies)
  2019-11-15  4:02 ` [PATCH v4 bpf-next 3/4] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 4/4] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko
  3 siblings, 3 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  4:02 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Johannes Weiner,
	Rik van Riel

Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.

There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
  - if map is already frozen, no writable mapping is allowed;
  - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
  - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.

For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.

One important consideration regarding how memory-mapping subsystem functions.
Memory-mapping subsystem provides few optional callbacks, among them open()
and close().  close() is called for each memory region that is unmapped, so
that users can decrease their reference counters and free up resources, if
necessary. open() is *almost* symmetrical: it's called for each memory region
that is being mapped, **except** the very first one. So bpf_map_mmap does
initial refcnt bump, while open() will do any extra ones after that. Thus
number of close() calls is equal to number of open() calls plus one more.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h            | 11 ++--
 include/linux/vmalloc.h        |  1 +
 include/uapi/linux/bpf.h       |  3 ++
 kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
 kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
 mm/vmalloc.c                   | 20 +++++++
 tools/include/uapi/linux/bpf.h |  3 ++
 7 files changed, 184 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6fbe599fb977..8021fce98868 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -12,6 +12,7 @@
 #include <linux/err.h>
 #include <linux/rbtree_latch.h>
 #include <linux/numa.h>
+#include <linux/mm_types.h>
 #include <linux/wait.h>
 #include <linux/u64_stats_sync.h>
 
@@ -66,6 +67,7 @@ struct bpf_map_ops {
 				     u64 *imm, u32 off);
 	int (*map_direct_value_meta)(const struct bpf_map *map,
 				     u64 imm, u32 *off);
+	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
 };
 
 struct bpf_map_memory {
@@ -94,9 +96,10 @@ struct bpf_map {
 	u32 btf_value_type_id;
 	struct btf *btf;
 	struct bpf_map_memory memory;
+	char name[BPF_OBJ_NAME_LEN];
 	bool unpriv_array;
-	bool frozen; /* write-once */
-	/* 48 bytes hole */
+	bool frozen; /* write-once; write-protected by freeze_mutex */
+	/* 22 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -104,7 +107,8 @@ struct bpf_map {
 	atomic64_t refcnt ____cacheline_aligned;
 	atomic64_t usercnt;
 	struct work_struct work;
-	char name[BPF_OBJ_NAME_LEN];
+	struct mutex freeze_mutex;
+	u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
 };
 
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
@@ -689,6 +693,7 @@ void bpf_map_charge_finish(struct bpf_map_memory *mem);
 void bpf_map_charge_move(struct bpf_map_memory *dst,
 			 struct bpf_map_memory *src);
 void *bpf_map_area_alloc(size_t size, int numa_node);
+void *bpf_map_area_mmapable_alloc(size_t size, int numa_node);
 void bpf_map_area_free(void *base);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4e7809408073..b4c58a191eb1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -93,6 +93,7 @@ extern void *vzalloc(unsigned long size);
 extern void *vmalloc_user(unsigned long size);
 extern void *vmalloc_node(unsigned long size, int node);
 extern void *vzalloc_node(unsigned long size, int node);
+extern void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags);
 extern void *vmalloc_exec(unsigned long size);
 extern void *vmalloc_32(unsigned long size);
 extern void *vmalloc_32_user(unsigned long size);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df6809a76404..bb39b53622d9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -346,6 +346,9 @@ enum bpf_attach_type {
 /* Clone map from listener for newly accepted socket */
 #define BPF_F_CLONE		(1U << 9)
 
+/* Enable memory-mapping BPF map */
+#define BPF_F_MMAPABLE		(1U << 10)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..baff45aed6ed 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -14,7 +14,7 @@
 #include "map_in_map.h"
 
 #define ARRAY_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+	(BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK)
 
 static void bpf_array_free_percpu(struct bpf_array *array)
 {
@@ -59,6 +59,10 @@ int array_map_alloc_check(union bpf_attr *attr)
 	    (percpu && numa_node != NUMA_NO_NODE))
 		return -EINVAL;
 
+	if (attr->map_type != BPF_MAP_TYPE_ARRAY &&
+	    attr->map_flags & BPF_F_MMAPABLE)
+		return -EINVAL;
+
 	if (attr->value_size > KMALLOC_MAX_SIZE)
 		/* if value_size is bigger, the user space won't be able to
 		 * access the elements.
@@ -102,10 +106,19 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	}
 
 	array_size = sizeof(*array);
-	if (percpu)
+	if (percpu) {
 		array_size += (u64) max_entries * sizeof(void *);
-	else
-		array_size += (u64) max_entries * elem_size;
+	} else {
+		/* rely on vmalloc() to return page-aligned memory and
+		 * ensure array->value is exactly page-aligned
+		 */
+		if (attr->map_flags & BPF_F_MMAPABLE) {
+			array_size = PAGE_ALIGN(array_size);
+			array_size += PAGE_ALIGN((u64) max_entries * elem_size);
+		} else {
+			array_size += (u64) max_entries * elem_size;
+		}
+	}
 
 	/* make sure there is no u32 overflow later in round_up() */
 	cost = array_size;
@@ -117,7 +130,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(ret);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(array_size, numa_node);
+	if (attr->map_flags & BPF_F_MMAPABLE) {
+		void *data;
+
+		/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
+		data = bpf_map_area_mmapable_alloc(array_size, numa_node);
+		if (!data) {
+			bpf_map_charge_finish(&mem);
+			return ERR_PTR(-ENOMEM);
+		}
+		array = data + PAGE_ALIGN(sizeof(struct bpf_array))
+			- offsetof(struct bpf_array, value);
+	} else {
+		array = bpf_map_area_alloc(array_size, numa_node);
+	}
 	if (!array) {
 		bpf_map_charge_finish(&mem);
 		return ERR_PTR(-ENOMEM);
@@ -350,6 +376,11 @@ static int array_map_delete_elem(struct bpf_map *map, void *key)
 	return -EINVAL;
 }
 
+static void *array_map_vmalloc_addr(struct bpf_array *array)
+{
+	return (void *)round_down((unsigned long)array, PAGE_SIZE);
+}
+
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
 static void array_map_free(struct bpf_map *map)
 {
@@ -365,7 +396,10 @@ static void array_map_free(struct bpf_map *map)
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
-	bpf_map_area_free(array);
+	if (array->map.map_flags & BPF_F_MMAPABLE)
+		bpf_map_area_free(array_map_vmalloc_addr(array));
+	else
+		bpf_map_area_free(array);
 }
 
 static void array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -444,6 +478,18 @@ static int array_map_check_btf(const struct bpf_map *map,
 	return 0;
 }
 
+int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	pgoff_t pgoff = PAGE_ALIGN(sizeof(*array)) >> PAGE_SHIFT;
+	int err;
+
+	if (!(map->map_flags & BPF_F_MMAPABLE))
+		return -EINVAL;
+
+	return remap_vmalloc_range(vma, array_map_vmalloc_addr(array), pgoff);
+}
+
 const struct bpf_map_ops array_map_ops = {
 	.map_alloc_check = array_map_alloc_check,
 	.map_alloc = array_map_alloc,
@@ -455,6 +501,7 @@ const struct bpf_map_ops array_map_ops = {
 	.map_gen_lookup = array_map_gen_lookup,
 	.map_direct_value_addr = array_map_direct_value_addr,
 	.map_direct_value_meta = array_map_direct_value_meta,
+	.map_mmap = array_map_mmap,
 	.map_seq_show_elem = array_map_seq_show_elem,
 	.map_check_btf = array_map_check_btf,
 };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9d8b56d660d7..810e53990641 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -127,7 +127,7 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
-void *bpf_map_area_alloc(size_t size, int numa_node)
+static void *__bpf_map_area_alloc(size_t size, int numa_node, bool mmapable)
 {
 	/* We really just want to fail instead of triggering OOM killer
 	 * under memory pressure, therefore we set __GFP_NORETRY to kmalloc,
@@ -142,18 +142,33 @@ void *bpf_map_area_alloc(size_t size, int numa_node)
 	const gfp_t flags = __GFP_NOWARN | __GFP_ZERO;
 	void *area;
 
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
+	/* kmalloc()'ed memory can't be mmap()'ed */
+	if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
 		area = kmalloc_node(size, GFP_USER | __GFP_NORETRY | flags,
 				    numa_node);
 		if (area != NULL)
 			return area;
 	}
-
+	if (mmapable) {
+		BUG_ON(!PAGE_ALIGNED(size));
+		return vmalloc_user_node_flags(size, numa_node, GFP_KERNEL |
+					       __GFP_RETRY_MAYFAIL | flags);
+	}
 	return __vmalloc_node_flags_caller(size, numa_node,
 					   GFP_KERNEL | __GFP_RETRY_MAYFAIL |
 					   flags, __builtin_return_address(0));
 }
 
+void *bpf_map_area_alloc(size_t size, int numa_node)
+{
+	return __bpf_map_area_alloc(size, numa_node, false);
+}
+
+void *bpf_map_area_mmapable_alloc(size_t size, int numa_node)
+{
+	return __bpf_map_area_alloc(size, numa_node, true);
+}
+
 void bpf_map_area_free(void *area)
 {
 	kvfree(area);
@@ -425,6 +440,74 @@ static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
 	return -EINVAL;
 }
 
+/* called for any extra memory-mapped regions (except initial) */
+static void bpf_map_mmap_open(struct vm_area_struct *vma)
+{
+	struct bpf_map *map = vma->vm_file->private_data;
+
+	bpf_map_inc(map);
+
+	if (vma->vm_flags & VM_WRITE) {
+		mutex_lock(&map->freeze_mutex);
+		map->writecnt++;
+		mutex_unlock(&map->freeze_mutex);
+	}
+}
+
+/* called for all unmapped memory region (including initial) */
+static void bpf_map_mmap_close(struct vm_area_struct *vma)
+{
+	struct bpf_map *map = vma->vm_file->private_data;
+
+	if (vma->vm_flags & VM_WRITE) {
+		mutex_lock(&map->freeze_mutex);
+		map->writecnt--;
+		mutex_unlock(&map->freeze_mutex);
+	}
+
+	bpf_map_put(map);
+}
+
+static const struct vm_operations_struct bpf_map_default_vmops = {
+	.open		= bpf_map_mmap_open,
+	.close		= bpf_map_mmap_close,
+};
+
+static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct bpf_map *map = filp->private_data;
+	int err;
+
+	if (!map->ops->map_mmap || map_value_has_spin_lock(map))
+		return -ENOTSUPP;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	mutex_lock(&map->freeze_mutex);
+
+	if ((vma->vm_flags & VM_WRITE) && map->frozen) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/* set default open/close callbacks */
+	vma->vm_ops = &bpf_map_default_vmops;
+	vma->vm_private_data = map;
+
+	err = map->ops->map_mmap(map, vma);
+	if (err)
+		goto out;
+
+	bpf_map_inc(map);
+
+	if (vma->vm_flags & VM_WRITE)
+		map->writecnt++;
+out:
+	mutex_unlock(&map->freeze_mutex);
+	return err;
+}
+
 const struct file_operations bpf_map_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_map_show_fdinfo,
@@ -432,6 +515,7 @@ const struct file_operations bpf_map_fops = {
 	.release	= bpf_map_release,
 	.read		= bpf_dummy_read,
 	.write		= bpf_dummy_write,
+	.mmap		= bpf_map_mmap,
 };
 
 int bpf_map_new_fd(struct bpf_map *map, int flags)
@@ -577,6 +661,7 @@ static int map_create(union bpf_attr *attr)
 
 	atomic64_set(&map->refcnt, 1);
 	atomic64_set(&map->usercnt, 1);
+	mutex_init(&map->freeze_mutex);
 
 	if (attr->btf_key_type_id || attr->btf_value_type_id) {
 		struct btf *btf;
@@ -1163,6 +1248,13 @@ static int map_freeze(const union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+
+	mutex_lock(&map->freeze_mutex);
+
+	if (map->writecnt) {
+		err = -EBUSY;
+		goto err_put;
+	}
 	if (READ_ONCE(map->frozen)) {
 		err = -EBUSY;
 		goto err_put;
@@ -1174,6 +1266,7 @@ static int map_freeze(const union bpf_attr *attr)
 
 	WRITE_ONCE(map->frozen, true);
 err_put:
+	mutex_unlock(&map->freeze_mutex);
 	fdput(f);
 	return err;
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a3c70e275f4e..4a7d7459c4f9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2671,6 +2671,26 @@ void *vzalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vzalloc_node);
 
+/**
+ * vmalloc_user_node_flags - allocate memory for userspace on a specific node
+ * @size: allocation size
+ * @node: numa node
+ * @flags: flags for the page level allocator
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags)
+{
+	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
+				    flags | __GFP_ZERO, PAGE_KERNEL,
+				    VM_USERMAP, node,
+				    __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vmalloc_user_node_flags);
+
 /**
  * vmalloc_exec - allocate virtually contiguous, executable memory
  * @size:	  allocation size
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index df6809a76404..bb39b53622d9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -346,6 +346,9 @@ enum bpf_attach_type {
 /* Clone map from listener for newly accepted socket */
 #define BPF_F_CLONE		(1U << 9)
 
+/* Enable memory-mapping BPF map */
+#define BPF_F_MMAPABLE		(1U << 10)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 bpf-next 3/4] libbpf: make global data internal arrays mmap()-able, if possible
  2019-11-15  4:02 [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
@ 2019-11-15  4:02 ` Andrii Nakryiko
  2019-11-15  4:02 ` [PATCH v4 bpf-next 4/4] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko
  3 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  4:02 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra
flag to internal global data maps, if supported by kernel. This allows users
to memory-map global data and use it without BPF map operations, greatly
simplifying user experience.

Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/libbpf.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 96ef18cfeffb..7e2be1288fa1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -142,6 +142,8 @@ struct bpf_capabilities {
 	__u32 btf_func:1;
 	/* BTF_KIND_VAR and BTF_KIND_DATASEC support */
 	__u32 btf_datasec:1;
+	/* BPF_F_MMAPABLE is supported for arrays */
+	__u32 array_mmap:1;
 };
 
 /*
@@ -856,8 +858,6 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 		pr_warn("failed to alloc map name\n");
 		return -ENOMEM;
 	}
-	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu.\n",
-		 map_name, map->sec_idx, map->sec_offset);
 
 	def = &map->def;
 	def->type = BPF_MAP_TYPE_ARRAY;
@@ -865,6 +865,12 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	def->value_size = data->d_size;
 	def->max_entries = 1;
 	def->map_flags = type == LIBBPF_MAP_RODATA ? BPF_F_RDONLY_PROG : 0;
+	if (obj->caps.array_mmap)
+		def->map_flags |= BPF_F_MMAPABLE;
+
+	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
+		 map_name, map->sec_idx, map->sec_offset, def->map_flags);
+
 	if (data_buff) {
 		*data_buff = malloc(data->d_size);
 		if (!*data_buff) {
@@ -2160,6 +2166,27 @@ static int bpf_object__probe_btf_datasec(struct bpf_object *obj)
 	return 0;
 }
 
+static int bpf_object__probe_array_mmap(struct bpf_object *obj)
+{
+	struct bpf_create_map_attr attr = {
+		.map_type = BPF_MAP_TYPE_ARRAY,
+		.map_flags = BPF_F_MMAPABLE,
+		.key_size = sizeof(int),
+		.value_size = sizeof(int),
+		.max_entries = 1,
+	};
+	int fd;
+
+	fd = bpf_create_map_xattr(&attr);
+	if (fd >= 0) {
+		obj->caps.array_mmap = 1;
+		close(fd);
+		return 1;
+	}
+
+	return 0;
+}
+
 static int
 bpf_object__probe_caps(struct bpf_object *obj)
 {
@@ -2168,6 +2195,7 @@ bpf_object__probe_caps(struct bpf_object *obj)
 		bpf_object__probe_global_data,
 		bpf_object__probe_btf_func,
 		bpf_object__probe_btf_datasec,
+		bpf_object__probe_array_mmap,
 	};
 	int i, ret;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 bpf-next 4/4] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests
  2019-11-15  4:02 [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2019-11-15  4:02 ` [PATCH v4 bpf-next 3/4] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
@ 2019-11-15  4:02 ` Andrii Nakryiko
  3 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  4:02 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add selftests validating mmap()-ing BPF array maps: both single-element and
multi-element ones. Check that plain bpf_map_update_elem() and
bpf_map_lookup_elem() work correctly with memory-mapped array. Also convert
CO-RE relocation tests to use memory-mapped views of global data.

Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/core_reloc.c     |  45 ++--
 tools/testing/selftests/bpf/prog_tests/mmap.c | 220 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mmap.c |  45 ++++
 3 files changed, 292 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mmap.c

diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
index f94bd071536b..ec9e2fdd6b89 100644
--- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c
+++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include "progs/core_reloc_types.h"
+#include <sys/mman.h>
 
 #define STRUCT_TO_CHAR_PTR(struct_name) (const char *)&(struct struct_name)
 
@@ -453,8 +454,15 @@ struct data {
 	char out[256];
 };
 
+static size_t roundup_page(size_t sz)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	return (sz + page_size - 1) / page_size * page_size;
+}
+
 void test_core_reloc(void)
 {
+	const size_t mmap_sz = roundup_page(sizeof(struct data));
 	struct bpf_object_load_attr load_attr = {};
 	struct core_reloc_test_case *test_case;
 	const char *tp_name, *probe_name;
@@ -463,8 +471,8 @@ void test_core_reloc(void)
 	struct bpf_map *data_map;
 	struct bpf_program *prog;
 	struct bpf_object *obj;
-	const int zero = 0;
-	struct data data;
+	struct data *data;
+	void *mmap_data = NULL;
 
 	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
 		test_case = &test_cases[i];
@@ -476,8 +484,7 @@ void test_core_reloc(void)
 		);
 
 		obj = bpf_object__open_file(test_case->bpf_obj_file, &opts);
-		if (CHECK(IS_ERR_OR_NULL(obj), "obj_open",
-			  "failed to open '%s': %ld\n",
+		if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n",
 			  test_case->bpf_obj_file, PTR_ERR(obj)))
 			continue;
 
@@ -519,24 +526,22 @@ void test_core_reloc(void)
 		if (CHECK(!data_map, "find_data_map", "data map not found\n"))
 			goto cleanup;
 
-		memset(&data, 0, sizeof(data));
-		memcpy(data.in, test_case->input, test_case->input_len);
-
-		err = bpf_map_update_elem(bpf_map__fd(data_map),
-					  &zero, &data, 0);
-		if (CHECK(err, "update_data_map",
-			  "failed to update .data map: %d\n", err))
+		mmap_data = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
+				 MAP_SHARED, bpf_map__fd(data_map), 0);
+		if (CHECK(mmap_data == MAP_FAILED, "mmap",
+			  ".bss mmap failed: %d", errno)) {
+			mmap_data = NULL;
 			goto cleanup;
+		}
+		data = mmap_data;
+
+		memset(mmap_data, 0, sizeof(*data));
+		memcpy(data->in, test_case->input, test_case->input_len);
 
 		/* trigger test run */
 		usleep(1);
 
-		err = bpf_map_lookup_elem(bpf_map__fd(data_map), &zero, &data);
-		if (CHECK(err, "get_result",
-			  "failed to get output data: %d\n", err))
-			goto cleanup;
-
-		equal = memcmp(data.out, test_case->output,
+		equal = memcmp(data->out, test_case->output,
 			       test_case->output_len) == 0;
 		if (CHECK(!equal, "check_result",
 			  "input/output data don't match\n")) {
@@ -548,12 +553,16 @@ void test_core_reloc(void)
 			}
 			for (j = 0; j < test_case->output_len; j++) {
 				printf("output byte #%d: EXP 0x%02hhx GOT 0x%02hhx\n",
-				       j, test_case->output[j], data.out[j]);
+				       j, test_case->output[j], data->out[j]);
 			}
 			goto cleanup;
 		}
 
 cleanup:
+		if (mmap_data) {
+			CHECK_FAIL(munmap(mmap_data, mmap_sz));
+			mmap_data = NULL;
+		}
 		if (!IS_ERR_OR_NULL(link)) {
 			bpf_link__destroy(link);
 			link = NULL;
diff --git a/tools/testing/selftests/bpf/prog_tests/mmap.c b/tools/testing/selftests/bpf/prog_tests/mmap.c
new file mode 100644
index 000000000000..051a6d48762c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/mmap.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <sys/mman.h>
+
+struct map_data {
+	__u64 val[512 * 4];
+};
+
+struct bss_data {
+	__u64 in_val;
+	__u64 out_val;
+};
+
+static size_t roundup_page(size_t sz)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	return (sz + page_size - 1) / page_size * page_size;
+}
+
+void test_mmap(void)
+{
+	const char *file = "test_mmap.o";
+	const char *probe_name = "raw_tracepoint/sys_enter";
+	const char *tp_name = "sys_enter";
+	const size_t bss_sz = roundup_page(sizeof(struct bss_data));
+	const size_t map_sz = roundup_page(sizeof(struct map_data));
+	const int zero = 0, one = 1, two = 2, far = 1500;
+	const long page_size = sysconf(_SC_PAGE_SIZE);
+	int err, duration = 0, i, data_map_fd;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_link *link = NULL;
+	struct bpf_map *data_map, *bss_map;
+	void *bss_mmaped = NULL, *map_mmaped = NULL, *tmp1, *tmp2;
+	volatile struct bss_data *bss_data;
+	volatile struct map_data *map_data;
+	__u64 val = 0;
+
+	obj = bpf_object__open_file("test_mmap.o", NULL);
+	if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n",
+		  file, PTR_ERR(obj)))
+		return;
+	prog = bpf_object__find_program_by_title(obj, probe_name);
+	if (CHECK(!prog, "find_probe", "prog '%s' not found\n", probe_name))
+		goto cleanup;
+	err = bpf_object__load(obj);
+	if (CHECK(err, "obj_load", "failed to load prog '%s': %d\n",
+		  probe_name, err))
+		goto cleanup;
+
+	bss_map = bpf_object__find_map_by_name(obj, "test_mma.bss");
+	if (CHECK(!bss_map, "find_bss_map", ".bss map not found\n"))
+		goto cleanup;
+	data_map = bpf_object__find_map_by_name(obj, "data_map");
+	if (CHECK(!data_map, "find_data_map", "data_map map not found\n"))
+		goto cleanup;
+	data_map_fd = bpf_map__fd(data_map);
+
+	bss_mmaped = mmap(NULL, bss_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  bpf_map__fd(bss_map), 0);
+	if (CHECK(bss_mmaped == MAP_FAILED, "bss_mmap",
+		  ".bss mmap failed: %d\n", errno)) {
+		bss_mmaped = NULL;
+		goto cleanup;
+	}
+	/* map as R/W first */
+	map_mmaped = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  data_map_fd, 0);
+	if (CHECK(map_mmaped == MAP_FAILED, "data_mmap",
+		  "data_map mmap failed: %d\n", errno)) {
+		map_mmaped = NULL;
+		goto cleanup;
+	}
+
+	bss_data = bss_mmaped;
+	map_data = map_mmaped;
+
+	CHECK_FAIL(bss_data->in_val);
+	CHECK_FAIL(bss_data->out_val);
+	CHECK_FAIL(map_data->val[0]);
+	CHECK_FAIL(map_data->val[1]);
+	CHECK_FAIL(map_data->val[2]);
+	CHECK_FAIL(map_data->val[far]);
+
+	link = bpf_program__attach_raw_tracepoint(prog, tp_name);
+	if (CHECK(IS_ERR(link), "attach_raw_tp", "err %ld\n", PTR_ERR(link)))
+		goto cleanup;
+
+	bss_data->in_val = 123;
+	val = 111;
+	CHECK_FAIL(bpf_map_update_elem(data_map_fd, &zero, &val, 0));
+
+	usleep(1);
+
+	CHECK_FAIL(bss_data->in_val != 123);
+	CHECK_FAIL(bss_data->out_val != 123);
+	CHECK_FAIL(map_data->val[0] != 111);
+	CHECK_FAIL(map_data->val[1] != 222);
+	CHECK_FAIL(map_data->val[2] != 123);
+	CHECK_FAIL(map_data->val[far] != 3 * 123);
+
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &zero, &val));
+	CHECK_FAIL(val != 111);
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &one, &val));
+	CHECK_FAIL(val != 222);
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &two, &val));
+	CHECK_FAIL(val != 123);
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &far, &val));
+	CHECK_FAIL(val != 3 * 123);
+
+	/* data_map freeze should fail due to R/W mmap() */
+	err = bpf_map_freeze(data_map_fd);
+	if (CHECK(!err || errno != EBUSY, "no_freeze",
+		  "data_map freeze succeeded: err=%d, errno=%d\n", err, errno))
+		goto cleanup;
+
+	/* unmap R/W mapping */
+	err = munmap(map_mmaped, map_sz);
+	map_mmaped = NULL;
+	if (CHECK(err, "data_map_munmap", "data_map munmap failed: %d\n", errno))
+		goto cleanup;
+
+	/* re-map as R/O now */
+	map_mmaped = mmap(NULL, map_sz, PROT_READ, MAP_SHARED, data_map_fd, 0);
+	if (CHECK(map_mmaped == MAP_FAILED, "data_mmap",
+		  "data_map R/O mmap failed: %d\n", errno)) {
+		map_mmaped = NULL;
+		goto cleanup;
+	}
+	map_data = map_mmaped;
+
+	/* map/unmap in a loop to test ref counting */
+	for (i = 0; i < 10; i++) {
+		int flags = i % 2 ? PROT_READ : PROT_WRITE;
+		void *p;
+
+		p = mmap(NULL, map_sz, flags, MAP_SHARED, data_map_fd, 0);
+		if (CHECK_FAIL(p == MAP_FAILED))
+			goto cleanup;
+		err = munmap(p, map_sz);
+		if (CHECK_FAIL(err))
+			goto cleanup;
+	}
+
+	/* data_map freeze should now succeed due to no R/W mapping */
+	err = bpf_map_freeze(data_map_fd);
+	if (CHECK(err, "freeze", "data_map freeze failed: err=%d, errno=%d\n",
+		  err, errno))
+		goto cleanup;
+
+	/* mapping as R/W now should fail */
+	tmp1 = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+		    data_map_fd, 0);
+	if (CHECK(tmp1 != MAP_FAILED, "data_mmap", "mmap succeeded\n")) {
+		munmap(tmp1, map_sz);
+		goto cleanup;
+	}
+
+	bss_data->in_val = 321;
+	usleep(1);
+	CHECK_FAIL(bss_data->in_val != 321);
+	CHECK_FAIL(bss_data->out_val != 321);
+	CHECK_FAIL(map_data->val[0] != 111);
+	CHECK_FAIL(map_data->val[1] != 222);
+	CHECK_FAIL(map_data->val[2] != 321);
+	CHECK_FAIL(map_data->val[far] != 3 * 321);
+
+	/* check some more advanced mmap() manipulations */
+
+	/* map all but last page: pages 1-3 mapped */
+	tmp1 = mmap(NULL, 3 * page_size, PROT_READ, MAP_SHARED,
+			  data_map_fd, 0);
+	if (CHECK(tmp1 == MAP_FAILED, "adv_mmap1", "errno %d\n", errno))
+		goto cleanup;
+
+	/* unmap second page: pages 1, 3 mapped */
+	err = munmap(tmp1 + page_size, page_size);
+	if (CHECK(err, "adv_mmap2", "errno %d\n", errno)) {
+		munmap(tmp1, map_sz);
+		goto cleanup;
+	}
+
+	/* map page 2 back */
+	tmp2 = mmap(tmp1 + page_size, page_size, PROT_READ,
+		    MAP_SHARED | MAP_FIXED, data_map_fd, 0);
+	if (CHECK(tmp2 == MAP_FAILED, "adv_mmap3", "errno %d\n", errno)) {
+		munmap(tmp1, page_size);
+		munmap(tmp1 + 2*page_size, page_size);
+		goto cleanup;
+	}
+	CHECK(tmp1 + page_size != tmp2, "adv_mmap4",
+	      "tmp1: %p, tmp2: %p\n", tmp1, tmp2);
+
+	/* re-map all 4 pages */
+	tmp2 = mmap(tmp1, 4 * page_size, PROT_READ, MAP_SHARED | MAP_FIXED,
+		    data_map_fd, 0);
+	if (CHECK(tmp2 == MAP_FAILED, "adv_mmap5", "errno %d\n", errno)) {
+		munmap(tmp1, 3 * page_size); /* unmap page 1 */
+		goto cleanup;
+	}
+	CHECK(tmp1 != tmp2, "adv_mmap6", "tmp1: %p, tmp2: %p\n", tmp1, tmp2);
+
+	map_data = tmp2;
+	CHECK_FAIL(bss_data->in_val != 321);
+	CHECK_FAIL(bss_data->out_val != 321);
+	CHECK_FAIL(map_data->val[0] != 111);
+	CHECK_FAIL(map_data->val[1] != 222);
+	CHECK_FAIL(map_data->val[2] != 321);
+	CHECK_FAIL(map_data->val[far] != 3 * 321);
+
+	munmap(tmp2, 4 * page_size);
+cleanup:
+	if (bss_mmaped)
+		CHECK_FAIL(munmap(bss_mmaped, bss_sz));
+	if (map_mmaped)
+		CHECK_FAIL(munmap(map_mmaped, map_sz));
+	if (!IS_ERR_OR_NULL(link))
+		bpf_link__destroy(link);
+	bpf_object__close(obj);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_mmap.c b/tools/testing/selftests/bpf/progs/test_mmap.c
new file mode 100644
index 000000000000..0d2ec9fbcf61
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_mmap.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 512 * 4); /* at least 4 pages of data */
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__type(key, __u32);
+	__type(value, __u64);
+} data_map SEC(".maps");
+
+static volatile __u64 in_val;
+static volatile __u64 out_val;
+
+SEC("raw_tracepoint/sys_enter")
+int test_mmap(void *ctx)
+{
+	int zero = 0, one = 1, two = 2, far = 1500;
+	__u64 val, *p;
+
+	out_val = in_val;
+
+	/* data_map[2] = in_val; */
+	bpf_map_update_elem(&data_map, &two, (const void *)&in_val, 0);
+
+	/* data_map[1] = data_map[0] * 2; */
+	p = bpf_map_lookup_elem(&data_map, &zero);
+	if (p) {
+		val = (*p) * 2;
+		bpf_map_update_elem(&data_map, &one, &val, 0);
+	}
+
+	/* data_map[far] = in_val * 3; */
+	val = in_val * 3;
+	bpf_map_update_elem(&data_map, &far, &val, 0);
+
+	return 0;
+}
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
@ 2019-11-15  4:45   ` Alexei Starovoitov
  2019-11-15  5:05     ` Andrii Nakryiko
  2019-11-15 13:57   ` Johannes Weiner
  2019-11-15 23:31   ` Daniel Borkmann
  2 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-11-15  4:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Johannes Weiner, Rik van Riel

On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> Add ability to memory-map contents of BPF array map. This is extremely useful
> for working with BPF global data from userspace programs. It allows to avoid
> typical bpf_map_{lookup,update}_elem operations, improving both performance
> and usability.
> 
> There had to be special considerations for map freezing, to avoid having
> writable memory view into a frozen map. To solve this issue, map freezing and
> mmap-ing is happening under mutex now:
>   - if map is already frozen, no writable mapping is allowed;
>   - if map has writable memory mappings active (accounted in map->writecnt),
>     map freezing will keep failing with -EBUSY;
>   - once number of writable memory mappings drops to zero, map freezing can be
>     performed again.
> 
> Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> can't be memory mapped either.
> 
> For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> to be mmap()'able. We also need to make sure that array data memory is
> page-sized and page-aligned, so we over-allocate memory in such a way that
> struct bpf_array is at the end of a single page of memory with array->value
> being aligned with the start of the second page. On deallocation we need to
> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> 
> One important consideration regarding how memory-mapping subsystem functions.
> Memory-mapping subsystem provides few optional callbacks, among them open()
> and close().  close() is called for each memory region that is unmapped, so
> that users can decrease their reference counters and free up resources, if
> necessary. open() is *almost* symmetrical: it's called for each memory region
> that is being mapped, **except** the very first one. So bpf_map_mmap does
> initial refcnt bump, while open() will do any extra ones after that. Thus
> number of close() calls is equal to number of open() calls plus one more.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@surriel.com>
> Acked-by: Song Liu <songliubraving@fb.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  include/linux/bpf.h            | 11 ++--
>  include/linux/vmalloc.h        |  1 +
>  include/uapi/linux/bpf.h       |  3 ++
>  kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
>  kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
>  mm/vmalloc.c                   | 20 +++++++
>  tools/include/uapi/linux/bpf.h |  3 ++
>  7 files changed, 184 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6fbe599fb977..8021fce98868 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -12,6 +12,7 @@
>  #include <linux/err.h>
>  #include <linux/rbtree_latch.h>
>  #include <linux/numa.h>
> +#include <linux/mm_types.h>
>  #include <linux/wait.h>
>  #include <linux/u64_stats_sync.h>
>  
> @@ -66,6 +67,7 @@ struct bpf_map_ops {
>  				     u64 *imm, u32 off);
>  	int (*map_direct_value_meta)(const struct bpf_map *map,
>  				     u64 imm, u32 *off);
> +	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
>  };
>  
>  struct bpf_map_memory {
> @@ -94,9 +96,10 @@ struct bpf_map {
>  	u32 btf_value_type_id;
>  	struct btf *btf;
>  	struct bpf_map_memory memory;
> +	char name[BPF_OBJ_NAME_LEN];
>  	bool unpriv_array;
> -	bool frozen; /* write-once */
> -	/* 48 bytes hole */
> +	bool frozen; /* write-once; write-protected by freeze_mutex */
> +	/* 22 bytes hole */
>  
>  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
>  	 * particularly with refcounting.
> @@ -104,7 +107,8 @@ struct bpf_map {
>  	atomic64_t refcnt ____cacheline_aligned;
>  	atomic64_t usercnt;
>  	struct work_struct work;
> -	char name[BPF_OBJ_NAME_LEN];
> +	struct mutex freeze_mutex;
> +	u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
>  };

Can the mutex be moved into bpf_array instead of being in bpf_map that is
shared across all map types?
If so then can you reuse the mutex that Daniel is adding in his patch 6/8
of tail_call series? Not sure what would the right name for such mutex.
It will be used for your map_freeze logic and for Daniel's text_poke.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  4:45   ` Alexei Starovoitov
@ 2019-11-15  5:05     ` Andrii Nakryiko
  2019-11-15  5:08       ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  5:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Johannes Weiner, Rik van Riel

On Thu, Nov 14, 2019 at 8:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> > Add ability to memory-map contents of BPF array map. This is extremely useful
> > for working with BPF global data from userspace programs. It allows to avoid
> > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > and usability.
> >
> > There had to be special considerations for map freezing, to avoid having
> > writable memory view into a frozen map. To solve this issue, map freezing and
> > mmap-ing is happening under mutex now:
> >   - if map is already frozen, no writable mapping is allowed;
> >   - if map has writable memory mappings active (accounted in map->writecnt),
> >     map freezing will keep failing with -EBUSY;
> >   - once number of writable memory mappings drops to zero, map freezing can be
> >     performed again.
> >
> > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > can't be memory mapped either.
> >
> > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > to be mmap()'able. We also need to make sure that array data memory is
> > page-sized and page-aligned, so we over-allocate memory in such a way that
> > struct bpf_array is at the end of a single page of memory with array->value
> > being aligned with the start of the second page. On deallocation we need to
> > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> >
> > One important consideration regarding how memory-mapping subsystem functions.
> > Memory-mapping subsystem provides few optional callbacks, among them open()
> > and close().  close() is called for each memory region that is unmapped, so
> > that users can decrease their reference counters and free up resources, if
> > necessary. open() is *almost* symmetrical: it's called for each memory region
> > that is being mapped, **except** the very first one. So bpf_map_mmap does
> > initial refcnt bump, while open() will do any extra ones after that. Thus
> > number of close() calls is equal to number of open() calls plus one more.
> >
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >  include/linux/bpf.h            | 11 ++--
> >  include/linux/vmalloc.h        |  1 +
> >  include/uapi/linux/bpf.h       |  3 ++
> >  kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
> >  kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
> >  mm/vmalloc.c                   | 20 +++++++
> >  tools/include/uapi/linux/bpf.h |  3 ++
> >  7 files changed, 184 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 6fbe599fb977..8021fce98868 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/err.h>
> >  #include <linux/rbtree_latch.h>
> >  #include <linux/numa.h>
> > +#include <linux/mm_types.h>
> >  #include <linux/wait.h>
> >  #include <linux/u64_stats_sync.h>
> >
> > @@ -66,6 +67,7 @@ struct bpf_map_ops {
> >                                    u64 *imm, u32 off);
> >       int (*map_direct_value_meta)(const struct bpf_map *map,
> >                                    u64 imm, u32 *off);
> > +     int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> >  };
> >
> >  struct bpf_map_memory {
> > @@ -94,9 +96,10 @@ struct bpf_map {
> >       u32 btf_value_type_id;
> >       struct btf *btf;
> >       struct bpf_map_memory memory;
> > +     char name[BPF_OBJ_NAME_LEN];
> >       bool unpriv_array;
> > -     bool frozen; /* write-once */
> > -     /* 48 bytes hole */
> > +     bool frozen; /* write-once; write-protected by freeze_mutex */
> > +     /* 22 bytes hole */
> >
> >       /* The 3rd and 4th cacheline with misc members to avoid false sharing
> >        * particularly with refcounting.
> > @@ -104,7 +107,8 @@ struct bpf_map {
> >       atomic64_t refcnt ____cacheline_aligned;
> >       atomic64_t usercnt;
> >       struct work_struct work;
> > -     char name[BPF_OBJ_NAME_LEN];
> > +     struct mutex freeze_mutex;
> > +     u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
> >  };
>
> Can the mutex be moved into bpf_array instead of being in bpf_map that is
> shared across all map types?

No, freezing logic is common to all maps. Same for writecnt and
mmap()-ing overall.

> If so then can you reuse the mutex that Daniel is adding in his patch 6/8
> of tail_call series? Not sure what would the right name for such mutex.
> It will be used for your map_freeze logic and for Daniel's text_poke.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  5:05     ` Andrii Nakryiko
@ 2019-11-15  5:08       ` Alexei Starovoitov
  2019-11-15  5:43         ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-11-15  5:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Johannes Weiner, Rik van Riel

On Thu, Nov 14, 2019 at 09:05:26PM -0800, Andrii Nakryiko wrote:
> On Thu, Nov 14, 2019 at 8:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> > > Add ability to memory-map contents of BPF array map. This is extremely useful
> > > for working with BPF global data from userspace programs. It allows to avoid
> > > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > > and usability.
> > >
> > > There had to be special considerations for map freezing, to avoid having
> > > writable memory view into a frozen map. To solve this issue, map freezing and
> > > mmap-ing is happening under mutex now:
> > >   - if map is already frozen, no writable mapping is allowed;
> > >   - if map has writable memory mappings active (accounted in map->writecnt),
> > >     map freezing will keep failing with -EBUSY;
> > >   - once number of writable memory mappings drops to zero, map freezing can be
> > >     performed again.
> > >
> > > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > > can't be memory mapped either.
> > >
> > > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > > to be mmap()'able. We also need to make sure that array data memory is
> > > page-sized and page-aligned, so we over-allocate memory in such a way that
> > > struct bpf_array is at the end of a single page of memory with array->value
> > > being aligned with the start of the second page. On deallocation we need to
> > > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> > >
> > > One important consideration regarding how memory-mapping subsystem functions.
> > > Memory-mapping subsystem provides few optional callbacks, among them open()
> > > and close().  close() is called for each memory region that is unmapped, so
> > > that users can decrease their reference counters and free up resources, if
> > > necessary. open() is *almost* symmetrical: it's called for each memory region
> > > that is being mapped, **except** the very first one. So bpf_map_mmap does
> > > initial refcnt bump, while open() will do any extra ones after that. Thus
> > > number of close() calls is equal to number of open() calls plus one more.
> > >
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Rik van Riel <riel@surriel.com>
> > > Acked-by: Song Liu <songliubraving@fb.com>
> > > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > ---
> > >  include/linux/bpf.h            | 11 ++--
> > >  include/linux/vmalloc.h        |  1 +
> > >  include/uapi/linux/bpf.h       |  3 ++
> > >  kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
> > >  kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
> > >  mm/vmalloc.c                   | 20 +++++++
> > >  tools/include/uapi/linux/bpf.h |  3 ++
> > >  7 files changed, 184 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 6fbe599fb977..8021fce98868 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/err.h>
> > >  #include <linux/rbtree_latch.h>
> > >  #include <linux/numa.h>
> > > +#include <linux/mm_types.h>
> > >  #include <linux/wait.h>
> > >  #include <linux/u64_stats_sync.h>
> > >
> > > @@ -66,6 +67,7 @@ struct bpf_map_ops {
> > >                                    u64 *imm, u32 off);
> > >       int (*map_direct_value_meta)(const struct bpf_map *map,
> > >                                    u64 imm, u32 *off);
> > > +     int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> > >  };
> > >
> > >  struct bpf_map_memory {
> > > @@ -94,9 +96,10 @@ struct bpf_map {
> > >       u32 btf_value_type_id;
> > >       struct btf *btf;
> > >       struct bpf_map_memory memory;
> > > +     char name[BPF_OBJ_NAME_LEN];
> > >       bool unpriv_array;
> > > -     bool frozen; /* write-once */
> > > -     /* 48 bytes hole */
> > > +     bool frozen; /* write-once; write-protected by freeze_mutex */
> > > +     /* 22 bytes hole */
> > >
> > >       /* The 3rd and 4th cacheline with misc members to avoid false sharing
> > >        * particularly with refcounting.
> > > @@ -104,7 +107,8 @@ struct bpf_map {
> > >       atomic64_t refcnt ____cacheline_aligned;
> > >       atomic64_t usercnt;
> > >       struct work_struct work;
> > > -     char name[BPF_OBJ_NAME_LEN];
> > > +     struct mutex freeze_mutex;
> > > +     u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
> > >  };
> >
> > Can the mutex be moved into bpf_array instead of being in bpf_map that is
> > shared across all map types?
> 
> No, freezing logic is common to all maps. Same for writecnt and
> mmap()-ing overall.

How mmap is going to work for hash map ? and for prog_array?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  5:08       ` Alexei Starovoitov
@ 2019-11-15  5:43         ` Andrii Nakryiko
  2019-11-15 16:36           ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15  5:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Johannes Weiner, Rik van Riel

On Thu, Nov 14, 2019 at 9:08 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 14, 2019 at 09:05:26PM -0800, Andrii Nakryiko wrote:
> > On Thu, Nov 14, 2019 at 8:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> > > > Add ability to memory-map contents of BPF array map. This is extremely useful
> > > > for working with BPF global data from userspace programs. It allows to avoid
> > > > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > > > and usability.
> > > >
> > > > There had to be special considerations for map freezing, to avoid having
> > > > writable memory view into a frozen map. To solve this issue, map freezing and
> > > > mmap-ing is happening under mutex now:
> > > >   - if map is already frozen, no writable mapping is allowed;
> > > >   - if map has writable memory mappings active (accounted in map->writecnt),
> > > >     map freezing will keep failing with -EBUSY;
> > > >   - once number of writable memory mappings drops to zero, map freezing can be
> > > >     performed again.
> > > >
> > > > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > > > can't be memory mapped either.
> > > >
> > > > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > > > to be mmap()'able. We also need to make sure that array data memory is
> > > > page-sized and page-aligned, so we over-allocate memory in such a way that
> > > > struct bpf_array is at the end of a single page of memory with array->value
> > > > being aligned with the start of the second page. On deallocation we need to
> > > > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> > > >
> > > > One important consideration regarding how memory-mapping subsystem functions.
> > > > Memory-mapping subsystem provides few optional callbacks, among them open()
> > > > and close().  close() is called for each memory region that is unmapped, so
> > > > that users can decrease their reference counters and free up resources, if
> > > > necessary. open() is *almost* symmetrical: it's called for each memory region
> > > > that is being mapped, **except** the very first one. So bpf_map_mmap does
> > > > initial refcnt bump, while open() will do any extra ones after that. Thus
> > > > number of close() calls is equal to number of open() calls plus one more.
> > > >
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > Cc: Rik van Riel <riel@surriel.com>
> > > > Acked-by: Song Liu <songliubraving@fb.com>
> > > > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > > ---
> > > >  include/linux/bpf.h            | 11 ++--
> > > >  include/linux/vmalloc.h        |  1 +
> > > >  include/uapi/linux/bpf.h       |  3 ++
> > > >  kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
> > > >  kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
> > > >  mm/vmalloc.c                   | 20 +++++++
> > > >  tools/include/uapi/linux/bpf.h |  3 ++
> > > >  7 files changed, 184 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 6fbe599fb977..8021fce98868 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -12,6 +12,7 @@
> > > >  #include <linux/err.h>
> > > >  #include <linux/rbtree_latch.h>
> > > >  #include <linux/numa.h>
> > > > +#include <linux/mm_types.h>
> > > >  #include <linux/wait.h>
> > > >  #include <linux/u64_stats_sync.h>
> > > >
> > > > @@ -66,6 +67,7 @@ struct bpf_map_ops {
> > > >                                    u64 *imm, u32 off);
> > > >       int (*map_direct_value_meta)(const struct bpf_map *map,
> > > >                                    u64 imm, u32 *off);
> > > > +     int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> > > >  };
> > > >
> > > >  struct bpf_map_memory {
> > > > @@ -94,9 +96,10 @@ struct bpf_map {
> > > >       u32 btf_value_type_id;
> > > >       struct btf *btf;
> > > >       struct bpf_map_memory memory;
> > > > +     char name[BPF_OBJ_NAME_LEN];
> > > >       bool unpriv_array;
> > > > -     bool frozen; /* write-once */
> > > > -     /* 48 bytes hole */
> > > > +     bool frozen; /* write-once; write-protected by freeze_mutex */
> > > > +     /* 22 bytes hole */
> > > >
> > > >       /* The 3rd and 4th cacheline with misc members to avoid false sharing
> > > >        * particularly with refcounting.
> > > > @@ -104,7 +107,8 @@ struct bpf_map {
> > > >       atomic64_t refcnt ____cacheline_aligned;
> > > >       atomic64_t usercnt;
> > > >       struct work_struct work;
> > > > -     char name[BPF_OBJ_NAME_LEN];
> > > > +     struct mutex freeze_mutex;
> > > > +     u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
> > > >  };
> > >
> > > Can the mutex be moved into bpf_array instead of being in bpf_map that is
> > > shared across all map types?
> >
> > No, freezing logic is common to all maps. Same for writecnt and
> > mmap()-ing overall.
>
> How mmap is going to work for hash map ? and for prog_array?
>

It probably won't. But one day we might have hash map using open
adressing, which will be more prone to memory-mapping. Or, say, some
sort of non-per-CPU ring buffer would be a good candidate as well. It
doesn't seem like a good idea to restrict mmap()-ability to just array
for no good reason.

But speaking about freeze_mutex specifically. It's to coordinate
writable memory-mapping and frozen flag. Even if we make
memory-mapping bpf_array specific, map->frozen is generic flag and
handled in syscall.c generically, so I just can't protect it only from
bpf_array side.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
  2019-11-15  4:45   ` Alexei Starovoitov
@ 2019-11-15 13:57   ` Johannes Weiner
  2019-11-15 23:31   ` Daniel Borkmann
  2 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2019-11-15 13:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team, Rik van Riel

On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> Add ability to memory-map contents of BPF array map. This is extremely useful
> for working with BPF global data from userspace programs. It allows to avoid
> typical bpf_map_{lookup,update}_elem operations, improving both performance
> and usability.
> 
> There had to be special considerations for map freezing, to avoid having
> writable memory view into a frozen map. To solve this issue, map freezing and
> mmap-ing is happening under mutex now:
>   - if map is already frozen, no writable mapping is allowed;
>   - if map has writable memory mappings active (accounted in map->writecnt),
>     map freezing will keep failing with -EBUSY;
>   - once number of writable memory mappings drops to zero, map freezing can be
>     performed again.
> 
> Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> can't be memory mapped either.
> 
> For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> to be mmap()'able. We also need to make sure that array data memory is
> page-sized and page-aligned, so we over-allocate memory in such a way that
> struct bpf_array is at the end of a single page of memory with array->value
> being aligned with the start of the second page. On deallocation we need to
> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> 
> One important consideration regarding how memory-mapping subsystem functions.
> Memory-mapping subsystem provides few optional callbacks, among them open()
> and close().  close() is called for each memory region that is unmapped, so
> that users can decrease their reference counters and free up resources, if
> necessary. open() is *almost* symmetrical: it's called for each memory region
> that is being mapped, **except** the very first one. So bpf_map_mmap does
> initial refcnt bump, while open() will do any extra ones after that. Thus
> number of close() calls is equal to number of open() calls plus one more.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@surriel.com>
> Acked-by: Song Liu <songliubraving@fb.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  5:43         ` Andrii Nakryiko
@ 2019-11-15 16:36           ` Andrii Nakryiko
  2019-11-15 20:42             ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15 16:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Johannes Weiner, Rik van Riel

On Thu, Nov 14, 2019 at 9:43 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Nov 14, 2019 at 9:08 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2019 at 09:05:26PM -0800, Andrii Nakryiko wrote:
> > > On Thu, Nov 14, 2019 at 8:45 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 14, 2019 at 08:02:23PM -0800, Andrii Nakryiko wrote:
> > > > > Add ability to memory-map contents of BPF array map. This is extremely useful
> > > > > for working with BPF global data from userspace programs. It allows to avoid
> > > > > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > > > > and usability.
> > > > >
> > > > > There had to be special considerations for map freezing, to avoid having
> > > > > writable memory view into a frozen map. To solve this issue, map freezing and
> > > > > mmap-ing is happening under mutex now:
> > > > >   - if map is already frozen, no writable mapping is allowed;
> > > > >   - if map has writable memory mappings active (accounted in map->writecnt),
> > > > >     map freezing will keep failing with -EBUSY;
> > > > >   - once number of writable memory mappings drops to zero, map freezing can be
> > > > >     performed again.
> > > > >
> > > > > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > > > > can't be memory mapped either.
> > > > >
> > > > > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > > > > to be mmap()'able. We also need to make sure that array data memory is
> > > > > page-sized and page-aligned, so we over-allocate memory in such a way that
> > > > > struct bpf_array is at the end of a single page of memory with array->value
> > > > > being aligned with the start of the second page. On deallocation we need to
> > > > > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> > > > >
> > > > > One important consideration regarding how memory-mapping subsystem functions.
> > > > > Memory-mapping subsystem provides few optional callbacks, among them open()
> > > > > and close().  close() is called for each memory region that is unmapped, so
> > > > > that users can decrease their reference counters and free up resources, if
> > > > > necessary. open() is *almost* symmetrical: it's called for each memory region
> > > > > that is being mapped, **except** the very first one. So bpf_map_mmap does
> > > > > initial refcnt bump, while open() will do any extra ones after that. Thus
> > > > > number of close() calls is equal to number of open() calls plus one more.
> > > > >
> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Cc: Rik van Riel <riel@surriel.com>
> > > > > Acked-by: Song Liu <songliubraving@fb.com>
> > > > > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > > > ---
> > > > >  include/linux/bpf.h            | 11 ++--
> > > > >  include/linux/vmalloc.h        |  1 +
> > > > >  include/uapi/linux/bpf.h       |  3 ++
> > > > >  kernel/bpf/arraymap.c          | 59 +++++++++++++++++---
> > > > >  kernel/bpf/syscall.c           | 99 ++++++++++++++++++++++++++++++++--
> > > > >  mm/vmalloc.c                   | 20 +++++++
> > > > >  tools/include/uapi/linux/bpf.h |  3 ++
> > > > >  7 files changed, 184 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index 6fbe599fb977..8021fce98868 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -12,6 +12,7 @@
> > > > >  #include <linux/err.h>
> > > > >  #include <linux/rbtree_latch.h>
> > > > >  #include <linux/numa.h>
> > > > > +#include <linux/mm_types.h>
> > > > >  #include <linux/wait.h>
> > > > >  #include <linux/u64_stats_sync.h>
> > > > >
> > > > > @@ -66,6 +67,7 @@ struct bpf_map_ops {
> > > > >                                    u64 *imm, u32 off);
> > > > >       int (*map_direct_value_meta)(const struct bpf_map *map,
> > > > >                                    u64 imm, u32 *off);
> > > > > +     int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> > > > >  };
> > > > >
> > > > >  struct bpf_map_memory {
> > > > > @@ -94,9 +96,10 @@ struct bpf_map {
> > > > >       u32 btf_value_type_id;
> > > > >       struct btf *btf;
> > > > >       struct bpf_map_memory memory;
> > > > > +     char name[BPF_OBJ_NAME_LEN];
> > > > >       bool unpriv_array;
> > > > > -     bool frozen; /* write-once */
> > > > > -     /* 48 bytes hole */
> > > > > +     bool frozen; /* write-once; write-protected by freeze_mutex */
> > > > > +     /* 22 bytes hole */
> > > > >
> > > > >       /* The 3rd and 4th cacheline with misc members to avoid false sharing
> > > > >        * particularly with refcounting.
> > > > > @@ -104,7 +107,8 @@ struct bpf_map {
> > > > >       atomic64_t refcnt ____cacheline_aligned;
> > > > >       atomic64_t usercnt;
> > > > >       struct work_struct work;
> > > > > -     char name[BPF_OBJ_NAME_LEN];
> > > > > +     struct mutex freeze_mutex;
> > > > > +     u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
> > > > >  };
> > > >
> > > > Can the mutex be moved into bpf_array instead of being in bpf_map that is
> > > > shared across all map types?
> > >
> > > No, freezing logic is common to all maps. Same for writecnt and
> > > mmap()-ing overall.
> >
> > How mmap is going to work for hash map ? and for prog_array?
> >
>
> It probably won't. But one day we might have hash map using open
> adressing, which will be more prone to memory-mapping. Or, say, some
> sort of non-per-CPU ring buffer would be a good candidate as well. It
> doesn't seem like a good idea to restrict mmap()-ability to just array
> for no good reason.
>
> But speaking about freeze_mutex specifically. It's to coordinate
> writable memory-mapping and frozen flag. Even if we make
> memory-mapping bpf_array specific, map->frozen is generic flag and
> handled in syscall.c generically, so I just can't protect it only from
> bpf_array side.

Alternatively we can use spinlock. I don't think it's too ugly, tbh. See below.

From 0da495b911adad495857f1c0fc3596f1d06a705f Mon Sep 17 00:00:00 2001
From: Andrii Nakryiko <andriin@fb.com>
Date: Fri, 15 Nov 2019 08:32:43 -0800
Subject: [PATCH bpf-next] bpf: switch freeze locking to use spin_lock and save
 space

Switch to spin_lock in favor of mutex. Due to mmap-ing itself happening not
under spinlock, there needs to be an extra "correction" step for writecnt, if
mapping fails.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h  |  6 ++---
 kernel/bpf/syscall.c | 52 ++++++++++++++++++++++++--------------------
 2 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8021fce98868..ceaefd69c6bb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -98,7 +98,7 @@ struct bpf_map {
     struct bpf_map_memory memory;
     char name[BPF_OBJ_NAME_LEN];
     bool unpriv_array;
-    bool frozen; /* write-once; write-protected by freeze_mutex */
+    bool frozen; /* write-once; write-protected by freeze_lock */
     /* 22 bytes hole */

     /* The 3rd and 4th cacheline with misc members to avoid false sharing
@@ -107,8 +107,8 @@ struct bpf_map {
     atomic64_t refcnt ____cacheline_aligned;
     atomic64_t usercnt;
     struct work_struct work;
-    struct mutex freeze_mutex;
-    u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
+    u64 writecnt; /* writable mmap cnt; protected by freeze_lock */
+    spinlock_t freeze_lock;
 };

 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 810e53990641..fc782ab06d57 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -448,9 +448,9 @@ static void bpf_map_mmap_open(struct vm_area_struct *vma)
     bpf_map_inc(map);

     if (vma->vm_flags & VM_WRITE) {
-        mutex_lock(&map->freeze_mutex);
+        spin_lock(&map->freeze_lock);
         map->writecnt++;
-        mutex_unlock(&map->freeze_mutex);
+        spin_unlock(&map->freeze_lock);
     }
 }

@@ -460,9 +460,9 @@ static void bpf_map_mmap_close(struct vm_area_struct *vma)
     struct bpf_map *map = vma->vm_file->private_data;

     if (vma->vm_flags & VM_WRITE) {
-        mutex_lock(&map->freeze_mutex);
+        spin_lock(&map->freeze_lock);
         map->writecnt--;
-        mutex_unlock(&map->freeze_mutex);
+        spin_unlock(&map->freeze_lock);
     }

     bpf_map_put(map);
@@ -484,28 +484,34 @@ static int bpf_map_mmap(struct file *filp,
struct vm_area_struct *vma)
     if (!(vma->vm_flags & VM_SHARED))
         return -EINVAL;

-    mutex_lock(&map->freeze_mutex);
-
+    spin_lock(&map->freeze_lock);
     if ((vma->vm_flags & VM_WRITE) && map->frozen) {
-        err = -EPERM;
-        goto out;
+        spin_unlock(&map->freeze_lock);
+        return -EPERM;
     }
-
+    /* speculatively increase writecnt assuming mmap will succeed */
+    if (vma->vm_flags & VM_WRITE)
+        map->writecnt++;
+    spin_unlock(&map->freeze_lock);
+
     /* set default open/close callbacks */
     vma->vm_ops = &bpf_map_default_vmops;
     vma->vm_private_data = map;

     err = map->ops->map_mmap(map, vma);
-    if (err)
-        goto out;
+    if (err) {
+        /* mmap failed, undo speculative writecnt increment */
+        if (vma->vm_flags & VM_WRITE) {
+            spin_lock(&map->freeze_lock);
+            map->writecnt--;
+            spin_unlock(&map->freeze_lock);
+        }
+        return err;
+    }

     bpf_map_inc(map);

-    if (vma->vm_flags & VM_WRITE)
-        map->writecnt++;
-out:
-    mutex_unlock(&map->freeze_mutex);
-    return err;
+    return 0;
 }

 const struct file_operations bpf_map_fops = {
@@ -661,7 +667,7 @@ static int map_create(union bpf_attr *attr)

     atomic64_set(&map->refcnt, 1);
     atomic64_set(&map->usercnt, 1);
-    mutex_init(&map->freeze_mutex);
+    spin_lock_init(&map->freeze_lock);

     if (attr->btf_key_type_id || attr->btf_value_type_id) {
         struct btf *btf;
@@ -1244,12 +1250,15 @@ static int map_freeze(const union bpf_attr *attr)
     if (CHECK_ATTR(BPF_MAP_FREEZE))
         return -EINVAL;

+    if (!capable(CAP_SYS_ADMIN))
+        return -EPERM;
+
     f = fdget(ufd);
     map = __bpf_map_get(f);
     if (IS_ERR(map))
         return PTR_ERR(map);

-    mutex_lock(&map->freeze_mutex);
+    spin_lock(&map->freeze_lock);

     if (map->writecnt) {
         err = -EBUSY;
@@ -1259,14 +1268,9 @@ static int map_freeze(const union bpf_attr *attr)
         err = -EBUSY;
         goto err_put;
     }
-    if (!capable(CAP_SYS_ADMIN)) {
-        err = -EPERM;
-        goto err_put;
-    }
-
     WRITE_ONCE(map->frozen, true);
 err_put:
-    mutex_unlock(&map->freeze_mutex);
+    spin_unlock(&map->freeze_lock);
     fdput(f);
     return err;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15 16:36           ` Andrii Nakryiko
@ 2019-11-15 20:42             ` Jakub Kicinski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2019-11-15 20:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Johannes Weiner, Rik van Riel

On Fri, 15 Nov 2019 08:36:56 -0800, Andrii Nakryiko wrote:
> Alternatively we can use spinlock. I don't think it's too ugly, tbh. See below.
> 
> From 0da495b911adad495857f1c0fc3596f1d06a705f Mon Sep 17 00:00:00 2001
> From: Andrii Nakryiko <andriin@fb.com>
> Date: Fri, 15 Nov 2019 08:32:43 -0800
> Subject: [PATCH bpf-next] bpf: switch freeze locking to use spin_lock and save
>  space
> 
> Switch to spin_lock in favor of mutex. Due to mmap-ing itself happening not
> under spinlock, there needs to be an extra "correction" step for writecnt, if
> mapping fails.

FWIW I was pondering that too, and thought your initial design was
nicer, the transient errors sometimes become a major PITA.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails
  2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
@ 2019-11-15 21:47   ` Song Liu
  2019-11-15 23:23   ` Daniel Borkmann
  1 sibling, 0 replies; 25+ messages in thread
From: Song Liu @ 2019-11-15 21:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, Alexei Starovoitov, daniel, andrii.nakryiko, Kernel Team



> On Nov 14, 2019, at 8:02 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 

[...] 

> 
> This patch, while modifying all users of bpf_map_inc, also cleans up its
> interface to match bpf_map_put with separate operations for bpf_map_inc and
> bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
> respectively). Also, given there are no users of bpf_map_inc_not_zero
> specifying uref=true, remove uref flag and default to uref=false internally.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails
  2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
  2019-11-15 21:47   ` Song Liu
@ 2019-11-15 23:23   ` Daniel Borkmann
  2019-11-15 23:27     ` Andrii Nakryiko
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-15 23:23 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast; +Cc: andrii.nakryiko, kernel-team

On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
> 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
> potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
> (32k). Due to using 32-bit counter, it's possible in practice to overflow
> refcounter and make it wrap around to 0, causing erroneous map free, while
> there are still references to it, causing use-after-free problems.

You mention 'it's possible in practice' to overflow in relation to bpf map
refcount, do we need any fixing for bpf tree here?

> But having a failing refcounting operations are problematic in some cases. One
> example is mmap() interface. After establishing initial memory-mapping, user
> is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
> splitting it into multiple non-contiguous regions. All this happening without
> any control from the users of mmap subsystem. Rather mmap subsystem sends
> notifications to original creator of memory mapping through open/close
> callbacks, which are optionally specified during initial memory mapping
> creation. These callbacks are used to maintain accurate refcount for bpf_map
> (see next patch in this series). The problem is that open() callback is not
> supposed to fail, because memory-mapped resource is set up and properly
> referenced. This is posing a problem for using memory-mapping with BPF maps.
> 
> One solution to this is to maintain separate refcount for just memory-mappings
> and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
> There are similar use cases in current work on tcp-bpf, necessitating extra
> counter as well. This seems like a rather unfortunate and ugly solution that
> doesn't scale well to various new use cases.
> 
> Another approach to solve this is to use non-failing refcount_t type, which
> uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
> stays there. This utlimately causes memory leak, but prevents use after free.
> 
> But given refcounting is not the most performance-critical operation with BPF
> maps (it's not used from running BPF program code), we can also just switch to
> 64-bit counter that can't overflow in practice, potentially disadvantaging
> 32-bit platforms a tiny bit. This simplifies semantics and allows above
> described scenarios to not worry about failing refcount increment operation.
> 
> In terms of struct bpf_map size, we are still good and use the same amount of
> space:
> 
> BEFORE (3 cache lines, 8 bytes of padding at the end):
> struct bpf_map {
> 	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
> 	struct bpf_map *           inner_map_meta;       /*     8     8 */
> 	void *                     security;             /*    16     8 */
> 	enum bpf_map_type  map_type;                     /*    24     4 */
> 	u32                        key_size;             /*    28     4 */
> 	u32                        value_size;           /*    32     4 */
> 	u32                        max_entries;          /*    36     4 */
> 	u32                        map_flags;            /*    40     4 */
> 	int                        spin_lock_off;        /*    44     4 */
> 	u32                        id;                   /*    48     4 */
> 	int                        numa_node;            /*    52     4 */
> 	u32                        btf_key_type_id;      /*    56     4 */
> 	u32                        btf_value_type_id;    /*    60     4 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct btf *               btf;                  /*    64     8 */
> 	struct bpf_map_memory memory;                    /*    72    16 */
> 	bool                       unpriv_array;         /*    88     1 */
> 	bool                       frozen;               /*    89     1 */
> 
> 	/* XXX 38 bytes hole, try to pack */
> 
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
> 	atomic_t                   usercnt;              /*   132     4 */
> 	struct work_struct work;                         /*   136    32 */
> 	char                       name[16];             /*   168    16 */
> 
> 	/* size: 192, cachelines: 3, members: 21 */
> 	/* sum members: 146, holes: 1, sum holes: 38 */
> 	/* padding: 8 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
> } __attribute__((__aligned__(64)));
> 
> AFTER (same 3 cache lines, no extra padding now):
> struct bpf_map {
> 	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
> 	struct bpf_map *           inner_map_meta;       /*     8     8 */
> 	void *                     security;             /*    16     8 */
> 	enum bpf_map_type  map_type;                     /*    24     4 */
> 	u32                        key_size;             /*    28     4 */
> 	u32                        value_size;           /*    32     4 */
> 	u32                        max_entries;          /*    36     4 */
> 	u32                        map_flags;            /*    40     4 */
> 	int                        spin_lock_off;        /*    44     4 */
> 	u32                        id;                   /*    48     4 */
> 	int                        numa_node;            /*    52     4 */
> 	u32                        btf_key_type_id;      /*    56     4 */
> 	u32                        btf_value_type_id;    /*    60     4 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct btf *               btf;                  /*    64     8 */
> 	struct bpf_map_memory memory;                    /*    72    16 */
> 	bool                       unpriv_array;         /*    88     1 */
> 	bool                       frozen;               /*    89     1 */
> 
> 	/* XXX 38 bytes hole, try to pack */
> 
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
> 	atomic64_t                 usercnt;              /*   136     8 */
> 	struct work_struct work;                         /*   144    32 */
> 	char                       name[16];             /*   176    16 */
> 
> 	/* size: 192, cachelines: 3, members: 21 */
> 	/* sum members: 154, holes: 1, sum holes: 38 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
> } __attribute__((__aligned__(64)));
> 
> This patch, while modifying all users of bpf_map_inc, also cleans up its
> interface to match bpf_map_put with separate operations for bpf_map_inc and
> bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
> respectively). Also, given there are no users of bpf_map_inc_not_zero
> specifying uref=true, remove uref flag and default to uref=false internally.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   .../net/ethernet/netronome/nfp/bpf/offload.c  |  4 +-
>   include/linux/bpf.h                           | 10 ++--
>   kernel/bpf/inode.c                            |  2 +-
>   kernel/bpf/map_in_map.c                       |  2 +-
>   kernel/bpf/syscall.c                          | 51 ++++++++-----------
>   kernel/bpf/verifier.c                         |  6 +--
>   kernel/bpf/xskmap.c                           |  6 +--
>   net/core/bpf_sk_storage.c                     |  2 +-
>   8 files changed, 34 insertions(+), 49 deletions(-)
> 
[...]
> +	refold = atomic64_fetch_add_unless(&map->refcnt, 1, 0);
>   	if (!refold)
>   		return ERR_PTR(-ENOENT);
> -
>   	if (uref)
> -		atomic_inc(&map->usercnt);
> +		atomic64_inc(&map->usercnt);
>   
>   	return map;
>   }
>   
> -struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
> +struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map)
>   {
>   	spin_lock_bh(&map_idr_lock);
> -	map = __bpf_map_inc_not_zero(map, uref);
> +	map = __bpf_map_inc_not_zero(map, false);
>   	spin_unlock_bh(&map_idr_lock);
>   
>   	return map;
> @@ -1454,6 +1444,9 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
>   	return f.file->private_data;
>   }
>   
> +/* prog's refcnt limit */
> +#define BPF_MAX_REFCNT 32768
> +

Would it make sense in this context to also convert prog refcount over to atomic64
so we have both in one go in order to align them again? This could likely simplify
even more wrt error paths.

>   struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
>   {
>   	if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails
  2019-11-15 23:23   ` Daniel Borkmann
@ 2019-11-15 23:27     ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-15 23:27 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov, Kernel Team

On Fri, Nov 15, 2019 at 3:24 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
> > 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
> > potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
> > (32k). Due to using 32-bit counter, it's possible in practice to overflow
> > refcounter and make it wrap around to 0, causing erroneous map free, while
> > there are still references to it, causing use-after-free problems.
>
> You mention 'it's possible in practice' to overflow in relation to bpf map
> refcount, do we need any fixing for bpf tree here?

I meant without existing 32k limit. With limit we currently have, no,
we'll just fail bpf_map_inc instead. So no need to do anything about
bpf tree.

>
> > But having a failing refcounting operations are problematic in some cases. One
> > example is mmap() interface. After establishing initial memory-mapping, user
> > is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
> > splitting it into multiple non-contiguous regions. All this happening without
> > any control from the users of mmap subsystem. Rather mmap subsystem sends
> > notifications to original creator of memory mapping through open/close
> > callbacks, which are optionally specified during initial memory mapping
> > creation. These callbacks are used to maintain accurate refcount for bpf_map
> > (see next patch in this series). The problem is that open() callback is not
> > supposed to fail, because memory-mapped resource is set up and properly
> > referenced. This is posing a problem for using memory-mapping with BPF maps.
> >
> > One solution to this is to maintain separate refcount for just memory-mappings
> > and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
> > There are similar use cases in current work on tcp-bpf, necessitating extra
> > counter as well. This seems like a rather unfortunate and ugly solution that
> > doesn't scale well to various new use cases.
> >
> > Another approach to solve this is to use non-failing refcount_t type, which
> > uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
> > stays there. This utlimately causes memory leak, but prevents use after free.
> >

[...]

> >
> > +/* prog's refcnt limit */
> > +#define BPF_MAX_REFCNT 32768
> > +
>
> Would it make sense in this context to also convert prog refcount over to atomic64
> so we have both in one go in order to align them again? This could likely simplify
> even more wrt error paths.
>

Yes, of course, I was just trying to minimize amount of changes, but I
can go and do the same for prog refcnt.

> >   struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
> >   {
> >       if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
  2019-11-15  4:45   ` Alexei Starovoitov
  2019-11-15 13:57   ` Johannes Weiner
@ 2019-11-15 23:31   ` Daniel Borkmann
  2019-11-15 23:37     ` Alexei Starovoitov
  2 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-15 23:31 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast
  Cc: andrii.nakryiko, kernel-team, Johannes Weiner, Rik van Riel

On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
> Add ability to memory-map contents of BPF array map. This is extremely useful
> for working with BPF global data from userspace programs. It allows to avoid
> typical bpf_map_{lookup,update}_elem operations, improving both performance
> and usability.
> 
> There had to be special considerations for map freezing, to avoid having
> writable memory view into a frozen map. To solve this issue, map freezing and
> mmap-ing is happening under mutex now:
>    - if map is already frozen, no writable mapping is allowed;
>    - if map has writable memory mappings active (accounted in map->writecnt),
>      map freezing will keep failing with -EBUSY;
>    - once number of writable memory mappings drops to zero, map freezing can be
>      performed again.
> 
> Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> can't be memory mapped either.
> 
> For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> to be mmap()'able. We also need to make sure that array data memory is
> page-sized and page-aligned, so we over-allocate memory in such a way that
> struct bpf_array is at the end of a single page of memory with array->value
> being aligned with the start of the second page. On deallocation we need to
> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> 
> One important consideration regarding how memory-mapping subsystem functions.
> Memory-mapping subsystem provides few optional callbacks, among them open()
> and close().  close() is called for each memory region that is unmapped, so
> that users can decrease their reference counters and free up resources, if
> necessary. open() is *almost* symmetrical: it's called for each memory region
> that is being mapped, **except** the very first one. So bpf_map_mmap does
> initial refcnt bump, while open() will do any extra ones after that. Thus
> number of close() calls is equal to number of open() calls plus one more.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@surriel.com>
> Acked-by: Song Liu <songliubraving@fb.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

[...]
> +/* called for any extra memory-mapped regions (except initial) */
> +static void bpf_map_mmap_open(struct vm_area_struct *vma)
> +{
> +	struct bpf_map *map = vma->vm_file->private_data;
> +
> +	bpf_map_inc(map);

This would also need to inc uref counter since it's technically a reference
of this map into user space as otherwise if map->ops->map_release_uref would
be used for maps supporting mmap, then the callback would trigger even if user
space still has a reference to it.

> +	if (vma->vm_flags & VM_WRITE) {
> +		mutex_lock(&map->freeze_mutex);
> +		map->writecnt++;
> +		mutex_unlock(&map->freeze_mutex);
> +	}
> +}
> +
> +/* called for all unmapped memory region (including initial) */
> +static void bpf_map_mmap_close(struct vm_area_struct *vma)
> +{
> +	struct bpf_map *map = vma->vm_file->private_data;
> +
> +	if (vma->vm_flags & VM_WRITE) {
> +		mutex_lock(&map->freeze_mutex);
> +		map->writecnt--;
> +		mutex_unlock(&map->freeze_mutex);
> +	}
> +
> +	bpf_map_put(map);

Ditto.

> +}
> +
> +static const struct vm_operations_struct bpf_map_default_vmops = {
> +	.open		= bpf_map_mmap_open,
> +	.close		= bpf_map_mmap_close,
> +};
> +
> +static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +	struct bpf_map *map = filp->private_data;
> +	int err;
> +
> +	if (!map->ops->map_mmap || map_value_has_spin_lock(map))
> +		return -ENOTSUPP;
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return -EINVAL;
> +
> +	mutex_lock(&map->freeze_mutex);
> +
> +	if ((vma->vm_flags & VM_WRITE) && map->frozen) {
> +		err = -EPERM;
> +		goto out;
> +	}
> +
> +	/* set default open/close callbacks */
> +	vma->vm_ops = &bpf_map_default_vmops;
> +	vma->vm_private_data = map;
> +
> +	err = map->ops->map_mmap(map, vma);
> +	if (err)
> +		goto out;
> +
> +	bpf_map_inc(map);

Ditto.

> +	if (vma->vm_flags & VM_WRITE)
> +		map->writecnt++;
> +out:
> +	mutex_unlock(&map->freeze_mutex);
> +	return err;
> +}
> +

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15 23:31   ` Daniel Borkmann
@ 2019-11-15 23:37     ` Alexei Starovoitov
  2019-11-15 23:44       ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-11-15 23:37 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: andrii.nakryiko, Kernel Team, Johannes Weiner, Rik van Riel

On 11/15/19 3:31 PM, Daniel Borkmann wrote:
> On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
>> Add ability to memory-map contents of BPF array map. This is extremely 
>> useful
>> for working with BPF global data from userspace programs. It allows to 
>> avoid
>> typical bpf_map_{lookup,update}_elem operations, improving both 
>> performance
>> and usability.
>>
>> There had to be special considerations for map freezing, to avoid having
>> writable memory view into a frozen map. To solve this issue, map 
>> freezing and
>> mmap-ing is happening under mutex now:
>>    - if map is already frozen, no writable mapping is allowed;
>>    - if map has writable memory mappings active (accounted in 
>> map->writecnt),
>>      map freezing will keep failing with -EBUSY;
>>    - once number of writable memory mappings drops to zero, map 
>> freezing can be
>>      performed again.
>>
>> Only non-per-CPU plain arrays are supported right now. Maps with 
>> spinlocks
>> can't be memory mapped either.
>>
>> For BPF_F_MMAPABLE array, memory allocation has to be done through 
>> vmalloc()
>> to be mmap()'able. We also need to make sure that array data memory is
>> page-sized and page-aligned, so we over-allocate memory in such a way 
>> that
>> struct bpf_array is at the end of a single page of memory with 
>> array->value
>> being aligned with the start of the second page. On deallocation we 
>> need to
>> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
>>
>> One important consideration regarding how memory-mapping subsystem 
>> functions.
>> Memory-mapping subsystem provides few optional callbacks, among them 
>> open()
>> and close().  close() is called for each memory region that is 
>> unmapped, so
>> that users can decrease their reference counters and free up 
>> resources, if
>> necessary. open() is *almost* symmetrical: it's called for each memory 
>> region
>> that is being mapped, **except** the very first one. So bpf_map_mmap does
>> initial refcnt bump, while open() will do any extra ones after that. Thus
>> number of close() calls is equal to number of open() calls plus one more.
>>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Acked-by: Song Liu <songliubraving@fb.com>
>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> 
> [...]
>> +/* called for any extra memory-mapped regions (except initial) */
>> +static void bpf_map_mmap_open(struct vm_area_struct *vma)
>> +{
>> +    struct bpf_map *map = vma->vm_file->private_data;
>> +
>> +    bpf_map_inc(map);
> 
> This would also need to inc uref counter since it's technically a reference
> of this map into user space as otherwise if map->ops->map_release_uref 
> would
> be used for maps supporting mmap, then the callback would trigger even 
> if user
> space still has a reference to it.

I thought we use uref only for array that can hold FDs ?
That's why I suggested Andrii earlier to drop uref++.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15 23:37     ` Alexei Starovoitov
@ 2019-11-15 23:44       ` Daniel Borkmann
  2019-11-15 23:47         ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-15 23:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, bpf, netdev
  Cc: andrii.nakryiko, Kernel Team, Johannes Weiner, Rik van Riel

On 11/16/19 12:37 AM, Alexei Starovoitov wrote:
> On 11/15/19 3:31 PM, Daniel Borkmann wrote:
>> On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
>>> Add ability to memory-map contents of BPF array map. This is extremely
>>> useful
>>> for working with BPF global data from userspace programs. It allows to
>>> avoid
>>> typical bpf_map_{lookup,update}_elem operations, improving both
>>> performance
>>> and usability.
>>>
>>> There had to be special considerations for map freezing, to avoid having
>>> writable memory view into a frozen map. To solve this issue, map
>>> freezing and
>>> mmap-ing is happening under mutex now:
>>>     - if map is already frozen, no writable mapping is allowed;
>>>     - if map has writable memory mappings active (accounted in
>>> map->writecnt),
>>>       map freezing will keep failing with -EBUSY;
>>>     - once number of writable memory mappings drops to zero, map
>>> freezing can be
>>>       performed again.
>>>
>>> Only non-per-CPU plain arrays are supported right now. Maps with
>>> spinlocks
>>> can't be memory mapped either.
>>>
>>> For BPF_F_MMAPABLE array, memory allocation has to be done through
>>> vmalloc()
>>> to be mmap()'able. We also need to make sure that array data memory is
>>> page-sized and page-aligned, so we over-allocate memory in such a way
>>> that
>>> struct bpf_array is at the end of a single page of memory with
>>> array->value
>>> being aligned with the start of the second page. On deallocation we
>>> need to
>>> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
>>>
>>> One important consideration regarding how memory-mapping subsystem
>>> functions.
>>> Memory-mapping subsystem provides few optional callbacks, among them
>>> open()
>>> and close().  close() is called for each memory region that is
>>> unmapped, so
>>> that users can decrease their reference counters and free up
>>> resources, if
>>> necessary. open() is *almost* symmetrical: it's called for each memory
>>> region
>>> that is being mapped, **except** the very first one. So bpf_map_mmap does
>>> initial refcnt bump, while open() will do any extra ones after that. Thus
>>> number of close() calls is equal to number of open() calls plus one more.
>>>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: Rik van Riel <riel@surriel.com>
>>> Acked-by: Song Liu <songliubraving@fb.com>
>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>
>> [...]
>>> +/* called for any extra memory-mapped regions (except initial) */
>>> +static void bpf_map_mmap_open(struct vm_area_struct *vma)
>>> +{
>>> +    struct bpf_map *map = vma->vm_file->private_data;
>>> +
>>> +    bpf_map_inc(map);
>>
>> This would also need to inc uref counter since it's technically a reference
>> of this map into user space as otherwise if map->ops->map_release_uref
>> would
>> be used for maps supporting mmap, then the callback would trigger even
>> if user
>> space still has a reference to it.
> 
> I thought we use uref only for array that can hold FDs ?
> That's why I suggested Andrii earlier to drop uref++.

Yeah, only for fd array currently. Question is, if we ever reuse that map_release_uref
callback in future for something else, will we remember that we earlier missed to add
it here? :/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15 23:44       ` Daniel Borkmann
@ 2019-11-15 23:47         ` Alexei Starovoitov
  2019-11-16  0:13           ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-11-15 23:47 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: andrii.nakryiko, Kernel Team, Johannes Weiner, Rik van Riel

On 11/15/19 3:44 PM, Daniel Borkmann wrote:
> On 11/16/19 12:37 AM, Alexei Starovoitov wrote:
>> On 11/15/19 3:31 PM, Daniel Borkmann wrote:
>>> On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
>>>> Add ability to memory-map contents of BPF array map. This is extremely
>>>> useful
>>>> for working with BPF global data from userspace programs. It allows to
>>>> avoid
>>>> typical bpf_map_{lookup,update}_elem operations, improving both
>>>> performance
>>>> and usability.
>>>>
>>>> There had to be special considerations for map freezing, to avoid 
>>>> having
>>>> writable memory view into a frozen map. To solve this issue, map
>>>> freezing and
>>>> mmap-ing is happening under mutex now:
>>>>     - if map is already frozen, no writable mapping is allowed;
>>>>     - if map has writable memory mappings active (accounted in
>>>> map->writecnt),
>>>>       map freezing will keep failing with -EBUSY;
>>>>     - once number of writable memory mappings drops to zero, map
>>>> freezing can be
>>>>       performed again.
>>>>
>>>> Only non-per-CPU plain arrays are supported right now. Maps with
>>>> spinlocks
>>>> can't be memory mapped either.
>>>>
>>>> For BPF_F_MMAPABLE array, memory allocation has to be done through
>>>> vmalloc()
>>>> to be mmap()'able. We also need to make sure that array data memory is
>>>> page-sized and page-aligned, so we over-allocate memory in such a way
>>>> that
>>>> struct bpf_array is at the end of a single page of memory with
>>>> array->value
>>>> being aligned with the start of the second page. On deallocation we
>>>> need to
>>>> accomodate this memory arrangement to free vmalloc()'ed memory 
>>>> correctly.
>>>>
>>>> One important consideration regarding how memory-mapping subsystem
>>>> functions.
>>>> Memory-mapping subsystem provides few optional callbacks, among them
>>>> open()
>>>> and close().  close() is called for each memory region that is
>>>> unmapped, so
>>>> that users can decrease their reference counters and free up
>>>> resources, if
>>>> necessary. open() is *almost* symmetrical: it's called for each memory
>>>> region
>>>> that is being mapped, **except** the very first one. So bpf_map_mmap 
>>>> does
>>>> initial refcnt bump, while open() will do any extra ones after that. 
>>>> Thus
>>>> number of close() calls is equal to number of open() calls plus one 
>>>> more.
>>>>
>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>> Cc: Rik van Riel <riel@surriel.com>
>>>> Acked-by: Song Liu <songliubraving@fb.com>
>>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>>
>>> [...]
>>>> +/* called for any extra memory-mapped regions (except initial) */
>>>> +static void bpf_map_mmap_open(struct vm_area_struct *vma)
>>>> +{
>>>> +    struct bpf_map *map = vma->vm_file->private_data;
>>>> +
>>>> +    bpf_map_inc(map);
>>>
>>> This would also need to inc uref counter since it's technically a 
>>> reference
>>> of this map into user space as otherwise if map->ops->map_release_uref
>>> would
>>> be used for maps supporting mmap, then the callback would trigger even
>>> if user
>>> space still has a reference to it.
>>
>> I thought we use uref only for array that can hold FDs ?
>> That's why I suggested Andrii earlier to drop uref++.
> 
> Yeah, only for fd array currently. Question is, if we ever reuse that 
> map_release_uref
> callback in future for something else, will we remember that we earlier 
> missed to add
> it here? :/

What do you mean 'missed to add' ?
This is mmap path. Anything that needs releasing (like FDs for 
prog_array or progs for sockmap) cannot be mmap-able.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-15 23:47         ` Alexei Starovoitov
@ 2019-11-16  0:13           ` Daniel Borkmann
  2019-11-16  1:18             ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-16  0:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, bpf, netdev
  Cc: andrii.nakryiko, Kernel Team, Johannes Weiner, Rik van Riel

On 11/16/19 12:47 AM, Alexei Starovoitov wrote:
> On 11/15/19 3:44 PM, Daniel Borkmann wrote:
>> On 11/16/19 12:37 AM, Alexei Starovoitov wrote:
>>> On 11/15/19 3:31 PM, Daniel Borkmann wrote:
>>>> On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
>>>>> Add ability to memory-map contents of BPF array map. This is extremely
>>>>> useful
>>>>> for working with BPF global data from userspace programs. It allows to
>>>>> avoid
>>>>> typical bpf_map_{lookup,update}_elem operations, improving both
>>>>> performance
>>>>> and usability.
>>>>>
>>>>> There had to be special considerations for map freezing, to avoid
>>>>> having
>>>>> writable memory view into a frozen map. To solve this issue, map
>>>>> freezing and
>>>>> mmap-ing is happening under mutex now:
>>>>>      - if map is already frozen, no writable mapping is allowed;
>>>>>      - if map has writable memory mappings active (accounted in
>>>>> map->writecnt),
>>>>>        map freezing will keep failing with -EBUSY;
>>>>>      - once number of writable memory mappings drops to zero, map
>>>>> freezing can be
>>>>>        performed again.
>>>>>
>>>>> Only non-per-CPU plain arrays are supported right now. Maps with
>>>>> spinlocks
>>>>> can't be memory mapped either.
>>>>>
>>>>> For BPF_F_MMAPABLE array, memory allocation has to be done through
>>>>> vmalloc()
>>>>> to be mmap()'able. We also need to make sure that array data memory is
>>>>> page-sized and page-aligned, so we over-allocate memory in such a way
>>>>> that
>>>>> struct bpf_array is at the end of a single page of memory with
>>>>> array->value
>>>>> being aligned with the start of the second page. On deallocation we
>>>>> need to
>>>>> accomodate this memory arrangement to free vmalloc()'ed memory
>>>>> correctly.
>>>>>
>>>>> One important consideration regarding how memory-mapping subsystem
>>>>> functions.
>>>>> Memory-mapping subsystem provides few optional callbacks, among them
>>>>> open()
>>>>> and close().  close() is called for each memory region that is
>>>>> unmapped, so
>>>>> that users can decrease their reference counters and free up
>>>>> resources, if
>>>>> necessary. open() is *almost* symmetrical: it's called for each memory
>>>>> region
>>>>> that is being mapped, **except** the very first one. So bpf_map_mmap
>>>>> does
>>>>> initial refcnt bump, while open() will do any extra ones after that.
>>>>> Thus
>>>>> number of close() calls is equal to number of open() calls plus one
>>>>> more.
>>>>>
>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>> Cc: Rik van Riel <riel@surriel.com>
>>>>> Acked-by: Song Liu <songliubraving@fb.com>
>>>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>>>
>>>> [...]
>>>>> +/* called for any extra memory-mapped regions (except initial) */
>>>>> +static void bpf_map_mmap_open(struct vm_area_struct *vma)
>>>>> +{
>>>>> +    struct bpf_map *map = vma->vm_file->private_data;
>>>>> +
>>>>> +    bpf_map_inc(map);
>>>>
>>>> This would also need to inc uref counter since it's technically a
>>>> reference
>>>> of this map into user space as otherwise if map->ops->map_release_uref
>>>> would
>>>> be used for maps supporting mmap, then the callback would trigger even
>>>> if user
>>>> space still has a reference to it.
>>>
>>> I thought we use uref only for array that can hold FDs ?
>>> That's why I suggested Andrii earlier to drop uref++.
>>
>> Yeah, only for fd array currently. Question is, if we ever reuse that
>> map_release_uref
>> callback in future for something else, will we remember that we earlier
>> missed to add
>> it here? :/
> 
> What do you mean 'missed to add' ?

Was saying missed to add the inc/put for the uref counter.

> This is mmap path. Anything that needs releasing (like FDs for
> prog_array or progs for sockmap) cannot be mmap-able.

Right, I meant if in future we ever have another use case outside of it
for some reason (unrelated to those maps you mention above). Can we
guarantee this is never going to happen? Seemed less fragile at least to
maintain proper count here.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-16  0:13           ` Daniel Borkmann
@ 2019-11-16  1:18             ` Alexei Starovoitov
  2019-11-17  5:57               ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: andrii.nakryiko, Kernel Team, Johannes Weiner, Rik van Riel

On 11/15/19 4:13 PM, Daniel Borkmann wrote:
>>> Yeah, only for fd array currently. Question is, if we ever reuse that
>>> map_release_uref
>>> callback in future for something else, will we remember that we earlier
>>> missed to add
>>> it here? :/
>>
>> What do you mean 'missed to add' ?
> 
> Was saying missed to add the inc/put for the uref counter.
> 
>> This is mmap path. Anything that needs releasing (like FDs for
>> prog_array or progs for sockmap) cannot be mmap-able.
> 
> Right, I meant if in future we ever have another use case outside of it
> for some reason (unrelated to those maps you mention above). Can we
> guarantee this is never going to happen? Seemed less fragile at least to
> maintain proper count here.

I'm struggling to understand the concern.
map-in-map, xskmap, socket local storage are doing bpf_map_inc(, false)
when they need to hold the map. Why this case is any different?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-16  1:18             ` Alexei Starovoitov
@ 2019-11-17  5:57               ` Andrii Nakryiko
  2019-11-17 12:07                 ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-17  5:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, netdev, Kernel Team,
	Johannes Weiner, Rik van Riel

On Fri, Nov 15, 2019 at 5:18 PM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 11/15/19 4:13 PM, Daniel Borkmann wrote:
> >>> Yeah, only for fd array currently. Question is, if we ever reuse that
> >>> map_release_uref
> >>> callback in future for something else, will we remember that we earlier
> >>> missed to add
> >>> it here? :/
> >>
> >> What do you mean 'missed to add' ?
> >
> > Was saying missed to add the inc/put for the uref counter.
> >
> >> This is mmap path. Anything that needs releasing (like FDs for
> >> prog_array or progs for sockmap) cannot be mmap-able.
> >
> > Right, I meant if in future we ever have another use case outside of it
> > for some reason (unrelated to those maps you mention above). Can we
> > guarantee this is never going to happen? Seemed less fragile at least to
> > maintain proper count here.

I don't think we'll ever going to allow mmaping anything that contains
not just pure data. E.g., we disallow mmaping array that contains spin
lock for that reason. So I think it's safe to assume that this is not
going to happen even for future maps. At least not without some
serious considerations before that. So I'm going to keep it as just
plain bpf_map_inc for now.

I'm going to convert bpf_prog_add/bpf_prog_inc, though, and will do it
as a separate patch, on top of bpf_map_inc refactor. It touches quite
a lot drivers, so would benefit from having being separate.

>
> I'm struggling to understand the concern.
> map-in-map, xskmap, socket local storage are doing bpf_map_inc(, false)
> when they need to hold the map. Why this case is any different?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-17  5:57               ` Andrii Nakryiko
@ 2019-11-17 12:07                 ` Daniel Borkmann
  2019-11-17 17:17                   ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-17 12:07 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, netdev, Kernel Team, Johannes Weiner, Rik van Riel

On 11/17/19 6:57 AM, Andrii Nakryiko wrote:
> On Fri, Nov 15, 2019 at 5:18 PM Alexei Starovoitov <ast@fb.com> wrote:
>> On 11/15/19 4:13 PM, Daniel Borkmann wrote:
>>>>> Yeah, only for fd array currently. Question is, if we ever reuse that
>>>>> map_release_uref
>>>>> callback in future for something else, will we remember that we earlier
>>>>> missed to add
>>>>> it here? :/
>>>>
>>>> What do you mean 'missed to add' ?
>>>
>>> Was saying missed to add the inc/put for the uref counter.
>>>
>>>> This is mmap path. Anything that needs releasing (like FDs for
>>>> prog_array or progs for sockmap) cannot be mmap-able.
>>>
>>> Right, I meant if in future we ever have another use case outside of it
>>> for some reason (unrelated to those maps you mention above). Can we
>>> guarantee this is never going to happen? Seemed less fragile at least to
>>> maintain proper count here.
> 
> I don't think we'll ever going to allow mmaping anything that contains
> not just pure data. E.g., we disallow mmaping array that contains spin
> lock for that reason. So I think it's safe to assume that this is not
> going to happen even for future maps. At least not without some
> serious considerations before that. So I'm going to keep it as just
> plain bpf_map_inc for now.

Fair enough, then keep it as it is. The purpose of the uref counter is to
track whatever map holds a reference either in form of fd or inode in bpf
fs which are the only two things till now where user space can refer to the
map, and once it hits 0, we perform the map's map_release_uref() callback.

The fact that some maps make use of it and some others not is an implementation
detail in my opinion, but the concept itself is generic and can be used by
whatever map implementation would need it in future. From my perspective not
breaking with this semantic would allow to worry about one less issue once
this callback gets reused for whatever reason.

As I understand, from your PoV, you think that this uref counter is and will
be exactly only tied to the fd based maps that currently use it and given
they understandably won't ever need a mmap interface we don't need to inc/dec
it there.

Fair enough, but could we add some assertion then which adds a check if a map
ever uses both that we bail out so we don't forget about this detail in a few
weeks from now? Given complexity we have in our BPF codebase these days, I'm
mainly worried about the latter if we can catch such details with a trivial
check easily, for example, it would be trivial enough to add a test for the
existence of map_release_uref callback inside bpf_map_mmap() and bail out in
order to guarantee this, similar as you do with the spinlock.

> I'm going to convert bpf_prog_add/bpf_prog_inc, though, and will do it
> as a separate patch, on top of bpf_map_inc refactor. It touches quite
> a lot drivers, so would benefit from having being separate.

Yeah, sounds good to me. Thanks for converting!

>> I'm struggling to understand the concern.
>> map-in-map, xskmap, socket local storage are doing bpf_map_inc(, false)
>> when they need to hold the map. Why this case is any different?

(See above.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-17 12:07                 ` Daniel Borkmann
@ 2019-11-17 17:17                   ` Andrii Nakryiko
  2019-11-18 13:50                     ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2019-11-17 17:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, netdev, Kernel Team,
	Johannes Weiner, Rik van Riel

On Sun, Nov 17, 2019 at 4:07 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/17/19 6:57 AM, Andrii Nakryiko wrote:
> > On Fri, Nov 15, 2019 at 5:18 PM Alexei Starovoitov <ast@fb.com> wrote:
> >> On 11/15/19 4:13 PM, Daniel Borkmann wrote:
> >>>>> Yeah, only for fd array currently. Question is, if we ever reuse that
> >>>>> map_release_uref
> >>>>> callback in future for something else, will we remember that we earlier
> >>>>> missed to add
> >>>>> it here? :/
> >>>>
> >>>> What do you mean 'missed to add' ?
> >>>
> >>> Was saying missed to add the inc/put for the uref counter.
> >>>
> >>>> This is mmap path. Anything that needs releasing (like FDs for
> >>>> prog_array or progs for sockmap) cannot be mmap-able.
> >>>
> >>> Right, I meant if in future we ever have another use case outside of it
> >>> for some reason (unrelated to those maps you mention above). Can we
> >>> guarantee this is never going to happen? Seemed less fragile at least to
> >>> maintain proper count here.
> >
> > I don't think we'll ever going to allow mmaping anything that contains
> > not just pure data. E.g., we disallow mmaping array that contains spin
> > lock for that reason. So I think it's safe to assume that this is not
> > going to happen even for future maps. At least not without some
> > serious considerations before that. So I'm going to keep it as just
> > plain bpf_map_inc for now.
>
> Fair enough, then keep it as it is. The purpose of the uref counter is to
> track whatever map holds a reference either in form of fd or inode in bpf
> fs which are the only two things till now where user space can refer to the
> map, and once it hits 0, we perform the map's map_release_uref() callback.

To be honest, I don't exactly understand why we need both refcnt and
usercnt. Does it have anything to do with some circular dependencies
for those maps containing other FDs? And once userspace doesn't have
any more referenced, we release FDs, which might decrement refcnt,
thus breaking circular refcnt between map FD and special FDs inside a
map? Or that's not the case and there is a different reason?

Either way, I looked at map creation and bpf_map_release()
implementation again. map_create() does set usercnt to 1, and
bpf_map_release(), which I assume is called when map FD is closed,
does decrement usercnt, so yeah, we do with_uref() manipulations for
cases when userspace maintains some sort of handle to map. In that
regard, mmap() does fall into the same category, so I'm going to
convert everything mmap-related back to
bpf_map_inc_with_uref()/bpf_map_put_with_uref(), to stay consistent.

>
> The fact that some maps make use of it and some others not is an implementation
> detail in my opinion, but the concept itself is generic and can be used by
> whatever map implementation would need it in future. From my perspective not
> breaking with this semantic would allow to worry about one less issue once
> this callback gets reused for whatever reason.
>
> As I understand, from your PoV, you think that this uref counter is and will
> be exactly only tied to the fd based maps that currently use it and given
> they understandably won't ever need a mmap interface we don't need to inc/dec
> it there.
>
> Fair enough, but could we add some assertion then which adds a check if a map
> ever uses both that we bail out so we don't forget about this detail in a few
> weeks from now? Given complexity we have in our BPF codebase these days, I'm
> mainly worried about the latter if we can catch such details with a trivial
> check easily, for example, it would be trivial enough to add a test for the
> existence of map_release_uref callback inside bpf_map_mmap() and bail out in
> order to guarantee this, similar as you do with the spinlock.
>
> > I'm going to convert bpf_prog_add/bpf_prog_inc, though, and will do it
> > as a separate patch, on top of bpf_map_inc refactor. It touches quite
> > a lot drivers, so would benefit from having being separate.
>
> Yeah, sounds good to me. Thanks for converting!
>
> >> I'm struggling to understand the concern.
> >> map-in-map, xskmap, socket local storage are doing bpf_map_inc(, false)
> >> when they need to hold the map. Why this case is any different?
>
> (See above.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-17 17:17                   ` Andrii Nakryiko
@ 2019-11-18 13:50                     ` Daniel Borkmann
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Borkmann @ 2019-11-18 13:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, netdev, Kernel Team,
	Johannes Weiner, Rik van Riel

On 11/17/19 6:17 PM, Andrii Nakryiko wrote:
> On Sun, Nov 17, 2019 at 4:07 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 11/17/19 6:57 AM, Andrii Nakryiko wrote:
>>> On Fri, Nov 15, 2019 at 5:18 PM Alexei Starovoitov <ast@fb.com> wrote:
>>>> On 11/15/19 4:13 PM, Daniel Borkmann wrote:
>>>>>>> Yeah, only for fd array currently. Question is, if we ever reuse that
>>>>>>> map_release_uref
>>>>>>> callback in future for something else, will we remember that we earlier
>>>>>>> missed to add
>>>>>>> it here? :/
>>>>>>
>>>>>> What do you mean 'missed to add' ?
>>>>>
>>>>> Was saying missed to add the inc/put for the uref counter.
>>>>>
>>>>>> This is mmap path. Anything that needs releasing (like FDs for
>>>>>> prog_array or progs for sockmap) cannot be mmap-able.
>>>>>
>>>>> Right, I meant if in future we ever have another use case outside of it
>>>>> for some reason (unrelated to those maps you mention above). Can we
>>>>> guarantee this is never going to happen? Seemed less fragile at least to
>>>>> maintain proper count here.
>>>
>>> I don't think we'll ever going to allow mmaping anything that contains
>>> not just pure data. E.g., we disallow mmaping array that contains spin
>>> lock for that reason. So I think it's safe to assume that this is not
>>> going to happen even for future maps. At least not without some
>>> serious considerations before that. So I'm going to keep it as just
>>> plain bpf_map_inc for now.
>>
>> Fair enough, then keep it as it is. The purpose of the uref counter is to
>> track whatever map holds a reference either in form of fd or inode in bpf
>> fs which are the only two things till now where user space can refer to the
>> map, and once it hits 0, we perform the map's map_release_uref() callback.
> 
> To be honest, I don't exactly understand why we need both refcnt and
> usercnt. Does it have anything to do with some circular dependencies
> for those maps containing other FDs? And once userspace doesn't have
> any more referenced, we release FDs, which might decrement refcnt,
> thus breaking circular refcnt between map FD and special FDs inside a
> map? Or that's not the case and there is a different reason?

Yes, back then we added it to break up circular dependencies in relation to
tail calls which is why these are pinned before the loader process finishes
(e.g. as in Cilium's case).

> Either way, I looked at map creation and bpf_map_release()
> implementation again. map_create() does set usercnt to 1, and
> bpf_map_release(), which I assume is called when map FD is closed,
> does decrement usercnt, so yeah, we do with_uref() manipulations for
> cases when userspace maintains some sort of handle to map. In that
> regard, mmap() does fall into the same category, so I'm going to
> convert everything mmap-related back to
> bpf_map_inc_with_uref()/bpf_map_put_with_uref(), to stay consistent.

Ok, fair enough, either way would have been fine.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-11-18 13:50 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15  4:02 [PATCH v4 bpf-next 0/4] Add support for memory-mapping BPF array maps Andrii Nakryiko
2019-11-15  4:02 ` [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit so bpf_map_inc never fails Andrii Nakryiko
2019-11-15 21:47   ` Song Liu
2019-11-15 23:23   ` Daniel Borkmann
2019-11-15 23:27     ` Andrii Nakryiko
2019-11-15  4:02 ` [PATCH v4 bpf-next 2/4] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
2019-11-15  4:45   ` Alexei Starovoitov
2019-11-15  5:05     ` Andrii Nakryiko
2019-11-15  5:08       ` Alexei Starovoitov
2019-11-15  5:43         ` Andrii Nakryiko
2019-11-15 16:36           ` Andrii Nakryiko
2019-11-15 20:42             ` Jakub Kicinski
2019-11-15 13:57   ` Johannes Weiner
2019-11-15 23:31   ` Daniel Borkmann
2019-11-15 23:37     ` Alexei Starovoitov
2019-11-15 23:44       ` Daniel Borkmann
2019-11-15 23:47         ` Alexei Starovoitov
2019-11-16  0:13           ` Daniel Borkmann
2019-11-16  1:18             ` Alexei Starovoitov
2019-11-17  5:57               ` Andrii Nakryiko
2019-11-17 12:07                 ` Daniel Borkmann
2019-11-17 17:17                   ` Andrii Nakryiko
2019-11-18 13:50                     ` Daniel Borkmann
2019-11-15  4:02 ` [PATCH v4 bpf-next 3/4] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
2019-11-15  4:02 ` [PATCH v4 bpf-next 4/4] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.