BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next v4 0/2] net: xdp: XSKMAP improvements
@ 2019-08-02  8:11 Björn Töpel
  2019-08-02  8:11 ` [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released Björn Töpel
  2019-08-02  8:11 ` [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP Björn Töpel
  0 siblings, 2 replies; 9+ messages in thread
From: Björn Töpel @ 2019-08-02  8:11 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, bjorn.topel,
	bruce.richardson, songliubraving, bpf

This series (v4 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.

1. Automatic cleanup when an AF_XDP socket goes out of scope/is
   released. Instead of require that the user manually clears the
   "released" state socket from the map, this is done
   automatically. Each socket tracks which maps it resides in, and
   remove itself from those maps at relase. A notable implementation
   change, is that the sockets references the map, instead of the map
   referencing the sockets. Which implies that when the XSKMAP is
   freed, it is by definition cleared of sockets.

2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
   which this patch addresses.

Daniel, I (hopefully...) addressed the issues you found in
[1]. Instead of popping the tracked map, it's simply read. Then, the
socket is removed iff it's the same socket, i.e. no updates has
occurred. There are some code comments in the xsk_delete_from_maps()
function as well.


Thanks,
Björn

[1] https://lore.kernel.org/bpf/2417e1ab-16fa-d3ed-564e-1a50c4cb6717@iogearbox.net/

v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
        Socket release/map update race. (Daniel)

Björn Töpel (2):
  xsk: remove AF_XDP socket from map when the socket is released
  xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP

 include/net/xdp_sock.h |  18 ++++++
 kernel/bpf/xskmap.c    | 130 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  48 +++++++++++++++
 3 files changed, 175 insertions(+), 21 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-02  8:11 [PATCH bpf-next v4 0/2] net: xdp: XSKMAP improvements Björn Töpel
@ 2019-08-02  8:11 ` Björn Töpel
  2019-08-02 18:28   ` Jonathan Lemon
  2019-08-12 12:28   ` Daniel Borkmann
  2019-08-02  8:11 ` [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP Björn Töpel
  1 sibling, 2 replies; 9+ messages in thread
From: Björn Töpel @ 2019-08-02  8:11 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, bruce.richardson,
	songliubraving, bpf

From: Björn Töpel <bjorn.topel@intel.com>

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  18 +++++++
 kernel/bpf/xskmap.c    | 113 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  48 +++++++++++++++++
 3 files changed, 160 insertions(+), 19 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..066e3ae446a8 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -50,6 +50,16 @@ struct xdp_umem {
 	struct list_head xsk_list;
 };
 
+/* Nodes are linked in the struct xdp_sock map_list field, and used to
+ * track which maps a certain socket reside in.
+ */
+struct xsk_map;
+struct xsk_map_node {
+	struct list_head node;
+	struct xsk_map *map;
+	struct xdp_sock **map_entry;
+};
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -75,6 +85,9 @@ struct xdp_sock {
 	/* Protects generic receive. */
 	spinlock_t rx_lock;
 	u64 rx_dropped;
+	struct list_head map_list;
+	/* Protects map_list */
+	spinlock_t map_list_lock;
 };
 
 struct xdp_buff;
@@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
 void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9bb96ace9fa1..780639309f6b 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -13,8 +13,71 @@ struct xsk_map {
 	struct bpf_map map;
 	struct xdp_sock **xsk_map;
 	struct list_head __percpu *flush_list;
+	spinlock_t lock; /* Synchronize map updates */
 };
 
+int xsk_map_inc(struct xsk_map *map)
+{
+	struct bpf_map *m = &map->map;
+
+	m = bpf_map_inc(m, false);
+	return IS_ERR(m) ? PTR_ERR(m) : 0;
+}
+
+void xsk_map_put(struct xsk_map *map)
+{
+	bpf_map_put(&map->map);
+}
+
+static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
+					       struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *node;
+	int err;
+
+	node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+	if (!node)
+		return NULL;
+
+	err = xsk_map_inc(map);
+	if (err) {
+		kfree(node);
+		return ERR_PTR(err);
+	}
+
+	node->map = map;
+	node->map_entry = map_entry;
+	return node;
+}
+
+static void xsk_map_node_free(struct xsk_map_node *node)
+{
+	xsk_map_put(node->map);
+	kfree(node);
+}
+
+static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
+{
+	spin_lock_bh(&xs->map_list_lock);
+	list_add_tail(&node->node, &xs->map_list);
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
+static void xsk_map_sock_delete(struct xdp_sock *xs,
+				struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *n, *tmp;
+
+	spin_lock_bh(&xs->map_list_lock);
+	list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
+		if (map_entry == n->map_entry) {
+			list_del(&n->node);
+			xsk_map_node_free(n);
+		}
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
 static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 {
 	struct xsk_map *m;
@@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&m->map, attr);
+	spin_lock_init(&m->lock);
 
 	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
 	cost += sizeof(struct list_head) * num_possible_cpus();
@@ -71,21 +135,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 static void xsk_map_free(struct bpf_map *map)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	int i;
 
 	bpf_clear_redirect_map(map);
 	synchronize_net();
-
-	for (i = 0; i < map->max_entries; i++) {
-		struct xdp_sock *xs;
-
-		xs = m->xsk_map[i];
-		if (!xs)
-			continue;
-
-		sock_put((struct sock *)xs);
-	}
-
 	free_percpu(m->flush_list);
 	bpf_map_area_free(m->xsk_map);
 	kfree(m);
@@ -165,7 +217,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
 	u32 i = *(u32 *)key, fd = *(u32 *)value;
-	struct xdp_sock *xs, *old_xs;
+	struct xdp_sock *xs, *old_xs, **entry;
+	struct xsk_map_node *node;
 	struct socket *sock;
 	int err;
 
@@ -192,11 +245,19 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EOPNOTSUPP;
 	}
 
-	sock_hold(sock->sk);
+	entry = &m->xsk_map[i];
+	node = xsk_map_node_alloc(m, entry);
+	if (IS_ERR(node)) {
+		sockfd_put(sock);
+		return PTR_ERR(node);
+	}
 
-	old_xs = xchg(&m->xsk_map[i], xs);
+	spin_lock_bh(&m->lock);
+	xsk_map_sock_add(xs, node);
+	old_xs = xchg(entry, xs);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
+		xsk_map_sock_delete(old_xs, entry);
+	spin_unlock_bh(&m->lock);
 
 	sockfd_put(sock);
 	return 0;
@@ -205,19 +266,33 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	struct xdp_sock *old_xs;
+	struct xdp_sock *old_xs, **map_entry;
 	int k = *(u32 *)key;
 
 	if (k >= map->max_entries)
 		return -EINVAL;
 
-	old_xs = xchg(&m->xsk_map[k], NULL);
+	spin_lock_bh(&m->lock);
+	map_entry = &m->xsk_map[k];
+	old_xs = xchg(map_entry, NULL);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
+		xsk_map_sock_delete(old_xs, map_entry);
+	spin_unlock_bh(&m->lock);
 
 	return 0;
 }
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry)
+{
+	spin_lock_bh(&map->lock);
+	if (READ_ONCE(*map_entry) == xs) {
+		WRITE_ONCE(*map_entry, NULL);
+		xsk_map_sock_delete(xs, map_entry);
+	}
+	spin_unlock_bh(&map->lock);
+}
+
 const struct bpf_map_ops xsk_map_ops = {
 	.map_alloc = xsk_map_alloc,
 	.map_free = xsk_map_free,
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..c3447bad608a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 	dev_put(dev);
 }
 
+static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
+					      struct xdp_sock ***map_entry)
+{
+	struct xsk_map *map = NULL;
+	struct xsk_map_node *node;
+
+	*map_entry = NULL;
+
+	spin_lock_bh(&xs->map_list_lock);
+	node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
+					node);
+	if (node) {
+		WARN_ON(xsk_map_inc(node->map));
+		map = node->map;
+		*map_entry = node->map_entry;
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+	return map;
+}
+
+static void xsk_delete_from_maps(struct xdp_sock *xs)
+{
+	/* This function removes the current XDP socket from all the
+	 * maps it resides in. We need to take extra care here, due to
+	 * the two locks involved. Each map has a lock synchronizing
+	 * updates to the entries, and each socket has a lock that
+	 * synchronizes access to the list of maps (map_list). For
+	 * deadlock avoidance the locks need to be taken in the order
+	 * "map lock"->"socket map list lock". We start off by
+	 * accessing the socket map list, and take a reference to the
+	 * map to guarantee existence. Then we ask the map to remove
+	 * the socket, which tries to remove the socket from the
+	 * map. Note that there might be updates to the map between
+	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
+	 */
+	struct xdp_sock **map_entry = NULL;
+	struct xsk_map *map;
+
+	while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
+		xsk_map_try_sock_delete(map, xs, map_entry);
+		xsk_map_put(map);
+	}
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -381,6 +425,7 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	xsk_delete_from_maps(xs);
 	xsk_unbind_dev(xs);
 
 	xskq_destroy(xs->rx);
@@ -855,6 +900,9 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 	spin_lock_init(&xs->rx_lock);
 	spin_lock_init(&xs->tx_completion_lock);
 
+	INIT_LIST_HEAD(&xs->map_list);
+	spin_lock_init(&xs->map_list_lock);
+
 	mutex_lock(&net->xdp.lock);
 	sk_add_node_rcu(sk, &net->xdp.list);
 	mutex_unlock(&net->xdp.lock);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
  2019-08-02  8:11 [PATCH bpf-next v4 0/2] net: xdp: XSKMAP improvements Björn Töpel
  2019-08-02  8:11 ` [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released Björn Töpel
@ 2019-08-02  8:11 ` Björn Töpel
  2019-08-02 18:28   ` Jonathan Lemon
  1 sibling, 1 reply; 9+ messages in thread
From: Björn Töpel @ 2019-08-02  8:11 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, bruce.richardson,
	songliubraving, bpf

From: Björn Töpel <bjorn.topel@intel.com>

The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 kernel/bpf/xskmap.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 780639309f6b..8864dfe1d9ef 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -226,8 +226,6 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EINVAL;
 	if (unlikely(i >= m->map.max_entries))
 		return -E2BIG;
-	if (unlikely(map_flags == BPF_NOEXIST))
-		return -EEXIST;
 
 	sock = sockfd_lookup(fd, &err);
 	if (!sock)
@@ -253,14 +251,29 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 	}
 
 	spin_lock_bh(&m->lock);
+	entry = &m->xsk_map[i];
+	old_xs = READ_ONCE(*entry);
+	if (old_xs && map_flags == BPF_NOEXIST) {
+		err = -EEXIST;
+		goto out;
+	} else if (!old_xs && map_flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto out;
+	}
 	xsk_map_sock_add(xs, node);
-	old_xs = xchg(entry, xs);
+	WRITE_ONCE(*entry, xs);
 	if (old_xs)
 		xsk_map_sock_delete(old_xs, entry);
 	spin_unlock_bh(&m->lock);
 
 	sockfd_put(sock);
 	return 0;
+
+out:
+	spin_unlock_bh(&m->lock);
+	sockfd_put(sock);
+	xsk_map_node_free(node);
+	return err;
 }
 
 static int xsk_map_delete_elem(struct bpf_map *map, void *key)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-02  8:11 ` [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released Björn Töpel
@ 2019-08-02 18:28   ` Jonathan Lemon
  2019-08-12 12:28   ` Daniel Borkmann
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Lemon @ 2019-08-02 18:28 UTC (permalink / raw)
  To: Björn Töpel
  Cc: ast, daniel, netdev, Björn Töpel, magnus.karlsson,
	bruce.richardson, songliubraving, bpf



On 2 Aug 2019, at 1:11, Björn Töpel wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> When an AF_XDP socket is released/closed the XSKMAP still holds a
> reference to the socket in a "released" state. The socket will still
> use the netdev queue resource, and block newly created sockets from
> attaching to that queue, but no user application can access the
> fill/complete/rx/tx queues. This results in that all applications need
> to explicitly clear the map entry from the old "zombie state"
> socket. This should be done automatically.
>
> In this patch, the sockets tracks, and have a reference to, which maps
> it resides in. When the socket is released, it will remove itself from
> all maps.
>
> Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Reviewed-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
  2019-08-02  8:11 ` [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP Björn Töpel
@ 2019-08-02 18:28   ` Jonathan Lemon
  0 siblings, 0 replies; 9+ messages in thread
From: Jonathan Lemon @ 2019-08-02 18:28 UTC (permalink / raw)
  To: Björn Töpel
  Cc: ast, daniel, netdev, Björn Töpel, magnus.karlsson,
	bruce.richardson, songliubraving, bpf



On 2 Aug 2019, at 1:11, Björn Töpel wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
> an entry. This patch addresses that.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Reviewed-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-02  8:11 ` [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released Björn Töpel
  2019-08-02 18:28   ` Jonathan Lemon
@ 2019-08-12 12:28   ` Daniel Borkmann
  2019-08-12 17:25     ` Björn Töpel
  1 sibling, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2019-08-12 12:28 UTC (permalink / raw)
  To: Björn Töpel, ast, netdev
  Cc: Björn Töpel, magnus.karlsson, bruce.richardson,
	songliubraving, bpf

On 8/2/19 10:11 AM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> When an AF_XDP socket is released/closed the XSKMAP still holds a
> reference to the socket in a "released" state. The socket will still
> use the netdev queue resource, and block newly created sockets from
> attaching to that queue, but no user application can access the
> fill/complete/rx/tx queues. This results in that all applications need
> to explicitly clear the map entry from the old "zombie state"
> socket. This should be done automatically.
> 
> In this patch, the sockets tracks, and have a reference to, which maps
> it resides in. When the socket is released, it will remove itself from
> all maps.
> 
> Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

[ Sorry for the review delay, was on PTO and catching up with things. ]

Overall looks good to me, I think better than previous versions. One question /
clarification for below:

> ---
>   include/net/xdp_sock.h |  18 +++++++
>   kernel/bpf/xskmap.c    | 113 ++++++++++++++++++++++++++++++++++-------
>   net/xdp/xsk.c          |  48 +++++++++++++++++
>   3 files changed, 160 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 69796d264f06..066e3ae446a8 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -50,6 +50,16 @@ struct xdp_umem {
>   	struct list_head xsk_list;
>   };
>   
> +/* Nodes are linked in the struct xdp_sock map_list field, and used to
> + * track which maps a certain socket reside in.
> + */
> +struct xsk_map;
> +struct xsk_map_node {
> +	struct list_head node;
> +	struct xsk_map *map;
> +	struct xdp_sock **map_entry;
> +};
> +
>   struct xdp_sock {
>   	/* struct sock must be the first member of struct xdp_sock */
>   	struct sock sk;
> @@ -75,6 +85,9 @@ struct xdp_sock {
>   	/* Protects generic receive. */
>   	spinlock_t rx_lock;
>   	u64 rx_dropped;
> +	struct list_head map_list;
> +	/* Protects map_list */
> +	spinlock_t map_list_lock;
>   };
>   
>   struct xdp_buff;
> @@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
>   void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
>   struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
>   
> +void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
> +			     struct xdp_sock **map_entry);
> +int xsk_map_inc(struct xsk_map *map);
> +void xsk_map_put(struct xsk_map *map);
> +
>   static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
>   {
>   	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
> index 9bb96ace9fa1..780639309f6b 100644
> --- a/kernel/bpf/xskmap.c
> +++ b/kernel/bpf/xskmap.c
> @@ -13,8 +13,71 @@ struct xsk_map {
>   	struct bpf_map map;
>   	struct xdp_sock **xsk_map;
>   	struct list_head __percpu *flush_list;
> +	spinlock_t lock; /* Synchronize map updates */
>   };
>   
> +int xsk_map_inc(struct xsk_map *map)
> +{
> +	struct bpf_map *m = &map->map;
> +
> +	m = bpf_map_inc(m, false);
> +	return IS_ERR(m) ? PTR_ERR(m) : 0;
> +}
> +
> +void xsk_map_put(struct xsk_map *map)
> +{
> +	bpf_map_put(&map->map);
> +}
> +
> +static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
> +					       struct xdp_sock **map_entry)
> +{
> +	struct xsk_map_node *node;
> +	int err;
> +
> +	node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
> +	if (!node)
> +		return NULL;
> +
> +	err = xsk_map_inc(map);
> +	if (err) {
> +		kfree(node);
> +		return ERR_PTR(err);
> +	}
> +
> +	node->map = map;
> +	node->map_entry = map_entry;
> +	return node;
> +}
> +
> +static void xsk_map_node_free(struct xsk_map_node *node)
> +{
> +	xsk_map_put(node->map);
> +	kfree(node);
> +}
> +
> +static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
> +{
> +	spin_lock_bh(&xs->map_list_lock);
> +	list_add_tail(&node->node, &xs->map_list);
> +	spin_unlock_bh(&xs->map_list_lock);
> +}
> +
> +static void xsk_map_sock_delete(struct xdp_sock *xs,
> +				struct xdp_sock **map_entry)
> +{
> +	struct xsk_map_node *n, *tmp;
> +
> +	spin_lock_bh(&xs->map_list_lock);
> +	list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
> +		if (map_entry == n->map_entry) {
> +			list_del(&n->node);
> +			xsk_map_node_free(n);
> +		}
> +	}
> +	spin_unlock_bh(&xs->map_list_lock);
> +}
> +
>   static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
>   {
>   	struct xsk_map *m;
> @@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
>   		return ERR_PTR(-ENOMEM);
>   
>   	bpf_map_init_from_attr(&m->map, attr);
> +	spin_lock_init(&m->lock);
>   
>   	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
>   	cost += sizeof(struct list_head) * num_possible_cpus();
> @@ -71,21 +135,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
>   static void xsk_map_free(struct bpf_map *map)
>   {
>   	struct xsk_map *m = container_of(map, struct xsk_map, map);
> -	int i;
>   
>   	bpf_clear_redirect_map(map);
>   	synchronize_net();
> -
> -	for (i = 0; i < map->max_entries; i++) {
> -		struct xdp_sock *xs;
> -
> -		xs = m->xsk_map[i];
> -		if (!xs)
> -			continue;
> -
> -		sock_put((struct sock *)xs);
> -	}
> -
>   	free_percpu(m->flush_list);
>   	bpf_map_area_free(m->xsk_map);
>   	kfree(m);
> @@ -165,7 +217,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
>   {
>   	struct xsk_map *m = container_of(map, struct xsk_map, map);
>   	u32 i = *(u32 *)key, fd = *(u32 *)value;
> -	struct xdp_sock *xs, *old_xs;
> +	struct xdp_sock *xs, *old_xs, **entry;
> +	struct xsk_map_node *node;
>   	struct socket *sock;
>   	int err;
>   
> @@ -192,11 +245,19 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
>   		return -EOPNOTSUPP;
>   	}
>   
> -	sock_hold(sock->sk);
> +	entry = &m->xsk_map[i];
> +	node = xsk_map_node_alloc(m, entry);
> +	if (IS_ERR(node)) {
> +		sockfd_put(sock);
> +		return PTR_ERR(node);
> +	}
>   
> -	old_xs = xchg(&m->xsk_map[i], xs);
> +	spin_lock_bh(&m->lock);
> +	xsk_map_sock_add(xs, node);
> +	old_xs = xchg(entry, xs);
>   	if (old_xs)
> -		sock_put((struct sock *)old_xs);
> +		xsk_map_sock_delete(old_xs, entry);
> +	spin_unlock_bh(&m->lock);
>   
>   	sockfd_put(sock);
>   	return 0;
> @@ -205,19 +266,33 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
>   static int xsk_map_delete_elem(struct bpf_map *map, void *key)
>   {
>   	struct xsk_map *m = container_of(map, struct xsk_map, map);
> -	struct xdp_sock *old_xs;
> +	struct xdp_sock *old_xs, **map_entry;
>   	int k = *(u32 *)key;
>   
>   	if (k >= map->max_entries)
>   		return -EINVAL;
>   
> -	old_xs = xchg(&m->xsk_map[k], NULL);
> +	spin_lock_bh(&m->lock);
> +	map_entry = &m->xsk_map[k];
> +	old_xs = xchg(map_entry, NULL);
>   	if (old_xs)
> -		sock_put((struct sock *)old_xs);
> +		xsk_map_sock_delete(old_xs, map_entry);
> +	spin_unlock_bh(&m->lock);
>   
>   	return 0;
>   }
>   
> +void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
> +			     struct xdp_sock **map_entry)
> +{
> +	spin_lock_bh(&map->lock);
> +	if (READ_ONCE(*map_entry) == xs) {
> +		WRITE_ONCE(*map_entry, NULL);
> +		xsk_map_sock_delete(xs, map_entry);
> +	}
> +	spin_unlock_bh(&map->lock);
> +}
> +
>   const struct bpf_map_ops xsk_map_ops = {
>   	.map_alloc = xsk_map_alloc,
>   	.map_free = xsk_map_free,
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 59b57d708697..c3447bad608a 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>   	dev_put(dev);
>   }
>   
> +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
> +					      struct xdp_sock ***map_entry)
> +{
> +	struct xsk_map *map = NULL;
> +	struct xsk_map_node *node;
> +
> +	*map_entry = NULL;
> +
> +	spin_lock_bh(&xs->map_list_lock);
> +	node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> +					node);
> +	if (node) {
> +		WARN_ON(xsk_map_inc(node->map));

Can you elaborate on the refcount usage here and against what scenario it is protecting?

Do we pretend it never fails on the bpf_map_inc() wrt the WARN_ON(), why that (what
makes it different from the xsk_map_node_alloc() inc above where we do error out)?

> +		map = node->map;
> +		*map_entry = node->map_entry;
> +	}
> +	spin_unlock_bh(&xs->map_list_lock);
> +	return map;
> +}
> +
> +static void xsk_delete_from_maps(struct xdp_sock *xs)
> +{
> +	/* This function removes the current XDP socket from all the
> +	 * maps it resides in. We need to take extra care here, due to
> +	 * the two locks involved. Each map has a lock synchronizing
> +	 * updates to the entries, and each socket has a lock that
> +	 * synchronizes access to the list of maps (map_list). For
> +	 * deadlock avoidance the locks need to be taken in the order
> +	 * "map lock"->"socket map list lock". We start off by
> +	 * accessing the socket map list, and take a reference to the
> +	 * map to guarantee existence. Then we ask the map to remove
> +	 * the socket, which tries to remove the socket from the
> +	 * map. Note that there might be updates to the map between
> +	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
> +	 */
> +	struct xdp_sock **map_entry = NULL;
> +	struct xsk_map *map;
> +
> +	while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
> +		xsk_map_try_sock_delete(map, xs, map_entry);
> +		xsk_map_put(map);
> +	}
> +}
> +
>   static int xsk_release(struct socket *sock)
>   {
>   	struct sock *sk = sock->sk;
> @@ -381,6 +425,7 @@ static int xsk_release(struct socket *sock)
>   	sock_prot_inuse_add(net, sk->sk_prot, -1);
>   	local_bh_enable();
>   
> +	xsk_delete_from_maps(xs);
>   	xsk_unbind_dev(xs);
>   
>   	xskq_destroy(xs->rx);
> @@ -855,6 +900,9 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
>   	spin_lock_init(&xs->rx_lock);
>   	spin_lock_init(&xs->tx_completion_lock);
>   
> +	INIT_LIST_HEAD(&xs->map_list);
> +	spin_lock_init(&xs->map_list_lock);
> +
>   	mutex_lock(&net->xdp.lock);
>   	sk_add_node_rcu(sk, &net->xdp.list);
>   	mutex_unlock(&net->xdp.lock);
> 

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-12 12:28   ` Daniel Borkmann
@ 2019-08-12 17:25     ` Björn Töpel
  2019-08-14 23:17       ` Daniel Borkmann
  0 siblings, 1 reply; 9+ messages in thread
From: Björn Töpel @ 2019-08-12 17:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Netdev, Björn Töpel, Karlsson,
	Magnus, Bruce Richardson, Song Liu, bpf

On Mon, 12 Aug 2019 at 14:28, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
[...]
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 59b57d708697..c3447bad608a 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
> >       dev_put(dev);
> >   }
> >
> > +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
> > +                                           struct xdp_sock ***map_entry)
> > +{
> > +     struct xsk_map *map = NULL;
> > +     struct xsk_map_node *node;
> > +
> > +     *map_entry = NULL;
> > +
> > +     spin_lock_bh(&xs->map_list_lock);
> > +     node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> > +                                     node);
> > +     if (node) {
> > +             WARN_ON(xsk_map_inc(node->map));
>
> Can you elaborate on the refcount usage here and against what scenario it is protecting?
>

Thanks for having a look!

First we access the map_list (under the lock) and pull out the map
which we intend to clean. In order to clear the map entry, we need to
a reference to the map. However, when the map_list_lock is released,
there's a window where the map entry can be cleared and the map can be
destroyed, and making the "map", which is used in
xsk_delete_from_maps, stale. To guarantee existence the additional
refinc is required. Makes sense?

> Do we pretend it never fails on the bpf_map_inc() wrt the WARN_ON(),
> why that (what makes it different from the xsk_map_node_alloc() inc
> above where we do error out)?

Hmm, given that we're in a cleanup (socket release), we can't really
return any error. What would be a more robust way? Retrying? AFAIK the
release ops return an int, but it's not checked/used.

> > +             map = node->map;
> > +             *map_entry = node->map_entry;
> > +     }
> > +     spin_unlock_bh(&xs->map_list_lock);
> > +     return map;
> > +}
> > +
> > +static void xsk_delete_from_maps(struct xdp_sock *xs)
> > +{
> > +     /* This function removes the current XDP socket from all the
> > +      * maps it resides in. We need to take extra care here, due to
> > +      * the two locks involved. Each map has a lock synchronizing
> > +      * updates to the entries, and each socket has a lock that
> > +      * synchronizes access to the list of maps (map_list). For
> > +      * deadlock avoidance the locks need to be taken in the order
> > +      * "map lock"->"socket map list lock". We start off by
> > +      * accessing the socket map list, and take a reference to the
> > +      * map to guarantee existence. Then we ask the map to remove
> > +      * the socket, which tries to remove the socket from the
> > +      * map. Note that there might be updates to the map between
> > +      * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
> > +      */

I tried to clarify here, but I obviously need to do a better job. :-)


Björn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-12 17:25     ` Björn Töpel
@ 2019-08-14 23:17       ` Daniel Borkmann
  2019-08-15  8:28         ` Björn Töpel
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2019-08-14 23:17 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexei Starovoitov, Netdev, Björn Töpel, Karlsson,
	Magnus, Bruce Richardson, Song Liu, bpf

On 8/12/19 7:25 PM, Björn Töpel wrote:
> On Mon, 12 Aug 2019 at 14:28, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
> [...]
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 59b57d708697..c3447bad608a 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>>>        dev_put(dev);
>>>    }
>>>
>>> +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
>>> +                                           struct xdp_sock ***map_entry)
>>> +{
>>> +     struct xsk_map *map = NULL;
>>> +     struct xsk_map_node *node;
>>> +
>>> +     *map_entry = NULL;
>>> +
>>> +     spin_lock_bh(&xs->map_list_lock);
>>> +     node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
>>> +                                     node);
>>> +     if (node) {
>>> +             WARN_ON(xsk_map_inc(node->map));
>>
>> Can you elaborate on the refcount usage here and against what scenario it is protecting?
> 
> Thanks for having a look!
> 
> First we access the map_list (under the lock) and pull out the map
> which we intend to clean. In order to clear the map entry, we need to
> a reference to the map. However, when the map_list_lock is released,
> there's a window where the map entry can be cleared and the map can be
> destroyed, and making the "map", which is used in
> xsk_delete_from_maps, stale. To guarantee existence the additional
> refinc is required. Makes sense?

Seems reasonable to me, and inc as opposed to inc_not_zero is also fine
here since at this point in time we're still holding one reference to
the map. But I think there's a catch with the current code that still
needs fixing:

Imagine you do a xsk_map_update_elem() where we have a situation where
xs == old_xs. There, we first do the xsk_map_sock_add() to add the new
xsk map node at the tail of the socket's xs->map_list. We do the xchg()
and then xsk_map_sock_delete() for old_xs which then walks xs->map_list
again and purges all entries including the just newly created one. This
means we'll end up with an xs socket at the given map slot, but the xs
socket has empty xs->map_list. This means we could release the xs sock
and the xsk_delete_from_maps() won't need to clean up anything anymore
but yet the xs is still in the map slot, so if you redirect to that
socket, it would be use-after-free, no?

>> Do we pretend it never fails on the bpf_map_inc() wrt the WARN_ON(),
>> why that (what makes it different from the xsk_map_node_alloc() inc
>> above where we do error out)?
> 
> Hmm, given that we're in a cleanup (socket release), we can't really
> return any error. What would be a more robust way? Retrying? AFAIK the
> release ops return an int, but it's not checked/used.
> 
>>> +             map = node->map;
>>> +             *map_entry = node->map_entry;
>>> +     }
>>> +     spin_unlock_bh(&xs->map_list_lock);
>>> +     return map;
>>> +}
>>> +
>>> +static void xsk_delete_from_maps(struct xdp_sock *xs)
>>> +{
>>> +     /* This function removes the current XDP socket from all the
>>> +      * maps it resides in. We need to take extra care here, due to
>>> +      * the two locks involved. Each map has a lock synchronizing
>>> +      * updates to the entries, and each socket has a lock that
>>> +      * synchronizes access to the list of maps (map_list). For
>>> +      * deadlock avoidance the locks need to be taken in the order
>>> +      * "map lock"->"socket map list lock". We start off by
>>> +      * accessing the socket map list, and take a reference to the
>>> +      * map to guarantee existence. Then we ask the map to remove
>>> +      * the socket, which tries to remove the socket from the
>>> +      * map. Note that there might be updates to the map between
>>> +      * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
>>> +      */
> 
> I tried to clarify here, but I obviously need to do a better job. :-)
> 
> 
> Björn
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
  2019-08-14 23:17       ` Daniel Borkmann
@ 2019-08-15  8:28         ` Björn Töpel
  0 siblings, 0 replies; 9+ messages in thread
From: Björn Töpel @ 2019-08-15  8:28 UTC (permalink / raw)
  To: Daniel Borkmann, Björn Töpel
  Cc: Alexei Starovoitov, Netdev, Karlsson, Magnus, Bruce Richardson,
	Song Liu, bpf

On 2019-08-15 01:17, Daniel Borkmann wrote:
> Seems reasonable to me, and inc as opposed to inc_not_zero is also fine
> here since at this point in time we're still holding one reference to
> the map.

Ok, good.

> But I think there's a catch with the current code that still
> needs fixing:
> 
> Imagine you do a xsk_map_update_elem() where we have a situation where
> xs == old_xs. There, we first do the xsk_map_sock_add() to add the new
> xsk map node at the tail of the socket's xs->map_list. We do the xchg()
> and then xsk_map_sock_delete() for old_xs which then walks xs->map_list
> again and purges all entries including the just newly created one. This
> means we'll end up with an xs socket at the given map slot, but the xs
> socket has empty xs->map_list. This means we could release the xs sock
> and the xsk_delete_from_maps() won't need to clean up anything anymore
> but yet the xs is still in the map slot, so if you redirect to that
> socket, it would be use-after-free, no?

Ah, correct. Checking against self-assignment, or doing the delete prior 
add. I'll spin a v5.

...and again, thanks for detailed review, Daniel!


Björn

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-02  8:11 [PATCH bpf-next v4 0/2] net: xdp: XSKMAP improvements Björn Töpel
2019-08-02  8:11 ` [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released Björn Töpel
2019-08-02 18:28   ` Jonathan Lemon
2019-08-12 12:28   ` Daniel Borkmann
2019-08-12 17:25     ` Björn Töpel
2019-08-14 23:17       ` Daniel Borkmann
2019-08-15  8:28         ` Björn Töpel
2019-08-02  8:11 ` [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP Björn Töpel
2019-08-02 18:28   ` Jonathan Lemon

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox