Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
@ 2019-10-08  6:16 Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

This is a rework of the following patch series 
https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
that tried to enable direct receive by bypassing XDP program attached
to the device.

Based on the community feedback and some suggestions from Bjorn, changed
the semantics of the implementation to enable direct receive on AF_XDP
sockets that are bound to a queue only when there is no normal XDP program
attached to the device.

This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
that is attached at the time of binding an AF_XDP socket to a queue of a
device. This is done only if there is no other XDP program attached to
the device. The normal XDP program has precedence and will replace the
DIRECT_XSK prog if it is attached later. The main reason to introduce a
special BPF prog pointer is to minimize the driver changes. The only change
is to use the bpf_get_prog_id() helper when QUERYING the prog id.

Any attach of a normal XDP program will take precedence and the direct xsk
program will be removed. The direct XSK program will be attached
automatically when the normal XDP program is removed when there are any
AF_XDP direct sockets associated with that device.

A static key is used to control this feature in order to avoid any overhead
for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.

Here is some performance data i collected on my Intel Ivybridge based
development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
NIC: Intel 40Gb ethernet (i40e)

xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
6.1x improvement in drop rate

xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
   default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
6x improvement in drop rate

xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
3.5x improvement in l2fwd rate

xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
   default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
   direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
4.5x improvement in l2fwd rate

dpdk-pktgen is used to send 64byte UDP packets from a link partner and 
ethtool ntuple flow rule is used to redirect packets to queue 1 on the 
system under test.

Sridhar Samudrala (4):
  bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
  xsk: allow AF_XDP sockets to receive packets directly from a queue
  libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/socionext/netsec.c           |  2 +-
 drivers/net/netdevsim/bpf.c                       |  6 ++-
 drivers/net/tun.c                                 |  4 +-
 drivers/net/veth.c                                |  4 +-
 drivers/net/virtio_net.c                          |  3 +-
 include/linux/bpf.h                               |  3 ++
 include/linux/filter.h                            | 18 +++++++
 include/linux/netdevice.h                         | 10 ++++
 include/net/xdp_sock.h                            |  5 ++
 include/trace/events/xdp.h                        |  4 +-
 include/uapi/linux/if_xdp.h                       |  5 ++
 kernel/bpf/arraymap.c                             |  2 +-
 kernel/bpf/cgroup.c                               |  2 +-
 kernel/bpf/core.c                                 |  2 +-
 kernel/bpf/syscall.c                              | 33 +++++++++----
 kernel/events/core.c                              |  2 +-
 kernel/trace/bpf_trace.c                          |  2 +-
 net/core/dev.c                                    | 54 ++++++++++++++++++++-
 net/core/filter.c                                 | 58 +++++++++++++++++++++++
 net/core/flow_dissector.c                         |  2 +-
 net/core/rtnetlink.c                              |  2 +-
 net/core/xdp.c                                    |  2 +-
 net/ipv6/seg6_local.c                             |  2 +-
 net/sched/act_bpf.c                               |  2 +-
 net/sched/cls_bpf.c                               |  2 +-
 net/xdp/xsk.c                                     | 51 +++++++++++++++++++-
 samples/bpf/xdpsock_user.c                        | 17 +++++--
 tools/include/uapi/linux/if_xdp.h                 |  5 ++
 tools/lib/bpf/xsk.c                               |  6 +++
 38 files changed, 279 insertions(+), 53 deletions(-)

-- 
2.14.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Currently the users of bpf prog id access it directly via prog->aux->id.
Next patch in this series introduces a special bpf prog pointer to support
AF_XDP sockets bound to a queue to receive packets from that queue directly.
As the special bpf prog pointer is not associated with any struct bpf_prog
prog id is not accessible via prog->aux. To abstract this from the users,
2 helper functions to get and set prog id are introduced and all the users
are updated to use these functions.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +--
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/socionext/netsec.c           |  2 +-
 drivers/net/netdevsim/bpf.c                       |  6 +++--
 drivers/net/tun.c                                 |  4 +--
 drivers/net/veth.c                                |  4 +--
 drivers/net/virtio_net.c                          |  3 +--
 include/linux/bpf.h                               |  3 +++
 include/trace/events/xdp.h                        |  4 +--
 kernel/bpf/arraymap.c                             |  2 +-
 kernel/bpf/cgroup.c                               |  2 +-
 kernel/bpf/core.c                                 |  2 +-
 kernel/bpf/syscall.c                              | 30 +++++++++++++++++------
 kernel/events/core.c                              |  2 +-
 kernel/trace/bpf_trace.c                          |  2 +-
 net/core/dev.c                                    |  4 +--
 net/core/flow_dissector.c                         |  2 +-
 net/core/rtnetlink.c                              |  2 +-
 net/core/xdp.c                                    |  2 +-
 net/ipv6/seg6_local.c                             |  2 +-
 net/sched/act_bpf.c                               |  2 +-
 net/sched/cls_bpf.c                               |  2 +-
 29 files changed, 58 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index c6f6f2033880..ef6dd2881264 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -330,7 +330,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 		rc = bnxt_xdp_set(bp, xdp->prog);
 		break;
 	case XDP_QUERY_PROG:
-		xdp->prog_id = bp->xdp_prog ? bp->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(bp->xdp_prog);
 		rc = 0;
 		break;
 	default:
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 40a44dcb3d9b..5c6c680252c1 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1912,7 +1912,7 @@ static int nicvf_xdp(struct net_device *netdev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return nicvf_xdp_setup(nic, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = nic->xdp_prog ? nic->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(nic->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 162d7d8fb295..8aef671d0731 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -1831,7 +1831,7 @@ static int dpaa2_eth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return setup_xdp(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = priv->xdp_prog ? priv->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(priv->xdp_prog);
 		break;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..0a59937b376e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12820,7 +12820,7 @@ static int i40e_xdp(struct net_device *dev,
 	case XDP_SETUP_PROG:
 		return i40e_xdp_setup(vsi, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(vsi->xdp_prog);
 		return 0;
 	case XDP_SETUP_XSK_UMEM:
 		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..c51ad24be037 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10283,8 +10283,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return ixgbe_xdp_setup(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = adapter->xdp_prog ?
-			adapter->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(adapter->xdp_prog);
 		return 0;
 	case XDP_SETUP_XSK_UMEM:
 		return ixgbe_xsk_umem_setup(adapter, xdp->xsk.umem,
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 076f2da36f27..dcf32a32cde8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4496,8 +4496,7 @@ static int ixgbevf_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return ixgbevf_xdp_setup(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = adapter->xdp_prog ?
-			       adapter->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(adapter->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 40ec5acf79c0..cf086748306f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2881,8 +2881,7 @@ static u32 mlx4_xdp_query(struct net_device *dev)
 	xdp_prog = rcu_dereference_protected(
 		priv->rx_ring[0]->xdp_prog,
 		lockdep_is_held(&mdev->state_lock));
-	if (xdp_prog)
-		prog_id = xdp_prog->aux->id;
+	prog_id = bpf_get_prog_id(xdp_prog);
 	mutex_unlock(&mdev->state_lock);
 
 	return prog_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7569287f8f3c..45c247ff05b2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4478,8 +4478,7 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 
 	mutex_lock(&priv->state_lock);
 	xdp_prog = priv->channels.params.xdp_prog;
-	if (xdp_prog)
-		prog_id = xdp_prog->aux->id;
+	prog_id = bpf_get_prog_id(xdp_prog);
 	mutex_unlock(&priv->state_lock);
 
 	return prog_id;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index 9a6a9a008714..75376bb85875 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1119,7 +1119,7 @@ int qede_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return qede_xdp_set(edev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = edev->xdp_prog ? edev->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(edev->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 55db7fbd43cc..19f5a1d850de 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1820,7 +1820,7 @@ static int netsec_xdp(struct net_device *ndev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return netsec_xdp_setup(priv, xdp->prog, xdp->extack);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = priv->xdp_prog ? priv->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(priv->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 2b74425822ab..2a24a7a41985 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -104,7 +104,7 @@ nsim_bpf_offload(struct netdevsim *ns, struct bpf_prog *prog, bool oldprog)
 	     "bad offload state, expected offload %sto be active",
 	     oldprog ? "" : "not ");
 	ns->bpf_offloaded = prog;
-	ns->bpf_offloaded_id = prog ? prog->aux->id : 0;
+	ns->bpf_offloaded_id = bpf_get_prog_id(prog);
 	nsim_prog_set_loaded(prog, true);
 
 	return 0;
@@ -218,6 +218,7 @@ static int nsim_bpf_create_prog(struct nsim_dev *nsim_dev,
 {
 	struct nsim_bpf_bound_prog *state;
 	char name[16];
+	u32 prog_id;
 
 	state = kzalloc(sizeof(*state), GFP_KERNEL);
 	if (!state)
@@ -235,7 +236,8 @@ static int nsim_bpf_create_prog(struct nsim_dev *nsim_dev,
 		return -ENOMEM;
 	}
 
-	debugfs_create_u32("id", 0400, state->ddir, &prog->aux->id);
+	prog_id = bpf_get_prog_id(prog);
+	debugfs_create_u32("id", 0400, state->ddir, &prog_id);
 	debugfs_create_file("state", 0400, state->ddir,
 			    &state->state, &nsim_bpf_string_fops);
 	debugfs_create_bool("loaded", 0400, state->ddir, &state->is_loaded);
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index aab0be40d443..396905d5c59a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1224,10 +1224,8 @@ static u32 tun_xdp_query(struct net_device *dev)
 	const struct bpf_prog *xdp_prog;
 
 	xdp_prog = rtnl_dereference(tun->xdp_prog);
-	if (xdp_prog)
-		return xdp_prog->aux->id;
 
-	return 0;
+	return bpf_get_prog_id(xdp_prog);
 }
 
 static int tun_xdp(struct net_device *dev, struct netdev_bpf *xdp)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9f3c839f9e5f..261b0df8dc41 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1140,10 +1140,8 @@ static u32 veth_xdp_query(struct net_device *dev)
 	const struct bpf_prog *xdp_prog;
 
 	xdp_prog = priv->_xdp_prog;
-	if (xdp_prog)
-		return xdp_prog->aux->id;
 
-	return 0;
+	return bpf_get_prog_id(xdp_prog);
 }
 
 static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba98e0971b84..5aa7f95b6c99 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2521,8 +2521,7 @@ static u32 virtnet_xdp_query(struct net_device *dev)
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		xdp_prog = rtnl_dereference(vi->rq[i].xdp_prog);
-		if (xdp_prog)
-			return xdp_prog->aux->id;
+		return bpf_get_prog_id(xdp_prog);
 	}
 	return 0;
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b9d22338606..e5b023cda42a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -692,6 +692,9 @@ int bpf_get_file_flag(int flags);
 int bpf_check_uarg_tail_zero(void __user *uaddr, size_t expected_size,
 			     size_t actual_size);
 
+u32 bpf_get_prog_id(const struct bpf_prog *prog);
+void bpf_set_prog_id(struct bpf_prog *prog, u32 id);
+
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
  * forced to use 'long' read/writes to try to atomically copy long counters.
  * Best-effort only.  No barriers here, since it _will_ race with concurrent
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 8c8420230a10..3369a73c27e1 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -39,7 +39,7 @@ TRACE_EVENT(xdp_exception,
 	),
 
 	TP_fast_assign(
-		__entry->prog_id	= xdp->aux->id;
+		__entry->prog_id	= bpf_get_prog_id(xdp);
 		__entry->act		= act;
 		__entry->ifindex	= dev->ifindex;
 	),
@@ -99,7 +99,7 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	),
 
 	TP_fast_assign(
-		__entry->prog_id	= xdp->aux->id;
+		__entry->prog_id	= bpf_get_prog_id(xdp);
 		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
 		__entry->err		= err;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..30037d72c176 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -589,7 +589,7 @@ static void prog_fd_array_put_ptr(void *ptr)
 
 static u32 prog_fd_array_sys_lookup_elem(void *ptr)
 {
-	return ((struct bpf_prog *)ptr)->aux->id;
+	return bpf_get_prog_id((struct bpf_prog *)ptr);
 }
 
 /* decrement refcnt of all bpf_progs that are stored in this map */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index ddd8addcdb5c..8db882606d54 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -526,7 +526,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 
 		i = 0;
 		list_for_each_entry(pl, progs, node) {
-			id = pl->prog->aux->id;
+			id = bpf_get_prog_id(pl->prog);
 			if (copy_to_user(prog_ids + i, &id, sizeof(id)))
 				return -EFAULT;
 			if (++i == cnt)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 66088a9e9b9e..60ccf0e552fc 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1832,7 +1832,7 @@ static bool bpf_prog_array_copy_core(struct bpf_prog_array *array,
 	for (item = array->items; item->prog; item++) {
 		if (item->prog == &dummy_bpf_prog.prog)
 			continue;
-		prog_ids[i] = item->prog->aux->id;
+		prog_ids[i] = bpf_get_prog_id(item->prog);
 		if (++i == request_cnt) {
 			item++;
 			break;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 82eabd4e38ad..205f95af67d2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1287,7 +1287,7 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog)
 	spin_lock_bh(&prog_idr_lock);
 	id = idr_alloc_cyclic(&prog_idr, prog, 1, INT_MAX, GFP_ATOMIC);
 	if (id > 0)
-		prog->aux->id = id;
+		bpf_set_prog_id(prog, id);
 	spin_unlock_bh(&prog_idr_lock);
 	idr_preload_end();
 
@@ -1305,7 +1305,7 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
 	 * disappears - even if someone grabs an fd to them they are unusable,
 	 * simply waiting for refcnt to drop to be freed.
 	 */
-	if (!prog->aux->id)
+	if (!bpf_get_prog_id(prog))
 		return;
 
 	if (do_idr_lock)
@@ -1313,8 +1313,8 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
 	else
 		__acquire(&prog_idr_lock);
 
-	idr_remove(&prog_idr, prog->aux->id);
-	prog->aux->id = 0;
+	idr_remove(&prog_idr, bpf_get_prog_id(prog));
+	bpf_set_prog_id(prog, 0);
 
 	if (do_idr_lock)
 		spin_unlock_bh(&prog_idr_lock);
@@ -1353,6 +1353,22 @@ void bpf_prog_put(struct bpf_prog *prog)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_put);
 
+u32 bpf_get_prog_id(const struct bpf_prog *prog)
+{
+	if (prog)
+		return prog->aux->id;
+
+	return 0;
+}
+EXPORT_SYMBOL(bpf_get_prog_id);
+
+void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
+{
+	if (prog)
+		prog->aux->id = id;
+}
+EXPORT_SYMBOL(bpf_set_prog_id);
+
 static int bpf_prog_release(struct inode *inode, struct file *filp)
 {
 	struct bpf_prog *prog = filp->private_data;
@@ -1406,7 +1422,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp)
 		   prog->jited,
 		   prog_tag,
 		   prog->pages * 1ULL << PAGE_SHIFT,
-		   prog->aux->id,
+		   bpf_get_prog_id(prog),
 		   stats.nsecs,
 		   stats.cnt);
 }
@@ -2329,7 +2345,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 		return -EFAULT;
 
 	info.type = prog->type;
-	info.id = prog->aux->id;
+	info.id = bpf_get_prog_id(prog);
 	info.load_time = prog->aux->load_time;
 	info.created_by_uid = from_kuid_munged(current_user_ns(),
 					       prog->aux->user->uid);
@@ -2792,7 +2808,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 		struct bpf_raw_event_map *btp = raw_tp->btp;
 
 		err = bpf_task_fd_query_copy(attr, uattr,
-					     raw_tp->prog->aux->id,
+					     bpf_get_prog_id(raw_tp->prog),
 					     BPF_FD_TYPE_RAW_TRACEPOINT,
 					     btp->tp->name, 0, 0);
 		goto put_file;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4655adbbae10..1410951ca904 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8035,7 +8035,7 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 			},
 			.type = type,
 			.flags = flags,
-			.id = prog->aux->id,
+			.id = bpf_get_prog_id(prog),
 		},
 	};
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 44bd08f2443b..35e8cd2b6b54 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1426,7 +1426,7 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
 	if (prog->type == BPF_PROG_TYPE_PERF_EVENT)
 		return -EOPNOTSUPP;
 
-	*prog_id = prog->aux->id;
+	*prog_id = bpf_get_prog_id(prog);
 	flags = event->tp_event->flags;
 	is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT;
 	is_syscall_tp = is_syscall_trace_event(event->tp_event);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7a456c6a7ad8..866d0ad936a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5286,7 +5286,7 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 		break;
 
 	case XDP_QUERY_PROG:
-		xdp->prog_id = old ? old->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(old);
 		break;
 
 	default:
@@ -8262,7 +8262,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			return -EINVAL;
 		}
 
-		if (prog->aux->id == prog_id) {
+		if (bpf_get_prog_id(prog) == prog_id) {
 			bpf_prog_put(prog);
 			return 0;
 		}
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 6b4b88d1599d..fa8b8e88bfaa 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -89,7 +89,7 @@ int skb_flow_dissector_prog_query(const union bpf_attr *attr,
 	attached = rcu_dereference(net->flow_dissector_prog);
 	if (attached) {
 		prog_cnt = 1;
-		prog_id = attached->aux->id;
+		prog_id = bpf_get_prog_id(attached);
 	}
 	rcu_read_unlock();
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 49fa910b58af..86fd505b4111 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1387,7 +1387,7 @@ static u32 rtnl_xdp_prog_skb(struct net_device *dev)
 	generic_xdp_prog = rtnl_dereference(dev->xdp_prog);
 	if (!generic_xdp_prog)
 		return 0;
-	return generic_xdp_prog->aux->id;
+	return bpf_get_prog_id(generic_xdp_prog);
 }
 
 static u32 rtnl_xdp_prog_drv(struct net_device *dev)
diff --git a/net/core/xdp.c b/net/core/xdp.c
index d7bf62ffbb5e..0bc9a50eb318 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -469,7 +469,7 @@ EXPORT_SYMBOL_GPL(__xdp_release_frame);
 int xdp_attachment_query(struct xdp_attachment_info *info,
 			 struct netdev_bpf *bpf)
 {
-	bpf->prog_id = info->prog ? info->prog->aux->id : 0;
+	bpf->prog_id = bpf_get_prog_id(info->prog);
 	bpf->prog_flags = info->prog ? info->flags : 0;
 	return 0;
 }
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 9d4f75e0d33a..e49987655567 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -853,7 +853,7 @@ static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	if (!nest)
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, SEG6_LOCAL_BPF_PROG, slwt->bpf.prog->aux->id))
+	if (nla_put_u32(skb, SEG6_LOCAL_BPF_PROG, bpf_get_prog_id(slwt->bpf.prog)))
 		return -EMSGSIZE;
 
 	if (slwt->bpf.name &&
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 04b7bd4ec751..d55a28d0adf6 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -119,7 +119,7 @@ static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf *prog,
 	    nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, TCA_ACT_BPF_ID, prog->filter->aux->id))
+	if (nla_put_u32(skb, TCA_ACT_BPF_ID, bpf_get_prog_id(prog->filter))
 		return -EMSGSIZE;
 
 	nla = nla_reserve(skb, TCA_ACT_BPF_TAG, sizeof(prog->filter->tag));
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index bf10bdaf5012..d35aedb4aee5 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -562,7 +562,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog *prog,
 	    nla_put_string(skb, TCA_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, TCA_BPF_ID, prog->filter->aux->id))
+	if (nla_put_u32(skb, TCA_BPF_ID, bpf_get_prog_id(prog->filter))
 		return -EMSGSIZE;
 
 	nla = nla_reserve(skb, TCA_BPF_TAG, sizeof(prog->filter->tag));
-- 
2.14.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
                     ` (2 more replies)
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 23+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Introduce a flag that can be specified during the bind() call
of an AF_XDP socket to receive packets directly from a queue when there is
no XDP program attached to the device.

This is enabled by introducing a special BPF prog pointer called
BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
when binding an AF_XDP socket to a queue. At the time of bind(), an AF_XDP
socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf program
if there is no other XDP program attached to the device. The device receive
queue is also associated with the AF_XDP socket.

In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
XDP buffer is passed to the AF_XDP socket that is associated with the
receive queue on which the packet is received.

To avoid any overhead for nomal XDP programs, a static key is used to keep
a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
to handle receives for direct XDP sockets.

Any attach of a normal XDP program will take precedence and the direct xsk
program will be removed. The direct XSK program will be attached
automatically when the normal XDP program is removed when there are any
AF_XDP direct sockets associated with that device.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/filter.h            | 18 ++++++++++++
 include/linux/netdevice.h         | 10 +++++++
 include/net/xdp_sock.h            |  5 ++++
 include/uapi/linux/if_xdp.h       |  5 ++++
 kernel/bpf/syscall.c              |  7 +++--
 net/core/dev.c                    | 50 +++++++++++++++++++++++++++++++++
 net/core/filter.c                 | 58 +++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c                     | 51 ++++++++++++++++++++++++++++++++--
 tools/include/uapi/linux/if_xdp.h |  5 ++++
 9 files changed, 204 insertions(+), 5 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2ce57645f3cd..db4ad85d8321 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -585,6 +585,9 @@ struct bpf_redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	u32 kern_flags;
+#ifdef CONFIG_XDP_SOCKETS
+	struct xdp_sock *xsk;
+#endif
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
@@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return res;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+#define BPF_PROG_DIRECT_XSK	0x1
+#define bpf_is_direct_xsk_prog(prog) \
+	((unsigned long)prog == BPF_PROG_DIRECT_XSK)
+DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
+u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
+#else
+#define bpf_is_direct_xsk_prog(prog) (false)
+#endif
+
 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
@@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 	 * already takes rcu_read_lock() when fetching the program, so
 	 * it's not necessary here anymore.
 	 */
+#ifdef CONFIG_XDP_SOCKETS
+	if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
+	    bpf_is_direct_xsk_prog(prog))
+		return bpf_direct_xsk(prog, xdp);
+#endif
 	return BPF_PROG_RUN(prog, xdp);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 48cc71aae466..f4d0f70aa718 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -743,6 +743,7 @@ struct netdev_rx_queue {
 	struct xdp_rxq_info		xdp_rxq;
 #ifdef CONFIG_XDP_SOCKETS
 	struct xdp_umem                 *umem;
+	struct xdp_sock			*xsk;
 #endif
 } ____cacheline_aligned_in_smp;
 
@@ -1836,6 +1837,10 @@ struct net_device {
 	atomic_t		carrier_up_count;
 	atomic_t		carrier_down_count;
 
+#ifdef CONFIG_XDP_SOCKETS
+	u16			direct_xsk_count;
+#endif
+
 #ifdef CONFIG_WIRELESS_EXT
 	const struct iw_handler_def *wireless_handlers;
 	struct iw_public_data	*wireless_data;
@@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
+#ifdef CONFIG_XDP_SOCKETS
+int dev_set_direct_xsk_prog(struct net_device *dev);
+int dev_clear_direct_xsk_prog(struct net_device *dev);
+#endif
+
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
 					       struct sk_buff *skb)
 {
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index c9398ce7960f..9158233d34e1 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -76,6 +76,9 @@ struct xsk_map_node {
 	struct xdp_sock **map_entry;
 };
 
+/* Flags for the xdp_sock flags field. */
+#define XDP_SOCK_DIRECT (1 << 0)
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -104,6 +107,7 @@ struct xdp_sock {
 	struct list_head map_list;
 	/* Protects map_list */
 	spinlock_t map_list_lock;
+	u16 flags;
 };
 
 struct xdp_buff;
@@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
 void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
 void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
 bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
+struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id);
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 			     struct xdp_sock **map_entry);
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index be328c59389d..d202b5d40859 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -25,6 +25,11 @@
  * application.
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
+/* This option allows an AF_XDP socket bound to a queue to receive all
+ * the packets directly from that queue when there is no XDP program
+ * attached to the device.
+ */
+#define XDP_DIRECT	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 205f95af67d2..871d738a78d2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
 
 void bpf_prog_put(struct bpf_prog *prog)
 {
-	__bpf_prog_put(prog, true);
+	if (!bpf_is_direct_xsk_prog(prog))
+		__bpf_prog_put(prog, true);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_put);
 
 u32 bpf_get_prog_id(const struct bpf_prog *prog)
 {
-	if (prog)
+	if (prog && !bpf_is_direct_xsk_prog(prog))
 		return prog->aux->id;
 
 	return 0;
@@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
 
 void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
 {
-	if (prog)
+	if (prog && !bpf_is_direct_xsk_prog(prog))
 		prog->aux->id = id;
 }
 EXPORT_SYMBOL(bpf_set_prog_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 866d0ad936a5..eb3cd718e580 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	} else {
 		if (!__dev_xdp_query(dev, bpf_op, query))
 			return 0;
+#ifdef CONFIG_XDP_SOCKETS
+		if (dev->direct_xsk_count)
+			prog = (void *)BPF_PROG_DIRECT_XSK;
+#endif
 	}
 
 	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
@@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	return err;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+int dev_set_direct_xsk_prog(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct bpf_prog *prog;
+	bpf_op_t bpf_op;
+
+	ASSERT_RTNL();
+
+	dev->direct_xsk_count++;
+
+	bpf_op = ops->ndo_bpf;
+	if (!bpf_op)
+		return -EOPNOTSUPP;
+
+	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
+		return 0;
+
+	prog = (void *)BPF_PROG_DIRECT_XSK;
+
+	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
+}
+
+int dev_clear_direct_xsk_prog(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	bpf_op_t bpf_op;
+
+	ASSERT_RTNL();
+
+	dev->direct_xsk_count--;
+
+	if (dev->direct_xsk_count)
+		return 0;
+
+	bpf_op = ops->ndo_bpf;
+	if (!bpf_op)
+		return -EOPNOTSUPP;
+
+	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
+		return 0;
+
+	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
+}
+#endif
+
 /**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
diff --git a/net/core/filter.c b/net/core/filter.c
index ed6563622ce3..391d7d600284 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <net/lwtunnel.h>
 #include <net/ipv6_stubs.h>
 #include <net/bpf_sk_storage.h>
+#include <linux/static_key.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 	return 0;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
+{
+	struct xdp_sock *xsk = READ_ONCE(ri->xsk);
+
+	if (xsk) {
+		ri->xsk = NULL;
+		xsk_flush(xsk);
+	}
+}
+#else
+static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
+{
+}
+#endif
+
 void xdp_do_flush_map(void)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
@@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
 			break;
 		}
 	}
+
+	xdp_do_flush_xsk(ri);
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush_map);
 
@@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	return err;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
+{
+	return READ_ONCE(ri->xsk);
+}
+#else
+static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
+{
+	return NULL;
+}
+#endif
+
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
+	struct xdp_sock *xsk;
+
+	xsk = xdp_get_direct_xsk(ri);
+	if (xsk)
+		return xsk_rcv(xsk, xdp);
 
 	if (likely(map))
 		return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
@@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+#ifdef CONFIG_XDP_SOCKETS
+DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
+EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
+
+u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
+{
+	struct xdp_sock *xsk;
+
+	xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
+	if (xsk) {
+		struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+		ri->xsk = xsk;
+		return XDP_REDIRECT;
+	}
+
+	return XDP_PASS;
+}
+EXPORT_SYMBOL(bpf_direct_xsk);
+#endif
+
 #endif /* CONFIG_INET */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index fa8fbb8fa3c8..8a29939bac7e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -24,6 +24,7 @@
 #include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
+#include <linux/if_link.h>
 
 #include "xsk_queue.h"
 #include "xdp_umem.h"
@@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xdp_clear_direct_xsk(struct xdp_sock *xsk)
+{
+	struct net_device *dev = xsk->dev;
+	u32 qid = xsk->queue_id;
+
+	if (!dev->_rx[qid].xsk)
+		return;
+
+	dev_clear_direct_xsk_prog(dev);
+	dev->_rx[qid].xsk = NULL;
+	static_branch_dec(&xdp_direct_xsk_needed);
+	xsk->flags &= ~XDP_SOCK_DIRECT;
+}
+
+static int xdp_set_direct_xsk(struct xdp_sock *xsk)
+{
+	struct net_device *dev = xsk->dev;
+	u32 qid = xsk->queue_id;
+	int err = 0;
+
+	if (dev->_rx[qid].xsk)
+		return -EEXIST;
+
+	xsk->flags |= XDP_SOCK_DIRECT;
+	static_branch_inc(&xdp_direct_xsk_needed);
+	dev->_rx[qid].xsk = xsk;
+	err = dev_set_direct_xsk_prog(dev);
+	if (err)
+		xdp_clear_direct_xsk(xsk);
+
+	return err;
+}
+
+struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id)
+{
+	return dev->_rx[queue_id].xsk;
+}
+EXPORT_SYMBOL(xdp_get_xsk_from_qid);
+
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
 {
 	xskq_produce_flush_addr_n(umem->cq, nb_entries);
@@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 		return;
 	WRITE_ONCE(xs->state, XSK_UNBOUND);
 
+	if (xs->flags & XDP_SOCK_DIRECT) {
+		rtnl_lock();
+		xdp_clear_direct_xsk(xs);
+		rtnl_unlock();
+	}
 	/* Wait for driver to stop using the xdp socket. */
 	xdp_del_sk_umem(xs->umem, xs);
 	xs->dev = NULL;
@@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	flags = sxdp->sxdp_flags;
 	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
-		      XDP_USE_NEED_WAKEUP))
+		      XDP_USE_NEED_WAKEUP | XDP_DIRECT))
 		return -EINVAL;
 
 	rtnl_lock();
@@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		struct socket *sock;
 
 		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
-		    (flags & XDP_USE_NEED_WAKEUP)) {
+		    (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
 			/* Cannot specify flags for shared sockets. */
 			err = -EINVAL;
 			goto out_unlock;
@@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
 	xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
 	xdp_add_sk_umem(xs->umem, xs);
+	if (flags & XDP_DIRECT)
+		err = xdp_set_direct_xsk(xs);
 
 out_unlock:
 	if (err) {
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index be328c59389d..d202b5d40859 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -25,6 +25,11 @@
  * application.
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
+/* This option allows an AF_XDP socket bound to a queue to receive all
+ * the packets directly from that queue when there is no XDP program
+ * attached to the device.
+ */
+#define XDP_DIRECT	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
-- 
2.14.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  8:05   ` Björn Töpel
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Don't allow an AF_XDP socket trying to bind with XDP_DIRECT bind
flag when a normal XDP program is already attached to the device,

Don't attach the default XDP program when AF_XDP socket is created
with XDP_DIRECT bind flag.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 tools/lib/bpf/xsk.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index d5f4900e5c54..953b479040cd 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -454,6 +454,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 		return err;
 
 	if (!prog_id) {
+		if (xsk->config.bind_flags & XDP_DIRECT)
+			return 0;
+
 		err = xsk_create_bpf_maps(xsk);
 		if (err)
 			return err;
@@ -464,6 +467,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 			return err;
 		}
 	} else {
+		if (xsk->config.bind_flags & XDP_DIRECT)
+			return -EEXIST;
+
 		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
 		err = xsk_lookup_bpf_maps(xsk);
 		if (err) {
-- 
2.14.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (2 preceding siblings ...)
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  8:05   ` Björn Töpel
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
  2019-10-09  0:49 ` Jakub Kicinski
  5 siblings, 1 reply; 23+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

This option enables an AF_XDP socket to bind with a XDP_DIRECT flag
that allows packets received on the associated queue to be received
directly when an XDP program is not attached.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 samples/bpf/xdpsock_user.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 405c4e091f8b..6f4633769968 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -129,6 +129,9 @@ static void print_benchmark(bool running)
 	if (opt_poll)
 		printf("poll() ");
 
+	if (opt_xdp_bind_flags & XDP_DIRECT)
+		printf("direct-xsk ");
+
 	if (running) {
 		printf("running...");
 		fflush(stdout);
@@ -202,7 +205,8 @@ static void int_exit(int sig)
 	dump_stats();
 	xsk_socket__delete(xsks[0]->xsk);
 	(void)xsk_umem__delete(umem);
-	remove_xdp_program();
+	if (!(opt_xdp_bind_flags & XDP_DIRECT))
+		remove_xdp_program();
 
 	exit(EXIT_SUCCESS);
 }
@@ -213,7 +217,8 @@ static void __exit_with_error(int error, const char *file, const char *func,
 	fprintf(stderr, "%s:%s:%i: errno: %d/\"%s\"\n", file, func,
 		line, error, strerror(error));
 	dump_stats();
-	remove_xdp_program();
+	if (!(opt_xdp_bind_flags & XDP_DIRECT))
+		remove_xdp_program();
 	exit(EXIT_FAILURE);
 }
 
@@ -363,6 +368,7 @@ static struct option long_options[] = {
 	{"frame-size", required_argument, 0, 'f'},
 	{"no-need-wakeup", no_argument, 0, 'm'},
 	{"unaligned", no_argument, 0, 'u'},
+	{"direct-xsk", no_argument, 0, 'd'},
 	{0, 0, 0, 0}
 };
 
@@ -386,6 +392,7 @@ static void usage(const char *prog)
 		"  -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
 		"  -f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is %d).\n"
 		"  -u, --unaligned	Enable unaligned chunk placement\n"
+		"  -d, --direct-xsk	Direct packets to XDP socket.\n"
 		"\n";
 	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
 	exit(EXIT_FAILURE);
@@ -398,7 +405,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mu",
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mud",
 				long_options, &option_index);
 		if (c == -1)
 			break;
@@ -452,7 +459,9 @@ static void parse_command_line(int argc, char **argv)
 			opt_need_wakeup = false;
 			opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
 			break;
-
+		case 'd':
+			opt_xdp_bind_flags |= XDP_DIRECT;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
-- 
2.14.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
@ 2019-10-08  6:58   ` Toke Høiland-Jørgensen
  2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:05   ` Björn Töpel
  2019-10-09  1:20   ` Alexei Starovoitov
  2 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-08  6:58 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, bjorn.topel, netdev, bpf,
	sridhar.samudrala, intel-wired-lan, maciej.fijalkowski,
	tom.herbert

Sridhar Samudrala <sridhar.samudrala@intel.com> writes:

>  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  	struct bpf_map *map = READ_ONCE(ri->map);
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_direct_xsk(ri);
> +	if (xsk)
> +		return xsk_rcv(xsk, xdp);

This is a new branch and a read barrier in the XDP_REDIRECT fast path.
What's the performance impact of that for non-XSK redirect?

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (3 preceding siblings ...)
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
@ 2019-10-08  8:05 ` Björn Töpel
  2019-10-09 16:19   ` Samudrala, Sridhar
  2019-10-09  0:49 ` Jakub Kicinski
  5 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert
  Cc: Jakub Kicinski

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> This is a rework of the following patch series
> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
> that tried to enable direct receive by bypassing XDP program attached
> to the device.
> 
> Based on the community feedback and some suggestions from Bjorn, changed
> the semantics of the implementation to enable direct receive on AF_XDP
> sockets that are bound to a queue only when there is no normal XDP program
> attached to the device.
> 
> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
> that is attached at the time of binding an AF_XDP socket to a queue of a
> device. This is done only if there is no other XDP program attached to
> the device. The normal XDP program has precedence and will replace the
> DIRECT_XSK prog if it is attached later. The main reason to introduce a
> special BPF prog pointer is to minimize the driver changes. The only change
> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> A static key is used to control this feature in order to avoid any overhead
> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.
> 
> Here is some performance data i collected on my Intel Ivybridge based
> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
> NIC: Intel 40Gb ethernet (i40e)
> 
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> 6.1x improvement in drop rate
> 
> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
> 6x improvement in drop rate
> 
> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
> 3.5x improvement in l2fwd rate
> 
> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
> 4.5x improvement in l2fwd rate
> 
> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
> system under test.
>

Thanks for working on this Sridhar! I like this approach! Except from
the bpf_get_prog_id() changes, no driver changes are needed.

It's also a cleaner (IMO) approach than my previous attempts [1,2,3]

Would be interesting to see NFP support AF_XDP offloading with this
option. (nudge, nudge).

A thought: From userland, a direct AF_XDP socket will not appear as an
XDP program is attached to the device (id == 0). Maybe show in ss(8)
(via xsk_diag.c) that the socket is direct?

[1] 
https://lore.kernel.org/netdev/CAJ+HfNj63QcLY8=y1fF93PZd3XcfiGSrbbWdiGByjTzZQydSSg@mail.gmail.com/
[2] 
https://lore.kernel.org/netdev/cd952f99-6bad-e0c8-5bcd-f0010218238c@intel.com/
[3] 
https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/

> Sridhar Samudrala (4):
>    bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
>    xsk: allow AF_XDP sockets to receive packets directly from a queue
>    libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
>    xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
>   drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
>   drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
>   drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +-
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +-
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
>   drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
>   drivers/net/ethernet/socionext/netsec.c           |  2 +-
>   drivers/net/netdevsim/bpf.c                       |  6 ++-
>   drivers/net/tun.c                                 |  4 +-
>   drivers/net/veth.c                                |  4 +-
>   drivers/net/virtio_net.c                          |  3 +-
>   include/linux/bpf.h                               |  3 ++
>   include/linux/filter.h                            | 18 +++++++
>   include/linux/netdevice.h                         | 10 ++++
>   include/net/xdp_sock.h                            |  5 ++
>   include/trace/events/xdp.h                        |  4 +-
>   include/uapi/linux/if_xdp.h                       |  5 ++
>   kernel/bpf/arraymap.c                             |  2 +-
>   kernel/bpf/cgroup.c                               |  2 +-
>   kernel/bpf/core.c                                 |  2 +-
>   kernel/bpf/syscall.c                              | 33 +++++++++----
>   kernel/events/core.c                              |  2 +-
>   kernel/trace/bpf_trace.c                          |  2 +-
>   net/core/dev.c                                    | 54 ++++++++++++++++++++-
>   net/core/filter.c                                 | 58 +++++++++++++++++++++++
>   net/core/flow_dissector.c                         |  2 +-
>   net/core/rtnetlink.c                              |  2 +-
>   net/core/xdp.c                                    |  2 +-
>   net/ipv6/seg6_local.c                             |  2 +-
>   net/sched/act_bpf.c                               |  2 +-
>   net/sched/cls_bpf.c                               |  2 +-
>   net/xdp/xsk.c                                     | 51 +++++++++++++++++++-
>   samples/bpf/xdpsock_user.c                        | 17 +++++--
>   tools/include/uapi/linux/if_xdp.h                 |  5 ++
>   tools/lib/bpf/xsk.c                               |  6 +++
>   38 files changed, 279 insertions(+), 53 deletions(-)
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
@ 2019-10-08  8:05   ` Björn Töpel
  2019-10-09 16:32     ` Samudrala, Sridhar
  2019-10-09  1:20   ` Alexei Starovoitov
  2 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> Introduce a flag that can be specified during the bind() call
> of an AF_XDP socket to receive packets directly from a queue when there is
> no XDP program attached to the device.
> 
> This is enabled by introducing a special BPF prog pointer called
> BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
> when binding an AF_XDP socket to a queue. At the time of bind(), an AF_XDP
> socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf program
> if there is no other XDP program attached to the device. The device receive
> queue is also associated with the AF_XDP socket.
> 
> In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
> XDP buffer is passed to the AF_XDP socket that is associated with the
> receive queue on which the packet is received.
> 
> To avoid any overhead for nomal XDP programs, a static key is used to keep
> a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
> to handle receives for direct XDP sockets.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>   include/linux/filter.h            | 18 ++++++++++++
>   include/linux/netdevice.h         | 10 +++++++
>   include/net/xdp_sock.h            |  5 ++++
>   include/uapi/linux/if_xdp.h       |  5 ++++
>   kernel/bpf/syscall.c              |  7 +++--
>   net/core/dev.c                    | 50 +++++++++++++++++++++++++++++++++
>   net/core/filter.c                 | 58 +++++++++++++++++++++++++++++++++++++++
>   net/xdp/xsk.c                     | 51 ++++++++++++++++++++++++++++++++--
>   tools/include/uapi/linux/if_xdp.h |  5 ++++
>   9 files changed, 204 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 2ce57645f3cd..db4ad85d8321 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -585,6 +585,9 @@ struct bpf_redirect_info {
>   	struct bpf_map *map;
>   	struct bpf_map *map_to_flush;
>   	u32 kern_flags;
> +#ifdef CONFIG_XDP_SOCKETS
> +	struct xdp_sock *xsk;
> +#endif
>   };
>   
>   DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> @@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>   	return res;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +#define BPF_PROG_DIRECT_XSK	0x1
> +#define bpf_is_direct_xsk_prog(prog) \
> +	((unsigned long)prog == BPF_PROG_DIRECT_XSK)
> +DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
> +#else
> +#define bpf_is_direct_xsk_prog(prog) (false)
> +#endif
> +
>   static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
>   					    struct xdp_buff *xdp)
>   {
> @@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
>   	 * already takes rcu_read_lock() when fetching the program, so
>   	 * it's not necessary here anymore.
>   	 */
> +#ifdef CONFIG_XDP_SOCKETS
> +	if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
> +	    bpf_is_direct_xsk_prog(prog))
> +		return bpf_direct_xsk(prog, xdp);
> +#endif
>   	return BPF_PROG_RUN(prog, xdp);
>   }
>   
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 48cc71aae466..f4d0f70aa718 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -743,6 +743,7 @@ struct netdev_rx_queue {
>   	struct xdp_rxq_info		xdp_rxq;
>   #ifdef CONFIG_XDP_SOCKETS
>   	struct xdp_umem                 *umem;
> +	struct xdp_sock			*xsk;
>   #endif
>   } ____cacheline_aligned_in_smp;
>   
> @@ -1836,6 +1837,10 @@ struct net_device {
>   	atomic_t		carrier_up_count;
>   	atomic_t		carrier_down_count;
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +	u16			direct_xsk_count;

Why u16? num_rx/tx_queues are unsigned ints.

> +#endif
> +
>   #ifdef CONFIG_WIRELESS_EXT
>   	const struct iw_handler_def *wireless_handlers;
>   	struct iw_public_data	*wireless_data;
> @@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>   bool is_skb_forwardable(const struct net_device *dev,
>   			const struct sk_buff *skb);
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +int dev_set_direct_xsk_prog(struct net_device *dev);
> +int dev_clear_direct_xsk_prog(struct net_device *dev);
> +#endif
> +
>   static __always_inline int ____dev_forward_skb(struct net_device *dev,
>   					       struct sk_buff *skb)
>   {
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index c9398ce7960f..9158233d34e1 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -76,6 +76,9 @@ struct xsk_map_node {
>   	struct xdp_sock **map_entry;
>   };
>   
> +/* Flags for the xdp_sock flags field. */
> +#define XDP_SOCK_DIRECT (1 << 0)
> +
>   struct xdp_sock {
>   	/* struct sock must be the first member of struct xdp_sock */
>   	struct sock sk;
> @@ -104,6 +107,7 @@ struct xdp_sock {
>   	struct list_head map_list;
>   	/* Protects map_list */
>   	spinlock_t map_list_lock;
> +	u16 flags;

Right now only the XDP_DIRECT is tracked here. Maybe track all flags, 
and show them in xsk_diag.

>   };
>   
>   struct xdp_buff;
> @@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
>   void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
>   void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
>   bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id);
>   
>   void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>   			     struct xdp_sock **map_entry);
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index be328c59389d..d202b5d40859 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -25,6 +25,11 @@
>    * application.
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
> +/* This option allows an AF_XDP socket bound to a queue to receive all
> + * the packets directly from that queue when there is no XDP program
> + * attached to the device.
> + */
> +#define XDP_DIRECT	(1 << 4)
>   
>   /* Flags for xsk_umem_config flags */
>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 205f95af67d2..871d738a78d2 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
>   
>   void bpf_prog_put(struct bpf_prog *prog)
>   {
> -	__bpf_prog_put(prog, true);
> +	if (!bpf_is_direct_xsk_prog(prog))
> +		__bpf_prog_put(prog, true);
>   }
>   EXPORT_SYMBOL_GPL(bpf_prog_put);
>   
>   u32 bpf_get_prog_id(const struct bpf_prog *prog)
>   {
> -	if (prog)
> +	if (prog && !bpf_is_direct_xsk_prog(prog))
>   		return prog->aux->id;
>   
>   	return 0;
> @@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
>   
>   void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
>   {
> -	if (prog)
> +	if (prog && !bpf_is_direct_xsk_prog(prog))
>   		prog->aux->id = id;
>   }
>   EXPORT_SYMBOL(bpf_set_prog_id);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 866d0ad936a5..eb3cd718e580 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   	} else {
>   		if (!__dev_xdp_query(dev, bpf_op, query))
>   			return 0;
> +#ifdef CONFIG_XDP_SOCKETS
> +		if (dev->direct_xsk_count)
> +			prog = (void *)BPF_PROG_DIRECT_XSK;
> +#endif

Nit, but maybe hide this weirdness in a function?

>   	}
>   
>   	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
> @@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   	return err;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +int dev_set_direct_xsk_prog(struct net_device *dev)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +	struct bpf_prog *prog;
> +	bpf_op_t bpf_op;
> +
> +	ASSERT_RTNL();
> +
> +	dev->direct_xsk_count++;
> +
> +	bpf_op = ops->ndo_bpf;
> +	if (!bpf_op)
> +		return -EOPNOTSUPP;
> +
> +	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
> +		return 0;
> +
> +	prog = (void *)BPF_PROG_DIRECT_XSK;
> +
> +	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
> +}
> +
> +int dev_clear_direct_xsk_prog(struct net_device *dev)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +	bpf_op_t bpf_op;
> +
> +	ASSERT_RTNL();
> +
> +	dev->direct_xsk_count--;
> +
> +	if (dev->direct_xsk_count)
> +		return 0;
> +
> +	bpf_op = ops->ndo_bpf;
> +	if (!bpf_op)
> +		return -EOPNOTSUPP;
> +
> +	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
> +		return 0;
> +
> +	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
> +}
> +#endif
> +
>   /**
>    *	dev_new_index	-	allocate an ifindex
>    *	@net: the applicable net namespace
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ed6563622ce3..391d7d600284 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -73,6 +73,7 @@
>   #include <net/lwtunnel.h>
>   #include <net/ipv6_stubs.h>
>   #include <net/bpf_sk_storage.h>
> +#include <linux/static_key.h>
>   
>   /**
>    *	sk_filter_trim_cap - run a packet through a socket filter
> @@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
> +{
> +	struct xdp_sock *xsk = READ_ONCE(ri->xsk);

Why READ_ONCE here?

> +
> +	if (xsk) {
> +		ri->xsk = NULL;
> +		xsk_flush(xsk);
> +	}
> +}
> +#else
> +static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
> +{
> +}
> +#endif
> +

Move CONFIG_XDP_SOCKETS into the function, and remove the empty/bottom one.

>   void xdp_do_flush_map(void)
>   {
>   	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> @@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
>   			break;
>   		}
>   	}
> +
> +	xdp_do_flush_xsk(ri);
>   }
>   EXPORT_SYMBOL_GPL(xdp_do_flush_map);
>   
> @@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
>   	return err;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
> +{
> +	return READ_ONCE(ri->xsk);

Again, why READ_ONCE? Please leave the inlining to the compiler in .c files.

> +}
> +#else
> +static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
> +{
> +	return NULL;
> +}
> +#endif
> +
>   int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>   		    struct bpf_prog *xdp_prog)
>   {
>   	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>   	struct bpf_map *map = READ_ONCE(ri->map);
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_direct_xsk(ri);
> +	if (xsk)
> +		return xsk_rcv(xsk, xdp);

Hmm, maybe you need a xsk_to_flush as well. Say that a user swaps in a
regular XDP program, then xsk_rcv() will be called until the flush
occurs, right? IOW, all packets processed (napi budget) in the napi_poll
will end up in the socket.

>   
>   	if (likely(map))
>   		return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
> @@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>   
>   const struct bpf_prog_ops sk_reuseport_prog_ops = {
>   };
> +
> +#ifdef CONFIG_XDP_SOCKETS
> +DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
> +EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
> +
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> +{
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> +	if (xsk) {
> +		struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +		ri->xsk = xsk;
> +		return XDP_REDIRECT;

 From the comment above. I *think* you need to ri->xsk_to_flush. Can the
direct socket (swap socket) change before flush?

> +	}
> +
> +	return XDP_PASS;
> +}
> +EXPORT_SYMBOL(bpf_direct_xsk);
> +#endif
> +
>   #endif /* CONFIG_INET */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index fa8fbb8fa3c8..8a29939bac7e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -24,6 +24,7 @@
>   #include <linux/rculist.h>
>   #include <net/xdp_sock.h>
>   #include <net/xdp.h>
> +#include <linux/if_link.h>
>   
>   #include "xsk_queue.h"
>   #include "xdp_umem.h"
> @@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   	return err;
>   }
>   
> +static void xdp_clear_direct_xsk(struct xdp_sock *xsk)

Use xs, and not xsk for consistency.

> +{
> +	struct net_device *dev = xsk->dev;
> +	u32 qid = xsk->queue_id;
> +
> +	if (!dev->_rx[qid].xsk)
> +		return;
> +
> +	dev_clear_direct_xsk_prog(dev);
> +	dev->_rx[qid].xsk = NULL;
> +	static_branch_dec(&xdp_direct_xsk_needed);
> +	xsk->flags &= ~XDP_SOCK_DIRECT;
> +}
> +
> +static int xdp_set_direct_xsk(struct xdp_sock *xsk)

Same here.

> +{
> +	struct net_device *dev = xsk->dev;
> +	u32 qid = xsk->queue_id;
> +	int err = 0;
> +
> +	if (dev->_rx[qid].xsk)
> +		return -EEXIST;
> +
> +	xsk->flags |= XDP_SOCK_DIRECT;
> +	static_branch_inc(&xdp_direct_xsk_needed);
> +	dev->_rx[qid].xsk = xsk;
> +	err = dev_set_direct_xsk_prog(dev);
> +	if (err)
> +		xdp_clear_direct_xsk(xsk);
> +
> +	return err;
> +}
> +
> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id)
> +{
> +	return dev->_rx[queue_id].xsk;
> +}
> +EXPORT_SYMBOL(xdp_get_xsk_from_qid);
> +
>   void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
>   {
>   	xskq_produce_flush_addr_n(umem->cq, nb_entries);
> @@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>   		return;
>   	WRITE_ONCE(xs->state, XSK_UNBOUND);
>   
> +	if (xs->flags & XDP_SOCK_DIRECT) {
> +		rtnl_lock();
> +		xdp_clear_direct_xsk(xs);
> +		rtnl_unlock();
> +	}
>   	/* Wait for driver to stop using the xdp socket. */
>   	xdp_del_sk_umem(xs->umem, xs);
>   	xs->dev = NULL;
> @@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   
>   	flags = sxdp->sxdp_flags;
>   	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
> -		      XDP_USE_NEED_WAKEUP))
> +		      XDP_USE_NEED_WAKEUP | XDP_DIRECT))
>   		return -EINVAL;
>   
>   	rtnl_lock();
> @@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   		struct socket *sock;
>   
>   		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
> -		    (flags & XDP_USE_NEED_WAKEUP)) {
> +		    (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
>   			/* Cannot specify flags for shared sockets. */
>   			err = -EINVAL;
>   			goto out_unlock;
> @@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   	xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
>   	xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
>   	xdp_add_sk_umem(xs->umem, xs);
> +	if (flags & XDP_DIRECT)
> +		err = xdp_set_direct_xsk(xs);
>   
>   out_unlock:
>   	if (err) {
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index be328c59389d..d202b5d40859 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -25,6 +25,11 @@
>    * application.
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
> +/* This option allows an AF_XDP socket bound to a queue to receive all
> + * the packets directly from that queue when there is no XDP program
> + * attached to the device.
> + */
> +#define XDP_DIRECT	(1 << 4)
>   
>   /* Flags for xsk_umem_config flags */
>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
@ 2019-10-08  8:05   ` Björn Töpel
  0 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> Don't allow an AF_XDP socket trying to bind with XDP_DIRECT bind
> flag when a normal XDP program is already attached to the device,
> 
> Don't attach the default XDP program when AF_XDP socket is created
> with XDP_DIRECT bind flag.
>

I'd like this to be default for xsk.c, and if not supported fall back to 
old code.


> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>   tools/lib/bpf/xsk.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index d5f4900e5c54..953b479040cd 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -454,6 +454,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>   		return err;
>   
>   	if (!prog_id) {
> +		if (xsk->config.bind_flags & XDP_DIRECT)
> +			return 0;
> +
>   		err = xsk_create_bpf_maps(xsk);
>   		if (err)
>   			return err;
> @@ -464,6 +467,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>   			return err;
>   		}
>   	} else {
> +		if (xsk->config.bind_flags & XDP_DIRECT)
> +			return -EEXIST;
> +
>   		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>   		err = xsk_lookup_bpf_maps(xsk);
>   		if (err) {
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
@ 2019-10-08  8:05   ` Björn Töpel
  0 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> This option enables an AF_XDP socket to bind with a XDP_DIRECT flag
> that allows packets received on the associated queue to be received
> directly when an XDP program is not attached.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

Acked-by: Björn Töpel <bjorn.topel@intel.com>

> ---
>   samples/bpf/xdpsock_user.c | 17 +++++++++++++----
>   1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index 405c4e091f8b..6f4633769968 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -129,6 +129,9 @@ static void print_benchmark(bool running)
>   	if (opt_poll)
>   		printf("poll() ");
>   
> +	if (opt_xdp_bind_flags & XDP_DIRECT)
> +		printf("direct-xsk ");
> +
>   	if (running) {
>   		printf("running...");
>   		fflush(stdout);
> @@ -202,7 +205,8 @@ static void int_exit(int sig)
>   	dump_stats();
>   	xsk_socket__delete(xsks[0]->xsk);
>   	(void)xsk_umem__delete(umem);
> -	remove_xdp_program();
> +	if (!(opt_xdp_bind_flags & XDP_DIRECT))
> +		remove_xdp_program();
>   
>   	exit(EXIT_SUCCESS);
>   }
> @@ -213,7 +217,8 @@ static void __exit_with_error(int error, const char *file, const char *func,
>   	fprintf(stderr, "%s:%s:%i: errno: %d/\"%s\"\n", file, func,
>   		line, error, strerror(error));
>   	dump_stats();
> -	remove_xdp_program();
> +	if (!(opt_xdp_bind_flags & XDP_DIRECT))
> +		remove_xdp_program();
>   	exit(EXIT_FAILURE);
>   }
>   
> @@ -363,6 +368,7 @@ static struct option long_options[] = {
>   	{"frame-size", required_argument, 0, 'f'},
>   	{"no-need-wakeup", no_argument, 0, 'm'},
>   	{"unaligned", no_argument, 0, 'u'},
> +	{"direct-xsk", no_argument, 0, 'd'},
>   	{0, 0, 0, 0}
>   };
>   
> @@ -386,6 +392,7 @@ static void usage(const char *prog)
>   		"  -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
>   		"  -f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is %d).\n"
>   		"  -u, --unaligned	Enable unaligned chunk placement\n"
> +		"  -d, --direct-xsk	Direct packets to XDP socket.\n"
>   		"\n";
>   	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
>   	exit(EXIT_FAILURE);
> @@ -398,7 +405,7 @@ static void parse_command_line(int argc, char **argv)
>   	opterr = 0;
>   
>   	for (;;) {
> -		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mu",
> +		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mud",
>   				long_options, &option_index);
>   		if (c == -1)
>   			break;
> @@ -452,7 +459,9 @@ static void parse_command_line(int argc, char **argv)
>   			opt_need_wakeup = false;
>   			opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
>   			break;
> -
> +		case 'd':
> +			opt_xdp_bind_flags |= XDP_DIRECT;
> +			break;
>   		default:
>   			usage(basename(argv[0]));
>   		}
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
@ 2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:48       ` Björn Töpel
  2019-10-08  9:04       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 at 08:59, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>
> >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> >                   struct bpf_prog *xdp_prog)
> >  {
> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> >       struct bpf_map *map = READ_ONCE(ri->map);
> > +     struct xdp_sock *xsk;
> > +
> > +     xsk = xdp_get_direct_xsk(ri);
> > +     if (xsk)
> > +             return xsk_rcv(xsk, xdp);
>
> This is a new branch and a read barrier in the XDP_REDIRECT fast path.
> What's the performance impact of that for non-XSK redirect?
>

The dependent-read-barrier in READ_ONCE? Another branch -- leave that
to the branch-predictor already! ;-) No, you're right, performance
impact here is interesting. I guess the same static_branch could be
used here as well...


> -Toke
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:47     ` Björn Töpel
@ 2019-10-08  8:48       ` Björn Töpel
  2019-10-08  9:04       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-10-08  8:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 at 10:47, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
[...]
>
> The dependent-read-barrier in READ_ONCE? Another branch -- leave that
> to the branch-predictor already! ;-) No, you're right, performance
> impact here is interesting. I guess the same static_branch could be
> used here as well...
>

...and I think the READ_ONCE can be omitted.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:48       ` Björn Töpel
@ 2019-10-08  9:04       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-08  9:04 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Tue, 8 Oct 2019 at 08:59, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>>
>> >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>> >                   struct bpf_prog *xdp_prog)
>> >  {
>> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> >       struct bpf_map *map = READ_ONCE(ri->map);
>> > +     struct xdp_sock *xsk;
>> > +
>> > +     xsk = xdp_get_direct_xsk(ri);
>> > +     if (xsk)
>> > +             return xsk_rcv(xsk, xdp);
>>
>> This is a new branch and a read barrier in the XDP_REDIRECT fast path.
>> What's the performance impact of that for non-XSK redirect?
>>
>
> The dependent-read-barrier in READ_ONCE? Another branch -- leave that
> to the branch-predictor already! ;-) No, you're right, performance
> impact here is interesting. I guess the same static_branch could be
> used here as well...

In any case, some measurements to see the impact (or lack thereof) would
be useful :)

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (4 preceding siblings ...)
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
@ 2019-10-09  0:49 ` Jakub Kicinski
  2019-10-09  6:29   ` Samudrala, Sridhar
  5 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2019-10-09  0:49 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On Mon,  7 Oct 2019 23:16:51 -0700, Sridhar Samudrala wrote:
> This is a rework of the following patch series 
> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
> that tried to enable direct receive by bypassing XDP program attached
> to the device.
> 
> Based on the community feedback and some suggestions from Bjorn, changed
> the semantics of the implementation to enable direct receive on AF_XDP
> sockets that are bound to a queue only when there is no normal XDP program
> attached to the device.
> 
> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
> that is attached at the time of binding an AF_XDP socket to a queue of a
> device. This is done only if there is no other XDP program attached to
> the device. The normal XDP program has precedence and will replace the
> DIRECT_XSK prog if it is attached later. The main reason to introduce a
> special BPF prog pointer is to minimize the driver changes. The only change
> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> A static key is used to control this feature in order to avoid any overhead
> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.

Don't say that static branches have no overhead. That's dishonest.

> Here is some performance data i collected on my Intel Ivybridge based
> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
> NIC: Intel 40Gb ethernet (i40e)
> 
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>    default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>    direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> 6.1x improvement in drop rate
> 
> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>    default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>    direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
> 6x improvement in drop rate
> 
> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>    default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>    direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
> 3.5x improvement in l2fwd rate
> 
> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>    default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>    direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
> 4.5x improvement in l2fwd rate

I asked you to add numbers for handling those use cases in the kernel
directly.

> dpdk-pktgen is used to send 64byte UDP packets from a link partner and 
> ethtool ntuple flow rule is used to redirect packets to queue 1 on the 
> system under test.

Obviously still nack from me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
  2019-10-08  8:05   ` Björn Töpel
@ 2019-10-09  1:20   ` Alexei Starovoitov
       [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
  2 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2019-10-09  1:20 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Karlsson, Magnus, Björn Töpel, Network Development,
	bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Mon, Oct 7, 2019 at 11:18 PM Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> +
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> +{
> +       struct xdp_sock *xsk;
> +
> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> +       if (xsk) {
> +               struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +               ri->xsk = xsk;
> +               return XDP_REDIRECT;
> +       }
> +
> +       return XDP_PASS;
> +}
> +EXPORT_SYMBOL(bpf_direct_xsk);

So you're saying there is a:
"""
xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
6.1x improvement in drop rate
"""

6.1x gain running above C code vs exactly equivalent BPF code?
How is that possible?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-09  0:49 ` Jakub Kicinski
@ 2019-10-09  6:29   ` Samudrala, Sridhar
  2019-10-09 16:53     ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09  6:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert



On 10/8/2019 5:49 PM, Jakub Kicinski wrote:
> On Mon,  7 Oct 2019 23:16:51 -0700, Sridhar Samudrala wrote:
>> This is a rework of the following patch series
>> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
>> that tried to enable direct receive by bypassing XDP program attached
>> to the device.
>>
>> Based on the community feedback and some suggestions from Bjorn, changed
>> the semantics of the implementation to enable direct receive on AF_XDP
>> sockets that are bound to a queue only when there is no normal XDP program
>> attached to the device.
>>
>> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
>> that is attached at the time of binding an AF_XDP socket to a queue of a
>> device. This is done only if there is no other XDP program attached to
>> the device. The normal XDP program has precedence and will replace the
>> DIRECT_XSK prog if it is attached later. The main reason to introduce a
>> special BPF prog pointer is to minimize the driver changes. The only change
>> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
>>
>> Any attach of a normal XDP program will take precedence and the direct xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> A static key is used to control this feature in order to avoid any overhead
>> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.
> 
> Don't say that static branches have no overhead. That's dishonest.

I didn't mean to say no overhead, but the overhead is minimized using 
static_branch_unlikely()

> 
>> Here is some performance data i collected on my Intel Ivybridge based
>> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
>> NIC: Intel 40Gb ethernet (i40e)
>>
>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>> 6.1x improvement in drop rate
>>
>> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
>> 6x improvement in drop rate
>>
>> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 3.5x improvement in l2fwd rate
>>
>> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 4.5x improvement in l2fwd rate
> 
> I asked you to add numbers for handling those use cases in the kernel
> directly.

Forgot to explicitly mention that I didn't see any regressions with 
xdp1, xdp2 or xdpsock in default mode with these patches. Performance 
remained the same.

> 
>> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
>> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
>> system under test.
> 
> Obviously still nack from me.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
@ 2019-10-09 16:19   ` Samudrala, Sridhar
  0 siblings, 0 replies; 23+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:19 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
  Cc: Jakub Kicinski

On 10/8/2019 1:05 AM, Björn Töpel wrote:
> On 2019-10-08 08:16, Sridhar Samudrala wrote:
>> This is a rework of the following patch series
>> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r 
>>
>> that tried to enable direct receive by bypassing XDP program attached
>> to the device.
>>
>> Based on the community feedback and some suggestions from Bjorn, changed
>> the semantics of the implementation to enable direct receive on AF_XDP
>> sockets that are bound to a queue only when there is no normal XDP 
>> program
>> attached to the device.
>>
>> This is accomplished by introducing a special BPF prog pointer 
>> (DIRECT_XSK)
>> that is attached at the time of binding an AF_XDP socket to a queue of a
>> device. This is done only if there is no other XDP program attached to
>> the device. The normal XDP program has precedence and will replace the
>> DIRECT_XSK prog if it is attached later. The main reason to introduce a
>> special BPF prog pointer is to minimize the driver changes. The only 
>> change
>> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
>>
>> Any attach of a normal XDP program will take precedence and the direct 
>> xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> A static key is used to control this feature in order to avoid any 
>> overhead
>> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk 
>> mode.
>>
>> Here is some performance data i collected on my Intel Ivybridge based
>> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
>> NIC: Intel 40Gb ethernet (i40e)
>>
>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>> 6.1x improvement in drop rate
>>
>> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
>> 6x improvement in drop rate
>>
>> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 3.5x improvement in l2fwd rate
>>
>> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 4.5x improvement in l2fwd rate
>>
>> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
>> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
>> system under test.
>>
> 
> Thanks for working on this Sridhar! I like this approach! Except from
> the bpf_get_prog_id() changes, no driver changes are needed.
> 
> It's also a cleaner (IMO) approach than my previous attempts [1,2,3]
> 
> Would be interesting to see NFP support AF_XDP offloading with this
> option. (nudge, nudge).
> 
> A thought: From userland, a direct AF_XDP socket will not appear as an
> XDP program is attached to the device (id == 0). Maybe show in ss(8)
> (via xsk_diag.c) that the socket is direct?

Sure. will add this in the next revision.

> 
> [1] 
> https://lore.kernel.org/netdev/CAJ+HfNj63QcLY8=y1fF93PZd3XcfiGSrbbWdiGByjTzZQydSSg@mail.gmail.com/ 
> 
> [2] 
> https://lore.kernel.org/netdev/cd952f99-6bad-e0c8-5bcd-f0010218238c@intel.com/ 
> 
> [3] 
> https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/ 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:05   ` Björn Töpel
@ 2019-10-09 16:32     ` Samudrala, Sridhar
  0 siblings, 0 replies; 23+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:32 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

On 10/8/2019 1:05 AM, Björn Töpel wrote:
> On 2019-10-08 08:16, Sridhar Samudrala wrote:
>> Introduce a flag that can be specified during the bind() call
>> of an AF_XDP socket to receive packets directly from a queue when 
>> there is
>> no XDP program attached to the device.
>>
>> This is enabled by introducing a special BPF prog pointer called
>> BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
>> when binding an AF_XDP socket to a queue. At the time of bind(), an 
>> AF_XDP
>> socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf 
>> program
>> if there is no other XDP program attached to the device. The device 
>> receive
>> queue is also associated with the AF_XDP socket.
>>
>> In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
>> XDP buffer is passed to the AF_XDP socket that is associated with the
>> receive queue on which the packet is received.
>>
>> To avoid any overhead for nomal XDP programs, a static key is used to 
>> keep
>> a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
>> to handle receives for direct XDP sockets.
>>
>> Any attach of a normal XDP program will take precedence and the direct 
>> xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   include/linux/filter.h            | 18 ++++++++++++
>>   include/linux/netdevice.h         | 10 +++++++
>>   include/net/xdp_sock.h            |  5 ++++
>>   include/uapi/linux/if_xdp.h       |  5 ++++
>>   kernel/bpf/syscall.c              |  7 +++--
>>   net/core/dev.c                    | 50 
>> +++++++++++++++++++++++++++++++++
>>   net/core/filter.c                 | 58 
>> +++++++++++++++++++++++++++++++++++++++
>>   net/xdp/xsk.c                     | 51 
>> ++++++++++++++++++++++++++++++++--
>>   tools/include/uapi/linux/if_xdp.h |  5 ++++
>>   9 files changed, 204 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 2ce57645f3cd..db4ad85d8321 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -585,6 +585,9 @@ struct bpf_redirect_info {
>>       struct bpf_map *map;
>>       struct bpf_map *map_to_flush;
>>       u32 kern_flags;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    struct xdp_sock *xsk;
>> +#endif
>>   };
>>   DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
>> @@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const 
>> struct bpf_prog *prog,
>>       return res;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +#define BPF_PROG_DIRECT_XSK    0x1
>> +#define bpf_is_direct_xsk_prog(prog) \
>> +    ((unsigned long)prog == BPF_PROG_DIRECT_XSK)
>> +DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
>> +#else
>> +#define bpf_is_direct_xsk_prog(prog) (false)
>> +#endif
>> +
>>   static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog 
>> *prog,
>>                           struct xdp_buff *xdp)
>>   {
>> @@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const 
>> struct bpf_prog *prog,
>>        * already takes rcu_read_lock() when fetching the program, so
>>        * it's not necessary here anymore.
>>        */
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
>> +        bpf_is_direct_xsk_prog(prog))
>> +        return bpf_direct_xsk(prog, xdp);
>> +#endif
>>       return BPF_PROG_RUN(prog, xdp);
>>   }
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 48cc71aae466..f4d0f70aa718 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -743,6 +743,7 @@ struct netdev_rx_queue {
>>       struct xdp_rxq_info        xdp_rxq;
>>   #ifdef CONFIG_XDP_SOCKETS
>>       struct xdp_umem                 *umem;
>> +    struct xdp_sock            *xsk;
>>   #endif
>>   } ____cacheline_aligned_in_smp;
>> @@ -1836,6 +1837,10 @@ struct net_device {
>>       atomic_t        carrier_up_count;
>>       atomic_t        carrier_down_count;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    u16            direct_xsk_count;
> 
> Why u16? num_rx/tx_queues are unsigned ints.

OK. Will changes to unsigned int

> 
>> +#endif
>> +
>>   #ifdef CONFIG_WIRELESS_EXT
>>       const struct iw_handler_def *wireless_handlers;
>>       struct iw_public_data    *wireless_data;
>> @@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, 
>> struct sk_buff *skb);
>>   bool is_skb_forwardable(const struct net_device *dev,
>>               const struct sk_buff *skb);
>> +#ifdef CONFIG_XDP_SOCKETS
>> +int dev_set_direct_xsk_prog(struct net_device *dev);
>> +int dev_clear_direct_xsk_prog(struct net_device *dev);
>> +#endif
>> +
>>   static __always_inline int ____dev_forward_skb(struct net_device *dev,
>>                              struct sk_buff *skb)
>>   {
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index c9398ce7960f..9158233d34e1 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -76,6 +76,9 @@ struct xsk_map_node {
>>       struct xdp_sock **map_entry;
>>   };
>> +/* Flags for the xdp_sock flags field. */
>> +#define XDP_SOCK_DIRECT (1 << 0)
>> +
>>   struct xdp_sock {
>>       /* struct sock must be the first member of struct xdp_sock */
>>       struct sock sk;
>> @@ -104,6 +107,7 @@ struct xdp_sock {
>>       struct list_head map_list;
>>       /* Protects map_list */
>>       spinlock_t map_list_lock;
>> +    u16 flags;
> 
> Right now only the XDP_DIRECT is tracked here. Maybe track all flags, 
> and show them in xsk_diag.

I see zc as the other field that can be converted into a flag.
Do you want to include that as part of this series or can it be done later?

> 
>>   };
>>   struct xdp_buff;
>> @@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
>>   void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
>>   void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
>>   bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
>> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 
>> queue_id);
>>   void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>>                    struct xdp_sock **map_entry);
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index be328c59389d..d202b5d40859 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -25,6 +25,11 @@
>>    * application.
>>    */
>>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>> +/* This option allows an AF_XDP socket bound to a queue to receive all
>> + * the packets directly from that queue when there is no XDP program
>> + * attached to the device.
>> + */
>> +#define XDP_DIRECT    (1 << 4)
>>   /* Flags for xsk_umem_config flags */
>>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 205f95af67d2..871d738a78d2 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog 
>> *prog, bool do_idr_lock)
>>   void bpf_prog_put(struct bpf_prog *prog)
>>   {
>> -    __bpf_prog_put(prog, true);
>> +    if (!bpf_is_direct_xsk_prog(prog))
>> +        __bpf_prog_put(prog, true);
>>   }
>>   EXPORT_SYMBOL_GPL(bpf_prog_put);
>>   u32 bpf_get_prog_id(const struct bpf_prog *prog)
>>   {
>> -    if (prog)
>> +    if (prog && !bpf_is_direct_xsk_prog(prog))
>>           return prog->aux->id;
>>       return 0;
>> @@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
>>   void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
>>   {
>> -    if (prog)
>> +    if (prog && !bpf_is_direct_xsk_prog(prog))
>>           prog->aux->id = id;
>>   }
>>   EXPORT_SYMBOL(bpf_set_prog_id);
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 866d0ad936a5..eb3cd718e580 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, 
>> struct netlink_ext_ack *extack,
>>       } else {
>>           if (!__dev_xdp_query(dev, bpf_op, query))
>>               return 0;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +        if (dev->direct_xsk_count)
>> +            prog = (void *)BPF_PROG_DIRECT_XSK;
>> +#endif
> 
> Nit, but maybe hide this weirdness in a function?

OK.

> 
>>       }
>>       err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
>> @@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, 
>> struct netlink_ext_ack *extack,
>>       return err;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +int dev_set_direct_xsk_prog(struct net_device *dev)
>> +{
>> +    const struct net_device_ops *ops = dev->netdev_ops;
>> +    struct bpf_prog *prog;
>> +    bpf_op_t bpf_op;
>> +
>> +    ASSERT_RTNL();
>> +
>> +    dev->direct_xsk_count++;
>> +
>> +    bpf_op = ops->ndo_bpf;
>> +    if (!bpf_op)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
>> +        return 0;
>> +
>> +    prog = (void *)BPF_PROG_DIRECT_XSK;
>> +
>> +    return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
>> +}
>> +
>> +int dev_clear_direct_xsk_prog(struct net_device *dev)
>> +{
>> +    const struct net_device_ops *ops = dev->netdev_ops;
>> +    bpf_op_t bpf_op;
>> +
>> +    ASSERT_RTNL();
>> +
>> +    dev->direct_xsk_count--;
>> +
>> +    if (dev->direct_xsk_count)
>> +        return 0;
>> +
>> +    bpf_op = ops->ndo_bpf;
>> +    if (!bpf_op)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
>> +        return 0;
>> +
>> +    return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
>> +}
>> +#endif
>> +
>>   /**
>>    *    dev_new_index    -    allocate an ifindex
>>    *    @net: the applicable net namespace
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index ed6563622ce3..391d7d600284 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -73,6 +73,7 @@
>>   #include <net/lwtunnel.h>
>>   #include <net/ipv6_stubs.h>
>>   #include <net/bpf_sk_storage.h>
>> +#include <linux/static_key.h>
>>   /**
>>    *    sk_filter_trim_cap - run a packet through a socket filter
>> @@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device 
>> *dev_rx, void *fwd,
>>       return 0;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
>> +{
>> +    struct xdp_sock *xsk = READ_ONCE(ri->xsk);
> 
> Why READ_ONCE here?
> 
>> +
>> +    if (xsk) {
>> +        ri->xsk = NULL;
>> +        xsk_flush(xsk);
>> +    }
>> +}
>> +#else
>> +static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
>> +{
>> +}
>> +#endif
>> +
> 
> Move CONFIG_XDP_SOCKETS into the function, and remove the empty/bottom one.
> 
>>   void xdp_do_flush_map(void)
>>   {
>>       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> @@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
>>               break;
>>           }
>>       }
>> +
>> +    xdp_do_flush_xsk(ri);
>>   }
>>   EXPORT_SYMBOL_GPL(xdp_do_flush_map);
>> @@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct 
>> net_device *dev, struct xdp_buff *xdp,
>>       return err;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +static inline struct xdp_sock *xdp_get_direct_xsk(struct 
>> bpf_redirect_info *ri)
>> +{
>> +    return READ_ONCE(ri->xsk);
> 
> Again, why READ_ONCE? Please leave the inlining to the compiler in .c 
> files.
> 
>> +}
>> +#else
>> +static inline struct xdp_sock *xdp_get_direct_xsk(struct 
>> bpf_redirect_info *ri)
>> +{
>> +    return NULL;
>> +}
>> +#endif
>> +
>>   int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>>               struct bpf_prog *xdp_prog)
>>   {
>>       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>>       struct bpf_map *map = READ_ONCE(ri->map);
>> +    struct xdp_sock *xsk;
>> +
>> +    xsk = xdp_get_direct_xsk(ri);
>> +    if (xsk)
>> +        return xsk_rcv(xsk, xdp);
> 
> Hmm, maybe you need a xsk_to_flush as well. Say that a user swaps in a
> regular XDP program, then xsk_rcv() will be called until the flush
> occurs, right? IOW, all packets processed (napi budget) in the napi_poll
> will end up in the socket.

I originally had xsk_to_flush considering this possibility of the XDP 
program changing before call to flush.
Will re-introduce it in the next revision.
Should i create a separate structure bpf_direct_xsk_info rather than 
adding these fields to bpf_redirect_info?


> 
>>       if (likely(map))
>>           return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
>> @@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops 
>> sk_reuseport_verifier_ops = {
>>   const struct bpf_prog_ops sk_reuseport_prog_ops = {
>>   };
>> +
>> +#ifdef CONFIG_XDP_SOCKETS
>> +DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
>> +EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
>> +
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>> +{
>> +    struct xdp_sock *xsk;
>> +
>> +    xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>> +    if (xsk) {
>> +        struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> +
>> +        ri->xsk = xsk;
>> +        return XDP_REDIRECT;
> 
>  From the comment above. I *think* you need to ri->xsk_to_flush. Can the
> direct socket (swap socket) change before flush?

Yes.

> 
>> +    }
>> +
>> +    return XDP_PASS;
>> +}
>> +EXPORT_SYMBOL(bpf_direct_xsk);
>> +#endif
>> +
>>   #endif /* CONFIG_INET */
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index fa8fbb8fa3c8..8a29939bac7e 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -24,6 +24,7 @@
>>   #include <linux/rculist.h>
>>   #include <net/xdp_sock.h>
>>   #include <net/xdp.h>
>> +#include <linux/if_link.h>
>>   #include "xsk_queue.h"
>>   #include "xdp_umem.h"
>> @@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>       return err;
>>   }
>> +static void xdp_clear_direct_xsk(struct xdp_sock *xsk)
> 
> Use xs, and not xsk for consistency.

OK.

> 
>> +{
>> +    struct net_device *dev = xsk->dev;
>> +    u32 qid = xsk->queue_id;
>> +
>> +    if (!dev->_rx[qid].xsk)
>> +        return;
>> +
>> +    dev_clear_direct_xsk_prog(dev);
>> +    dev->_rx[qid].xsk = NULL;
>> +    static_branch_dec(&xdp_direct_xsk_needed);
>> +    xsk->flags &= ~XDP_SOCK_DIRECT;
>> +}
>> +
>> +static int xdp_set_direct_xsk(struct xdp_sock *xsk)
> 
> Same here.
> 
>> +{
>> +    struct net_device *dev = xsk->dev;
>> +    u32 qid = xsk->queue_id;
>> +    int err = 0;
>> +
>> +    if (dev->_rx[qid].xsk)
>> +        return -EEXIST;
>> +
>> +    xsk->flags |= XDP_SOCK_DIRECT;
>> +    static_branch_inc(&xdp_direct_xsk_needed);
>> +    dev->_rx[qid].xsk = xsk;
>> +    err = dev_set_direct_xsk_prog(dev);
>> +    if (err)
>> +        xdp_clear_direct_xsk(xsk);
>> +
>> +    return err;
>> +}
>> +
>> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 
>> queue_id)
>> +{
>> +    return dev->_rx[queue_id].xsk;
>> +}
>> +EXPORT_SYMBOL(xdp_get_xsk_from_qid);
>> +
>>   void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
>>   {
>>       xskq_produce_flush_addr_n(umem->cq, nb_entries);
>> @@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>>           return;
>>       WRITE_ONCE(xs->state, XSK_UNBOUND);
>> +    if (xs->flags & XDP_SOCK_DIRECT) {
>> +        rtnl_lock();
>> +        xdp_clear_direct_xsk(xs);
>> +        rtnl_unlock();
>> +    }
>>       /* Wait for driver to stop using the xdp socket. */
>>       xdp_del_sk_umem(xs->umem, xs);
>>       xs->dev = NULL;
>> @@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>       flags = sxdp->sxdp_flags;
>>       if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
>> -              XDP_USE_NEED_WAKEUP))
>> +              XDP_USE_NEED_WAKEUP | XDP_DIRECT))
>>           return -EINVAL;
>>       rtnl_lock();
>> @@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>           struct socket *sock;
>>           if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
>> -            (flags & XDP_USE_NEED_WAKEUP)) {
>> +            (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
>>               /* Cannot specify flags for shared sockets. */
>>               err = -EINVAL;
>>               goto out_unlock;
>> @@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>       xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
>>       xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
>>       xdp_add_sk_umem(xs->umem, xs);
>> +    if (flags & XDP_DIRECT)
>> +        err = xdp_set_direct_xsk(xs);
>>   out_unlock:
>>       if (err) {
>> diff --git a/tools/include/uapi/linux/if_xdp.h 
>> b/tools/include/uapi/linux/if_xdp.h
>> index be328c59389d..d202b5d40859 100644
>> --- a/tools/include/uapi/linux/if_xdp.h
>> +++ b/tools/include/uapi/linux/if_xdp.h
>> @@ -25,6 +25,11 @@
>>    * application.
>>    */
>>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>> +/* This option allows an AF_XDP socket bound to a queue to receive all
>> + * the packets directly from that queue when there is no XDP program
>> + * attached to the device.
>> + */
>> +#define XDP_DIRECT    (1 << 4)
>>   /* Flags for xsk_umem_config flags */
>>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
       [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
@ 2019-10-09 16:53       ` " Samudrala, Sridhar
  2019-10-09 17:17         ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom


>> +
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>> +{
>> +       struct xdp_sock *xsk;
>> +
>> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>> +       if (xsk) {
>> +               struct bpf_redirect_info *ri =
>> + this_cpu_ptr(&bpf_redirect_info);
>> +
>> +               ri->xsk = xsk;
>> +               return XDP_REDIRECT;
>> +       }
>> +
>> +       return XDP_PASS;
>> +}
>> +EXPORT_SYMBOL(bpf_direct_xsk);
> 
> So you're saying there is a:
> """
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
> 
> 6.1x gain running above C code vs exactly equivalent BPF code?
> How is that possible?

It seems to be due to the overhead of __bpf_prog_run on older processors 
(Ivybridge). The overhead is smaller on newer processors, but even on 
skylake i see around 1.5x improvement.

perf report with default xdpsock
================================
Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
Overhead  Command          Shared Object     Symbol
   34.57%  xdpsock          xdpsock           [.] main
   17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
   13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
    4.09%  ksoftirqd/1      [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
    3.08%  xdpsock          [kernel.vmlinux]  [k] nmi
    2.76%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_map_lookup_elem
    2.33%  xdpsock          [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
    2.33%  ksoftirqd/1      [i40e]            [k] i40e_clean_rx_irq_zc
    2.16%  xdpsock          [kernel.vmlinux]  [k] bpf_map_lookup_elem
    1.82%  ksoftirqd/1      [kernel.vmlinux]  [k] xdp_do_redirect
    1.41%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_rcv
    1.39%  ksoftirqd/1      [kernel.vmlinux]  [k] update_curr
    1.09%  ksoftirqd/1      [kernel.vmlinux]  [k] bpf_xdp_redirect_map
    1.09%  xdpsock          [i40e]            [k] i40e_clean_rx_irq_zc
    1.08%  ksoftirqd/1      [kernel.vmlinux]  [k] __xsk_map_redirect
    1.07%  swapper          [kernel.vmlinux]  [k] xsk_umem_peek_addr
    1.05%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_peek_addr
    0.89%  swapper          [kernel.vmlinux]  [k] __xsk_map_redirect
    0.87%  ksoftirqd/1      [kernel.vmlinux]  [k] __bpf_prog_run32
    0.87%  swapper          [kernel.vmlinux]  [k] intel_idle
    0.67%  xdpsock          [kernel.vmlinux]  [k] bpf_xdp_redirect_map
    0.57%  xdpsock          [kernel.vmlinux]  [k] xdp_do_redirect

perf report with direct xdpsock
===============================
Samples: 2K of event 'cycles:ppp', Event count (approx.): 17996091975
Overhead  Command          Shared Object     Symbol
   18.44%  xdpsock          [i40e]            [k] i40e_clean_rx_irq_zc
   15.14%  ksoftirqd/1      [i40e]            [k] i40e_clean_rx_irq_zc
    6.87%  xdpsock          [kernel.vmlinux]  [k] xsk_umem_peek_addr
    5.03%  ksoftirqd/1      [kernel.vmlinux]  [k] xdp_do_redirect
    4.21%  xdpsock          xdpsock           [.] main
    4.13%  ksoftirqd/1      [i40e]            [k] 
i40e_clean_programming_status
    3.71%  xdpsock          [kernel.vmlinux]  [k] xsk_rcv
    3.44%  ksoftirqd/1      [kernel.vmlinux]  [k] nmi
    3.41%  xdpsock          [kernel.vmlinux]  [k] nmi
    3.20%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_rcv
    2.45%  xdpsock          [kernel.vmlinux]  [k] xdp_get_xsk_from_qid
    2.35%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_peek_addr
    2.33%  ksoftirqd/1      [kernel.vmlinux]  [k] net_rx_action
    2.16%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_consume_tx
    2.10%  swapper          [kernel.vmlinux]  [k] __softirqentry_text_start
    2.06%  xdpsock          [kernel.vmlinux]  [k] native_irq_return_iret
    1.43%  xdpsock          [kernel.vmlinux]  [k] check_preempt_wakeup
    1.42%  xdpsock          [kernel.vmlinux]  [k] xsk_umem_consume_tx
    1.22%  xdpsock          [kernel.vmlinux]  [k] xdp_do_redirect
    1.21%  xdpsock          [kernel.vmlinux]  [k] 
dma_direct_sync_single_for_device
    1.16%  ksoftirqd/1      [kernel.vmlinux]  [k] irqtime_account_irq
    1.09%  xdpsock          [kernel.vmlinux]  [k] sock_def_readable
    0.99%  swapper          [kernel.vmlinux]  [k] intel_idle
    0.88%  xdpsock          [i40e]            [k] 
i40e_clean_programming_status
    0.74%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_discard_addr
    0.71%  ksoftirqd/1      [kernel.vmlinux]  [k] __switch_to
    0.50%  ksoftirqd/1      [kernel.vmlinux]  [k] 
dma_direct_sync_single_for_device

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-09  6:29   ` Samudrala, Sridhar
@ 2019-10-09 16:53     ` Jakub Kicinski
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Kicinski @ 2019-10-09 16:53 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 23:29:59 -0700, Samudrala, Sridhar wrote:
> On 10/8/2019 5:49 PM, Jakub Kicinski wrote:
> > I asked you to add numbers for handling those use cases in the kernel
> > directly.  
> 
> Forgot to explicitly mention that I didn't see any regressions with 
> xdp1, xdp2 or xdpsock in default mode with these patches. Performance 
> remained the same.

I'm not looking for regressions. The in-kernel path is faster, and
should be used for speeding things up rather than a "direct path to
user space". Your comparison should have 3 numbers - current AF_XDP,
patched AF_XDP, in-kernel handling.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 16:53       ` FW: " Samudrala, Sridhar
@ 2019-10-09 17:17         ` Alexei Starovoitov
  2019-10-09 19:12           ` Samudrala, Sridhar
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2019-10-09 17:17 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On Wed, Oct 9, 2019 at 9:53 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> >> +
> >> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> >> +{
> >> +       struct xdp_sock *xsk;
> >> +
> >> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> >> +       if (xsk) {
> >> +               struct bpf_redirect_info *ri =
> >> + this_cpu_ptr(&bpf_redirect_info);
> >> +
> >> +               ri->xsk = xsk;
> >> +               return XDP_REDIRECT;
> >> +       }
> >> +
> >> +       return XDP_PASS;
> >> +}
> >> +EXPORT_SYMBOL(bpf_direct_xsk);
> >
> > So you're saying there is a:
> > """
> > xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
> >     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> >     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
> >
> > 6.1x gain running above C code vs exactly equivalent BPF code?
> > How is that possible?
>
> It seems to be due to the overhead of __bpf_prog_run on older processors
> (Ivybridge). The overhead is smaller on newer processors, but even on
> skylake i see around 1.5x improvement.
>
> perf report with default xdpsock
> ================================
> Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
> Overhead  Command          Shared Object     Symbol
>    34.57%  xdpsock          xdpsock           [.] main
>    17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
>    13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run

That must be a bad joke.
The whole patch set is based on comparing native code to interpreter?!
It's pretty awesome that interpreter is only 1.5x slower than native x86.
Just turn the JIT on.

Obvious Nack to the patch set.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 17:17         ` Alexei Starovoitov
@ 2019-10-09 19:12           ` Samudrala, Sridhar
  2019-10-10  1:06             ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 19:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On 10/9/2019 10:17 AM, Alexei Starovoitov wrote:
> On Wed, Oct 9, 2019 at 9:53 AM Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>>
>>>> +
>>>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>>>> +{
>>>> +       struct xdp_sock *xsk;
>>>> +
>>>> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>>>> +       if (xsk) {
>>>> +               struct bpf_redirect_info *ri =
>>>> + this_cpu_ptr(&bpf_redirect_info);
>>>> +
>>>> +               ri->xsk = xsk;
>>>> +               return XDP_REDIRECT;
>>>> +       }
>>>> +
>>>> +       return XDP_PASS;
>>>> +}
>>>> +EXPORT_SYMBOL(bpf_direct_xsk);
>>>
>>> So you're saying there is a:
>>> """
>>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>>      default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>>      direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
>>>
>>> 6.1x gain running above C code vs exactly equivalent BPF code?
>>> How is that possible?
>>
>> It seems to be due to the overhead of __bpf_prog_run on older processors
>> (Ivybridge). The overhead is smaller on newer processors, but even on
>> skylake i see around 1.5x improvement.
>>
>> perf report with default xdpsock
>> ================================
>> Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
>> Overhead  Command          Shared Object     Symbol
>>     34.57%  xdpsock          xdpsock           [.] main
>>     17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
>>     13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
> 
> That must be a bad joke.
> The whole patch set is based on comparing native code to interpreter?!
> It's pretty awesome that interpreter is only 1.5x slower than native x86.
> Just turn the JIT on.

Thanks Alexei for pointing out that i didn't have JIT on.
When i turn it on, the performance improvement is a more modest 1.5x 
with rxdrop and 1.2x with l2fwd.

> 
> Obvious Nack to the patch set.
> 

Will update the patchset with the right performance data and address 
feedback from Bjorn.
Hope you are not totally against direct XDP approach as it does provide 
value when an AF_XDP socket is bound to a queue and a HW filter can 
direct packets targeted for that queue.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 19:12           ` Samudrala, Sridhar
@ 2019-10-10  1:06             ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2019-10-10  1:06 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On Wed, Oct 9, 2019 at 12:12 PM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> >>     34.57%  xdpsock          xdpsock           [.] main
> >>     17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
> >>     13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
> >
> > That must be a bad joke.
> > The whole patch set is based on comparing native code to interpreter?!
> > It's pretty awesome that interpreter is only 1.5x slower than native x86.
> > Just turn the JIT on.
>
> Thanks Alexei for pointing out that i didn't have JIT on.
> When i turn it on, the performance improvement is a more modest 1.5x
> with rxdrop and 1.2x with l2fwd.

I want to see perf reports back to back with proper performance analysis.

> >
> > Obvious Nack to the patch set.
> >
>
> Will update the patchset with the right performance data and address
> feedback from Bjorn.
> Hope you are not totally against direct XDP approach as it does provide
> value when an AF_XDP socket is bound to a queue and a HW filter can
> direct packets targeted for that queue.

I used to be in favor of providing "prog bypass" for af_xdp,
because of anecdotal evidence that it will make af_xdp faster.
Now seeing the numbers and the way they were collected
I'm against such bypass.
I want to see hard proof that trivial bpf prog is actually slowing things down
before reviewing any new patch sets.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
2019-10-08  6:58   ` Toke Høiland-Jørgensen
2019-10-08  8:47     ` Björn Töpel
2019-10-08  8:48       ` Björn Töpel
2019-10-08  9:04       ` Toke Høiland-Jørgensen
2019-10-08  8:05   ` Björn Töpel
2019-10-09 16:32     ` Samudrala, Sridhar
2019-10-09  1:20   ` Alexei Starovoitov
     [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
2019-10-09 16:53       ` FW: " Samudrala, Sridhar
2019-10-09 17:17         ` Alexei Starovoitov
2019-10-09 19:12           ` Samudrala, Sridhar
2019-10-10  1:06             ` Alexei Starovoitov
2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
2019-10-08  8:05   ` Björn Töpel
2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
2019-10-08  8:05   ` Björn Töpel
2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
2019-10-09 16:19   ` Samudrala, Sridhar
2019-10-09  0:49 ` Jakub Kicinski
2019-10-09  6:29   ` Samudrala, Sridhar
2019-10-09 16:53     ` Jakub Kicinski

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org netdev@archiver.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox