BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
@ 2019-10-08  6:16 Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

This is a rework of the following patch series 
https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
that tried to enable direct receive by bypassing XDP program attached
to the device.

Based on the community feedback and some suggestions from Bjorn, changed
the semantics of the implementation to enable direct receive on AF_XDP
sockets that are bound to a queue only when there is no normal XDP program
attached to the device.

This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
that is attached at the time of binding an AF_XDP socket to a queue of a
device. This is done only if there is no other XDP program attached to
the device. The normal XDP program has precedence and will replace the
DIRECT_XSK prog if it is attached later. The main reason to introduce a
special BPF prog pointer is to minimize the driver changes. The only change
is to use the bpf_get_prog_id() helper when QUERYING the prog id.

Any attach of a normal XDP program will take precedence and the direct xsk
program will be removed. The direct XSK program will be attached
automatically when the normal XDP program is removed when there are any
AF_XDP direct sockets associated with that device.

A static key is used to control this feature in order to avoid any overhead
for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.

Here is some performance data i collected on my Intel Ivybridge based
development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
NIC: Intel 40Gb ethernet (i40e)

xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
6.1x improvement in drop rate

xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
   default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
6x improvement in drop rate

xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
3.5x improvement in l2fwd rate

xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
   default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
   direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
4.5x improvement in l2fwd rate

dpdk-pktgen is used to send 64byte UDP packets from a link partner and 
ethtool ntuple flow rule is used to redirect packets to queue 1 on the 
system under test.

Sridhar Samudrala (4):
  bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
  xsk: allow AF_XDP sockets to receive packets directly from a queue
  libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/socionext/netsec.c           |  2 +-
 drivers/net/netdevsim/bpf.c                       |  6 ++-
 drivers/net/tun.c                                 |  4 +-
 drivers/net/veth.c                                |  4 +-
 drivers/net/virtio_net.c                          |  3 +-
 include/linux/bpf.h                               |  3 ++
 include/linux/filter.h                            | 18 +++++++
 include/linux/netdevice.h                         | 10 ++++
 include/net/xdp_sock.h                            |  5 ++
 include/trace/events/xdp.h                        |  4 +-
 include/uapi/linux/if_xdp.h                       |  5 ++
 kernel/bpf/arraymap.c                             |  2 +-
 kernel/bpf/cgroup.c                               |  2 +-
 kernel/bpf/core.c                                 |  2 +-
 kernel/bpf/syscall.c                              | 33 +++++++++----
 kernel/events/core.c                              |  2 +-
 kernel/trace/bpf_trace.c                          |  2 +-
 net/core/dev.c                                    | 54 ++++++++++++++++++++-
 net/core/filter.c                                 | 58 +++++++++++++++++++++++
 net/core/flow_dissector.c                         |  2 +-
 net/core/rtnetlink.c                              |  2 +-
 net/core/xdp.c                                    |  2 +-
 net/ipv6/seg6_local.c                             |  2 +-
 net/sched/act_bpf.c                               |  2 +-
 net/sched/cls_bpf.c                               |  2 +-
 net/xdp/xsk.c                                     | 51 +++++++++++++++++++-
 samples/bpf/xdpsock_user.c                        | 17 +++++--
 tools/include/uapi/linux/if_xdp.h                 |  5 ++
 tools/lib/bpf/xsk.c                               |  6 +++
 38 files changed, 279 insertions(+), 53 deletions(-)

-- 
2.14.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Currently the users of bpf prog id access it directly via prog->aux->id.
Next patch in this series introduces a special bpf prog pointer to support
AF_XDP sockets bound to a queue to receive packets from that queue directly.
As the special bpf prog pointer is not associated with any struct bpf_prog
prog id is not accessible via prog->aux. To abstract this from the users,
2 helper functions to get and set prog id are introduced and all the users
are updated to use these functions.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +--
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/socionext/netsec.c           |  2 +-
 drivers/net/netdevsim/bpf.c                       |  6 +++--
 drivers/net/tun.c                                 |  4 +--
 drivers/net/veth.c                                |  4 +--
 drivers/net/virtio_net.c                          |  3 +--
 include/linux/bpf.h                               |  3 +++
 include/trace/events/xdp.h                        |  4 +--
 kernel/bpf/arraymap.c                             |  2 +-
 kernel/bpf/cgroup.c                               |  2 +-
 kernel/bpf/core.c                                 |  2 +-
 kernel/bpf/syscall.c                              | 30 +++++++++++++++++------
 kernel/events/core.c                              |  2 +-
 kernel/trace/bpf_trace.c                          |  2 +-
 net/core/dev.c                                    |  4 +--
 net/core/flow_dissector.c                         |  2 +-
 net/core/rtnetlink.c                              |  2 +-
 net/core/xdp.c                                    |  2 +-
 net/ipv6/seg6_local.c                             |  2 +-
 net/sched/act_bpf.c                               |  2 +-
 net/sched/cls_bpf.c                               |  2 +-
 29 files changed, 58 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index c6f6f2033880..ef6dd2881264 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -330,7 +330,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 		rc = bnxt_xdp_set(bp, xdp->prog);
 		break;
 	case XDP_QUERY_PROG:
-		xdp->prog_id = bp->xdp_prog ? bp->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(bp->xdp_prog);
 		rc = 0;
 		break;
 	default:
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 40a44dcb3d9b..5c6c680252c1 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1912,7 +1912,7 @@ static int nicvf_xdp(struct net_device *netdev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return nicvf_xdp_setup(nic, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = nic->xdp_prog ? nic->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(nic->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 162d7d8fb295..8aef671d0731 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -1831,7 +1831,7 @@ static int dpaa2_eth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return setup_xdp(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = priv->xdp_prog ? priv->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(priv->xdp_prog);
 		break;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..0a59937b376e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12820,7 +12820,7 @@ static int i40e_xdp(struct net_device *dev,
 	case XDP_SETUP_PROG:
 		return i40e_xdp_setup(vsi, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(vsi->xdp_prog);
 		return 0;
 	case XDP_SETUP_XSK_UMEM:
 		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..c51ad24be037 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10283,8 +10283,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return ixgbe_xdp_setup(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = adapter->xdp_prog ?
-			adapter->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(adapter->xdp_prog);
 		return 0;
 	case XDP_SETUP_XSK_UMEM:
 		return ixgbe_xsk_umem_setup(adapter, xdp->xsk.umem,
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 076f2da36f27..dcf32a32cde8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4496,8 +4496,7 @@ static int ixgbevf_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return ixgbevf_xdp_setup(dev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = adapter->xdp_prog ?
-			       adapter->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(adapter->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 40ec5acf79c0..cf086748306f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2881,8 +2881,7 @@ static u32 mlx4_xdp_query(struct net_device *dev)
 	xdp_prog = rcu_dereference_protected(
 		priv->rx_ring[0]->xdp_prog,
 		lockdep_is_held(&mdev->state_lock));
-	if (xdp_prog)
-		prog_id = xdp_prog->aux->id;
+	prog_id = bpf_get_prog_id(xdp_prog);
 	mutex_unlock(&mdev->state_lock);
 
 	return prog_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7569287f8f3c..45c247ff05b2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4478,8 +4478,7 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 
 	mutex_lock(&priv->state_lock);
 	xdp_prog = priv->channels.params.xdp_prog;
-	if (xdp_prog)
-		prog_id = xdp_prog->aux->id;
+	prog_id = bpf_get_prog_id(xdp_prog);
 	mutex_unlock(&priv->state_lock);
 
 	return prog_id;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index 9a6a9a008714..75376bb85875 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1119,7 +1119,7 @@ int qede_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return qede_xdp_set(edev, xdp->prog);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = edev->xdp_prog ? edev->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(edev->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 55db7fbd43cc..19f5a1d850de 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1820,7 +1820,7 @@ static int netsec_xdp(struct net_device *ndev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		return netsec_xdp_setup(priv, xdp->prog, xdp->extack);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = priv->xdp_prog ? priv->xdp_prog->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(priv->xdp_prog);
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 2b74425822ab..2a24a7a41985 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -104,7 +104,7 @@ nsim_bpf_offload(struct netdevsim *ns, struct bpf_prog *prog, bool oldprog)
 	     "bad offload state, expected offload %sto be active",
 	     oldprog ? "" : "not ");
 	ns->bpf_offloaded = prog;
-	ns->bpf_offloaded_id = prog ? prog->aux->id : 0;
+	ns->bpf_offloaded_id = bpf_get_prog_id(prog);
 	nsim_prog_set_loaded(prog, true);
 
 	return 0;
@@ -218,6 +218,7 @@ static int nsim_bpf_create_prog(struct nsim_dev *nsim_dev,
 {
 	struct nsim_bpf_bound_prog *state;
 	char name[16];
+	u32 prog_id;
 
 	state = kzalloc(sizeof(*state), GFP_KERNEL);
 	if (!state)
@@ -235,7 +236,8 @@ static int nsim_bpf_create_prog(struct nsim_dev *nsim_dev,
 		return -ENOMEM;
 	}
 
-	debugfs_create_u32("id", 0400, state->ddir, &prog->aux->id);
+	prog_id = bpf_get_prog_id(prog);
+	debugfs_create_u32("id", 0400, state->ddir, &prog_id);
 	debugfs_create_file("state", 0400, state->ddir,
 			    &state->state, &nsim_bpf_string_fops);
 	debugfs_create_bool("loaded", 0400, state->ddir, &state->is_loaded);
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index aab0be40d443..396905d5c59a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1224,10 +1224,8 @@ static u32 tun_xdp_query(struct net_device *dev)
 	const struct bpf_prog *xdp_prog;
 
 	xdp_prog = rtnl_dereference(tun->xdp_prog);
-	if (xdp_prog)
-		return xdp_prog->aux->id;
 
-	return 0;
+	return bpf_get_prog_id(xdp_prog);
 }
 
 static int tun_xdp(struct net_device *dev, struct netdev_bpf *xdp)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9f3c839f9e5f..261b0df8dc41 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1140,10 +1140,8 @@ static u32 veth_xdp_query(struct net_device *dev)
 	const struct bpf_prog *xdp_prog;
 
 	xdp_prog = priv->_xdp_prog;
-	if (xdp_prog)
-		return xdp_prog->aux->id;
 
-	return 0;
+	return bpf_get_prog_id(xdp_prog);
 }
 
 static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba98e0971b84..5aa7f95b6c99 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2521,8 +2521,7 @@ static u32 virtnet_xdp_query(struct net_device *dev)
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		xdp_prog = rtnl_dereference(vi->rq[i].xdp_prog);
-		if (xdp_prog)
-			return xdp_prog->aux->id;
+		return bpf_get_prog_id(xdp_prog);
 	}
 	return 0;
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b9d22338606..e5b023cda42a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -692,6 +692,9 @@ int bpf_get_file_flag(int flags);
 int bpf_check_uarg_tail_zero(void __user *uaddr, size_t expected_size,
 			     size_t actual_size);
 
+u32 bpf_get_prog_id(const struct bpf_prog *prog);
+void bpf_set_prog_id(struct bpf_prog *prog, u32 id);
+
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
  * forced to use 'long' read/writes to try to atomically copy long counters.
  * Best-effort only.  No barriers here, since it _will_ race with concurrent
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 8c8420230a10..3369a73c27e1 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -39,7 +39,7 @@ TRACE_EVENT(xdp_exception,
 	),
 
 	TP_fast_assign(
-		__entry->prog_id	= xdp->aux->id;
+		__entry->prog_id	= bpf_get_prog_id(xdp);
 		__entry->act		= act;
 		__entry->ifindex	= dev->ifindex;
 	),
@@ -99,7 +99,7 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	),
 
 	TP_fast_assign(
-		__entry->prog_id	= xdp->aux->id;
+		__entry->prog_id	= bpf_get_prog_id(xdp);
 		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
 		__entry->err		= err;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..30037d72c176 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -589,7 +589,7 @@ static void prog_fd_array_put_ptr(void *ptr)
 
 static u32 prog_fd_array_sys_lookup_elem(void *ptr)
 {
-	return ((struct bpf_prog *)ptr)->aux->id;
+	return bpf_get_prog_id((struct bpf_prog *)ptr);
 }
 
 /* decrement refcnt of all bpf_progs that are stored in this map */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index ddd8addcdb5c..8db882606d54 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -526,7 +526,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 
 		i = 0;
 		list_for_each_entry(pl, progs, node) {
-			id = pl->prog->aux->id;
+			id = bpf_get_prog_id(pl->prog);
 			if (copy_to_user(prog_ids + i, &id, sizeof(id)))
 				return -EFAULT;
 			if (++i == cnt)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 66088a9e9b9e..60ccf0e552fc 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1832,7 +1832,7 @@ static bool bpf_prog_array_copy_core(struct bpf_prog_array *array,
 	for (item = array->items; item->prog; item++) {
 		if (item->prog == &dummy_bpf_prog.prog)
 			continue;
-		prog_ids[i] = item->prog->aux->id;
+		prog_ids[i] = bpf_get_prog_id(item->prog);
 		if (++i == request_cnt) {
 			item++;
 			break;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 82eabd4e38ad..205f95af67d2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1287,7 +1287,7 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog)
 	spin_lock_bh(&prog_idr_lock);
 	id = idr_alloc_cyclic(&prog_idr, prog, 1, INT_MAX, GFP_ATOMIC);
 	if (id > 0)
-		prog->aux->id = id;
+		bpf_set_prog_id(prog, id);
 	spin_unlock_bh(&prog_idr_lock);
 	idr_preload_end();
 
@@ -1305,7 +1305,7 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
 	 * disappears - even if someone grabs an fd to them they are unusable,
 	 * simply waiting for refcnt to drop to be freed.
 	 */
-	if (!prog->aux->id)
+	if (!bpf_get_prog_id(prog))
 		return;
 
 	if (do_idr_lock)
@@ -1313,8 +1313,8 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
 	else
 		__acquire(&prog_idr_lock);
 
-	idr_remove(&prog_idr, prog->aux->id);
-	prog->aux->id = 0;
+	idr_remove(&prog_idr, bpf_get_prog_id(prog));
+	bpf_set_prog_id(prog, 0);
 
 	if (do_idr_lock)
 		spin_unlock_bh(&prog_idr_lock);
@@ -1353,6 +1353,22 @@ void bpf_prog_put(struct bpf_prog *prog)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_put);
 
+u32 bpf_get_prog_id(const struct bpf_prog *prog)
+{
+	if (prog)
+		return prog->aux->id;
+
+	return 0;
+}
+EXPORT_SYMBOL(bpf_get_prog_id);
+
+void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
+{
+	if (prog)
+		prog->aux->id = id;
+}
+EXPORT_SYMBOL(bpf_set_prog_id);
+
 static int bpf_prog_release(struct inode *inode, struct file *filp)
 {
 	struct bpf_prog *prog = filp->private_data;
@@ -1406,7 +1422,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp)
 		   prog->jited,
 		   prog_tag,
 		   prog->pages * 1ULL << PAGE_SHIFT,
-		   prog->aux->id,
+		   bpf_get_prog_id(prog),
 		   stats.nsecs,
 		   stats.cnt);
 }
@@ -2329,7 +2345,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 		return -EFAULT;
 
 	info.type = prog->type;
-	info.id = prog->aux->id;
+	info.id = bpf_get_prog_id(prog);
 	info.load_time = prog->aux->load_time;
 	info.created_by_uid = from_kuid_munged(current_user_ns(),
 					       prog->aux->user->uid);
@@ -2792,7 +2808,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 		struct bpf_raw_event_map *btp = raw_tp->btp;
 
 		err = bpf_task_fd_query_copy(attr, uattr,
-					     raw_tp->prog->aux->id,
+					     bpf_get_prog_id(raw_tp->prog),
 					     BPF_FD_TYPE_RAW_TRACEPOINT,
 					     btp->tp->name, 0, 0);
 		goto put_file;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4655adbbae10..1410951ca904 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8035,7 +8035,7 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 			},
 			.type = type,
 			.flags = flags,
-			.id = prog->aux->id,
+			.id = bpf_get_prog_id(prog),
 		},
 	};
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 44bd08f2443b..35e8cd2b6b54 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1426,7 +1426,7 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
 	if (prog->type == BPF_PROG_TYPE_PERF_EVENT)
 		return -EOPNOTSUPP;
 
-	*prog_id = prog->aux->id;
+	*prog_id = bpf_get_prog_id(prog);
 	flags = event->tp_event->flags;
 	is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT;
 	is_syscall_tp = is_syscall_trace_event(event->tp_event);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7a456c6a7ad8..866d0ad936a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5286,7 +5286,7 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 		break;
 
 	case XDP_QUERY_PROG:
-		xdp->prog_id = old ? old->aux->id : 0;
+		xdp->prog_id = bpf_get_prog_id(old);
 		break;
 
 	default:
@@ -8262,7 +8262,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			return -EINVAL;
 		}
 
-		if (prog->aux->id == prog_id) {
+		if (bpf_get_prog_id(prog) == prog_id) {
 			bpf_prog_put(prog);
 			return 0;
 		}
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 6b4b88d1599d..fa8b8e88bfaa 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -89,7 +89,7 @@ int skb_flow_dissector_prog_query(const union bpf_attr *attr,
 	attached = rcu_dereference(net->flow_dissector_prog);
 	if (attached) {
 		prog_cnt = 1;
-		prog_id = attached->aux->id;
+		prog_id = bpf_get_prog_id(attached);
 	}
 	rcu_read_unlock();
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 49fa910b58af..86fd505b4111 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1387,7 +1387,7 @@ static u32 rtnl_xdp_prog_skb(struct net_device *dev)
 	generic_xdp_prog = rtnl_dereference(dev->xdp_prog);
 	if (!generic_xdp_prog)
 		return 0;
-	return generic_xdp_prog->aux->id;
+	return bpf_get_prog_id(generic_xdp_prog);
 }
 
 static u32 rtnl_xdp_prog_drv(struct net_device *dev)
diff --git a/net/core/xdp.c b/net/core/xdp.c
index d7bf62ffbb5e..0bc9a50eb318 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -469,7 +469,7 @@ EXPORT_SYMBOL_GPL(__xdp_release_frame);
 int xdp_attachment_query(struct xdp_attachment_info *info,
 			 struct netdev_bpf *bpf)
 {
-	bpf->prog_id = info->prog ? info->prog->aux->id : 0;
+	bpf->prog_id = bpf_get_prog_id(info->prog);
 	bpf->prog_flags = info->prog ? info->flags : 0;
 	return 0;
 }
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 9d4f75e0d33a..e49987655567 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -853,7 +853,7 @@ static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	if (!nest)
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, SEG6_LOCAL_BPF_PROG, slwt->bpf.prog->aux->id))
+	if (nla_put_u32(skb, SEG6_LOCAL_BPF_PROG, bpf_get_prog_id(slwt->bpf.prog)))
 		return -EMSGSIZE;
 
 	if (slwt->bpf.name &&
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 04b7bd4ec751..d55a28d0adf6 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -119,7 +119,7 @@ static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf *prog,
 	    nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, TCA_ACT_BPF_ID, prog->filter->aux->id))
+	if (nla_put_u32(skb, TCA_ACT_BPF_ID, bpf_get_prog_id(prog->filter))
 		return -EMSGSIZE;
 
 	nla = nla_reserve(skb, TCA_ACT_BPF_TAG, sizeof(prog->filter->tag));
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index bf10bdaf5012..d35aedb4aee5 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -562,7 +562,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog *prog,
 	    nla_put_string(skb, TCA_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	if (nla_put_u32(skb, TCA_BPF_ID, prog->filter->aux->id))
+	if (nla_put_u32(skb, TCA_BPF_ID, bpf_get_prog_id(prog->filter))
 		return -EMSGSIZE;
 
 	nla = nla_reserve(skb, TCA_BPF_TAG, sizeof(prog->filter->tag));
-- 
2.14.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
                     ` (2 more replies)
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Introduce a flag that can be specified during the bind() call
of an AF_XDP socket to receive packets directly from a queue when there is
no XDP program attached to the device.

This is enabled by introducing a special BPF prog pointer called
BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
when binding an AF_XDP socket to a queue. At the time of bind(), an AF_XDP
socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf program
if there is no other XDP program attached to the device. The device receive
queue is also associated with the AF_XDP socket.

In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
XDP buffer is passed to the AF_XDP socket that is associated with the
receive queue on which the packet is received.

To avoid any overhead for nomal XDP programs, a static key is used to keep
a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
to handle receives for direct XDP sockets.

Any attach of a normal XDP program will take precedence and the direct xsk
program will be removed. The direct XSK program will be attached
automatically when the normal XDP program is removed when there are any
AF_XDP direct sockets associated with that device.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/filter.h            | 18 ++++++++++++
 include/linux/netdevice.h         | 10 +++++++
 include/net/xdp_sock.h            |  5 ++++
 include/uapi/linux/if_xdp.h       |  5 ++++
 kernel/bpf/syscall.c              |  7 +++--
 net/core/dev.c                    | 50 +++++++++++++++++++++++++++++++++
 net/core/filter.c                 | 58 +++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c                     | 51 ++++++++++++++++++++++++++++++++--
 tools/include/uapi/linux/if_xdp.h |  5 ++++
 9 files changed, 204 insertions(+), 5 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2ce57645f3cd..db4ad85d8321 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -585,6 +585,9 @@ struct bpf_redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	u32 kern_flags;
+#ifdef CONFIG_XDP_SOCKETS
+	struct xdp_sock *xsk;
+#endif
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
@@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return res;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+#define BPF_PROG_DIRECT_XSK	0x1
+#define bpf_is_direct_xsk_prog(prog) \
+	((unsigned long)prog == BPF_PROG_DIRECT_XSK)
+DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
+u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
+#else
+#define bpf_is_direct_xsk_prog(prog) (false)
+#endif
+
 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
@@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 	 * already takes rcu_read_lock() when fetching the program, so
 	 * it's not necessary here anymore.
 	 */
+#ifdef CONFIG_XDP_SOCKETS
+	if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
+	    bpf_is_direct_xsk_prog(prog))
+		return bpf_direct_xsk(prog, xdp);
+#endif
 	return BPF_PROG_RUN(prog, xdp);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 48cc71aae466..f4d0f70aa718 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -743,6 +743,7 @@ struct netdev_rx_queue {
 	struct xdp_rxq_info		xdp_rxq;
 #ifdef CONFIG_XDP_SOCKETS
 	struct xdp_umem                 *umem;
+	struct xdp_sock			*xsk;
 #endif
 } ____cacheline_aligned_in_smp;
 
@@ -1836,6 +1837,10 @@ struct net_device {
 	atomic_t		carrier_up_count;
 	atomic_t		carrier_down_count;
 
+#ifdef CONFIG_XDP_SOCKETS
+	u16			direct_xsk_count;
+#endif
+
 #ifdef CONFIG_WIRELESS_EXT
 	const struct iw_handler_def *wireless_handlers;
 	struct iw_public_data	*wireless_data;
@@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
+#ifdef CONFIG_XDP_SOCKETS
+int dev_set_direct_xsk_prog(struct net_device *dev);
+int dev_clear_direct_xsk_prog(struct net_device *dev);
+#endif
+
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
 					       struct sk_buff *skb)
 {
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index c9398ce7960f..9158233d34e1 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -76,6 +76,9 @@ struct xsk_map_node {
 	struct xdp_sock **map_entry;
 };
 
+/* Flags for the xdp_sock flags field. */
+#define XDP_SOCK_DIRECT (1 << 0)
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -104,6 +107,7 @@ struct xdp_sock {
 	struct list_head map_list;
 	/* Protects map_list */
 	spinlock_t map_list_lock;
+	u16 flags;
 };
 
 struct xdp_buff;
@@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
 void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
 void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
 bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
+struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id);
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 			     struct xdp_sock **map_entry);
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index be328c59389d..d202b5d40859 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -25,6 +25,11 @@
  * application.
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
+/* This option allows an AF_XDP socket bound to a queue to receive all
+ * the packets directly from that queue when there is no XDP program
+ * attached to the device.
+ */
+#define XDP_DIRECT	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 205f95af67d2..871d738a78d2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
 
 void bpf_prog_put(struct bpf_prog *prog)
 {
-	__bpf_prog_put(prog, true);
+	if (!bpf_is_direct_xsk_prog(prog))
+		__bpf_prog_put(prog, true);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_put);
 
 u32 bpf_get_prog_id(const struct bpf_prog *prog)
 {
-	if (prog)
+	if (prog && !bpf_is_direct_xsk_prog(prog))
 		return prog->aux->id;
 
 	return 0;
@@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
 
 void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
 {
-	if (prog)
+	if (prog && !bpf_is_direct_xsk_prog(prog))
 		prog->aux->id = id;
 }
 EXPORT_SYMBOL(bpf_set_prog_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 866d0ad936a5..eb3cd718e580 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	} else {
 		if (!__dev_xdp_query(dev, bpf_op, query))
 			return 0;
+#ifdef CONFIG_XDP_SOCKETS
+		if (dev->direct_xsk_count)
+			prog = (void *)BPF_PROG_DIRECT_XSK;
+#endif
 	}
 
 	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
@@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	return err;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+int dev_set_direct_xsk_prog(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct bpf_prog *prog;
+	bpf_op_t bpf_op;
+
+	ASSERT_RTNL();
+
+	dev->direct_xsk_count++;
+
+	bpf_op = ops->ndo_bpf;
+	if (!bpf_op)
+		return -EOPNOTSUPP;
+
+	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
+		return 0;
+
+	prog = (void *)BPF_PROG_DIRECT_XSK;
+
+	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
+}
+
+int dev_clear_direct_xsk_prog(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	bpf_op_t bpf_op;
+
+	ASSERT_RTNL();
+
+	dev->direct_xsk_count--;
+
+	if (dev->direct_xsk_count)
+		return 0;
+
+	bpf_op = ops->ndo_bpf;
+	if (!bpf_op)
+		return -EOPNOTSUPP;
+
+	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
+		return 0;
+
+	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
+}
+#endif
+
 /**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
diff --git a/net/core/filter.c b/net/core/filter.c
index ed6563622ce3..391d7d600284 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <net/lwtunnel.h>
 #include <net/ipv6_stubs.h>
 #include <net/bpf_sk_storage.h>
+#include <linux/static_key.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 	return 0;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
+{
+	struct xdp_sock *xsk = READ_ONCE(ri->xsk);
+
+	if (xsk) {
+		ri->xsk = NULL;
+		xsk_flush(xsk);
+	}
+}
+#else
+static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
+{
+}
+#endif
+
 void xdp_do_flush_map(void)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
@@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
 			break;
 		}
 	}
+
+	xdp_do_flush_xsk(ri);
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush_map);
 
@@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	return err;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
+{
+	return READ_ONCE(ri->xsk);
+}
+#else
+static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
+{
+	return NULL;
+}
+#endif
+
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
+	struct xdp_sock *xsk;
+
+	xsk = xdp_get_direct_xsk(ri);
+	if (xsk)
+		return xsk_rcv(xsk, xdp);
 
 	if (likely(map))
 		return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
@@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+#ifdef CONFIG_XDP_SOCKETS
+DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
+EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
+
+u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
+{
+	struct xdp_sock *xsk;
+
+	xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
+	if (xsk) {
+		struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+		ri->xsk = xsk;
+		return XDP_REDIRECT;
+	}
+
+	return XDP_PASS;
+}
+EXPORT_SYMBOL(bpf_direct_xsk);
+#endif
+
 #endif /* CONFIG_INET */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index fa8fbb8fa3c8..8a29939bac7e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -24,6 +24,7 @@
 #include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
+#include <linux/if_link.h>
 
 #include "xsk_queue.h"
 #include "xdp_umem.h"
@@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xdp_clear_direct_xsk(struct xdp_sock *xsk)
+{
+	struct net_device *dev = xsk->dev;
+	u32 qid = xsk->queue_id;
+
+	if (!dev->_rx[qid].xsk)
+		return;
+
+	dev_clear_direct_xsk_prog(dev);
+	dev->_rx[qid].xsk = NULL;
+	static_branch_dec(&xdp_direct_xsk_needed);
+	xsk->flags &= ~XDP_SOCK_DIRECT;
+}
+
+static int xdp_set_direct_xsk(struct xdp_sock *xsk)
+{
+	struct net_device *dev = xsk->dev;
+	u32 qid = xsk->queue_id;
+	int err = 0;
+
+	if (dev->_rx[qid].xsk)
+		return -EEXIST;
+
+	xsk->flags |= XDP_SOCK_DIRECT;
+	static_branch_inc(&xdp_direct_xsk_needed);
+	dev->_rx[qid].xsk = xsk;
+	err = dev_set_direct_xsk_prog(dev);
+	if (err)
+		xdp_clear_direct_xsk(xsk);
+
+	return err;
+}
+
+struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id)
+{
+	return dev->_rx[queue_id].xsk;
+}
+EXPORT_SYMBOL(xdp_get_xsk_from_qid);
+
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
 {
 	xskq_produce_flush_addr_n(umem->cq, nb_entries);
@@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 		return;
 	WRITE_ONCE(xs->state, XSK_UNBOUND);
 
+	if (xs->flags & XDP_SOCK_DIRECT) {
+		rtnl_lock();
+		xdp_clear_direct_xsk(xs);
+		rtnl_unlock();
+	}
 	/* Wait for driver to stop using the xdp socket. */
 	xdp_del_sk_umem(xs->umem, xs);
 	xs->dev = NULL;
@@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	flags = sxdp->sxdp_flags;
 	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
-		      XDP_USE_NEED_WAKEUP))
+		      XDP_USE_NEED_WAKEUP | XDP_DIRECT))
 		return -EINVAL;
 
 	rtnl_lock();
@@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		struct socket *sock;
 
 		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
-		    (flags & XDP_USE_NEED_WAKEUP)) {
+		    (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
 			/* Cannot specify flags for shared sockets. */
 			err = -EINVAL;
 			goto out_unlock;
@@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
 	xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
 	xdp_add_sk_umem(xs->umem, xs);
+	if (flags & XDP_DIRECT)
+		err = xdp_set_direct_xsk(xs);
 
 out_unlock:
 	if (err) {
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index be328c59389d..d202b5d40859 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -25,6 +25,11 @@
  * application.
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
+/* This option allows an AF_XDP socket bound to a queue to receive all
+ * the packets directly from that queue when there is no XDP program
+ * attached to the device.
+ */
+#define XDP_DIRECT	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
-- 
2.14.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  8:05   ` Björn Töpel
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

Don't allow an AF_XDP socket trying to bind with XDP_DIRECT bind
flag when a normal XDP program is already attached to the device,

Don't attach the default XDP program when AF_XDP socket is created
with XDP_DIRECT bind flag.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 tools/lib/bpf/xsk.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index d5f4900e5c54..953b479040cd 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -454,6 +454,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 		return err;
 
 	if (!prog_id) {
+		if (xsk->config.bind_flags & XDP_DIRECT)
+			return 0;
+
 		err = xsk_create_bpf_maps(xsk);
 		if (err)
 			return err;
@@ -464,6 +467,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 			return err;
 		}
 	} else {
+		if (xsk->config.bind_flags & XDP_DIRECT)
+			return -EEXIST;
+
 		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
 		err = xsk_lookup_bpf_maps(xsk);
 		if (err) {
-- 
2.14.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (2 preceding siblings ...)
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
@ 2019-10-08  6:16 ` Sridhar Samudrala
  2019-10-08  8:05   ` Björn Töpel
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
  2019-10-09  0:49 ` Jakub Kicinski
  5 siblings, 1 reply; 42+ messages in thread
From: Sridhar Samudrala @ 2019-10-08  6:16 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

This option enables an AF_XDP socket to bind with a XDP_DIRECT flag
that allows packets received on the associated queue to be received
directly when an XDP program is not attached.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 samples/bpf/xdpsock_user.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 405c4e091f8b..6f4633769968 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -129,6 +129,9 @@ static void print_benchmark(bool running)
 	if (opt_poll)
 		printf("poll() ");
 
+	if (opt_xdp_bind_flags & XDP_DIRECT)
+		printf("direct-xsk ");
+
 	if (running) {
 		printf("running...");
 		fflush(stdout);
@@ -202,7 +205,8 @@ static void int_exit(int sig)
 	dump_stats();
 	xsk_socket__delete(xsks[0]->xsk);
 	(void)xsk_umem__delete(umem);
-	remove_xdp_program();
+	if (!(opt_xdp_bind_flags & XDP_DIRECT))
+		remove_xdp_program();
 
 	exit(EXIT_SUCCESS);
 }
@@ -213,7 +217,8 @@ static void __exit_with_error(int error, const char *file, const char *func,
 	fprintf(stderr, "%s:%s:%i: errno: %d/\"%s\"\n", file, func,
 		line, error, strerror(error));
 	dump_stats();
-	remove_xdp_program();
+	if (!(opt_xdp_bind_flags & XDP_DIRECT))
+		remove_xdp_program();
 	exit(EXIT_FAILURE);
 }
 
@@ -363,6 +368,7 @@ static struct option long_options[] = {
 	{"frame-size", required_argument, 0, 'f'},
 	{"no-need-wakeup", no_argument, 0, 'm'},
 	{"unaligned", no_argument, 0, 'u'},
+	{"direct-xsk", no_argument, 0, 'd'},
 	{0, 0, 0, 0}
 };
 
@@ -386,6 +392,7 @@ static void usage(const char *prog)
 		"  -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
 		"  -f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is %d).\n"
 		"  -u, --unaligned	Enable unaligned chunk placement\n"
+		"  -d, --direct-xsk	Direct packets to XDP socket.\n"
 		"\n";
 	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
 	exit(EXIT_FAILURE);
@@ -398,7 +405,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mu",
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mud",
 				long_options, &option_index);
 		if (c == -1)
 			break;
@@ -452,7 +459,9 @@ static void parse_command_line(int argc, char **argv)
 			opt_need_wakeup = false;
 			opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
 			break;
-
+		case 'd':
+			opt_xdp_bind_flags |= XDP_DIRECT;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
-- 
2.14.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
@ 2019-10-08  6:58   ` Toke Høiland-Jørgensen
  2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:05   ` Björn Töpel
  2019-10-09  1:20   ` Alexei Starovoitov
  2 siblings, 1 reply; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-08  6:58 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, bjorn.topel, netdev, bpf,
	sridhar.samudrala, intel-wired-lan, maciej.fijalkowski,
	tom.herbert

Sridhar Samudrala <sridhar.samudrala@intel.com> writes:

>  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  	struct bpf_map *map = READ_ONCE(ri->map);
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_direct_xsk(ri);
> +	if (xsk)
> +		return xsk_rcv(xsk, xdp);

This is a new branch and a read barrier in the XDP_REDIRECT fast path.
What's the performance impact of that for non-XSK redirect?

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (3 preceding siblings ...)
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
@ 2019-10-08  8:05 ` Björn Töpel
  2019-10-09 16:19   ` Samudrala, Sridhar
  2019-10-09  0:49 ` Jakub Kicinski
  5 siblings, 1 reply; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert
  Cc: Jakub Kicinski

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> This is a rework of the following patch series
> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
> that tried to enable direct receive by bypassing XDP program attached
> to the device.
> 
> Based on the community feedback and some suggestions from Bjorn, changed
> the semantics of the implementation to enable direct receive on AF_XDP
> sockets that are bound to a queue only when there is no normal XDP program
> attached to the device.
> 
> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
> that is attached at the time of binding an AF_XDP socket to a queue of a
> device. This is done only if there is no other XDP program attached to
> the device. The normal XDP program has precedence and will replace the
> DIRECT_XSK prog if it is attached later. The main reason to introduce a
> special BPF prog pointer is to minimize the driver changes. The only change
> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> A static key is used to control this feature in order to avoid any overhead
> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.
> 
> Here is some performance data i collected on my Intel Ivybridge based
> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
> NIC: Intel 40Gb ethernet (i40e)
> 
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> 6.1x improvement in drop rate
> 
> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
> 6x improvement in drop rate
> 
> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
> 3.5x improvement in l2fwd rate
> 
> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
> 4.5x improvement in l2fwd rate
> 
> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
> system under test.
>

Thanks for working on this Sridhar! I like this approach! Except from
the bpf_get_prog_id() changes, no driver changes are needed.

It's also a cleaner (IMO) approach than my previous attempts [1,2,3]

Would be interesting to see NFP support AF_XDP offloading with this
option. (nudge, nudge).

A thought: From userland, a direct AF_XDP socket will not appear as an
XDP program is attached to the device (id == 0). Maybe show in ss(8)
(via xsk_diag.c) that the socket is direct?

[1] 
https://lore.kernel.org/netdev/CAJ+HfNj63QcLY8=y1fF93PZd3XcfiGSrbbWdiGByjTzZQydSSg@mail.gmail.com/
[2] 
https://lore.kernel.org/netdev/cd952f99-6bad-e0c8-5bcd-f0010218238c@intel.com/
[3] 
https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/

> Sridhar Samudrala (4):
>    bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions.
>    xsk: allow AF_XDP sockets to receive packets directly from a queue
>    libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
>    xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  2 +-
>   drivers/net/ethernet/cavium/thunder/nicvf_main.c  |  2 +-
>   drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
>   drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  3 +-
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +-
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
>   drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
>   drivers/net/ethernet/socionext/netsec.c           |  2 +-
>   drivers/net/netdevsim/bpf.c                       |  6 ++-
>   drivers/net/tun.c                                 |  4 +-
>   drivers/net/veth.c                                |  4 +-
>   drivers/net/virtio_net.c                          |  3 +-
>   include/linux/bpf.h                               |  3 ++
>   include/linux/filter.h                            | 18 +++++++
>   include/linux/netdevice.h                         | 10 ++++
>   include/net/xdp_sock.h                            |  5 ++
>   include/trace/events/xdp.h                        |  4 +-
>   include/uapi/linux/if_xdp.h                       |  5 ++
>   kernel/bpf/arraymap.c                             |  2 +-
>   kernel/bpf/cgroup.c                               |  2 +-
>   kernel/bpf/core.c                                 |  2 +-
>   kernel/bpf/syscall.c                              | 33 +++++++++----
>   kernel/events/core.c                              |  2 +-
>   kernel/trace/bpf_trace.c                          |  2 +-
>   net/core/dev.c                                    | 54 ++++++++++++++++++++-
>   net/core/filter.c                                 | 58 +++++++++++++++++++++++
>   net/core/flow_dissector.c                         |  2 +-
>   net/core/rtnetlink.c                              |  2 +-
>   net/core/xdp.c                                    |  2 +-
>   net/ipv6/seg6_local.c                             |  2 +-
>   net/sched/act_bpf.c                               |  2 +-
>   net/sched/cls_bpf.c                               |  2 +-
>   net/xdp/xsk.c                                     | 51 +++++++++++++++++++-
>   samples/bpf/xdpsock_user.c                        | 17 +++++--
>   tools/include/uapi/linux/if_xdp.h                 |  5 ++
>   tools/lib/bpf/xsk.c                               |  6 +++
>   38 files changed, 279 insertions(+), 53 deletions(-)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
@ 2019-10-08  8:05   ` Björn Töpel
  2019-10-09 16:32     ` Samudrala, Sridhar
  2019-10-09  1:20   ` Alexei Starovoitov
  2 siblings, 1 reply; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> Introduce a flag that can be specified during the bind() call
> of an AF_XDP socket to receive packets directly from a queue when there is
> no XDP program attached to the device.
> 
> This is enabled by introducing a special BPF prog pointer called
> BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
> when binding an AF_XDP socket to a queue. At the time of bind(), an AF_XDP
> socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf program
> if there is no other XDP program attached to the device. The device receive
> queue is also associated with the AF_XDP socket.
> 
> In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
> XDP buffer is passed to the AF_XDP socket that is associated with the
> receive queue on which the packet is received.
> 
> To avoid any overhead for nomal XDP programs, a static key is used to keep
> a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
> to handle receives for direct XDP sockets.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>   include/linux/filter.h            | 18 ++++++++++++
>   include/linux/netdevice.h         | 10 +++++++
>   include/net/xdp_sock.h            |  5 ++++
>   include/uapi/linux/if_xdp.h       |  5 ++++
>   kernel/bpf/syscall.c              |  7 +++--
>   net/core/dev.c                    | 50 +++++++++++++++++++++++++++++++++
>   net/core/filter.c                 | 58 +++++++++++++++++++++++++++++++++++++++
>   net/xdp/xsk.c                     | 51 ++++++++++++++++++++++++++++++++--
>   tools/include/uapi/linux/if_xdp.h |  5 ++++
>   9 files changed, 204 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 2ce57645f3cd..db4ad85d8321 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -585,6 +585,9 @@ struct bpf_redirect_info {
>   	struct bpf_map *map;
>   	struct bpf_map *map_to_flush;
>   	u32 kern_flags;
> +#ifdef CONFIG_XDP_SOCKETS
> +	struct xdp_sock *xsk;
> +#endif
>   };
>   
>   DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> @@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>   	return res;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +#define BPF_PROG_DIRECT_XSK	0x1
> +#define bpf_is_direct_xsk_prog(prog) \
> +	((unsigned long)prog == BPF_PROG_DIRECT_XSK)
> +DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
> +#else
> +#define bpf_is_direct_xsk_prog(prog) (false)
> +#endif
> +
>   static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
>   					    struct xdp_buff *xdp)
>   {
> @@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
>   	 * already takes rcu_read_lock() when fetching the program, so
>   	 * it's not necessary here anymore.
>   	 */
> +#ifdef CONFIG_XDP_SOCKETS
> +	if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
> +	    bpf_is_direct_xsk_prog(prog))
> +		return bpf_direct_xsk(prog, xdp);
> +#endif
>   	return BPF_PROG_RUN(prog, xdp);
>   }
>   
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 48cc71aae466..f4d0f70aa718 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -743,6 +743,7 @@ struct netdev_rx_queue {
>   	struct xdp_rxq_info		xdp_rxq;
>   #ifdef CONFIG_XDP_SOCKETS
>   	struct xdp_umem                 *umem;
> +	struct xdp_sock			*xsk;
>   #endif
>   } ____cacheline_aligned_in_smp;
>   
> @@ -1836,6 +1837,10 @@ struct net_device {
>   	atomic_t		carrier_up_count;
>   	atomic_t		carrier_down_count;
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +	u16			direct_xsk_count;

Why u16? num_rx/tx_queues are unsigned ints.

> +#endif
> +
>   #ifdef CONFIG_WIRELESS_EXT
>   	const struct iw_handler_def *wireless_handlers;
>   	struct iw_public_data	*wireless_data;
> @@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>   bool is_skb_forwardable(const struct net_device *dev,
>   			const struct sk_buff *skb);
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +int dev_set_direct_xsk_prog(struct net_device *dev);
> +int dev_clear_direct_xsk_prog(struct net_device *dev);
> +#endif
> +
>   static __always_inline int ____dev_forward_skb(struct net_device *dev,
>   					       struct sk_buff *skb)
>   {
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index c9398ce7960f..9158233d34e1 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -76,6 +76,9 @@ struct xsk_map_node {
>   	struct xdp_sock **map_entry;
>   };
>   
> +/* Flags for the xdp_sock flags field. */
> +#define XDP_SOCK_DIRECT (1 << 0)
> +
>   struct xdp_sock {
>   	/* struct sock must be the first member of struct xdp_sock */
>   	struct sock sk;
> @@ -104,6 +107,7 @@ struct xdp_sock {
>   	struct list_head map_list;
>   	/* Protects map_list */
>   	spinlock_t map_list_lock;
> +	u16 flags;

Right now only the XDP_DIRECT is tracked here. Maybe track all flags, 
and show them in xsk_diag.

>   };
>   
>   struct xdp_buff;
> @@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
>   void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
>   void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
>   bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id);
>   
>   void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>   			     struct xdp_sock **map_entry);
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index be328c59389d..d202b5d40859 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -25,6 +25,11 @@
>    * application.
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
> +/* This option allows an AF_XDP socket bound to a queue to receive all
> + * the packets directly from that queue when there is no XDP program
> + * attached to the device.
> + */
> +#define XDP_DIRECT	(1 << 4)
>   
>   /* Flags for xsk_umem_config flags */
>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 205f95af67d2..871d738a78d2 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
>   
>   void bpf_prog_put(struct bpf_prog *prog)
>   {
> -	__bpf_prog_put(prog, true);
> +	if (!bpf_is_direct_xsk_prog(prog))
> +		__bpf_prog_put(prog, true);
>   }
>   EXPORT_SYMBOL_GPL(bpf_prog_put);
>   
>   u32 bpf_get_prog_id(const struct bpf_prog *prog)
>   {
> -	if (prog)
> +	if (prog && !bpf_is_direct_xsk_prog(prog))
>   		return prog->aux->id;
>   
>   	return 0;
> @@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
>   
>   void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
>   {
> -	if (prog)
> +	if (prog && !bpf_is_direct_xsk_prog(prog))
>   		prog->aux->id = id;
>   }
>   EXPORT_SYMBOL(bpf_set_prog_id);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 866d0ad936a5..eb3cd718e580 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   	} else {
>   		if (!__dev_xdp_query(dev, bpf_op, query))
>   			return 0;
> +#ifdef CONFIG_XDP_SOCKETS
> +		if (dev->direct_xsk_count)
> +			prog = (void *)BPF_PROG_DIRECT_XSK;
> +#endif

Nit, but maybe hide this weirdness in a function?

>   	}
>   
>   	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
> @@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   	return err;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +int dev_set_direct_xsk_prog(struct net_device *dev)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +	struct bpf_prog *prog;
> +	bpf_op_t bpf_op;
> +
> +	ASSERT_RTNL();
> +
> +	dev->direct_xsk_count++;
> +
> +	bpf_op = ops->ndo_bpf;
> +	if (!bpf_op)
> +		return -EOPNOTSUPP;
> +
> +	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
> +		return 0;
> +
> +	prog = (void *)BPF_PROG_DIRECT_XSK;
> +
> +	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
> +}
> +
> +int dev_clear_direct_xsk_prog(struct net_device *dev)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +	bpf_op_t bpf_op;
> +
> +	ASSERT_RTNL();
> +
> +	dev->direct_xsk_count--;
> +
> +	if (dev->direct_xsk_count)
> +		return 0;
> +
> +	bpf_op = ops->ndo_bpf;
> +	if (!bpf_op)
> +		return -EOPNOTSUPP;
> +
> +	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
> +		return 0;
> +
> +	return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
> +}
> +#endif
> +
>   /**
>    *	dev_new_index	-	allocate an ifindex
>    *	@net: the applicable net namespace
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ed6563622ce3..391d7d600284 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -73,6 +73,7 @@
>   #include <net/lwtunnel.h>
>   #include <net/ipv6_stubs.h>
>   #include <net/bpf_sk_storage.h>
> +#include <linux/static_key.h>
>   
>   /**
>    *	sk_filter_trim_cap - run a packet through a socket filter
> @@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
> +{
> +	struct xdp_sock *xsk = READ_ONCE(ri->xsk);

Why READ_ONCE here?

> +
> +	if (xsk) {
> +		ri->xsk = NULL;
> +		xsk_flush(xsk);
> +	}
> +}
> +#else
> +static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
> +{
> +}
> +#endif
> +

Move CONFIG_XDP_SOCKETS into the function, and remove the empty/bottom one.

>   void xdp_do_flush_map(void)
>   {
>   	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> @@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
>   			break;
>   		}
>   	}
> +
> +	xdp_do_flush_xsk(ri);
>   }
>   EXPORT_SYMBOL_GPL(xdp_do_flush_map);
>   
> @@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
>   	return err;
>   }
>   
> +#ifdef CONFIG_XDP_SOCKETS
> +static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
> +{
> +	return READ_ONCE(ri->xsk);

Again, why READ_ONCE? Please leave the inlining to the compiler in .c files.

> +}
> +#else
> +static inline struct xdp_sock *xdp_get_direct_xsk(struct bpf_redirect_info *ri)
> +{
> +	return NULL;
> +}
> +#endif
> +
>   int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>   		    struct bpf_prog *xdp_prog)
>   {
>   	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>   	struct bpf_map *map = READ_ONCE(ri->map);
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_direct_xsk(ri);
> +	if (xsk)
> +		return xsk_rcv(xsk, xdp);

Hmm, maybe you need a xsk_to_flush as well. Say that a user swaps in a
regular XDP program, then xsk_rcv() will be called until the flush
occurs, right? IOW, all packets processed (napi budget) in the napi_poll
will end up in the socket.

>   
>   	if (likely(map))
>   		return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
> @@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>   
>   const struct bpf_prog_ops sk_reuseport_prog_ops = {
>   };
> +
> +#ifdef CONFIG_XDP_SOCKETS
> +DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
> +EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
> +
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> +{
> +	struct xdp_sock *xsk;
> +
> +	xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> +	if (xsk) {
> +		struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +		ri->xsk = xsk;
> +		return XDP_REDIRECT;

 From the comment above. I *think* you need to ri->xsk_to_flush. Can the
direct socket (swap socket) change before flush?

> +	}
> +
> +	return XDP_PASS;
> +}
> +EXPORT_SYMBOL(bpf_direct_xsk);
> +#endif
> +
>   #endif /* CONFIG_INET */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index fa8fbb8fa3c8..8a29939bac7e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -24,6 +24,7 @@
>   #include <linux/rculist.h>
>   #include <net/xdp_sock.h>
>   #include <net/xdp.h>
> +#include <linux/if_link.h>
>   
>   #include "xsk_queue.h"
>   #include "xdp_umem.h"
> @@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   	return err;
>   }
>   
> +static void xdp_clear_direct_xsk(struct xdp_sock *xsk)

Use xs, and not xsk for consistency.

> +{
> +	struct net_device *dev = xsk->dev;
> +	u32 qid = xsk->queue_id;
> +
> +	if (!dev->_rx[qid].xsk)
> +		return;
> +
> +	dev_clear_direct_xsk_prog(dev);
> +	dev->_rx[qid].xsk = NULL;
> +	static_branch_dec(&xdp_direct_xsk_needed);
> +	xsk->flags &= ~XDP_SOCK_DIRECT;
> +}
> +
> +static int xdp_set_direct_xsk(struct xdp_sock *xsk)

Same here.

> +{
> +	struct net_device *dev = xsk->dev;
> +	u32 qid = xsk->queue_id;
> +	int err = 0;
> +
> +	if (dev->_rx[qid].xsk)
> +		return -EEXIST;
> +
> +	xsk->flags |= XDP_SOCK_DIRECT;
> +	static_branch_inc(&xdp_direct_xsk_needed);
> +	dev->_rx[qid].xsk = xsk;
> +	err = dev_set_direct_xsk_prog(dev);
> +	if (err)
> +		xdp_clear_direct_xsk(xsk);
> +
> +	return err;
> +}
> +
> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 queue_id)
> +{
> +	return dev->_rx[queue_id].xsk;
> +}
> +EXPORT_SYMBOL(xdp_get_xsk_from_qid);
> +
>   void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
>   {
>   	xskq_produce_flush_addr_n(umem->cq, nb_entries);
> @@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>   		return;
>   	WRITE_ONCE(xs->state, XSK_UNBOUND);
>   
> +	if (xs->flags & XDP_SOCK_DIRECT) {
> +		rtnl_lock();
> +		xdp_clear_direct_xsk(xs);
> +		rtnl_unlock();
> +	}
>   	/* Wait for driver to stop using the xdp socket. */
>   	xdp_del_sk_umem(xs->umem, xs);
>   	xs->dev = NULL;
> @@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   
>   	flags = sxdp->sxdp_flags;
>   	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
> -		      XDP_USE_NEED_WAKEUP))
> +		      XDP_USE_NEED_WAKEUP | XDP_DIRECT))
>   		return -EINVAL;
>   
>   	rtnl_lock();
> @@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   		struct socket *sock;
>   
>   		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
> -		    (flags & XDP_USE_NEED_WAKEUP)) {
> +		    (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
>   			/* Cannot specify flags for shared sockets. */
>   			err = -EINVAL;
>   			goto out_unlock;
> @@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   	xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
>   	xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
>   	xdp_add_sk_umem(xs->umem, xs);
> +	if (flags & XDP_DIRECT)
> +		err = xdp_set_direct_xsk(xs);
>   
>   out_unlock:
>   	if (err) {
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index be328c59389d..d202b5d40859 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -25,6 +25,11 @@
>    * application.
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
> +/* This option allows an AF_XDP socket bound to a queue to receive all
> + * the packets directly from that queue when there is no XDP program
> + * attached to the device.
> + */
> +#define XDP_DIRECT	(1 << 4)
>   
>   /* Flags for xsk_umem_config flags */
>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag.
  2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
@ 2019-10-08  8:05   ` Björn Töpel
  0 siblings, 0 replies; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> Don't allow an AF_XDP socket trying to bind with XDP_DIRECT bind
> flag when a normal XDP program is already attached to the device,
> 
> Don't attach the default XDP program when AF_XDP socket is created
> with XDP_DIRECT bind flag.
>

I'd like this to be default for xsk.c, and if not supported fall back to 
old code.


> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>   tools/lib/bpf/xsk.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index d5f4900e5c54..953b479040cd 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -454,6 +454,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>   		return err;
>   
>   	if (!prog_id) {
> +		if (xsk->config.bind_flags & XDP_DIRECT)
> +			return 0;
> +
>   		err = xsk_create_bpf_maps(xsk);
>   		if (err)
>   			return err;
> @@ -464,6 +467,9 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>   			return err;
>   		}
>   	} else {
> +		if (xsk->config.bind_flags & XDP_DIRECT)
> +			return -EEXIST;
> +
>   		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>   		err = xsk_lookup_bpf_maps(xsk);
>   		if (err) {
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode
  2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
@ 2019-10-08  8:05   ` Björn Töpel
  0 siblings, 0 replies; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:05 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On 2019-10-08 08:16, Sridhar Samudrala wrote:
> This option enables an AF_XDP socket to bind with a XDP_DIRECT flag
> that allows packets received on the associated queue to be received
> directly when an XDP program is not attached.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

Acked-by: Björn Töpel <bjorn.topel@intel.com>

> ---
>   samples/bpf/xdpsock_user.c | 17 +++++++++++++----
>   1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index 405c4e091f8b..6f4633769968 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -129,6 +129,9 @@ static void print_benchmark(bool running)
>   	if (opt_poll)
>   		printf("poll() ");
>   
> +	if (opt_xdp_bind_flags & XDP_DIRECT)
> +		printf("direct-xsk ");
> +
>   	if (running) {
>   		printf("running...");
>   		fflush(stdout);
> @@ -202,7 +205,8 @@ static void int_exit(int sig)
>   	dump_stats();
>   	xsk_socket__delete(xsks[0]->xsk);
>   	(void)xsk_umem__delete(umem);
> -	remove_xdp_program();
> +	if (!(opt_xdp_bind_flags & XDP_DIRECT))
> +		remove_xdp_program();
>   
>   	exit(EXIT_SUCCESS);
>   }
> @@ -213,7 +217,8 @@ static void __exit_with_error(int error, const char *file, const char *func,
>   	fprintf(stderr, "%s:%s:%i: errno: %d/\"%s\"\n", file, func,
>   		line, error, strerror(error));
>   	dump_stats();
> -	remove_xdp_program();
> +	if (!(opt_xdp_bind_flags & XDP_DIRECT))
> +		remove_xdp_program();
>   	exit(EXIT_FAILURE);
>   }
>   
> @@ -363,6 +368,7 @@ static struct option long_options[] = {
>   	{"frame-size", required_argument, 0, 'f'},
>   	{"no-need-wakeup", no_argument, 0, 'm'},
>   	{"unaligned", no_argument, 0, 'u'},
> +	{"direct-xsk", no_argument, 0, 'd'},
>   	{0, 0, 0, 0}
>   };
>   
> @@ -386,6 +392,7 @@ static void usage(const char *prog)
>   		"  -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
>   		"  -f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is %d).\n"
>   		"  -u, --unaligned	Enable unaligned chunk placement\n"
> +		"  -d, --direct-xsk	Direct packets to XDP socket.\n"
>   		"\n";
>   	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
>   	exit(EXIT_FAILURE);
> @@ -398,7 +405,7 @@ static void parse_command_line(int argc, char **argv)
>   	opterr = 0;
>   
>   	for (;;) {
> -		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mu",
> +		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mud",
>   				long_options, &option_index);
>   		if (c == -1)
>   			break;
> @@ -452,7 +459,9 @@ static void parse_command_line(int argc, char **argv)
>   			opt_need_wakeup = false;
>   			opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
>   			break;
> -
> +		case 'd':
> +			opt_xdp_bind_flags |= XDP_DIRECT;
> +			break;
>   		default:
>   			usage(basename(argv[0]));
>   		}
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
@ 2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:48       ` Björn Töpel
  2019-10-08  9:04       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 at 08:59, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>
> >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> >                   struct bpf_prog *xdp_prog)
> >  {
> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> >       struct bpf_map *map = READ_ONCE(ri->map);
> > +     struct xdp_sock *xsk;
> > +
> > +     xsk = xdp_get_direct_xsk(ri);
> > +     if (xsk)
> > +             return xsk_rcv(xsk, xdp);
>
> This is a new branch and a read barrier in the XDP_REDIRECT fast path.
> What's the performance impact of that for non-XSK redirect?
>

The dependent-read-barrier in READ_ONCE? Another branch -- leave that
to the branch-predictor already! ;-) No, you're right, performance
impact here is interesting. I guess the same static_branch could be
used here as well...


> -Toke
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:47     ` Björn Töpel
@ 2019-10-08  8:48       ` Björn Töpel
  2019-10-08  9:04       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 42+ messages in thread
From: Björn Töpel @ 2019-10-08  8:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 at 10:47, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
[...]
>
> The dependent-read-barrier in READ_ONCE? Another branch -- leave that
> to the branch-predictor already! ;-) No, you're right, performance
> impact here is interesting. I guess the same static_branch could be
> used here as well...
>

...and I think the READ_ONCE can be omitted.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:47     ` Björn Töpel
  2019-10-08  8:48       ` Björn Töpel
@ 2019-10-08  9:04       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-08  9:04 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Sridhar Samudrala, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Tue, 8 Oct 2019 at 08:59, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>>
>> >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>> >                   struct bpf_prog *xdp_prog)
>> >  {
>> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> >       struct bpf_map *map = READ_ONCE(ri->map);
>> > +     struct xdp_sock *xsk;
>> > +
>> > +     xsk = xdp_get_direct_xsk(ri);
>> > +     if (xsk)
>> > +             return xsk_rcv(xsk, xdp);
>>
>> This is a new branch and a read barrier in the XDP_REDIRECT fast path.
>> What's the performance impact of that for non-XSK redirect?
>>
>
> The dependent-read-barrier in READ_ONCE? Another branch -- leave that
> to the branch-predictor already! ;-) No, you're right, performance
> impact here is interesting. I guess the same static_branch could be
> used here as well...

In any case, some measurements to see the impact (or lack thereof) would
be useful :)

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
                   ` (4 preceding siblings ...)
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
@ 2019-10-09  0:49 ` Jakub Kicinski
  2019-10-09  6:29   ` Samudrala, Sridhar
  5 siblings, 1 reply; 42+ messages in thread
From: Jakub Kicinski @ 2019-10-09  0:49 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On Mon,  7 Oct 2019 23:16:51 -0700, Sridhar Samudrala wrote:
> This is a rework of the following patch series 
> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
> that tried to enable direct receive by bypassing XDP program attached
> to the device.
> 
> Based on the community feedback and some suggestions from Bjorn, changed
> the semantics of the implementation to enable direct receive on AF_XDP
> sockets that are bound to a queue only when there is no normal XDP program
> attached to the device.
> 
> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
> that is attached at the time of binding an AF_XDP socket to a queue of a
> device. This is done only if there is no other XDP program attached to
> the device. The normal XDP program has precedence and will replace the
> DIRECT_XSK prog if it is attached later. The main reason to introduce a
> special BPF prog pointer is to minimize the driver changes. The only change
> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
> 
> Any attach of a normal XDP program will take precedence and the direct xsk
> program will be removed. The direct XSK program will be attached
> automatically when the normal XDP program is removed when there are any
> AF_XDP direct sockets associated with that device.
> 
> A static key is used to control this feature in order to avoid any overhead
> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.

Don't say that static branches have no overhead. That's dishonest.

> Here is some performance data i collected on my Intel Ivybridge based
> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
> NIC: Intel 40Gb ethernet (i40e)
> 
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>    default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>    direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> 6.1x improvement in drop rate
> 
> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>    default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>    direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
> 6x improvement in drop rate
> 
> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>    default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>    direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
> 3.5x improvement in l2fwd rate
> 
> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>    default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>    direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
> 4.5x improvement in l2fwd rate

I asked you to add numbers for handling those use cases in the kernel
directly.

> dpdk-pktgen is used to send 64byte UDP packets from a link partner and 
> ethtool ntuple flow rule is used to redirect packets to queue 1 on the 
> system under test.

Obviously still nack from me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
  2019-10-08  6:58   ` Toke Høiland-Jørgensen
  2019-10-08  8:05   ` Björn Töpel
@ 2019-10-09  1:20   ` Alexei Starovoitov
       [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
  2 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-09  1:20 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Karlsson, Magnus, Björn Töpel, Network Development,
	bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert

On Mon, Oct 7, 2019 at 11:18 PM Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> +
> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> +{
> +       struct xdp_sock *xsk;
> +
> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> +       if (xsk) {
> +               struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +               ri->xsk = xsk;
> +               return XDP_REDIRECT;
> +       }
> +
> +       return XDP_PASS;
> +}
> +EXPORT_SYMBOL(bpf_direct_xsk);

So you're saying there is a:
"""
xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
   default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
   direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
6.1x improvement in drop rate
"""

6.1x gain running above C code vs exactly equivalent BPF code?
How is that possible?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-09  0:49 ` Jakub Kicinski
@ 2019-10-09  6:29   ` Samudrala, Sridhar
  2019-10-09 16:53     ` Jakub Kicinski
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09  6:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert



On 10/8/2019 5:49 PM, Jakub Kicinski wrote:
> On Mon,  7 Oct 2019 23:16:51 -0700, Sridhar Samudrala wrote:
>> This is a rework of the following patch series
>> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r
>> that tried to enable direct receive by bypassing XDP program attached
>> to the device.
>>
>> Based on the community feedback and some suggestions from Bjorn, changed
>> the semantics of the implementation to enable direct receive on AF_XDP
>> sockets that are bound to a queue only when there is no normal XDP program
>> attached to the device.
>>
>> This is accomplished by introducing a special BPF prog pointer (DIRECT_XSK)
>> that is attached at the time of binding an AF_XDP socket to a queue of a
>> device. This is done only if there is no other XDP program attached to
>> the device. The normal XDP program has precedence and will replace the
>> DIRECT_XSK prog if it is attached later. The main reason to introduce a
>> special BPF prog pointer is to minimize the driver changes. The only change
>> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
>>
>> Any attach of a normal XDP program will take precedence and the direct xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> A static key is used to control this feature in order to avoid any overhead
>> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk mode.
> 
> Don't say that static branches have no overhead. That's dishonest.

I didn't mean to say no overhead, but the overhead is minimized using 
static_branch_unlikely()

> 
>> Here is some performance data i collected on my Intel Ivybridge based
>> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
>> NIC: Intel 40Gb ethernet (i40e)
>>
>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>> 6.1x improvement in drop rate
>>
>> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
>> 6x improvement in drop rate
>>
>> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 3.5x improvement in l2fwd rate
>>
>> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 4.5x improvement in l2fwd rate
> 
> I asked you to add numbers for handling those use cases in the kernel
> directly.

Forgot to explicitly mention that I didn't see any regressions with 
xdp1, xdp2 or xdpsock in default mode with these patches. Performance 
remained the same.

> 
>> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
>> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
>> system under test.
> 
> Obviously still nack from me.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
@ 2019-10-09 16:19   ` Samudrala, Sridhar
  0 siblings, 0 replies; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:19 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
  Cc: Jakub Kicinski

On 10/8/2019 1:05 AM, Björn Töpel wrote:
> On 2019-10-08 08:16, Sridhar Samudrala wrote:
>> This is a rework of the following patch series
>> https://lore.kernel.org/netdev/1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com/#r 
>>
>> that tried to enable direct receive by bypassing XDP program attached
>> to the device.
>>
>> Based on the community feedback and some suggestions from Bjorn, changed
>> the semantics of the implementation to enable direct receive on AF_XDP
>> sockets that are bound to a queue only when there is no normal XDP 
>> program
>> attached to the device.
>>
>> This is accomplished by introducing a special BPF prog pointer 
>> (DIRECT_XSK)
>> that is attached at the time of binding an AF_XDP socket to a queue of a
>> device. This is done only if there is no other XDP program attached to
>> the device. The normal XDP program has precedence and will replace the
>> DIRECT_XSK prog if it is attached later. The main reason to introduce a
>> special BPF prog pointer is to minimize the driver changes. The only 
>> change
>> is to use the bpf_get_prog_id() helper when QUERYING the prog id.
>>
>> Any attach of a normal XDP program will take precedence and the direct 
>> xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> A static key is used to control this feature in order to avoid any 
>> overhead
>> for normal XDP datapath when there are no AF_XDP sockets in direct-xsk 
>> mode.
>>
>> Here is some performance data i collected on my Intel Ivybridge based
>> development system (Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
>> NIC: Intel 40Gb ethernet (i40e)
>>
>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>> 6.1x improvement in drop rate
>>
>> xdpsock rxdrop 2 core (app and queue's irq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -r -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -r -d -q 1
>> 6x improvement in drop rate
>>
>> xdpsock l2fwd 1 core (both app and queue's irq pinned to the same core)
>>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 3.5x improvement in l2fwd rate
>>
>> xdpsock rxdrop 2 core (app and queue'sirq pinned to different cores)
>>     default : taskset -c 3 ./xdpsock -i enp66s0f0 -l -q 1
>>     direct-xsk :taskset -c 3 ./xdpsock -i enp66s0f0 -l -d -q 1
>> 4.5x improvement in l2fwd rate
>>
>> dpdk-pktgen is used to send 64byte UDP packets from a link partner and
>> ethtool ntuple flow rule is used to redirect packets to queue 1 on the
>> system under test.
>>
> 
> Thanks for working on this Sridhar! I like this approach! Except from
> the bpf_get_prog_id() changes, no driver changes are needed.
> 
> It's also a cleaner (IMO) approach than my previous attempts [1,2,3]
> 
> Would be interesting to see NFP support AF_XDP offloading with this
> option. (nudge, nudge).
> 
> A thought: From userland, a direct AF_XDP socket will not appear as an
> XDP program is attached to the device (id == 0). Maybe show in ss(8)
> (via xsk_diag.c) that the socket is direct?

Sure. will add this in the next revision.

> 
> [1] 
> https://lore.kernel.org/netdev/CAJ+HfNj63QcLY8=y1fF93PZd3XcfiGSrbbWdiGByjTzZQydSSg@mail.gmail.com/ 
> 
> [2] 
> https://lore.kernel.org/netdev/cd952f99-6bad-e0c8-5bcd-f0010218238c@intel.com/ 
> 
> [3] 
> https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/ 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-08  8:05   ` Björn Töpel
@ 2019-10-09 16:32     ` Samudrala, Sridhar
  0 siblings, 0 replies; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:32 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

On 10/8/2019 1:05 AM, Björn Töpel wrote:
> On 2019-10-08 08:16, Sridhar Samudrala wrote:
>> Introduce a flag that can be specified during the bind() call
>> of an AF_XDP socket to receive packets directly from a queue when 
>> there is
>> no XDP program attached to the device.
>>
>> This is enabled by introducing a special BPF prog pointer called
>> BPF_PROG_DIRECT_XSK and a new bind flag XDP_DIRECT that can be specified
>> when binding an AF_XDP socket to a queue. At the time of bind(), an 
>> AF_XDP
>> socket in XDP_DIRECT mode, will attach BPF_PROG_DIRECT_XSK as a bpf 
>> program
>> if there is no other XDP program attached to the device. The device 
>> receive
>> queue is also associated with the AF_XDP socket.
>>
>> In the XDP receive path, if the bpf program is a DIRECT_XSK program, the
>> XDP buffer is passed to the AF_XDP socket that is associated with the
>> receive queue on which the packet is received.
>>
>> To avoid any overhead for nomal XDP programs, a static key is used to 
>> keep
>> a count of AF_XDP direct xsk sockets and static_branch_unlikely() is used
>> to handle receives for direct XDP sockets.
>>
>> Any attach of a normal XDP program will take precedence and the direct 
>> xsk
>> program will be removed. The direct XSK program will be attached
>> automatically when the normal XDP program is removed when there are any
>> AF_XDP direct sockets associated with that device.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   include/linux/filter.h            | 18 ++++++++++++
>>   include/linux/netdevice.h         | 10 +++++++
>>   include/net/xdp_sock.h            |  5 ++++
>>   include/uapi/linux/if_xdp.h       |  5 ++++
>>   kernel/bpf/syscall.c              |  7 +++--
>>   net/core/dev.c                    | 50 
>> +++++++++++++++++++++++++++++++++
>>   net/core/filter.c                 | 58 
>> +++++++++++++++++++++++++++++++++++++++
>>   net/xdp/xsk.c                     | 51 
>> ++++++++++++++++++++++++++++++++--
>>   tools/include/uapi/linux/if_xdp.h |  5 ++++
>>   9 files changed, 204 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 2ce57645f3cd..db4ad85d8321 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -585,6 +585,9 @@ struct bpf_redirect_info {
>>       struct bpf_map *map;
>>       struct bpf_map *map_to_flush;
>>       u32 kern_flags;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    struct xdp_sock *xsk;
>> +#endif
>>   };
>>   DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
>> @@ -693,6 +696,16 @@ static inline u32 bpf_prog_run_clear_cb(const 
>> struct bpf_prog *prog,
>>       return res;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +#define BPF_PROG_DIRECT_XSK    0x1
>> +#define bpf_is_direct_xsk_prog(prog) \
>> +    ((unsigned long)prog == BPF_PROG_DIRECT_XSK)
>> +DECLARE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp);
>> +#else
>> +#define bpf_is_direct_xsk_prog(prog) (false)
>> +#endif
>> +
>>   static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog 
>> *prog,
>>                           struct xdp_buff *xdp)
>>   {
>> @@ -702,6 +715,11 @@ static __always_inline u32 bpf_prog_run_xdp(const 
>> struct bpf_prog *prog,
>>        * already takes rcu_read_lock() when fetching the program, so
>>        * it's not necessary here anymore.
>>        */
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    if (static_branch_unlikely(&xdp_direct_xsk_needed) &&
>> +        bpf_is_direct_xsk_prog(prog))
>> +        return bpf_direct_xsk(prog, xdp);
>> +#endif
>>       return BPF_PROG_RUN(prog, xdp);
>>   }
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 48cc71aae466..f4d0f70aa718 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -743,6 +743,7 @@ struct netdev_rx_queue {
>>       struct xdp_rxq_info        xdp_rxq;
>>   #ifdef CONFIG_XDP_SOCKETS
>>       struct xdp_umem                 *umem;
>> +    struct xdp_sock            *xsk;
>>   #endif
>>   } ____cacheline_aligned_in_smp;
>> @@ -1836,6 +1837,10 @@ struct net_device {
>>       atomic_t        carrier_up_count;
>>       atomic_t        carrier_down_count;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +    u16            direct_xsk_count;
> 
> Why u16? num_rx/tx_queues are unsigned ints.

OK. Will changes to unsigned int

> 
>> +#endif
>> +
>>   #ifdef CONFIG_WIRELESS_EXT
>>       const struct iw_handler_def *wireless_handlers;
>>       struct iw_public_data    *wireless_data;
>> @@ -3712,6 +3717,11 @@ int dev_forward_skb(struct net_device *dev, 
>> struct sk_buff *skb);
>>   bool is_skb_forwardable(const struct net_device *dev,
>>               const struct sk_buff *skb);
>> +#ifdef CONFIG_XDP_SOCKETS
>> +int dev_set_direct_xsk_prog(struct net_device *dev);
>> +int dev_clear_direct_xsk_prog(struct net_device *dev);
>> +#endif
>> +
>>   static __always_inline int ____dev_forward_skb(struct net_device *dev,
>>                              struct sk_buff *skb)
>>   {
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index c9398ce7960f..9158233d34e1 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -76,6 +76,9 @@ struct xsk_map_node {
>>       struct xdp_sock **map_entry;
>>   };
>> +/* Flags for the xdp_sock flags field. */
>> +#define XDP_SOCK_DIRECT (1 << 0)
>> +
>>   struct xdp_sock {
>>       /* struct sock must be the first member of struct xdp_sock */
>>       struct sock sk;
>> @@ -104,6 +107,7 @@ struct xdp_sock {
>>       struct list_head map_list;
>>       /* Protects map_list */
>>       spinlock_t map_list_lock;
>> +    u16 flags;
> 
> Right now only the XDP_DIRECT is tracked here. Maybe track all flags, 
> and show them in xsk_diag.

I see zc as the other field that can be converted into a flag.
Do you want to include that as part of this series or can it be done later?

> 
>>   };
>>   struct xdp_buff;
>> @@ -129,6 +133,7 @@ void xsk_set_tx_need_wakeup(struct xdp_umem *umem);
>>   void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
>>   void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
>>   bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
>> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 
>> queue_id);
>>   void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>>                    struct xdp_sock **map_entry);
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index be328c59389d..d202b5d40859 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -25,6 +25,11 @@
>>    * application.
>>    */
>>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>> +/* This option allows an AF_XDP socket bound to a queue to receive all
>> + * the packets directly from that queue when there is no XDP program
>> + * attached to the device.
>> + */
>> +#define XDP_DIRECT    (1 << 4)
>>   /* Flags for xsk_umem_config flags */
>>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 205f95af67d2..871d738a78d2 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1349,13 +1349,14 @@ static void __bpf_prog_put(struct bpf_prog 
>> *prog, bool do_idr_lock)
>>   void bpf_prog_put(struct bpf_prog *prog)
>>   {
>> -    __bpf_prog_put(prog, true);
>> +    if (!bpf_is_direct_xsk_prog(prog))
>> +        __bpf_prog_put(prog, true);
>>   }
>>   EXPORT_SYMBOL_GPL(bpf_prog_put);
>>   u32 bpf_get_prog_id(const struct bpf_prog *prog)
>>   {
>> -    if (prog)
>> +    if (prog && !bpf_is_direct_xsk_prog(prog))
>>           return prog->aux->id;
>>       return 0;
>> @@ -1364,7 +1365,7 @@ EXPORT_SYMBOL(bpf_get_prog_id);
>>   void bpf_set_prog_id(struct bpf_prog *prog, u32 id)
>>   {
>> -    if (prog)
>> +    if (prog && !bpf_is_direct_xsk_prog(prog))
>>           prog->aux->id = id;
>>   }
>>   EXPORT_SYMBOL(bpf_set_prog_id);
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 866d0ad936a5..eb3cd718e580 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -8269,6 +8269,10 @@ int dev_change_xdp_fd(struct net_device *dev, 
>> struct netlink_ext_ack *extack,
>>       } else {
>>           if (!__dev_xdp_query(dev, bpf_op, query))
>>               return 0;
>> +#ifdef CONFIG_XDP_SOCKETS
>> +        if (dev->direct_xsk_count)
>> +            prog = (void *)BPF_PROG_DIRECT_XSK;
>> +#endif
> 
> Nit, but maybe hide this weirdness in a function?

OK.

> 
>>       }
>>       err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
>> @@ -8278,6 +8282,52 @@ int dev_change_xdp_fd(struct net_device *dev, 
>> struct netlink_ext_ack *extack,
>>       return err;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +int dev_set_direct_xsk_prog(struct net_device *dev)
>> +{
>> +    const struct net_device_ops *ops = dev->netdev_ops;
>> +    struct bpf_prog *prog;
>> +    bpf_op_t bpf_op;
>> +
>> +    ASSERT_RTNL();
>> +
>> +    dev->direct_xsk_count++;
>> +
>> +    bpf_op = ops->ndo_bpf;
>> +    if (!bpf_op)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
>> +        return 0;
>> +
>> +    prog = (void *)BPF_PROG_DIRECT_XSK;
>> +
>> +    return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, prog);
>> +}
>> +
>> +int dev_clear_direct_xsk_prog(struct net_device *dev)
>> +{
>> +    const struct net_device_ops *ops = dev->netdev_ops;
>> +    bpf_op_t bpf_op;
>> +
>> +    ASSERT_RTNL();
>> +
>> +    dev->direct_xsk_count--;
>> +
>> +    if (dev->direct_xsk_count)
>> +        return 0;
>> +
>> +    bpf_op = ops->ndo_bpf;
>> +    if (!bpf_op)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
>> +        return 0;
>> +
>> +    return dev_xdp_install(dev, bpf_op, NULL, XDP_FLAGS_DRV_MODE, NULL);
>> +}
>> +#endif
>> +
>>   /**
>>    *    dev_new_index    -    allocate an ifindex
>>    *    @net: the applicable net namespace
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index ed6563622ce3..391d7d600284 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -73,6 +73,7 @@
>>   #include <net/lwtunnel.h>
>>   #include <net/ipv6_stubs.h>
>>   #include <net/bpf_sk_storage.h>
>> +#include <linux/static_key.h>
>>   /**
>>    *    sk_filter_trim_cap - run a packet through a socket filter
>> @@ -3546,6 +3547,22 @@ static int __bpf_tx_xdp_map(struct net_device 
>> *dev_rx, void *fwd,
>>       return 0;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +static void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
>> +{
>> +    struct xdp_sock *xsk = READ_ONCE(ri->xsk);
> 
> Why READ_ONCE here?
> 
>> +
>> +    if (xsk) {
>> +        ri->xsk = NULL;
>> +        xsk_flush(xsk);
>> +    }
>> +}
>> +#else
>> +static inline void xdp_do_flush_xsk(struct bpf_redirect_info *ri)
>> +{
>> +}
>> +#endif
>> +
> 
> Move CONFIG_XDP_SOCKETS into the function, and remove the empty/bottom one.
> 
>>   void xdp_do_flush_map(void)
>>   {
>>       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> @@ -3568,6 +3585,8 @@ void xdp_do_flush_map(void)
>>               break;
>>           }
>>       }
>> +
>> +    xdp_do_flush_xsk(ri);
>>   }
>>   EXPORT_SYMBOL_GPL(xdp_do_flush_map);
>> @@ -3631,11 +3650,28 @@ static int xdp_do_redirect_map(struct 
>> net_device *dev, struct xdp_buff *xdp,
>>       return err;
>>   }
>> +#ifdef CONFIG_XDP_SOCKETS
>> +static inline struct xdp_sock *xdp_get_direct_xsk(struct 
>> bpf_redirect_info *ri)
>> +{
>> +    return READ_ONCE(ri->xsk);
> 
> Again, why READ_ONCE? Please leave the inlining to the compiler in .c 
> files.
> 
>> +}
>> +#else
>> +static inline struct xdp_sock *xdp_get_direct_xsk(struct 
>> bpf_redirect_info *ri)
>> +{
>> +    return NULL;
>> +}
>> +#endif
>> +
>>   int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>>               struct bpf_prog *xdp_prog)
>>   {
>>       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>>       struct bpf_map *map = READ_ONCE(ri->map);
>> +    struct xdp_sock *xsk;
>> +
>> +    xsk = xdp_get_direct_xsk(ri);
>> +    if (xsk)
>> +        return xsk_rcv(xsk, xdp);
> 
> Hmm, maybe you need a xsk_to_flush as well. Say that a user swaps in a
> regular XDP program, then xsk_rcv() will be called until the flush
> occurs, right? IOW, all packets processed (napi budget) in the napi_poll
> will end up in the socket.

I originally had xsk_to_flush considering this possibility of the XDP 
program changing before call to flush.
Will re-introduce it in the next revision.
Should i create a separate structure bpf_direct_xsk_info rather than 
adding these fields to bpf_redirect_info?


> 
>>       if (likely(map))
>>           return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
>> @@ -8934,4 +8970,26 @@ const struct bpf_verifier_ops 
>> sk_reuseport_verifier_ops = {
>>   const struct bpf_prog_ops sk_reuseport_prog_ops = {
>>   };
>> +
>> +#ifdef CONFIG_XDP_SOCKETS
>> +DEFINE_STATIC_KEY_FALSE(xdp_direct_xsk_needed);
>> +EXPORT_SYMBOL_GPL(xdp_direct_xsk_needed);
>> +
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>> +{
>> +    struct xdp_sock *xsk;
>> +
>> +    xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>> +    if (xsk) {
>> +        struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> +
>> +        ri->xsk = xsk;
>> +        return XDP_REDIRECT;
> 
>  From the comment above. I *think* you need to ri->xsk_to_flush. Can the
> direct socket (swap socket) change before flush?

Yes.

> 
>> +    }
>> +
>> +    return XDP_PASS;
>> +}
>> +EXPORT_SYMBOL(bpf_direct_xsk);
>> +#endif
>> +
>>   #endif /* CONFIG_INET */
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index fa8fbb8fa3c8..8a29939bac7e 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -24,6 +24,7 @@
>>   #include <linux/rculist.h>
>>   #include <net/xdp_sock.h>
>>   #include <net/xdp.h>
>> +#include <linux/if_link.h>
>>   #include "xsk_queue.h"
>>   #include "xdp_umem.h"
>> @@ -264,6 +265,45 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>       return err;
>>   }
>> +static void xdp_clear_direct_xsk(struct xdp_sock *xsk)
> 
> Use xs, and not xsk for consistency.

OK.

> 
>> +{
>> +    struct net_device *dev = xsk->dev;
>> +    u32 qid = xsk->queue_id;
>> +
>> +    if (!dev->_rx[qid].xsk)
>> +        return;
>> +
>> +    dev_clear_direct_xsk_prog(dev);
>> +    dev->_rx[qid].xsk = NULL;
>> +    static_branch_dec(&xdp_direct_xsk_needed);
>> +    xsk->flags &= ~XDP_SOCK_DIRECT;
>> +}
>> +
>> +static int xdp_set_direct_xsk(struct xdp_sock *xsk)
> 
> Same here.
> 
>> +{
>> +    struct net_device *dev = xsk->dev;
>> +    u32 qid = xsk->queue_id;
>> +    int err = 0;
>> +
>> +    if (dev->_rx[qid].xsk)
>> +        return -EEXIST;
>> +
>> +    xsk->flags |= XDP_SOCK_DIRECT;
>> +    static_branch_inc(&xdp_direct_xsk_needed);
>> +    dev->_rx[qid].xsk = xsk;
>> +    err = dev_set_direct_xsk_prog(dev);
>> +    if (err)
>> +        xdp_clear_direct_xsk(xsk);
>> +
>> +    return err;
>> +}
>> +
>> +struct xdp_sock *xdp_get_xsk_from_qid(struct net_device *dev, u16 
>> queue_id)
>> +{
>> +    return dev->_rx[queue_id].xsk;
>> +}
>> +EXPORT_SYMBOL(xdp_get_xsk_from_qid);
>> +
>>   void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
>>   {
>>       xskq_produce_flush_addr_n(umem->cq, nb_entries);
>> @@ -464,6 +504,11 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>>           return;
>>       WRITE_ONCE(xs->state, XSK_UNBOUND);
>> +    if (xs->flags & XDP_SOCK_DIRECT) {
>> +        rtnl_lock();
>> +        xdp_clear_direct_xsk(xs);
>> +        rtnl_unlock();
>> +    }
>>       /* Wait for driver to stop using the xdp socket. */
>>       xdp_del_sk_umem(xs->umem, xs);
>>       xs->dev = NULL;
>> @@ -604,7 +649,7 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>       flags = sxdp->sxdp_flags;
>>       if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY |
>> -              XDP_USE_NEED_WAKEUP))
>> +              XDP_USE_NEED_WAKEUP | XDP_DIRECT))
>>           return -EINVAL;
>>       rtnl_lock();
>> @@ -632,7 +677,7 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>           struct socket *sock;
>>           if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY) ||
>> -            (flags & XDP_USE_NEED_WAKEUP)) {
>> +            (flags & XDP_USE_NEED_WAKEUP) || (flags & XDP_DIRECT)) {
>>               /* Cannot specify flags for shared sockets. */
>>               err = -EINVAL;
>>               goto out_unlock;
>> @@ -688,6 +733,8 @@ static int xsk_bind(struct socket *sock, struct 
>> sockaddr *addr, int addr_len)
>>       xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
>>       xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
>>       xdp_add_sk_umem(xs->umem, xs);
>> +    if (flags & XDP_DIRECT)
>> +        err = xdp_set_direct_xsk(xs);
>>   out_unlock:
>>       if (err) {
>> diff --git a/tools/include/uapi/linux/if_xdp.h 
>> b/tools/include/uapi/linux/if_xdp.h
>> index be328c59389d..d202b5d40859 100644
>> --- a/tools/include/uapi/linux/if_xdp.h
>> +++ b/tools/include/uapi/linux/if_xdp.h
>> @@ -25,6 +25,11 @@
>>    * application.
>>    */
>>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>> +/* This option allows an AF_XDP socket bound to a queue to receive all
>> + * the packets directly from that queue when there is no XDP program
>> + * attached to the device.
>> + */
>> +#define XDP_DIRECT    (1 << 4)
>>   /* Flags for xsk_umem_config flags */
>>   #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
       [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
@ 2019-10-09 16:53       ` " Samudrala, Sridhar
  2019-10-09 17:17         ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 16:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Karlsson, Magnus, Björn Töpel,
	Netdev, bpf, intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom


>> +
>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>> +{
>> +       struct xdp_sock *xsk;
>> +
>> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>> +       if (xsk) {
>> +               struct bpf_redirect_info *ri =
>> + this_cpu_ptr(&bpf_redirect_info);
>> +
>> +               ri->xsk = xsk;
>> +               return XDP_REDIRECT;
>> +       }
>> +
>> +       return XDP_PASS;
>> +}
>> +EXPORT_SYMBOL(bpf_direct_xsk);
> 
> So you're saying there is a:
> """
> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
> 
> 6.1x gain running above C code vs exactly equivalent BPF code?
> How is that possible?

It seems to be due to the overhead of __bpf_prog_run on older processors 
(Ivybridge). The overhead is smaller on newer processors, but even on 
skylake i see around 1.5x improvement.

perf report with default xdpsock
================================
Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
Overhead  Command          Shared Object     Symbol
   34.57%  xdpsock          xdpsock           [.] main
   17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
   13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
    4.09%  ksoftirqd/1      [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
    3.08%  xdpsock          [kernel.vmlinux]  [k] nmi
    2.76%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_map_lookup_elem
    2.33%  xdpsock          [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
    2.33%  ksoftirqd/1      [i40e]            [k] i40e_clean_rx_irq_zc
    2.16%  xdpsock          [kernel.vmlinux]  [k] bpf_map_lookup_elem
    1.82%  ksoftirqd/1      [kernel.vmlinux]  [k] xdp_do_redirect
    1.41%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_rcv
    1.39%  ksoftirqd/1      [kernel.vmlinux]  [k] update_curr
    1.09%  ksoftirqd/1      [kernel.vmlinux]  [k] bpf_xdp_redirect_map
    1.09%  xdpsock          [i40e]            [k] i40e_clean_rx_irq_zc
    1.08%  ksoftirqd/1      [kernel.vmlinux]  [k] __xsk_map_redirect
    1.07%  swapper          [kernel.vmlinux]  [k] xsk_umem_peek_addr
    1.05%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_peek_addr
    0.89%  swapper          [kernel.vmlinux]  [k] __xsk_map_redirect
    0.87%  ksoftirqd/1      [kernel.vmlinux]  [k] __bpf_prog_run32
    0.87%  swapper          [kernel.vmlinux]  [k] intel_idle
    0.67%  xdpsock          [kernel.vmlinux]  [k] bpf_xdp_redirect_map
    0.57%  xdpsock          [kernel.vmlinux]  [k] xdp_do_redirect

perf report with direct xdpsock
===============================
Samples: 2K of event 'cycles:ppp', Event count (approx.): 17996091975
Overhead  Command          Shared Object     Symbol
   18.44%  xdpsock          [i40e]            [k] i40e_clean_rx_irq_zc
   15.14%  ksoftirqd/1      [i40e]            [k] i40e_clean_rx_irq_zc
    6.87%  xdpsock          [kernel.vmlinux]  [k] xsk_umem_peek_addr
    5.03%  ksoftirqd/1      [kernel.vmlinux]  [k] xdp_do_redirect
    4.21%  xdpsock          xdpsock           [.] main
    4.13%  ksoftirqd/1      [i40e]            [k] 
i40e_clean_programming_status
    3.71%  xdpsock          [kernel.vmlinux]  [k] xsk_rcv
    3.44%  ksoftirqd/1      [kernel.vmlinux]  [k] nmi
    3.41%  xdpsock          [kernel.vmlinux]  [k] nmi
    3.20%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_rcv
    2.45%  xdpsock          [kernel.vmlinux]  [k] xdp_get_xsk_from_qid
    2.35%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_peek_addr
    2.33%  ksoftirqd/1      [kernel.vmlinux]  [k] net_rx_action
    2.16%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_consume_tx
    2.10%  swapper          [kernel.vmlinux]  [k] __softirqentry_text_start
    2.06%  xdpsock          [kernel.vmlinux]  [k] native_irq_return_iret
    1.43%  xdpsock          [kernel.vmlinux]  [k] check_preempt_wakeup
    1.42%  xdpsock          [kernel.vmlinux]  [k] xsk_umem_consume_tx
    1.22%  xdpsock          [kernel.vmlinux]  [k] xdp_do_redirect
    1.21%  xdpsock          [kernel.vmlinux]  [k] 
dma_direct_sync_single_for_device
    1.16%  ksoftirqd/1      [kernel.vmlinux]  [k] irqtime_account_irq
    1.09%  xdpsock          [kernel.vmlinux]  [k] sock_def_readable
    0.99%  swapper          [kernel.vmlinux]  [k] intel_idle
    0.88%  xdpsock          [i40e]            [k] 
i40e_clean_programming_status
    0.74%  ksoftirqd/1      [kernel.vmlinux]  [k] xsk_umem_discard_addr
    0.71%  ksoftirqd/1      [kernel.vmlinux]  [k] __switch_to
    0.50%  ksoftirqd/1      [kernel.vmlinux]  [k] 
dma_direct_sync_single_for_device

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets
  2019-10-09  6:29   ` Samudrala, Sridhar
@ 2019-10-09 16:53     ` Jakub Kicinski
  0 siblings, 0 replies; 42+ messages in thread
From: Jakub Kicinski @ 2019-10-09 16:53 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: magnus.karlsson, bjorn.topel, netdev, bpf, intel-wired-lan,
	maciej.fijalkowski, tom.herbert

On Tue, 8 Oct 2019 23:29:59 -0700, Samudrala, Sridhar wrote:
> On 10/8/2019 5:49 PM, Jakub Kicinski wrote:
> > I asked you to add numbers for handling those use cases in the kernel
> > directly.  
> 
> Forgot to explicitly mention that I didn't see any regressions with 
> xdp1, xdp2 or xdpsock in default mode with these patches. Performance 
> remained the same.

I'm not looking for regressions. The in-kernel path is faster, and
should be used for speeding things up rather than a "direct path to
user space". Your comparison should have 3 numbers - current AF_XDP,
patched AF_XDP, in-kernel handling.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 16:53       ` FW: " Samudrala, Sridhar
@ 2019-10-09 17:17         ` Alexei Starovoitov
  2019-10-09 19:12           ` Samudrala, Sridhar
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-09 17:17 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On Wed, Oct 9, 2019 at 9:53 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> >> +
> >> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
> >> +{
> >> +       struct xdp_sock *xsk;
> >> +
> >> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
> >> +       if (xsk) {
> >> +               struct bpf_redirect_info *ri =
> >> + this_cpu_ptr(&bpf_redirect_info);
> >> +
> >> +               ri->xsk = xsk;
> >> +               return XDP_REDIRECT;
> >> +       }
> >> +
> >> +       return XDP_PASS;
> >> +}
> >> +EXPORT_SYMBOL(bpf_direct_xsk);
> >
> > So you're saying there is a:
> > """
> > xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
> >     default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
> >     direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
> >
> > 6.1x gain running above C code vs exactly equivalent BPF code?
> > How is that possible?
>
> It seems to be due to the overhead of __bpf_prog_run on older processors
> (Ivybridge). The overhead is smaller on newer processors, but even on
> skylake i see around 1.5x improvement.
>
> perf report with default xdpsock
> ================================
> Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
> Overhead  Command          Shared Object     Symbol
>    34.57%  xdpsock          xdpsock           [.] main
>    17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
>    13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run

That must be a bad joke.
The whole patch set is based on comparing native code to interpreter?!
It's pretty awesome that interpreter is only 1.5x slower than native x86.
Just turn the JIT on.

Obvious Nack to the patch set.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 17:17         ` Alexei Starovoitov
@ 2019-10-09 19:12           ` Samudrala, Sridhar
  2019-10-10  1:06             ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-09 19:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On 10/9/2019 10:17 AM, Alexei Starovoitov wrote:
> On Wed, Oct 9, 2019 at 9:53 AM Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>>
>>>> +
>>>> +u32 bpf_direct_xsk(const struct bpf_prog *prog, struct xdp_buff *xdp)
>>>> +{
>>>> +       struct xdp_sock *xsk;
>>>> +
>>>> +       xsk = xdp_get_xsk_from_qid(xdp->rxq->dev, xdp->rxq->queue_index);
>>>> +       if (xsk) {
>>>> +               struct bpf_redirect_info *ri =
>>>> + this_cpu_ptr(&bpf_redirect_info);
>>>> +
>>>> +               ri->xsk = xsk;
>>>> +               return XDP_REDIRECT;
>>>> +       }
>>>> +
>>>> +       return XDP_PASS;
>>>> +}
>>>> +EXPORT_SYMBOL(bpf_direct_xsk);
>>>
>>> So you're saying there is a:
>>> """
>>> xdpsock rxdrop 1 core (both app and queue's irq pinned to the same core)
>>>      default : taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1
>>>      direct-xsk :taskset -c 1 ./xdpsock -i enp66s0f0 -r -q 1 6.1x improvement in drop rate """
>>>
>>> 6.1x gain running above C code vs exactly equivalent BPF code?
>>> How is that possible?
>>
>> It seems to be due to the overhead of __bpf_prog_run on older processors
>> (Ivybridge). The overhead is smaller on newer processors, but even on
>> skylake i see around 1.5x improvement.
>>
>> perf report with default xdpsock
>> ================================
>> Samples: 2K of event 'cycles:ppp', Event count (approx.): 8437658090
>> Overhead  Command          Shared Object     Symbol
>>     34.57%  xdpsock          xdpsock           [.] main
>>     17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
>>     13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
> 
> That must be a bad joke.
> The whole patch set is based on comparing native code to interpreter?!
> It's pretty awesome that interpreter is only 1.5x slower than native x86.
> Just turn the JIT on.

Thanks Alexei for pointing out that i didn't have JIT on.
When i turn it on, the performance improvement is a more modest 1.5x 
with rxdrop and 1.2x with l2fwd.

> 
> Obvious Nack to the patch set.
> 

Will update the patchset with the right performance data and address 
feedback from Bjorn.
Hope you are not totally against direct XDP approach as it does provide 
value when an AF_XDP socket is bound to a queue and a HW filter can 
direct packets targeted for that queue.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-09 19:12           ` Samudrala, Sridhar
@ 2019-10-10  1:06             ` Alexei Starovoitov
  2019-10-18 18:40               ` Samudrala, Sridhar
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-10  1:06 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On Wed, Oct 9, 2019 at 12:12 PM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> >>     34.57%  xdpsock          xdpsock           [.] main
> >>     17.19%  ksoftirqd/1      [kernel.vmlinux]  [k] ___bpf_prog_run
> >>     13.12%  xdpsock          [kernel.vmlinux]  [k] ___bpf_prog_run
> >
> > That must be a bad joke.
> > The whole patch set is based on comparing native code to interpreter?!
> > It's pretty awesome that interpreter is only 1.5x slower than native x86.
> > Just turn the JIT on.
>
> Thanks Alexei for pointing out that i didn't have JIT on.
> When i turn it on, the performance improvement is a more modest 1.5x
> with rxdrop and 1.2x with l2fwd.

I want to see perf reports back to back with proper performance analysis.

> >
> > Obvious Nack to the patch set.
> >
>
> Will update the patchset with the right performance data and address
> feedback from Bjorn.
> Hope you are not totally against direct XDP approach as it does provide
> value when an AF_XDP socket is bound to a queue and a HW filter can
> direct packets targeted for that queue.

I used to be in favor of providing "prog bypass" for af_xdp,
because of anecdotal evidence that it will make af_xdp faster.
Now seeing the numbers and the way they were collected
I'm against such bypass.
I want to see hard proof that trivial bpf prog is actually slowing things down
before reviewing any new patch sets.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-10  1:06             ` Alexei Starovoitov
@ 2019-10-18 18:40               ` Samudrala, Sridhar
  2019-10-18 19:22                 ` Toke Høiland-Jørgensen
  2019-10-19  0:14                 ` Alexei Starovoitov
  0 siblings, 2 replies; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-18 18:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Kicinski
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom



On 10/9/2019 6:06 PM, Alexei Starovoitov wrote:
>>
>> Will update the patchset with the right performance data and address
>> feedback from Bjorn.
>> Hope you are not totally against direct XDP approach as it does provide
>> value when an AF_XDP socket is bound to a queue and a HW filter can
>> direct packets targeted for that queue.
> 
> I used to be in favor of providing "prog bypass" for af_xdp,
> because of anecdotal evidence that it will make af_xdp faster.
> Now seeing the numbers and the way they were collected
> I'm against such bypass.
> I want to see hard proof that trivial bpf prog is actually slowing things down
> before reviewing any new patch sets.
> 

Here is a more detailed performance report that compares the current AF_XDP rx_drop
with the patches that enable direct receive without any XDP program. I also collected
and included kernel rxdrop data too as Jakub requested and also perf reports.
Hope it addresses the concerns you raised with the earlier data I posted.

Test Setup
==========
2 Skylake servers with Intel 40Gbe NICs connected via 100Gb Switch

Server Configuration
====================
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz (skylake)
CPU(s):              112
On-line CPU(s) list: 0-111
Thread(s) per core:  2
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
NUMA node0 CPU(s):   0-27,56-83
NUMA node1 CPU(s):   28-55,84-111

Memory: 96GB

NIC: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

Distro
======
Fedora 29 (Server Edition)

Kernel Configuration
====================
AF_XDP direct socket patches applied on top of
bpf-next git repo HEAD: 05949f63055fcf53947886ddb8e23c8a5d41bd80

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.3.0-bpf-next-dxk+ root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap LANG=en_IN.UTF-8

For ‘mitigations OFF’ scenarios, the kernel command line parameter is changed to add ‘mitigations=off’

Packet Generator on the link partner
====================================
   pktgen - sending 64 byte UDP packets at 43mpps

HW filter to redirect packet to a queue
=======================================
ethtool -N ens260f1 flow-type udp4 dst-ip 192.168.128.41 dst-port 9 action 28

Test Cases
==========
kernel rxdrop
   taskset -c 28 samples/bpf/xdp_rxq_info -d ens260f1 -a XDP_DROP

AF_XDP default rxdrop
   taskset -c 28 samples/bpf/xdpsock -i ens260f1 -r -q 28

AD_XDP direct rxdrop
   taskset -c 28 samples/bpf/xdpsock -i ens260f1 -r -d -q 28

Performance Results
===================
Only 1 core is used in all these testcases as the app and the queue irq are pinned to the same core.

----------------------------------------------------------------------------------
                                mitigations ON                mitigations OFF
   Testcase              ----------------------------------------------------------
                         no patches    with patches       no patches   with patches
----------------------------------------------------------------------------------
AF_XDP default rxdrop        X             X                   Y            Y
AF_XDP direct rxdrop        N/A          X+46%                N/A         Y+25%
Kernel rxdrop              X+61%         X+61%               Y+53%        Y+53%
----------------------------------------------------------------------------------

Here Y is pps with CPU security mitigations turned OFF and it is 26% better than X.
So there is 46% improvement in AF_XDP rxdrop performance with direct receive when
mitigations are ON (default configuration) and 25% improvement when mitigations are
turned OFF.
As expected, the in-kernel rxdrop performance is higher even with direct receive in
both scenarios.

Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
==========================================================================
Samples: 44K of event 'cycles', Event count (approx.): 38532389541
Overhead  Command          Shared Object              Symbol
   15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
   10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
    9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
    8.62%  xdpsock          xdpsock                    [.] main
    7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
    5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
    4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
    3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
    2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
    2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
    2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
    1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
    1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
    1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
    1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
    1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
    0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
    0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
    0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
    0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
    0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
    0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu

Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
==========================================================================
Samples: 46K of event 'cycles', Event count (approx.): 38387018585
Overhead  Command          Shared Object             Symbol
   21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
   14.36%  xdpsock          xdpsock                   [.] main
   11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
   11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
    4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
    2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
    2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
    2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
    2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
    2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
    1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
    1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
    1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
    1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
    0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
    0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
    0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
    0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
    0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
    0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
    0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
    0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr

Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
	bpf_prog_xxx
         bpf_xdp_redirect_map
	xsk_map_lookup_elem
         __xsk_map_redirect
With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()

The above test results document performance of components on a particular test, in specific systems.
Differences in hardware, software, or configuration will affect actual performance.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-18 18:40               ` Samudrala, Sridhar
@ 2019-10-18 19:22                 ` Toke Høiland-Jørgensen
  2019-10-19  0:14                 ` Alexei Starovoitov
  1 sibling, 0 replies; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-18 19:22 UTC (permalink / raw)
  To: Samudrala\, Sridhar, Alexei Starovoitov, Jakub Kicinski
  Cc: Karlsson, Magnus, Björn Töpel, Netdev, bpf,
	intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

"Samudrala, Sridhar" <sridhar.samudrala@intel.com> writes:

> Performance Results
> ===================
> Only 1 core is used in all these testcases as the app and the queue irq are pinned to the same core.
>
> ----------------------------------------------------------------------------------
>                                 mitigations ON                mitigations OFF
>    Testcase              ----------------------------------------------------------
>                          no patches    with patches       no patches   with patches
> ----------------------------------------------------------------------------------
> AF_XDP default rxdrop        X             X                   Y            Y

Is this really exactly the same with and without patches? You're adding
an extra check to xdp_do_redirect(); are you really saying that the
impact of that is zero?

> AF_XDP direct rxdrop        N/A          X+46%                N/A         Y+25%
> Kernel rxdrop              X+61%         X+61%               Y+53%        Y+53%
> ----------------------------------------------------------------------------------
>
> Here Y is pps with CPU security mitigations turned OFF and it is 26%
> better than X.

Any particular reason you're not sharing the values of X and Y?

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-18 18:40               ` Samudrala, Sridhar
  2019-10-18 19:22                 ` Toke Høiland-Jørgensen
@ 2019-10-19  0:14                 ` Alexei Starovoitov
  2019-10-19  0:45                   ` Samudrala, Sridhar
  1 sibling, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-19  0:14 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jakub Kicinski, Karlsson, Magnus, Björn Töpel, Netdev,
	bpf, intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
> 
> Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
> ==========================================================================
> Samples: 44K of event 'cycles', Event count (approx.): 38532389541
> Overhead  Command          Shared Object              Symbol
>   15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>   10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>    9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>    8.62%  xdpsock          xdpsock                    [.] main
>    7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>    5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>    4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>    3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv

why everything is duplicated?
Same code runs in different tasks ?

>    2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>    2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
>    2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>    2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>    1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>    1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>    1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>    1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>    1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>    1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
>    1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>    0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>    0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>    0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>    0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>    0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
>    0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>    0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> 
> Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
> ==========================================================================
> Samples: 46K of event 'cycles', Event count (approx.): 38387018585
> Overhead  Command          Shared Object             Symbol
>   21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
>   14.36%  xdpsock          xdpsock                   [.] main
>   11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
>   11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
>    4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
>    2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
>    2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
>    2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
>    2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
>    2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>    1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>    1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>    1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
>    1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
>    0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
>    0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
>    0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
>    0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
>    0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
>    0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>    0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
>    0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
> 
> Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
> AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
> 	bpf_prog_xxx
>         bpf_xdp_redirect_map
> 	xsk_map_lookup_elem
>         __xsk_map_redirect
> With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()

I don't think you're identifying the overhead correctly.
xsk_map_lookup_elem is 1%
but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
which is a different function:
ffffffff81493fe0 T __xsk_map_lookup_elem
ffffffff81492e80 t xsk_map_lookup_elem

10% for bpf_prog_80b55d8a76303785 is huge.
It's the actual code of the program _without_ any helpers.
How does the program actually look?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-19  0:14                 ` Alexei Starovoitov
@ 2019-10-19  0:45                   ` Samudrala, Sridhar
  2019-10-19  2:25                     ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-19  0:45 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen
  Cc: Jakub Kicinski, Karlsson, Magnus, Björn Töpel, Netdev,
	bpf, intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
> On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
>>
>> Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
>> ==========================================================================
>> Samples: 44K of event 'cycles', Event count (approx.): 38532389541
>> Overhead  Command          Shared Object              Symbol
>>    15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>>    10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>     9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>>     8.62%  xdpsock          xdpsock                    [.] main
>>     7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>>     5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>>     4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>     3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
> 
> why everything is duplicated?
> Same code runs in different tasks ?

Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.

> 
>>     2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>     2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>     2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>>     2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>>     1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>     1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>     1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>     1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>     1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>     1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>     1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>>     0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>     0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>>     0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>>     0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>>     0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
>>     0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>>     0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>
>> Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
>> ==========================================================================
>> Samples: 46K of event 'cycles', Event count (approx.): 38387018585
>> Overhead  Command          Shared Object             Symbol
>>    21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
>>    14.36%  xdpsock          xdpsock                   [.] main
>>    11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
>>    11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
>>     4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
>>     2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
>>     2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>     2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>     2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
>>     2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>     1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>     1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>     1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
>>     1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
>>     0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
>>     0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
>>     0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
>>     0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
>>     0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
>>     0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>     0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
>>     0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
>>
>> Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
>> AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
>> 	bpf_prog_xxx
>>          bpf_xdp_redirect_map
>> 	xsk_map_lookup_elem
>>          __xsk_map_redirect
>> With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
> 
> I don't think you're identifying the overhead correctly.
> xsk_map_lookup_elem is 1%
> but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
> which is a different function:
> ffffffff81493fe0 T __xsk_map_lookup_elem
> ffffffff81492e80 t xsk_map_lookup_elem
> 
> 10% for bpf_prog_80b55d8a76303785 is huge.
> It's the actual code of the program _without_ any helpers.
> How does the program actually look?

It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268

In the archives, i see that Toke had some comments, but somehow i didn't get his email in
my inbox.

> Performance Results
> ===================
> Only 1 core is used in all these testcases as the app and the queue irq are pinned to the same core.
>
> ----------------------------------------------------------------------------------
>                                 mitigations ON                mitigations OFF
>    Testcase              ----------------------------------------------------------
>                          no patches    with patches       no patches   with patches
> ----------------------------------------------------------------------------------
> AF_XDP default rxdrop        X             X                   Y            Y

> Is this really exactly the same with and without patches? You're adding
> an extra check to xdp_do_redirect(); are you really saying that the
> impact of that is zero?

Yes. I didn't see any impact. The variation is within +/- < 1%
I could use static_key even for that check in xdp_do_redirect() if required.

-Sridhar

  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-19  0:45                   ` Samudrala, Sridhar
@ 2019-10-19  2:25                     ` Alexei Starovoitov
  2019-10-20 10:14                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-19  2:25 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Toke Høiland-Jørgensen, Jakub Kicinski, Karlsson,
	Magnus, Björn Töpel, Netdev, bpf, intel-wired-lan,
	Fijalkowski, Maciej, Herbert, Tom

On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
> > On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
> > > 
> > > Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
> > > ==========================================================================
> > > Samples: 44K of event 'cycles', Event count (approx.): 38532389541
> > > Overhead  Command          Shared Object              Symbol
> > >    15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
> > >    10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> > >     9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
> > >     8.62%  xdpsock          xdpsock                    [.] main
> > >     7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
> > >     5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
> > >     4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> > >     3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
> > 
> > why everything is duplicated?
> > Same code runs in different tasks ?
> 
> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
> 
> > 
> > >     2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> > >     2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
> > >     2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
> > >     2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
> > >     1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
> > >     1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
> > >     1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> > >     1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> > >     1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> > >     1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
> > >     1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
> > >     0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> > >     0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
> > >     0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
> > >     0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
> > >     0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
> > >     0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
> > >     0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> > > 
> > > Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
> > > ==========================================================================
> > > Samples: 46K of event 'cycles', Event count (approx.): 38387018585
> > > Overhead  Command          Shared Object             Symbol
> > >    21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
> > >    14.36%  xdpsock          xdpsock                   [.] main
> > >    11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
> > >    11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
> > >     4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
> > >     2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
> > >     2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
> > >     2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
> > >     2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
> > >     2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> > >     1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> > >     1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> > >     1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
> > >     1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
> > >     0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
> > >     0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
> > >     0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
> > >     0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
> > >     0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
> > >     0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> > >     0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
> > >     0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
> > > 
> > > Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
> > > AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
> > > 	bpf_prog_xxx
> > >          bpf_xdp_redirect_map
> > > 	xsk_map_lookup_elem
> > >          __xsk_map_redirect
> > > With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
> > 
> > I don't think you're identifying the overhead correctly.
> > xsk_map_lookup_elem is 1%
> > but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
> > which is a different function:
> > ffffffff81493fe0 T __xsk_map_lookup_elem
> > ffffffff81492e80 t xsk_map_lookup_elem
> > 
> > 10% for bpf_prog_80b55d8a76303785 is huge.
> > It's the actual code of the program _without_ any helpers.
> > How does the program actually look?
> 
> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268

I see. Looks like map_gen_lookup was never implemented for xskmap.
How about adding it first the way array_map_gen_lookup() is implemented?
This will easily give 2x perf gain.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-19  2:25                     ` Alexei Starovoitov
@ 2019-10-20 10:14                       ` Toke Høiland-Jørgensen
  2019-10-20 17:12                         ` [Intel-wired-lan] " Björn Töpel
  0 siblings, 1 reply; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-20 10:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Samudrala\, Sridhar
  Cc: Jakub Kicinski, Karlsson, Magnus, Björn Töpel, Netdev,
	bpf, intel-wired-lan, Fijalkowski, Maciej, Herbert, Tom

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
>> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
>> > On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
>> > > 
>> > > Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
>> > > ==========================================================================
>> > > Samples: 44K of event 'cycles', Event count (approx.): 38532389541
>> > > Overhead  Command          Shared Object              Symbol
>> > >    15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>> > >    10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>> > >     9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>> > >     8.62%  xdpsock          xdpsock                    [.] main
>> > >     7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>> > >     5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>> > >     4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>> > >     3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
>> > 
>> > why everything is duplicated?
>> > Same code runs in different tasks ?
>> 
>> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
>> 
>> > 
>> > >     2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>> > >     2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
>> > >     2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>> > >     2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>> > >     1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>> > >     1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>> > >     1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>> > >     1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>> > >     1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>> > >     1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
>> > >     1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>> > >     0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>> > >     0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>> > >     0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>> > >     0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>> > >     0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
>> > >     0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>> > >     0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>> > > 
>> > > Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
>> > > ==========================================================================
>> > > Samples: 46K of event 'cycles', Event count (approx.): 38387018585
>> > > Overhead  Command          Shared Object             Symbol
>> > >    21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
>> > >    14.36%  xdpsock          xdpsock                   [.] main
>> > >    11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
>> > >    11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
>> > >     4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
>> > >     2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
>> > >     2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
>> > >     2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
>> > >     2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
>> > >     2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>> > >     1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>> > >     1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>> > >     1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
>> > >     1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
>> > >     0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
>> > >     0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
>> > >     0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
>> > >     0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
>> > >     0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
>> > >     0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>> > >     0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
>> > >     0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
>> > > 
>> > > Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
>> > > AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
>> > > 	bpf_prog_xxx
>> > >          bpf_xdp_redirect_map
>> > > 	xsk_map_lookup_elem
>> > >          __xsk_map_redirect
>> > > With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
>> > 
>> > I don't think you're identifying the overhead correctly.
>> > xsk_map_lookup_elem is 1%
>> > but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
>> > which is a different function:
>> > ffffffff81493fe0 T __xsk_map_lookup_elem
>> > ffffffff81492e80 t xsk_map_lookup_elem
>> > 
>> > 10% for bpf_prog_80b55d8a76303785 is huge.
>> > It's the actual code of the program _without_ any helpers.
>> > How does the program actually look?
>> 
>> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
>> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268
>
> I see. Looks like map_gen_lookup was never implemented for xskmap.
> How about adding it first the way array_map_gen_lookup() is implemented?
> This will easily give 2x perf gain.

I guess we should implement this for devmaps as well now that we allow
lookups into those.

However, in this particular example, the lookup from BPF is not actually
needed, since bpf_redirect_map() will return a configurable error value
when the map lookup fails (for exactly this use case).

So replacing:

if (bpf_map_lookup_elem(&xsks_map, &index))
    return bpf_redirect_map(&xsks_map, index, 0);

with simply

return bpf_redirect_map(&xsks_map, index, XDP_PASS);

would save the call to xsk_map_lookup_elem().

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-20 10:14                       ` Toke Høiland-Jørgensen
@ 2019-10-20 17:12                         ` " Björn Töpel
  2019-10-21 20:10                           ` Samudrala, Sridhar
  0 siblings, 1 reply; 42+ messages in thread
From: Björn Töpel @ 2019-10-20 17:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Samudrala, Sridhar, Jakub Kicinski, Netdev,
	intel-wired-lan, Herbert, Tom, Fijalkowski, Maciej, bpf,
	Björn Töpel, Karlsson, Magnus

On Sun, 20 Oct 2019 at 12:15, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
> >> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
> >> > On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
> >> > >
> >> > > Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
> >> > > ==========================================================================
> >> > > Samples: 44K of event 'cycles', Event count (approx.): 38532389541
> >> > > Overhead  Command          Shared Object              Symbol
> >> > >    15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
> >> > >    10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> >> > >     9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
> >> > >     8.62%  xdpsock          xdpsock                    [.] main
> >> > >     7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
> >> > >     5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
> >> > >     4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> >> > >     3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
> >> >
> >> > why everything is duplicated?
> >> > Same code runs in different tasks ?
> >>
> >> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
> >>
> >> >
> >> > >     2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >> > >     2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
> >> > >     2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
> >> > >     2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
> >> > >     1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
> >> > >     1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
> >> > >     1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> >> > >     1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >> > >     1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> >> > >     1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
> >> > >     1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
> >> > >     0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> >> > >     0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
> >> > >     0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
> >> > >     0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
> >> > >     0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
> >> > >     0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
> >> > >     0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> >> > >
> >> > > Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
> >> > > ==========================================================================
> >> > > Samples: 46K of event 'cycles', Event count (approx.): 38387018585
> >> > > Overhead  Command          Shared Object             Symbol
> >> > >    21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
> >> > >    14.36%  xdpsock          xdpsock                   [.] main
> >> > >    11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
> >> > >    11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
> >> > >     4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
> >> > >     2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
> >> > >     2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
> >> > >     2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
> >> > >     2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
> >> > >     2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> >> > >     1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> >> > >     1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> >> > >     1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
> >> > >     1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
> >> > >     0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
> >> > >     0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
> >> > >     0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
> >> > >     0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
> >> > >     0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
> >> > >     0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> >> > >     0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
> >> > >     0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
> >> > >
> >> > > Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
> >> > > AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
> >> > >  bpf_prog_xxx
> >> > >          bpf_xdp_redirect_map
> >> > >  xsk_map_lookup_elem
> >> > >          __xsk_map_redirect
> >> > > With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
> >> >
> >> > I don't think you're identifying the overhead correctly.
> >> > xsk_map_lookup_elem is 1%
> >> > but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
> >> > which is a different function:
> >> > ffffffff81493fe0 T __xsk_map_lookup_elem
> >> > ffffffff81492e80 t xsk_map_lookup_elem
> >> >
> >> > 10% for bpf_prog_80b55d8a76303785 is huge.
> >> > It's the actual code of the program _without_ any helpers.
> >> > How does the program actually look?
> >>
> >> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
> >> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268
> >
> > I see. Looks like map_gen_lookup was never implemented for xskmap.
> > How about adding it first the way array_map_gen_lookup() is implemented?
> > This will easily give 2x perf gain.
>
> I guess we should implement this for devmaps as well now that we allow
> lookups into those.
>
> However, in this particular example, the lookup from BPF is not actually
> needed, since bpf_redirect_map() will return a configurable error value
> when the map lookup fails (for exactly this use case).
>
> So replacing:
>
> if (bpf_map_lookup_elem(&xsks_map, &index))
>     return bpf_redirect_map(&xsks_map, index, 0);
>
> with simply
>
> return bpf_redirect_map(&xsks_map, index, XDP_PASS);
>
> would save the call to xsk_map_lookup_elem().
>

Thanks for the reminder! I just submitted a patch. Still, doing the
map_gen_lookup()  for xsk/devmaps still makes sense!


Björn


> -Toke
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-20 17:12                         ` [Intel-wired-lan] " Björn Töpel
@ 2019-10-21 20:10                           ` Samudrala, Sridhar
  2019-10-21 22:34                             ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-21 20:10 UTC (permalink / raw)
  To: Björn Töpel, Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Jakub Kicinski, Netdev, intel-wired-lan,
	Herbert, Tom, Fijalkowski, Maciej, bpf, Björn Töpel,
	Karlsson, Magnus

On 10/20/2019 10:12 AM, Björn Töpel wrote:
> On Sun, 20 Oct 2019 at 12:15, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>>> On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
>>>> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
>>>>> On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
>>>>>>
>>>>>> Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
>>>>>> ==========================================================================
>>>>>> Samples: 44K of event 'cycles', Event count (approx.): 38532389541
>>>>>> Overhead  Command          Shared Object              Symbol
>>>>>>     15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>>>>>>     10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>>>>>      9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>>>>>>      8.62%  xdpsock          xdpsock                    [.] main
>>>>>>      7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>>>>>>      5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>>>>>>      4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>>>>>      3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
>>>>>
>>>>> why everything is duplicated?
>>>>> Same code runs in different tasks ?
>>>>
>>>> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
>>>>
>>>>>
>>>>>>      2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>>>>>      2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>>>>>      2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>>>>>>      2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>>>>>>      1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>>>>>      1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>>>>>      1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>>>>>      1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>>>>>      1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>>>>>      1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>>>>>      1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>>>>>>      0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>>>>>      0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>>>>>>      0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>>>>>>      0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>>>>>>      0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
>>>>>>      0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>>>>>>      0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>>>>>
>>>>>> Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
>>>>>> ==========================================================================
>>>>>> Samples: 46K of event 'cycles', Event count (approx.): 38387018585
>>>>>> Overhead  Command          Shared Object             Symbol
>>>>>>     21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>     14.36%  xdpsock          xdpsock                   [.] main
>>>>>>     11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
>>>>>>     11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>      4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
>>>>>>      2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
>>>>>>      2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>>>>>      2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>>>>>      2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
>>>>>>      2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>>>>>      1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>>>>>      1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>>>>>      1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
>>>>>>      1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
>>>>>>      0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
>>>>>>      0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
>>>>>>      0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
>>>>>>      0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>      0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
>>>>>>      0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>>>>>      0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
>>>>>>      0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
>>>>>>
>>>>>> Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
>>>>>> AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
>>>>>>   bpf_prog_xxx
>>>>>>           bpf_xdp_redirect_map
>>>>>>   xsk_map_lookup_elem
>>>>>>           __xsk_map_redirect
>>>>>> With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
>>>>>
>>>>> I don't think you're identifying the overhead correctly.
>>>>> xsk_map_lookup_elem is 1%
>>>>> but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
>>>>> which is a different function:
>>>>> ffffffff81493fe0 T __xsk_map_lookup_elem
>>>>> ffffffff81492e80 t xsk_map_lookup_elem
>>>>>
>>>>> 10% for bpf_prog_80b55d8a76303785 is huge.
>>>>> It's the actual code of the program _without_ any helpers.
>>>>> How does the program actually look?
>>>>
>>>> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268
>>>
>>> I see. Looks like map_gen_lookup was never implemented for xskmap.
>>> How about adding it first the way array_map_gen_lookup() is implemented?
>>> This will easily give 2x perf gain.
>>
>> I guess we should implement this for devmaps as well now that we allow
>> lookups into those.
>>
>> However, in this particular example, the lookup from BPF is not actually
>> needed, since bpf_redirect_map() will return a configurable error value
>> when the map lookup fails (for exactly this use case).
>>
>> So replacing:
>>
>> if (bpf_map_lookup_elem(&xsks_map, &index))
>>      return bpf_redirect_map(&xsks_map, index, 0);
>>
>> with simply
>>
>> return bpf_redirect_map(&xsks_map, index, XDP_PASS);
>>
>> would save the call to xsk_map_lookup_elem().
>>
> 
> Thanks for the reminder! I just submitted a patch. Still, doing the
> map_gen_lookup()  for xsk/devmaps still makes sense!
> 

I tried Bjorn's patch that avoids the lookups in the BPF prog.
https://lore.kernel.org/netdev/20191021105938.11820-1-bjorn.topel@gmail.com/

With this patch I am also seeing around 3-4% increase in xdpsock rxdrop performance and
the perf report looks like this.

Samples: 44K of event 'cycles', Event count (approx.): 38749965204
Overhead  Command          Shared Object              Symbol
   16.06%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
   10.18%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
   10.15%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
   10.06%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
    7.45%  xdpsock          xdpsock                    [.] main
    5.76%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
    4.51%  xdpsock          bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
    3.67%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
    3.06%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    2.34%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
    2.33%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
    1.69%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
    1.69%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
    1.42%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
    1.19%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    1.13%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    0.95%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    0.92%  swapper          [kernel.vmlinux]           [k] intel_idle
    0.92%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
    0.80%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
    0.73%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
    0.71%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_lookup_elem
    0.63%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
    0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
    0.58%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu

So with this patch applied, direct receive performance improvement comes down from 46% to 42%.
I think it is still substantial enough to provide an option to allow direct receive for
certain use cases. If it is OK, i can re-spin and submit the patches on top of the latest bpf-next

Thanks
Sridhar





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-21 20:10                           ` Samudrala, Sridhar
@ 2019-10-21 22:34                             ` Alexei Starovoitov
  2019-10-22 19:06                               ` Samudrala, Sridhar
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-21 22:34 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Björn Töpel, Toke Høiland-Jørgensen,
	Jakub Kicinski, Netdev, intel-wired-lan, Herbert, Tom,
	Fijalkowski, Maciej, bpf, Björn Töpel, Karlsson,
	Magnus

On Mon, Oct 21, 2019 at 1:10 PM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 10/20/2019 10:12 AM, Björn Töpel wrote:
> > On Sun, 20 Oct 2019 at 12:15, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >>
> >>> On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
> >>>> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
> >>>>> On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
> >>>>>>
> >>>>>> Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
> >>>>>> ==========================================================================
> >>>>>> Samples: 44K of event 'cycles', Event count (approx.): 38532389541
> >>>>>> Overhead  Command          Shared Object              Symbol
> >>>>>>     15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
> >>>>>>     10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> >>>>>>      9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
> >>>>>>      8.62%  xdpsock          xdpsock                    [.] main
> >>>>>>      7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
> >>>>>>      5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
> >>>>>>      4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
> >>>>>>      3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
> >>>>>
> >>>>> why everything is duplicated?
> >>>>> Same code runs in different tasks ?
> >>>>
> >>>> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
> >>>>
> >>>>>
> >>>>>>      2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >>>>>>      2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
> >>>>>>      2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
> >>>>>>      2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
> >>>>>>      1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
> >>>>>>      1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
> >>>>>>      1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> >>>>>>      1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >>>>>>      1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> >>>>>>      1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
> >>>>>>      1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
> >>>>>>      0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
> >>>>>>      0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
> >>>>>>      0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
> >>>>>>      0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
> >>>>>>      0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
> >>>>>>      0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
> >>>>>>      0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
> >>>>>>
> >>>>>> Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
> >>>>>> ==========================================================================
> >>>>>> Samples: 46K of event 'cycles', Event count (approx.): 38387018585
> >>>>>> Overhead  Command          Shared Object             Symbol
> >>>>>>     21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
> >>>>>>     14.36%  xdpsock          xdpsock                   [.] main
> >>>>>>     11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
> >>>>>>     11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
> >>>>>>      4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
> >>>>>>      2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
> >>>>>>      2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
> >>>>>>      2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
> >>>>>>      2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
> >>>>>>      2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> >>>>>>      1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> >>>>>>      1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
> >>>>>>      1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
> >>>>>>      1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
> >>>>>>      0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
> >>>>>>      0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
> >>>>>>      0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
> >>>>>>      0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
> >>>>>>      0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
> >>>>>>      0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
> >>>>>>      0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
> >>>>>>      0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
> >>>>>>
> >>>>>> Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
> >>>>>> AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
> >>>>>>   bpf_prog_xxx
> >>>>>>           bpf_xdp_redirect_map
> >>>>>>   xsk_map_lookup_elem
> >>>>>>           __xsk_map_redirect
> >>>>>> With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
> >>>>>
> >>>>> I don't think you're identifying the overhead correctly.
> >>>>> xsk_map_lookup_elem is 1%
> >>>>> but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
> >>>>> which is a different function:
> >>>>> ffffffff81493fe0 T __xsk_map_lookup_elem
> >>>>> ffffffff81492e80 t xsk_map_lookup_elem
> >>>>>
> >>>>> 10% for bpf_prog_80b55d8a76303785 is huge.
> >>>>> It's the actual code of the program _without_ any helpers.
> >>>>> How does the program actually look?
> >>>>
> >>>> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268
> >>>
> >>> I see. Looks like map_gen_lookup was never implemented for xskmap.
> >>> How about adding it first the way array_map_gen_lookup() is implemented?
> >>> This will easily give 2x perf gain.
> >>
> >> I guess we should implement this for devmaps as well now that we allow
> >> lookups into those.
> >>
> >> However, in this particular example, the lookup from BPF is not actually
> >> needed, since bpf_redirect_map() will return a configurable error value
> >> when the map lookup fails (for exactly this use case).
> >>
> >> So replacing:
> >>
> >> if (bpf_map_lookup_elem(&xsks_map, &index))
> >>      return bpf_redirect_map(&xsks_map, index, 0);
> >>
> >> with simply
> >>
> >> return bpf_redirect_map(&xsks_map, index, XDP_PASS);
> >>
> >> would save the call to xsk_map_lookup_elem().
> >>
> >
> > Thanks for the reminder! I just submitted a patch. Still, doing the
> > map_gen_lookup()  for xsk/devmaps still makes sense!
> >
>
> I tried Bjorn's patch that avoids the lookups in the BPF prog.
> https://lore.kernel.org/netdev/20191021105938.11820-1-bjorn.topel@gmail.com/
>
> With this patch I am also seeing around 3-4% increase in xdpsock rxdrop performance and
> the perf report looks like this.
>
> Samples: 44K of event 'cycles', Event count (approx.): 38749965204
> Overhead  Command          Shared Object              Symbol
>    16.06%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>    10.18%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
>    10.15%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>    10.06%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>     7.45%  xdpsock          xdpsock                    [.] main
>     5.76%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>     4.51%  xdpsock          bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
>     3.67%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
>     3.06%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>     2.34%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>     2.33%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>     1.69%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>     1.69%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>     1.42%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>     1.19%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>     1.13%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>     0.95%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>     0.92%  swapper          [kernel.vmlinux]           [k] intel_idle
>     0.92%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>     0.80%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>     0.73%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>     0.71%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_lookup_elem
>     0.63%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>     0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>     0.58%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>
> So with this patch applied, direct receive performance improvement comes down from 46% to 42%.
> I think it is still substantial enough to provide an option to allow direct receive for
> certain use cases. If it is OK, i can re-spin and submit the patches on top of the latest bpf-next

I think it's too early to consider such drastic approach.
The run-time performance of XDP program should be the same as C code.
Something fishy in these numbers, since spending 10% cpu in few loads
and single call to bpf_xdp_redirect_map() just not right.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-21 22:34                             ` Alexei Starovoitov
@ 2019-10-22 19:06                               ` Samudrala, Sridhar
  2019-10-23 17:42                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-22 19:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Björn Töpel, Toke Høiland-Jørgensen,
	Jakub Kicinski, Netdev, intel-wired-lan, Herbert, Tom,
	Fijalkowski, Maciej, bpf, Björn Töpel, Karlsson,
	Magnus

On 10/21/2019 3:34 PM, Alexei Starovoitov wrote:
> On Mon, Oct 21, 2019 at 1:10 PM Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>> On 10/20/2019 10:12 AM, Björn Töpel wrote:
>>> On Sun, 20 Oct 2019 at 12:15, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>
>>>>> On Fri, Oct 18, 2019 at 05:45:26PM -0700, Samudrala, Sridhar wrote:
>>>>>> On 10/18/2019 5:14 PM, Alexei Starovoitov wrote:
>>>>>>> On Fri, Oct 18, 2019 at 11:40:07AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>
>>>>>>>> Perf report for "AF_XDP default rxdrop" with patched kernel - mitigations ON
>>>>>>>> ==========================================================================
>>>>>>>> Samples: 44K of event 'cycles', Event count (approx.): 38532389541
>>>>>>>> Overhead  Command          Shared Object              Symbol
>>>>>>>>      15.31%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>>>>>>>>      10.50%  ksoftirqd/28     bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>>>>>>>       9.48%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>>>>>>>>       8.62%  xdpsock          xdpsock                    [.] main
>>>>>>>>       7.11%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>>>>>>>>       5.81%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>>>>>>>>       4.46%  xdpsock          bpf_prog_80b55d8a76303785  [k] bpf_prog_80b55d8a76303785
>>>>>>>>       3.83%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
>>>>>>>
>>>>>>> why everything is duplicated?
>>>>>>> Same code runs in different tasks ?
>>>>>>
>>>>>> Yes. looks like these functions run from both the app(xdpsock) context and ksoftirqd context.
>>>>>>
>>>>>>>
>>>>>>>>       2.81%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>>>>>>>       2.78%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>>>>>>>       2.44%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>>>>>>>>       2.19%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>>>>>>>>       1.62%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>>>>>>>       1.57%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>>>>>>>       1.32%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>>>>>>>       1.28%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>>>>>>>       1.15%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>>>>>>>       1.12%  xdpsock          [kernel.vmlinux]           [k] xsk_map_lookup_elem
>>>>>>>>       1.06%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>>>>>>>>       0.94%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>>>>>>>       0.75%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>>>>>>>>       0.66%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>>>>>>>>       0.64%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>>>>>>>>       0.64%  swapper          [kernel.vmlinux]           [k] intel_idle
>>>>>>>>       0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>>>>>>>>       0.57%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>>>>>>>
>>>>>>>> Perf report for "AF_XDP direct rxdrop" with patched kernel - mitigations ON
>>>>>>>> ==========================================================================
>>>>>>>> Samples: 46K of event 'cycles', Event count (approx.): 38387018585
>>>>>>>> Overhead  Command          Shared Object             Symbol
>>>>>>>>      21.94%  ksoftirqd/28     [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>>>      14.36%  xdpsock          xdpsock                   [.] main
>>>>>>>>      11.53%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_rcv
>>>>>>>>      11.32%  xdpsock          [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>>>       4.02%  xdpsock          [kernel.vmlinux]          [k] xsk_rcv
>>>>>>>>       2.91%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_do_redirect
>>>>>>>>       2.45%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>>>>>>>       2.19%  xdpsock          [kernel.vmlinux]          [k] xsk_umem_peek_addr
>>>>>>>>       2.08%  ksoftirqd/28     [kernel.vmlinux]          [k] bpf_direct_xsk
>>>>>>>>       2.07%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>>>>>>>       1.53%  ksoftirqd/28     [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>>>>>>>       1.39%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_device
>>>>>>>>       1.22%  ksoftirqd/28     [kernel.vmlinux]          [k] xdp_get_xsk_from_qid
>>>>>>>>       1.12%  ksoftirqd/28     [i40e]                    [k] i40e_clean_programming_status
>>>>>>>>       0.96%  ksoftirqd/28     [i40e]                    [k] i40e_napi_poll
>>>>>>>>       0.95%  ksoftirqd/28     [kernel.vmlinux]          [k] net_rx_action
>>>>>>>>       0.89%  xdpsock          [kernel.vmlinux]          [k] xdp_do_redirect
>>>>>>>>       0.83%  swapper          [i40e]                    [k] i40e_clean_rx_irq_zc
>>>>>>>>       0.70%  swapper          [kernel.vmlinux]          [k] intel_idle
>>>>>>>>       0.66%  xdpsock          [kernel.vmlinux]          [k] dma_direct_sync_single_for_cpu
>>>>>>>>       0.60%  xdpsock          [kernel.vmlinux]          [k] bpf_direct_xsk
>>>>>>>>       0.50%  ksoftirqd/28     [kernel.vmlinux]          [k] xsk_umem_discard_addr
>>>>>>>>
>>>>>>>> Based on the perf reports comparing AF_XDP default and direct rxdrop, we can say that
>>>>>>>> AF_XDP direct rxdrop codepath is avoiding the overhead of going through these functions
>>>>>>>>    bpf_prog_xxx
>>>>>>>>            bpf_xdp_redirect_map
>>>>>>>>    xsk_map_lookup_elem
>>>>>>>>            __xsk_map_redirect
>>>>>>>> With AF_XDP direct, xsk_rcv() is directly called via bpf_direct_xsk() in xdp_do_redirect()
>>>>>>>
>>>>>>> I don't think you're identifying the overhead correctly.
>>>>>>> xsk_map_lookup_elem is 1%
>>>>>>> but bpf_xdp_redirect_map() suppose to call __xsk_map_lookup_elem()
>>>>>>> which is a different function:
>>>>>>> ffffffff81493fe0 T __xsk_map_lookup_elem
>>>>>>> ffffffff81492e80 t xsk_map_lookup_elem
>>>>>>>
>>>>>>> 10% for bpf_prog_80b55d8a76303785 is huge.
>>>>>>> It's the actual code of the program _without_ any helpers.
>>>>>>> How does the program actually look?
>>>>>>
>>>>>> It is the xdp program that is loaded via xsk_load_xdp_prog() in tools/lib/bpf/xsk.c
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/lib/bpf/xsk.c#n268
>>>>>
>>>>> I see. Looks like map_gen_lookup was never implemented for xskmap.
>>>>> How about adding it first the way array_map_gen_lookup() is implemented?
>>>>> This will easily give 2x perf gain.
>>>>
>>>> I guess we should implement this for devmaps as well now that we allow
>>>> lookups into those.
>>>>
>>>> However, in this particular example, the lookup from BPF is not actually
>>>> needed, since bpf_redirect_map() will return a configurable error value
>>>> when the map lookup fails (for exactly this use case).
>>>>
>>>> So replacing:
>>>>
>>>> if (bpf_map_lookup_elem(&xsks_map, &index))
>>>>       return bpf_redirect_map(&xsks_map, index, 0);
>>>>
>>>> with simply
>>>>
>>>> return bpf_redirect_map(&xsks_map, index, XDP_PASS);
>>>>
>>>> would save the call to xsk_map_lookup_elem().
>>>>
>>>
>>> Thanks for the reminder! I just submitted a patch. Still, doing the
>>> map_gen_lookup()  for xsk/devmaps still makes sense!
>>>
>>
>> I tried Bjorn's patch that avoids the lookups in the BPF prog.
>> https://lore.kernel.org/netdev/20191021105938.11820-1-bjorn.topel@gmail.com/
>>
>> With this patch I am also seeing around 3-4% increase in xdpsock rxdrop performance and
>> the perf report looks like this.
>>
>> Samples: 44K of event 'cycles', Event count (approx.): 38749965204
>> Overhead  Command          Shared Object              Symbol
>>     16.06%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>>     10.18%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
>>     10.15%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>>     10.06%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>>      7.45%  xdpsock          xdpsock                    [.] main
>>      5.76%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>>      4.51%  xdpsock          bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
>>      3.67%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
>>      3.06%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>      2.34%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
>>      2.33%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
>>      1.69%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>      1.69%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
>>      1.42%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>      1.19%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>>      1.13%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>      0.95%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
>>      0.92%  swapper          [kernel.vmlinux]           [k] intel_idle
>>      0.92%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
>>      0.80%  ksoftirqd/28     [kernel.vmlinux]           [k] __x86_indirect_thunk_rax
>>      0.73%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
>>      0.71%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_lookup_elem
>>      0.63%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action
>>      0.62%  ksoftirqd/28     [i40e]                     [k] i40e_napi_poll
>>      0.58%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
>>
>> So with this patch applied, direct receive performance improvement comes down from 46% to 42%.
>> I think it is still substantial enough to provide an option to allow direct receive for
>> certain use cases. If it is OK, i can re-spin and submit the patches on top of the latest bpf-next
> 
> I think it's too early to consider such drastic approach.
> The run-time performance of XDP program should be the same as C code.
> Something fishy in these numbers, since spending 10% cpu in few loads
> and single call to bpf_xdp_redirect_map() just not right.

OK. Here is another data point that shows the perf report with the same test but CPU mitigations
turned OFF. Here bpf_prog overhead goes down from almost (10.18 + 4.51)% to (3.23 + 1.44%).

   21.40%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
   14.13%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
    8.33%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
    6.09%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
    5.19%  xdpsock          xdpsock                    [.] main
    3.48%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    3.23%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
    3.06%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_redirect
    2.72%  xdpsock          [kernel.vmlinux]           [k] xdp_do_redirect
    2.27%  xdpsock          [kernel.vmlinux]           [k] xsk_rcv
    2.10%  xdpsock          [kernel.vmlinux]           [k] xsk_umem_peek_addr
    2.09%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_umem_peek_addr
    1.89%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
    1.44%  xdpsock          bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
    1.36%  xdpsock          [kernel.vmlinux]           [k] bpf_xdp_redirect_map
    1.31%  ksoftirqd/28     [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    1.30%  xdpsock          [kernel.vmlinux]           [k] __xsk_map_redirect
    1.23%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_device
    0.97%  ksoftirqd/28     [kernel.vmlinux]           [k] __xsk_map_lookup_elem
    0.90%  ksoftirqd/28     [i40e]                     [k] i40e_clean_programming_status
    0.81%  xdpsock          [kernel.vmlinux]           [k] dma_direct_sync_single_for_cpu
    0.76%  swapper          [i40e]                     [k] i40e_clean_rx_irq_zc
    0.75%  swapper          [kernel.vmlinux]           [k] intel_idle
    0.59%  ksoftirqd/28     [kernel.vmlinux]           [k] net_rx_action

So a major component of the bpf_prog overhead seems to be due to the CPU vulnerability mitigations.
The other component is the bpf_xdp_redirect_map() codepath.

Let me know if it helps to collect any other data that should further help with the perf analysis.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-22 19:06                               ` Samudrala, Sridhar
@ 2019-10-23 17:42                                 ` Alexei Starovoitov
  2019-10-24 18:12                                   ` Samudrala, Sridhar
  2019-10-25  9:07                                   ` Björn Töpel
  0 siblings, 2 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-23 17:42 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Björn Töpel, Toke Høiland-Jørgensen,
	Jakub Kicinski, Netdev, intel-wired-lan, Herbert, Tom,
	Fijalkowski, Maciej, bpf, Björn Töpel, Karlsson,
	Magnus

On Tue, Oct 22, 2019 at 12:06 PM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> OK. Here is another data point that shows the perf report with the same test but CPU mitigations
> turned OFF. Here bpf_prog overhead goes down from almost (10.18 + 4.51)% to (3.23 + 1.44%).
>
>    21.40%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
>    14.13%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
>     8.33%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
>     6.09%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
>     5.19%  xdpsock          xdpsock                    [.] main
>     3.48%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
>     3.23%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
>
> So a major component of the bpf_prog overhead seems to be due to the CPU vulnerability mitigations.

I feel that it's an incorrect conclusion because JIT is not doing
any retpolines (because there are no indirect calls in bpf).
There should be no difference in bpf_prog runtime with or without mitigations.
Also you're running root, so no spectre mitigations either.

This 3% seems like a lot for a function that does few loads that should
hit d-cache and one direct call.
Please investigate why you're seeing this 10% cpu cost when mitigations are on.
perf report/annotate is the best.
Also please double check that you're using the latest perf.
Since bpf performance analysis was greatly improved several versions ago.
I don't think old perf will be showing bogus numbers, but better to
run the latest.

> The other component is the bpf_xdp_redirect_map() codepath.
>
> Let me know if it helps to collect any other data that should further help with the perf analysis.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-23 17:42                                 ` Alexei Starovoitov
@ 2019-10-24 18:12                                   ` Samudrala, Sridhar
  2019-10-25  7:42                                     ` Björn Töpel
  2019-10-25  9:07                                   ` Björn Töpel
  1 sibling, 1 reply; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-24 18:12 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bjorn.topel, bjorn.topel, bpf, intel-wired-lan, jakub.kicinski,
	maciej.fijalkowski, magnus.karlsson, netdev, sridhar.samudrala,
	toke, tom.herbert


> > OK. Here is another data point that shows the perf report with the same test but CPU mitigations
> > turned OFF. Here bpf_prog overhead goes down from almost (10.18 + 4.51)% to (3.23 + 1.44%).
> >
> >    21.40%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
> >    14.13%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
> >     8.33%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
> >     6.09%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
> >     5.19%  xdpsock          xdpsock                    [.] main
> >     3.48%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >     3.23%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
> >
> > So a major component of the bpf_prog overhead seems to be due to the CPU vulnerability mitigations.

> I feel that it's an incorrect conclusion because JIT is not doing
> any retpolines (because there are no indirect calls in bpf).
> There should be no difference in bpf_prog runtime with or without mitigations.
> Also you're running root, so no spectre mitigations either.

> This 3% seems like a lot for a function that does few loads that should
> hit d-cache and one direct call.
> Please investigate why you're seeing this 10% cpu cost when mitigations are on.
> perf report/annotate is the best.
> Also please double check that you're using the latest perf.
> Since bpf performance analysis was greatly improved several versions ago.
> I don't think old perf will be showing bogus numbers, but better to
> run the latest.

Here is perf annotate output for bpf_prog_ with and without mitigations turned ON
Using the perf built from the bpf-next tree.
   perf version 5.3.g4071324a76c1

With mitigations ON
-------------------
Samples: 6K of event 'cycles', 4000 Hz, Event count (approx.): 5646512726
bpf_prog_3c8251c7e0fef8db  bpf_prog_3c8251c7e0fef8db [Percent: local period]
  45.05      push   %rbp
   0.02      mov    %rsp,%rbp
   0.03      sub    $0x8,%rsp
  22.09      push   %rbx
   7.66      push   %r13
   1.08      push   %r14
   1.85      push   %r15
   0.63      pushq  $0x0
   1.13      mov    0x28(%rdi),%rsi
   0.47      mov    0x8(%rsi),%esi
   3.47      mov    %esi,-0x4(%rbp)
   0.02      movabs $0xffff8ab414a83e00,%rdi
   0.90      mov    $0x2,%edx
   2.85      callq  *ffffffffd149fc5f
   1.55      and    $0x6,%rax
             test   %rax,%rax
   1.48      jne    72
             mov    %rbp,%rsi
             add    $0xfffffffffffffffc,%rsi
             movabs $0xffff8ab414a83e00,%rdi
             callq  *ffffffffd0e5fd5f
             mov    %rax,%rdi
             mov    $0x2,%eax
             test   %rdi,%rdi
             je     72
             mov    -0x4(%rbp),%esi
             movabs $0xffff8ab414a83e00,%rdi
             xor    %edx,%edx
             callq  *ffffffffd149fc5f
        72:  pop    %rbx
             pop    %r15
   1.90      pop    %r14
   1.93      pop    %r13
             pop    %rbx
   3.63      leaveq
   2.27      retq

With mitigations OFF
--------------------
Samples: 2K of event 'cycles', 4000 Hz, Event count (approx.): 1872116166
bpf_prog_3c8251c7e0fef8db  bpf_prog_3c8251c7e0fef8db [Percent: local period]
   0.15      push   %rbp
             mov    %rsp,%rbp
  13.79      sub    $0x8,%rsp
   0.30      push   %rbx
   0.15      push   %r13
   0.20      push   %r14
  14.50      push   %r15
   0.20      pushq  $0x0
             mov    0x28(%rdi),%rsi
   0.25      mov    0x8(%rsi),%esi
  14.37      mov    %esi,-0x4(%rbp)
   0.25      movabs $0xffff8ea2c673b800,%rdi
             mov    $0x2,%edx
  13.60      callq  *ffffffffe50c2f38
  14.33      and    $0x6,%rax
             test   %rax,%rax
             jne    72
             mov    %rbp,%rsi
             add    $0xfffffffffffffffc,%rsi
             movabs $0xffff8ea2c673b800,%rdi
             callq  *ffffffffe4a83038
             mov    %rax,%rdi
             mov    $0x2,%eax
             test   %rdi,%rdi
             je     72
             mov    -0x4(%rbp),%esi
             movabs $0xffff8ea2c673b800,%rdi
             xor    %edx,%edx
             callq  *ffffffffe50c2f38
        72:  pop    %rbx
             pop    %r15
  13.97      pop    %r14
   0.10      pop    %r13
             pop    %rbx
  13.71      leaveq
   0.15      retq

Do you see any issues with this data? With mitigations ON push %rbp and push %rbx overhead seems to
be pretty high.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-24 18:12                                   ` Samudrala, Sridhar
@ 2019-10-25  7:42                                     ` Björn Töpel
  2019-10-31 22:38                                       ` Samudrala, Sridhar
  0 siblings, 1 reply; 42+ messages in thread
From: Björn Töpel @ 2019-10-25  7:42 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexei Starovoitov, Björn Töpel, bpf, intel-wired-lan,
	Jakub Kicinski, Fijalkowski, Maciej, Karlsson, Magnus, Netdev,
	Toke Høiland-Jørgensen, Herbert, Tom

On Thu, 24 Oct 2019 at 20:12, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
[...]
>
> With mitigations ON
> -------------------
> Samples: 6K of event 'cycles', 4000 Hz, Event count (approx.): 5646512726
> bpf_prog_3c8251c7e0fef8db  bpf_prog_3c8251c7e0fef8db [Percent: local period]
>   45.05      push   %rbp
>    0.02      mov    %rsp,%rbp
>    0.03      sub    $0x8,%rsp
>   22.09      push   %rbx

[...]

>
> Do you see any issues with this data? With mitigations ON push %rbp and push %rbx overhead seems to
> be pretty high.

That's sample skid from the retpoline thunk when entring the XDP
program. Pretty expensive push otherwise! :-)

Another thought; Disclaimer: I'm no spectrev2 expert, and probably not
getting the mitigations well enough. So this is me trying to swim at
the deep end! Would it be possible to avoid the retpoline when
entering the XDP program. At least for some XDP program that can be
proved "safe"? If so, PeterZ's upcoming static_call could be used from
the driver side.


Björn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-23 17:42                                 ` Alexei Starovoitov
  2019-10-24 18:12                                   ` Samudrala, Sridhar
@ 2019-10-25  9:07                                   ` Björn Töpel
  1 sibling, 0 replies; 42+ messages in thread
From: Björn Töpel @ 2019-10-25  9:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Samudrala, Sridhar, Toke Høiland-Jørgensen,
	Jakub Kicinski, Netdev, intel-wired-lan, Herbert, Tom,
	Fijalkowski, Maciej, bpf, Björn Töpel, Karlsson,
	Magnus

On Wed, 23 Oct 2019 at 19:42, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 12:06 PM Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
> >
> > OK. Here is another data point that shows the perf report with the same test but CPU mitigations
> > turned OFF. Here bpf_prog overhead goes down from almost (10.18 + 4.51)% to (3.23 + 1.44%).
> >
> >    21.40%  ksoftirqd/28     [i40e]                     [k] i40e_clean_rx_irq_zc
> >    14.13%  xdpsock          [i40e]                     [k] i40e_clean_rx_irq_zc
> >     8.33%  ksoftirqd/28     [kernel.vmlinux]           [k] xsk_rcv
> >     6.09%  ksoftirqd/28     [kernel.vmlinux]           [k] xdp_do_redirect
> >     5.19%  xdpsock          xdpsock                    [.] main
> >     3.48%  ksoftirqd/28     [kernel.vmlinux]           [k] bpf_xdp_redirect_map
> >     3.23%  ksoftirqd/28     bpf_prog_3c8251c7e0fef8db  [k] bpf_prog_3c8251c7e0fef8db
> >
> > So a major component of the bpf_prog overhead seems to be due to the CPU vulnerability mitigations.
>
> I feel that it's an incorrect conclusion because JIT is not doing
> any retpolines (because there are no indirect calls in bpf).
> There should be no difference in bpf_prog runtime with or without mitigations.
> Also you're running root, so no spectre mitigations either.
>
> This 3% seems like a lot for a function that does few loads that should
> hit d-cache and one direct call.
> Please investigate why you're seeing this 10% cpu cost when mitigations are on.
> perf report/annotate is the best.
> Also please double check that you're using the latest perf.
> Since bpf performance analysis was greatly improved several versions ago.
> I don't think old perf will be showing bogus numbers, but better to
> run the latest.
>

For comparison, on my Skylake 3GHz with mitigations off. (I have one
internal patch that inlines xsk_rcv() into __xsk_map_redirect, so
that's why it's not showing xsk_rcv(). I'll upstream that...)

  41.79%  [kernel]                   [k] i40e_clean_rx_irq_zc
  15.55%  [kernel]                   [k] __xsk_map_redirect
   9.87%  [kernel]                   [k] xdp_do_redirect
   6.89%  [kernel]                   [k] xsk_umem_peek_addr
   6.37%  [kernel]                   [k] bpf_xdp_redirect_map
   5.02%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629

Again, it might look weird that simple functions like
bpf_xdp_redirect_map and the XDP program is 6% and 5%.

Let's dig into that. I let the xdpsock program (rxdrop) run on one
core 22, and the ksoftirqd on core 20. Core 20 is only processing
packets, plus the regular kernel householding. I did a processor trace
for core 20 for 207 936 packets.

In total it's 84,407,427 instructions, and bpf_xdp_redirect_map() is
8,109,504 instructions, which is 9.6%. bpf_xdp_redirect_map() executes
39 instructions for AF_XDP. As perf is reporting less than 9.6% means
that the IPC count of that function is more than the average which
perf-stat reports as IPC of 2.88.

The BPF program executes fewer instructions than
bpf_xdp_redirect_map(), so given that perf shows 5%, means that the
IPC count is better than average here as well.

So, it's roughly 405 instructions per packet, and with an IPC of 2.88
that'll give ~140 cycles per packet, which on this machine
(3,000,000,000/140) is ~21.4 Mpps. The xdpsock application reports
21. This is sane.

The TL;DR version is: 6% and 5% for bpf_xdp_redirect_map and
bpf_prog_992d9ddc835e5629 might seem high, but it's just that the
total number of instruction executing is fairly low. So, even though
the functions are small in size, it will show up as non-negligible
percentage in perf.

At these speeds, really small things have an impact on
performance. DPDK has ~50 cycles per packet.


Björn



> > The other component is the bpf_xdp_redirect_map() codepath.
> >
> > Let me know if it helps to collect any other data that should further help with the perf analysis.
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-25  7:42                                     ` Björn Töpel
@ 2019-10-31 22:38                                       ` Samudrala, Sridhar
  2019-10-31 23:15                                         ` Alexei Starovoitov
  2019-11-01  0:21                                         ` Jakub Kicinski
  0 siblings, 2 replies; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-10-31 22:38 UTC (permalink / raw)
  To: bjorn.topel
  Cc: alexei.starovoitov, bjorn.topel, bpf, intel-wired-lan,
	jakub.kicinski, maciej.fijalkowski, magnus.karlsson, netdev,
	sridhar.samudrala, toke, tom.herbert


[...]
> >
> > With mitigations ON
> > -------------------
> > Samples: 6K of event 'cycles', 4000 Hz, Event count (approx.): 5646512726
> > bpf_prog_3c8251c7e0fef8db  bpf_prog_3c8251c7e0fef8db [Percent: local period]
> >   45.05      push   %rbp
> >    0.02      mov    %rsp,%rbp
> >    0.03      sub    $0x8,%rsp
> >   22.09      push   %rbx

> [...]
> >
> > Do you see any issues with this data? With mitigations ON push %rbp and push %rbx overhead seems to
> > be pretty high.

> That's sample skid from the retpoline thunk when entring the XDP
> program. Pretty expensive push otherwise! :-)

> Another thought; Disclaimer: I'm no spectrev2 expert, and probably not
> getting the mitigations well enough. So this is me trying to swim at
> the deep end! Would it be possible to avoid the retpoline when
> entering the XDP program. At least for some XDP program that can be
> proved "safe"? If so, PeterZ's upcoming static_call could be used from
> the driver side.

Alexei, Jakub

Do you think it will be possible to avoid this overhead when mitigations are turned ON?
The other part of the overhead is going through the redirect path.

Can i assume that your silence as an indication that you are now okay with optional bypass
flag as long as it doesn't effect the normal XDP datapath. If so, i will respin and submit
the patches against the latest bpf-next

-Sridhar

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-31 22:38                                       ` Samudrala, Sridhar
@ 2019-10-31 23:15                                         ` Alexei Starovoitov
  2019-11-01  0:21                                         ` Jakub Kicinski
  1 sibling, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2019-10-31 23:15 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Björn Töpel, Björn Töpel, bpf,
	intel-wired-lan, Jakub Kicinski, Fijalkowski, Maciej, Karlsson,
	Magnus, Network Development, Toke Høiland-Jørgensen,
	Herbert, Tom

On Thu, Oct 31, 2019 at 3:38 PM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> Alexei, Jakub
>
> Do you think it will be possible to avoid this overhead when mitigations are turned ON?

yes

> The other part of the overhead is going through the redirect path.
>
> Can i assume that your silence as an indication that you are now okay with optional bypass
> flag as long as it doesn't effect the normal XDP datapath. If so, i will respin and submit
> the patches against the latest bpf-next

I'm still against it.
Looks like the only motivation for it is extra overhead due to retpolines.
imo it's not a good reason to introduce a bunch of extra code helping
single kernel feature.
We will have proper solution for indirect calls.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-10-31 22:38                                       ` Samudrala, Sridhar
  2019-10-31 23:15                                         ` Alexei Starovoitov
@ 2019-11-01  0:21                                         ` Jakub Kicinski
  2019-11-01 18:31                                           ` Samudrala, Sridhar
  2019-11-04  2:08                                           ` dan
  1 sibling, 2 replies; 42+ messages in thread
From: Jakub Kicinski @ 2019-11-01  0:21 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: bjorn.topel, alexei.starovoitov, bjorn.topel, bpf,
	intel-wired-lan, maciej.fijalkowski, magnus.karlsson, netdev,
	toke, tom.herbert, David Miller

On Thu, 31 Oct 2019 15:38:42 -0700, Samudrala, Sridhar wrote:
> Do you think it will be possible to avoid this overhead when mitigations are turned ON?
> The other part of the overhead is going through the redirect path.

Yes, you should help Maciej with the XDP bulking.

> Can i assume that your silence as an indication that you are now okay with optional bypass
> flag as long as it doesn't effect the normal XDP datapath. If so, i will respin and submit
> the patches against the latest bpf-next

This logic baffles me. I absolutely hate when people repost patches
after I nack them without even as much as mentioning my objections in
the cover letter.

My concern was that we want the applications to encode fast path logic
in BPF and load that into the kernel. So your patch works fundamentally
against that goal:

I worry that with the volume of patches that get posted each day
objections of a measly contributor like myself will get forgotten 
if I don't reply to each posting, yet replying each time makes me look
bullish or whatnot (apart from being an utter waste of time).

Ugh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-11-01  0:21                                         ` Jakub Kicinski
@ 2019-11-01 18:31                                           ` Samudrala, Sridhar
  2019-11-04  2:08                                           ` dan
  1 sibling, 0 replies; 42+ messages in thread
From: Samudrala, Sridhar @ 2019-11-01 18:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bjorn.topel, alexei.starovoitov, bjorn.topel, bpf,
	intel-wired-lan, maciej.fijalkowski, magnus.karlsson, netdev,
	toke, tom.herbert, David Miller

On 10/31/2019 5:21 PM, Jakub Kicinski wrote:
> On Thu, 31 Oct 2019 15:38:42 -0700, Samudrala, Sridhar wrote:
>> Do you think it will be possible to avoid this overhead when mitigations are turned ON?
>> The other part of the overhead is going through the redirect path.
> 
> Yes, you should help Maciej with the XDP bulking.
> 
>> Can i assume that your silence as an indication that you are now okay with optional bypass
>> flag as long as it doesn't effect the normal XDP datapath. If so, i will respin and submit
>> the patches against the latest bpf-next
> 
> This logic baffles me. I absolutely hate when people repost patches
> after I nack them without even as much as mentioning my objections in
> the cover letter.

Sorry if you got the impression that i didn't take your feedback. I CCed you
and also included the kernel rxdrop data that you requested in the original
series.

> 
> My concern was that we want the applications to encode fast path logic
> in BPF and load that into the kernel. So your patch works fundamentally
> against that goal:

So looks like you are saying that the fundamental requirement is that all AF_XDP
packets need to go via a BPF program.

The reason i proposed direct receive is because of the overhead we are seeing
with going via BPF program for apps that want to receive all the packets on a
specific queue.

I agree that there is work going on to reduce this overhead with bulking and avoiding
the retpoline.
We can revisit after these optimizations get in and then see if it is still useful
to provide a direct receive option.

-Sridhar

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-wired-lan] FW: [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue
  2019-11-01  0:21                                         ` Jakub Kicinski
  2019-11-01 18:31                                           ` Samudrala, Sridhar
@ 2019-11-04  2:08                                           ` dan
  1 sibling, 0 replies; 42+ messages in thread
From: dan @ 2019-11-04  2:08 UTC (permalink / raw)
  To: Jakub Kicinski, Samudrala, Sridhar
  Cc: bjorn.topel, alexei.starovoitov, bjorn.topel, bpf,
	intel-wired-lan, maciej.fijalkowski, magnus.karlsson, netdev,
	toke, tom.herbert, David Miller

On Thu, 2019-10-31 at 17:21 -0700, Jakub Kicinski wrote:
> 
> My concern was that we want the applications to encode fast path
> logic
> in BPF and load that into the kernel. So your patch works
> fundamentally
> against that goal:

I'm only one AF_XDP user but my main goal is to get packets to
userspace as fast as possible. I don't (currently) need a BPF program
in the path at all. I suspect that many other people that look at
AF_XDP as a DPDK replacement have a similar view.


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  6:16 [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Sridhar Samudrala
2019-10-08  6:16 ` [PATCH bpf-next 1/4] bpf: introduce bpf_get_prog_id and bpf_set_prog_id helper functions Sridhar Samudrala
2019-10-08  6:16 ` [PATCH bpf-next 2/4] xsk: allow AF_XDP sockets to receive packets directly from a queue Sridhar Samudrala
2019-10-08  6:58   ` Toke Høiland-Jørgensen
2019-10-08  8:47     ` Björn Töpel
2019-10-08  8:48       ` Björn Töpel
2019-10-08  9:04       ` Toke Høiland-Jørgensen
2019-10-08  8:05   ` Björn Töpel
2019-10-09 16:32     ` Samudrala, Sridhar
2019-10-09  1:20   ` Alexei Starovoitov
     [not found]     ` <3ED8E928C4210A4289A677D2FEB48235140134CE@fmsmsx111.amr.corp.intel.com>
2019-10-09 16:53       ` FW: " Samudrala, Sridhar
2019-10-09 17:17         ` Alexei Starovoitov
2019-10-09 19:12           ` Samudrala, Sridhar
2019-10-10  1:06             ` Alexei Starovoitov
2019-10-18 18:40               ` Samudrala, Sridhar
2019-10-18 19:22                 ` Toke Høiland-Jørgensen
2019-10-19  0:14                 ` Alexei Starovoitov
2019-10-19  0:45                   ` Samudrala, Sridhar
2019-10-19  2:25                     ` Alexei Starovoitov
2019-10-20 10:14                       ` Toke Høiland-Jørgensen
2019-10-20 17:12                         ` [Intel-wired-lan] " Björn Töpel
2019-10-21 20:10                           ` Samudrala, Sridhar
2019-10-21 22:34                             ` Alexei Starovoitov
2019-10-22 19:06                               ` Samudrala, Sridhar
2019-10-23 17:42                                 ` Alexei Starovoitov
2019-10-24 18:12                                   ` Samudrala, Sridhar
2019-10-25  7:42                                     ` Björn Töpel
2019-10-31 22:38                                       ` Samudrala, Sridhar
2019-10-31 23:15                                         ` Alexei Starovoitov
2019-11-01  0:21                                         ` Jakub Kicinski
2019-11-01 18:31                                           ` Samudrala, Sridhar
2019-11-04  2:08                                           ` dan
2019-10-25  9:07                                   ` Björn Töpel
2019-10-08  6:16 ` [PATCH bpf-next 3/4] libbpf: handle AF_XDP sockets created with XDP_DIRECT bind flag Sridhar Samudrala
2019-10-08  8:05   ` Björn Töpel
2019-10-08  6:16 ` [PATCH bpf-next 4/4] xdpsock: add an option to create AF_XDP sockets in XDP_DIRECT mode Sridhar Samudrala
2019-10-08  8:05   ` Björn Töpel
2019-10-08  8:05 ` [PATCH bpf-next 0/4] Enable direct receive on AF_XDP sockets Björn Töpel
2019-10-09 16:19   ` Samudrala, Sridhar
2019-10-09  0:49 ` Jakub Kicinski
2019-10-09  6:29   ` Samudrala, Sridhar
2019-10-09 16:53     ` Jakub Kicinski

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git