netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
@ 2018-06-04 11:57 Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 1/5] xsk: proper fill queue descriptor validation Björn Töpel
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

From: Björn Töpel <bjorn.topel@intel.com>

An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
https://www.spinics.net/lists/netdev/msg503664.html) is that it does
not support NICs that have a "type-writer" model in an efficient
way. In this model, a memory window is passed to the hardware and
multiple frames might be filled into that window, instead of just one
that we have in the current fixed frame-size model.

This patch set fixes two bugs in the current implementation and then
changes the uapi so that the type-writer model can be supported
efficiently by a possible future extension of AF_XDP.

These are the uapi changes in this patch:

* Change the "u32 idx" in the descriptors to "u64 addr". The current
  idx based format does NOT work for the type-writer model (as packets
  can start anywhere within a frame) but that a relative address
  pointer (the u64 addr) works well for both models in the prototype
  code we have that supports both models. We increased it from u32 to
  u64 to support umems larger than 4G. We have also removed the u16
  offset when having a "u64 addr" since that information is already
  carried in the least significant bits of the address.

* We want to use "u8 padding[5]" for something useful in the future
  (since we are not allowed to change its name), so we now call it
  just options so it can be extended for various purposes in the
  future. It is an u32 as that it what is left of the 16 byte
  descriptor.

* We changed the name of frame_size in the UMEM_REG setsockopt to
  chunk_size since this naming also makes sense to the type-writer
  model.

With these changes to the uapi, we believe the type-writer model can
be supported without having to resort to a new descriptor format. The
type-writer model could then be supported, from the uapi point of
view, by setting a flag at bind time and providing a new flag bit in
the options field of the descriptor that signals to user space that
all packets have been written in a chunk. Or with a new chunk
completion queue as suggested by Mykyta in his latest feedback mail on
the list.

We based this patch set on bpf-next commit bd3a08aaa9a3 ("bpf:
flowlabel in bpf_fib_lookup should be flowinfo")

The structure of the patch set is as follows:

Patches 1-2: Fixes two bugs in the current implementation.
Patches 3-4: Prepares the uapi for a "type-writer" model and modifies
             the sample application so that it works with the new
	     uapi.
Patch 5: Small performance improvement patch for the sample application.

Cheers: Magnus and Björn

Björn Töpel (4):
  xsk: proper fill queue descriptor validation
  xsk: proper Rx drop statistics update
  xsk: new descriptor addressing scheme
  samples/bpf: adapted to new uapi

Magnus Karlsson (1):
  samples/bpf: minor *_nb_free performance fix

 Documentation/networking/af_xdp.rst | 101 +++++++++++++++++++++---------------
 include/uapi/linux/if_xdp.h         |  12 ++---
 net/xdp/xdp_umem.c                  |  33 ++++++------
 net/xdp/xdp_umem.h                  |  27 +++-------
 net/xdp/xdp_umem_props.h            |   4 +-
 net/xdp/xsk.c                       |  41 ++++++++-------
 net/xdp/xsk_queue.c                 |   2 +-
 net/xdp/xsk_queue.h                 |  63 ++++++++--------------
 samples/bpf/xdpsock_user.c          |  92 +++++++++++++++-----------------
 9 files changed, 172 insertions(+), 203 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 1/5] xsk: proper fill queue descriptor validation
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
@ 2018-06-04 11:57 ` Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 2/5] xsk: proper Rx drop statistics update Björn Töpel
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

From: Björn Töpel <bjorn.topel@intel.com>

Previously the fill queue descriptor was not copied to kernel space
prior validating it, making it possible for userland to change the
descriptor post-kernel-validation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c       | 11 +++++------
 net/xdp/xsk_queue.h | 32 +++++++++-----------------------
 2 files changed, 14 insertions(+), 29 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cce0e4f8a536..43554eb56fe6 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,20 +41,19 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 *id, len = xdp->data_end - xdp->data;
+	u32 id, len = xdp->data_end - xdp->data;
 	void *buffer;
-	int err = 0;
+	int err;
 
 	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
 
-	id = xskq_peek_id(xs->umem->fq);
-	if (!id)
+	if (!xskq_peek_id(xs->umem->fq, &id))
 		return -ENOSPC;
 
-	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, id);
 	memcpy(buffer, xdp->data, len);
-	err = xskq_produce_batch_desc(xs->rx, *id, len,
+	err = xskq_produce_batch_desc(xs->rx, id, len,
 				      xs->umem->frame_headroom);
 	if (!err)
 		xskq_discard_id(xs->umem->fq);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index cb8e5be35110..b5924e7aeb2b 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -85,14 +85,15 @@ static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
 	return true;
 }
 
-static inline u32 *xskq_validate_id(struct xsk_queue *q)
+static inline u32 *xskq_validate_id(struct xsk_queue *q, u32 *id)
 {
 	while (q->cons_tail != q->cons_head) {
 		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 		unsigned int idx = q->cons_tail & q->ring_mask;
 
-		if (xskq_is_valid_id(q, ring->desc[idx]))
-			return &ring->desc[idx];
+		*id = READ_ONCE(ring->desc[idx]);
+		if (xskq_is_valid_id(q, *id))
+			return id;
 
 		q->cons_tail++;
 	}
@@ -100,28 +101,22 @@ static inline u32 *xskq_validate_id(struct xsk_queue *q)
 	return NULL;
 }
 
-static inline u32 *xskq_peek_id(struct xsk_queue *q)
+static inline u32 *xskq_peek_id(struct xsk_queue *q, u32 *id)
 {
-	struct xdp_umem_ring *ring;
-
 	if (q->cons_tail == q->cons_head) {
 		WRITE_ONCE(q->ring->consumer, q->cons_tail);
 		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
 
 		/* Order consumer and data */
 		smp_rmb();
-
-		return xskq_validate_id(q);
 	}
 
-	ring = (struct xdp_umem_ring *)q->ring;
-	return &ring->desc[q->cons_tail & q->ring_mask];
+	return xskq_validate_id(q, id);
 }
 
 static inline void xskq_discard_id(struct xsk_queue *q)
 {
 	q->cons_tail++;
-	(void)xskq_validate_id(q);
 }
 
 static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
@@ -174,11 +169,9 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 		unsigned int idx = q->cons_tail & q->ring_mask;
 
-		if (xskq_is_valid_desc(q, &ring->desc[idx])) {
-			if (desc)
-				*desc = ring->desc[idx];
+		*desc = READ_ONCE(ring->desc[idx]);
+		if (xskq_is_valid_desc(q, desc))
 			return desc;
-		}
 
 		q->cons_tail++;
 	}
@@ -189,27 +182,20 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
 static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
 					      struct xdp_desc *desc)
 {
-	struct xdp_rxtx_ring *ring;
-
 	if (q->cons_tail == q->cons_head) {
 		WRITE_ONCE(q->ring->consumer, q->cons_tail);
 		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
 
 		/* Order consumer and data */
 		smp_rmb();
-
-		return xskq_validate_desc(q, desc);
 	}
 
-	ring = (struct xdp_rxtx_ring *)q->ring;
-	*desc = ring->desc[q->cons_tail & q->ring_mask];
-	return desc;
+	return xskq_validate_desc(q, desc);
 }
 
 static inline void xskq_discard_desc(struct xsk_queue *q)
 {
 	q->cons_tail++;
-	(void)xskq_validate_desc(q, NULL);
 }
 
 static inline int xskq_produce_batch_desc(struct xsk_queue *q,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 2/5] xsk: proper Rx drop statistics update
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 1/5] xsk: proper fill queue descriptor validation Björn Töpel
@ 2018-06-04 11:57 ` Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 3/5] xsk: new descriptor addressing scheme Björn Töpel
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

From: Björn Töpel <bjorn.topel@intel.com>

Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
program redirected the frame to a socket bound to a different queue
than where the XDP program was executing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 43554eb56fe6..966307ce4b8e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -48,8 +48,10 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
 
-	if (!xskq_peek_id(xs->umem->fq, &id))
+	if (!xskq_peek_id(xs->umem->fq, &id)) {
+		xs->rx_dropped++;
 		return -ENOSPC;
+	}
 
 	buffer = xdp_umem_get_data_with_headroom(xs->umem, id);
 	memcpy(buffer, xdp->data, len);
@@ -57,6 +59,8 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 				      xs->umem->frame_headroom);
 	if (!err)
 		xskq_discard_id(xs->umem->fq);
+	else
+		xs->rx_dropped++;
 
 	return err;
 }
@@ -68,8 +72,6 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	err = __xsk_rcv(xs, xdp);
 	if (likely(!err))
 		xdp_return_buff(xdp);
-	else
-		xs->rx_dropped++;
 
 	return err;
 }
@@ -87,8 +89,6 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	err = __xsk_rcv(xs, xdp);
 	if (!err)
 		xsk_flush(xs);
-	else
-		xs->rx_dropped++;
 
 	return err;
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 3/5] xsk: new descriptor addressing scheme
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 1/5] xsk: proper fill queue descriptor validation Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 2/5] xsk: proper Rx drop statistics update Björn Töpel
@ 2018-06-04 11:57 ` Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 4/5] samples/bpf: adapted to new uapi Björn Töpel
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

From: Björn Töpel <bjorn.topel@intel.com>

Currently, AF_XDP only supports a fixed frame-size memory scheme where
each frame is referenced via an index (idx). A user passes the frame
index to the kernel, and the kernel acts upon the data.  Some NICs,
however, do not have a fixed frame-size model, instead they have a
model where a memory window is passed to the hardware and multiple
frames are filled into that window (referred to as the "type-writer"
model).

By changing the descriptor format from the current frame index
addressing scheme, AF_XDP can in the future be extended to support
these kinds of NICs.

In the index-based model, an idx refers to a frame of size
frame_size. Addressing a frame in the UMEM is done by offseting the
UMEM starting address by a global offset, idx * frame_size + offset.
Communicating via the fill- and completion-rings are done by means of
idx.

In this commit, the idx is removed in favor of an address (addr),
which is a relative address ranging over the UMEM. To convert an
idx-based address to the new addr is simply: addr = idx * frame_size +
offset.

We also stop referring to the UMEM "frame" as a frame. Instead it is
simply called a chunk.

To transfer ownership of a chunk to the kernel, the addr of the chunk
is passed in the fill-ring. Note, that the kernel will mask addr to
make it chunk aligned, so there is no need for userspace to do
that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
3000 to the fill-ring will refer to the same chunk.

On the completion-ring, the addr will match that of the Tx descriptor,
passed to the kernel.

Changing the descriptor format to use chunks/addr will allow for
future changes to move to a type-writer based model, where multiple
frames can reside in one chunk. In this model passing one single chunk
into the fill-ring, would potentially result in multiple Rx
descriptors.

This commit changes the uapi of AF_XDP sockets, and updates the
documentation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 Documentation/networking/af_xdp.rst | 101 +++++++++++++++++++++---------------
 include/uapi/linux/if_xdp.h         |  12 ++---
 net/xdp/xdp_umem.c                  |  33 ++++++------
 net/xdp/xdp_umem.h                  |  27 +++-------
 net/xdp/xdp_umem_props.h            |   4 +-
 net/xdp/xsk.c                       |  30 ++++++-----
 net/xdp/xsk_queue.c                 |   2 +-
 net/xdp/xsk_queue.h                 |  43 +++++++--------
 8 files changed, 123 insertions(+), 129 deletions(-)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 91928d9ee4bf..ff929cfab4f4 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -12,7 +12,7 @@ packet processing.
 
 This document assumes that the reader is familiar with BPF and XDP. If
 not, the Cilium project has an excellent reference guide at
-http://cilium.readthedocs.io/en/doc-1.0/bpf/.
+http://cilium.readthedocs.io/en/latest/bpf/.
 
 Using the XDP_REDIRECT action from an XDP program, the program can
 redirect ingress frames to other XDP enabled netdevs, using the
@@ -33,22 +33,22 @@ for a while due to a possible retransmit, the descriptor that points
 to that packet can be changed to point to another and reused right
 away. This again avoids copying data.
 
-The UMEM consists of a number of equally size frames and each frame
-has a unique frame id. A descriptor in one of the rings references a
-frame by referencing its frame id. The user space allocates memory for
-this UMEM using whatever means it feels is most appropriate (malloc,
-mmap, huge pages, etc). This memory area is then registered with the
-kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
-rings: the FILL ring and the COMPLETION ring. The fill ring is used by
-the application to send down frame ids for the kernel to fill in with
-RX packet data. References to these frames will then appear in the RX
-ring once each packet has been received. The completion ring, on the
-other hand, contains frame ids that the kernel has transmitted
-completely and can now be used again by user space, for either TX or
-RX. Thus, the frame ids appearing in the completion ring are ids that
-were previously transmitted using the TX ring. In summary, the RX and
-FILL rings are used for the RX path and the TX and COMPLETION rings
-are used for the TX path.
+The UMEM consists of a number of equally sized chunks. A descriptor in
+one of the rings references a frame by referencing its addr. The addr
+is simply an offset within the entire UMEM region. The user space
+allocates memory for this UMEM using whatever means it feels is most
+appropriate (malloc, mmap, huge pages, etc). This memory area is then
+registered with the kernel using the new setsockopt XDP_UMEM_REG. The
+UMEM also has two rings: the FILL ring and the COMPLETION ring. The
+fill ring is used by the application to send down addr for the kernel
+to fill in with RX packet data. References to these frames will then
+appear in the RX ring once each packet has been received. The
+completion ring, on the other hand, contains frame addr that the
+kernel has transmitted completely and can now be used again by user
+space, for either TX or RX. Thus, the frame addrs appearing in the
+completion ring are addrs that were previously transmitted using the
+TX ring. In summary, the RX and FILL rings are used for the RX path
+and the TX and COMPLETION rings are used for the TX path.
 
 The socket is then finally bound with a bind() call to a device and a
 specific queue id on that device, and it is not until bind is
@@ -59,13 +59,13 @@ wants to do this, it simply skips the registration of the UMEM and its
 corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
 call and submits the XSK of the process it would like to share UMEM
 with as well as its own newly created XSK socket. The new process will
-then receive frame id references in its own RX ring that point to this
-shared UMEM. Note that since the ring structures are single-consumer /
-single-producer (for performance reasons), the new process has to
-create its own socket with associated RX and TX rings, since it cannot
-share this with the other process. This is also the reason that there
-is only one set of FILL and COMPLETION rings per UMEM. It is the
-responsibility of a single process to handle the UMEM.
+then receive frame addr references in its own RX ring that point to
+this shared UMEM. Note that since the ring structures are
+single-consumer / single-producer (for performance reasons), the new
+process has to create its own socket with associated RX and TX rings,
+since it cannot share this with the other process. This is also the
+reason that there is only one set of FILL and COMPLETION rings per
+UMEM. It is the responsibility of a single process to handle the UMEM.
 
 How is then packets distributed from an XDP program to the XSKs? There
 is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
@@ -102,10 +102,10 @@ UMEM
 
 UMEM is a region of virtual contiguous memory, divided into
 equal-sized frames. An UMEM is associated to a netdev and a specific
-queue id of that netdev. It is created and configured (frame size,
-frame headroom, start address and size) by using the XDP_UMEM_REG
-setsockopt system call. A UMEM is bound to a netdev and queue id, via
-the bind() system call.
+queue id of that netdev. It is created and configured (chunk size,
+headroom, start address and size) by using the XDP_UMEM_REG setsockopt
+system call. A UMEM is bound to a netdev and queue id, via the bind()
+system call.
 
 An AF_XDP is socket linked to a single UMEM, but one UMEM can have
 multiple AF_XDP sockets. To share an UMEM created via one socket A,
@@ -147,13 +147,17 @@ UMEM Fill Ring
 ~~~~~~~~~~~~~~
 
 The Fill ring is used to transfer ownership of UMEM frames from
-user-space to kernel-space. The UMEM indicies are passed in the
-ring. As an example, if the UMEM is 64k and each frame is 4k, then the
-UMEM has 16 frames and can pass indicies between 0 and 15.
+user-space to kernel-space. The UMEM addrs are passed in the ring. As
+an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
+16 chunks and can pass addrs between 0 and 64k.
 
 Frames passed to the kernel are used for the ingress path (RX rings).
 
-The user application produces UMEM indicies to this ring.
+The user application produces UMEM addrs to this ring. Note that the
+kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
+log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
+and 3000 refers to the same chunk.
+
 
 UMEM Completetion Ring
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -165,16 +169,15 @@ used.
 Frames passed from the kernel to user-space are frames that has been
 sent (TX ring) and can be used by user-space again.
 
-The user application consumes UMEM indicies from this ring.
+The user application consumes UMEM addrs from this ring.
 
 
 RX Ring
 ~~~~~~~
 
 The RX ring is the receiving side of a socket. Each entry in the ring
-is a struct xdp_desc descriptor. The descriptor contains UMEM index
-(idx), the length of the data (len), the offset into the frame
-(offset).
+is a struct xdp_desc descriptor. The descriptor contains UMEM offset
+(addr) and the length of the data (len).
 
 If no frames have been passed to kernel via the Fill ring, no
 descriptors will (or can) appear on the RX ring.
@@ -221,38 +224,50 @@ side is xdpsock_user.c and the XDP side xdpsock_kern.c.
 
 Naive ring dequeue and enqueue could look like this::
 
+    // struct xdp_rxtx_ring {
+    // 	__u32 *producer;
+    // 	__u32 *consumer;
+    // 	struct xdp_desc *desc;
+    // };
+
+    // struct xdp_umem_ring {
+    // 	__u32 *producer;
+    // 	__u32 *consumer;
+    // 	__u64 *desc;
+    // };
+
     // typedef struct xdp_rxtx_ring RING;
     // typedef struct xdp_umem_ring RING;
 
     // typedef struct xdp_desc RING_TYPE;
-    // typedef __u32 RING_TYPE;
+    // typedef __u64 RING_TYPE;
 
     int dequeue_one(RING *ring, RING_TYPE *item)
     {
-        __u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
+        __u32 entries = *ring->producer - *ring->consumer;
 
         if (entries == 0)
             return -1;
 
         // read-barrier!
 
-        *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
-        ring->ptrs.consumer++;
+        *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
+        (*ring->consumer)++;
         return 0;
     }
 
     int enqueue_one(RING *ring, const RING_TYPE *item)
     {
-        u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
+        u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
 
         if (free_entries == 0)
             return -1;
 
-        ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
+        ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
 
         // write-barrier!
 
-        ring->ptrs.producer++;
+        (*ring->producer)++;
         return 0;
     }
 
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 4737cfe222f5..e411d6f9ac65 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -48,8 +48,8 @@ struct xdp_mmap_offsets {
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
 	__u64 len; /* Length of packet data area */
-	__u32 frame_size; /* Frame size */
-	__u32 frame_headroom; /* Frame head room */
+	__u32 chunk_size;
+	__u32 headroom;
 };
 
 struct xdp_statistics {
@@ -66,13 +66,11 @@ struct xdp_statistics {
 
 /* Rx/Tx descriptor */
 struct xdp_desc {
-	__u32 idx;
+	__u64 addr;
 	__u32 len;
-	__u16 offset;
-	__u8 flags;
-	__u8 padding[5];
+	__u32 options;
 };
 
-/* UMEM descriptor is __u32 */
+/* UMEM descriptor is __u64 */
 
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 87998818116f..9ad791ff4739 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -14,7 +14,7 @@
 
 #include "xdp_umem.h"
 
-#define XDP_UMEM_MIN_FRAME_SIZE 2048
+#define XDP_UMEM_MIN_CHUNK_SIZE 2048
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
@@ -151,12 +151,12 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
 
 static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 {
-	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
+	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
+	unsigned int chunks, chunks_per_page;
 	u64 addr = mr->addr, size = mr->len;
-	unsigned int nframes, nfpp;
 	int size_chk, err;
 
-	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+	if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
 		/* Strictly speaking we could support this, if:
 		 * - huge pages, or*
 		 * - using an IOMMU, or
@@ -166,7 +166,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 		return -EINVAL;
 	}
 
-	if (!is_power_of_2(frame_size))
+	if (!is_power_of_2(chunk_size))
 		return -EINVAL;
 
 	if (!PAGE_ALIGNED(addr)) {
@@ -179,33 +179,30 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	if ((addr + size) < addr)
 		return -EINVAL;
 
-	nframes = (unsigned int)div_u64(size, frame_size);
-	if (nframes == 0 || nframes > UINT_MAX)
+	chunks = (unsigned int)div_u64(size, chunk_size);
+	if (chunks == 0)
 		return -EINVAL;
 
-	nfpp = PAGE_SIZE / frame_size;
-	if (nframes < nfpp || nframes % nfpp)
+	chunks_per_page = PAGE_SIZE / chunk_size;
+	if (chunks < chunks_per_page || chunks % chunks_per_page)
 		return -EINVAL;
 
-	frame_headroom = ALIGN(frame_headroom, 64);
+	headroom = ALIGN(headroom, 64);
 
-	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
+	size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM;
 	if (size_chk < 0)
 		return -EINVAL;
 
 	umem->pid = get_task_pid(current, PIDTYPE_PID);
-	umem->size = (size_t)size;
 	umem->address = (unsigned long)addr;
-	umem->props.frame_size = frame_size;
-	umem->props.nframes = nframes;
-	umem->frame_headroom = frame_headroom;
+	umem->props.chunk_mask = ~((u64)chunk_size - 1);
+	umem->props.size = size;
+	umem->headroom = headroom;
+	umem->chunk_size_nohr = chunk_size - headroom;
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
 
-	umem->frame_size_log2 = ilog2(frame_size);
-	umem->nfpp_mask = nfpp - 1;
-	umem->nfpplog2 = ilog2(nfpp);
 	refcount_set(&umem->users, 1);
 
 	err = xdp_umem_account_pages(umem);
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 0881cf456230..aeadd1bcb72d 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -18,35 +18,20 @@ struct xdp_umem {
 	struct xsk_queue *cq;
 	struct page **pgs;
 	struct xdp_umem_props props;
-	u32 npgs;
-	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
+	u32 headroom;
+	u32 chunk_size_nohr;
 	struct user_struct *user;
 	struct pid *pid;
 	unsigned long address;
-	size_t size;
 	refcount_t users;
 	struct work_struct work;
+	u32 npgs;
 };
 
-static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
-{
-	u64 pg, off;
-	char *data;
-
-	pg = idx >> umem->nfpplog2;
-	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
-
-	data = page_address(umem->pgs[pg]);
-	return data + off;
-}
-
-static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
-						    u32 idx)
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
-	return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
+	return page_address(umem->pgs[addr >> PAGE_SHIFT]) +
+		(addr & (PAGE_SIZE - 1));
 }
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
index 2cf8ec485fd2..40eab10dfc49 100644
--- a/net/xdp/xdp_umem_props.h
+++ b/net/xdp/xdp_umem_props.h
@@ -7,8 +7,8 @@
 #define XDP_UMEM_PROPS_H_
 
 struct xdp_umem_props {
-	u32 frame_size;
-	u32 nframes;
+	u64 chunk_mask;
+	u64 size;
 };
 
 #endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 966307ce4b8e..4688c750df1d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,24 +41,27 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 id, len = xdp->data_end - xdp->data;
+	u32 len = xdp->data_end - xdp->data;
 	void *buffer;
+	u64 addr;
 	int err;
 
 	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
 
-	if (!xskq_peek_id(xs->umem->fq, &id)) {
+	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+	    len > xs->umem->chunk_size_nohr) {
 		xs->rx_dropped++;
 		return -ENOSPC;
 	}
 
-	buffer = xdp_umem_get_data_with_headroom(xs->umem, id);
+	addr += xs->umem->headroom;
+
+	buffer = xdp_umem_get_data(xs->umem, addr);
 	memcpy(buffer, xdp->data, len);
-	err = xskq_produce_batch_desc(xs->rx, id, len,
-				      xs->umem->frame_headroom);
+	err = xskq_produce_batch_desc(xs->rx, addr, len);
 	if (!err)
-		xskq_discard_id(xs->umem->fq);
+		xskq_discard_addr(xs->umem->fq);
 	else
 		xs->rx_dropped++;
 
@@ -95,10 +98,10 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
-	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
+	u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg;
 	struct xdp_sock *xs = xdp_sk(skb->sk);
 
-	WARN_ON_ONCE(xskq_produce_id(xs->umem->cq, id));
+	WARN_ON_ONCE(xskq_produce_addr(xs->umem->cq, addr));
 
 	sock_wfree(skb);
 }
@@ -123,14 +126,15 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	while (xskq_peek_desc(xs->tx, &desc)) {
 		char *buffer;
-		u32 id, len;
+		u64 addr;
+		u32 len;
 
 		if (max_batch-- == 0) {
 			err = -EAGAIN;
 			goto out;
 		}
 
-		if (xskq_reserve_id(xs->umem->cq)) {
+		if (xskq_reserve_addr(xs->umem->cq)) {
 			err = -EAGAIN;
 			goto out;
 		}
@@ -153,8 +157,8 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		}
 
 		skb_put(skb, len);
-		id = desc.idx;
-		buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
+		addr = desc.addr;
+		buffer = xdp_umem_get_data(xs->umem, addr);
 		err = skb_store_bits(skb, 0, buffer, len);
 		if (unlikely(err)) {
 			kfree_skb(skb);
@@ -164,7 +168,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		skb->dev = xs->dev;
 		skb->priority = sk->sk_priority;
 		skb->mark = sk->sk_mark;
-		skb_shinfo(skb)->destructor_arg = (void *)(long)id;
+		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
 		skb->destructor = xsk_destruct_skb;
 
 		err = dev_direct_xmit(skb, xs->queue_id);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index ebe85e59507e..6c32e92e98fc 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -17,7 +17,7 @@ void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
 
 static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 {
-	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
+	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u64);
 }
 
 static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index b5924e7aeb2b..337e5ad3b10e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -27,7 +27,7 @@ struct xdp_rxtx_ring {
 /* Used for the fill and completion queues for buffers */
 struct xdp_umem_ring {
 	struct xdp_ring ptrs;
-	u32 desc[0] ____cacheline_aligned_in_smp;
+	u64 desc[0] ____cacheline_aligned_in_smp;
 };
 
 struct xsk_queue {
@@ -76,24 +76,25 @@ static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 
 /* UMEM queue */
 
-static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
+static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
 {
-	if (unlikely(idx >= q->umem_props.nframes)) {
+	if (addr >= q->umem_props.size) {
 		q->invalid_descs++;
 		return false;
 	}
+
 	return true;
 }
 
-static inline u32 *xskq_validate_id(struct xsk_queue *q, u32 *id)
+static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
 {
 	while (q->cons_tail != q->cons_head) {
 		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 		unsigned int idx = q->cons_tail & q->ring_mask;
 
-		*id = READ_ONCE(ring->desc[idx]);
-		if (xskq_is_valid_id(q, *id))
-			return id;
+		*addr = READ_ONCE(ring->desc[idx]) & q->umem_props.chunk_mask;
+		if (xskq_is_valid_addr(q, *addr))
+			return addr;
 
 		q->cons_tail++;
 	}
@@ -101,7 +102,7 @@ static inline u32 *xskq_validate_id(struct xsk_queue *q, u32 *id)
 	return NULL;
 }
 
-static inline u32 *xskq_peek_id(struct xsk_queue *q, u32 *id)
+static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
 {
 	if (q->cons_tail == q->cons_head) {
 		WRITE_ONCE(q->ring->consumer, q->cons_tail);
@@ -111,19 +112,19 @@ static inline u32 *xskq_peek_id(struct xsk_queue *q, u32 *id)
 		smp_rmb();
 	}
 
-	return xskq_validate_id(q, id);
+	return xskq_validate_addr(q, addr);
 }
 
-static inline void xskq_discard_id(struct xsk_queue *q)
+static inline void xskq_discard_addr(struct xsk_queue *q)
 {
 	q->cons_tail++;
 }
 
-static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
+static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
-	ring->desc[q->prod_tail++ & q->ring_mask] = id;
+	ring->desc[q->prod_tail++ & q->ring_mask] = addr;
 
 	/* Order producer and data */
 	smp_wmb();
@@ -132,7 +133,7 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 	return 0;
 }
 
-static inline int xskq_reserve_id(struct xsk_queue *q)
+static inline int xskq_reserve_addr(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
 		return -ENOSPC;
@@ -145,16 +146,11 @@ static inline int xskq_reserve_id(struct xsk_queue *q)
 
 static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
 {
-	u32 buff_len;
-
-	if (unlikely(d->idx >= q->umem_props.nframes)) {
-		q->invalid_descs++;
+	if (!xskq_is_valid_addr(q, d->addr))
 		return false;
-	}
 
-	buff_len = q->umem_props.frame_size;
-	if (unlikely(d->len > buff_len || d->len == 0 ||
-		     d->offset > buff_len || d->offset + d->len > buff_len)) {
+	if (((d->addr + d->len) & q->umem_props.chunk_mask) !=
+	    (d->addr & q->umem_props.chunk_mask)) {
 		q->invalid_descs++;
 		return false;
 	}
@@ -199,7 +195,7 @@ static inline void xskq_discard_desc(struct xsk_queue *q)
 }
 
 static inline int xskq_produce_batch_desc(struct xsk_queue *q,
-					  u32 id, u32 len, u16 offset)
+					  u64 addr, u32 len)
 {
 	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 	unsigned int idx;
@@ -208,9 +204,8 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q,
 		return -ENOSPC;
 
 	idx = (q->prod_head++) & q->ring_mask;
-	ring->desc[idx].idx = id;
+	ring->desc[idx].addr = addr;
 	ring->desc[idx].len = len;
-	ring->desc[idx].offset = offset;
 
 	return 0;
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 4/5] samples/bpf: adapted to new uapi
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
                   ` (2 preceding siblings ...)
  2018-06-04 11:57 ` [PATCH bpf-next 3/5] xsk: new descriptor addressing scheme Björn Töpel
@ 2018-06-04 11:57 ` Björn Töpel
  2018-06-04 11:57 ` [PATCH bpf-next 5/5] samples/bpf: minor *_nb_free performance fix Björn Töpel
  2018-06-04 16:24 ` [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Alexei Starovoitov
  5 siblings, 0 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

From: Björn Töpel <bjorn.topel@intel.com>

Here, the xdpsock sample application is adjusted to the new descriptor
format.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 samples/bpf/xdpsock_user.c | 84 ++++++++++++++++++++--------------------------
 1 file changed, 36 insertions(+), 48 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index e379eac034ac..b71a342b9082 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -46,6 +46,7 @@
 
 #define NUM_FRAMES 131072
 #define FRAME_HEADROOM 0
+#define FRAME_SHIFT 11
 #define FRAME_SIZE 2048
 #define NUM_DESCS 1024
 #define BATCH_SIZE 16
@@ -55,6 +56,7 @@
 
 #define DEBUG_HEXDUMP 0
 
+typedef __u64 u64;
 typedef __u32 u32;
 
 static unsigned long prev_time;
@@ -81,12 +83,12 @@ struct xdp_umem_uqueue {
 	u32 size;
 	u32 *producer;
 	u32 *consumer;
-	u32 *ring;
+	u64 *ring;
 	void *map;
 };
 
 struct xdp_umem {
-	char (*frames)[FRAME_SIZE];
+	char *frames;
 	struct xdp_umem_uqueue fq;
 	struct xdp_umem_uqueue cq;
 	int fd;
@@ -214,7 +216,7 @@ static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
 	for (i = 0; i < nb; i++) {
 		u32 idx = fq->cached_prod++ & fq->mask;
 
-		fq->ring[idx] = d[i].idx;
+		fq->ring[idx] = d[i].addr;
 	}
 
 	u_smp_wmb();
@@ -224,7 +226,7 @@ static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
 	return 0;
 }
 
-static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u64 *d,
 				      size_t nb)
 {
 	u32 i;
@@ -246,7 +248,7 @@ static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
 }
 
 static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
-					       u32 *d, size_t nb)
+					       u64 *d, size_t nb)
 {
 	u32 idx, i, entries = umem_nb_avail(cq, nb);
 
@@ -266,10 +268,9 @@ static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
 	return entries;
 }
 
-static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+static inline void *xq_get_data(struct xdpsock *xsk, u64 addr)
 {
-	lassert(idx < NUM_FRAMES);
-	return &xsk->umem->frames[idx][off];
+	return &xsk->umem->frames[addr];
 }
 
 static inline int xq_enq(struct xdp_uqueue *uq,
@@ -285,9 +286,8 @@ static inline int xq_enq(struct xdp_uqueue *uq,
 	for (i = 0; i < ndescs; i++) {
 		u32 idx = uq->cached_prod++ & uq->mask;
 
-		r[idx].idx = descs[i].idx;
+		r[idx].addr = descs[i].addr;
 		r[idx].len = descs[i].len;
-		r[idx].offset = descs[i].offset;
 	}
 
 	u_smp_wmb();
@@ -297,7 +297,7 @@ static inline int xq_enq(struct xdp_uqueue *uq,
 }
 
 static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
-				 __u32 idx, unsigned int ndescs)
+				 unsigned int id, unsigned int ndescs)
 {
 	struct xdp_desc *r = uq->ring;
 	unsigned int i;
@@ -308,9 +308,8 @@ static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
 	for (i = 0; i < ndescs; i++) {
 		u32 idx = uq->cached_prod++ & uq->mask;
 
-		r[idx].idx	= idx + i;
+		r[idx].addr	= (id + i) << FRAME_SHIFT;
 		r[idx].len	= sizeof(pkt_data) - 1;
-		r[idx].offset	= 0;
 	}
 
 	u_smp_wmb();
@@ -357,17 +356,21 @@ static void swap_mac_addresses(void *data)
 	*dst_addr = tmp;
 }
 
-#if DEBUG_HEXDUMP
-static void hex_dump(void *pkt, size_t length, const char *prefix)
+static void hex_dump(void *pkt, size_t length, u64 addr)
 {
-	int i = 0;
 	const unsigned char *address = (unsigned char *)pkt;
 	const unsigned char *line = address;
 	size_t line_size = 32;
 	unsigned char c;
+	char buf[32];
+	int i = 0;
 
+	if (!DEBUG_HEXDUMP)
+		return;
+
+	sprintf(buf, "addr=%llu", addr);
 	printf("length = %zu\n", length);
-	printf("%s | ", prefix);
+	printf("%s | ", buf);
 	while (length-- > 0) {
 		printf("%02X ", *address++);
 		if (!(++i % line_size) || (length == 0 && i % line_size)) {
@@ -382,12 +385,11 @@ static void hex_dump(void *pkt, size_t length, const char *prefix)
 			}
 			printf("\n");
 			if (length > 0)
-				printf("%s | ", prefix);
+				printf("%s | ", buf);
 		}
 	}
 	printf("\n");
 }
-#endif
 
 static size_t gen_eth_frame(char *frame)
 {
@@ -412,8 +414,8 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 
 	mr.addr = (__u64)bufs;
 	mr.len = NUM_FRAMES * FRAME_SIZE;
-	mr.frame_size = FRAME_SIZE;
-	mr.frame_headroom = FRAME_HEADROOM;
+	mr.chunk_size = FRAME_SIZE;
+	mr.headroom = FRAME_HEADROOM;
 
 	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
 	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
@@ -426,7 +428,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 			   &optlen) == 0);
 
 	umem->fq.map = mmap(0, off.fr.desc +
-			    FQ_NUM_DESCS * sizeof(u32),
+			    FQ_NUM_DESCS * sizeof(u64),
 			    PROT_READ | PROT_WRITE,
 			    MAP_SHARED | MAP_POPULATE, sfd,
 			    XDP_UMEM_PGOFF_FILL_RING);
@@ -439,7 +441,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 	umem->fq.ring = umem->fq.map + off.fr.desc;
 
 	umem->cq.map = mmap(0, off.cr.desc +
-			     CQ_NUM_DESCS * sizeof(u32),
+			     CQ_NUM_DESCS * sizeof(u64),
 			     PROT_READ | PROT_WRITE,
 			     MAP_SHARED | MAP_POPULATE, sfd,
 			     XDP_UMEM_PGOFF_COMPLETION_RING);
@@ -451,14 +453,14 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 	umem->cq.consumer = umem->cq.map + off.cr.consumer;
 	umem->cq.ring = umem->cq.map + off.cr.desc;
 
-	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->frames = bufs;
 	umem->fd = sfd;
 
 	if (opt_bench == BENCH_TXONLY) {
 		int i;
 
-		for (i = 0; i < NUM_FRAMES; i++)
-			(void)gen_eth_frame(&umem->frames[i][0]);
+		for (i = 0; i < NUM_FRAMES * FRAME_SIZE; i += FRAME_SIZE)
+			(void)gen_eth_frame(&umem->frames[i]);
 	}
 
 	return umem;
@@ -472,7 +474,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 	struct xdpsock *xsk;
 	bool shared = true;
 	socklen_t optlen;
-	u32 i;
+	u64 i;
 
 	sfd = socket(PF_XDP, SOCK_RAW, 0);
 	lassert(sfd >= 0);
@@ -508,7 +510,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 	lassert(xsk->rx.map != MAP_FAILED);
 
 	if (!shared) {
-		for (i = 0; i < NUM_DESCS / 2; i++)
+		for (i = 0; i < NUM_DESCS * FRAME_SIZE; i += FRAME_SIZE)
 			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
 				== 0);
 	}
@@ -727,7 +729,7 @@ static void kick_tx(int fd)
 
 static inline void complete_tx_l2fwd(struct xdpsock *xsk)
 {
-	u32 descs[BATCH_SIZE];
+	u64 descs[BATCH_SIZE];
 	unsigned int rcvd;
 	size_t ndescs;
 
@@ -749,7 +751,7 @@ static inline void complete_tx_l2fwd(struct xdpsock *xsk)
 
 static inline void complete_tx_only(struct xdpsock *xsk)
 {
-	u32 descs[BATCH_SIZE];
+	u64 descs[BATCH_SIZE];
 	unsigned int rcvd;
 
 	if (!xsk->outstanding_tx)
@@ -774,17 +776,9 @@ static void rx_drop(struct xdpsock *xsk)
 		return;
 
 	for (i = 0; i < rcvd; i++) {
-		u32 idx = descs[i].idx;
-
-		lassert(idx < NUM_FRAMES);
-#if DEBUG_HEXDUMP
-		char *pkt;
-		char buf[32];
+		char *pkt = xq_get_data(xsk, descs[i].addr);
 
-		pkt = xq_get_data(xsk, idx, descs[i].offset);
-		sprintf(buf, "idx=%d", idx);
-		hex_dump(pkt, descs[i].len, buf);
-#endif
+		hex_dump(pkt, descs[i].len, descs[i].addr);
 	}
 
 	xsk->rx_npkts += rcvd;
@@ -867,17 +861,11 @@ static void l2fwd(struct xdpsock *xsk)
 		}
 
 		for (i = 0; i < rcvd; i++) {
-			char *pkt = xq_get_data(xsk, descs[i].idx,
-						descs[i].offset);
+			char *pkt = xq_get_data(xsk, descs[i].addr);
 
 			swap_mac_addresses(pkt);
-#if DEBUG_HEXDUMP
-			char buf[32];
-			u32 idx = descs[i].idx;
 
-			sprintf(buf, "idx=%d", idx);
-			hex_dump(pkt, descs[i].len, buf);
-#endif
+			hex_dump(pkt, descs[i].len, descs[i].addr);
 		}
 
 		xsk->rx_npkts += rcvd;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 5/5] samples/bpf: minor *_nb_free performance fix
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
                   ` (3 preceding siblings ...)
  2018-06-04 11:57 ` [PATCH bpf-next 4/5] samples/bpf: adapted to new uapi Björn Töpel
@ 2018-06-04 11:57 ` Björn Töpel
  2018-06-04 16:24 ` [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Alexei Starovoitov
  5 siblings, 0 replies; 8+ messages in thread
From: Björn Töpel @ 2018-06-04 11:57 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index b71a342b9082..7494f60fbff8 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -157,15 +157,15 @@ static const char pkt_data[] =
 
 static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
 {
-	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+	u32 free_entries = q->cached_cons - q->cached_prod;
 
 	if (free_entries >= nb)
 		return free_entries;
 
 	/* Refresh the local tail pointer */
-	q->cached_cons = *q->consumer;
+	q->cached_cons = *q->consumer + q->size;
 
-	return q->size - (q->cached_prod - q->cached_cons);
+	return q->cached_cons - q->cached_prod;
 }
 
 static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
@@ -439,6 +439,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 	umem->fq.producer = umem->fq.map + off.fr.producer;
 	umem->fq.consumer = umem->fq.map + off.fr.consumer;
 	umem->fq.ring = umem->fq.map + off.fr.desc;
+	umem->fq.cached_cons = FQ_NUM_DESCS;
 
 	umem->cq.map = mmap(0, off.cr.desc +
 			     CQ_NUM_DESCS * sizeof(u64),
@@ -535,6 +536,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 	xsk->tx.producer = xsk->tx.map + off.tx.producer;
 	xsk->tx.consumer = xsk->tx.map + off.tx.consumer;
 	xsk->tx.ring = xsk->tx.map + off.tx.desc;
+	xsk->tx.cached_cons = NUM_DESCS;
 
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
  2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
                   ` (4 preceding siblings ...)
  2018-06-04 11:57 ` [PATCH bpf-next 5/5] samples/bpf: minor *_nb_free performance fix Björn Töpel
@ 2018-06-04 16:24 ` Alexei Starovoitov
  2018-06-04 19:51   ` Daniel Borkmann
  5 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2018-06-04 16:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

On Mon, Jun 04, 2018 at 01:57:10PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
> https://www.spinics.net/lists/netdev/msg503664.html) is that it does
> not support NICs that have a "type-writer" model in an efficient
> way. In this model, a memory window is passed to the hardware and
> multiple frames might be filled into that window, instead of just one
> that we have in the current fixed frame-size model.
> 
> This patch set fixes two bugs in the current implementation and then
> changes the uapi so that the type-writer model can be supported
> efficiently by a possible future extension of AF_XDP.
> 
> These are the uapi changes in this patch:
> 
> * Change the "u32 idx" in the descriptors to "u64 addr". The current
>   idx based format does NOT work for the type-writer model (as packets
>   can start anywhere within a frame) but that a relative address
>   pointer (the u64 addr) works well for both models in the prototype
>   code we have that supports both models. We increased it from u32 to
>   u64 to support umems larger than 4G. We have also removed the u16
>   offset when having a "u64 addr" since that information is already
>   carried in the least significant bits of the address.
> 
> * We want to use "u8 padding[5]" for something useful in the future
>   (since we are not allowed to change its name), so we now call it
>   just options so it can be extended for various purposes in the
>   future. It is an u32 as that it what is left of the 16 byte
>   descriptor.
> 
> * We changed the name of frame_size in the UMEM_REG setsockopt to
>   chunk_size since this naming also makes sense to the type-writer
>   model.
> 
> With these changes to the uapi, we believe the type-writer model can
> be supported without having to resort to a new descriptor format. The
> type-writer model could then be supported, from the uapi point of
> view, by setting a flag at bind time and providing a new flag bit in
> the options field of the descriptor that signals to user space that
> all packets have been written in a chunk. Or with a new chunk
> completion queue as suggested by Mykyta in his latest feedback mail on
> the list.

for the set:
Acked-by: Alexei Starovoitov <ast@kernel.org>
Thank you for these fixes.
According to unofficial feedback from brcm and netronome folks
the descriptor format should work for these nics too.
At some point we may consider second format, but I think SW
should drive HW requirements and not the other way around.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
  2018-06-04 16:24 ` [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Alexei Starovoitov
@ 2018-06-04 19:51   ` Daniel Borkmann
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Borkmann @ 2018-06-04 19:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan

On 06/04/2018 06:24 PM, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 01:57:10PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
>> https://www.spinics.net/lists/netdev/msg503664.html) is that it does
>> not support NICs that have a "type-writer" model in an efficient
>> way. In this model, a memory window is passed to the hardware and
>> multiple frames might be filled into that window, instead of just one
>> that we have in the current fixed frame-size model.
>>
>> This patch set fixes two bugs in the current implementation and then
>> changes the uapi so that the type-writer model can be supported
>> efficiently by a possible future extension of AF_XDP.
>>
>> These are the uapi changes in this patch:
>>
>> * Change the "u32 idx" in the descriptors to "u64 addr". The current
>>   idx based format does NOT work for the type-writer model (as packets
>>   can start anywhere within a frame) but that a relative address
>>   pointer (the u64 addr) works well for both models in the prototype
>>   code we have that supports both models. We increased it from u32 to
>>   u64 to support umems larger than 4G. We have also removed the u16
>>   offset when having a "u64 addr" since that information is already
>>   carried in the least significant bits of the address.
>>
>> * We want to use "u8 padding[5]" for something useful in the future
>>   (since we are not allowed to change its name), so we now call it
>>   just options so it can be extended for various purposes in the
>>   future. It is an u32 as that it what is left of the 16 byte
>>   descriptor.
>>
>> * We changed the name of frame_size in the UMEM_REG setsockopt to
>>   chunk_size since this naming also makes sense to the type-writer
>>   model.
>>
>> With these changes to the uapi, we believe the type-writer model can
>> be supported without having to resort to a new descriptor format. The
>> type-writer model could then be supported, from the uapi point of
>> view, by setting a flag at bind time and providing a new flag bit in
>> the options field of the descriptor that signals to user space that
>> all packets have been written in a chunk. Or with a new chunk
>> completion queue as suggested by Mykyta in his latest feedback mail on
>> the list.
> 
> for the set:
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Thank you for these fixes.
> According to unofficial feedback from brcm and netronome folks
> the descriptor format should work for these nics too.
> At some point we may consider second format, but I think SW
> should drive HW requirements and not the other way around.

LGTM as well, applied to bpf-next, thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-06-04 19:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 11:57 [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Björn Töpel
2018-06-04 11:57 ` [PATCH bpf-next 1/5] xsk: proper fill queue descriptor validation Björn Töpel
2018-06-04 11:57 ` [PATCH bpf-next 2/5] xsk: proper Rx drop statistics update Björn Töpel
2018-06-04 11:57 ` [PATCH bpf-next 3/5] xsk: new descriptor addressing scheme Björn Töpel
2018-06-04 11:57 ` [PATCH bpf-next 4/5] samples/bpf: adapted to new uapi Björn Töpel
2018-06-04 11:57 ` [PATCH bpf-next 5/5] samples/bpf: minor *_nb_free performance fix Björn Töpel
2018-06-04 16:24 ` [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes Alexei Starovoitov
2018-06-04 19:51   ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).