All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v2 0/2] libbpf: remove two dependencies on Linux kernel headers and improve performance as a bonus
@ 2019-04-10  7:09 Magnus Karlsson
  2019-04-10  7:09 ` [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h Magnus Karlsson
  2019-04-10  7:09 ` [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h " Magnus Karlsson
  0 siblings, 2 replies; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-10  7:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev
  Cc: bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	xiaolong.ye, ferruh.yigit, qi.z.zhang, georgmueller

This patch set removes two dependencies on Linux kernel headers
from the XDP socket code in libbpf. A number of people have pointed
out that these two dependencies make it hard to build the XDP socket
part of libbpf without any kernel header dependencies. The two removed
dependecies are:

* Remove the usage of likely and unlikely (compiler.h) in xsk.h. It
  has been reported that the use of these actually decreases the
  performance of the ring access code due to an increase in
  instruction cache misses, so let us just remove these.

* Remove the dependency on barrier.h as it brings in a lot of kernel
  headers. As the XDP socket code only uses two simple functions from
  it, we can reimplement these. As a bonus, the new implementation is
  faster as it uses the same barrier primitives as the kernel does
  when the same code is compiled there. Without this patch, the user
  land code uses lfence and sfence on x86, which are unecessarily
  harsh/thorough.

v1 -> v2: Added comment about validity of ARM 32-bit barriers.
          Only armv7 and above.

/Magnus

Magnus Karlsson (2):
  libbpf: remove likely/unlikely in xsk.h
  libbpf: remove dependency on barrier.h in xsk.h

 tools/lib/bpf/xsk.h | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h
  2019-04-10  7:09 [PATCH bpf v2 0/2] libbpf: remove two dependencies on Linux kernel headers and improve performance as a bonus Magnus Karlsson
@ 2019-04-10  7:09 ` Magnus Karlsson
  2019-04-10 17:56   ` Y Song
  2019-04-10  7:09 ` [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h " Magnus Karlsson
  1 sibling, 1 reply; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-10  7:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev
  Cc: bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	xiaolong.ye, ferruh.yigit, qi.z.zhang, georgmueller

This patch removes the use of likely and unlikely in xsk.h since they
create a dependency on Linux headers as reported by several
users. There have also been reports that the use of these decreases
performance as the compiler puts the code on two different cache lines
instead of on a single one. All in all, I think we are better off
without them.

Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 tools/lib/bpf/xsk.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index a497f00..3638147 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -105,7 +105,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
 static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
 					    size_t nb, __u32 *idx)
 {
-	if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
+	if (xsk_prod_nb_free(prod, nb) < nb)
 		return 0;
 
 	*idx = prod->cached_prod;
@@ -129,7 +129,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
 {
 	size_t entries = xsk_cons_nb_avail(cons, nb);
 
-	if (likely(entries > 0)) {
+	if (entries > 0) {
 		/* Make sure we do not speculatively read the data before
 		 * we have received the packet buffers from the ring.
 		 */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-10  7:09 [PATCH bpf v2 0/2] libbpf: remove two dependencies on Linux kernel headers and improve performance as a bonus Magnus Karlsson
  2019-04-10  7:09 ` [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h Magnus Karlsson
@ 2019-04-10  7:09 ` Magnus Karlsson
  2019-04-10 18:13   ` Y Song
  1 sibling, 1 reply; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-10  7:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev
  Cc: bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	xiaolong.ye, ferruh.yigit, qi.z.zhang, georgmueller

The use of smp_rmb() and smp_wmb() creates a Linux header dependency
on barrier.h that is uneccessary in most parts. This patch implements
the two small defines that are needed from barrier.h. As a bonus, the
new implementations are faster than the default ones as they default
to sfence and lfence for x86, while we only need a compiler barrier in
our case. Just as it is when the same ring access code is compiled in
the kernel.

Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 3638147..69136d9 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
 struct xsk_umem;
 struct xsk_socket;
 
+#if !defined bpf_smp_rmb && !defined bpf_smp_wmb
+# if defined(__i386__) || defined(__x86_64__)
+#  define bpf_smp_rmb() asm volatile("" : : : "memory")
+#  define bpf_smp_wmb() asm volatile("" : : : "memory")
+# elif defined(__aarch64__)
+#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
+#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
+# elif defined(__arm__)
+/* These are only valid for armv7 and above */
+#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
+#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
+# else
+#  error Architecture not supported by the XDP socket code in libbpf.
+# endif
+#endif
+
 static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
 					      __u32 idx)
 {
@@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
 	/* Make sure everything has been written to the ring before signalling
 	 * this to the kernel.
 	 */
-	smp_wmb();
+	bpf_smp_wmb();
 
 	*prod->producer += nb;
 }
@@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
 		/* Make sure we do not speculatively read the data before
 		 * we have received the packet buffers from the ring.
 		 */
-		smp_rmb();
+		bpf_smp_rmb();
 
 		*idx = cons->cached_cons;
 		cons->cached_cons += entries;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h
  2019-04-10  7:09 ` [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h Magnus Karlsson
@ 2019-04-10 17:56   ` Y Song
  2019-04-11  6:20     ` Magnus Karlsson
  0 siblings, 1 reply; 10+ messages in thread
From: Y Song @ 2019-04-10 17:56 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	netdev, bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	xiaolong.ye, ferruh.yigit, qi.z.zhang, georgmueller

On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
<magnus.karlsson@intel.com> wrote:
>
> This patch removes the use of likely and unlikely in xsk.h since they
> create a dependency on Linux headers as reported by several
> users. There have also been reports that the use of these decreases
> performance as the compiler puts the code on two different cache lines
> instead of on a single one. All in all, I think we are better off
> without them.

The change looks good to me.
  Acked-by: Yonghong Song <yhs@fb.com>

libbpf repo (https://github.com/libbpf/libbpf/) solved this issue by
providing some customer
implementation just to satisfying compilatioins. I guess users here do
not use libbpf repo and they
directly extract kernel source and try to build?

Just curious. do you have detailed info about which code in two
different cache lines instead
of one cache line and how much performance degradation?

>
> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  tools/lib/bpf/xsk.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index a497f00..3638147 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -105,7 +105,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
>  static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
>                                             size_t nb, __u32 *idx)
>  {
> -       if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
> +       if (xsk_prod_nb_free(prod, nb) < nb)
>                 return 0;
>
>         *idx = prod->cached_prod;
> @@ -129,7 +129,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>  {
>         size_t entries = xsk_cons_nb_avail(cons, nb);
>
> -       if (likely(entries > 0)) {
> +       if (entries > 0) {
>                 /* Make sure we do not speculatively read the data before
>                  * we have received the packet buffers from the ring.
>                  */
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-10  7:09 ` [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h " Magnus Karlsson
@ 2019-04-10 18:13   ` Y Song
  2019-04-11  7:54     ` Magnus Karlsson
  0 siblings, 1 reply; 10+ messages in thread
From: Y Song @ 2019-04-10 18:13 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	netdev, bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	xiaolong.ye, ferruh.yigit, qi.z.zhang, georgmueller

On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
<magnus.karlsson@intel.com> wrote:
>
> The use of smp_rmb() and smp_wmb() creates a Linux header dependency
> on barrier.h that is uneccessary in most parts. This patch implements
> the two small defines that are needed from barrier.h. As a bonus, the
> new implementations are faster than the default ones as they default
> to sfence and lfence for x86, while we only need a compiler barrier in
> our case. Just as it is when the same ring access code is compiled in
> the kernel.
>
> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index 3638147..69136d9 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
>  struct xsk_umem;
>  struct xsk_socket;
>
> +#if !defined bpf_smp_rmb && !defined bpf_smp_wmb

Maybe add some comments to explain the different between bpf_smp_{r,w}mb
and smp_{r,w}mb so later users will have a better idea which to pick?

> +# if defined(__i386__) || defined(__x86_64__)
> +#  define bpf_smp_rmb() asm volatile("" : : : "memory")
> +#  define bpf_smp_wmb() asm volatile("" : : : "memory")
> +# elif defined(__aarch64__)
> +#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> +# elif defined(__arm__)
> +/* These are only valid for armv7 and above */
> +#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> +# else
> +#  error Architecture not supported by the XDP socket code in libbpf.
> +# endif
> +#endif

Since this is generic enough and could be used by other files as well,
maybe put it into libbpf_util.h?

> +
>  static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
>                                               __u32 idx)
>  {
> @@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
>         /* Make sure everything has been written to the ring before signalling
>          * this to the kernel.
>          */
> -       smp_wmb();
> +       bpf_smp_wmb();
>
>         *prod->producer += nb;
>  }
> @@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>                 /* Make sure we do not speculatively read the data before
>                  * we have received the packet buffers from the ring.
>                  */
> -               smp_rmb();
> +               bpf_smp_rmb();

Could you explain why a compiler barrier is good enough here on x86? Note that
the load cons->cached_cons could be reordered with earlier
non-overlapping stores
at runtime.

>
>                 *idx = cons->cached_cons;
>                 cons->cached_cons += entries;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h
  2019-04-10 17:56   ` Y Song
@ 2019-04-11  6:20     ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-11  6:20 UTC (permalink / raw)
  To: Y Song
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, netdev, bpf, bruce.richardson, ciara.loftus,
	ilias.apalodimas, Ye Xiaolong, ferruh.yigit, Zhang, Qi Z,
	georgmueller

On Wed, Apr 10, 2019 at 9:08 PM Y Song <ys114321@gmail.com> wrote:
>
> On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
> <magnus.karlsson@intel.com> wrote:
> >
> > This patch removes the use of likely and unlikely in xsk.h since they
> > create a dependency on Linux headers as reported by several
> > users. There have also been reports that the use of these decreases
> > performance as the compiler puts the code on two different cache lines
> > instead of on a single one. All in all, I think we are better off
> > without them.
>
> The change looks good to me.
>   Acked-by: Yonghong Song <yhs@fb.com>
>
> libbpf repo (https://github.com/libbpf/libbpf/) solved this issue by
> providing some customer
> implementation just to satisfying compilatioins. I guess users here do
> not use libbpf repo and they
> directly extract kernel source and try to build?

That is correct. Quite a number of people did not even know it existed
in the first place. Maybe we need more pointers to the repo.

> Just curious. do you have detailed info about which code in two
> different cache lines instead
> of one cache line and how much performance degradation?

Sorry, I have no detailed info. It is very dependent on the exact
access rates to and from the queues and the way the instructions line
up in the cache in the application. But the more often the unlikely
paths are taken, the more performance can be degraded. The
unlikely/likely paths in the ring access code are just not
unlikely/likely enough to warrant these annotations. They will occur
every time the locally cached values have to be refreshed from the
shared ones.

/Magnus

> >
> > Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >  tools/lib/bpf/xsk.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> > index a497f00..3638147 100644
> > --- a/tools/lib/bpf/xsk.h
> > +++ b/tools/lib/bpf/xsk.h
> > @@ -105,7 +105,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
> >  static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
> >                                             size_t nb, __u32 *idx)
> >  {
> > -       if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
> > +       if (xsk_prod_nb_free(prod, nb) < nb)
> >                 return 0;
> >
> >         *idx = prod->cached_prod;
> > @@ -129,7 +129,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> >  {
> >         size_t entries = xsk_cons_nb_avail(cons, nb);
> >
> > -       if (likely(entries > 0)) {
> > +       if (entries > 0) {
> >                 /* Make sure we do not speculatively read the data before
> >                  * we have received the packet buffers from the ring.
> >                  */
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-10 18:13   ` Y Song
@ 2019-04-11  7:54     ` Magnus Karlsson
  2019-04-11 20:23       ` Daniel Borkmann
  0 siblings, 1 reply; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-11  7:54 UTC (permalink / raw)
  To: Y Song
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, netdev, bpf, bruce.richardson, ciara.loftus,
	ilias.apalodimas, Ye Xiaolong, ferruh.yigit, Zhang, Qi Z,
	georgmueller

On Wed, Apr 10, 2019 at 9:08 PM Y Song <ys114321@gmail.com> wrote:
>
> On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
> <magnus.karlsson@intel.com> wrote:
> >
> > The use of smp_rmb() and smp_wmb() creates a Linux header dependency
> > on barrier.h that is uneccessary in most parts. This patch implements
> > the two small defines that are needed from barrier.h. As a bonus, the
> > new implementations are faster than the default ones as they default
> > to sfence and lfence for x86, while we only need a compiler barrier in
> > our case. Just as it is when the same ring access code is compiled in
> > the kernel.
> >
> > Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >  tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> > index 3638147..69136d9 100644
> > --- a/tools/lib/bpf/xsk.h
> > +++ b/tools/lib/bpf/xsk.h
> > @@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
> >  struct xsk_umem;
> >  struct xsk_socket;
> >
> > +#if !defined bpf_smp_rmb && !defined bpf_smp_wmb
>
> Maybe add some comments to explain the different between bpf_smp_{r,w}mb
> and smp_{r,w}mb so later users will have a better idea which to pick?

Ouch, that is a hard one. I would just recommend people to read
Documentation/memory-barriers.txt. My attempt at explaining all this
would not be pretty and likely sprinkled with errors ;-).

> > +# if defined(__i386__) || defined(__x86_64__)
> > +#  define bpf_smp_rmb() asm volatile("" : : : "memory")
> > +#  define bpf_smp_wmb() asm volatile("" : : : "memory")
> > +# elif defined(__aarch64__)
> > +#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
> > +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> > +# elif defined(__arm__)
> > +/* These are only valid for armv7 and above */
> > +#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
> > +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> > +# else
> > +#  error Architecture not supported by the XDP socket code in libbpf.
> > +# endif
> > +#endif
>
> Since this is generic enough and could be used by other files as well,
> maybe put it into libbpf_util.h?

Good question. Do not know. Daniel suggested introducing [0] and
perhaps that can be used by the broader libbpf code base? The
important part for this patch set is that these operations match the
ones in the kernel on the other end of the ring.

[0] https://lore.kernel.org/netdev/20181017144156.16639-2-daniel@iogearbox.net/

> > +
> >  static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
> >                                               __u32 idx)
> >  {
> > @@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
> >         /* Make sure everything has been written to the ring before signalling
> >          * this to the kernel.
> >          */
> > -       smp_wmb();
> > +       bpf_smp_wmb();
> >
> >         *prod->producer += nb;
> >  }
> > @@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> >                 /* Make sure we do not speculatively read the data before
> >                  * we have received the packet buffers from the ring.
> >                  */
> > -               smp_rmb();
> > +               bpf_smp_rmb();
>
> Could you explain why a compiler barrier is good enough here on x86? Note that
> the load cons->cached_cons could be reordered with earlier
> non-overlapping stores
> at runtime.

The bpf_smp_rmb() is there to protect the data in the ring itself to
be read by the consumer before the producer has signaled that it has
finished “producing” them by updating the producer (head) pointer. As
stores are not reordered with other stores on x86 (nor loads with
other loads), the update of the producer pointer will always be
observed after the writing of the data in the ring, as that is done
before the update of the producer pointer in xsk_ring_prod__submit().
One side only updates and the other side only reads. cached_cons is a
local variable and only for operations done by another core can we
observe loads being reordered with older stores to different
locations. Since no one else is touching cached_cons, this will not
happen.

/Magnus

> >
> >                 *idx = cons->cached_cons;
> >                 cons->cached_cons += entries;
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-11  7:54     ` Magnus Karlsson
@ 2019-04-11 20:23       ` Daniel Borkmann
  2019-04-11 20:43         ` Daniel Borkmann
  2019-04-12  9:01         ` Magnus Karlsson
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Borkmann @ 2019-04-11 20:23 UTC (permalink / raw)
  To: Magnus Karlsson, Y Song
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	netdev, bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	Ye Xiaolong, ferruh.yigit, Zhang, Qi Z, georgmueller

On 04/11/2019 09:54 AM, Magnus Karlsson wrote:
> On Wed, Apr 10, 2019 at 9:08 PM Y Song <ys114321@gmail.com> wrote:
>> On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
>> <magnus.karlsson@intel.com> wrote:
>>>
>>> The use of smp_rmb() and smp_wmb() creates a Linux header dependency
>>> on barrier.h that is uneccessary in most parts. This patch implements
>>> the two small defines that are needed from barrier.h. As a bonus, the
>>> new implementations are faster than the default ones as they default
>>> to sfence and lfence for x86, while we only need a compiler barrier in
>>> our case. Just as it is when the same ring access code is compiled in
>>> the kernel.
>>>
>>> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>> ---
>>>  tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
>>>  1 file changed, 18 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
>>> index 3638147..69136d9 100644
>>> --- a/tools/lib/bpf/xsk.h
>>> +++ b/tools/lib/bpf/xsk.h
>>> @@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
>>>  struct xsk_umem;
>>>  struct xsk_socket;
>>>
>>> +#if !defined bpf_smp_rmb && !defined bpf_smp_wmb
>>
>> Maybe add some comments to explain the different between bpf_smp_{r,w}mb
>> and smp_{r,w}mb so later users will have a better idea which to pick?
> 
> Ouch, that is a hard one. I would just recommend people to read
> Documentation/memory-barriers.txt. My attempt at explaining all this
> would not be pretty and likely sprinkled with errors ;-).

I think Yonghong meant here place a comment wrt when to use the below versus
when to use smp_{r,w}mb(). Both are essentially the same just that the main
difference here would be that this header needs to be installed in the system
so users need to have it. I think it indeed makes sense to add a comment about
this specific fact otherwise we might forget about it in few months.

>>> +# if defined(__i386__) || defined(__x86_64__)
>>> +#  define bpf_smp_rmb() asm volatile("" : : : "memory")
>>> +#  define bpf_smp_wmb() asm volatile("" : : : "memory")
>>> +# elif defined(__aarch64__)
>>> +#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
>>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
>>> +# elif defined(__arm__)
>>> +/* These are only valid for armv7 and above */
>>> +#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
>>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
>>> +# else
>>> +#  error Architecture not supported by the XDP socket code in libbpf.
>>> +# endif
>>> +#endif
>>
>> Since this is generic enough and could be used by other files as well,
>> maybe put it into libbpf_util.h?

Hmm, maybe a good point. We could place it into libbpf.h as there is already
various misc helpers and xsk.h includes it anyway. But: if we do that, then
the above 'else' part would need some generic fallback (__sync_synchronize()
plus a warning?) as otherwise compilation would break for everyone with 'error'.
Ideally this should then cover as much as possible from mainstream archs though.

> Good question. Do not know. Daniel suggested introducing [0] and
> perhaps that can be used by the broader libbpf code base? The
> important part for this patch set is that these operations match the
> ones in the kernel on the other end of the ring.

Yeah, it can be used generally except for headers that are going to be
installed where these are present in inline helper functions.

> [0] https://lore.kernel.org/netdev/20181017144156.16639-2-daniel@iogearbox.net/
> 
>>> +
>>>  static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
>>>                                               __u32 idx)
>>>  {
>>> @@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
>>>         /* Make sure everything has been written to the ring before signalling
>>>          * this to the kernel.
>>>          */
>>> -       smp_wmb();
>>> +       bpf_smp_wmb();
>>>
>>>         *prod->producer += nb;
>>>  }
>>> @@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>>>                 /* Make sure we do not speculatively read the data before
>>>                  * we have received the packet buffers from the ring.
>>>                  */
>>> -               smp_rmb();
>>> +               bpf_smp_rmb();
>>
>> Could you explain why a compiler barrier is good enough here on x86? Note that
>> the load cons->cached_cons could be reordered with earlier
>> non-overlapping stores
>> at runtime.
> 
> The bpf_smp_rmb() is there to protect the data in the ring itself to
> be read by the consumer before the producer has signaled that it has
> finished “producing” them by updating the producer (head) pointer. As
> stores are not reordered with other stores on x86 (nor loads with
> other loads), the update of the producer pointer will always be
> observed after the writing of the data in the ring, as that is done
> before the update of the producer pointer in xsk_ring_prod__submit().
> One side only updates and the other side only reads. cached_cons is a
> local variable and only for operations done by another core can we
> observe loads being reordered with older stores to different
> locations. Since no one else is touching cached_cons, this will not
> happen.

From perf RB side, I found this one kernel/events/ring_buffer.c +72 to
be very helpful. It's independent of this series, but I would appreciate
if you could make similar scheme / comment somewhere in the AF_XDP code
such that all barriers in there can be more easily followed wrt how they
pair to user space.

Thanks,
Daniel

> /Magnus
> 
>>>
>>>                 *idx = cons->cached_cons;
>>>                 cons->cached_cons += entries;
>>> --
>>> 2.7.4
>>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-11 20:23       ` Daniel Borkmann
@ 2019-04-11 20:43         ` Daniel Borkmann
  2019-04-12  9:01         ` Magnus Karlsson
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel Borkmann @ 2019-04-11 20:43 UTC (permalink / raw)
  To: Magnus Karlsson, Y Song
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	netdev, bpf, bruce.richardson, ciara.loftus, ilias.apalodimas,
	Ye Xiaolong, ferruh.yigit, Zhang, Qi Z, georgmueller

On 04/11/2019 10:23 PM, Daniel Borkmann wrote:
> On 04/11/2019 09:54 AM, Magnus Karlsson wrote:
>> On Wed, Apr 10, 2019 at 9:08 PM Y Song <ys114321@gmail.com> wrote:
>>> On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
>>> <magnus.karlsson@intel.com> wrote:
>>>>
>>>> The use of smp_rmb() and smp_wmb() creates a Linux header dependency
>>>> on barrier.h that is uneccessary in most parts. This patch implements
>>>> the two small defines that are needed from barrier.h. As a bonus, the
>>>> new implementations are faster than the default ones as they default
>>>> to sfence and lfence for x86, while we only need a compiler barrier in
>>>> our case. Just as it is when the same ring access code is compiled in
>>>> the kernel.
>>>>
>>>> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
>>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>>> ---
>>>>  tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
>>>>  1 file changed, 18 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
>>>> index 3638147..69136d9 100644
>>>> --- a/tools/lib/bpf/xsk.h
>>>> +++ b/tools/lib/bpf/xsk.h
>>>> @@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
>>>>  struct xsk_umem;
>>>>  struct xsk_socket;
>>>>
>>>> +#if !defined bpf_smp_rmb && !defined bpf_smp_wmb
>>>
>>> Maybe add some comments to explain the different between bpf_smp_{r,w}mb
>>> and smp_{r,w}mb so later users will have a better idea which to pick?
>>
>> Ouch, that is a hard one. I would just recommend people to read
>> Documentation/memory-barriers.txt. My attempt at explaining all this
>> would not be pretty and likely sprinkled with errors ;-).
> 
> I think Yonghong meant here place a comment wrt when to use the below versus
> when to use smp_{r,w}mb(). Both are essentially the same just that the main
> difference here would be that this header needs to be installed in the system
> so users need to have it. I think it indeed makes sense to add a comment about
> this specific fact otherwise we might forget about it in few months.
> 
>>>> +# if defined(__i386__) || defined(__x86_64__)
>>>> +#  define bpf_smp_rmb() asm volatile("" : : : "memory")
>>>> +#  define bpf_smp_wmb() asm volatile("" : : : "memory")
>>>> +# elif defined(__aarch64__)
>>>> +#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
>>>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
>>>> +# elif defined(__arm__)
>>>> +/* These are only valid for armv7 and above */
>>>> +#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
>>>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
>>>> +# else
>>>> +#  error Architecture not supported by the XDP socket code in libbpf.
>>>> +# endif
>>>> +#endif
>>>
>>> Since this is generic enough and could be used by other files as well,
>>> maybe put it into libbpf_util.h?
> 
> Hmm, maybe a good point. We could place it into libbpf.h as there is already
> various misc helpers and xsk.h includes it anyway. But: if we do that, then
> the above 'else' part would need some generic fallback (__sync_synchronize()
> plus a warning?) as otherwise compilation would break for everyone with 'error'.
> Ideally this should then cover as much as possible from mainstream archs though.

(And if so then prefixed with libbpf_smp_{r,w}mb() to denote it's misc libbpf
 internal function.)

>> Good question. Do not know. Daniel suggested introducing [0] and
>> perhaps that can be used by the broader libbpf code base? The
>> important part for this patch set is that these operations match the
>> ones in the kernel on the other end of the ring.
> 
> Yeah, it can be used generally except for headers that are going to be
> installed where these are present in inline helper functions.
> 
>> [0] https://lore.kernel.org/netdev/20181017144156.16639-2-daniel@iogearbox.net/
>>
>>>> +
>>>>  static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
>>>>                                               __u32 idx)
>>>>  {
>>>> @@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
>>>>         /* Make sure everything has been written to the ring before signalling
>>>>          * this to the kernel.
>>>>          */
>>>> -       smp_wmb();
>>>> +       bpf_smp_wmb();
>>>>
>>>>         *prod->producer += nb;
>>>>  }
>>>> @@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>>>>                 /* Make sure we do not speculatively read the data before
>>>>                  * we have received the packet buffers from the ring.
>>>>                  */
>>>> -               smp_rmb();
>>>> +               bpf_smp_rmb();
>>>
>>> Could you explain why a compiler barrier is good enough here on x86? Note that
>>> the load cons->cached_cons could be reordered with earlier
>>> non-overlapping stores
>>> at runtime.
>>
>> The bpf_smp_rmb() is there to protect the data in the ring itself to
>> be read by the consumer before the producer has signaled that it has
>> finished “producing” them by updating the producer (head) pointer. As
>> stores are not reordered with other stores on x86 (nor loads with
>> other loads), the update of the producer pointer will always be
>> observed after the writing of the data in the ring, as that is done
>> before the update of the producer pointer in xsk_ring_prod__submit().
>> One side only updates and the other side only reads. cached_cons is a
>> local variable and only for operations done by another core can we
>> observe loads being reordered with older stores to different
>> locations. Since no one else is touching cached_cons, this will not
>> happen.
> 
> From perf RB side, I found this one kernel/events/ring_buffer.c +72 to
> be very helpful. It's independent of this series, but I would appreciate
> if you could make similar scheme / comment somewhere in the AF_XDP code
> such that all barriers in there can be more easily followed wrt how they
> pair to user space.
> 
> Thanks,
> Daniel
> 
>> /Magnus
>>
>>>>
>>>>                 *idx = cons->cached_cons;
>>>>                 cons->cached_cons += entries;
>>>> --
>>>> 2.7.4
>>>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h in xsk.h
  2019-04-11 20:23       ` Daniel Borkmann
  2019-04-11 20:43         ` Daniel Borkmann
@ 2019-04-12  9:01         ` Magnus Karlsson
  1 sibling, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2019-04-12  9:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Y Song, Magnus Karlsson, Björn Töpel,
	Alexei Starovoitov, netdev, bpf, bruce.richardson, ciara.loftus,
	ilias.apalodimas, Ye Xiaolong, ferruh.yigit, Zhang, Qi Z,
	georgmueller

On Thu, Apr 11, 2019 at 10:23 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 04/11/2019 09:54 AM, Magnus Karlsson wrote:
> > On Wed, Apr 10, 2019 at 9:08 PM Y Song <ys114321@gmail.com> wrote:
> >> On Wed, Apr 10, 2019 at 12:21 AM Magnus Karlsson
> >> <magnus.karlsson@intel.com> wrote:
> >>>
> >>> The use of smp_rmb() and smp_wmb() creates a Linux header dependency
> >>> on barrier.h that is uneccessary in most parts. This patch implements
> >>> the two small defines that are needed from barrier.h. As a bonus, the
> >>> new implementations are faster than the default ones as they default
> >>> to sfence and lfence for x86, while we only need a compiler barrier in
> >>> our case. Just as it is when the same ring access code is compiled in
> >>> the kernel.
> >>>
> >>> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
> >>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> >>> ---
> >>>  tools/lib/bpf/xsk.h | 20 ++++++++++++++++++--
> >>>  1 file changed, 18 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> >>> index 3638147..69136d9 100644
> >>> --- a/tools/lib/bpf/xsk.h
> >>> +++ b/tools/lib/bpf/xsk.h
> >>> @@ -39,6 +39,22 @@ DEFINE_XSK_RING(xsk_ring_cons);
> >>>  struct xsk_umem;
> >>>  struct xsk_socket;
> >>>
> >>> +#if !defined bpf_smp_rmb && !defined bpf_smp_wmb
> >>
> >> Maybe add some comments to explain the different between bpf_smp_{r,w}mb
> >> and smp_{r,w}mb so later users will have a better idea which to pick?
> >
> > Ouch, that is a hard one. I would just recommend people to read
> > Documentation/memory-barriers.txt. My attempt at explaining all this
> > would not be pretty and likely sprinkled with errors ;-).
>
> I think Yonghong meant here place a comment wrt when to use the below versus
> when to use smp_{r,w}mb(). Both are essentially the same just that the main
> difference here would be that this header needs to be installed in the system
> so users need to have it. I think it indeed makes sense to add a comment about
> this specific fact otherwise we might forget about it in few months.

Will do.

> >>> +# if defined(__i386__) || defined(__x86_64__)
> >>> +#  define bpf_smp_rmb() asm volatile("" : : : "memory")
> >>> +#  define bpf_smp_wmb() asm volatile("" : : : "memory")
> >>> +# elif defined(__aarch64__)
> >>> +#  define bpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
> >>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> >>> +# elif defined(__arm__)
> >>> +/* These are only valid for armv7 and above */
> >>> +#  define bpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
> >>> +#  define bpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
> >>> +# else
> >>> +#  error Architecture not supported by the XDP socket code in libbpf.
> >>> +# endif
> >>> +#endif
> >>
> >> Since this is generic enough and could be used by other files as well,
> >> maybe put it into libbpf_util.h?
>
> Hmm, maybe a good point. We could place it into libbpf.h as there is already
> various misc helpers and xsk.h includes it anyway. But: if we do that, then
> the above 'else' part would need some generic fallback (__sync_synchronize()
> plus a warning?) as otherwise compilation would break for everyone with 'error'.
> Ideally this should then cover as much as possible from mainstream archs though.

I will give this a stab for the mainstream archs then fall back and
warn as you suggests. Will use the libbpf_ prefix.

> > Good question. Do not know. Daniel suggested introducing [0] and
> > perhaps that can be used by the broader libbpf code base? The
> > important part for this patch set is that these operations match the
> > ones in the kernel on the other end of the ring.
>
> Yeah, it can be used generally except for headers that are going to be
> installed where these are present in inline helper functions.
>
> > [0] https://lore.kernel.org/netdev/20181017144156.16639-2-daniel@iogearbox.net/
> >
> >>> +
> >>>  static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
> >>>                                               __u32 idx)
> >>>  {
> >>> @@ -119,7 +135,7 @@ static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
> >>>         /* Make sure everything has been written to the ring before signalling
> >>>          * this to the kernel.
> >>>          */
> >>> -       smp_wmb();
> >>> +       bpf_smp_wmb();
> >>>
> >>>         *prod->producer += nb;
> >>>  }
> >>> @@ -133,7 +149,7 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> >>>                 /* Make sure we do not speculatively read the data before
> >>>                  * we have received the packet buffers from the ring.
> >>>                  */
> >>> -               smp_rmb();
> >>> +               bpf_smp_rmb();
> >>
> >> Could you explain why a compiler barrier is good enough here on x86? Note that
> >> the load cons->cached_cons could be reordered with earlier
> >> non-overlapping stores
> >> at runtime.
> >
> > The bpf_smp_rmb() is there to protect the data in the ring itself to
> > be read by the consumer before the producer has signaled that it has
> > finished “producing” them by updating the producer (head) pointer. As
> > stores are not reordered with other stores on x86 (nor loads with
> > other loads), the update of the producer pointer will always be
> > observed after the writing of the data in the ring, as that is done
> > before the update of the producer pointer in xsk_ring_prod__submit().
> > One side only updates and the other side only reads. cached_cons is a
> > local variable and only for operations done by another core can we
> > observe loads being reordered with older stores to different
> > locations. Since no one else is touching cached_cons, this will not
> > happen.
>
> From perf RB side, I found this one kernel/events/ring_buffer.c +72 to
> be very helpful. It's independent of this series, but I would appreciate
> if you could make similar scheme / comment somewhere in the AF_XDP code
> such that all barriers in there can be more easily followed wrt how they
> pair to user space.

That was very useful indeed and as we have the same structure this
also uncovered that we are missing an smp_mb in the (D) location below
for the ring to work on PPC64 and likely ARM too. On x86, it will work
without one since stores are not reordered with older loads (as
observed from another core), but there is only rmb, wmb and mb in
Linux (no rwmb or wrmb), so we are going to take a hit on x86 here.
But correctness for everyone always goes first. Will submit a new
patch set with this fix in the kernel also. The question is, can I
just refer to the perf_ring comments in the xsk.h code, or should I
largely copy it into the xsk code so it is there too?

As an optimization for the future, maybe we could do something along
the lines of Documentation/core-api/circular-buffers.rst, where
load_aqcuire and store_release are used instead of expensive memory
barriers. Do you guys have any experience using these primitives
instead and their performance on various platforms?

from kernel/events/ring_buffer.c:
*   kernel                             user
*
*   if (LOAD ->data_tail) {            LOAD ->data_head
*                      (A)             smp_rmb()       (C)
*      STORE $data                     LOAD $data
*      smp_wmb()       (B)             smp_mb()        (D)
*      STORE ->data_head               STORE ->data_tail
*   }
*
* Where A pairs with D, and B pairs with C.


> Thanks,
> Daniel
>
> > /Magnus
> >
> >>>
> >>>                 *idx = cons->cached_cons;
> >>>                 cons->cached_cons += entries;
> >>> --
> >>> 2.7.4
> >>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-12  9:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10  7:09 [PATCH bpf v2 0/2] libbpf: remove two dependencies on Linux kernel headers and improve performance as a bonus Magnus Karlsson
2019-04-10  7:09 ` [PATCH bpf v2 1/2] libbpf: remove likely/unlikely in xsk.h Magnus Karlsson
2019-04-10 17:56   ` Y Song
2019-04-11  6:20     ` Magnus Karlsson
2019-04-10  7:09 ` [PATCH bpf v2 2/2] libbpf: remove dependency on barrier.h " Magnus Karlsson
2019-04-10 18:13   ` Y Song
2019-04-11  7:54     ` Magnus Karlsson
2019-04-11 20:23       ` Daniel Borkmann
2019-04-11 20:43         ` Daniel Borkmann
2019-04-12  9:01         ` Magnus Karlsson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.