xdp-newbies.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* AF_XDP sockets across multiple NIC queues
@ 2021-03-25  6:24 Konstantinos Kaffes
  2021-03-25  7:24 ` Magnus Karlsson
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantinos Kaffes @ 2021-03-25  6:24 UTC (permalink / raw)
  To: xdp-newbies

Hello everyone,

I want to write a multi-threaded AF_XDP server where all N threads can
read from all N NIC queues. In my design, each thread creates N AF_XDP
sockets, each associated with a different queue. I have the following
questions:

1. Do sockets associated with the same queue need to share their UMEM
area and fill and completion rings?
2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
happens if my XDP program redirects a packet to a socket that is
associated with a different NIC queue than the one in which the packet
arrived?

I must mention that I am using the XDP skb mode with copies.

Thank you in advance,
Kostis

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-25  6:24 AF_XDP sockets across multiple NIC queues Konstantinos Kaffes
@ 2021-03-25  7:24 ` Magnus Karlsson
  2021-03-25 18:51   ` Konstantinos Kaffes
  0 siblings, 1 reply; 9+ messages in thread
From: Magnus Karlsson @ 2021-03-25  7:24 UTC (permalink / raw)
  To: Konstantinos Kaffes; +Cc: Xdp

On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
>
> Hello everyone,
>
> I want to write a multi-threaded AF_XDP server where all N threads can
> read from all N NIC queues. In my design, each thread creates N AF_XDP
> sockets, each associated with a different queue. I have the following
> questions:
>
> 1. Do sockets associated with the same queue need to share their UMEM
> area and fill and completion rings?

Yes. In zero-copy mode this is natural since the NIC HW will DMA the
packet into a umem that was decided long before the packet was even
received. And this is of course before we even get to pick what socket
it should go to. This restriction is currently carried over to
copy-mode, however, in theory there is nothing preventing supporting
multiple umems on the same netdev and queue id in copy-mode. It is
just that nobody has implemented support for it.

> 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> happens if my XDP program redirects a packet to a socket that is
> associated with a different NIC queue than the one in which the packet
> arrived?

You can have multiple XSKMAPs but you would in any case have to have
N^2 sockets in total to be able to cover all cases. Sockets are tied
to a specific netdev and queue id. If you try to redirect to socket
with a queue id or netdev that the packet was not received on, it will
be dropped. Again, for copy-mode, it would from a theoretical
perspective be perfectly fine to redirect to another queue id and/or
netdev since the packet is copied anyway. Maybe you want to add
support for it :-).

> I must mention that I am using the XDP skb mode with copies.
>
> Thank you in advance,
> Kostis

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-25  7:24 ` Magnus Karlsson
@ 2021-03-25 18:51   ` Konstantinos Kaffes
  2021-03-26  7:36     ` Magnus Karlsson
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantinos Kaffes @ 2021-03-25 18:51 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Xdp

Great, thanks for the info! I will look into implementing this.

For the time being, I implemented a version of my design with N^2
sockets. I observed that when all traffic is directed to a single NIC
queue, the throughput is higher than when I use all N NIC queues. I am
using spinlocks to guard concurrent access to UMEM and the
fill/completion rings. When I use a single NIC queue, I achieve
~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
and this bad scaling behavior expected?


On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
> On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> >
> > Hello everyone,
> >
> > I want to write a multi-threaded AF_XDP server where all N threads can
> > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > sockets, each associated with a different queue. I have the following
> > questions:
> >
> > 1. Do sockets associated with the same queue need to share their UMEM
> > area and fill and completion rings?
>
> Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> packet into a umem that was decided long before the packet was even
> received. And this is of course before we even get to pick what socket
> it should go to. This restriction is currently carried over to
> copy-mode, however, in theory there is nothing preventing supporting
> multiple umems on the same netdev and queue id in copy-mode. It is
> just that nobody has implemented support for it.
>
> > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > happens if my XDP program redirects a packet to a socket that is
> > associated with a different NIC queue than the one in which the packet
> > arrived?
>
> You can have multiple XSKMAPs but you would in any case have to have
> N^2 sockets in total to be able to cover all cases. Sockets are tied
> to a specific netdev and queue id. If you try to redirect to socket
> with a queue id or netdev that the packet was not received on, it will
> be dropped. Again, for copy-mode, it would from a theoretical
> perspective be perfectly fine to redirect to another queue id and/or
> netdev since the packet is copied anyway. Maybe you want to add
> support for it :-).
>
> > I must mention that I am using the XDP skb mode with copies.
> >
> > Thank you in advance,
> > Kostis



-- 
Kostis Kaffes
PhD Student in Electrical Engineering
Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-25 18:51   ` Konstantinos Kaffes
@ 2021-03-26  7:36     ` Magnus Karlsson
  2021-03-30  5:32       ` Konstantinos Kaffes
  0 siblings, 1 reply; 9+ messages in thread
From: Magnus Karlsson @ 2021-03-26  7:36 UTC (permalink / raw)
  To: Konstantinos Kaffes; +Cc: Xdp

On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
>
> Great, thanks for the info! I will look into implementing this.
>
> For the time being, I implemented a version of my design with N^2
> sockets. I observed that when all traffic is directed to a single NIC
> queue, the throughput is higher than when I use all N NIC queues. I am
> using spinlocks to guard concurrent access to UMEM and the
> fill/completion rings. When I use a single NIC queue, I achieve
> ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> and this bad scaling behavior expected?

1Mpps sounds reasonable with SKB mode. If you use something simple
like the spinlock scheme you describe, then it will not scale. Check
the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
mempool implementation that should scale better than the one you
implemented. For anything remotely complicated, something that manages
the buffers in the umem plus the fill and completion queues is usually
required. This is called a mempool most of the time. User-space
network libraries such as DPDK and VPP provide fast and scalable
mempool implementations. It would be nice to add a simple one to
libbpf, or rather libxdp as the AF_XDP functionality is moving over
there. Several people have asked for it, but unfortunately I have not
had the time.

>
> On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> >
> > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > >
> > > Hello everyone,
> > >
> > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > sockets, each associated with a different queue. I have the following
> > > questions:
> > >
> > > 1. Do sockets associated with the same queue need to share their UMEM
> > > area and fill and completion rings?
> >
> > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > packet into a umem that was decided long before the packet was even
> > received. And this is of course before we even get to pick what socket
> > it should go to. This restriction is currently carried over to
> > copy-mode, however, in theory there is nothing preventing supporting
> > multiple umems on the same netdev and queue id in copy-mode. It is
> > just that nobody has implemented support for it.
> >
> > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > happens if my XDP program redirects a packet to a socket that is
> > > associated with a different NIC queue than the one in which the packet
> > > arrived?
> >
> > You can have multiple XSKMAPs but you would in any case have to have
> > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > to a specific netdev and queue id. If you try to redirect to socket
> > with a queue id or netdev that the packet was not received on, it will
> > be dropped. Again, for copy-mode, it would from a theoretical
> > perspective be perfectly fine to redirect to another queue id and/or
> > netdev since the packet is copied anyway. Maybe you want to add
> > support for it :-).
> >
> > > I must mention that I am using the XDP skb mode with copies.
> > >
> > > Thank you in advance,
> > > Kostis
>
>
>
> --
> Kostis Kaffes
> PhD Student in Electrical Engineering
> Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-26  7:36     ` Magnus Karlsson
@ 2021-03-30  5:32       ` Konstantinos Kaffes
  2021-03-30  6:17         ` Magnus Karlsson
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantinos Kaffes @ 2021-03-30  5:32 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Xdp

On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
> On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> >
> > Great, thanks for the info! I will look into implementing this.
> >
> > For the time being, I implemented a version of my design with N^2
> > sockets. I observed that when all traffic is directed to a single NIC
> > queue, the throughput is higher than when I use all N NIC queues. I am
> > using spinlocks to guard concurrent access to UMEM and the
> > fill/completion rings. When I use a single NIC queue, I achieve
> > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > and this bad scaling behavior expected?
>
> 1Mpps sounds reasonable with SKB mode. If you use something simple
> like the spinlock scheme you describe, then it will not scale. Check
> the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> mempool implementation that should scale better than the one you
> implemented. For anything remotely complicated, something that manages
> the buffers in the umem plus the fill and completion queues is usually
> required. This is called a mempool most of the time. User-space
> network libraries such as DPDK and VPP provide fast and scalable
> mempool implementations. It would be nice to add a simple one to
> libbpf, or rather libxdp as the AF_XDP functionality is moving over
> there. Several people have asked for it, but unfortunately I have not
> had the time.
>

Thanks for the tip! I have also started trying zero-copy DRV mode and
came across a weird behavior. When I am using multiple sockets, one
for each NIC queue, I observe very low throughput and a lot of time
spent on the following loop:

uint32_t idx_cq;
while (ret < buf_count) {
  ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
}

This does not happen when I have only one XDP socket bound to a single queue.

Any idea on why this might be happening?

> >
> > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > >
> > > > Hello everyone,
> > > >
> > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > sockets, each associated with a different queue. I have the following
> > > > questions:
> > > >
> > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > area and fill and completion rings?
> > >
> > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > packet into a umem that was decided long before the packet was even
> > > received. And this is of course before we even get to pick what socket
> > > it should go to. This restriction is currently carried over to
> > > copy-mode, however, in theory there is nothing preventing supporting
> > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > just that nobody has implemented support for it.
> > >
> > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > happens if my XDP program redirects a packet to a socket that is
> > > > associated with a different NIC queue than the one in which the packet
> > > > arrived?
> > >
> > > You can have multiple XSKMAPs but you would in any case have to have
> > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > to a specific netdev and queue id. If you try to redirect to socket
> > > with a queue id or netdev that the packet was not received on, it will
> > > be dropped. Again, for copy-mode, it would from a theoretical
> > > perspective be perfectly fine to redirect to another queue id and/or
> > > netdev since the packet is copied anyway. Maybe you want to add
> > > support for it :-).
> > >
> > > > I must mention that I am using the XDP skb mode with copies.
> > > >
> > > > Thank you in advance,
> > > > Kostis
> >
> >
> >
> > --
> > Kostis Kaffes
> > PhD Student in Electrical Engineering
> > Stanford University



-- 
Kostis Kaffes
PhD Student in Electrical Engineering
Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-30  5:32       ` Konstantinos Kaffes
@ 2021-03-30  6:17         ` Magnus Karlsson
  2021-03-30  6:21           ` Magnus Karlsson
  0 siblings, 1 reply; 9+ messages in thread
From: Magnus Karlsson @ 2021-03-30  6:17 UTC (permalink / raw)
  To: Konstantinos Kaffes; +Cc: Xdp

On Tue, Mar 30, 2021 at 7:32 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
>
> On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> >
> > On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > >
> > > Great, thanks for the info! I will look into implementing this.
> > >
> > > For the time being, I implemented a version of my design with N^2
> > > sockets. I observed that when all traffic is directed to a single NIC
> > > queue, the throughput is higher than when I use all N NIC queues. I am
> > > using spinlocks to guard concurrent access to UMEM and the
> > > fill/completion rings. When I use a single NIC queue, I achieve
> > > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > > and this bad scaling behavior expected?
> >
> > 1Mpps sounds reasonable with SKB mode. If you use something simple
> > like the spinlock scheme you describe, then it will not scale. Check
> > the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> > mempool implementation that should scale better than the one you
> > implemented. For anything remotely complicated, something that manages
> > the buffers in the umem plus the fill and completion queues is usually
> > required. This is called a mempool most of the time. User-space
> > network libraries such as DPDK and VPP provide fast and scalable
> > mempool implementations. It would be nice to add a simple one to
> > libbpf, or rather libxdp as the AF_XDP functionality is moving over
> > there. Several people have asked for it, but unfortunately I have not
> > had the time.
> >
>
> Thanks for the tip! I have also started trying zero-copy DRV mode and
> came across a weird behavior. When I am using multiple sockets, one
> for each NIC queue, I observe very low throughput and a lot of time
> spent on the following loop:
>
> uint32_t idx_cq;
> while (ret < buf_count) {
>   ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
> }

This is very likely a naïve and unscalable implementation from my
side, or maybe from you or someone else since I do not know where it
comes from :-). Here you are waiting for the completing ring to have a
certain amount of entries (buf_count) to move on. Work with what you
get instead of trying to get a certain amount. Also check where your
driver code for each queue id is running. Are they evenly spread out
or on the same core? htop is an easy way to find out. It seems that
your completion rate is bounded and does not scale with number of
queue ids. Might be the case that Tx driver processing is occurring on
one core. At least worth examining. I would do that first before
changing the logic above.

/Magnus

> This does not happen when I have only one XDP socket bound to a single queue.
>
> Any idea on why this might be happening?
>
> > >
> > > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > >
> > > > > Hello everyone,
> > > > >
> > > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > > sockets, each associated with a different queue. I have the following
> > > > > questions:
> > > > >
> > > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > > area and fill and completion rings?
> > > >
> > > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > > packet into a umem that was decided long before the packet was even
> > > > received. And this is of course before we even get to pick what socket
> > > > it should go to. This restriction is currently carried over to
> > > > copy-mode, however, in theory there is nothing preventing supporting
> > > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > > just that nobody has implemented support for it.
> > > >
> > > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > > happens if my XDP program redirects a packet to a socket that is
> > > > > associated with a different NIC queue than the one in which the packet
> > > > > arrived?
> > > >
> > > > You can have multiple XSKMAPs but you would in any case have to have
> > > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > > to a specific netdev and queue id. If you try to redirect to socket
> > > > with a queue id or netdev that the packet was not received on, it will
> > > > be dropped. Again, for copy-mode, it would from a theoretical
> > > > perspective be perfectly fine to redirect to another queue id and/or
> > > > netdev since the packet is copied anyway. Maybe you want to add
> > > > support for it :-).
> > > >
> > > > > I must mention that I am using the XDP skb mode with copies.
> > > > >
> > > > > Thank you in advance,
> > > > > Kostis
> > >
> > >
> > >
> > > --
> > > Kostis Kaffes
> > > PhD Student in Electrical Engineering
> > > Stanford University
>
>
>
> --
> Kostis Kaffes
> PhD Student in Electrical Engineering
> Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-30  6:17         ` Magnus Karlsson
@ 2021-03-30  6:21           ` Magnus Karlsson
  2021-03-30  6:29             ` Konstantinos Kaffes
  0 siblings, 1 reply; 9+ messages in thread
From: Magnus Karlsson @ 2021-03-30  6:21 UTC (permalink / raw)
  To: Konstantinos Kaffes; +Cc: Xdp

On Tue, Mar 30, 2021 at 8:17 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 7:32 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> >
> > On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > >
> > > > Great, thanks for the info! I will look into implementing this.
> > > >
> > > > For the time being, I implemented a version of my design with N^2
> > > > sockets. I observed that when all traffic is directed to a single NIC
> > > > queue, the throughput is higher than when I use all N NIC queues. I am
> > > > using spinlocks to guard concurrent access to UMEM and the
> > > > fill/completion rings. When I use a single NIC queue, I achieve
> > > > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > > > and this bad scaling behavior expected?
> > >
> > > 1Mpps sounds reasonable with SKB mode. If you use something simple
> > > like the spinlock scheme you describe, then it will not scale. Check
> > > the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> > > mempool implementation that should scale better than the one you
> > > implemented. For anything remotely complicated, something that manages
> > > the buffers in the umem plus the fill and completion queues is usually
> > > required. This is called a mempool most of the time. User-space
> > > network libraries such as DPDK and VPP provide fast and scalable
> > > mempool implementations. It would be nice to add a simple one to
> > > libbpf, or rather libxdp as the AF_XDP functionality is moving over
> > > there. Several people have asked for it, but unfortunately I have not
> > > had the time.
> > >
> >
> > Thanks for the tip! I have also started trying zero-copy DRV mode and
> > came across a weird behavior. When I am using multiple sockets, one
> > for each NIC queue, I observe very low throughput and a lot of time
> > spent on the following loop:
> >
> > uint32_t idx_cq;
> > while (ret < buf_count) {
> >   ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
> > }
>
> This is very likely a naïve and unscalable implementation from my
> side, or maybe from you or someone else since I do not know where it
> comes from :-). Here you are waiting for the completing ring to have a
> certain amount of entries (buf_count) to move on. Work with what you
> get instead of trying to get a certain amount.

Another good tactic is to just go and do something else if you do not
get buf_count, then come back later and try again. Do not waste your
cycles doing nothing.

> Also check where your
> driver code for each queue id is running. Are they evenly spread out
> or on the same core? htop is an easy way to find out. It seems that
> your completion rate is bounded and does not scale with number of
> queue ids. Might be the case that Tx driver processing is occurring on
> one core. At least worth examining. I would do that first before
> changing the logic above.
>
> /Magnus
>
> > This does not happen when I have only one XDP socket bound to a single queue.
> >
> > Any idea on why this might be happening?
> >
> > > >
> > > > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > > > sockets, each associated with a different queue. I have the following
> > > > > > questions:
> > > > > >
> > > > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > > > area and fill and completion rings?
> > > > >
> > > > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > > > packet into a umem that was decided long before the packet was even
> > > > > received. And this is of course before we even get to pick what socket
> > > > > it should go to. This restriction is currently carried over to
> > > > > copy-mode, however, in theory there is nothing preventing supporting
> > > > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > > > just that nobody has implemented support for it.
> > > > >
> > > > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > > > happens if my XDP program redirects a packet to a socket that is
> > > > > > associated with a different NIC queue than the one in which the packet
> > > > > > arrived?
> > > > >
> > > > > You can have multiple XSKMAPs but you would in any case have to have
> > > > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > > > to a specific netdev and queue id. If you try to redirect to socket
> > > > > with a queue id or netdev that the packet was not received on, it will
> > > > > be dropped. Again, for copy-mode, it would from a theoretical
> > > > > perspective be perfectly fine to redirect to another queue id and/or
> > > > > netdev since the packet is copied anyway. Maybe you want to add
> > > > > support for it :-).
> > > > >
> > > > > > I must mention that I am using the XDP skb mode with copies.
> > > > > >
> > > > > > Thank you in advance,
> > > > > > Kostis
> > > >
> > > >
> > > >
> > > > --
> > > > Kostis Kaffes
> > > > PhD Student in Electrical Engineering
> > > > Stanford University
> >
> >
> >
> > --
> > Kostis Kaffes
> > PhD Student in Electrical Engineering
> > Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-30  6:21           ` Magnus Karlsson
@ 2021-03-30  6:29             ` Konstantinos Kaffes
  2021-03-30  6:44               ` Magnus Karlsson
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantinos Kaffes @ 2021-03-30  6:29 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Xdp

On Mon, 29 Mar 2021 at 23:21, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 8:17 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 7:32 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > >
> > > On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > >
> > > > > Great, thanks for the info! I will look into implementing this.
> > > > >
> > > > > For the time being, I implemented a version of my design with N^2
> > > > > sockets. I observed that when all traffic is directed to a single NIC
> > > > > queue, the throughput is higher than when I use all N NIC queues. I am
> > > > > using spinlocks to guard concurrent access to UMEM and the
> > > > > fill/completion rings. When I use a single NIC queue, I achieve
> > > > > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > > > > and this bad scaling behavior expected?
> > > >
> > > > 1Mpps sounds reasonable with SKB mode. If you use something simple
> > > > like the spinlock scheme you describe, then it will not scale. Check
> > > > the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> > > > mempool implementation that should scale better than the one you
> > > > implemented. For anything remotely complicated, something that manages
> > > > the buffers in the umem plus the fill and completion queues is usually
> > > > required. This is called a mempool most of the time. User-space
> > > > network libraries such as DPDK and VPP provide fast and scalable
> > > > mempool implementations. It would be nice to add a simple one to
> > > > libbpf, or rather libxdp as the AF_XDP functionality is moving over
> > > > there. Several people have asked for it, but unfortunately I have not
> > > > had the time.
> > > >
> > >
> > > Thanks for the tip! I have also started trying zero-copy DRV mode and
> > > came across a weird behavior. When I am using multiple sockets, one
> > > for each NIC queue, I observe very low throughput and a lot of time
> > > spent on the following loop:
> > >
> > > uint32_t idx_cq;
> > > while (ret < buf_count) {
> > >   ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
> > > }
> >
> > This is very likely a naïve and unscalable implementation from my
> > side, or maybe from you or someone else since I do not know where it
> > comes from :-). Here you are waiting for the completing ring to have a
> > certain amount of entries (buf_count) to move on. Work with what you
> > get instead of trying to get a certain amount.
>
> Another good tactic is to just go and do something else if you do not
> get buf_count, then come back later and try again. Do not waste your
> cycles doing nothing.
>

The way my application is designed - which is obviously not optimal -
I need to confirm that the packets are sent and the buffers are
released before doing other stuff. The problem is not so much that the
performance does not scale. It is more like that with 2 sockets/queues
the performance is 300x worse than in the single-socket/queue case.

> > Also check where your
> > driver code for each queue id is running. Are they evenly spread out
> > or on the same core? htop is an easy way to find out. It seems that
> > your completion rate is bounded and does not scale with number of
> > queue ids. Might be the case that Tx driver processing is occurring on
> > one core. At least worth examining. I would do that first before
> > changing the logic above.
> >

I have checked that the TX driver for each queue is running on a
different core. In any case, is it possible to adjust where the TX
driver runs?


> > /Magnus
> >
> > > This does not happen when I have only one XDP socket bound to a single queue.
> > >
> > > Any idea on why this might be happening?
> > >
> > > > >
> > > > > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello everyone,
> > > > > > >
> > > > > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > > > > sockets, each associated with a different queue. I have the following
> > > > > > > questions:
> > > > > > >
> > > > > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > > > > area and fill and completion rings?
> > > > > >
> > > > > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > > > > packet into a umem that was decided long before the packet was even
> > > > > > received. And this is of course before we even get to pick what socket
> > > > > > it should go to. This restriction is currently carried over to
> > > > > > copy-mode, however, in theory there is nothing preventing supporting
> > > > > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > > > > just that nobody has implemented support for it.
> > > > > >
> > > > > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > > > > happens if my XDP program redirects a packet to a socket that is
> > > > > > > associated with a different NIC queue than the one in which the packet
> > > > > > > arrived?
> > > > > >
> > > > > > You can have multiple XSKMAPs but you would in any case have to have
> > > > > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > > > > to a specific netdev and queue id. If you try to redirect to socket
> > > > > > with a queue id or netdev that the packet was not received on, it will
> > > > > > be dropped. Again, for copy-mode, it would from a theoretical
> > > > > > perspective be perfectly fine to redirect to another queue id and/or
> > > > > > netdev since the packet is copied anyway. Maybe you want to add
> > > > > > support for it :-).
> > > > > >
> > > > > > > I must mention that I am using the XDP skb mode with copies.
> > > > > > >
> > > > > > > Thank you in advance,
> > > > > > > Kostis
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kostis Kaffes
> > > > > PhD Student in Electrical Engineering
> > > > > Stanford University
> > >
> > >
> > >
> > > --
> > > Kostis Kaffes
> > > PhD Student in Electrical Engineering
> > > Stanford University



--
Kostis Kaffes
PhD Student in Electrical Engineering
Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: AF_XDP sockets across multiple NIC queues
  2021-03-30  6:29             ` Konstantinos Kaffes
@ 2021-03-30  6:44               ` Magnus Karlsson
  0 siblings, 0 replies; 9+ messages in thread
From: Magnus Karlsson @ 2021-03-30  6:44 UTC (permalink / raw)
  To: Konstantinos Kaffes; +Cc: Xdp

On Tue, Mar 30, 2021 at 8:29 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
>
> On Mon, 29 Mar 2021 at 23:21, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 8:17 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 7:32 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > >
> > > > On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > > >
> > > > > > Great, thanks for the info! I will look into implementing this.
> > > > > >
> > > > > > For the time being, I implemented a version of my design with N^2
> > > > > > sockets. I observed that when all traffic is directed to a single NIC
> > > > > > queue, the throughput is higher than when I use all N NIC queues. I am
> > > > > > using spinlocks to guard concurrent access to UMEM and the
> > > > > > fill/completion rings. When I use a single NIC queue, I achieve
> > > > > > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > > > > > and this bad scaling behavior expected?
> > > > >
> > > > > 1Mpps sounds reasonable with SKB mode. If you use something simple
> > > > > like the spinlock scheme you describe, then it will not scale. Check
> > > > > the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> > > > > mempool implementation that should scale better than the one you
> > > > > implemented. For anything remotely complicated, something that manages
> > > > > the buffers in the umem plus the fill and completion queues is usually
> > > > > required. This is called a mempool most of the time. User-space
> > > > > network libraries such as DPDK and VPP provide fast and scalable
> > > > > mempool implementations. It would be nice to add a simple one to
> > > > > libbpf, or rather libxdp as the AF_XDP functionality is moving over
> > > > > there. Several people have asked for it, but unfortunately I have not
> > > > > had the time.
> > > > >
> > > >
> > > > Thanks for the tip! I have also started trying zero-copy DRV mode and
> > > > came across a weird behavior. When I am using multiple sockets, one
> > > > for each NIC queue, I observe very low throughput and a lot of time
> > > > spent on the following loop:
> > > >
> > > > uint32_t idx_cq;
> > > > while (ret < buf_count) {
> > > >   ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
> > > > }
> > >
> > > This is very likely a naïve and unscalable implementation from my
> > > side, or maybe from you or someone else since I do not know where it
> > > comes from :-). Here you are waiting for the completing ring to have a
> > > certain amount of entries (buf_count) to move on. Work with what you
> > > get instead of trying to get a certain amount.
> >
> > Another good tactic is to just go and do something else if you do not
> > get buf_count, then come back later and try again. Do not waste your
> > cycles doing nothing.
> >
>
> The way my application is designed - which is obviously not optimal -
> I need to confirm that the packets are sent and the buffers are
> released before doing other stuff. The problem is not so much that the
> performance does not scale. It is more like that with 2 sockets/queues
> the performance is 300x worse than in the single-socket/queue case.

Ouch, that sounds awful. What driver are you running?

> > > Also check where your
> > > driver code for each queue id is running. Are they evenly spread out
> > > or on the same core? htop is an easy way to find out. It seems that
> > > your completion rate is bounded and does not scale with number of
> > > queue ids. Might be the case that Tx driver processing is occurring on
> > > one core. At least worth examining. I would do that first before
> > > changing the logic above.
> > >
>
> I have checked that the TX driver for each queue is running on a
> different core. In any case, is it possible to adjust where the TX
> driver runs?

Usually, Rx and Tx processing for a specific queue is done on the same
core since they are bundled in the same napi context. You can set irq
affinity so that the interrupt associated with the queue id in
question is only allowed to fire on the core you want. This way it
will get pegged on one core. Another more scalable way would be to use
busy-poll [1]. In this mode, the Rx and Tx processing is done on the
same core as the application that uses the socket. The driver
processing is usually not triggered by interrupts, insted it is
triggered by the application doing recvmsg, sendto, and/or poll
syscalls. Check out the xdpsock_user.c sample in a kernel 5.11 or
later for an example.

[1] https://lwn.net/Articles/837010/

/Magnus

>
> > > /Magnus
> > >
> > > > This does not happen when I have only one XDP socket bound to a single queue.
> > > >
> > > > Any idea on why this might be happening?
> > > >
> > > > > >
> > > > > > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > > > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > > > > > sockets, each associated with a different queue. I have the following
> > > > > > > > questions:
> > > > > > > >
> > > > > > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > > > > > area and fill and completion rings?
> > > > > > >
> > > > > > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > > > > > packet into a umem that was decided long before the packet was even
> > > > > > > received. And this is of course before we even get to pick what socket
> > > > > > > it should go to. This restriction is currently carried over to
> > > > > > > copy-mode, however, in theory there is nothing preventing supporting
> > > > > > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > > > > > just that nobody has implemented support for it.
> > > > > > >
> > > > > > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > > > > > happens if my XDP program redirects a packet to a socket that is
> > > > > > > > associated with a different NIC queue than the one in which the packet
> > > > > > > > arrived?
> > > > > > >
> > > > > > > You can have multiple XSKMAPs but you would in any case have to have
> > > > > > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > > > > > to a specific netdev and queue id. If you try to redirect to socket
> > > > > > > with a queue id or netdev that the packet was not received on, it will
> > > > > > > be dropped. Again, for copy-mode, it would from a theoretical
> > > > > > > perspective be perfectly fine to redirect to another queue id and/or
> > > > > > > netdev since the packet is copied anyway. Maybe you want to add
> > > > > > > support for it :-).
> > > > > > >
> > > > > > > > I must mention that I am using the XDP skb mode with copies.
> > > > > > > >
> > > > > > > > Thank you in advance,
> > > > > > > > Kostis
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kostis Kaffes
> > > > > > PhD Student in Electrical Engineering
> > > > > > Stanford University
> > > >
> > > >
> > > >
> > > > --
> > > > Kostis Kaffes
> > > > PhD Student in Electrical Engineering
> > > > Stanford University
>
>
>
> --
> Kostis Kaffes
> PhD Student in Electrical Engineering
> Stanford University

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-03-30  6:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25  6:24 AF_XDP sockets across multiple NIC queues Konstantinos Kaffes
2021-03-25  7:24 ` Magnus Karlsson
2021-03-25 18:51   ` Konstantinos Kaffes
2021-03-26  7:36     ` Magnus Karlsson
2021-03-30  5:32       ` Konstantinos Kaffes
2021-03-30  6:17         ` Magnus Karlsson
2021-03-30  6:21           ` Magnus Karlsson
2021-03-30  6:29             ` Konstantinos Kaffes
2021-03-30  6:44               ` Magnus Karlsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).