All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Gospodarek <andy@greyhouse.net>
To: Michael Chan <michael.chan@broadcom.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	"Duyck, Alexander H" <alexander.h.duyck@intel.com>,
	"pstaszewski@itcare.pl" <pstaszewski@itcare.pl>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"xdp-newbies@vger.kernel.org" <xdp-newbies@vger.kernel.org>,
	"borkmann@iogearbox.net" <borkmann@iogearbox.net>
Subject: Re: XDP redirect measurements, gotchas and tracepoints
Date: Mon, 28 Aug 2017 12:02:37 -0400	[thread overview]
Message-ID: <20170828160237.GA70910@C02RW35GFVH8.dhcp.broadcom.net> (raw)
In-Reply-To: <CACKFLin-sZEkkCxE2iZRHjgG=K36OJ65B-xrEYVCCSQqhYsH-g@mail.gmail.com>

On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
> >> On Thu, 24 Aug 2017 20:36:28 -0700
> >> Michael Chan <michael.chan@broadcom.com> wrote:
> >>
> >>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
> >>> <brouer@redhat.com> wrote:
> >>>> On Tue, 22 Aug 2017 23:59:05 -0700
> >>>> Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>
> >>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> >>>>> <alexander.duyck@gmail.com> wrote:
> >>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>>>>
> >>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
> >>>>>>> the input device, right?
> >>>>
> >>>> Yes, I would really like to see an API like this.
> >>>>
> >>>>>>
> >>>>>> You could, it is just added complexity. "just free the buffer" in
> >>>>>> ixgbe usually just amounts to one atomic operation to decrement the
> >>>>>> total page count since page recycling is already implemented in the
> >>>>>> driver. You still would have to unmap the buffer regardless of if you
> >>>>>> were recycling it or not so all you would save is 1.000015259 atomic
> >>>>>> operations per packet. The fraction is because once every 64K uses we
> >>>>>> have to bulk update the count on the page.
> >>>>>>
> >>>>>
> >>>>> If the buffer is returned to the input device, the input device can
> >>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> >>>>> the input device when the buffer is returned.
> >>>>
> >>>> Yes, exactly, return to the input device. I really think we should
> >>>> work on a solution where we can keep the DMA mapping around.  We have
> >>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
> >>>> page return call, to achieve this. (I imagine other arch's have a high
> >>>> DMA overhead than Intel)
> >>>>
> >>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
> >>>> splitting the page (into two packets) actually complicates things, and
> >>>> tie us into a page-refcnt based model.  We could get around this by
> >>>> each driver implementing a page-return-callback, that allow us to
> >>>> return the page to the input device?  Then, drivers implementing the
> >>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
> >>>> "1" DMA-sync and reuse it in the RX queue.
> >>>>
> >>>
> >>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
> >>> redirecting to a non-intel NIC or vice versa will actually work.  It
> >>> sounds like the output device has to make some assumptions about how
> >>> the page was allocated by the input device.
> >>
> >> Yes, exactly. We are tied into a page refcnt based scheme.
> >>
> >> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
> >> is also tied to the RX queue size, plus how fast the pages are returned.
> >> This makes it very hard to tune.  As I demonstrated, default ixgbe
> >> settings does not work well with XDP_REDIRECT.  I needed to increase
> >> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
> >> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
> >> RX-ring size is smaller, thus two contradicting tuning needed.
> >>
> >
> > The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> > split into two halves being the default) from the number of descriptors
> > doesn't look too bad IMO. It seems like it could be done by having some
> > extra pages allocated upfront and pulling those in when we need another
> > page.
> >
> > This would be a nice iterative step we could take on the existing API.
> >
> >>
> >>> With buffer return API,
> >>> each driver can cleanly recycle or free its own buffers properly.
> >>
> >> Yes, exactly. And RX-driver can implement a special memory model for
> >> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
> >> which is never used for SKBs, thus opening for new RX memory models.
> >>
> >> Another advantage of a return API.  There is also an opportunity for
> >> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
> >> we can add a DMA API, where we can query if the two devices uses the
> >> same DMA engine, and can reuse the same DMA address the RX-side already
> >> knows.
> >>
> >>
> >>> Let me discuss this further with Andy to see if we can come up with a
> >>> good scheme.
> >>
> >> Sound good, looking forward to hear what you come-up with :-)
> >>
> >
> > I guess by this thread we will see a broadcom nic with redirect support
> > soon ;)
> 
> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
> buffer recycling scheme has some problems.  We can make it work for
> Broadcom to Broadcom only, but we want a better solution.

(Sorry for the radio silence I was AFK last week...)

I finished it a little while ago, but Michael and I both have concerns
that in a heterogenous hardware setup one can quickly run into issues
and haven't had time to work-up a few solutions before bringing this up
formally.  It also isn't a major problem until the second
optimized/native XDP driver appears on the scene.

I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
device I get OOM kills after only a few seconds, due to the lack of
feedback between the different drivers that the pointer to xdp->data can
be freed/reused/etc and the different buffer allocation schemes used.

Initially I did not think this was an issue and that xdp_do_flush_map()
would handle this, but I think there is a still a need to be able to
signal back to the receving device that the buffer allocated has been
xmitted by the transmitter and can be freed.  Since there is really no
guarantee that completion of an XDP_REDIRECT action means that it is
safe to free area pointed to by xdp->data area that contains the packet
to be xmitted.  Since the packet done interrupt handler in a driver
cannot signal back the the receiving driver that the buffer is now safe
to reuse/free there is a chance for trouble.  

I was hoping to spend some time this week cooking up a patch that just
did not allow use of XDP_REDIRECT when the ifindex of the outgoing
device did not match that of the device to which the XDP prog was
attached, but that probably is not worth the trouble when we would just
fix it for real.  (It would also require some really terrible hacks to
enforce this in the kernel when all that is being done is setting up a
map that contains the redirect table, so it is probably not useful.)

The basic prototype would be something like this:

(rx packet interrupt on eth0, leads to napi_poll)
napi_poll (eth0)
  call xdp_prog (eth0)
    xdp_do_redirect (eth0)
      ndo_xdp_xmit (eth1)
      mark buffer with information netdev/ring/etc
      place buffer on tx ring for eth1

(tx done interrupt on eth1, leads to napi_poll)
napi_poll (eth1)
  process tx interrupt (eth1)
    look up information about netdev/ring/etc
    ndo_xdp_data_free (eth0, ring, etc)

Thoughts?

  reply	other threads:[~2017-08-28 16:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-21 19:25 XDP redirect measurements, gotchas and tracepoints Jesper Dangaard Brouer
2017-08-21 22:35 ` Alexei Starovoitov
2017-08-22  6:37   ` Jesper Dangaard Brouer
2017-08-22 17:09     ` Alexei Starovoitov
2017-08-22 17:17       ` John Fastabend
2017-08-23  8:56         ` Jesper Dangaard Brouer
2017-08-22 18:02 ` Michael Chan
2017-08-22 18:17   ` John Fastabend
2017-08-22 18:30     ` Duyck, Alexander H
2017-08-22 20:04       ` Michael Chan
2017-08-23  1:06         ` Alexander Duyck
2017-08-23  6:59           ` Michael Chan
2017-08-23  8:29             ` Jesper Dangaard Brouer
2017-08-25  3:36               ` Michael Chan
2017-08-25 12:45                 ` Jesper Dangaard Brouer
2017-08-25 15:10                   ` John Fastabend
2017-08-25 15:28                     ` Michael Chan
2017-08-28 16:02                       ` Andy Gospodarek [this message]
2017-08-28 16:11                         ` Alexander Duyck
2017-08-29 13:26                           ` Jesper Dangaard Brouer
2017-08-29 16:23                             ` Alexander Duyck
2017-08-29 19:02                               ` Andy Gospodarek
2017-08-29 19:52                                 ` Alexander Duyck
2017-08-28 16:14                         ` John Fastabend
2017-08-28 19:39                           ` Andy Gospodarek
2017-08-23 14:51             ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170828160237.GA70910@C02RW35GFVH8.dhcp.broadcom.net \
    --to=andy@greyhouse.net \
    --cc=alexander.duyck@gmail.com \
    --cc=alexander.h.duyck@intel.com \
    --cc=borkmann@iogearbox.net \
    --cc=brouer@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pstaszewski@itcare.pl \
    --cc=xdp-newbies@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.