All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bjorn.topel@gmail.com, jasowang@redhat.com, ast@fb.com,
	alexander.duyck@gmail.com, john.r.fastabend@intel.com,
	netdev@vger.kernel.org, brouer@redhat.com
Subject: Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
Date: Tue, 31 Jan 2017 13:20:15 +0100	[thread overview]
Message-ID: <20170131132015.68e7d631@redhat.com> (raw)
In-Reply-To: <588FB57B.1060507@gmail.com>


(next submission fix subj. ineterface -> interface)

On Mon, 30 Jan 2017 13:51:55 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 17-01-30 10:16 AM, Jesper Dangaard Brouer wrote:
> > On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> This adds ndo ops for upper layer objects to request direct DMA from
> >> the network interface into memory "slots". The slots must be DMA'able
> >> memory given by a page/offset/size vector in a packet_ring_buffer
> >> structure.
> >>
> >> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
> >> RX from the network device into memory mapped userspace memory. For
> >> this to work drivers encode the correct descriptor blocks and headers
> >> so that existing PF_PACKET applications work without any modification.
> >> This only supports the V2 header formats for now. And works by mapping
> >> a ring of the network device to these slots. Originally I used V2
> >> header formats but this does complicate the driver a bit.
> >>
> >> V3 header formats added bulk polling via socket calls and timers
> >> used in the polling interface to return every n milliseconds. Currently,
> >> I don't see any way to support this in hardware because we can't
> >> know if the hardware is in the middle of a DMA operation or not
> >> on a slot. So when a timer fires I don't know how to advance the
> >> descriptor ring leaving empty descriptors similar to how the software
> >> ring works. The easiest (best?) route is to simply not support this.  
> > 
> > From a performance pov bulking is essential. Systems like netmap that
> > also depend on transferring control between kernel and userspace,
> > report[1] that they need at least bulking size 8, to amortize the overhead.
> >   
> 
> Bulking in general is not a problem it can be done on top of V2 implementation
> without issue. 

Good.

Notice, that the type of bulking I'm looking for can indirectly be
achieved via a queue, as long as there isn't a syscall per dequeue
involved. Looking at some af_packet example code, and your desc below,
it looks like af_packet is doing exactly what is needed, as it is
sync/spinning on a block_status bit.  (Lessons learned from ptr_ring,
indicate this might actually be more efficient)


> Its the poll timer that seemed a bit clumsy to implement.
> Looking at it again though I think we could do something if we cared to.
> I'm not convinced we would gain much though.

I actually think this would slowdown performance.

> Also v2 uses static buffer sizes so that every buffer is 2k or whatever
> size the user configures. V3 allows the buffer size to be updated at
> runtime which could be done in the drivers I suppose but would require
> some ixgbe restructuring.

I think we should stay with the V2 fixed fixed buffers, for performance
reasons.


> > [1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf
> > 
> >   
> >> It might be worth creating a new v4 header that is simple for drivers
> >> to support direct DMA ops with. I can imagine using the xdp_buff
> >> structure as a header for example. Thoughts?  
> > 
> > Likely, but I would like that we do a measurement based approach.  Lets
> > benchmark with this V2 header format, and see how far we are from
> > target, and see what lights-up in perf report and if it is something we
> > can address.  
> 
> Yep I'm hoping to get to this sometime this week.
> 
> > 
> >   
> >> The ndo operations and new socket option PACKET_RX_DIRECT work by
> >> giving a queue_index to run the direct dma operations over. Once
> >> setsockopt returns successfully the indicated queue is mapped
> >> directly to the requesting application and can not be used for
> >> other purposes. Also any kernel layers such as tc will be bypassed
> >> and need to be implemented in the hardware via some other mechanism
> >> such as tc offload or other offload interfaces.  
> > 
> > Will this also need to bypass XDP too?  
> 
> There is nothing stopping this from working with XDP but why would
> you want to run XDP here?
> 
> Dropping a packet for example is not really useful because its
> already in memory user space can read. Modifying the packet seems
> pointless user space can modify it. 
>
> Maybe pushing a copy of the packet
> to kernel stack is useful in some case? But I can't see where I would
> want this.

Wouldn't it be useful to pass ARP packets to kernel stack?
(E.g. if your HW filter is based on MAC addr match)


> > E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
> > packet with this solution (and place it on a TX queue and wait for DMA
> > TX completion).
> >   
> 
> This is something worth exploring. tpacket_v2 uses a fixed ring with
> slots so all the pages are allocated and assigned to the ring at init
> time. To xmit a packet in this case the user space application would
> be required to leave the packet descriptor on the rx side pinned
> until the tx side DMA has completed. Then it can unpin the rx side
> and return it to the driver. This works if the TX/RX processing is
> fast enough to keep up. For many things this is good enough.

Sounds tricky.
 
> For some work loads though this may not be sufficient. In which
> case a tpacket_v4 would be useful that can push down a new set
> of "slots" every n packets. Where n is sufficiently large to keep
> the workload running. This is similar in many ways to virtio/vhost
> interaction.

This starts to sound like to need a page pool like facility with
pages premapped DMA and premapped to userspace...

> >   
[...]
> > 
> > Guess, I don't understand the details of the af_packet versions well
> > enough, but can you explain to me, how userspace knows what slots it
> > can read/fetch, and how it marks when it is complete/finished so the
> > kernel knows it can reuse this slot?
> >   
> 
> At init time user space allocates a ring of buffers. Each buffer has
> space to hold the packet descriptor + packet payload. The API gives this
> to the driver to initialize DMA engine and assign addresses. At init
> time all buffers are "owned" by the driver which is indicated by a status bit
> in the descriptor header.
> 
> Userspace can spin on the status bit to know when the driver has handed
> it to userspace. The driver will check the status bit before returning
> the buffer to the hardware. Then a series of kicks are used to wake up
> userspace (if its not spinning) and to wake up the driver if it is overrun
> and needs to return buffers into its pool (not implemented yet). The
> kick to wake up the driver could in a future v4 be used to push new
> buffers to the driver if needed.

As I wrote above, this status bit spinning approach is good and actually
achieving a bulking effect indirectly.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

  parent reply	other threads:[~2017-01-31 12:20 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
2017-01-30 18:16   ` Jesper Dangaard Brouer
2017-01-30 21:51     ` John Fastabend
2017-01-31  1:31       ` Willem de Bruijn
2017-02-01  5:09         ` John Fastabend
2017-03-06 21:28           ` chetan loke
2017-01-31 12:20       ` Jesper Dangaard Brouer [this message]
2017-02-01  5:01         ` John Fastabend
2017-02-04  3:10   ` Jason Wang
2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
2017-01-31  2:53   ` Alexei Starovoitov
2017-02-01  4:58     ` John Fastabend
2017-01-30 22:02 ` [RFC PATCH 0/2] rx zero copy interface for af_packet David Miller
2017-01-31 16:30 ` Sowmini Varadhan
2017-02-01  4:23   ` John Fastabend
2017-01-31 19:39 ` tndave
2017-02-01  5:09   ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170131132015.68e7d631@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=ast@fb.com \
    --cc=bjorn.topel@gmail.com \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=john.r.fastabend@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.