All of lore.kernel.org
 help / color / mirror / Atom feed
* Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-05 14:31 Jesper Dangaard Brouer
  2016-12-12  8:38 ` Mike Rapoport
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-05 14:31 UTC (permalink / raw)
  To: netdev
  Cc: brouer, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg

Hi all,

This is my design for how to safely handle RX zero-copy in the network
stack, by using page_pool[1] and modifying NIC drivers.  Safely means
not leaking kernel info in pages mapped to userspace and resilience
so a malicious userspace app cannot crash the kernel.

It is only a design, and thus the purpose is for you to find any holes
in this design ;-)  Below text is also available as html see[2].

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/design.html
[2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===========================
Memory Model for Networking
===========================

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
          request zero-copy RX, then the driver must use a specific
          SKB allocation mode and might have to reconfigure the
          RX-ring.

Design target
=============

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==========

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page
--------------------

When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
---------------------

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.


NIC RX Zero-Copy
================

Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.


Kernel safety
-------------

For the paranoid, how do we protect the kernel from a malicious
userspace program.  Sure there will be a communication interface
between kernel and userspace, that synchronize ownership of pages.
But a userspace program can violate this interface, given pages are
kept VMA mapped, the program can in principle access all the memory
pages in the given page_pool.  This opens up for a malicious (or
defect) program modifying memory pages concurrently with the kernel
and DMA engine using them.

An easy way to get around userspace modifying page data contents is
simply to map pages read-only into userspace.

.. Note:: The first implementation target is read-only zero-copy RX
          page to userspace and require driver to use SKB-read-only
          mode.

Advanced: Allowing userspace write access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What if userspace need write access? Flipping the page permissions per
transfer will likely kill performance (as this likely affects the
TLB-cache).

I will argue that giving userspace write access is still possible,
without risking a kernel crash.  This is related to the SKB-read-only
mode that copies the packet headers (in to another memory area,
inaccessible to userspace).  The attack angle is to modify packet
headers after they passed some kernel network stack validation step
(as once headers are copied they are out of "reach").

Situation classes where memory page can be modified concurrently:

1) When DMA engine owns the page.  Not a problem, as DMA engine will
   simply overwrite data.

2) Just after DMA engine finish writing.  Not a problem, the packet
   will go through netstack validation and be rejected.

3) While XDP reads data. This can lead to XDP/eBPF program goes into a
   wrong code branch, but the eBPF virtual machine should not be able
   to crash the kernel. The worst outcome is a wrong or invalid XDP
   return code.

4) Before SKB with read-only page is constructed. Not a problem, the
   packet will go through netstack validation and be rejected.

5) After SKB with read-only page has been constructed.  Remember the
   packet headers were copied into a separate memory area, and the
   page data is pointed to with an offset passed the copied headers.
   Thus, userspace cannot modify the headers used for netstack
   validation.  It can only modify packet data contents, which is less
   critical as it cannot crash the kernel, and eventually this will be
   caught by packet checksum validation.

6) After netstack delivered packet to another userspace process. Not a
   problem, as it cannot crash the kernel.  It might corrupt
   packet-data being read by another userspace process, which one
   argument for requiring elevated privileges to get write access
   (like NET_CAP_ADMIN).


Userspace delivery and OOM
--------------------------

These RX pages are likely mapped to userspace via mmap(), so-far so
good.  It is key to performance to get an efficient way of signaling
between kernel and userspace, e.g what page are ready for consumption,
and when userspace are done with the page.

It is outside the scope of page_pool to provide such a queuing
structure, but the page_pool can offer some means of protecting the
system resource usage.  It is a classical problem that resources
(e.g. the page) must be returned in a timely manor, else the system,
in this case, will run out of memory.  Any system/design with
unbounded memory allocation can lead to Out-Of-Memory (OOM)
situations.

Communication between kernel and userspace is likely going to be some
kind of queue.  Given transferring packets individually will have too
much scheduling overhead.  A queue can implicitly function as a
bulking interface, and offers a natural way to split the workload
across CPU cores.

This essentially boils down-to a two queue system, with the RX-ring
queue and the userspace delivery queue.

Two bad situations exists for the userspace queue:

1) Userspace is not consuming objects fast-enough. This should simply
   result in packets getting dropped when enqueueing to a full
   userspace queue (as queue *must* implement some limit). Open
   question is; should this be reported or communicated to userspace.

2) Userspace is consuming objects fast, but not returning them in a
   timely manor.  This is a bad situation, because it threatens the
   system stability as it can lead to OOM.

The page_pool should somehow protect the system in case 2.  The
page_pool can detect the situation as it is able to track the number
of outstanding pages, due to the recycle feedback loop.  Thus, the
page_pool can have some configurable limit of allowed outstanding
pages, which can protect the system against OOM.

Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
pages to be "pageable", i.e. swap-able, but that is not an option for
the page_pool as these pages are DMA mapped.

.. _`Fbufs paper`:
   http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688

Effect of blocking allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The effect of page_pool, in case 2, that denies more allocations
essentially result-in the RX-ring queue cannot be refilled and HW
starts dropping packets due to "out-of-buffers".  For NICs with
several HW RX-queues, this can be limited to a subset of queues (and
admin can control which RX queue with HW filters).

The question is if the page_pool can do something smarter in this
case, to signal the consumers of these pages, before the maximum limit
is hit (of allowed outstanding packets).  The MM-subsystem already
have a concept of emergency PFMEMALLOC reserves and associate
page-flags (e.g. page_is_pfmemalloc).  And the network stack already
handle and react to this.  Could the same PFMEMALLOC system be used
for marking pages when limit is close?

This requires further analysis. One can imagine; this could be used at
RX by XDP to mitigate the situation by dropping less-important frames.
Given XDP choose which pages are being send to userspace it might have
appropriate knowledge of what it relevant to drop(?).

.. Note:: An alternative idea is using a data-structure that blocks
          userspace from getting new pages before returning some.
          (out of scope for the page_pool)

--
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Above document is taken at GitHub commit 47fa7c844f48fab8b
 https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-05 14:31 Designing a safe RX-zero-copy Memory Model for Networking Jesper Dangaard Brouer
@ 2016-12-12  8:38 ` Mike Rapoport
  2016-12-12  9:40   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Rapoport @ 2016-12-12  8:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

Hello Jesper,

On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> Hi all,
> 
> This is my design for how to safely handle RX zero-copy in the network
> stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> not leaking kernel info in pages mapped to userspace and resilience
> so a malicious userspace app cannot crash the kernel.
> 
> Design target
> =============
> 
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
> 
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).

Sorry, if this reply is a bit off-topic.

I'm working on implementation of RX zero-copy for virtio and I've dedicated
some thought about making guest memory available for physical NIC DMAs.
I believe this is quite related to your page_pool proposal, at least from
the NIC driver perspective, so I'd like to share some thoughts here.
The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
using macvtap, and then propagate guest RX memory allocations to the NIC
using something like new .ndo_set_rx_buffers method.

What is your view about interface between the page_pool and the NIC
drivers?
Have you considered using "push" model for setting the NIC's RX memory?

> 
> --
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> Above document is taken at GitHub commit 47fa7c844f48fab8b
>  https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b
> 

--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12  8:38 ` Mike Rapoport
@ 2016-12-12  9:40   ` Jesper Dangaard Brouer
  2016-12-12 14:14       ` Mike Rapoport
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-12  9:40 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, brouer


On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:

> Hello Jesper,
> 
> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> > Hi all,
> > 
> > This is my design for how to safely handle RX zero-copy in the network
> > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > not leaking kernel info in pages mapped to userspace and resilience
> > so a malicious userspace app cannot crash the kernel.
> > 
> > Design target
> > =============
> > 
> > Allow the NIC to function as a normal Linux NIC and be shared in a
> > safe manor, between the kernel network stack and an accelerated
> > userspace application using RX zero-copy delivery.
> > 
> > Target is to provide the basis for building RX zero-copy solutions in
> > a memory safe manor.  An efficient communication channel for userspace
> > delivery is out of scope for this document, but OOM considerations are
> > discussed below (`Userspace delivery and OOM`_).  
> 
> Sorry, if this reply is a bit off-topic.

It is very much on topic IMHO :-)

> I'm working on implementation of RX zero-copy for virtio and I've dedicated
> some thought about making guest memory available for physical NIC DMAs.
> I believe this is quite related to your page_pool proposal, at least from
> the NIC driver perspective, so I'd like to share some thoughts here.

Seems quite related. I'm very interested in cooperating with you! I'm
not very familiar with virtio, and how packets/pages gets channeled
into virtio.

> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> using macvtap, and then propagate guest RX memory allocations to the NIC
> using something like new .ndo_set_rx_buffers method.

I believe the page_pool API/design aligns with this idea/use-case.

> What is your view about interface between the page_pool and the NIC
> drivers?

In my Prove-of-Concept implementation, the NIC driver (mlx5) register
a page_pool per RX queue.  This is done for two reasons (1) performance
and (2) for supporting use-cases where only one single RX-ring queue is
(re)configured to support RX-zero-copy.  There are some associated
extra cost of enabling this mode, thus it makes sense to only enable it
when needed.

I've not decided how this gets enabled, maybe some new driver NDO.  It
could also happen when a XDP program gets loaded, which request this
feature.

The macvtap solution is nice and we should support it, but it requires
VM to have their MAC-addr registered on the physical switch.  This
design is about adding flexibility. Registering an XDP eBPF filter
provides the maximum flexibility for matching the destination VM.


> Have you considered using "push" model for setting the NIC's RX memory?

I don't understand what you mean by a "push" model?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12  9:40   ` Jesper Dangaard Brouer
@ 2016-12-12 14:14       ` Mike Rapoport
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2016-12-12 14:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> 
> > Hello Jesper,
> > 
> > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> > > Hi all,
> > > 
> > > This is my design for how to safely handle RX zero-copy in the network
> > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > not leaking kernel info in pages mapped to userspace and resilience
> > > so a malicious userspace app cannot crash the kernel.
> > > 
> > > Design target
> > > =============
> > > 
> > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > safe manor, between the kernel network stack and an accelerated
> > > userspace application using RX zero-copy delivery.
> > > 
> > > Target is to provide the basis for building RX zero-copy solutions in
> > > a memory safe manor.  An efficient communication channel for userspace
> > > delivery is out of scope for this document, but OOM considerations are
> > > discussed below (`Userspace delivery and OOM`_).  
> > 
> > Sorry, if this reply is a bit off-topic.
> 
> It is very much on topic IMHO :-)
> 
> > I'm working on implementation of RX zero-copy for virtio and I've dedicated
> > some thought about making guest memory available for physical NIC DMAs.
> > I believe this is quite related to your page_pool proposal, at least from
> > the NIC driver perspective, so I'd like to share some thoughts here.
> 
> Seems quite related. I'm very interested in cooperating with you! I'm
> not very familiar with virtio, and how packets/pages gets channeled
> into virtio.

They are copied :-)
Presuming we are dealing only with vhost backend, the received skb
eventually gets converted to IOVs, which in turn are copied to the guest
memory. The IOVs point to the guest memory that is allocated by virtio-net
running in the guest.

> > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > using macvtap, and then propagate guest RX memory allocations to the NIC
> > using something like new .ndo_set_rx_buffers method.
> 
> I believe the page_pool API/design aligns with this idea/use-case.
> 
> > What is your view about interface between the page_pool and the NIC
> > drivers?
> 
> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> a page_pool per RX queue.  This is done for two reasons (1) performance
> and (2) for supporting use-cases where only one single RX-ring queue is
> (re)configured to support RX-zero-copy.  There are some associated
> extra cost of enabling this mode, thus it makes sense to only enable it
> when needed.
> 
> I've not decided how this gets enabled, maybe some new driver NDO.  It
> could also happen when a XDP program gets loaded, which request this
> feature.
> 
> The macvtap solution is nice and we should support it, but it requires
> VM to have their MAC-addr registered on the physical switch.  This
> design is about adding flexibility. Registering an XDP eBPF filter
> provides the maximum flexibility for matching the destination VM.

I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
what needs to be done in BPF program to do proper conversion of skb to the
virtio descriptors.

We were not considered using XDP yet, so we've decided to limit the initial
implementation to macvtap because we can ensure correspondence between a
NIC queue and virtual NIC, which is not the case with more generic tap
device. It could be that use of XDP will allow for a generic solution for
virtio case as well.
 
> 
> > Have you considered using "push" model for setting the NIC's RX memory?
> 
> I don't understand what you mean by a "push" model?

Currently, memory allocation in NIC drivers boils down to alloc_page with
some wrapping code. I see two possible ways to make NIC use of some
preallocated pages: either NIC driver will call an API (probably different
from alloc_page) to obtain that memory, or there will be NDO API that
allows to set the NIC's RX buffers. I named the later case "push".
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-12 14:14       ` Mike Rapoport
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2016-12-12 14:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> 
> > Hello Jesper,
> > 
> > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> > > Hi all,
> > > 
> > > This is my design for how to safely handle RX zero-copy in the network
> > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > not leaking kernel info in pages mapped to userspace and resilience
> > > so a malicious userspace app cannot crash the kernel.
> > > 
> > > Design target
> > > =============
> > > 
> > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > safe manor, between the kernel network stack and an accelerated
> > > userspace application using RX zero-copy delivery.
> > > 
> > > Target is to provide the basis for building RX zero-copy solutions in
> > > a memory safe manor.  An efficient communication channel for userspace
> > > delivery is out of scope for this document, but OOM considerations are
> > > discussed below (`Userspace delivery and OOM`_).  
> > 
> > Sorry, if this reply is a bit off-topic.
> 
> It is very much on topic IMHO :-)
> 
> > I'm working on implementation of RX zero-copy for virtio and I've dedicated
> > some thought about making guest memory available for physical NIC DMAs.
> > I believe this is quite related to your page_pool proposal, at least from
> > the NIC driver perspective, so I'd like to share some thoughts here.
> 
> Seems quite related. I'm very interested in cooperating with you! I'm
> not very familiar with virtio, and how packets/pages gets channeled
> into virtio.

They are copied :-)
Presuming we are dealing only with vhost backend, the received skb
eventually gets converted to IOVs, which in turn are copied to the guest
memory. The IOVs point to the guest memory that is allocated by virtio-net
running in the guest.

> > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > using macvtap, and then propagate guest RX memory allocations to the NIC
> > using something like new .ndo_set_rx_buffers method.
> 
> I believe the page_pool API/design aligns with this idea/use-case.
> 
> > What is your view about interface between the page_pool and the NIC
> > drivers?
> 
> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> a page_pool per RX queue.  This is done for two reasons (1) performance
> and (2) for supporting use-cases where only one single RX-ring queue is
> (re)configured to support RX-zero-copy.  There are some associated
> extra cost of enabling this mode, thus it makes sense to only enable it
> when needed.
> 
> I've not decided how this gets enabled, maybe some new driver NDO.  It
> could also happen when a XDP program gets loaded, which request this
> feature.
> 
> The macvtap solution is nice and we should support it, but it requires
> VM to have their MAC-addr registered on the physical switch.  This
> design is about adding flexibility. Registering an XDP eBPF filter
> provides the maximum flexibility for matching the destination VM.

I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
what needs to be done in BPF program to do proper conversion of skb to the
virtio descriptors.

We were not considered using XDP yet, so we've decided to limit the initial
implementation to macvtap because we can ensure correspondence between a
NIC queue and virtual NIC, which is not the case with more generic tap
device. It could be that use of XDP will allow for a generic solution for
virtio case as well.
 
> 
> > Have you considered using "push" model for setting the NIC's RX memory?
> 
> I don't understand what you mean by a "push" model?

Currently, memory allocation in NIC drivers boils down to alloc_page with
some wrapping code. I see two possible ways to make NIC use of some
preallocated pages: either NIC driver will call an API (probably different
from alloc_page) to obtain that memory, or there will be NDO API that
allows to set the NIC's RX buffers. I named the later case "push".
 
--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 14:14       ` Mike Rapoport
  (?)
@ 2016-12-12 14:49       ` John Fastabend
  2016-12-12 17:13         ` Jesper Dangaard Brouer
  2016-12-13  9:42         ` Mike Rapoport
  -1 siblings, 2 replies; 39+ messages in thread
From: John Fastabend @ 2016-12-12 14:49 UTC (permalink / raw)
  To: Mike Rapoport, Jesper Dangaard Brouer
  Cc: netdev, linux-mm, Willem de Bruijn, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Mel Gorman, Tom Herbert,
	Brenden Blanco, Tariq Toukan, Saeed Mahameed, Jesse Brandeburg,
	Kalman Meth, Vladislav Yasevich

On 16-12-12 06:14 AM, Mike Rapoport wrote:
> On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
>>
>> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
>>
>>> Hello Jesper,
>>>
>>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
>>>> Hi all,
>>>>
>>>> This is my design for how to safely handle RX zero-copy in the network
>>>> stack, by using page_pool[1] and modifying NIC drivers.  Safely means
>>>> not leaking kernel info in pages mapped to userspace and resilience
>>>> so a malicious userspace app cannot crash the kernel.
>>>>
>>>> Design target
>>>> =============
>>>>
>>>> Allow the NIC to function as a normal Linux NIC and be shared in a
>>>> safe manor, between the kernel network stack and an accelerated
>>>> userspace application using RX zero-copy delivery.
>>>>
>>>> Target is to provide the basis for building RX zero-copy solutions in
>>>> a memory safe manor.  An efficient communication channel for userspace
>>>> delivery is out of scope for this document, but OOM considerations are
>>>> discussed below (`Userspace delivery and OOM`_).  
>>>
>>> Sorry, if this reply is a bit off-topic.
>>
>> It is very much on topic IMHO :-)
>>
>>> I'm working on implementation of RX zero-copy for virtio and I've dedicated
>>> some thought about making guest memory available for physical NIC DMAs.
>>> I believe this is quite related to your page_pool proposal, at least from
>>> the NIC driver perspective, so I'd like to share some thoughts here.
>>
>> Seems quite related. I'm very interested in cooperating with you! I'm
>> not very familiar with virtio, and how packets/pages gets channeled
>> into virtio.
> 
> They are copied :-)
> Presuming we are dealing only with vhost backend, the received skb
> eventually gets converted to IOVs, which in turn are copied to the guest
> memory. The IOVs point to the guest memory that is allocated by virtio-net
> running in the guest.
> 

Great I'm also doing something similar.

My plan was to embed the zero copy as an AF_PACKET mode and then push
a AF_PACKET backend into vhost. I'll post a patch later this week.

>>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
>>> using macvtap, and then propagate guest RX memory allocations to the NIC
>>> using something like new .ndo_set_rx_buffers method.
>>
>> I believe the page_pool API/design aligns with this idea/use-case.
>>
>>> What is your view about interface between the page_pool and the NIC
>>> drivers?
>>
>> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
>> a page_pool per RX queue.  This is done for two reasons (1) performance
>> and (2) for supporting use-cases where only one single RX-ring queue is
>> (re)configured to support RX-zero-copy.  There are some associated
>> extra cost of enabling this mode, thus it makes sense to only enable it
>> when needed.
>>
>> I've not decided how this gets enabled, maybe some new driver NDO.  It
>> could also happen when a XDP program gets loaded, which request this
>> feature.
>>
>> The macvtap solution is nice and we should support it, but it requires
>> VM to have their MAC-addr registered on the physical switch.  This
>> design is about adding flexibility. Registering an XDP eBPF filter
>> provides the maximum flexibility for matching the destination VM.
> 
> I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> what needs to be done in BPF program to do proper conversion of skb to the
> virtio descriptors.

I don't think XDP has much to do with this code and they should be done
separately. XDP runs eBPF code on received packets after the DMA engine
has already placed the packet in memory so its too late in the process.

The other piece here is enabling XDP in vhost but that is again separate
IMO.

Notice that ixgbe supports pushing packets into a macvlan via 'tc'
traffic steering commands so even though macvlan gets an L2 address it
doesn't mean it can't use other criteria to steer traffic to it.

> 
> We were not considered using XDP yet, so we've decided to limit the initial
> implementation to macvtap because we can ensure correspondence between a
> NIC queue and virtual NIC, which is not the case with more generic tap
> device. It could be that use of XDP will allow for a generic solution for
> virtio case as well.

Interesting this was one of the original ideas behind the macvlan
offload mode. iirc Vlad also was interested in this.

I'm guessing this was used because of the ability to push macvlan onto
its own queue?

>  
>>
>>> Have you considered using "push" model for setting the NIC's RX memory?
>>
>> I don't understand what you mean by a "push" model?
> 
> Currently, memory allocation in NIC drivers boils down to alloc_page with
> some wrapping code. I see two possible ways to make NIC use of some
> preallocated pages: either NIC driver will call an API (probably different
> from alloc_page) to obtain that memory, or there will be NDO API that
> allows to set the NIC's RX buffers. I named the later case "push".

I prefer the ndo op. This matches up well with AF_PACKET model where we
have "slots" and offload is just a transparent "push" of these "slots"
to the driver. Below we have a snippet of our proposed API,

(https://patchwork.ozlabs.org/patch/396714/ note the descriptor mapping
bits will be dropped)

+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into
+ *	mmap region.
+

> +
> +static int
> +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> +	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long dummy_page_phy;
> +	pgprot_t pre_vm_page_prot;
> +	unsigned long start;
> +	unsigned int i;
> +	int err;
> +
> +	if (!dummy_page_buf) {
> +		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> +		if (!dummy_page_buf)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> +			dummy_page_buf[i] = 0xdeadbeef;
> +	}
> +
> +	dummy_page_phy = virt_to_phys(dummy_page_buf);
> +	pre_vm_page_prot = vma->vm_page_prot;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> +	/* assume the vm_start is 4K aligned address */
> +	for (start = vma->vm_start;
> +	     start < vma->vm_end;
> +	     start += PAGE_SIZE_4K) {
> +		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else {
> +			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
> +
> +			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> +					      pre_vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		}
> +	}
> +	return 0;
> +}
> +

Any thoughts on something like the above? We could push it when net-next
opens. One piece that fits naturally into vhost/macvtap is the kicks and
queue splicing are already there so no need to implement this making the
above patch much simpler.

.John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 14:14       ` Mike Rapoport
@ 2016-12-12 15:10         ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-12 15:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, brouer

On Mon, 12 Dec 2016 16:14:33 +0200
Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:

> On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> >   
> > > Hello Jesper,
> > > 
> > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
> > > > Hi all,
> > > > 
> > > > This is my design for how to safely handle RX zero-copy in the network
> > > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > > not leaking kernel info in pages mapped to userspace and resilience
> > > > so a malicious userspace app cannot crash the kernel.
> > > > 
> > > > Design target
> > > > =============
> > > > 
> > > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > > safe manor, between the kernel network stack and an accelerated
> > > > userspace application using RX zero-copy delivery.
> > > > 
> > > > Target is to provide the basis for building RX zero-copy solutions in
> > > > a memory safe manor.  An efficient communication channel for userspace
> > > > delivery is out of scope for this document, but OOM considerations are
> > > > discussed below (`Userspace delivery and OOM`_).    
> > > 
> > > Sorry, if this reply is a bit off-topic.  
> > 
> > It is very much on topic IMHO :-)
> >   
> > > I'm working on implementation of RX zero-copy for virtio and I've dedicated
> > > some thought about making guest memory available for physical NIC DMAs.
> > > I believe this is quite related to your page_pool proposal, at least from
> > > the NIC driver perspective, so I'd like to share some thoughts here.  
> > 
> > Seems quite related. I'm very interested in cooperating with you! I'm
> > not very familiar with virtio, and how packets/pages gets channeled
> > into virtio.  
> 
> They are copied :-)
> Presuming we are dealing only with vhost backend, the received skb
> eventually gets converted to IOVs, which in turn are copied to the guest
> memory. The IOVs point to the guest memory that is allocated by virtio-net
> running in the guest.

Thanks for explaining that. It seems like a lot of overhead. I have to
wrap my head around this... so, the hardware NIC is receiving the
packet/page, in the RX ring, and after converting it to IOVs, it is
conceptually transmitted into the guest, and then the guest-side have a
RX-function to handle this packet. Correctly understood?

 
> > > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > > using macvtap, and then propagate guest RX memory allocations to the NIC
> > > using something like new .ndo_set_rx_buffers method.  
> > 
> > I believe the page_pool API/design aligns with this idea/use-case.
> >   
> > > What is your view about interface between the page_pool and the NIC
> > > drivers?  
> > 
> > In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> > a page_pool per RX queue.  This is done for two reasons (1) performance
> > and (2) for supporting use-cases where only one single RX-ring queue is
> > (re)configured to support RX-zero-copy.  There are some associated
> > extra cost of enabling this mode, thus it makes sense to only enable it
> > when needed.
> > 
> > I've not decided how this gets enabled, maybe some new driver NDO.  It
> > could also happen when a XDP program gets loaded, which request this
> > feature.
> > 
> > The macvtap solution is nice and we should support it, but it requires
> > VM to have their MAC-addr registered on the physical switch.  This
> > design is about adding flexibility. Registering an XDP eBPF filter
> > provides the maximum flexibility for matching the destination VM.  
> 
> I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> what needs to be done in BPF program to do proper conversion of skb to the
> virtio descriptors.

XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
modify the packet-page data, but I don't think it is needed for your
use-case.  View XDP (primarily) as an early (demux) filter.

XDP is missing a feature your need, which is TX packet into another
net_device (I actually imagine a port mapping table, that point to a
net_device).  This require a new "TX-raw" NDO that takes a page (+
offset and length). 

I imagine, the virtio driver (virtio_net or a new driver?) getting
extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
 Whether zero-copy is possible is determined by checking if page
originates from a page_pool that have enabled zero-copy (and likely
matching against a "protection domain" id number).


> We were not considered using XDP yet, so we've decided to limit the initial
> implementation to macvtap because we can ensure correspondence between a
> NIC queue and virtual NIC, which is not the case with more generic tap
> device. It could be that use of XDP will allow for a generic solution for
> virtio case as well.

You don't need an XDP filter, if you can make the HW do the early demux
binding into a queue.  The check for if memory is zero-copy enabled
would be the same.

> >   
> > > Have you considered using "push" model for setting the NIC's RX memory?  
> > 
> > I don't understand what you mean by a "push" model?  
> 
> Currently, memory allocation in NIC drivers boils down to alloc_page with
> some wrapping code. I see two possible ways to make NIC use of some
> preallocated pages: either NIC driver will call an API (probably different
> from alloc_page) to obtain that memory, or there will be NDO API that
> allows to set the NIC's RX buffers. I named the later case "push".

As you might have guessed, I'm not into the "push" model, because this
means I cannot share the queue with the normal network stack.  Which I
believe is possible as outlined (in email and [2]) and can be done with
out HW filter features (like macvlan).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
[2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-12 15:10         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-12 15:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, brouer

On Mon, 12 Dec 2016 16:14:33 +0200
Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:

> On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> >   
> > > Hello Jesper,
> > > 
> > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
> > > > Hi all,
> > > > 
> > > > This is my design for how to safely handle RX zero-copy in the network
> > > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > > not leaking kernel info in pages mapped to userspace and resilience
> > > > so a malicious userspace app cannot crash the kernel.
> > > > 
> > > > Design target
> > > > =============
> > > > 
> > > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > > safe manor, between the kernel network stack and an accelerated
> > > > userspace application using RX zero-copy delivery.
> > > > 
> > > > Target is to provide the basis for building RX zero-copy solutions in
> > > > a memory safe manor.  An efficient communication channel for userspace
> > > > delivery is out of scope for this document, but OOM considerations are
> > > > discussed below (`Userspace delivery and OOM`_).    
> > > 
> > > Sorry, if this reply is a bit off-topic.  
> > 
> > It is very much on topic IMHO :-)
> >   
> > > I'm working on implementation of RX zero-copy for virtio and I've dedicated
> > > some thought about making guest memory available for physical NIC DMAs.
> > > I believe this is quite related to your page_pool proposal, at least from
> > > the NIC driver perspective, so I'd like to share some thoughts here.  
> > 
> > Seems quite related. I'm very interested in cooperating with you! I'm
> > not very familiar with virtio, and how packets/pages gets channeled
> > into virtio.  
> 
> They are copied :-)
> Presuming we are dealing only with vhost backend, the received skb
> eventually gets converted to IOVs, which in turn are copied to the guest
> memory. The IOVs point to the guest memory that is allocated by virtio-net
> running in the guest.

Thanks for explaining that. It seems like a lot of overhead. I have to
wrap my head around this... so, the hardware NIC is receiving the
packet/page, in the RX ring, and after converting it to IOVs, it is
conceptually transmitted into the guest, and then the guest-side have a
RX-function to handle this packet. Correctly understood?

 
> > > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > > using macvtap, and then propagate guest RX memory allocations to the NIC
> > > using something like new .ndo_set_rx_buffers method.  
> > 
> > I believe the page_pool API/design aligns with this idea/use-case.
> >   
> > > What is your view about interface between the page_pool and the NIC
> > > drivers?  
> > 
> > In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> > a page_pool per RX queue.  This is done for two reasons (1) performance
> > and (2) for supporting use-cases where only one single RX-ring queue is
> > (re)configured to support RX-zero-copy.  There are some associated
> > extra cost of enabling this mode, thus it makes sense to only enable it
> > when needed.
> > 
> > I've not decided how this gets enabled, maybe some new driver NDO.  It
> > could also happen when a XDP program gets loaded, which request this
> > feature.
> > 
> > The macvtap solution is nice and we should support it, but it requires
> > VM to have their MAC-addr registered on the physical switch.  This
> > design is about adding flexibility. Registering an XDP eBPF filter
> > provides the maximum flexibility for matching the destination VM.  
> 
> I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> what needs to be done in BPF program to do proper conversion of skb to the
> virtio descriptors.

XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
modify the packet-page data, but I don't think it is needed for your
use-case.  View XDP (primarily) as an early (demux) filter.

XDP is missing a feature your need, which is TX packet into another
net_device (I actually imagine a port mapping table, that point to a
net_device).  This require a new "TX-raw" NDO that takes a page (+
offset and length). 

I imagine, the virtio driver (virtio_net or a new driver?) getting
extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
 Whether zero-copy is possible is determined by checking if page
originates from a page_pool that have enabled zero-copy (and likely
matching against a "protection domain" id number).


> We were not considered using XDP yet, so we've decided to limit the initial
> implementation to macvtap because we can ensure correspondence between a
> NIC queue and virtual NIC, which is not the case with more generic tap
> device. It could be that use of XDP will allow for a generic solution for
> virtio case as well.

You don't need an XDP filter, if you can make the HW do the early demux
binding into a queue.  The check for if memory is zero-copy enabled
would be the same.

> >   
> > > Have you considered using "push" model for setting the NIC's RX memory?  
> > 
> > I don't understand what you mean by a "push" model?  
> 
> Currently, memory allocation in NIC drivers boils down to alloc_page with
> some wrapping code. I see two possible ways to make NIC use of some
> preallocated pages: either NIC driver will call an API (probably different
> from alloc_page) to obtain that memory, or there will be NDO API that
> allows to set the NIC's RX buffers. I named the later case "push".

As you might have guessed, I'm not into the "push" model, because this
means I cannot share the queue with the normal network stack.  Which I
believe is possible as outlined (in email and [2]) and can be done with
out HW filter features (like macvlan).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
[2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 14:49       ` John Fastabend
@ 2016-12-12 17:13         ` Jesper Dangaard Brouer
  2016-12-12 18:06             ` Christoph Lameter
  2016-12-13  9:42         ` Mike Rapoport
  1 sibling, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-12 17:13 UTC (permalink / raw)
  To: John Fastabend
  Cc: Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich, brouer

On Mon, 12 Dec 2016 06:49:03 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> >>  
> >>> Hello Jesper,
> >>>
> >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
> >>>> Hi all,
> >>>>
> >>>> This is my design for how to safely handle RX zero-copy in the network
> >>>> stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> >>>> not leaking kernel info in pages mapped to userspace and resilience
> >>>> so a malicious userspace app cannot crash the kernel.
> >>>>
> >>>> Design target
> >>>> =============
> >>>>
> >>>> Allow the NIC to function as a normal Linux NIC and be shared in a
> >>>> safe manor, between the kernel network stack and an accelerated
> >>>> userspace application using RX zero-copy delivery.
> >>>>
> >>>> Target is to provide the basis for building RX zero-copy solutions in
> >>>> a memory safe manor.  An efficient communication channel for userspace
> >>>> delivery is out of scope for this document, but OOM considerations are
> >>>> discussed below (`Userspace delivery and OOM`_).    
> >>>
> >>> Sorry, if this reply is a bit off-topic.  
> >>
> >> It is very much on topic IMHO :-)
> >>  
> >>> I'm working on implementation of RX zero-copy for virtio and I've dedicated
> >>> some thought about making guest memory available for physical NIC DMAs.
> >>> I believe this is quite related to your page_pool proposal, at least from
> >>> the NIC driver perspective, so I'd like to share some thoughts here.  
> >>
> >> Seems quite related. I'm very interested in cooperating with you! I'm
> >> not very familiar with virtio, and how packets/pages gets channeled
> >> into virtio.  
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> >   
> 
> Great I'm also doing something similar.
> 
> My plan was to embed the zero copy as an AF_PACKET mode and then push
> a AF_PACKET backend into vhost. I'll post a patch later this week.
> 
> >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> >>> using macvtap, and then propagate guest RX memory allocations to the NIC
> >>> using something like new .ndo_set_rx_buffers method.  
> >>
> >> I believe the page_pool API/design aligns with this idea/use-case.
> >>  
> >>> What is your view about interface between the page_pool and the NIC
> >>> drivers?  
> >>
> >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> >> a page_pool per RX queue.  This is done for two reasons (1) performance
> >> and (2) for supporting use-cases where only one single RX-ring queue is
> >> (re)configured to support RX-zero-copy.  There are some associated
> >> extra cost of enabling this mode, thus it makes sense to only enable it
> >> when needed.
> >>
> >> I've not decided how this gets enabled, maybe some new driver NDO.  It
> >> could also happen when a XDP program gets loaded, which request this
> >> feature.
> >>
> >> The macvtap solution is nice and we should support it, but it requires
> >> VM to have their MAC-addr registered on the physical switch.  This
> >> design is about adding flexibility. Registering an XDP eBPF filter
> >> provides the maximum flexibility for matching the destination VM.  
> > 
> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.  
> 
> I don't think XDP has much to do with this code and they should be done
> separately. XDP runs eBPF code on received packets after the DMA engine
> has already placed the packet in memory so its too late in the process.

It does not have to be connected to XDP.  My idea should support RX
zero-copy into normal sockets, without XDP.

My idea was to pre-VMA map the RX ring, when zero-copy is requested,
thus it is not too late in the process.  When frame travel the normal
network stack, then require the SKB-read-only-page mode (skb-frags).
If the SKB reach a socket that support zero-copy, then we can do RX
zero-copy on normal sockets.

 
> The other piece here is enabling XDP in vhost but that is again separate
> IMO.
> 
> Notice that ixgbe supports pushing packets into a macvlan via 'tc'
> traffic steering commands so even though macvlan gets an L2 address it
> doesn't mean it can't use other criteria to steer traffic to it.

This sounds interesting. As this allow much more flexibility macvlan
matching, which I like, but still depending on HW support. 

 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.  
> 
> Interesting this was one of the original ideas behind the macvlan
> offload mode. iirc Vlad also was interested in this.
> 
> I'm guessing this was used because of the ability to push macvlan onto
> its own queue?
> 
> >    
> >>  
> >>> Have you considered using "push" model for setting the NIC's RX memory?  
> >>
> >> I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".  
> 
> I prefer the ndo op. This matches up well with AF_PACKET model where we
> have "slots" and offload is just a transparent "push" of these "slots"
> to the driver. Below we have a snippet of our proposed API,

Hmmm. If you can rely on hardware setup to give you steering and
dedicated access to the RX rings.  In those cases, I guess, the "push"
model could be a more direct API approach.

I was shooting for a model that worked without hardware support.  And
then transparently benefit from HW support by configuring a HW filter
into a specific RX queue and attaching/using to that queue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 17:13         ` Jesper Dangaard Brouer
@ 2016-12-12 18:06             ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-12 18:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:

> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings.  In those cases, I guess, the "push"
> model could be a more direct API approach.

If the hardware does not support steering then one should be able to
provide those services in software.

> I was shooting for a model that worked without hardware support.  And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.

The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported  by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.

What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.

The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-12 18:06             ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-12 18:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:

> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings.  In those cases, I guess, the "push"
> model could be a more direct API approach.

If the hardware does not support steering then one should be able to
provide those services in software.

> I was shooting for a model that worked without hardware support.  And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.

The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported  by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.

What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.

The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 15:10         ` Jesper Dangaard Brouer
@ 2016-12-13  8:43           ` Mike Rapoport
  -1 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2016-12-13  8:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

On Mon, Dec 12, 2016 at 04:10:26PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 12 Dec 2016 16:14:33 +0200
> Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> 
> Thanks for explaining that. It seems like a lot of overhead. I have to
> wrap my head around this... so, the hardware NIC is receiving the
> packet/page, in the RX ring, and after converting it to IOVs, it is
> conceptually transmitted into the guest, and then the guest-side have a
> RX-function to handle this packet. Correctly understood?

Almost :)
For the hardware NIC driver, the receive just follows the "normal" path.
It creates an skb for the packet and passes it to the net core RX. Then the
skb is delivered to tap/macvtap. The later converts the skb to IOVs and
IOVs are pushed to the guest address space.

On the guest side, virtio-net sees these IOVs as a part of its RX ring, it
creates an skb for the packet and passes the skb to the net core of the
guest.

> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.
> 
> XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
> modify the packet-page data, but I don't think it is needed for your
> use-case.  View XDP (primarily) as an early (demux) filter.
> 
> XDP is missing a feature your need, which is TX packet into another
> net_device (I actually imagine a port mapping table, that point to a
> net_device).  This require a new "TX-raw" NDO that takes a page (+
> offset and length). 
> 
> I imagine, the virtio driver (virtio_net or a new driver?) getting
> extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
>  Whether zero-copy is possible is determined by checking if page
> originates from a page_pool that have enabled zero-copy (and likely
> matching against a "protection domain" id number).
 
That could be quite a few drivers that will need to implement "TX-raw" then
:)
In general case, the virtual NIC may be connected to the physical network
via long chain of virtual devices such as bridge, veth and ovs.
Actually, because of that we wanted to concentrate on macvtap...
 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> You don't need an XDP filter, if you can make the HW do the early demux
> binding into a queue.  The check for if memory is zero-copy enabled
> would be the same.
> 
> > >   
> > > > Have you considered using "push" model for setting the NIC's RX memory?  
> > > 
> > > I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> As you might have guessed, I'm not into the "push" model, because this
> means I cannot share the queue with the normal network stack.  Which I
> believe is possible as outlined (in email and [2]) and can be done with
> out HW filter features (like macvlan).

I think I should sleep on it a bit more :)
Probably we can add page_pool "backend" implementation to vhost...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-13  8:43           ` Mike Rapoport
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2016-12-13  8:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

On Mon, Dec 12, 2016 at 04:10:26PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 12 Dec 2016 16:14:33 +0200
> Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> 
> Thanks for explaining that. It seems like a lot of overhead. I have to
> wrap my head around this... so, the hardware NIC is receiving the
> packet/page, in the RX ring, and after converting it to IOVs, it is
> conceptually transmitted into the guest, and then the guest-side have a
> RX-function to handle this packet. Correctly understood?

Almost :)
For the hardware NIC driver, the receive just follows the "normal" path.
It creates an skb for the packet and passes it to the net core RX. Then the
skb is delivered to tap/macvtap. The later converts the skb to IOVs and
IOVs are pushed to the guest address space.

On the guest side, virtio-net sees these IOVs as a part of its RX ring, it
creates an skb for the packet and passes the skb to the net core of the
guest.

> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.
> 
> XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
> modify the packet-page data, but I don't think it is needed for your
> use-case.  View XDP (primarily) as an early (demux) filter.
> 
> XDP is missing a feature your need, which is TX packet into another
> net_device (I actually imagine a port mapping table, that point to a
> net_device).  This require a new "TX-raw" NDO that takes a page (+
> offset and length). 
> 
> I imagine, the virtio driver (virtio_net or a new driver?) getting
> extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
>  Whether zero-copy is possible is determined by checking if page
> originates from a page_pool that have enabled zero-copy (and likely
> matching against a "protection domain" id number).
 
That could be quite a few drivers that will need to implement "TX-raw" then
:)
In general case, the virtual NIC may be connected to the physical network
via long chain of virtual devices such as bridge, veth and ovs.
Actually, because of that we wanted to concentrate on macvtap...
 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> You don't need an XDP filter, if you can make the HW do the early demux
> binding into a queue.  The check for if memory is zero-copy enabled
> would be the same.
> 
> > >   
> > > > Have you considered using "push" model for setting the NIC's RX memory?  
> > > 
> > > I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> As you might have guessed, I'm not into the "push" model, because this
> means I cannot share the queue with the normal network stack.  Which I
> believe is possible as outlined (in email and [2]) and can be done with
> out HW filter features (like macvlan).

I think I should sleep on it a bit more :)
Probably we can add page_pool "backend" implementation to vhost...

--
Sincerely yours,
Mike. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 14:49       ` John Fastabend
  2016-12-12 17:13         ` Jesper Dangaard Brouer
@ 2016-12-13  9:42         ` Mike Rapoport
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2016-12-13  9:42 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Mon, Dec 12, 2016 at 06:49:03AM -0800, John Fastabend wrote:
> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> >>
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> Interesting this was one of the original ideas behind the macvlan
> offload mode. iirc Vlad also was interested in this.
> 
> I'm guessing this was used because of the ability to push macvlan onto
> its own queue?

Yes, with a queue dedicated to a virtual NIC we only need to ensure that
guest memory is used for RX buffers. 
 
> >>
> >>> Have you considered using "push" model for setting the NIC's RX memory?
> >>
> >> I don't understand what you mean by a "push" model?
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> I prefer the ndo op. This matches up well with AF_PACKET model where we
> have "slots" and offload is just a transparent "push" of these "slots"
> to the driver. Below we have a snippet of our proposed API,
> 
> (https://patchwork.ozlabs.org/patch/396714/ note the descriptor mapping
> bits will be dropped)
> 
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + *				     struct net_device *dev)
> + *	Called to map queue pair range from split_queue_pairs into
> + *	mmap region.
> +
> 
> > +
> > +static int
> > +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> > +{
> > +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> > +	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> > +	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > +	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > +	unsigned long dummy_page_phy;
> > +	pgprot_t pre_vm_page_prot;
> > +	unsigned long start;
> > +	unsigned int i;
> > +	int err;
> > +
> > +	if (!dummy_page_buf) {
> > +		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> > +		if (!dummy_page_buf)
> > +			return -ENOMEM;
> > +
> > +		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> > +			dummy_page_buf[i] = 0xdeadbeef;
> > +	}
> > +
> > +	dummy_page_phy = virt_to_phys(dummy_page_buf);
> > +	pre_vm_page_prot = vma->vm_page_prot;
> > +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > +
> > +	/* assume the vm_start is 4K aligned address */
> > +	for (start = vma->vm_start;
> > +	     start < vma->vm_end;
> > +	     start += PAGE_SIZE_4K) {
> > +		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> > +			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> > +					      vma->vm_page_prot);
> > +			if (err)
> > +				return -EAGAIN;
> > +		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> > +			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> > +					      vma->vm_page_prot);
> > +			if (err)
> > +				return -EAGAIN;
> > +		} else {
> > +			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
> > +
> > +			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> > +					      pre_vm_page_prot);
> > +			if (err)
> > +				return -EAGAIN;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> 
> Any thoughts on something like the above? We could push it when net-next
> opens. One piece that fits naturally into vhost/macvtap is the kicks and
> queue splicing are already there so no need to implement this making the
> above patch much simpler.

Sorry, but I don't quite follow you here. The vhost does not use vma
mappings, it just sees a bunch of pages pointed by the vring descriptors...
 
> .John
 
--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-12 18:06             ` Christoph Lameter
  (?)
@ 2016-12-13 16:10             ` Jesper Dangaard Brouer
  2016-12-13 16:36                 ` Christoph Lameter
                                 ` (2 more replies)
  -1 siblings, 3 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-13 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich, brouer


On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
> 
> > Hmmm. If you can rely on hardware setup to give you steering and
> > dedicated access to the RX rings.  In those cases, I guess, the "push"
> > model could be a more direct API approach.  
> 
> If the hardware does not support steering then one should be able to
> provide those services in software.

This is the early demux problem.  With the push-mode of registering
memory, you need hardware steering support, for zero-copy support, as
the software step happens after DMA engine have written into the memory.

My model pre-VMA map all the pages in the RX ring (if zero-copy gets
enabled, by a single user).  The software step can filter and zero-copy
send packet-pages to the application/socket that requested this. The
disadvantage is all zero-copy application need to share this VMA
mapping.  This is solved by configuring HW filters into a RX-queue, and
then only attach your zero-copy application to that queue.


> > I was shooting for a model that worked without hardware support.
> > And then transparently benefit from HW support by configuring a HW
> > filter into a specific RX queue and attaching/using to that queue.  
> 
> The discussion here is a bit amusing since these issues have been
> resolved a long time ago with the design of the RDMA subsystem. Zero
> copy is already in wide use. Memory registration is used to pin down
> memory areas. Work requests can be filed with the RDMA subsystem that
> then send and receive packets from the registered memory regions.
> This is not strictly remote memory access but this is a basic mode of
> operations supported  by the RDMA subsystem. The mlx5 driver quoted
> here supports all of that.

I hear what you are saying.  I will look into a push-model, as it might
be a better solution.
 I will read up on RDMA + verbs and learn more about their API model.  I
even plan to write a small sample program to get a feeling for the API,
and maybe we can use that as a baseline for the performance target we
can obtain on the same HW. (Thanks to Björn for already giving me some
pointer here)


> What is bad about RDMA is that it is a separate kernel subsystem.
> What I would like to see is a deeper integration with the network
> stack so that memory regions can be registred with a network socket
> and work requests then can be submitted and processed that directly
> read and write in these regions. The network stack should provide the
> services that the hardware of the NIC does not suppport as usual.

Interesting.  So you even imagine sockets registering memory regions
with the NIC.  If we had a proper NIC HW filter API across the drivers,
to register the steering rule (like ibv_create_flow), this would be
doable, but we don't (DPDK actually have an interesting proposal[1])

 
> The RX/TX ring in user space should be an additional mode of
> operation of the socket layer. Once that is in place the "Remote
> memory acces" can be trivially implemented on top of that and the
> ugly RDMA sidecar subsystem can go away.
 
I cannot follow that 100%, but I guess you are saying we also need a
more efficient mode of handing over pages/packet to userspace (than
going through the normal socket API calls).


Appreciate your input, it challenged my thinking.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 16:10             ` Jesper Dangaard Brouer
@ 2016-12-13 16:36                 ` Christoph Lameter
  2016-12-13 17:43                 ` John Fastabend
  2016-12-13 18:39               ` Hannes Frederic Sowa
  2 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-13 16:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:

> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.

Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.

> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported  by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)

Great.

> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.

> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).

A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.

I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-13 16:36                 ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-13 16:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:

> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.

Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.

> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported  by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to BjA?rn for already giving me some
> pointer here)

Great.

> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.

> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).

A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.

I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 16:10             ` Jesper Dangaard Brouer
@ 2016-12-13 17:43                 ` John Fastabend
  2016-12-13 17:43                 ` John Fastabend
  2016-12-13 18:39               ` Hannes Frederic Sowa
  2 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2016-12-13 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Christoph Lameter
  Cc: Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings.  In those cases, I guess, the "push"
>>> model could be a more direct API approach.  
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
> 
> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
> 
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user).  The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The

What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.

> disadvantage is all zero-copy application need to share this VMA
> mapping.  This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
> 
> 
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.  
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported  by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
> 
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
> 
> 
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
> 

Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well.  The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.

Are you suggesting 'tc' is not a proper NIC HW filter API?

>  
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>  
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
> 
> 
> Appreciate your input, it challenged my thinking.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-13 17:43                 ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2016-12-13 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Christoph Lameter
  Cc: Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings.  In those cases, I guess, the "push"
>>> model could be a more direct API approach.  
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
> 
> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
> 
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user).  The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The

What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.

> disadvantage is all zero-copy application need to share this VMA
> mapping.  This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
> 
> 
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.  
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported  by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
> 
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to BjA?rn for already giving me some
> pointer here)
> 
> 
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
> 

Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well.  The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.

Are you suggesting 'tc' is not a proper NIC HW filter API?

>  
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>  
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
> 
> 
> Appreciate your input, it challenged my thinking.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 16:10             ` Jesper Dangaard Brouer
  2016-12-13 16:36                 ` Christoph Lameter
  2016-12-13 17:43                 ` John Fastabend
@ 2016-12-13 18:39               ` Hannes Frederic Sowa
  2016-12-14 17:00                   ` Christoph Lameter
  2 siblings, 1 reply; 39+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-13 18:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Christoph Lameter
  Cc: John Fastabend, Mike Rapoport, netdev, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On 13.12.2016 17:10, Jesper Dangaard Brouer wrote:
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

On a side note, this is what windows does with RIO ("registered I/O").
Maybe you want to look at the API to get some ideas: allocating and
pinning down memory in user space and registering that with sockets to
get zero-copy IO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 17:43                 ` John Fastabend
  (?)
@ 2016-12-13 19:53                 ` David Miller
  2016-12-13 20:08                   ` John Fastabend
  -1 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2016-12-13 19:53 UTC (permalink / raw)
  To: john.fastabend
  Cc: brouer, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich

From: John Fastabend <john.fastabend@gmail.com>
Date: Tue, 13 Dec 2016 09:43:59 -0800

> What does "zero-copy send packet-pages to the application/socket that
> requested this" mean? At the moment on x86 page-flipping appears to be
> more expensive than memcpy (I can post some data shortly) and shared
> memory was proposed and rejected for security reasons when we were
> working on bifurcated driver.

The whole idea is that we map all the active RX ring pages into
userspace from the start.

And just how Jesper's page pool work will avoid DMA map/unmap,
it will also avoid changing the userspace mapping of the pages
as well.

Thus avoiding the TLB/VM overhead altogether.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 19:53                 ` David Miller
@ 2016-12-13 20:08                   ` John Fastabend
  2016-12-14  9:39                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: John Fastabend @ 2016-12-13 20:08 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich

On 16-12-13 11:53 AM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Tue, 13 Dec 2016 09:43:59 -0800
> 
>> What does "zero-copy send packet-pages to the application/socket that
>> requested this" mean? At the moment on x86 page-flipping appears to be
>> more expensive than memcpy (I can post some data shortly) and shared
>> memory was proposed and rejected for security reasons when we were
>> working on bifurcated driver.
> 
> The whole idea is that we map all the active RX ring pages into
> userspace from the start.
> 
> And just how Jesper's page pool work will avoid DMA map/unmap,
> it will also avoid changing the userspace mapping of the pages
> as well.
> 
> Thus avoiding the TLB/VM overhead altogether.
> 

I get this but it requires applications to be isolated. The pages from
a queue can not be shared between multiple applications in different
trust domains. And the application has to be cooperative meaning it
can't "look" at data that has not been marked by the stack as OK. In
these schemes we tend to end up with something like virtio/vhost or
af_packet.

Any ACLs/filtering/switching/headers need to be done in hardware or
the application trust boundaries are broken.

If the above can not be met then a copy is needed. What I am trying
to tease out is the above comment along with other statements like
this "can be done with out HW filter features".

.John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 20:08                   ` John Fastabend
@ 2016-12-14  9:39                     ` Jesper Dangaard Brouer
  2016-12-14 16:32                       ` John Fastabend
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-14  9:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich,
	brouer

On Tue, 13 Dec 2016 12:08:21 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-13 11:53 AM, David Miller wrote:
> > From: John Fastabend <john.fastabend@gmail.com>
> > Date: Tue, 13 Dec 2016 09:43:59 -0800
> >   
> >> What does "zero-copy send packet-pages to the application/socket that
> >> requested this" mean? At the moment on x86 page-flipping appears to be
> >> more expensive than memcpy (I can post some data shortly) and shared
> >> memory was proposed and rejected for security reasons when we were
> >> working on bifurcated driver.  
> > 
> > The whole idea is that we map all the active RX ring pages into
> > userspace from the start.
> > 
> > And just how Jesper's page pool work will avoid DMA map/unmap,
> > it will also avoid changing the userspace mapping of the pages
> > as well.
> > 
> > Thus avoiding the TLB/VM overhead altogether.
> >   

Exactly.  It is worth mentioning that pages entering the page pool need
to be cleared (measured cost 143 cycles), in order to not leak any
kernel info.  The primary focus of this design is to make sure not to
leak kernel info to userspace, but with an "exclusive" mode also
support isolation between applications.


> I get this but it requires applications to be isolated. The pages from
> a queue can not be shared between multiple applications in different
> trust domains. And the application has to be cooperative meaning it
> can't "look" at data that has not been marked by the stack as OK. In
> these schemes we tend to end up with something like virtio/vhost or
> af_packet.

I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
two would require CAP_NET_ADMIN privileges.  All modes have a trust
domain id, that need to match e.g. when page reach the socket.

Mode-1 "Shared": Application choose lowest isolation level, allowing
 multiple application to mmap VMA area.

Mode-2 "Single-user": Application request it want to be the only user
 of the RX queue.  This blocks other application to mmap VMA area.

Mode-3 "Exclusive": Application request to own RX queue.  Packets are
 no longer allowed for normal netstack delivery.

Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
still allowed to travel netstack and thus can contain packet data from
other normal applications.  This is part of the design, to share the
NIC between netstack and an accelerated userspace application using RX
zero-copy delivery.


> Any ACLs/filtering/switching/headers need to be done in hardware or
> the application trust boundaries are broken.

The software solution outlined allow the application to make the choice
of what trust boundary it wants.

The "exclusive" mode-3 make most sense together with HW filters.
Already today, we support creating a new RX queue based on ethtool
ntuple HW filter and then you simply attach your application that queue
in mode-3, and have full isolation.

 
> If the above can not be met then a copy is needed. What I am trying
> to tease out is the above comment along with other statements like
> this "can be done with out HW filter features".

Does this address your concerns?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14  9:39                     ` Jesper Dangaard Brouer
@ 2016-12-14 16:32                       ` John Fastabend
  2016-12-14 16:45                         ` Alexander Duyck
  2016-12-14 21:04                         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 39+ messages in thread
From: John Fastabend @ 2016-12-14 16:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich

On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend <john.fastabend@gmail.com>
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
>>>> What does "zero-copy send packet-pages to the application/socket that
>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>> memory was proposed and rejected for security reasons when we were
>>>> working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.

Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above comment along with other statements like
>> this "can be done with out HW filter features".
> 
> Does this address your concerns?
> 

I think we need to enforce strong isolation. An application should not
be able to read data or impact other applications. I gather this is
the case per comment about normal applications in mode-2. A slightly
weaker statement would be to say applications can only impace/read data
of other applications in their domain. This might be OK as well.

.John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 16:32                       ` John Fastabend
@ 2016-12-14 16:45                         ` Alexander Duyck
  2016-12-14 21:29                           ` Jesper Dangaard Brouer
  2016-12-14 21:04                         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2016-12-14 16:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, David Miller, Christoph Lameter, rppt,
	Netdev, linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich

On Wed, Dec 14, 2016 at 8:32 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
>> On Tue, 13 Dec 2016 12:08:21 -0800
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>
>>> On 16-12-13 11:53 AM, David Miller wrote:
>>>> From: John Fastabend <john.fastabend@gmail.com>
>>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>>
>>>>> What does "zero-copy send packet-pages to the application/socket that
>>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>>> memory was proposed and rejected for security reasons when we were
>>>>> working on bifurcated driver.
>>>>
>>>> The whole idea is that we map all the active RX ring pages into
>>>> userspace from the start.
>>>>
>>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>>> it will also avoid changing the userspace mapping of the pages
>>>> as well.
>>>>
>>>> Thus avoiding the TLB/VM overhead altogether.
>>>>
>>
>> Exactly.  It is worth mentioning that pages entering the page pool need
>> to be cleared (measured cost 143 cycles), in order to not leak any
>> kernel info.  The primary focus of this design is to make sure not to
>> leak kernel info to userspace, but with an "exclusive" mode also
>> support isolation between applications.
>>
>>
>>> I get this but it requires applications to be isolated. The pages from
>>> a queue can not be shared between multiple applications in different
>>> trust domains. And the application has to be cooperative meaning it
>>> can't "look" at data that has not been marked by the stack as OK. In
>>> these schemes we tend to end up with something like virtio/vhost or
>>> af_packet.
>>
>> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
>> two would require CAP_NET_ADMIN privileges.  All modes have a trust
>> domain id, that need to match e.g. when page reach the socket.
>
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.
>
>>
>> Mode-1 "Shared": Application choose lowest isolation level, allowing
>>  multiple application to mmap VMA area.
>
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
>
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
>
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
>
>>
>> Mode-2 "Single-user": Application request it want to be the only user
>>  of the RX queue.  This blocks other application to mmap VMA area.
>>
>
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.
>
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
>
> I assume all these concerns are shared between mode-1 and mode-2
>
>> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>>  no longer allowed for normal netstack delivery.
>>
>
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.
>
>> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
>> still allowed to travel netstack and thus can contain packet data from
>> other normal applications.  This is part of the design, to share the
>> NIC between netstack and an accelerated userspace application using RX
>> zero-copy delivery.
>>
>
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I agree.  This is a no-go from the performance perspective as well.
At a minimum you would have to be zeroing out the page between uses to
avoid leaking data, and that assumes that the program we are sending
the pages to is slightly well behaved.  If we think zeroing out an
sk_buff is expensive wait until we are trying to do an entire 4K page.

I think we are stuck with having to use a HW filter to split off
application traffic to a specific ring, and then having to share the
memory between the application and the kernel on that ring only.  Any
other approach just opens us up to all sorts of security concerns
since it would be possible for the application to try to read and
possibly write any data it wants into the buffers.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-13 18:39               ` Hannes Frederic Sowa
@ 2016-12-14 17:00                   ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-14 17:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport, netdev,
	linux-mm, Willem de Bruijn, Björn Töpel, Karlsson,
	Magnus, Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:

> > Interesting.  So you even imagine sockets registering memory regions
> > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > to register the steering rule (like ibv_create_flow), this would be
> > doable, but we don't (DPDK actually have an interesting proposal[1])
>
> On a side note, this is what windows does with RIO ("registered I/O").
> Maybe you want to look at the API to get some ideas: allocating and
> pinning down memory in user space and registering that with sockets to
> get zero-copy IO.

Yup that is also what I think. Regarding the memory registration and flow
steering for user space RX/TX ring please look at the qpair model
implemented by the RDMA subsystem in the kernel. The memory semantics are
clearly established there and have been in use for more than a decade.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-14 17:00                   ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-14 17:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport, netdev,
	linux-mm, Willem de Bruijn, Björn Töpel, Karlsson,
	Magnus, Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:

> > Interesting.  So you even imagine sockets registering memory regions
> > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > to register the steering rule (like ibv_create_flow), this would be
> > doable, but we don't (DPDK actually have an interesting proposal[1])
>
> On a side note, this is what windows does with RIO ("registered I/O").
> Maybe you want to look at the API to get some ideas: allocating and
> pinning down memory in user space and registering that with sockets to
> get zero-copy IO.

Yup that is also what I think. Regarding the memory registration and flow
steering for user space RX/TX ring please look at the qpair model
implemented by the RDMA subsystem in the kernel. The memory semantics are
clearly established there and have been in use for more than a decade.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 17:00                   ` Christoph Lameter
  (?)
@ 2016-12-14 17:37                   ` David Laight
  2016-12-14 19:43                       ` Christoph Lameter
  -1 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2016-12-14 17:37 UTC (permalink / raw)
  To: 'Christoph Lameter', Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport, netdev,
	linux-mm, Willem de Bruijn, Björn Töpel, Karlsson,
	Magnus, Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

From: Christoph Lameter
> Sent: 14 December 2016 17:00
> On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:
> 
> > > Interesting.  So you even imagine sockets registering memory regions
> > > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > > to register the steering rule (like ibv_create_flow), this would be
> > > doable, but we don't (DPDK actually have an interesting proposal[1])
> >
> > On a side note, this is what windows does with RIO ("registered I/O").
> > Maybe you want to look at the API to get some ideas: allocating and
> > pinning down memory in user space and registering that with sockets to
> > get zero-copy IO.
> 
> Yup that is also what I think. Regarding the memory registration and flow
> steering for user space RX/TX ring please look at the qpair model
> implemented by the RDMA subsystem in the kernel. The memory semantics are
> clearly established there and have been in use for more than a decade.

Isn't there a bigger problem for transmit?
If the kernel is doing ANY validation on the frames it must copy the
data to memory the application cannot modify before doing the validation.
Otherwise the application could change the data afterwards.

	David


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 17:37                   ` David Laight
@ 2016-12-14 19:43                       ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-14 19:43 UTC (permalink / raw)
  To: David Laight
  Cc: Hannes Frederic Sowa, Jesper Dangaard Brouer, John Fastabend,
	Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth

On Wed, 14 Dec 2016, David Laight wrote:

> If the kernel is doing ANY validation on the frames it must copy the
> data to memory the application cannot modify before doing the validation.
> Otherwise the application could change the data afterwards.

The application is not allowed to change the data after a work request has
been submitted to send the frame. Changes are possible after the
completion request has been received.

The kernel can enforce that by making the frame(s) readonly and thus
getting a page fault if the app would do such a thing.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-14 19:43                       ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-14 19:43 UTC (permalink / raw)
  To: David Laight
  Cc: Hannes Frederic Sowa, Jesper Dangaard Brouer, John Fastabend,
	Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Wed, 14 Dec 2016, David Laight wrote:

> If the kernel is doing ANY validation on the frames it must copy the
> data to memory the application cannot modify before doing the validation.
> Otherwise the application could change the data afterwards.

The application is not allowed to change the data after a work request has
been submitted to send the frame. Changes are possible after the
completion request has been received.

The kernel can enforce that by making the frame(s) readonly and thus
getting a page fault if the app would do such a thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 19:43                       ` Christoph Lameter
@ 2016-12-14 20:37                         ` Hannes Frederic Sowa
  -1 siblings, 0 replies; 39+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-14 20:37 UTC (permalink / raw)
  To: Christoph Lameter, David Laight
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport, netdev,
	linux-mm, Willem de Bruijn, Björn Töpel, Karlsson,
	Magnus, Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On 14.12.2016 20:43, Christoph Lameter wrote:
> On Wed, 14 Dec 2016, David Laight wrote:
> 
>> If the kernel is doing ANY validation on the frames it must copy the
>> data to memory the application cannot modify before doing the validation.
>> Otherwise the application could change the data afterwards.
> 
> The application is not allowed to change the data after a work request has
> been submitted to send the frame. Changes are possible after the
> completion request has been received.
> 
> The kernel can enforce that by making the frame(s) readonly and thus
> getting a page fault if the app would do such a thing.

As far as I remember right now, if you gift with vmsplice the memory
over a pipe to a tcp socket, you can in fact change the user data while
the data is in transmit. So you should not touch the memory region until
you received a SOF_TIMESTAMPING_TX_ACK error message in your sockets
error queue or stuff might break horribly. I don't think we have a
proper event for UDP that fires after we know the data left the hardware.

In my opinion this is still fine within the kernel protection limits.
E.g. due to scatter gather I/O you don't get access to the TCP header
nor UDP header and thus can't e.g. spoof or modify the header or
administration policies, albeit TOCTTOU races with netfilter which
matches inside the TCP/UDP packets are very well possible on transmit.

Wouldn't changing of the pages cause expensive TLB flushes?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
@ 2016-12-14 20:37                         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-14 20:37 UTC (permalink / raw)
  To: Christoph Lameter, David Laight
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport, netdev,
	linux-mm, Willem de Bruijn, Björn Töpel, Karlsson,
	Magnus, Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On 14.12.2016 20:43, Christoph Lameter wrote:
> On Wed, 14 Dec 2016, David Laight wrote:
> 
>> If the kernel is doing ANY validation on the frames it must copy the
>> data to memory the application cannot modify before doing the validation.
>> Otherwise the application could change the data afterwards.
> 
> The application is not allowed to change the data after a work request has
> been submitted to send the frame. Changes are possible after the
> completion request has been received.
> 
> The kernel can enforce that by making the frame(s) readonly and thus
> getting a page fault if the app would do such a thing.

As far as I remember right now, if you gift with vmsplice the memory
over a pipe to a tcp socket, you can in fact change the user data while
the data is in transmit. So you should not touch the memory region until
you received a SOF_TIMESTAMPING_TX_ACK error message in your sockets
error queue or stuff might break horribly. I don't think we have a
proper event for UDP that fires after we know the data left the hardware.

In my opinion this is still fine within the kernel protection limits.
E.g. due to scatter gather I/O you don't get access to the TCP header
nor UDP header and thus can't e.g. spoof or modify the header or
administration policies, albeit TOCTTOU races with netfilter which
matches inside the TCP/UDP packets are very well possible on transmit.

Wouldn't changing of the pages cause expensive TLB flushes?

Bye,
Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 16:32                       ` John Fastabend
  2016-12-14 16:45                         ` Alexander Duyck
@ 2016-12-14 21:04                         ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-14 21:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich,
	brouer

On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> On 16-12-13 11:53 AM, David Miller wrote:  
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>>     
> >>>> What does "zero-copy send packet-pages to the application/socket that
> >>>> requested this" mean? At the moment on x86 page-flipping appears to be
> >>>> more expensive than memcpy (I can post some data shortly) and shared
> >>>> memory was proposed and rejected for security reasons when we were
> >>>> working on bifurcated driver.    
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>>     
> > 
> > Exactly.  It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info.  The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> > 
> >   
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.  
> > 
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges.  All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.  
> 
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.

Good point.

> > 
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> >  multiple application to mmap VMA area.  
> 
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
> 
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
> 
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
> 
> > 
> > Mode-2 "Single-user": Application request it want to be the only user
> >  of the RX queue.  This blocks other application to mmap VMA area.
> >   
> 
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.

Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."

The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
 
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
> 
> I assume all these concerns are shared between mode-1 and mode-2
> 
> > Mode-3 "Exclusive": Application request to own RX queue.  Packets are
> >  no longer allowed for normal netstack delivery.
> >   
> 
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.

Interesting.

> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications.  This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >   
> 
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I though the bifurcated driver work was rejected because it could leak
kernel info in the pages. This approach cannot.

   
> >> Any ACLs/filtering/switching/headers need to be done in hardware or
> >> the application trust boundaries are broken.  
> > 
> > The software solution outlined allow the application to make the
> > choice of what trust boundary it wants.
> > 
> > The "exclusive" mode-3 make most sense together with HW filters.
> > Already today, we support creating a new RX queue based on ethtool
> > ntuple HW filter and then you simply attach your application that
> > queue in mode-3, and have full isolation.
> >   
> 
> Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
> Without hardware filters we have no way of knowing who/what data is
> put in the page.

For sockets, an SKB carrying a RX zero-copy-able page can be steered
(as normal) into a given socket. Then we check if socket requested
zero-copy, and verify if the domain-id match between the page_pool and
socket.

You can also use XDP to filter and steer the packet (which will be
faster and using normal steering code). 

> >    
> >> If the above can not be met then a copy is needed. What I am trying
> >> to tease out is the above comment along with other statements like
> >> this "can be done with out HW filter features".  
> > 
> > Does this address your concerns?
> >   
> 
> I think we need to enforce strong isolation. An application should not
> be able to read data or impact other applications. I gather this is
> the case per comment about normal applications in mode-2. A slightly
> weaker statement would be to say applications can only impace/read
> data of other applications in their domain. This might be OK as well.

I think this approach covers the "weaker statement".  Because only page
within the pool are "exposed".  Thus, the domain is the NIC (possibly
restricted to a single RX queue).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 20:37                         ` Hannes Frederic Sowa
  (?)
@ 2016-12-14 21:22                         ` Christoph Lameter
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-14 21:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Jesper Dangaard Brouer, John Fastabend,
	Mike Rapoport, netdev, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich

On Wed, 14 Dec 2016, Hannes Frederic Sowa wrote:

> Wouldn't changing of the pages cause expensive TLB flushes?

Yes so you would only want that feature if its realized at the page
table level for debugging issues.

Once you have memory registered with the hardware device then also the
device could itself could perform snooping to realize that data was
changed and thus abort the operation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 16:45                         ` Alexander Duyck
@ 2016-12-14 21:29                           ` Jesper Dangaard Brouer
  2016-12-14 22:45                             ` Alexander Duyck
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-14 21:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich, brouer

On Wed, 14 Dec 2016 08:45:08 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> I agree.  This is a no-go from the performance perspective as well.
> At a minimum you would have to be zeroing out the page between uses to
> avoid leaking data, and that assumes that the program we are sending
> the pages to is slightly well behaved.  If we think zeroing out an
> sk_buff is expensive wait until we are trying to do an entire 4K page.

Again, yes the page will be zero'ed out, but only when entering the
page_pool. Because they are recycled they are not cleared on every use.
Thus, performance does not suffer.

Besides clearing large mem area is not as bad as clearing small.
Clearing an entire page does cost something, as mentioned before 143
cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
cost 36 cycles which is only 7 bytes-per-cycle (256/36).


> I think we are stuck with having to use a HW filter to split off
> application traffic to a specific ring, and then having to share the
> memory between the application and the kernel on that ring only.  Any
> other approach just opens us up to all sorts of security concerns
> since it would be possible for the application to try to read and
> possibly write any data it wants into the buffers.

This is why I wrote a document[1], trying to outline how this is possible,
going through all the combinations, and asking the community to find
faults in my idea.  Inlining it again, as nobody really replied on the
content of the doc.

- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===========================
Memory Model for Networking
===========================

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
          request zero-copy RX, then the driver must use a specific
          SKB allocation mode and might have to reconfigure the
          RX-ring.


Design target
=============

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==========

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page
--------------------

When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
---------------------

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.


NIC RX Zero-Copy
================

Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.


Kernel safety
-------------

For the paranoid, how do we protect the kernel from a malicious
userspace program.  Sure there will be a communication interface
between kernel and userspace, that synchronize ownership of pages.
But a userspace program can violate this interface, given pages are
kept VMA mapped, the program can in principle access all the memory
pages in the given page_pool.  This opens up for a malicious (or
defect) program modifying memory pages concurrently with the kernel
and DMA engine using them.

An easy way to get around userspace modifying page data contents is
simply to map pages read-only into userspace.

.. Note:: The first implementation target is read-only zero-copy RX
          page to userspace and require driver to use SKB-read-only
          mode.

Advanced: Allowing userspace write access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What if userspace need write access? Flipping the page permissions per
transfer will likely kill performance (as this likely affects the
TLB-cache).

I will argue that giving userspace write access is still possible,
without risking a kernel crash.  This is related to the SKB-read-only
mode that copies the packet headers (in to another memory area,
inaccessible to userspace).  The attack angle is to modify packet
headers after they passed some kernel network stack validation step
(as once headers are copied they are out of "reach").

Situation classes where memory page can be modified concurrently:

1) When DMA engine owns the page.  Not a problem, as DMA engine will
   simply overwrite data.

2) Just after DMA engine finish writing.  Not a problem, the packet
   will go through netstack validation and be rejected.

3) While XDP reads data. This can lead to XDP/eBPF program goes into a
   wrong code branch, but the eBPF virtual machine should not be able
   to crash the kernel. The worst outcome is a wrong or invalid XDP
   return code.

4) Before SKB with read-only page is constructed. Not a problem, the
   packet will go through netstack validation and be rejected.

5) After SKB with read-only page has been constructed.  Remember the
   packet headers were copied into a separate memory area, and the
   page data is pointed to with an offset passed the copied headers.
   Thus, userspace cannot modify the headers used for netstack
   validation.  It can only modify packet data contents, which is less
   critical as it cannot crash the kernel, and eventually this will be
   caught by packet checksum validation.

6) After netstack delivered packet to another userspace process. Not a
   problem, as it cannot crash the kernel.  It might corrupt
   packet-data being read by another userspace process, which one
   argument for requiring elevated privileges to get write access
   (like NET_CAP_ADMIN).


Userspace delivery and OOM
--------------------------

These RX pages are likely mapped to userspace via mmap(), so-far so
good.  It is key to performance to get an efficient way of signaling
between kernel and userspace, e.g what page are ready for consumption,
and when userspace are done with the page.

It is outside the scope of page_pool to provide such a queuing
structure, but the page_pool can offer some means of protecting the
system resource usage.  It is a classical problem that resources
(e.g. the page) must be returned in a timely manor, else the system,
in this case, will run out of memory.  Any system/design with
unbounded memory allocation can lead to Out-Of-Memory (OOM)
situations.

Communication between kernel and userspace is likely going to be some
kind of queue.  Given transferring packets individually will have too
much scheduling overhead.  A queue can implicitly function as a
bulking interface, and offers a natural way to split the workload
across CPU cores.

This essentially boils down-to a two queue system, with the RX-ring
queue and the userspace delivery queue.

Two bad situations exists for the userspace queue:

1) Userspace is not consuming objects fast-enough. This should simply
   result in packets getting dropped when enqueueing to a full
   userspace queue (as queue *must* implement some limit). Open
   question is; should this be reported or communicated to userspace.

2) Userspace is consuming objects fast, but not returning them in a
   timely manor.  This is a bad situation, because it threatens the
   system stability as it can lead to OOM.

The page_pool should somehow protect the system in case 2.  The
page_pool can detect the situation as it is able to track the number
of outstanding pages, due to the recycle feedback loop.  Thus, the
page_pool can have some configurable limit of allowed outstanding
pages, which can protect the system against OOM.

Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
pages to be "pageable", i.e. swap-able, but that is not an option for
the page_pool as these pages are DMA mapped.

.. _`Fbufs paper`:
   http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688

Effect of blocking allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The effect of page_pool, in case 2, that denies more allocations
essentially result-in the RX-ring queue cannot be refilled and HW
starts dropping packets due to "out-of-buffers".  For NICs with
several HW RX-queues, this can be limited to a subset of queues (and
admin can control which RX queue with HW filters).

The question is if the page_pool can do something smarter in this
case, to signal the consumers of these pages, before the maximum limit
is hit (of allowed outstanding packets).  The MM-subsystem already
have a concept of emergency PFMEMALLOC reserves and associate
page-flags (e.g. page_is_pfmemalloc).  And the network stack already
handle and react to this.  Could the same PFMEMALLOC system be used
for marking pages when limit is close?

This requires further analysis. One can imagine; this could be used at
RX by XDP to mitigate the situation by dropping less-important frames.
Given XDP choose which pages are being send to userspace it might have
appropriate knowledge of what it relevant to drop(?).

.. Note:: An alternative idea is using a data-structure that blocks
          userspace from getting new pages before returning some.
          (out of scope for the page_pool)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 21:29                           ` Jesper Dangaard Brouer
@ 2016-12-14 22:45                             ` Alexander Duyck
  2016-12-15  8:28                               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2016-12-14 22:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich

On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 14 Dec 2016 08:45:08 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> I agree.  This is a no-go from the performance perspective as well.
>> At a minimum you would have to be zeroing out the page between uses to
>> avoid leaking data, and that assumes that the program we are sending
>> the pages to is slightly well behaved.  If we think zeroing out an
>> sk_buff is expensive wait until we are trying to do an entire 4K page.
>
> Again, yes the page will be zero'ed out, but only when entering the
> page_pool. Because they are recycled they are not cleared on every use.
> Thus, performance does not suffer.

So you are talking about recycling, but not clearing the page when it
is recycled.  That right there is my problem with this.  It is fine if
you assume the pages are used by the application only, but you are
talking about using them for both the application and for the regular
network path.  You can't do that.  If you are recycling you will have
to clear the page every time you put it back onto the Rx ring,
otherwise you can leak the recycled memory into user space and end up
with a user space program being able to snoop data out of the skb.

> Besides clearing large mem area is not as bad as clearing small.
> Clearing an entire page does cost something, as mentioned before 143
> cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> cost 36 cycles which is only 7 bytes-per-cycle (256/36).

What I am saying is that you are going to be clearing the 4K blocks
each time they are recycled.  You can't have the pages shared between
user-space and the network stack unless you have true isolation.  If
you are allowing network stack pages to be recycled back into the
user-space application you open up all sorts of leaks where the
application can snoop into data it shouldn't have access to.

>> I think we are stuck with having to use a HW filter to split off
>> application traffic to a specific ring, and then having to share the
>> memory between the application and the kernel on that ring only.  Any
>> other approach just opens us up to all sorts of security concerns
>> since it would be possible for the application to try to read and
>> possibly write any data it wants into the buffers.
>
> This is why I wrote a document[1], trying to outline how this is possible,
> going through all the combinations, and asking the community to find
> faults in my idea.  Inlining it again, as nobody really replied on the
> content of the doc.
>
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> ===========================
> Memory Model for Networking
> ===========================
>
> This design describes how the page_pool change the memory model for
> networking in the NIC (Network Interface Card) drivers.
>
> .. Note:: The catch for driver developers is that, once an application
>           request zero-copy RX, then the driver must use a specific
>           SKB allocation mode and might have to reconfigure the
>           RX-ring.
>
>
> Design target
> =============
>
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
>
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).
>
> Background
> ==========
>
> The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> for network packets in the Linux Kernel network stack.  It is a fairly
> complex object and can be constructed in several ways.
>
> From a memory perspective there are two ways depending on
> RX-buffer/page state:
>
> 1) Writable packet page
> 2) Read-only packet page
>
> To take full potential of the page_pool, the drivers must actually
> support handling both options depending on the configuration state of
> the page_pool.
>
> Writable packet page
> --------------------
>
> When the RX packet page is writable, the SKB setup is fairly straight
> forward.  The SKB->data (and skb->head) can point directly to the page
> data, adjusting the offset according to drivers headroom (for adding
> headers) and setting the length according to the DMA descriptor info.
>
> The page/data need to be writable, because the network stack need to
> adjust headers (like TimeToLive and checksum) or even add or remove
> headers for encapsulation purposes.
>
> A subtle catch, which also requires a writable page, is that the SKB
> also have an accompanying "shared info" data-structure ``struct
> skb_shared_info``.  This "skb_shared_info" is written into the
> skb->data memory area at the end (skb->end) of the (header) data.  The
> skb_shared_info contains semi-sensitive information, like kernel
> memory pointers to other pages (which might be pointers to more packet
> data).  This would be bad from a zero-copy point of view to leak this
> kind of information.

This should be the default once we get things moved over to using the
DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
before it gets fully into Linus's tree.  It looks like the swiotlb
bits have been accepted, just waiting on the ability to map a page w/
attributes and the remainder of the patches that are floating around
in mmotm and linux-next.

BTW, any ETA on when we might expect to start seeing code related to
the page_pool?  It is much easier to review code versus these kind of
blueprints.

> Read-only packet page
> ---------------------
>
> When the RX packet page is read-only, the construction of the SKB is
> significantly more complicated and even involves one more memory
> allocation.
>
> 1) Allocate a new separate writable memory area, and point skb->data
>    here.  This is needed due to (above described) skb_shared_info.
>
> 2) Memcpy packet headers into this (skb->data) area.
>
> 3) Clear part of skb_shared_info struct in writable-area.
>
> 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
>    and adjust the page_offset to be past the headers just copied.
>
> It is useful (later) that the network stack have this notion that part
> of the packet and a page can be read-only.  This implies that the
> kernel will not "pollute" this memory with any sensitive information.
> This is good from a zero-copy point of view, but bad from a
> performance perspective.

This will hopefully become a legacy approach.

>
> NIC RX Zero-Copy
> ================
>
> Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
> involves costly mapping and unmapping operations in the address space
> of the userspace process.  Plus for doing this safely, the page memory
> need to be cleared before using it, to avoid leaking kernel
> information to userspace, also a costly operation.  The page_pool base
> "class" of optimization is moving these kind of operations out of the
> fastpath, by recycling and lifetime control.
>
> Once a NIC RX-queue's page_pool have been configured for zero-copy
> into userspace, then can packets still be allowed to travel the normal
> stack?
>
> Yes, this should be possible, because the driver can use the
> SKB-read-only mode, which avoids polluting the page data with
> kernel-side sensitive data.  This implies, when a driver RX-queue
> switch page_pool to RX-zero-copy mode it MUST also switch to
> SKB-read-only mode (for normal stack delivery for this RXq).

This is the part that is wrong.  Once userspace has access to the
pages in an Rx ring that ring cannot be used for regular kernel-side
networking.  If it is, then sensitive kernel data may be leaked
because the application has full access to any page on the ring so it
could read the data at any time regardless of where the data is meant
to be delivered.

> XDP can be used for controlling which pages that gets RX zero-copied
> to userspace.  The page is still writable for the XDP program, but
> read-only for normal stack delivery.

Making the page read-only doesn't get you anything.  You still have a
conflict since user-space can read any packet directly out of the
page.

> Kernel safety
> -------------
>
> For the paranoid, how do we protect the kernel from a malicious
> userspace program.  Sure there will be a communication interface
> between kernel and userspace, that synchronize ownership of pages.
> But a userspace program can violate this interface, given pages are
> kept VMA mapped, the program can in principle access all the memory
> pages in the given page_pool.  This opens up for a malicious (or
> defect) program modifying memory pages concurrently with the kernel
> and DMA engine using them.
>
> An easy way to get around userspace modifying page data contents is
> simply to map pages read-only into userspace.
>
> .. Note:: The first implementation target is read-only zero-copy RX
>           page to userspace and require driver to use SKB-read-only
>           mode.

This allows for Rx but what do we do about Tx?  It sounds like
Christoph's RDMA approach might be the way to go.

> Advanced: Allowing userspace write access?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> What if userspace need write access? Flipping the page permissions per
> transfer will likely kill performance (as this likely affects the
> TLB-cache).
>
> I will argue that giving userspace write access is still possible,
> without risking a kernel crash.  This is related to the SKB-read-only
> mode that copies the packet headers (in to another memory area,
> inaccessible to userspace).  The attack angle is to modify packet
> headers after they passed some kernel network stack validation step
> (as once headers are copied they are out of "reach").
>
> Situation classes where memory page can be modified concurrently:
>
> 1) When DMA engine owns the page.  Not a problem, as DMA engine will
>    simply overwrite data.
>
> 2) Just after DMA engine finish writing.  Not a problem, the packet
>    will go through netstack validation and be rejected.
>
> 3) While XDP reads data. This can lead to XDP/eBPF program goes into a
>    wrong code branch, but the eBPF virtual machine should not be able
>    to crash the kernel. The worst outcome is a wrong or invalid XDP
>    return code.
>
> 4) Before SKB with read-only page is constructed. Not a problem, the
>    packet will go through netstack validation and be rejected.
>
> 5) After SKB with read-only page has been constructed.  Remember the
>    packet headers were copied into a separate memory area, and the
>    page data is pointed to with an offset passed the copied headers.
>    Thus, userspace cannot modify the headers used for netstack
>    validation.  It can only modify packet data contents, which is less
>    critical as it cannot crash the kernel, and eventually this will be
>    caught by packet checksum validation.
>
> 6) After netstack delivered packet to another userspace process. Not a
>    problem, as it cannot crash the kernel.  It might corrupt
>    packet-data being read by another userspace process, which one
>    argument for requiring elevated privileges to get write access
>    (like NET_CAP_ADMIN).

If userspace has access to a ring we shouldn't be using SKBs on it
really anyway.  We should probably expect XDP to be handling all the
packaging so items 4-6 can probably be dropped.

>
> Userspace delivery and OOM
> --------------------------
>
> These RX pages are likely mapped to userspace via mmap(), so-far so
> good.  It is key to performance to get an efficient way of signaling
> between kernel and userspace, e.g what page are ready for consumption,
> and when userspace are done with the page.
>
> It is outside the scope of page_pool to provide such a queuing
> structure, but the page_pool can offer some means of protecting the
> system resource usage.  It is a classical problem that resources
> (e.g. the page) must be returned in a timely manor, else the system,
> in this case, will run out of memory.  Any system/design with
> unbounded memory allocation can lead to Out-Of-Memory (OOM)
> situations.
>
> Communication between kernel and userspace is likely going to be some
> kind of queue.  Given transferring packets individually will have too
> much scheduling overhead.  A queue can implicitly function as a
> bulking interface, and offers a natural way to split the workload
> across CPU cores.
>
> This essentially boils down-to a two queue system, with the RX-ring
> queue and the userspace delivery queue.
>
> Two bad situations exists for the userspace queue:
>
> 1) Userspace is not consuming objects fast-enough. This should simply
>    result in packets getting dropped when enqueueing to a full
>    userspace queue (as queue *must* implement some limit). Open
>    question is; should this be reported or communicated to userspace.
>
> 2) Userspace is consuming objects fast, but not returning them in a
>    timely manor.  This is a bad situation, because it threatens the
>    system stability as it can lead to OOM.
>
> The page_pool should somehow protect the system in case 2.  The
> page_pool can detect the situation as it is able to track the number
> of outstanding pages, due to the recycle feedback loop.  Thus, the
> page_pool can have some configurable limit of allowed outstanding
> pages, which can protect the system against OOM.
>
> Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
> pages to be "pageable", i.e. swap-able, but that is not an option for
> the page_pool as these pages are DMA mapped.
>
> .. _`Fbufs paper`:
>    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688
>
> Effect of blocking allocation
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The effect of page_pool, in case 2, that denies more allocations
> essentially result-in the RX-ring queue cannot be refilled and HW
> starts dropping packets due to "out-of-buffers".  For NICs with
> several HW RX-queues, this can be limited to a subset of queues (and
> admin can control which RX queue with HW filters).
>
> The question is if the page_pool can do something smarter in this
> case, to signal the consumers of these pages, before the maximum limit
> is hit (of allowed outstanding packets).  The MM-subsystem already
> have a concept of emergency PFMEMALLOC reserves and associate
> page-flags (e.g. page_is_pfmemalloc).  And the network stack already
> handle and react to this.  Could the same PFMEMALLOC system be used
> for marking pages when limit is close?
>
> This requires further analysis. One can imagine; this could be used at
> RX by XDP to mitigate the situation by dropping less-important frames.
> Given XDP choose which pages are being send to userspace it might have
> appropriate knowledge of what it relevant to drop(?).
>
> .. Note:: An alternative idea is using a data-structure that blocks
>           userspace from getting new pages before returning some.
>           (out of scope for the page_pool)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-14 22:45                             ` Alexander Duyck
@ 2016-12-15  8:28                               ` Jesper Dangaard Brouer
  2016-12-15 15:59                                 ` Alexander Duyck
  2016-12-15 16:38                                 ` Christoph Lameter
  0 siblings, 2 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-15  8:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich, brouer

On Wed, 14 Dec 2016 14:45:00 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Wed, 14 Dec 2016 08:45:08 -0800
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> I agree.  This is a no-go from the performance perspective as well.
> >> At a minimum you would have to be zeroing out the page between uses to
> >> avoid leaking data, and that assumes that the program we are sending
> >> the pages to is slightly well behaved.  If we think zeroing out an
> >> sk_buff is expensive wait until we are trying to do an entire 4K page.  
> >
> > Again, yes the page will be zero'ed out, but only when entering the
> > page_pool. Because they are recycled they are not cleared on every use.
> > Thus, performance does not suffer.  
> 
> So you are talking about recycling, but not clearing the page when it
> is recycled.  That right there is my problem with this.  It is fine if
> you assume the pages are used by the application only, but you are
> talking about using them for both the application and for the regular
> network path.  You can't do that.  If you are recycling you will have
> to clear the page every time you put it back onto the Rx ring,
> otherwise you can leak the recycled memory into user space and end up
> with a user space program being able to snoop data out of the skb.
> 
> > Besides clearing large mem area is not as bad as clearing small.
> > Clearing an entire page does cost something, as mentioned before 143
> > cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).  
> 
> What I am saying is that you are going to be clearing the 4K blocks
> each time they are recycled.  You can't have the pages shared between
> user-space and the network stack unless you have true isolation.  If
> you are allowing network stack pages to be recycled back into the
> user-space application you open up all sorts of leaks where the
> application can snoop into data it shouldn't have access to.

See later, the "Read-only packet page" mode should provide a mode where
the netstack doesn't write into the page, and thus cannot leak kernel
data. (CAP_NET_ADMIN already give it access to other applications data.)


> >> I think we are stuck with having to use a HW filter to split off
> >> application traffic to a specific ring, and then having to share the
> >> memory between the application and the kernel on that ring only.  Any
> >> other approach just opens us up to all sorts of security concerns
> >> since it would be possible for the application to try to read and
> >> possibly write any data it wants into the buffers.  
> >
> > This is why I wrote a document[1], trying to outline how this is possible,
> > going through all the combinations, and asking the community to find
> > faults in my idea.  Inlining it again, as nobody really replied on the
> > content of the doc.
> >
> > -
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >
> > [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
> >
> > ===========================
> > Memory Model for Networking
> > ===========================
> >
> > This design describes how the page_pool change the memory model for
> > networking in the NIC (Network Interface Card) drivers.
> >
> > .. Note:: The catch for driver developers is that, once an application
> >           request zero-copy RX, then the driver must use a specific
> >           SKB allocation mode and might have to reconfigure the
> >           RX-ring.
> >
> >
> > Design target
> > =============
> >
> > Allow the NIC to function as a normal Linux NIC and be shared in a
> > safe manor, between the kernel network stack and an accelerated
> > userspace application using RX zero-copy delivery.
> >
> > Target is to provide the basis for building RX zero-copy solutions in
> > a memory safe manor.  An efficient communication channel for userspace
> > delivery is out of scope for this document, but OOM considerations are
> > discussed below (`Userspace delivery and OOM`_).
> >
> > Background
> > ==========
> >
> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> > for network packets in the Linux Kernel network stack.  It is a fairly
> > complex object and can be constructed in several ways.
> >
> > From a memory perspective there are two ways depending on
> > RX-buffer/page state:
> >
> > 1) Writable packet page
> > 2) Read-only packet page
> >
> > To take full potential of the page_pool, the drivers must actually
> > support handling both options depending on the configuration state of
> > the page_pool.
> >
> > Writable packet page
> > --------------------
> >
> > When the RX packet page is writable, the SKB setup is fairly straight
> > forward.  The SKB->data (and skb->head) can point directly to the page
> > data, adjusting the offset according to drivers headroom (for adding
> > headers) and setting the length according to the DMA descriptor info.
> >
> > The page/data need to be writable, because the network stack need to
> > adjust headers (like TimeToLive and checksum) or even add or remove
> > headers for encapsulation purposes.
> >
> > A subtle catch, which also requires a writable page, is that the SKB
> > also have an accompanying "shared info" data-structure ``struct
> > skb_shared_info``.  This "skb_shared_info" is written into the
> > skb->data memory area at the end (skb->end) of the (header) data.  The
> > skb_shared_info contains semi-sensitive information, like kernel
> > memory pointers to other pages (which might be pointers to more packet
> > data).  This would be bad from a zero-copy point of view to leak this
> > kind of information.  
> 
> This should be the default once we get things moved over to using the
> DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
> before it gets fully into Linus's tree.  It looks like the swiotlb
> bits have been accepted, just waiting on the ability to map a page w/
> attributes and the remainder of the patches that are floating around
> in mmotm and linux-next.

I'm very happy that you are working on this.
 
> BTW, any ETA on when we might expect to start seeing code related to
> the page_pool?  It is much easier to review code versus these kind of
> blueprints.

I've implemented a prove-of-concept of page_pool, but only the first
stage, which is the ability to replace driver specific page-caches.  It
works, but is not upstream ready, as e.g. it assumes it can get a page
flag and cleanup-on-driver-unload code is missing.  Mel Gorman have
reviewed it, but with the changes he requested I lost quite some
performance, I'm still trying to figure out a way to regain that
performance lost.  The zero-copy part is not implemented.  


> > Read-only packet page
> > ---------------------
> >
> > When the RX packet page is read-only, the construction of the SKB is
> > significantly more complicated and even involves one more memory
> > allocation.
> >
> > 1) Allocate a new separate writable memory area, and point skb->data
> >    here.  This is needed due to (above described) skb_shared_info.
> >
> > 2) Memcpy packet headers into this (skb->data) area.
> >
> > 3) Clear part of skb_shared_info struct in writable-area.
> >
> > 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
> >    and adjust the page_offset to be past the headers just copied.
> >
> > It is useful (later) that the network stack have this notion that part
> > of the packet and a page can be read-only.  This implies that the
> > kernel will not "pollute" this memory with any sensitive information.
> > This is good from a zero-copy point of view, but bad from a
> > performance perspective.  
> 
> This will hopefully become a legacy approach.

Hopefully, but this mode will have to be supported forever, and is the
current default.
 

> > NIC RX Zero-Copy
> > ================
> >
> > Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
> > involves costly mapping and unmapping operations in the address space
> > of the userspace process.  Plus for doing this safely, the page memory
> > need to be cleared before using it, to avoid leaking kernel
> > information to userspace, also a costly operation.  The page_pool base
> > "class" of optimization is moving these kind of operations out of the
> > fastpath, by recycling and lifetime control.
> >
> > Once a NIC RX-queue's page_pool have been configured for zero-copy
> > into userspace, then can packets still be allowed to travel the normal
> > stack?
> >
> > Yes, this should be possible, because the driver can use the
> > SKB-read-only mode, which avoids polluting the page data with
> > kernel-side sensitive data.  This implies, when a driver RX-queue
> > switch page_pool to RX-zero-copy mode it MUST also switch to
> > SKB-read-only mode (for normal stack delivery for this RXq).  
> 
> This is the part that is wrong.  Once userspace has access to the
> pages in an Rx ring that ring cannot be used for regular kernel-side
> networking.  If it is, then sensitive kernel data may be leaked
> because the application has full access to any page on the ring so it
> could read the data at any time regardless of where the data is meant
> to be delivered.

Are you sure. Can you give me an example of kernel code that writes
into the page when it is attached as a read-only page to the SKB? 

That would violate how we/drivers use the DMA API today (calling DMA
unmap when packets are in-flight).

 
> > XDP can be used for controlling which pages that gets RX zero-copied
> > to userspace.  The page is still writable for the XDP program, but
> > read-only for normal stack delivery.  
> 
> Making the page read-only doesn't get you anything.  You still have a
> conflict since user-space can read any packet directly out of the
> page.

Giving the application CAP_NAT_ADMIN already gave it "tcpdump" read access
to all other applications packet content from that NIC.


> > Kernel safety
> > -------------
> >
> > For the paranoid, how do we protect the kernel from a malicious
> > userspace program.  Sure there will be a communication interface
> > between kernel and userspace, that synchronize ownership of pages.
> > But a userspace program can violate this interface, given pages are
> > kept VMA mapped, the program can in principle access all the memory
> > pages in the given page_pool.  This opens up for a malicious (or
> > defect) program modifying memory pages concurrently with the kernel
> > and DMA engine using them.
> >
> > An easy way to get around userspace modifying page data contents is
> > simply to map pages read-only into userspace.
> >
> > .. Note:: The first implementation target is read-only zero-copy RX
> >           page to userspace and require driver to use SKB-read-only
> >           mode.  
> 
> This allows for Rx but what do we do about Tx?  

True, I've not covered Tx.  But I believe Tx is easier from a sharing
PoV, as we don't have the early demux sharing problem, because an
application/socket will be the starting point, and simply have
associated a page_pool for TX, solving the VMA mapping overhead.
Using the skb-read-only-page mode, this would in principle allow normal
socket zero-copy TX and packet steering.

For performance reasons, when you already know what NIC you want to TX
on, you could extend this to allocate a separate queue for TX.  Which
makes it look a lot like RDMA.


> It sounds like Christoph's RDMA approach might be the way to go.

I'm getting more and more fond of Christoph's RDMA approach.  I do
think we will end-up with something close to that approach.  I just
wanted to get review on my idea first.

IMHO the major blocker for the RDMA approach is not HW filters
themselves, but a common API that applications can call to register
what goes into the HW queues in the driver.  I suspect it will be a
long project agreeing between vendors.  And agreeing on semantics.


> > Advanced: Allowing userspace write access?
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > What if userspace need write access? Flipping the page permissions per
> > transfer will likely kill performance (as this likely affects the
> > TLB-cache).
> >
> > I will argue that giving userspace write access is still possible,
> > without risking a kernel crash.  This is related to the SKB-read-only
> > mode that copies the packet headers (in to another memory area,
> > inaccessible to userspace).  The attack angle is to modify packet
> > headers after they passed some kernel network stack validation step
> > (as once headers are copied they are out of "reach").
> >
> > Situation classes where memory page can be modified concurrently:
> >
> > 1) When DMA engine owns the page.  Not a problem, as DMA engine will
> >    simply overwrite data.
> >
> > 2) Just after DMA engine finish writing.  Not a problem, the packet
> >    will go through netstack validation and be rejected.
> >
> > 3) While XDP reads data. This can lead to XDP/eBPF program goes into a
> >    wrong code branch, but the eBPF virtual machine should not be able
> >    to crash the kernel. The worst outcome is a wrong or invalid XDP
> >    return code.
> >
> > 4) Before SKB with read-only page is constructed. Not a problem, the
> >    packet will go through netstack validation and be rejected.
> >
> > 5) After SKB with read-only page has been constructed.  Remember the
> >    packet headers were copied into a separate memory area, and the
> >    page data is pointed to with an offset passed the copied headers.
> >    Thus, userspace cannot modify the headers used for netstack
> >    validation.  It can only modify packet data contents, which is less
> >    critical as it cannot crash the kernel, and eventually this will be
> >    caught by packet checksum validation.
> >
> > 6) After netstack delivered packet to another userspace process. Not a
> >    problem, as it cannot crash the kernel.  It might corrupt
> >    packet-data being read by another userspace process, which one
> >    argument for requiring elevated privileges to get write access
> >    (like NET_CAP_ADMIN).  
> 
> If userspace has access to a ring we shouldn't be using SKBs on it
> really anyway.  We should probably expect XDP to be handling all the
> packaging so items 4-6 can probably be dropped.
> 
> >
> > Userspace delivery and OOM
> > --------------------------
> >
> > These RX pages are likely mapped to userspace via mmap(), so-far so
> > good.  It is key to performance to get an efficient way of signaling
> > between kernel and userspace, e.g what page are ready for consumption,
> > and when userspace are done with the page.
> >
> > It is outside the scope of page_pool to provide such a queuing
> > structure, but the page_pool can offer some means of protecting the
> > system resource usage.  It is a classical problem that resources
> > (e.g. the page) must be returned in a timely manor, else the system,
> > in this case, will run out of memory.  Any system/design with
> > unbounded memory allocation can lead to Out-Of-Memory (OOM)
> > situations.
> >
> > Communication between kernel and userspace is likely going to be some
> > kind of queue.  Given transferring packets individually will have too
> > much scheduling overhead.  A queue can implicitly function as a
> > bulking interface, and offers a natural way to split the workload
> > across CPU cores.
> >
> > This essentially boils down-to a two queue system, with the RX-ring
> > queue and the userspace delivery queue.
> >
> > Two bad situations exists for the userspace queue:
> >
> > 1) Userspace is not consuming objects fast-enough. This should simply
> >    result in packets getting dropped when enqueueing to a full
> >    userspace queue (as queue *must* implement some limit). Open
> >    question is; should this be reported or communicated to userspace.
> >
> > 2) Userspace is consuming objects fast, but not returning them in a
> >    timely manor.  This is a bad situation, because it threatens the
> >    system stability as it can lead to OOM.
> >
> > The page_pool should somehow protect the system in case 2.  The
> > page_pool can detect the situation as it is able to track the number
> > of outstanding pages, due to the recycle feedback loop.  Thus, the
> > page_pool can have some configurable limit of allowed outstanding
> > pages, which can protect the system against OOM.
> >
> > Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
> > pages to be "pageable", i.e. swap-able, but that is not an option for
> > the page_pool as these pages are DMA mapped.
> >
> > .. _`Fbufs paper`:
> >    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688
> >
> > Effect of blocking allocation
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > The effect of page_pool, in case 2, that denies more allocations
> > essentially result-in the RX-ring queue cannot be refilled and HW
> > starts dropping packets due to "out-of-buffers".  For NICs with
> > several HW RX-queues, this can be limited to a subset of queues (and
> > admin can control which RX queue with HW filters).
> >
> > The question is if the page_pool can do something smarter in this
> > case, to signal the consumers of these pages, before the maximum limit
> > is hit (of allowed outstanding packets).  The MM-subsystem already
> > have a concept of emergency PFMEMALLOC reserves and associate
> > page-flags (e.g. page_is_pfmemalloc).  And the network stack already
> > handle and react to this.  Could the same PFMEMALLOC system be used
> > for marking pages when limit is close?
> >
> > This requires further analysis. One can imagine; this could be used at
> > RX by XDP to mitigate the situation by dropping less-important frames.
> > Given XDP choose which pages are being send to userspace it might have
> > appropriate knowledge of what it relevant to drop(?).
> >
> > .. Note:: An alternative idea is using a data-structure that blocks
> >           userspace from getting new pages before returning some.
> >           (out of scope for the page_pool)
> >  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-15  8:28                               ` Jesper Dangaard Brouer
@ 2016-12-15 15:59                                 ` Alexander Duyck
  2016-12-15 16:38                                 ` Christoph Lameter
  1 sibling, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2016-12-15 15:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich

On Thu, Dec 15, 2016 at 12:28 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 14 Dec 2016 14:45:00 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > On Wed, 14 Dec 2016 08:45:08 -0800
>> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> >
>> >> I agree.  This is a no-go from the performance perspective as well.
>> >> At a minimum you would have to be zeroing out the page between uses to
>> >> avoid leaking data, and that assumes that the program we are sending
>> >> the pages to is slightly well behaved.  If we think zeroing out an
>> >> sk_buff is expensive wait until we are trying to do an entire 4K page.
>> >
>> > Again, yes the page will be zero'ed out, but only when entering the
>> > page_pool. Because they are recycled they are not cleared on every use.
>> > Thus, performance does not suffer.
>>
>> So you are talking about recycling, but not clearing the page when it
>> is recycled.  That right there is my problem with this.  It is fine if
>> you assume the pages are used by the application only, but you are
>> talking about using them for both the application and for the regular
>> network path.  You can't do that.  If you are recycling you will have
>> to clear the page every time you put it back onto the Rx ring,
>> otherwise you can leak the recycled memory into user space and end up
>> with a user space program being able to snoop data out of the skb.
>>
>> > Besides clearing large mem area is not as bad as clearing small.
>> > Clearing an entire page does cost something, as mentioned before 143
>> > cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
>> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).
>>
>> What I am saying is that you are going to be clearing the 4K blocks
>> each time they are recycled.  You can't have the pages shared between
>> user-space and the network stack unless you have true isolation.  If
>> you are allowing network stack pages to be recycled back into the
>> user-space application you open up all sorts of leaks where the
>> application can snoop into data it shouldn't have access to.
>
> See later, the "Read-only packet page" mode should provide a mode where
> the netstack doesn't write into the page, and thus cannot leak kernel
> data. (CAP_NET_ADMIN already give it access to other applications data.)

I think you are kind of missing the point.  The device is writing to
the page on the kernel's behalf.  Therefore the page isn't "Read-only"
and you have an issue since you are talking about sharing a ring
between kernel and userspace.

>> >> I think we are stuck with having to use a HW filter to split off
>> >> application traffic to a specific ring, and then having to share the
>> >> memory between the application and the kernel on that ring only.  Any
>> >> other approach just opens us up to all sorts of security concerns
>> >> since it would be possible for the application to try to read and
>> >> possibly write any data it wants into the buffers.
>> >
>> > This is why I wrote a document[1], trying to outline how this is possible,
>> > going through all the combinations, and asking the community to find
>> > faults in my idea.  Inlining it again, as nobody really replied on the
>> > content of the doc.
>> >
>> > -
>> > Best regards,
>> >   Jesper Dangaard Brouer
>> >   MSc.CS, Principal Kernel Engineer at Red Hat
>> >   LinkedIn: http://www.linkedin.com/in/brouer
>> >
>> > [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>> >
>> > ===========================
>> > Memory Model for Networking
>> > ===========================
>> >
>> > This design describes how the page_pool change the memory model for
>> > networking in the NIC (Network Interface Card) drivers.
>> >
>> > .. Note:: The catch for driver developers is that, once an application
>> >           request zero-copy RX, then the driver must use a specific
>> >           SKB allocation mode and might have to reconfigure the
>> >           RX-ring.
>> >
>> >
>> > Design target
>> > =============
>> >
>> > Allow the NIC to function as a normal Linux NIC and be shared in a
>> > safe manor, between the kernel network stack and an accelerated
>> > userspace application using RX zero-copy delivery.
>> >
>> > Target is to provide the basis for building RX zero-copy solutions in
>> > a memory safe manor.  An efficient communication channel for userspace
>> > delivery is out of scope for this document, but OOM considerations are
>> > discussed below (`Userspace delivery and OOM`_).
>> >
>> > Background
>> > ==========
>> >
>> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
>> > for network packets in the Linux Kernel network stack.  It is a fairly
>> > complex object and can be constructed in several ways.
>> >
>> > From a memory perspective there are two ways depending on
>> > RX-buffer/page state:
>> >
>> > 1) Writable packet page
>> > 2) Read-only packet page
>> >
>> > To take full potential of the page_pool, the drivers must actually
>> > support handling both options depending on the configuration state of
>> > the page_pool.
>> >
>> > Writable packet page
>> > --------------------
>> >
>> > When the RX packet page is writable, the SKB setup is fairly straight
>> > forward.  The SKB->data (and skb->head) can point directly to the page
>> > data, adjusting the offset according to drivers headroom (for adding
>> > headers) and setting the length according to the DMA descriptor info.
>> >
>> > The page/data need to be writable, because the network stack need to
>> > adjust headers (like TimeToLive and checksum) or even add or remove
>> > headers for encapsulation purposes.
>> >
>> > A subtle catch, which also requires a writable page, is that the SKB
>> > also have an accompanying "shared info" data-structure ``struct
>> > skb_shared_info``.  This "skb_shared_info" is written into the
>> > skb->data memory area at the end (skb->end) of the (header) data.  The
>> > skb_shared_info contains semi-sensitive information, like kernel
>> > memory pointers to other pages (which might be pointers to more packet
>> > data).  This would be bad from a zero-copy point of view to leak this
>> > kind of information.
>>
>> This should be the default once we get things moved over to using the
>> DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
>> before it gets fully into Linus's tree.  It looks like the swiotlb
>> bits have been accepted, just waiting on the ability to map a page w/
>> attributes and the remainder of the patches that are floating around
>> in mmotm and linux-next.
>
> I'm very happy that you are working on this.

Well it looks like the rest just got accepted into Linus's tree
yesterday.  There are still some documentation and rename patches
outstanding but I will probably start submitting driver updates for
enabling build_skb and the like in net-next in the next several weeks.

>> BTW, any ETA on when we might expect to start seeing code related to
>> the page_pool?  It is much easier to review code versus these kind of
>> blueprints.
>
> I've implemented a prove-of-concept of page_pool, but only the first
> stage, which is the ability to replace driver specific page-caches.  It
> works, but is not upstream ready, as e.g. it assumes it can get a page
> flag and cleanup-on-driver-unload code is missing.  Mel Gorman have
> reviewed it, but with the changes he requested I lost quite some
> performance, I'm still trying to figure out a way to regain that
> performance lost.  The zero-copy part is not implemented.

Well RFCs are always welcome.  It is just really hard to review things
when all you have is documentation that may or may not match up with
what ends up being implemented.

>
>> > Read-only packet page
>> > ---------------------
>> >
>> > When the RX packet page is read-only, the construction of the SKB is
>> > significantly more complicated and even involves one more memory
>> > allocation.
>> >
>> > 1) Allocate a new separate writable memory area, and point skb->data
>> >    here.  This is needed due to (above described) skb_shared_info.
>> >
>> > 2) Memcpy packet headers into this (skb->data) area.
>> >
>> > 3) Clear part of skb_shared_info struct in writable-area.
>> >
>> > 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
>> >    and adjust the page_offset to be past the headers just copied.
>> >
>> > It is useful (later) that the network stack have this notion that part
>> > of the packet and a page can be read-only.  This implies that the
>> > kernel will not "pollute" this memory with any sensitive information.
>> > This is good from a zero-copy point of view, but bad from a
>> > performance perspective.
>>
>> This will hopefully become a legacy approach.
>
> Hopefully, but this mode will have to be supported forever, and is the
> current default.

Maybe you need to rename this approach since it is clear there is some
confusion about what is going on here.  The page is not Read-only.  It
is left device mapped and is not writable by the CPU.  That doesn't
mean it isn't written to though.

>
>> > NIC RX Zero-Copy
>> > ================
>> >
>> > Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
>> > involves costly mapping and unmapping operations in the address space
>> > of the userspace process.  Plus for doing this safely, the page memory
>> > need to be cleared before using it, to avoid leaking kernel
>> > information to userspace, also a costly operation.  The page_pool base
>> > "class" of optimization is moving these kind of operations out of the
>> > fastpath, by recycling and lifetime control.
>> >
>> > Once a NIC RX-queue's page_pool have been configured for zero-copy
>> > into userspace, then can packets still be allowed to travel the normal
>> > stack?
>> >
>> > Yes, this should be possible, because the driver can use the
>> > SKB-read-only mode, which avoids polluting the page data with
>> > kernel-side sensitive data.  This implies, when a driver RX-queue
>> > switch page_pool to RX-zero-copy mode it MUST also switch to
>> > SKB-read-only mode (for normal stack delivery for this RXq).
>>
>> This is the part that is wrong.  Once userspace has access to the
>> pages in an Rx ring that ring cannot be used for regular kernel-side
>> networking.  If it is, then sensitive kernel data may be leaked
>> because the application has full access to any page on the ring so it
>> could read the data at any time regardless of where the data is meant
>> to be delivered.
>
> Are you sure. Can you give me an example of kernel code that writes
> into the page when it is attached as a read-only page to the SKB?

You are completely overlooking the writes by the device.  The device
is writing to the page.  Therefore it is not a true "read-only" page.

> That would violate how we/drivers use the DMA API today (calling DMA
> unmap when packets are in-flight).

What I am talking about is the DMA so it doesn't violate things.

>> > XDP can be used for controlling which pages that gets RX zero-copied
>> > to userspace.  The page is still writable for the XDP program, but
>> > read-only for normal stack delivery.
>>
>> Making the page read-only doesn't get you anything.  You still have a
>> conflict since user-space can read any packet directly out of the
>> page.
>
> Giving the application CAP_NAT_ADMIN already gave it "tcpdump" read access
> to all other applications packet content from that NIC.

Now we are getting somewhere.  So in this scenario we are okay with
the application being able to read anything that is written to the
kernel.  That is actually the data I was concerned about.  So as long
as we are fine with the application reading any data that is going by
in the packets then we should be fine sharing the data this way.

It does lead to questions though on why there was the 1 page per
packet requirement.  As long as we are using pages in this "read-only"
format you could share as many pages as you wanted and have either
multiple packets per page or multiple pages per packet as long as you
honor the read-only aspect of things.

>> > Kernel safety
>> > -------------
>> >
>> > For the paranoid, how do we protect the kernel from a malicious
>> > userspace program.  Sure there will be a communication interface
>> > between kernel and userspace, that synchronize ownership of pages.
>> > But a userspace program can violate this interface, given pages are
>> > kept VMA mapped, the program can in principle access all the memory
>> > pages in the given page_pool.  This opens up for a malicious (or
>> > defect) program modifying memory pages concurrently with the kernel
>> > and DMA engine using them.
>> >
>> > An easy way to get around userspace modifying page data contents is
>> > simply to map pages read-only into userspace.
>> >
>> > .. Note:: The first implementation target is read-only zero-copy RX
>> >           page to userspace and require driver to use SKB-read-only
>> >           mode.
>>
>> This allows for Rx but what do we do about Tx?
>
> True, I've not covered Tx.  But I believe Tx is easier from a sharing
> PoV, as we don't have the early demux sharing problem, because an
> application/socket will be the starting point, and simply have
> associated a page_pool for TX, solving the VMA mapping overhead.
> Using the skb-read-only-page mode, this would in principle allow normal
> socket zero-copy TX and packet steering.
>
> For performance reasons, when you already know what NIC you want to TX
> on, you could extend this to allocate a separate queue for TX.  Which
> makes it look a lot like RDMA.
>
>
>> It sounds like Christoph's RDMA approach might be the way to go.
>
> I'm getting more and more fond of Christoph's RDMA approach.  I do
> think we will end-up with something close to that approach.  I just
> wanted to get review on my idea first.
>
> IMHO the major blocker for the RDMA approach is not HW filters
> themselves, but a common API that applications can call to register
> what goes into the HW queues in the driver.  I suspect it will be a
> long project agreeing between vendors.  And agreeing on semantics.

We really should end up doing a HW filtering approach for any
application most likely anyway.  I know the Intel parts have their
Flow Director which should allow for directing a flow to the correct
queue.  Really it sort of makes sense to look at going that route as
you can focus your software efforts on a queue that should mostly
contain the traffic you are looking for rather than one that will be
processing unrelated traffic.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Designing a safe RX-zero-copy Memory Model for Networking
  2016-12-15  8:28                               ` Jesper Dangaard Brouer
  2016-12-15 15:59                                 ` Alexander Duyck
@ 2016-12-15 16:38                                 ` Christoph Lameter
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2016-12-15 16:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, John Fastabend, David Miller, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich

On Thu, 15 Dec 2016, Jesper Dangaard Brouer wrote:

> > It sounds like Christoph's RDMA approach might be the way to go.
>
> I'm getting more and more fond of Christoph's RDMA approach.  I do
> think we will end-up with something close to that approach.  I just
> wanted to get review on my idea first.
>
> IMHO the major blocker for the RDMA approach is not HW filters
> themselves, but a common API that applications can call to register
> what goes into the HW queues in the driver.  I suspect it will be a
> long project agreeing between vendors.  And agreeing on semantics.

Some of the methods from the RDMA subsystem (like queue pairs, the various
queues etc) could be extracted and used here. Multiple vendors already
support these features and some devices operate both in an RDMA and a
network stack mode. Having that all supported by the networks stack would
reduce overhead for those vendors.

Multiple new vendors are coming up in the RDMA subsystem because the
regular network stack does not have the right performance for high speed
networking. I would rather see them have a way to get that functionality
from the regular network stack. Please add some extensions so that the
RDMA style I/O can be made to work. Even the hardware of the new NICs is
already prepared to work with the data structures of the RDMA subsystem.
That provides an area of standardization where we could hook into but do
that properly and in a nice way in the context of main stream network
support.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-12-15 16:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-05 14:31 Designing a safe RX-zero-copy Memory Model for Networking Jesper Dangaard Brouer
2016-12-12  8:38 ` Mike Rapoport
2016-12-12  9:40   ` Jesper Dangaard Brouer
2016-12-12 14:14     ` Mike Rapoport
2016-12-12 14:14       ` Mike Rapoport
2016-12-12 14:49       ` John Fastabend
2016-12-12 17:13         ` Jesper Dangaard Brouer
2016-12-12 18:06           ` Christoph Lameter
2016-12-12 18:06             ` Christoph Lameter
2016-12-13 16:10             ` Jesper Dangaard Brouer
2016-12-13 16:36               ` Christoph Lameter
2016-12-13 16:36                 ` Christoph Lameter
2016-12-13 17:43               ` John Fastabend
2016-12-13 17:43                 ` John Fastabend
2016-12-13 19:53                 ` David Miller
2016-12-13 20:08                   ` John Fastabend
2016-12-14  9:39                     ` Jesper Dangaard Brouer
2016-12-14 16:32                       ` John Fastabend
2016-12-14 16:45                         ` Alexander Duyck
2016-12-14 21:29                           ` Jesper Dangaard Brouer
2016-12-14 22:45                             ` Alexander Duyck
2016-12-15  8:28                               ` Jesper Dangaard Brouer
2016-12-15 15:59                                 ` Alexander Duyck
2016-12-15 16:38                                 ` Christoph Lameter
2016-12-14 21:04                         ` Jesper Dangaard Brouer
2016-12-13 18:39               ` Hannes Frederic Sowa
2016-12-14 17:00                 ` Christoph Lameter
2016-12-14 17:00                   ` Christoph Lameter
2016-12-14 17:37                   ` David Laight
2016-12-14 19:43                     ` Christoph Lameter
2016-12-14 19:43                       ` Christoph Lameter
2016-12-14 20:37                       ` Hannes Frederic Sowa
2016-12-14 20:37                         ` Hannes Frederic Sowa
2016-12-14 21:22                         ` Christoph Lameter
2016-12-13  9:42         ` Mike Rapoport
2016-12-12 15:10       ` Jesper Dangaard Brouer
2016-12-12 15:10         ` Jesper Dangaard Brouer
2016-12-13  8:43         ` Mike Rapoport
2016-12-13  8:43           ` Mike Rapoport

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.