From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Karlsson, Magnus" <magnus.karlsson@intel.com>
Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
Date: Mon, 23 Apr 2018 10:26:18 +0000
Message-ID: <AFED4FBCE79F3548A8F74434195ACE39588D8054@IRSMSX107.ger.corp.intel.com>
References: <20180327165919.17933-1-bjorn.topel@gmail.com>
 <20180327165919.17933-4-bjorn.topel@gmail.com>
 <20180412050542-mutt-send-email-mst@kernel.org>
 <AFED4FBCE79F3548A8F74434195ACE39588D1AA4@IRSMSX107.ger.corp.intel.com>
 <20180412170110-mutt-send-email-mst@kernel.org>
 <AFED4FBCE79F3548A8F74434195ACE39588D2083@IRSMSX107.ger.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8BIT
Cc: =?iso-8859-1?Q?=27Bj=F6rn_T=F6pel=27?= <bjorn.topel@gmail.com>,
        "Duyck, Alexander H" <alexander.h.duyck@intel.com>,
        "'alexander.duyck@gmail.com'" <alexander.duyck@gmail.com>,
        "'john.fastabend@gmail.com'" <john.fastabend@gmail.com>,
        "'ast@fb.com'" <ast@fb.com>,
        "'brouer@redhat.com'" <brouer@redhat.com>,
        "'willemdebruijn.kernel@gmail.com'" <willemdebruijn.kernel@gmail.com>,
        "'daniel@iogearbox.net'" <daniel@iogearbox.net>,
        "'netdev@vger.kernel.org'" <netdev@vger.kernel.org>,
        "'michael.lundkvist@ericsson.com'" <michael.lundkvist@ericsson.com>,
        "Brandeburg, Jesse" <jesse.brandeburg@intel.com>,
        "Singhai, Anjali" <anjali.singhai@intel.com>,
        "Zhang, Qi Z" <qi.z.zhang@intel.com>,
        "'ravineet.singh@ericsson.com'" <ravineet.singh@ericsson.com>
To: "'Michael S. Tsirkin'" <mst@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mga04.intel.com ([192.55.52.120]:39790 "EHLO mga04.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1754607AbeDWK0X (ORCPT <rfc822;netdev@vger.kernel.org>);
        Mon, 23 Apr 2018 06:26:23 -0400
In-Reply-To: <AFED4FBCE79F3548A8F74434195ACE39588D2083@IRSMSX107.ger.corp.intel.com>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


> -----Original Message-----
> From: Karlsson, Magnus
> Sent: Thursday, April 12, 2018 5:20 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, April 12, 2018 4:05 PM
> > To: Karlsson, Magnus <magnus.karlsson@intel.com>
> > Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> > <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> > <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ravineet.singh@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> > > I think you are definitely right in that there are ways in which we
> > > can improve performance here. That said, the current queue performs
> > > slightly better than the previous one we had that was more or less a
> > > copy of one of your first virtio 1.1 proposals from little over a
> > > year ago. It had bidirectional queues and a valid flag in the
> > > descriptor itself. The reason we abandoned this was not poor
> > > performance (it was good), but a need to go to unidirectional
> > > queues. Maybe I should have only changed that aspect and kept the valid
> flag.
> >
> > Is there a summary about unidirectional queues anywhere?  I'm curious
> > to know whether there are any lessons here to be learned for virtio or
> ptr_ring.
> 
> I did a quick hack in which I used your ptr_ring for the fill queue instead of
> our head/tail based one. In the corner cases (usually empty or usually full),
> there is basically no difference. But for the case when the queue is always
> half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7
> Mpps (as there is no cache line bouncing in this case) on my system (slower
> than Björn's that was used for the numbers in the RFC).
> 
> So I think this should be implemented properly so we can get some real
> numbers.
> Especially since 0.1 Mpps with copies will likely become much more with
> zero-copy as we are really chasing cycles there. We will get back a better
> evaluation in a few days.
> 
> Thanks: Magnus
> 
> > --
> > MST

Hi Michael,

Sorry for the late reply. Been travelling. Björn and I have now
made a real implementation of the ptr_ring principles in the
af_xdp code. We just added a switch in bind (only for the purpose
of this test) to be able to pick what ring implementation to use
from the user space test program. The main difference between our
version of ptr_ring and our head/tail ring is that the ptr_ring
version uses the idx field to signal if the entry is available or
not (idx == 0 indicating empty descriptor) and that it does not
use the head and tail pointers at all. Even though it is not
a "ring of pointers" in our implementation, we will still call it
ptr_ring for the purpose of this mail.

In summary, our experiments show that the two rings perform the
same in our micro benchmarks when the queues are balanced and
rarely full or empty, but the head/tail version performs better
for RX when the queues are not perfectly balanced. Why is that?
We do not exactly know, but there are a number of differences
between a ptr_ring in the kernel and one between user and kernel
space for the use in af_xdp.

* The user space descriptors have to be validated as we are
  communicating between user space and kernel space. Done slightly
  differently for the two rings due to the batching below.

* The RX and TX ring have descriptors that are larger than one
  pointer, so need to have barriers here even with ptr_ring. We can
  not rely on address dependency because it is not a pointer.

* Batching performed slightly differently in both versions. We
  avoid touching head and tail for as long as possible. At the
  worst it is once per batch, but it might be much less than that
  on the consumer side. The drawback with the accesses to the
  head/tail pointers is that it usually ends up to be a cache
  line bounce. But with ptr_ring, the drawback is that it is
  always N writes (setting idx = 0) for a batch size of N. The
  good thing though, is that these will not incur any cache
  line bouncing if the rings are balanced (well, they will be
  read by the producer at some point, but only once per traversal
  of the ring).

Something to note is that we think that the head/tail version
provides an easier-to-use user space interface since the indexes start
from 0 instead of 1 as in the ptr_ring case. With ptr_ring you
have to teach the user space application writer not to use index
0. With the head/tail version no such restriction is needed.

Here are just some of the results for a workload where user space
is faster than kernel space. This is for the case in which the user
space program has no problem keeping up with the kernel.

head/tail 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,782,718   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,504,235   
tx              2,504,232   


ptr_ring 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,519,373   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,519,265   
tx              2,519,265   


ptr_ring with batch sizes calculated as in ptr_ring.h

  sock0@p3p2:16 rxdrop
                 pps        
rx              7,470,658   
tx              0           
^C

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,431,701   
tx              2,431,701   

/Magnus