From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bruce Richardson <bruce.richardson@intel.com>
Subject: Re: [PATCH v1 01/14] ring: remove split cacheline build
 setting
Date: Wed, 1 Mar 2017 10:42:58 +0000
Message-ID: <20170301104257.GB25032@bricha3-MOBL3.ger.corp.intel.com>
References: <20170223172407.27664-1-bruce.richardson@intel.com>
 <20170223172407.27664-2-bruce.richardson@intel.com>
 <20170228113511.GA28584@localhost.localdomain>
 <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com>
 <20170228120833.GA30817@localhost.localdomain>
 <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com>
 <20170228175423.GA23591@localhost.localdomain>
 <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com>
 <20170301111753.1223a01e@platinum>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Jerin Jacob <jerin.jacob@caviumnetworks.com>, dev@dpdk.org
To: Olivier Matz <olivier.matz@6wind.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
 by dpdk.org (Postfix) with ESMTP id 1D9B74CE4
 for <dev@dpdk.org>; Wed,  1 Mar 2017 11:43:01 +0100 (CET)
Content-Disposition: inline
In-Reply-To: <20170301111753.1223a01e@platinum>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Wed, Mar 01, 2017 at 11:17:53AM +0100, Olivier Matz wrote:
> Hi Bruce,
> 
> On Wed, 1 Mar 2017 09:47:03 +0000, Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> > On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote:
> > > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote:  
> > > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote:  
> > > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson
> > > > > wrote:  
> > > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote:  
> > > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson
> > > > > > > wrote:  
> > > > > > > > Users compiling DPDK should not need to know or care
> > > > > > > > about the arrangement of cachelines in the rte_ring
> > > > > > > > structure. Therefore just remove the build option and set
> > > > > > > > the structures to be always split. For improved
> > > > > > > > performance use 128B rather than 64B alignment since it
> > > > > > > > stops the producer and consumer data being on adjacent
> 
> 
> You say you see an improved performance on Intel by having an extra
> blank cache-line between the producer and consumer data. Do you have an
> idea why it behaves like this? Do you think it is related to the
> hardware adjacent cache line prefetcher?
>

That is a likely candidate, but I haven't done a full analysis on the
details to know for sure what all the factors are. We see a similar
effect with the packet distributor library, which uses similar padding.

> 
> 
> > [...]
> > > # base code  
> > > RTE>>ring_perf_autotest  
> > > ### Testing single element and burst enq/deq ###
> > > SP/SC single enq/dequeue: 84
> > > MP/MC single enq/dequeue: 301
> > > SP/SC burst enq/dequeue (size: 8): 20
> > > MP/MC burst enq/dequeue (size: 8): 46
> > > SP/SC burst enq/dequeue (size: 32): 12
> > > MP/MC burst enq/dequeue (size: 32): 18
> > > 
> > > ### Testing empty dequeue ###
> > > SC empty dequeue: 7.11
> > > MC empty dequeue: 12.15
> > > 
> > > ### Testing using a single lcore ###
> > > SP/SC bulk enq/dequeue (size: 8): 19.08
> > > MP/MC bulk enq/dequeue (size: 8): 46.28
> > > SP/SC bulk enq/dequeue (size: 32): 11.89
> > > MP/MC bulk enq/dequeue (size: 32): 18.84
> > > 
> > > ### Testing using two physical cores ###
> > > SP/SC bulk enq/dequeue (size: 8): 37.42
> > > MP/MC bulk enq/dequeue (size: 8): 73.32
> > > SP/SC bulk enq/dequeue (size: 32): 18.69
> > > MP/MC bulk enq/dequeue (size: 32): 24.59
> > > Test OK
> > > 
> > > # with ring rework patch  
> > > RTE>>ring_perf_autotest  
> > > ### Testing single element and burst enq/deq ###
> > > SP/SC single enq/dequeue: 84
> > > MP/MC single enq/dequeue: 301
> > > SP/SC burst enq/dequeue (size: 8): 19
> > > MP/MC burst enq/dequeue (size: 8): 45
> > > SP/SC burst enq/dequeue (size: 32): 11
> > > MP/MC burst enq/dequeue (size: 32): 18
> > > 
> > > ### Testing empty dequeue ###
> > > SC empty dequeue: 7.10
> > > MC empty dequeue: 12.15
> > > 
> > > ### Testing using a single lcore ###
> > > SP/SC bulk enq/dequeue (size: 8): 18.59
> > > MP/MC bulk enq/dequeue (size: 8): 45.49
> > > SP/SC bulk enq/dequeue (size: 32): 11.67
> > > MP/MC bulk enq/dequeue (size: 32): 18.65
> > > 
> > > ### Testing using two physical cores ###
> > > SP/SC bulk enq/dequeue (size: 8): 37.41
> > > MP/MC bulk enq/dequeue (size: 8): 72.98
> > > SP/SC bulk enq/dequeue (size: 32): 18.69
> > > MP/MC bulk enq/dequeue (size: 32): 24.59
> > > Test OK  
> > > RTE>>  
> > > 
> > > # with ring rework patch + cache-line size change to one on 128BCL
> > > target  
> > > RTE>>ring_perf_autotest  
> > > ### Testing single element and burst enq/deq ###
> > > SP/SC single enq/dequeue: 90
> > > MP/MC single enq/dequeue: 317
> > > SP/SC burst enq/dequeue (size: 8): 20
> > > MP/MC burst enq/dequeue (size: 8): 48
> > > SP/SC burst enq/dequeue (size: 32): 11
> > > MP/MC burst enq/dequeue (size: 32): 18
> > > 
> > > ### Testing empty dequeue ###
> > > SC empty dequeue: 8.10
> > > MC empty dequeue: 11.15
> > > 
> > > ### Testing using a single lcore ###
> > > SP/SC bulk enq/dequeue (size: 8): 20.24
> > > MP/MC bulk enq/dequeue (size: 8): 48.43
> > > SP/SC bulk enq/dequeue (size: 32): 11.01
> > > MP/MC bulk enq/dequeue (size: 32): 18.43
> > > 
> > > ### Testing using two physical cores ###
> > > SP/SC bulk enq/dequeue (size: 8): 25.92
> > > MP/MC bulk enq/dequeue (size: 8): 69.76
> > > SP/SC bulk enq/dequeue (size: 32): 14.27
> > > MP/MC bulk enq/dequeue (size: 32): 22.94
> > > Test OK  
> > > RTE>>  
> > 
> > So given that there is not much difference here, is the MIN_SIZE i.e.
> > forced 64B, your preference, rather than actual cacheline-size?
> > 
> 
> I don't quite like this macro CACHE_LINE_MIN_SIZE. For me, it does not
> mean anything. The reasons for aligning on a cache line size are
> straightforward, but when should we need to align on the minimum
> cache line size supported by dpdk? For instance, in mbuf structure,
> aligning on 64 would make more sense to me.
> 
> So, I would prefer using (RTE_CACHE_LINE_SIZE * 2) here. If we don't
> want it on some architectures, or if this optimization is only for Intel
> (or all archs that need this optim), I think we could have something
> like:
> 
> /* bla bla */
> #ifdef INTEL
> #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE * 2)
> #else
> #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
> #endif
> 
I would agree that CACHE_LINE_MIN_SIZE probably doesn't make any sense
here, but I'm happy to put in any suitable scheme that others are happy
with. The options are:

* Keep as-is: 
    adv: simplest option, disadv: wastes 128B * 2 on some platforms
* Change to MIN_SIZE: 
    adv: no ifdefs, disadv: doesn't make much sense logically here
* Use ifdefs:
    adv: each platform gets what's best for it, disadv: untidy code, may
    be harder to maintain
* Use hard-coded 128:
    adv: short and simple, disadv: misses any logical reason why 128 is
    used, i.e. magic number

I'm ok with any option above.

/Bruce