From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bruce Richardson Subject: Re: [PATCH v1 01/14] ring: remove split cacheline build setting Date: Wed, 1 Mar 2017 10:42:58 +0000 Message-ID: <20170301104257.GB25032@bricha3-MOBL3.ger.corp.intel.com> References: <20170223172407.27664-1-bruce.richardson@intel.com> <20170223172407.27664-2-bruce.richardson@intel.com> <20170228113511.GA28584@localhost.localdomain> <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com> <20170228120833.GA30817@localhost.localdomain> <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com> <20170228175423.GA23591@localhost.localdomain> <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com> <20170301111753.1223a01e@platinum> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jerin Jacob , dev@dpdk.org To: Olivier Matz Return-path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 1D9B74CE4 for ; Wed, 1 Mar 2017 11:43:01 +0100 (CET) Content-Disposition: inline In-Reply-To: <20170301111753.1223a01e@platinum> List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Mar 01, 2017 at 11:17:53AM +0100, Olivier Matz wrote: > Hi Bruce, > > On Wed, 1 Mar 2017 09:47:03 +0000, Bruce Richardson > wrote: > > On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote: > > > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote: > > > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote: > > > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson > > > > > wrote: > > > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote: > > > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson > > > > > > > wrote: > > > > > > > > Users compiling DPDK should not need to know or care > > > > > > > > about the arrangement of cachelines in the rte_ring > > > > > > > > structure. Therefore just remove the build option and set > > > > > > > > the structures to be always split. For improved > > > > > > > > performance use 128B rather than 64B alignment since it > > > > > > > > stops the producer and consumer data being on adjacent > > > You say you see an improved performance on Intel by having an extra > blank cache-line between the producer and consumer data. Do you have an > idea why it behaves like this? Do you think it is related to the > hardware adjacent cache line prefetcher? > That is a likely candidate, but I haven't done a full analysis on the details to know for sure what all the factors are. We see a similar effect with the packet distributor library, which uses similar padding. > > > > [...] > > > # base code > > > RTE>>ring_perf_autotest > > > ### Testing single element and burst enq/deq ### > > > SP/SC single enq/dequeue: 84 > > > MP/MC single enq/dequeue: 301 > > > SP/SC burst enq/dequeue (size: 8): 20 > > > MP/MC burst enq/dequeue (size: 8): 46 > > > SP/SC burst enq/dequeue (size: 32): 12 > > > MP/MC burst enq/dequeue (size: 32): 18 > > > > > > ### Testing empty dequeue ### > > > SC empty dequeue: 7.11 > > > MC empty dequeue: 12.15 > > > > > > ### Testing using a single lcore ### > > > SP/SC bulk enq/dequeue (size: 8): 19.08 > > > MP/MC bulk enq/dequeue (size: 8): 46.28 > > > SP/SC bulk enq/dequeue (size: 32): 11.89 > > > MP/MC bulk enq/dequeue (size: 32): 18.84 > > > > > > ### Testing using two physical cores ### > > > SP/SC bulk enq/dequeue (size: 8): 37.42 > > > MP/MC bulk enq/dequeue (size: 8): 73.32 > > > SP/SC bulk enq/dequeue (size: 32): 18.69 > > > MP/MC bulk enq/dequeue (size: 32): 24.59 > > > Test OK > > > > > > # with ring rework patch > > > RTE>>ring_perf_autotest > > > ### Testing single element and burst enq/deq ### > > > SP/SC single enq/dequeue: 84 > > > MP/MC single enq/dequeue: 301 > > > SP/SC burst enq/dequeue (size: 8): 19 > > > MP/MC burst enq/dequeue (size: 8): 45 > > > SP/SC burst enq/dequeue (size: 32): 11 > > > MP/MC burst enq/dequeue (size: 32): 18 > > > > > > ### Testing empty dequeue ### > > > SC empty dequeue: 7.10 > > > MC empty dequeue: 12.15 > > > > > > ### Testing using a single lcore ### > > > SP/SC bulk enq/dequeue (size: 8): 18.59 > > > MP/MC bulk enq/dequeue (size: 8): 45.49 > > > SP/SC bulk enq/dequeue (size: 32): 11.67 > > > MP/MC bulk enq/dequeue (size: 32): 18.65 > > > > > > ### Testing using two physical cores ### > > > SP/SC bulk enq/dequeue (size: 8): 37.41 > > > MP/MC bulk enq/dequeue (size: 8): 72.98 > > > SP/SC bulk enq/dequeue (size: 32): 18.69 > > > MP/MC bulk enq/dequeue (size: 32): 24.59 > > > Test OK > > > RTE>> > > > > > > # with ring rework patch + cache-line size change to one on 128BCL > > > target > > > RTE>>ring_perf_autotest > > > ### Testing single element and burst enq/deq ### > > > SP/SC single enq/dequeue: 90 > > > MP/MC single enq/dequeue: 317 > > > SP/SC burst enq/dequeue (size: 8): 20 > > > MP/MC burst enq/dequeue (size: 8): 48 > > > SP/SC burst enq/dequeue (size: 32): 11 > > > MP/MC burst enq/dequeue (size: 32): 18 > > > > > > ### Testing empty dequeue ### > > > SC empty dequeue: 8.10 > > > MC empty dequeue: 11.15 > > > > > > ### Testing using a single lcore ### > > > SP/SC bulk enq/dequeue (size: 8): 20.24 > > > MP/MC bulk enq/dequeue (size: 8): 48.43 > > > SP/SC bulk enq/dequeue (size: 32): 11.01 > > > MP/MC bulk enq/dequeue (size: 32): 18.43 > > > > > > ### Testing using two physical cores ### > > > SP/SC bulk enq/dequeue (size: 8): 25.92 > > > MP/MC bulk enq/dequeue (size: 8): 69.76 > > > SP/SC bulk enq/dequeue (size: 32): 14.27 > > > MP/MC bulk enq/dequeue (size: 32): 22.94 > > > Test OK > > > RTE>> > > > > So given that there is not much difference here, is the MIN_SIZE i.e. > > forced 64B, your preference, rather than actual cacheline-size? > > > > I don't quite like this macro CACHE_LINE_MIN_SIZE. For me, it does not > mean anything. The reasons for aligning on a cache line size are > straightforward, but when should we need to align on the minimum > cache line size supported by dpdk? For instance, in mbuf structure, > aligning on 64 would make more sense to me. > > So, I would prefer using (RTE_CACHE_LINE_SIZE * 2) here. If we don't > want it on some architectures, or if this optimization is only for Intel > (or all archs that need this optim), I think we could have something > like: > > /* bla bla */ > #ifdef INTEL > #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE * 2) > #else > #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) > #endif > I would agree that CACHE_LINE_MIN_SIZE probably doesn't make any sense here, but I'm happy to put in any suitable scheme that others are happy with. The options are: * Keep as-is: adv: simplest option, disadv: wastes 128B * 2 on some platforms * Change to MIN_SIZE: adv: no ifdefs, disadv: doesn't make much sense logically here * Use ifdefs: adv: each platform gets what's best for it, disadv: untidy code, may be harder to maintain * Use hard-coded 128: adv: short and simple, disadv: misses any logical reason why 128 is used, i.e. magic number I'm ok with any option above. /Bruce