From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerin Jacob Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Date: Mon, 23 Oct 2017 15:36:18 +0530 Message-ID: <20171023100617.GA17957@jerin> References: <20171012172311.GA8524@jerin> <2601191342CEEE43887BDE71AB9772585FAAB171@IRSMSX103.ger.corp.intel.com> <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com> <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia.he@hxt-semitech.com" , "jie2.liu@hxt-semitech.com" , "bing.zhao@hxt-semitech.com" , "Richardson, Bruce" To: Jia He Return-path: Received: from NAM03-CO1-obe.outbound.protection.outlook.com (mail-co1nam03on0078.outbound.protection.outlook.com [104.47.40.78]) by dpdk.org (Postfix) with ESMTP id 3BF8A1B3E1 for ; Mon, 23 Oct 2017 12:06:42 +0200 (CEST) Content-Disposition: inline In-Reply-To: List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" -----Original Message----- > Date: Mon, 23 Oct 2017 16:49:01 +0800 > From: Jia He > To: Jerin Jacob > Cc: "Ananyev, Konstantin" , "Zhao, Bing" > , Olivier MATZ , > "dev@dpdk.org" , "jia.he@hxt-semitech.com" > , "jie2.liu@hxt-semitech.com" > , "bing.zhao@hxt-semitech.com" > , "Richardson, Bruce" > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > -----Original Message----- > > > > [...] > > > dependant on each other. > > > Thus a memory barrier is neccessary. > > Yes. The barrier is necessary. > > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > > implementation is derived from freebsd's buf_ring.h. > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > I think, the only outstanding issue is, how to reduce the performance > > impact for arm64. I believe using accurate/release semantics instead > > of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, > > freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > > > Jia, > > 1) Can you verify the use of accurate/release semantics fixes the problem in your > > platform? like use of atomic_load_acq* in the reference code. > > 2) If so, What is the overhead between accurate/release and plane smp_smb() > > barriers. Based on that we need decide what path to take. > I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier > (half barrier) you mentioned > at above link. > The patch seems like: > @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, >                 /* Reset n to the initial burst count */ >                 n = max; > > -               *old_head = r->prod.head; > -               const uint32_t cons_tail = r->cons.tail; > +               *old_head = atomic_load_acq_32(&r->prod.head); > +               const uint32_t cons_tail = > atomic_load_acq_32(&r->cons.tail); > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s >                 /* Restore n as it may change every loop */ >                 n = max; > > -               *old_head = r->cons.head; > -               const uint32_t prod_tail = r->prod.tail; > +               *old_head = atomic_load_acq_32(&r->cons.head); > +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) >                 /* The subtraction is done between two unsigned 32bits value >                  * (the result is always modulo 32 bits even if we have >                  * cons_head > prod_tail). So 'entries' is always between 0 >                  * and size(ring)-1. */ > > The half barrier patch passed the fuctional test. > > As for the performance comparision on *arm64*(the debug patch is at > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the > test results > below: > > [case 1] old codes, no barrier > ============================================ >  Performance counter stats for './test --no-huge -l 1-10': > >      689275.001200      task-clock (msec)         #    9.771 CPUs utilized >               6223      context-switches          #    0.009 K/sec >                 10      cpu-migrations            #    0.000 K/sec >                653      page-faults               #    0.001 K/sec >      1721190914583      cycles                    #    2.497 GHz >      3363238266430      instructions              #    1.95  insn per cycle >    branches >           27804740      branch-misses             #    0.00% of all branches > >       70.540618825 seconds time elapsed > > [case 2] full barrier with rte_smp_rmb() > ============================================ >  Performance counter stats for './test --no-huge -l 1-10': > >      582557.895850      task-clock (msec)         #    9.752 CPUs utilized >               5242      context-switches          #    0.009 K/sec >                 10      cpu-migrations            #    0.000 K/sec >                665      page-faults               #    0.001 K/sec >      1454360730055      cycles                    #    2.497 GHz >       587197839907      instructions              #    0.40  insn per cycle >    branches >           27799687      branch-misses             #    0.00% of all branches > >       59.735582356 seconds time elapse > > [case 1] half barrier with load_acquire > ============================================ >  Performance counter stats for './test --no-huge -l 1-10': > >      660758.877050      task-clock (msec)         #    9.764 CPUs utilized >               5982      context-switches          #    0.009 K/sec >                 11      cpu-migrations            #    0.000 K/sec >                657      page-faults               #    0.001 K/sec >      1649875318044      cycles                    #    2.497 GHz >       591583257765      instructions              #    0.36  insn per cycle >    branches >           27994903      branch-misses             #    0.00% of all branches > >       67.672855107 seconds time elapsed > > Please see the context-switches in the perf results > test result  sorted by time is: > full barrier < half barrier < no barrier > > AFAICT, in this case ,the cpu reordering will add the possibility for > context switching and > increase the running time. > Any ideas? Regarding performance test, it better to use ring perf test case on _isolated_ cores to measure impact on number of enqueue/dequeue operations. example: ./build/app/test -c 0xff -n 4 >>ring_perf_autotest By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter