From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
Date: Mon, 8 Oct 2018 16:19:45 +0530
Message-ID: <20181008104944.GD11081@jerin>
References: <60055965-A7C8-4E9F-8668-0AE1DCE57515@arm.com>
 <20181006074126.GA16715@jerin>
 <F83A969A-15F5-4D79-85B4-CB435E2C9B51@arm.com>
 <20181007040243.GA1850@jerin>
 <7A156041-23EC-4CCB-B129-3607AF34A992@arm.com>
 <20181008060629.GA5228@jerin>
 <063A95EC-CFC1-42F7-B864-DFB9C6718AC8@arm.com>
 <20181008100004.GB11081@jerin>
 <B00E8CE9-7D26-46FC-A98B-06BA6DEEB813@arm.com>
 <DB7PR08MB3163A598CBC76D05EC67335E8FE60@DB7PR08MB3163.eurprd08.prod.outlook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Cc: Ola Liljedahl <Ola.Liljedahl@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
 Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
 "stable@dpdk.org" <stable@dpdk.org>
To: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
Return-path: <dev-bounces@dpdk.org>
Content-Disposition: inline
In-Reply-To: <DB7PR08MB3163A598CBC76D05EC67335E8FE60@DB7PR08MB3163.eurprd08.prod.outlook.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

-----Original Message-----
> Date: Mon, 8 Oct 2018 10:33:43 +0000
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> To: Ola Liljedahl <Ola.Liljedahl@arm.com>, Jerin Jacob
>  <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, Steve Capper <Steve.Capper@arm.com>, nd
>  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> 
> 
> I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency.
> Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> 
> sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i

These counters are running at 100MHz. Use PMU counters to get more
accurate results.

https://doc.dpdk.org/guides/prog_guide/profile_app.html
See: 55.2. Profiling on ARM64

> > -----Original Message-----
> > From: Ola Liljedahl
> > Sent: Monday, October 8, 2018 6:26 PM
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> > <nd@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >
> >
> >
> > ﻿On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> > wrote:
> >
> >     -----Original Message-----
> >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
> >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
> >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> > <nd@arm.com>,
> >     >  "stable@dpdk.org" <stable@dpdk.org>
> >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
> >     >
> >     > External Email
> >     >
> >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> > wrote:
> >     >
> >     >     -----Original Message-----
> >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
> >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
> > China)"
> >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> > <nd@arm.com>,
> >     >     >  "stable@dpdk.org" <stable@dpdk.org>
> >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
> >     >     >
> >     >
> >     >
> >     >     Could you please fix the email client for inline reply.
> >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
> > Office365. The official Office365/Outlook
> >     > documentation doesn't match the actual user interface...
> >     >
> >     >
> >     >
> >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
> > clients.html
> >     >
> >     >
> >     >     >
> >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
> > <jerin.jacob@caviumnetworks.com> wrote:
> >     >     >
> >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
> > volatile ("":::"memory") of rte_pause().
> >     >     >     I would n't have any issue, if the generated code code is same or
> > better than the exiting case. but it not the case, Right?
> >     >     > The existing case is actually not interesting (IMO) as it exposes
> > undefined behaviour which allows the compiler to do anything. But you
> > seem to be satisfied with "works for me, right here right now". I think the
> > cost of avoiding undefined behaviour is acceptable (actually I don't think it
> > even will be noticeable).
> >     >
> >     >     I am not convinced because of use of volatile in head and tail indexes.
> >     >     For me that brings the defined behavior.
> >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses
> > to volatile objects),
> >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and
> > stores). But I quoted
> >     > the C11 spec where it explicitly mentions that mixing atomic and non-
> > atomic accesses to the same
> >     > object is undefined behaviour. Don't argue with me, argue with the C11
> > spec.
> >     > If you want to disobey the spec, this should at least be called out for in
> > the code with a comment.
> >
> >     That's boils down only one question, should we follow C11 spec? Why not
> > only take load
> >     acquire and store release semantics only just like Linux kernel and FreeBSD.
> > And introduce even more undefined behaviour?
> >
> >     Does not look like C11 memory model is super efficient in term of gcc
> >     implementation.
> > You are making a chicken out of a feather.
> >
> > I think this "problem" with one additional ADD instruction will only concern
> > __atomic_load_n(__ATOMIC_RELAXED) and
> > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates
> > the address generation (add offset of struct member) from the load or store
> > itself. For other atomic operations and memory orderings (e.g.
> > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be
> > included anyway (as long as we access a non-first struct member) because
> > e.g. LDAR only accepts a base register with no offset.
> >
> > I suggest minimising the imposed memory orderings can have a much larger
> > (positive) effect on performance compared to avoiding one ADD instruction
> > (memory accesses are much slower than CPU ALU instructions).
> > Using C11 memory model and identifying exactly which objects are used for
> > synchronisation and whether (any) updates to shared memory are acquired
> > or released (no updates to shared memory means relaxed order can be used)
> > will provide maximum freedom to the compiler and hardware to get the best
> > result.
> >
> > The FreeBSD and DPDK ring buffers show some fundamental
> > misunderstandings here. Instead excessive orderings and explicit barriers
> > have been used as band-aids, with unknown effects on performance.
> >
> >
> >     >
> >     >
> >     >     That the reason why I shared
> >     >     the generated assembly code. If you think other way, Pick any compiler
> >     >     and see generated output.
> >     > This is what one compiler for one architecture generates today. These
> > things change. Other things
> >     > that used to work or worked for some specific architecture has stopped
> > working in newer versions of
> >     > the compiler.
> >     >
> >     >
> >     >     And
> >     >
> >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
> > Don't have
> >     >     such logic, See
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
> >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
> > inspired API although I don't see
> >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also
> > mixes in explicit barriers
> >     > so definitively not pure C11 memory model usage. And finally, it doesn't
> > establish the proper
> >     > load-acquire/store-release relationships (e.g. store-release cons_tail
> > requires a load-acquire cons_tail,
> >     > same for prod_tail).
> >     >
> >     > "* multi-producer safe lock-free ring buffer enqueue"
> >     > The comment is also wrong. This design is not lock-free, how could it be
> > when there is spinning
> >     > (waiting) for other threads in the code? If a thread must wait for other
> > threads, then by definition
> >     > the design is blocking.
> >     >
> >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also
> > do it wrong?
> >     >
> >     >
> >     >     See below too.
> >     >
> >     >     >
> >     >     > Skipping the compiler memory barrier in rte_pause() potentially
> > allows for optimisations that provide much more benefit, e.g. hiding some
> > cache miss latency for later loads. The DPDK ring buffer implementation is
> > defined so to enable inlining of enqueue/dequeue functions into the caller,
> > any code could immediately follow these calls.
> >     >     >
> >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
> >     >     > Programming languages — C
> >     >     >
> >     >     > 5.1.2.4
> >     >     > 4 Two expression evaluations conflict if one of them modifies a
> > memory location and the other one reads or modifies the same memory
> > location.
> >     >     >
> >     >     > 25 The execution of a program contains a data race if it contains two
> > conflicting actions in different threads, at least one of which is not atomic,
> > and neither happens before the other. Any such data race results in
> > undefined behavior.
> >     >
> >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
> > will atomic
> >     >     for 32b and 64b machines. If not, the problem persist for generic case
> >     >     as well(lib/librte_ring/rte_ring_generic.h)
> >     > The read from a volatile object is not an atomic access per the C11 spec. It
> > just happens to
> >     > be translated to an instruction (on x86-64 and AArch64/A64) that
> > implements an atomic load.
> >     > I don't think any compiler would change this code generation and
> > suddenly generate some
> >     > non-atomic load instruction for a program that *only* uses volatile to do
> > "atomic" accesses.
> >     > But a future compiler could detect the mix of atomic and non-atomic
> > accesses and mark this
> >     > expression as causing undefined behaviour and that would have
> > consequences for code generation.
> >     >
> >     >
> >     >     I agree with you on C11 memory model semantics usage. The reason
> > why I
> >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did
> > not
> >     >     had definitions for load acquire and store release semantics.
> >     >     I was looking for taking load acquire and store release semantics
> >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
> >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
> > your
> >     >     concern then we could create new abstractions as well. That would
> > help
> >     >     exiting KNI problem as well.
> >     > I appreciate your embrace of the C11 memory model. I think it is better
> > for describing
> >     > (both to the compiler and to humans) which and how objects are used
> > for synchronisation.
> >     >
> >     > However, I don't think an API as you suggest (and others have suggested
> > before, e.g. as
> >     > done in ODP) is a good idea. There is an infinite amount of possible base
> > types, an
> >     > increasing number of operations and a bunch of different memory
> > orderings, a "complete"
> >     > API would be very large and difficult to test, and most members of the
> > API would never be used.
> >     > GCC and Clang both support the __atomic intrinsics. This API avoids the
> > problems I
> >     > described above. Or we could use the official C11 syntax (stdatomic.h).
> > But then we
> >     > have the problem with using pre-C11 compilers...
> >
> >     I have no objection, if everyone agrees to move C11 memory model
> >     with __atomic intrinsics. But if we need to keep both have then
> >     atomic_load_acq_32() kind of API make sense.
> >
> >
> >     >
> >     >
> >     >
> >     >
> >     >     I think, currently it mixed usage because, the same variable declaration
> >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
> >     >     case. Either we need to change only to C11 mode OR have APIs for
> >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
> >     >     Linux kernel and FreeBSD.
> >     >
> >     >     >
> >     >     > -- Ola
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >
> >
>