From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerin Jacob Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load Date: Mon, 8 Oct 2018 16:19:45 +0530 Message-ID: <20181008104944.GD11081@jerin> References: <60055965-A7C8-4E9F-8668-0AE1DCE57515@arm.com> <20181006074126.GA16715@jerin> <20181007040243.GA1850@jerin> <7A156041-23EC-4CCB-B129-3607AF34A992@arm.com> <20181008060629.GA5228@jerin> <063A95EC-CFC1-42F7-B864-DFB9C6718AC8@arm.com> <20181008100004.GB11081@jerin> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: Ola Liljedahl , "dev@dpdk.org" , Honnappa Nagarahalli , "Ananyev, Konstantin" , Steve Capper , nd , "stable@dpdk.org" To: "Gavin Hu (Arm Technology China)" Return-path: Content-Disposition: inline In-Reply-To: List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" -----Original Message----- > Date: Mon, 8 Oct 2018 10:33:43 +0000 > From: "Gavin Hu (Arm Technology China)" > To: Ola Liljedahl , Jerin Jacob > > CC: "dev@dpdk.org" , Honnappa Nagarahalli > , "Ananyev, Konstantin" > , Steve Capper , nd > , "stable@dpdk.org" > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load > > > I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency. > Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch). > > sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024 -- -i These counters are running at 100MHz. Use PMU counters to get more accurate results. https://doc.dpdk.org/guides/prog_guide/profile_app.html See: 55.2. Profiling on ARM64 > > -----Original Message----- > > From: Ola Liljedahl > > Sent: Monday, October 8, 2018 6:26 PM > > To: Jerin Jacob > > Cc: dev@dpdk.org; Honnappa Nagarahalli > > ; Ananyev, Konstantin > > ; Gavin Hu (Arm Technology China) > > ; Steve Capper ; nd > > ; stable@dpdk.org > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > > > > > > > On 08/10/2018, 12:00, "Jerin Jacob" > > wrote: > > > > -----Original Message----- > > > Date: Mon, 8 Oct 2018 09:22:05 +0000 > > > From: Ola Liljedahl > > > To: Jerin Jacob > > > CC: "dev@dpdk.org" , Honnappa Nagarahalli > > > , "Ananyev, Konstantin" > > > , "Gavin Hu (Arm Technology China)" > > > , Steve Capper , nd > > , > > > "stable@dpdk.org" > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > External Email > > > > > > On 08/10/2018, 08:06, "Jerin Jacob" > > wrote: > > > > > > -----Original Message----- > > > > Date: Sun, 7 Oct 2018 20:44:54 +0000 > > > > From: Ola Liljedahl > > > > To: Jerin Jacob > > > > CC: "dev@dpdk.org" , Honnappa Nagarahalli > > > > , "Ananyev, Konstantin" > > > > , "Gavin Hu (Arm Technology > > China)" > > > > , Steve Capper , nd > > , > > > > "stable@dpdk.org" > > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > > > > > > > > Could you please fix the email client for inline reply. > > > Sorry that doesn't seem to be possible with Outlook for Mac 16 or > > Office365. The official Office365/Outlook > > > documentation doesn't match the actual user interface... > > > > > > > > > > > > https://www.kernel.org/doc/html/v4.19-rc7/process/email- > > clients.html > > > > > > > > > > > > > > On 07/10/2018, 06:03, "Jerin Jacob" > > wrote: > > > > > > > > In arm64 case, it will have ATOMIC_RELAXED followed by asm > > volatile ("":::"memory") of rte_pause(). > > > > I would n't have any issue, if the generated code code is same or > > better than the exiting case. but it not the case, Right? > > > > The existing case is actually not interesting (IMO) as it exposes > > undefined behaviour which allows the compiler to do anything. But you > > seem to be satisfied with "works for me, right here right now". I think the > > cost of avoiding undefined behaviour is acceptable (actually I don't think it > > even will be noticeable). > > > > > > I am not convinced because of use of volatile in head and tail indexes. > > > For me that brings the defined behavior. > > > As long as you don't mix in C11 atomic accesses (just use "plain" accesses > > to volatile objects), > > > it is AFAIK defined behaviour (but not necessarily using atomic loads and > > stores). But I quoted > > > the C11 spec where it explicitly mentions that mixing atomic and non- > > atomic accesses to the same > > > object is undefined behaviour. Don't argue with me, argue with the C11 > > spec. > > > If you want to disobey the spec, this should at least be called out for in > > the code with a comment. > > > > That's boils down only one question, should we follow C11 spec? Why not > > only take load > > acquire and store release semantics only just like Linux kernel and FreeBSD. > > And introduce even more undefined behaviour? > > > > Does not look like C11 memory model is super efficient in term of gcc > > implementation. > > You are making a chicken out of a feather. > > > > I think this "problem" with one additional ADD instruction will only concern > > __atomic_load_n(__ATOMIC_RELAXED) and > > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates > > the address generation (add offset of struct member) from the load or store > > itself. For other atomic operations and memory orderings (e.g. > > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be > > included anyway (as long as we access a non-first struct member) because > > e.g. LDAR only accepts a base register with no offset. > > > > I suggest minimising the imposed memory orderings can have a much larger > > (positive) effect on performance compared to avoiding one ADD instruction > > (memory accesses are much slower than CPU ALU instructions). > > Using C11 memory model and identifying exactly which objects are used for > > synchronisation and whether (any) updates to shared memory are acquired > > or released (no updates to shared memory means relaxed order can be used) > > will provide maximum freedom to the compiler and hardware to get the best > > result. > > > > The FreeBSD and DPDK ring buffers show some fundamental > > misunderstandings here. Instead excessive orderings and explicit barriers > > have been used as band-aids, with unknown effects on performance. > > > > > > > > > > > > > That the reason why I shared > > > the generated assembly code. If you think other way, Pick any compiler > > > and see generated output. > > > This is what one compiler for one architecture generates today. These > > things change. Other things > > > that used to work or worked for some specific architecture has stopped > > working in newer versions of > > > the compiler. > > > > > > > > > And > > > > > > Freebsd implementation of ring buffer(Which DPDK derived from), > > Don't have > > > such logic, See > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108 > > > It looks like FreeBSD uses some kind of C11 atomic memory model- > > inspired API although I don't see > > > exactly how e.g. atomic_store_rel_int() is implemented. The code also > > mixes in explicit barriers > > > so definitively not pure C11 memory model usage. And finally, it doesn't > > establish the proper > > > load-acquire/store-release relationships (e.g. store-release cons_tail > > requires a load-acquire cons_tail, > > > same for prod_tail). > > > > > > "* multi-producer safe lock-free ring buffer enqueue" > > > The comment is also wrong. This design is not lock-free, how could it be > > when there is spinning > > > (waiting) for other threads in the code? If a thread must wait for other > > threads, then by definition > > > the design is blocking. > > > > > > So you are saying that because FreeBSD is doing it wrong, DPDK can also > > do it wrong? > > > > > > > > > See below too. > > > > > > > > > > > Skipping the compiler memory barrier in rte_pause() potentially > > allows for optimisations that provide much more benefit, e.g. hiding some > > cache miss latency for later loads. The DPDK ring buffer implementation is > > defined so to enable inlining of enqueue/dequeue functions into the caller, > > any code could immediately follow these calls. > > > > > > > > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x > > > > Programming languages — C > > > > > > > > 5.1.2.4 > > > > 4 Two expression evaluations conflict if one of them modifies a > > memory location and the other one reads or modifies the same memory > > location. > > > > > > > > 25 The execution of a program contains a data race if it contains two > > conflicting actions in different threads, at least one of which is not atomic, > > and neither happens before the other. Any such data race results in > > undefined behavior. > > > > > > IMO, Both condition will satisfy if the variable is volatile and 32bit read > > will atomic > > > for 32b and 64b machines. If not, the problem persist for generic case > > > as well(lib/librte_ring/rte_ring_generic.h) > > > The read from a volatile object is not an atomic access per the C11 spec. It > > just happens to > > > be translated to an instruction (on x86-64 and AArch64/A64) that > > implements an atomic load. > > > I don't think any compiler would change this code generation and > > suddenly generate some > > > non-atomic load instruction for a program that *only* uses volatile to do > > "atomic" accesses. > > > But a future compiler could detect the mix of atomic and non-atomic > > accesses and mark this > > > expression as causing undefined behaviour and that would have > > consequences for code generation. > > > > > > > > > I agree with you on C11 memory model semantics usage. The reason > > why I > > > propose name for the file as rte_ring_c11_mem.h as DPDK it self did > > not > > > had definitions for load acquire and store release semantics. > > > I was looking for taking load acquire and store release semantics > > > from C11 instead of creating new API like Linux kernel for FreeBSD(APIs > > > like atomic_load_acq_32(), atomic_store_rel_32()). If the file name is > > your > > > concern then we could create new abstractions as well. That would > > help > > > exiting KNI problem as well. > > > I appreciate your embrace of the C11 memory model. I think it is better > > for describing > > > (both to the compiler and to humans) which and how objects are used > > for synchronisation. > > > > > > However, I don't think an API as you suggest (and others have suggested > > before, e.g. as > > > done in ODP) is a good idea. There is an infinite amount of possible base > > types, an > > > increasing number of operations and a bunch of different memory > > orderings, a "complete" > > > API would be very large and difficult to test, and most members of the > > API would never be used. > > > GCC and Clang both support the __atomic intrinsics. This API avoids the > > problems I > > > described above. Or we could use the official C11 syntax (stdatomic.h). > > But then we > > > have the problem with using pre-C11 compilers... > > > > I have no objection, if everyone agrees to move C11 memory model > > with __atomic intrinsics. But if we need to keep both have then > > atomic_load_acq_32() kind of API make sense. > > > > > > > > > > > > > > > > > > > I think, currently it mixed usage because, the same variable declaration > > > used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11 > > > case. Either we need to change only to C11 mode OR have APIs for > > > atomic_load_acq_() and atomic_store_rel_() to allow both models like > > > Linux kernel and FreeBSD. > > > > > > > > > > > -- Ola > > > > > > > > > > > > > > > > > > > > >