Re: [lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ? - Mathieu Desnoyers via lttng-dev

From: Mathieu Desnoyers via lttng-dev <lttng-dev@lists.lttng.org>
To: Duncan Sands <baldrick@free.fr>
Cc: lttng-dev <lttng-dev@lists.lttng.org>, paulmck <paulmck@kernel.org>
Subject: Re: [lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?
Date: Fri, 16 Apr 2021 16:39:38 -0400 (EDT)	[thread overview]
Message-ID: <612661965.84539.1618605578871.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <0b613c40-24b4-6836-d47b-705ac0e46386@free.fr>

----- On Apr 16, 2021, at 11:22 AM, lttng-dev lttng-dev@lists.lttng.org wrote:

> Hi Mathieu,
> 

Hi Duncan,

> On 4/16/21 4:52 PM, Mathieu Desnoyers via lttng-dev wrote:
>> Hi Paul, Will, Peter,
>> 
>> I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
>> is able to break rcu_dereference. This seems to be taken care of by
>> arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
>> 
>> In the liburcu user-space library, we have this comment near rcu_dereference()
>> in
>> include/urcu/static/pointer.h:
>> 
>>   * The compiler memory barrier in CMM_LOAD_SHARED() ensures that
>>   value-speculative
>>   * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the
>>   * data read before the pointer read by speculating the value of the pointer.
>>   * Correct ordering is ensured because the pointer is read as a volatile access.
>>   * This acts as a global side-effect operation, which forbids reordering of
>>   * dependent memory operations. Note that such concern about dependency-breaking
>>   * optimizations will eventually be taken care of by the "memory_order_consume"
>>   * addition to forthcoming C++ standard.
>> 
>> (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in
>> liburcu as a public API before READ_ONCE() existed in the Linux kernel)
> 
> this is not directly on topic, but what do you think of porting userspace RCU to
> use the C++ memory model and GCC/LLVM atomic builtins (__atomic_store etc)
> rather than rolling your own?  Tools like thread sanitizer would then understand
> what userspace RCU is doing.  Not to mention the compiler.  More developers
> would understand it too!

Yes, that sounds like a clear win.

> From a code organization viewpoint, going down this path would presumably mean
> directly using GCC/LLVM atomic support when available, and falling back on
> something like the current uatomic to emulate them for older compilers.

Yes, I think this approach would be good. One caveat though: the GCC atomic
operations were known to be broken with some older compilers for specific architectures,
so we may have to keep track of a list of known buggy compilers to use our own
implementation instead in those situations. It's been a while since I've looked at
this though, so we may not even be supporting those old compilers in liburcu anymore.

> 
> Some parts of uatomic have pretty clear equivalents (see below), but not all, so
> the conversion could be quite tricky.

We'd have to see on a case by case basis, but it cannot hurt to start the effort
by integrating the easy ones.

> 
>> Peter tells me the "memory_order_consume" is not something which can be used
>> today.
> 
> This is a pity, because it seems to have been invented with rcu_dereference in
> mind.

Actually, (see other leg of this email thread) memory_order_consume works for
rcu_dereference, but it appears to be implemented as a slightly heavier than
required memory_order_acquire on weakly-ordered architectures. So we're just
moving the issue into compiler-land. Oh well.

> 
>> Any information on its status at C/C++ standard levels and implementation-wise ?
>> 
>> Pragmatically speaking, what should we change in liburcu to ensure we don't
>> generate
>> broken code when LTO is enabled ? I suspect there are a few options here:
>> 
>> 1) Fail to build if LTO is enabled,
>> 2) Generate slower code for rcu_dereference, either on all architectures or only
>>     on weakly-ordered architectures,
>> 3) Generate different code depending on whether LTO is enabled or not. AFAIU
>> this would only
>>     work if every compile unit is aware that it will end up being optimized with
>>     LTO. Not sure
>>     how this could be done in the context of user-space.
>> 4) [ Insert better idea here. ]
>> 
>> Thoughts ?
> 
> Best wishes, Duncan.
> 
> PS: We are experimentally running with the following patch, as it already makes
> thread sanitizer a lot happier:

Quick question: should we use __atomic_load() or atomic_load_explicit() (C) and
(std::atomic<__typeof__(x)>)(x)).load() (C++) ?

We'd have to make this dependent on C11/C++11 though, and keep volatile for older
compilers.

Last thing: I have limited time to work on this, so if you have well-tested patches
you wish to submit, I'll do my best to review them!

Thanks,

Mathieu

> 
> --- a/External/UserspaceRCU/userspace-rcu/include/urcu/system.h
> 
> +++ b/External/UserspaceRCU/userspace-rcu/include/urcu/system.h
> 
> @@ -26,34 +26,45 @@
> 
>   * Identify a shared load. A cmm_smp_rmc() or cmm_smp_mc() should come
> 
>   * before the load.
> 
>   */
> 
> -#define _CMM_LOAD_SHARED(p)	       CMM_ACCESS_ONCE(p)
> 
> +#define _CMM_LOAD_SHARED(p)					\
> 
> +	__extension__						\
> 
> +	({							\
> 
> +		__typeof__(p) v;				\
> 
> +		__atomic_load(&p, &v, __ATOMIC_RELAXED);	\
> 
> +		v;						\
> 
> +	})
> 
> 
> 
>  /*
> 
>   * Load a data from shared memory, doing a cache flush if required.
> 
>   */
> 
> -#define CMM_LOAD_SHARED(p)			\
> 
> -	__extension__			\
> 
> -	({				\
> 
> -		cmm_smp_rmc();		\
> 
> -		_CMM_LOAD_SHARED(p);	\
> 
> +#define CMM_LOAD_SHARED(p)					\
> 
> +	__extension__						\
> 
> +	({							\
> 
> +		__typeof__(p) v;				\
> 
> +		__atomic_load(&p, &v, __ATOMIC_ACQUIRE);	\
> 
> +		v;						\
> 
>  	})
> 
> 
> 
>  /*
> 
>   * Identify a shared store. A cmm_smp_wmc() or cmm_smp_mc() should
> 
>   * follow the store.
> 
>   */
> 
> -#define _CMM_STORE_SHARED(x, v)	__extension__ ({ CMM_ACCESS_ONCE(x) = (v); })
> 
> +#define _CMM_STORE_SHARED(x, v)					\
> 
> +	__extension__						\
> 
> +	({							\
> 
> +		__typeof__(x) w = v;				\
> 
> +		__atomic_store(&x, &w, __ATOMIC_RELAXED);	\
> 
> +	})
> 
> 
> 
>  /*
> 
>   * Store v into x, where x is located in shared memory. Performs the
> 
>   * required cache flush after writing. Returns v.
> 
>   */
> 
> -#define CMM_STORE_SHARED(x, v)						\
> 
> -	__extension__							\
> 
> -	({								\
> 
> -		__typeof__(x) _v = _CMM_STORE_SHARED(x, v);		\
> 
> -		cmm_smp_wmc();						\
> 
> -		_v = _v;	/* Work around clang "unused result" */	\
> 
> +#define CMM_STORE_SHARED(x, v)					\
> 
> +	__extension__						\
> 
> +	({							\
> 
> +		__typeof__(x) w = v;				\
> 
> +		__atomic_store(&x, &w, __ATOMIC_RELEASE);	\
> 
>  	})
> 
> 
> 
>  #endif /* _URCU_SYSTEM_H */
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev