* [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
@ 2019-01-14 12:12 Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-14 12:12 UTC (permalink / raw)
To: Richard Henderson, Aurelien Jarno, Peter Maydell,
Alex Bennée, Cornelia Huck, qemu-devel
Cc: qemu-s390x
Clang v7.0.1 does not like the __int128 variable type for inline
assembly on s390x:
In file included from fpu/softfloat.c:97:
include/fpu/softfloat-macros.h:647:9: error: inline asm error:
This value type register class is not natively supported!
asm("dlgr %0, %1" : "+r"(n) : "r"(d));
^
Disable this code part there now when compiling with Clang, so that
the generic code gets used instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
---
include/fpu/softfloat-macros.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index b1d772e..bd5b641 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
uint64_t q;
asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
return q;
-#elif defined(__s390x__)
+#elif defined(__s390x__) && !defined(__clang__)
/* Need to use a TImode type to get an even register pair for DLGR. */
unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
asm("dlgr %0, %1" : "+r"(n) : "r"(d));
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
@ 2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 16:37 ` Alex Bennée
2019-01-14 21:40 ` Richard Henderson
` (2 subsequent siblings)
3 siblings, 1 reply; 23+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-01-14 12:16 UTC (permalink / raw)
To: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
Alex Bennée, Cornelia Huck, qemu-devel
Cc: qemu-s390x
On 1/14/19 1:12 PM, Thomas Huth wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
>
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
> This value type register class is not natively supported!
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> ^
>
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
> include/fpu/softfloat-macros.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
> uint64_t q;
> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
> return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)
Can we rather check if __int128 is natively supported? So this part get
compiled once Clang do support it, else we'll never use it...
> /* Need to use a TImode type to get an even register pair for DLGR. */
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 12:16 ` Philippe Mathieu-Daudé
@ 2019-01-14 16:37 ` Alex Bennée
2019-01-14 17:03 ` Thomas Huth
0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 16:37 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
Cornelia Huck, qemu-devel, qemu-s390x
Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> On 1/14/19 1:12 PM, Thomas Huth wrote:
>> Clang v7.0.1 does not like the __int128 variable type for inline
>> assembly on s390x:
>>
>> In file included from fpu/softfloat.c:97:
>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>> This value type register class is not natively supported!
>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>> ^
>>
>> Disable this code part there now when compiling with Clang, so that
>> the generic code gets used instead.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>> include/fpu/softfloat-macros.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>> index b1d772e..bd5b641 100644
>> --- a/include/fpu/softfloat-macros.h
>> +++ b/include/fpu/softfloat-macros.h
>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>> uint64_t q;
>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>> return q;
>> -#elif defined(__s390x__)
>> +#elif defined(__s390x__) && !defined(__clang__)
>
> Can we rather check if __int128 is natively supported? So this part get
> compiled once Clang do support it, else we'll never use it...
We already define CONFIG_INT128 so you could just use that.
Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>
>> /* Need to use a TImode type to get an even register pair for DLGR. */
>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 16:37 ` Alex Bennée
@ 2019-01-14 17:03 ` Thomas Huth
2019-01-14 18:58 ` Alex Bennée
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Huth @ 2019-01-14 17:03 UTC (permalink / raw)
To: Alex Bennée, Philippe Mathieu-Daudé
Cc: Richard Henderson, Aurelien Jarno, Peter Maydell, Cornelia Huck,
qemu-devel, qemu-s390x
On 2019-01-14 17:37, Alex Bennée wrote:
>
> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>
>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>> assembly on s390x:
>>>
>>> In file included from fpu/softfloat.c:97:
>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>> This value type register class is not natively supported!
>>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>> ^
>>>
>>> Disable this code part there now when compiling with Clang, so that
>>> the generic code gets used instead.
>>>
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>> ---
>>> include/fpu/softfloat-macros.h | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>> index b1d772e..bd5b641 100644
>>> --- a/include/fpu/softfloat-macros.h
>>> +++ b/include/fpu/softfloat-macros.h
>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>> uint64_t q;
>>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>> return q;
>>> -#elif defined(__s390x__)
>>> +#elif defined(__s390x__) && !defined(__clang__)
>>
>> Can we rather check if __int128 is natively supported? So this part get
>> compiled once Clang do support it, else we'll never use it...
>
> We already define CONFIG_INT128 so you could just use that.
>
> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
that it does not like __int128 to be passed as parameters for inline
assembly...
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 17:03 ` Thomas Huth
@ 2019-01-14 18:58 ` Alex Bennée
2019-01-14 21:36 ` Richard Henderson
0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 18:58 UTC (permalink / raw)
To: Thomas Huth
Cc: Philippe Mathieu-Daudé,
Richard Henderson, Aurelien Jarno, Peter Maydell, Cornelia Huck,
qemu-devel, qemu-s390x
Thomas Huth <thuth@redhat.com> writes:
> On 2019-01-14 17:37, Alex Bennée wrote:
>>
>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>
>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>> assembly on s390x:
>>>>
>>>> In file included from fpu/softfloat.c:97:
>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>> This value type register class is not natively supported!
>>>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>> ^
>>>>
>>>> Disable this code part there now when compiling with Clang, so that
>>>> the generic code gets used instead.
>>>>
>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>> ---
>>>> include/fpu/softfloat-macros.h | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>> index b1d772e..bd5b641 100644
>>>> --- a/include/fpu/softfloat-macros.h
>>>> +++ b/include/fpu/softfloat-macros.h
>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>> uint64_t q;
>>>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>> return q;
>>>> -#elif defined(__s390x__)
>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>
>>> Can we rather check if __int128 is natively supported? So this part get
>>> compiled once Clang do support it, else we'll never use it...
>>
>> We already define CONFIG_INT128 so you could just use that.
>>
>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>
> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
> that it does not like __int128 to be passed as parameters for inline
> assembly...
What about something like this:
modified include/fpu/softfloat-macros.h
@@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
uint64_t q;
asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
return q;
-#elif defined(__s390x__)
- /* Need to use a TImode type to get an even register pair for DLGR. */
- unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
- asm("dlgr %0, %1" : "+r"(n) : "r"(d));
- *r = n >> 64;
- return n;
#elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
/* From Power ISA 2.06, programming note for divdeu. */
uint64_t q1, q2, Q, r1, r2, R;
@@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
}
*r = R;
return Q;
+#elif defined(CONFIG_INT128)
+ unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
+ unsigned __int128 q = n / d;
+ *r = q >> 64;
+ return q;
#else
uint64_t d0, d1, q0, q1, r1, r0, m;
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 18:58 ` Alex Bennée
@ 2019-01-14 21:36 ` Richard Henderson
2019-01-14 22:48 ` Alex Bennée
0 siblings, 1 reply; 23+ messages in thread
From: Richard Henderson @ 2019-01-14 21:36 UTC (permalink / raw)
To: Alex Bennée, Thomas Huth
Cc: Philippe Mathieu-Daudé,
Aurelien Jarno, Peter Maydell, Cornelia Huck, qemu-devel,
qemu-s390x
On 1/15/19 5:58 AM, Alex Bennée wrote:
>
> Thomas Huth <thuth@redhat.com> writes:
>
>> On 2019-01-14 17:37, Alex Bennée wrote:
>>>
>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>
>>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>>> assembly on s390x:
>>>>>
>>>>> In file included from fpu/softfloat.c:97:
>>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>>> This value type register class is not natively supported!
>>>>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>>> ^
>>>>>
>>>>> Disable this code part there now when compiling with Clang, so that
>>>>> the generic code gets used instead.
>>>>>
>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>> ---
>>>>> include/fpu/softfloat-macros.h | 2 +-
>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>>> index b1d772e..bd5b641 100644
>>>>> --- a/include/fpu/softfloat-macros.h
>>>>> +++ b/include/fpu/softfloat-macros.h
>>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>>> uint64_t q;
>>>>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>>> return q;
>>>>> -#elif defined(__s390x__)
>>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>>
>>>> Can we rather check if __int128 is natively supported? So this part get
>>>> compiled once Clang do support it, else we'll never use it...
>>>
>>> We already define CONFIG_INT128 so you could just use that.
>>>
>>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>>
>> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
>> that it does not like __int128 to be passed as parameters for inline
>> assembly...
>
> What about something like this:
>
> modified include/fpu/softfloat-macros.h
> @@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
> uint64_t q;
> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
> return q;
> -#elif defined(__s390x__)
> - /* Need to use a TImode type to get an even register pair for DLGR. */
> - unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> - asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> - *r = n >> 64;
> - return n;
> #elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
> /* From Power ISA 2.06, programming note for divdeu. */
> uint64_t q1, q2, Q, r1, r2, R;
> @@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
> }
> *r = R;
> return Q;
> +#elif defined(CONFIG_INT128)
> + unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> + unsigned __int128 q = n / d;
> + *r = q >> 64;
> + return q;
Because that is not what the assembly does, for one.
But perhaps
unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
*r = n % d;
return n / d;
will allow the compiler to do what the assembly does for some 64-bit hosts.
r~
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
@ 2019-01-14 21:40 ` Richard Henderson
2019-01-16 16:50 ` Cornelia Huck
2019-01-17 8:30 ` Cornelia Huck
3 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2019-01-14 21:40 UTC (permalink / raw)
To: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
Alex Bennée, Cornelia Huck, qemu-devel
Cc: qemu-s390x
On 1/14/19 11:12 PM, Thomas Huth wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
>
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
> This value type register class is not natively supported!
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> ^
>
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
> include/fpu/softfloat-macros.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Regardless of whatever follow-up we might do wrt CONFIG_INT128.
r~
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 21:36 ` Richard Henderson
@ 2019-01-14 22:48 ` Alex Bennée
2019-01-15 10:14 ` Peter Maydell
0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 22:48 UTC (permalink / raw)
To: Richard Henderson
Cc: Thomas Huth, Philippe Mathieu-Daudé,
Aurelien Jarno, Peter Maydell, Cornelia Huck, qemu-devel,
qemu-s390x
Richard Henderson <rth@twiddle.net> writes:
> On 1/15/19 5:58 AM, Alex Bennée wrote:
>>
>> Thomas Huth <thuth@redhat.com> writes:
>>
>>> On 2019-01-14 17:37, Alex Bennée wrote:
>>>>
>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>>
>>>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>>>> assembly on s390x:
>>>>>>
>>>>>> In file included from fpu/softfloat.c:97:
>>>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>>>> This value type register class is not natively supported!
>>>>>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>>>> ^
>>>>>>
>>>>>> Disable this code part there now when compiling with Clang, so that
>>>>>> the generic code gets used instead.
>>>>>>
>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>>> ---
>>>>>> include/fpu/softfloat-macros.h | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>>>> index b1d772e..bd5b641 100644
>>>>>> --- a/include/fpu/softfloat-macros.h
>>>>>> +++ b/include/fpu/softfloat-macros.h
>>>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>>>> uint64_t q;
>>>>>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>>>> return q;
>>>>>> -#elif defined(__s390x__)
>>>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>>>
>>>>> Can we rather check if __int128 is natively supported? So this part get
>>>>> compiled once Clang do support it, else we'll never use it...
>>>>
>>>> We already define CONFIG_INT128 so you could just use that.
>>>>
>>>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>>>
>>> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
>>> that it does not like __int128 to be passed as parameters for inline
>>> assembly...
>>
>> What about something like this:
>>
>> modified include/fpu/softfloat-macros.h
>> @@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>> uint64_t q;
>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>> return q;
>> -#elif defined(__s390x__)
>> - /* Need to use a TImode type to get an even register pair for DLGR. */
>> - unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> - asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>> - *r = n >> 64;
>> - return n;
>> #elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
>> /* From Power ISA 2.06, programming note for divdeu. */
>> uint64_t q1, q2, Q, r1, r2, R;
>> @@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>> }
>> *r = R;
>> return Q;
>> +#elif defined(CONFIG_INT128)
>> + unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> + unsigned __int128 q = n / d;
>> + *r = q >> 64;
>> + return q;
>
> Because that is not what the assembly does, for one.
Doh...
> But perhaps
>
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> *r = n % d;
> return n / d;
>
> will allow the compiler to do what the assembly does for some 64-bit
> hosts.
I wonder how much cost is incurred by the jumping to the (libgcc?) div
helper? Anyone got an s390x about so we can benchmark the two
approaches?
If it's in the noise then it would be nice to avoid getting too #ifdef
happy.
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 22:48 ` Alex Bennée
@ 2019-01-15 10:14 ` Peter Maydell
2019-01-15 14:46 ` Alex Bennée
0 siblings, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2019-01-15 10:14 UTC (permalink / raw)
To: Alex Bennée
Cc: Richard Henderson, Thomas Huth, Philippe Mathieu-Daudé,
Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x
On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Richard Henderson <rth@twiddle.net> writes:
> > But perhaps
> >
> > unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> > *r = n % d;
> > return n / d;
> >
> > will allow the compiler to do what the assembly does for some 64-bit
> > hosts.
>
> I wonder how much cost is incurred by the jumping to the (libgcc?) div
> helper? Anyone got an s390x about so we can benchmark the two
> approaches?
The project has an s390x system available; however it's usually
running merge build tests so not so useful for benchmarking.
(I can set up accounts on it but that requires me to faff about
figuring out how to create new accounts :-))
thanks
-- PMM
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 10:14 ` Peter Maydell
@ 2019-01-15 14:46 ` Alex Bennée
2019-01-15 15:29 ` Thomas Huth
0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-15 14:46 UTC (permalink / raw)
To: Peter Maydell
Cc: Richard Henderson, Thomas Huth, Philippe Mathieu-Daudé,
Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x
Peter Maydell <peter.maydell@linaro.org> writes:
> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Richard Henderson <rth@twiddle.net> writes:
>> > But perhaps
>> >
>> > unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> > *r = n % d;
>> > return n / d;
>> >
>> > will allow the compiler to do what the assembly does for some 64-bit
>> > hosts.
>>
>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>> helper? Anyone got an s390x about so we can benchmark the two
>> approaches?
>
> The project has an s390x system available; however it's usually
> running merge build tests so not so useful for benchmarking.
> (I can set up accounts on it but that requires me to faff about
> figuring out how to create new accounts :-))
I'm happy to leave this up to those who care about s390x host
performance (Thomas, Cornelia?). I'm just keen to avoid the divide
helper getting too #ifdefy.
I'll include a CONFIG_INT128 patch in my next patch queue review once
I've double checked and tested under linux-user ;-)
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 14:46 ` Alex Bennée
@ 2019-01-15 15:29 ` Thomas Huth
2019-01-15 16:01 ` Alex Bennée
2019-01-15 22:05 ` Richard Henderson
0 siblings, 2 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-15 15:29 UTC (permalink / raw)
To: Alex Bennée, Peter Maydell
Cc: Richard Henderson, Philippe Mathieu-Daudé,
Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x
On 2019-01-15 15:46, Alex Bennée wrote:
>
> Peter Maydell <peter.maydell@linaro.org> writes:
>
>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>
>>>
>>> Richard Henderson <rth@twiddle.net> writes:
>>>> But perhaps
>>>>
>>>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>> *r = n % d;
>>>> return n / d;
>>>>
>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>> hosts.
>>>
>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>> helper? Anyone got an s390x about so we can benchmark the two
>>> approaches?
>>
>> The project has an s390x system available; however it's usually
>> running merge build tests so not so useful for benchmarking.
>> (I can set up accounts on it but that requires me to faff about
>> figuring out how to create new accounts :-))
>
> I'm happy to leave this up to those who care about s390x host
> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
> helper getting too #ifdefy.
Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got available:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
asm("dlgr %0, %1" : "+r"(n) : "r"(d));
*r = n >> 64;
return n;
}
uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
*r = n % d;
return n / d;
}
uint64_t udiv_qrnnd3(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
uint64_t d0, d1, q0, q1, r1, r0, m;
d0 = (uint32_t)d;
d1 = d >> 32;
r1 = n1 % d1;
q1 = n1 / d1;
m = q1 * d0;
r1 = (r1 << 32) | (n0 >> 32);
if (r1 < m) {
q1 -= 1;
r1 += d;
if (r1 >= d) {
if (r1 < m) {
q1 -= 1;
r1 += d;
}
}
}
r1 -= m;
r0 = r1 % d1;
q0 = r1 / d1;
m = q0 * d0;
r0 = (r0 << 32) | (uint32_t)n0;
if (r0 < m) {
q0 -= 1;
r0 += d;
if (r0 >= d) {
if (r0 < m) {
q0 -= 1;
r0 += d;
}
}
}
r0 -= m;
*r = r0;
return (q1 << 32) | q0;
}
int main()
{
uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
uint64_t rs = 0, rn = 0;
clock_t start, end;
long i;
start = clock();
for (i=0; i<200000000L; i++) {
n1 += 3;
n0 += 987654321;
d += 0x123456789;
rs += udiv_qrnnd1(&r, n1, n0, d);
rn += r;
}
end = clock();
printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
r = n1 = n0 = d = rs = rn = 0;
start = clock();
for (i=0; i<200000000L; i++) {
n1 += 3;
n0 += 987654321;
d += 0x123456789;
rs += udiv_qrnnd2(&r, n1, n0, d);
rn += r;
}
end = clock();
printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
r = n1 = n0 = d = rs = rn = 0;
start = clock();
for (i=0; i<200000000L; i++) {
n1 += 3;
n0 += 987654321;
d += 0x123456789;
rs += udiv_qrnnd3(&r, n1, n0, d);
rn += r;
}
end = clock();
printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
return 0;
}
... and results with GCC v8.2.1 are (using -O2):
test 1: time=609 , rs=2264924160200000000 , rn = 6136218997527160832
test 2: time=10127 , rs=2264924160200000000 , rn = 6136218997527160832
test 3: time=2350 , rs=2264924183048928865 , rn = 4842822048162311089
Thus the int128 version is the slowest!
... but at least it gives the same results as the DLGR instruction. The 64-bit
version gives different results - do we have a bug here?
Results with Clang v7.0.1 (using -O2, too) are these:
test 2: time=5035 , rs=2264924160200000000 , rn = 6136218997527160832
test 3: time=1970 , rs=2264924183048928865 , rn = 4842822048162311089
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 15:29 ` Thomas Huth
@ 2019-01-15 16:01 ` Alex Bennée
2019-01-15 20:05 ` Emilio G. Cota
2019-01-15 22:05 ` Richard Henderson
1 sibling, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-15 16:01 UTC (permalink / raw)
To: Thomas Huth
Cc: Peter Maydell, Richard Henderson, Philippe Mathieu-Daudé,
Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x
Thomas Huth <thuth@redhat.com> writes:
> On 2019-01-15 15:46, Alex Bennée wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> writes:
>>
>>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>>
>>>> Richard Henderson <rth@twiddle.net> writes:
>>>>> But perhaps
>>>>>
>>>>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>> *r = n % d;
>>>>> return n / d;
>>>>>
>>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>>> hosts.
>>>>
>>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>>> helper? Anyone got an s390x about so we can benchmark the two
>>>> approaches?
>>>
>>> The project has an s390x system available; however it's usually
>>> running merge build tests so not so useful for benchmarking.
>>> (I can set up accounts on it but that requires me to faff about
>>> figuring out how to create new accounts :-))
>>
>> I'm happy to leave this up to those who care about s390x host
>> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
>> helper getting too #ifdefy.
>
> Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got
> available:
Ahh I should have mentioned we already have the technology for this ;-)
If you build the fpu/next tree on a s390x you can then run:
./tests/fp/fp-bench f64_div
with and without the CONFIG_128 path. To get an idea of the real world
impact you can compile a foreign binary and run it on a s390x system
with:
$QEMU ./tests/fp/fp-bench f64_div -t host
And that will give you the peak performance assuming your program is
doing nothing but f64_div operations. If the two QEMU's are basically in
the same ballpark then it doesn't make enough difference. That said:
> #include <stdio.h>
> #include <time.h>
> #include <stdint.h>
>
> uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> *r = n >> 64;
> return n;
> }
>
> uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> *r = n % d;
> return n / d;
> }
>
<snip>
>
> int main()
> {
> uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
> uint64_t rs = 0, rn = 0;
> clock_t start, end;
> long i;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd1(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> r = n1 = n0 = d = rs = rn = 0;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd2(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> r = n1 = n0 = d = rs = rn = 0;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd3(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> return 0;
> }
>
> ... and results with GCC v8.2.1 are (using -O2):
>
> test 1: time=609 , rs=2264924160200000000 , rn = 6136218997527160832
> test 2: time=10127 , rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=2350 , rs=2264924183048928865 , rn = 4842822048162311089
>
> Thus the int128 version is the slowest!
I'd expect a little slow down due to the indirection into libgcc.. but
that seems pretty high.
>
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
>
> Results with Clang v7.0.1 (using -O2, too) are these:
>
> test 2: time=5035 , rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=1970 , rs=2264924183048928865 , rn =
> 4842822048162311089
You can run:
./tests/fp/fp-test f64_div -l 2 -r all
For a proper comprehensive test.
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 16:01 ` Alex Bennée
@ 2019-01-15 20:05 ` Emilio G. Cota
2019-01-16 6:33 ` Thomas Huth
0 siblings, 1 reply; 23+ messages in thread
From: Emilio G. Cota @ 2019-01-15 20:05 UTC (permalink / raw)
To: Alex Bennée
Cc: Thomas Huth, Peter Maydell, Cornelia Huck, QEMU Developers,
qemu-s390x, Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
> Ahh I should have mentioned we already have the technology for this ;-)
>
> If you build the fpu/next tree on a s390x you can then run:
>
> ./tests/fp/fp-bench f64_div
>
> with and without the CONFIG_128 path. To get an idea of the real world
> impact you can compile a foreign binary and run it on a s390x system
> with:
>
> $QEMU ./tests/fp/fp-bench f64_div -t host
>
> And that will give you the peak performance assuming your program is
> doing nothing but f64_div operations. If the two QEMU's are basically in
> the same ballpark then it doesn't make enough difference. That said:
I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
you'll get the default op (-o add).
Thanks,
E.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 15:29 ` Thomas Huth
2019-01-15 16:01 ` Alex Bennée
@ 2019-01-15 22:05 ` Richard Henderson
1 sibling, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2019-01-15 22:05 UTC (permalink / raw)
To: Thomas Huth, Alex Bennée, Peter Maydell
Cc: Cornelia Huck, QEMU Developers, qemu-s390x,
Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
On 1/16/19 2:29 AM, Thomas Huth wrote:
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
Yes, on your inputs. udiv_qrnnd3 requires that D be "normalized", i.e. have
the most siginificant bit set. (And thus N1:N0 shifted left to match, and the
returned remainder shifted right to match.)
r~
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-15 20:05 ` Emilio G. Cota
@ 2019-01-16 6:33 ` Thomas Huth
2019-01-16 17:08 ` Alex Bennée
2019-01-16 18:21 ` Emilio G. Cota
0 siblings, 2 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-16 6:33 UTC (permalink / raw)
To: Emilio G. Cota, Alex Bennée
Cc: Peter Maydell, Cornelia Huck, QEMU Developers, qemu-s390x,
Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
On 2019-01-15 21:05, Emilio G. Cota wrote:
> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>> Ahh I should have mentioned we already have the technology for this ;-)
>>
>> If you build the fpu/next tree on a s390x you can then run:
>>
>> ./tests/fp/fp-bench f64_div
>>
>> with and without the CONFIG_128 path. To get an idea of the real world
>> impact you can compile a foreign binary and run it on a s390x system
>> with:
>>
>> $QEMU ./tests/fp/fp-bench f64_div -t host
>>
>> And that will give you the peak performance assuming your program is
>> doing nothing but f64_div operations. If the two QEMU's are basically in
>> the same ballpark then it doesn't make enough difference. That said:
>
> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
> you'll get the default op (-o add).
I tried that now, too, and -o div -p double does not really seem to
exercise this function at all.
Here are my results (disclaimer: that system is likely not really usable
for benchmarks since it's CPUs are shared with other LPARs, but I ran
all the tests at least twice and got similar results):
With the DGLR inline assembly:
time ./fp-test f64_div -l 2 -r all
real 6m43,648s
user 6m43,362s
sys 0m0,160s
time ./fp-bench -o div -p double
204.98 MFlops
real 0m1,002s
user 0m1,001s
sys 0m0,001s
With the "#else" default 64-bit code:
time ./fp-test f64_div -l 2 -r all
real 6m44,910s
user 6m44,616s
sys 0m0,165s
time ./fp-bench -o div -p double
205.41 MFlops
real 0m1,002s
user 0m1,001s
sys 0m0,001s
With the new CONFIG_INT128 code:
time ./fp-test f64_div -l 2 -r all
real 6m58,371s
user 6m58,078s
sys 0m0,164s
time ./fp-bench -o div -p double
205.17 MFlops
real 0m1,002s
user 0m1,000s
sys 0m0,001s
==> The new CONFIG_INT128 code is really worse than the 64-bit code, so
I don't think we should include this yet (unless we know a system where
the compiler can create optimized assembly code without libgcc here).
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 21:40 ` Richard Henderson
@ 2019-01-16 16:50 ` Cornelia Huck
2019-01-16 17:16 ` Alex Bennée
2019-01-17 8:30 ` Cornelia Huck
3 siblings, 1 reply; 23+ messages in thread
From: Cornelia Huck @ 2019-01-16 16:50 UTC (permalink / raw)
To: Thomas Huth
Cc: Richard Henderson, Aurelien Jarno, Peter Maydell,
Alex Bennée, qemu-devel, qemu-s390x
On Mon, 14 Jan 2019 13:12:35 +0100
Thomas Huth <thuth@redhat.com> wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
>
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
> This value type register class is not natively supported!
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> ^
>
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
> include/fpu/softfloat-macros.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
> uint64_t q;
> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
> return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)
> /* Need to use a TImode type to get an even register pair for DLGR. */
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
Ok, so what's the deal with this patch now? Fix compilation now,
optimize later?
If yes, should I pick it as an s390x build fix (I plan to send a pull
request later this week), or will the fpu maintainers pick it?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-16 6:33 ` Thomas Huth
@ 2019-01-16 17:08 ` Alex Bennée
2019-01-17 6:06 ` Thomas Huth
2019-01-16 18:21 ` Emilio G. Cota
1 sibling, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-16 17:08 UTC (permalink / raw)
To: Thomas Huth
Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
qemu-s390x, Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
Thomas Huth <thuth@redhat.com> writes:
> On 2019-01-15 21:05, Emilio G. Cota wrote:
>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>
>>> If you build the fpu/next tree on a s390x you can then run:
>>>
>>> ./tests/fp/fp-bench f64_div
>>>
>>> with and without the CONFIG_128 path. To get an idea of the real world
>>> impact you can compile a foreign binary and run it on a s390x system
>>> with:
>>>
>>> $QEMU ./tests/fp/fp-bench f64_div -t host
>>>
>>> And that will give you the peak performance assuming your program is
>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>> the same ballpark then it doesn't make enough difference. That said:
>>
>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>> you'll get the default op (-o add).
>
> I tried that now, too, and -o div -p double does not really seem to
> exercise this function at all.
How do you mean? It should do because by default it should be calling
the softfloat implementations.
> Here are my results (disclaimer: that system is likely not really usable
> for benchmarks since it's CPUs are shared with other LPARs, but I ran
> all the tests at least twice and got similar results):
>
>
> With the DGLR inline assembly:
>
<snip>
> time ./fp-bench -o div -p double
> 204.98 MFlops
<snip>
> With the "#else" default 64-bit code:
>
<snip>
> time ./fp-bench -o div -p double
> 205.41 MFlops
<snip>
> With the new CONFIG_INT128 code:
>
<snip>
> time ./fp-bench -o div -p double
> 205.17 MFlops
<snip>
>
>
> ==> The new CONFIG_INT128 code is really worse than the 64-bit code, so
> I don't think we should include this yet (unless we know a system where
> the compiler can create optimized assembly code without libgcc here).
I mean to me that looks like it is easily in the noise range and that
the dglr instruction didn't actually beat the unrolled 64 bit code -
which is just weird.
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-16 16:50 ` Cornelia Huck
@ 2019-01-16 17:16 ` Alex Bennée
2019-01-17 5:57 ` Thomas Huth
0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-16 17:16 UTC (permalink / raw)
To: Cornelia Huck
Cc: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
qemu-devel, qemu-s390x
Cornelia Huck <cohuck@redhat.com> writes:
> On Mon, 14 Jan 2019 13:12:35 +0100
> Thomas Huth <thuth@redhat.com> wrote:
>
<snip>
>>
>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>> index b1d772e..bd5b641 100644
>> --- a/include/fpu/softfloat-macros.h
>> +++ b/include/fpu/softfloat-macros.h
>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>> uint64_t q;
>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>> return q;
>> -#elif defined(__s390x__)
>> +#elif defined(__s390x__) && !defined(__clang__)
>> /* Need to use a TImode type to get an even register pair for DLGR. */
>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>
> Ok, so what's the deal with this patch now? Fix compilation now,
> optimize later?
>
> If yes, should I pick it as an s390x build fix (I plan to send a pull
> request later this week), or will the fpu maintainers pick it?
I'm planning to send a FPU PR tomorrow and I'll happily include either
version.
I'm personally minded to go with the patch that makes s390 (and others)
fall back to the generic CONFIG_INT128 code. The numbers Thomas gathered
didn't look like it was much difference either way.
Unless you *really* care about milking that last bit of performance out of
the s390 TCG back-end?
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-16 6:33 ` Thomas Huth
2019-01-16 17:08 ` Alex Bennée
@ 2019-01-16 18:21 ` Emilio G. Cota
1 sibling, 0 replies; 23+ messages in thread
From: Emilio G. Cota @ 2019-01-16 18:21 UTC (permalink / raw)
To: Thomas Huth
Cc: Alex Bennée, Peter Maydell, Cornelia Huck, QEMU Developers,
qemu-s390x, Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
On Wed, Jan 16, 2019 at 07:33:28 +0100, Thomas Huth wrote:
> On 2019-01-15 21:05, Emilio G. Cota wrote:
> > On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
> >> Ahh I should have mentioned we already have the technology for this ;-)
> >>
> >> If you build the fpu/next tree on a s390x you can then run:
> >>
> >> ./tests/fp/fp-bench f64_div
> >>
> >> with and without the CONFIG_128 path. To get an idea of the real world
> >> impact you can compile a foreign binary and run it on a s390x system
> >> with:
> >>
> >> $QEMU ./tests/fp/fp-bench f64_div -t host
> >>
> >> And that will give you the peak performance assuming your program is
> >> doing nothing but f64_div operations. If the two QEMU's are basically in
> >> the same ballpark then it doesn't make enough difference. That said:
> >
> > I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
> > you'll get the default op (-o add).
>
> I tried that now, too, and -o div -p double does not really seem to
> exercise this function at all.
You can check what is being called then with perf record/report.
E.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-16 17:16 ` Alex Bennée
@ 2019-01-17 5:57 ` Thomas Huth
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-17 5:57 UTC (permalink / raw)
To: Alex Bennée, Cornelia Huck
Cc: Richard Henderson, Aurelien Jarno, Peter Maydell, qemu-devel, qemu-s390x
On 2019-01-16 18:16, Alex Bennée wrote:
>
> Cornelia Huck <cohuck@redhat.com> writes:
>
>> On Mon, 14 Jan 2019 13:12:35 +0100
>> Thomas Huth <thuth@redhat.com> wrote:
>>
> <snip>
>>>
>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>> index b1d772e..bd5b641 100644
>>> --- a/include/fpu/softfloat-macros.h
>>> +++ b/include/fpu/softfloat-macros.h
>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>> uint64_t q;
>>> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>> return q;
>>> -#elif defined(__s390x__)
>>> +#elif defined(__s390x__) && !defined(__clang__)
>>> /* Need to use a TImode type to get an even register pair for DLGR. */
>>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>
>> Ok, so what's the deal with this patch now? Fix compilation now,
>> optimize later?
>>
>> If yes, should I pick it as an s390x build fix (I plan to send a pull
>> request later this week), or will the fpu maintainers pick it?
>
> I'm planning to send a FPU PR tomorrow and I'll happily include either
> version.
>
> I'm personally minded to go with the patch that makes s390 (and others)
> fall back to the generic CONFIG_INT128 code. The numbers Thomas gathered
> didn't look like it was much difference either way.
>
> Unless you *really* care about milking that last bit of performance out of
> the s390 TCG back-end?
I don't think that anybody buys a mainframe for this special case ;-)
I'd go with the !defined(__clang__) patch above.
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-16 17:08 ` Alex Bennée
@ 2019-01-17 6:06 ` Thomas Huth
2019-01-17 7:42 ` Alex Bennée
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Huth @ 2019-01-17 6:06 UTC (permalink / raw)
To: Alex Bennée
Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
qemu-s390x, Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
On 2019-01-16 18:08, Alex Bennée wrote:
>
> Thomas Huth <thuth@redhat.com> writes:
>
>> On 2019-01-15 21:05, Emilio G. Cota wrote:
>>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>>
>>>> If you build the fpu/next tree on a s390x you can then run:
>>>>
>>>> ./tests/fp/fp-bench f64_div
>>>>
>>>> with and without the CONFIG_128 path. To get an idea of the real world
>>>> impact you can compile a foreign binary and run it on a s390x system
>>>> with:
>>>>
>>>> $QEMU ./tests/fp/fp-bench f64_div -t host
>>>>
>>>> And that will give you the peak performance assuming your program is
>>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>>> the same ballpark then it doesn't make enough difference. That said:
>>>
>>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>>> you'll get the default op (-o add).
>>
>> I tried that now, too, and -o div -p double does not really seem to
>> exercise this function at all.
>
> How do you mean? It should do because by default it should be calling
> the softfloat implementations.
I've added a puts("hello") into the udiv_qrnd() function. When I then
run "fp-bench -o div -p double", it only prints out "hello" a single
time, so the function is only called once during the whole test.
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-17 6:06 ` Thomas Huth
@ 2019-01-17 7:42 ` Alex Bennée
0 siblings, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2019-01-17 7:42 UTC (permalink / raw)
To: Thomas Huth
Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
qemu-s390x, Philippe Mathieu-Daudé,
Aurelien Jarno, Richard Henderson
Thomas Huth <thuth@redhat.com> writes:
> On 2019-01-16 18:08, Alex Bennée wrote:
>>
>> Thomas Huth <thuth@redhat.com> writes:
>>
>>> On 2019-01-15 21:05, Emilio G. Cota wrote:
>>>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>>>
>>>>> If you build the fpu/next tree on a s390x you can then run:
>>>>>
>>>>> ./tests/fp/fp-bench f64_div
>>>>>
>>>>> with and without the CONFIG_128 path. To get an idea of the real world
>>>>> impact you can compile a foreign binary and run it on a s390x system
>>>>> with:
>>>>>
>>>>> $QEMU ./tests/fp/fp-bench f64_div -t host
>>>>>
>>>>> And that will give you the peak performance assuming your program is
>>>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>>>> the same ballpark then it doesn't make enough difference. That said:
>>>>
>>>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>>>> you'll get the default op (-o add).
>>>
>>> I tried that now, too, and -o div -p double does not really seem to
>>> exercise this function at all.
>>
>> How do you mean? It should do because by default it should be calling
>> the softfloat implementations.
>
> I've added a puts("hello") into the udiv_qrnd() function. When I then
> run "fp-bench -o div -p double", it only prints out "hello" a single
> time, so the function is only called once during the whole test.
That's very odd. With the following on my aarch64 box:
modified include/fpu/softfloat-macros.h
@@ -637,6 +637,8 @@ static inline uint64_t estimateDiv128To64(uint64_t a0, uint64_t a1, uint64_t b)
static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
uint64_t n0, uint64_t d)
{
+ static int iter = 0;
+ fprintf(stderr, "%s: %d\n", __func__, iter++);
#if defined(__x86_64__)
uint64_t q;
asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
I get:
udiv_qrnnd: 0
udiv_qrnnd: 1
..
..<all the way to>
udiv_qrnnd: 99998
udiv_qrnnd: 99999
So I'm not sure what is different on s390
--
Alex Bennée
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
` (2 preceding siblings ...)
2019-01-16 16:50 ` Cornelia Huck
@ 2019-01-17 8:30 ` Cornelia Huck
3 siblings, 0 replies; 23+ messages in thread
From: Cornelia Huck @ 2019-01-17 8:30 UTC (permalink / raw)
To: Thomas Huth
Cc: Richard Henderson, Aurelien Jarno, Peter Maydell,
Alex Bennée, qemu-devel, qemu-s390x
On Mon, 14 Jan 2019 13:12:35 +0100
Thomas Huth <thuth@redhat.com> wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
>
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
> This value type register class is not natively supported!
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> ^
>
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
> include/fpu/softfloat-macros.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
> uint64_t q;
> asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
> return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)
> /* Need to use a TImode type to get an even register pair for DLGR. */
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
Acked-by: Cornelia Huck <cohuck@redhat.com>
...as this fixed that s390x clang problem for me.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-01-17 8:30 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 16:37 ` Alex Bennée
2019-01-14 17:03 ` Thomas Huth
2019-01-14 18:58 ` Alex Bennée
2019-01-14 21:36 ` Richard Henderson
2019-01-14 22:48 ` Alex Bennée
2019-01-15 10:14 ` Peter Maydell
2019-01-15 14:46 ` Alex Bennée
2019-01-15 15:29 ` Thomas Huth
2019-01-15 16:01 ` Alex Bennée
2019-01-15 20:05 ` Emilio G. Cota
2019-01-16 6:33 ` Thomas Huth
2019-01-16 17:08 ` Alex Bennée
2019-01-17 6:06 ` Thomas Huth
2019-01-17 7:42 ` Alex Bennée
2019-01-16 18:21 ` Emilio G. Cota
2019-01-15 22:05 ` Richard Henderson
2019-01-14 21:40 ` Richard Henderson
2019-01-16 16:50 ` Cornelia Huck
2019-01-16 17:16 ` Alex Bennée
2019-01-17 5:57 ` Thomas Huth
2019-01-17 8:30 ` Cornelia Huck
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.