All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
@ 2019-01-14 12:12 Thomas Huth
  2019-01-14 12:16 ` Philippe Mathieu-Daudé
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-14 12:12 UTC (permalink / raw)
  To: Richard Henderson, Aurelien Jarno, Peter Maydell,
	Alex Bennée, Cornelia Huck, qemu-devel
  Cc: qemu-s390x

Clang v7.0.1 does not like the __int128 variable type for inline
assembly on s390x:

In file included from fpu/softfloat.c:97:
include/fpu/softfloat-macros.h:647:9: error: inline asm error:
 This value type register class is not natively supported!
    asm("dlgr %0, %1" : "+r"(n) : "r"(d));
        ^

Disable this code part there now when compiling with Clang, so that
the generic code gets used instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 include/fpu/softfloat-macros.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index b1d772e..bd5b641 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
     uint64_t q;
     asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
     return q;
-#elif defined(__s390x__)
+#elif defined(__s390x__) && !defined(__clang__)
     /* Need to use a TImode type to get an even register pair for DLGR.  */
     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
@ 2019-01-14 12:16 ` Philippe Mathieu-Daudé
  2019-01-14 16:37   ` Alex Bennée
  2019-01-14 21:40 ` Richard Henderson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-01-14 12:16 UTC (permalink / raw)
  To: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
	Alex Bennée, Cornelia Huck, qemu-devel
  Cc: qemu-s390x

On 1/14/19 1:12 PM, Thomas Huth wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
> 
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>  This value type register class is not natively supported!
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>         ^
> 
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  include/fpu/softfloat-macros.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>      uint64_t q;
>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>      return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)

Can we rather check if __int128 is natively supported? So this part get
compiled once Clang do support it, else we'll never use it...

>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 12:16 ` Philippe Mathieu-Daudé
@ 2019-01-14 16:37   ` Alex Bennée
  2019-01-14 17:03     ` Thomas Huth
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 16:37 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
	Cornelia Huck, qemu-devel, qemu-s390x


Philippe Mathieu-Daudé <philmd@redhat.com> writes:

> On 1/14/19 1:12 PM, Thomas Huth wrote:
>> Clang v7.0.1 does not like the __int128 variable type for inline
>> assembly on s390x:
>>
>> In file included from fpu/softfloat.c:97:
>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>  This value type register class is not natively supported!
>>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>         ^
>>
>> Disable this code part there now when compiling with Clang, so that
>> the generic code gets used instead.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>  include/fpu/softfloat-macros.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>> index b1d772e..bd5b641 100644
>> --- a/include/fpu/softfloat-macros.h
>> +++ b/include/fpu/softfloat-macros.h
>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>      uint64_t q;
>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>      return q;
>> -#elif defined(__s390x__)
>> +#elif defined(__s390x__) && !defined(__clang__)
>
> Can we rather check if __int128 is natively supported? So this part get
> compiled once Clang do support it, else we'll never use it...

We already define CONFIG_INT128 so you could just use that.

Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?

>
>>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>


--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 16:37   ` Alex Bennée
@ 2019-01-14 17:03     ` Thomas Huth
  2019-01-14 18:58       ` Alex Bennée
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Huth @ 2019-01-14 17:03 UTC (permalink / raw)
  To: Alex Bennée, Philippe Mathieu-Daudé
  Cc: Richard Henderson, Aurelien Jarno, Peter Maydell, Cornelia Huck,
	qemu-devel, qemu-s390x

On 2019-01-14 17:37, Alex Bennée wrote:
> 
> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> 
>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>> assembly on s390x:
>>>
>>> In file included from fpu/softfloat.c:97:
>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>  This value type register class is not natively supported!
>>>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>         ^
>>>
>>> Disable this code part there now when compiling with Clang, so that
>>> the generic code gets used instead.
>>>
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>> ---
>>>  include/fpu/softfloat-macros.h | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>> index b1d772e..bd5b641 100644
>>> --- a/include/fpu/softfloat-macros.h
>>> +++ b/include/fpu/softfloat-macros.h
>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>      uint64_t q;
>>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>      return q;
>>> -#elif defined(__s390x__)
>>> +#elif defined(__s390x__) && !defined(__clang__)
>>
>> Can we rather check if __int128 is natively supported? So this part get
>> compiled once Clang do support it, else we'll never use it...
> 
> We already define CONFIG_INT128 so you could just use that.
> 
> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?

Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
that it does not like __int128  to be passed as parameters for inline
assembly...

 Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 17:03     ` Thomas Huth
@ 2019-01-14 18:58       ` Alex Bennée
  2019-01-14 21:36         ` Richard Henderson
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 18:58 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Philippe Mathieu-Daudé,
	Richard Henderson, Aurelien Jarno, Peter Maydell, Cornelia Huck,
	qemu-devel, qemu-s390x


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-14 17:37, Alex Bennée wrote:
>>
>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>
>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>> assembly on s390x:
>>>>
>>>> In file included from fpu/softfloat.c:97:
>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>>  This value type register class is not natively supported!
>>>>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>>         ^
>>>>
>>>> Disable this code part there now when compiling with Clang, so that
>>>> the generic code gets used instead.
>>>>
>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>> ---
>>>>  include/fpu/softfloat-macros.h | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>> index b1d772e..bd5b641 100644
>>>> --- a/include/fpu/softfloat-macros.h
>>>> +++ b/include/fpu/softfloat-macros.h
>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>>      uint64_t q;
>>>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>>      return q;
>>>> -#elif defined(__s390x__)
>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>
>>> Can we rather check if __int128 is natively supported? So this part get
>>> compiled once Clang do support it, else we'll never use it...
>>
>> We already define CONFIG_INT128 so you could just use that.
>>
>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>
> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
> that it does not like __int128  to be passed as parameters for inline
> assembly...

What about something like this:

modified   include/fpu/softfloat-macros.h
@@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
     uint64_t q;
     asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
     return q;
-#elif defined(__s390x__)
-    /* Need to use a TImode type to get an even register pair for DLGR.  */
-    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
-    asm("dlgr %0, %1" : "+r"(n) : "r"(d));
-    *r = n >> 64;
-    return n;
 #elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
     /* From Power ISA 2.06, programming note for divdeu.  */
     uint64_t q1, q2, Q, r1, r2, R;
@@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
     }
     *r = R;
     return Q;
+#elif defined(CONFIG_INT128)
+    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
+    unsigned __int128 q = n / d;
+    *r = q >> 64;
+    return q;
 #else
     uint64_t d0, d1, q0, q1, r1, r0, m;

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 18:58       ` Alex Bennée
@ 2019-01-14 21:36         ` Richard Henderson
  2019-01-14 22:48           ` Alex Bennée
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Henderson @ 2019-01-14 21:36 UTC (permalink / raw)
  To: Alex Bennée, Thomas Huth
  Cc: Philippe Mathieu-Daudé,
	Aurelien Jarno, Peter Maydell, Cornelia Huck, qemu-devel,
	qemu-s390x

On 1/15/19 5:58 AM, Alex Bennée wrote:
> 
> Thomas Huth <thuth@redhat.com> writes:
> 
>> On 2019-01-14 17:37, Alex Bennée wrote:
>>>
>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>
>>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>>> assembly on s390x:
>>>>>
>>>>> In file included from fpu/softfloat.c:97:
>>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>>>  This value type register class is not natively supported!
>>>>>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>>>         ^
>>>>>
>>>>> Disable this code part there now when compiling with Clang, so that
>>>>> the generic code gets used instead.
>>>>>
>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>> ---
>>>>>  include/fpu/softfloat-macros.h | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>>> index b1d772e..bd5b641 100644
>>>>> --- a/include/fpu/softfloat-macros.h
>>>>> +++ b/include/fpu/softfloat-macros.h
>>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>>>      uint64_t q;
>>>>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>>>      return q;
>>>>> -#elif defined(__s390x__)
>>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>>
>>>> Can we rather check if __int128 is natively supported? So this part get
>>>> compiled once Clang do support it, else we'll never use it...
>>>
>>> We already define CONFIG_INT128 so you could just use that.
>>>
>>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>>
>> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
>> that it does not like __int128  to be passed as parameters for inline
>> assembly...
> 
> What about something like this:
> 
> modified   include/fpu/softfloat-macros.h
> @@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>      uint64_t q;
>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>      return q;
> -#elif defined(__s390x__)
> -    /* Need to use a TImode type to get an even register pair for DLGR.  */
> -    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> -    asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> -    *r = n >> 64;
> -    return n;
>  #elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
>      /* From Power ISA 2.06, programming note for divdeu.  */
>      uint64_t q1, q2, Q, r1, r2, R;
> @@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>      }
>      *r = R;
>      return Q;
> +#elif defined(CONFIG_INT128)
> +    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> +    unsigned __int128 q = n / d;
> +    *r = q >> 64;
> +    return q;

Because that is not what the assembly does, for one.

But perhaps

    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
    *r = n % d;
    return n / d;

will allow the compiler to do what the assembly does for some 64-bit hosts.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
  2019-01-14 12:16 ` Philippe Mathieu-Daudé
@ 2019-01-14 21:40 ` Richard Henderson
  2019-01-16 16:50 ` Cornelia Huck
  2019-01-17  8:30 ` Cornelia Huck
  3 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2019-01-14 21:40 UTC (permalink / raw)
  To: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
	Alex Bennée, Cornelia Huck, qemu-devel
  Cc: qemu-s390x

On 1/14/19 11:12 PM, Thomas Huth wrote:
> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
> 
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>  This value type register class is not natively supported!
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>         ^
> 
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  include/fpu/softfloat-macros.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Regardless of whatever follow-up we might do wrt CONFIG_INT128.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 21:36         ` Richard Henderson
@ 2019-01-14 22:48           ` Alex Bennée
  2019-01-15 10:14             ` Peter Maydell
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-14 22:48 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Thomas Huth, Philippe Mathieu-Daudé,
	Aurelien Jarno, Peter Maydell, Cornelia Huck, qemu-devel,
	qemu-s390x


Richard Henderson <rth@twiddle.net> writes:

> On 1/15/19 5:58 AM, Alex Bennée wrote:
>>
>> Thomas Huth <thuth@redhat.com> writes:
>>
>>> On 2019-01-14 17:37, Alex Bennée wrote:
>>>>
>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>>
>>>>> On 1/14/19 1:12 PM, Thomas Huth wrote:
>>>>>> Clang v7.0.1 does not like the __int128 variable type for inline
>>>>>> assembly on s390x:
>>>>>>
>>>>>> In file included from fpu/softfloat.c:97:
>>>>>> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>>>>>>  This value type register class is not natively supported!
>>>>>>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>>>>>         ^
>>>>>>
>>>>>> Disable this code part there now when compiling with Clang, so that
>>>>>> the generic code gets used instead.
>>>>>>
>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>>> ---
>>>>>>  include/fpu/softfloat-macros.h | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>>>>> index b1d772e..bd5b641 100644
>>>>>> --- a/include/fpu/softfloat-macros.h
>>>>>> +++ b/include/fpu/softfloat-macros.h
>>>>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>>>>      uint64_t q;
>>>>>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>>>>      return q;
>>>>>> -#elif defined(__s390x__)
>>>>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>>>
>>>>> Can we rather check if __int128 is natively supported? So this part get
>>>>> compiled once Clang do support it, else we'll never use it...
>>>>
>>>> We already define CONFIG_INT128 so you could just use that.
>>>>
>>>> Thomas does the s390 clang leave CONFIG_INT128=y in config-host.mak?
>>>
>>> Yes, CONFIG_INT128=y is also set with Clang on s390x. It's really just
>>> that it does not like __int128  to be passed as parameters for inline
>>> assembly...
>>
>> What about something like this:
>>
>> modified   include/fpu/softfloat-macros.h
>> @@ -641,12 +641,6 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>      uint64_t q;
>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>      return q;
>> -#elif defined(__s390x__)
>> -    /* Need to use a TImode type to get an even register pair for DLGR.  */
>> -    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> -    asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>> -    *r = n >> 64;
>> -    return n;
>>  #elif defined(_ARCH_PPC64) && defined(_ARCH_PWR7)
>>      /* From Power ISA 2.06, programming note for divdeu.  */
>>      uint64_t q1, q2, Q, r1, r2, R;
>> @@ -663,6 +657,11 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>      }
>>      *r = R;
>>      return Q;
>> +#elif defined(CONFIG_INT128)
>> +    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> +    unsigned __int128 q = n / d;
>> +    *r = q >> 64;
>> +    return q;
>
> Because that is not what the assembly does, for one.

Doh...

> But perhaps
>
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     *r = n % d;
>     return n / d;
>
> will allow the compiler to do what the assembly does for some 64-bit
> hosts.

I wonder how much cost is incurred by the jumping to the (libgcc?) div
helper? Anyone got an s390x about so we can benchmark the two
approaches?

If it's in the noise then it would be nice to avoid getting too #ifdef
happy.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 22:48           ` Alex Bennée
@ 2019-01-15 10:14             ` Peter Maydell
  2019-01-15 14:46               ` Alex Bennée
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2019-01-15 10:14 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Richard Henderson, Thomas Huth, Philippe Mathieu-Daudé,
	Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x

On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Richard Henderson <rth@twiddle.net> writes:
> > But perhaps
> >
> >     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> >     *r = n % d;
> >     return n / d;
> >
> > will allow the compiler to do what the assembly does for some 64-bit
> > hosts.
>
> I wonder how much cost is incurred by the jumping to the (libgcc?) div
> helper? Anyone got an s390x about so we can benchmark the two
> approaches?

The project has an s390x system available; however it's usually
running merge build tests so not so useful for benchmarking.
(I can set up accounts on it but that requires me to faff about
figuring out how to create new accounts :-))

thanks
-- PMM

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 10:14             ` Peter Maydell
@ 2019-01-15 14:46               ` Alex Bennée
  2019-01-15 15:29                 ` Thomas Huth
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-15 14:46 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, Thomas Huth, Philippe Mathieu-Daudé,
	Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x


Peter Maydell <peter.maydell@linaro.org> writes:

> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Richard Henderson <rth@twiddle.net> writes:
>> > But perhaps
>> >
>> >     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>> >     *r = n % d;
>> >     return n / d;
>> >
>> > will allow the compiler to do what the assembly does for some 64-bit
>> > hosts.
>>
>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>> helper? Anyone got an s390x about so we can benchmark the two
>> approaches?
>
> The project has an s390x system available; however it's usually
> running merge build tests so not so useful for benchmarking.
> (I can set up accounts on it but that requires me to faff about
> figuring out how to create new accounts :-))

I'm happy to leave this up to those who care about s390x host
performance (Thomas, Cornelia?). I'm just keen to avoid the divide
helper getting too #ifdefy.

I'll include a CONFIG_INT128 patch in my next patch queue review once
I've double checked and tested under linux-user ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 14:46               ` Alex Bennée
@ 2019-01-15 15:29                 ` Thomas Huth
  2019-01-15 16:01                   ` Alex Bennée
  2019-01-15 22:05                   ` Richard Henderson
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-15 15:29 UTC (permalink / raw)
  To: Alex Bennée, Peter Maydell
  Cc: Richard Henderson, Philippe Mathieu-Daudé,
	Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x

On 2019-01-15 15:46, Alex Bennée wrote:
> 
> Peter Maydell <peter.maydell@linaro.org> writes:
> 
>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>
>>>
>>> Richard Henderson <rth@twiddle.net> writes:
>>>> But perhaps
>>>>
>>>>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>     *r = n % d;
>>>>     return n / d;
>>>>
>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>> hosts.
>>>
>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>> helper? Anyone got an s390x about so we can benchmark the two
>>> approaches?
>>
>> The project has an s390x system available; however it's usually
>> running merge build tests so not so useful for benchmarking.
>> (I can set up accounts on it but that requires me to faff about
>> figuring out how to create new accounts :-))
> 
> I'm happy to leave this up to those who care about s390x host
> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
> helper getting too #ifdefy.

Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got available:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
    asm("dlgr %0, %1" : "+r"(n) : "r"(d));
    *r = n >> 64;
    return n;
}

uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
    unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
    *r = n % d;
    return n / d;
}

uint64_t udiv_qrnnd3(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
{
    uint64_t d0, d1, q0, q1, r1, r0, m;

    d0 = (uint32_t)d;
    d1 = d >> 32;

    r1 = n1 % d1;
    q1 = n1 / d1;
    m = q1 * d0;
    r1 = (r1 << 32) | (n0 >> 32);
    if (r1 < m) {
        q1 -= 1;
        r1 += d;
        if (r1 >= d) {
            if (r1 < m) {
                q1 -= 1;
                r1 += d;
            }
        }
    }
    r1 -= m;

    r0 = r1 % d1;
    q0 = r1 / d1;
    m = q0 * d0;
    r0 = (r0 << 32) | (uint32_t)n0;
    if (r0 < m) {
        q0 -= 1;
        r0 += d;
        if (r0 >= d) {
            if (r0 < m) {
                q0 -= 1;
                r0 += d;
            }
        }
    }
    r0 -= m;

    *r = r0;
    return (q1 << 32) | q0;

}

int main()
{
	uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
	uint64_t rs = 0, rn = 0;
	clock_t start, end;
	long i;

	start = clock();
	for (i=0; i<200000000L; i++) {
		n1 += 3;
		n0 += 987654321;
		d += 0x123456789;
		rs += udiv_qrnnd1(&r, n1, n0, d);
		rn += r;
	}
	end = clock();
	printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);

	r = n1 = n0 = d = rs = rn = 0;

	start = clock();
	for (i=0; i<200000000L; i++) {
		n1 += 3;
		n0 += 987654321;
		d += 0x123456789;
		rs += udiv_qrnnd2(&r, n1, n0, d);
		rn += r;
	}
	end = clock();
	printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);

	r = n1 = n0 = d = rs = rn = 0;

	start = clock();
	for (i=0; i<200000000L; i++) {
		n1 += 3;
		n0 += 987654321;
		d += 0x123456789;
		rs += udiv_qrnnd3(&r, n1, n0, d);
		rn += r;
	}
	end = clock();
	printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);

	return 0;
}

... and results with GCC v8.2.1 are (using -O2):

test 1: time=609	, rs=2264924160200000000 , rn = 6136218997527160832
test 2: time=10127	, rs=2264924160200000000 , rn = 6136218997527160832
test 3: time=2350	, rs=2264924183048928865 , rn = 4842822048162311089

Thus the int128 version is the slowest!

... but at least it gives the same results as the DLGR instruction. The 64-bit
version gives different results - do we have a bug here?

Results with Clang v7.0.1 (using -O2, too) are these:

test 2: time=5035	, rs=2264924160200000000 , rn = 6136218997527160832
test 3: time=1970	, rs=2264924183048928865 , rn = 4842822048162311089

 Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 15:29                 ` Thomas Huth
@ 2019-01-15 16:01                   ` Alex Bennée
  2019-01-15 20:05                     ` Emilio G. Cota
  2019-01-15 22:05                   ` Richard Henderson
  1 sibling, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-15 16:01 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Peter Maydell, Richard Henderson, Philippe Mathieu-Daudé,
	Aurelien Jarno, Cornelia Huck, QEMU Developers, qemu-s390x


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-15 15:46, Alex Bennée wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> writes:
>>
>>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>>
>>>> Richard Henderson <rth@twiddle.net> writes:
>>>>> But perhaps
>>>>>
>>>>>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>>     *r = n % d;
>>>>>     return n / d;
>>>>>
>>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>>> hosts.
>>>>
>>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>>> helper? Anyone got an s390x about so we can benchmark the two
>>>> approaches?
>>>
>>> The project has an s390x system available; however it's usually
>>> running merge build tests so not so useful for benchmarking.
>>> (I can set up accounts on it but that requires me to faff about
>>> figuring out how to create new accounts :-))
>>
>> I'm happy to leave this up to those who care about s390x host
>> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
>> helper getting too #ifdefy.
>
> Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got
> available:

Ahh I should have mentioned we already have the technology for this ;-)

If you build the fpu/next tree on a s390x you can then run:

  ./tests/fp/fp-bench f64_div

with and without the CONFIG_128 path. To get an idea of the real world
impact you can compile a foreign binary and run it on a s390x system
with:

  $QEMU ./tests/fp/fp-bench f64_div -t host

And that will give you the peak performance assuming your program is
doing nothing but f64_div operations. If the two QEMU's are basically in
the same ballpark then it doesn't make enough difference. That said:

> #include <stdio.h>
> #include <time.h>
> #include <stdint.h>
>
> uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>     *r = n >> 64;
>     return n;
> }
>
> uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     *r = n % d;
>     return n / d;
> }
>
<snip>
>
> int main()
> {
> 	uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
> 	uint64_t rs = 0, rn = 0;
> 	clock_t start, end;
> 	long i;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd1(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd2(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd3(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	return 0;
> }
>
> ... and results with GCC v8.2.1 are (using -O2):
>
> test 1: time=609	, rs=2264924160200000000 , rn = 6136218997527160832
> test 2: time=10127	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=2350	, rs=2264924183048928865 , rn = 4842822048162311089
>
> Thus the int128 version is the slowest!

I'd expect a little slow down due to the indirection into libgcc.. but
that seems pretty high.

>
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
>
> Results with Clang v7.0.1 (using -O2, too) are these:
>
> test 2: time=5035	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=1970	, rs=2264924183048928865 , rn =
> 4842822048162311089

You can run:

  ./tests/fp/fp-test f64_div -l 2 -r all

For a proper comprehensive test.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 16:01                   ` Alex Bennée
@ 2019-01-15 20:05                     ` Emilio G. Cota
  2019-01-16  6:33                       ` Thomas Huth
  0 siblings, 1 reply; 23+ messages in thread
From: Emilio G. Cota @ 2019-01-15 20:05 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Thomas Huth, Peter Maydell, Cornelia Huck, QEMU Developers,
	qemu-s390x, Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson

On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
> Ahh I should have mentioned we already have the technology for this ;-)
> 
> If you build the fpu/next tree on a s390x you can then run:
> 
>   ./tests/fp/fp-bench f64_div
> 
> with and without the CONFIG_128 path. To get an idea of the real world
> impact you can compile a foreign binary and run it on a s390x system
> with:
> 
>   $QEMU ./tests/fp/fp-bench f64_div -t host
> 
> And that will give you the peak performance assuming your program is
> doing nothing but f64_div operations. If the two QEMU's are basically in
> the same ballpark then it doesn't make enough difference. That said:

I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
you'll get the default op (-o add).

Thanks,

		E.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 15:29                 ` Thomas Huth
  2019-01-15 16:01                   ` Alex Bennée
@ 2019-01-15 22:05                   ` Richard Henderson
  1 sibling, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2019-01-15 22:05 UTC (permalink / raw)
  To: Thomas Huth, Alex Bennée, Peter Maydell
  Cc: Cornelia Huck, QEMU Developers, qemu-s390x,
	Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson

On 1/16/19 2:29 AM, Thomas Huth wrote:
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?

Yes, on your inputs.  udiv_qrnnd3 requires that D be "normalized", i.e. have
the most siginificant bit set.  (And thus N1:N0 shifted left to match, and the
returned remainder shifted right to match.)


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-15 20:05                     ` Emilio G. Cota
@ 2019-01-16  6:33                       ` Thomas Huth
  2019-01-16 17:08                         ` Alex Bennée
  2019-01-16 18:21                         ` Emilio G. Cota
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-16  6:33 UTC (permalink / raw)
  To: Emilio G. Cota, Alex Bennée
  Cc: Peter Maydell, Cornelia Huck, QEMU Developers, qemu-s390x,
	Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson

On 2019-01-15 21:05, Emilio G. Cota wrote:
> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>> Ahh I should have mentioned we already have the technology for this ;-)
>>
>> If you build the fpu/next tree on a s390x you can then run:
>>
>>   ./tests/fp/fp-bench f64_div
>>
>> with and without the CONFIG_128 path. To get an idea of the real world
>> impact you can compile a foreign binary and run it on a s390x system
>> with:
>>
>>   $QEMU ./tests/fp/fp-bench f64_div -t host
>>
>> And that will give you the peak performance assuming your program is
>> doing nothing but f64_div operations. If the two QEMU's are basically in
>> the same ballpark then it doesn't make enough difference. That said:
> 
> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
> you'll get the default op (-o add).

I tried that now, too, and -o div -p double does not really seem to
exercise this function at all.

Here are my results (disclaimer: that system is likely not really usable
for benchmarks since it's CPUs are shared with other LPARs, but I ran
all the tests at least twice and got similar results):


With the DGLR inline assembly:

 time ./fp-test f64_div -l 2 -r all
 real	6m43,648s
 user	6m43,362s
 sys	0m0,160s

 time ./fp-bench -o div -p double
 204.98 MFlops
 real	0m1,002s
 user	0m1,001s
 sys	0m0,001s


With the "#else" default 64-bit code:

 time ./fp-test f64_div -l 2 -r all
 real	6m44,910s
 user	6m44,616s
 sys	0m0,165s

 time ./fp-bench -o div -p double
 205.41 MFlops
 real	0m1,002s
 user	0m1,001s
 sys	0m0,001s


With the new CONFIG_INT128 code:

 time ./fp-test f64_div -l 2 -r all
 real	6m58,371s
 user	6m58,078s
 sys	0m0,164s

 time ./fp-bench -o div -p double
 205.17 MFlops
 real	0m1,002s
 user	0m1,000s
 sys	0m0,001s


==> The new CONFIG_INT128 code is really worse than the 64-bit code, so
I don't think we should include this yet (unless we know a system where
the compiler can create optimized assembly code without libgcc here).

 Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
  2019-01-14 12:16 ` Philippe Mathieu-Daudé
  2019-01-14 21:40 ` Richard Henderson
@ 2019-01-16 16:50 ` Cornelia Huck
  2019-01-16 17:16   ` Alex Bennée
  2019-01-17  8:30 ` Cornelia Huck
  3 siblings, 1 reply; 23+ messages in thread
From: Cornelia Huck @ 2019-01-16 16:50 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Richard Henderson, Aurelien Jarno, Peter Maydell,
	Alex Bennée, qemu-devel, qemu-s390x

On Mon, 14 Jan 2019 13:12:35 +0100
Thomas Huth <thuth@redhat.com> wrote:

> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
> 
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>  This value type register class is not natively supported!
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>         ^
> 
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  include/fpu/softfloat-macros.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>      uint64_t q;
>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>      return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)
>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));

Ok, so what's the deal with this patch now? Fix compilation now,
optimize later?

If yes, should I pick it as an s390x build fix (I plan to send a pull
request later this week), or will the fpu maintainers pick it?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-16  6:33                       ` Thomas Huth
@ 2019-01-16 17:08                         ` Alex Bennée
  2019-01-17  6:06                           ` Thomas Huth
  2019-01-16 18:21                         ` Emilio G. Cota
  1 sibling, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-16 17:08 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
	qemu-s390x, Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-15 21:05, Emilio G. Cota wrote:
>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>
>>> If you build the fpu/next tree on a s390x you can then run:
>>>
>>>   ./tests/fp/fp-bench f64_div
>>>
>>> with and without the CONFIG_128 path. To get an idea of the real world
>>> impact you can compile a foreign binary and run it on a s390x system
>>> with:
>>>
>>>   $QEMU ./tests/fp/fp-bench f64_div -t host
>>>
>>> And that will give you the peak performance assuming your program is
>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>> the same ballpark then it doesn't make enough difference. That said:
>>
>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>> you'll get the default op (-o add).
>
> I tried that now, too, and -o div -p double does not really seem to
> exercise this function at all.

How do you mean? It should do because by default it should be calling
the softfloat implementations.

> Here are my results (disclaimer: that system is likely not really usable
> for benchmarks since it's CPUs are shared with other LPARs, but I ran
> all the tests at least twice and got similar results):
>
>
> With the DGLR inline assembly:
>
<snip>
>  time ./fp-bench -o div -p double
>  204.98 MFlops
<snip>
> With the "#else" default 64-bit code:
>
<snip>
>  time ./fp-bench -o div -p double
>  205.41 MFlops
<snip>
> With the new CONFIG_INT128 code:
>
<snip>
>  time ./fp-bench -o div -p double
>  205.17 MFlops
<snip>
>
>
> ==> The new CONFIG_INT128 code is really worse than the 64-bit code, so
> I don't think we should include this yet (unless we know a system where
> the compiler can create optimized assembly code without libgcc here).

I mean to me that looks like it is easily in the noise range and that
the dglr instruction didn't actually beat the unrolled 64 bit code -
which is just weird.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-16 16:50 ` Cornelia Huck
@ 2019-01-16 17:16   ` Alex Bennée
  2019-01-17  5:57     ` Thomas Huth
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Bennée @ 2019-01-16 17:16 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Richard Henderson, Aurelien Jarno, Peter Maydell,
	qemu-devel, qemu-s390x


Cornelia Huck <cohuck@redhat.com> writes:

> On Mon, 14 Jan 2019 13:12:35 +0100
> Thomas Huth <thuth@redhat.com> wrote:
>
<snip>
>>
>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>> index b1d772e..bd5b641 100644
>> --- a/include/fpu/softfloat-macros.h
>> +++ b/include/fpu/softfloat-macros.h
>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>      uint64_t q;
>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>      return q;
>> -#elif defined(__s390x__)
>> +#elif defined(__s390x__) && !defined(__clang__)
>>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>
> Ok, so what's the deal with this patch now? Fix compilation now,
> optimize later?
>
> If yes, should I pick it as an s390x build fix (I plan to send a pull
> request later this week), or will the fpu maintainers pick it?

I'm planning to send a FPU PR tomorrow and I'll happily include either
version.

I'm personally minded to go with the patch that makes s390 (and others)
fall back to the generic CONFIG_INT128 code. The numbers Thomas gathered
didn't look like it was much difference either way.

Unless you *really* care about milking that last bit of performance out of
the s390 TCG back-end?

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-16  6:33                       ` Thomas Huth
  2019-01-16 17:08                         ` Alex Bennée
@ 2019-01-16 18:21                         ` Emilio G. Cota
  1 sibling, 0 replies; 23+ messages in thread
From: Emilio G. Cota @ 2019-01-16 18:21 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Alex Bennée, Peter Maydell, Cornelia Huck, QEMU Developers,
	qemu-s390x, Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson

On Wed, Jan 16, 2019 at 07:33:28 +0100, Thomas Huth wrote:
> On 2019-01-15 21:05, Emilio G. Cota wrote:
> > On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
> >> Ahh I should have mentioned we already have the technology for this ;-)
> >>
> >> If you build the fpu/next tree on a s390x you can then run:
> >>
> >>   ./tests/fp/fp-bench f64_div
> >>
> >> with and without the CONFIG_128 path. To get an idea of the real world
> >> impact you can compile a foreign binary and run it on a s390x system
> >> with:
> >>
> >>   $QEMU ./tests/fp/fp-bench f64_div -t host
> >>
> >> And that will give you the peak performance assuming your program is
> >> doing nothing but f64_div operations. If the two QEMU's are basically in
> >> the same ballpark then it doesn't make enough difference. That said:
> > 
> > I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
> > you'll get the default op (-o add).
> 
> I tried that now, too, and -o div -p double does not really seem to
> exercise this function at all.

You can check what is being called then with perf record/report.

		E.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-16 17:16   ` Alex Bennée
@ 2019-01-17  5:57     ` Thomas Huth
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Huth @ 2019-01-17  5:57 UTC (permalink / raw)
  To: Alex Bennée, Cornelia Huck
  Cc: Richard Henderson, Aurelien Jarno, Peter Maydell, qemu-devel, qemu-s390x

On 2019-01-16 18:16, Alex Bennée wrote:
> 
> Cornelia Huck <cohuck@redhat.com> writes:
> 
>> On Mon, 14 Jan 2019 13:12:35 +0100
>> Thomas Huth <thuth@redhat.com> wrote:
>>
> <snip>
>>>
>>> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
>>> index b1d772e..bd5b641 100644
>>> --- a/include/fpu/softfloat-macros.h
>>> +++ b/include/fpu/softfloat-macros.h
>>> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>>>      uint64_t q;
>>>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>>>      return q;
>>> -#elif defined(__s390x__)
>>> +#elif defined(__s390x__) && !defined(__clang__)
>>>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>>>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>>
>> Ok, so what's the deal with this patch now? Fix compilation now,
>> optimize later?
>>
>> If yes, should I pick it as an s390x build fix (I plan to send a pull
>> request later this week), or will the fpu maintainers pick it?
> 
> I'm planning to send a FPU PR tomorrow and I'll happily include either
> version.
> 
> I'm personally minded to go with the patch that makes s390 (and others)
> fall back to the generic CONFIG_INT128 code. The numbers Thomas gathered
> didn't look like it was much difference either way.
> 
> Unless you *really* care about milking that last bit of performance out of
> the s390 TCG back-end?

I don't think that anybody buys a mainframe for this special case ;-)
I'd go with the !defined(__clang__) patch above.

 Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-16 17:08                         ` Alex Bennée
@ 2019-01-17  6:06                           ` Thomas Huth
  2019-01-17  7:42                             ` Alex Bennée
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Huth @ 2019-01-17  6:06 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
	qemu-s390x, Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson

On 2019-01-16 18:08, Alex Bennée wrote:
> 
> Thomas Huth <thuth@redhat.com> writes:
> 
>> On 2019-01-15 21:05, Emilio G. Cota wrote:
>>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>>
>>>> If you build the fpu/next tree on a s390x you can then run:
>>>>
>>>>   ./tests/fp/fp-bench f64_div
>>>>
>>>> with and without the CONFIG_128 path. To get an idea of the real world
>>>> impact you can compile a foreign binary and run it on a s390x system
>>>> with:
>>>>
>>>>   $QEMU ./tests/fp/fp-bench f64_div -t host
>>>>
>>>> And that will give you the peak performance assuming your program is
>>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>>> the same ballpark then it doesn't make enough difference. That said:
>>>
>>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>>> you'll get the default op (-o add).
>>
>> I tried that now, too, and -o div -p double does not really seem to
>> exercise this function at all.
> 
> How do you mean? It should do because by default it should be calling
> the softfloat implementations.

I've added a puts("hello") into the udiv_qrnd() function. When I then
run "fp-bench -o div -p double", it only prints out "hello" a single
time, so the function is only called once during the whole test.

 Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-17  6:06                           ` Thomas Huth
@ 2019-01-17  7:42                             ` Alex Bennée
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2019-01-17  7:42 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Emilio G. Cota, Peter Maydell, Cornelia Huck, QEMU Developers,
	qemu-s390x, Philippe Mathieu-Daudé,
	Aurelien Jarno, Richard Henderson


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-16 18:08, Alex Bennée wrote:
>>
>> Thomas Huth <thuth@redhat.com> writes:
>>
>>> On 2019-01-15 21:05, Emilio G. Cota wrote:
>>>> On Tue, Jan 15, 2019 at 16:01:32 +0000, Alex Bennée wrote:
>>>>> Ahh I should have mentioned we already have the technology for this ;-)
>>>>>
>>>>> If you build the fpu/next tree on a s390x you can then run:
>>>>>
>>>>>   ./tests/fp/fp-bench f64_div
>>>>>
>>>>> with and without the CONFIG_128 path. To get an idea of the real world
>>>>> impact you can compile a foreign binary and run it on a s390x system
>>>>> with:
>>>>>
>>>>>   $QEMU ./tests/fp/fp-bench f64_div -t host
>>>>>
>>>>> And that will give you the peak performance assuming your program is
>>>>> doing nothing but f64_div operations. If the two QEMU's are basically in
>>>>> the same ballpark then it doesn't make enough difference. That said:
>>>>
>>>> I think you mean here `tests/fp/fp-bench -o div -p double', otherwise
>>>> you'll get the default op (-o add).
>>>
>>> I tried that now, too, and -o div -p double does not really seem to
>>> exercise this function at all.
>>
>> How do you mean? It should do because by default it should be calling
>> the softfloat implementations.
>
> I've added a puts("hello") into the udiv_qrnd() function. When I then
> run "fp-bench -o div -p double", it only prints out "hello" a single
> time, so the function is only called once during the whole test.

That's very odd. With the following on my aarch64 box:

modified   include/fpu/softfloat-macros.h
@@ -637,6 +637,8 @@ static inline uint64_t estimateDiv128To64(uint64_t a0, uint64_t a1, uint64_t b)
 static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
                                   uint64_t n0, uint64_t d)
 {
+    static int iter = 0;
+    fprintf(stderr, "%s: %d\n", __func__, iter++);
 #if defined(__x86_64__)
     uint64_t q;
     asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));

I get:

udiv_qrnnd: 0
udiv_qrnnd: 1
..
..<all the way to>
udiv_qrnnd: 99998
udiv_qrnnd: 99999

So I'm not sure what is different on s390

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
  2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
                   ` (2 preceding siblings ...)
  2019-01-16 16:50 ` Cornelia Huck
@ 2019-01-17  8:30 ` Cornelia Huck
  3 siblings, 0 replies; 23+ messages in thread
From: Cornelia Huck @ 2019-01-17  8:30 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Richard Henderson, Aurelien Jarno, Peter Maydell,
	Alex Bennée, qemu-devel, qemu-s390x

On Mon, 14 Jan 2019 13:12:35 +0100
Thomas Huth <thuth@redhat.com> wrote:

> Clang v7.0.1 does not like the __int128 variable type for inline
> assembly on s390x:
> 
> In file included from fpu/softfloat.c:97:
> include/fpu/softfloat-macros.h:647:9: error: inline asm error:
>  This value type register class is not natively supported!
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>         ^
> 
> Disable this code part there now when compiling with Clang, so that
> the generic code gets used instead.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  include/fpu/softfloat-macros.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index b1d772e..bd5b641 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -641,7 +641,7 @@ static inline uint64_t udiv_qrnnd(uint64_t *r, uint64_t n1,
>      uint64_t q;
>      asm("divq %4" : "=a"(q), "=d"(*r) : "0"(n0), "1"(n1), "rm"(d));
>      return q;
> -#elif defined(__s390x__)
> +#elif defined(__s390x__) && !defined(__clang__)
>      /* Need to use a TImode type to get an even register pair for DLGR.  */
>      unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>      asm("dlgr %0, %1" : "+r"(n) : "r"(d));

Acked-by: Cornelia Huck <cohuck@redhat.com>

...as this fixed that s390x clang problem for me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-01-17  8:30 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 16:37   ` Alex Bennée
2019-01-14 17:03     ` Thomas Huth
2019-01-14 18:58       ` Alex Bennée
2019-01-14 21:36         ` Richard Henderson
2019-01-14 22:48           ` Alex Bennée
2019-01-15 10:14             ` Peter Maydell
2019-01-15 14:46               ` Alex Bennée
2019-01-15 15:29                 ` Thomas Huth
2019-01-15 16:01                   ` Alex Bennée
2019-01-15 20:05                     ` Emilio G. Cota
2019-01-16  6:33                       ` Thomas Huth
2019-01-16 17:08                         ` Alex Bennée
2019-01-17  6:06                           ` Thomas Huth
2019-01-17  7:42                             ` Alex Bennée
2019-01-16 18:21                         ` Emilio G. Cota
2019-01-15 22:05                   ` Richard Henderson
2019-01-14 21:40 ` Richard Henderson
2019-01-16 16:50 ` Cornelia Huck
2019-01-16 17:16   ` Alex Bennée
2019-01-17  5:57     ` Thomas Huth
2019-01-17  8:30 ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.