* [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
@ 2015-01-09 10:27 Frediano Ziglio
2015-01-09 10:35 ` Paolo Bonzini
0 siblings, 1 reply; 8+ messages in thread
From: Frediano Ziglio @ 2015-01-09 10:27 UTC (permalink / raw)
To: Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi
Cc: Frediano Ziglio, qemu-devel
As this platform can do multiply/divide using 128 bit precision use
these instruction to implement it.
Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
---
include/qemu-common.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/include/qemu-common.h b/include/qemu-common.h
index f862214..5366220 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
}
/* compute with 96 bit intermediate result: (a*b)/c */
+#ifndef __x86_64__
static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
{
union {
@@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
return res.ll;
}
+#else
+static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
+{
+ uint64_t res;
+
+ asm ("mulq %2\n\tdivq %3"
+ : "=a"(res)
+ : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
+ : "rdx", "cc");
+ return res;
+}
+#endif
/* Round number down to multiple */
#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 10:27 [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture Frediano Ziglio
@ 2015-01-09 10:35 ` Paolo Bonzini
2015-01-09 11:04 ` Frediano Ziglio
0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2015-01-09 10:35 UTC (permalink / raw)
To: Frediano Ziglio, Anthony Liguori, Stefan Hajnoczi
Cc: Frediano Ziglio, qemu-devel
On 09/01/2015 11:27, Frediano Ziglio wrote:
>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
> ---
> include/qemu-common.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index f862214..5366220 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
> }
>
> /* compute with 96 bit intermediate result: (a*b)/c */
> +#ifndef __x86_64__
> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> {
> union {
> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
> return res.ll;
> }
> +#else
> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> +{
> + uint64_t res;
> +
> + asm ("mulq %2\n\tdivq %3"
> + : "=a"(res)
> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
> + : "rdx", "cc");
> + return res;
> +}
> +#endif
>
Good idea. However, if you have __int128, you can just do
return (__int128)a * b / c
and the compiler should generate the right code. Conveniently, there is
already CONFIG_INT128 that you can use.
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 10:35 ` Paolo Bonzini
@ 2015-01-09 11:04 ` Frediano Ziglio
2015-01-09 11:24 ` Paolo Bonzini
0 siblings, 1 reply; 8+ messages in thread
From: Frediano Ziglio @ 2015-01-09 11:04 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Frediano Ziglio, Stefan Hajnoczi, Anthony Liguori, qemu-devel
2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 09/01/2015 11:27, Frediano Ziglio wrote:
>>
>> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
>> ---
>> include/qemu-common.h | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index f862214..5366220 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
>> }
>>
>> /* compute with 96 bit intermediate result: (a*b)/c */
>> +#ifndef __x86_64__
>> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>> {
>> union {
>> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
>> return res.ll;
>> }
>> +#else
>> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>> +{
>> + uint64_t res;
>> +
>> + asm ("mulq %2\n\tdivq %3"
>> + : "=a"(res)
>> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
>> + : "rdx", "cc");
>> + return res;
>> +}
>> +#endif
>>
>
> Good idea. However, if you have __int128, you can just do
>
> return (__int128)a * b / c
>
> and the compiler should generate the right code. Conveniently, there is
> already CONFIG_INT128 that you can use.
>
> Paolo
Well, it works but in our case b <= c, that is a * b / c is always <
2^64. This lead to no integer overflow in the last division. However
the compiler does not know this so it does the entire (a*b) / c
division which is mainly consists in two integer division instead of
one (not taking into account that is implemented using a helper
function).
I think that I'll write two patches. One implementing using the int128
as you suggested (which is much easier to read that current one and
assembly ones) that another for x86_64 optimization.
Frediano
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 11:04 ` Frediano Ziglio
@ 2015-01-09 11:24 ` Paolo Bonzini
2015-01-09 11:38 ` Peter Maydell
2015-01-09 12:07 ` Frediano Ziglio
0 siblings, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2015-01-09 11:24 UTC (permalink / raw)
To: Frediano Ziglio
Cc: Frediano Ziglio, Stefan Hajnoczi, Anthony Liguori, qemu-devel
On 09/01/2015 12:04, Frediano Ziglio wrote:
> 2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>:
>>
>>
>> On 09/01/2015 11:27, Frediano Ziglio wrote:
>>>
>>> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
>>> ---
>>> include/qemu-common.h | 13 +++++++++++++
>>> 1 file changed, 13 insertions(+)
>>>
>>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>>> index f862214..5366220 100644
>>> --- a/include/qemu-common.h
>>> +++ b/include/qemu-common.h
>>> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
>>> }
>>>
>>> /* compute with 96 bit intermediate result: (a*b)/c */
>>> +#ifndef __x86_64__
>>> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>> {
>>> union {
>>> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
>>> return res.ll;
>>> }
>>> +#else
>>> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>> +{
>>> + uint64_t res;
>>> +
>>> + asm ("mulq %2\n\tdivq %3"
>>> + : "=a"(res)
>>> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
>>> + : "rdx", "cc");
>>> + return res;
>>> +}
>>> +#endif
>>>
>>
>> Good idea. However, if you have __int128, you can just do
>>
>> return (__int128)a * b / c
>>
>> and the compiler should generate the right code. Conveniently, there is
>> already CONFIG_INT128 that you can use.
>
> Well, it works but in our case b <= c, that is a * b / c is always <
> 2^64.
This is not necessarily the case. Quick grep:
hw/timer/hpet.c: return (muldiv64(value, HPET_CLK_PERIOD, FS_PER_NS));
hw/timer/hpet.c: return (muldiv64(value, FS_PER_NS, HPET_CLK_PERIOD));
One of the two must disprove your assertion. :)
But it's true that we expect no overflow.
> This lead to no integer overflow in the last division. However
> the compiler does not know this so it does the entire (a*b) / c
> division which is mainly consists in two integer division instead of
> one (not taking into account that is implemented using a helper
> function).
>
> I think that I'll write two patches. One implementing using the int128
> as you suggested (which is much easier to read that current one and
> assembly ones) that another for x86_64 optimization.
Right, that's even better.
Out of curiosity, have you seen it in some profiles?
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 11:24 ` Paolo Bonzini
@ 2015-01-09 11:38 ` Peter Maydell
2015-01-09 12:07 ` Frediano Ziglio
1 sibling, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2015-01-09 11:38 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Frediano Ziglio, qemu-devel, Frediano Ziglio, Stefan Hajnoczi,
Anthony Liguori
On 9 January 2015 at 11:24, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 09/01/2015 12:04, Frediano Ziglio wrote:
>> I think that I'll write two patches. One implementing using the int128
>> as you suggested (which is much easier to read that current one and
>> assembly ones) that another for x86_64 optimization.
>
> Right, that's even better.
Personally I would prefer we didn't write inline assembly
functions if we can avoid them. So I'd rather see an int128
version, and if the compiler doesn't do a good enough job
then go talk to the compiler folks to improve things.
> Out of curiosity, have you seen it in some profiles?
I would absolutely want to see significant perf uplift on
a real workload before we start putting inline asm into
qemu-common.h...
-- PMM
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 11:24 ` Paolo Bonzini
2015-01-09 11:38 ` Peter Maydell
@ 2015-01-09 12:07 ` Frediano Ziglio
1 sibling, 0 replies; 8+ messages in thread
From: Frediano Ziglio @ 2015-01-09 12:07 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Frediano Ziglio, Stefan Hajnoczi, Anthony Liguori, qemu-devel
2015-01-09 11:24 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 09/01/2015 12:04, Frediano Ziglio wrote:
>> 2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>
>>> On 09/01/2015 11:27, Frediano Ziglio wrote:
>>>>
>>>> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
>>>> ---
>>>> include/qemu-common.h | 13 +++++++++++++
>>>> 1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>>>> index f862214..5366220 100644
>>>> --- a/include/qemu-common.h
>>>> +++ b/include/qemu-common.h
>>>> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
>>>> }
>>>>
>>>> /* compute with 96 bit intermediate result: (a*b)/c */
>>>> +#ifndef __x86_64__
>>>> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>>> {
>>>> union {
>>>> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>>> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
>>>> return res.ll;
>>>> }
>>>> +#else
>>>> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
>>>> +{
>>>> + uint64_t res;
>>>> +
>>>> + asm ("mulq %2\n\tdivq %3"
>>>> + : "=a"(res)
>>>> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
>>>> + : "rdx", "cc");
>>>> + return res;
>>>> +}
>>>> +#endif
>>>>
>>>
>>> Good idea. However, if you have __int128, you can just do
>>>
>>> return (__int128)a * b / c
>>>
>>> and the compiler should generate the right code. Conveniently, there is
>>> already CONFIG_INT128 that you can use.
>>
>> Well, it works but in our case b <= c, that is a * b / c is always <
>> 2^64.
>
> This is not necessarily the case. Quick grep:
>
> hw/timer/hpet.c: return (muldiv64(value, HPET_CLK_PERIOD, FS_PER_NS));
> hw/timer/hpet.c: return (muldiv64(value, FS_PER_NS, HPET_CLK_PERIOD));
>
> One of the two must disprove your assertion. :)
>
Unless FS_PER_NS == HPET_CLK_PERIOD :)
> But it's true that we expect no overflow.
>
This is enough!
>> This lead to no integer overflow in the last division. However
>> the compiler does not know this so it does the entire (a*b) / c
>> division which is mainly consists in two integer division instead of
>> one (not taking into account that is implemented using a helper
>> function).
>>
>> I think that I'll write two patches. One implementing using the int128
>> as you suggested (which is much easier to read that current one and
>> assembly ones) that another for x86_64 optimization.
>
> Right, that's even better.
>
> Out of curiosity, have you seen it in some profiles?
>
> Paolo
No, just looked at the complicate code and generated code and though
"why using dozen of instructions if two are enough?" :)
Frediano
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
2015-01-09 9:53 Frediano Ziglio
@ 2015-01-09 20:00 ` Richard Henderson
0 siblings, 0 replies; 8+ messages in thread
From: Richard Henderson @ 2015-01-09 20:00 UTC (permalink / raw)
To: Frediano Ziglio, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi
Cc: qemu-devel
On 01/09/2015 01:53 AM, Frediano Ziglio wrote:
> As this platform can do multiply/divide using 128 bit precision use
> these instruction to implement it.
>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
> ---
> include/qemu-common.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index f862214..5366220 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
> }
>
> /* compute with 96 bit intermediate result: (a*b)/c */
> +#ifndef __x86_64__
> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> {
> union {
> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
> return res.ll;
> }
> +#else
> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
> +{
> + uint64_t res;
> +
> + asm ("mulq %2\n\tdivq %3"
> + : "=a"(res)
> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
> + : "rdx", "cc");
> + return res;
> +}
> +#endif
Honestly, this ought to move into qemu/host-utils.h, and it should
use __int128 for targets that support it. Which includes x86_64,
but also other 64-bit hosts.
r~
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture
@ 2015-01-09 9:53 Frediano Ziglio
2015-01-09 20:00 ` Richard Henderson
0 siblings, 1 reply; 8+ messages in thread
From: Frediano Ziglio @ 2015-01-09 9:53 UTC (permalink / raw)
To: Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi
Cc: Frediano Ziglio, qemu-devel
As this platform can do multiply/divide using 128 bit precision use
these instruction to implement it.
Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
---
include/qemu-common.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/include/qemu-common.h b/include/qemu-common.h
index f862214..5366220 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val)
}
/* compute with 96 bit intermediate result: (a*b)/c */
+#ifndef __x86_64__
static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
{
union {
@@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
return res.ll;
}
+#else
+static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
+{
+ uint64_t res;
+
+ asm ("mulq %2\n\tdivq %3"
+ : "=a"(res)
+ : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c)
+ : "rdx", "cc");
+ return res;
+}
+#endif
/* Round number down to multiple */
#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-01-09 20:00 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-09 10:27 [Qemu-devel] [PATCH] x86_64: optimise muldiv64 for x86_64 architecture Frediano Ziglio
2015-01-09 10:35 ` Paolo Bonzini
2015-01-09 11:04 ` Frediano Ziglio
2015-01-09 11:24 ` Paolo Bonzini
2015-01-09 11:38 ` Peter Maydell
2015-01-09 12:07 ` Frediano Ziglio
-- strict thread matches above, loose matches on Subject: below --
2015-01-09 9:53 Frediano Ziglio
2015-01-09 20:00 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.