All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
To: "Aleksandar Markovic" <amarkovic@wavecomp.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: Aleksandar Rikalo <arikalo@wavecomp.com>,
	"aurelien@aurel32.net" <aurelien@aurel32.net>
Subject: Re: [Qemu-devel] [PATCH 1/2] target/mips: Improve performance for MSA binary operations
Date: Mon, 3 Jun 2019 15:29:11 +0200	[thread overview]
Message-ID: <76f6a6d8-d44b-c6b8-f3ec-42bee138e2ea@rt-rk.com> (raw)
In-Reply-To: <BN6PR2201MB12512BA7B7D00602F78C1331C6140@BN6PR2201MB1251.namprd22.prod.outlook.com>


On 3.6.19. 15:10, Aleksandar Markovic wrote:
>> From: Alex Bennée <alex.bennee@linaro.org>
>> Sent: Sunday, June 2, 2019 3:22 PM
>> To: qemu-devel@nongnu.org
>> Cc: Aleksandar Rikalo; Aleksandar Markovic; aurelien@aurel32.net
>> Subject: Re: [Qemu-devel] [PATCH 1/2] target/mips: Improve performance for MSA binary operations
>
>> Mateja Marjanovic <mateja.marjanovic@rt-rk.com> writes:
>>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>
>>> Eliminate loops for better performance.
>> Have you done any measurements of the bellow loop unrolling? Because
>> this is something that maybe we can achieve and let the compiler make
>> the choice.
> I know that Mateja did extensive performance measurements, and I am
> asking him to give us some samples.
Yes, here are some of the binop instructions with a loop and without
it:

|| intruction   ||       with-loop     ||    without-loop   ||
===============================
||  addv.b      ||    147.356 ms     ||    84.249 ms     ||
||  addv.h      ||     83.612 ms      ||    44.808 ms     ||
||  addv.w      ||     49.952 ms     ||    34.128 ms     ||
||  addv.d      ||     29.834 ms      ||    34.103 ms     ||
||  asub_s.b   ||    187.939 ms     ||    143.658 ms   ||
||  asub_s.h   ||    105.266 ms     ||     82.424 ms    ||
||  asub_s.w  ||     68.941 ms      ||     57.923 ms    ||
||  asub_s.d   ||     41.536 ms      ||     42.092 ms    ||
||  max_s.b    ||    156.036 ms    ||    100.628 ms    ||
||  max_s.h    ||     87.100 ms     ||     64.905 ms     ||
||  max_s.w   ||     52.339 ms     ||     40.632 ms     ||
||  max_s.d    ||     34.873 ms     ||     35.562 ms     ||
===============================

>
> As for code generation, here are disassemblies of function
> helper_msa_add_a_df() before and after this patch:
>
> (it is visible the compiler did not perform unrolling loops by itself)
>
> BEFORE:
>
> Dump of assembler code for function helper_msa_add_a_df:
>     0x00000000001500b0 <+0>:	cmp    $0x1,%esi
>     0x00000000001500b3 <+3>:	je     0x150258 <helper_msa_add_a_df+424>
>     0x00000000001500b9 <+9>:	jb     0x1501e8 <helper_msa_add_a_df+312>
>     0x00000000001500bf <+15>:	cmp    $0x2,%esi
>     0x00000000001500c2 <+18>:	je     0x150180 <helper_msa_add_a_df+208>
>     0x00000000001500c8 <+24>:	cmp    $0x3,%esi
>     0x00000000001500cb <+27>:	jne    0x1502c2 <helper_msa_add_a_df+530>
>     0x00000000001500d1 <+33>:	mov    %ecx,%ecx
>     0x00000000001500d3 <+35>:	mov    %edx,%edx
>     0x00000000001500d5 <+37>:	lea    0x22(%rcx),%rax
>     0x00000000001500d9 <+41>:	lea    0x22(%rdx),%r10
>     0x00000000001500dd <+45>:	shl    $0x4,%rcx
>     0x00000000001500e1 <+49>:	add    %rdi,%rcx
>     0x00000000001500e4 <+52>:	shl    $0x4,%rdx
>     0x00000000001500e8 <+56>:	shl    $0x4,%rax
>     0x00000000001500ec <+60>:	shl    $0x4,%r10
>     0x00000000001500f0 <+64>:	add    %rdi,%rax
>     0x00000000001500f3 <+67>:	mov    0x8(%rax),%r9
>     0x00000000001500f7 <+71>:	mov    0x8(%rax),%rsi
>     0x00000000001500fb <+75>:	mov    %r8d,%eax
>     0x00000000001500fe <+78>:	sar    $0x3f,%r9
>     0x0000000000150102 <+82>:	xor    %r9,%rsi
>     0x0000000000150105 <+85>:	sub    %r9,%rsi
>     0x0000000000150108 <+88>:	mov    %rsi,%r9
>     0x000000000015010b <+91>:	lea    0x22(%rax),%rsi
>     0x000000000015010f <+95>:	shl    $0x4,%rax
>     0x0000000000150113 <+99>:	lea    (%rdi,%rax,1),%rax
>     0x0000000000150117 <+103>:	shl    $0x4,%rsi
>     0x000000000015011b <+107>:	add    %rdi,%rsi
>     0x000000000015011e <+110>:	mov    0x8(%rsi),%r8
>     0x0000000000150122 <+114>:	mov    0x8(%rsi),%r11
>     0x0000000000150126 <+118>:	sar    $0x3f,%r8
>     0x000000000015012a <+122>:	xor    %r8,%r11
>     0x000000000015012d <+125>:	mov    %r11,%rsi
>     0x0000000000150130 <+128>:	sub    %r8,%rsi
>     0x0000000000150133 <+131>:	add    %r9,%rsi
>     0x0000000000150136 <+134>:	mov    %rsi,0x8(%rdi,%r10,1)
>     0x000000000015013b <+139>:	mov    0x230(%rcx),%rsi
>     0x0000000000150142 <+146>:	mov    0x230(%rcx),%r8
>     0x0000000000150149 <+153>:	sar    $0x3f,%rsi
>     0x000000000015014d <+157>:	xor    %rsi,%r8
>     0x0000000000150150 <+160>:	mov    %r8,%rcx
>     0x0000000000150153 <+163>:	mov    0x230(%rax),%r8
>     0x000000000015015a <+170>:	sub    %rsi,%rcx
>     0x000000000015015d <+173>:	mov    0x230(%rax),%rsi
>     0x0000000000150164 <+180>:	sar    $0x3f,%rsi
>     0x0000000000150168 <+184>:	xor    %rsi,%r8
>     0x000000000015016b <+187>:	mov    %r8,%rax
>     0x000000000015016e <+190>:	sub    %rsi,%rax
>     0x0000000000150171 <+193>:	add    %rcx,%rax
>     0x0000000000150174 <+196>:	mov    %rax,0x230(%rdi,%rdx,1)
>     0x000000000015017c <+204>:	retq
>     0x000000000015017d <+205>:	nopl   (%rax)
>     0x0000000000150180 <+208>:	mov    %r8d,%r8d
>     0x0000000000150183 <+211>:	mov    %ecx,%ecx
>     0x0000000000150185 <+213>:	mov    %edx,%edx
>     0x0000000000150187 <+215>:	mov    %r8,%rax
>     0x000000000015018a <+218>:	neg    %r8
>     0x000000000015018d <+221>:	shl    $0x4,%rcx
>     0x0000000000150191 <+225>:	shl    $0x4,%rax
>     0x0000000000150195 <+229>:	shl    $0x4,%r8
>     0x0000000000150199 <+233>:	shl    $0x4,%rdx
>     0x000000000015019d <+237>:	lea    0x228(%rdi,%rax,1),%r9
>     0x00000000001501a5 <+245>:	lea    0x238(%rdi,%rax,1),%rdi
>     0x00000000001501ad <+253>:	lea    (%r9,%r8,1),%r10
>     0x00000000001501b1 <+257>:	add    $0x4,%r9
>     0x00000000001501b5 <+261>:	movslq (%r10,%rcx,1),%rax
>     0x00000000001501b9 <+265>:	mov    %rax,%rsi
>     0x00000000001501bc <+268>:	sar    $0x3f,%rsi
>     0x00000000001501c0 <+272>:	xor    %rsi,%rax
>     0x00000000001501c3 <+275>:	sub    %rsi,%rax
>     0x00000000001501c6 <+278>:	movslq -0x4(%r9),%rsi
>     0x00000000001501ca <+282>:	mov    %rsi,%r11
>     0x00000000001501cd <+285>:	sar    $0x3f,%r11
>     0x00000000001501d1 <+289>:	xor    %r11,%rsi
>     0x00000000001501d4 <+292>:	sub    %r11,%rsi
>     0x00000000001501d7 <+295>:	add    %rsi,%rax
>     0x00000000001501da <+298>:	cmp    %rdi,%r9
>     0x00000000001501dd <+301>:	mov    %eax,(%r10,%rdx,1)
>     0x00000000001501e1 <+305>:	jne    0x1501ad <helper_msa_add_a_df+253>
>     0x00000000001501e3 <+307>:	repz retq
>     0x00000000001501e5 <+309>:	nopl   (%rax)
>     0x00000000001501e8 <+312>:	mov    %r8d,%r8d
>     0x00000000001501eb <+315>:	mov    %ecx,%ecx
>     0x00000000001501ed <+317>:	mov    %edx,%edx
>     0x00000000001501ef <+319>:	mov    %r8,%rax
>     0x00000000001501f2 <+322>:	neg    %r8
>     0x00000000001501f5 <+325>:	shl    $0x4,%rcx
>     0x00000000001501f9 <+329>:	shl    $0x4,%rax
>     0x00000000001501fd <+333>:	shl    $0x4,%r8
>     0x0000000000150201 <+337>:	shl    $0x4,%rdx
>     0x0000000000150205 <+341>:	lea    0x228(%rdi,%rax,1),%r9
>     0x000000000015020d <+349>:	lea    0x238(%rdi,%rax,1),%r11
>     0x0000000000150215 <+357>:	nopl   (%rax)
>     0x0000000000150218 <+360>:	lea    (%r8,%r9,1),%rdi
>     0x000000000015021c <+364>:	add    $0x1,%r9
>     0x0000000000150220 <+368>:	movsbq (%rdi,%rcx,1),%rax
>     0x0000000000150225 <+373>:	mov    %rax,%rsi
>     0x0000000000150228 <+376>:	sar    $0x3f,%rsi
>     0x000000000015022c <+380>:	xor    %rsi,%rax
>     0x000000000015022f <+383>:	sub    %rsi,%rax
>     0x0000000000150232 <+386>:	movsbq -0x1(%r9),%rsi
>     0x0000000000150237 <+391>:	mov    %rsi,%r10
>     0x000000000015023a <+394>:	sar    $0x3f,%r10
>     0x000000000015023e <+398>:	xor    %r10,%rsi
>     0x0000000000150241 <+401>:	sub    %r10,%rsi
>     0x0000000000150244 <+404>:	add    %rsi,%rax
>     0x0000000000150247 <+407>:	cmp    %r9,%r11
>     0x000000000015024a <+410>:	mov    %al,(%rdi,%rdx,1)
>     0x000000000015024d <+413>:	jne    0x150218 <helper_msa_add_a_df+360>
>     0x000000000015024f <+415>:	repz retq
>     0x0000000000150251 <+417>:	nopl   0x0(%rax)
>     0x0000000000150258 <+424>:	mov    %r8d,%r8d
>     0x000000000015025b <+427>:	mov    %ecx,%ecx
>     0x000000000015025d <+429>:	mov    %edx,%edx
>     0x000000000015025f <+431>:	mov    %r8,%rax
>     0x0000000000150262 <+434>:	neg    %r8
>     0x0000000000150265 <+437>:	shl    $0x4,%rcx
>     0x0000000000150269 <+441>:	shl    $0x4,%rax
>     0x000000000015026d <+445>:	shl    $0x4,%r8
>     0x0000000000150271 <+449>:	shl    $0x4,%rdx
>     0x0000000000150275 <+453>:	lea    0x228(%rdi,%rax,1),%r9
>     0x000000000015027d <+461>:	lea    0x238(%rdi,%rax,1),%r10
>     0x0000000000150285 <+469>:	nopl   (%rax)
>     0x0000000000150288 <+472>:	lea    (%r8,%r9,1),%rdi
>     0x000000000015028c <+476>:	add    $0x2,%r9
>     0x0000000000150290 <+480>:	movswq (%rdi,%rcx,1),%rax
>     0x0000000000150295 <+485>:	mov    %rax,%rsi
>     0x0000000000150298 <+488>:	sar    $0x3f,%rsi
>     0x000000000015029c <+492>:	xor    %rsi,%rax
>     0x000000000015029f <+495>:	sub    %rsi,%rax
>     0x00000000001502a2 <+498>:	movswq -0x2(%r9),%rsi
>     0x00000000001502a7 <+503>:	mov    %rsi,%r11
>     0x00000000001502aa <+506>:	sar    $0x3f,%r11
>     0x00000000001502ae <+510>:	xor    %r11,%rsi
>     0x00000000001502b1 <+513>:	sub    %r11,%rsi
>     0x00000000001502b4 <+516>:	add    %rsi,%rax
>     0x00000000001502b7 <+519>:	cmp    %r10,%r9
>     0x00000000001502ba <+522>:	mov    %ax,(%rdi,%rdx,1)
>     0x00000000001502be <+526>:	jne    0x150288 <helper_msa_add_a_df+472>
>     0x00000000001502c0 <+528>:	repz retq
>     0x00000000001502c2 <+530>:	lea    0x13c3b7(%rip),%rcx        # 0x28c680 <__PRETTY_FUNCTION__.26062>
>     0x00000000001502c9 <+537>:	lea    0x13b830(%rip),%rsi        # 0x28bb00
>     0x00000000001502d0 <+544>:	lea    0x1c7204(%rip),%rdi        # 0x3174db
>     0x00000000001502d7 <+551>:	sub    $0x8,%rsp
>     0x00000000001502db <+555>:	mov    $0x357,%edx
>     0x00000000001502e0 <+560>:	callq  0x8eeb8
> End of assembler dump.
>
>
> AFTER:
>
>     0x00000000001548d0 <+0>:	cmp    $0x1,%esi
>     0x00000000001548d3 <+3>:	je     0x154e00 <helper_msa_add_a_df+1328>
>     0x00000000001548d9 <+9>:	jb     0x154a98 <helper_msa_add_a_df+456>
>     0x00000000001548df <+15>:	cmp    $0x2,%esi
>     0x00000000001548e2 <+18>:	je     0x1549a0 <helper_msa_add_a_df+208>
>     0x00000000001548e8 <+24>:	cmp    $0x3,%esi
>     0x00000000001548eb <+27>:	jne    0x154fd1 <helper_msa_add_a_df+1793>
>     0x00000000001548f1 <+33>:	mov    %ecx,%eax
>     0x00000000001548f3 <+35>:	mov    %r8d,%r8d
>     0x00000000001548f6 <+38>:	mov    %edx,%edx
>     0x00000000001548f8 <+40>:	lea    0x22(%rax),%rcx
>     0x00000000001548fc <+44>:	lea    0x22(%rdx),%r9
>     0x0000000000154900 <+48>:	shl    $0x4,%rax
>     0x0000000000154904 <+52>:	add    %rdi,%rax
>     0x0000000000154907 <+55>:	shl    $0x4,%rdx
>     0x000000000015490b <+59>:	shl    $0x4,%rcx
>     0x000000000015490f <+63>:	shl    $0x4,%r9
>     0x0000000000154913 <+67>:	add    %rdi,%rcx
>     0x0000000000154916 <+70>:	mov    0x8(%rcx),%rsi
>     0x000000000015491a <+74>:	mov    0x8(%rcx),%r11
>     0x000000000015491e <+78>:	sar    $0x3f,%rsi
>     0x0000000000154922 <+82>:	xor    %rsi,%r11
>     0x0000000000154925 <+85>:	mov    %r11,%rcx
>     0x0000000000154928 <+88>:	sub    %rsi,%rcx
>     0x000000000015492b <+91>:	mov    %rcx,%rsi
>     0x000000000015492e <+94>:	lea    0x22(%r8),%rcx
>     0x0000000000154932 <+98>:	shl    $0x4,%r8
>     0x0000000000154936 <+102>:	add    %rdi,%r8
>     0x0000000000154939 <+105>:	shl    $0x4,%rcx
>     0x000000000015493d <+109>:	add    %rdi,%rcx
>     0x0000000000154940 <+112>:	mov    0x8(%rcx),%r10
>     0x0000000000154944 <+116>:	mov    0x8(%rcx),%r11
>     0x0000000000154948 <+120>:	sar    $0x3f,%r10
>     0x000000000015494c <+124>:	xor    %r10,%r11
>     0x000000000015494f <+127>:	mov    %r11,%rcx
>     0x0000000000154952 <+130>:	sub    %r10,%rcx
>     0x0000000000154955 <+133>:	add    %rsi,%rcx
>     0x0000000000154958 <+136>:	mov    %rcx,0x8(%rdi,%r9,1)
>     0x000000000015495d <+141>:	mov    0x230(%rax),%rcx
>     0x0000000000154964 <+148>:	mov    0x230(%rax),%rsi
>     0x000000000015496b <+155>:	sar    $0x3f,%rcx
>     0x000000000015496f <+159>:	xor    %rcx,%rsi
>     0x0000000000154972 <+162>:	mov    %rsi,%rax
>     0x0000000000154975 <+165>:	mov    0x230(%r8),%rsi
>     0x000000000015497c <+172>:	sub    %rcx,%rax
>     0x000000000015497f <+175>:	mov    %rax,%rcx
>     0x0000000000154982 <+178>:	mov    0x230(%r8),%rax
>     0x0000000000154989 <+185>:	sar    $0x3f,%rsi
>     0x000000000015498d <+189>:	xor    %rsi,%rax
>     0x0000000000154990 <+192>:	sub    %rsi,%rax
>     0x0000000000154993 <+195>:	add    %rcx,%rax
>     0x0000000000154996 <+198>:	mov    %rax,0x230(%rdi,%rdx,1)
>     0x000000000015499e <+206>:	retq
>     0x000000000015499f <+207>:	nop
>     0x00000000001549a0 <+208>:	mov    %ecx,%ecx
>     0x00000000001549a2 <+210>:	mov    %r8d,%r8d
>     0x00000000001549a5 <+213>:	mov    %edx,%edx
>     0x00000000001549a7 <+215>:	lea    0x22(%rcx),%rax
>     0x00000000001549ab <+219>:	lea    0x22(%rdx),%r9
>     0x00000000001549af <+223>:	shl    $0x4,%rcx
>     0x00000000001549b3 <+227>:	add    %rdi,%rcx
>     0x00000000001549b6 <+230>:	shl    $0x4,%rdx
>     0x00000000001549ba <+234>:	shl    $0x4,%rax
>     0x00000000001549be <+238>:	shl    $0x4,%r9
>     0x00000000001549c2 <+242>:	add    %rdi,%rdx
>     0x00000000001549c5 <+245>:	movslq 0x8(%rdi,%rax,1),%rax
>     0x00000000001549ca <+250>:	mov    %rax,%rsi
>     0x00000000001549cd <+253>:	sar    $0x3f,%rsi
>     0x00000000001549d1 <+257>:	xor    %rsi,%rax
>     0x00000000001549d4 <+260>:	sub    %rsi,%rax
>     0x00000000001549d7 <+263>:	lea    0x22(%r8),%rsi
>     0x00000000001549db <+267>:	shl    $0x4,%r8
>     0x00000000001549df <+271>:	shl    $0x4,%rsi
>     0x00000000001549e3 <+275>:	movslq 0x8(%rdi,%rsi,1),%rsi
>     0x00000000001549e8 <+280>:	mov    %rsi,%r10
>     0x00000000001549eb <+283>:	sar    $0x3f,%r10
>     0x00000000001549ef <+287>:	xor    %r10,%rsi
>     0x00000000001549f2 <+290>:	sub    %r10,%rsi
>     0x00000000001549f5 <+293>:	add    %rsi,%rax
>     0x00000000001549f8 <+296>:	mov    %eax,0x8(%rdi,%r9,1)
>     0x00000000001549fd <+301>:	movslq 0x22c(%rcx),%rax
>     0x0000000000154a04 <+308>:	add    %r8,%rdi
>     0x0000000000154a07 <+311>:	mov    %rax,%rsi
>     0x0000000000154a0a <+314>:	sar    $0x3f,%rsi
>     0x0000000000154a0e <+318>:	xor    %rsi,%rax
>     0x0000000000154a11 <+321>:	sub    %rsi,%rax
>     0x0000000000154a14 <+324>:	movslq 0x22c(%rdi),%rsi
>     0x0000000000154a1b <+331>:	mov    %rsi,%r8
>     0x0000000000154a1e <+334>:	sar    $0x3f,%r8
>     0x0000000000154a22 <+338>:	xor    %r8,%rsi
>     0x0000000000154a25 <+341>:	sub    %r8,%rsi
>     0x0000000000154a28 <+344>:	add    %rsi,%rax
>     0x0000000000154a2b <+347>:	mov    %eax,0x22c(%rdx)
>     0x0000000000154a31 <+353>:	movslq 0x230(%rcx),%rax
>     0x0000000000154a38 <+360>:	mov    %rax,%rsi
>     0x0000000000154a3b <+363>:	sar    $0x3f,%rsi
>     0x0000000000154a3f <+367>:	xor    %rsi,%rax
>     0x0000000000154a42 <+370>:	sub    %rsi,%rax
>     0x0000000000154a45 <+373>:	movslq 0x230(%rdi),%rsi
>     0x0000000000154a4c <+380>:	mov    %rsi,%r8
>     0x0000000000154a4f <+383>:	sar    $0x3f,%r8
>     0x0000000000154a53 <+387>:	xor    %r8,%rsi
>     0x0000000000154a56 <+390>:	sub    %r8,%rsi
>     0x0000000000154a59 <+393>:	add    %rsi,%rax
>     0x0000000000154a5c <+396>:	mov    %eax,0x230(%rdx)
>     0x0000000000154a62 <+402>:	movslq 0x234(%rcx),%rax
>     0x0000000000154a69 <+409>:	mov    %rax,%rcx
>     0x0000000000154a6c <+412>:	sar    $0x3f,%rcx
>     0x0000000000154a70 <+416>:	xor    %rcx,%rax
>     0x0000000000154a73 <+419>:	sub    %rcx,%rax
>     0x0000000000154a76 <+422>:	movslq 0x234(%rdi),%rcx
>     0x0000000000154a7d <+429>:	mov    %rcx,%rsi
>     0x0000000000154a80 <+432>:	sar    $0x3f,%rsi
>     0x0000000000154a84 <+436>:	xor    %rsi,%rcx
>     0x0000000000154a87 <+439>:	sub    %rsi,%rcx
>     0x0000000000154a8a <+442>:	add    %rcx,%rax
>     0x0000000000154a8d <+445>:	mov    %eax,0x234(%rdx)
>     0x0000000000154a93 <+451>:	retq
>     0x0000000000154a94 <+452>:	nopl   0x0(%rax)
>     0x0000000000154a98 <+456>:	mov    %ecx,%eax
>     0x0000000000154a9a <+458>:	mov    %r8d,%r8d
>     0x0000000000154a9d <+461>:	mov    %edx,%edx
>     0x0000000000154a9f <+463>:	lea    0x22(%rax),%rcx
>     0x0000000000154aa3 <+467>:	lea    0x22(%rdx),%r9
>     0x0000000000154aa7 <+471>:	shl    $0x4,%rax
>     0x0000000000154aab <+475>:	lea    (%rdi,%rax,1),%rax
>     0x0000000000154aaf <+479>:	shl    $0x4,%rdx
>     0x0000000000154ab3 <+483>:	shl    $0x4,%rcx
>     0x0000000000154ab7 <+487>:	shl    $0x4,%r9
>     0x0000000000154abb <+491>:	add    %rdi,%rdx
>     0x0000000000154abe <+494>:	movsbq 0x8(%rdi,%rcx,1),%rsi
>     0x0000000000154ac4 <+500>:	mov    %rsi,%rcx
>     0x0000000000154ac7 <+503>:	sar    $0x3f,%rcx
>     0x0000000000154acb <+507>:	xor    %rcx,%rsi
>     0x0000000000154ace <+510>:	sub    %rcx,%rsi
>     0x0000000000154ad1 <+513>:	lea    0x22(%r8),%rcx
>     0x0000000000154ad5 <+517>:	shl    $0x4,%r8
>     0x0000000000154ad9 <+521>:	shl    $0x4,%rcx
>     0x0000000000154add <+525>:	movsbq 0x8(%rdi,%rcx,1),%rcx
>     0x0000000000154ae3 <+531>:	mov    %rcx,%r10
>     0x0000000000154ae6 <+534>:	sar    $0x3f,%r10
>     0x0000000000154aea <+538>:	xor    %r10,%rcx
>     0x0000000000154aed <+541>:	sub    %r10,%rcx
>     0x0000000000154af0 <+544>:	add    %rcx,%rsi
>     0x0000000000154af3 <+547>:	mov    %sil,0x8(%rdi,%r9,1)
>     0x0000000000154af8 <+552>:	movsbq 0x229(%rax),%rcx
>     0x0000000000154b00 <+560>:	add    %r8,%rdi
>     0x0000000000154b03 <+563>:	mov    %rcx,%rsi
>     0x0000000000154b06 <+566>:	sar    $0x3f,%rsi
>     0x0000000000154b0a <+570>:	xor    %rsi,%rcx
>     0x0000000000154b0d <+573>:	sub    %rsi,%rcx
>     0x0000000000154b10 <+576>:	movsbq 0x229(%rdi),%rsi
>     0x0000000000154b18 <+584>:	mov    %rsi,%r8
>     0x0000000000154b1b <+587>:	sar    $0x3f,%r8
>     0x0000000000154b1f <+591>:	xor    %r8,%rsi
>     0x0000000000154b22 <+594>:	sub    %r8,%rsi
>     0x0000000000154b25 <+597>:	add    %rsi,%rcx
>     0x0000000000154b28 <+600>:	mov    %cl,0x229(%rdx)
>     0x0000000000154b2e <+606>:	movsbq 0x22a(%rax),%rcx
>     0x0000000000154b36 <+614>:	mov    %rcx,%rsi
>     0x0000000000154b39 <+617>:	sar    $0x3f,%rsi
>     0x0000000000154b3d <+621>:	xor    %rsi,%rcx
>     0x0000000000154b40 <+624>:	sub    %rsi,%rcx
>     0x0000000000154b43 <+627>:	movsbq 0x22a(%rdi),%rsi
>     0x0000000000154b4b <+635>:	mov    %rsi,%r8
>     0x0000000000154b4e <+638>:	sar    $0x3f,%r8
>     0x0000000000154b52 <+642>:	xor    %r8,%rsi
>     0x0000000000154b55 <+645>:	sub    %r8,%rsi
>     0x0000000000154b58 <+648>:	add    %rsi,%rcx
>     0x0000000000154b5b <+651>:	mov    %cl,0x22a(%rdx)
>     0x0000000000154b61 <+657>:	movsbq 0x22b(%rax),%rcx
>     0x0000000000154b69 <+665>:	mov    %rcx,%rsi
>     0x0000000000154b6c <+668>:	sar    $0x3f,%rsi
>     0x0000000000154b70 <+672>:	xor    %rsi,%rcx
>     0x0000000000154b73 <+675>:	sub    %rsi,%rcx
>     0x0000000000154b76 <+678>:	movsbq 0x22b(%rdi),%rsi
>     0x0000000000154b7e <+686>:	mov    %rsi,%r8
>     0x0000000000154b81 <+689>:	sar    $0x3f,%r8
>     0x0000000000154b85 <+693>:	xor    %r8,%rsi
>     0x0000000000154b88 <+696>:	sub    %r8,%rsi
>     0x0000000000154b8b <+699>:	add    %rsi,%rcx
>     0x0000000000154b8e <+702>:	mov    %cl,0x22b(%rdx)
>     0x0000000000154b94 <+708>:	movsbq 0x22c(%rax),%rcx
>     0x0000000000154b9c <+716>:	mov    %rcx,%rsi
>     0x0000000000154b9f <+719>:	sar    $0x3f,%rsi
>     0x0000000000154ba3 <+723>:	xor    %rsi,%rcx
>     0x0000000000154ba6 <+726>:	sub    %rsi,%rcx
>     0x0000000000154ba9 <+729>:	movsbq 0x22c(%rdi),%rsi
>     0x0000000000154bb1 <+737>:	mov    %rsi,%r8
>     0x0000000000154bb4 <+740>:	sar    $0x3f,%r8
>     0x0000000000154bb8 <+744>:	xor    %r8,%rsi
>     0x0000000000154bbb <+747>:	sub    %r8,%rsi
>     0x0000000000154bbe <+750>:	add    %rsi,%rcx
>     0x0000000000154bc1 <+753>:	mov    %cl,0x22c(%rdx)
>     0x0000000000154bc7 <+759>:	movsbq 0x22d(%rax),%rcx
>     0x0000000000154bcf <+767>:	mov    %rcx,%rsi
>     0x0000000000154bd2 <+770>:	sar    $0x3f,%rsi
>     0x0000000000154bd6 <+774>:	xor    %rsi,%rcx
>     0x0000000000154bd9 <+777>:	sub    %rsi,%rcx
>     0x0000000000154bdc <+780>:	movsbq 0x22d(%rdi),%rsi
>     0x0000000000154be4 <+788>:	mov    %rsi,%r8
>     0x0000000000154be7 <+791>:	sar    $0x3f,%r8
>     0x0000000000154beb <+795>:	xor    %r8,%rsi
>     0x0000000000154bee <+798>:	sub    %r8,%rsi
>     0x0000000000154bf1 <+801>:	add    %rsi,%rcx
>     0x0000000000154bf4 <+804>:	mov    %cl,0x22d(%rdx)
>     0x0000000000154bfa <+810>:	movsbq 0x22e(%rax),%rcx
>     0x0000000000154c02 <+818>:	mov    %rcx,%rsi
>     0x0000000000154c05 <+821>:	sar    $0x3f,%rsi
>     0x0000000000154c09 <+825>:	xor    %rsi,%rcx
>     0x0000000000154c0c <+828>:	sub    %rsi,%rcx
>     0x0000000000154c0f <+831>:	movsbq 0x22e(%rdi),%rsi
>     0x0000000000154c17 <+839>:	mov    %rsi,%r8
>     0x0000000000154c1a <+842>:	sar    $0x3f,%r8
>     0x0000000000154c1e <+846>:	xor    %r8,%rsi
>     0x0000000000154c21 <+849>:	sub    %r8,%rsi
>     0x0000000000154c24 <+852>:	add    %rsi,%rcx
>     0x0000000000154c27 <+855>:	mov    %cl,0x22e(%rdx)
>     0x0000000000154c2d <+861>:	movsbq 0x22f(%rax),%rcx
>     0x0000000000154c35 <+869>:	mov    %rcx,%rsi
>     0x0000000000154c38 <+872>:	sar    $0x3f,%rsi
>     0x0000000000154c3c <+876>:	xor    %rsi,%rcx
>     0x0000000000154c3f <+879>:	sub    %rsi,%rcx
>     0x0000000000154c42 <+882>:	movsbq 0x22f(%rdi),%rsi
>     0x0000000000154c4a <+890>:	mov    %rsi,%r8
>     0x0000000000154c4d <+893>:	sar    $0x3f,%r8
>     0x0000000000154c51 <+897>:	xor    %r8,%rsi
>     0x0000000000154c54 <+900>:	sub    %r8,%rsi
>     0x0000000000154c57 <+903>:	add    %rsi,%rcx
>     0x0000000000154c5a <+906>:	mov    %cl,0x22f(%rdx)
>     0x0000000000154c60 <+912>:	movsbq 0x230(%rax),%rcx
>     0x0000000000154c68 <+920>:	mov    %rcx,%rsi
>     0x0000000000154c6b <+923>:	sar    $0x3f,%rsi
>     0x0000000000154c6f <+927>:	xor    %rsi,%rcx
>     0x0000000000154c72 <+930>:	sub    %rsi,%rcx
>     0x0000000000154c75 <+933>:	movsbq 0x230(%rdi),%rsi
>     0x0000000000154c7d <+941>:	mov    %rsi,%r8
>     0x0000000000154c80 <+944>:	sar    $0x3f,%r8
>     0x0000000000154c84 <+948>:	xor    %r8,%rsi
>     0x0000000000154c87 <+951>:	sub    %r8,%rsi
>     0x0000000000154c8a <+954>:	add    %rsi,%rcx
>     0x0000000000154c8d <+957>:	mov    %cl,0x230(%rdx)
>     0x0000000000154c93 <+963>:	movsbq 0x231(%rax),%rcx
>     0x0000000000154c9b <+971>:	mov    %rcx,%rsi
>     0x0000000000154c9e <+974>:	sar    $0x3f,%rsi
>     0x0000000000154ca2 <+978>:	xor    %rsi,%rcx
>     0x0000000000154ca5 <+981>:	sub    %rsi,%rcx
>     0x0000000000154ca8 <+984>:	movsbq 0x231(%rdi),%rsi
>     0x0000000000154cb0 <+992>:	mov    %rsi,%r8
>     0x0000000000154cb3 <+995>:	sar    $0x3f,%r8
>     0x0000000000154cb7 <+999>:	xor    %r8,%rsi
>     0x0000000000154cba <+1002>:	sub    %r8,%rsi
>     0x0000000000154cbd <+1005>:	add    %rsi,%rcx
>     0x0000000000154cc0 <+1008>:	mov    %cl,0x231(%rdx)
>     0x0000000000154cc6 <+1014>:	movsbq 0x232(%rax),%rcx
>     0x0000000000154cce <+1022>:	mov    %rcx,%rsi
>     0x0000000000154cd1 <+1025>:	sar    $0x3f,%rsi
>     0x0000000000154cd5 <+1029>:	xor    %rsi,%rcx
>     0x0000000000154cd8 <+1032>:	sub    %rsi,%rcx
>     0x0000000000154cdb <+1035>:	movsbq 0x232(%rdi),%rsi
>     0x0000000000154ce3 <+1043>:	mov    %rsi,%r8
>     0x0000000000154ce6 <+1046>:	sar    $0x3f,%r8
>     0x0000000000154cea <+1050>:	xor    %r8,%rsi
>     0x0000000000154ced <+1053>:	sub    %r8,%rsi
>     0x0000000000154cf0 <+1056>:	add    %rsi,%rcx
>     0x0000000000154cf3 <+1059>:	mov    %cl,0x232(%rdx)
>     0x0000000000154cf9 <+1065>:	movsbq 0x233(%rdi),%rcx
>     0x0000000000154d01 <+1073>:	mov    %rcx,%rsi
>     0x0000000000154d04 <+1076>:	sar    $0x3f,%rsi
>     0x0000000000154d08 <+1080>:	xor    %rsi,%rcx
>     0x0000000000154d0b <+1083>:	sub    %rsi,%rcx
>     0x0000000000154d0e <+1086>:	movsbq 0x233(%rax),%rsi
>     0x0000000000154d16 <+1094>:	mov    %rsi,%r8
>     0x0000000000154d19 <+1097>:	sar    $0x3f,%r8
>     0x0000000000154d1d <+1101>:	xor    %r8,%rsi
>     0x0000000000154d20 <+1104>:	sub    %r8,%rsi
>     0x0000000000154d23 <+1107>:	add    %rsi,%rcx
>     0x0000000000154d26 <+1110>:	mov    %cl,0x233(%rdx)
>     0x0000000000154d2c <+1116>:	movsbq 0x234(%rdi),%rcx
>     0x0000000000154d34 <+1124>:	mov    %rcx,%rsi
>     0x0000000000154d37 <+1127>:	sar    $0x3f,%rsi
>     0x0000000000154d3b <+1131>:	xor    %rsi,%rcx
>     0x0000000000154d3e <+1134>:	sub    %rsi,%rcx
>     0x0000000000154d41 <+1137>:	movsbq 0x234(%rax),%rsi
>     0x0000000000154d49 <+1145>:	mov    %rsi,%r8
>     0x0000000000154d4c <+1148>:	sar    $0x3f,%r8
>     0x0000000000154d50 <+1152>:	xor    %r8,%rsi
>     0x0000000000154d53 <+1155>:	sub    %r8,%rsi
>     0x0000000000154d56 <+1158>:	add    %rsi,%rcx
>     0x0000000000154d59 <+1161>:	mov    %cl,0x234(%rdx)
>     0x0000000000154d5f <+1167>:	movsbq 0x235(%rax),%rcx
>     0x0000000000154d67 <+1175>:	mov    %rcx,%rsi
>     0x0000000000154d6a <+1178>:	sar    $0x3f,%rsi
>     0x0000000000154d6e <+1182>:	xor    %rsi,%rcx
>     0x0000000000154d71 <+1185>:	sub    %rsi,%rcx
>     0x0000000000154d74 <+1188>:	movsbq 0x235(%rdi),%rsi
>     0x0000000000154d7c <+1196>:	mov    %rsi,%r8
>     0x0000000000154d7f <+1199>:	sar    $0x3f,%r8
>     0x0000000000154d83 <+1203>:	xor    %r8,%rsi
>     0x0000000000154d86 <+1206>:	sub    %r8,%rsi
>     0x0000000000154d89 <+1209>:	add    %rsi,%rcx
>     0x0000000000154d8c <+1212>:	mov    %cl,0x235(%rdx)
>     0x0000000000154d92 <+1218>:	movsbq 0x236(%rdi),%rcx
>     0x0000000000154d9a <+1226>:	mov    %rcx,%rsi
>     0x0000000000154d9d <+1229>:	sar    $0x3f,%rsi
>     0x0000000000154da1 <+1233>:	xor    %rsi,%rcx
>     0x0000000000154da4 <+1236>:	sub    %rsi,%rcx
>     0x0000000000154da7 <+1239>:	movsbq 0x236(%rax),%rsi
>     0x0000000000154daf <+1247>:	mov    %rsi,%r8
>     0x0000000000154db2 <+1250>:	sar    $0x3f,%r8
>     0x0000000000154db6 <+1254>:	xor    %r8,%rsi
>     0x0000000000154db9 <+1257>:	sub    %r8,%rsi
>     0x0000000000154dbc <+1260>:	add    %rsi,%rcx
>     0x0000000000154dbf <+1263>:	mov    %cl,0x236(%rdx)
>     0x0000000000154dc5 <+1269>:	movsbq 0x237(%rax),%rax
>     0x0000000000154dcd <+1277>:	mov    %rax,%rcx
>     0x0000000000154dd0 <+1280>:	sar    $0x3f,%rcx
>     0x0000000000154dd4 <+1284>:	xor    %rcx,%rax
>     0x0000000000154dd7 <+1287>:	sub    %rcx,%rax
>     0x0000000000154dda <+1290>:	movsbq 0x237(%rdi),%rcx
>     0x0000000000154de2 <+1298>:	mov    %rcx,%rsi
>     0x0000000000154de5 <+1301>:	sar    $0x3f,%rsi
>     0x0000000000154de9 <+1305>:	xor    %rsi,%rcx
>     0x0000000000154dec <+1308>:	sub    %rsi,%rcx
>     0x0000000000154def <+1311>:	add    %rcx,%rax
>     0x0000000000154df2 <+1314>:	mov    %al,0x237(%rdx)
>     0x0000000000154df8 <+1320>:	retq
>     0x0000000000154df9 <+1321>:	nopl   0x0(%rax)
>     0x0000000000154e00 <+1328>:	mov    %ecx,%eax
>     0x0000000000154e02 <+1330>:	mov    %r8d,%r8d
>     0x0000000000154e05 <+1333>:	mov    %edx,%edx
>     0x0000000000154e07 <+1335>:	lea    0x22(%rax),%rcx
>     0x0000000000154e0b <+1339>:	lea    0x22(%rdx),%r9
>     0x0000000000154e0f <+1343>:	shl    $0x4,%rax
>     0x0000000000154e13 <+1347>:	lea    (%rdi,%rax,1),%rax
>     0x0000000000154e17 <+1351>:	shl    $0x4,%rdx
>     0x0000000000154e1b <+1355>:	shl    $0x4,%rcx
>     0x0000000000154e1f <+1359>:	shl    $0x4,%r9
>     0x0000000000154e23 <+1363>:	add    %rdi,%rdx
>     0x0000000000154e26 <+1366>:	movswq 0x8(%rdi,%rcx,1),%rsi
>     0x0000000000154e2c <+1372>:	mov    %rsi,%rcx
>     0x0000000000154e2f <+1375>:	sar    $0x3f,%rcx
>     0x0000000000154e33 <+1379>:	xor    %rcx,%rsi
>     0x0000000000154e36 <+1382>:	sub    %rcx,%rsi
>     0x0000000000154e39 <+1385>:	lea    0x22(%r8),%rcx
>     0x0000000000154e3d <+1389>:	shl    $0x4,%r8
>     0x0000000000154e41 <+1393>:	shl    $0x4,%rcx
>     0x0000000000154e45 <+1397>:	movswq 0x8(%rdi,%rcx,1),%rcx
>     0x0000000000154e4b <+1403>:	mov    %rcx,%r10
>     0x0000000000154e4e <+1406>:	sar    $0x3f,%r10
>     0x0000000000154e52 <+1410>:	xor    %r10,%rcx
>     0x0000000000154e55 <+1413>:	sub    %r10,%rcx
>     0x0000000000154e58 <+1416>:	add    %rcx,%rsi
>     0x0000000000154e5b <+1419>:	mov    %si,0x8(%rdi,%r9,1)
>     0x0000000000154e61 <+1425>:	movswq 0x22a(%rax),%rcx
>     0x0000000000154e69 <+1433>:	add    %r8,%rdi
>     0x0000000000154e6c <+1436>:	mov    %rcx,%rsi
>     0x0000000000154e6f <+1439>:	sar    $0x3f,%rsi
>     0x0000000000154e73 <+1443>:	xor    %rsi,%rcx
>     0x0000000000154e76 <+1446>:	sub    %rsi,%rcx
>     0x0000000000154e79 <+1449>:	movswq 0x22a(%rdi),%rsi
>     0x0000000000154e81 <+1457>:	mov    %rsi,%r8
>     0x0000000000154e84 <+1460>:	sar    $0x3f,%r8
>     0x0000000000154e88 <+1464>:	xor    %r8,%rsi
>     0x0000000000154e8b <+1467>:	sub    %r8,%rsi
>     0x0000000000154e8e <+1470>:	add    %rsi,%rcx
>     0x0000000000154e91 <+1473>:	mov    %cx,0x22a(%rdx)
>     0x0000000000154e98 <+1480>:	movswq 0x22c(%rax),%rcx
>     0x0000000000154ea0 <+1488>:	mov    %rcx,%rsi
>     0x0000000000154ea3 <+1491>:	sar    $0x3f,%rsi
>     0x0000000000154ea7 <+1495>:	xor    %rsi,%rcx
>     0x0000000000154eaa <+1498>:	sub    %rsi,%rcx
>     0x0000000000154ead <+1501>:	movswq 0x22c(%rdi),%rsi
>     0x0000000000154eb5 <+1509>:	mov    %rsi,%r8
>     0x0000000000154eb8 <+1512>:	sar    $0x3f,%r8
>     0x0000000000154ebc <+1516>:	xor    %r8,%rsi
>     0x0000000000154ebf <+1519>:	sub    %r8,%rsi
>     0x0000000000154ec2 <+1522>:	add    %rsi,%rcx
>     0x0000000000154ec5 <+1525>:	mov    %cx,0x22c(%rdx)
>     0x0000000000154ecc <+1532>:	movswq 0x22e(%rax),%rcx
>     0x0000000000154ed4 <+1540>:	mov    %rcx,%rsi
>     0x0000000000154ed7 <+1543>:	sar    $0x3f,%rsi
>     0x0000000000154edb <+1547>:	xor    %rsi,%rcx
>     0x0000000000154ede <+1550>:	sub    %rsi,%rcx
>     0x0000000000154ee1 <+1553>:	movswq 0x22e(%rdi),%rsi
>     0x0000000000154ee9 <+1561>:	mov    %rsi,%r8
>     0x0000000000154eec <+1564>:	sar    $0x3f,%r8
>     0x0000000000154ef0 <+1568>:	xor    %r8,%rsi
>     0x0000000000154ef3 <+1571>:	sub    %r8,%rsi
>     0x0000000000154ef6 <+1574>:	add    %rsi,%rcx
>     0x0000000000154ef9 <+1577>:	mov    %cx,0x22e(%rdx)
>     0x0000000000154f00 <+1584>:	movswq 0x230(%rax),%rcx
>     0x0000000000154f08 <+1592>:	mov    %rcx,%rsi
>     0x0000000000154f0b <+1595>:	sar    $0x3f,%rsi
>     0x0000000000154f0f <+1599>:	xor    %rsi,%rcx
>     0x0000000000154f12 <+1602>:	sub    %rsi,%rcx
>     0x0000000000154f15 <+1605>:	movswq 0x230(%rdi),%rsi
>     0x0000000000154f1d <+1613>:	mov    %rsi,%r8
>     0x0000000000154f20 <+1616>:	sar    $0x3f,%r8
>     0x0000000000154f24 <+1620>:	xor    %r8,%rsi
>     0x0000000000154f27 <+1623>:	sub    %r8,%rsi
>     0x0000000000154f2a <+1626>:	add    %rsi,%rcx
>     0x0000000000154f2d <+1629>:	mov    %cx,0x230(%rdx)
>     0x0000000000154f34 <+1636>:	movswq 0x232(%rax),%rcx
>     0x0000000000154f3c <+1644>:	mov    %rcx,%rsi
>     0x0000000000154f3f <+1647>:	sar    $0x3f,%rsi
>     0x0000000000154f43 <+1651>:	xor    %rsi,%rcx
>     0x0000000000154f46 <+1654>:	sub    %rsi,%rcx
>     0x0000000000154f49 <+1657>:	movswq 0x232(%rdi),%rsi
>     0x0000000000154f51 <+1665>:	mov    %rsi,%r8
>     0x0000000000154f54 <+1668>:	sar    $0x3f,%r8
>     0x0000000000154f58 <+1672>:	xor    %r8,%rsi
>     0x0000000000154f5b <+1675>:	sub    %r8,%rsi
>     0x0000000000154f5e <+1678>:	add    %rsi,%rcx
>     0x0000000000154f61 <+1681>:	mov    %cx,0x232(%rdx)
>     0x0000000000154f68 <+1688>:	movswq 0x234(%rax),%rcx
>     0x0000000000154f70 <+1696>:	mov    %rcx,%rsi
>     0x0000000000154f73 <+1699>:	sar    $0x3f,%rsi
>     0x0000000000154f77 <+1703>:	xor    %rsi,%rcx
>     0x0000000000154f7a <+1706>:	sub    %rsi,%rcx
>     0x0000000000154f7d <+1709>:	movswq 0x234(%rdi),%rsi
>     0x0000000000154f85 <+1717>:	mov    %rsi,%r8
>     0x0000000000154f88 <+1720>:	sar    $0x3f,%r8
>     0x0000000000154f8c <+1724>:	xor    %r8,%rsi
>     0x0000000000154f8f <+1727>:	sub    %r8,%rsi
>     0x0000000000154f92 <+1730>:	add    %rsi,%rcx
>     0x0000000000154f95 <+1733>:	mov    %cx,0x234(%rdx)
>     0x0000000000154f9c <+1740>:	movswq 0x236(%rax),%rax
>     0x0000000000154fa4 <+1748>:	mov    %rax,%rcx
>     0x0000000000154fa7 <+1751>:	sar    $0x3f,%rcx
>     0x0000000000154fab <+1755>:	xor    %rcx,%rax
>     0x0000000000154fae <+1758>:	sub    %rcx,%rax
>     0x0000000000154fb1 <+1761>:	movswq 0x236(%rdi),%rcx
>     0x0000000000154fb9 <+1769>:	mov    %rcx,%rsi
>     0x0000000000154fbc <+1772>:	sar    $0x3f,%rsi
>     0x0000000000154fc0 <+1776>:	xor    %rsi,%rcx
>     0x0000000000154fc3 <+1779>:	sub    %rsi,%rcx
>     0x0000000000154fc6 <+1782>:	add    %rcx,%rax
>     0x0000000000154fc9 <+1785>:	mov    %ax,0x236(%rdx)
>     0x0000000000154fd0 <+1792>:	retq
>     0x0000000000154fd1 <+1793>:	lea    0x14faa8(%rip),%rcx        # 0x2a4a80 <__PRETTY_FUNCTION__.25843>
>     0x0000000000154fd8 <+1800>:	lea    0x14ef81(%rip),%rsi        # 0x2a3f60
>     0x0000000000154fdf <+1807>:	lea    0x1da975(%rip),%rdi        # 0x32f95b
>     0x0000000000154fe6 <+1814>:	sub    $0x8,%rsp
>     0x0000000000154fea <+1818>:	mov    $0x368,%edx
>     0x0000000000154fef <+1823>:	callq  0x8f170
>
>
>>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>>> ---
>>>   target/mips/msa_helper.c | 43 ++++++++++++++++++++++++++++++-------------
>>>   1 file changed, 30 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>>> index 4c7ec05..1152fda 100644
>>> --- a/target/mips/msa_helper.c
>>> +++ b/target/mips/msa_helper.c
>>> @@ -804,28 +804,45 @@ void helper_msa_ ## func ## _df(CPUMIPSState *env, uint32_t df,         \
>>>       wr_t *pwd = &(env->active_fpu.fpr[wd].wr);                          \
>>>       wr_t *pws = &(env->active_fpu.fpr[ws].wr);                          \
>>>       wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
>>> \
>> If we can ensure alignment for the various vector registers then the
>> compiler always has the option of using host vectors (certainly for int
>> and logic operations).
Very interesting, could you please tell me more about that, so I can
understand better.

Thanks,
Mateja
>>> -    uint32_t i;                                                         \
>>>                                                                           \
>>>       switch (df) {                                                       \
>>>       case DF_BYTE:                                                       \
>>> -        for (i = 0; i < DF_ELEMENTS(DF_BYTE); i++) {                    \
>>> -            pwd->b[i] = msa_ ## func ## _df(df, pws->b[i], pwt->b[i]);  \
>>> -        }                                                               \
>>> +        pwd->b[0]  = msa_ ## func ## _df(df, pws->b[0], pwt->b[0]);     \
>>> +        pwd->b[1]  = msa_ ## func ## _df(df, pws->b[1], pwt->b[1]);     \
>>> +        pwd->b[2]  = msa_ ## func ## _df(df, pws->b[2], pwt->b[2]);     \
>>> +        pwd->b[3]  = msa_ ## func ## _df(df, pws->b[3], pwt->b[3]);     \
>>> +        pwd->b[4]  = msa_ ## func ## _df(df, pws->b[4], pwt->b[4]);     \
>>> +        pwd->b[5]  = msa_ ## func ## _df(df, pws->b[5], pwt->b[5]);     \
>>> +        pwd->b[6]  = msa_ ## func ## _df(df, pws->b[6], pwt->b[6]);     \
>>> +        pwd->b[7]  = msa_ ## func ## _df(df, pws->b[7], pwt->b[7]);     \
>>> +        pwd->b[8]  = msa_ ## func ## _df(df, pws->b[8], pwt->b[8]);     \
>>> +        pwd->b[9]  = msa_ ## func ## _df(df, pws->b[9], pwt->b[9]);     \
>>> +        pwd->b[10] = msa_ ## func ## _df(df, pws->b[10], pwt->b[10]);   \
>>> +        pwd->b[11] = msa_ ## func ## _df(df, pws->b[11], pwt->b[11]);   \
>>> +        pwd->b[12] = msa_ ## func ## _df(df, pws->b[12], pwt->b[12]);   \
>>> +        pwd->b[13] = msa_ ## func ## _df(df, pws->b[13], pwt->b[13]);   \
>>> +        pwd->b[14] = msa_ ## func ## _df(df, pws->b[14], pwt->b[14]);   \
>>> +        pwd->b[15] = msa_ ## func ## _df(df, pws->b[15], pwt->b[15]);   \
>>>           break;                                                          \
>>>       case DF_HALF:                                                       \
>>> -        for (i = 0; i < DF_ELEMENTS(DF_HALF); i++) {                    \
>>> -            pwd->h[i] = msa_ ## func ## _df(df, pws->h[i], pwt->h[i]);  \
>>> -        }                                                               \
>>> +        pwd->h[0] = msa_ ## func ## _df(df, pws->h[0], pwt->h[0]);      \
>>> +        pwd->h[1] = msa_ ## func ## _df(df, pws->h[1], pwt->h[1]);      \
>>> +        pwd->h[2] = msa_ ## func ## _df(df, pws->h[2], pwt->h[2]);      \
>>> +        pwd->h[3] = msa_ ## func ## _df(df, pws->h[3], pwt->h[3]);      \
>>> +        pwd->h[4] = msa_ ## func ## _df(df, pws->h[4], pwt->h[4]);      \
>>> +        pwd->h[5] = msa_ ## func ## _df(df, pws->h[5], pwt->h[5]);      \
>>> +        pwd->h[6] = msa_ ## func ## _df(df, pws->h[6], pwt->h[6]);      \
>>> +        pwd->h[7] = msa_ ## func ## _df(df, pws->h[7], pwt->h[7]);      \
>>>           break;                                                          \
>>>       case DF_WORD:                                                       \
>>> -        for (i = 0; i < DF_ELEMENTS(DF_WORD); i++) {                    \
>>> -            pwd->w[i] = msa_ ## func ## _df(df, pws->w[i], pwt->w[i]);  \
>>> -        }                                                               \
>>> +        pwd->w[0] = msa_ ## func ## _df(df, pws->w[0], pwt->w[0]);      \
>>> +        pwd->w[1] = msa_ ## func ## _df(df, pws->w[1], pwt->w[1]);      \
>>> +        pwd->w[2] = msa_ ## func ## _df(df, pws->w[2], pwt->w[2]);      \
>>> +        pwd->w[3] = msa_ ## func ## _df(df, pws->w[3], pwt->w[3]);      \
>>>           break;                                                          \
>>>       case DF_DOUBLE:                                                     \
>>> -        for (i = 0; i < DF_ELEMENTS(DF_DOUBLE); i++) {                  \
>>> -            pwd->d[i] = msa_ ## func ## _df(df, pws->d[i], pwt->d[i]);  \
>>> -        }                                                               \
>>> +        pwd->d[0] = msa_ ## func ## _df(df, pws->d[0], pwt->d[0]);      \
>>> +        pwd->d[1] = msa_ ## func ## _df(df, pws->d[1], pwt->d[1]);      \
>>>           break;                                                          \
>>>       default:                                                            \
>>>           assert(0);                                                      \
>
> --
> Alex Bennée
>



  reply	other threads:[~2019-06-03 13:30 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-04 16:51 [Qemu-devel] [PATCH 0/2] target/mips: Improve performance for MSA binary operations Mateja Marjanovic
2019-03-04 16:51 ` [Qemu-devel] [PATCH 1/2] " Mateja Marjanovic
2019-06-01 14:16   ` Aleksandar Markovic
2019-06-02  7:06     ` Aleksandar Markovic
2019-06-03  9:46       ` Mateja Marjanovic
2019-06-02 13:22   ` Alex Bennée
2019-06-03 13:10     ` Aleksandar Markovic
2019-06-03 13:29       ` Mateja Marjanovic [this message]
2019-03-04 16:51 ` [Qemu-devel] [PATCH 2/2] target/mips: Tests for binary integer MSA instruction (add, adds, hadd...) Mateja Marjanovic
2019-03-04 18:43   ` Aleksandar Markovic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=76f6a6d8-d44b-c6b8-f3ec-42bee138e2ea@rt-rk.com \
    --to=mateja.marjanovic@rt-rk.com \
    --cc=alex.bennee@linaro.org \
    --cc=amarkovic@wavecomp.com \
    --cc=arikalo@wavecomp.com \
    --cc=aurelien@aurel32.net \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.