* [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions @ 2019-08-27 9:37 Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Stefan Brankovic @ 2019-08-27 9:37 UTC (permalink / raw) To: qemu-devel; +Cc: richard.henderson, david Optimize emulation of ten Altivec instructions: lvsl, lvsr, vsl, vsr, vpkpx, vgbbd, vclzb, vclzh, vclzw and vclzd. This series buils up on and complements recent work of Thomas Murta, Mark Cave-Ayland and Richard Henderson in the same area. It is based on devising TCG translation implementation for selected instructions rather than using helpers. The selected instructions are most of the time idiosyncratic to ppc platform, so relatively complex TCG translation (without direct mapping to host instruction that is not possible in these cases) seems to be the best option, and that approach is presented in this series. The performance improvements are significant in all cases. V6: Rebased series to the latest qemu code. Excluded all patches that are already accepted. V5: Fixed vpkpx bug and added it back in patch. Fixed graphical distortions on OSX 10.3 and 10.4. Removed conversion of vmrgh and vmrgl instructions to vector operations for further investigation. V4: Addressed Richard's Henderson's suggestions. Removed vpkpx's optimization for further investigation on graphical distortions it caused on OSX 10.2-4 guests. Added opcodes for vector vmrgh(b|h|w) and vmrgl(b|h|w) in tcg. Implemented vector vmrgh and vmrgl instructions for i386. Converted vmrgh and vmrgl instructions to vector operations. V3: Fixed problem during build. V2: Addressed Richard's Henderson's suggestions. Fixed problem during build on patch 2/8. Rebased series to the latest qemu code. Stefan Brankovic (3): target/ppc: Optimize emulation of vpkpx instruction target/ppc: Optimize emulation of vclzh and vclzb instructions target/ppc: Refactor emulation of vmrgew and vmrgow instructions target/ppc/helper.h | 3 - target/ppc/int_helper.c | 30 ---- target/ppc/translate/vmx-impl.inc.c | 301 ++++++++++++++++++++++++++++++++---- 3 files changed, 269 insertions(+), 65 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-27 9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic @ 2019-08-27 9:37 ` Stefan Brankovic 2019-08-27 18:52 ` Richard Henderson 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic 2 siblings, 1 reply; 12+ messages in thread From: Stefan Brankovic @ 2019-08-27 9:37 UTC (permalink / raw) To: qemu-devel; +Cc: richard.henderson, david Optimize altivec instruction vpkpx (Vector Pack Pixel). Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register) into contigous array of bits in the destination register. In each iteration of outer loop, the instruction is to be done with the 6-5-5 pack for 2 pixels of each doubleword element of each source register. The first thing to be done in outer loop is choosing which doubleword element of which register is to be used in current iteration and it is to be placed in avr variable. The next step is to perform 6-5-5 pack of pixels on avr variable in inner for loop(2 iterations, 1 for each pixel) and save result in tmp variable. In the end of outer for loop, the result is merged in variable called result and saved in appropriate doubleword element of vD if the whole doubleword is finished(every second iteration). The outer loop has 4 iterations. Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com> --- target/ppc/helper.h | 1 - target/ppc/int_helper.c | 21 -------- target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++- 3 files changed, 98 insertions(+), 23 deletions(-) diff --git a/target/ppc/helper.h b/target/ppc/helper.h index 54ea9b9..940a115 100644 --- a/target/ppc/helper.h +++ b/target/ppc/helper.h @@ -258,7 +258,6 @@ DEF_HELPER_4(vpkudus, void, env, avr, avr, avr) DEF_HELPER_4(vpkuhum, void, env, avr, avr, avr) DEF_HELPER_4(vpkuwum, void, env, avr, avr, avr) DEF_HELPER_4(vpkudum, void, env, avr, avr, avr) -DEF_HELPER_3(vpkpx, void, avr, avr, avr) DEF_HELPER_5(vmhaddshs, void, env, avr, avr, avr, avr) DEF_HELPER_5(vmhraddshs, void, env, avr, avr, avr, avr) DEF_HELPER_5(vmsumuhm, void, env, avr, avr, avr, avr) diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c index 46deb57..9ff3b03 100644 --- a/target/ppc/int_helper.c +++ b/target/ppc/int_helper.c @@ -1262,27 +1262,6 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b) #else #define PKBIG 0 #endif -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b) -{ - int i, j; - ppc_avr_t result; -#if defined(HOST_WORDS_BIGENDIAN) - const ppc_avr_t *x[2] = { a, b }; -#else - const ppc_avr_t *x[2] = { b, a }; -#endif - - VECTOR_FOR_INORDER_I(i, u64) { - VECTOR_FOR_INORDER_I(j, u32) { - uint32_t e = x[i]->u32[j]; - - result.u16[4 * i + j] = (((e >> 9) & 0xfc00) | - ((e >> 6) & 0x3e0) | - ((e >> 3) & 0x1f)); - } - } - *r = result; -} #define VPK(suffix, from, to, cvt, dosat) \ void helper_vpk##suffix(CPUPPCState *env, ppc_avr_t *r, \ diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c index 0d71c10..456666a 100644 --- a/target/ppc/translate/vmx-impl.inc.c +++ b/target/ppc/translate/vmx-impl.inc.c @@ -571,6 +571,103 @@ static void trans_lvsr(DisasContext *ctx) } /* + * vpkpx VRT,VRA,VRB - Vector Pack Pixel + * + * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register) + * into contigous array of bits in the destination register. + */ +static void trans_vpkpx(DisasContext *ctx) +{ + int VT = rD(ctx->opcode); + int VA = rA(ctx->opcode); + int VB = rB(ctx->opcode); + TCGv_i64 tmp = tcg_temp_new_i64(); + TCGv_i64 shifted = tcg_temp_new_i64(); + TCGv_i64 avr = tcg_temp_new_i64(); + TCGv_i64 result = tcg_temp_new_i64(); + TCGv_i64 result1 = tcg_temp_new_i64(); + TCGv_i64 result2 = tcg_temp_new_i64(); + int64_t mask1 = 0x1fULL; + int64_t mask2 = 0x1fULL << 5; + int64_t mask3 = 0x3fULL << 10; + int i, j; + /* + * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword + * element of each source register. + */ + for (i = 0; i < 4; i++) { + switch (i) { + case 0: + /* + * Get high doubleword of vA to perfrom 6-5-5 pack of pixels + * 1 and 2. + */ + get_avr64(avr, VA, true); + tcg_gen_movi_i64(result, 0x0ULL); + break; + case 1: + /* + * Get low doubleword of vA to perfrom 6-5-5 pack of pixels + * 3 and 4. + */ + get_avr64(avr, VA, false); + break; + case 2: + /* + * Get high doubleword of vB to perfrom 6-5-5 pack of pixels + * 5 and 6. + */ + get_avr64(avr, VB, true); + tcg_gen_movi_i64(result, 0x0ULL); + break; + case 3: + /* + * Get low doubleword of vB to perfrom 6-5-5 pack of pixels + * 7 and 8. + */ + get_avr64(avr, VB, false); + break; + } + /* Perform the packing for 2 pixels(each iteration for 1). */ + tcg_gen_movi_i64(tmp, 0x0ULL); + for (j = 0; j < 2; j++) { + tcg_gen_shri_i64(shifted, avr, (j * 16 + 3)); + tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16)); + tcg_gen_or_i64(tmp, tmp, shifted); + + tcg_gen_shri_i64(shifted, avr, (j * 16 + 6)); + tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16)); + tcg_gen_or_i64(tmp, tmp, shifted); + + tcg_gen_shri_i64(shifted, avr, (j * 16 + 9)); + tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16)); + tcg_gen_or_i64(tmp, tmp, shifted); + } + if ((i == 0) || (i == 2)) { + tcg_gen_shli_i64(tmp, tmp, 32); + } + tcg_gen_or_i64(result, result, tmp); + if (i == 1) { + /* Place packed pixels 1:4 to high doubleword of vD. */ + tcg_gen_mov_i64(result1, result); + } + if (i == 3) { + /* Place packed pixels 5:8 to low doubleword of vD. */ + tcg_gen_mov_i64(result2, result); + } + } + set_avr64(VT, result1, true); + set_avr64(VT, result2, false); + + tcg_temp_free_i64(tmp); + tcg_temp_free_i64(shifted); + tcg_temp_free_i64(avr); + tcg_temp_free_i64(result); + tcg_temp_free_i64(result1); + tcg_temp_free_i64(result2); +} + +/* * vsl VRT,VRA,VRB - Vector Shift Left * * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB. @@ -923,7 +1020,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21); GEN_VXFORM_ENV(vpkshss, 7, 6); GEN_VXFORM_ENV(vpkswss, 7, 7); GEN_VXFORM_ENV(vpksdss, 7, 23); -GEN_VXFORM(vpkpx, 7, 12); +GEN_VXFORM_TRANS(vpkpx, 7, 12); GEN_VXFORM_ENV(vsum4ubs, 4, 24); GEN_VXFORM_ENV(vsum4sbs, 4, 28); GEN_VXFORM_ENV(vsum4shs, 4, 25); -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic @ 2019-08-27 18:52 ` Richard Henderson 2019-08-27 19:04 ` BALATON Zoltan 2019-08-29 13:34 ` Stefan Brankovic 0 siblings, 2 replies; 12+ messages in thread From: Richard Henderson @ 2019-08-27 18:52 UTC (permalink / raw) To: Stefan Brankovic, qemu-devel; +Cc: david On 8/27/19 2:37 AM, Stefan Brankovic wrote: > + for (i = 0; i < 4; i++) { > + switch (i) { > + case 0: > + /* > + * Get high doubleword of vA to perfrom 6-5-5 pack of pixels > + * 1 and 2. > + */ > + get_avr64(avr, VA, true); > + tcg_gen_movi_i64(result, 0x0ULL); > + break; > + case 1: > + /* > + * Get low doubleword of vA to perfrom 6-5-5 pack of pixels > + * 3 and 4. > + */ > + get_avr64(avr, VA, false); > + break; > + case 2: > + /* > + * Get high doubleword of vB to perfrom 6-5-5 pack of pixels > + * 5 and 6. > + */ > + get_avr64(avr, VB, true); > + tcg_gen_movi_i64(result, 0x0ULL); > + break; > + case 3: > + /* > + * Get low doubleword of vB to perfrom 6-5-5 pack of pixels > + * 7 and 8. > + */ > + get_avr64(avr, VB, false); > + break; > + } > + /* Perform the packing for 2 pixels(each iteration for 1). */ > + tcg_gen_movi_i64(tmp, 0x0ULL); > + for (j = 0; j < 2; j++) { > + tcg_gen_shri_i64(shifted, avr, (j * 16 + 3)); > + tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16)); > + tcg_gen_or_i64(tmp, tmp, shifted); > + > + tcg_gen_shri_i64(shifted, avr, (j * 16 + 6)); > + tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16)); > + tcg_gen_or_i64(tmp, tmp, shifted); > + > + tcg_gen_shri_i64(shifted, avr, (j * 16 + 9)); > + tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16)); > + tcg_gen_or_i64(tmp, tmp, shifted); > + } > + if ((i == 0) || (i == 2)) { > + tcg_gen_shli_i64(tmp, tmp, 32); > + } > + tcg_gen_or_i64(result, result, tmp); > + if (i == 1) { > + /* Place packed pixels 1:4 to high doubleword of vD. */ > + tcg_gen_mov_i64(result1, result); > + } > + if (i == 3) { > + /* Place packed pixels 5:8 to low doubleword of vD. */ > + tcg_gen_mov_i64(result2, result); > + } > + } > + set_avr64(VT, result1, true); > + set_avr64(VT, result2, false); I really have a hard time believing that it is worthwhile to inline all of this code. By my count this is 82 non-move opcodes. That is a *lot* of inline expansion. However, I can well imagine that the existing out-of-line helper is less than optimal. > -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b) > -{ > - int i, j; > - ppc_avr_t result; > -#if defined(HOST_WORDS_BIGENDIAN) > - const ppc_avr_t *x[2] = { a, b }; > -#else > - const ppc_avr_t *x[2] = { b, a }; > -#endif > - > - VECTOR_FOR_INORDER_I(i, u64) { > - VECTOR_FOR_INORDER_I(j, u32) { > - uint32_t e = x[i]->u32[j]; Double indirect loads? > - > - result.u16[4 * i + j] = (((e >> 9) & 0xfc00) | > - ((e >> 6) & 0x3e0) | > - ((e >> 3) & 0x1f)); Store to temporary ... > - } > - } > - *r = result; ... and then copy? Try replacing the existing helper with something like the following. r~ static inline uint64_t pkpx_1(uint64_t a, int shr, int shl) { uint64_t r; r = ((a >> (shr + 9)) & 0x3f) << shl; r |= ((a >> (shr + 6)) & 0x1f) << shl; r |= ((a >> (shr + 3)) & 0x1f) << shl; return r; } static inline uint64_t pkpx_2(uint64_t ah, uint64_t al) { return pkpx_1(ah, 32, 48) | pkpx_1(ah, 0, 32) | pkpx_1(al, 32, 16) | pkpx_1(al, 0, 0); } void helper_vpkpx(uint64_t *r, uint64_t *a, uint64_t *b) { uint64_t rh = pkpx_2(a->VsrD(0), a->VsrD(1)); uint64_t rl = pkpx_2(b->VsrD(0), b->VsrD(1)); r->VsrD(0) = rh; r->VsrD(1) = rl; } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-27 18:52 ` Richard Henderson @ 2019-08-27 19:04 ` BALATON Zoltan 2019-08-29 13:34 ` Stefan Brankovic 1 sibling, 0 replies; 12+ messages in thread From: BALATON Zoltan @ 2019-08-27 19:04 UTC (permalink / raw) To: Richard Henderson; +Cc: Stefan Brankovic, qemu-devel, david On Tue, 27 Aug 2019, Richard Henderson wrote: > On 8/27/19 2:37 AM, Stefan Brankovic wrote: >> + for (i = 0; i < 4; i++) { >> + switch (i) { >> + case 0: >> + /* >> + * Get high doubleword of vA to perfrom 6-5-5 pack of pixels >> + * 1 and 2. >> + */ >> + get_avr64(avr, VA, true); >> + tcg_gen_movi_i64(result, 0x0ULL); >> + break; >> + case 1: >> + /* >> + * Get low doubleword of vA to perfrom 6-5-5 pack of pixels >> + * 3 and 4. >> + */ >> + get_avr64(avr, VA, false); >> + break; >> + case 2: >> + /* >> + * Get high doubleword of vB to perfrom 6-5-5 pack of pixels >> + * 5 and 6. >> + */ >> + get_avr64(avr, VB, true); >> + tcg_gen_movi_i64(result, 0x0ULL); >> + break; >> + case 3: >> + /* >> + * Get low doubleword of vB to perfrom 6-5-5 pack of pixels If this is replaced by Richard's suggested version it does not matter but there's a typo in above comments. Probably you've meant perfrom -> perform Regards, BALATON Zoltan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-27 18:52 ` Richard Henderson 2019-08-27 19:04 ` BALATON Zoltan @ 2019-08-29 13:34 ` Stefan Brankovic 2019-08-29 15:31 ` Richard Henderson 1 sibling, 1 reply; 12+ messages in thread From: Stefan Brankovic @ 2019-08-29 13:34 UTC (permalink / raw) To: Richard Henderson, qemu-devel; +Cc: david [-- Attachment #1: Type: text/plain, Size: 5764 bytes --] On 27.8.19. 20:52, Richard Henderson wrote: > On 8/27/19 2:37 AM, Stefan Brankovic wrote: >> + for (i = 0; i < 4; i++) { >> + switch (i) { >> + case 0: >> + /* >> + * Get high doubleword of vA to perfrom 6-5-5 pack of pixels >> + * 1 and 2. >> + */ >> + get_avr64(avr, VA, true); >> + tcg_gen_movi_i64(result, 0x0ULL); >> + break; >> + case 1: >> + /* >> + * Get low doubleword of vA to perfrom 6-5-5 pack of pixels >> + * 3 and 4. >> + */ >> + get_avr64(avr, VA, false); >> + break; >> + case 2: >> + /* >> + * Get high doubleword of vB to perfrom 6-5-5 pack of pixels >> + * 5 and 6. >> + */ >> + get_avr64(avr, VB, true); >> + tcg_gen_movi_i64(result, 0x0ULL); >> + break; >> + case 3: >> + /* >> + * Get low doubleword of vB to perfrom 6-5-5 pack of pixels >> + * 7 and 8. >> + */ >> + get_avr64(avr, VB, false); >> + break; >> + } >> + /* Perform the packing for 2 pixels(each iteration for 1). */ >> + tcg_gen_movi_i64(tmp, 0x0ULL); >> + for (j = 0; j < 2; j++) { >> + tcg_gen_shri_i64(shifted, avr, (j * 16 + 3)); >> + tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16)); >> + tcg_gen_or_i64(tmp, tmp, shifted); >> + >> + tcg_gen_shri_i64(shifted, avr, (j * 16 + 6)); >> + tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16)); >> + tcg_gen_or_i64(tmp, tmp, shifted); >> + >> + tcg_gen_shri_i64(shifted, avr, (j * 16 + 9)); >> + tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16)); >> + tcg_gen_or_i64(tmp, tmp, shifted); >> + } >> + if ((i == 0) || (i == 2)) { >> + tcg_gen_shli_i64(tmp, tmp, 32); >> + } >> + tcg_gen_or_i64(result, result, tmp); >> + if (i == 1) { >> + /* Place packed pixels 1:4 to high doubleword of vD. */ >> + tcg_gen_mov_i64(result1, result); >> + } >> + if (i == 3) { >> + /* Place packed pixels 5:8 to low doubleword of vD. */ >> + tcg_gen_mov_i64(result2, result); >> + } >> + } >> + set_avr64(VT, result1, true); >> + set_avr64(VT, result2, false); > I really have a hard time believing that it is worthwhile to inline all of this > code. By my count this is 82 non-move opcodes. That is a *lot* of inline > expansion. > > However, I can well imagine that the existing out-of-line helper is less than > optimal. > >> -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b) >> -{ >> - int i, j; >> - ppc_avr_t result; >> -#if defined(HOST_WORDS_BIGENDIAN) >> - const ppc_avr_t *x[2] = { a, b }; >> -#else >> - const ppc_avr_t *x[2] = { b, a }; >> -#endif >> - >> - VECTOR_FOR_INORDER_I(i, u64) { >> - VECTOR_FOR_INORDER_I(j, u32) { >> - uint32_t e = x[i]->u32[j]; > Double indirect loads? > >> - >> - result.u16[4 * i + j] = (((e >> 9) & 0xfc00) | >> - ((e >> 6) & 0x3e0) | >> - ((e >> 3) & 0x1f)); > Store to temporary ... > >> - } >> - } >> - *r = result; > ... and then copy? > > Try replacing the existing helper with something like the following. > > > r~ > > > > static inline uint64_t pkpx_1(uint64_t a, int shr, int shl) > { > uint64_t r; > > r = ((a >> (shr + 9)) & 0x3f) << shl; > r |= ((a >> (shr + 6)) & 0x1f) << shl; > r |= ((a >> (shr + 3)) & 0x1f) << shl; > > return r; > } > > static inline uint64_t pkpx_2(uint64_t ah, uint64_t al) > { > return pkpx_1(ah, 32, 48) > | pkpx_1(ah, 0, 32) > | pkpx_1(al, 32, 16) > | pkpx_1(al, 0, 0); > } > > void helper_vpkpx(uint64_t *r, uint64_t *a, uint64_t *b) > { > uint64_t rh = pkpx_2(a->VsrD(0), a->VsrD(1)); > uint64_t rl = pkpx_2(b->VsrD(0), b->VsrD(1)); > r->VsrD(0) = rh; > r->VsrD(1) = rl; > } I implemented vpkpx as you suggested above with small modifications(so it builds and gives correct result). It looks like this: static inline uint64_t pkpx_1(uint64_t a, int shr, int shl) { uint64_t r; r = ((a >> (shr + 9)) & 0xfc00) << shl; r |= ((a >> (shr + 6)) & 0x3e0) << shl; r |= ((a >> (shr + 3)) & 0x1f) << shl; return r; } static inline uint64_t pkpx_2(uint64_t ah, uint64_t al) { return pkpx_1(ah, 32, 48) | pkpx_1(ah, 0, 32) | pkpx_1(al, 32, 16) | pkpx_1(al, 0, 0); } void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b) { uint64_t rh = pkpx_2(a->u64[1], a->u64[0]); uint64_t rl = pkpx_2(b->u64[1], b->u64[0]); r->u64[1] = rh; r->u64[0] = rl; } I also noticed that this would work only for little_endian hosts, so we would need to modify it in order to support big_endian hosts (this shouldn't affect performance results). Then I run my performance tests and I got following results(test is calling vpkpx 100000 times): 1) Current helper implementation: ~ 157 ms 2) helper implementation you suggested: ~94 ms 3) tcg implementation: ~75 ms Attached file contains assembly code for both current implementation and implementation you suggested, so please take a look at that as well. Kind Regards, Stefan [-- Attachment #2: vpkpx_assembly.txt --] [-- Type: text/plain, Size: 23654 bytes --] Current vpkpx implementation: 1)Both c and assembly code: Dump of assembler code for function helper_vpkpx: 1267 { 0x0000000000195fe0 <+0>: 48 83 ec 38 sub $0x38,%rsp 1268 int i, j; 1269 ppc_avr_t result; 1270 #if defined(HOST_WORDS_BIGENDIAN) 1271 const ppc_avr_t *x[2] = { a, b }; 1272 #else 1273 const ppc_avr_t *x[2] = { b, a }; 0x0000000000195fe4 <+4>: b9 07 00 00 00 mov $0x7,%ecx 1267 { 0x0000000000195fe9 <+9>: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax 0x0000000000195ff2 <+18>: 48 89 44 24 28 mov %rax,0x28(%rsp) 0x0000000000195ff7 <+23>: 31 c0 xor %eax,%eax 0x0000000000195ff9 <+25>: 4c 8d 4c 24 10 lea 0x10(%rsp),%r9 1268 int i, j; 1269 ppc_avr_t result; 1270 #if defined(HOST_WORDS_BIGENDIAN) 1271 const ppc_avr_t *x[2] = { a, b }; 1272 #else 1273 const ppc_avr_t *x[2] = { b, a }; 0x0000000000195ffe <+30>: 48 89 54 24 10 mov %rdx,0x10(%rsp) 0x0000000000196003 <+35>: 48 89 74 24 18 mov %rsi,0x18(%rsp) 0x0000000000196008 <+40>: 44 8d 51 fc lea -0x4(%rcx),%r10d 0x000000000019600c <+44>: 48 83 c6 0c add $0xc,%rsi 1278 uint32_t e = x[i]->u32[j]; 0x0000000000196010 <+48>: 8b 06 mov (%rsi),%eax 1279 1280 result.u16[4 * i + j] = (((e >> 9) & 0xfc00) | 0x0000000000196012 <+50>: 4c 63 d9 movslq %ecx,%r11 0x0000000000196015 <+53>: 83 e9 01 sub $0x1,%ecx 0x0000000000196018 <+56>: 48 83 ee 04 sub $0x4,%rsi 0x000000000019601c <+60>: 89 c2 mov %eax,%edx 0x000000000019601e <+62>: c1 ea 09 shr $0x9,%edx 0x0000000000196021 <+65>: 41 89 d0 mov %edx,%r8d 0x0000000000196024 <+68>: 89 c2 mov %eax,%edx 0x0000000000196026 <+70>: c1 e8 03 shr $0x3,%eax 0x0000000000196029 <+73>: c1 ea 06 shr $0x6,%edx 0x000000000019602c <+76>: 66 41 81 e0 00 fc and $0xfc00,%r8w 0x0000000000196032 <+82>: 83 e0 1f and $0x1f,%eax 0x0000000000196035 <+85>: 66 81 e2 e0 03 and $0x3e0,%dx 0x000000000019603a <+90>: 44 09 c2 or %r8d,%edx 0x000000000019603d <+93>: 09 d0 or %edx,%eax 1277 VECTOR_FOR_INORDER_I(j, u32) { 0x000000000019603f <+95>: 41 39 ca cmp %ecx,%r10d 1279 1280 result.u16[4 * i + j] = (((e >> 9) & 0xfc00) | 0x0000000000196042 <+98>: 66 42 89 04 5c mov %ax,(%rsp,%r11,2) 1277 VECTOR_FOR_INORDER_I(j, u32) { 0x0000000000196047 <+103>: 75 c7 jne 0x196010 <helper_vpkpx+48> 1276 VECTOR_FOR_INORDER_I(i, u64) { 0x0000000000196049 <+105>: 41 83 fa ff cmp $0xffffffff,%r10d 0x000000000019604d <+109>: 44 89 d1 mov %r10d,%ecx 0x0000000000196050 <+112>: 74 0e je 0x196060 <helper_vpkpx+128> 0x0000000000196052 <+114>: 49 8b 31 mov (%r9),%rsi 0x0000000000196055 <+117>: 49 83 e9 08 sub $0x8,%r9 0x0000000000196059 <+121>: eb ad jmp 0x196008 <helper_vpkpx+40> 0x000000000019605b <+123>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 1281 ((e >> 6) & 0x3e0) | 1282 ((e >> 3) & 0x1f)); 1283 // printf("%x\n",result.u16[4 * i + j]); 1284 } 1285 } 1286 // printf("%lx\n",result.u64[0]); 1287 // printf("%lx\n",result.u64[1]); 1288 *r = result; 0x0000000000196060 <+128>: 48 8b 04 24 mov (%rsp),%rax 0x0000000000196064 <+132>: 48 8b 54 24 08 mov 0x8(%rsp),%rdx 0x0000000000196069 <+137>: 48 89 07 mov %rax,(%rdi) 0x000000000019606c <+140>: 48 89 57 08 mov %rdx,0x8(%rdi) 1289 } 0x0000000000196070 <+144>: 48 8b 44 24 28 mov 0x28(%rsp),%rax 0x0000000000196075 <+149>: 64 48 33 04 25 28 00 00 00 xor %fs:0x28,%rax 0x000000000019607e <+158>: 75 05 jne 0x196085 <helper_vpkpx+165> 0x0000000000196080 <+160>: 48 83 c4 38 add $0x38,%rsp 0x0000000000196084 <+164>: c3 retq 0x0000000000196085 <+165>: e8 2e 66 f0 ff callq 0x9c6b8 End of assembler dump. 2) Only assembly code: Dump of assembler code for function helper_vpkpx: 0x0000000000195fe0 <+0>: 48 83 ec 38 sub $0x38,%rsp 0x0000000000195fe4 <+4>: b9 07 00 00 00 mov $0x7,%ecx 0x0000000000195fe9 <+9>: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax 0x0000000000195ff2 <+18>: 48 89 44 24 28 mov %rax,0x28(%rsp) 0x0000000000195ff7 <+23>: 31 c0 xor %eax,%eax 0x0000000000195ff9 <+25>: 4c 8d 4c 24 10 lea 0x10(%rsp),%r9 0x0000000000195ffe <+30>: 48 89 54 24 10 mov %rdx,0x10(%rsp) 0x0000000000196003 <+35>: 48 89 74 24 18 mov %rsi,0x18(%rsp) 0x0000000000196008 <+40>: 44 8d 51 fc lea -0x4(%rcx),%r10d 0x000000000019600c <+44>: 48 83 c6 0c add $0xc,%rsi 0x0000000000196010 <+48>: 8b 06 mov (%rsi),%eax 0x0000000000196012 <+50>: 4c 63 d9 movslq %ecx,%r11 0x0000000000196015 <+53>: 83 e9 01 sub $0x1,%ecx 0x0000000000196018 <+56>: 48 83 ee 04 sub $0x4,%rsi 0x000000000019601c <+60>: 89 c2 mov %eax,%edx 0x000000000019601e <+62>: c1 ea 09 shr $0x9,%edx 0x0000000000196021 <+65>: 41 89 d0 mov %edx,%r8d 0x0000000000196024 <+68>: 89 c2 mov %eax,%edx 0x0000000000196026 <+70>: c1 e8 03 shr $0x3,%eax 0x0000000000196029 <+73>: c1 ea 06 shr $0x6,%edx 0x000000000019602c <+76>: 66 41 81 e0 00 fc and $0xfc00,%r8w 0x0000000000196032 <+82>: 83 e0 1f and $0x1f,%eax 0x0000000000196035 <+85>: 66 81 e2 e0 03 and $0x3e0,%dx 0x000000000019603a <+90>: 44 09 c2 or %r8d,%edx 0x000000000019603d <+93>: 09 d0 or %edx,%eax 0x000000000019603f <+95>: 41 39 ca cmp %ecx,%r10d 0x0000000000196042 <+98>: 66 42 89 04 5c mov %ax,(%rsp,%r11,2) 0x0000000000196047 <+103>: 75 c7 jne 0x196010 <helper_vpkpx+48> 0x0000000000196049 <+105>: 41 83 fa ff cmp $0xffffffff,%r10d 0x000000000019604d <+109>: 44 89 d1 mov %r10d,%ecx 0x0000000000196050 <+112>: 74 0e je 0x196060 <helper_vpkpx+128> 0x0000000000196052 <+114>: 49 8b 31 mov (%r9),%rsi 0x0000000000196055 <+117>: 49 83 e9 08 sub $0x8,%r9 0x0000000000196059 <+121>: eb ad jmp 0x196008 <helper_vpkpx+40> 0x000000000019605b <+123>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0x0000000000196060 <+128>: 48 8b 04 24 mov (%rsp),%rax 0x0000000000196064 <+132>: 48 8b 54 24 08 mov 0x8(%rsp),%rdx 0x0000000000196069 <+137>: 48 89 07 mov %rax,(%rdi) 0x000000000019606c <+140>: 48 89 57 08 mov %rdx,0x8(%rdi) 0x0000000000196070 <+144>: 48 8b 44 24 28 mov 0x28(%rsp),%rax 0x0000000000196075 <+149>: 64 48 33 04 25 28 00 00 00 xor %fs:0x28,%rax 0x000000000019607e <+158>: 75 05 jne 0x196085 <helper_vpkpx+165> 0x0000000000196080 <+160>: 48 83 c4 38 add $0x38,%rsp 0x0000000000196084 <+164>: c3 retq 0x0000000000196085 <+165>: e8 2e 66 f0 ff callq 0x9c6b8 End of assembler dump. Implementation you suggested: 1)Both c and assembly code: Dump of assembler code for function helper_vpkpx: 1313 { 0x0000000000195fe0 <+0>: 55 push %rbp 0x0000000000195fe1 <+1>: 53 push %rbx 1314 uint64_t rh = pkpx_2(a->u64[1], a->u64[0]); 0x0000000000195fe2 <+2>: 48 8b 46 08 mov 0x8(%rsi),%rax 0x0000000000195fe6 <+6>: 48 8b 0e mov (%rsi),%rcx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000195fe9 <+9>: 49 89 c1 mov %rax,%r9 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000195fec <+12>: 48 89 c6 mov %rax,%rsi 0x0000000000195fef <+15>: 49 89 c3 mov %rax,%r11 0x0000000000195ff2 <+18>: 48 c1 ee 29 shr $0x29,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000195ff6 <+22>: 49 c1 e9 26 shr $0x26,%r9 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000195ffa <+26>: 49 c1 eb 09 shr $0x9,%r11 0x0000000000195ffe <+30>: 81 e6 00 fc 00 00 and $0xfc00,%esi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196004 <+36>: 41 81 e1 e0 03 00 00 and $0x3e0,%r9d 0x000000000019600b <+43>: 49 89 ca mov %rcx,%r10 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x000000000019600e <+46>: 49 89 f0 mov %rsi,%r8 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196011 <+49>: 4c 89 ce mov %r9,%rsi 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196014 <+52>: 49 89 c1 mov %rax,%r9 0x0000000000196017 <+55>: 49 c1 e9 23 shr $0x23,%r9 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019601b <+59>: 4c 09 c6 or %r8,%rsi 0x000000000019601e <+62>: 49 c1 ea 26 shr $0x26,%r10 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196022 <+66>: 41 83 e1 1f and $0x1f,%r9d 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196026 <+70>: 41 81 e2 e0 03 00 00 and $0x3e0,%r10d 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x000000000019602d <+77>: 49 09 f1 or %rsi,%r9 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196030 <+80>: 4c 89 de mov %r11,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196033 <+83>: 49 89 c3 mov %rax,%r11 0x0000000000196036 <+86>: 49 c1 eb 06 shr $0x6,%r11 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x000000000019603a <+90>: 81 e6 00 fc 00 00 and $0xfc00,%esi 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196040 <+96>: 48 c1 e8 03 shr $0x3,%rax 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196044 <+100>: 41 81 e3 e0 03 00 00 and $0x3e0,%r11d 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x000000000019604b <+107>: 83 e0 1f and $0x1f,%eax 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019604e <+110>: 49 09 f3 or %rsi,%r11 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196051 <+113>: 49 09 c3 or %rax,%r11 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196054 <+116>: 48 89 c8 mov %rcx,%rax 0x0000000000196057 <+119>: 48 c1 e8 29 shr $0x29,%rax 0x000000000019605b <+123>: 25 00 fc 00 00 and $0xfc00,%eax 0x0000000000196060 <+128>: 48 89 c6 mov %rax,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196063 <+131>: 4c 89 d0 mov %r10,%rax 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196066 <+134>: 49 89 ca mov %rcx,%r10 0x0000000000196069 <+137>: 49 c1 ea 23 shr $0x23,%r10 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019606d <+141>: 48 09 f0 or %rsi,%rax 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196070 <+144>: 41 83 e2 1f and $0x1f,%r10d 0x0000000000196074 <+148>: 49 09 c2 or %rax,%r10 1315 uint64_t rl = pkpx_2(b->u64[1], b->u64[0]); 0x0000000000196077 <+151>: 48 8b 02 mov (%rdx),%rax 0x000000000019607a <+154>: 48 8b 52 08 mov 0x8(%rdx),%rdx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019607e <+158>: 49 89 d0 mov %rdx,%r8 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196081 <+161>: 48 89 d6 mov %rdx,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196084 <+164>: 49 c1 e8 26 shr $0x26,%r8 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196088 <+168>: 48 c1 ee 29 shr $0x29,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019608c <+172>: 41 81 e0 e0 03 00 00 and $0x3e0,%r8d 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196093 <+179>: 48 89 f3 mov %rsi,%rbx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196096 <+182>: 4c 89 c6 mov %r8,%rsi 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196099 <+185>: 49 89 d0 mov %rdx,%r8 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x000000000019609c <+188>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x00000000001960a2 <+194>: 49 c1 e8 23 shr $0x23,%r8 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960a6 <+198>: 48 09 de or %rbx,%rsi 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x00000000001960a9 <+201>: 41 83 e0 1f and $0x1f,%r8d 0x00000000001960ad <+205>: 49 09 f0 or %rsi,%r8 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960b0 <+208>: 48 89 d6 mov %rdx,%rsi 0x00000000001960b3 <+211>: 48 c1 ee 09 shr $0x9,%rsi 1316 r->u64[1] = rh; 0x00000000001960b7 <+215>: 49 c1 e1 30 shl $0x30,%r9 0x00000000001960bb <+219>: 49 c1 e3 20 shl $0x20,%r11 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960bf <+223>: 48 89 f3 mov %rsi,%rbx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960c2 <+226>: 48 89 d6 mov %rdx,%rsi 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x00000000001960c5 <+229>: 48 c1 ea 03 shr $0x3,%rdx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960c9 <+233>: 48 c1 ee 06 shr $0x6,%rsi 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960cd <+237>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x00000000001960d3 <+243>: 83 e2 1f and $0x1f,%edx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960d6 <+246>: 81 e6 e0 03 00 00 and $0x3e0,%esi 1316 r->u64[1] = rh; 0x00000000001960dc <+252>: 49 c1 e2 10 shl $0x10,%r10 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960e0 <+256>: 48 09 de or %rbx,%rsi 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960e3 <+259>: 48 89 c3 mov %rax,%rbx 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x00000000001960e6 <+262>: 48 09 f2 or %rsi,%rdx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960e9 <+265>: 48 89 c6 mov %rax,%rsi 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960ec <+268>: 48 c1 eb 29 shr $0x29,%rbx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960f0 <+272>: 48 c1 ee 26 shr $0x26,%rsi 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x00000000001960f4 <+276>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x00000000001960fa <+282>: 81 e6 e0 03 00 00 and $0x3e0,%esi 1296 r = ((a >> (shr + 9)) & 0xfc00); 0x0000000000196100 <+288>: 48 89 dd mov %rbx,%rbp 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x0000000000196103 <+291>: 48 89 f3 mov %rsi,%rbx 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196106 <+294>: 48 89 c6 mov %rax,%rsi 0x0000000000196109 <+297>: 48 c1 ee 23 shr $0x23,%rsi 1297 r |= ((a >> (shr + 6)) & 0x3e0); 0x000000000019610d <+301>: 48 09 eb or %rbp,%rbx 1298 r |= ((a >> (shr + 3)) & 0x1f); 0x0000000000196110 <+304>: 83 e6 1f and $0x1f,%esi 0x0000000000196113 <+307>: 48 09 de or %rbx,%rsi 1316 r->u64[1] = rh; 0x0000000000196116 <+310>: 48 89 cb mov %rcx,%rbx 0x0000000000196119 <+313>: 48 c1 eb 09 shr $0x9,%rbx 0x000000000019611d <+317>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 0x0000000000196123 <+323>: 48 89 dd mov %rbx,%rbp 0x0000000000196126 <+326>: 48 89 cb mov %rcx,%rbx 0x0000000000196129 <+329>: 48 c1 e9 03 shr $0x3,%rcx 0x000000000019612d <+333>: 48 c1 eb 06 shr $0x6,%rbx 0x0000000000196131 <+337>: 83 e1 1f and $0x1f,%ecx 0x0000000000196134 <+340>: 81 e3 e0 03 00 00 and $0x3e0,%ebx 0x000000000019613a <+346>: 48 09 eb or %rbp,%rbx 0x000000000019613d <+349>: 48 09 d9 or %rbx,%rcx 0x0000000000196140 <+352>: 4c 09 c9 or %r9,%rcx 0x0000000000196143 <+355>: 4c 09 d9 or %r11,%rcx 0x0000000000196146 <+358>: 4c 09 d1 or %r10,%rcx 0x0000000000196149 <+361>: 48 89 4f 08 mov %rcx,0x8(%rdi) 1317 r->u64[0] = rl; 0x000000000019614d <+365>: 48 89 c1 mov %rax,%rcx 0x0000000000196150 <+368>: 48 c1 e9 09 shr $0x9,%rcx 0x0000000000196154 <+372>: 81 e1 00 fc 00 00 and $0xfc00,%ecx 0x000000000019615a <+378>: 49 89 c9 mov %rcx,%r9 0x000000000019615d <+381>: 48 89 c1 mov %rax,%rcx 0x0000000000196160 <+384>: 48 c1 e9 06 shr $0x6,%rcx 0x0000000000196164 <+388>: 48 c1 e8 03 shr $0x3,%rax 0x0000000000196168 <+392>: 49 c1 e0 30 shl $0x30,%r8 0x000000000019616c <+396>: 81 e1 e0 03 00 00 and $0x3e0,%ecx 0x0000000000196172 <+402>: 83 e0 1f and $0x1f,%eax 0x0000000000196175 <+405>: 48 c1 e2 20 shl $0x20,%rdx 0x0000000000196179 <+409>: 4c 09 c9 or %r9,%rcx 0x000000000019617c <+412>: 48 09 c8 or %rcx,%rax 0x000000000019617f <+415>: 48 89 f1 mov %rsi,%rcx 0x0000000000196182 <+418>: 4c 09 c0 or %r8,%rax 0x0000000000196185 <+421>: 48 c1 e1 10 shl $0x10,%rcx 0x0000000000196189 <+425>: 48 09 d0 or %rdx,%rax 0x000000000019618c <+428>: 48 09 c8 or %rcx,%rax 1318 } 0x000000000019618f <+431>: 5b pop %rbx 1317 r->u64[0] = rl; 0x0000000000196190 <+432>: 48 89 07 mov %rax,(%rdi) 1318 } 0x0000000000196193 <+435>: 5d pop %rbp 0x0000000000196194 <+436>: c3 retq End of assembler dump. 2) Only assembly code: Dump of assembler code for function helper_vpkpx: 0x0000000000195fe0 <+0>: 55 push %rbp 0x0000000000195fe1 <+1>: 53 push %rbx 0x0000000000195fe2 <+2>: 48 8b 46 08 mov 0x8(%rsi),%rax 0x0000000000195fe6 <+6>: 48 8b 0e mov (%rsi),%rcx 0x0000000000195fe9 <+9>: 49 89 c1 mov %rax,%r9 0x0000000000195fec <+12>: 48 89 c6 mov %rax,%rsi 0x0000000000195fef <+15>: 49 89 c3 mov %rax,%r11 0x0000000000195ff2 <+18>: 48 c1 ee 29 shr $0x29,%rsi 0x0000000000195ff6 <+22>: 49 c1 e9 26 shr $0x26,%r9 0x0000000000195ffa <+26>: 49 c1 eb 09 shr $0x9,%r11 0x0000000000195ffe <+30>: 81 e6 00 fc 00 00 and $0xfc00,%esi 0x0000000000196004 <+36>: 41 81 e1 e0 03 00 00 and $0x3e0,%r9d 0x000000000019600b <+43>: 49 89 ca mov %rcx,%r10 0x000000000019600e <+46>: 49 89 f0 mov %rsi,%r8 0x0000000000196011 <+49>: 4c 89 ce mov %r9,%rsi 0x0000000000196014 <+52>: 49 89 c1 mov %rax,%r9 0x0000000000196017 <+55>: 49 c1 e9 23 shr $0x23,%r9 0x000000000019601b <+59>: 4c 09 c6 or %r8,%rsi 0x000000000019601e <+62>: 49 c1 ea 26 shr $0x26,%r10 0x0000000000196022 <+66>: 41 83 e1 1f and $0x1f,%r9d 0x0000000000196026 <+70>: 41 81 e2 e0 03 00 00 and $0x3e0,%r10d 0x000000000019602d <+77>: 49 09 f1 or %rsi,%r9 0x0000000000196030 <+80>: 4c 89 de mov %r11,%rsi 0x0000000000196033 <+83>: 49 89 c3 mov %rax,%r11 0x0000000000196036 <+86>: 49 c1 eb 06 shr $0x6,%r11 0x000000000019603a <+90>: 81 e6 00 fc 00 00 and $0xfc00,%esi 0x0000000000196040 <+96>: 48 c1 e8 03 shr $0x3,%rax 0x0000000000196044 <+100>: 41 81 e3 e0 03 00 00 and $0x3e0,%r11d 0x000000000019604b <+107>: 83 e0 1f and $0x1f,%eax 0x000000000019604e <+110>: 49 09 f3 or %rsi,%r11 0x0000000000196051 <+113>: 49 09 c3 or %rax,%r11 0x0000000000196054 <+116>: 48 89 c8 mov %rcx,%rax 0x0000000000196057 <+119>: 48 c1 e8 29 shr $0x29,%rax 0x000000000019605b <+123>: 25 00 fc 00 00 and $0xfc00,%eax 0x0000000000196060 <+128>: 48 89 c6 mov %rax,%rsi 0x0000000000196063 <+131>: 4c 89 d0 mov %r10,%rax 0x0000000000196066 <+134>: 49 89 ca mov %rcx,%r10 0x0000000000196069 <+137>: 49 c1 ea 23 shr $0x23,%r10 0x000000000019606d <+141>: 48 09 f0 or %rsi,%rax 0x0000000000196070 <+144>: 41 83 e2 1f and $0x1f,%r10d 0x0000000000196074 <+148>: 49 09 c2 or %rax,%r10 0x0000000000196077 <+151>: 48 8b 02 mov (%rdx),%rax 0x000000000019607a <+154>: 48 8b 52 08 mov 0x8(%rdx),%rdx 0x000000000019607e <+158>: 49 89 d0 mov %rdx,%r8 0x0000000000196081 <+161>: 48 89 d6 mov %rdx,%rsi 0x0000000000196084 <+164>: 49 c1 e8 26 shr $0x26,%r8 0x0000000000196088 <+168>: 48 c1 ee 29 shr $0x29,%rsi 0x000000000019608c <+172>: 41 81 e0 e0 03 00 00 and $0x3e0,%r8d 0x0000000000196093 <+179>: 48 89 f3 mov %rsi,%rbx 0x0000000000196096 <+182>: 4c 89 c6 mov %r8,%rsi 0x0000000000196099 <+185>: 49 89 d0 mov %rdx,%r8 0x000000000019609c <+188>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 0x00000000001960a2 <+194>: 49 c1 e8 23 shr $0x23,%r8 0x00000000001960a6 <+198>: 48 09 de or %rbx,%rsi 0x00000000001960a9 <+201>: 41 83 e0 1f and $0x1f,%r8d 0x00000000001960ad <+205>: 49 09 f0 or %rsi,%r8 0x00000000001960b0 <+208>: 48 89 d6 mov %rdx,%rsi 0x00000000001960b3 <+211>: 48 c1 ee 09 shr $0x9,%rsi 0x00000000001960b7 <+215>: 49 c1 e1 30 shl $0x30,%r9 0x00000000001960bb <+219>: 49 c1 e3 20 shl $0x20,%r11 0x00000000001960bf <+223>: 48 89 f3 mov %rsi,%rbx 0x00000000001960c2 <+226>: 48 89 d6 mov %rdx,%rsi 0x00000000001960c5 <+229>: 48 c1 ea 03 shr $0x3,%rdx 0x00000000001960c9 <+233>: 48 c1 ee 06 shr $0x6,%rsi 0x00000000001960cd <+237>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 0x00000000001960d3 <+243>: 83 e2 1f and $0x1f,%edx 0x00000000001960d6 <+246>: 81 e6 e0 03 00 00 and $0x3e0,%esi 0x00000000001960dc <+252>: 49 c1 e2 10 shl $0x10,%r10 0x00000000001960e0 <+256>: 48 09 de or %rbx,%rsi 0x00000000001960e3 <+259>: 48 89 c3 mov %rax,%rbx 0x00000000001960e6 <+262>: 48 09 f2 or %rsi,%rdx 0x00000000001960e9 <+265>: 48 89 c6 mov %rax,%rsi 0x00000000001960ec <+268>: 48 c1 eb 29 shr $0x29,%rbx 0x00000000001960f0 <+272>: 48 c1 ee 26 shr $0x26,%rsi 0x00000000001960f4 <+276>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 0x00000000001960fa <+282>: 81 e6 e0 03 00 00 and $0x3e0,%esi 0x0000000000196100 <+288>: 48 89 dd mov %rbx,%rbp 0x0000000000196103 <+291>: 48 89 f3 mov %rsi,%rbx 0x0000000000196106 <+294>: 48 89 c6 mov %rax,%rsi 0x0000000000196109 <+297>: 48 c1 ee 23 shr $0x23,%rsi 0x000000000019610d <+301>: 48 09 eb or %rbp,%rbx 0x0000000000196110 <+304>: 83 e6 1f and $0x1f,%esi 0x0000000000196113 <+307>: 48 09 de or %rbx,%rsi 0x0000000000196116 <+310>: 48 89 cb mov %rcx,%rbx 0x0000000000196119 <+313>: 48 c1 eb 09 shr $0x9,%rbx 0x000000000019611d <+317>: 81 e3 00 fc 00 00 and $0xfc00,%ebx 0x0000000000196123 <+323>: 48 89 dd mov %rbx,%rbp 0x0000000000196126 <+326>: 48 89 cb mov %rcx,%rbx 0x0000000000196129 <+329>: 48 c1 e9 03 shr $0x3,%rcx 0x000000000019612d <+333>: 48 c1 eb 06 shr $0x6,%rbx 0x0000000000196131 <+337>: 83 e1 1f and $0x1f,%ecx 0x0000000000196134 <+340>: 81 e3 e0 03 00 00 and $0x3e0,%ebx 0x000000000019613a <+346>: 48 09 eb or %rbp,%rbx 0x000000000019613d <+349>: 48 09 d9 or %rbx,%rcx 0x0000000000196140 <+352>: 4c 09 c9 or %r9,%rcx 0x0000000000196143 <+355>: 4c 09 d9 or %r11,%rcx 0x0000000000196146 <+358>: 4c 09 d1 or %r10,%rcx 0x0000000000196149 <+361>: 48 89 4f 08 mov %rcx,0x8(%rdi) 0x000000000019614d <+365>: 48 89 c1 mov %rax,%rcx 0x0000000000196150 <+368>: 48 c1 e9 09 shr $0x9,%rcx 0x0000000000196154 <+372>: 81 e1 00 fc 00 00 and $0xfc00,%ecx 0x000000000019615a <+378>: 49 89 c9 mov %rcx,%r9 0x000000000019615d <+381>: 48 89 c1 mov %rax,%rcx 0x0000000000196160 <+384>: 48 c1 e9 06 shr $0x6,%rcx 0x0000000000196164 <+388>: 48 c1 e8 03 shr $0x3,%rax 0x0000000000196168 <+392>: 49 c1 e0 30 shl $0x30,%r8 0x000000000019616c <+396>: 81 e1 e0 03 00 00 and $0x3e0,%ecx 0x0000000000196172 <+402>: 83 e0 1f and $0x1f,%eax 0x0000000000196175 <+405>: 48 c1 e2 20 shl $0x20,%rdx 0x0000000000196179 <+409>: 4c 09 c9 or %r9,%rcx 0x000000000019617c <+412>: 48 09 c8 or %rcx,%rax 0x000000000019617f <+415>: 48 89 f1 mov %rsi,%rcx 0x0000000000196182 <+418>: 4c 09 c0 or %r8,%rax 0x0000000000196185 <+421>: 48 c1 e1 10 shl $0x10,%rcx 0x0000000000196189 <+425>: 48 09 d0 or %rdx,%rax 0x000000000019618c <+428>: 48 09 c8 or %rcx,%rax 0x000000000019618f <+431>: 5b pop %rbx 0x0000000000196190 <+432>: 48 89 07 mov %rax,(%rdi) 0x0000000000196193 <+435>: 5d pop %rbp 0x0000000000196194 <+436>: c3 retq End of assembler dump. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-29 13:34 ` Stefan Brankovic @ 2019-08-29 15:31 ` Richard Henderson 2019-10-16 13:53 ` Stefan Brankovic 0 siblings, 1 reply; 12+ messages in thread From: Richard Henderson @ 2019-08-29 15:31 UTC (permalink / raw) To: Stefan Brankovic, qemu-devel; +Cc: david On 8/29/19 6:34 AM, Stefan Brankovic wrote: > Then I run my performance tests and I got following results(test is calling > vpkpx 100000 times): > > 1) Current helper implementation: ~ 157 ms > > 2) helper implementation you suggested: ~94 ms > > 3) tcg implementation: ~75 ms I assume you tested in a loop. If you have just the one expansion, you'll not see the penalty for the icache expansion. To show the other extreme, you'd want to test as separate sequential invocations. That said, I'd be more interested in a real test case that isn't just calling one instruction over and over. Is there a real test case that shows vpkpx in the top 25 of the profile? With more than 0.5% of runtime? r~ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction 2019-08-29 15:31 ` Richard Henderson @ 2019-10-16 13:53 ` Stefan Brankovic 0 siblings, 0 replies; 12+ messages in thread From: Stefan Brankovic @ 2019-10-16 13:53 UTC (permalink / raw) To: Richard Henderson, qemu-devel; +Cc: david On 29.8.19. 17:31, Richard Henderson wrote: > On 8/29/19 6:34 AM, Stefan Brankovic wrote: >> Then I run my performance tests and I got following results(test is calling >> vpkpx 100000 times): >> >> 1) Current helper implementation: ~ 157 ms >> >> 2) helper implementation you suggested: ~94 ms >> >> 3) tcg implementation: ~75 ms > I assume you tested in a loop. If you have just the one expansion, you'll not > see the penalty for the icache expansion. To show the other extreme, you'd > want to test as separate sequential invocations. Yes, testing is done in a loop. > > That said, I'd be more interested in a real test case that isn't just calling > one instruction over and over. Is there a real test case that shows vpkpx in > the top 25 of the profile? With more than 0.5% of runtime? > > > r~ I made an experiment where I started MAC OSX 10.4 in QEMU system mode and I found out that vpkpx instruction is widely used to display different graphical elements. With that in mind, this performance improvement is of great importance. Also, vpkpx instruction is often used in a loop, to process big amount of pixels at once. That's why testing performance of this instruction in a loop should give good insight of how this instruction perform overall. Kind Regards, Stefan ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions 2019-08-27 9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic @ 2019-08-27 9:37 ` Stefan Brankovic 2019-08-27 19:14 ` Richard Henderson 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic 2 siblings, 1 reply; 12+ messages in thread From: Stefan Brankovic @ 2019-08-27 9:37 UTC (permalink / raw) To: qemu-devel; +Cc: richard.henderson, david Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword). This instruction counts the number of leading zeros of each halfword element in source register and places result in the appropriate halfword element of destination register. In each iteration of outer for loop count operation is performed on one doubleword element of source register vB. In the first iteration, higher doubleword element of vB is placed in variable avr, and then counting for every halfword element is performed by using tcg_gen_clzi_i64. Since it counts leading zeros on 64 bit lenght, ith byte element has to be moved to the highest 16 bits of tmp, or-ed with mask(in order to get all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move it's result in appropriate halfword element of result. This is done in inner for loop. After the operation is finished, the result is saved in the appropriate doubleword element of destination register vD. The same sequence of orders is to be applied again for the lower doubleword element of vB. Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte). This instruction counts the number of leading zeros of each byte element in source register and places result in the appropriate byte element of destination register. In each iteration of the outer for loop, counting operation is done on one doubleword element of source register vB. In the first iteration, the higher doubleword element of vB is placed in variable avr, and then counting for every byte element is performed using tcg_gen_clzi_i64. Since it counts leading zeros on 64 bit lenght, ith byte element has to be moved to the highest 8 bits of variable tmp, or-ed with mask(in order to get all ones in the lowest 56 bits), then perform tcg_gen_clzi_i64 and move it's result in the appropriate byte element of result. This is done in inner for loop. After the operation is finished, the result is saved in the appropriate doubleword element of destination register vD. The same sequence of orders is to be applied again for the lower doubleword element of vB. Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com> --- target/ppc/helper.h | 2 - target/ppc/int_helper.c | 9 --- target/ppc/translate/vmx-impl.inc.c | 136 +++++++++++++++++++++++++++++++++++- 3 files changed, 134 insertions(+), 13 deletions(-) diff --git a/target/ppc/helper.h b/target/ppc/helper.h index 940a115..39c202f 100644 --- a/target/ppc/helper.h +++ b/target/ppc/helper.h @@ -307,8 +307,6 @@ DEF_HELPER_4(vcfsx, void, env, avr, avr, i32) DEF_HELPER_4(vctuxs, void, env, avr, avr, i32) DEF_HELPER_4(vctsxs, void, env, avr, avr, i32) -DEF_HELPER_2(vclzb, void, avr, avr) -DEF_HELPER_2(vclzh, void, avr, avr) DEF_HELPER_2(vctzb, void, avr, avr) DEF_HELPER_2(vctzh, void, avr, avr) DEF_HELPER_2(vctzw, void, avr, avr) diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c index 9ff3b03..65a9387 100644 --- a/target/ppc/int_helper.c +++ b/target/ppc/int_helper.c @@ -1796,15 +1796,6 @@ VUPK(lsw, s64, s32, UPKLO) } \ } -#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8) -#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16) - -VGENERIC_DO(clzb, u8) -VGENERIC_DO(clzh, u16) - -#undef clzb -#undef clzh - #define ctzb(v) ((v) ? ctz32(v) : 8) #define ctzh(v) ((v) ? ctz32(v) : 16) #define ctzw(v) ctz32((v)) diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c index 456666a..e8a0fb6 100644 --- a/target/ppc/translate/vmx-impl.inc.c +++ b/target/ppc/translate/vmx-impl.inc.c @@ -840,6 +840,138 @@ static void trans_vgbbd(DisasContext *ctx) } /* + * vclzb VRT,VRB - Vector Count Leading Zeros Byte + * + * Counting the number of leading zero bits of each byte element in source + * register and placing result in appropriate byte element of destination + * register. + */ +static void trans_vclzb(DisasContext *ctx) +{ + int VT = rD(ctx->opcode); + int VB = rB(ctx->opcode); + TCGv_i64 avr = tcg_temp_new_i64(); + TCGv_i64 result = tcg_temp_new_i64(); + TCGv_i64 result1 = tcg_temp_new_i64(); + TCGv_i64 result2 = tcg_temp_new_i64(); + TCGv_i64 tmp = tcg_temp_new_i64(); + TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL); + int i, j; + + for (i = 0; i < 2; i++) { + if (i == 0) { + /* Get high doubleword of vB in avr. */ + get_avr64(avr, VB, true); + } else { + /* Get low doubleword of vB in avr. */ + get_avr64(avr, VB, false); + } + /* + * Perform count for every byte element using tcg_gen_clzi_i64. + * Since it counts leading zeros on 64 bit lenght, we have to move + * ith byte element to highest 8 bits of tmp, or it with mask(so we get + * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move + * it's result in appropriate byte element of result. + */ + tcg_gen_shli_i64(tmp, avr, 56); + tcg_gen_or_i64(tmp, tmp, mask); + tcg_gen_clzi_i64(result, tmp, 64); + for (j = 1; j < 7; j++) { + tcg_gen_shli_i64(tmp, avr, (7 - j) * 8); + tcg_gen_or_i64(tmp, tmp, mask); + tcg_gen_clzi_i64(tmp, tmp, 64); + tcg_gen_deposit_i64(result, result, tmp, j * 8, 8); + } + tcg_gen_or_i64(tmp, avr, mask); + tcg_gen_clzi_i64(tmp, tmp, 64); + tcg_gen_deposit_i64(result, result, tmp, 56, 8); + if (i == 0) { + /* Place result in high doubleword element of vD. */ + tcg_gen_mov_i64(result1, result); + } else { + /* Place result in low doubleword element of vD. */ + tcg_gen_mov_i64(result2, result); + } + } + + set_avr64(VT, result1, true); + set_avr64(VT, result2, false); + + tcg_temp_free_i64(avr); + tcg_temp_free_i64(result); + tcg_temp_free_i64(result1); + tcg_temp_free_i64(result2); + tcg_temp_free_i64(tmp); + tcg_temp_free_i64(mask); +} + +/* + * vclzh VRT,VRB - Vector Count Leading Zeros Halfword + * + * Counting the number of leading zero bits of each halfword element in source + * register and placing result in appropriate halfword element of destination + * register. + */ +static void trans_vclzh(DisasContext *ctx) +{ + int VT = rD(ctx->opcode); + int VB = rB(ctx->opcode); + TCGv_i64 avr = tcg_temp_new_i64(); + TCGv_i64 result = tcg_temp_new_i64(); + TCGv_i64 result1 = tcg_temp_new_i64(); + TCGv_i64 result2 = tcg_temp_new_i64(); + TCGv_i64 tmp = tcg_temp_new_i64(); + TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL); + int i, j; + + for (i = 0; i < 2; i++) { + if (i == 0) { + /* Get high doubleword element of vB in avr. */ + get_avr64(avr, VB, true); + } else { + /* Get low doubleword element of vB in avr. */ + get_avr64(avr, VB, false); + } + /* + * Perform count for every halfword element using tcg_gen_clzi_i64. + * Since it counts leading zeros on 64 bit lenght, we have to move + * ith byte element to highest 16 bits of tmp, or it with mask(so we get + * all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move + * it's result in appropriate halfword element of result. + */ + tcg_gen_shli_i64(tmp, avr, 48); + tcg_gen_or_i64(tmp, tmp, mask); + tcg_gen_clzi_i64(result, tmp, 64); + for (j = 1; j < 3; j++) { + tcg_gen_shli_i64(tmp, avr, (3 - j) * 16); + tcg_gen_or_i64(tmp, tmp, mask); + tcg_gen_clzi_i64(tmp, tmp, 64); + tcg_gen_deposit_i64(result, result, tmp, j * 16, 16); + } + tcg_gen_or_i64(tmp, avr, mask); + tcg_gen_clzi_i64(tmp, tmp, 64); + tcg_gen_deposit_i64(result, result, tmp, 48, 16); + if (i == 0) { + /* Place result in high doubleword element of vD. */ + tcg_gen_mov_i64(result1, result); + } else { + /* Place result in low doubleword element of vD. */ + tcg_gen_mov_i64(result2, result); + } + } + + set_avr64(VT, result1, true); + set_avr64(VT, result2, false); + + tcg_temp_free_i64(avr); + tcg_temp_free_i64(result); + tcg_temp_free_i64(result1); + tcg_temp_free_i64(result2); + tcg_temp_free_i64(tmp); + tcg_temp_free_i64(mask); +} + +/* * vclzw VRT,VRB - Vector Count Leading Zeros Word * * Counting the number of leading zero bits of each word element in source @@ -1404,8 +1536,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20) GEN_VAFORM_PAIRED(vsel, vperm, 21) GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23) -GEN_VXFORM_NOA(vclzb, 1, 28) -GEN_VXFORM_NOA(vclzh, 1, 29) +GEN_VXFORM_TRANS(vclzb, 1, 28) +GEN_VXFORM_TRANS(vclzh, 1, 29) GEN_VXFORM_TRANS(vclzw, 1, 30) GEN_VXFORM_TRANS(vclzd, 1, 31) GEN_VXFORM_NOA_2(vnegw, 1, 24, 6) -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic @ 2019-08-27 19:14 ` Richard Henderson 0 siblings, 0 replies; 12+ messages in thread From: Richard Henderson @ 2019-08-27 19:14 UTC (permalink / raw) To: Stefan Brankovic, qemu-devel; +Cc: david On 8/27/19 2:37 AM, Stefan Brankovic wrote: > + for (i = 0; i < 2; i++) { > + if (i == 0) { > + /* Get high doubleword of vB in avr. */ > + get_avr64(avr, VB, true); > + } else { > + /* Get low doubleword of vB in avr. */ > + get_avr64(avr, VB, false); > + } > + /* > + * Perform count for every byte element using tcg_gen_clzi_i64. > + * Since it counts leading zeros on 64 bit lenght, we have to move > + * ith byte element to highest 8 bits of tmp, or it with mask(so we get > + * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move > + * it's result in appropriate byte element of result. > + */ > + tcg_gen_shli_i64(tmp, avr, 56); > + tcg_gen_or_i64(tmp, tmp, mask); > + tcg_gen_clzi_i64(result, tmp, 64); > + for (j = 1; j < 7; j++) { > + tcg_gen_shli_i64(tmp, avr, (7 - j) * 8); > + tcg_gen_or_i64(tmp, tmp, mask); > + tcg_gen_clzi_i64(tmp, tmp, 64); > + tcg_gen_deposit_i64(result, result, tmp, j * 8, 8); > + } > + tcg_gen_or_i64(tmp, avr, mask); > + tcg_gen_clzi_i64(tmp, tmp, 64); > + tcg_gen_deposit_i64(result, result, tmp, 56, 8); > + if (i == 0) { > + /* Place result in high doubleword element of vD. */ > + tcg_gen_mov_i64(result1, result); > + } else { > + /* Place result in low doubleword element of vD. */ > + tcg_gen_mov_i64(result2, result); > + } > + } By my count, 60 non-move operations. This is too many to inline. Moreover, unlike vpkpx, which I can see being used for graphics format conversion in old operating systems (who else uses 16-bit graphics formats now?), I would be very surprised to see vclzb or vclzh being used frequently. How did you determine that these instructions needed optimization? I can see wanting to apply --- a/target/ppc/int_helper.c +++ b/target/ppc/int_helper.c @@ -1817,8 +1817,8 @@ VUPK(lsw, s64, s32, UPKLO) } \ } -#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8) -#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16) +#define clzb(v) clz32(((uint32_t)(v) << 24) | 0x00ffffffu) +#define clzh(v) clz32(((uint32_t)(v) << 16) | 0x0000ffffu) VGENERIC_DO(clzb, u8) VGENERIC_DO(clzh, u16) as the cmov instruction required by the current implementation is going to be quite a bit slower than the OR instruction. And similarly for ctzb() and ctzh(). r~ ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions 2019-08-27 9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic @ 2019-08-27 9:37 ` Stefan Brankovic 2019-08-27 19:19 ` Richard Henderson 2 siblings, 1 reply; 12+ messages in thread From: Stefan Brankovic @ 2019-08-27 9:37 UTC (permalink / raw) To: qemu-devel; +Cc: richard.henderson, david Since I found this two instructions implemented with tcg, I refactored them so they are consistent with other similar implementations that I introduced in this patch. Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is used if one instruction is realized with direct translation, and second one with a helper. Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com> --- target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c index e8a0fb6..6af9c73 100644 --- a/target/ppc/translate/vmx-impl.inc.c +++ b/target/ppc/translate/vmx-impl.inc.c @@ -350,6 +350,28 @@ static void glue(gen_, name0##_##name1)(DisasContext *ctx) \ } \ } +/* + * We use this macro if one instruction is realized with direct + * translation, and second one with helper. + */ +#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\ +static void glue(gen_, name0##_##name1)(DisasContext *ctx) \ +{ \ + if ((Rc(ctx->opcode) == 0) && \ + ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \ + if (unlikely(!ctx->altivec_enabled)) { \ + gen_exception(ctx, POWERPC_EXCP_VPU); \ + return; \ + } \ + trans_##name0(ctx); \ + } else if ((Rc(ctx->opcode) == 1) && \ + ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \ + gen_##name1(ctx); \ + } else { \ + gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL); \ + } \ +} + /* Adds support to provide invalid mask */ #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0, \ name1, flg1, flg2_1, inval1) \ @@ -431,20 +453,13 @@ GEN_VXFORM(vmrglb, 6, 4); GEN_VXFORM(vmrglh, 6, 5); GEN_VXFORM(vmrglw, 6, 6); -static void gen_vmrgew(DisasContext *ctx) +static void trans_vmrgew(DisasContext *ctx) { - TCGv_i64 tmp; - TCGv_i64 avr; - int VT, VA, VB; - if (unlikely(!ctx->altivec_enabled)) { - gen_exception(ctx, POWERPC_EXCP_VPU); - return; - } - VT = rD(ctx->opcode); - VA = rA(ctx->opcode); - VB = rB(ctx->opcode); - tmp = tcg_temp_new_i64(); - avr = tcg_temp_new_i64(); + int VT = rD(ctx->opcode); + int VA = rA(ctx->opcode); + int VB = rB(ctx->opcode); + TCGv_i64 tmp = tcg_temp_new_i64(); + TCGv_i64 avr = tcg_temp_new_i64(); get_avr64(avr, VB, true); tcg_gen_shri_i64(tmp, avr, 32); @@ -462,21 +477,14 @@ static void gen_vmrgew(DisasContext *ctx) tcg_temp_free_i64(avr); } -static void gen_vmrgow(DisasContext *ctx) +static void trans_vmrgow(DisasContext *ctx) { - TCGv_i64 t0, t1; - TCGv_i64 avr; - int VT, VA, VB; - if (unlikely(!ctx->altivec_enabled)) { - gen_exception(ctx, POWERPC_EXCP_VPU); - return; - } - VT = rD(ctx->opcode); - VA = rA(ctx->opcode); - VB = rB(ctx->opcode); - t0 = tcg_temp_new_i64(); - t1 = tcg_temp_new_i64(); - avr = tcg_temp_new_i64(); + int VT = rD(ctx->opcode); + int VA = rA(ctx->opcode); + int VB = rB(ctx->opcode); + TCGv_i64 t0 = tcg_temp_new_i64(); + TCGv_i64 t1 = tcg_temp_new_i64(); + TCGv_i64 avr = tcg_temp_new_i64(); get_avr64(t0, VB, true); get_avr64(t1, VA, true); @@ -1165,14 +1173,14 @@ GEN_VXFORM_ENV(vminfp, 5, 17); GEN_VXFORM_HETRO(vextublx, 6, 24) GEN_VXFORM_HETRO(vextuhlx, 6, 25) GEN_VXFORM_HETRO(vextuwlx, 6, 26) -GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207, +GEN_VXFORM_TRANS_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207, vextuwlx, PPC_NONE, PPC2_ISA300) GEN_VXFORM_HETRO(vextubrx, 6, 28) GEN_VXFORM_HETRO(vextuhrx, 6, 29) GEN_VXFORM_HETRO(vextuwrx, 6, 30) GEN_VXFORM_TRANS(lvsl, 6, 31) GEN_VXFORM_TRANS(lvsr, 6, 32) -GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \ +GEN_VXFORM_TRANS_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, vextuwrx, PPC_NONE, PPC2_ISA300) #define GEN_VXRFORM1(opname, name, str, opc2, opc3) \ -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic @ 2019-08-27 19:19 ` Richard Henderson 2019-08-28 0:42 ` David Gibson 0 siblings, 1 reply; 12+ messages in thread From: Richard Henderson @ 2019-08-27 19:19 UTC (permalink / raw) To: Stefan Brankovic, qemu-devel; +Cc: david On 8/27/19 2:37 AM, Stefan Brankovic wrote: > Since I found this two instructions implemented with tcg, I refactored > them so they are consistent with other similar implementations that > I introduced in this patch. > > Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is > used if one instruction is realized with direct translation, and second > one with a helper. > > Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com> > --- > target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++---------------- > 1 file changed, 37 insertions(+), 29 deletions(-) Reviewed-by: Richard Henderson <richard.henderson@linaro.org> r~ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions 2019-08-27 19:19 ` Richard Henderson @ 2019-08-28 0:42 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2019-08-28 0:42 UTC (permalink / raw) To: Richard Henderson; +Cc: Stefan Brankovic, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1020 bytes --] On Tue, Aug 27, 2019 at 12:19:27PM -0700, Richard Henderson wrote: > On 8/27/19 2:37 AM, Stefan Brankovic wrote: > > Since I found this two instructions implemented with tcg, I refactored > > them so they are consistent with other similar implementations that > > I introduced in this patch. > > > > Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is > > used if one instruction is realized with direct translation, and second > > one with a helper. > > > > Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com> > > --- > > target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++---------------- > > 1 file changed, 37 insertions(+), 29 deletions(-) > > Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Applied to ppc-for-4.2, thanks. > > > r~ > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-10-16 13:56 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-27 9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic 2019-08-27 18:52 ` Richard Henderson 2019-08-27 19:04 ` BALATON Zoltan 2019-08-29 13:34 ` Stefan Brankovic 2019-08-29 15:31 ` Richard Henderson 2019-10-16 13:53 ` Stefan Brankovic 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic 2019-08-27 19:14 ` Richard Henderson 2019-08-27 9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic 2019-08-27 19:19 ` Richard Henderson 2019-08-28 0:42 ` David Gibson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).