From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O3rTI-0005se-U5 for qemu-devel@nongnu.org; Mon, 19 Apr 2010 09:57:09 -0400 Received: from [140.186.70.92] (port=43936 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O3rTD-0005lL-V3 for qemu-devel@nongnu.org; Mon, 19 Apr 2010 09:57:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O3rTB-00078S-4p for qemu-devel@nongnu.org; Mon, 19 Apr 2010 09:57:03 -0400 Received: from mail-pw0-f45.google.com ([209.85.160.45]:58517) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O3rTA-00077w-VH for qemu-devel@nongnu.org; Mon, 19 Apr 2010 09:57:01 -0400 Received: by pwi6 with SMTP id 6so3067988pwi.4 for ; Mon, 19 Apr 2010 06:56:59 -0700 (PDT) Sender: Richard Henderson Message-ID: <4BCC611C.3020202@twiddle.net> Date: Mon, 19 Apr 2010 08:56:44 -0500 From: Richard Henderson MIME-Version: 1.0 Subject: Re: [Qemu-devel] [PATCH 05/21] tcg-i386: Tidy bswap operations. References: <20100418221302.GA26784@volta.aurel32.net> In-Reply-To: <20100418221302.GA26784@volta.aurel32.net> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Aurelien Jarno Cc: qemu-devel@nongnu.org On 04/18/2010 05:13 PM, Aurelien Jarno wrote: > On Tue, Apr 13, 2010 at 04:33:59PM -0700, Richard Henderson wrote: >> Define OPC_BSWAP. Factor opcode emission to separate functions. >> Use bswap+shift to implement 16-bit swap instead of a rolw; this >> gets the proper zero-extension required by INDEX_op_bswap16_i32. > > This is not required by INDEX_op_bswap16_i32. What is need is that the > value in the input register has the 16 upper bits set to 0. Ah. > Considering > that, the rolw instruction is faster than bswap + shift. Well, no, it isn't. static inline int test_rolw(unsigned short *s) { int i, start, end; asm volatile("rdtsc\n\t" "movl %%eax, %1\n\t" "movzwl %3,%2\n\t" "rolw $8, %w2\n\t" "addl $1,%2\n\t" "rdtsc" : "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx"); return end - start; } static inline int test_bswap(unsigned short *s) { int i, start, end; asm volatile("rdtsc\n\t" "movl %%eax, %1\n\t" "movzwl %3,%2\n\t" "bswap %2\n\t" "shl $16,%2\n\t" "addl $1,%2\n\t" "rdtsc" : "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx"); return end - start; } model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz rolw 60 60 72 60 60 72 60 60 72 60 bswap 60 60 60 60 60 60 60 60 60 60 model name : Dual-Core AMD Opteron(tm) Processor 1210 rolw 9 10 9 9 8 8 8 8 8 8 bswap 9 9 8 8 8 8 8 8 8 8 The rolw sequence isn't ever faster, and it's more unstable, likely due to the partial register stall I mentioned. I will grant that the rolw sequence is smaller, and I can adjust this patch to use that sequence if you wish. r~