From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47603) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bgvVP-0006pu-9M for qemu-devel@nongnu.org; Mon, 05 Sep 2016 11:08:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bgvVK-0001xT-LL for qemu-devel@nongnu.org; Mon, 05 Sep 2016 11:08:15 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35166) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bgvVK-0001xF-Cu for qemu-devel@nongnu.org; Mon, 05 Sep 2016 11:08:10 -0400 Date: Mon, 5 Sep 2016 16:08:06 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20160905150806.GD22496@work-vm> References: <1472496380-19706-1-git-send-email-rth@twiddle.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1472496380-19706-1-git-send-email-rth@twiddle.net> Subject: Re: [Qemu-devel] [PATCH v3 0/9] Improve buffer_is_zero List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: qemu-devel@nongnu.org, pbonzini@redhat.com, vijay.kilari@gmail.com * Richard Henderson (rth@twiddle.net) wrote: Have you considered contributing something similar to this to glibc? I filed https://sourceware.org/bugzilla/show_bug.cgi?id=19920 a while back suggesting it would be useful to have it in libc to be used by things other than just qemu. Dave > Changes from v2 to v3: > > * Unit testing. This includes having x86 attempt all versions of > the accelerator that will run on the hardware. Thus an avx2 host > will run the basic test 5 times (1.5sec on my laptop). > > * Drop the ppc and aarch64 specializations. I have improved the > basic integer version to the point that those vectorized versions > are not a win. > > In the case of my aarch64 mustang, the integer version is 4 times > faster than the neon version that I delete. With effort I was > able to rewrite the neon version to come to within a factor of 1.1, > but it remained slower than the integer. To be fair, gcc6 makes > very good use of ldp, so the integer path is *also* loading 16 bytes > per insn. > > I can forward my standalone aarch64 benchmark if anyone is interested. > > Note however that at least the avx2 acceleration is still very much > a win, being about 3 times faster on my laptop. Of course, it's > handling 4 times as much data per loop as the integer version, so > one can still see the overhead caused by using vector insns. > > For grins I wrote an avx512 version, if someone has a skylake upon > which to test and benchmark. That requires additional configure > checks, so I didn't bother to include it here. > > > r~ > > > Richard Henderson (9): > cutils: Move buffer_is_zero and subroutines to a new file > cutils: Remove SPLAT macro > cutils: Export only buffer_is_zero > cutils: Rearrange buffer_is_zero acceleration > cutils: Add test for buffer_is_zero > cutils: Add generic prefetch > cutils: Rewrite x86 buffer zero checking > cutils: Remove aarch64 buffer zero checking > cutils: Remove ppc buffer zero checking > > configure | 21 +-- > include/qemu/cutils.h | 3 +- > migration/ram.c | 2 +- > migration/rdma.c | 5 +- > tests/Makefile.include | 3 + > tests/test-bufferiszero.c | 78 +++++++++++ > util/Makefile.objs | 1 + > util/bufferiszero.c | 332 ++++++++++++++++++++++++++++++++++++++++++++++ > util/cutils.c | 244 ---------------------------------- > 9 files changed, 423 insertions(+), 266 deletions(-) > create mode 100644 tests/test-bufferiszero.c > create mode 100644 util/bufferiszero.c > > -- > 2.7.4 > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK