From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:47621) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UKP25-0000gw-PT for qemu-devel@nongnu.org; Tue, 26 Mar 2013 04:15:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UKP22-00067U-E9 for qemu-devel@nongnu.org; Tue, 26 Mar 2013 04:15:01 -0400 Received: from mx.ipv6.kamp.de ([2a02:248:0:51::16]:46642 helo=mx01.kamp.de) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1UKP22-00067G-3e for qemu-devel@nongnu.org; Tue, 26 Mar 2013 04:14:58 -0400 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) From: Peter Lieven In-Reply-To: <51506068.5080103@redhat.com> Date: Tue, 26 Mar 2013 09:14:51 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: References: <972929461.13095041.1364216522903.JavaMail.root@redhat.com> <4E89AD05-F328-493A-9C31-E52A033420B1@kamp.de> <806A8BFB-FF1F-482C-B679-2B1B10D06D7C@kamp.de> <51506068.5080103@redhat.com> Subject: Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Stefan Hajnoczi , Orit Wasserman , qemu-devel@nongnu.org, quintela@redhat.com Am 25.03.2013 um 15:34 schrieb Paolo Bonzini : >=20 > Hmm, right. What about just processing the first few longs twice, = i.e. > the above followed by "for (i =3D 0; i < len / sizeof(sizeof(VECTYPE); = i > +=3D BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"? I tested this version as v3: size_t buffer_find_nonzero_offset_v3(const void *buf, size_t len) { VECTYPE *p =3D (VECTYPE *)buf; unsigned long *tmp =3D (unsigned long *)buf; VECTYPE zero =3D ZERO_SPLAT; size_t i; =20 assert(can_use_buffer_find_nonzero_offset(buf, len)); =20 if (!len) { return 0; } =20 if (tmp[0]) { return 0; } if (tmp[1]) { return 1 * sizeof(unsigned long); } if (tmp[2]) { return 2 * sizeof(unsigned long); } if (tmp[3]) { return 3 * sizeof(unsigned long); } for (i =3D 0; i < len / sizeof(VECTYPE);=20 i +=3D BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { VECTYPE tmp0 =3D p[i + 0] | p[i + 1]; VECTYPE tmp1 =3D p[i + 2] | p[i + 3]; VECTYPE tmp2 =3D p[i + 4] | p[i + 5]; VECTYPE tmp3 =3D p[i + 6] | p[i + 7]; VECTYPE tmp01 =3D tmp0 | tmp1; VECTYPE tmp23 =3D tmp2 | tmp3; if (!ALL_EQ(tmp01 | tmp23, zero)) { break; } } =20 return i * sizeof(VECTYPE); } For reference this is v2: size_t buffer_find_nonzero_offset_v2(const void *buf, size_t len) { VECTYPE *p =3D (VECTYPE *)buf; VECTYPE zero =3D ZERO_SPLAT; size_t i; =20 assert(can_use_buffer_find_nonzero_offset(buf, len)); =20 if (!len) { return 0; } =20 for (i =3D 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { if (!ALL_EQ(p[i], zero)) { return i * sizeof(VECTYPE); } } for (i =3D BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;=20 i < len / sizeof(VECTYPE);=20 i +=3D BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { VECTYPE tmp0 =3D p[i + 0] | p[i + 1]; VECTYPE tmp1 =3D p[i + 2] | p[i + 3]; VECTYPE tmp2 =3D p[i + 4] | p[i + 5]; VECTYPE tmp3 =3D p[i + 6] | p[i + 7]; VECTYPE tmp01 =3D tmp0 | tmp1; VECTYPE tmp23 =3D tmp2 | tmp3; if (!ALL_EQ(tmp01 | tmp23, zero)) { break; } } =20 return i * sizeof(VECTYPE); } I ran 3*2 tests. Each with 1GB memory and 256 iterations of checking = each 4k page for zero. 1) all pages zero a) SSE2 is_zero_page: res=3D67108864 (ticks 3289 user 1 system) is_zero_page_v2: res=3D67108864 (ticks 3326 user 0 system) is_zero_page_v3: res=3D67108864 (ticks 3305 user 3 system) is_dup_page: res=3D67108864 (ticks 3648 user 1 system) b) unsigned long arithmetic is_zero_page: res=3D67108864 (ticks 3474 user 3 system) is_zero_page_2: res=3D67108864 (ticks 3516 user 1 system) is_zero_page_3: res=3D67108864 (ticks 3525 user 3 system) is_dup_page: res=3D67108864 (ticks 3826 user 4 system) 2) all pages non-zero, but first 64-bit of each page zero a) SSE2 is_zero_page: res=3D0 (ticks 251 user 0 system) is_zero_page_v2: res=3D0 (ticks 87 user 0 system) is_zero_page_v3: res=3D0 (ticks 91 user 0 system) is_dup_page: res=3D0 (ticks 82 user 0 system) b) unsigned long arithmetic is_zero_page: res=3D0 (ticks 209 user 0 system) is_zero_page_v2: res=3D0 (ticks 89 user 0 system) is_zero_page_v3: res=3D0 (ticks 88 user 0 system) is_dup_page: res=3D0 (ticks 88 user 0 system) 3) all pages non-zero, but first 256-bit of each page zero a) is_zero_pages: res=3D0 (ticks 260 user 0 system) is_zero_pages_2: res=3D0 (ticks 199 user 0 system) is_zero_pages_3: res=3D0 (ticks 342 user 0 system) is_dup_pages: res=3D0 (ticks 223 user 0 system) b) unsigned long arithmetic is_zero_pages: res=3D0 (ticks 230 user 0 system) is_zero_pages_2: res=3D0 (ticks 194 user 0 system) is_zero_pages_3: res=3D0 (ticks 280 user 0 system) is_dup_pages: res=3D0 (ticks 191 user 0 system) --- is_zero_page is the version from patch set v4. is_zero_page_2 is checking the first 8 * sizeof(VECTYPE) chunks one by = one and than continuing 8 chunks at once without double-checks is_zero_page_3 is the above version. is_dup_page the old implementation. All compiled with gcc -O3 If noone objects I would use is_zero_page_2 and continue with v5 of the = patch set. As I am ooo for the next 8 days from tomorrow. i prefer v3 as it has better = performance if the non-zeroness is within the 8*sizeof(VECTYPE) bytes and not in the first 256-bit. Paolo, with the version that has lower setup costs in mind shall I use = the vectorized or the unrolled version of patch 4 (find_next_bit = optimization)? Peter