From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:52424)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UK8Tg-0006Mn-7b
	for qemu-devel@nongnu.org; Mon, 25 Mar 2013 10:34:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UK8Te-0006uj-Mu
	for qemu-devel@nongnu.org; Mon, 25 Mar 2013 10:34:24 -0400
Received: from mail-ve0-f170.google.com ([209.85.128.170]:47213)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UK8Te-0006uV-J7
	for qemu-devel@nongnu.org; Mon, 25 Mar 2013 10:34:22 -0400
Received: by mail-ve0-f170.google.com with SMTP id 14so5011013vea.15
	for <qemu-devel@nongnu.org>; Mon, 25 Mar 2013 07:34:22 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
Message-ID: <51506068.5080103@redhat.com>
Date: Mon, 25 Mar 2013 15:34:16 +0100
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <972929461.13095041.1364216522903.JavaMail.root@redhat.com>
	<4E89AD05-F328-493A-9C31-E52A033420B1@kamp.de>
	<806A8BFB-FF1F-482C-B679-2B1B10D06D7C@kamp.de>
In-Reply-To: <806A8BFB-FF1F-482C-B679-2B1B10D06D7C@kamp.de>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Subject: Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration
	optimizations
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@kamp.de>
Cc: Stefan Hajnoczi <stefanha@gmail.com>, Orit Wasserman <owasserm@redhat.com>, qemu-devel@nongnu.org, quintela@redhat.com

Il 25/03/2013 14:32, Peter Lieven ha scritto:
> 
> Am 25.03.2013 um 14:23 schrieb Peter Lieven <pl@kamp.de>:
> 
>>
>> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>>> Maybe I should have explained the output more detailed. The percentages
>>>> are added. 35.8% in the second last column means that
>>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>>> This was meant to illustrate at how many 64-bit chunks you have
>>>> to look to grab a certain percentage of non-zero pages.
>>>
>>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>>>
>>>> Looking e.g. at the third value it means that looking at the first
>>>> three 64-bit chunks it will catch 34.0% of all pages.
>>>> It turns out that the non-zeroness of a page can be detected looking
>>>> at the first 256 or so bits and only a low
>>>> percentage turns out to be non-zero at a later position. So after
>>>> having checked the first chunks one by one
>>>> there is no big penalty looking at the remaining chunks with the
>>>> vectorized loop.
>>>
>>> I think it makes most sense to unroll the first four non-vectorized
>>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>>>
>>>  if (foo[0]) return 0;
>>>  if (foo[1]) return 8;
>>>  if (foo[2]) return 16;
>>>  if (foo[3]) return 24;
>>>
>>> or
>>>
>>>  if (foo[0]) return 0;
>>>  if (foo[1] | foo[2] | foo[3]) return 8;
>>>
>>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>>> 32- and 64-bit machines, which makes this a nice choice.
>>
>> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
>> are not dividable by 8*sizeof(VECTYPE).


Hmm, right.  What about just processing the first few longs twice, i.e.
the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
+= BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

Paolo

>>
>>    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
>>         i < len / sizeof(VECTYPE); 
>>         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>        …
>>    }
> 
> performance of the above is bad compared to:
> 
>     for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>         if (!ALL_EQ(p[i], zero)) {
>             return i * sizeof(VECTYPE);
>         }
>     }
> 
> …
> 
> The above is basically what old is_dup_page is doing, but after the first
> 8 iterations the optimized version kicks in.
> 
> Peter
> 
> 
>