All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>,
	Orit Wasserman <owasserm@redhat.com>,
	qemu-devel@nongnu.org, quintela@redhat.com
Subject: Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
Date: Tue, 26 Mar 2013 09:14:51 +0100	[thread overview]
Message-ID: <F01A8B40-2F39-4FB7-976D-1E385F0AE0A2@kamp.de> (raw)
In-Reply-To: <51506068.5080103@redhat.com>


Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> 
> Hmm, right.  What about just processing the first few longs twice, i.e.
> the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
> += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

I tested this version as v3:

size_t buffer_find_nonzero_offset_v3(const void *buf, size_t len)
{
    VECTYPE *p = (VECTYPE *)buf;
    unsigned long *tmp = (unsigned long *)buf;
    VECTYPE zero = ZERO_SPLAT;
    size_t i;
    
    assert(can_use_buffer_find_nonzero_offset(buf, len));
    
    if (!len) {
        return 0;
    }
    
    if (tmp[0]) {
        return 0;
    }

    if (tmp[1]) {
        return 1 * sizeof(unsigned long);
    }

    if (tmp[2]) {
        return 2 * sizeof(unsigned long);
    }

    if (tmp[3]) {
        return 3 * sizeof(unsigned long);
    }

    for (i = 0; i < len / sizeof(VECTYPE); 
            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
        VECTYPE tmp0 = p[i + 0] | p[i + 1];
        VECTYPE tmp1 = p[i + 2] | p[i + 3];
        VECTYPE tmp2 = p[i + 4] | p[i + 5];
        VECTYPE tmp3 = p[i + 6] | p[i + 7];
        VECTYPE tmp01 = tmp0 | tmp1;
        VECTYPE tmp23 = tmp2 | tmp3;
        if (!ALL_EQ(tmp01 | tmp23, zero)) {
            break;
        }
    }
    
    return i * sizeof(VECTYPE);
}

For reference this is v2:

size_t buffer_find_nonzero_offset_v2(const void *buf, size_t len)
{
    VECTYPE *p = (VECTYPE *)buf;
    VECTYPE zero = ZERO_SPLAT;
    size_t i;
    
    assert(can_use_buffer_find_nonzero_offset(buf, len));
    
    if (!len) {
        return 0;
    }
    
    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
        if (!ALL_EQ(p[i], zero)) {
            return i * sizeof(VECTYPE);
        }
    }

    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
            i < len / sizeof(VECTYPE); 
            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
        VECTYPE tmp0 = p[i + 0] | p[i + 1];
        VECTYPE tmp1 = p[i + 2] | p[i + 3];
        VECTYPE tmp2 = p[i + 4] | p[i + 5];
        VECTYPE tmp3 = p[i + 6] | p[i + 7];
        VECTYPE tmp01 = tmp0 | tmp1;
        VECTYPE tmp23 = tmp2 | tmp3;
        if (!ALL_EQ(tmp01 | tmp23, zero)) {
            break;
        }
    }
    
    return i * sizeof(VECTYPE);
}

I ran 3*2 tests. Each with 1GB memory and 256 iterations of checking each 4k page for zero.

1) all pages zero

a) SSE2
is_zero_page: res=67108864 (ticks 3289 user 1 system)
is_zero_page_v2: res=67108864 (ticks 3326 user 0 system)
is_zero_page_v3: res=67108864 (ticks 3305 user 3 system)
is_dup_page: res=67108864 (ticks 3648 user 1 system)

b) unsigned long arithmetic

is_zero_page: res=67108864 (ticks 3474 user 3 system)
is_zero_page_2: res=67108864 (ticks 3516 user 1 system)
is_zero_page_3: res=67108864 (ticks 3525 user 3 system)
is_dup_page: res=67108864 (ticks 3826 user 4 system)

2) all pages non-zero, but first 64-bit of each page zero

a) SSE2
is_zero_page: res=0 (ticks 251 user 0 system)
is_zero_page_v2: res=0 (ticks 87 user 0 system)
is_zero_page_v3: res=0 (ticks 91 user 0 system)
is_dup_page: res=0 (ticks 82 user 0 system)

b) unsigned long arithmetic
is_zero_page: res=0 (ticks 209 user 0 system)
is_zero_page_v2: res=0 (ticks 89 user 0 system)
is_zero_page_v3: res=0 (ticks 88 user 0 system)
is_dup_page: res=0 (ticks 88 user 0 system)

3) all pages non-zero, but first 256-bit of each page zero

a)
is_zero_pages: res=0 (ticks 260 user 0 system)
is_zero_pages_2: res=0 (ticks 199 user 0 system)
is_zero_pages_3: res=0 (ticks 342 user 0 system)
is_dup_pages: res=0 (ticks 223 user 0 system)

b) unsigned long arithmetic
is_zero_pages: res=0 (ticks 230 user 0 system)
is_zero_pages_2: res=0 (ticks 194 user 0 system)
is_zero_pages_3: res=0 (ticks 280 user 0 system)
is_dup_pages: res=0 (ticks 191 user 0 system)


---

is_zero_page is the version from patch set v4.
is_zero_page_2 is checking the first 8 * sizeof(VECTYPE) chunks one by one and than continuing 8 chunks at once without double-checks
is_zero_page_3 is the above version.
is_dup_page the old implementation.

All compiled with gcc -O3

If noone objects I would use is_zero_page_2 and continue with v5 of the patch set. As I am
ooo for the next 8 days from tomorrow. i prefer v3 as it has better performance if the non-zeroness
is within the 8*sizeof(VECTYPE) bytes and not in the first 256-bit.

Paolo, with the version that has lower setup costs in mind shall I use the vectorized or the unrolled version of patch 4 (find_next_bit optimization)?

Peter

  parent reply	other threads:[~2013-03-26  8:15 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
2013-03-25  8:35   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
2013-03-22 19:37   ` Eric Blake
2013-03-22 20:03     ` Peter Lieven
2013-03-22 20:22       ` [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer] Eric Blake
2013-03-23 11:18         ` Peter Maydell
2013-03-25  8:53   ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Orit Wasserman
2013-03-25  8:56     ` Peter Lieven
2013-03-25  9:26       ` Orit Wasserman
2013-03-25  9:42         ` Paolo Bonzini
2013-03-25 10:03           ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
2013-03-25  8:53   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
2013-03-25  9:04   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
2013-03-22 19:49   ` Eric Blake
2013-03-22 20:02     ` Peter Lieven
2013-03-25  9:30   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
2013-03-25  9:32   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
2013-03-22 20:13   ` Eric Blake
2013-03-25  9:44   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty " Peter Lieven
2013-03-25 10:05   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after " Peter Lieven
2013-03-25 10:16   ` Orit Wasserman
2013-03-22 17:25 ` [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Paolo Bonzini
2013-03-22 19:20   ` Peter Lieven
2013-03-22 21:24     ` Paolo Bonzini
2013-03-23  7:34       ` Peter Lieven
2013-03-25 10:17       ` Peter Lieven
2013-03-25 10:53         ` Paolo Bonzini
2013-03-25 11:26           ` Peter Lieven
2013-03-25 13:02             ` Paolo Bonzini
2013-03-25 13:23               ` Peter Lieven
2013-03-25 13:32                 ` Peter Lieven
2013-03-25 14:34                   ` Paolo Bonzini
2013-03-25 21:37                     ` Peter Lieven
2013-03-26  8:14                     ` Peter Lieven [this message]
2013-03-26  9:20                       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F01A8B40-2F39-4FB7-976D-1E385F0AE0A2@kamp.de \
    --to=pl@kamp.de \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.