[Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
@ 2016-10-25 15:35 Pranith Kumar
  2016-10-25 15:41 ` Paolo Bonzini
  2016-10-25 20:55 ` Emilio G. Cota
  0 siblings, 2 replies; 10+ messages in thread
From: Pranith Kumar @ 2016-10-25 15:35 UTC (permalink / raw)
  To: Richard Henderson, Alex Bennée, Markus Armbruster,
	Sergey Fedorov, Paolo Bonzini, Emilio G. Cota,
	open list:All patches CC here

Using perf, I see that sequence lock is being a bottleneck since it is
being read by everyone. Giving it its own cache-line seems to help
things quite a bit.

Using qht-bench, I measured the following for:

$ ./tests/qht-bench -d 10 -n 24 -u <x>

throughput base   patch  %change
update
0          8.07   13.33  +65%
10         7.10   8.90   +25%
20         6.34   7.02	 +10%
30         5.48   6.11   +9.6%
40         4.90   5.46   +11.42%

I am not able to see any significant increases for lower thread counts though.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 include/qemu/seqlock.h | 2 +-
 util/qht.c             | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 8dee11d..954abe8 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
     unsigned sequence;
-};
+} QEMU_ALIGNED(64);
 
 static inline void seqlock_init(QemuSeqLock *sl)
 {
diff --git a/util/qht.c b/util/qht.c
index ff4d2e6..4d82609 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -101,14 +101,14 @@
  * be grabbed first.
  */
 struct qht_bucket {
-    QemuSpin lock;
     QemuSeqLock sequence;
+    QemuSpin lock;
     uint32_t hashes[QHT_BUCKET_ENTRIES];
     void *pointers[QHT_BUCKET_ENTRIES];
     struct qht_bucket *next;
 } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
 
-QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN);
 
 /**
  * struct qht_map - structure to track an array of buckets
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 15:35 [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line Pranith Kumar
@ 2016-10-25 15:41 ` Paolo Bonzini
  2016-10-25 15:49   ` Pranith Kumar
  2016-10-25 20:55 ` Emilio G. Cota
  1 sibling, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2016-10-25 15:41 UTC (permalink / raw)
  To: Pranith Kumar, Richard Henderson, Alex Bennée,
	Markus Armbruster, Sergey Fedorov, Emilio G. Cota,
	open list:All patches CC here



On 25/10/2016 17:35, Pranith Kumar wrote:
> Using perf, I see that sequence lock is being a bottleneck since it is
> being read by everyone. Giving it its own cache-line seems to help
> things quite a bit.
> 
> Using qht-bench, I measured the following for:
> 
> $ ./tests/qht-bench -d 10 -n 24 -u <x>
> 
> throughput base   patch  %change
> update
> 0          8.07   13.33  +65%
> 10         7.10   8.90   +25%
> 20         6.34   7.02	 +10%
> 30         5.48   6.11   +9.6%
> 40         4.90   5.46   +11.42%
> 
> I am not able to see any significant increases for lower thread counts though.

Whoa, that's a bit of a heavy hammer.  The idea is that whoever modifies
the seqlock must take the spinlock first, and whoever reads the seqlock
will read one of the members of hashes[]/pointers[].  So having
everything in the same cacheline should be better.

What happens if you just change QHT_BUCKET_ALIGN to 128?

Paolo

> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  include/qemu/seqlock.h | 2 +-
>  util/qht.c             | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 8dee11d..954abe8 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock;
>  
>  struct QemuSeqLock {
>      unsigned sequence;
> -};
> +} QEMU_ALIGNED(64);
>  
>  static inline void seqlock_init(QemuSeqLock *sl)
>  {
> diff --git a/util/qht.c b/util/qht.c
> index ff4d2e6..4d82609 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -101,14 +101,14 @@
>   * be grabbed first.
>   */
>  struct qht_bucket {
> -    QemuSpin lock;
>      QemuSeqLock sequence;
> +    QemuSpin lock;
>      uint32_t hashes[QHT_BUCKET_ENTRIES];
>      void *pointers[QHT_BUCKET_ENTRIES];
>      struct qht_bucket *next;
>  } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
>  
> -QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN);
>  
>  /**
>   * struct qht_map - structure to track an array of buckets
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 15:41 ` Paolo Bonzini
@ 2016-10-25 15:49   ` Pranith Kumar
  2016-10-25 15:54     ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Pranith Kumar @ 2016-10-25 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Richard Henderson, Alex Bennée, Markus Armbruster,
	Emilio G. Cota, open list:All patches CC here

On Tue, Oct 25, 2016 at 11:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/10/2016 17:35, Pranith Kumar wrote:
>> Using perf, I see that sequence lock is being a bottleneck since it is
>> being read by everyone. Giving it its own cache-line seems to help
>> things quite a bit.
>>
>> Using qht-bench, I measured the following for:
>>
>> $ ./tests/qht-bench -d 10 -n 24 -u <x>
>>
>> throughput base   patch  %change
>> update
>> 0          8.07   13.33  +65%
>> 10         7.10   8.90   +25%
>> 20         6.34   7.02         +10%
>> 30         5.48   6.11   +9.6%
>> 40         4.90   5.46   +11.42%
>>
>> I am not able to see any significant increases for lower thread counts though.
>
> Whoa, that's a bit of a heavy hammer.  The idea is that whoever modifies
> the seqlock must take the spinlock first, and whoever reads the seqlock
> will read one of the members of hashes[]/pointers[].  So having
> everything in the same cacheline should be better.

But we are taking the seqlock of only the head bucket, while the
readers are reading hashes/pointers of the chained buckets.

>
> What happens if you just change QHT_BUCKET_ALIGN to 128?
>

0     4.47
10   9.82 (weird!)
20   8.13
30   7.13
40   6.45

Thanks,
-- 
Pranith

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 15:49   ` Pranith Kumar
@ 2016-10-25 15:54     ` Paolo Bonzini
  2016-10-25 19:12       ` Pranith Kumar
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2016-10-25 15:54 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Richard Henderson, Alex Bennée, Markus Armbruster,
	Emilio G. Cota, open list:All patches CC here



On 25/10/2016 17:49, Pranith Kumar wrote:
> But we are taking the seqlock of only the head bucket, while the
> readers are reading hashes/pointers of the chained buckets.

No, we aren't.  See qht_lookup__slowpath.

This patch:

throughput base   patch  %change
update
0          8.07   13.33  +65%
10         7.10   8.90   +25%
20         6.34   7.02   +10%
30         5.48   6.11   +9.6%
40         4.90   5.46   +11.42%


Just doubling the cachesize:

throughput base   patch  %change
update
0          8.07   4.47   -45% ?!?
10         7.10   9.82   +38%
20         6.34   8.13   +28%
30         5.48   7.13   +30%
40         5.90   6.45   +30%

It seems to me that your machine has 128-byte cachelines.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 15:54     ` Paolo Bonzini
@ 2016-10-25 19:12       ` Pranith Kumar
  2016-10-25 20:02         ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Pranith Kumar @ 2016-10-25 19:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Richard Henderson, Alex Bennée, Markus Armbruster,
	Emilio G. Cota, open list:All patches CC here


Paolo Bonzini writes:

> On 25/10/2016 17:49, Pranith Kumar wrote:
>> But we are taking the seqlock of only the head bucket, while the
>> readers are reading hashes/pointers of the chained buckets.
>
> No, we aren't.  See qht_lookup__slowpath.


I don't see it. The reader is taking the head bucket look in
qht_lookup__slowpath() and then iterating over the chained buckets in
qht_do_lookup(). The writer is doing the same. It is taking the head bucket
lock in qht_insert__locked().

I've written a patch (see below) to take the per-bucket sequence locks.

>
> This patch:
>
> throughput base   patch  %change
> update
> 0          8.07   13.33  +65%
> 10         7.10   8.90   +25%
> 20         6.34   7.02   +10%
> 30         5.48   6.11   +9.6%
> 40         4.90   5.46   +11.42%
>
>
> Just doubling the cachesize:
>
> throughput base   patch  %change
> update
> 0          8.07   4.47   -45% ?!?
> 10         7.10   9.82   +38%
> 20         6.34   8.13   +28%
> 30         5.48   7.13   +30%
> 40         5.90   6.45   +30%
>
> It seems to me that your machine has 128-byte cachelines.
>

Nope. It is just the regular 64 byte cache line.

$ getconf LEVEL1_DCACHE_LINESIZE
64

(The machine model is Xeon CPU E5-2620).


Take the per-bucket sequence locks instead of the head bucket lock.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 util/qht.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/util/qht.c b/util/qht.c
index 4d82609..cfce5fc 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -374,19 +374,19 @@ static void qht_bucket_reset__locked(struct qht_bucket *head)
     struct qht_bucket *b = head;
     int i;
 
-    seqlock_write_begin(&head->sequence);
     do {
+        seqlock_write_begin(&b->sequence);
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
             if (b->pointers[i] == NULL) {
-                goto done;
+                seqlock_write_end(&b->sequence);
+                return;
             }
             atomic_set(&b->hashes[i], 0);
             atomic_set(&b->pointers[i], NULL);
         }
+        seqlock_write_end(&b->sequence);
         b = b->next;
     } while (b);
- done:
-    seqlock_write_end(&head->sequence);
 }
 
 /* call with all bucket locks held */
@@ -446,6 +446,8 @@ void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
     int i;
 
     do {
+        void *q = NULL;
+        unsigned int version = seqlock_read_begin(&b->sequence);
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
             if (atomic_read(&b->hashes[i]) == hash) {
                 /* The pointer is dereferenced before seqlock_read_retry,
@@ -455,11 +457,16 @@ void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
                 void *p = atomic_rcu_read(&b->pointers[i]);
 
                 if (likely(p) && likely(func(p, userp))) {
-                    return p;
+                    q = p;
+                    break;
                 }
             }
         }
-        b = atomic_rcu_read(&b->next);
+        if (!q) {
+            b = atomic_rcu_read(&b->next);
+        } else if (!seqlock_read_retry(&b->sequence, version)) {
+            return q;
+        }
     } while (b);
 
     return NULL;
@@ -469,14 +476,7 @@ static __attribute__((noinline))
 void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
                            const void *userp, uint32_t hash)
 {
-    unsigned int version;
-    void *ret;
-
-    do {
-        version = seqlock_read_begin(&b->sequence);
-        ret = qht_do_lookup(b, func, userp, hash);
-    } while (seqlock_read_retry(&b->sequence, version));
-    return ret;
+    return qht_do_lookup(b, func, userp, hash);
 }
 
 void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
@@ -537,14 +537,14 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
 
  found:
     /* found an empty key: acquire the seqlock and write */
-    seqlock_write_begin(&head->sequence);
+    seqlock_write_begin(&b->sequence);
     if (new) {
         atomic_rcu_set(&prev->next, b);
     }
     /* smp_wmb() implicit in seqlock_write_begin.  */
     atomic_set(&b->hashes[i], hash);
     atomic_set(&b->pointers[i], p);
-    seqlock_write_end(&head->sequence);
+    seqlock_write_end(&b->sequence);
     return true;
 }
 
@@ -665,9 +665,9 @@ bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
             }
             if (q == p) {
                 qht_debug_assert(b->hashes[i] == hash);
-                seqlock_write_begin(&head->sequence);
+                seqlock_write_begin(&b->sequence);
                 qht_bucket_remove_entry(b, i);
-                seqlock_write_end(&head->sequence);
+                seqlock_write_end(&b->sequence);
                 return true;
             }
         }
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 19:12       ` Pranith Kumar
@ 2016-10-25 20:02         ` Paolo Bonzini
  2016-10-25 20:35           ` Pranith Kumar
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2016-10-25 20:02 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: open list:All patches CC here, Emilio G. Cota, Alex Bennée,
	Markus Armbruster, Richard Henderson



On 25/10/2016 21:12, Pranith Kumar wrote:
> 
> Paolo Bonzini writes:
> 
>> On 25/10/2016 17:49, Pranith Kumar wrote:
>>> But we are taking the seqlock of only the head bucket, while the
>>> readers are reading hashes/pointers of the chained buckets.
>>
>> No, we aren't.  See qht_lookup__slowpath.
> 
> 
> I don't see it. The reader is taking the head bucket look in
> qht_lookup__slowpath() and then iterating over the chained buckets in
> qht_do_lookup(). The writer is doing the same. It is taking the head bucket
> lock in qht_insert__locked().

Uh, you're right.  Sorry I wasn't reading it correctly.

> I've written a patch (see below) to take the per-bucket sequence locks.

What's the performance like?

Paolo

>>
>> This patch:
>>
>> throughput base   patch  %change
>> update
>> 0          8.07   13.33  +65%
>> 10         7.10   8.90   +25%
>> 20         6.34   7.02   +10%
>> 30         5.48   6.11   +9.6%
>> 40         4.90   5.46   +11.42%
>>
>>
>> Just doubling the cachesize:
>>
>> throughput base   patch  %change
>> update
>> 0          8.07   4.47   -45% ?!?
>> 10         7.10   9.82   +38%
>> 20         6.34   8.13   +28%
>> 30         5.48   7.13   +30%
>> 40         5.90   6.45   +30%
>>
>> It seems to me that your machine has 128-byte cachelines.
>>
> 
> Nope. It is just the regular 64 byte cache line.
> 
> $ getconf LEVEL1_DCACHE_LINESIZE
> 64
> 
> (The machine model is Xeon CPU E5-2620).
> 
> 
> Take the per-bucket sequence locks instead of the head bucket lock.
> 
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  util/qht.c | 36 ++++++++++++++++++------------------
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/util/qht.c b/util/qht.c
> index 4d82609..cfce5fc 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -374,19 +374,19 @@ static void qht_bucket_reset__locked(struct qht_bucket *head)
>      struct qht_bucket *b = head;
>      int i;
>  
> -    seqlock_write_begin(&head->sequence);
>      do {
> +        seqlock_write_begin(&b->sequence);
>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>              if (b->pointers[i] == NULL) {
> -                goto done;
> +                seqlock_write_end(&b->sequence);
> +                return;
>              }
>              atomic_set(&b->hashes[i], 0);
>              atomic_set(&b->pointers[i], NULL);
>          }
> +        seqlock_write_end(&b->sequence);
>          b = b->next;
>      } while (b);
> - done:
> -    seqlock_write_end(&head->sequence);
>  }
>  
>  /* call with all bucket locks held */
> @@ -446,6 +446,8 @@ void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
>      int i;
>  
>      do {
> +        void *q = NULL;
> +        unsigned int version = seqlock_read_begin(&b->sequence);
>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>              if (atomic_read(&b->hashes[i]) == hash) {
>                  /* The pointer is dereferenced before seqlock_read_retry,
> @@ -455,11 +457,16 @@ void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
>                  void *p = atomic_rcu_read(&b->pointers[i]);
>  
>                  if (likely(p) && likely(func(p, userp))) {
> -                    return p;
> +                    q = p;
> +                    break;
>                  }
>              }
>          }
> -        b = atomic_rcu_read(&b->next);
> +        if (!q) {
> +            b = atomic_rcu_read(&b->next);
> +        } else if (!seqlock_read_retry(&b->sequence, version)) {
> +            return q;
> +        }
>      } while (b);
>  
>      return NULL;
> @@ -469,14 +476,7 @@ static __attribute__((noinline))
>  void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
>                             const void *userp, uint32_t hash)
>  {
> -    unsigned int version;
> -    void *ret;
> -
> -    do {
> -        version = seqlock_read_begin(&b->sequence);
> -        ret = qht_do_lookup(b, func, userp, hash);
> -    } while (seqlock_read_retry(&b->sequence, version));
> -    return ret;
> +    return qht_do_lookup(b, func, userp, hash);
>  }
>  
>  void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> @@ -537,14 +537,14 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>  
>   found:
>      /* found an empty key: acquire the seqlock and write */
> -    seqlock_write_begin(&head->sequence);
> +    seqlock_write_begin(&b->sequence);
>      if (new) {
>          atomic_rcu_set(&prev->next, b);
>      }
>      /* smp_wmb() implicit in seqlock_write_begin.  */
>      atomic_set(&b->hashes[i], hash);
>      atomic_set(&b->pointers[i], p);
> -    seqlock_write_end(&head->sequence);
> +    seqlock_write_end(&b->sequence);
>      return true;
>  }
>  
> @@ -665,9 +665,9 @@ bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
>              }
>              if (q == p) {
>                  qht_debug_assert(b->hashes[i] == hash);
> -                seqlock_write_begin(&head->sequence);
> +                seqlock_write_begin(&b->sequence);
>                  qht_bucket_remove_entry(b, i);
> -                seqlock_write_end(&head->sequence);
> +                seqlock_write_end(&b->sequence);
>                  return true;
>              }
>          }
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 20:02         ` Paolo Bonzini
@ 2016-10-25 20:35           ` Pranith Kumar
  2016-10-25 20:45             ` Emilio G. Cota
  0 siblings, 1 reply; 10+ messages in thread
From: Pranith Kumar @ 2016-10-25 20:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: open list:All patches CC here, Emilio G. Cota, Alex Bennée,
	Markus Armbruster, Richard Henderson

On Tue, Oct 25, 2016 at 4:02 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
>> I've written a patch (see below) to take the per-bucket sequence locks.
>
> What's the performance like?
>

Applying only this patch, the perf numbers are similar to the 128
cache line alignment you suggested.

0 4
10 9.70
20 8.09
30 7.13
40 6.49

I am not sure why only 100% reader case is so low. Applying the
sequence lock cache alignment patch brings it back up to 13
MT/s/thread.

-- 
Pranith

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 20:35           ` Pranith Kumar
@ 2016-10-25 20:45             ` Emilio G. Cota
  2016-10-25 20:58               ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Emilio G. Cota @ 2016-10-25 20:45 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Paolo Bonzini, open list:All patches CC here, Alex Bennée,
	Markus Armbruster, Richard Henderson

On Tue, Oct 25, 2016 at 16:35:48 -0400, Pranith Kumar wrote:
> On Tue, Oct 25, 2016 at 4:02 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> >> I've written a patch (see below) to take the per-bucket sequence locks.
> >
> > What's the performance like?
> >
> 
> Applying only this patch, the perf numbers are similar to the 128
> cache line alignment you suggested.

That makes sense. Having a single seqlock per bucket is simple and fast;
note that bucket chains should be very short (we use good hashing and
automatic resize for this purpose).

		E.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 15:35 [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line Pranith Kumar
  2016-10-25 15:41 ` Paolo Bonzini
@ 2016-10-25 20:55 ` Emilio G. Cota
  1 sibling, 0 replies; 10+ messages in thread
From: Emilio G. Cota @ 2016-10-25 20:55 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Richard Henderson, Alex Bennée, Markus Armbruster,
	Sergey Fedorov, Paolo Bonzini, open list:All patches CC here

On Tue, Oct 25, 2016 at 11:35:06 -0400, Pranith Kumar wrote:
> Using perf, I see that sequence lock is being a bottleneck since it is
> being read by everyone. Giving it its own cache-line seems to help
> things quite a bit.
> 
> Using qht-bench, I measured the following for:
> 
> $ ./tests/qht-bench -d 10 -n 24 -u <x>
> 
> throughput base   patch  %change
> update
> 0          8.07   13.33  +65%
> 10         7.10   8.90   +25%
> 20         6.34   7.02	 +10%
> 30         5.48   6.11   +9.6%
> 40         4.90   5.46   +11.42%
> 
> I am not able to see any significant increases for lower thread counts though.

Honestly I don't know what you're measuring here.

Your results are low (I assume you're showing here throughput per-thread,
but still they're low), and it makes no sense that increasing the cacheline
footprint will make a *read-only* workload (0% updates) run faster.

More below.

> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  include/qemu/seqlock.h | 2 +-
>  util/qht.c             | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 8dee11d..954abe8 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock;
>  
>  struct QemuSeqLock {
>      unsigned sequence;
> -};
> +} QEMU_ALIGNED(64);
>  
>  static inline void seqlock_init(QemuSeqLock *sl)
>  {
> diff --git a/util/qht.c b/util/qht.c
> index ff4d2e6..4d82609 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -101,14 +101,14 @@
>   * be grabbed first.
>   */
>  struct qht_bucket {
> -    QemuSpin lock;
>      QemuSeqLock sequence;
> +    QemuSpin lock;
>      uint32_t hashes[QHT_BUCKET_ENTRIES];
>      void *pointers[QHT_BUCKET_ENTRIES];
>      struct qht_bucket *next;
>  } QEMU_ALIGNED(QHT_BUCKET_ALIGN);

I understand this is a hack but this would have been more localized:

diff --git a/util/qht.c b/util/qht.c
index ff4d2e6..55db907 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -101,14 +101,16 @@
  * be grabbed first.
  */
 struct qht_bucket {
+    struct {
+        QemuSeqLock sequence;
+    } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
     QemuSpin lock;
-    QemuSeqLock sequence;
     uint32_t hashes[QHT_BUCKET_ENTRIES];
     void *pointers[QHT_BUCKET_ENTRIES];
     struct qht_bucket *next;
 } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
 
So I tested my change above vs. master on a 16-core (32-way) Intel machine
(Xeon E5-2690 @ 2.90GHz with turbo-boost disabled), making sure threads are
scheduled on separate cores, favouring same-socket ones.
Results: http://imgur.com/a/c4dTB

So really I don't know what you're measuring.
The idea of decoupling the seqlock from the spinlock's cache line doesn't
make sense to me, because:
- Bucket lock holders are very likely to update the seqlock, so it makes sense
  to have them in the same cache line (exceptions to this are resizes or
  traversals, but those are very rare and we're not measuring those in qht-bench)
- Thanks to resizing + good hashing, bucket chains are very short, so
  a single seqlock per bucket is all we need.
- We can have *many* buckets (200K is not crazy for the TB htable), so
  anything that increases their size needs very good justification (see
  200K results above).

		E.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
  2016-10-25 20:45             ` Emilio G. Cota
@ 2016-10-25 20:58               ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2016-10-25 20:58 UTC (permalink / raw)
  To: Emilio G. Cota, Pranith Kumar
  Cc: Richard Henderson, Alex Bennée,
	open list:All patches CC here, Markus Armbruster



On 25/10/2016 22:45, Emilio G. Cota wrote:
> On Tue, Oct 25, 2016 at 16:35:48 -0400, Pranith Kumar wrote:
>> On Tue, Oct 25, 2016 at 4:02 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>>> I've written a patch (see below) to take the per-bucket sequence locks.
>>>
>>> What's the performance like?
>>>
>>
>> Applying only this patch, the perf numbers are similar to the 128
>> cache line alignment you suggested.
> 
> That makes sense. Having a single seqlock per bucket is simple and fast;
> note that bucket chains should be very short (we use good hashing and
> automatic resize for this purpose).

But why do we get such worse performance in the 100% reader case?  (And
even more puzzling, why does Pranith's original patch improve
performance instead of causing more cache misses?)

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-10-25 20:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-25 15:35 [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line Pranith Kumar
2016-10-25 15:41 ` Paolo Bonzini
2016-10-25 15:49   ` Pranith Kumar
2016-10-25 15:54     ` Paolo Bonzini
2016-10-25 19:12       ` Pranith Kumar
2016-10-25 20:02         ` Paolo Bonzini
2016-10-25 20:35           ` Pranith Kumar
2016-10-25 20:45             ` Emilio G. Cota
2016-10-25 20:58               ` Paolo Bonzini
2016-10-25 20:55 ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.