All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Waiman Long <waiman.long@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Weinberger <richard@nod.at>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Matt Fleming <matt.fleming@intel.com>,
	Herbert Xu <herbert@gondor.hengli.com.au>,
	Akinobu Mita <akinobu.mita@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>
Subject: Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
Date: Fri, 19 Jul 2013 10:40:23 +0200	[thread overview]
Message-ID: <20130719084023.GB25784@gmail.com> (raw)
In-Reply-To: <51E7F03A.4090305@hp.com>


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: Waiman Long <waiman.long@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Weinberger <richard@nod.at>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Matt Fleming <matt.fleming@intel.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Akinobu Mita <akinobu.mita@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Norton, Sc
Subject: Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
Date: Fri, 19 Jul 2013 10:40:23 +0200	[thread overview]
Message-ID: <20130719084023.GB25784@gmail.com> (raw)
In-Reply-To: <51E7F03A.4090305@hp.com>


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: Waiman Long <waiman.long@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Weinberger <richard@nod.at>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Matt Fleming <matt.fleming@intel.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Akinobu Mita <akinobu.mita@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>
Subject: Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
Date: Fri, 19 Jul 2013 10:40:23 +0200	[thread overview]
Message-ID: <20130719084023.GB25784@gmail.com> (raw)
Message-ID: <20130719084023.URyViXiOPDiM3FT5nmpUqvuifVV1yf0B0wFHP6GmMbI@z> (raw)
In-Reply-To: <51E7F03A.4090305@hp.com>


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

  reply	other threads:[~2013-07-19  8:40 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-13  1:34 [PATCH RFC 0/2] qrwlock: Introducing a queue read/write lock implementation Waiman Long
2013-07-13  1:34 ` Waiman Long
2013-07-13  1:34 ` [PATCH RFC 1/2] qrwlock: A " Waiman Long
2013-07-13  1:34   ` Waiman Long
2013-07-15 14:39   ` Steven Rostedt
2013-07-15 14:39     ` Steven Rostedt
2013-07-15 20:44     ` Waiman Long
2013-07-15 20:44       ` Waiman Long
2013-07-15 22:31   ` Thomas Gleixner
2013-07-15 22:31     ` Thomas Gleixner
2013-07-16  1:19     ` Waiman Long
2013-07-16  1:19       ` Waiman Long
2013-07-18  7:42       ` Ingo Molnar
2013-07-18  7:42         ` Ingo Molnar
2013-07-18  7:42         ` Ingo Molnar
2013-07-18 13:40         ` Waiman Long
2013-07-18 13:40           ` Waiman Long
2013-07-18 13:40           ` Waiman Long
2013-07-19  8:40           ` Ingo Molnar [this message]
2013-07-19  8:40             ` Ingo Molnar
2013-07-19  8:40             ` Ingo Molnar
2013-07-19 15:30             ` Waiman Long
2013-07-19 15:30               ` Waiman Long
2013-07-19 15:30               ` Waiman Long
2013-07-22 10:34               ` Ingo Molnar
2013-07-22 10:34                 ` Ingo Molnar
2013-07-22 10:34                 ` Ingo Molnar
2013-07-24  0:03                 ` Waiman Long
2013-07-24  0:03                   ` Waiman Long
2013-07-24  0:03                   ` Waiman Long
2013-07-18 10:22       ` Thomas Gleixner
2013-07-18 10:22         ` Thomas Gleixner
2013-07-18 14:19         ` Waiman Long
2013-07-18 14:19           ` Waiman Long
2013-07-21  5:42           ` Raghavendra K T
2013-07-21  5:42             ` Raghavendra K T
2013-07-21  5:42             ` Raghavendra K T
2013-07-23 23:54             ` Waiman Long
2013-07-23 23:54               ` Waiman Long
2013-07-23 23:54               ` Waiman Long
2013-07-13  1:34 ` [PATCH RFC 2/2] x86 qrwlock: Enable x86 to use queue read/write lock Waiman Long
2013-07-13  1:34   ` Waiman Long
2013-07-18 12:55 [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation George Spelvin
2013-07-18 13:43 ` Waiman Long
2013-07-18 18:46   ` George Spelvin
2013-07-19 15:43     ` Waiman Long
2013-07-19 21:11       ` George Spelvin
2013-07-19 21:35         ` Waiman Long
2013-07-18 13:18 George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130719084023.GB25784@gmail.com \
    --to=mingo@kernel.org \
    --cc=akinobu.mita@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=arnd@arndb.de \
    --cc=aswin@hp.com \
    --cc=catalin.marinas@arm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=herbert@gondor.hengli.com.au \
    --cc=hpa@zytor.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt.fleming@intel.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=richard@nod.at \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=waiman.long@hp.com \
    --cc=walken@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.