From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Andi Kleen <ak@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>,
Ingo Molnar <mingo@elte.hu>,
Arjan van de Ven <arjan@intel.linux.com>,
linux-kernel@vger.kernel.org, akpm@osdl.org, andrea@suse.de
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages
Date: Sun, 26 Feb 2006 03:48:51 +1100 [thread overview]
Message-ID: <44008A73.4030804@yahoo.com.au> (raw)
In-Reply-To: <200602241327.27390.ak@suse.de>
[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]
Andi Kleen wrote:
> On Friday 24 February 2006 10:26, Nick Piggin wrote:
>
>
>>[aside]
>>Actually I have a scalability improvement for rwsems, that moves the
>>actual task wakeups out from underneath the rwsem spinlock in the up()
>>paths. This was useful exactly on a mixed read+write workload on mmap_sem.
>>
>>The difference was quite large for the "generic rwsem" algorithm because
>>it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
>>think x86-64 uses the former, which is what I presume you're testing with?
>
>
> I used the generic algorithm because Andrea originally expressed some doubts
> on the correctness of the xadd algorithms and after trying to understand them
> myself I wasn't sure myself. Generic was the safer choice.
>
> But if someone can show convincing numbers that XADD rwsems are faster
> for some workload we can switch. I guess they are tested well enough now
> on i386.
>
> Or would your scalability improvement remove that difference (if it really exists)?
>
Here is a forward port of my scalability improvement, untested.
Unfortunately I didn't include absolute performance results, but the
changelog indicates that there was some noticable delta between the
rwsem implementations (and that's what I vaguely remember).
Note: this was with volanomark, which can be quite variable at the
best of times IIRC.
Note2: this was on a 16-way NUMAQ, which had the interconnect
performance of a string between two cans compared to a modern x86-64
of smaller size, so it may be harder to notice an improvement.
--
SUSE Labs, Novell Inc.
[-- Attachment #2: rwsem-scale.patch --]
[-- Type: text/plain, Size: 12243 bytes --]
Move wakeups out from under the rwsem's wait_lock spinlock.
This reduces that lock's contention by a factor of around
10 on a 16-way NUMAQ running volanomark, however cacheline
contention on the rwsem's "activity" drowns out these
small improvements when using the i386 "optimised" rwsem:
unpatched:
55802519 total 32.3097
23325323 default_idle 364458.1719
22084349 .text.lock.futex 82404.2873
2369107 queue_me 24678.1979
1875296 unqueue_me 9767.1667
1202258 .text.lock.rwsem 46240.6923
941801 finish_task_switch 7357.8203
787101 __wake_up 12298.4531
645252 drop_key_refs 13442.7500
362789 futex_wait 839.7894
333294 futex_wake 1487.9196
146797 rwsem_down_read_failed 436.8958
82788 .text.lock.dev 221.3583
81221 try_to_wake_up 133.5872
+rwsem-scale:
58120260 total 33.6458
25482132 default_idle 398158.3125
22774675 .text.lock.futex 84980.1306
2517797 queue_me 26227.0521
1953424 unqueue_me 10174.0833
1063068 finish_task_switch 8305.2188
834793 __wake_up 13043.6406
674570 drop_key_refs 14053.5417
371811 futex_wait 860.6736
343398 futex_wake 1533.0268
155419 try_to_wake_up 255.6234
114704 .text.lock.rwsem 4411.6923
The rwsem-spinlock implementation, however, is improved
significantly more, and now gets volanomark performance
similar to the xadd rwsem:
unpatched:
30850964 total 18.1787
18986006 default_idle 296656.3438
3989183 .text.lock.rwsem_spinlock 40294.7778
2990161 .text.lock.futex 32501.7500
549707 finish_task_switch 4294.5859
535327 __down_read 3717.5486
452721 queue_me 4715.8438
439725 __up_read 9160.9375
396273 __wake_up 6191.7656
326595 unqueue_me 1701.0156
+rwsem-scale:
25378268 total 14.9537
13325514 default_idle 208211.1562
3675634 .text.lock.futex 39952.5435
2908629 .text.lock.rwsem_spinlock 28239.1165
628115 __down_read 4361.9097
607417 finish_task_switch 4745.4453
588031 queue_me 6125.3229
571169 __up_read 11899.3542
436795 __wake_up 6824.9219
416788 unqueue_me 2170.7708
Index: linux-2.6/lib/rwsem.c
===================================================================
--- linux-2.6.orig/lib/rwsem.c 2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem.c 2006-02-26 03:37:04.000000000 +1100
@@ -36,14 +36,15 @@ void rwsemtrace(struct rw_semaphore *sem
* - the spinlock must be held by the caller
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if downgrading is false
+ *
+ * The spinlock will be dropped by this function.
*/
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int downgrading, unsigned long flags)
{
+ LIST_HEAD(wake_list);
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
- struct list_head *next;
- signed long oldcount, woken, loop;
+ signed long oldcount, woken;
rwsemtrace(sem, "Entering __rwsem_do_wake");
@@ -72,12 +73,8 @@ __rwsem_do_wake(struct rw_semaphore *sem
* It is an allocated on the waiter's stack and may become invalid at
* any time after that point (due to a wakeup from another source).
*/
- list_del(&waiter->list);
- tsk = waiter->task;
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
goto out;
/* don't want to wake any writers */
@@ -94,41 +91,37 @@ __rwsem_do_wake(struct rw_semaphore *sem
readers_only:
woken = 0;
do {
- woken++;
-
- if (waiter->list.next == &sem->wait_list)
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
+ woken += RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+ if (list_empty(&sem->wait_list))
break;
- waiter = list_entry(waiter->list.next,
+ waiter = list_entry(sem->wait_list.next,
struct rwsem_waiter, list);
-
} while (waiter->flags & RWSEM_WAITING_FOR_READ);
- loop = woken;
- woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
if (!downgrading)
/* we'd already done one increment earlier */
woken -= RWSEM_ACTIVE_BIAS;
rwsem_atomic_add(woken, sem);
- next = sem->wait_list.next;
- for (; loop > 0; loop--) {
- waiter = list_entry(next, struct rwsem_waiter, list);
- next = waiter->list.next;
+out:
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
+ while (!list_empty(&wake_list)) {
+ struct task_struct *tsk;
+ waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
+ list_del(&waiter->list);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
}
- sem->wait_list.next = next;
- next->prev = &sem->wait_list;
-
- out:
rwsemtrace(sem, "Leaving __rwsem_do_wake");
- return sem;
+ return;
/* undo the change to count, but check for a transition 1->0 */
undo:
@@ -145,12 +138,13 @@ rwsem_down_failed_common(struct rw_semap
struct rwsem_waiter *waiter, signed long adjustment)
{
struct task_struct *tsk = current;
+ unsigned long flags;
signed long count;
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
/* set up my own style of waitqueue */
- spin_lock_irq(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);
waiter->task = tsk;
get_task_struct(tsk);
@@ -161,14 +155,12 @@ rwsem_down_failed_common(struct rw_semap
/* if there are no active locks, wake the front queued process(es) up */
if (!(count & RWSEM_ACTIVE_MASK))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irq(&sem->wait_lock);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
/* wait to be given the lock */
- for (;;) {
- if (!waiter->task)
- break;
+ while (waiter->task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -227,9 +219,9 @@ struct rw_semaphore fastcall *rwsem_wake
/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
rwsemtrace(sem, "Leaving rwsem_wake");
@@ -251,9 +243,9 @@ struct rw_semaphore fastcall *rwsem_down
/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 1, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
rwsemtrace(sem, "Leaving rwsem_downgrade_wake");
return sem;
Index: linux-2.6/lib/rwsem-spinlock.c
===================================================================
--- linux-2.6.orig/lib/rwsem-spinlock.c 2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem-spinlock.c 2006-02-26 03:37:42.000000000 +1100
@@ -49,11 +49,11 @@ void fastcall init_rwsem(struct rw_semap
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if wakewrite is non-zero
*/
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite, unsigned long flags)
{
+ LIST_HEAD(wake_list);
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
int woken;
rwsemtrace(sem, "Entering __rwsem_do_wake");
@@ -73,46 +73,46 @@ __rwsem_do_wake(struct rw_semaphore *sem
*/
if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
sem->activity = -1;
- list_del(&waiter->list);
- tsk = waiter->task;
- /* Don't touch waiter after ->task has been NULLed */
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ list_move_tail(&waiter->list, &wake_list);
goto out;
}
/* grant an infinite number of read locks to the front of the queue */
dont_wake_writers:
woken = 0;
- while (waiter->flags & RWSEM_WAITING_FOR_READ) {
- struct list_head *next = waiter->list.next;
+ do {
+ list_move_tail(&waiter->list, &wake_list);
+ woken++;
+ if (list_empty(&sem->wait_list))
+ break;
+ waiter = list_entry(sem->wait_list.next,
+ struct rwsem_waiter, list);
+ } while (waiter->flags & RWSEM_WAITING_FOR_READ);
+
+ sem->activity += woken;
+out:
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
+ while (!list_empty(&wake_list)) {
+ struct task_struct *tsk;
+ waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
list_del(&waiter->list);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
- woken++;
- if (list_empty(&sem->wait_list))
- break;
- waiter = list_entry(next, struct rwsem_waiter, list);
}
- sem->activity += woken;
-
- out:
rwsemtrace(sem, "Leaving __rwsem_do_wake");
- return sem;
}
/*
* wake a single writer
+ * called with wait_lock locked and returns with it unlocked.
*/
-static inline struct rw_semaphore *
-__rwsem_wake_one_writer(struct rw_semaphore *sem)
+static inline void
+__rwsem_wake_one_writer(struct rw_semaphore *sem, unsigned long flags)
{
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -121,13 +121,13 @@ __rwsem_wake_one_writer(struct rw_semaph
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
list_del(&waiter->list);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
- return sem;
}
/*
@@ -163,9 +163,7 @@ void fastcall __sched __down_read(struct
spin_unlock_irq(&sem->wait_lock);
/* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
+ while (waiter.task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -234,9 +232,7 @@ void fastcall __sched __down_write(struc
spin_unlock_irq(&sem->wait_lock);
/* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
+ while (waiter.task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -283,9 +279,9 @@ void fastcall __up_read(struct rw_semaph
spin_lock_irqsave(&sem->wait_lock, flags);
if (--sem->activity == 0 && !list_empty(&sem->wait_list))
- sem = __rwsem_wake_one_writer(sem);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_wake_one_writer(sem, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
rwsemtrace(sem, "Leaving __up_read");
}
@@ -303,9 +299,9 @@ void fastcall __up_write(struct rw_semap
sem->activity = 0;
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 1, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
rwsemtrace(sem, "Leaving __up_write");
}
@@ -324,9 +320,9 @@ void fastcall __downgrade_write(struct r
sem->activity = 1;
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
rwsemtrace(sem, "Leaving __downgrade_write");
}
next prev parent reply other threads:[~2006-02-25 16:48 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-23 9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven
2006-02-23 9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
2006-02-23 9:41 ` Andi Kleen
2006-02-23 12:41 ` Ingo Molnar
2006-02-23 13:06 ` Andi Kleen
2006-02-23 13:15 ` Nick Piggin
2006-02-23 13:29 ` Ingo Molnar
2006-02-24 6:36 ` Nick Piggin
2006-02-24 6:49 ` Ingo Molnar
2006-02-24 7:01 ` Nick Piggin
2006-02-24 12:33 ` Andi Kleen
2006-02-24 12:55 ` Hugh Dickins
2006-02-24 9:15 ` Arjan van de Ven
2006-02-24 9:26 ` Nick Piggin
2006-02-24 12:27 ` Andi Kleen
2006-02-24 15:31 ` Andrea Arcangeli
2006-02-25 16:48 ` Nick Piggin [this message]
2006-02-25 17:22 ` Nick Piggin
2006-02-28 22:30 ` Pavel Machek
2006-02-23 18:25 ` Paul Jackson
2006-02-23 9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
2006-02-23 9:42 ` Andi Kleen
2006-02-23 9:48 ` Arjan van de Ven
2006-02-23 10:05 ` Andi Kleen
2006-02-23 10:15 ` Arjan van de Ven
2006-02-23 11:00 ` Andi Kleen
2006-02-23 11:22 ` Arjan van de Ven
2006-02-23 11:57 ` Andi Kleen
2006-02-24 18:52 ` Christoph Hellwig
2006-02-24 19:05 ` Andi Kleen
2006-02-24 19:09 ` Christoph Hellwig
2006-02-23 16:37 ` Benjamin LaHaise
2006-02-23 20:02 [Patch 3/3] prepopulate/cache cleared pages Chuck Ebbert
2006-02-23 21:10 Chuck Ebbert
2006-02-23 21:18 ` Arjan van de Ven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=44008A73.4030804@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=ak@suse.de \
--cc=akpm@osdl.org \
--cc=andrea@suse.de \
--cc=arjan@infradead.org \
--cc=arjan@intel.linux.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).