linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: blk_congestion_wait racy?
@ 2004-03-09 17:54 Martin Schwidefsky
  2004-03-10  5:23 ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-09 17:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm





Hi Nick,

> Another problem is that if there are no requests anywhere in the system,
> sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
> could probably have blk_run_queues moved after prepare_to_wait, which
> might help.
I tried putting blk_run_queues after prepare_to_wait, it worked but it
didn't help. The test still needs close to a minute.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-09 17:54 blk_congestion_wait racy? Martin Schwidefsky
@ 2004-03-10  5:23 ` Nick Piggin
  2004-03-10  5:35   ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2004-03-10  5:23 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Andrew Morton, linux-kernel, linux-mm



Martin Schwidefsky wrote:

>
>
>
>Hi Nick,
>
>
>>Another problem is that if there are no requests anywhere in the system,
>>sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
>>could probably have blk_run_queues moved after prepare_to_wait, which
>>might help.
>>
>I tried putting blk_run_queues after prepare_to_wait, it worked but it
>didn't help. The test still needs close to a minute.
>
>

OK. This was *with* the memory barrier changes too, was it? Not that
they should make that much difference. The test is still racy, but
the window just gets smaller.

But I'm guessing that you have no requests in flight by the time
blk_congestion_wait gets called, so nothing ever gets kicked.

I prefer something more like this model: if 'current' submits a request
to a congested queue then it gets put on the congestion waitqueue.
You can then run blk_congestion_wait afterwards and it won't block if
the queue you've written to has come out of congestion at any time.

This also means that you can (should, in fact) stop uncongested queues
from waking up the waiters every time they complete a request. Hmm, I
like it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-10  5:23 ` Nick Piggin
@ 2004-03-10  5:35   ` Andrew Morton
  2004-03-10  5:47     ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2004-03-10  5:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: schwidefsky, linux-kernel, linux-mm

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> But I'm guessing that you have no requests in flight by the time
>  blk_congestion_wait gets called, so nothing ever gets kicked.

That's why blk_congestion_wait() in -mm propagates the schedule_timeout()
return value.   You can do:

	if (blk_congestion_wait(...))
		printk("ouch\n");

If your kernel says ouch much, we have a problem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-10  5:35   ` Andrew Morton
@ 2004-03-10  5:47     ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2004-03-10  5:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: schwidefsky, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]



Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>But I'm guessing that you have no requests in flight by the time
>> blk_congestion_wait gets called, so nothing ever gets kicked.
>>
>
>That's why blk_congestion_wait() in -mm propagates the schedule_timeout()
>return value.   You can do:
>
>	if (blk_congestion_wait(...))
>		printk("ouch\n");
>
>If your kernel says ouch much, we have a problem.
>
>

Martin, have you tried adding this printk?

Andrew, could you take the following patch (even though it didn't fix
the problem).

I think the smp_mb isn't needed because the rl waitqueue stuff is
serialised by the queue spinlocks.

The addition of the smp_mb and the other change is to try to close the
window for races a bit. Obviously they can still happen, it's a racy
interface and it doesn't matter much.


[-- Attachment #2: blk-congestion-races.patch --]
[-- Type: text/x-patch, Size: 1341 bytes --]

 linux-2.6-npiggin/drivers/block/ll_rw_blk.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff -puN drivers/block/ll_rw_blk.c~blk-congestion-races drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c~blk-congestion-races	2004-03-10 16:38:33.000000000 +1100
+++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c	2004-03-10 16:41:29.000000000 +1100
@@ -110,6 +110,9 @@ static void clear_queue_congested(reques
 
 	bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
 	clear_bit(bit, &q->backing_dev_info.state);
+
+	smp_mb(); /* congestion_wqh is not synchronised. This is still racy,
+		   * but better. It isn't a big deal */
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
 }
@@ -1543,7 +1546,6 @@ static void freed_request(request_queue_
 	if (rl->count[rw] < queue_congestion_off_threshold(q))
 		clear_queue_congested(q, rw);
 	if (rl->count[rw]+1 <= q->nr_requests) {
-		smp_mb();
 		if (waitqueue_active(&rl->wait[rw]))
 			wake_up(&rl->wait[rw]);
 		if (!waitqueue_active(&rl->wait[rw]))
@@ -2036,8 +2038,8 @@ long blk_congestion_wait(int rw, long ti
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[rw];
 
-	blk_run_queues();
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	blk_run_queues();
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 	return ret;

_

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-11 19:04 Martin Schwidefsky
  2004-03-11 23:25 ` Andrew Morton
@ 2004-03-12  2:31 ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2004-03-12  2:31 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Andrew Morton, linux-kernel, linux-mm



Martin Schwidefsky wrote:

>
>
>
>>Yes, sorry, all the world's an x86 :( Could you please send me whatever
>>diffs were needed to get it all going?
>>
>
>I am just preparing that mail :-)
>
>
>>I thought you were running a 256MB machine?  Two seconds for 400 megs of
>>swapout?  What's up?
>>
>
>Roughly 400 MB of swapout. And two seconds isn't that bad ;-)
>
>
>>An ouch-per-second sounds reasonable.  It could simply be that the CPUs
>>were off running other tasks - those timeout are less than scheduling
>>quanta.
>>
>
>I don't understand why an ouch-per-second is reasonable. The mempig is
>the only process that runs on the machine and the blk_congestion_wait
>uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
>seconds the test runs.
>
>The 4x performance difference remains not understood.
>
>

It would still be blk_congestion_wait slowing things down, wouldn't
it? Performance was good when you took that out, wasn't it?

And it would not unusual for you to be waiting needlessly without
seeing the ouch.

I think I will try doing a non-racy blk_congestion_wait after Jens'
unplugging patch gets put into -mm. That should solve your problem.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-11 19:04 Martin Schwidefsky
@ 2004-03-11 23:25 ` Andrew Morton
  2004-03-12  2:31 ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2004-03-11 23:25 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linux-kernel, linux-mm, piggin

Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> > An ouch-per-second sounds reasonable.  It could simply be that the CPUs
> > were off running other tasks - those timeout are less than scheduling
> > quanta.
> 
> I don't understand why an ouch-per-second is reasonable. The mempig is
> the only process that runs on the machine and the blk_congestion_wait
> uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
> seconds the test runs.

blk_congestion_wait() is supposed to be terminated by someone releasing a
disk write request.  If no write requests are freed in 100 milliseconds
then either Something Is Up or that process simply was not scheduled for
some time after the wakeup was delivered.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
@ 2004-03-11 19:04 Martin Schwidefsky
  2004-03-11 23:25 ` Andrew Morton
  2004-03-12  2:31 ` Nick Piggin
  0 siblings, 2 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-11 19:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, piggin





> Yes, sorry, all the world's an x86 :( Could you please send me whatever
> diffs were needed to get it all going?

I am just preparing that mail :-)

> I thought you were running a 256MB machine?  Two seconds for 400 megs of
> swapout?  What's up?

Roughly 400 MB of swapout. And two seconds isn't that bad ;-)

> An ouch-per-second sounds reasonable.  It could simply be that the CPUs
> were off running other tasks - those timeout are less than scheduling
> quanta.

I don't understand why an ouch-per-second is reasonable. The mempig is
the only process that runs on the machine and the blk_congestion_wait
uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
seconds the test runs.

The 4x performance difference remains not understood.


blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-11 18:24 Martin Schwidefsky
@ 2004-03-11 18:55 ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2004-03-11 18:55 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: piggin, linux-kernel, linux-mm

Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> > Martin, have you tried adding this printk?
> 
> Sorry for the delay. I had to get 2.6.4-mm1 working before doing the
> "ouch" test. The new pte_to_pgprot/pgoff_prot_to_pte stuff wasn't easy.

Yes, sorry, all the world's an x86 :( Could you please send me whatever
diffs were needed to get it all going?

There are porting instructions in
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.4/2.6.4-mm1/broken-out/remap-file-pages-prot-2.6.4-rc1-mm1-A1.patch
but maybe it's a bit late for that.

> I tested 2.6.4-mm1 with the blk_run_queues move and the ouch printk.
> The first interesting observation is that 2.6.4-mm1 behaves MUCH better
> then 2.6.4:
> 
> 2.6.4-mm1 with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.587s
> user    0m0.100s
> sys     0m0.730s
> #

I thought you were running a 256MB machine?  Two seconds for 400 megs of
swapout?  What's up?

> 2.6.4-mm1 with 2 cpus
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m10.313s
> user    0m0.160s
> sys     0m0.780s
> #
> 
> 2.6.4 takes > 1min for the test with 2 cpus.
> 
> The second observation is that I get only a few "ouch" messages. They
> all come from the blk_congestion_wait in try_to_free_pages, as expected.
> What I did not expect is that I only got 9 "ouches" for the run with
> 2 cpus.

An ouch-per-second sounds reasonable.  It could simply be that the CPUs
were off running other tasks - those timeout are less than scheduling
quanta.

The 4x performance difference remains not understood.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
@ 2004-03-11 18:24 Martin Schwidefsky
  2004-03-11 18:55 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-11 18:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm





> Martin, have you tried adding this printk?

Sorry for the delay. I had to get 2.6.4-mm1 working before doing the
"ouch" test. The new pte_to_pgprot/pgoff_prot_to_pte stuff wasn't easy.
I tested 2.6.4-mm1 with the blk_run_queues move and the ouch printk.
The first interesting observation is that 2.6.4-mm1 behaves MUCH better
then 2.6.4:

2.6.4-mm1 with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.587s
user    0m0.100s
sys     0m0.730s
#

2.6.4-mm1 with 2 cpus
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m10.313s
user    0m0.160s
sys     0m0.780s
#

2.6.4 takes > 1min for the test with 2 cpus.

The second observation is that I get only a few "ouch" messages. They
all come from the blk_congestion_wait in try_to_free_pages, as expected.
What I did not expect is that I only got 9 "ouches" for the run with
2 cpus.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-08 13:38 Martin Schwidefsky
@ 2004-03-08 23:50 ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2004-03-08 23:50 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Andrew Morton, linux-kernel, linux-mm

Martin Schwidefsky wrote:

>
>
>
>>Gad, that'll make the VM scan its guts out.
>>
>Yes, I expected something like this.
>
>
>>>2.6.4-rc2 + "fix" with 1 cpu
>>>sys     0m0.880s
>>>
>>>2.6.4-rc2 + "fix" with 2 cpu
>>>sys     0m1.560s
>>>
>>system time was doubled though.
>>
>That would be the additional cost for not waiting.
>
>

I'd say its more like cacheline contention or something: reclaim
won't simply be spinning with nothing to do because you're dirtying
plenty of memory. And if any queues were full it will mostly just be
blocking in the block layer.

>>Nope, something is obviously broken.   I'll take a look.
>>
>That would be very much appreciated.
>

I'm looking at 2.6.1 source, so apologies if I'm wrong, but
drivers/block/ll_rw_blk.c:
freed_request does not need the memory barrier because the queue is
protected by the per queue spinlock. And I think clear_queue_congested
should have a memory barrier right before if (waitqueue_active(wqh)).

Another problem is that if there are no requests anywhere in the system,
sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
could probably have blk_run_queues moved after prepare_to_wait, which
might help.

Just some ideas.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
@ 2004-03-08 13:38 Martin Schwidefsky
  2004-03-08 23:50 ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-08 13:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm





> Gad, that'll make the VM scan its guts out.
Yes, I expected something like this.

> > 2.6.4-rc2 + "fix" with 1 cpu
> > sys     0m0.880s
> >
> > 2.6.4-rc2 + "fix" with 2 cpu
> > sys     0m1.560s
>
> system time was doubled though.
That would be the additional cost for not waiting.

> Nope, something is obviously broken.   I'll take a look.
That would be very much appreciated.

> Perhaps with two CPUs you are able to get kswapd and mempig running page
> reclaim at the same time, which causes seekier swap I/O patterns than with
> one CPU, where we only run one app or the other at any time.
>
> Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
> would be a way of testing that theory.

Just tried the following patch:

Index: mm/vmscan.c
===================================================================
RCS file: /home/cvs/linux-2.5/mm/vmscan.c,v
retrieving revision 1.45
diff -u -r1.45 vmscan.c
--- mm/vmscan.c   18 Feb 2004 17:45:28 -0000    1.45
+++ mm/vmscan.c   8 Mar 2004 13:30:56 -0000
@@ -848,6 +848,7 @@
  * excessive rotation of the inactive list, which is _supposed_ to be an LRU,
  * yes?
  */
+static DECLARE_MUTEX(reclaim_sem);
 int try_to_free_pages(struct zone **zones,
            unsigned int gfp_mask, unsigned int order)
 {
@@ -858,6 +859,8 @@
      struct reclaim_state *reclaim_state = current->reclaim_state;
      int i;

+     down(&reclaim_sem);
+
      inc_page_state(allocstall);

      for (i = 0; zones[i] != 0; i++)
@@ -884,7 +887,10 @@
            wakeup_bdflush(total_scanned);

            /* Take a nap, wait for some writeback to complete */
+           up(&reclaim_sem);
            blk_congestion_wait(WRITE, HZ/10);
+           down(&reclaim_sem);
+
            if (zones[0] - zones[0]->zone_pgdat->node_zones < ZONE_HIGHMEM) {
                  shrink_slab(total_scanned, gfp_mask);
                  if (reclaim_state) {
@@ -898,6 +904,9 @@
 out:
      for (i = 0; zones[i] != 0; i++)
            zones[i]->prev_priority = zones[i]->temp_priority;
+
+     up(&reclaim_sem);
+
      return ret;
 }

@@ -926,6 +935,8 @@
      int i;
      struct reclaim_state *reclaim_state = current->reclaim_state;

+     down(&reclaim_sem);
+
      inc_page_state(pageoutrun);

      for (i = 0; i < pgdat->nr_zones; i++) {
@@ -974,8 +985,11 @@
            }
            if (all_zones_ok)
                  break;
-           if (to_free > 0)
+           if (to_free > 0) {
+                 up(&reclaim_sem);
                  blk_congestion_wait(WRITE, HZ/10);
+                 down(&reclaim_sem);
+           }
      }

      for (i = 0; i < pgdat->nr_zones; i++) {
@@ -983,6 +997,9 @@

            zone->prev_priority = zone->temp_priority;
      }
+
+     up(&reclaim_sem);
+
      return nr_pages - to_free;
 }


It didn't help. Still needs almost a minute.

blue skies,
   Martin


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: blk_congestion_wait racy?
  2004-03-08  9:59 Martin Schwidefsky
@ 2004-03-08 12:24 ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2004-03-08 12:24 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linux-kernel, linux-mm

Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> Hi,
> we have a stupid little program that linearly allocates and touches
> memory. We use this to see how fast s390 can swap. If this is combined
> with the fastest block device we have (xpram) we see a very strange
> effect:
> 
> 2.6.4-rc2 with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.516s
> user    0m0.150s
> sys     0m0.570s
> #
> 
> 2.6.4-rc2 with 2 cpus
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m56.086s
> user    0m0.110s
> sys     0m0.630s
> #

Interesting.

> I have the suspicion that the call to blk_congestion_wait in
> try_to_free_pages is part of the problem. It initiates a wait for
> a queue to exit congestion but this could already have happened
> on another cpu before blk_congestion_wait has setup the wait
> queue. In this case the process sleeps for 0.1 seconds.

The comment may be a bit stale.  The idea is that the VM needs to take a
nap while the disk system retires some writes.  So we go to sleep until a
write request gets put back.  We do this regardless of the queue's
congestion state - the queue could have thousands of request slots and may
never even become congested.

> the swap test setup this happens all the time. If I "fix"
> blk_congestion_wait not to wait:
> 
> diff -urN linux-2.6/drivers/block/ll_rw_blk.c linux-2.6-fix/drivers/block/ll_rw_blk.c
> --- linux-2.6/drivers/block/ll_rw_blk.c	Fri Mar  5 14:50:28 2004
> +++ linux-2.6-fix/drivers/block/ll_rw_blk.c	Fri Mar  5 14:51:05 2004
> @@ -1892,7 +1892,9 @@
>  
>  	blk_run_queues();
>  	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +#if 0
>  	io_schedule_timeout(timeout);
> +#endif
>  	finish_wait(wqh, &wait);
>  }

Gad, that'll make the VM scan its guts out.

> then the system reacts normal again:
> 
> 2.6.4-rc2 + "fix" with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.523s
> user    0m0.200s
> sys     0m0.880s
> #
> 
> 2.6.4-rc2 + "fix" with 2 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600  of 600
> Done.
> 
> real    0m2.029s
> user    0m0.250s
> sys     0m1.560s
> #

system time was doubled though.

> Since it isn't a solution to remove the call to io_schedule_timeout
> I tried to understand what the event is, that blk_congestion_wait
> is waiting for. The comment says it waits for a queue to exit congestion.

It's just waiting for a write request to complete.  It's a pretty crude way
of throttling page reclaim to the I/O system's speed.

> That is starting from prepare_to_wait it waits for a call to
> clear_queue_congested. In my test scenario NO queue is congested on
> enter to blk_congestion_wait. I'd like to see a proper wait_event
> there but it is non-trivial to define the event to wait for.
> Any useful hints ?

Nope, something is obviously broken.   I'll take a look.

Perhaps with two CPUs you are able to get kswapd and mempig running page
reclaim at the same time, which causes seekier swap I/O patterns than with
one CPU, where we only run one app or the other at any time.

Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
would be a way of testing that theory.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* blk_congestion_wait racy?
@ 2004-03-08  9:59 Martin Schwidefsky
  2004-03-08 12:24 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2004-03-08  9:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Hi,
we have a stupid little program that linearly allocates and touches
memory. We use this to see how fast s390 can swap. If this is combined
with the fastest block device we have (xpram) we see a very strange
effect:

2.6.4-rc2 with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.516s
user    0m0.150s
sys     0m0.570s
#

2.6.4-rc2 with 2 cpus
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m56.086s
user    0m0.110s
sys     0m0.630s
#

I have the suspicion that the call to blk_congestion_wait in
try_to_free_pages is part of the problem. It initiates a wait for
a queue to exit congestion but this could already have happened
on another cpu before blk_congestion_wait has setup the wait
queue. In this case the process sleeps for 0.1 seconds. With
the swap test setup this happens all the time. If I "fix"
blk_congestion_wait not to wait:

diff -urN linux-2.6/drivers/block/ll_rw_blk.c linux-2.6-fix/drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c	Fri Mar  5 14:50:28 2004
+++ linux-2.6-fix/drivers/block/ll_rw_blk.c	Fri Mar  5 14:51:05 2004
@@ -1892,7 +1892,9 @@
 
 	blk_run_queues();
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+#if 0
 	io_schedule_timeout(timeout);
+#endif
 	finish_wait(wqh, &wait);
 }
 
then the system reacts normal again:

2.6.4-rc2 + "fix" with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.523s
user    0m0.200s
sys     0m0.880s
#

2.6.4-rc2 + "fix" with 2 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600  of 600
Done.

real    0m2.029s
user    0m0.250s
sys     0m1.560s
#

2.6.4-rc2 + "fix" with 2 cpus


Since it isn't a solution to remove the call to io_schedule_timeout
I tried to understand what the event is, that blk_congestion_wait
is waiting for. The comment says it waits for a queue to exit congestion.
That is starting from prepare_to_wait it waits for a call to
clear_queue_congested. In my test scenario NO queue is congested on
enter to blk_congestion_wait. I'd like to see a proper wait_event
there but it is non-trivial to define the event to wait for.
Any useful hints ?

blue skies,
   Martin


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-03-12  2:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-09 17:54 blk_congestion_wait racy? Martin Schwidefsky
2004-03-10  5:23 ` Nick Piggin
2004-03-10  5:35   ` Andrew Morton
2004-03-10  5:47     ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2004-03-11 19:04 Martin Schwidefsky
2004-03-11 23:25 ` Andrew Morton
2004-03-12  2:31 ` Nick Piggin
2004-03-11 18:24 Martin Schwidefsky
2004-03-11 18:55 ` Andrew Morton
2004-03-08 13:38 Martin Schwidefsky
2004-03-08 23:50 ` Nick Piggin
2004-03-08  9:59 Martin Schwidefsky
2004-03-08 12:24 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).