* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
[not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>
@ 2000-11-09 18:31 ` Linus Torvalds
2000-11-10 7:34 ` Mike Galbraith
2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-09 18:31 UTC (permalink / raw)
To: Mike Galbraith
Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox
As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
to do with IO except indirectly (the IO is necessary to trigger the
problem, but the _reason_ for the problem lies elsewhere).
And it has everything to do with the fact that the way Linux semaphores
are implemented, a non-blocking process has a HUGE advantage over a
blocking one. Linux kernel semaphores are extreme unfair in that way.
What happens is that some process is getting a lot of VM faults and gets
its VM semaphore. No contention yet. it holds the semaphore over the
IO, and now another process does a "ps".
The "ps" process goes to sleep on the semaphore. So far so good.
The original process releases the semaphore, which increments the count,
and wakes up the process waiting for it. Note that it _wakes_ it, it does
not give the semaphore to it. Big difference.
The process that got woken up will run eventually. Probably not all that
immediately, because the process that woke it (and held the semaphore)
just slept on a page fault too, so it's not likely to immediately
relinquish the CPU.
The original running process comes back faulting again, finds the
semaphore still unlocked (the "ps" process is awake but has not gotten to
run yet), gets the semaphore, and falls asleep on the IO for the next
page.
The "ps" process actually gets to run now, but it's a bit late. The
semaphore is locked again.
Repeat until luck breaks the bad circle.
(This schenario, btw, is much harder to trigger on SMP than on UP. And
it's completely separate from the issue of simple disk bandwidth issues
which can obviously cause no end of stalls on anything that needs the
disk, and which can also happen on SMP).
NOTE! If somebody wants to fix this, the fix should be reasonably simple
but needs to be quite exhaustively checked and double-checked. It's just
too easy to break the semaphores by mistake.
The way to make semaphores more fair is to NOT allow a new process to just
come in immediately and steal the semaphore in __down() if there are other
sleepers. This is most easily accomplished by something along the lines of
the following in __down() in arch/i386/kernel/semaphore.c
spin_lock_irq(&semaphore_lock);
sem->sleepers++;
+
+ /*
+ * Are there other people waiting for this?
+ * They get to go first.
+ */
+ if (sleepers > 1)
+ goto inside;
for (;;) {
int sleepers = sem->sleepers;
/*
* Add "everybody else" into it. They aren't
* playing, because we own the spinlock.
*/
if (!atomic_add_negative(sleepers - 1, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1; /* us - see -1 above */
+inside:
spin_unlock_irq(&semaphore_lock);
schedule();
tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
spin_lock_irq(&semaphore_lock);
}
spin_unlock_irq(&semaphore_lock);
But note that teh above is UNTESTED and also note that from a throughput
(as opposed to latency) standpoint being unfair tends to be nice.
Anybody want to try out something like the above? (And no, I'm not
applying it to my tree yet. It needs about a hundred pairs of eyes to
verify that there isn't some subtle "lost wakeup" race somewhere).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-09 18:31 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 Linus Torvalds
@ 2000-11-10 7:34 ` Mike Galbraith
2000-11-10 10:47 ` Mike Galbraith
2000-11-10 17:07 ` Linus Torvalds
2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
1 sibling, 2 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-10 7:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox
On Thu, 9 Nov 2000, Linus Torvalds wrote:
>
>
> As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
> to do with IO except indirectly (the IO is necessary to trigger the
> problem, but the _reason_ for the problem lies elsewhere).
>
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
>
> What happens is that some process is getting a lot of VM faults and gets
> its VM semaphore. No contention yet. it holds the semaphore over the
> IO, and now another process does a "ps".
>
> The "ps" process goes to sleep on the semaphore. So far so good.
>
> The original process releases the semaphore, which increments the count,
> and wakes up the process waiting for it. Note that it _wakes_ it, it does
> not give the semaphore to it. Big difference.
>
> The process that got woken up will run eventually. Probably not all that
> immediately, because the process that woke it (and held the semaphore)
> just slept on a page fault too, so it's not likely to immediately
> relinquish the CPU.
>
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
>
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again.
>
> Repeat until luck breaks the bad circle.
>
> (This schenario, btw, is much harder to trigger on SMP than on UP. And
> it's completely separate from the issue of simple disk bandwidth issues
> which can obviously cause no end of stalls on anything that needs the
> disk, and which can also happen on SMP).
Unfortunately, it didn't help in the scenario I'm running.
time make -j30 bzImage:
real 14m19.987s (within stock variance)
user 6m24.480s
sys 1m12.970s
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0
34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5
13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50
I understood the above well enough to be very interested in seeing what
happens with flush IO restricted.
-Mike
[try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
to shrink the best candidate's footprint?]
Thanks!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-10 7:34 ` Mike Galbraith
@ 2000-11-10 10:47 ` Mike Galbraith
2000-11-10 17:07 ` Linus Torvalds
1 sibling, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-10 10:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox
> I understood the above well enough to be very interested in seeing what
> happens with flush IO restricted.
>
> -Mike
>
> [try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
> to shrink the best candidate's footprint?]
>
> Thanks!
The results:
pre2+semaphore
real 14m19.987s
user 6m24.480s
sys 1m12.970s
pre2+semaphore+throttle_IO
real 10m13.953s
user 6m19.980s
sys 0m28.960s
pre2+semaphore+throttle_IO extended to refill_inactive()
real 9m46.395s
user 6m23.510s
sys 0m29.420s
pre2+semaphore+throttle_IO + above + tiny little tweak to page_launder()
real 8m56.808s
user 6m23.420s
sys 0m29.430s
Unfortunately, when I try to get past this point I burn trees :-)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-10 7:34 ` Mike Galbraith
2000-11-10 10:47 ` Mike Galbraith
@ 2000-11-10 17:07 ` Linus Torvalds
1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-10 17:07 UTC (permalink / raw)
To: linux-kernel
In article <Pine.Linu.4.10.10011100732250.601-100000@mikeg.weiden.de>,
Mike Galbraith <mikeg@wen-online.de> wrote:
>>
>> (This schenario, btw, is much harder to trigger on SMP than on UP. And
>> it's completely separate from the issue of simple disk bandwidth issues
>> which can obviously cause no end of stalls on anything that needs the
>> disk, and which can also happen on SMP).
>
>Unfortunately, it didn't help in the scenario I'm running.
>
>time make -j30 bzImage:
>
>real 14m19.987s (within stock variance)
>user 6m24.480s
>sys 1m12.970s
Note that the above kin of "throughput performance" should not have been
affected, and was not what I was worried about.
>procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
>31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0
>34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5
>13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50
Looks like there was a big delay in vmstat there - that could easily be
due to simple disk throughput issues..
Does it feel any different under the original load that got the original
complaint? The patch may have just been buggy and ineffective, for all I
know.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10
2000-11-09 18:31 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 Linus Torvalds
2000-11-10 7:34 ` Mike Galbraith
@ 2000-11-10 21:42 ` David Mansfield
2000-11-11 6:20 ` Linus Torvalds
1 sibling, 1 reply; 16+ messages in thread
From: David Mansfield @ 2000-11-10 21:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mike Galbraith, Jens Axboe, MOLNAR Ingo, Rik van Riel,
Kernel Mailing List, Alan Cox
Linus Torvalds wrote:
...
>
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
>
...
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
>
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again.
>
> Repeat until luck breaks the bad circle.
>
But doesn't __down have a fast path coded in assembly? In other words,
it only hits your patched code if there is already contention, which
there isn't in this case, and therefore the bug...?
David Mansfield
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10
2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
@ 2000-11-11 6:20 ` Linus Torvalds
0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-11 6:20 UTC (permalink / raw)
To: linux-kernel
In article <3A0C6BD6.A8F73950@dm.ultramaster.com>,
David Mansfield <lkml@dm.ultramaster.com> wrote:
>Linus Torvalds wrote:
>...
>>
>> And it has everything to do with the fact that the way Linux semaphores
>> are implemented, a non-blocking process has a HUGE advantage over a
>> blocking one. Linux kernel semaphores are extreme unfair in that way.
>>
>...
>> The original running process comes back faulting again, finds the
>> semaphore still unlocked (the "ps" process is awake but has not gotten to
>> run yet), gets the semaphore, and falls asleep on the IO for the next
>> page.
>>
>> The "ps" process actually gets to run now, but it's a bit late. The
>> semaphore is locked again.
>>
>> Repeat until luck breaks the bad circle.
>>
>
>But doesn't __down have a fast path coded in assembly? In other words,
>it only hits your patched code if there is already contention, which
>there isn't in this case, and therefore the bug...?
The __down() case should be hit if there's a waiter, even if that waiter
has not yet been able to pick up the lock (the waiter _will_ have
decremented the count to negative in order to trigger the proper logic
at release time).
But as I mentioned, the pseudo-patch was certainly untested, so
somebody should probably walk through the cases to check that I didn't
miss something.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-03 19:38 ` Jens Axboe
@ 2000-11-04 5:43 ` Mike Galbraith
0 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-04 5:43 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian
On Fri, 3 Nov 2000, Jens Axboe wrote:
> On Fri, Nov 03 2000, Mike Galbraith wrote:
> > > I very much agree. Kflushd is still hungry for free write
> > > bandwidth here.
> >
> > In the LKML tradition of code talks and silly opinions walk...
> >
> > Attached is a diagnostic patch which gets kflushd under control,
> > and takes make -j30 bzImage build times down from 12 minutes to
> > 9 here. I have no more massive context switching on write, and
> > copies seem to go a lot quicker to boot. (that may be because
> > some of my failures were really _really_ horrible)
> >
> > Comments are very welcome. I haven't had problems with this yet,
> > but it's early so... This patch isn't supposed to be pretty either
> > (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> > so be sure to grab a barfbag before you take a look.
>
> Super, looks pretty good from here. I'll give it a go when I get back.
> In addition, here's a small patch that disables the read stealing
> of requests from the write list -- does that improve behaviour
> when we are busy flushing?
Yes. I've done this a bit differently here, and have had good
results. I only disable stealing when I need flush throughput.
Now that the box isn't biting off more than it can chew quite
as often, I'll try this again. I'm pretty darn sure that I can
get more throughput, but :> I've learned that getting too much
can do really OOGLY things. (turns box into single user single
tasking streaming IO monster from hell)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-03 15:45 ` Mike Galbraith
@ 2000-11-03 19:38 ` Jens Axboe
2000-11-04 5:43 ` Mike Galbraith
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2000-11-03 19:38 UTC (permalink / raw)
To: Mike Galbraith; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian
[-- Attachment #1: Type: text/plain, Size: 1053 bytes --]
On Fri, Nov 03 2000, Mike Galbraith wrote:
> > I very much agree. Kflushd is still hungry for free write
> > bandwidth here.
>
> In the LKML tradition of code talks and silly opinions walk...
>
> Attached is a diagnostic patch which gets kflushd under control,
> and takes make -j30 bzImage build times down from 12 minutes to
> 9 here. I have no more massive context switching on write, and
> copies seem to go a lot quicker to boot. (that may be because
> some of my failures were really _really_ horrible)
>
> Comments are very welcome. I haven't had problems with this yet,
> but it's early so... This patch isn't supposed to be pretty either
> (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> so be sure to grab a barfbag before you take a look.
Super, looks pretty good from here. I'll give it a go when I get back.
In addition, here's a small patch that disables the read stealing
of requests from the write list -- does that improve behaviour
when we are busy flushing?
--
* Jens Axboe <axboe@suse.de>
* SuSE Labs
[-- Attachment #2: read_steal.diff --]
[-- Type: text/plain, Size: 998 bytes --]
--- drivers/block/ll_rw_blk.c~ Fri Nov 3 03:22:25 2000
+++ drivers/block/ll_rw_blk.c Fri Nov 3 03:23:24 2000
@@ -455,35 +455,17 @@
struct list_head *list = &q->request_freelist[rw];
struct request *rq;
- /*
- * Reads get preferential treatment and are allowed to steal
- * from the write free list if necessary.
- */
if (!list_empty(list)) {
rq = blkdev_free_rq(list);
- goto got_rq;
- }
-
- /*
- * if the WRITE list is non-empty, we know that rw is READ
- * and that the READ list is empty. allow reads to 'steal'
- * from the WRITE list.
- */
- if (!list_empty(&q->request_freelist[WRITE])) {
- list = &q->request_freelist[WRITE];
- rq = blkdev_free_rq(list);
- goto got_rq;
+ list_del(&rq->table);
+ rq->free_list = list;
+ rq->rq_status = RQ_ACTIVE;
+ rq->special = NULL;
+ rq->q = q;
+ return rq;
}
return NULL;
-
-got_rq:
- list_del(&rq->table);
- rq->free_list = list;
- rq->rq_status = RQ_ACTIVE;
- rq->special = NULL;
- rq->q = q;
- return rq;
}
/*
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-03 5:56 ` Mike Galbraith
@ 2000-11-03 15:45 ` Mike Galbraith
2000-11-03 19:38 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-03 15:45 UTC (permalink / raw)
To: linux-kernel; +Cc: Jens Axboe, Rik van Riel, Tigran Aivazian
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1680 bytes --]
On Fri, 3 Nov 2000, Mike Galbraith wrote:
> On Thu, 2 Nov 2000, Jens Axboe wrote:
>
> > On Thu, Nov 02 2000, Val Henson wrote:
> > > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > > scenario where processes using /proc/<pid>/stat have the
> > > > > possibility to block on multiple processes that are in the
> > > > > process of handling a page fault (but are being starved)
> > > >
> > > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > > are not helped at all. It doesn't seem to become request bound
> > > > (haven't instrumented that yet to be sure) but the stalls persist.
> > > >
> > > > -Mike
> > >
> > > This is not an elevator starvation problem.
> >
> > True, but the blk-xx patches help work-around (what I believe) is
> > bad flushing behaviour by the vm.
>
> I very much agree. Kflushd is still hungry for free write
> bandwidth here.
In the LKML tradition of code talks and silly opinions walk...
Attached is a diagnostic patch which gets kflushd under control,
and takes make -j30 bzImage build times down from 12 minutes to
9 here. I have no more massive context switching on write, and
copies seem to go a lot quicker to boot. (that may be because
some of my failures were really _really_ horrible)
Comments are very welcome. I haven't had problems with this yet,
but it's early so... This patch isn't supposed to be pretty either
(hw techs don't do pretty;) it's only supposed to say 'Huston...'
so be sure to grab a barfbag before you take a look.
-Mike
P.S. almost forgot. vmstat freezes were shortened too :-)
[-- Attachment #2: Type: TEXT/PLAIN, Size: 13303 bytes --]
diff -urN linux-2.4.0-test10.virgin/fs/buffer.c linux-2.4.0-test10.mike/fs/buffer.c
--- linux-2.4.0-test10.virgin/fs/buffer.c Wed Nov 1 06:42:40 2000
+++ linux-2.4.0-test10.mike/fs/buffer.c Fri Nov 3 14:59:10 2000
@@ -38,6 +38,7 @@
#include <linux/swapctl.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
+#include <linux/blk.h>
#include <linux/blkdev.h>
#include <linux/sysrq.h>
#include <linux/file.h>
@@ -705,13 +706,12 @@
/*
* We used to try various strange things. Let's not.
*/
+static int flush_dirty_buffers(int mode);
+
static void refill_freelist(int size)
{
- if (!grow_buffers(size)) {
- wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */
- current->policy |= SCHED_YIELD;
- schedule();
- }
+ if (!grow_buffers(size))
+ flush_dirty_buffers(2);
}
void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -859,7 +859,9 @@
/* -1 -> no need to flush
0 -> async flush
- 1 -> sync flush (wait for I/O completation) */
+ 1 -> sync flush (wait for I/O completation)
+ throttle_IO will be set by kflushd to indicate IO saturation. */
+int throttle_IO;
int balance_dirty_state(kdev_t dev)
{
unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
@@ -2469,6 +2471,7 @@
* response to dirty buffers. Once this process is activated, we write back
* a limited number of buffers to the disks and then go back to sleep again.
*/
+static DECLARE_WAIT_QUEUE_HEAD(bdflush_wait);
static DECLARE_WAIT_QUEUE_HEAD(bdflush_done);
struct task_struct *bdflush_tsk = 0;
@@ -2476,11 +2479,12 @@
{
DECLARE_WAITQUEUE(wait, current);
- if (current == bdflush_tsk)
+ if (current->flags & PF_MEMALLOC)
return;
if (!block) {
- wake_up_process(bdflush_tsk);
+ if (waitqueue_active(&bdflush_wait))
+ wake_up(&bdflush_wait);
return;
}
@@ -2491,7 +2495,9 @@
__set_current_state(TASK_UNINTERRUPTIBLE);
add_wait_queue(&bdflush_done, &wait);
- wake_up_process(bdflush_tsk);
+ if (waitqueue_active(&bdflush_wait))
+ wake_up(&bdflush_wait);
+ current->policy |= SCHED_YIELD;
schedule();
remove_wait_queue(&bdflush_done, &wait);
@@ -2503,11 +2509,19 @@
NOTENOTENOTENOTE: we _only_ need to browse the DIRTY lru list
as all dirty buffers lives _only_ in the DIRTY lru list.
As we never browse the LOCKED and CLEAN lru lists they are infact
- completly useless. */
-static int flush_dirty_buffers(int check_flushtime)
+ completly useless.
+ modes: 0 = check bdf_prm.b_un.ndirty [kflushd]
+ 1 = check flushtime [kupdate]
+ 2 = check bdf_prm.b_un.nrefill [refill_freelist()] */
+#define MODE_KFLUSHD 0
+#define MODE_KUPDATE 1
+#define MODE_REFILL 2
+static int flush_dirty_buffers(int mode)
{
struct buffer_head * bh, *next;
+ request_queue_t *q;
int flushed = 0, i;
+ unsigned long flags;
restart:
spin_lock(&lru_list_lock);
@@ -2524,31 +2538,52 @@
if (buffer_locked(bh))
continue;
- if (check_flushtime) {
+ if (mode == MODE_KUPDATE) {
/* The dirty lru list is chronologically ordered so
if the current bh is not yet timed out,
then also all the following bhs
will be too young. */
if (time_before(jiffies, bh->b_flushtime))
goto out_unlock;
+ } else if (MODE_KFLUSHD) {
+ if (flushed >= bdf_prm.b_un.ndirty)
+ goto out_unlock;
} else {
- if (++flushed > bdf_prm.b_un.ndirty)
+ if (flushed >= bdf_prm.b_un.nrefill)
goto out_unlock;
}
- /* OK, now we are committed to write it out. */
+ /* We are almost committed to write it out. */
atomic_inc(&bh->b_count);
+ q = blk_get_queue(bh->b_rdev);
+ spin_lock_irqsave(&q->request_lock, flags);
spin_unlock(&lru_list_lock);
+ if (list_empty(&q->request_freelist[WRITE])) {
+ throttle_IO = 1;
+ atomic_dec(&bh->b_count);
+ spin_unlock_irqrestore(&q->request_lock, flags);
+ run_task_queue(&tq_disk);
+ break;
+ } else
+ throttle_IO = 0;
+ spin_unlock_irqrestore(&q->request_lock, flags);
+ /* OK, now we are really committed. */
+
ll_rw_block(WRITE, 1, &bh);
atomic_dec(&bh->b_count);
+ flushed++;
- if (current->need_resched)
- schedule();
+ if (current->need_resched) {
+ if (!(mode == MODE_KFLUSHD))
+ schedule();
+ else
+ goto out;
+ }
goto restart;
}
- out_unlock:
+out_unlock:
spin_unlock(&lru_list_lock);
-
+out:
return flushed;
}
@@ -2640,7 +2675,7 @@
int bdflush(void *sem)
{
struct task_struct *tsk = current;
- int flushed;
+ int flushed, dirty, pdirty=0;
/*
* We have a bare-bones task_struct, and really should fill
* in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2649,6 +2684,7 @@
tsk->session = 1;
tsk->pgrp = 1;
+ tsk->flags |= PF_MEMALLOC;
strcpy(tsk->comm, "kflushd");
bdflush_tsk = tsk;
@@ -2664,32 +2700,39 @@
for (;;) {
CHECK_EMERGENCY_SYNC
+ if (balance_dirty_state(NODEV) < 0)
+ goto sleep;
+
flushed = flush_dirty_buffers(0);
if (free_shortage())
flushed += page_launder(GFP_BUFFER, 0);
- /* If wakeup_bdflush will wakeup us
- after our bdflush_done wakeup, then
- we must make sure to not sleep
- in schedule_timeout otherwise
- wakeup_bdflush may wait for our
- bdflush_done wakeup that would never arrive
- (as we would be sleeping) and so it would
- deadlock in SMP. */
- __set_current_state(TASK_INTERRUPTIBLE);
- wake_up_all(&bdflush_done);
+#if 0
+ /*
+ * Did someone create lots of dirty buffers while we slept?
+ */
+ dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+ if (dirty - pdirty > flushed && throttle_IO) {
+ printk(KERN_WARNING
+ "kflushd: pdirty(%d) dirtied (%d) flushed (%d)\n",
+ pdirty, pdirty - dirty, flushed);
+ }
+ pdirty = dirty;
+#endif
+
+ run_task_queue(&tq_disk);
+ if (flushed)
+ wake_up_all(&bdflush_done);
/*
* If there are still a lot of dirty buffers around,
- * skip the sleep and flush some more. Otherwise, we
+ * we sleep and the flush some more. Otherwise, we
* go to sleep waiting a wakeup.
*/
- if (!flushed || balance_dirty_state(NODEV) < 0) {
- run_task_queue(&tq_disk);
- schedule();
+ if (balance_dirty_state(NODEV) < 0) {
+sleep:
+ wake_up_all(&bdflush_done);
+ interruptible_sleep_on_timeout(&bdflush_wait, HZ/10);
}
- /* Remember to mark us as running otherwise
- the next schedule will block. */
- __set_current_state(TASK_RUNNING);
}
}
@@ -2706,6 +2749,7 @@
tsk->session = 1;
tsk->pgrp = 1;
+ tsk->flags |= PF_MEMALLOC;
strcpy(tsk->comm, "kupdate");
/* sigstop and sigcont will stop and wakeup kupdate */
diff -urN linux-2.4.0-test10.virgin/mm/page_alloc.c linux-2.4.0-test10.mike/mm/page_alloc.c
--- linux-2.4.0-test10.virgin/mm/page_alloc.c Wed Nov 1 06:42:45 2000
+++ linux-2.4.0-test10.mike/mm/page_alloc.c Fri Nov 3 15:22:55 2000
@@ -285,7 +285,7 @@
struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order)
{
zone_t **zone;
- int direct_reclaim = 0;
+ int direct_reclaim = 0, strikes = 0;
unsigned int gfp_mask = zonelist->gfp_mask;
struct page * page;
@@ -310,22 +310,30 @@
!(current->flags & PF_MEMALLOC))
direct_reclaim = 1;
- /*
- * If we are about to get low on free pages and we also have
- * an inactive page shortage, wake up kswapd.
- */
- if (inactive_shortage() > inactive_target / 2 && free_shortage())
- wakeup_kswapd(0);
+#define STRIKE_ONE \
+ (strikes++ && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_TWO \
+ (strikes++ < 2 && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_THREE \
+ (strikes++ < 3 && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_THREE_NOIO \
+ (strikes++ < 3 && (gfp_mask & GFP_USER) == __GFP_WAIT)
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
* wake up bdflush.
*/
- else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
+try_again:
+ if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
- wakeup_bdflush(0);
+ wakeup_bdflush(STRIKE_ONE);
+ /*
+ * If we are about to get low on free pages and we also have
+ * an inactive page shortage, wake up kswapd.
+ */
+ else if (inactive_shortage() || free_shortage())
+ wakeup_kswapd(STRIKE_ONE);
-try_again:
/*
* First, see if we have any zones with lots of free memory.
*
@@ -374,35 +382,16 @@
if (page)
return page;
- /*
- * OK, none of the zones on our zonelist has lots
- * of pages free.
- *
- * We wake up kswapd, in the hope that kswapd will
- * resolve this situation before memory gets tight.
- *
- * We also yield the CPU, because that:
- * - gives kswapd a chance to do something
- * - slows down allocations, in particular the
- * allocations from the fast allocator that's
- * causing the problems ...
- * - ... which minimises the impact the "bad guys"
- * have on the rest of the system
- * - if we don't have __GFP_IO set, kswapd may be
- * able to free some memory we can't free ourselves
- */
- wakeup_kswapd(0);
- if (gfp_mask & __GFP_WAIT) {
- __set_current_state(TASK_RUNNING);
+ if (STRIKE_TWO) {
current->policy |= SCHED_YIELD;
- schedule();
+ goto try_again;
}
/*
- * After waking up kswapd, we try to allocate a page
+ * After waking up daemons, we try to allocate a page
* from any zone which isn't critical yet.
*
- * Kswapd should, in most situations, bring the situation
+ * Kswapd/kflushd should, in most situations, bring the situation
* back to normal in no time.
*/
page = __alloc_pages_limit(zonelist, order, PAGES_MIN, direct_reclaim);
@@ -426,7 +415,7 @@
* in the hope of creating a large, physically contiguous
* piece of free memory.
*/
- if (order > 0 && (gfp_mask & __GFP_WAIT)) {
+ if (gfp_mask & __GFP_WAIT) {
zone = zonelist->zones;
/* First, clean some dirty pages. */
page_launder(gfp_mask, 1);
@@ -463,26 +452,23 @@
* simply cannot free a large enough contiguous area
* of memory *ever*.
*/
- if ((gfp_mask & (__GFP_WAIT|__GFP_IO)) == (__GFP_WAIT|__GFP_IO)) {
+ if (~gfp_mask & __GFP_HIGH && STRIKE_THREE) {
+ memory_pressure++;
wakeup_kswapd(1);
+ goto try_again;
+ } else if (~gfp_mask & __GFP_HIGH && STRIKE_THREE_NOIO) {
+ /*
+ * If __GFP_IO isn't set, we can't wait on kswapd because
+ * daemons just might need some IO locks /we/ are holding ...
+ *
+ * SUBTLE: The scheduling point above makes sure that
+ * kswapd does get the chance to free memory we can't
+ * free ourselves...
+ */
memory_pressure++;
- if (!order)
- goto try_again;
- /*
- * If __GFP_IO isn't set, we can't wait on kswapd because
- * kswapd just might need some IO locks /we/ are holding ...
- *
- * SUBTLE: The scheduling point above makes sure that
- * kswapd does get the chance to free memory we can't
- * free ourselves...
- */
- } else if (gfp_mask & __GFP_WAIT) {
try_to_free_pages(gfp_mask);
- memory_pressure++;
- if (!order)
- goto try_again;
+ goto try_again;
}
-
}
/*
diff -urN linux-2.4.0-test10.virgin/mm/vmscan.c linux-2.4.0-test10.mike/mm/vmscan.c
--- linux-2.4.0-test10.virgin/mm/vmscan.c Wed Nov 1 06:42:45 2000
+++ linux-2.4.0-test10.mike/mm/vmscan.c Fri Nov 3 15:20:32 2000
@@ -562,6 +562,7 @@
* go out to Matthew Dillon.
*/
#define MAX_LAUNDER (4 * (1 << page_cluster))
+extern int throttle_IO;
int page_launder(int gfp_mask, int sync)
{
int launder_loop, maxscan, cleaned_pages, maxlaunder;
@@ -573,7 +574,7 @@
* We can only grab the IO locks (eg. for flushing dirty
* buffers to disk) if __GFP_IO is set.
*/
- can_get_io_locks = gfp_mask & __GFP_IO;
+ can_get_io_locks = gfp_mask & __GFP_IO && !throttle_IO;
launder_loop = 0;
maxlaunder = 0;
@@ -1050,13 +1051,23 @@
for (;;) {
static int recalc = 0;
+ /* Once a second, recalculate some VM stats. */
+ if (time_after(jiffies, recalc + HZ)) {
+ recalc = jiffies;
+ recalculate_vm_stats();
+ }
+
/* If needed, try to free some memory. */
if (inactive_shortage() || free_shortage()) {
int wait = 0;
/* Do we need to do some synchronous flushing? */
if (waitqueue_active(&kswapd_done))
wait = 1;
+#if 0 /* Undo this and watch allocations fail under heavy stress */
do_try_to_free_pages(GFP_KSWAPD, wait);
+#else
+ do_try_to_free_pages(GFP_KSWAPD, 0);
+#endif
}
/*
@@ -1067,12 +1078,6 @@
*/
refill_inactive_scan(6, 0);
- /* Once a second, recalculate some VM stats. */
- if (time_after(jiffies, recalc + HZ)) {
- recalc = jiffies;
- recalculate_vm_stats();
- }
-
/*
* Wake up everybody waiting for free memory
* and unplug the disk queue.
@@ -1112,7 +1117,7 @@
{
DECLARE_WAITQUEUE(wait, current);
- if (current == kswapd_task)
+ if (current->flags & PF_MEMALLOC)
return;
if (!block) {
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-03 1:37 ` Jens Axboe
@ 2000-11-03 5:56 ` Mike Galbraith
2000-11-03 15:45 ` Mike Galbraith
0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-03 5:56 UTC (permalink / raw)
To: Jens Axboe; +Cc: Val Henson, Rik van Riel, linux-kernel
On Thu, 2 Nov 2000, Jens Axboe wrote:
> On Thu, Nov 02 2000, Val Henson wrote:
> > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > scenario where processes using /proc/<pid>/stat have the
> > > > possibility to block on multiple processes that are in the
> > > > process of handling a page fault (but are being starved)
> > >
> > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > are not helped at all. It doesn't seem to become request bound
> > > (haven't instrumented that yet to be sure) but the stalls persist.
> > >
> > > -Mike
> >
> > This is not an elevator starvation problem.
>
> True, but the blk-xx patches help work-around (what I believe) is
> bad flushing behaviour by the vm.
I very much agree. Kflushd is still hungry for free write
bandwidth here.
Of course it's _going_ to have to wait if you're doing max IO
throughput, but when you're flushing, you need to let kflushd
have the bandwidth it needs to do it's job. I don't think it's
getting what it needs, and am trying two things.
1. Revoke read's ability to steal requests while we're in a
heavy flushing situation. Flushing must proceed, and it must
go at full speed. (Actually, reversing the favoritism when
you need flush bandwidth makes sense to me, and does help if
limited.. if not limited, it hurts like hell)
2. Use the information that we are starving (or going full bore)
to tell the VM to keep it's fingers off dirty buffers. If we're
flushing at disk speed, page_launder() can't do anything useful
with dirty buffers, it can only do harm IMHO.
-Mike
P.S. Before I revert to Luke Codecrawler mode, I have a wild
problem theory I'd appreciate comments on.. preferably the kind
where I become extremely busy thinking about their content ;-)
If one __alloc_pages() is waiting for kswapd, kswapd tries to do
synchronous flushing.. if the write queue is nearly (or) exausted
and page_launder() couldn't clean any buffers on it's first pass,
it blasts the queue some more and stalls. If kswapd, kflushd and
kupdate are all waiting for a request, and then say a GFP_BUFFER
allocation comes along.. (we're low on memory) we do SCHED_YIELD
schedule(). If we're holding IO locks, nobody can do IO. OK, if
there's nobody else running, we come right back and either finish
the allocation of fail. But, if you have other allocations trying
to flush buffers (GFP_KERNEL eg), they are not only in danger of
stacking up due to a request shortage, but they can't get whatever
IO locks the GFP_BUFFER allocation is holding anyway so are doomed
until we do schedule back to the GFP_BUFFER allocating task.
Isn't scheduling while holding IO locks the wrong thing to do? It's
protected from neither GFP_BUFFER nor PF_MEMALLOC.
I must be missing something.. but what?
<ears at maximum gain>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-02 21:59 ` Val Henson
@ 2000-11-03 1:37 ` Jens Axboe
2000-11-03 5:56 ` Mike Galbraith
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2000-11-03 1:37 UTC (permalink / raw)
To: Val Henson; +Cc: Mike Galbraith, Rik van Riel, linux-kernel
On Thu, Nov 02 2000, Val Henson wrote:
> > > 3) combine this with the elevator starvation stuff (ask Jens
> > > Axboe for blk-7 to alleviate this issue) and you have a
> > > scenario where processes using /proc/<pid>/stat have the
> > > possibility to block on multiple processes that are in the
> > > process of handling a page fault (but are being starved)
> >
> > I'm experimenting with blk.[67] in test10 right now. The stalls
> > are not helped at all. It doesn't seem to become request bound
> > (haven't instrumented that yet to be sure) but the stalls persist.
> >
> > -Mike
>
> This is not an elevator starvation problem.
True, but the blk-xx patches help work-around (what I believe) is
bad flushing behaviour by the vm.
> I also experienced these stalls with my IDE-only system. Unless I'm
> badly mistaken, the elevator is only used on SCSI disks, therefore
> elevator starvation cannot be blamed for this problem. These stalls
> are particularly annoying since I want to find the pid of the process
> hogging memory in order to kill it, but the read from /proc stalls for
> 45 seconds or more.
You are badly mistaken.
--
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-02 7:19 ` Mike Galbraith
@ 2000-11-02 21:59 ` Val Henson
2000-11-03 1:37 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Val Henson @ 2000-11-02 21:59 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Rik van Riel, linux-kernel
On Thu, Nov 02, 2000 at 08:19:06AM +0100, Mike Galbraith wrote:
> On Wed, 1 Nov 2000, Rik van Riel wrote:
>
> > I have one possible reason for this ....
> >
> > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> > down(&mm->mmap_sem);
> >
> > 2) but, in order to do that, it has to wait until the process
> > it is trying to stat has /finished/ its page fault, and is
> > not into its next one ...
> >
> > 3) combine this with the elevator starvation stuff (ask Jens
> > Axboe for blk-7 to alleviate this issue) and you have a
> > scenario where processes using /proc/<pid>/stat have the
> > possibility to block on multiple processes that are in the
> > process of handling a page fault (but are being starved)
>
> I'm experimenting with blk.[67] in test10 right now. The stalls
> are not helped at all. It doesn't seem to become request bound
> (haven't instrumented that yet to be sure) but the stalls persist.
>
> -Mike
This is not an elevator starvation problem.
I also experienced these stalls with my IDE-only system. Unless I'm
badly mistaken, the elevator is only used on SCSI disks, therefore
elevator starvation cannot be blamed for this problem. These stalls
are particularly annoying since I want to find the pid of the process
hogging memory in order to kill it, but the read from /proc stalls for
45 seconds or more.
-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-01 18:48 ` Rik van Riel
2000-11-02 7:19 ` Mike Galbraith
@ 2000-11-02 8:40 ` Christoph Rohland
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Rohland @ 2000-11-02 8:40 UTC (permalink / raw)
To: Rik van Riel; +Cc: David Mansfield, lkml
Hi Rik,
I can probably give some more datapoints. Here is the console output
of my test machine (there is a 'vmstat 5' running in background):
[root@ls3016 /root]# killall shmtst
[root@ls3016 /root]#
1 12 2 0 1607668 18932 2110496 0 0 67154 1115842 1050063 2029389 0 2 98
0 10 2 0 1607564 18932 2110496 0 0 0 300 317 426 0 0 100
0 10 2 0 1607408 18932 2110496 0 0 0 301 336 473 0 0 100
0 10 2 0 1607560 18932 2110508 0 0 0 307 318 430 0 0 100
0 10 2 0 1607556 18932 2110512 0 0 0 304 324 433 0 0 100
0 10 2 0 1607528 18932 2110512 0 0 0 272 308 410 0 1 99
0 10 2 0 1607440 18932 2110516 0 0 0 315 323 438 0 1 99
0 10 2 0 1607528 18932 2110516 0 0 0 323 316 424 0 0 100
0 10 2 0 1607556 18932 2110516 0 0 0 304 309 410 0 0 100
0 10 2 0 1607600 18932 2110528 0 0 0 298 314 418 0 0 100
0 10 2 0 1607384 18932 2110528 0 0 0 296 307 406 0 1 99
0 10 2 0 1607284 18932 2110528 0 0 0 304 315 421 0 0 100
0 10 2 0 1607668 18932 2110528 0 0 0 298 304 402 0 0 100
0 10 2 0 1607576 18932 2110528 0 0 0 285 307 405 0 0 100
0 10 2 0 1607656 18932 2110528 0 0 0 292 303 399 0 1 99
0 10 2 0 1607928 18932 2110528 0 0 0 313 310 408 0 0 100
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 10 2 0 1608440 18932 2110528 0 0 0 340 313 417 0 1 99
0 10 2 0 1608260 18932 2110528 0 0 0 298 318 426 0 0 100
0 10 2 0 1608208 18932 2110528 0 0 0 314 334 448 0 1 99
0 10 2 0 1608396 18932 2110528 0 0 0 323 316 421 0 1 99
0 10 2 0 1608204 18932 2110548 0 0 0 334 333 458 0 0 100
0 10 2 0 1607888 18932 2110580 0 0 0 336 329 448 0 1 99
0 10 2 0 1608040 18932 2110584 0 0 0 317 321 435 0 0 100
0 10 2 0 1608032 18932 2110588 0 0 0 241 318 425 0 0 100
0 10 2 0 1608028 18932 2110592 0 0 0 257 325 443 0 1 99
0 10 3 0 1608028 18932 2110592 0 0 0 258 323 435 0 0 99
0 10 2 0 1608032 18932 2110592 0 0 0 241 316 425 0 0 100
0 10 2 0 1608024 18932 2110592 0 0 0 261 337 460 0 0 100
0 10 2 0 1608016 18932 2110592 0 0 0 253 328 444 0 0 100
0 10 2 0 1608024 18932 2110592 0 0 0 252 320 435 0 0 100
0 10 2 0 1608012 18932 2110592 0 0 0 255 326 446 0 0 100
0 10 2 0 1608020 18932 2110592 0 0 0 255 326 444 0 1 99
0 10 2 0 1608012 18932 2110600 0 0 0 261 341 469 0 0 100
0 10 2 0 1607992 18932 2110608 0 0 0 261 344 479 0 0 100
0 10 2 0 1607992 18932 2110612 0 0 0 264 342 471 0 0 100
0 10 2 0 1607984 18932 2110612 0 0 0 266 334 462 0 0 100
0 10 2 0 1607980 18932 2110620 0 0 0 273 340 468 0 0 99
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 10 2 0 1607972 18932 2110624 0 0 0 266 345 474 0 1 99
0 10 2 0 1607940 18932 2110640 0 0 0 256 341 462 0 0 100
0 10 2 0 1607936 18932 2110644 0 0 0 262 339 462 0 1 99
0 10 2 0 1607940 18932 2110644 0 0 0 261 333 450 0 1 99
0 10 2 0 1607944 18932 2110644 0 0 0 253 335 454 0 0 100
0 10 2 0 1607944 18932 2110644 0 0 0 272 352 479 0 1 99
[root@ls3016 /root]# ps l
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
100 0 820 1 9 0 2200 1168 wait4 S ttyS0 0:00 login -- ro
100 0 862 820 14 0 1756 976 wait4 S ttyS0 0:00 -bash
000 0 878 862 9 0 1080 360 down D ttyS0 11:27 ./shmtst 10
000 0 879 862 9 0 1080 360 down D ttyS0 15:21 ./shmtst 15
040 0 880 878 9 0 1092 416 wait_o D ttyS0 8:55 ./shmtst 10
040 0 881 878 9 0 1080 360 down D ttyS0 10:22 ./shmtst 10
444 0 882 878 9 0 0 0 do_exi Z ttyS0 10:00 [shmtst <de
040 0 883 878 9 0 1092 416 wait_o D ttyS0 9:30 ./shmtst 10
040 0 884 878 9 0 1092 416 down D ttyS0 8:44 ./shmtst 10
040 0 885 878 9 0 1092 416 down D ttyS0 9:01 ./shmtst 10
444 0 886 878 9 0 0 0 do_exi Z ttyS0 7:59 [shmtst <de
444 0 887 879 9 0 0 0 do_exi Z ttyS0 17:11 [shmtst <de
040 0 888 878 9 0 1080 360 down D ttyS0 10:21 ./shmtst 10
040 0 889 878 9 0 1092 416 down D ttyS0 9:06 ./shmtst 10
000 0 891 862 9 0 1136 488 nanosl S ttyS0 0:23 vmstat 5
000 0 1226 862 19 0 2756 1084 - R ttyS0 0:00 ps l
[root@ls3016 /root]# 0 10 2 0 1607936 18932 2110652 0 0 0 275 368 488 0 0 99
0 10 2 0 1607912 18932 2110660 0 0 0 266 334 457 0 0 100
0 10 2 0 1607848 18932 2110672 0 0 0 302 354 498 0 0 100
0 10 2 0 1607892 18932 2110688 0 0 0 287 352 496 0 0 100
0 11 2 0 1607868 18932 2110704 0 0 1 282 338 472 0 1 99
So the processes don't finish exiting at least 47*5sec. They have
shared mmaped some 666000000 bytes long plain file on a 8GB machine.
The rest of the machine behaves nicely.
Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-01 18:48 ` Rik van Riel
@ 2000-11-02 7:19 ` Mike Galbraith
2000-11-02 21:59 ` Val Henson
2000-11-02 8:40 ` Christoph Rohland
1 sibling, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-02 7:19 UTC (permalink / raw)
To: Rik van Riel; +Cc: David Mansfield, lkml
On Wed, 1 Nov 2000, Rik van Riel wrote:
> On Wed, 1 Nov 2000, David Mansfield wrote:
>
> > I'd like to report what seems like a performance problem in the latest
> > kernels. Actually, all recent kernels have exhibited this problem, but
> > I was waiting for the new VM stuff to stabilize before reporting it.
> >
> > My test is: run 7 processes that each allocate and randomly
> > access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> > 224MB, this still sends the machine lightly into swap. The
> > machine continues to function fairly smoothly for the most part.
> > I can do filesystem operations, run new programs, move desktops
> > in X etc.
> >
> > Except: programs which access /proc/<pid>/stat stall for an
> > inderminate amount of time. For example, 'ps' and 'vmstat'
> > stall BADLY in these scenarios. I have had the stalls last over
> > a minute in higher VM pressure situations.
>
> I have one possible reason for this ....
>
> 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> down(&mm->mmap_sem);
>
> 2) but, in order to do that, it has to wait until the process
> it is trying to stat has /finished/ its page fault, and is
> not into its next one ...
>
> 3) combine this with the elevator starvation stuff (ask Jens
> Axboe for blk-7 to alleviate this issue) and you have a
> scenario where processes using /proc/<pid>/stat have the
> possibility to block on multiple processes that are in the
> process of handling a page fault (but are being starved)
I'm experimenting with blk.[67] in test10 right now. The stalls
are not helped at all. It doesn't seem to become request bound
(haven't instrumented that yet to be sure) but the stalls persist.
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield
@ 2000-11-01 18:48 ` Rik van Riel
2000-11-02 7:19 ` Mike Galbraith
2000-11-02 8:40 ` Christoph Rohland
0 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2000-11-01 18:48 UTC (permalink / raw)
To: David Mansfield; +Cc: lkml
On Wed, 1 Nov 2000, David Mansfield wrote:
> I'd like to report what seems like a performance problem in the latest
> kernels. Actually, all recent kernels have exhibited this problem, but
> I was waiting for the new VM stuff to stabilize before reporting it.
>
> My test is: run 7 processes that each allocate and randomly
> access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> 224MB, this still sends the machine lightly into swap. The
> machine continues to function fairly smoothly for the most part.
> I can do filesystem operations, run new programs, move desktops
> in X etc.
>
> Except: programs which access /proc/<pid>/stat stall for an
> inderminate amount of time. For example, 'ps' and 'vmstat'
> stall BADLY in these scenarios. I have had the stalls last over
> a minute in higher VM pressure situations.
I have one possible reason for this ....
1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
down(&mm->mmap_sem);
2) but, in order to do that, it has to wait until the process
it is trying to stat has /finished/ its page fault, and is
not into its next one ...
3) combine this with the elevator starvation stuff (ask Jens
Axboe for blk-7 to alleviate this issue) and you have a
scenario where processes using /proc/<pid>/stat have the
possibility to block on multiple processes that are in the
process of handling a page fault (but are being starved)
regards,
Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000
http://www.conectiva.com/ http://www.surriel.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
@ 2000-11-01 18:38 David Mansfield
2000-11-01 18:48 ` Rik van Riel
0 siblings, 1 reply; 16+ messages in thread
From: David Mansfield @ 2000-11-01 18:38 UTC (permalink / raw)
To: lkml
Hi VM/procfs hackers,
System is UP Athlon 700mhz with 256mb ram running vanilla 2.4.0-test10.
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
I'd like to report what seems like a performance problem in the latest
kernels. Actually, all recent kernels have exhibited this problem, but
I was waiting for the new VM stuff to stabilize before reporting it.
My test is: run 7 processes that each allocate and randomly access 32mb
of ram (on a 256mb machine). Even though 7*32MB = 224MB, this still
sends the machine lightly into swap. The machine continues to function
fairly smoothly for the most part. I can do filesystem operations, run
new programs, move desktops in X etc.
Except: programs which access /proc/<pid>/stat stall for an inderminate
amount of time. For example, 'ps' and 'vmstat' stall BADLY in these
scenarios. I have had the stalls last over a minute in higher VM
pressure situations.
Unfortunately, when system is thrashing, it's nice to be able to run
'ps' in order to get the PID to kill, and run a reliable vmstat to
monitor it.
Here's a segment of an strace of 'ps' showing a 12 second stall (this
isn't the worst I've seen by any means, but a 12 second stall trying to
get process info for 1 swapping task can easily snowball into a DOS).
0.000119 open("/proc/4746/stat", O_RDONLY) = 7
0.000072 read(7, "4746 (hog) D 4739 4739 827 34817"..., 511) = 181
12.237161 close(7) = 0
The wchan of the stalled 'ps' is in __down_interruptible, which probably
doesn't help much.
This worked absolutely fine in 2.2. Even under extreme swap pressure,
vmstat continues to function fine, spitting out messages every second as
it should.
David Mansfield
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2000-11-11 6:21 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>
2000-11-09 18:31 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 Linus Torvalds
2000-11-10 7:34 ` Mike Galbraith
2000-11-10 10:47 ` Mike Galbraith
2000-11-10 17:07 ` Linus Torvalds
2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
2000-11-11 6:20 ` Linus Torvalds
2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield
2000-11-01 18:48 ` Rik van Riel
2000-11-02 7:19 ` Mike Galbraith
2000-11-02 21:59 ` Val Henson
2000-11-03 1:37 ` Jens Axboe
2000-11-03 5:56 ` Mike Galbraith
2000-11-03 15:45 ` Mike Galbraith
2000-11-03 19:38 ` Jens Axboe
2000-11-04 5:43 ` Mike Galbraith
2000-11-02 8:40 ` Christoph Rohland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).