[BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] /proc/<pid>/stat access stalls badly for swapping process,  2.4.0-test10
@ 2000-11-01 18:38 David Mansfield
  2000-11-01 18:48 ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: David Mansfield @ 2000-11-01 18:38 UTC (permalink / raw)
  To: lkml

Hi VM/procfs hackers,

System is UP Athlon 700mhz with 256mb ram running vanilla 2.4.0-test10.
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)

I'd like to report what seems like a performance problem in the latest
kernels.  Actually, all recent kernels have exhibited this problem, but
I was waiting for the new VM stuff to stabilize before reporting it. 

My test is: run 7 processes that each allocate and randomly access 32mb
of ram (on a 256mb machine).  Even though 7*32MB = 224MB, this still
sends the machine lightly into swap.  The machine continues to function
fairly smoothly for the most part.  I can do filesystem operations, run
new programs, move desktops in X etc.

Except: programs which access /proc/<pid>/stat stall for an inderminate
amount of time.  For example, 'ps' and 'vmstat' stall BADLY in these
scenarios.  I have had the stalls last over a minute in higher VM
pressure situations. 

Unfortunately, when system is thrashing, it's nice to be able to run
'ps' in order to get the PID to kill, and run a reliable vmstat to
monitor it.

Here's a segment of an strace of 'ps' showing a 12 second stall (this
isn't the worst I've seen by any means, but a 12 second stall trying to
get process info for 1 swapping task can easily snowball into a DOS).

     0.000119 open("/proc/4746/stat", O_RDONLY) = 7
     0.000072 read(7, "4746 (hog) D 4739 4739 827 34817"..., 511) = 181
    12.237161 close(7)                  = 0

The wchan of the stalled 'ps' is in __down_interruptible, which probably
doesn't help much.

This worked absolutely fine in 2.2.  Even under extreme swap pressure,
vmstat continues to function fine, spitting out messages every second as
it should.

David Mansfield
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield
@ 2000-11-01 18:48 ` Rik van Riel
  2000-11-02  7:19   ` Mike Galbraith
  2000-11-02  8:40   ` Christoph Rohland
  0 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2000-11-01 18:48 UTC (permalink / raw)
  To: David Mansfield; +Cc: lkml

On Wed, 1 Nov 2000, David Mansfield wrote:

> I'd like to report what seems like a performance problem in the latest
> kernels.  Actually, all recent kernels have exhibited this problem, but
> I was waiting for the new VM stuff to stabilize before reporting it. 
> 
> My test is: run 7 processes that each allocate and randomly
> access 32mb of ram (on a 256mb machine).  Even though 7*32MB =
> 224MB, this still sends the machine lightly into swap.  The
> machine continues to function fairly smoothly for the most part.  
> I can do filesystem operations, run new programs, move desktops
> in X etc.
> 
> Except: programs which access /proc/<pid>/stat stall for an
> inderminate amount of time.  For example, 'ps' and 'vmstat'
> stall BADLY in these scenarios.  I have had the stalls last over
> a minute in higher VM pressure situations.

I have one possible reason for this ....

1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
	down(&mm->mmap_sem);

2) but, in order to do that, it has to wait until the process
   it is trying to stat has /finished/ its page fault, and is
   not into its next one ...

3) combine this with the elevator starvation stuff (ask Jens
   Axboe for blk-7 to alleviate this issue) and you have a
   scenario where processes using /proc/<pid>/stat have the
   possibility to block on multiple processes that are in the
   process of handling a page fault (but are being starved)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-01 18:48 ` Rik van Riel
@ 2000-11-02  7:19   ` Mike Galbraith
  2000-11-02 21:59     ` Val Henson
  2000-11-02  8:40   ` Christoph Rohland
  1 sibling, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-02  7:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David Mansfield, lkml

On Wed, 1 Nov 2000, Rik van Riel wrote:

> On Wed, 1 Nov 2000, David Mansfield wrote:
> 
> > I'd like to report what seems like a performance problem in the latest
> > kernels.  Actually, all recent kernels have exhibited this problem, but
> > I was waiting for the new VM stuff to stabilize before reporting it. 
> > 
> > My test is: run 7 processes that each allocate and randomly
> > access 32mb of ram (on a 256mb machine).  Even though 7*32MB =
> > 224MB, this still sends the machine lightly into swap.  The
> > machine continues to function fairly smoothly for the most part.  
> > I can do filesystem operations, run new programs, move desktops
> > in X etc.
> > 
> > Except: programs which access /proc/<pid>/stat stall for an
> > inderminate amount of time.  For example, 'ps' and 'vmstat'
> > stall BADLY in these scenarios.  I have had the stalls last over
> > a minute in higher VM pressure situations.
> 
> I have one possible reason for this ....
> 
> 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> 	down(&mm->mmap_sem);
> 
> 2) but, in order to do that, it has to wait until the process
>    it is trying to stat has /finished/ its page fault, and is
>    not into its next one ...
> 
> 3) combine this with the elevator starvation stuff (ask Jens
>    Axboe for blk-7 to alleviate this issue) and you have a
>    scenario where processes using /proc/<pid>/stat have the
>    possibility to block on multiple processes that are in the
>    process of handling a page fault (but are being starved)

I'm experimenting with blk.[67] in test10 right now.  The stalls
are not helped at all.  It doesn't seem to become request bound
(haven't instrumented that yet to be sure) but the stalls persist.

	-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-02  7:19   ` Mike Galbraith
@ 2000-11-02 21:59     ` Val Henson
  2000-11-03  1:37       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Val Henson @ 2000-11-02 21:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Rik van Riel, linux-kernel

On Thu, Nov 02, 2000 at 08:19:06AM +0100, Mike Galbraith wrote:
> On Wed, 1 Nov 2000, Rik van Riel wrote:
> 
> > I have one possible reason for this ....
> > 
> > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> > 	down(&mm->mmap_sem);
> > 
> > 2) but, in order to do that, it has to wait until the process
> >    it is trying to stat has /finished/ its page fault, and is
> >    not into its next one ...
> > 
> > 3) combine this with the elevator starvation stuff (ask Jens
> >    Axboe for blk-7 to alleviate this issue) and you have a
> >    scenario where processes using /proc/<pid>/stat have the
> >    possibility to block on multiple processes that are in the
> >    process of handling a page fault (but are being starved)
> 
> I'm experimenting with blk.[67] in test10 right now.  The stalls
> are not helped at all.  It doesn't seem to become request bound
> (haven't instrumented that yet to be sure) but the stalls persist.
> 
> 	-Mike

This is not an elevator starvation problem.

I also experienced these stalls with my IDE-only system.  Unless I'm
badly mistaken, the elevator is only used on SCSI disks, therefore
elevator starvation cannot be blamed for this problem.  These stalls
are particularly annoying since I want to find the pid of the process
hogging memory in order to kill it, but the read from /proc stalls for
45 seconds or more.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-02 21:59     ` Val Henson
@ 2000-11-03  1:37       ` Jens Axboe
  2000-11-03  5:56         ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2000-11-03  1:37 UTC (permalink / raw)
  To: Val Henson; +Cc: Mike Galbraith, Rik van Riel, linux-kernel

On Thu, Nov 02 2000, Val Henson wrote:
> > > 3) combine this with the elevator starvation stuff (ask Jens
> > >    Axboe for blk-7 to alleviate this issue) and you have a
> > >    scenario where processes using /proc/<pid>/stat have the
> > >    possibility to block on multiple processes that are in the
> > >    process of handling a page fault (but are being starved)
> > 
> > I'm experimenting with blk.[67] in test10 right now.  The stalls
> > are not helped at all.  It doesn't seem to become request bound
> > (haven't instrumented that yet to be sure) but the stalls persist.
> > 
> > 	-Mike
> 
> This is not an elevator starvation problem.

True, but the blk-xx patches help work-around (what I believe) is
bad flushing behaviour by the vm.

> I also experienced these stalls with my IDE-only system.  Unless I'm
> badly mistaken, the elevator is only used on SCSI disks, therefore
> elevator starvation cannot be blamed for this problem.  These stalls
> are particularly annoying since I want to find the pid of the process
> hogging memory in order to kill it, but the read from /proc stalls for
> 45 seconds or more.

You are badly mistaken.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-03  1:37       ` Jens Axboe
@ 2000-11-03  5:56         ` Mike Galbraith
  2000-11-03 15:45           ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-03  5:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Val Henson, Rik van Riel, linux-kernel

On Thu, 2 Nov 2000, Jens Axboe wrote:

> On Thu, Nov 02 2000, Val Henson wrote:
> > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > >    Axboe for blk-7 to alleviate this issue) and you have a
> > > >    scenario where processes using /proc/<pid>/stat have the
> > > >    possibility to block on multiple processes that are in the
> > > >    process of handling a page fault (but are being starved)
> > > 
> > > I'm experimenting with blk.[67] in test10 right now.  The stalls
> > > are not helped at all.  It doesn't seem to become request bound
> > > (haven't instrumented that yet to be sure) but the stalls persist.
> > > 
> > > 	-Mike
> > 
> > This is not an elevator starvation problem.
> 
> True, but the blk-xx patches help work-around (what I believe) is
> bad flushing behaviour by the vm.

I very much agree.  Kflushd is still hungry for free write
bandwidth here.

Of course it's _going_ to have to wait if you're doing max IO
throughput, but when you're flushing, you need to let kflushd
have the bandwidth it needs to do it's job.  I don't think it's
getting what it needs, and am trying two things.

1.  Revoke read's ability to steal requests while we're in a
heavy flushing situation.  Flushing must proceed, and it must
go at full speed.  (Actually, reversing the favoritism when
you need flush bandwidth makes sense to me, and does help if
limited.. if not limited, it hurts like hell)

2.  Use the information that we are starving (or going full bore)
to tell the VM to keep it's fingers off dirty buffers.  If we're
flushing at disk speed, page_launder() can't do anything useful
with dirty buffers, it can only do harm IMHO.

	-Mike

P.S.  Before I revert to Luke Codecrawler mode, I have a wild
problem theory I'd appreciate comments on.. preferably the kind
where I become extremely busy thinking about their content ;-)

If one __alloc_pages() is waiting for kswapd, kswapd tries to do
synchronous flushing.. if the write queue is nearly (or) exausted
and page_launder() couldn't clean any buffers on it's first pass,
it blasts the queue some more and stalls.  If kswapd, kflushd and
kupdate are all waiting for a request, and then say a GFP_BUFFER
allocation comes along.. (we're low on memory) we do SCHED_YIELD
schedule().  If we're holding IO locks, nobody can do IO.  OK, if
there's nobody else running, we come right back and either finish
the allocation of fail.  But, if you have other allocations trying
to flush buffers (GFP_KERNEL eg), they are not only in danger of
stacking up due to a request shortage, but they can't get whatever
IO locks the GFP_BUFFER allocation is holding anyway so are doomed
until we do schedule back to the GFP_BUFFER allocating task.

Isn't scheduling while holding IO locks the wrong thing to do?  It's
protected from neither GFP_BUFFER nor PF_MEMALLOC.

I must be missing something.. but what?

<ears at maximum gain>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-03  5:56         ` Mike Galbraith
@ 2000-11-03 15:45           ` Mike Galbraith
  2000-11-03 19:38             ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2000-11-03 15:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, Rik van Riel, Tigran Aivazian

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1680 bytes --]

On Fri, 3 Nov 2000, Mike Galbraith wrote:

> On Thu, 2 Nov 2000, Jens Axboe wrote:
> 
> > On Thu, Nov 02 2000, Val Henson wrote:
> > > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > >    Axboe for blk-7 to alleviate this issue) and you have a
> > > > >    scenario where processes using /proc/<pid>/stat have the
> > > > >    possibility to block on multiple processes that are in the
> > > > >    process of handling a page fault (but are being starved)
> > > > 
> > > > I'm experimenting with blk.[67] in test10 right now.  The stalls
> > > > are not helped at all.  It doesn't seem to become request bound
> > > > (haven't instrumented that yet to be sure) but the stalls persist.
> > > > 
> > > > 	-Mike
> > > 
> > > This is not an elevator starvation problem.
> > 
> > True, but the blk-xx patches help work-around (what I believe) is
> > bad flushing behaviour by the vm.
> 
> I very much agree.  Kflushd is still hungry for free write
> bandwidth here.

In the LKML tradition of code talks and silly opinions walk...

Attached is a diagnostic patch which gets kflushd under control,
and takes make -j30 bzImage build times down from 12 minutes to
9 here.  I have no more massive context switching on write, and
copies seem to go a lot quicker to boot.  (that may be because
some of my failures were really _really_ horrible)

Comments are very welcome.  I haven't had problems with this yet,
but it's early so...  This patch isn't supposed to be pretty either
(hw techs don't do pretty;) it's only supposed to say 'Huston...'
so be sure to grab a barfbag before you take a look. 

	-Mike

P.S. almost forgot. vmstat freezes were shortened too :-)

[-- Attachment #2: Type: TEXT/PLAIN, Size: 13303 bytes --]

diff -urN linux-2.4.0-test10.virgin/fs/buffer.c linux-2.4.0-test10.mike/fs/buffer.c
--- linux-2.4.0-test10.virgin/fs/buffer.c	Wed Nov  1 06:42:40 2000
+++ linux-2.4.0-test10.mike/fs/buffer.c	Fri Nov  3 14:59:10 2000
@@ -38,6 +38,7 @@
 #include <linux/swapctl.h>
 #include <linux/smp_lock.h>
 #include <linux/vmalloc.h>
+#include <linux/blk.h>
 #include <linux/blkdev.h>
 #include <linux/sysrq.h>
 #include <linux/file.h>
@@ -705,13 +706,12 @@
 /*
  * We used to try various strange things. Let's not.
  */
+static int flush_dirty_buffers(int mode);
+
 static void refill_freelist(int size)
 {
-	if (!grow_buffers(size)) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-		current->policy |= SCHED_YIELD;
-		schedule();
-	}
+	if (!grow_buffers(size))
+		flush_dirty_buffers(2);
 }
 
 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -859,7 +859,9 @@
 
 /* -1 -> no need to flush
     0 -> async flush
-    1 -> sync flush (wait for I/O completation) */
+    1 -> sync flush (wait for I/O completation)
+    throttle_IO will be set by kflushd to indicate IO saturation. */
+int throttle_IO;
 int balance_dirty_state(kdev_t dev)
 {
 	unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
@@ -2469,6 +2471,7 @@
  * response to dirty buffers.  Once this process is activated, we write back
  * a limited number of buffers to the disks and then go back to sleep again.
  */
+static DECLARE_WAIT_QUEUE_HEAD(bdflush_wait);
 static DECLARE_WAIT_QUEUE_HEAD(bdflush_done);
 struct task_struct *bdflush_tsk = 0;
 
@@ -2476,11 +2479,12 @@
 {
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (current == bdflush_tsk)
+	if (current->flags & PF_MEMALLOC)
 		return;
 
 	if (!block) {
-		wake_up_process(bdflush_tsk);
+		if (waitqueue_active(&bdflush_wait))
+		wake_up(&bdflush_wait);
 		return;
 	}
 
@@ -2491,7 +2495,9 @@
 	__set_current_state(TASK_UNINTERRUPTIBLE);
 	add_wait_queue(&bdflush_done, &wait);
 
-	wake_up_process(bdflush_tsk);
+	if (waitqueue_active(&bdflush_wait))
+		wake_up(&bdflush_wait);
+	current->policy |= SCHED_YIELD;
 	schedule();
 
 	remove_wait_queue(&bdflush_done, &wait);
@@ -2503,11 +2509,19 @@
    NOTENOTENOTENOTE: we _only_ need to browse the DIRTY lru list
    as all dirty buffers lives _only_ in the DIRTY lru list.
    As we never browse the LOCKED and CLEAN lru lists they are infact
-   completly useless. */
-static int flush_dirty_buffers(int check_flushtime)
+   completly useless.
+   modes: 0 = check bdf_prm.b_un.ndirty [kflushd]
+          1 = check flushtime [kupdate]
+          2 = check bdf_prm.b_un.nrefill [refill_freelist()] */
+#define  MODE_KFLUSHD 0
+#define  MODE_KUPDATE 1
+#define  MODE_REFILL 2
+static int flush_dirty_buffers(int mode)
 {
 	struct buffer_head * bh, *next;
+	request_queue_t *q;
 	int flushed = 0, i;
+	unsigned long flags;
 
  restart:
 	spin_lock(&lru_list_lock);
@@ -2524,31 +2538,52 @@
 		if (buffer_locked(bh))
 			continue;
 
-		if (check_flushtime) {
+		if (mode == MODE_KUPDATE) {
 			/* The dirty lru list is chronologically ordered so
 			   if the current bh is not yet timed out,
 			   then also all the following bhs
 			   will be too young. */
 			if (time_before(jiffies, bh->b_flushtime))
 				goto out_unlock;
+		} else if (MODE_KFLUSHD) {
+			if (flushed >= bdf_prm.b_un.ndirty)
+				goto out_unlock;
 		} else {
-			if (++flushed > bdf_prm.b_un.ndirty)
+			if (flushed >= bdf_prm.b_un.nrefill)
 				goto out_unlock;
 		}
 
-		/* OK, now we are committed to write it out. */
+		/* We are almost committed to write it out. */
 		atomic_inc(&bh->b_count);
+		q = blk_get_queue(bh->b_rdev);
+		spin_lock_irqsave(&q->request_lock, flags);
 		spin_unlock(&lru_list_lock);
+		if (list_empty(&q->request_freelist[WRITE])) {
+			throttle_IO = 1;
+			atomic_dec(&bh->b_count);
+			spin_unlock_irqrestore(&q->request_lock, flags);
+			run_task_queue(&tq_disk);
+			break;
+		} else
+			throttle_IO = 0;
+		spin_unlock_irqrestore(&q->request_lock, flags);
+		/* OK, now we are really committed. */
+
 		ll_rw_block(WRITE, 1, &bh);
 		atomic_dec(&bh->b_count);
+		flushed++;
 
-		if (current->need_resched)
-			schedule();
+		if (current->need_resched) {
+			if (!(mode == MODE_KFLUSHD))
+				schedule();
+			else
+				goto out;
+		}
 		goto restart;
 	}
- out_unlock:
+out_unlock:
 	spin_unlock(&lru_list_lock);
-
+out:
 	return flushed;
 }
 
@@ -2640,7 +2675,7 @@
 int bdflush(void *sem)
 {
 	struct task_struct *tsk = current;
-	int flushed;
+	int flushed, dirty, pdirty=0;
 	/*
 	 *	We have a bare-bones task_struct, and really should fill
 	 *	in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2649,6 +2684,7 @@
 
 	tsk->session = 1;
 	tsk->pgrp = 1;
+	tsk->flags |= PF_MEMALLOC;
 	strcpy(tsk->comm, "kflushd");
 	bdflush_tsk = tsk;
 
@@ -2664,32 +2700,39 @@
 	for (;;) {
 		CHECK_EMERGENCY_SYNC
 
+		if (balance_dirty_state(NODEV) < 0)
+			goto sleep;
+
 		flushed = flush_dirty_buffers(0);
 		if (free_shortage())
 			flushed += page_launder(GFP_BUFFER, 0);
 
-		/* If wakeup_bdflush will wakeup us
-		   after our bdflush_done wakeup, then
-		   we must make sure to not sleep
-		   in schedule_timeout otherwise
-		   wakeup_bdflush may wait for our
-		   bdflush_done wakeup that would never arrive
-		   (as we would be sleeping) and so it would
-		   deadlock in SMP. */
-		__set_current_state(TASK_INTERRUPTIBLE);
-		wake_up_all(&bdflush_done);
+#if 0
+		/*
+		 * Did someone create lots of dirty buffers while we slept?
+		 */
+		dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+		if (dirty - pdirty > flushed && throttle_IO) {
+			printk(KERN_WARNING 
+			"kflushd: pdirty(%d) dirtied (%d) flushed (%d)\n",
+			pdirty, pdirty - dirty, flushed);
+		}
+		pdirty = dirty;
+#endif
+
+		run_task_queue(&tq_disk);
+		if (flushed)
+			wake_up_all(&bdflush_done);
 		/*
 		 * If there are still a lot of dirty buffers around,
-		 * skip the sleep and flush some more. Otherwise, we
+		 * we sleep and the flush some more. Otherwise, we
 		 * go to sleep waiting a wakeup.
 		 */
-		if (!flushed || balance_dirty_state(NODEV) < 0) {
-			run_task_queue(&tq_disk);
-			schedule();
+		if (balance_dirty_state(NODEV) < 0) {
+sleep:
+			wake_up_all(&bdflush_done);
+			interruptible_sleep_on_timeout(&bdflush_wait, HZ/10);
 		}
-		/* Remember to mark us as running otherwise
-		   the next schedule will block. */
-		__set_current_state(TASK_RUNNING);
 	}
 }
 
@@ -2706,6 +2749,7 @@
 
 	tsk->session = 1;
 	tsk->pgrp = 1;
+	tsk->flags |= PF_MEMALLOC;
 	strcpy(tsk->comm, "kupdate");
 
 	/* sigstop and sigcont will stop and wakeup kupdate */
diff -urN linux-2.4.0-test10.virgin/mm/page_alloc.c linux-2.4.0-test10.mike/mm/page_alloc.c
--- linux-2.4.0-test10.virgin/mm/page_alloc.c	Wed Nov  1 06:42:45 2000
+++ linux-2.4.0-test10.mike/mm/page_alloc.c	Fri Nov  3 15:22:55 2000
@@ -285,7 +285,7 @@
 struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order)
 {
 	zone_t **zone;
-	int direct_reclaim = 0;
+	int direct_reclaim = 0, strikes = 0;
 	unsigned int gfp_mask = zonelist->gfp_mask;
 	struct page * page;
 
@@ -310,22 +310,30 @@
 			!(current->flags & PF_MEMALLOC))
 		direct_reclaim = 1;
 
-	/*
-	 * If we are about to get low on free pages and we also have
-	 * an inactive page shortage, wake up kswapd.
-	 */
-	if (inactive_shortage() > inactive_target / 2 && free_shortage())
-		wakeup_kswapd(0);
+#define STRIKE_ONE \
+	(strikes++ && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_TWO \
+	(strikes++ < 2 && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_THREE \
+	(strikes++ < 3 && (gfp_mask & GFP_USER) == GFP_USER)
+#define STRIKE_THREE_NOIO \
+	(strikes++ < 3 && (gfp_mask & GFP_USER) == __GFP_WAIT)
 	/*
 	 * If we are about to get low on free pages and cleaning
 	 * the inactive_dirty pages would fix the situation,
 	 * wake up bdflush.
 	 */
-	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
+try_again:
+	if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
 			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
+		wakeup_bdflush(STRIKE_ONE);
+	/*
+	 * If we are about to get low on free pages and we also have
+	 * an inactive page shortage, wake up kswapd.
+	 */
+	else if (inactive_shortage() || free_shortage())
+		wakeup_kswapd(STRIKE_ONE);
 
-try_again:
 	/*
 	 * First, see if we have any zones with lots of free memory.
 	 *
@@ -374,35 +382,16 @@
 	if (page)
 		return page;
 
-	/*
-	 * OK, none of the zones on our zonelist has lots
-	 * of pages free.
-	 *
-	 * We wake up kswapd, in the hope that kswapd will
-	 * resolve this situation before memory gets tight.
-	 *
-	 * We also yield the CPU, because that:
-	 * - gives kswapd a chance to do something
-	 * - slows down allocations, in particular the
-	 *   allocations from the fast allocator that's
-	 *   causing the problems ...
-	 * - ... which minimises the impact the "bad guys"
-	 *   have on the rest of the system
-	 * - if we don't have __GFP_IO set, kswapd may be
-	 *   able to free some memory we can't free ourselves
-	 */
-	wakeup_kswapd(0);
-	if (gfp_mask & __GFP_WAIT) {
-		__set_current_state(TASK_RUNNING);
+	if (STRIKE_TWO) {
 		current->policy |= SCHED_YIELD;
-		schedule();
+		goto try_again;
 	}
 
 	/*
-	 * After waking up kswapd, we try to allocate a page
+	 * After waking up daemons, we try to allocate a page
 	 * from any zone which isn't critical yet.
 	 *
-	 * Kswapd should, in most situations, bring the situation
+	 * Kswapd/kflushd should, in most situations, bring the situation
 	 * back to normal in no time.
 	 */
 	page = __alloc_pages_limit(zonelist, order, PAGES_MIN, direct_reclaim);
@@ -426,7 +415,7 @@
 		 * in the hope of creating a large, physically contiguous
 		 * piece of free memory.
 		 */
-		if (order > 0 && (gfp_mask & __GFP_WAIT)) {
+		if (gfp_mask & __GFP_WAIT) {
 			zone = zonelist->zones;
 			/* First, clean some dirty pages. */
 			page_launder(gfp_mask, 1);
@@ -463,26 +452,23 @@
 		 * simply cannot free a large enough contiguous area
 		 * of memory *ever*.
 		 */
-		if ((gfp_mask & (__GFP_WAIT|__GFP_IO)) == (__GFP_WAIT|__GFP_IO)) {
+		if (~gfp_mask & __GFP_HIGH && STRIKE_THREE) {
+			memory_pressure++;
 			wakeup_kswapd(1);
+			goto try_again;
+		} else if (~gfp_mask & __GFP_HIGH && STRIKE_THREE_NOIO) {
+			/*
+		 	* If __GFP_IO isn't set, we can't wait on kswapd because
+		 	* daemons just might need some IO locks /we/ are holding ...
+		 	*
+		 	* SUBTLE: The scheduling point above makes sure that
+		 	* kswapd does get the chance to free memory we can't
+		 	* free ourselves...
+		 	*/
 			memory_pressure++;
-			if (!order)
-				goto try_again;
-		/*
-		 * If __GFP_IO isn't set, we can't wait on kswapd because
-		 * kswapd just might need some IO locks /we/ are holding ...
-		 *
-		 * SUBTLE: The scheduling point above makes sure that
-		 * kswapd does get the chance to free memory we can't
-		 * free ourselves...
-		 */
-		} else if (gfp_mask & __GFP_WAIT) {
 			try_to_free_pages(gfp_mask);
-			memory_pressure++;
-			if (!order)
-				goto try_again;
+			goto try_again;
 		}
-
 	}
 
 	/*
diff -urN linux-2.4.0-test10.virgin/mm/vmscan.c linux-2.4.0-test10.mike/mm/vmscan.c
--- linux-2.4.0-test10.virgin/mm/vmscan.c	Wed Nov  1 06:42:45 2000
+++ linux-2.4.0-test10.mike/mm/vmscan.c	Fri Nov  3 15:20:32 2000
@@ -562,6 +562,7 @@
  * go out to Matthew Dillon.
  */
 #define MAX_LAUNDER 		(4 * (1 << page_cluster))
+extern int throttle_IO;
 int page_launder(int gfp_mask, int sync)
 {
 	int launder_loop, maxscan, cleaned_pages, maxlaunder;
@@ -573,7 +574,7 @@
 	 * We can only grab the IO locks (eg. for flushing dirty
 	 * buffers to disk) if __GFP_IO is set.
 	 */
-	can_get_io_locks = gfp_mask & __GFP_IO;
+	can_get_io_locks = gfp_mask & __GFP_IO && !throttle_IO;
 
 	launder_loop = 0;
 	maxlaunder = 0;
@@ -1050,13 +1051,23 @@
 	for (;;) {
 		static int recalc = 0;
 
+		/* Once a second, recalculate some VM stats. */
+		if (time_after(jiffies, recalc + HZ)) {
+			recalc = jiffies;
+			recalculate_vm_stats();
+		}
+
 		/* If needed, try to free some memory. */
 		if (inactive_shortage() || free_shortage()) {
 			int wait = 0;
 			/* Do we need to do some synchronous flushing? */
 			if (waitqueue_active(&kswapd_done))
 				wait = 1;
+#if 0 /* Undo this and watch allocations fail under heavy stress */
 			do_try_to_free_pages(GFP_KSWAPD, wait);
+#else
+			do_try_to_free_pages(GFP_KSWAPD, 0);
+#endif
 		}
 
 		/*
@@ -1067,12 +1078,6 @@
 		 */
 		refill_inactive_scan(6, 0);
 
-		/* Once a second, recalculate some VM stats. */
-		if (time_after(jiffies, recalc + HZ)) {
-			recalc = jiffies;
-			recalculate_vm_stats();
-		}
-
 		/*
 		 * Wake up everybody waiting for free memory
 		 * and unplug the disk queue.
@@ -1112,7 +1117,7 @@
 {
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (current == kswapd_task)
+	if (current->flags & PF_MEMALLOC)
 		return;
 
 	if (!block) {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-03 15:45           ` Mike Galbraith
@ 2000-11-03 19:38             ` Jens Axboe
  2000-11-04  5:43               ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2000-11-03 19:38 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian

[-- Attachment #1: Type: text/plain, Size: 1053 bytes --]

On Fri, Nov 03 2000, Mike Galbraith wrote:
> > I very much agree.  Kflushd is still hungry for free write
> > bandwidth here.
> 
> In the LKML tradition of code talks and silly opinions walk...
> 
> Attached is a diagnostic patch which gets kflushd under control,
> and takes make -j30 bzImage build times down from 12 minutes to
> 9 here.  I have no more massive context switching on write, and
> copies seem to go a lot quicker to boot.  (that may be because
> some of my failures were really _really_ horrible)
> 
> Comments are very welcome.  I haven't had problems with this yet,
> but it's early so...  This patch isn't supposed to be pretty either
> (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> so be sure to grab a barfbag before you take a look. 

Super, looks pretty good from here. I'll give it a go when I get back.
In addition, here's a small patch that disables the read stealing
of requests from the write list -- does that improve behaviour
when we are busy flushing?

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs

[-- Attachment #2: read_steal.diff --]
[-- Type: text/plain, Size: 998 bytes --]

--- drivers/block/ll_rw_blk.c~	Fri Nov  3 03:22:25 2000
+++ drivers/block/ll_rw_blk.c	Fri Nov  3 03:23:24 2000
@@ -455,35 +455,17 @@
 	struct list_head *list = &q->request_freelist[rw];
 	struct request *rq;
 
-	/*
-	 * Reads get preferential treatment and are allowed to steal
-	 * from the write free list if necessary.
-	 */
 	if (!list_empty(list)) {
 		rq = blkdev_free_rq(list);
-		goto got_rq;
-	}
-
-	/*
-	 * if the WRITE list is non-empty, we know that rw is READ
-	 * and that the READ list is empty. allow reads to 'steal'
-	 * from the WRITE list.
-	 */
-	if (!list_empty(&q->request_freelist[WRITE])) {
-		list = &q->request_freelist[WRITE];
-		rq = blkdev_free_rq(list);
-		goto got_rq;
+		list_del(&rq->table);
+		rq->free_list = list;
+		rq->rq_status = RQ_ACTIVE;
+		rq->special = NULL;
+		rq->q = q;
+		return rq;
 	}
 
 	return NULL;
-
-got_rq:
-	list_del(&rq->table);
-	rq->free_list = list;
-	rq->rq_status = RQ_ACTIVE;
-	rq->special = NULL;
-	rq->q = q;
-	return rq;
 }
 
 /*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-03 19:38             ` Jens Axboe
@ 2000-11-04  5:43               ` Mike Galbraith
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-04  5:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian

On Fri, 3 Nov 2000, Jens Axboe wrote:

> On Fri, Nov 03 2000, Mike Galbraith wrote:
> > > I very much agree.  Kflushd is still hungry for free write
> > > bandwidth here.
> > 
> > In the LKML tradition of code talks and silly opinions walk...
> > 
> > Attached is a diagnostic patch which gets kflushd under control,
> > and takes make -j30 bzImage build times down from 12 minutes to
> > 9 here.  I have no more massive context switching on write, and
> > copies seem to go a lot quicker to boot.  (that may be because
> > some of my failures were really _really_ horrible)
> > 
> > Comments are very welcome.  I haven't had problems with this yet,
> > but it's early so...  This patch isn't supposed to be pretty either
> > (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> > so be sure to grab a barfbag before you take a look. 
> 
> Super, looks pretty good from here. I'll give it a go when I get back.
> In addition, here's a small patch that disables the read stealing
> of requests from the write list -- does that improve behaviour
> when we are busy flushing?

Yes.  I've done this a bit differently here, and have had good
results.  I only disable stealing when I need flush throughput.

Now that the box isn't biting off more than it can chew quite
as often, I'll try this again.  I'm pretty darn sure that I can
get more throughput, but :> I've learned that getting too much
can do really OOGLY things. (turns box into single user single
tasking streaming IO monster from hell)

	-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-01 18:48 ` Rik van Riel
  2000-11-02  7:19   ` Mike Galbraith
@ 2000-11-02  8:40   ` Christoph Rohland
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Rohland @ 2000-11-02  8:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David Mansfield, lkml

Hi Rik,

I can probably give some more datapoints. Here is the console output
of my test machine (there is a 'vmstat 5' running in background):

[root@ls3016 /root]# killall shmtst
[root@ls3016 /root]#
 1 12  2      0 1607668  18932 2110496   0   0 67154 1115842 1050063 2029389   0   2  98
 0 10  2      0 1607564  18932 2110496   0   0     0   300  317   426   0   0 100
 0 10  2      0 1607408  18932 2110496   0   0     0   301  336   473   0   0 100
 0 10  2      0 1607560  18932 2110508   0   0     0   307  318   430   0   0 100
 0 10  2      0 1607556  18932 2110512   0   0     0   304  324   433   0   0 100
 0 10  2      0 1607528  18932 2110512   0   0     0   272  308   410   0   1  99
 0 10  2      0 1607440  18932 2110516   0   0     0   315  323   438   0   1  99
 0 10  2      0 1607528  18932 2110516   0   0     0   323  316   424   0   0 100
 0 10  2      0 1607556  18932 2110516   0   0     0   304  309   410   0   0 100
 0 10  2      0 1607600  18932 2110528   0   0     0   298  314   418   0   0 100
 0 10  2      0 1607384  18932 2110528   0   0     0   296  307   406   0   1  99
 0 10  2      0 1607284  18932 2110528   0   0     0   304  315   421   0   0 100
 0 10  2      0 1607668  18932 2110528   0   0     0   298  304   402   0   0 100
 0 10  2      0 1607576  18932 2110528   0   0     0   285  307   405   0   0 100
 0 10  2      0 1607656  18932 2110528   0   0     0   292  303   399   0   1  99
 0 10  2      0 1607928  18932 2110528   0   0     0   313  310   408   0   0 100
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0 10  2      0 1608440  18932 2110528   0   0     0   340  313   417   0   1  99
 0 10  2      0 1608260  18932 2110528   0   0     0   298  318   426   0   0 100
 0 10  2      0 1608208  18932 2110528   0   0     0   314  334   448   0   1  99
 0 10  2      0 1608396  18932 2110528   0   0     0   323  316   421   0   1  99
 0 10  2      0 1608204  18932 2110548   0   0     0   334  333   458   0   0 100
 0 10  2      0 1607888  18932 2110580   0   0     0   336  329   448   0   1  99
 0 10  2      0 1608040  18932 2110584   0   0     0   317  321   435   0   0 100
 0 10  2      0 1608032  18932 2110588   0   0     0   241  318   425   0   0 100
 0 10  2      0 1608028  18932 2110592   0   0     0   257  325   443   0   1  99
 0 10  3      0 1608028  18932 2110592   0   0     0   258  323   435   0   0  99
 0 10  2      0 1608032  18932 2110592   0   0     0   241  316   425   0   0 100
 0 10  2      0 1608024  18932 2110592   0   0     0   261  337   460   0   0 100
 0 10  2      0 1608016  18932 2110592   0   0     0   253  328   444   0   0 100
 0 10  2      0 1608024  18932 2110592   0   0     0   252  320   435   0   0 100
 0 10  2      0 1608012  18932 2110592   0   0     0   255  326   446   0   0 100
 0 10  2      0 1608020  18932 2110592   0   0     0   255  326   444   0   1  99
 0 10  2      0 1608012  18932 2110600   0   0     0   261  341   469   0   0 100
 0 10  2      0 1607992  18932 2110608   0   0     0   261  344   479   0   0 100
 0 10  2      0 1607992  18932 2110612   0   0     0   264  342   471   0   0 100
 0 10  2      0 1607984  18932 2110612   0   0     0   266  334   462   0   0 100
 0 10  2      0 1607980  18932 2110620   0   0     0   273  340   468   0   0  99
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0 10  2      0 1607972  18932 2110624   0   0     0   266  345   474   0   1  99
 0 10  2      0 1607940  18932 2110640   0   0     0   256  341   462   0   0 100
 0 10  2      0 1607936  18932 2110644   0   0     0   262  339   462   0   1  99
 0 10  2      0 1607940  18932 2110644   0   0     0   261  333   450   0   1  99
0 10  2      0 1607944  18932 2110644   0   0     0   253  335   454   0   0 100
0 10  2      0 1607944  18932 2110644   0   0     0   272  352   479   0   1  99

[root@ls3016 /root]# ps l
  F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
100     0   820     1   9   0  2200 1168 wait4  S    ttyS0      0:00 login -- ro
100     0   862   820  14   0  1756  976 wait4  S    ttyS0      0:00 -bash
000     0   878   862   9   0  1080  360 down   D    ttyS0     11:27 ./shmtst 10
000     0   879   862   9   0  1080  360 down   D    ttyS0     15:21 ./shmtst 15
040     0   880   878   9   0  1092  416 wait_o D    ttyS0      8:55 ./shmtst 10
040     0   881   878   9   0  1080  360 down   D    ttyS0     10:22 ./shmtst 10
444     0   882   878   9   0     0    0 do_exi Z    ttyS0     10:00 [shmtst <de
040     0   883   878   9   0  1092  416 wait_o D    ttyS0      9:30 ./shmtst 10
040     0   884   878   9   0  1092  416 down   D    ttyS0      8:44 ./shmtst 10
040     0   885   878   9   0  1092  416 down   D    ttyS0      9:01 ./shmtst 10
444     0   886   878   9   0     0    0 do_exi Z    ttyS0      7:59 [shmtst <de
444     0   887   879   9   0     0    0 do_exi Z    ttyS0     17:11 [shmtst <de
040     0   888   878   9   0  1080  360 down   D    ttyS0     10:21 ./shmtst 10
040     0   889   878   9   0  1092  416 down   D    ttyS0      9:06 ./shmtst 10
000     0   891   862   9   0  1136  488 nanosl S    ttyS0      0:23 vmstat 5
000     0  1226   862  19   0  2756 1084 -      R    ttyS0      0:00 ps l
[root@ls3016 /root]#  0 10  2      0 1607936  18932 2110652   0   0     0   275  368   488   0   0  99
 0 10  2      0 1607912  18932 2110660   0   0     0   266  334   457   0   0 100
 0 10  2      0 1607848  18932 2110672   0   0     0   302  354   498   0   0 100
 0 10  2      0 1607892  18932 2110688   0   0     0   287  352   496   0   0 100
 0 11  2      0 1607868  18932 2110704   0   0     1   282  338   472   0   1  99

So the processes don't finish exiting at least 47*5sec. They have
shared mmaped some 666000000 bytes long plain file on a 8GB machine.

The rest of the machine behaves nicely.

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>]

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
       [not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>
@ 2000-11-09 18:31 ` Linus Torvalds
  2000-11-10  7:34   ` Mike Galbraith
  2000-11-10 21:42   ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-09 18:31 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox

As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
to do with IO except indirectly (the IO is necessary to trigger the
problem, but the _reason_ for the problem lies elsewhere).

And it has everything to do with the fact that the way Linux semaphores
are implemented, a non-blocking process has a HUGE advantage over a
blocking one. Linux kernel semaphores are extreme unfair in that way.

What happens is that some process is getting a lot of VM faults and gets
its VM semaphore. No contention yet. it holds the semaphore over the
IO, and now another process does a "ps".

The "ps" process goes to sleep on the semaphore. So far so good.

The original process releases the semaphore, which increments the count,
and wakes up the process waiting for it. Note that it _wakes_ it, it does
not give the semaphore to it. Big difference.

The process that got woken up will run eventually. Probably not all that
immediately, because the process that woke it (and held the semaphore)
just slept on a page fault too, so it's not likely to immediately
relinquish the CPU.

The original running process comes back faulting again, finds the
semaphore still unlocked (the "ps" process is awake but has not gotten to
run yet), gets the semaphore, and falls asleep on the IO for the next
page.

The "ps" process actually gets to run now, but it's a bit late. The
semaphore is locked again. 

Repeat until luck breaks the bad circle.

(This schenario, btw, is much harder to trigger on SMP than on UP. And
it's completely separate from the issue of simple disk bandwidth issues
which can obviously cause no end of stalls on anything that needs the
disk, and which can also happen on SMP).

NOTE! If somebody wants to fix this, the fix should be reasonably simple
but needs to be quite exhaustively checked and double-checked. It's just
too easy to break the semaphores by mistake.

The way to make semaphores more fair is to NOT allow a new process to just
come in immediately and steal the semaphore in __down() if there are other
sleepers. This is most easily accomplished by something along the lines of
the following in __down() in arch/i386/kernel/semaphore.c 

	spin_lock_irq(&semaphore_lock);
	sem->sleepers++;
+
+	/*
+	 * Are there other people waiting for this?
+	 * They get to go first.
+	 */
+	if (sleepers > 1)
+		goto inside;
	for (;;) {
                int sleepers = sem->sleepers;

                /*
                 * Add "everybody else" into it. They aren't
                 * playing, because we own the spinlock.
                 */
                if (!atomic_add_negative(sleepers - 1, &sem->count)) {
                        sem->sleepers = 0;
                        break;
                }
                sem->sleepers = 1;      /* us - see -1 above */
+inside:
                spin_unlock_irq(&semaphore_lock);
                schedule();
                tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
                spin_lock_irq(&semaphore_lock);
        }
        spin_unlock_irq(&semaphore_lock);

But note that teh above is UNTESTED and also note that from a throughput
(as opposed to latency) standpoint being unfair tends to be nice.

Anybody want to try out something like the above? (And no, I'm not
applying it to my tree yet. It needs about a hundred pairs of eyes to
verify that there isn't some subtle "lost wakeup" race somewhere).

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-09 18:31 ` Linus Torvalds
@ 2000-11-10  7:34   ` Mike Galbraith
  2000-11-10 10:47     ` Mike Galbraith
  2000-11-10 17:07     ` Linus Torvalds
  2000-11-10 21:42   ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
  1 sibling, 2 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-10  7:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox

On Thu, 9 Nov 2000, Linus Torvalds wrote:

> 
> 
> As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
> to do with IO except indirectly (the IO is necessary to trigger the
> problem, but the _reason_ for the problem lies elsewhere).
> 
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
> 
> What happens is that some process is getting a lot of VM faults and gets
> its VM semaphore. No contention yet. it holds the semaphore over the
> IO, and now another process does a "ps".
> 
> The "ps" process goes to sleep on the semaphore. So far so good.
> 
> The original process releases the semaphore, which increments the count,
> and wakes up the process waiting for it. Note that it _wakes_ it, it does
> not give the semaphore to it. Big difference.
> 
> The process that got woken up will run eventually. Probably not all that
> immediately, because the process that woke it (and held the semaphore)
> just slept on a page fault too, so it's not likely to immediately
> relinquish the CPU.
> 
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
> 
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again. 
> 
> Repeat until luck breaks the bad circle.
> 
> (This schenario, btw, is much harder to trigger on SMP than on UP. And
> it's completely separate from the issue of simple disk bandwidth issues
> which can obviously cause no end of stalls on anything that needs the
> disk, and which can also happen on SMP).

Unfortunately, it didn't help in the scenario I'm running.

time make -j30 bzImage:

real    14m19.987s  (within stock variance)
user    6m24.480s
sys     1m12.970s

procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
31  2  1     12   1432   4440  12660   0  12    27   151  202   848  89  11   0
34  4  1   1908   2584    536   5376 248 1904   602   763  785  4094  63  32  5
13 19  1  64140  67728    604  33784 106500 84612 43625 21683 19080 52168  28  22  50

I understood the above well enough to be very interested in seeing what
happens with flush IO restricted.

	-Mike

[try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
to shrink the best candidate's footprint?]

Thanks!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-10  7:34   ` Mike Galbraith
@ 2000-11-10 10:47     ` Mike Galbraith
  2000-11-10 17:07     ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2000-11-10 10:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox

> I understood the above well enough to be very interested in seeing what
> happens with flush IO restricted.
> 
> 	-Mike
> 
> [try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
> to shrink the best candidate's footprint?]
> 
> Thanks!

The results:

pre2+semaphore
real    14m19.987s
user    6m24.480s
sys     1m12.970s

pre2+semaphore+throttle_IO
real    10m13.953s
user    6m19.980s
sys     0m28.960s

pre2+semaphore+throttle_IO extended to refill_inactive()
real    9m46.395s
user    6m23.510s
sys     0m29.420s

pre2+semaphore+throttle_IO + above + tiny little tweak to page_launder()
real    8m56.808s
user    6m23.420s
sys     0m29.430s

Unfortunately, when I try to get past this point I burn trees :-)

	-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10
  2000-11-10  7:34   ` Mike Galbraith
  2000-11-10 10:47     ` Mike Galbraith
@ 2000-11-10 17:07     ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-10 17:07 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.Linu.4.10.10011100732250.601-100000@mikeg.weiden.de>,
Mike Galbraith  <mikeg@wen-online.de> wrote:
>> 
>> (This schenario, btw, is much harder to trigger on SMP than on UP. And
>> it's completely separate from the issue of simple disk bandwidth issues
>> which can obviously cause no end of stalls on anything that needs the
>> disk, and which can also happen on SMP).
>
>Unfortunately, it didn't help in the scenario I'm running.
>
>time make -j30 bzImage:
>
>real    14m19.987s  (within stock variance)
>user    6m24.480s
>sys     1m12.970s

Note that the above kin of "throughput performance" should not have been
affected, and was not what I was worried about. 

>procs                      memory    swap          io     system         cpu
> r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>31  2  1     12   1432   4440  12660   0  12    27   151  202   848  89  11   0
>34  4  1   1908   2584    536   5376 248 1904   602   763  785  4094  63  32  5
>13 19  1  64140  67728    604  33784 106500 84612 43625 21683 19080 52168  28  22  50

Looks like there was a big delay in vmstat there - that could easily be
due to simple disk throughput issues..

Does it feel any different under the original load that got the original
complaint? The patch may have just been buggy and ineffective, for all I
know. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping  process,2.4.0-test10
  2000-11-09 18:31 ` Linus Torvalds
  2000-11-10  7:34   ` Mike Galbraith
@ 2000-11-10 21:42   ` David Mansfield
  2000-11-11  6:20     ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: David Mansfield @ 2000-11-10 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Jens Axboe, MOLNAR Ingo, Rik van Riel,
	Kernel Mailing List, Alan Cox

Linus Torvalds wrote:
...
> 
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
>
...
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
> 
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again.
> 
> Repeat until luck breaks the bad circle.
> 

But doesn't __down have a fast path coded in assembly?  In other words,
it only hits your patched code if there is already contention, which
there isn't in this case, and therefore the bug...?

David Mansfield
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping  process,2.4.0-test10
  2000-11-10 21:42   ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
@ 2000-11-11  6:20     ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-11  6:20 UTC (permalink / raw)
  To: linux-kernel

In article <3A0C6BD6.A8F73950@dm.ultramaster.com>,
David Mansfield  <lkml@dm.ultramaster.com> wrote:
>Linus Torvalds wrote:
>...
>> 
>> And it has everything to do with the fact that the way Linux semaphores
>> are implemented, a non-blocking process has a HUGE advantage over a
>> blocking one. Linux kernel semaphores are extreme unfair in that way.
>>
>...
>> The original running process comes back faulting again, finds the
>> semaphore still unlocked (the "ps" process is awake but has not gotten to
>> run yet), gets the semaphore, and falls asleep on the IO for the next
>> page.
>> 
>> The "ps" process actually gets to run now, but it's a bit late. The
>> semaphore is locked again.
>> 
>> Repeat until luck breaks the bad circle.
>> 
>
>But doesn't __down have a fast path coded in assembly?  In other words,
>it only hits your patched code if there is already contention, which
>there isn't in this case, and therefore the bug...?

The __down() case should be hit if there's a waiter, even if that waiter
has not yet been able to pick up the lock (the waiter _will_ have
decremented the count to negative in order to trigger the proper logic
at release time).

But as I mentioned, the pseudo-patch was certainly untested, so
somebody should probably walk through the cases to check that I didn't
miss something.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2000-11-11  6:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield
2000-11-01 18:48 ` Rik van Riel
2000-11-02  7:19   ` Mike Galbraith
2000-11-02 21:59     ` Val Henson
2000-11-03  1:37       ` Jens Axboe
2000-11-03  5:56         ` Mike Galbraith
2000-11-03 15:45           ` Mike Galbraith
2000-11-03 19:38             ` Jens Axboe
2000-11-04  5:43               ` Mike Galbraith
2000-11-02  8:40   ` Christoph Rohland
     [not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>
2000-11-09 18:31 ` Linus Torvalds
2000-11-10  7:34   ` Mike Galbraith
2000-11-10 10:47     ` Mike Galbraith
2000-11-10 17:07     ` Linus Torvalds
2000-11-10 21:42   ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield
2000-11-11  6:20     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).