* [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 @ 2000-11-01 18:38 David Mansfield 2000-11-01 18:48 ` Rik van Riel 0 siblings, 1 reply; 16+ messages in thread From: David Mansfield @ 2000-11-01 18:38 UTC (permalink / raw) To: lkml Hi VM/procfs hackers, System is UP Athlon 700mhz with 256mb ram running vanilla 2.4.0-test10. gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) I'd like to report what seems like a performance problem in the latest kernels. Actually, all recent kernels have exhibited this problem, but I was waiting for the new VM stuff to stabilize before reporting it. My test is: run 7 processes that each allocate and randomly access 32mb of ram (on a 256mb machine). Even though 7*32MB = 224MB, this still sends the machine lightly into swap. The machine continues to function fairly smoothly for the most part. I can do filesystem operations, run new programs, move desktops in X etc. Except: programs which access /proc/<pid>/stat stall for an inderminate amount of time. For example, 'ps' and 'vmstat' stall BADLY in these scenarios. I have had the stalls last over a minute in higher VM pressure situations. Unfortunately, when system is thrashing, it's nice to be able to run 'ps' in order to get the PID to kill, and run a reliable vmstat to monitor it. Here's a segment of an strace of 'ps' showing a 12 second stall (this isn't the worst I've seen by any means, but a 12 second stall trying to get process info for 1 swapping task can easily snowball into a DOS). 0.000119 open("/proc/4746/stat", O_RDONLY) = 7 0.000072 read(7, "4746 (hog) D 4739 4739 827 34817"..., 511) = 181 12.237161 close(7) = 0 The wchan of the stalled 'ps' is in __down_interruptible, which probably doesn't help much. This worked absolutely fine in 2.2. Even under extreme swap pressure, vmstat continues to function fine, spitting out messages every second as it should. David Mansfield - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield @ 2000-11-01 18:48 ` Rik van Riel 2000-11-02 7:19 ` Mike Galbraith 2000-11-02 8:40 ` Christoph Rohland 0 siblings, 2 replies; 16+ messages in thread From: Rik van Riel @ 2000-11-01 18:48 UTC (permalink / raw) To: David Mansfield; +Cc: lkml On Wed, 1 Nov 2000, David Mansfield wrote: > I'd like to report what seems like a performance problem in the latest > kernels. Actually, all recent kernels have exhibited this problem, but > I was waiting for the new VM stuff to stabilize before reporting it. > > My test is: run 7 processes that each allocate and randomly > access 32mb of ram (on a 256mb machine). Even though 7*32MB = > 224MB, this still sends the machine lightly into swap. The > machine continues to function fairly smoothly for the most part. > I can do filesystem operations, run new programs, move desktops > in X etc. > > Except: programs which access /proc/<pid>/stat stall for an > inderminate amount of time. For example, 'ps' and 'vmstat' > stall BADLY in these scenarios. I have had the stalls last over > a minute in higher VM pressure situations. I have one possible reason for this .... 1) the procfs process does (in fs/proc/array.c::proc_pid_stat) down(&mm->mmap_sem); 2) but, in order to do that, it has to wait until the process it is trying to stat has /finished/ its page fault, and is not into its next one ... 3) combine this with the elevator starvation stuff (ask Jens Axboe for blk-7 to alleviate this issue) and you have a scenario where processes using /proc/<pid>/stat have the possibility to block on multiple processes that are in the process of handling a page fault (but are being starved) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-01 18:48 ` Rik van Riel @ 2000-11-02 7:19 ` Mike Galbraith 2000-11-02 21:59 ` Val Henson 2000-11-02 8:40 ` Christoph Rohland 1 sibling, 1 reply; 16+ messages in thread From: Mike Galbraith @ 2000-11-02 7:19 UTC (permalink / raw) To: Rik van Riel; +Cc: David Mansfield, lkml On Wed, 1 Nov 2000, Rik van Riel wrote: > On Wed, 1 Nov 2000, David Mansfield wrote: > > > I'd like to report what seems like a performance problem in the latest > > kernels. Actually, all recent kernels have exhibited this problem, but > > I was waiting for the new VM stuff to stabilize before reporting it. > > > > My test is: run 7 processes that each allocate and randomly > > access 32mb of ram (on a 256mb machine). Even though 7*32MB = > > 224MB, this still sends the machine lightly into swap. The > > machine continues to function fairly smoothly for the most part. > > I can do filesystem operations, run new programs, move desktops > > in X etc. > > > > Except: programs which access /proc/<pid>/stat stall for an > > inderminate amount of time. For example, 'ps' and 'vmstat' > > stall BADLY in these scenarios. I have had the stalls last over > > a minute in higher VM pressure situations. > > I have one possible reason for this .... > > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat) > down(&mm->mmap_sem); > > 2) but, in order to do that, it has to wait until the process > it is trying to stat has /finished/ its page fault, and is > not into its next one ... > > 3) combine this with the elevator starvation stuff (ask Jens > Axboe for blk-7 to alleviate this issue) and you have a > scenario where processes using /proc/<pid>/stat have the > possibility to block on multiple processes that are in the > process of handling a page fault (but are being starved) I'm experimenting with blk.[67] in test10 right now. The stalls are not helped at all. It doesn't seem to become request bound (haven't instrumented that yet to be sure) but the stalls persist. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-02 7:19 ` Mike Galbraith @ 2000-11-02 21:59 ` Val Henson 2000-11-03 1:37 ` Jens Axboe 0 siblings, 1 reply; 16+ messages in thread From: Val Henson @ 2000-11-02 21:59 UTC (permalink / raw) To: Mike Galbraith; +Cc: Rik van Riel, linux-kernel On Thu, Nov 02, 2000 at 08:19:06AM +0100, Mike Galbraith wrote: > On Wed, 1 Nov 2000, Rik van Riel wrote: > > > I have one possible reason for this .... > > > > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat) > > down(&mm->mmap_sem); > > > > 2) but, in order to do that, it has to wait until the process > > it is trying to stat has /finished/ its page fault, and is > > not into its next one ... > > > > 3) combine this with the elevator starvation stuff (ask Jens > > Axboe for blk-7 to alleviate this issue) and you have a > > scenario where processes using /proc/<pid>/stat have the > > possibility to block on multiple processes that are in the > > process of handling a page fault (but are being starved) > > I'm experimenting with blk.[67] in test10 right now. The stalls > are not helped at all. It doesn't seem to become request bound > (haven't instrumented that yet to be sure) but the stalls persist. > > -Mike This is not an elevator starvation problem. I also experienced these stalls with my IDE-only system. Unless I'm badly mistaken, the elevator is only used on SCSI disks, therefore elevator starvation cannot be blamed for this problem. These stalls are particularly annoying since I want to find the pid of the process hogging memory in order to kill it, but the read from /proc stalls for 45 seconds or more. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-02 21:59 ` Val Henson @ 2000-11-03 1:37 ` Jens Axboe 2000-11-03 5:56 ` Mike Galbraith 0 siblings, 1 reply; 16+ messages in thread From: Jens Axboe @ 2000-11-03 1:37 UTC (permalink / raw) To: Val Henson; +Cc: Mike Galbraith, Rik van Riel, linux-kernel On Thu, Nov 02 2000, Val Henson wrote: > > > 3) combine this with the elevator starvation stuff (ask Jens > > > Axboe for blk-7 to alleviate this issue) and you have a > > > scenario where processes using /proc/<pid>/stat have the > > > possibility to block on multiple processes that are in the > > > process of handling a page fault (but are being starved) > > > > I'm experimenting with blk.[67] in test10 right now. The stalls > > are not helped at all. It doesn't seem to become request bound > > (haven't instrumented that yet to be sure) but the stalls persist. > > > > -Mike > > This is not an elevator starvation problem. True, but the blk-xx patches help work-around (what I believe) is bad flushing behaviour by the vm. > I also experienced these stalls with my IDE-only system. Unless I'm > badly mistaken, the elevator is only used on SCSI disks, therefore > elevator starvation cannot be blamed for this problem. These stalls > are particularly annoying since I want to find the pid of the process > hogging memory in order to kill it, but the read from /proc stalls for > 45 seconds or more. You are badly mistaken. -- * Jens Axboe <axboe@suse.de> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-03 1:37 ` Jens Axboe @ 2000-11-03 5:56 ` Mike Galbraith 2000-11-03 15:45 ` Mike Galbraith 0 siblings, 1 reply; 16+ messages in thread From: Mike Galbraith @ 2000-11-03 5:56 UTC (permalink / raw) To: Jens Axboe; +Cc: Val Henson, Rik van Riel, linux-kernel On Thu, 2 Nov 2000, Jens Axboe wrote: > On Thu, Nov 02 2000, Val Henson wrote: > > > > 3) combine this with the elevator starvation stuff (ask Jens > > > > Axboe for blk-7 to alleviate this issue) and you have a > > > > scenario where processes using /proc/<pid>/stat have the > > > > possibility to block on multiple processes that are in the > > > > process of handling a page fault (but are being starved) > > > > > > I'm experimenting with blk.[67] in test10 right now. The stalls > > > are not helped at all. It doesn't seem to become request bound > > > (haven't instrumented that yet to be sure) but the stalls persist. > > > > > > -Mike > > > > This is not an elevator starvation problem. > > True, but the blk-xx patches help work-around (what I believe) is > bad flushing behaviour by the vm. I very much agree. Kflushd is still hungry for free write bandwidth here. Of course it's _going_ to have to wait if you're doing max IO throughput, but when you're flushing, you need to let kflushd have the bandwidth it needs to do it's job. I don't think it's getting what it needs, and am trying two things. 1. Revoke read's ability to steal requests while we're in a heavy flushing situation. Flushing must proceed, and it must go at full speed. (Actually, reversing the favoritism when you need flush bandwidth makes sense to me, and does help if limited.. if not limited, it hurts like hell) 2. Use the information that we are starving (or going full bore) to tell the VM to keep it's fingers off dirty buffers. If we're flushing at disk speed, page_launder() can't do anything useful with dirty buffers, it can only do harm IMHO. -Mike P.S. Before I revert to Luke Codecrawler mode, I have a wild problem theory I'd appreciate comments on.. preferably the kind where I become extremely busy thinking about their content ;-) If one __alloc_pages() is waiting for kswapd, kswapd tries to do synchronous flushing.. if the write queue is nearly (or) exausted and page_launder() couldn't clean any buffers on it's first pass, it blasts the queue some more and stalls. If kswapd, kflushd and kupdate are all waiting for a request, and then say a GFP_BUFFER allocation comes along.. (we're low on memory) we do SCHED_YIELD schedule(). If we're holding IO locks, nobody can do IO. OK, if there's nobody else running, we come right back and either finish the allocation of fail. But, if you have other allocations trying to flush buffers (GFP_KERNEL eg), they are not only in danger of stacking up due to a request shortage, but they can't get whatever IO locks the GFP_BUFFER allocation is holding anyway so are doomed until we do schedule back to the GFP_BUFFER allocating task. Isn't scheduling while holding IO locks the wrong thing to do? It's protected from neither GFP_BUFFER nor PF_MEMALLOC. I must be missing something.. but what? <ears at maximum gain> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-03 5:56 ` Mike Galbraith @ 2000-11-03 15:45 ` Mike Galbraith 2000-11-03 19:38 ` Jens Axboe 0 siblings, 1 reply; 16+ messages in thread From: Mike Galbraith @ 2000-11-03 15:45 UTC (permalink / raw) To: linux-kernel; +Cc: Jens Axboe, Rik van Riel, Tigran Aivazian [-- Attachment #1: Type: TEXT/PLAIN, Size: 1680 bytes --] On Fri, 3 Nov 2000, Mike Galbraith wrote: > On Thu, 2 Nov 2000, Jens Axboe wrote: > > > On Thu, Nov 02 2000, Val Henson wrote: > > > > > 3) combine this with the elevator starvation stuff (ask Jens > > > > > Axboe for blk-7 to alleviate this issue) and you have a > > > > > scenario where processes using /proc/<pid>/stat have the > > > > > possibility to block on multiple processes that are in the > > > > > process of handling a page fault (but are being starved) > > > > > > > > I'm experimenting with blk.[67] in test10 right now. The stalls > > > > are not helped at all. It doesn't seem to become request bound > > > > (haven't instrumented that yet to be sure) but the stalls persist. > > > > > > > > -Mike > > > > > > This is not an elevator starvation problem. > > > > True, but the blk-xx patches help work-around (what I believe) is > > bad flushing behaviour by the vm. > > I very much agree. Kflushd is still hungry for free write > bandwidth here. In the LKML tradition of code talks and silly opinions walk... Attached is a diagnostic patch which gets kflushd under control, and takes make -j30 bzImage build times down from 12 minutes to 9 here. I have no more massive context switching on write, and copies seem to go a lot quicker to boot. (that may be because some of my failures were really _really_ horrible) Comments are very welcome. I haven't had problems with this yet, but it's early so... This patch isn't supposed to be pretty either (hw techs don't do pretty;) it's only supposed to say 'Huston...' so be sure to grab a barfbag before you take a look. -Mike P.S. almost forgot. vmstat freezes were shortened too :-) [-- Attachment #2: Type: TEXT/PLAIN, Size: 13303 bytes --] diff -urN linux-2.4.0-test10.virgin/fs/buffer.c linux-2.4.0-test10.mike/fs/buffer.c --- linux-2.4.0-test10.virgin/fs/buffer.c Wed Nov 1 06:42:40 2000 +++ linux-2.4.0-test10.mike/fs/buffer.c Fri Nov 3 14:59:10 2000 @@ -38,6 +38,7 @@ #include <linux/swapctl.h> #include <linux/smp_lock.h> #include <linux/vmalloc.h> +#include <linux/blk.h> #include <linux/blkdev.h> #include <linux/sysrq.h> #include <linux/file.h> @@ -705,13 +706,12 @@ /* * We used to try various strange things. Let's not. */ +static int flush_dirty_buffers(int mode); + static void refill_freelist(int size) { - if (!grow_buffers(size)) { - wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - current->policy |= SCHED_YIELD; - schedule(); - } + if (!grow_buffers(size)) + flush_dirty_buffers(2); } void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) @@ -859,7 +859,9 @@ /* -1 -> no need to flush 0 -> async flush - 1 -> sync flush (wait for I/O completation) */ + 1 -> sync flush (wait for I/O completation) + throttle_IO will be set by kflushd to indicate IO saturation. */ +int throttle_IO; int balance_dirty_state(kdev_t dev) { unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit; @@ -2469,6 +2471,7 @@ * response to dirty buffers. Once this process is activated, we write back * a limited number of buffers to the disks and then go back to sleep again. */ +static DECLARE_WAIT_QUEUE_HEAD(bdflush_wait); static DECLARE_WAIT_QUEUE_HEAD(bdflush_done); struct task_struct *bdflush_tsk = 0; @@ -2476,11 +2479,12 @@ { DECLARE_WAITQUEUE(wait, current); - if (current == bdflush_tsk) + if (current->flags & PF_MEMALLOC) return; if (!block) { - wake_up_process(bdflush_tsk); + if (waitqueue_active(&bdflush_wait)) + wake_up(&bdflush_wait); return; } @@ -2491,7 +2495,9 @@ __set_current_state(TASK_UNINTERRUPTIBLE); add_wait_queue(&bdflush_done, &wait); - wake_up_process(bdflush_tsk); + if (waitqueue_active(&bdflush_wait)) + wake_up(&bdflush_wait); + current->policy |= SCHED_YIELD; schedule(); remove_wait_queue(&bdflush_done, &wait); @@ -2503,11 +2509,19 @@ NOTENOTENOTENOTE: we _only_ need to browse the DIRTY lru list as all dirty buffers lives _only_ in the DIRTY lru list. As we never browse the LOCKED and CLEAN lru lists they are infact - completly useless. */ -static int flush_dirty_buffers(int check_flushtime) + completly useless. + modes: 0 = check bdf_prm.b_un.ndirty [kflushd] + 1 = check flushtime [kupdate] + 2 = check bdf_prm.b_un.nrefill [refill_freelist()] */ +#define MODE_KFLUSHD 0 +#define MODE_KUPDATE 1 +#define MODE_REFILL 2 +static int flush_dirty_buffers(int mode) { struct buffer_head * bh, *next; + request_queue_t *q; int flushed = 0, i; + unsigned long flags; restart: spin_lock(&lru_list_lock); @@ -2524,31 +2538,52 @@ if (buffer_locked(bh)) continue; - if (check_flushtime) { + if (mode == MODE_KUPDATE) { /* The dirty lru list is chronologically ordered so if the current bh is not yet timed out, then also all the following bhs will be too young. */ if (time_before(jiffies, bh->b_flushtime)) goto out_unlock; + } else if (MODE_KFLUSHD) { + if (flushed >= bdf_prm.b_un.ndirty) + goto out_unlock; } else { - if (++flushed > bdf_prm.b_un.ndirty) + if (flushed >= bdf_prm.b_un.nrefill) goto out_unlock; } - /* OK, now we are committed to write it out. */ + /* We are almost committed to write it out. */ atomic_inc(&bh->b_count); + q = blk_get_queue(bh->b_rdev); + spin_lock_irqsave(&q->request_lock, flags); spin_unlock(&lru_list_lock); + if (list_empty(&q->request_freelist[WRITE])) { + throttle_IO = 1; + atomic_dec(&bh->b_count); + spin_unlock_irqrestore(&q->request_lock, flags); + run_task_queue(&tq_disk); + break; + } else + throttle_IO = 0; + spin_unlock_irqrestore(&q->request_lock, flags); + /* OK, now we are really committed. */ + ll_rw_block(WRITE, 1, &bh); atomic_dec(&bh->b_count); + flushed++; - if (current->need_resched) - schedule(); + if (current->need_resched) { + if (!(mode == MODE_KFLUSHD)) + schedule(); + else + goto out; + } goto restart; } - out_unlock: +out_unlock: spin_unlock(&lru_list_lock); - +out: return flushed; } @@ -2640,7 +2675,7 @@ int bdflush(void *sem) { struct task_struct *tsk = current; - int flushed; + int flushed, dirty, pdirty=0; /* * We have a bare-bones task_struct, and really should fill * in a few more things so "top" and /proc/2/{exe,root,cwd} @@ -2649,6 +2684,7 @@ tsk->session = 1; tsk->pgrp = 1; + tsk->flags |= PF_MEMALLOC; strcpy(tsk->comm, "kflushd"); bdflush_tsk = tsk; @@ -2664,32 +2700,39 @@ for (;;) { CHECK_EMERGENCY_SYNC + if (balance_dirty_state(NODEV) < 0) + goto sleep; + flushed = flush_dirty_buffers(0); if (free_shortage()) flushed += page_launder(GFP_BUFFER, 0); - /* If wakeup_bdflush will wakeup us - after our bdflush_done wakeup, then - we must make sure to not sleep - in schedule_timeout otherwise - wakeup_bdflush may wait for our - bdflush_done wakeup that would never arrive - (as we would be sleeping) and so it would - deadlock in SMP. */ - __set_current_state(TASK_INTERRUPTIBLE); - wake_up_all(&bdflush_done); +#if 0 + /* + * Did someone create lots of dirty buffers while we slept? + */ + dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; + if (dirty - pdirty > flushed && throttle_IO) { + printk(KERN_WARNING + "kflushd: pdirty(%d) dirtied (%d) flushed (%d)\n", + pdirty, pdirty - dirty, flushed); + } + pdirty = dirty; +#endif + + run_task_queue(&tq_disk); + if (flushed) + wake_up_all(&bdflush_done); /* * If there are still a lot of dirty buffers around, - * skip the sleep and flush some more. Otherwise, we + * we sleep and the flush some more. Otherwise, we * go to sleep waiting a wakeup. */ - if (!flushed || balance_dirty_state(NODEV) < 0) { - run_task_queue(&tq_disk); - schedule(); + if (balance_dirty_state(NODEV) < 0) { +sleep: + wake_up_all(&bdflush_done); + interruptible_sleep_on_timeout(&bdflush_wait, HZ/10); } - /* Remember to mark us as running otherwise - the next schedule will block. */ - __set_current_state(TASK_RUNNING); } } @@ -2706,6 +2749,7 @@ tsk->session = 1; tsk->pgrp = 1; + tsk->flags |= PF_MEMALLOC; strcpy(tsk->comm, "kupdate"); /* sigstop and sigcont will stop and wakeup kupdate */ diff -urN linux-2.4.0-test10.virgin/mm/page_alloc.c linux-2.4.0-test10.mike/mm/page_alloc.c --- linux-2.4.0-test10.virgin/mm/page_alloc.c Wed Nov 1 06:42:45 2000 +++ linux-2.4.0-test10.mike/mm/page_alloc.c Fri Nov 3 15:22:55 2000 @@ -285,7 +285,7 @@ struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order) { zone_t **zone; - int direct_reclaim = 0; + int direct_reclaim = 0, strikes = 0; unsigned int gfp_mask = zonelist->gfp_mask; struct page * page; @@ -310,22 +310,30 @@ !(current->flags & PF_MEMALLOC)) direct_reclaim = 1; - /* - * If we are about to get low on free pages and we also have - * an inactive page shortage, wake up kswapd. - */ - if (inactive_shortage() > inactive_target / 2 && free_shortage()) - wakeup_kswapd(0); +#define STRIKE_ONE \ + (strikes++ && (gfp_mask & GFP_USER) == GFP_USER) +#define STRIKE_TWO \ + (strikes++ < 2 && (gfp_mask & GFP_USER) == GFP_USER) +#define STRIKE_THREE \ + (strikes++ < 3 && (gfp_mask & GFP_USER) == GFP_USER) +#define STRIKE_THREE_NOIO \ + (strikes++ < 3 && (gfp_mask & GFP_USER) == __GFP_WAIT) /* * If we are about to get low on free pages and cleaning * the inactive_dirty pages would fix the situation, * wake up bdflush. */ - else if (free_shortage() && nr_inactive_dirty_pages > free_shortage() +try_again: + if (free_shortage() && nr_inactive_dirty_pages > free_shortage() && nr_inactive_dirty_pages >= freepages.high) - wakeup_bdflush(0); + wakeup_bdflush(STRIKE_ONE); + /* + * If we are about to get low on free pages and we also have + * an inactive page shortage, wake up kswapd. + */ + else if (inactive_shortage() || free_shortage()) + wakeup_kswapd(STRIKE_ONE); -try_again: /* * First, see if we have any zones with lots of free memory. * @@ -374,35 +382,16 @@ if (page) return page; - /* - * OK, none of the zones on our zonelist has lots - * of pages free. - * - * We wake up kswapd, in the hope that kswapd will - * resolve this situation before memory gets tight. - * - * We also yield the CPU, because that: - * - gives kswapd a chance to do something - * - slows down allocations, in particular the - * allocations from the fast allocator that's - * causing the problems ... - * - ... which minimises the impact the "bad guys" - * have on the rest of the system - * - if we don't have __GFP_IO set, kswapd may be - * able to free some memory we can't free ourselves - */ - wakeup_kswapd(0); - if (gfp_mask & __GFP_WAIT) { - __set_current_state(TASK_RUNNING); + if (STRIKE_TWO) { current->policy |= SCHED_YIELD; - schedule(); + goto try_again; } /* - * After waking up kswapd, we try to allocate a page + * After waking up daemons, we try to allocate a page * from any zone which isn't critical yet. * - * Kswapd should, in most situations, bring the situation + * Kswapd/kflushd should, in most situations, bring the situation * back to normal in no time. */ page = __alloc_pages_limit(zonelist, order, PAGES_MIN, direct_reclaim); @@ -426,7 +415,7 @@ * in the hope of creating a large, physically contiguous * piece of free memory. */ - if (order > 0 && (gfp_mask & __GFP_WAIT)) { + if (gfp_mask & __GFP_WAIT) { zone = zonelist->zones; /* First, clean some dirty pages. */ page_launder(gfp_mask, 1); @@ -463,26 +452,23 @@ * simply cannot free a large enough contiguous area * of memory *ever*. */ - if ((gfp_mask & (__GFP_WAIT|__GFP_IO)) == (__GFP_WAIT|__GFP_IO)) { + if (~gfp_mask & __GFP_HIGH && STRIKE_THREE) { + memory_pressure++; wakeup_kswapd(1); + goto try_again; + } else if (~gfp_mask & __GFP_HIGH && STRIKE_THREE_NOIO) { + /* + * If __GFP_IO isn't set, we can't wait on kswapd because + * daemons just might need some IO locks /we/ are holding ... + * + * SUBTLE: The scheduling point above makes sure that + * kswapd does get the chance to free memory we can't + * free ourselves... + */ memory_pressure++; - if (!order) - goto try_again; - /* - * If __GFP_IO isn't set, we can't wait on kswapd because - * kswapd just might need some IO locks /we/ are holding ... - * - * SUBTLE: The scheduling point above makes sure that - * kswapd does get the chance to free memory we can't - * free ourselves... - */ - } else if (gfp_mask & __GFP_WAIT) { try_to_free_pages(gfp_mask); - memory_pressure++; - if (!order) - goto try_again; + goto try_again; } - } /* diff -urN linux-2.4.0-test10.virgin/mm/vmscan.c linux-2.4.0-test10.mike/mm/vmscan.c --- linux-2.4.0-test10.virgin/mm/vmscan.c Wed Nov 1 06:42:45 2000 +++ linux-2.4.0-test10.mike/mm/vmscan.c Fri Nov 3 15:20:32 2000 @@ -562,6 +562,7 @@ * go out to Matthew Dillon. */ #define MAX_LAUNDER (4 * (1 << page_cluster)) +extern int throttle_IO; int page_launder(int gfp_mask, int sync) { int launder_loop, maxscan, cleaned_pages, maxlaunder; @@ -573,7 +574,7 @@ * We can only grab the IO locks (eg. for flushing dirty * buffers to disk) if __GFP_IO is set. */ - can_get_io_locks = gfp_mask & __GFP_IO; + can_get_io_locks = gfp_mask & __GFP_IO && !throttle_IO; launder_loop = 0; maxlaunder = 0; @@ -1050,13 +1051,23 @@ for (;;) { static int recalc = 0; + /* Once a second, recalculate some VM stats. */ + if (time_after(jiffies, recalc + HZ)) { + recalc = jiffies; + recalculate_vm_stats(); + } + /* If needed, try to free some memory. */ if (inactive_shortage() || free_shortage()) { int wait = 0; /* Do we need to do some synchronous flushing? */ if (waitqueue_active(&kswapd_done)) wait = 1; +#if 0 /* Undo this and watch allocations fail under heavy stress */ do_try_to_free_pages(GFP_KSWAPD, wait); +#else + do_try_to_free_pages(GFP_KSWAPD, 0); +#endif } /* @@ -1067,12 +1078,6 @@ */ refill_inactive_scan(6, 0); - /* Once a second, recalculate some VM stats. */ - if (time_after(jiffies, recalc + HZ)) { - recalc = jiffies; - recalculate_vm_stats(); - } - /* * Wake up everybody waiting for free memory * and unplug the disk queue. @@ -1112,7 +1117,7 @@ { DECLARE_WAITQUEUE(wait, current); - if (current == kswapd_task) + if (current->flags & PF_MEMALLOC) return; if (!block) { ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-03 15:45 ` Mike Galbraith @ 2000-11-03 19:38 ` Jens Axboe 2000-11-04 5:43 ` Mike Galbraith 0 siblings, 1 reply; 16+ messages in thread From: Jens Axboe @ 2000-11-03 19:38 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian [-- Attachment #1: Type: text/plain, Size: 1053 bytes --] On Fri, Nov 03 2000, Mike Galbraith wrote: > > I very much agree. Kflushd is still hungry for free write > > bandwidth here. > > In the LKML tradition of code talks and silly opinions walk... > > Attached is a diagnostic patch which gets kflushd under control, > and takes make -j30 bzImage build times down from 12 minutes to > 9 here. I have no more massive context switching on write, and > copies seem to go a lot quicker to boot. (that may be because > some of my failures were really _really_ horrible) > > Comments are very welcome. I haven't had problems with this yet, > but it's early so... This patch isn't supposed to be pretty either > (hw techs don't do pretty;) it's only supposed to say 'Huston...' > so be sure to grab a barfbag before you take a look. Super, looks pretty good from here. I'll give it a go when I get back. In addition, here's a small patch that disables the read stealing of requests from the write list -- does that improve behaviour when we are busy flushing? -- * Jens Axboe <axboe@suse.de> * SuSE Labs [-- Attachment #2: read_steal.diff --] [-- Type: text/plain, Size: 998 bytes --] --- drivers/block/ll_rw_blk.c~ Fri Nov 3 03:22:25 2000 +++ drivers/block/ll_rw_blk.c Fri Nov 3 03:23:24 2000 @@ -455,35 +455,17 @@ struct list_head *list = &q->request_freelist[rw]; struct request *rq; - /* - * Reads get preferential treatment and are allowed to steal - * from the write free list if necessary. - */ if (!list_empty(list)) { rq = blkdev_free_rq(list); - goto got_rq; - } - - /* - * if the WRITE list is non-empty, we know that rw is READ - * and that the READ list is empty. allow reads to 'steal' - * from the WRITE list. - */ - if (!list_empty(&q->request_freelist[WRITE])) { - list = &q->request_freelist[WRITE]; - rq = blkdev_free_rq(list); - goto got_rq; + list_del(&rq->table); + rq->free_list = list; + rq->rq_status = RQ_ACTIVE; + rq->special = NULL; + rq->q = q; + return rq; } return NULL; - -got_rq: - list_del(&rq->table); - rq->free_list = list; - rq->rq_status = RQ_ACTIVE; - rq->special = NULL; - rq->q = q; - return rq; } /* ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-03 19:38 ` Jens Axboe @ 2000-11-04 5:43 ` Mike Galbraith 0 siblings, 0 replies; 16+ messages in thread From: Mike Galbraith @ 2000-11-04 5:43 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, Rik van Riel, Tigran Aivazian On Fri, 3 Nov 2000, Jens Axboe wrote: > On Fri, Nov 03 2000, Mike Galbraith wrote: > > > I very much agree. Kflushd is still hungry for free write > > > bandwidth here. > > > > In the LKML tradition of code talks and silly opinions walk... > > > > Attached is a diagnostic patch which gets kflushd under control, > > and takes make -j30 bzImage build times down from 12 minutes to > > 9 here. I have no more massive context switching on write, and > > copies seem to go a lot quicker to boot. (that may be because > > some of my failures were really _really_ horrible) > > > > Comments are very welcome. I haven't had problems with this yet, > > but it's early so... This patch isn't supposed to be pretty either > > (hw techs don't do pretty;) it's only supposed to say 'Huston...' > > so be sure to grab a barfbag before you take a look. > > Super, looks pretty good from here. I'll give it a go when I get back. > In addition, here's a small patch that disables the read stealing > of requests from the write list -- does that improve behaviour > when we are busy flushing? Yes. I've done this a bit differently here, and have had good results. I only disable stealing when I need flush throughput. Now that the box isn't biting off more than it can chew quite as often, I'll try this again. I'm pretty darn sure that I can get more throughput, but :> I've learned that getting too much can do really OOGLY things. (turns box into single user single tasking streaming IO monster from hell) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-01 18:48 ` Rik van Riel 2000-11-02 7:19 ` Mike Galbraith @ 2000-11-02 8:40 ` Christoph Rohland 1 sibling, 0 replies; 16+ messages in thread From: Christoph Rohland @ 2000-11-02 8:40 UTC (permalink / raw) To: Rik van Riel; +Cc: David Mansfield, lkml Hi Rik, I can probably give some more datapoints. Here is the console output of my test machine (there is a 'vmstat 5' running in background): [root@ls3016 /root]# killall shmtst [root@ls3016 /root]# 1 12 2 0 1607668 18932 2110496 0 0 67154 1115842 1050063 2029389 0 2 98 0 10 2 0 1607564 18932 2110496 0 0 0 300 317 426 0 0 100 0 10 2 0 1607408 18932 2110496 0 0 0 301 336 473 0 0 100 0 10 2 0 1607560 18932 2110508 0 0 0 307 318 430 0 0 100 0 10 2 0 1607556 18932 2110512 0 0 0 304 324 433 0 0 100 0 10 2 0 1607528 18932 2110512 0 0 0 272 308 410 0 1 99 0 10 2 0 1607440 18932 2110516 0 0 0 315 323 438 0 1 99 0 10 2 0 1607528 18932 2110516 0 0 0 323 316 424 0 0 100 0 10 2 0 1607556 18932 2110516 0 0 0 304 309 410 0 0 100 0 10 2 0 1607600 18932 2110528 0 0 0 298 314 418 0 0 100 0 10 2 0 1607384 18932 2110528 0 0 0 296 307 406 0 1 99 0 10 2 0 1607284 18932 2110528 0 0 0 304 315 421 0 0 100 0 10 2 0 1607668 18932 2110528 0 0 0 298 304 402 0 0 100 0 10 2 0 1607576 18932 2110528 0 0 0 285 307 405 0 0 100 0 10 2 0 1607656 18932 2110528 0 0 0 292 303 399 0 1 99 0 10 2 0 1607928 18932 2110528 0 0 0 313 310 408 0 0 100 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 10 2 0 1608440 18932 2110528 0 0 0 340 313 417 0 1 99 0 10 2 0 1608260 18932 2110528 0 0 0 298 318 426 0 0 100 0 10 2 0 1608208 18932 2110528 0 0 0 314 334 448 0 1 99 0 10 2 0 1608396 18932 2110528 0 0 0 323 316 421 0 1 99 0 10 2 0 1608204 18932 2110548 0 0 0 334 333 458 0 0 100 0 10 2 0 1607888 18932 2110580 0 0 0 336 329 448 0 1 99 0 10 2 0 1608040 18932 2110584 0 0 0 317 321 435 0 0 100 0 10 2 0 1608032 18932 2110588 0 0 0 241 318 425 0 0 100 0 10 2 0 1608028 18932 2110592 0 0 0 257 325 443 0 1 99 0 10 3 0 1608028 18932 2110592 0 0 0 258 323 435 0 0 99 0 10 2 0 1608032 18932 2110592 0 0 0 241 316 425 0 0 100 0 10 2 0 1608024 18932 2110592 0 0 0 261 337 460 0 0 100 0 10 2 0 1608016 18932 2110592 0 0 0 253 328 444 0 0 100 0 10 2 0 1608024 18932 2110592 0 0 0 252 320 435 0 0 100 0 10 2 0 1608012 18932 2110592 0 0 0 255 326 446 0 0 100 0 10 2 0 1608020 18932 2110592 0 0 0 255 326 444 0 1 99 0 10 2 0 1608012 18932 2110600 0 0 0 261 341 469 0 0 100 0 10 2 0 1607992 18932 2110608 0 0 0 261 344 479 0 0 100 0 10 2 0 1607992 18932 2110612 0 0 0 264 342 471 0 0 100 0 10 2 0 1607984 18932 2110612 0 0 0 266 334 462 0 0 100 0 10 2 0 1607980 18932 2110620 0 0 0 273 340 468 0 0 99 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 10 2 0 1607972 18932 2110624 0 0 0 266 345 474 0 1 99 0 10 2 0 1607940 18932 2110640 0 0 0 256 341 462 0 0 100 0 10 2 0 1607936 18932 2110644 0 0 0 262 339 462 0 1 99 0 10 2 0 1607940 18932 2110644 0 0 0 261 333 450 0 1 99 0 10 2 0 1607944 18932 2110644 0 0 0 253 335 454 0 0 100 0 10 2 0 1607944 18932 2110644 0 0 0 272 352 479 0 1 99 [root@ls3016 /root]# ps l F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 100 0 820 1 9 0 2200 1168 wait4 S ttyS0 0:00 login -- ro 100 0 862 820 14 0 1756 976 wait4 S ttyS0 0:00 -bash 000 0 878 862 9 0 1080 360 down D ttyS0 11:27 ./shmtst 10 000 0 879 862 9 0 1080 360 down D ttyS0 15:21 ./shmtst 15 040 0 880 878 9 0 1092 416 wait_o D ttyS0 8:55 ./shmtst 10 040 0 881 878 9 0 1080 360 down D ttyS0 10:22 ./shmtst 10 444 0 882 878 9 0 0 0 do_exi Z ttyS0 10:00 [shmtst <de 040 0 883 878 9 0 1092 416 wait_o D ttyS0 9:30 ./shmtst 10 040 0 884 878 9 0 1092 416 down D ttyS0 8:44 ./shmtst 10 040 0 885 878 9 0 1092 416 down D ttyS0 9:01 ./shmtst 10 444 0 886 878 9 0 0 0 do_exi Z ttyS0 7:59 [shmtst <de 444 0 887 879 9 0 0 0 do_exi Z ttyS0 17:11 [shmtst <de 040 0 888 878 9 0 1080 360 down D ttyS0 10:21 ./shmtst 10 040 0 889 878 9 0 1092 416 down D ttyS0 9:06 ./shmtst 10 000 0 891 862 9 0 1136 488 nanosl S ttyS0 0:23 vmstat 5 000 0 1226 862 19 0 2756 1084 - R ttyS0 0:00 ps l [root@ls3016 /root]# 0 10 2 0 1607936 18932 2110652 0 0 0 275 368 488 0 0 99 0 10 2 0 1607912 18932 2110660 0 0 0 266 334 457 0 0 100 0 10 2 0 1607848 18932 2110672 0 0 0 302 354 498 0 0 100 0 10 2 0 1607892 18932 2110688 0 0 0 287 352 496 0 0 100 0 11 2 0 1607868 18932 2110704 0 0 1 282 338 472 0 1 99 So the processes don't finish exiting at least 47*5sec. They have shared mmaped some 666000000 bytes long plain file on a 8GB machine. The rest of the machine behaves nicely. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de>]
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 [not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de> @ 2000-11-09 18:31 ` Linus Torvalds 2000-11-10 7:34 ` Mike Galbraith 2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield 0 siblings, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2000-11-09 18:31 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing to do with IO except indirectly (the IO is necessary to trigger the problem, but the _reason_ for the problem lies elsewhere). And it has everything to do with the fact that the way Linux semaphores are implemented, a non-blocking process has a HUGE advantage over a blocking one. Linux kernel semaphores are extreme unfair in that way. What happens is that some process is getting a lot of VM faults and gets its VM semaphore. No contention yet. it holds the semaphore over the IO, and now another process does a "ps". The "ps" process goes to sleep on the semaphore. So far so good. The original process releases the semaphore, which increments the count, and wakes up the process waiting for it. Note that it _wakes_ it, it does not give the semaphore to it. Big difference. The process that got woken up will run eventually. Probably not all that immediately, because the process that woke it (and held the semaphore) just slept on a page fault too, so it's not likely to immediately relinquish the CPU. The original running process comes back faulting again, finds the semaphore still unlocked (the "ps" process is awake but has not gotten to run yet), gets the semaphore, and falls asleep on the IO for the next page. The "ps" process actually gets to run now, but it's a bit late. The semaphore is locked again. Repeat until luck breaks the bad circle. (This schenario, btw, is much harder to trigger on SMP than on UP. And it's completely separate from the issue of simple disk bandwidth issues which can obviously cause no end of stalls on anything that needs the disk, and which can also happen on SMP). NOTE! If somebody wants to fix this, the fix should be reasonably simple but needs to be quite exhaustively checked and double-checked. It's just too easy to break the semaphores by mistake. The way to make semaphores more fair is to NOT allow a new process to just come in immediately and steal the semaphore in __down() if there are other sleepers. This is most easily accomplished by something along the lines of the following in __down() in arch/i386/kernel/semaphore.c spin_lock_irq(&semaphore_lock); sem->sleepers++; + + /* + * Are there other people waiting for this? + * They get to go first. + */ + if (sleepers > 1) + goto inside; for (;;) { int sleepers = sem->sleepers; /* * Add "everybody else" into it. They aren't * playing, because we own the spinlock. */ if (!atomic_add_negative(sleepers - 1, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; /* us - see -1 above */ +inside: spin_unlock_irq(&semaphore_lock); schedule(); tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); But note that teh above is UNTESTED and also note that from a throughput (as opposed to latency) standpoint being unfair tends to be nice. Anybody want to try out something like the above? (And no, I'm not applying it to my tree yet. It needs about a hundred pairs of eyes to verify that there isn't some subtle "lost wakeup" race somewhere). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-09 18:31 ` Linus Torvalds @ 2000-11-10 7:34 ` Mike Galbraith 2000-11-10 10:47 ` Mike Galbraith 2000-11-10 17:07 ` Linus Torvalds 2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield 1 sibling, 2 replies; 16+ messages in thread From: Mike Galbraith @ 2000-11-10 7:34 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox On Thu, 9 Nov 2000, Linus Torvalds wrote: > > > As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing > to do with IO except indirectly (the IO is necessary to trigger the > problem, but the _reason_ for the problem lies elsewhere). > > And it has everything to do with the fact that the way Linux semaphores > are implemented, a non-blocking process has a HUGE advantage over a > blocking one. Linux kernel semaphores are extreme unfair in that way. > > What happens is that some process is getting a lot of VM faults and gets > its VM semaphore. No contention yet. it holds the semaphore over the > IO, and now another process does a "ps". > > The "ps" process goes to sleep on the semaphore. So far so good. > > The original process releases the semaphore, which increments the count, > and wakes up the process waiting for it. Note that it _wakes_ it, it does > not give the semaphore to it. Big difference. > > The process that got woken up will run eventually. Probably not all that > immediately, because the process that woke it (and held the semaphore) > just slept on a page fault too, so it's not likely to immediately > relinquish the CPU. > > The original running process comes back faulting again, finds the > semaphore still unlocked (the "ps" process is awake but has not gotten to > run yet), gets the semaphore, and falls asleep on the IO for the next > page. > > The "ps" process actually gets to run now, but it's a bit late. The > semaphore is locked again. > > Repeat until luck breaks the bad circle. > > (This schenario, btw, is much harder to trigger on SMP than on UP. And > it's completely separate from the issue of simple disk bandwidth issues > which can obviously cause no end of stalls on anything that needs the > disk, and which can also happen on SMP). Unfortunately, it didn't help in the scenario I'm running. time make -j30 bzImage: real 14m19.987s (within stock variance) user 6m24.480s sys 1m12.970s procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0 34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5 13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50 I understood the above well enough to be very interested in seeing what happens with flush IO restricted. -Mike [try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets to shrink the best candidate's footprint?] Thanks! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-10 7:34 ` Mike Galbraith @ 2000-11-10 10:47 ` Mike Galbraith 2000-11-10 17:07 ` Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: Mike Galbraith @ 2000-11-10 10:47 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox > I understood the above well enough to be very interested in seeing what > happens with flush IO restricted. > > -Mike > > [try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets > to shrink the best candidate's footprint?] > > Thanks! The results: pre2+semaphore real 14m19.987s user 6m24.480s sys 1m12.970s pre2+semaphore+throttle_IO real 10m13.953s user 6m19.980s sys 0m28.960s pre2+semaphore+throttle_IO extended to refill_inactive() real 9m46.395s user 6m23.510s sys 0m29.420s pre2+semaphore+throttle_IO + above + tiny little tweak to page_launder() real 8m56.808s user 6m23.420s sys 0m29.430s Unfortunately, when I try to get past this point I burn trees :-) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 2000-11-10 7:34 ` Mike Galbraith 2000-11-10 10:47 ` Mike Galbraith @ 2000-11-10 17:07 ` Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2000-11-10 17:07 UTC (permalink / raw) To: linux-kernel In article <Pine.Linu.4.10.10011100732250.601-100000@mikeg.weiden.de>, Mike Galbraith <mikeg@wen-online.de> wrote: >> >> (This schenario, btw, is much harder to trigger on SMP than on UP. And >> it's completely separate from the issue of simple disk bandwidth issues >> which can obviously cause no end of stalls on anything that needs the >> disk, and which can also happen on SMP). > >Unfortunately, it didn't help in the scenario I'm running. > >time make -j30 bzImage: > >real 14m19.987s (within stock variance) >user 6m24.480s >sys 1m12.970s Note that the above kin of "throughput performance" should not have been affected, and was not what I was worried about. >procs memory swap io system cpu > r b w swpd free buff cache si so bi bo in cs us sy id >31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0 >34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5 >13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50 Looks like there was a big delay in vmstat there - that could easily be due to simple disk throughput issues.. Does it feel any different under the original load that got the original complaint? The patch may have just been buggy and ineffective, for all I know. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 2000-11-09 18:31 ` Linus Torvalds 2000-11-10 7:34 ` Mike Galbraith @ 2000-11-10 21:42 ` David Mansfield 2000-11-11 6:20 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: David Mansfield @ 2000-11-10 21:42 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Jens Axboe, MOLNAR Ingo, Rik van Riel, Kernel Mailing List, Alan Cox Linus Torvalds wrote: ... > > And it has everything to do with the fact that the way Linux semaphores > are implemented, a non-blocking process has a HUGE advantage over a > blocking one. Linux kernel semaphores are extreme unfair in that way. > ... > The original running process comes back faulting again, finds the > semaphore still unlocked (the "ps" process is awake but has not gotten to > run yet), gets the semaphore, and falls asleep on the IO for the next > page. > > The "ps" process actually gets to run now, but it's a bit late. The > semaphore is locked again. > > Repeat until luck breaks the bad circle. > But doesn't __down have a fast path coded in assembly? In other words, it only hits your patched code if there is already contention, which there isn't in this case, and therefore the bug...? David Mansfield - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield @ 2000-11-11 6:20 ` Linus Torvalds 0 siblings, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2000-11-11 6:20 UTC (permalink / raw) To: linux-kernel In article <3A0C6BD6.A8F73950@dm.ultramaster.com>, David Mansfield <lkml@dm.ultramaster.com> wrote: >Linus Torvalds wrote: >... >> >> And it has everything to do with the fact that the way Linux semaphores >> are implemented, a non-blocking process has a HUGE advantage over a >> blocking one. Linux kernel semaphores are extreme unfair in that way. >> >... >> The original running process comes back faulting again, finds the >> semaphore still unlocked (the "ps" process is awake but has not gotten to >> run yet), gets the semaphore, and falls asleep on the IO for the next >> page. >> >> The "ps" process actually gets to run now, but it's a bit late. The >> semaphore is locked again. >> >> Repeat until luck breaks the bad circle. >> > >But doesn't __down have a fast path coded in assembly? In other words, >it only hits your patched code if there is already contention, which >there isn't in this case, and therefore the bug...? The __down() case should be hit if there's a waiter, even if that waiter has not yet been able to pick up the lock (the waiter _will_ have decremented the count to negative in order to trigger the proper logic at release time). But as I mentioned, the pseudo-patch was certainly untested, so somebody should probably walk through the cases to check that I didn't miss something. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2000-11-11 6:21 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2000-11-01 18:38 [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10 David Mansfield 2000-11-01 18:48 ` Rik van Riel 2000-11-02 7:19 ` Mike Galbraith 2000-11-02 21:59 ` Val Henson 2000-11-03 1:37 ` Jens Axboe 2000-11-03 5:56 ` Mike Galbraith 2000-11-03 15:45 ` Mike Galbraith 2000-11-03 19:38 ` Jens Axboe 2000-11-04 5:43 ` Mike Galbraith 2000-11-02 8:40 ` Christoph Rohland [not found] <Pine.Linu.4.10.10011091452270.747-100000@mikeg.weiden.de> 2000-11-09 18:31 ` Linus Torvalds 2000-11-10 7:34 ` Mike Galbraith 2000-11-10 10:47 ` Mike Galbraith 2000-11-10 17:07 ` Linus Torvalds 2000-11-10 21:42 ` [BUG] /proc/<pid>/stat access stalls badly for swapping process,2.4.0-test10 David Mansfield 2000-11-11 6:20 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).