* Re: 2.4.16 & OOM killer screw up (fwd)
@ 2001-12-10 19:08 Marcelo Tosatti
2001-12-10 20:47 ` Andrew Morton
2001-12-11 0:43 ` Andrea Arcangeli
0 siblings, 2 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2001-12-10 19:08 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: lkml
Andrea,
Could you please start looking at any 2.4 VM issues which show up ?
Just please make sure that when sending a fix for something, send me _one_
problem and a patch which fixes _that_ problem.
I'm tempted to look at VM, but I think I'll spend my limited time in a
better way if I review's others people work instead.
---------- Forwarded message ----------
Date: Mon, 10 Dec 2001 16:46:02 -0200 (BRST)
From: Marcelo Tosatti <marcelo@conectiva.com.br>
To: Abraham vd Merwe <abraham@2d3d.co.za>
Cc: Linux Kernel Development <linux-kernel@vger.kernel.org>
Subject: Re: 2.4.16 & OOM killer screw up
On Mon, 10 Dec 2001, Abraham vd Merwe wrote:
> Hi!
>
> If I leave my machine on for a day or two without doing anything on it (e.g.
> my machine at work over a weekend) and I come back then 1) all my memory is
> used for buffers/caches and when I try running application, the OOM killer
> kicks in, tries to allocate swap space (which I don't have) and kills
> whatever I try start (that's with 300M+ memory in buffers/caches).
Abraham,
I'll take a look at this issue as soon as pre8 is released.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti @ 2001-12-10 20:47 ` Andrew Morton 2001-12-10 19:42 ` Marcelo Tosatti 2001-12-11 0:11 ` Andrea Arcangeli 2001-12-11 0:43 ` Andrea Arcangeli 1 sibling, 2 replies; 43+ messages in thread From: Andrew Morton @ 2001-12-10 20:47 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrea Arcangeli, lkml Marcelo Tosatti wrote: > > Andrea, > > Could you please start looking at any 2.4 VM issues which show up ? > Just fwiw, I did some testing on this yesterday. Buffers and cache data are sitting on the active list, and shrink_caches() is *not* getting them off the active list, and onto the inactive list where they can be freed. So we end up with enormous amounts of anon memory on the inactive list, so this code: /* try to keep the active list 2/3 of the size of the cache */ ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); refill_inactive(ratio); just calls refill_inactive(0) all the time. Nothing gets moved onto the inactive list - it remains full of unfreeable anon allocations. And with no swap, there's nowhere to go. I think a little fix is to add if (ratio < nr_pages) ratio = nr_pages; so we at least move *something* onto the inactive list. Also refill_inactive needs to be changed so that it counts the number of pages which it actually moved, rather than the number of pages which it inspected. In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). So we're madly trying to swap pages out and finding that there's no swap space. I beleive that when we find there's no swap left we should move the page onto the active list so we don't keep rescanning it pointlessly. A fix may be to just remove the use-once stuff. It is one of the sources of this problem, because it's overpopulating the inactive list. In my testing last night, I tried to allocate 650 megs on a 768 meg swapless box. Got oom-killed when there was almost 100 megs of freeable memory: half buffercache, half filecache. Presumably, all of it was stuck on the active list with no way to get off. We also need to do something about shrink_[di]cache_memory(), which seem to be called in the wrong place. There's also the report concerning modify_ldt() failure in a similar situation. I'm not sure why this one occurred. It vmallocs 64k of memory and that seems to fail. I did some similar testing a week or so ago, also tested the -aa patches. They seemed to maybe help a tiny bit, but not significantly. - ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-10 20:47 ` Andrew Morton @ 2001-12-10 19:42 ` Marcelo Tosatti 2001-12-11 0:11 ` Andrea Arcangeli 1 sibling, 0 replies; 43+ messages in thread From: Marcelo Tosatti @ 2001-12-10 19:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, lkml On Mon, 10 Dec 2001, Andrew Morton wrote: > Marcelo Tosatti wrote: > > > > Andrea, > > > > Could you please start looking at any 2.4 VM issues which show up ? > > > > Just fwiw, I did some testing on this yesterday. > > Buffers and cache data are sitting on the active list, and shrink_caches() > is *not* getting them off the active list, and onto the inactive list > where they can be freed. > > So we end up with enormous amounts of anon memory on the inactive > list, so this code: > > /* try to keep the active list 2/3 of the size of the cache */ > ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); > refill_inactive(ratio); > > just calls refill_inactive(0) all the time. Nothing gets moved > onto the inactive list - it remains full of unfreeable anon > allocations. And with no swap, there's nowhere to go. > > I think a little fix is to add > > if (ratio < nr_pages) > ratio = nr_pages; > > so we at least move *something* onto the inactive list. > > Also refill_inactive needs to be changed so that it counts > the number of pages which it actually moved, rather than > the number of pages which it inspected. > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > So we're madly trying to swap pages out and finding that there's no swap > space. I beleive that when we find there's no swap left we should move > the page onto the active list so we don't keep rescanning it pointlessly. > > A fix may be to just remove the use-once stuff. It is one of the > sources of this problem, because it's overpopulating the inactive list. > > In my testing last night, I tried to allocate 650 megs on a 768 meg > swapless box. Got oom-killed when there was almost 100 megs of freeable > memory: half buffercache, half filecache. Presumably, all of it was > stuck on the active list with no way to get off. > > We also need to do something about shrink_[di]cache_memory(), > which seem to be called in the wrong place. > > There's also the report concerning modify_ldt() failure in a > similar situation. I'm not sure why this one occurred. It > vmallocs 64k of memory and that seems to fail. I haven't applied the modify_ldt() patch because I want to make sure its needed: It may just be a bad effect of this one bug. > I did some similar testing a week or so ago, also tested > the -aa patches. They seemed to maybe help a tiny bit, > but not significantly. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-10 20:47 ` Andrew Morton 2001-12-10 19:42 ` Marcelo Tosatti @ 2001-12-11 0:11 ` Andrea Arcangeli 2001-12-11 7:07 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 0:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Marcelo Tosatti, lkml On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote: > Marcelo Tosatti wrote: > > > > Andrea, > > > > Could you please start looking at any 2.4 VM issues which show up ? > > > > Just fwiw, I did some testing on this yesterday. > > Buffers and cache data are sitting on the active list, and shrink_caches() > is *not* getting them off the active list, and onto the inactive list > where they can be freed. please check 2.4.17pre4aa1, see the per-classzone info, they will prevent all the problems with the refill inactive with highmem. > > So we end up with enormous amounts of anon memory on the inactive > list, so this code: > > /* try to keep the active list 2/3 of the size of the cache */ > ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); > refill_inactive(ratio); > > just calls refill_inactive(0) all the time. Nothing gets moved > onto the inactive list - it remains full of unfreeable anon > allocations. And with no swap, there's nowhere to go. > > I think a little fix is to add > > if (ratio < nr_pages) > ratio = nr_pages; > > so we at least move *something* onto the inactive list. > > Also refill_inactive needs to be changed so that it counts > the number of pages which it actually moved, rather than > the number of pages which it inspected. done ages ago here. > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > So we're madly trying to swap pages out and finding that there's no swap > space. I beleive that when we find there's no swap left we should move > the page onto the active list so we don't keep rescanning it pointlessly. yes, however I think the swap-flood with no swap isn't a very interesting case to optimize. > > A fix may be to just remove the use-once stuff. It is one of the > sources of this problem, because it's overpopulating the inactive list. > > In my testing last night, I tried to allocate 650 megs on a 768 meg > swapless box. Got oom-killed when there was almost 100 megs of freeable > memory: half buffercache, half filecache. Presumably, all of it was > stuck on the active list with no way to get off. > > We also need to do something about shrink_[di]cache_memory(), > which seem to be called in the wrong place. > > There's also the report concerning modify_ldt() failure in a > similar situation. I'm not sure why this one occurred. It > vmallocs 64k of memory and that seems to fail. dunno about this modify_ldt failure. > > I did some similar testing a week or so ago, also tested > the -aa patches. They seemed to maybe help a tiny bit, > but not significantly. I don't have any pending bug report. AFIK those bugs are only in mainline. If you can reproduce with -aa please send me a bug report. thanks, Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 0:11 ` Andrea Arcangeli @ 2001-12-11 7:07 ` Andrew Morton 2001-12-11 13:32 ` Rik van Riel 2001-12-11 13:42 ` Andrea Arcangeli 0 siblings, 2 replies; 43+ messages in thread From: Andrew Morton @ 2001-12-11 7:07 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml Andrea Arcangeli wrote: > > On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote: > > Marcelo Tosatti wrote: > > > > > > Andrea, > > > > > > Could you please start looking at any 2.4 VM issues which show up ? > > > > > > > Just fwiw, I did some testing on this yesterday. > > > > Buffers and cache data are sitting on the active list, and shrink_caches() > > is *not* getting them off the active list, and onto the inactive list > > where they can be freed. > > please check 2.4.17pre4aa1, see the per-classzone info, they will > prevent all the problems with the refill inactive with highmem. This is not highmem-related. But the latest -aa patch does appear to have fixed this bug. Stale memory is no longer being left on the active list, and all buffercache memory is being reclaimed before the oom-killer kicks in (swapless case). Also, (and this is in fact the same problem), the patched kernel has less tendency to push in-use memory out to swap while leaving tens of megs of old memory on the active list. This is all good. Which of your changes has caused this? Could you please separate this out into one or more specific patches for the 2.4.17 series? Why does this code exist at the end of refill_inactive()? if (entry != &active_list) { list_del(&active_list); list_add(&active_list, entry); } This test on a 64 megabyte machine, on ext2: time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. This is probably due to the write scheduling changes in fs/buffer.c. This chunk especially will, under some conditions, cause bdflush to madly spin in a loop unplugging all the disk queues: @@ -2787,7 +2795,7 @@ spin_lock(&lru_list_lock); if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) { - wait_for_some_buffers(NODEV); + run_task_queue(&tq_disk); interruptible_sleep_on(&bdflush_wait); } } Why did you make this change? Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap dual x86: -aa: 4 minutes 20 seconds 2.4.7-pre8 4 minutes 8 seconds 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds Now it could be that this performance regression is due to the write merging mistake in fs/buffer.c. But with so much unrelated material in the same patch it's hard to pinpoint the source. --- linux-2.4.17-pre8/mm/vmscan.c Thu Nov 22 23:02:59 2001 +++ linux-akpm/mm/vmscan.c Mon Dec 10 22:34:18 2001 @@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages spin_lock(&pagemap_lru_lock); entry = active_list.prev; - while (nr_pages-- && entry != &active_list) { + while (nr_pages && entry != &active_list) { struct page * page; page = list_entry(entry, struct page, lru); @@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages del_page_from_active_list(page); add_page_to_inactive_list(page); SetPageReferenced(page); + nr_pages--; } spin_unlock(&pagemap_lru_lock); } @@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz int chunk_size = nr_pages; unsigned long ratio; + shrink_dcache_memory(priority, gfp_mask); + shrink_icache_memory(priority, gfp_mask); +#ifdef CONFIG_QUOTA + shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); +#endif + nr_pages -= kmem_cache_reap(gfp_mask); if (nr_pages <= 0) return 0; @@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz nr_pages = chunk_size; /* try to keep the active list 2/3 of the size of the cache */ ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); + if (ratio == 0) + ratio = nr_pages; refill_inactive(ratio); nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority); if (nr_pages <= 0) return 0; - - shrink_dcache_memory(priority, gfp_mask); - shrink_icache_memory(priority, gfp_mask); -#ifdef CONFIG_QUOTA - shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); -#endif return nr_pages; } > ... > > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > > So we're madly trying to swap pages out and finding that there's no swap > > space. I beleive that when we find there's no swap left we should move > > the page onto the active list so we don't keep rescanning it pointlessly. > > yes, however I think the swap-flood with no swap isn't a very > interesting case to optimize. Running swapless is a valid configuration, and the kernel is doing great amounts of pointless work. I would expect a diskless workstation to suffer from this. The problem remains in latest -aa. It would be useful to find a fix. > > I don't have any pending bug report. AFIK those bugs are only in > mainline. If you can reproduce with -aa please send me a bug report. > thanks, Bugs which are only fixed in -aa aren't much use to anyone. The VM code lacks comments, and nobody except yourself understands what it is supposed to be doing. That's a bug, don't you think? - ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 7:07 ` Andrew Morton @ 2001-12-11 13:32 ` Rik van Riel 2001-12-11 13:46 ` Andrea Arcangeli 2001-12-11 13:42 ` Andrea Arcangeli 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2001-12-11 13:32 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Marcelo Tosatti, lkml On Mon, 10 Dec 2001, Andrew Morton wrote: > This test on a 64 megabyte machine, on ext2: > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap > dual x86: > > -aa: 4 minutes 20 seconds > 2.4.7-pre8 4 minutes 8 seconds > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds Andrea, it seems -aa is not the holy grail VM-wise. If you want to merge your good stuff with marcelo, please do it in the "one patch with explanation per problem" style marcelo asked. If nothing happens I'll take my chainsaw and remove the whole use-once stuff just so 2.4 will avoid the worst cases, even if it happens to remove some of the nice stuff you've been working on. regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:32 ` Rik van Riel @ 2001-12-11 13:46 ` Andrea Arcangeli 2001-12-12 8:44 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 13:46 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote: > On Mon, 10 Dec 2001, Andrew Morton wrote: > > > This test on a 64 megabyte machine, on ext2: > > > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) > > > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. > > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap > > dual x86: > > > > -aa: 4 minutes 20 seconds > > 2.4.7-pre8 4 minutes 8 seconds > > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds > > > Andrea, it seems -aa is not the holy grail VM-wise. If you want it may be not a holy grail in swap benchmarks and flood of writes to disk, those are minor performance regressions, but I have no one single bug report related to "stability". The only thing I got back from Andrew is been "it runs a little slower" in those two tests. and of course he didn't even attempted to benchmark the interactive feeling that was the _whole_ point of my buffer.c and elevator changes. So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock solid and usable in production. We'll keep doing background benchmarking and changes that cannot affect stability, but the core design is finished as far I can tell. Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:46 ` Andrea Arcangeli @ 2001-12-12 8:44 ` Andrew Morton 2001-12-12 9:21 ` Andrea Arcangeli 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2001-12-12 8:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Marcelo Tosatti, lkml Andrea Arcangeli wrote: > > On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote: > > On Mon, 10 Dec 2001, Andrew Morton wrote: > > > > > This test on a 64 megabyte machine, on ext2: > > > > > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) > > > > > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. > > > > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap > > > dual x86: > > > > > > -aa: 4 minutes 20 seconds > > > 2.4.7-pre8 4 minutes 8 seconds > > > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds > > > > > > Andrea, it seems -aa is not the holy grail VM-wise. If you want > > it may be not a holy grail in swap benchmarks and flood of writes to > disk, those are minor performance regressions, but I have no one single > bug report related to "stability". Your patch increases the time to untar a kernel tree by seventy five percent. That's a fairly major minor regression. > The only thing I got back from Andrew is been "it runs a little slower" > in those two tests. The swapstorm I agree is uninteresting. The slowdown with a heavy write load impacts a very common usage, and I've told you how to mostly fix it. You need to back out the change to bdflush. > and of course he didn't even attempted to benchmark the interactive > feeling that was the _whole_ point of my buffer.c and elevator changes. As far as I know, at no point in time have you told anyone that this was an objective of your latest patch. So of course I didn't test for it. Interactivity is indeed improved. It has gone from catastrophic to horrid. There are four basic tests I use to quantify this, all with 64 megs of memory: 1: Start a continuous write, and on a different partition, time how long it takes to read a 16 megabyte file. Here, -aa takes 40 seconds. Stock 2.4.17-pre8 takes 71 seconds. 2.4.17-pre8 with the same elevator settings as in -aa takes 40 seconds. Large writes are slowing reads by a factor of 100. 2: Start a continuous write and, from another machine, run time ssh -X otherhost xterm -e true On -aa this takes 68 seconds. On 2.4.17-pre8 it takes over three minutes. I got bored and killed it. The problem can't be fixed on 2.4.17-pre8 with tuning - it's probably due to the poor page replacement - stuff is getting swapped out. This is a significant problem in 2.4.17-pre and we need a fix for it. 3: Run `cp -a linux/ junk'. Time how long it takes to read a 16 meg file. There's no appreciable difference between any of the kernels here. It varies from 2 seconds to 10, and is generally OK. 4: Run `cp -a linux/ junk'. time ssh -X otherhost xterm -e true Varies between three and five seconds, depending on elvtune settings. No noticeable difference between any kernels. It's tests 1 and 2 which are interesting, because we perform so very badly. And no amount of fiddling buffer.c or elvtune settings is going to fix it, because they don't address the core problem. Which is: when the elevator can't merge a read it sticks it at the end of the request queue, behind all the writes. I'll be submitting a little patch for 2.4.18-pre which allows the user to tunably promote reads ahead of most of the writes. It improves tests 1 and 2 by a factor of eight to twelve. > So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock > solid and usable in production. I haven't done much stability testing - without a description of what the changes are trying to do, I can't test them - all I could do is blindly run stress tests and I'm sure your QA team can do that as well as I, on bigger boxes. But I don't doubt that it's stable. However Red Hat's QA guys are pretty good at knocking kernels over... gargh. Ninety seconds of bash-shared-mapping and I get "end-request: buffer-list destroyed" against the swap device. Borked IDE driver. Seems stable on SCSI. The -aa VM is still a little prone to tossing out "0-order allocation failures" when there's tons of swap available and when much memory is freeable by dropping or writing back to shared mappings. But this doesn't seem to cause any problems, as long as there's some memory available for atomic allocations, and I never saw free memory go below 800 kbytes... > We'll keep doing background benchmarking and changes that cannot > affect stability, but the core design is finished as far I can tell. We'll know when it gets wider testing in the runup to 2.4.18. The fact that I found a major (although easily fixed) performance problem in the first ten minutes indicates that caution is needed, yes? What's the thinking with the changes to dcache/icache flushing? A single d/icache entry can save three seeks, which is _enormous_ value for just a few hundred bytes of memory. You appear to be shrinking the i/dcache by 12% each time you try to swap out or evict 32 pages. What this means is that as soon we start to get a bit short on memory, the i/dcache vanishes. And it takes ages to read that stuff back in. How did you test this? Without having done (or even devised) any quantitative testing myself, I have a gut feel that we need to preserve the i/dcache (versus file data) much more than this. Oh. Maybe the core design (whatever it is :)) is not finished, because it retains the bone-headed, dumb-to-the-point-of-astonishing misfeature which Linux VM has always had: If someone is linearly writing (or reading) a gigabyte file on a 64 megabyte box they *don't* want the VM to evict every last little scrap of cache on behalf of data which they *obviously* do not want cached. It's good that -aa VM doesn't summarily dump the i/dcache and plonk everything you want into swap when this happens. Progress. So. To summarise. - Your attempt to address read latencies didn't work out, and should be dropped (hopefully Marcelo and Jens are OK with an elevator hack :)) - We urgently need a fix for 2.4.17's page replacement problems. - aa is good. Believe it or not, I like it. The mm/* portions fix significant performance problems in our current VM. I guess we should bite the bullet and merge it all in 2.4.18-pre - ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 8:44 ` Andrew Morton @ 2001-12-12 9:21 ` Andrea Arcangeli 2001-12-12 9:45 ` Rik van Riel 2001-12-12 9:59 ` Andrew Morton 0 siblings, 2 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 9:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Marcelo Tosatti, lkml On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote: > > > On Mon, 10 Dec 2001, Andrew Morton wrote: > > > > > > > This test on a 64 megabyte machine, on ext2: > > > > > > > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) > > > > > > > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. > > > > > > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap > > > > dual x86: > > > > > > > > -aa: 4 minutes 20 seconds > > > > 2.4.7-pre8 4 minutes 8 seconds > > > > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds > > > > > > > > > Andrea, it seems -aa is not the holy grail VM-wise. If you want > > > > it may be not a holy grail in swap benchmarks and flood of writes to > > disk, those are minor performance regressions, but I have no one single > > bug report related to "stability". > > Your patch increases the time to untar a kernel tree by seventy five > percent. That's a fairly major minor regression. > > > The only thing I got back from Andrew is been "it runs a little slower" > > in those two tests. > > The swapstorm I agree is uninteresting. The slowdown with a heavy write > load impacts a very common usage, and I've told you how to mostly fix > it. You need to back out the change to bdflush. I guess i should drop the run_task_queue(&tq_disk) instead of replacing it back with a wait_for_some_buffers(). > > and of course he didn't even attempted to benchmark the interactive > > feeling that was the _whole_ point of my buffer.c and elevator changes. > > As far as I know, at no point in time have you told anyone that > this was an objective of your latest patch. So of course I > didn't test for it. > > Interactivity is indeed improved. It has gone from catastrophic to > horrid. :) > > There are four basic tests I use to quantify this, all with 64 megs of > memory: > > 1: Start a continuous write, and on a different partition, time how > long it takes to read a 16 megabyte file. > > Here, -aa takes 40 seconds. Stock 2.4.17-pre8 takes 71 seconds. > 2.4.17-pre8 with the same elevator settings as in -aa takes > 40 seconds. > > Large writes are slowing reads by a factor of 100. > > 2: Start a continuous write and, from another machine, run > > time ssh -X otherhost xterm -e true > > On -aa this takes 68 seconds. On 2.4.17-pre8 it takes over > three minutes. I got bored and killed it. The problem can't > be fixed on 2.4.17-pre8 with tuning - it's probably due to the > poor page replacement - stuff is getting swapped out. This is > a significant problem in 2.4.17-pre and we need a fix for it. > > 3: Run `cp -a linux/ junk'. Time how long it takes to read a 16 meg file. > > There's no appreciable difference between any of the kernels here. > It varies from 2 seconds to 10, and is generally OK. > > 4: Run `cp -a linux/ junk'. time ssh -X otherhost xterm -e true > > Varies between three and five seconds, depending on elvtune settings. > No noticeable difference between any kernels. > > It's tests 1 and 2 which are interesting, because we perform so > very badly. And no amount of fiddling buffer.c or elvtune settings > is going to fix it, because they don't address the core problem. > > Which is: when the elevator can't merge a read it sticks it at the > end of the request queue, behind all the writes. > > I'll be submitting a little patch for 2.4.18-pre which allows the user > to tunably promote reads ahead of most of the writes. It improves > tests 1 and 2 by a factor of eight to twelve. Note that the first elevator (not elevator_linus) could handle this case, however it was too complicated and I'm been told it was hurting too much the performance of things like dbench etc.. But it was allowing you to take a few seconds for your test number 2 for example. Quite frankly all my benchmark were latency oriented, but I couldn't notice an huge drop of performance, but OTOH at that time my test box had a 10mbyte/sec HD, and I know for experience that on such HD numbers tends to be very different than on fast SCSI and my current test hd IDE 33mbyte/sec so I think they were right. > > So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock > > solid and usable in production. > > I haven't done much stability testing - without a description of what the > changes are trying to do, I can't test them - all I could do is blindly > run stress tests and I'm sure your QA team can do that as well as I, > on bigger boxes. > > But I don't doubt that it's stable. However Red Hat's QA guys are > pretty good at knocking kernels over... > > gargh. Ninety seconds of bash-shared-mapping and I get "end-request: > buffer-list destroyed" against the swap device. Borked IDE driver. > Seems stable on SCSI. > > The -aa VM is still a little prone to tossing out "0-order allocation > failures" when there's tons of swap available and when much memory > is freeable by dropping or writing back to shared mappings. But > this doesn't seem to cause any problems, as long as there's some > memory available for atomic allocations, and I never saw free > memory go below 800 kbytes... It mostly tends to fail on the GFP_NOIO and friends, where it cannot block and I believe that's correct, looping forever inside the allocator can only lead to deadlocks. Those GFP_NOIO users have loops outside the allocator if required. A failure means that unless somebody else does something for us, we couldn't allocate anything. Thus SCHED_YIELD and try again. > > We'll keep doing background benchmarking and changes that cannot > > affect stability, but the core design is finished as far I can tell. > > We'll know when it gets wider testing in the runup to 2.4.18. The > fact that I found a major (although easily fixed) performance problem > in the first ten minutes indicates that caution is needed, yes? I consider that minor tuning (as you said removing the run_task_queue() in bdflush may be enough to cure the tar xzf, I will make some test). > What's the thinking with the changes to dcache/icache flushing? > A single d/icache entry can save three seeks, which is _enormous_ value for > just a few hundred bytes of memory. You appear to be shrinking the i/dcache > by 12% each time you try to swap out or evict 32 pages. What this means yes. > is that as soon we start to get a bit short on memory, the i/dcache vanishes. > And it takes ages to read that stuff back in. How did you test this? Without > having done (or even devised) any quantitative testing myself, I have a gut > feel that we need to preserve the i/dcache (versus file data) much more than > this. The problem is the zone-normal, if we fail to shrink the cache we _must_ shrink the dcache/icache as well to be correct (at the very least if the classzone is < ZONE_HIGHMEM). Otherwise zone normal/dma allocations can fail forever and you won't be able to fork a new task any longer. I tested this with a ZONE_NORMAL of 1/2 mbytes with highmem emulation. Of course this makes the problem reproducible trivially but it could happen on larger boxes as well at least in theory, and I want to cover all the cases as best as I can. > Oh. Maybe the core design (whatever it is :)) is not finished, > because it retains the bone-headed, dumb-to-the-point-of-astonishing > misfeature which Linux VM has always had: > > If someone is linearly writing (or reading) a gigabyte file on a 64 > megabyte box they *don't* want the VM to evict every last little scrap > of cache on behalf of data which they *obviously* do not want > cached. The current design tries to detect this, at least much much better than 2.2. This is why I disagree with Rik's patch of yesterday. detecting cache pollution is good also on the lowmem boxes (not only for DB). > It's good that -aa VM doesn't summarily dump the i/dcache and plonk > everything you want into swap when this happens. Progress. > > > So. To summarise. > > - Your attempt to address read latencies didn't work out, and should > be dropped (hopefully Marcelo and Jens are OK with an elevator hack :)) It should not be dropped. And it's not an hack, I only enabled the code that was basically disabled due the huge numbers. It will work as 2.2.20. Now what you want to add is an hack to move the read at the top of the request_queue and if you go back to 2.3.5x you'll see I was doing this, that's the first thing I did while playing with the elevator. And latency-wise it was working great. I'm sure somebody remebers the kind of latency you could get with such an elevator. Then I got flames from Linus and Ingo claiming that I screwedup the elevator and that I was the source of the 2.3.x bad I/O performance and so they required to nearly rewrite the elevator in a way that was obvious that couldn't hurt the benchmarks and so Jens dropped part of my latency-capable elevator and he did the elevator_linus that of course cannot hurt performance of benchmarks, but that has the usual problem you need to wait 1 minute for xterm to be stared under a write flood. However my object was to avoid nearly infinite starvation and the elevator_linus avoids it (you can start the xterm it in 1 minute, previously in early 2.3 and 2.2 you'd need to wait for the disk to be full, and that could take some day with some terabyte of data). So I was pretty much fine with elevator_linus too but we very well known reads would be starved again significantly (even if not indefinitely). Many thanks for the help!! Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 9:21 ` Andrea Arcangeli @ 2001-12-12 9:45 ` Rik van Riel 2001-12-12 10:09 ` Andrea Arcangeli 2001-12-12 9:59 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2001-12-12 9:45 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Marcelo Tosatti, lkml On Wed, 12 Dec 2001, Andrea Arcangeli wrote: > On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote: > > Oh. Maybe the core design (whatever it is :)) is not finished, > > because it retains the bone-headed, dumb-to-the-point-of-astonishing > > misfeature which Linux VM has always had: > > > > If someone is linearly writing (or reading) a gigabyte file on a 64 > > megabyte box they *don't* want the VM to evict every last little scrap > > of cache on behalf of data which they *obviously* do not want > > cached. > > The current design tries to detect this, at least much much better than > 2.2. This is why I disagree with Rik's patch of yesterday. detecting > cache pollution is good also on the lowmem boxes (not only for DB). Oh, absolutely. The problem just is that the current design has even worse problems where it doesn't put any pressure on pages which were touched twice an hour ago. This leads to the situation that applications get OOM-killed to preserve buffer cache memory which hasn't been touched since bootup time. There are ways to both have good behaviour on bulk IO and flush out old data which was in active use but no longer is. I believe these are called page aging and drop-behind. I've been thinking about achieving the wanted behaviour without these two, but haven't been able to come up with any algorithm which doesn't have some very bad side effects. If you know a way of doing bulk IO properly and flushing out an old working set correctly, please let us know. regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 9:45 ` Rik van Riel @ 2001-12-12 10:09 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 10:09 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml On Wed, Dec 12, 2001 at 07:45:45AM -0200, Rik van Riel wrote: > On Wed, 12 Dec 2001, Andrea Arcangeli wrote: > > On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote: > > > Oh. Maybe the core design (whatever it is :)) is not finished, > > > because it retains the bone-headed, dumb-to-the-point-of-astonishing > > > misfeature which Linux VM has always had: > > > > > > If someone is linearly writing (or reading) a gigabyte file on a 64 > > > megabyte box they *don't* want the VM to evict every last little scrap > > > of cache on behalf of data which they *obviously* do not want > > > cached. > > > > The current design tries to detect this, at least much much better than > > 2.2. This is why I disagree with Rik's patch of yesterday. detecting > > cache pollution is good also on the lowmem boxes (not only for DB). > > Oh, absolutely. The problem just is that the current design > has even worse problems where it doesn't put any pressure on > pages which were touched twice an hour ago. it does. See the refill_inactive pass. > This leads to the situation that applications get OOM-killed > to preserve buffer cache memory which hasn't been touched > since bootup time. It doesn't happen here. At the very least the fix is the two liner from Andrew that forces a nr_pages refile from active list, that will guarantee that whatever happens we always roll the active list too, but the oom killing you are experiencing is a problem of mainline, it definitely doesn't happen here and the refill_inactive(0) cannot be the culprit because the active list grows always to a relevant size and if during oom a few pages stays untouched into the active list that's fine, those two pages couldn't save us anyways so they'd better stay there so we don't trash. > > There are ways to both have good behaviour on bulk IO and > flush out old data which was in active use but no longer is. > I believe these are called page aging and drop-behind. > I've been thinking about achieving the wanted behaviour > without these two, but haven't been able to come up with > any algorithm which doesn't have some very bad side effects. > > If you know a way of doing bulk IO properly and flushing out > an old working set correctly, please let us know. > > regards, > > Rik > -- > Shortwave goes a long way: irc.starchat.net #swl > > http://www.surriel.com/ http://distro.conectiva.com/ Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 9:21 ` Andrea Arcangeli 2001-12-12 9:45 ` Rik van Riel @ 2001-12-12 9:59 ` Andrew Morton 2001-12-12 10:15 ` Andrea Arcangeli 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2001-12-12 9:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Marcelo Tosatti, lkml Andrea Arcangeli wrote: > > ... > > The swapstorm I agree is uninteresting. The slowdown with a heavy write > > load impacts a very common usage, and I've told you how to mostly fix > > it. You need to back out the change to bdflush. > > I guess i should drop the run_task_queue(&tq_disk) instead of replacing > it back with a wait_for_some_buffers(). hum. Nope, it definitely wants the wait_for_locked_buffers() in there. 36 seconds versus 25. (21 on stock kernel) My theory is that balance_dirty() is directing heaps of wakeups to bdflush, so bdflush just keeps on running. I'll take a look tomorrow. (If we're sending that many wakeups, we should do a waitqueue_active test in wakeup_bdflush...) > ... > > Note that the first elevator (not elevator_linus) could handle this > case, however it was too complicated and I'm been told it was hurting > too much the performance of things like dbench etc.. But it was allowing > you to take a few seconds for your test number 2 for example. Quite > frankly all my benchmark were latency oriented, but I couldn't notice > an huge drop of performance, but OTOH at that time my test box had a > 10mbyte/sec HD, and I know for experience that on such HD numbers tends > to be very different than on fast SCSI and my current test hd IDE > 33mbyte/sec so I think they were right. OK, well I think I'll make it so the feature defaults to "off" - no change in behaviour. People need to run `elvtune -b non-zero-value' to turn it on. So what is then needed is testing to determine the latency-versus-throughput tradeoff. Andries takes manpage patches :) > ... > > - Your attempt to address read latencies didn't work out, and should > > be dropped (hopefully Marcelo and Jens are OK with an elevator hack :)) > > It should not be dropped. And it's not an hack, I only enabled the code > that was basically disabled due the huge numbers. It will work as 2.2.20. Sorry, I was referring to the elevator-bypass patch. Jens called it a hack ;) > Now what you want to add is an hack to move the read at the top of the > request_queue and if you go back to 2.3.5x you'll see I was doing this, > that's the first thing I did while playing with the elevator. And > latency-wise it was working great. I'm sure somebody remebers the kind > of latency you could get with such an elevator. > > Then I got flames from Linus and Ingo claiming that I screwedup the > elevator and that I was the source of the 2.3.x bad I/O performance and > so they required to nearly rewrite the elevator in a way that was > obvious that couldn't hurt the benchmarks and so Jens dropped part of my > latency-capable elevator and he did the elevator_linus that of course > cannot hurt performance of benchmarks, but that has the usual problem > you need to wait 1 minute for xterm to be stared under a write flood. > > However my object was to avoid nearly infinite starvation and the > elevator_linus avoids it (you can start the xterm it in 1 minute, > previously in early 2.3 and 2.2 you'd need to wait for the disk to be > full, and that could take some day with some terabyte of data). So I was > pretty much fine with elevator_linus too but we very well known reads > would be starved again significantly (even if not indefinitely). > OK, thanks. As long as the elevator-bypass tunable gives a good range of latency-versus-throughput tuning then I'll be happy. It's a bit sad that in even the best case, reads are penalised by a factor of ten when there are writes happening. But fixing that would require major readahead surgery, and perhaps implementation of anticipatory scheduling, as described in http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf which is out of scope. - ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 9:59 ` Andrew Morton @ 2001-12-12 10:15 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 10:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Marcelo Tosatti, lkml On Wed, Dec 12, 2001 at 01:59:38AM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > ... > > > The swapstorm I agree is uninteresting. The slowdown with a heavy write > > > load impacts a very common usage, and I've told you how to mostly fix > > > it. You need to back out the change to bdflush. > > > > I guess i should drop the run_task_queue(&tq_disk) instead of replacing > > it back with a wait_for_some_buffers(). > > hum. Nope, it definitely wants the wait_for_locked_buffers() in there. > 36 seconds versus 25. (21 on stock kernel) please try without the wait_for_locked_buffers and without the run_task_queue, just delete that line. > > My theory is that balance_dirty() is directing heaps of wakeups > to bdflush, so bdflush just keeps on running. I'll take a look > tomorrow. Please delete the wait_on_buffers from balance_dirty() too, it's totally broken there as well. wait_on_something _does_ wakeup the queue just like a run_task_queue() otherwise it's a noop. However I need to check better the refile of clean buffers from locked to clean lists, we should make sure not to spend too much time there, the first time a wait_on_buffers is recalled... > (If we're sending that many wakeups, we should do a waitqueue_active > test in wakeup_bdflush...) > > > ... > > > > Note that the first elevator (not elevator_linus) could handle this > > case, however it was too complicated and I'm been told it was hurting > > too much the performance of things like dbench etc.. But it was allowing > > you to take a few seconds for your test number 2 for example. Quite > > frankly all my benchmark were latency oriented, but I couldn't notice > > an huge drop of performance, but OTOH at that time my test box had a > > 10mbyte/sec HD, and I know for experience that on such HD numbers tends > > to be very different than on fast SCSI and my current test hd IDE > > 33mbyte/sec so I think they were right. > > OK, well I think I'll make it so the feature defaults to "off" - no > change in behaviour. People need to run `elvtune -b non-zero-value' > to turn it on. Ok. BTW, I guess on this side it worth to work only on 2.5. We know latency isn't very good in 2.4 and in 2.2, we're more throughput oriented. Ah and of course to make the latency better we could as well reduce the size of the I/O queue, I bet the queues are way oversized for a normal desktop. > > So what is then needed is testing to determine the latency-versus-throughput > tradeoff. Andries takes manpage patches :) > > > ... > > > - Your attempt to address read latencies didn't work out, and should > > > be dropped (hopefully Marcelo and Jens are OK with an elevator hack :)) > > > > It should not be dropped. And it's not an hack, I only enabled the code > > that was basically disabled due the huge numbers. It will work as 2.2.20. > > Sorry, I was referring to the elevator-bypass patch. Jens called > it a hack ;) Oh yes, that's an "hack" :), and it definitely works well for the latency. > > > Now what you want to add is an hack to move the read at the top of the > > request_queue and if you go back to 2.3.5x you'll see I was doing this, > > that's the first thing I did while playing with the elevator. And > > latency-wise it was working great. I'm sure somebody remebers the kind > > of latency you could get with such an elevator. > > > > Then I got flames from Linus and Ingo claiming that I screwedup the > > elevator and that I was the source of the 2.3.x bad I/O performance and > > so they required to nearly rewrite the elevator in a way that was > > obvious that couldn't hurt the benchmarks and so Jens dropped part of my > > latency-capable elevator and he did the elevator_linus that of course > > cannot hurt performance of benchmarks, but that has the usual problem > > you need to wait 1 minute for xterm to be stared under a write flood. > > > > However my object was to avoid nearly infinite starvation and the > > elevator_linus avoids it (you can start the xterm it in 1 minute, > > previously in early 2.3 and 2.2 you'd need to wait for the disk to be > > full, and that could take some day with some terabyte of data). So I was > > pretty much fine with elevator_linus too but we very well known reads > > would be starved again significantly (even if not indefinitely). > > > > OK, thanks. > > As long as the elevator-bypass tunable gives a good range of > latency-versus-throughput tuning then I'll be happy. It's a > bit sad that in even the best case, reads are penalised by a > factor of ten when there are writes happening. > > But fixing that would require major readahead surgery, and perhaps > implementation of anticipatory scheduling, as described in > http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf > which is out of scope. > > - Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 7:07 ` Andrew Morton 2001-12-11 13:32 ` Rik van Riel @ 2001-12-11 13:42 ` Andrea Arcangeli 2001-12-11 13:59 ` Rik van Riel ` (3 more replies) 1 sibling, 4 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 13:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Marcelo Tosatti, lkml On Mon, Dec 10, 2001 at 11:07:31PM -0800, Andrew Morton wrote: > Why does this code exist at the end of refill_inactive()? > > if (entry != &active_list) { > list_del(&active_list); > list_add(&active_list, entry); > } so that we restart next time at the point where we stopped browsing the active list. > This test on a 64 megabyte machine, on ext2: > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync) > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds. > > This is probably due to the write scheduling changes in fs/buffer.c. yes, I also lowered the percentage of dirty memory in the system by default, so that a write flood should less probably stall the system. Plus I made the elevator more latency oriented, rather than throughput oriented. Did you also tested how much the system was responsive during the test? Do you remeber the thread about a 'tar xzf' hanging the machine? It doesn't hang with -aa, but of course you'll run slower if it has to do seeks. > This chunk especially will, under some conditions, cause bdflush > to madly spin in a loop unplugging all the disk queues: > > @@ -2787,7 +2795,7 @@ > > spin_lock(&lru_list_lock); > if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) { > - wait_for_some_buffers(NODEV); > + run_task_queue(&tq_disk); > interruptible_sleep_on(&bdflush_wait); > } > } > > Why did you make this change? to make bdflush to less badly spin in a loop unplugging all the disk queues. We need to unplug only once, to submit the I/O, but we don't need to wait on every single buffer that we previously wrote. Note that run_task_queue() has nothing to do with wait_on_buffer, the above should be much better in terms of "spinning in a loop unplugging all the disk queues". It will do it only once at least. Infact all the wait_for_some_buffers are broken (particularly the one in balance_dirty()), they're not necessary, they can only slowdown the machine. The only reason would be to refile the buffers into the clean list, but nothing else. That's a total waste of I/O pipelining. And yes, that's something to fix too. > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap > dual x86: > > -aa: 4 minutes 20 seconds > 2.4.7-pre8 4 minutes 8 seconds > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds > > Now it could be that this performance regression is due to the > write merging mistake in fs/buffer.c. But with so much unrelated > material in the same patch it's hard to pinpoint the source. > > > > --- linux-2.4.17-pre8/mm/vmscan.c Thu Nov 22 23:02:59 2001 > +++ linux-akpm/mm/vmscan.c Mon Dec 10 22:34:18 2001 > @@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages > > spin_lock(&pagemap_lru_lock); > entry = active_list.prev; > - while (nr_pages-- && entry != &active_list) { > + while (nr_pages && entry != &active_list) { > struct page * page; > > page = list_entry(entry, struct page, lru); > @@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages > del_page_from_active_list(page); > add_page_to_inactive_list(page); > SetPageReferenced(page); > + nr_pages--; > } > spin_unlock(&pagemap_lru_lock); > } > @@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz > int chunk_size = nr_pages; > unsigned long ratio; > > + shrink_dcache_memory(priority, gfp_mask); > + shrink_icache_memory(priority, gfp_mask); > +#ifdef CONFIG_QUOTA > + shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); > +#endif > + > nr_pages -= kmem_cache_reap(gfp_mask); > if (nr_pages <= 0) > return 0; > @@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz > nr_pages = chunk_size; > /* try to keep the active list 2/3 of the size of the cache */ > ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); > + if (ratio == 0) > + ratio = nr_pages; > refill_inactive(ratio); > > nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority); > if (nr_pages <= 0) > return 0; > - > - shrink_dcache_memory(priority, gfp_mask); > - shrink_icache_memory(priority, gfp_mask); > -#ifdef CONFIG_QUOTA > - shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); > -#endif > > return nr_pages; > } it should be simple, mainline swapouts more, so it's less likely to trash away some useful cache. just try -aa after a: echo 10 >/proc/sys/vm/vm_mapped_ratio it should swapout more and better preserve the cache. > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > > > So we're madly trying to swap pages out and finding that there's no swap > > > space. I beleive that when we find there's no swap left we should move > > > the page onto the active list so we don't keep rescanning it pointlessly. > > > > yes, however I think the swap-flood with no swap isn't a very > > interesting case to optimize. > > Running swapless is a valid configuration, and the kernel is doing I'm not saying it's not valid or non interesting. It's the mix "I'm running out of memory and I'm swapless" that is the case not interesting to optimize. If you're swapless it means you've enough memory and that you're not running out of swap. Otherwise _you_ (not the kernel) are wrong not having swap. > great amounts of pointless work. I would expect a diskless workstation > to suffer from this. The problem remains in latest -aa. It would be > useful to find a fix. It can be optimized by making the other cases slower. I believe if swap_out is recalled heavily in a swapless configuration either some other part of the kernel or the user are wrong, not swap_out. So it's at least not obvious to me that it would be useful to fix it inside swap_out. > > I don't have any pending bug report. AFIK those bugs are only in > > mainline. If you can reproduce with -aa please send me a bug report. > > thanks, > > Bugs which are only fixed in -aa aren't much use to anyone. Then there are no other bugs, that's fine, this is why I said I'm finished (except for the minor performance work, like the buffer flushing in buffer.c that certainly cannot affect stability, or the swap-triggering etc.. all minor things that doesn't affect stability and where there's not the perfect solution anyways). > The VM code lacks comments, and nobody except yourself understands > what it is supposed to be doing. That's a bug, don't you think? Lack of documentation is not a bug, period. Also it's not true that I'm the only one who understands it. For istance Linus understand it completly, I am 100% sure. Anyways I wrote a dozen of slides on the VM with some graph showing the design of the VM if anybody can better learn from a slide than from the code. I believe the slides are useful to understand the design, but if you want to change one line of code slides or not you've to read the code. Everybody is complaining about documentation. This is a red-herring. There's no documentation that allows you to hack the previous VM code. I'd ask how many of the people happy with the previous documentation were effectively VM developers. Except for some possible misleading comment in the current code that we may have not updated yet, I don't think there's been a regression in documentation. Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:42 ` Andrea Arcangeli @ 2001-12-11 13:59 ` Rik van Riel 2001-12-11 14:23 ` Andrea Arcangeli 2001-12-11 13:59 ` Abraham vd Merwe ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2001-12-11 13:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Marcelo Tosatti, lkml On Tue, 11 Dec 2001, Andrea Arcangeli wrote: > > The VM code lacks comments, and nobody except yourself understands > > what it is supposed to be doing. That's a bug, don't you think? > > Lack of documentation is not a bug, period. Also it's not true that > I'm the only one who understands it. Without documentation, you can only know what the code does, never what it is supposed to do or why it does it. This makes fixing problems a lot harder, especially since people will never agree on what a piece of code is supposed to do. regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:59 ` Rik van Riel @ 2001-12-11 14:23 ` Andrea Arcangeli 2001-12-11 15:27 ` Daniel Phillips 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 14:23 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml On Tue, Dec 11, 2001 at 11:59:06AM -0200, Rik van Riel wrote: > On Tue, 11 Dec 2001, Andrea Arcangeli wrote: > > > > The VM code lacks comments, and nobody except yourself understands > > > what it is supposed to be doing. That's a bug, don't you think? > > > > Lack of documentation is not a bug, period. Also it's not true that > > I'm the only one who understands it. > > Without documentation, you can only know what the code > does, never what it is supposed to do or why it does it. I only care about "what the code does" and "what are the results and the bugreports". Anything else is vaopurware and I don't care about that. As said I wrote some documentation on the VM for my last speech at the one of the most important italian linux events, it explains the basic design. It should be published on their webside as soon as I find the time to send them the slides. I can post a link once it will be online. It shoud allow non VM-developers to understand the logic behind the VM algorithm, but understanding those slides it's far from allowing anyone to hack the VM. I _totally_ agree with Linus when he said "real world is totally dominated by the implementation details". I was thinking this way before reading his recent email to l-k (however I totally disagree about evolution being random and the other kernel-offtopic part of such thread :). For developers the real freedom is the code, not the documentation and the code is there. And I think it's much easier to understand the current code (ok I'm biased, but still I believe for outsiders it's simpler). Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 14:23 ` Andrea Arcangeli @ 2001-12-11 15:27 ` Daniel Phillips 2001-12-12 11:16 ` Andrea Arcangeli 0 siblings, 1 reply; 43+ messages in thread From: Daniel Phillips @ 2001-12-11 15:27 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml On December 11, 2001 03:23 pm, Andrea Arcangeli wrote: > As said I wrote some documentation on the VM for my last speech at the > one of the most important italian linux events, it explains the basic > design. It should be published on their webside as soon as I find the > time to send them the slides. I can post a link once it will be online. Why not also post the whole thing as an email, right here? > It shoud allow non VM-developers to understand the logic behind the VM > algorithm, but understanding those slides it's far from allowing anyone > to hack the VM. It's a start. > I _totally_ agree with Linus when he said "real world is totally > dominated by the implementation details". Linus didn't say anything about not documenting the implementation details, nor did he say anything about not documenting in general. > For developers the real freedom is the code, not the documentation and > the code is there. And I think it's much easier to understand the > current code (ok I'm biased, but still I believe for outsiders it's > simpler). Judging by the number of complaints, it's not easy enough. I know that, personally, decoding your vm is something that's always on my 'things I could do if I didn't have a lot of other things to do' list. So far, only Linus, Marcelo, Andrew and maybe Rik seem to have made the investment. You'd have a lot more helpers by now if you gave just a little higher priority to documentation -- Daniel ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 15:27 ` Daniel Phillips @ 2001-12-12 11:16 ` Andrea Arcangeli 2001-12-12 20:03 ` Daniel Phillips 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 11:16 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote: > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote: > > As said I wrote some documentation on the VM for my last speech at the > > one of the most important italian linux events, it explains the basic > > design. It should be published on their webside as soon as I find the > > time to send them the slides. I can post a link once it will be online. > > Why not also post the whole thing as an email, right here? I uploaded it here: ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz Hopefully it's understandable standalone. > > It shoud allow non VM-developers to understand the logic behind the VM > > algorithm, but understanding those slides it's far from allowing anyone > > to hack the VM. > > It's a start. > > > I _totally_ agree with Linus when he said "real world is totally > > dominated by the implementation details". > > Linus didn't say anything about not documenting the implementation details, > nor did he say anything about not documenting in general. yes, my only point was that "documentation" isn't nearly enough, and that it's not mandatory (given all the changes don't affect any user API), but I certainly agree documentation helps. Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 11:16 ` Andrea Arcangeli @ 2001-12-12 20:03 ` Daniel Phillips 2001-12-12 21:25 ` Andrea Arcangeli 0 siblings, 1 reply; 43+ messages in thread From: Daniel Phillips @ 2001-12-12 20:03 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml On December 12, 2001 12:16 pm, Andrea Arcangeli wrote: > On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote: > > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote: > > > As said I wrote some documentation on the VM for my last speech at the > > > one of the most important italian linux events, it explains the basic > > > design. It should be published on their webside as soon as I find the > > > time to send them the slides. I can post a link once it will be online. > > > > Why not also post the whole thing as an email, right here? > > I uploaded it here: ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz This is really, really useful. Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get install mgp) and do: mv pluto.mpg pluto.mgp # ;) mgp pluto.mgp -x vflib Helpful hint #2: Actually, just gv pluto.ps is gets all the content. Helpful hint #3: Actually, less pluto.mgp will do the trick (after the rename) and lets you cut and paste the text, as I'm about to do... Nit: "vm shrinking is not serialized with any other subsystem, it is also only---^^^^ threaded against itself." The big thing I see missing from this presentation is a discussion of how icache, dcache etc fit into the picture, i.e., shrink_caches. -- Daniel ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 20:03 ` Daniel Phillips @ 2001-12-12 21:25 ` Andrea Arcangeli 0 siblings, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 21:25 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml On Wed, Dec 12, 2001 at 09:03:20PM +0100, Daniel Phillips wrote: > On December 12, 2001 12:16 pm, Andrea Arcangeli wrote: > > On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote: > > > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote: > > > > As said I wrote some documentation on the VM for my last speech at the > > > > one of the most important italian linux events, it explains the basic > > > > design. It should be published on their webside as soon as I find the > > > > time to send them the slides. I can post a link once it will be online. > > > > > > Why not also post the whole thing as an email, right here? > > > > I uploaded it here: > > ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz > > This is really, really useful. > > Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get > install mgp) and do: > > mv pluto.mpg pluto.mgp # ;) 8) > mgp pluto.mgp -x vflib > > Helpful hint #2: Actually, just gv pluto.ps is gets all the content. > > Helpful hint #3: Actually, less pluto.mgp will do the trick (after the > rename) and lets you cut and paste the text, as I'm about to do... > > Nit: "vm shrinking is not serialized with any other subsystem, it is also > only---^^^^ > threaded against itself." correct. > The big thing I see missing from this presentation is a discussion of how > icache, dcache etc fit into the picture, i.e., shrink_caches. Going into the differences between icache/dcache and pagecache would been too low level (and I should have spent some time explaining what icache and dcache are first ;), so as you noticed I intentionally ignored those highlevel vfs caches in the slides. The concept of "pages of cache" is usually well known by most people instead, so I only considered the pagecache, that incidentally is also the most interesting case for the VM. For seasoned kernel developers it would been interesting to integrate more stuff, of course, but as you said this is a start at least :). About the icache/dcache shrinking, that's probably the most rough thing we have in the vm at the moment. It just works. Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:42 ` Andrea Arcangeli 2001-12-11 13:59 ` Rik van Riel @ 2001-12-11 13:59 ` Abraham vd Merwe 2001-12-11 14:01 ` Andrea Arcangeli 2001-12-11 15:47 ` Henning P. Schmiedehausen 2001-12-12 8:39 ` Andrew Morton 3 siblings, 1 reply; 43+ messages in thread From: Abraham vd Merwe @ 2001-12-11 13:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linux Kernel Development [-- Attachment #1: Type: text/plain, Size: 1632 bytes --] Hi Andrea! > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > > > > So we're madly trying to swap pages out and finding that there's no swap > > > > space. I beleive that when we find there's no swap left we should move > > > > the page onto the active list so we don't keep rescanning it pointlessly. > > > > > > yes, however I think the swap-flood with no swap isn't a very > > > interesting case to optimize. > > > > Running swapless is a valid configuration, and the kernel is doing > > I'm not saying it's not valid or non interesting. > > It's the mix "I'm running out of memory and I'm swapless" that is the > case not interesting to optimize. > > If you're swapless it means you've enough memory and that you're not > running out of swap. Otherwise _you_ (not the kernel) are wrong not > having swap. The problem is that your VM is unnecesarily eating up memory and then wants swap. That is unacceptable. Having 90% of your memory in buffers/cache and then the OOM killer kicks in because nothing is free is what we're moaning about. -- Regards Abraham Did you hear about the model who sat on a broken bottle and cut a nice figure? __________________________________________________________ Abraham vd Merwe - 2d3D, Inc. Device Driver Development, Outsourcing, Embedded Systems Cell: +27 82 565 4451 Snailmail: Tel: +27 21 761 7549 Block C, Antree Park Fax: +27 21 761 7648 Doncaster Road Email: abraham@2d3d.co.za Kenilworth, 7700 Http: http://www.2d3d.com South Africa [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:59 ` Abraham vd Merwe @ 2001-12-11 14:01 ` Andrea Arcangeli 2001-12-11 17:30 ` Leigh Orf 0 siblings, 1 reply; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 14:01 UTC (permalink / raw) To: Abraham vd Merwe, Linux Kernel Development On Tue, Dec 11, 2001 at 03:59:22PM +0200, Abraham vd Merwe wrote: > Hi Andrea! > > > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > > > > > So we're madly trying to swap pages out and finding that there's no swap > > > > > space. I beleive that when we find there's no swap left we should move > > > > > the page onto the active list so we don't keep rescanning it pointlessly. > > > > > > > > yes, however I think the swap-flood with no swap isn't a very > > > > interesting case to optimize. > > > > > > Running swapless is a valid configuration, and the kernel is doing > > > > I'm not saying it's not valid or non interesting. > > > > It's the mix "I'm running out of memory and I'm swapless" that is the > > case not interesting to optimize. > > > > If you're swapless it means you've enough memory and that you're not > > running out of swap. Otherwise _you_ (not the kernel) are wrong not > > having swap. > > The problem is that your VM is unnecesarily eating up memory and then wants > swap. That is unacceptable. Having 90% of your memory in buffers/cache and > then the OOM killer kicks in because nothing is free is what we're moaning > about. Dear, Abraham please apply this patch: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2 on top of a 2.4.17pre4 and then recompile, try again and send me a bugreport if you can reproduce. thanks, Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 14:01 ` Andrea Arcangeli @ 2001-12-11 17:30 ` Leigh Orf 0 siblings, 0 replies; 43+ messages in thread From: Leigh Orf @ 2001-12-11 17:30 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linux Kernel Development Andrea Arcangeli wrote: | > The problem is that your VM is unnecesarily eating up | > memory and then wants swap. That is unacceptable. Having | > 90% of your memory in buffers/cache and then the OOM killer | > kicks in because nothing is free is what we're moaning | > about. | | Dear, Abraham please apply this patch: | | ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2 | | on top of a 2.4.17pre4 and then recompile, try again and send me a | bugreport if you can reproduce. thanks, Andrea, I applied your patch and it didn't fix the problem. I reported this earlier to the kernel list but I'm not sure if you got it. See http://groups.google.com/groups?hl=en&rnum=1&selm=linux.kernel.200112081539.fB8FdFj03048%40orp.orf.cx or see the recent thread "2.4.16 memory badness (reproducible)". The behavior I cite with 2.4.16 is identical to what happens with 2.4.17pre4aa1, but here it is again. It is reproducible. Machine is 1.4GHZ Athlon with 1 GB memory, 2 GB swap, RH 7.2 with updates. home[1001]:/home/orf% uname -a Linux orp.orf.cx 2.4.17-pre4 #1 Mon Dec 10 22:09:16 EST 2001 i686 unknown (it's been patched with 2.4.17pre4aa1.bz2) (updatedb updates RedHat's file database, does lots of file I/O) home[1005]:/home/orf% free total used free shared buffers cached Mem: 1029780 207976 821804 0 49468 71856 -/+ buffers/cache: 86652 943128 Swap: 2064344 6324 2058020 home[1006]:/home/orf% sudo updatedb Password: home[1007]:/home/orf% free total used free shared buffers cached Mem: 1029780 1017576 12204 0 471548 70924 -/+ buffers/cache: 475104 554676 Swap: 2064344 6312 2058032 home[1008]:/home/orf% xmms Memory fault home[1009]:/home/orf% strace xmms 2>&1 | tail old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40316000 mprotect(0x40448000, 37704, PROT_NONE) = 0 old_mmap(0x40448000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40448000 old_mmap(0x4044e000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4044e000 close(3) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40452000 munmap(0x40018000, 72492) = 0 modify_ldt(0x1, 0xbffff33c, 0x10) = -1 ENOMEM (Cannot allocate memory) --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++ Note that some applications don't mem fault this way, but all the ones that do die at modify_ldt (see my previous post). home[1010]:/home/orf% cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 1054494720 1041756160 12738560 0 481837056 77209600 Swap: 2113888256 6463488 2107424768 MemTotal: 1029780 kB MemFree: 12440 kB MemShared: 0 kB Buffers: 470544 kB Cached: 71388 kB SwapCached: 4012 kB Active: 367796 kB Inactive: 232088 kB HighTotal: 130992 kB HighFree: 2044 kB LowTotal: 898788 kB LowFree: 10396 kB SwapTotal: 2064344 kB SwapFree: 2058032 kB home[1011]:/home/orf% cat /proc/slabinfo slabinfo - version: 1.1 kmem_cache 65 68 112 2 2 1 ip_conntrack 22 50 384 5 5 1 nfs_write_data 0 0 384 0 0 1 nfs_read_data 0 0 384 0 0 1 nfs_page 0 0 128 0 0 1 ip_fib_hash 10 112 32 1 1 1 urb_priv 0 0 64 0 0 1 clip_arp_cache 0 0 128 0 0 1 ip_mrt_cache 0 0 128 0 0 1 tcp_tw_bucket 0 0 128 0 0 1 tcp_bind_bucket 17 112 32 1 1 1 tcp_open_request 0 0 128 0 0 1 inet_peer_cache 2 59 64 1 1 1 ip_dst_cache 56 80 192 4 4 1 arp_cache 3 30 128 1 1 1 blkdev_requests 640 660 128 22 22 1 journal_head 0 0 48 0 0 1 revoke_table 0 0 12 0 0 1 revoke_record 0 0 32 0 0 1 dnotify cache 0 0 20 0 0 1 file lock cache 2 42 92 1 1 1 fasync cache 2 202 16 1 1 1 uid_cache 7 112 32 1 1 1 skbuff_head_cache 293 320 192 16 16 1 sock 131 132 1280 44 44 1 sigqueue 4 29 132 1 1 1 cdev_cache 2313 2360 64 40 40 1 bdev_cache 8 59 64 1 1 1 mnt_cache 19 59 64 1 1 1 inode_cache 452259 452263 512 64609 64609 1 dentry_cache 469963 469980 128 15666 15666 1 dquot 0 0 128 0 0 1 filp 1633 1650 128 55 55 1 names_cache 0 2 4096 0 2 1 buffer_head 136268 164880 128 5496 5496 1 mm_struct 54 60 192 3 3 1 vm_area_struct 2186 2250 128 73 75 1 fs_cache 53 59 64 1 1 1 files_cache 53 63 448 6 7 1 signal_act 61 63 1344 21 21 1 size-131072(DMA) 0 0 131072 0 0 32 size-131072 0 0 131072 0 0 32 size-65536(DMA) 0 0 65536 0 0 16 size-65536 1 1 65536 1 1 16 size-32768(DMA) 0 0 32768 0 0 8 size-32768 1 1 32768 1 1 8 size-16384(DMA) 0 0 16384 0 0 4 size-16384 1 3 16384 1 3 4 size-8192(DMA) 0 0 8192 0 0 2 size-8192 5 7 8192 5 7 2 size-4096(DMA) 0 0 4096 0 0 1 size-4096 70 73 4096 70 73 1 size-2048(DMA) 0 0 2048 0 0 1 size-2048 64 68 2048 34 34 1 size-1024(DMA) 0 0 1024 0 0 1 size-1024 11028 11032 1024 2757 2758 1 size-512(DMA) 0 0 512 0 0 1 size-512 12029 12032 512 1504 1504 1 size-256(DMA) 0 0 256 0 0 1 size-256 1609 1635 256 109 109 1 size-128(DMA) 2 30 128 1 1 1 size-128 29383 29430 128 980 981 1 size-64(DMA) 0 0 64 0 0 1 size-64 9105 9145 64 155 155 1 size-32(DMA) 34 59 64 1 1 1 size-32 70942 70977 64 1203 1203 1 Leigh Orf ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:42 ` Andrea Arcangeli 2001-12-11 13:59 ` Rik van Riel 2001-12-11 13:59 ` Abraham vd Merwe @ 2001-12-11 15:47 ` Henning P. Schmiedehausen 2001-12-11 16:01 ` Alan Cox ` (2 more replies) 2001-12-12 8:39 ` Andrew Morton 3 siblings, 3 replies; 43+ messages in thread From: Henning P. Schmiedehausen @ 2001-12-11 15:47 UTC (permalink / raw) To: linux-kernel Andrea Arcangeli <andrea@suse.de> writes: >Lack of documentation is not a bug, period. Also it's not true that I'm I scare myself shitless that you as the one responsible for something as crucial as MM in the Linux kernel, has such an attitude towards software development especially when people like RvR as for docs. Sorry, but to me this sounds like something from M$ (MAPI? You don't need MAPI documentation. We know what we're doing. You don't need to know how Windows XX works. It's enough that we know). Actually, you _do_ get documentation from M$. Something, one can't say about the Linux MM-sprikled-with holy-penguin-pee subsystem. I'm not happy about your usage of magic numbers, either. So it is still running on solid 2.2.19 until further notice (or until Rik loses his patience. ;-) ) Regards Henning -- Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de D-91054 Buckenhof Fax.: 09131 / 50654-20 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 15:47 ` Henning P. Schmiedehausen @ 2001-12-11 16:01 ` Alan Cox 2001-12-11 16:37 ` Hubert Mantel 2001-12-11 17:09 ` Rik van Riel 2 siblings, 0 replies; 43+ messages in thread From: Alan Cox @ 2001-12-11 16:01 UTC (permalink / raw) To: hps; +Cc: linux-kernel > I'm not happy about your usage of magic numbers, either. So it is > still running on solid 2.2.19 until further notice (or until Rik loses > his patience. ;-) ) Andrea did the 2.2.19 VM as well, but that one is somewhat better documented, and doesn't have the use-once funnies. Alan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 15:47 ` Henning P. Schmiedehausen 2001-12-11 16:01 ` Alan Cox @ 2001-12-11 16:37 ` Hubert Mantel 2001-12-11 17:09 ` Rik van Riel 2 siblings, 0 replies; 43+ messages in thread From: Hubert Mantel @ 2001-12-11 16:37 UTC (permalink / raw) To: linux-kernel Hi, On Tue, Dec 11, Henning P. Schmiedehausen wrote: > Andrea Arcangeli <andrea@suse.de> writes: > > >Lack of documentation is not a bug, period. Also it's not true that I'm > > I scare myself shitless that you as the one responsible for something > as crucial as MM in the Linux kernel, has such an attitude towards > software development especially when people like RvR as for docs. > > Sorry, but to me this sounds like something from M$ (MAPI? You don't > need MAPI documentation. We know what we're doing. You don't need to > know how Windows XX works. It's enough that we know). > > Actually, you _do_ get documentation from M$. Something, one can't say How do you know the documentation matches the actual code? > about the Linux MM-sprikled-with holy-penguin-pee subsystem. In Linux, you get even more: You can look at the code itself. > I'm not happy about your usage of magic numbers, either. So it is > still running on solid 2.2.19 until further notice (or until Rik loses > his patience. ;-) ) Oh, the 2.2.19 VM is from Andrea ;) > Regards > Henning -o) Hubert Mantel Goodbye, dots... /\\ _\_v ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 15:47 ` Henning P. Schmiedehausen 2001-12-11 16:01 ` Alan Cox 2001-12-11 16:37 ` Hubert Mantel @ 2001-12-11 17:09 ` Rik van Riel 2001-12-11 17:28 ` Alan Cox 2 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2001-12-11 17:09 UTC (permalink / raw) To: hps; +Cc: linux-kernel On Tue, 11 Dec 2001, Henning P. Schmiedehausen wrote: > I'm not happy about your usage of magic numbers, either. So it is > still running on solid 2.2.19 until further notice (or until Rik loses > his patience. ;-) ) I've lost patience and have decided to move development away from the main tree. http://linuxvm.bkbits.net/ ;) cheers, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 17:09 ` Rik van Riel @ 2001-12-11 17:28 ` Alan Cox 2001-12-11 17:22 ` Rik van Riel 2001-12-11 17:23 ` Christoph Hellwig 0 siblings, 2 replies; 43+ messages in thread From: Alan Cox @ 2001-12-11 17:28 UTC (permalink / raw) To: Rik van Riel; +Cc: hps, linux-kernel > > I'm not happy about your usage of magic numbers, either. So it is > > still running on solid 2.2.19 until further notice (or until Rik loses > > his patience. ;-) ) > > I've lost patience and have decided to move development away > from the main tree. http://linuxvm.bkbits.net/ ;) Are your patches available in a format that is accessible using free software ? (Now where did I put the troll sign 8)) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 17:28 ` Alan Cox @ 2001-12-11 17:22 ` Rik van Riel 2001-12-11 17:23 ` Christoph Hellwig 1 sibling, 0 replies; 43+ messages in thread From: Rik van Riel @ 2001-12-11 17:22 UTC (permalink / raw) To: Alan Cox; +Cc: hps, linux-kernel On Tue, 11 Dec 2001, Alan Cox wrote: > > > I'm not happy about your usage of magic numbers, either. So it is > > > still running on solid 2.2.19 until further notice (or until Rik loses > > > his patience. ;-) ) > > > > I've lost patience and have decided to move development away > > from the main tree. http://linuxvm.bkbits.net/ ;) > > Are your patches available in a format that is accessible using free > software ? Yes, I'm making patches available on my home page: http://surriel.com/patches/ Note that development isn't too fast due to the fact that I try to clean up all code I touch instead of just making the changes needed for the functionality. kind regards, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 17:28 ` Alan Cox 2001-12-11 17:22 ` Rik van Riel @ 2001-12-11 17:23 ` Christoph Hellwig 2001-12-12 22:20 ` Rob Landley 1 sibling, 1 reply; 43+ messages in thread From: Christoph Hellwig @ 2001-12-11 17:23 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel In article <E16DqhI-0005vG-00@the-village.bc.nu> you wrote: >> > I'm not happy about your usage of magic numbers, either. So it is >> > still running on solid 2.2.19 until further notice (or until Rik loses >> > his patience. ;-) ) >> >> I've lost patience and have decided to move development away >> from the main tree. http://linuxvm.bkbits.net/ ;) > > Are your patches available in a format that is accessible using free > software ? As bitkeeper-ignorant I've found nice snapshots on http://www.surriel.com/patches/. For BSD advocates it might be a problem that these are unified diffs that are only applyable with GPL-licensed patch(1) version.. Christoph -- Of course it doesn't work. We've performed a software upgrade. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 17:23 ` Christoph Hellwig @ 2001-12-12 22:20 ` Rob Landley 2001-12-13 8:47 ` David S. Miller 2001-12-13 8:48 ` Alan Cox 0 siblings, 2 replies; 43+ messages in thread From: Rob Landley @ 2001-12-12 22:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote: > For BSD advocates it might be a problem that these are unified diffs > that are only applyable with GPL-licensed patch(1) version.. Why would BSD advocates be applying patches to the linux kernel? (You don't need the tool to read a patch for ideas, do you?) Why would BSD advocates apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then complain that the tool they're using to do so is GPL-licensed? I'm confused. (Not SUPRISED, mind you. Just easily confused.) Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 22:20 ` Rob Landley @ 2001-12-13 8:47 ` David S. Miller 2001-12-13 18:41 ` Matthias Andree 2001-12-13 8:48 ` Alan Cox 1 sibling, 1 reply; 43+ messages in thread From: David S. Miller @ 2001-12-13 8:47 UTC (permalink / raw) To: alan; +Cc: landley, hch, linux-kernel > > For BSD advocates it might be a problem that these are unified diffs > > that are only applyable with GPL-licensed patch(1) version.. I'm back quoting twice, sorry I've lost the original attribution. But anyways didn't the original Larry Wall patch do unified diffs? I thought it did, and I recall that wasn't GPL licensed. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-13 8:47 ` David S. Miller @ 2001-12-13 18:41 ` Matthias Andree 0 siblings, 0 replies; 43+ messages in thread From: Matthias Andree @ 2001-12-13 18:41 UTC (permalink / raw) To: linux-kernel On Thu, 13 Dec 2001, David S. Miller wrote: > But anyways didn't the original Larry Wall patch do unified diffs? > I thought it did, and I recall that wasn't GPL licensed. Nope, it did context diffs however. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 22:20 ` Rob Landley 2001-12-13 8:47 ` David S. Miller @ 2001-12-13 8:48 ` Alan Cox 2001-12-13 10:22 ` [OT] " Rob Landley 1 sibling, 1 reply; 43+ messages in thread From: Alan Cox @ 2001-12-13 8:48 UTC (permalink / raw) To: Rob Landley; +Cc: Christoph Hellwig, linux-kernel > On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote: > > > For BSD advocates it might be a problem that these are unified diffs > > that are only applyable with GPL-licensed patch(1) version.. > > Why would BSD advocates be applying patches to the linux kernel? (You don't > need the tool to read a patch for ideas, do you?) Why would BSD advocates > apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then > complain that the tool they're using to do so is GPL-licensed? > > I'm confused. (Not SUPRISED, mind you. Just easily confused.) Christoph, please remember that irony is not available between the Canadian and Mexican border.... you are confusing them again 8) Alan ^ permalink raw reply [flat|nested] 43+ messages in thread
* [OT] Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-13 8:48 ` Alan Cox @ 2001-12-13 10:22 ` Rob Landley 0 siblings, 0 replies; 43+ messages in thread From: Rob Landley @ 2001-12-13 10:22 UTC (permalink / raw) To: Alan Cox; +Cc: Christoph Hellwig, linux-kernel On Thursday 13 December 2001 03:48 am, Alan Cox wrote: > > On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote: > > > For BSD advocates it might be a problem that these are unified diffs > > > that are only applyable with GPL-licensed patch(1) version.. > > > > Why would BSD advocates be applying patches to the linux kernel? (You > > don't need the tool to read a patch for ideas, do you?) Why would BSD > > advocates apply a GPL-licensed patch to the GPL-licensed Linux kernel, > > and then complain that the tool they're using to do so is GPL-licensed? > > > > I'm confused. (Not SUPRISED, mind you. Just easily confused.) > > Christoph, please remember that irony is not available between the Canadian > and Mexican border.... you are confusing them again 8) We'll get it back when the whole "everything has changed" fad dies down. Average together how long the OJ simpson trial lasted, the monica lewinsky thing, elian gonzalez down in miami, the press coverage of hurricane andrew, the original gulf war, nancy kerrigan, john wayne bobbit, joey buttafuoco, the military interventions in somalia and bosnia, the outcry over alar and malathion in california back in the 80's, dan quayle attacking murphy brown, the anti-nuke sentiment following chernobyl and three mile island... That's our national attention span. A year, maybe a year and change. Anybody who thinks some nut with a beard can keep this country permanently nervous obviously doesn't remember the cuban missile crisis. (And of course there are a lot of people who don't, again because of our short attention span...) Our military may be rather impressive, but our sarcastic self-centered indifference is legendary. We're STILL bombing Iraq, and most of the US has forgotten that country even exists... </off topic thread> > Alan Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 13:42 ` Andrea Arcangeli ` (2 preceding siblings ...) 2001-12-11 15:47 ` Henning P. Schmiedehausen @ 2001-12-12 8:39 ` Andrew Morton 3 siblings, 0 replies; 43+ messages in thread From: Andrew Morton @ 2001-12-12 8:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml Andrea Arcangeli wrote: > > > [ big snip. Addressed in other email ] > > it should be simple, mainline swapouts more, so it's less likely to > trash away some useful cache. > > just try -aa after a: > > echo 10 >/proc/sys/vm/vm_mapped_ratio > > it should swapout more and better preserve the cache. -aa swapout balancing seems very good indeed to me. > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others(). > > > > So we're madly trying to swap pages out and finding that there's no swap > > > > space. I beleive that when we find there's no swap left we should move > > > > the page onto the active list so we don't keep rescanning it pointlessly. > > > > > > yes, however I think the swap-flood with no swap isn't a very > > > interesting case to optimize. > > > > Running swapless is a valid configuration, and the kernel is doing > > I'm not saying it's not valid or non interesting. > > It's the mix "I'm running out of memory and I'm swapless" that is the > case not interesting to optimize. > > If you're swapless it means you've enough memory and that you're not > running out of swap. Otherwise _you_ (not the kernel) are wrong not > having swap. um. Spose so. > ... > > > The VM code lacks comments, and nobody except yourself understands > > what it is supposed to be doing. That's a bug, don't you think? > > Lack of documentation is not a bug, period. Also it's not true that I'm > the only one who understands it. For istance Linus understand it > completly, I am 100% sure. > > Anyways I wrote a dozen of slides on the VM with some graph showing the > design of the VM if anybody can better learn from a slide than from the > code. That's good. Your elevator design slides were very helpful. However offline documentation tends to go stale. A nice big block comment maintained by a programmer who cares goes a loooog way. > I believe the slides are useful to understand the design, but if you > want to change one line of code slides or not you've to read the code. > Everybody is complaining about documentation. This is a red-herring. > There's no documentation that allows you to hack the previous VM code. > I'd ask how many of the people happy with the previous documentation > were effectively VM developers. Except for some possible misleading > comment in the current code that we may have not updated yet, I don't > think there's been a regression in documentation. > Sigh. Just because the current core kernel looks like it was scrawled in crayon by an infant doesn't mean that everyone has to eschew literate, mature, competent and maintainable programming practices. - ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti 2001-12-10 20:47 ` Andrew Morton @ 2001-12-11 0:43 ` Andrea Arcangeli 2001-12-11 15:46 ` Luigi Genoni 2001-12-12 22:05 ` Ken Brownfield 1 sibling, 2 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-11 0:43 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: lkml On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote: > > Andrea, > > Could you please start looking at any 2.4 VM issues which show up ? well, as far I can tell no VM bug should be present in my latest -aa, so I think I'm finished. At the very least I know people is using 2.4.15aa1 and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM load and I didn't got any bugreport back yet. > > Just please make sure that when sending a fix for something, send me _one_ > problem and a patch which fixes _that_ problem. I will split something for you soon, at the moment I was doing some further benchmark. > > I'm tempted to look at VM, but I think I'll spend my limited time in a > better way if I review's others people work instead. until I split something out, you can see all the vm related changes in the 10_vm-* patches in my ftp area. Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 0:43 ` Andrea Arcangeli @ 2001-12-11 15:46 ` Luigi Genoni 2001-12-12 22:05 ` Ken Brownfield 1 sibling, 0 replies; 43+ messages in thread From: Luigi Genoni @ 2001-12-11 15:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml On Tue, 11 Dec 2001, Andrea Arcangeli wrote: > On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote: > > > > Andrea, > > > > Could you please start looking at any 2.4 VM issues which show up ? > > well, as far I can tell no VM bug should be present in my latest -aa, so > I think I'm finished. At the very least I know people is using 2.4.15aa1 > and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM > load and I didn't got any bugreport back yet. 2.4.17pre1aa1 is quire rock solid on all my 2 and 4 GB machines But I have to admitt that actually I did not really stressed the VM on my servers, since, guys, we are going to christmass :) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 0:43 ` Andrea Arcangeli 2001-12-11 15:46 ` Luigi Genoni @ 2001-12-12 22:05 ` Ken Brownfield 2001-12-12 22:30 ` Andrea Arcangeli 2001-12-12 23:23 ` Rik van Riel 1 sibling, 2 replies; 43+ messages in thread From: Ken Brownfield @ 2001-12-12 22:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: lkml On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote: | On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote: | > Andrea, | > Could you please start looking at any 2.4 VM issues which show up ? | | well, as far I can tell no VM bug should be present in my latest -aa, so | I think I'm finished. At the very least I know people is using 2.4.15aa1 | and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM | load and I didn't got any bugreport back yet. [...] I look forward to this stuff. 2.4 mainline falls down reliably and completely when running updatedb on systems with a large number of used inodes. Linus' VM/mmap patch helped a ton, but between general VM issues and the i/dcache bloat I'm hoping that I won't have to redirect my irritated users' ire into a karma pool to get these changes merged into mainline where all of the knowledgeable folks here can beat out the details. I do think that the vast majority of users don't see this issue on small-ish UP desktops. But I'm about to buy >100 SMP systems for production expansion which will most likely be effected by this issue. For me that emphasizes that these so-called corner cases really are show-stoppers for Linux-as-more-than-toy. Gimme the /proc interface (bdflush?) and lets bang on this stuff in mainline. I need to stick with the latest -pre so I can track progress, so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-( Cheers, just venting, -- Ken. brownfld@irridia.com PS: Nice catch on the NTFS vmalloc() issue. | > Just please make sure that when sending a fix for something, send me _one_ | > problem and a patch which fixes _that_ problem. | | I will split something for you soon, at the moment I was doing some | further benchmark. | | > | > I'm tempted to look at VM, but I think I'll spend my limited time in a | > better way if I review's others people work instead. | | until I split something out, you can see all the vm related changes in | the 10_vm-* patches in my ftp area. | | Andrea | - | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in | the body of a message to majordomo@vger.kernel.org | More majordomo info at http://vger.kernel.org/majordomo-info.html | Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 22:05 ` Ken Brownfield @ 2001-12-12 22:30 ` Andrea Arcangeli 2001-12-12 23:23 ` Rik van Riel 1 sibling, 0 replies; 43+ messages in thread From: Andrea Arcangeli @ 2001-12-12 22:30 UTC (permalink / raw) To: Ken Brownfield; +Cc: lkml, Andrew Morton On Wed, Dec 12, 2001 at 04:05:51PM -0600, Ken Brownfield wrote: > On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote: > | On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote: > | > Andrea, > | > Could you please start looking at any 2.4 VM issues which show up ? > | > | well, as far I can tell no VM bug should be present in my latest -aa, so > | I think I'm finished. At the very least I know people is using 2.4.15aa1 > | and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM > | load and I didn't got any bugreport back yet. > [...] > > I look forward to this stuff. 2.4 mainline falls down reliably and > completely when running updatedb on systems with a large number of used > inodes. Linus' VM/mmap patch helped a ton, but between general VM > issues and the i/dcache bloat I'm hoping that I won't have to redirect > my irritated users' ire into a karma pool to get these changes merged > into mainline where all of the knowledgeable folks here can beat out the > details. > > I do think that the vast majority of users don't see this issue on > small-ish UP desktops. But I'm about to buy >100 SMP systems for > production expansion which will most likely be effected by this issue. > For me that emphasizes that these so-called corner cases really are > show-stoppers for Linux-as-more-than-toy. > > Gimme the /proc interface (bdflush?) and lets bang on this stuff in > mainline. I need to stick with the latest -pre so I can track progress, > so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-( I finished fixing the bdflush stuff that Andrew kindly pointed out. async writes are as fast as possible again now and I also introduced some histeresis for bdflush to reduce the wakeup rate, plus I'm forcing bdflush to do some significant work rather than just NRSYNC buffers. But I'm doing some other swapout benchmarking before releasing a new -aa, I hope to finish tomorrow. Once I'll feel to be finished I'll split out something. anyways here it is a preview of the bdflush fixes for Andrew. it definitely cures the performance for me. previously there were too many reschedule. I also wonder that the balance_dirty() should also write nfract of buffers, instead of only NRSYNC (or maybe something less than ndirty but more than NRSYNC). comments? (then BUF_LOCKED will contain all the clean buffers too, and so it cannot be accounted into balance_dirty anymore, the VM will throttle on those locked buffers and so it's not a problem) --- 2.4.17pre7aa1/fs/buffer.c.~1~ Mon Dec 10 16:10:40 2001 +++ 2.4.17pre7aa1/fs/buffer.c Wed Dec 12 19:16:23 2001 @@ -105,22 +105,23 @@ struct { int nfract; /* Percentage of buffer cache dirty to activate bdflush */ - int dummy1; /* old "ndirty" */ + int ndirty; /* Maximum number of dirty blocks to write out per + wake-cycle */ int dummy2; /* old "nrefill" */ int dummy3; /* unused */ int interval; /* jiffies delay between kupdate flushes */ int age_buffer; /* Time for normal buffer to age before we flush it */ int nfract_sync;/* Percentage of buffer cache dirty to activate bdflush synchronously */ - int dummy4; /* unused */ + int nfract_stop_bdflush; /* Percetange of buffer cache dirty to stop bdflush */ int dummy5; /* unused */ } b_un; unsigned int data[N_PARAM]; -} bdf_prm = {{20, 0, 0, 0, 5*HZ, 30*HZ, 40, 0, 0}}; +} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}}; /* These are the min and max parameter values that we will allow to be assigned */ -int bdflush_min[N_PARAM] = { 0, 0, 0, 0, 0, 1*HZ, 0, 0, 0}; -int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 0, 0}; +int bdflush_min[N_PARAM] = { 0, 1, 0, 0, 0, 1*HZ, 0, 0, 0}; +int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 100, 0}; void unlock_buffer(struct buffer_head *bh) { @@ -181,7 +182,6 @@ bh->b_end_io = end_buffer_io_sync; clear_bit(BH_Pending_IO, &bh->b_state); submit_bh(WRITE, bh); - conditional_schedule(); } while (--count); } @@ -217,11 +217,10 @@ array[count++] = bh; if (count < NRSYNC) continue; - spin_unlock(&lru_list_lock); - conditional_schedule(); write_locked_buffers(array, count); + conditional_schedule(); return -EAGAIN; } unlock_buffer(bh); @@ -282,12 +281,6 @@ return 0; } -static inline void wait_for_some_buffers(kdev_t dev) -{ - spin_lock(&lru_list_lock); - wait_for_buffers(dev, BUF_LOCKED, 1); -} - static int wait_for_locked_buffers(kdev_t dev, int index, int refile) { do @@ -1043,7 +1036,6 @@ unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit; dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; - dirty += size_buffers_type[BUF_LOCKED] >> PAGE_SHIFT; tot = nr_free_buffer_pages(); dirty *= 100; @@ -1060,6 +1052,21 @@ return -1; } +static int bdflush_stop(void) +{ + unsigned long dirty, tot, dirty_limit; + + dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; + tot = nr_free_buffer_pages(); + + dirty *= 100; + dirty_limit = tot * bdf_prm.b_un.nfract_stop_bdflush; + + if (dirty > dirty_limit) + return 0; + return 1; +} + /* * if a new dirty buffer is created we need to balance bdflush. * @@ -1084,7 +1091,6 @@ if (state > 0) { spin_lock(&lru_list_lock); write_some_buffers(NODEV); - wait_for_some_buffers(NODEV); } } @@ -2789,13 +2795,18 @@ complete((struct completion *)startup); for (;;) { + int ndirty = bdf_prm.b_un.ndirty; + CHECK_EMERGENCY_SYNC - spin_lock(&lru_list_lock); - if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) { - run_task_queue(&tq_disk); - interruptible_sleep_on(&bdflush_wait); + while (ndirty > 0) { + spin_lock(&lru_list_lock); + if (!write_some_buffers(NODEV)) + break; + ndirty -= NRSYNC; } + if (ndirty > 0 || bdflush_stop()) + interruptible_sleep_on(&bdflush_wait); } } Andrea ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-12 22:05 ` Ken Brownfield 2001-12-12 22:30 ` Andrea Arcangeli @ 2001-12-12 23:23 ` Rik van Riel 1 sibling, 0 replies; 43+ messages in thread From: Rik van Riel @ 2001-12-12 23:23 UTC (permalink / raw) To: Ken Brownfield; +Cc: Andrea Arcangeli, lkml On Wed, 12 Dec 2001, Ken Brownfield wrote: > I'm hoping that I won't have to redirect my irritated users' ire into > a karma pool to get these changes merged into mainline Actually, Marcelo has already indicated that he's willing to take VM code from Andrea, as long as the parts are merged one by one and come with proper argumentation. This means you'll either have to split out Andrea's patch yourself or you'll have to convince Andrea to play by the rules ;)) regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva>]
* Re: 2.4.16 & OOM killer screw up (fwd) [not found] <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva> @ 2001-12-11 16:45 ` Marcelo Tosatti 2001-12-11 18:51 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Marcelo Tosatti @ 2001-12-11 16:45 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, lkml On Mon, 10 Dec 2001, Rik van Riel wrote: > On Mon, 10 Dec 2001, Andrew Morton wrote: > > > A fix may be to just remove the use-once stuff. It is one of the > > sources of this problem, because it's overpopulating the inactive list. > > Absolutely. Use-once is an inherently unstable system, suitable > for things like a database load (where you know you want to spend > a certain percentage of your RAM on caching the index), but not > suitable for a general-purpose VM, where you have no idea how > large the working set will be. > > I'll take a stab at completely removing the use-once stuff as an > emergency measure. Rik, Could you please make a patch without use-once and post the patch to lkml ? This way people can test it and report performance results. I really would prefer to remove use-once as I also think its an optimization which breaks some workloads, but I want to know what happens in practice if we do that. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: 2.4.16 & OOM killer screw up (fwd) 2001-12-11 16:45 ` Marcelo Tosatti @ 2001-12-11 18:51 ` Rik van Riel 0 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2001-12-11 18:51 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, Andrea Arcangeli, lkml On Tue, 11 Dec 2001, Marcelo Tosatti wrote: > > I'll take a stab at completely removing the use-once stuff as an > > emergency measure. > > Could you please make a patch without use-once and post the patch to > lkml ? > > This way people can test it and report performance results. OK, here's a quick hack to migrate 2.4 to second-chance replacement. In this implementation that means: 1) for pages in the working set of processes, we keep the pages resident whenever we find a referenced bit in the page table 2) for pages which are not mapped, we unconditionally move the page to the inactive list; the page only gets reactivated if it is referenced while on the inactive list This should give us some small protection against use-once data, since the referenced bit doesn't count, while allowing us to protect the working set of processes. It also makes shrinking of the slab-based filesystem caches unconditional, to prevent bad effects there. Note that I'm still compiling and haven't tested it yet, please give it a spin. regards, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ --- linux-2.4.17-pre8/mm/filemap.c.orig Tue Dec 11 16:11:16 2001 +++ linux-2.4.17-pre8/mm/filemap.c Tue Dec 11 16:27:44 2001 @@ -1249,23 +1249,20 @@ } /* - * Mark a page as having seen activity. - * - * If it was already so marked, move it - * to the active queue and drop the referenced - * bit. Otherwise, just mark it for future - * action.. + * Simple second-chance replacement. + * As long as a page is on the active list, further references + * are ignored so used-once pages get replaced quickly. + * If a page on the inactive list gets referenced or has a + * referenced bit in the page table page, it gets moved back + * to the far end of the active list. */ void mark_page_accessed(struct page *page) { - if (!PageActive(page) && PageReferenced(page)) { + if (PageLRU(page) && !PageActive(page)) { activate_page(page); ClearPageReferenced(page); return; } - - /* Mark the page referenced, AFTER checking for previous usage.. */ - SetPageReferenced(page); } /* --- linux-2.4.17-pre8/mm/swap.c.orig Tue Dec 11 16:11:16 2001 +++ linux-2.4.17-pre8/mm/swap.c Tue Dec 11 16:13:11 2001 @@ -59,7 +59,7 @@ { if (!TestSetPageLRU(page)) { spin_lock(&pagemap_lru_lock); - add_page_to_inactive_list(page); + add_page_to_active_list(page); spin_unlock(&pagemap_lru_lock); } } --- linux-2.4.17-pre8/mm/vmscan.c.orig Tue Dec 11 16:11:16 2001 +++ linux-2.4.17-pre8/mm/vmscan.c Tue Dec 11 16:43:10 2001 @@ -526,10 +526,14 @@ /* * This moves pages from the active list to - * the inactive list. + * the inactive list. If they get referenced + * while on the inactive list, they will be + * activated again. * - * We move them the other way when we see the - * reference bit on the page. + * Note that we cannot (and don't want to) + * clear the referenced bits in the page tables + * of pages, so the working sets of processes + * have an edge on cache pages. */ static void refill_inactive(int nr_pages) { @@ -542,15 +546,10 @@ page = list_entry(entry, struct page, lru); entry = entry->prev; - if (PageTestandClearReferenced(page)) { - list_del(&page->lru); - list_add(&page->lru, &active_list); - continue; - } del_page_from_active_list(page); add_page_to_inactive_list(page); - SetPageReferenced(page); + ClearPageReferenced(page); } spin_unlock(&pagemap_lru_lock); } @@ -570,16 +569,16 @@ ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); refill_inactive(ratio); - nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority); - if (nr_pages <= 0) - return 0; - shrink_dcache_memory(priority, gfp_mask); shrink_icache_memory(priority, gfp_mask); #ifdef CONFIG_QUOTA shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); #endif + nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority); + + if (nr_pages <= 0) + return 0; return nr_pages; } ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2001-12-13 22:43 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti 2001-12-10 20:47 ` Andrew Morton 2001-12-10 19:42 ` Marcelo Tosatti 2001-12-11 0:11 ` Andrea Arcangeli 2001-12-11 7:07 ` Andrew Morton 2001-12-11 13:32 ` Rik van Riel 2001-12-11 13:46 ` Andrea Arcangeli 2001-12-12 8:44 ` Andrew Morton 2001-12-12 9:21 ` Andrea Arcangeli 2001-12-12 9:45 ` Rik van Riel 2001-12-12 10:09 ` Andrea Arcangeli 2001-12-12 9:59 ` Andrew Morton 2001-12-12 10:15 ` Andrea Arcangeli 2001-12-11 13:42 ` Andrea Arcangeli 2001-12-11 13:59 ` Rik van Riel 2001-12-11 14:23 ` Andrea Arcangeli 2001-12-11 15:27 ` Daniel Phillips 2001-12-12 11:16 ` Andrea Arcangeli 2001-12-12 20:03 ` Daniel Phillips 2001-12-12 21:25 ` Andrea Arcangeli 2001-12-11 13:59 ` Abraham vd Merwe 2001-12-11 14:01 ` Andrea Arcangeli 2001-12-11 17:30 ` Leigh Orf 2001-12-11 15:47 ` Henning P. Schmiedehausen 2001-12-11 16:01 ` Alan Cox 2001-12-11 16:37 ` Hubert Mantel 2001-12-11 17:09 ` Rik van Riel 2001-12-11 17:28 ` Alan Cox 2001-12-11 17:22 ` Rik van Riel 2001-12-11 17:23 ` Christoph Hellwig 2001-12-12 22:20 ` Rob Landley 2001-12-13 8:47 ` David S. Miller 2001-12-13 18:41 ` Matthias Andree 2001-12-13 8:48 ` Alan Cox 2001-12-13 10:22 ` [OT] " Rob Landley 2001-12-12 8:39 ` Andrew Morton 2001-12-11 0:43 ` Andrea Arcangeli 2001-12-11 15:46 ` Luigi Genoni 2001-12-12 22:05 ` Ken Brownfield 2001-12-12 22:30 ` Andrea Arcangeli 2001-12-12 23:23 ` Rik van Riel [not found] <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva> 2001-12-11 16:45 ` Marcelo Tosatti 2001-12-11 18:51 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).