* 2.6.0-test9 - poor swap performance on low end machines @ 2003-10-29 22:30 Chris Vine 2003-10-31 3:57 ` Rik van Riel 0 siblings, 1 reply; 63+ messages in thread From: Chris Vine @ 2003-10-29 22:30 UTC (permalink / raw) To: linux-kernel Hi, I have been testing the 2.6.0-test9 kernel on a couple of desktop machines. On the first, a 1.8GHz Pentium 4 uniprocessor with 512MHz of RAM, it seems to perform fine, and on various compilation tests, compile times for the test programs I have compiled are pretty much the same as those obtained with a stock 2.4.22 kernel, and the 2.6 kernel seems to be slightly more responsive on the desktop. Nothing I use it for knocks it substantially into swap. However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB of RAM and 70MB of swap) I get poor performance once extensive use is made of the swap space. On a test compile of a C++ program involving quite a lot of templates and therefore which is quite memory intensive, it chugs along with the stock 2.4.22 kernel and completes the compile in about 10 minutes going (at its worst) into about 34MB of swap. However, doing the same compile on the 2.6.0-test9 kernel, it reaches about 22MB into swap and then goes into some kind of swap frenzy, continuously swaping and unswapping. Even after leaving it for an hour it continuously swaps and unswaps and fails to compile even the first file (which takes about 2 minutes using the 2.4.22 kernel) and sticks at about 24MB of swap. The kernel is compiled with gcc-2.95.3. Chris. PS Please copy any replies to my e-mail address. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-29 22:30 2.6.0-test9 - poor swap performance on low end machines Chris Vine @ 2003-10-31 3:57 ` Rik van Riel 2003-10-31 11:26 ` Roger Luethi ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Rik van Riel @ 2003-10-31 3:57 UTC (permalink / raw) To: Chris Vine; +Cc: linux-kernel, Con Kolivas On Wed, 29 Oct 2003, Chris Vine wrote: > However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB > of RAM and 70MB of swap) I get poor performance once extensive use is made of > the swap space. Could you try the patch Con Kolivas posted on the 25th ? Subject: [PATCH] Autoregulate vm swappiness cleanup -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 3:57 ` Rik van Riel @ 2003-10-31 11:26 ` Roger Luethi 2003-10-31 12:37 ` Con Kolivas 2003-10-31 12:55 ` Ed Tomlinson 2003-10-31 21:52 ` Chris Vine 2003-11-02 23:06 ` Chris Vine 2 siblings, 2 replies; 63+ messages in thread From: Roger Luethi @ 2003-10-31 11:26 UTC (permalink / raw) To: Rik van Riel; +Cc: Chris Vine, linux-kernel, Con Kolivas On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote: > On Wed, 29 Oct 2003, Chris Vine wrote: > > > However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB > > of RAM and 70MB of swap) I get poor performance once extensive use is made of > > the swap space. > > Could you try the patch Con Kolivas posted on the 25th ? > > Subject: [PATCH] Autoregulate vm swappiness cleanup I suppose it will show some improvement but fail to get performance anywhere near 2.4 -- at least that's what my own tests found. I've been working on a break-down of where we're losing it. Bottom line: It's not simply a price we pay for feature X or Y. It's all over the map, and thus no single patch can possibly fix it. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 11:26 ` Roger Luethi @ 2003-10-31 12:37 ` Con Kolivas 2003-10-31 12:59 ` Roger Luethi 2003-10-31 12:55 ` Ed Tomlinson 1 sibling, 1 reply; 63+ messages in thread From: Con Kolivas @ 2003-10-31 12:37 UTC (permalink / raw) To: Roger Luethi, Rik van Riel; +Cc: Chris Vine, linux-kernel On Fri, 31 Oct 2003 22:26, Roger Luethi wrote: > On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote: > > On Wed, 29 Oct 2003, Chris Vine wrote: > > > However, on a low end machine (200MHz Pentium MMX uniprocessor with > > > only 32MB of RAM and 70MB of swap) I get poor performance once > > > extensive use is made of the swap space. > > > > Could you try the patch Con Kolivas posted on the 25th ? > > > > Subject: [PATCH] Autoregulate vm swappiness cleanup > > I suppose it will show some improvement but fail to get performance > anywhere near 2.4 -- at least that's what my own tests found. I've been > working on a break-down of where we're losing it. > Bottom line: It's not simply a price we pay for feature X or Y. It's > all over the map, and thus no single patch can possibly fix it. Yes it will show improvement, and I would like to hear how much given how simple it is, but I agree with you. There is an intrinsic difference in the vm in 2.6 that makes it too hard for multiple running applications to have a small piece of the action instead of giving out big pieces of the action. While it is better in most circumstances I believe you describe well the problem under vm overload. I guess encoding a vm scheduler will help (and clearly 2.8 territory) but at what overhead cost? I have no idea myself, as now I'm pulling catch-phrases out of my arse that I hate hearing others use (see any lkml thread about scheduling from people who don't code). Cheers, Con ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 12:37 ` Con Kolivas @ 2003-10-31 12:59 ` Roger Luethi 0 siblings, 0 replies; 63+ messages in thread From: Roger Luethi @ 2003-10-31 12:59 UTC (permalink / raw) To: Con Kolivas; +Cc: Rik van Riel, Chris Vine, linux-kernel On Fri, 31 Oct 2003 23:37:34 +1100, Con Kolivas wrote: > Yes it will show improvement, and I would like to hear how much given how I've been sitting on my data because I was waiting for the missing pieces from my test box, but here's a data point: For my test case, your patch improves run time from 500 to 440 seconds. > simple it is, but I agree with you. There is an intrinsic difference in the > vm in 2.6 that makes it too hard for multiple running applications to have a My (probably surprising to many) finding is that there _isn't_ an intrinsic difference which makes 2.6 suck. There are a number of _separate_ issues, and they are only related in their contribution to making 2.6 thrashing behavior abysmal. What I'm trying to find out is whether the issues are intrinsic to a change in some mechanisms (which typically means it's a price we have to pay for other benefits) or if they are just problems with the implementation. I had tracked down vm_swappiness as one problem, and your solution shows that the implementation could indeed be improved without touching the fundamental VM workings at all. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 11:26 ` Roger Luethi 2003-10-31 12:37 ` Con Kolivas @ 2003-10-31 12:55 ` Ed Tomlinson 2003-11-01 18:34 ` Pasi Savolainen 2003-11-06 18:40 ` bill davidsen 1 sibling, 2 replies; 63+ messages in thread From: Ed Tomlinson @ 2003-10-31 12:55 UTC (permalink / raw) To: linux-kernel On October 31, 2003 06:26 am, Roger Luethi wrote: > On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote: > > > On Wed, 29 Oct 2003, Chris Vine wrote: > > > > > > > However, on a low end machine (200MHz Pentium MMX uniprocessor with > > > only 32MB \r of RAM and 70MB of swap) I get poor performance once > > > extensive use is made of the swap space. > > > > > > Could you try the patch Con Kolivas posted on the 25th ? > > > > Subject: [PATCH] Autoregulate vm swappiness cleanup > > > I suppose it will show some improvement but fail to get performance > anywhere near 2.4 -- at least that's what my own tests found. I've been > working on a break-down of where we're losing it. > Bottom line: It's not simply a price we pay for feature X or Y. It's > all over the map, and thus no single patch can possibly fix it. With 2.6 its possible to tell the kernel how much to swap. Con's patch tries to keep applications in memory. You can also play with /proc/sys/vm/swappiness which is what Con's patch tries to replace. Ed Tomlinson ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 12:55 ` Ed Tomlinson @ 2003-11-01 18:34 ` Pasi Savolainen 2003-11-06 18:40 ` bill davidsen 1 sibling, 0 replies; 63+ messages in thread From: Pasi Savolainen @ 2003-11-01 18:34 UTC (permalink / raw) To: linux-kernel * Ed Tomlinson <edt@aei.ca>: > On October 31, 2003 06:26 am, Roger Luethi wrote: >> On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote: >> >> > On Wed, 29 Oct 2003, Chris Vine wrote: >> > >> > >> > > However, on a low end machine (200MHz Pentium MMX uniprocessor with >> > > only 32MB \r of RAM and 70MB of swap) I get poor performance once >> > > extensive use is made of the swap space. >> > >> > >> > Could you try the patch Con Kolivas posted on the 25th ? >> > >> > Subject: [PATCH] Autoregulate vm swappiness cleanup >> >> >> I suppose it will show some improvement but fail to get performance >> anywhere near 2.4 -- at least that's what my own tests found. I've been >> working on a break-down of where we're losing it. >> Bottom line: It's not simply a price we pay for feature X or Y. It's >> all over the map, and thus no single patch can possibly fix it. > > With 2.6 its possible to tell the kernel how much to swap. Con's patch > tries to keep applications in memory. You can also play with > /proc/sys/vm/swappiness which is what Con's patch tries to replace. FWIW, I've been getting horrible performance when FREEING swap space. UI would just hang for several seconds. It've been that way all the -test timeframe. test9 has been much better, but seemingly only because it doesn't seem to like swap as good as before, it doesn't use it unless in dire need. It's not exactly low-end machine either: 2x1800, 512M + 1G swap. DMA is on for IDE. below is 'vmstat 1', I loaded some 30MB images into GIMP and made tight operations (undo level at 100). Loaded some more heavy apps, then moved to GIMP's workspace and closed it. There was 'no-response' -time at marked places: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa ... 1 1 318784 6788 9500 134092 964 0 1652 220 1262 1185 9 2 44 46 0 1 318784 5188 9512 134164 1548 0 1620 4 1162 861 8 1 32 59 0 1 318784 3140 9268 132052 8000 0 8000 0 1299 1130 4 2 48 46 0 2 331280 2812 8860 127072 7360 13336 7360 13336 1502 1700 1 4 44 51 0 1 331280 3800 8644 125816 4648 0 4648 4 1368 1330 1 3 29 67 0 2 333596 3104 8632 125312 3136 2340 3136 2372 1466 1809 2 4 44 51 1 2 336548 3620 8588 124924 5736 2972 5736 2972 1658 2163 3 5 23 69 0 2 336468 3752 8668 124836 736 92 816 112 1454 1697 3 4 23 71 0 1 336468 5232 8760 123288 5468 0 5536 136 1694 2265 3 4 28 65 - click GIMP 'Quit' - 1 0 334848 4680 8808 124088 2200 0 3000 56 1339 1404 3 2 34 61 0 1 291692 150596 8812 124496 876 0 1248 0 1232 1116 13 3 48 36 - GUI freeze - 0 1 291692 148432 8812 124164 2576 0 2576 0 1369 561 1 1 50 49 0 1 291692 146128 8812 124132 2344 0 2344 0 1284 488 1 1 50 49 0 2 291692 143828 8812 124048 2396 0 2396 28 1254 373 0 2 49 49 0 2 291692 141588 8828 124092 2320 0 2320 52 1270 583 0 1 6 92 0 2 291692 139228 8848 124044 2272 0 2272 32 1257 524 0 1 13 86 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 291692 136988 8848 124064 2224 0 2224 0 1213 401 1 1 50 49 0 1 291692 134688 8848 124048 2328 0 2328 0 1349 362 1 1 49 49 0 1 291692 132128 8848 124040 2524 0 2524 0 1408 371 0 2 49 49 0 1 291692 129824 8848 124028 2324 0 2324 0 1210 361 0 1 49 50 0 1 291692 127520 8852 124044 2292 0 2292 164 1266 474 0 1 29 69 0 2 291692 125348 8868 124060 2144 0 2144 28 1225 473 0 2 8 90 0 2 291692 122788 8888 124100 2524 0 2524 32 1283 557 1 1 24 75 0 1 291692 120612 8888 124092 2184 0 2184 0 1181 367 0 1 43 56 0 1 291692 118052 8888 124092 2516 0 2516 0 1176 346 1 1 50 49 0 1 291692 115684 8888 124068 2404 0 2404 0 1178 337 1 1 50 49 0 1 291692 113124 8888 124112 2540 0 2540 0 1183 349 1 1 50 49 0 1 291692 111080 8888 123688 2464 0 2464 0 1186 336 1 1 49 50 0 0 291692 109920 8924 123712 1164 0 1164 60 1165 869 3 1 57 39 - GUI available - 0 0 291692 109924 8924 123712 0 0 0 0 1079 632 0 0 100 0 0 0 291692 109924 8924 123712 0 0 0 0 1079 687 0 0 100 0 0 0 291692 109924 8924 123712 0 0 0 0 1074 645 0 0 100 0 0 0 291692 109928 8924 123712 0 0 0 0 1122 797 1 0 100 0 -- Psi -- <http://www.iki.fi/pasi.savolainen> Vivake -- Virtuaalinen valokuvauskerho <http://members.lycos.co.uk/vivake/> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 12:55 ` Ed Tomlinson 2003-11-01 18:34 ` Pasi Savolainen @ 2003-11-06 18:40 ` bill davidsen 1 sibling, 0 replies; 63+ messages in thread From: bill davidsen @ 2003-11-06 18:40 UTC (permalink / raw) To: linux-kernel In article <200310310755.36224.edt@aei.ca>, Ed Tomlinson <edt@aei.ca> wrote: | With 2.6 its possible to tell the kernel how much to swap. Con's patch | tries to keep applications in memory. You can also play with | /proc/sys/vm/swappiness which is what Con's patch tries to replace. I added Nick's sched and io patches to Con's patch on test9, and it looked stable under load. But I'm (mostly) on vacation this week, so it isn't being tested any more. My responsiveness test didn't show it to be as good as 2.4, unfortunately. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 3:57 ` Rik van Riel 2003-10-31 11:26 ` Roger Luethi @ 2003-10-31 21:52 ` Chris Vine 2003-11-02 23:06 ` Chris Vine 2 siblings, 0 replies; 63+ messages in thread From: Chris Vine @ 2003-10-31 21:52 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, Con Kolivas On Friday 31 October 2003 3:57 am, Rik van Riel wrote: > On Wed, 29 Oct 2003, Chris Vine wrote: > > However, on a low end machine (200MHz Pentium MMX uniprocessor with only > > 32MB of RAM and 70MB of swap) I get poor performance once extensive use > > is made of the swap space. > > Could you try the patch Con Kolivas posted on the 25th ? > > Subject: [PATCH] Autoregulate vm swappiness cleanup I will do that over the weekend and report back. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-10-31 3:57 ` Rik van Riel 2003-10-31 11:26 ` Roger Luethi 2003-10-31 21:52 ` Chris Vine @ 2003-11-02 23:06 ` Chris Vine 2003-11-03 0:48 ` Con Kolivas 2 siblings, 1 reply; 63+ messages in thread From: Chris Vine @ 2003-11-02 23:06 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, Con Kolivas On Friday 31 October 2003 3:57 am, Rik van Riel wrote: > On Wed, 29 Oct 2003, Chris Vine wrote: > > However, on a low end machine (200MHz Pentium MMX uniprocessor with only > > 32MB of RAM and 70MB of swap) I get poor performance once extensive use > > is made of the swap space. > > Could you try the patch Con Kolivas posted on the 25th ? > > Subject: [PATCH] Autoregulate vm swappiness cleanup OK. I have now done some testing. The default swappiness in the kernel (without Con's patch) is 60. This gives hopeless swapping results on a 200MHz Pentium with 32MB of RAM once the amount of memory swapped out exceeds about 15 to 20MB. A static swappiness of 10 gives results which work under load, with up to 40MB swapped out (I haven't tested beyond that). Compile times with a test file requiring about 35MB of swap and with everything else the same are: 2.4.22 - average of 1 minute 35 seconds 2.6.0-test9 (swappiness 10) - average of 5 minutes 56 seconds A swappiness of 5 on the test compile causes the machine to hang in some kind of "won't swap/can't continue without more memory" stand-off, and a swappiness of 20 starts the machine thrashing to the point where I stopped the compile. A swappiness of 10 would complete anything I threw at it and without excessive thrashing, but more slowly (and using a little more swap space) than 2.4.22. With Con's dynamic swappiness patch things were worse, rather than better. With no load, the swappiness (now read only) was around 37. Under load with the test compile, swappiness went up to around 62, thrashing began, and after 30 minutes the compile still had not completed, swappiness had reached 70, and I abandoned it. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-02 23:06 ` Chris Vine @ 2003-11-03 0:48 ` Con Kolivas 2003-11-03 21:13 ` Chris Vine 0 siblings, 1 reply; 63+ messages in thread From: Con Kolivas @ 2003-11-03 0:48 UTC (permalink / raw) To: Chris Vine, Rik van Riel; +Cc: linux-kernel, Martin J. Bligh [-- Attachment #1: Type: text/plain, Size: 2273 bytes --] On Mon, 3 Nov 2003 10:06, Chris Vine wrote: > On Friday 31 October 2003 3:57 am, Rik van Riel wrote: > > On Wed, 29 Oct 2003, Chris Vine wrote: > > > However, on a low end machine (200MHz Pentium MMX uniprocessor with > > > only 32MB of RAM and 70MB of swap) I get poor performance once > > > extensive use is made of the swap space. > > > > Could you try the patch Con Kolivas posted on the 25th ? > > > > Subject: [PATCH] Autoregulate vm swappiness cleanup > > OK. I have now done some testing. > > The default swappiness in the kernel (without Con's patch) is 60. This > gives hopeless swapping results on a 200MHz Pentium with 32MB of RAM once > the amount of memory swapped out exceeds about 15 to 20MB. A static > swappiness of 10 gives results which work under load, with up to 40MB > swapped out (I haven't tested beyond that). Compile times with a test file > requiring about 35MB of swap and with everything else the same are: > > 2.4.22 - average of 1 minute 35 seconds > 2.6.0-test9 (swappiness 10) - average of 5 minutes 56 seconds > > A swappiness of 5 on the test compile causes the machine to hang in some > kind of "won't swap/can't continue without more memory" stand-off, and a > swappiness of 20 starts the machine thrashing to the point where I stopped > the compile. A swappiness of 10 would complete anything I threw at it and > without excessive thrashing, but more slowly (and using a little more swap > space) than 2.4.22. > > With Con's dynamic swappiness patch things were worse, rather than better. > With no load, the swappiness (now read only) was around 37. Under load > with the test compile, swappiness went up to around 62, thrashing began, > and after 30 minutes the compile still had not completed, swappiness had > reached 70, and I abandoned it. Well I was considering adding the swap pressure to this algorithm but I had hoped 2.6 behaved better than this under swap overload which is what appears to happen to yours. Can you try this patch? It takes into account swap pressure as well. It wont be as aggressive as setting the swappiness manually to 10, but unlike a swappiness of 10 it will be more useful over a wide range of hardware and circumstances. Con P.S. patches available here: http://ck.kolivas.org/patches [-- Attachment #2: patch-test9-am-5 --] [-- Type: text/x-diff, Size: 2320 bytes --] --- linux-2.6.0-test8-base/kernel/sysctl.c 2003-10-20 14:16:54.000000000 +1000 +++ linux-2.6.0-test8/kernel/sysctl.c 2003-11-03 10:49:15.000000000 +1100 @@ -664,11 +664,8 @@ static ctl_table vm_table[] = { .procname = "swappiness", .data = &vm_swappiness, .maxlen = sizeof(vm_swappiness), - .mode = 0644, - .proc_handler = &proc_dointvec_minmax, - .strategy = &sysctl_intvec, - .extra1 = &zero, - .extra2 = &one_hundred, + .mode = 0444 /* read-only*/, + .proc_handler = &proc_dointvec, }, #ifdef CONFIG_HUGETLB_PAGE { --- linux-2.6.0-test8-base/mm/vmscan.c 2003-10-20 14:16:54.000000000 +1000 +++ linux-2.6.0-test8/mm/vmscan.c 2003-11-03 11:38:08.542960408 +1100 @@ -47,7 +47,7 @@ /* * From 0 .. 100. Higher means more swappy. */ -int vm_swappiness = 60; +int vm_swappiness = 0; static long total_memory; #ifdef ARCH_HAS_PREFETCH @@ -600,6 +600,7 @@ refill_inactive_zone(struct zone *zone, LIST_HEAD(l_active); /* Pages to go onto the active_list */ struct page *page; struct pagevec pvec; + struct sysinfo i; int reclaim_mapped = 0; long mapped_ratio; long distress; @@ -641,14 +642,38 @@ refill_inactive_zone(struct zone *zone, */ mapped_ratio = (ps->nr_mapped * 100) / total_memory; + si_swapinfo(&i); + if (unlikely(!i.totalswap)) + vm_swappiness = 0; + else { + int app_centile, swap_centile; + + /* + * app_centile is the percentage of physical ram used + * by application pages. + */ + si_meminfo(&i); + app_centile = 100 - (((i.freeram + get_page_cache_size() - + swapper_space.nrpages) * 100) / i.totalram); + + /* + * swap_centile is the percentage of free swap. + */ + swap_centile = i.freeswap * 100 / i.totalswap; + + /* + * Autoregulate vm_swappiness to be equal to the lowest of + * app_centile and swap_centile. -ck + */ + vm_swappiness = min(app_centile, swap_centile); + } + /* * Now decide how much we really want to unmap some pages. The mapped * ratio is downgraded - just because there's a lot of mapped memory * doesn't necessarily mean that page reclaim isn't succeeding. * * The distress ratio is important - we don't want to start going oom. - * - * A 100% value of vm_swappiness overrides this algorithm altogether. */ swap_tendency = mapped_ratio / 2 + distress + vm_swappiness; ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-03 0:48 ` Con Kolivas @ 2003-11-03 21:13 ` Chris Vine 2003-11-04 2:55 ` Con Kolivas 0 siblings, 1 reply; 63+ messages in thread From: Chris Vine @ 2003-11-03 21:13 UTC (permalink / raw) To: Con Kolivas, Rik van Riel; +Cc: linux-kernel, Martin J. Bligh On Monday 03 November 2003 12:48 am, Con Kolivas wrote: > Well I was considering adding the swap pressure to this algorithm but I had > hoped 2.6 behaved better than this under swap overload which is what > appears to happen to yours. Can you try this patch? It takes into account > swap pressure as well. It wont be as aggressive as setting the swappiness > manually to 10, but unlike a swappiness of 10 it will be more useful over a > wide range of hardware and circumstances. Hi, I applied the patch. The test compile started in a similar way to the compile when using your first patch. swappiness under no load was 37. At the beginning of the compile it went up to 67, but when thrashing was well established it started to come down slowly. After 40 minutes of thrashing it came down to 53. At that point I stopped the compile attempt (which did not complete). So, there is a slight move in the right direction, but given that a swappiness of 20 generates thrashing with 32 MB of RAM when more than about 20MB of memory is swapped out, it is a drop in the ocean. The conclusion appears to be that for low end systems, once memory swapped out reaches about 60% of installed RAM the swap ceases to work effectively unless swappiness is much more aggressively low than your patch achieves. The ability manually to tune it therefore seems to be required (and even then, 2.4.22 is considerably better, compiling the test file in about 1 minute 35 seconds). I suppose one question is whether I would get the same thrashiness with my other machine (which has 512MB of RAM) once more than about 300MB is swapped out. However, I cannot answer that question as I do not have anything here which makes memory demands of that kind. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-03 21:13 ` Chris Vine @ 2003-11-04 2:55 ` Con Kolivas 2003-11-04 22:08 ` Chris Vine 2003-12-08 13:52 ` William Lee Irwin III 0 siblings, 2 replies; 63+ messages in thread From: Con Kolivas @ 2003-11-04 2:55 UTC (permalink / raw) To: Chris Vine, Rik van Riel Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III On Tue, 4 Nov 2003 08:13, Chris Vine wrote: > On Monday 03 November 2003 12:48 am, Con Kolivas wrote: > > Well I was considering adding the swap pressure to this algorithm but I > > had hoped 2.6 behaved better than this under swap overload which is what > > appears to happen to yours. Can you try this patch? It takes into account > > swap pressure as well. It wont be as aggressive as setting the swappiness > > manually to 10, but unlike a swappiness of 10 it will be more useful over > > a wide range of hardware and circumstances. > > The test compile started in a similar way to the compile when using your > first patch. swappiness under no load was 37. At the beginning of the > compile it went up to 67, but when thrashing was well established it > started to come down slowly. After 40 minutes of thrashing it came down to > 53. At that point I stopped the compile attempt (which did not complete). > > So, there is a slight move in the right direction, but given that a > swappiness of 20 generates thrashing with 32 MB of RAM when more than about > 20MB of memory is swapped out, it is a drop in the ocean. > > The conclusion appears to be that for low end systems, once memory swapped > out reaches about 60% of installed RAM the swap ceases to work effectively > unless swappiness is much more aggressively low than your patch achieves. > The ability manually to tune it therefore seems to be required (and even > then, 2.4.22 is considerably better, compiling the test file in about 1 > minute 35 seconds). > > I suppose one question is whether I would get the same thrashiness with my > other machine (which has 512MB of RAM) once more than about 300MB is > swapped out. However, I cannot answer that question as I do not have > anything here which makes memory demands of that kind. That's pretty much what I expected. Overall I'm happier with this later version as it doesn't impact on the noticable improvement on systems that are not overloaded, yet keeps performance at least that of the untuned version. I can tune it to be better for this work load but it would be to the detriment of the rest. Ultimately this is the problem I see with 2.6 ; there is no way for the vm to know that "all the pages belonging to the currently running tasks should try their best to fit into the available space by getting an equal share". It seems the 2.6 vm gives nice emphasis to the most current task, but at the detriment of other tasks that are on the runqueue and still need ram. The original design of the 2.6 vm didn't even include this last ditch effort at taming swappiness with the "knob", and behaved as though the swapppiness was always set at 100. Trying to tune this further with just the swappiness value will prove futile as can be seen by the "best" setting of 20 in your test case still taking 4 times longer to compile the kernel. This is now a balance tradeoff of trying to set a value that works for your combination of the required ram of the applications you run concurrently, the physical ram and the swap ram. As you can see from your example, in your workload it seems there would be no point having more swap than your physical ram since even if it tries to use say 40Mb it just drowns in a swapstorm. Clearly this is not the case in a machine with more ram in different circumstances, as swapping out say openoffice and mozilla while it's not being used will not cause any harm to a kernel compile that takes up all the available physical ram (it would actually be beneficial). Fortunately most modern machines' ram vs application sizes are of the latter balance. There's always so much more you can do... wli, riel care to comment? Cheers, Con ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-04 2:55 ` Con Kolivas @ 2003-11-04 22:08 ` Chris Vine 2003-11-04 22:30 ` Con Kolivas 2003-12-08 13:52 ` William Lee Irwin III 1 sibling, 1 reply; 63+ messages in thread From: Chris Vine @ 2003-11-04 22:08 UTC (permalink / raw) To: Con Kolivas, Rik van Riel Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III On Tuesday 04 November 2003 2:55 am, Con Kolivas wrote: > That's pretty much what I expected. Overall I'm happier with this later > version as it doesn't impact on the noticable improvement on systems that > are not overloaded, yet keeps performance at least that of the untuned > version. I can tune it to be better for this work load but it would be to > the detriment of the rest. > > Ultimately this is the problem I see with 2.6 ; there is no way for the vm > to know that "all the pages belonging to the currently running tasks should > try their best to fit into the available space by getting an equal share". > It seems the 2.6 vm gives nice emphasis to the most current task, but at > the detriment of other tasks that are on the runqueue and still need ram. > The original design of the 2.6 vm didn't even include this last ditch > effort at taming swappiness with the "knob", and behaved as though the > swapppiness was always set at 100. Trying to tune this further with just > the swappiness value will prove futile as can be seen by the "best" setting > of 20 in your test case still taking 4 times longer to compile the kernel. > > This is now a balance tradeoff of trying to set a value that works for your > combination of the required ram of the applications you run concurrently, > the physical ram and the swap ram. As you can see from your example, in > your workload it seems there would be no point having more swap than your > physical ram since even if it tries to use say 40Mb it just drowns in a > swapstorm. Clearly this is not the case in a machine with more ram in > different circumstances, as swapping out say openoffice and mozilla while > it's not being used will not cause any harm to a kernel compile that takes > up all the available physical ram (it would actually be beneficial). > Fortunately most modern machines' ram vs application sizes are of the > latter balance. Your diagnosis looks right, but two points - 1. The test compile was not of the kernel but of a file in a C++ program using quite a lot of templates and therefore which is quite memory intensive (for the sake of choosing something, it was a compile of src/main.o in http://www.cvine.freeserve.co.uk/efax-gtk/efax-gtk-2.2.2.src.tgz). It would be a sad day if the kernel could not be compiled under 2.6 in 32MB of memory, and I am glad to say that it does compile - my 2.6.0-test9 kernel compiles on the 32MB machine in on average 45 minutes 13 seconds under kernel 2.4.22, and in 54 minutes 11 seconds under 2.6.0-test9 with your latest patch, which is not an enormous difference. (As a digression, in the 2.0 days the kernel would compile in 6 minutes on the machine in question, and at the time I was very impressed.) 2. Being able to choose a manual setting for swappiness is not "futile". As I mentioned in an earlier post, a swappiness of 10 will enable 2.6.0-test9 to compile the things I threw at it on a low end machine, albeit slowly, whereas with dynamic swappiness it would not compile at all. So the difference is between being able to do something and not being able to do it. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-04 22:08 ` Chris Vine @ 2003-11-04 22:30 ` Con Kolivas 0 siblings, 0 replies; 63+ messages in thread From: Con Kolivas @ 2003-11-04 22:30 UTC (permalink / raw) To: Chris Vine, Rik van Riel Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III On Wed, 5 Nov 2003 09:08, Chris Vine wrote: > Your diagnosis looks right, but two points - > > 1. The test compile was not of the kernel but of a file in a C++ program > using quite a lot of templates and therefore which is quite memory > intensive (for the sake of choosing something, it was a compile of > src/main.o in > http://www.cvine.freeserve.co.uk/efax-gtk/efax-gtk-2.2.2.src.tgz). It > would be a sad day if the kernel could not be compiled under 2.6 in 32MB of > memory, and I am glad to say that it does compile - my 2.6.0-test9 kernel > compiles on the 32MB machine in on average 45 minutes 13 seconds under > kernel 2.4.22, and in 54 minutes 11 seconds under 2.6.0-test9 with your > latest patch, which is not an enormous difference. (As a digression, in > the 2.0 days the kernel would compile in 6 minutes on the machine in > question, and at the time I was very impressed.) Phew. It would be sad if it couldn't compile a kernel indeed. > > 2. Being able to choose a manual setting for swappiness is not "futile". > As I mentioned in an earlier post, a swappiness of 10 will enable > 2.6.0-test9 to compile the things I threw at it on a low end machine, > albeit slowly, whereas with dynamic swappiness it would not compile at all. > So the difference is between being able to do something and not being able > to do it. I agree with you on that; I meant it would be futile trying to get the compile times back to 2.4 levels with just this tunable modified alone (statically or dynamically)... which means we should look elsewhere for ways to tackle this. Con ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-11-04 2:55 ` Con Kolivas 2003-11-04 22:08 ` Chris Vine @ 2003-12-08 13:52 ` William Lee Irwin III 2003-12-08 14:23 ` Con Kolivas 2003-12-08 19:49 ` Roger Luethi 1 sibling, 2 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-08 13:52 UTC (permalink / raw) To: Con Kolivas; +Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Tue, Nov 04, 2003 at 01:55:08PM +1100, Con Kolivas wrote: > This is now a balance tradeoff of trying to set a value that works for your > combination of the required ram of the applications you run concurrently, the > physical ram and the swap ram. As you can see from your example, in your > workload it seems there would be no point having more swap than your physical > ram since even if it tries to use say 40Mb it just drowns in a swapstorm. > Clearly this is not the case in a machine with more ram in different > circumstances, as swapping out say openoffice and mozilla while it's not > being used will not cause any harm to a kernel compile that takes up all the > available physical ram (it would actually be beneficial). Fortunately most > modern machines' ram vs application sizes are of the latter balance. > There's always so much more you can do... > wli, riel care to comment? Explicit load control is in order. 2.4 appears to work better in these instances because it victimizes one process at a time. It vaguely resembles load control with a random demotion policy (mmlist order is effectively random), but is the only method of page reclamation, which disturbs its two-stage LRU, and basically livelocks in various situations because having "demoted" a process address space to whatever extent it does fails to eliminate it from consideration during further attempts to reclaim memory to satisfy allocations. On smaller machines or workloads with high levels of overcommitment (in a sense different from non-overcommit; here it means that if all tasks were executing simultaneously over some period of time they would require more RAM than the machine has), the effect of load control dominates replacement by several orders of magnitude, so the mere presence of anything like a load control mechanism does them wonders. According to a study from the 80's (Carr's thesis), the best load control policies are demoting the smallest task, demoting the "most recently activated task", and demoting the "task with the largest remaining quantum". The latter two no longer make sense in the presence of threads, or at least have to be revised not to assume a unique execution context associated with a process address space. These three Were said to be largely equivalent and performed 15% better than random. Other important aspects of load control beyond the demotion policy are explicit suspension the execution contexts of the process address spaces chosen as its victims, complete eviction of the process address space, load-time bonuses for process address spaces promoted from that demoted status, and, of course, fair enough scheduling that starvation or repetitive demotions of the same tasks (I think demoting the faulting task runs into this) without forward progress don't occur. 2.4 does not do any of this. The effect of not suspending the execution contexts of the demoted process address spaces is that the victimized execution contexts thrash while trying to reload the memory they need to execute. The effect of incomplete demotion is essentially livelock under sufficient stress. Its memory scheduling to what extent it has it is RR and hence fair, but the various caveats above justify "does not do any of this", particularly incomplete demotion. So I predict that a true load control mechanism and policy would be both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4 on underprovisioned machines. For now, we lack an implementation. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 13:52 ` William Lee Irwin III @ 2003-12-08 14:23 ` Con Kolivas 2003-12-08 14:30 ` William Lee Irwin III ` (2 more replies) 2003-12-08 19:49 ` Roger Luethi 1 sibling, 3 replies; 63+ messages in thread From: Con Kolivas @ 2003-12-08 14:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh [-- Attachment #1: Type: text/plain, Size: 363 bytes --] [snip original discussion thrashing swap on 2.6test with 32mb ram] Chris By an unusual coincidence I was looking into the patches that were supposed to speed up application startup and noticed this one was merged. A brief discussion with wli suggests this could cause thrashing problems on low memory boxes so can you try this patch? Applies to test11. Con [-- Attachment #2: patch-backout-readahead --] [-- Type: text/x-diff, Size: 495 bytes --] --- linux-2.6.0-test11-base/mm/filemap.c 2003-11-24 22:18:56.000000000 +1100 +++ linux-2.6.0-test11-fremap/mm/filemap.c 2003-12-09 01:17:47.793384425 +1100 @@ -1285,10 +1285,6 @@ static int filemap_populate(struct vm_ar struct page *page; int err; - if (!nonblock) - force_page_cache_readahead(mapping, vma->vm_file, - pgoff, len >> PAGE_CACHE_SHIFT); - repeat: size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (pgoff + (len >> PAGE_CACHE_SHIFT) > size) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 14:23 ` Con Kolivas @ 2003-12-08 14:30 ` William Lee Irwin III 2003-12-09 21:03 ` Chris Vine 2003-12-13 14:08 ` Chris Vine 2 siblings, 0 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-08 14:30 UTC (permalink / raw) To: Con Kolivas; +Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Tue, Dec 09, 2003 at 01:23:31AM +1100, Con Kolivas wrote: > [snip original discussion thrashing swap on 2.6test with 32mb ram] > Chris > By an unusual coincidence I was looking into the patches that were supposed to > speed up application startup and noticed this one was merged. A brief > discussion with wli suggests this could cause thrashing problems on low > memory boxes so can you try this patch? Applies to test11. This is effectively only called when faulting on paged-out ptes whose file offsets were disturbed by remap_file_pages() and when calling remap_file_pages() itself. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 14:23 ` Con Kolivas 2003-12-08 14:30 ` William Lee Irwin III @ 2003-12-09 21:03 ` Chris Vine 2003-12-13 14:08 ` Chris Vine 2 siblings, 0 replies; 63+ messages in thread From: Chris Vine @ 2003-12-09 21:03 UTC (permalink / raw) To: Con Kolivas, William Lee Irwin III Cc: Rik van Riel, linux-kernel, Martin J. Bligh On Monday 08 December 2003 2:23 pm, Con Kolivas wrote: > [snip original discussion thrashing swap on 2.6test with 32mb ram] > > Chris > > By an unusual coincidence I was looking into the patches that were supposed > to speed up application startup and noticed this one was merged. A brief > discussion with wli suggests this could cause thrashing problems on low > memory boxes so can you try this patch? Applies to test11. > > Con Con, I have just got back from a trip away. I will try out the patch tomorrow, I hope, and see what difference it makes. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 14:23 ` Con Kolivas 2003-12-08 14:30 ` William Lee Irwin III 2003-12-09 21:03 ` Chris Vine @ 2003-12-13 14:08 ` Chris Vine 2 siblings, 0 replies; 63+ messages in thread From: Chris Vine @ 2003-12-13 14:08 UTC (permalink / raw) To: Con Kolivas, William Lee Irwin III Cc: Rik van Riel, linux-kernel, Martin J. Bligh On Monday 08 December 2003 2:23 pm, Con Kolivas wrote: > [snip original discussion thrashing swap on 2.6test with 32mb ram] > > Chris > > By an unusual coincidence I was looking into the patches that were supposed > to speed up application startup and noticed this one was merged. A brief > discussion with wli suggests this could cause thrashing problems on low > memory boxes so can you try this patch? Applies to test11. Con, I have applied the patch, and performance is nearly indistinguishable from that with the kernel without it. Chris. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 13:52 ` William Lee Irwin III 2003-12-08 14:23 ` Con Kolivas @ 2003-12-08 19:49 ` Roger Luethi 2003-12-08 20:48 ` William Lee Irwin III 1 sibling, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-08 19:49 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh [-- Attachment #1: Type: text/plain, Size: 5296 bytes --] I've been looking at this during the past few months. I will sketch out a few of my findinds below. I can follow up with some details and actual data if necessary. On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote: > Explicit load control is in order. 2.4 appears to work better in these > instances because it victimizes one process at a time. It vaguely > resembles load control with a random demotion policy (mmlist order is Everybody I talked to seemed to assume that 2.4 does better due to the way mapped pages are freed (i.e. swap_out in 2.4). While it is true that the new VM as merged in 2.5.27 didn't exactly help with thrashing performance, the main factors slowing 2.6 down were merged much later. Have a look at the graph attached to this message to get an idea of what I am talking about (x axis is kernel releases after 2.5.0, y axis is time to complete each benchmark). It is important to note that different work loads show different thrashing behavior. Some changes in 2.5 improved one thrashing benchmark and made another worse. However, 2.4 seems to do better than 2.6 across the board, which suggests that some elements are in fact better for any types of thrashing. > Other important aspects of load control beyond the demotion policy are > explicit suspension the execution contexts of the process address > spaces chosen as its victims, complete eviction of the process address I implemented suspension during memory shortage for 2.6 and I had some code for complete eviction as well. It definitely helped for some benchmarks. There's one problem, though: Latency. If a machine is thrashing, a sys admin won't appreciate that her shell is suspended when she tries to log in to correct the problem. I have some simple criteria for selecting a process to suspend, but it's hard to get it right every time (kind of like the OOM killer, although with smaller damage for bad decisions). For workstations and most servers latency is so important compared to throughput that I began to wonder whether implementing suspension was actually worth it. After benchmarking 2.4 vs 2.6, though, I suspected that there must be plenty of room for improvement _before_ such drastic measures are necessary. It makes little sense to add suspension to 2.6 if performance can be improved _without_ hurting latency. That's why I shelved my work on suspension to find out and document when exactly performance went down during 2.5. > 2.4 does not do any of this. > > The effect of not suspending the execution contexts of the demoted > process address spaces is that the victimized execution contexts thrash > while trying to reload the memory they need to execute. The effect of > incomplete demotion is essentially livelock under sufficient stress. > Its memory scheduling to what extent it has it is RR and hence fair, > but the various caveats above justify "does not do any of this", > particularly incomplete demotion. One thing you can observe with 2.4 is that one process may force another process out. Say you have several instances of the same program which all have the same working set size (i.e. requirements, not RSS) and a constant rate of memory references in the code. If their current RSS differ then some take more major faults and spend more time blocked than others. In a thrashing situation, you can see the small RSSs shrink to virtually zero, while the largest RSS will grow even further -- the thrashing processes are stealing each other's pages while the one which hardly ever faults keeps its complete working set in RAM. Bad for fairness, but can help throughput quite a bit. This effect is harder to trigger in 2.6. > So I predict that a true load control mechanism and policy would be > both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4 > on underprovisioned machines. For now, we lack an implementation. I doubt that you can get performance anywhere near 2.4 just by adding load control to 2.6 unless you measure throughput and nothing else -- otherwise latency will kill you. I am convinced the key is not in _adding_ stuff, but _fixing_ what we have. IMO the question is: How much do we care? Machines with tight memory are not necessarily very concerned about paging (e.g. PDAs), and serious servers rarely operate under such conditions: Admins tend to add RAM when the paging load is significant. If you don't care _that_ much about thrashing in Linux, just tell people to buy more RAM. Computers are cheap, RAM even more so, 64 bit becomes affordable, and heavy paging sucks no matter how good a paging mechanism is. If you care enough to spend resources to address the problem, look at the major regressions in 2.5 and find out where they were a consequence of a deliberate trade-off decision and where it was an oversight which can be fixed or mitigated without sacrificing what was gained through the respective changes in 2.5. Obviously, performing regular testing with thrashing benchmarks would make lasting major regressions like those in the 2.5 development series much less likely in the future. Additional load control mechanisms create new problems (latency, increased complexity), so I think they should be a last resort, not some method to paper over deficiencies elsewhere in the kernel. Roger [-- Attachment #2: plot.png --] [-- Type: image/png, Size: 10196 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 19:49 ` Roger Luethi @ 2003-12-08 20:48 ` William Lee Irwin III 2003-12-09 0:27 ` Roger Luethi 2003-12-10 21:52 ` Andrea Arcangeli 0 siblings, 2 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-08 20:48 UTC (permalink / raw) To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote: >> Explicit load control is in order. 2.4 appears to work better in these >> instances because it victimizes one process at a time. It vaguely >> resembles load control with a random demotion policy (mmlist order is On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > Everybody I talked to seemed to assume that 2.4 does better due to the > way mapped pages are freed (i.e. swap_out in 2.4). While it is true > that the new VM as merged in 2.5.27 didn't exactly help with thrashing > performance, the main factors slowing 2.6 down were merged much later. What kinds of factors are these? How did you find these factors? When were these factors introduced? On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > Have a look at the graph attached to this message to get an idea of > what I am talking about (x axis is kernel releases after 2.5.0, y axis > is time to complete each benchmark). On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > It is important to note that different work loads show different > thrashing behavior. Some changes in 2.5 improved one thrashing benchmark > and made another worse. However, 2.4 seems to do better than 2.6 across > the board, which suggests that some elements are in fact better for > any types of thrashing. qsbench I'd pretty much ignore except as a control case, since there's nothing to do with a single process but let it thrash. On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote: >> Other important aspects of load control beyond the demotion policy are >> explicit suspension the execution contexts of the process address >> spaces chosen as its victims, complete eviction of the process address On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > I implemented suspension during memory shortage for 2.6 and I had some > code for complete eviction as well. It definitely helped for some > benchmarks. There's one problem, though: Latency. If a machine is > thrashing, a sys admin won't appreciate that her shell is suspended > when she tries to log in to correct the problem. I have some simple > criteria for selecting a process to suspend, but it's hard to get it > right every time (kind of like the OOM killer, although with smaller > damage for bad decisions). I'd be interested in seeing the specific criteria used, since the policy can strongly influence performance. Some of the most obvious policies do worse than random. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > For workstations and most servers latency is so important compared to > throughput that I began to wonder whether implementing suspension was > actually worth it. After benchmarking 2.4 vs 2.6, though, I suspected > that there must be plenty of room for improvement _before_ such drastic > measures are necessary. It makes little sense to add suspension to 2.6 > if performance can be improved _without_ hurting latency. That's why > I shelved my work on suspension to find out and document when exactly > performance went down during 2.5. Ideally, the targets for suspension and complete eviction would be background tasks that aren't going to demand memory in the near future. Unfortunately that algorithm appears to require an oracle to implement. Also, the best criteria as I know of them are somewhat counterintuitive, so I'd like to be sure they were tried. On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote: >> 2.4 does not do any of this. >> The effect of not suspending the execution contexts of the demoted >> process address spaces is that the victimized execution contexts thrash >> while trying to reload the memory they need to execute. The effect of >> incomplete demotion is essentially livelock under sufficient stress. >> Its memory scheduling to what extent it has it is RR and hence fair, >> but the various caveats above justify "does not do any of this", >> particularly incomplete demotion. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > One thing you can observe with 2.4 is that one process may force another > process out. Say you have several instances of the same program which > all have the same working set size (i.e. requirements, not RSS) and > a constant rate of memory references in the code. If their current RSS > differ then some take more major faults and spend more time blocked than > others. In a thrashing situation, you can see the small RSSs shrink > to virtually zero, while the largest RSS will grow even further -- > the thrashing processes are stealing each other's pages while the one > which hardly ever faults keeps its complete working set in RAM. Bad for > fairness, but can help throughput quite a bit. This effect is harder > to trigger in 2.6. There was a study conducted by someone involved with CKRM (included in some joint paper with the rest of the team) that actually charted out this property of 2.6 in terms of either faults taken over time or RSS over time, but compared it to a modified page replacement policy that actually had it to a greater degree than stock 2.6 instead of 2.4. On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote: >> So I predict that a true load control mechanism and policy would be >> both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4 >> on underprovisioned machines. For now, we lack an implementation. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > I doubt that you can get performance anywhere near 2.4 just by adding > load control to 2.6 unless you measure throughput and nothing else -- > otherwise latency will kill you. I am convinced the key is not in > _adding_ stuff, but _fixing_ what we have. A small problem with that kind of argument is that it's assuming the existence of some accumulation of small regressions that haven't proven to exist (or have they?), where the kind of a priori argument I've made only needs to rely on the properties of the algorithms. But neither can actually provide a guarantee of results without testing. I suppose one point in favor of my "grab this tool off the shelf" approach is that there is quite a bit of history behind the methods and that they are well-understood. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > IMO the question is: How much do we care? Machines with tight memory are > not necessarily very concerned about paging (e.g. PDAs), and serious > servers rarely operate under such conditions: Admins tend to add RAM > when the paging load is significant. The question is not if we care, but if we care about others. Economies aren't as kind to all users as they are to us On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > If you don't care _that_ much about thrashing in Linux, just tell > people to buy more RAM. Computers are cheap, RAM even more so, 64 bit > becomes affordable, and heavy paging sucks no matter how good a paging > mechanism is. If I took this kind of argument seriously I'd be telling people to go shopping for new devices every time they run into a driver problem. I'm actually rather annoyed with hearing this line of reasoning repeated to many so many times over, and I'd appreciate not hearing it ever again (offenders, you know who you are). The issue at hand is improving how the kernel behaves on specific hardware configurations; the fact other hardware configurations exist is irrelevant. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > If you care enough to spend resources to address the problem, look at > the major regressions in 2.5 and find out where they were a consequence > of a deliberate trade-off decision and where it was an oversight which > can be fixed or mitigated without sacrificing what was gained through > the respective changes in 2.5. Obviously, performing regular testing > with thrashing benchmarks would make lasting major regressions like > those in the 2.5 development series much less likely in the future. Yes, this does need to be done more regularly. c.f. the min_free_kb tuning problem Matt Mackall and I identified. On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > Additional load control mechanisms create new problems (latency, > increased complexity), so I think they should be a last resort, not > some method to paper over deficiencies elsewhere in the kernel. Who could disagree with this without looking ridiculous? Methods of last resort are not necessarily unavoidable; the OOM killer is an example of one that isn't avoidable. The issue is less clear cut here, since the effect is limited to degraded performance on a limited range of machines. But I would prefer not to send an "FOAD" message to the users of older hardware or users who can't afford fast hardware. The assumption methods of last resort create more problems than they solve appears to be based on the notion that they'll be used for more than methods of last resort. They're meant to handle the specific cases where they are beneficial, not infect the common case with behavior that's only appropriate for underpowered machines or other bogosity. That is, it should teach the kernel how to behave in the new situation where we want it to behave well, not change its behavior where it already behaves well. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 20:48 ` William Lee Irwin III @ 2003-12-09 0:27 ` Roger Luethi 2003-12-09 4:05 ` William Lee Irwin III 2003-12-10 21:52 ` Andrea Arcangeli 1 sibling, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-09 0:27 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: > On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > > Everybody I talked to seemed to assume that 2.4 does better due to the > > way mapped pages are freed (i.e. swap_out in 2.4). While it is true > > that the new VM as merged in 2.5.27 didn't exactly help with thrashing > > performance, the main factors slowing 2.6 down were merged much later. > > What kinds of factors are these? How did you find these factors? When > were these factors introduced? The relevant changes are all over the place, factors other than the pageout mechanism affect thrashing. I haven't identified all of them, though. I work on it occasionally. When I realized what had happened in 2.5 (took a while), I went for a tedious, systematic approach. It started with benchmarks: 3 benchmarks x some 85 kernels x 10 runs each. The graph you saw in my previous message represents a few hundred hours worth of benchmarking (required because variance in thrashing benchmarks is pretty bad). The real stuff is quite detailed but too large to post on the list. I scanned the resulting data for significant performance changes. For some of them, I used the Changelog and -- if necessary -- a binary search to nail down the patch set that caused the regression. The next step would be to find out whether the regression was "necessary" or not. Problem is, ten or twenty kernel releases later, you can't easily revert a patch and it's not always obvious which regression was fixed by the occasional performance improvement in a graph. So what it boils down to quite often is this: Figure out what the patch intended to do, find out if it's still slowing down recent test kernels, then try to achieve the same without causing the regression in 2.6.0-test11. I didn't have much time to spend on this so far, and the original patch authors would be much more qualified to do this anyway. > qsbench I'd pretty much ignore except as a control case, since there's > nothing to do with a single process but let it thrash. I like to keep qsbench around for a number of reasons: It's the benchmark where 2.6 looks best (i.e. less bad). I can't rule out that somewhere somebody has a real work load of that type. And it is an interesting contrast to the real world compile benchmarks I care about. > > right every time (kind of like the OOM killer, although with smaller > > damage for bad decisions). > > I'd be interested in seeing the specific criteria used, since the > policy can strongly influence performance. Some of the most obvious > policies do worse than random. Define "performance". My goal was to improve both responsiveness and throughput of the system under extreme memory pressure. That also meant that I wasn't interested in better throughput if latency got even worse. I used a modified version of badness in oom_kill. I didn't put too much effort into it, but I could explain the reasoning behind the changes. I had a bunch of batch processes thrashing and I wanted to see them selected and not the sshd or the login shell. It worked reasonably well for me. /* * Resident memory size of the process is the basis for the badness. */ points = p->mm->rss; /* * CPU time is in seconds and run time is in minutes. There is no * particular reason for this other than that it turned out to work * very well in practice. */ cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3); run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10); points *= int_sqrt(cpu_time); points *= int_sqrt(int_sqrt(run_time)); /* * Niced processes are most likely less important. */ if (task_nice(p) > 0) points *= 4; /* * Keep interactive processes around. */ if (task_interactive(p)) points /= 4; /* * Superuser processes are usually more important, so we make it * less likely that we kill those. */ if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) || p->uid == 0 || p->euid == 0) points /= 2; /* * We don't want to kill a process with direct hardware access. * Not only could that mess up the hardware, but usually users * tend to only have this flag set on applications they think * of as important. */ if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) points /= 2; > Ideally, the targets for suspension and complete eviction would be > background tasks that aren't going to demand memory in the near future. You lost me there. If I knew of a background task that it was about to demand more memory in the near future when memory is very tight anyway, that would be the first process I'd suspend and evict before it gets a chance to make matters worse. So what are some potential criteria? - process owner: sshd runs often as root. You don't want to stun that. OTOH, a sys admin will usually log in as a normal user before su'ing to root. So stunning non-root processes isn't a clear winner, either. - process size: I favored stunning processes with large RSS because for my scenario that described the culprits quite well and left the interactive stuff alone. - interactivity: Avoiding to stun tasks the scheduler considers interactive was a no-brainer. - nice value: A niced process tends to be a batch process. Stun. - time: OOM kill doesn't want to take down long running processes because of the work that is lost. For stunning, I don't care. In fact, they are probably batch processes, so stun them. - fault frequency, I/O requests: When the paging disk is the bottleneck, it might be sensible to stun a process that produces lots of faults or does a lot of disk I/O. If there is an easy way to get that data then I missed it. There are certainly more, but that's what I can think of off the top of my head. I did note your reference to Carr's thesis (which I'm not familiar with), but like most papers I've seen on the subject it seems to focus on throughput. That's special-casing for batch processing or transaction systems, however; on a general-purpose computer, throughput means nothing if latency goes down the tube. > Unfortunately that algorithm appears to require an oracle to implement. Ah, we've all seen these optimal solutions for classic CS problems where the only gotcha is that you need omniscience. > Also, the best criteria as I know of them are somewhat counterintuitive, > so I'd like to be sure they were tried. Again, best for what? > On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > > I doubt that you can get performance anywhere near 2.4 just by adding > > load control to 2.6 unless you measure throughput and nothing else -- > > otherwise latency will kill you. I am convinced the key is not in > > _adding_ stuff, but _fixing_ what we have. > > A small problem with that kind of argument is that it's assuming the > existence of some accumulation of small regressions that haven't proven > to exist (or have they?), where the kind of a priori argument I've made Heh. Did you look at the graph in my previous message? Yes, there are several, independant regressions. What we don't know is which ones were unavoidable. For instance, the regression in 2.5.27 is quite possibly a necessary consequence of the new pageout mechanism and the benefits in upward scalability may well outweigh the costs for the low-end user. If we accept the notion that we don't care about what we can't measure (remember the interactivity debates?) and since nobody tested regularly for thrashing behavior, it seems quite likely that at least some of the regressions can be fixed, maybe at a slight cost in performance elsewhere, maybe not even that. There should be plenty of room for improvement: We are not talking 10% or 20%, but factors of 3 and more. > actually provide a guarantee of results without testing. I suppose one > point in favor of my "grab this tool off the shelf" approach is that > there is quite a bit of history behind the methods and that they are > well-understood. I know I sound like a broken record, but I have one problem with the off-the-shelf solutions I've found so far: They try to maximize throughput. They don't care about latency. I do. > On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > > IMO the question is: How much do we care? Machines with tight memory are > > not necessarily very concerned about paging (e.g. PDAs), and serious > > servers rarely operate under such conditions: Admins tend to add RAM > > when the paging load is significant. > > The question is not if we care, but if we care about others. Economies > aren't as kind to all users as they are to us Right. But kernel hackers tend to work for companies that don't make their money by helping those who don't have any. And before you call me a cynic, look at the resources that go into making Linux capable of running on the top 0.something percent of machines and compare that to the interest with which this and similar threads have been received. I made my observation based on experience, not personal preference. That said, it is a fact that thrashing is not the hot issue it was 35 years ago, although the hardware (growing access gap RAM/disk) and usage patterns (latency matters a lot more, load is unpredictable and exogenous for the kernel) should have made the problem worse. The classic solutions are pretty much unworkable today and in most cases there is one economic solution which is indeed to throw more RAM at it. > > If you don't care _that_ much about thrashing in Linux, just tell > > people to buy more RAM. Computers are cheap, RAM even more so, 64 bit > > becomes affordable, and heavy paging sucks no matter how good a paging > > mechanism is. > > If I took this kind of argument seriously I'd be telling people to go > shopping for new devices every time they run into a driver problem. I'm No. Bad example. For starters, new devices are more likely to have driver problems, so your advice would be dubious even if they had the money :-P. The argument I hear for the regressions is that 2.6 is more scalable on high-end machines now and we just made a trade-off. It has happened before. Linux 1.x didn't have the hardware requirements of 2.4. The point I was trying to make with regard to thrashing was that I suspect it was written off as an inevitable trade-off too early. I believe that some of the regressions can be fixed without losing the gains in upward scalability _if_ we find the resources to do it. Quite frankly, playing with the suspension code was a lot more fun than investigating regressions in other people's work. But I hated the idea that Linux fails so miserably now where it used to do so well. At the very least I wanted to be sure that it was forced collateral damage and not just an oversight or bad tuning. Clearly, I do care. > The issue at hand is improving how the kernel behaves on specific > hardware configurations; the fact other hardware configurations exist > is irrelevant. Why do you make me remind you that we live in a world with resource constraints? What _is_ relevant is where the resources to do the work come from, which is a non-trivial problem if the work is to benefit people who don't have the money to buy more RAM. Just saying that it's unacceptable to screw over those with low-end hardware won't help anybody :-). If you are volunteering to help out, though, more power to you. > > the respective changes in 2.5. Obviously, performing regular testing > > with thrashing benchmarks would make lasting major regressions like > > those in the 2.5 development series much less likely in the future. > > Yes, this does need to be done more regularly. c.f. the min_free_kb > tuning problem Matt Mackall and I identified. Well, tuning problems always make me want to try genetic algorithms. Regression testing would be much easier. Just run all benchmarks for every new kernel. Update chart. Done. ... It's scriptable even. > On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote: > > Additional load control mechanisms create new problems (latency, > > increased complexity), so I think they should be a last resort, not > > some method to paper over deficiencies elsewhere in the kernel. > > Who could disagree with this without looking ridiculous? Heh. It was carefully worded that way <g>. Seriously, though, it's not as ridiculous as it may seem. The problems we need to address are not even on the map for the classic papers I have seen on the subject. They suggest working sets or some sort of abstract load control, but 2.6 has problems that are very specific to that kernel and its mechanisms. There's no elegant, proven standard algorithm to solve those problems for us. > Methods of last resort are not necessarily unavoidable; the OOM killer > is an example of one that isn't avoidable. The issue is less clear cut That's debatable. Funny that you should take that example. > range of machines. But I would prefer not to send an "FOAD" message to > the users of older hardware or users who can't afford fast hardware. Agreed. > The assumption methods of last resort create more problems than they > solve appears to be based on the notion that they'll be used for more > than methods of last resort. They're meant to handle the specific cases No. My beef with load control is that once it's there people will say "See? Performance is back!" and whatever incentive there was to fix the real problems is gone. Which I could accept if it wasn't for the fact that a load control solution is always inferior to other improvements because of the massive latency increase. > where they are beneficial, not infect the common case with behavior > that's only appropriate for underpowered machines or other bogosity. > That is, it should teach the kernel how to behave in the new situation > where we want it to behave well, not change its behavior where it > already behaves well. Alright. One more thing: Thrashing is not a clear cut system state. You don't want to change behavior when it was doing well, so you need to be cautious about your trigger. Which means it will often not fire for border cases, that is light thrashing. I didn't do a survey, but I suspect that light thrashing (where there's just not quite enough memory) is much more common than the heavy variant. Now guess what? The 2.6 performance for light thrashing is absolutely abysmal. In fact 2.6 will happily spend a lot of time in I/O wait in a situation where 2.4 will cruise through the task without a hitch. I'm all for adding load control to deal with heavy thrashing that can't be handled any other way. But I am firmly opposed to pretending that it is a solution for the common case. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 0:27 ` Roger Luethi @ 2003-12-09 4:05 ` William Lee Irwin III 2003-12-09 15:11 ` Roger Luethi 0 siblings, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-09 4:05 UTC (permalink / raw) To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> What kinds of factors are these? How did you find these factors? When >> were these factors introduced? On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > The relevant changes are all over the place, factors other than the > pageout mechanism affect thrashing. I haven't identified all of them, > though. I work on it occasionally. > When I realized what had happened in 2.5 (took a while), I went for a > tedious, systematic approach. It started with benchmarks: 3 benchmarks > x some 85 kernels x 10 runs each. The graph you saw in my previous > message represents a few hundred hours worth of benchmarking (required > because variance in thrashing benchmarks is pretty bad). The real stuff > is quite detailed but too large to post on the list. Okay, I'm interested in getting my hands on this however you can get it to me. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> qsbench I'd pretty much ignore except as a control case, since there's >> nothing to do with a single process but let it thrash. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > I like to keep qsbench around for a number of reasons: It's the benchmark > where 2.6 looks best (i.e. less bad). I can't rule out that somewhere > somebody has a real work load of that type. And it is an interesting > contrast to the real world compile benchmarks I care about. I won't debate that; however, as far as load control goes, there's nothing to do. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> I'd be interested in seeing the specific criteria used, since the >> policy can strongly influence performance. Some of the most obvious >> policies do worse than random. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Define "performance". My goal was to improve both responsiveness and > throughput of the system under extreme memory pressure. That also > meant that I wasn't interested in better throughput if latency got > even worse. It was defined in two different ways: cpu utilization (inverse of iowait) and multiprogramming level (how many tasks it could avoid suspending). On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > So what are some potential criteria? > - process owner: sshd runs often as root. You don't want to stun that. > OTOH, a sys admin will usually log in as a normal user before su'ing > to root. So stunning non-root processes isn't a clear winner, either. That's a method I haven't even heard of. I'd be wary of this; there are a lot of daemons that might as well be swapped as they almost never run and aren't latency-sensitive when they do. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > - process size: I favored stunning processes with large RSS because > for my scenario that described the culprits quite well and left the > interactive stuff alone. Demoting the largest task is one that does worse than random. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > - interactivity: Avoiding to stun tasks the scheduler considers > interactive was a no-brainer. An odd result was that since the ancient kernels didn't have threads, their mm's had unique timeslices etc. Largest remaining quantum is one of the three equivalent "empirically best" policies. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > - nice value: A niced process tends to be a batch process. Stun. > - time: OOM kill doesn't want to take down long running processes > because of the work that is lost. For stunning, I don't care. > In fact, they are probably batch processes, so stun them. These sound unusual, but innocuous. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > - fault frequency, I/O requests: When the paging disk is the bottleneck, > it might be sensible to stun a process that produces lots of faults > or does a lot of disk I/O. If there is an easy way to get that data > then I missed it. Like the PFF bits for WS? On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > There are certainly more, but that's what I can think of off the top > of my head. I did note your reference to Carr's thesis (which I'm not > familiar with), but like most papers I've seen on the subject it seems > to focus on throughput. That's special-casing for batch processing or > transaction systems, however; on a general-purpose computer, throughput > means nothing if latency goes down the tube. Multiprogramming level and cpu utilization seemed to be more oriented toward concurrency than either throughput or latency. What exactly that's worth I'm not entirely sure. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> Also, the best criteria as I know of them are somewhat counterintuitive, >> so I'd like to be sure they were tried. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Again, best for what? Cpu utilization, a.k.a. minimizing iowait. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> A small problem with that kind of argument is that it's assuming the >> existence of some accumulation of small regressions that haven't proven >> to exist (or have they?), where the kind of a priori argument I've made On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Heh. Did you look at the graph in my previous message? Yes, there are > several, independant regressions. What we don't know is which ones were > unavoidable. For instance, the regression in 2.5.27 is quite possibly a > necessary consequence of the new pageout mechanism and the benefits in > upward scalability may well outweigh the costs for the low-end user. I don't see other kernels to compare it to. I guess you can use earlier versions of itself. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > If we accept the notion that we don't care about what we can't measure > (remember the interactivity debates?) and since nobody tested regularly > for thrashing behavior, it seems quite likely that at least some of > the regressions can be fixed, maybe at a slight cost in performance > elsewhere, maybe not even that. > There should be plenty of room for improvement: We are not talking 10% > or 20%, but factors of 3 and more. They should probably get cleaned up. Where are your benchmarks? On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> ... I suppose one >> point in favor of my "grab this tool off the shelf" approach is that >> there is quite a bit of history behind the methods and that they are >> well-understood. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > I know I sound like a broken record, but I have one problem with > the off-the-shelf solutions I've found so far: They try to maximize > throughput. They don't care about latency. I do. I have a vague notion we're thinking of different cases. What kinds of overcommitment levels are you thinking of? On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> The question is not if we care, but if we care about others. Economies >> aren't as kind to all users as they are to us On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Right. But kernel hackers tend to work for companies that don't make > their money by helping those who don't have any. And before you call > me a cynic, look at the resources that go into making Linux capable of > running on the top 0.something percent of machines and compare that to > the interest with which this and similar threads have been received. I > made my observation based on experience, not personal preference. > That said, it is a fact that thrashing is not the hot issue it was > 35 years ago, although the hardware (growing access gap RAM/disk) > and usage patterns (latency matters a lot more, load is unpredictable > and exogenous for the kernel) should have made the problem worse. The > classic solutions are pretty much unworkable today and in most cases > there is one economic solution which is indeed to throw more RAM at it. Top 0.001%? For expanded range, both endpoints matter. My notion of scalability is running like greased lightning (compared to other OS's) on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM to a 16384x/16PB superdupercomputer. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> If I took this kind of argument seriously I'd be telling people to go >> shopping for new devices every time they run into a driver problem. I'm On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > No. Bad example. For starters, new devices are more likely to have driver > problems, so your advice would be dubious even if they had the money :-P. > The argument I hear for the regressions is that 2.6 is more scalable > on high-end machines now and we just made a trade-off. It has happened > before. Linux 1.x didn't have the hardware requirements of 2.4. No! No! NO!!! (a) buying a G3 will not make my sun3 boot Linux (b) buying an SS1 will not make my Decstation 5000/200's PMAD-AA driver work (c) buying another multia will not make my multia stop deadlocking while swapping over NFS No matter how many pieces of hardware you buy, the original one still isn't driven correctly by the software. Hardware *NEVER* fixes software. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > The point I was trying to make with regard to thrashing was that > I suspect it was written off as an inevitable trade-off too early. > I believe that some of the regressions can be fixed without losing the > gains in upward scalability _if_ we find the resources to do it. > Quite frankly, playing with the suspension code was a lot more fun than > investigating regressions in other people's work. But I hated the idea > that Linux fails so miserably now where it used to do so well. At the > very least I wanted to be sure that it was forced collateral damage > and not just an oversight or bad tuning. Clearly, I do care. They should get cleaned up; start sending data over. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> The issue at hand is improving how the kernel behaves on specific >> hardware configurations; the fact other hardware configurations exist >> is irrelevant. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Why do you make me remind you that we live in a world with resource > constraints? What _is_ relevant is where the resources to do the work > come from, which is a non-trivial problem if the work is to benefit > people who don't have the money to buy more RAM. Just saying that it's > unacceptable to screw over those with low-end hardware won't help anybody > :-). If you are volunteering to help out, though, more power to you. It's unlikely I'll have any trouble coughing up code. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> Yes, this does need to be done more regularly. c.f. the min_free_kb >> tuning problem Matt Mackall and I identified. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Well, tuning problems always make me want to try genetic algorithms. > Regression testing would be much easier. Just run all benchmarks for > every new kernel. Update chart. Done. ... It's scriptable even. This was a bit easier than that; the boot-time default was 1MB regardless of the size of RAM; akpm picked some scaling algorithm out of a hat and it pretty much got solved as it shrank with memory. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> Who could disagree with this without looking ridiculous? On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Heh. It was carefully worded that way <g>. Seriously, though, it's not > as ridiculous as it may seem. The problems we need to address are not > even on the map for the classic papers I have seen on the subject. They > suggest working sets or some sort of abstract load control, but 2.6 > has problems that are very specific to that kernel and its mechanisms. > There's no elegant, proven standard algorithm to solve those problems > for us. WS had its own load control bundled with it. AIUI most replacement algorithms need a load control tailored to them (some seem to do okay with an arbitrary choice), so there's some synthesis involved. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> Methods of last resort are not necessarily unavoidable; the OOM killer >> is an example of one that isn't avoidable. The issue is less clear cut On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > That's debatable. Funny that you should take that example. Not really. e.g. -aa's OOM killer just uses a trivial policy that shoots the requesting task. Eliminating it entirely is theoretically possible with ridiculous amounts of accounting, but I'm relatively certain it's infeasible to implement. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> The assumption methods of last resort create more problems than they >> solve appears to be based on the notion that they'll be used for more >> than methods of last resort. They're meant to handle the specific cases On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > No. My beef with load control is that once it's there people will say > "See? Performance is back!" and whatever incentive there was to fix the > real problems is gone. Which I could accept if it wasn't for the fact > that a load control solution is always inferior to other improvements > because of the massive latency increase. I think we have vastly different levels of overcommitment in mind. On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote: >> where they are beneficial, not infect the common case with behavior >> that's only appropriate for underpowered machines or other bogosity. >> That is, it should teach the kernel how to behave in the new situation >> where we want it to behave well, not change its behavior where it >> already behaves well. On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > Alright. One more thing: Thrashing is not a clear cut system state. You > don't want to change behavior when it was doing well, so you need to > be cautious about your trigger. Which means it will often not fire > for border cases, that is light thrashing. I didn't do a survey, but > I suspect that light thrashing (where there's just not quite enough > memory) is much more common than the heavy variant. Now guess what? > The 2.6 performance for light thrashing is absolutely abysmal. In fact > 2.6 will happily spend a lot of time in I/O wait in a situation where > 2.4 will cruise through the task without a hitch. > I'm all for adding load control to deal with heavy thrashing that can't > be handled any other way. But I am firmly opposed to pretending that > it is a solution for the common case. The common case is pretty much zero or slim overcommitment these days. The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's, not non-overcommit-related.) -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 4:05 ` William Lee Irwin III @ 2003-12-09 15:11 ` Roger Luethi 2003-12-09 16:04 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-09 15:11 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: > > because variance in thrashing benchmarks is pretty bad). The real stuff > > is quite detailed but too large to post on the list. > > Okay, I'm interested in getting my hands on this however you can get it > to me. http://hellgate.ch/bench/thrash.tar.gz The tar contains one postscript plot file and three data files: {kbuild,efax,qsbench}.dat. Numbers are execution times in seconds. The first column is the average of each row, and the values of each row are sorted in ascending order. Other than that, it's the raw data. The kernels were not tested in order, so it's definitely not the hardware that's been deteriorating over time. I repeated some tests that looked like mistakes: In 2.5.32, for instance, it seems odd that both kbuild and qsbench are slower but efax isn't. I believe the data is accurate, but I can do reruns upon request. A fourth file, plot.ps, contains the graphs I use right now: You can see how both average execution time and variance have grown from 2.5.0 and 2.6.0-test11. The graph is precise enough to determine the kernel release that caused a regression. The more fine-grained work is not complete and I'm not sure it ever will be. Some _preliminary_ results (i.e. take with a grain of salt): The regression for kbuild in 2.5.48 was caused by a patch titled "better inode reclaim balancing". In 2.5.49, "strengthen the `incremental min' logic in the page". In 2.6.0-test3 (aka 2.6.78), it's a subtle interaction between "fix kswapd throttling" and "decaying average of zone pressure" -- IIRC reverting the former gains nothing unless you also revert the latter. I'd have to dig through my notes. > On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote: > > Define "performance". My goal was to improve both responsiveness and > > throughput of the system under extreme memory pressure. That also > > meant that I wasn't interested in better throughput if latency got > > even worse. > > It was defined in two different ways: cpu utilization (inverse of iowait) > and multiprogramming level (how many tasks it could avoid suspending). Yeah, that's the classic. It _is_ throughput. Unless you have task priorities (i.e. eye candy or SETI@home competing for cycles), CPU utilization is an excellent approximation for throughput. And the benefit of maintaining a high level of multiprogramming is that you have a better chance to have a runnable process at any time, meaning better CPU utilization meaning higher throughput. The classic strategies based on these criteria work for transaction and batch systems. They are all but useless, though, for a workstation and even most modern servers, due to assumptions that are incorrect today (remember all the degrees of freedom a scheduler had 30 years ago) and additional factors that only became crucial in the past few decades (latency again). > > - process size: I favored stunning processes with large RSS because > > for my scenario that described the culprits quite well and left the > > interactive stuff alone. > > Demoting the largest task is one that does worse than random. We only know that to be true for irrelevant optimization criteria. > > - fault frequency, I/O requests: When the paging disk is the bottleneck, > > it might be sensible to stun a process that produces lots of faults > > or does a lot of disk I/O. If there is an easy way to get that data > > then I missed it. > > Like the PFF bits for WS? Yup. PFF doesn't cover all disk I/O, though. Suspending a process that is I/O bound even with a low PFF improves thrashing performance as well, because disk I/O is the bottleneck. > > Heh. Did you look at the graph in my previous message? Yes, there are > > several, independant regressions. What we don't know is which ones were > > unavoidable. For instance, the regression in 2.5.27 is quite possibly a > > necessary consequence of the new pageout mechanism and the benefits in > > upward scalability may well outweigh the costs for the low-end user. > > I don't see other kernels to compare it to. I guess you can use earlier > versions of itself. You don't need anything to compare it to. You can investigate the performance regression and determine whether it was a logical consequence of the intended change in behavior. Suppose you found that the problem in 2.5.27 is that shared pages are unmapped too quickly -- that would be easy to fix without affecting the benefits of the new VM. I think the more likely candidates for improvements are later in 2.5, though. > Top 0.001%? For expanded range, both endpoints matter. My notion of > scalability is running like greased lightning (compared to other OS's) > on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM > to a 16384x/16PB superdupercomputer. Well, that's nice. I agree. IIRC, though, each major release had more demanding minimum requirements (in terms of RAM). The range covered has been growing only because upward scalability grew faster. I can't help but notice that some of your statements sound a lot like wishful thinking. > > No. Bad example. For starters, new devices are more likely to have driver > > problems, so your advice would be dubious even if they had the money :-P. > > The argument I hear for the regressions is that 2.6 is more scalable > > on high-end machines now and we just made a trade-off. It has happened > > before. Linux 1.x didn't have the hardware requirements of 2.4. > > No! No! NO!!! > > (a) buying a G3 will not make my sun3 boot Linux > (b) buying an SS1 will not make my Decstation 5000/200's PMAD-AA > driver work > (c) buying another multia will not make my multia stop deadlocking > while swapping over NFS > > No matter how many pieces of hardware you buy, the original one still > isn't driven correctly by the software. Hardware *NEVER* fixes software. Look, I became the maintainer of via-rhine because nobody else wanted to fix the driver for a very common, but barely documented piece of cheap hardware. People were just told to buy another cheap card. That's the reality of Linux. Don't forget what we are talking about, though. Once you are seriously tight on memory, you can only mitigate the damage in software, the only solution is to add more RAM. Thrashing is not a bug like a broken driver. I am currently writing a paper on the subject, and the gist of it will likely be that we should try to prevent thrashing from happening as long as possible (with good page replacement, I/O scheduling, etc.), but when it's inevitable we're pretty much done for. Load control may or may not be worth adding, but it only helps in some special cases and does not seem clearly beneficial in general-purpose systems. > This was a bit easier than that; the boot-time default was 1MB regardless > of the size of RAM; akpm picked some scaling algorithm out of a hat and > it pretty much got solved as it shrank with memory. With all due respect for akpm's hat, sometimes I wish we had some good heuristics for this stuff. > > 2.6 will happily spend a lot of time in I/O wait in a situation where > > 2.4 will cruise through the task without a hitch. > > > > I'm all for adding load control to deal with heavy thrashing that can't > > be handled any other way. But I am firmly opposed to pretending that > > it is a solution for the common case. > > The common case is pretty much zero or slim overcommitment these days. > The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's, > not non-overcommit-related.) So you want to help people who for some reason _have_ to run several _batch_ jobs _concurrently_ (otherwise load control is ineffective) on a low-end machine to result in a 10x overcommit system? Why don't we buy those two or three guys a DIMM each? I'm afraid you have a solution in search of a problem. Nobody runs a 10x overcommit system. And if they did, they would find it doesn't work well with 2.4, either, so no one will complain about a regression. What does happen, though, is that people go close to the limit of what their low-end hardware supports, which will work perfectly with 2.4 and collapse with 2.6. The real problem, the one many people will hit, the one the very complaint that started this thread was about, is light and medium overcommit. And load control is not the answer to that. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 15:11 ` Roger Luethi @ 2003-12-09 16:04 ` Rik van Riel 2003-12-09 16:31 ` Roger Luethi 2003-12-09 18:31 ` William Lee Irwin III 2003-12-09 19:38 ` William Lee Irwin III 2 siblings, 1 reply; 63+ messages in thread From: Rik van Riel @ 2003-12-09 16:04 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Tue, 9 Dec 2003, Roger Luethi wrote: > The classic strategies based on these criteria work for transaction and > batch systems. They are all but useless, though, for a workstation and > even most modern servers, due to assumptions that are incorrect today > (remember all the degrees of freedom a scheduler had 30 years ago) > and additional factors that only became crucial in the past few decades > (latency again). Don't forget that computers have gotten a lot slower over the years ;) Swapping out a 64kB process to a disk that does 180kB/s is a lot faster than swapping out a 100MB process to a disk that does 50MB/s ... Once you figure in seek times, the picture looks even worse. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 16:04 ` Rik van Riel @ 2003-12-09 16:31 ` Roger Luethi 0 siblings, 0 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-09 16:31 UTC (permalink / raw) To: Rik van Riel Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Tue, 09 Dec 2003 11:04:49 -0500, Rik van Riel wrote: > > The classic strategies based on these criteria work for transaction and > > batch systems. They are all but useless, though, for a workstation and > > even most modern servers, due to assumptions that are incorrect today > > (remember all the degrees of freedom a scheduler had 30 years ago) > > and additional factors that only became crucial in the past few decades > > (latency again). > > Don't forget that computers have gotten a lot slower > over the years ;) > > Swapping out a 64kB process to a disk that does 180kB/s > is a lot faster than swapping out a 100MB process to a > disk that does 50MB/s ... > > Once you figure in seek times, the picture looks even > worse. Exactly -- I did mention the growing access time gap between RAM and disks in an earlier message. Yes, there are quite a few developments in hardware and in the way we use computers (interactive, Client/Server, dedicated machines, etc.) that made thrashing pretty much unsolvable at an OS level. Fortunately, fixing it in hardware by adding RAM works for most. What we _can_ do in software, though, is prevent thrashing as long as possible. Comparing 2.4 and 2.6 shows that a kernel can still make a significant difference with smart pageout algorithms, I/O scheduling etc. But you won't get much help with that from ancient papers. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 15:11 ` Roger Luethi 2003-12-09 16:04 ` Rik van Riel @ 2003-12-09 18:31 ` William Lee Irwin III 2003-12-09 19:38 ` William Lee Irwin III 2 siblings, 0 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-09 18:31 UTC (permalink / raw) To: Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > I'm afraid you have a solution in search of a problem. Nobody runs a > 10x overcommit system. And if they did, they would find it doesn't work > well with 2.4, either, so no one will complain about a regression. What > does happen, though, is that people go close to the limit of what > their low-end hardware supports, which will work perfectly with 2.4 > and collapse with 2.6. No, I've got a guy in Russia complaining about 2.6 not doing well on one of his boxen. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 15:11 ` Roger Luethi 2003-12-09 16:04 ` Rik van Riel 2003-12-09 18:31 ` William Lee Irwin III @ 2003-12-09 19:38 ` William Lee Irwin III 2003-12-10 13:58 ` Roger Luethi 2 siblings, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-09 19:38 UTC (permalink / raw) To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > The more fine-grained work is not complete and I'm not sure it ever > will be. Some _preliminary_ results (i.e. take with a grain of salt): > The regression for kbuild in 2.5.48 was caused by a patch titled "better > inode reclaim balancing". In 2.5.49, "strengthen the `incremental > min' logic in the page". In 2.6.0-test3 (aka 2.6.78), it's a subtle > interaction between "fix kswapd throttling" and "decaying average of > zone pressure" -- IIRC reverting the former gains nothing unless you > also revert the latter. I'd have to dig through my notes. Okay, it sounds like you're well on our way to cleaning things up. Not too hard to chime in as needed. On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> It was defined in two different ways: cpu utilization (inverse of iowait) >> and multiprogramming level (how many tasks it could avoid suspending). On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > Yeah, that's the classic. It _is_ throughput. Unless you have task > priorities (i.e. eye candy or SETI@home competing for cycles), CPU > utilization is an excellent approximation for throughput. And the > benefit of maintaining a high level of multiprogramming is that you > have a better chance to have a runnable process at any time, meaning > better CPU utilization meaning higher throughput. > The classic strategies based on these criteria work for transaction and > batch systems. They are all but useless, though, for a workstation and > even most modern servers, due to assumptions that are incorrect today > (remember all the degrees of freedom a scheduler had 30 years ago) > and additional factors that only became crucial in the past few decades > (latency again). This assessment is inaccurate. The performance metrics are not entirely useless, and it's rather trivial to recover data useful for modern scenarios based on them. The driving notion from the iron age (I guess the stone age was when the only way to support virtual memory was swapping) was that getting stuck on io was the thing preventing the cpu from getting used. Nowadays, we burn the cpu nicely enough with GNOME etc. but have to worry about what happens to some task or other. So: (a) Multiprogramming level is obviously trying to minimize the amount of swapping out going on. i.e. this is trying to limit the worst case handling to the worst case. Minimal adjustments are required. e.g. consider number of process address spaces swapped out directly as something to be minimized instead of total minus that. (b) CPU utilization is essentially trying to minimize how much the system gets stuck on io. This one needs more adjustment for modern systems, which tend to utilize the cpu regardless of io being in flight. Number of blocked tasks is a close approximation, directly using iowait, and so on are various other possible substitutes. On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> Demoting the largest task is one that does worse than random. On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > We only know that to be true for irrelevant optimization criteria. The above explains how and why they are relevant. It's also not difficult to understand why it goes wrong: the operation is too expensive. On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> Like the PFF bits for WS? On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > Yup. PFF doesn't cover all disk I/O, though. Suspending a process that > is I/O bound even with a low PFF improves thrashing performance as well, > because disk I/O is the bottleneck. That's a significantly different use for it; AIUI it was an heuristic to estimate the WSS without periodic catastrophes like WSinterval, though ISTR bits about "extremely high" rates contracting the estimated size. On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> Top 0.001%? For expanded range, both endpoints matter. My notion of >> scalability is running like greased lightning (compared to other OS's) >> on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM >> to a 16384x/16PB superdupercomputer. On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > Well, that's nice. I agree. IIRC, though, each major release had more > demanding minimum requirements (in terms of RAM). The range covered > has been growing only because upward scalability grew faster. I can't > help but notice that some of your statements sound a lot like wishful > thinking. This is not wishful thinking; it's an example that tries to illustrate the goal. It's rather clear to me that neither end of the spectrum mentioned above is even functional in current source (well, 4096x might boot with deep enough stacks, though I'd expect it to perform too poorly to be considered functional). On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> No matter how many pieces of hardware you buy, the original one still >> isn't driven correctly by the software. Hardware *NEVER* fixes software. On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > Look, I became the maintainer of via-rhine because nobody else wanted to > fix the driver for a very common, but barely documented piece of > cheap hardware. People were just told to buy another cheap card. That's > the reality of Linux. That's an _unfortunate_ reality. And you changed it in a similar way to how we want to support the lower end, though I'm going a bit lower end than you are. On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > Don't forget what we are talking about, though. Once you are seriously > tight on memory, you can only mitigate the damage in software, the only > solution is to add more RAM. Thrashing is not a bug like a broken driver. Covering for low quality hardware is generally a kernel's job. c.f. the "how many address lines did this device snip off" games that have even infected the VM. Of course, it's not as clear cut as an oops, no. On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > I am currently writing a paper on the subject, and the gist of it will > likely be that we should try to prevent thrashing from happening as > long as possible (with good page replacement, I/O scheduling, etc.), > but when it's inevitable we're pretty much done for. Load control may > or may not be worth adding, but it only helps in some special cases > and does not seem clearly beneficial in general-purpose systems. Figures. On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote: >> The common case is pretty much zero or slim overcommitment these days. >> The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's, >> not non-overcommit-related.) On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > So you want to help people who for some reason _have_ to run several > _batch_ jobs _concurrently_ (otherwise load control is ineffective) > on a low-end machine to result in a 10x overcommit system? Why don't > we buy those two or three guys a DIMM each? > I'm afraid you have a solution in search of a problem. Nobody runs a > 10x overcommit system. And if they did, they would find it doesn't work > well with 2.4, either, so no one will complain about a regression. What > does happen, though, is that people go close to the limit of what > their low-end hardware supports, which will work perfectly with 2.4 > and collapse with 2.6. > The real problem, the one many people will hit, the one the very > complaint that started this thread was about, is light and medium > overcommit. And load control is not the answer to that. No, I've got some guy in Russia complaining about 2.6 sucking on his box who has a 10x overcommit ratio (approximate sum of WSS's). (Also, whatever this thread was, the In-Reply-To: chain was broken somewhere and the first thing I saw was the post I replied to.) Hmm. I was trying to avoid duplicating effort and/or preempt someone's code they were working on because I'd heard either you and/or Nick Piggin were working on the stuff. You're doing something useful and relevant... Well, I guess I might as well help with your paper. If the demotion criteria you're using are anything like what you posted, they risk invalidating the results, since they're apparently based on something worse than random. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-09 19:38 ` William Lee Irwin III @ 2003-12-10 13:58 ` Roger Luethi 2003-12-10 17:47 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-10 13:58 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote: > On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote: > > The more fine-grained work is not complete and I'm not sure it ever > > will be. Some _preliminary_ results (i.e. take with a grain of salt): > > Okay, it sounds like you're well on our way to cleaning things up. Actually, I'm rather well on my way wrapping things up. I documented in detail how much 2.6 sucks in this area and where the potential for improvements would have likely been, but now I've got a deadline to meet and other things on my plate. For me this discussion just confirmed that my approach fails to draw much interest, either because there are better alternatives or because heavy paging and medium thrashing are generally not considered interesting problems. > > The classic strategies based on these criteria work for transaction and > > batch systems. They are all but useless, though, for a workstation and > > even most modern servers, due to assumptions that are incorrect today > > (remember all the degrees of freedom a scheduler had 30 years ago) > > and additional factors that only became crucial in the past few decades > > (latency again). > > This assessment is inaccurate. The performance metrics are not entirely > useless, and it's rather trivial to recover data useful for modern > scenarios based on them. The driving notion from the iron age (I guess I said _strategies_ rather than papers or research because I realize that the metrics can be an important part of the modern picture. It's just the ancient recipes that once solved the problem that are useless for typical modern usage patterns. > >> Demoting the largest task is one that does worse than random. > > > We only know that to be true for irrelevant optimization criteria. > > The above explains how and why they are relevant. > > It's also not difficult to understand why it goes wrong: the operation > is too expensive. What goes wrong is that once you start suspending tasks, you have a hard time telling the interactive tasks apart from the batch load. This may not be much of a problem on a 10x overcommit system, because that's presumably quite unresponsive anyway, but it does matter a lot if you have an interactive system that just crossed the border to thrashing. Our apparent differences come from the fact that we try to solve different problems as you correctly noted: You are concerned with extreme overcommit, while I am concerned that 2.6 takes several times longer than 2.4 to complete a task under slight overcommit. I have no reason to doubt that load control will help you solve your problem. It may help with medium thrashing and it might even keep latency within reasonable bounds. I do think, however, that we should investigate _first_ how we lost over 50% of the performance we had in 2.5.40 for both compile benchmarks. > (Also, whatever this thread was, the In-Reply-To: chain was broken > somewhere and the first thing I saw was the post I replied to.) You can read the whole thread starting from here: http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=M794.3OE.7%40gated-at.bofh.it > Well, I guess I might as well help with your paper. If the demotion > criteria you're using are anything like what you posted, they risk > invalidating the results, since they're apparently based on something > worse than random. Worse than random may still improve throughput, though, compared to doing nothing, right? And I did measure improvements. There are variables other than the demotion criteria that I found can be important, to name a few: - Trigger: Under which circumstances is suspending any processes considered? How often? - Eviction: Does regular pageout take care of the memory of a suspended process, or are pages marked old or even unmapped upon stunning? - Release: Is the stunning queue a simple FIFO? How long do the processes stay there? Does a process get a bonus after it's woken up again -- bigger quantum, chunk of free memory, prepaged working set before stunning? There's quite a bit of complexity involved and many variables will depend on the scenario. Sort of like interactivity, except lots of people were affected by the interactivity tuning and only few will notice and test load control. The key question with regards to load control remains: How do you keep a load controled system responsive? Cleverly detect interactive processes and spare them, or wake them up again quickly enough? How? Or is the plan to use load control where responsiveness doesn't matter anyway? Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 13:58 ` Roger Luethi @ 2003-12-10 17:47 ` William Lee Irwin III 2003-12-10 22:23 ` Roger Luethi 2003-12-10 21:04 ` Rik van Riel 2003-12-10 23:30 ` Helge Hafting 2 siblings, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-10 17:47 UTC (permalink / raw) To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > Actually, I'm rather well on my way wrapping things up. I documented > in detail how much 2.6 sucks in this area and where the potential for > improvements would have likely been, but now I've got a deadline to > meet and other things on my plate. Well, it'd be nice to see the code, then. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > For me this discussion just confirmed that my approach fails to draw much > interest, either because there are better alternatives or because heavy > paging and medium thrashing are generally not considered interesting > problems. They're worthwhile; I didn't even realize there were such problems until you pointed them out. I had presumed it was due to physical scanning. On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote: >> This assessment is inaccurate. The performance metrics are not entirely >> useless, and it's rather trivial to recover data useful for modern >> scenarios based on them. The driving notion from the iron age (I guess On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > I said _strategies_ rather than papers or research because I realize > that the metrics can be an important part of the modern picture. It's > just the ancient recipes that once solved the problem that are useless > for typical modern usage patterns. Hmm. There were a wide variety of algorithms. On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote: >> The above explains how and why they are relevant. >> It's also not difficult to understand why it goes wrong: the operation >> is too expensive. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > What goes wrong is that once you start suspending tasks, you have a > hard time telling the interactive tasks apart from the batch load. > This may not be much of a problem on a 10x overcommit system, because > that's presumably quite unresponsive anyway, but it does matter a lot if > you have an interactive system that just crossed the border to thrashing. It's effectively a form of longer-term process scheduling. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > Our apparent differences come from the fact that we try to solve > different problems as you correctly noted: You are concerned with > extreme overcommit, while I am concerned that 2.6 takes several times > longer than 2.4 to complete a task under slight overcommit. Yes, my focus is pushing back the point of true thrashing as opposed to the interior points of the range. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > I have no reason to doubt that load control will help you solve your > problem. It may help with medium thrashing and it might even keep > latency within reasonable bounds. I do think, however, that we should > investigate _first_ how we lost over 50% of the performance we had in > 2.5.40 for both compile benchmarks. Perfectly reasonable. On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote: >> Well, I guess I might as well help with your paper. If the demotion >> criteria you're using are anything like what you posted, they risk >> invalidating the results, since they're apparently based on something >> worse than random. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > Worse than random may still improve throughput, though, compared to > doing nothing, right? And I did measure improvements. I didn't see any of the methods compared to no load control, so I don't have any information on that. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > There are variables other than the demotion criteria that I found can > be important, to name a few: > - Trigger: Under which circumstances is suspending any processes > considered? How often? This is generally part of the load control algorithm, but it essentially just tries to detect levels of overcommitment that would degrade performance so it can resolve them. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > - Eviction: Does regular pageout take care of the memory of a suspended > process, or are pages marked old or even unmapped upon stunning? This is generally unmapping and evicting upon suspension. The effect isn't immediate anyway, since io is required, and batching the work for io contiguity etc. is a fair amount of savings, so there's little or no incentive to delay this apart from keeping io rates down to where user io and VM io aren't in competition. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > - Release: Is the stunning queue a simple FIFO? How long do the > processes stay there? Does a process get a bonus after it's woken up > again -- bigger quantum, chunk of free memory, prepaged working set > before stunning? It's a form of process scheduling. Memory scheduling policies are not discussed very much in the sources I can get at, so some synthesis may be required unless material can be found on that, but in general this isn't a very interesting problem (at least not since the 70's or earlier). FreeBSD has an implementation of some of this we can all look at, though it doesn't illustrate a number of the concepts. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > There's quite a bit of complexity involved and many variables will depend > on the scenario. Sort of like interactivity, except lots of people were > affected by the interactivity tuning and only few will notice and test > load control. It's basically just process scheduling, so I don't see an issue there. On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > The key question with regards to load control remains: How do you keep a > load controled system responsive? Cleverly detect interactive processes > and spare them, or wake them up again quickly enough? How? Or is the > plan to use load control where responsiveness doesn't matter anyway? It's more in the interest of graceful degradation and relative improvement than meeting absolute response time requirements. i.e. making the best of a bad situation. Interactivity heuristics would presumably be part of a memory scheduling policy as they are for a cpu scheduling policy, but there is some missing information there. I suppose that's where synthesis is required. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 17:47 ` William Lee Irwin III @ 2003-12-10 22:23 ` Roger Luethi 2003-12-11 0:12 ` William Lee Irwin III 0 siblings, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-10 22:23 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh [-- Attachment #1: Type: text/plain, Size: 4079 bytes --] On Wed, 10 Dec 2003 09:47:57 -0800, William Lee Irwin III wrote: > On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > > Actually, I'm rather well on my way wrapping things up. I documented > > in detail how much 2.6 sucks in this area and where the potential for > > improvements would have likely been, but now I've got a deadline to > > meet and other things on my plate. > > Well, it'd be nice to see the code, then. I attached the stunning code I wrote a few months ago, rediffed against test11, seems to compile. It does not include the eviction code (although you can tell where it plugs in) -- that's a bit messy and I'm not too confident that I got all the locking right. The trigger in the page allocator worked pretty well in test4 to test6, but it is sensitive to VM changes. Earlier 2.5 kernels went through the slow path much more frequently (IIRC before akpm limited use of blk_congestion_wait), for instance. That would require a different trigger. The time processes spend in the stunning queue (defined in stun_time()) is too short to gain much in terms of throughput -- that's because back then I tried to put a cap on worst case latency. > you pointed them out. I had presumed it was due to physical scanning. Everybody did, including me. Only after doing some of the benchmarks did I realize I had been wrong. It's quite clear that physical scanning accounts for a 50% higher execution time at most, which is a mere fifth of the overall slow down in compile benchmarks. > > There are variables other than the demotion criteria that I found can > > be important, to name a few: > > - Trigger: Under which circumstances is suspending any processes > > considered? How often? > > This is generally part of the load control algorithm, but it > essentially just tries to detect levels of overcommitment that would > degrade performance so it can resolve them. Level of overcommitment? What kind of criteria is that supposed to be? You can have 10x overcommit and not thrash at all, if most of the memory is allocated and filled but never referenced again. IOW, I can't derive an algorithm from your handwaving <g>. > > - Eviction: Does regular pageout take care of the memory of a suspended > > process, or are pages marked old or even unmapped upon stunning? > > This is generally unmapping and evicting upon suspension. The effect > isn't immediate anyway, since io is required, and batching the work for > io contiguity etc. is a fair amount of savings, so there's little or no > incentive to delay this apart from keeping io rates down to where user > io and VM io aren't in competition. I agree with that part. > On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > > - Release: Is the stunning queue a simple FIFO? How long do the > > processes stay there? Does a process get a bonus after it's woken up > > again -- bigger quantum, chunk of free memory, prepaged working set > > before stunning? > > It's a form of process scheduling. Memory scheduling policies are not > discussed very much in the sources I can get at, so some synthesis may > be required unless material can be found on that, but in general this > isn't a very interesting problem (at least not since the 70's or earlier). Not interesting, yes. And I realize that it's not even important once you accept the very real possibility of extreme latencies. > > There's quite a bit of complexity involved and many variables will depend > > on the scenario. Sort of like interactivity, except lots of people were > > affected by the interactivity tuning and only few will notice and test > > load control. > > It's basically just process scheduling, so I don't see an issue there. The issue is that there are tons of knobs and dials that affect the behavior, and it's hard to get good heuristics with a tiny test field. Admittedly, things get easier once you want load control only for the heavy thrashing case, and that's been my plan, too, since I realized that it doesn't work well for the light and medium type I'd been working on. Roger [-- Attachment #2: linux-2.6.0-test11-stun.patch --] [-- Type: text/plain, Size: 10228 bytes --] diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/loadcontrol.h ./include/linux/loadcontrol.h --- ../../18_binsearch/linux-2.6.0-test11/include/linux/loadcontrol.h 1970-01-01 01:00:00.000000000 +0100 +++ ./include/linux/loadcontrol.h 2003-12-10 22:11:15.999792424 +0100 @@ -0,0 +1,14 @@ +#ifndef _LINUX_LOADCONTROL_H +#define _LINUX_LOADCONTROL_H + +#include <asm/atomic.h> + +extern wait_queue_head_t loadctrl_wq; +extern struct semaphore stun_ser; +extern struct semaphore unstun_token; + +extern void loadcontrol(void); +extern void thrashing(unsigned long); +extern atomic_t lctrl_waiting; + +#endif /* _LINUX_LOADCONTROL_H */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/sched.h ./include/linux/sched.h --- ../../18_binsearch/linux-2.6.0-test11/include/linux/sched.h 2003-11-24 10:28:54.000000000 +0100 +++ ./include/linux/sched.h 2003-12-10 22:11:16.002791985 +0100 @@ -500,6 +500,8 @@ do { if (atomic_dec_and_test(&(tsk)->usa #define PF_SWAPOFF 0x00080000 /* I am in swapoff */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */ +#define PF_STUN 0x00400000 +#define PF_YIELD 0x00800000 #ifdef CONFIG_SMP extern int set_cpus_allowed(task_t *p, cpumask_t new_mask); @@ -581,6 +583,7 @@ extern int FASTCALL(wake_up_process(stru #endif extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk)); extern void FASTCALL(sched_exit(task_t * p)); +extern int task_interactive(task_t * p); asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru); diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/swap.h ./include/linux/swap.h --- ../../18_binsearch/linux-2.6.0-test11/include/linux/swap.h 2003-10-15 15:03:46.000000000 +0200 +++ ./include/linux/swap.h 2003-12-10 22:11:16.003791839 +0100 @@ -174,6 +174,7 @@ extern void swap_setup(void); /* linux/mm/vmscan.c */ extern int try_to_free_pages(struct zone *, unsigned int, unsigned int); +extern int shrink_list(struct list_head *, unsigned int, int *, int *); extern int shrink_all_memory(int); extern int vm_swappiness; diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/loadcontrol.c ./kernel/loadcontrol.c --- ../../18_binsearch/linux-2.6.0-test11/kernel/loadcontrol.c 1970-01-01 01:00:00.000000000 +0100 +++ ./kernel/loadcontrol.c 2003-12-10 22:11:16.005791546 +0100 @@ -0,0 +1,169 @@ +#include <linux/mm.h> +#include <linux/pagemap.h> +#include <linux/loadcontrol.h> + +void loadcontrol(void); +void thrashing(unsigned long); + +DECLARE_MUTEX(stun_ser); +DECLARE_MUTEX_LOCKED(unstun_token); +DECLARE_WAIT_QUEUE_HEAD(loadctrl_wq); + +atomic_t lctrl_waiting; + +static inline void stun_me(void) +{ + DEFINE_WAIT(wait); + + up(&stun_ser); /* Allow next */ + atomic_inc(&lctrl_waiting); + + for (;;) { + prepare_to_wait_exclusive(&loadctrl_wq, &wait, + TASK_UNINTERRUPTIBLE); + schedule(); + if (!down_trylock(&unstun_token)) { + /* Yay. Got unstun token, wake up */ + break; + } + } + finish_wait(&loadctrl_wq, &wait); + + atomic_dec(&lctrl_waiting); +} + +void loadcontrol() +{ + unsigned long flags = current->flags; + + spin_lock_irq(¤t->sighand->siglock); + recalc_sigpending(); /* We sent fake signal, clean it up */ + spin_unlock_irq(¤t->sighand->siglock); + + if (flags & PF_STUN) + stun_me(); + + current->flags &= ~(PF_STUN|PF_YIELD|PF_MEMALLOC); +} + +/* + * int_sqrt - oom_kill.c internal function, rough approximation to sqrt + * @x: integer of which to calculate the sqrt + * + * A very rough approximation to the sqrt() function. + */ +static unsigned int int_sqrt(unsigned int x) +{ + unsigned int out = x; + while (x & ~(unsigned int)1) x >>=2, out >>=1; + if (x) out -= out >> 2; + return (out ? out : 1); +} + +static int badness(struct task_struct *p, int flags) +{ + int points, cpu_time, run_time; + + if (!p->mm) + return 0; + + if (p->flags & (PF_MEMDIE | flags)) + return 0; + + /* + * Resident memory size of the process is the basis for the badness. + */ + points = p->mm->rss; + + /* + * CPU time is in seconds and run time is in minutes. There is no + * particular reason for this other than that it turned out to work + * very well in practice. + */ + cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3); + run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10); + + points *= int_sqrt(cpu_time); + points *= int_sqrt(int_sqrt(run_time)); + + /* + * Niced processes are most likely less important. + */ + if (task_nice(p) > 0) + points *= 4; + + /* + * Keep interactive processes around. + */ + if (task_interactive(p)) + points /= 4; + + /* + * Superuser processes are usually more important, so we make it + * less likely that we kill those. + */ + if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) || + p->uid == 0 || p->euid == 0) + points /= 2; + + /* + * We don't want to kill a process with direct hardware access. + * Not only could that mess up the hardware, but usually users + * tend to only have this flag set on applications they think + * of as important. + */ + if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) + points /= 2; + + return points; +} + +/* + * Simple selection loop. We chose the process with the highest + * number of 'points'. We expect the caller will lock the tasklist. + */ +static struct task_struct * pick_bad_process(int flags) +{ + int maxpoints = 0; + struct task_struct *g, *p; + struct task_struct *chosen = NULL; + + do_each_thread(g, p) + if (p->pid) { + int points = badness(p, flags); + if (points > maxpoints) { + chosen = p; + maxpoints = points; + } + } + while_each_thread(g, p); + return chosen; +} + + +void thrashing(unsigned long action) +{ + struct task_struct *p; + unsigned long flags; + + if (down_trylock(&stun_ser)) + return; + + read_lock(&tasklist_lock); + + p = pick_bad_process(PF_STUN|action); + if (!p) { + up(&stun_ser); + goto out_unlock; + } + + p->flags |= action|PF_MEMALLOC; + p->time_slice = HZ; + + spin_lock_irqsave(&p->sighand->siglock, flags); + signal_wake_up(p, 0); + spin_unlock_irqrestore(&p->sighand->siglock, flags); + +out_unlock: + read_unlock(&tasklist_lock); +} diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/Makefile ./kernel/Makefile --- ../../18_binsearch/linux-2.6.0-test11/kernel/Makefile 2003-10-15 15:03:46.000000000 +0200 +++ ./kernel/Makefile 2003-12-10 22:11:16.006791400 +0100 @@ -6,7 +6,8 @@ obj-y = sched.o fork.o exec_domain.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o intermodule.o extable.o params.o posix-timers.o + rcupdate.o intermodule.o extable.o params.o posix-timers.o \ + loadcontrol.o obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/sched.c ./kernel/sched.c --- ../../18_binsearch/linux-2.6.0-test11/kernel/sched.c 2003-11-24 10:28:54.000000000 +0100 +++ ./kernel/sched.c 2003-12-10 22:11:16.015790083 +0100 @@ -37,6 +37,7 @@ #include <linux/rcupdate.h> #include <linux/cpu.h> #include <linux/percpu.h> +#include <linux/loadcontrol.h> #ifdef CONFIG_NUMA #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu)) @@ -1465,6 +1466,22 @@ out: void scheduling_functions_start_here(void) { } +static unsigned long stun_time(void) { + unsigned long ret; + int ql = atomic_read(&lctrl_waiting); + if (ql == 1) + ret = 5*HZ; + else if (ql == 2) + ret = 3*HZ; + else if (ql < 5) + ret = 2*HZ; + else if (ql < 10) + ret = 1*HZ; + else + ret = HZ/2; + return ret; +} + /* * schedule() is the main scheduler function. */ @@ -1495,6 +1512,22 @@ need_resched: prev = current; rq = this_rq(); + if (unlikely(waitqueue_active(&loadctrl_wq))) { + static unsigned long prev_unstun; + unsigned long wait = stun_time(); + if (time_before(jiffies, prev_unstun + wait) && prev_unstun) + goto loadctrl_done; + if (!atomic_read(&stun_ser.count)) + goto loadctrl_done; + if (!prev_unstun) { + prev_unstun = jiffies; + goto loadctrl_done; + } + prev_unstun = jiffies; + up(&unstun_token); + wake_up(&loadctrl_wq); + } +loadctrl_done: release_kernel_lock(prev); now = sched_clock(); if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG)) @@ -1935,6 +1968,11 @@ asmlinkage long sys_nice(int increment) #endif +int task_interactive(task_t *p) +{ + return TASK_INTERACTIVE(p); +} + /** * task_prio - return the priority value of a given task. * @p: the task in question. diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/arch/i386/kernel/signal.c ./arch/i386/kernel/signal.c --- ../../18_binsearch/linux-2.6.0-test11/arch/i386/kernel/signal.c 2003-11-24 10:28:51.000000000 +0100 +++ ./arch/i386/kernel/signal.c 2003-12-10 22:11:16.018789645 +0100 @@ -24,6 +24,7 @@ #include <asm/uaccess.h> #include <asm/i387.h> #include "sigframe.h" +#include <linux/loadcontrol.h> #define DEBUG_SIG 0 @@ -569,6 +570,10 @@ int do_signal(struct pt_regs *regs, sigs refrigerator(0); goto no_signal; } + if (current->flags & (PF_STUN|PF_YIELD)) { + loadcontrol(); + goto no_signal; + } if (!oldset) oldset = ¤t->blocked; diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/mm/page_alloc.c ./mm/page_alloc.c --- ../../18_binsearch/linux-2.6.0-test11/mm/page_alloc.c 2003-10-15 15:03:46.000000000 +0200 +++ ./mm/page_alloc.c 2003-12-10 22:11:16.021789206 +0100 @@ -31,6 +31,7 @@ #include <linux/topology.h> #include <linux/sysctl.h> #include <linux/cpu.h> +#include <linux/loadcontrol.h> #include <asm/tlbflush.h> @@ -606,6 +607,7 @@ __alloc_pages(unsigned int gfp_mask, uns } /* here we're in the low on memory slow path */ + thrashing(PF_STUN); rebalance: if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) { ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 22:23 ` Roger Luethi @ 2003-12-11 0:12 ` William Lee Irwin III 0 siblings, 0 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-11 0:12 UTC (permalink / raw) To: Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, Dec 10, 2003 at 11:23:55PM +0100, Roger Luethi wrote: > Level of overcommitment? What kind of criteria is that supposed to be? > You can have 10x overcommit and not thrash at all, if most of the memory > is allocated and filled but never referenced again. IOW, I can't derive > an algorithm from your handwaving <g>. There is no handwaving; the answer is necessarily ambiguous. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 13:58 ` Roger Luethi 2003-12-10 17:47 ` William Lee Irwin III @ 2003-12-10 21:04 ` Rik van Riel 2003-12-10 23:17 ` Roger Luethi 2003-12-10 23:30 ` Helge Hafting 2 siblings, 1 reply; 63+ messages in thread From: Rik van Riel @ 2003-12-10 21:04 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003, Roger Luethi wrote: > For me this discussion just confirmed that my approach fails to draw > much interest, either because there are better alternatives or because > heavy paging and medium thrashing are generally not considered > interesting problems. I'm willing to take over this work if you really want to throw in the towel. It has to be done, simply to make Linux better able to deal with load spikes. > Our apparent differences come from the fact that we try to solve > different problems as you correctly noted: You are concerned with > extreme overcommit, while I am concerned that 2.6 takes several times > longer than 2.4 to complete a task under slight overcommit. Agreed, the slight to medium overcommit needs to be addressed well. This is way more important than very highly overcommitted systems, because computers are powerful enough for their workloads anyway. The thing Linux needs to deal with are unexpected load spikes. The thing that needs to be done is making sure that such a load spike doesn't send Linux into a death spiral. If such a load control mechanism also solves the highly overloaded scenario, that's just a nice bonus. > The key question with regards to load control remains: How do you keep a > load controled system responsive? Cleverly detect interactive processes > and spare them, or wake them up again quickly enough? How? Or is the > plan to use load control where responsiveness doesn't matter anyway? Under light to moderate overload, a load controlled system will be more responsive than a thrashing system. Heavy overload is probably a "docter, it hurts ..." case. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 21:04 ` Rik van Riel @ 2003-12-10 23:17 ` Roger Luethi 2003-12-11 1:31 ` Rik van Riel 0 siblings, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-10 23:17 UTC (permalink / raw) To: Rik van Riel Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003 16:04:16 -0500, Rik van Riel wrote: > > For me this discussion just confirmed that my approach fails to draw > > much interest, either because there are better alternatives or because > > heavy paging and medium thrashing are generally not considered > > interesting problems. > > I'm willing to take over this work if you really want > to throw in the towel. It has to be done, simply to > make Linux better able to deal with load spikes. I am willing to keep my work up if I don't have to pull this alone. As far as thrashing is concerned, the VM changed significanly even during the -test series and I expect that to continue once 2.6.0 is released. It would be good to get help from the people who made those changes -- they should know their stuff best, after all. For one, we could look at the regression in test3 which might be easier to fix than others because the changes haven't been buried under dozens of later kernels. Some time ago, I took some notes about how the two patches I mentioned in an earlier message worked together to change the pageout patterns. Is that something we could start with? Setting up some regular regression testing for new kernels might be a good idea, too. Otherwise it's going to be Sisyphus work. For the time being I can continue the testing, provided the harddisk that miraculously survived hundreds of hours of thrashing tests keeps going. > Under light to moderate overload, a load controlled system > will be more responsive than a thrashing system. That I doubt. 2.4 is very responsive under light overload -- every process is mostly in memory and ready to grab a few missing pages at any time. Once you add load control, you have processes that are completely evicted and stunned when they are needed. Of course it's a matter of definition, too, so I'd go even as far as saying: - It is light thrashing when load control has no advantage. - It is medium thrashing when using load control is a toss-up. Probably better throughput, but somewhat higher latency. - It is heavy thrashing when load control is a winner in both regards. I just made this up. It neatly resolves all arguments about when load control is appropriate. Yeah, so it's a circular definition. Sue me. > Heavy overload is probably a "docter, it hurts ..." case. That's pretty much my thinking, too. Might still be worthwhile adding some load control if there are more people like wli's Russian guy. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 23:17 ` Roger Luethi @ 2003-12-11 1:31 ` Rik van Riel 2003-12-11 10:16 ` Roger Luethi 0 siblings, 1 reply; 63+ messages in thread From: Rik van Riel @ 2003-12-11 1:31 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Thu, 11 Dec 2003, Roger Luethi wrote: Hmmm, those definitions have changed a little from the OS books I read ;)) > - It is light thrashing when load control has no advantage. This used to be called "no thrashing" ;) > - It is medium thrashing when using load control is a toss-up. Probably > better throughput, but somewhat higher latency. This would be when the system load is so high that decreasing the multiprocessing level would increase system load, but performance would still be within acceptable limits (say, 30% of top performance). > - It is heavy thrashing when load control is a winner in both regards. Heavy thrashing would be "no work gets done by the processes in the system, nobody makes good progress". In that case load control is needed to make the system survive in a useful way. > I just made this up. It neatly resolves all arguments about when load > control is appropriate. Yeah, so it's a circular definition. Sue me. Knowing what your definitions are has definately made it easier for me to understand your previous mails. Still, sticking to the textbook definitions might make it even easier to talk about things, and compare the plans for Linux with what's been done for other OSes. Also, it would make the job of a load control mechanism really easy to define: "Prevent the system from thrashing" -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-11 1:31 ` Rik van Riel @ 2003-12-11 10:16 ` Roger Luethi 0 siblings, 0 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-11 10:16 UTC (permalink / raw) To: Rik van Riel Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003 20:31:40 -0500, Rik van Riel wrote: > Hmmm, those definitions have changed a little from the > OS books I read ;)) > > > - It is light thrashing when load control has no advantage. > > This used to be called "no thrashing" ;) Fair enough, but that was before Linux 2.6 <g>. kbuild benchmark, execution time in seconds (median over ten runs): 74 2.6.0-test11, 256 MB RAM 115 2.4.21, 64 MB RAM 539 2.6.0-test11, 64 MB RAM We can call it lousy paging, that'll be fine with me. > Also, it would make the job of a load control mechanism > really easy to define: > > "Prevent the system from thrashing" "... once all other means are exhausted". Then I'll buy it. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 13:58 ` Roger Luethi 2003-12-10 17:47 ` William Lee Irwin III 2003-12-10 21:04 ` Rik van Riel @ 2003-12-10 23:30 ` Helge Hafting 2 siblings, 0 replies; 63+ messages in thread From: Helge Hafting @ 2003-12-10 23:30 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote: > > What goes wrong is that once you start suspending tasks, you have a > hard time telling the interactive tasks apart from the batch load. > This may not be much of a problem on a 10x overcommit system, because > that's presumably quite unresponsive anyway, but it does matter a lot if > you have an interactive system that just crossed the border to thrashing. > This isn't too bad. Lets say I use the system interavtively and the "wrong" app suddenly is swapped out. I notice this, and simply close down some responsive apps that are less needed. The system will then notice that there's "enough" memory and allow the app to page in again. Helge Hafting ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-08 20:48 ` William Lee Irwin III 2003-12-09 0:27 ` Roger Luethi @ 2003-12-10 21:52 ` Andrea Arcangeli 2003-12-10 22:05 ` Roger Luethi 1 sibling, 1 reply; 63+ messages in thread From: Andrea Arcangeli @ 2003-12-10 21:52 UTC (permalink / raw) To: William Lee Irwin III, rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote: > qsbench I'd pretty much ignore except as a control case, since there's > nothing to do with a single process but let it thrash. this is not the point. If a single process like qsbench trashes twice as fast in 2.4, it means 2.6 has some great problem in the core vm, the whole point of swap is to trash but to give the task more physical virtual memory. I doubt you can solve it with anything returned by si_swapinfo. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 21:52 ` Andrea Arcangeli @ 2003-12-10 22:05 ` Roger Luethi 2003-12-10 22:44 ` Andrea Arcangeli 0 siblings, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-10 22:05 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003 22:52:35 +0100, Andrea Arcangeli wrote: > On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote: > > qsbench I'd pretty much ignore except as a control case, since there's > > nothing to do with a single process but let it thrash. > > this is not the point. If a single process like qsbench trashes twice as > fast in 2.4, it means 2.6 has some great problem in the core vm, the > whole point of swap is to trash but to give the task more physical > virtual memory. I doubt you can solve it with anything returned by > si_swapinfo. Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not about one single process. There were four worker processes (-p 4), and my load control stuff did make it run faster, so the point is moot. Also, the 2.6 core VM doesn't seem all that bad since it was introduced in 2.5.27 but most of the problems I measured were introduced after 2.5.40. Check out the graph I posted. Thank you, we now return to our regularly scheduled programming. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 22:05 ` Roger Luethi @ 2003-12-10 22:44 ` Andrea Arcangeli 2003-12-11 1:28 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Andrea Arcangeli @ 2003-12-10 22:44 UTC (permalink / raw) To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, Dec 10, 2003 at 11:05:25PM +0100, Roger Luethi wrote: > On Wed, 10 Dec 2003 22:52:35 +0100, Andrea Arcangeli wrote: > > On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote: > > > qsbench I'd pretty much ignore except as a control case, since there's > > > nothing to do with a single process but let it thrash. > > > > this is not the point. If a single process like qsbench trashes twice as > > fast in 2.4, it means 2.6 has some great problem in the core vm, the > > whole point of swap is to trash but to give the task more physical > > virtual memory. I doubt you can solve it with anything returned by > > si_swapinfo. > > Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not > about one single process. There were four worker processes (-p 4), and my > load control stuff did make it run faster, so the point is moot. more processes can be optimized even better by adding unfariness. Either ways a significant slowdown of qsbench probably means worse core VM, at least if compared with 2.4 that isn't adding huge unfariness just to optimize qsbench. > Also, the 2.6 core VM doesn't seem all that bad since it was introduced in > 2.5.27 but most of the problems I measured were introduced after 2.5.40. > Check out the graph I posted. you're confusing rmap with core vm. rmap in no way can be defined as the core vm, rmap is just a method used by the core vm to find some information more efficiently at the expenses of all the fast paths that now have to do the rmap bookkeeping. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 22:44 ` Andrea Arcangeli @ 2003-12-11 1:28 ` William Lee Irwin III 2003-12-11 1:32 ` Rik van Riel 2003-12-11 10:16 ` Roger Luethi 2003-12-15 23:31 ` Andrew Morton 2 siblings, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-11 1:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, Dec 10, 2003 at 11:05:25PM +0100, Roger Luethi wrote: >> Also, the 2.6 core VM doesn't seem all that bad since it was introduced in >> 2.5.27 but most of the problems I measured were introduced after 2.5.40. >> Check out the graph I posted. On Wed, Dec 10, 2003 at 11:44:46PM +0100, Andrea Arcangeli wrote: > you're confusing rmap with core vm. rmap in no way can be defined as the > core vm, rmap is just a method used by the core vm to find some > information more efficiently at the expenses of all the fast paths > that now have to do the rmap bookkeeping. I've been maintaining one of the answers to this (anobjrmap, originally from hugh). I still haven't removed page->mapcount because keeping nr_mapped straight requires some care, though doing so should be feasible. I could probably use some helpers to untangle it from the highpmd, compile-time mapping->page_lock rwlock/spinlock switching, RCU mapping->i_shared_lock, and O(1) proc_pid_statm() bits. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-11 1:28 ` William Lee Irwin III @ 2003-12-11 1:32 ` Rik van Riel 0 siblings, 0 replies; 63+ messages in thread From: Rik van Riel @ 2003-12-11 1:32 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrea Arcangeli, rl, Con Kolivas, Chris Vine, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003, William Lee Irwin III wrote: > I could probably use some helpers to untangle it from the highpmd, > compile-time mapping->page_lock rwlock/spinlock switching, RCU > mapping->i_shared_lock, and O(1) proc_pid_statm() bits. Looking into it. Your patch certainly has a lot of stuff folded into one piece ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 22:44 ` Andrea Arcangeli 2003-12-11 1:28 ` William Lee Irwin III @ 2003-12-11 10:16 ` Roger Luethi 2003-12-15 23:31 ` Andrew Morton 2 siblings, 0 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-11 10:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh On Wed, 10 Dec 2003 23:44:46 +0100, Andrea Arcangeli wrote: > more processes can be optimized even better by adding unfariness. > Either ways a significant slowdown of qsbench probably means worse core > VM, at least if compared with 2.4 that isn't adding huge unfariness just > to optimize qsbench. Can you be a bit more specific about the type of unfairness? The only instance I clearly noticed is that one process can grow its RSS at the expense of others if they already have a high PFF. That happens more often in 2.4 and helps a lot with some benchmarks. I did notice, though, that after an initial slowdown, qsbench improved during 2.5, while the compile benchmarks got even worse. > > Also, the 2.6 core VM doesn't seem all that bad since it was introduced in > > 2.5.27 but most of the problems I measured were introduced after 2.5.40. > > Check out the graph I posted. > > you're confusing rmap with core vm. rmap in no way can be defined as the > core vm, rmap is just a method used by the core vm to find some Incidentally, all these places where rmap is used by the core VM were introduced in 2.5.27 as well. In particular vmscan.c was completely overhauled. But apparently you suspect subsequent changes to the core to be a problem. I am curious what they are if that can help fixing the slowdowns I'm seeing. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-10 22:44 ` Andrea Arcangeli 2003-12-11 1:28 ` William Lee Irwin III 2003-12-11 10:16 ` Roger Luethi @ 2003-12-15 23:31 ` Andrew Morton 2003-12-15 23:37 ` Andrea Arcangeli 2 siblings, 1 reply; 63+ messages in thread From: Andrew Morton @ 2003-12-15 23:31 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh Andrea Arcangeli <andrea@suse.de> wrote: > > > Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not > > about one single process. There were four worker processes (-p 4), and my > > load control stuff did make it run faster, so the point is moot. > > more processes can be optimized even better by adding unfariness. > Either ways a significant slowdown of qsbench probably means worse core > VM, at least if compared with 2.4 that isn't adding huge unfariness just > to optimize qsbench. Single-threaded qsbench is OK on 2.6. Last time I looked it was a little quicker than 2.4. It's when you go to multiple qsbench instances that everything goes to crap. It's interesting to watch the `top' output during the run. In 2.4 you see three qsbench instances have consumed 0.1 seconds CPU and the fourth has consumed 45 seconds and then exits. In 2.6 all four processes consume CPU at the same rate. Really, really slowly. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-15 23:31 ` Andrew Morton @ 2003-12-15 23:37 ` Andrea Arcangeli 2003-12-15 23:54 ` Andrew Morton 0 siblings, 1 reply; 63+ messages in thread From: Andrea Arcangeli @ 2003-12-15 23:37 UTC (permalink / raw) To: Andrew Morton; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh On Mon, Dec 15, 2003 at 03:31:22PM -0800, Andrew Morton wrote: > Single-threaded qsbench is OK on 2.6. Last time I looked it was a little > quicker than 2.4. It's when you go to multiple qsbench instances that > everything goes to crap. > > It's interesting to watch the `top' output during the run. In 2.4 you see > three qsbench instances have consumed 0.1 seconds CPU and the fourth has > consumed 45 seconds and then exits. > > In 2.6 all four processes consume CPU at the same rate. Really, really > slowly. sounds good, so this seems only a fariness issue. 2.6 is more fair but fariness in this case means much inferior performance. The reason 2.4 runs faster could be a more aggressive "young" pagetable heuristic via the swap_out clock algorithm. as soon as one program grows a bit its rss, it will run for longer, and the longer it runs the more pages it marks "young" during a clock scan, and the more pages it marks young the bigger it will grow. This keeps going until it's the by far biggest task and takes almost all available cpu. This is optimal for performance, but not optimal for fariness. So 2.6 may be better or worse depending if fariness payoffs or not, obviously in qsbench it doesn't since it's not even measured. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-15 23:37 ` Andrea Arcangeli @ 2003-12-15 23:54 ` Andrew Morton 2003-12-16 0:17 ` Rik van Riel 2003-12-16 11:23 ` Roger Luethi 0 siblings, 2 replies; 63+ messages in thread From: Andrew Morton @ 2003-12-15 23:54 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh Andrea Arcangeli <andrea@suse.de> wrote: > > The reason 2.4 runs faster could be a more aggressive "young" pagetable > heuristic via the swap_out clock algorithm. as soon as one program grows > a bit its rss, it will run for longer, and the longer it runs the more > pages it marks "young" during a clock scan, and the more pages it marks > young the bigger it will grow. This keeps going until it's the by far > biggest task and takes almost all available cpu. This is optimal for > performance, but not optimal for fariness. Sounds right. One thing to be cautious of here is an interaction with the "human factor". One tends to adjust the test case so that it takes a reasonable amount of time. So the process is: Run 1: took five seconds. "hmm, it didn't swap at all. I'll use some more threads" Run 2: takes 4 hours. "man, that sucked. I'll use a few less threads" Run 3: takes ten minutes. "ah, that's nice. I'll use that many threads from now on". Problem is, you have now carefully placed your test point right on the point of a sharp knee in a big curve. So small changes in input conditions cause large changes in runtime. At least, that's what I do ;) > So 2.6 may be better or worse > depending if fariness payoffs or not, obviously in qsbench it doesn't > since it's not even measured. It would be nice, but I've yet to find a workload in which 2.6 pageout decisively wins. It could well be that something is simply misbehaving in there and that we can pull back significant benefits with some inspired tweaking rather than with radical changes. Certainly some of Roger's measurements indicate that this is the case, although I worry that he may have tuned himself onto the knee of the curve. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-15 23:54 ` Andrew Morton @ 2003-12-16 0:17 ` Rik van Riel 2003-12-16 11:23 ` Roger Luethi 1 sibling, 0 replies; 63+ messages in thread From: Rik van Riel @ 2003-12-16 0:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Mon, 15 Dec 2003, Andrew Morton wrote: > It could well be that something is simply misbehaving in there I have my suspicions about inter-zone balancing in 2.6. Something seems wrong, but I can't quite put my finger on it yet. This should have quite some impact in the 1 - 4 GB range and a test (done by somebody else, I can't give you the details yet unfortunately) has shown there is a problem. I'm working on it and should come up with a patch soon. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-15 23:54 ` Andrew Morton 2003-12-16 0:17 ` Rik van Riel @ 2003-12-16 11:23 ` Roger Luethi 2003-12-16 16:29 ` Rik van Riel 2003-12-17 18:53 ` Rik van Riel 1 sibling, 2 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-16 11:23 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, wli, kernel, chris, riel, linux-kernel, mbligh On Mon, 15 Dec 2003 15:54:27 -0800, Andrew Morton wrote: > One tends to adjust the test case so that it takes a reasonable amount of > time. So the process is: > > Run 1: took five seconds. > > "hmm, it didn't swap at all. I'll use some more threads" > > Run 2: takes 4 hours. > > "man, that sucked. I'll use a few less threads" > > Run 3: takes ten minutes. > > "ah, that's nice. I'll use that many threads from now on". > > [...] > > It could well be that something is simply misbehaving in there and that we > can pull back significant benefits with some inspired tweaking rather than > with radical changes. Certainly some of Roger's measurements indicate that > this is the case, although I worry that he may have tuned himself onto the > knee of the curve. No worries, mate :-). The efax benchmark I run is a replica of the case that started this thread. "make main.o" for efax with 32 MB. The kbuild benchmark is very different as far as compile benchmarks go: "make -j 24" for the Linux kernel with 64 MB -- the time was adjusted not by using fewer processes but by only building a small part of the kernel, which does not change the character of the test. As benchmarks, efax and kbuild seem different enough to warrant the conclusion that compiling under tight memory conditions is slow on 2.6. The qsbench benchmarks is clearly a different type from the other two. Improvements in qsbench coincided several times with losses for efax/kbuild and vice versa. Exceptions exist like 2.5.65 which brought no change for efax but big improvements for kbuild and qsbench (which was back on par with 2.5.0 for two releases). It is at least conceivable, though, that the damage for one type of benchmark (qsbench) was mitigated at the expense of others. One potential problem with the benchmarks is that my test box has just one bar with 256 MB RAM. The kbuild and efax tests were run with mem=64M and mem=32M, respectively. If the difference between mem=32M and a real 32 MB machine is significant for the benchmark, the results will be less than perfect. I plan to do some testing on a machine with more than one memory module to get an idea of the impact, provided I can dig up some usable hardware. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-16 11:23 ` Roger Luethi @ 2003-12-16 16:29 ` Rik van Riel 2003-12-17 11:03 ` Roger Luethi 2003-12-17 18:53 ` Rik van Riel 1 sibling, 1 reply; 63+ messages in thread From: Rik van Riel @ 2003-12-16 16:29 UTC (permalink / raw) To: Roger Luethi Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Tue, 16 Dec 2003, Roger Luethi wrote: > One potential problem with the benchmarks is that my test box has > just one bar with 256 MB RAM. The kbuild and efax tests were run with > mem=64M and mem=32M, respectively. If the difference between mem=32M > and a real 32 MB machine is significant for the benchmark, Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ? I have a feeling that the lower zone protection logic could be badly messing up systems in the 24-48 MB range, as well as systems in the 1.5-3 GB range. This would be because the allocation threshold for the lower zone would be 30% higher than the high threshold of the pageout code, meaning that the memory in the lower zone would be just sitting there, without old pages being recycled by the pageout code. In effect, your 32 MB test would have old memory in the lower 16 MB, without the pageout code reusing that memory for something more useful, reducing the amount of memory the system really uses well to something way lower. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-16 16:29 ` Rik van Riel @ 2003-12-17 11:03 ` Roger Luethi 2003-12-17 11:06 ` William Lee Irwin III 2003-12-17 11:33 ` Rik van Riel 0 siblings, 2 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-17 11:03 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Tue, 16 Dec 2003 11:29:50 -0500, Rik van Riel wrote: > On Tue, 16 Dec 2003, Roger Luethi wrote: > > > One potential problem with the benchmarks is that my test box has > > just one bar with 256 MB RAM. The kbuild and efax tests were run with > > mem=64M and mem=32M, respectively. If the difference between mem=32M > > and a real 32 MB machine is significant for the benchmark, > > Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ? Defaults to 0 anyway, doesn't it? Turning it _on_ seems to slow benchmarks down somewhat (< 5%). In one of ten runs, though, the efax test stopped doing anything for ten minutes -- no disk activity, no progress whatsoever. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 11:03 ` Roger Luethi @ 2003-12-17 11:06 ` William Lee Irwin III 2003-12-17 16:50 ` Roger Luethi 2003-12-17 11:33 ` Rik van Riel 1 sibling, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-17 11:06 UTC (permalink / raw) To: Rik van Riel, Andrew Morton, Andrea Arcangeli, kernel, chris, linux-kernel, mbligh On Tue, 16 Dec 2003 11:29:50 -0500, Rik van Riel wrote: >> Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ? On Wed, Dec 17, 2003 at 12:03:37PM +0100, Roger Luethi wrote: > Defaults to 0 anyway, doesn't it? Turning it _on_ seems to slow > benchmarks down somewhat (< 5%). In one of ten runs, though, the efax > test stopped doing anything for ten minutes -- no disk activity, no > progress whatsoever. Sorry about that, that got brought up elsewhere but not propagated out to lkml. Hearing more about the various degradations you've identified would be helpful. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 11:06 ` William Lee Irwin III @ 2003-12-17 16:50 ` Roger Luethi 0 siblings, 0 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-17 16:50 UTC (permalink / raw) To: William Lee Irwin III, Rik van Riel, Andrew Morton, Andrea Arcangeli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003 03:06:48 -0800, William Lee Irwin III wrote: > to lkml. Hearing more about the various degradations you've identified > would be helpful. I'll use 2.6.0-test3 again as the example. That release brought a slight improvement for qsbench and big slowdowns for kbuild and efax (check the numbers I posted for details), due to two patches: "fix kswapd throttling" (patch 1) and "decaying average of zone pressure/use zone_pressure for page unmapping" (patch 2). Even as late as test9 I found that reverting patches 1 and 2 changed performance numbers for all benchmarks pretty much back to test2 level. Reverting only patch 1 brought a partial improvement, reverting only patch 2 none at all. Patch 1 prevented those frequent calls to blk_congestion_wait in balance_pgdat when enough pages were freed: diff -Nru a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c Thu Jul 17 06:09:38 2003 +++ b/mm/vmscan.c Fri Aug 1 03:02:09 2003 @@ -930,7 +930,8 @@ } if (all_zones_ok) break; - blk_congestion_wait(WRITE, HZ/10); + if (to_free > 0) + blk_congestion_wait(WRITE, HZ/10); } return nr_pages - to_free; } Unconditional blk_congestion_wait breaks (as they have been in test2 and earlier) reduce the speed at which kswapd can free pages, making it much more likely that memory is reclaimed by the allocator (try_to_free_pages) because kswapd fails to keep up with demand. Patch 2 changed distress and thus reclaim_mapped in refill_inactive_zone. distress became less volatile -- kernels before test3 tended to consider mapped pages only after a few iterations in balance_pgdat (i.e. with raising priority). To get the benefits of reverting patch 2 in test3/test9 this small patch should suffice: diff -u ./mm/vmscan.c ./mm/vmscan.c --- ./mm/vmscan.c Wed Nov 19 11:02:51 2003 +++ ./mm/vmscan.c Wed Nov 19 23:53:06 2003 @@ -632,7 +632,7 @@ * `distress' is a measure of how much trouble we're having reclaiming * pages. 0 -> no problems. 100 -> great trouble. */ - distress = 100 >> zone->prev_priority; + distress = 100 >> priority; /* * The point of this algorithm is to decide when to start reclaiming Without patch 2 (kernel test2 and earlier), kswapd freeing is frequently interrupted by the allocator satisfying immediate needs. With the patch, refill is dominated by long, undisturbed sequences driven by kswapd. All this has little impact on qsbench because unlike the other two benchmarks qsbench hardly ever fails to convince refill_inactive_zone to consider mapped pages as well (thanks to an extremely high mapped_ratio). Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 11:03 ` Roger Luethi 2003-12-17 11:06 ` William Lee Irwin III @ 2003-12-17 11:33 ` Rik van Riel 1 sibling, 0 replies; 63+ messages in thread From: Rik van Riel @ 2003-12-17 11:33 UTC (permalink / raw) To: Roger Luethi Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003, Roger Luethi wrote: > Defaults to 0 anyway, doesn't it? Duh, right... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-16 11:23 ` Roger Luethi 2003-12-16 16:29 ` Rik van Riel @ 2003-12-17 18:53 ` Rik van Riel 2003-12-17 19:27 ` William Lee Irwin III 2003-12-17 19:49 ` Roger Luethi 1 sibling, 2 replies; 63+ messages in thread From: Rik van Riel @ 2003-12-17 18:53 UTC (permalink / raw) To: Roger Luethi Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Tue, 16 Dec 2003, Roger Luethi wrote: > One potential problem with the benchmarks is that my test box has > just one bar with 256 MB RAM. The kbuild and efax tests were run with > mem=64M and mem=32M, respectively. If the difference between mem=32M OK, I found another difference with 2.4. Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think that should give the same free watermarks that 2.4 has. Using 1MB as the min free watermark for lowmem is bound to result in more free (and less used) memory on systems with less than 128 MB RAM ... significantly so on smaller systems. The fact that ZONE_HIGHMEM and ZONE_NORMAL are recycled at very different rates could also be of influence on some performance tests... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 18:53 ` Rik van Riel @ 2003-12-17 19:27 ` William Lee Irwin III 2003-12-17 19:51 ` Rik van Riel 2003-12-17 19:49 ` Roger Luethi 1 sibling, 1 reply; 63+ messages in thread From: William Lee Irwin III @ 2003-12-17 19:27 UTC (permalink / raw) To: Rik van Riel Cc: Roger Luethi, Andrew Morton, Andrea Arcangeli, kernel, chris, linux-kernel, mbligh On Tue, 16 Dec 2003, Roger Luethi wrote: >> One potential problem with the benchmarks is that my test box has >> just one bar with 256 MB RAM. The kbuild and efax tests were run with >> mem=64M and mem=32M, respectively. If the difference between mem=32M On Wed, Dec 17, 2003 at 01:53:28PM -0500, Rik van Riel wrote: > OK, I found another difference with 2.4. > Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think > that should give the same free watermarks that 2.4 has. > Using 1MB as the min free watermark for lowmem is bound > to result in more free (and less used) memory on systems > with less than 128 MB RAM ... significantly so on smaller > systems. > The fact that ZONE_HIGHMEM and ZONE_NORMAL are recycled > at very different rates could also be of influence on > some performance tests... Limited sets of configurations may have left holes in the testing. Upper zones much larger than lower zones basically want the things to be unequal. It probably wants the replacement load spread proportionally in general or some such nonsense. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 19:27 ` William Lee Irwin III @ 2003-12-17 19:51 ` Rik van Riel 0 siblings, 0 replies; 63+ messages in thread From: Rik van Riel @ 2003-12-17 19:51 UTC (permalink / raw) To: William Lee Irwin III Cc: Roger Luethi, Andrew Morton, Andrea Arcangeli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003, William Lee Irwin III wrote: > Limited sets of configurations may have left holes in the testing. > Upper zones much larger than lower zones basically want the things > to be unequal. It probably wants the replacement load spread > proportionally in general or some such nonsense. Yeah. In some configurations 2.4-rmap takes care of this automagically since part of the replacement isn't as pressure driven as in 2.4 mainline and 2.6, ie. some of the aging is done independantly of allocation pressure. Still, inter-zone balancing is HARD to get right. I'm currently trying to absorb all of the 2.6 VM balancing into my mind (*sound effects of brain turning to slush*) to find any possible imbalances. Some of the test results I have seen make me very suspicious... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 18:53 ` Rik van Riel 2003-12-17 19:27 ` William Lee Irwin III @ 2003-12-17 19:49 ` Roger Luethi 2003-12-17 21:41 ` Andrew Morton 2003-12-17 21:41 ` Roger Luethi 1 sibling, 2 replies; 63+ messages in thread From: Roger Luethi @ 2003-12-17 19:49 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003 13:53:28 -0500, Rik van Riel wrote: > On Tue, 16 Dec 2003, Roger Luethi wrote: > > > One potential problem with the benchmarks is that my test box has > > just one bar with 256 MB RAM. The kbuild and efax tests were run with > > mem=64M and mem=32M, respectively. If the difference between mem=32M > > OK, I found another difference with 2.4. > > Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think > that should give the same free watermarks that 2.4 has. I played around with that knob after wli posted his findings in the "mem=16MB laptop testing" thread. IIRC tweaking min_free_kbytes didn't help nearly as much as I had hoped. I'm running the efax benchmark right now just to make sure. It's going to take a couple of hours, I'll follow up with results. FWIW akpm posted a patch to initialize min_free_kbytes depending on available RAM which seemed to make sense but it hasn't made it into mainline yet. > Using 1MB as the min free watermark for lowmem is bound > to result in more free (and less used) memory on systems > with less than 128 MB RAM ... significantly so on smaller > systems. Possibly. If memory pressure is high enough, though, the allocator ignores the watermarks. And on the other end kswapd seems to be pretty busy anyway during the benchmarks. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 19:49 ` Roger Luethi @ 2003-12-17 21:41 ` Andrew Morton 2003-12-17 21:41 ` Roger Luethi 1 sibling, 0 replies; 63+ messages in thread From: Andrew Morton @ 2003-12-17 21:41 UTC (permalink / raw) To: Roger Luethi; +Cc: riel, andrea, wli, kernel, chris, linux-kernel, mbligh Roger Luethi <rl@hellgate.ch> wrote: > > FWIW akpm posted a patch to initialize min_free_kbytes depending on > available RAM which seemed to make sense but it hasn't made it into > mainline yet. Yup. ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test11/2.6.0-test11-mm1/broken-out/scale-min_free_kbytes.patch Also, note that setup_per_zone_pages_min() plays games to ensure that the highmem zone's free pages limit is small: there's not a lot of point in keeping lots of highmem pages free. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 19:49 ` Roger Luethi 2003-12-17 21:41 ` Andrew Morton @ 2003-12-17 21:41 ` Roger Luethi 2003-12-18 0:21 ` Rik van Riel 1 sibling, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-17 21:41 UTC (permalink / raw) To: Rik van Riel, Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003 20:49:51 +0100, Roger Luethi wrote: > right now just to make sure. It's going to take a couple of hours, > I'll follow up with results. For efax, a benchmark run with mem=32M, the difference in run time between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise (< 1%). Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-17 21:41 ` Roger Luethi @ 2003-12-18 0:21 ` Rik van Riel 2003-12-18 22:53 ` Roger Luethi 0 siblings, 1 reply; 63+ messages in thread From: Rik van Riel @ 2003-12-18 0:21 UTC (permalink / raw) To: Roger Luethi Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003, Roger Luethi wrote: > On Wed, 17 Dec 2003 20:49:51 +0100, Roger Luethi wrote: > > right now just to make sure. It's going to take a couple of hours, > > I'll follow up with results. > > For efax, a benchmark run with mem=32M, the difference in run time > between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise > (< 1%). OK, so I guess you're not as close to the knee of the curve as this kind of tests tend to be ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-18 0:21 ` Rik van Riel @ 2003-12-18 22:53 ` Roger Luethi 2003-12-18 23:38 ` William Lee Irwin III 0 siblings, 1 reply; 63+ messages in thread From: Roger Luethi @ 2003-12-18 22:53 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh On Wed, 17 Dec 2003 19:21:52 -0500, Rik van Riel wrote: > > For efax, a benchmark run with mem=32M, the difference in run time > > between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise > > (< 1%). > > OK, so I guess you're not as close to the knee > of the curve as this kind of tests tend to be ;) Depends on the axis in your graph. The benchmarks I am using are not balancing on the verge of going bad, if that's what you mean. They cut deep (30 to 100 MB) into swap through most of their run time, and there's quite a bit of swap turnover with compiling stuff. I also completed a best effort attempt at determining the impact of any differences between mem= and actual RAM removal. I had to adapt the kbuild benchmark somewhat to the available hardware. I benchmarked with 48 MB RAM at mem=16M and again after removing 32MB of RAM. If there was a difference in performance, it was very small for both 2.4.23 and 2.6.0-test11, with the latter taking over 2.5 times as long to complete the benchmark. Roger ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: 2.6.0-test9 - poor swap performance on low end machines 2003-12-18 22:53 ` Roger Luethi @ 2003-12-18 23:38 ` William Lee Irwin III 0 siblings, 0 replies; 63+ messages in thread From: William Lee Irwin III @ 2003-12-18 23:38 UTC (permalink / raw) To: rl, Rik van Riel, Andrew Morton, Andrea Arcangeli, kernel, chris, linux-kernel, mbligh On Thu, Dec 18, 2003 at 11:53:25PM +0100, Roger Luethi wrote: > Depends on the axis in your graph. The benchmarks I am using are not > balancing on the verge of going bad, if that's what you mean. They > cut deep (30 to 100 MB) into swap through most of their run time, > and there's quite a bit of swap turnover with compiling stuff. > I also completed a best effort attempt at determining the impact of > any differences between mem= and actual RAM removal. I had to adapt > the kbuild benchmark somewhat to the available hardware. I benchmarked > with 48 MB RAM at mem=16M and again after removing 32MB of RAM. If there > was a difference in performance, it was very small for both 2.4.23 and > 2.6.0-test11, with the latter taking over 2.5 times as long to complete > the benchmark. A bogon was recently fixed in 2.6 that caused the results to differ. They should not differ. -- wli ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2003-12-18 23:39 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-10-29 22:30 2.6.0-test9 - poor swap performance on low end machines Chris Vine 2003-10-31 3:57 ` Rik van Riel 2003-10-31 11:26 ` Roger Luethi 2003-10-31 12:37 ` Con Kolivas 2003-10-31 12:59 ` Roger Luethi 2003-10-31 12:55 ` Ed Tomlinson 2003-11-01 18:34 ` Pasi Savolainen 2003-11-06 18:40 ` bill davidsen 2003-10-31 21:52 ` Chris Vine 2003-11-02 23:06 ` Chris Vine 2003-11-03 0:48 ` Con Kolivas 2003-11-03 21:13 ` Chris Vine 2003-11-04 2:55 ` Con Kolivas 2003-11-04 22:08 ` Chris Vine 2003-11-04 22:30 ` Con Kolivas 2003-12-08 13:52 ` William Lee Irwin III 2003-12-08 14:23 ` Con Kolivas 2003-12-08 14:30 ` William Lee Irwin III 2003-12-09 21:03 ` Chris Vine 2003-12-13 14:08 ` Chris Vine 2003-12-08 19:49 ` Roger Luethi 2003-12-08 20:48 ` William Lee Irwin III 2003-12-09 0:27 ` Roger Luethi 2003-12-09 4:05 ` William Lee Irwin III 2003-12-09 15:11 ` Roger Luethi 2003-12-09 16:04 ` Rik van Riel 2003-12-09 16:31 ` Roger Luethi 2003-12-09 18:31 ` William Lee Irwin III 2003-12-09 19:38 ` William Lee Irwin III 2003-12-10 13:58 ` Roger Luethi 2003-12-10 17:47 ` William Lee Irwin III 2003-12-10 22:23 ` Roger Luethi 2003-12-11 0:12 ` William Lee Irwin III 2003-12-10 21:04 ` Rik van Riel 2003-12-10 23:17 ` Roger Luethi 2003-12-11 1:31 ` Rik van Riel 2003-12-11 10:16 ` Roger Luethi 2003-12-10 23:30 ` Helge Hafting 2003-12-10 21:52 ` Andrea Arcangeli 2003-12-10 22:05 ` Roger Luethi 2003-12-10 22:44 ` Andrea Arcangeli 2003-12-11 1:28 ` William Lee Irwin III 2003-12-11 1:32 ` Rik van Riel 2003-12-11 10:16 ` Roger Luethi 2003-12-15 23:31 ` Andrew Morton 2003-12-15 23:37 ` Andrea Arcangeli 2003-12-15 23:54 ` Andrew Morton 2003-12-16 0:17 ` Rik van Riel 2003-12-16 11:23 ` Roger Luethi 2003-12-16 16:29 ` Rik van Riel 2003-12-17 11:03 ` Roger Luethi 2003-12-17 11:06 ` William Lee Irwin III 2003-12-17 16:50 ` Roger Luethi 2003-12-17 11:33 ` Rik van Riel 2003-12-17 18:53 ` Rik van Riel 2003-12-17 19:27 ` William Lee Irwin III 2003-12-17 19:51 ` Rik van Riel 2003-12-17 19:49 ` Roger Luethi 2003-12-17 21:41 ` Andrew Morton 2003-12-17 21:41 ` Roger Luethi 2003-12-18 0:21 ` Rik van Riel 2003-12-18 22:53 ` Roger Luethi 2003-12-18 23:38 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).