linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.0-test9 - poor swap performance on low end machines
@ 2003-10-29 22:30 Chris Vine
  2003-10-31  3:57 ` Rik van Riel
  0 siblings, 1 reply; 63+ messages in thread
From: Chris Vine @ 2003-10-29 22:30 UTC (permalink / raw)
  To: linux-kernel

Hi,

I have been testing the 2.6.0-test9 kernel on a couple of desktop machines.  
On the first, a 1.8GHz Pentium 4 uniprocessor with 512MHz of RAM, it seems to 
perform fine, and on various compilation tests, compile times for the test 
programs I have compiled are pretty much the same as those obtained with a 
stock 2.4.22 kernel, and the 2.6 kernel seems to be slightly more responsive 
on the desktop.  Nothing I use it for knocks it substantially into swap.

However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB 
of RAM and 70MB of swap) I get poor performance once extensive use is made of 
the swap space.  On a test compile of a C++ program involving quite a lot of 
templates and therefore which is quite memory intensive, it chugs along with 
the stock 2.4.22 kernel and completes the compile in about 10 minutes going 
(at its worst) into about 34MB of swap.  However, doing the same compile on 
the 2.6.0-test9 kernel, it reaches about 22MB into swap and then goes into 
some kind of swap frenzy, continuously swaping and unswapping.  Even after 
leaving it for an hour it continuously swaps and unswaps and fails to compile 
even the first file (which takes about 2 minutes using the 2.4.22 kernel) and 
sticks at about 24MB of swap.  The kernel is compiled with gcc-2.95.3.

Chris.

PS Please copy any replies to my e-mail address.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-29 22:30 2.6.0-test9 - poor swap performance on low end machines Chris Vine
@ 2003-10-31  3:57 ` Rik van Riel
  2003-10-31 11:26   ` Roger Luethi
                     ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Rik van Riel @ 2003-10-31  3:57 UTC (permalink / raw)
  To: Chris Vine; +Cc: linux-kernel, Con Kolivas

On Wed, 29 Oct 2003, Chris Vine wrote:

> However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB 
> of RAM and 70MB of swap) I get poor performance once extensive use is made of 
> the swap space.

Could you try the patch Con Kolivas posted on the 25th ?

Subject: [PATCH] Autoregulate vm swappiness cleanup


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31  3:57 ` Rik van Riel
@ 2003-10-31 11:26   ` Roger Luethi
  2003-10-31 12:37     ` Con Kolivas
  2003-10-31 12:55     ` Ed Tomlinson
  2003-10-31 21:52   ` Chris Vine
  2003-11-02 23:06   ` Chris Vine
  2 siblings, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2003-10-31 11:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chris Vine, linux-kernel, Con Kolivas

On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote:
> On Wed, 29 Oct 2003, Chris Vine wrote:
> 
> > However, on a low end machine (200MHz Pentium MMX uniprocessor with only 32MB 
> > of RAM and 70MB of swap) I get poor performance once extensive use is made of 
> > the swap space.
> 
> Could you try the patch Con Kolivas posted on the 25th ?
> 
> Subject: [PATCH] Autoregulate vm swappiness cleanup

I suppose it will show some improvement but fail to get performance
anywhere near 2.4 -- at least that's what my own tests found. I've been
working on a break-down of where we're losing it.
Bottom line: It's not simply a price we pay for feature X or Y. It's
all over the map, and thus no single patch can possibly fix it.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31 11:26   ` Roger Luethi
@ 2003-10-31 12:37     ` Con Kolivas
  2003-10-31 12:59       ` Roger Luethi
  2003-10-31 12:55     ` Ed Tomlinson
  1 sibling, 1 reply; 63+ messages in thread
From: Con Kolivas @ 2003-10-31 12:37 UTC (permalink / raw)
  To: Roger Luethi, Rik van Riel; +Cc: Chris Vine, linux-kernel

On Fri, 31 Oct 2003 22:26, Roger Luethi wrote:
> On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote:
> > On Wed, 29 Oct 2003, Chris Vine wrote:
> > > However, on a low end machine (200MHz Pentium MMX uniprocessor with
> > > only 32MB of RAM and 70MB of swap) I get poor performance once
> > > extensive use is made of the swap space.
> >
> > Could you try the patch Con Kolivas posted on the 25th ?
> >
> > Subject: [PATCH] Autoregulate vm swappiness cleanup
>
> I suppose it will show some improvement but fail to get performance
> anywhere near 2.4 -- at least that's what my own tests found. I've been
> working on a break-down of where we're losing it.
> Bottom line: It's not simply a price we pay for feature X or Y. It's
> all over the map, and thus no single patch can possibly fix it.

Yes it will show improvement, and I would like to hear how much given how 
simple it is, but I agree with you. There is an intrinsic difference in the 
vm in 2.6 that makes it too hard for multiple running applications to have a 
small piece of the action instead of giving out big pieces of the action. 
While it is better in most circumstances I believe you describe well the 
problem under vm overload. I guess encoding a vm scheduler will help (and 
clearly 2.8 territory) but at what overhead cost? I have no idea myself, as 
now I'm pulling catch-phrases out of my arse that I hate hearing others use 
(see any lkml thread about scheduling from people who don't code).

Cheers,
Con


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31 11:26   ` Roger Luethi
  2003-10-31 12:37     ` Con Kolivas
@ 2003-10-31 12:55     ` Ed Tomlinson
  2003-11-01 18:34       ` Pasi Savolainen
  2003-11-06 18:40       ` bill davidsen
  1 sibling, 2 replies; 63+ messages in thread
From: Ed Tomlinson @ 2003-10-31 12:55 UTC (permalink / raw)
  To: linux-kernel

On October 31, 2003 06:26 am, Roger Luethi wrote:
> On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote:
>
> > On Wed, 29 Oct 2003, Chris Vine wrote:
> > 
> >
> > > However, on a low end machine (200MHz Pentium MMX uniprocessor with
> > > only 32MB \r of RAM and 70MB of swap) I get poor performance once
> > > extensive use is made of the swap space.
> >
> > 
> > Could you try the patch Con Kolivas posted on the 25th ?
> > 
> > Subject: [PATCH] Autoregulate vm swappiness cleanup
>
> 
> I suppose it will show some improvement but fail to get performance
> anywhere near 2.4 -- at least that's what my own tests found. I've been
> working on a break-down of where we're losing it.
> Bottom line: It's not simply a price we pay for feature X or Y. It's
> all over the map, and thus no single patch can possibly fix it.

With 2.6 its possible to tell the kernel how much to swap.  Con's patch
tries to keep applications in memory.  You can also play with 
/proc/sys/vm/swappiness which is what Con's patch tries to replace.

Ed Tomlinson

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31 12:37     ` Con Kolivas
@ 2003-10-31 12:59       ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2003-10-31 12:59 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Rik van Riel, Chris Vine, linux-kernel

On Fri, 31 Oct 2003 23:37:34 +1100, Con Kolivas wrote:
> Yes it will show improvement, and I would like to hear how much given how 

I've been sitting on my data because I was waiting for the missing
pieces from my test box, but here's a data point: For my test case,
your patch improves run time from 500 to 440 seconds.

> simple it is, but I agree with you. There is an intrinsic difference in the 
> vm in 2.6 that makes it too hard for multiple running applications to have a 

My (probably surprising to many) finding is that there _isn't_
an intrinsic difference which makes 2.6 suck. There are a number of
_separate_ issues, and they are only related in their contribution to
making 2.6 thrashing behavior abysmal.

What I'm trying to find out is whether the issues are intrinsic to
a change in some mechanisms (which typically means it's a price we
have to pay for other benefits) or if they are just problems with the
implementation. I had tracked down vm_swappiness as one problem, and
your solution shows that the implementation could indeed be improved
without touching the fundamental VM workings at all.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31  3:57 ` Rik van Riel
  2003-10-31 11:26   ` Roger Luethi
@ 2003-10-31 21:52   ` Chris Vine
  2003-11-02 23:06   ` Chris Vine
  2 siblings, 0 replies; 63+ messages in thread
From: Chris Vine @ 2003-10-31 21:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Con Kolivas

On Friday 31 October 2003 3:57 am, Rik van Riel wrote:
> On Wed, 29 Oct 2003, Chris Vine wrote:
> > However, on a low end machine (200MHz Pentium MMX uniprocessor with only
> > 32MB of RAM and 70MB of swap) I get poor performance once extensive use
> > is made of the swap space.
>
> Could you try the patch Con Kolivas posted on the 25th ?
>
> Subject: [PATCH] Autoregulate vm swappiness cleanup

I will do that over the weekend and report back.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31 12:55     ` Ed Tomlinson
@ 2003-11-01 18:34       ` Pasi Savolainen
  2003-11-06 18:40       ` bill davidsen
  1 sibling, 0 replies; 63+ messages in thread
From: Pasi Savolainen @ 2003-11-01 18:34 UTC (permalink / raw)
  To: linux-kernel

* Ed Tomlinson <edt@aei.ca>:
> On October 31, 2003 06:26 am, Roger Luethi wrote:
>> On Thu, 30 Oct 2003 22:57:23 -0500, Rik van Riel wrote:
>>
>> > On Wed, 29 Oct 2003, Chris Vine wrote:
>> > 
>> >
>> > > However, on a low end machine (200MHz Pentium MMX uniprocessor with
>> > > only 32MB \r of RAM and 70MB of swap) I get poor performance once
>> > > extensive use is made of the swap space.
>> >
>> > 
>> > Could you try the patch Con Kolivas posted on the 25th ?
>> > 
>> > Subject: [PATCH] Autoregulate vm swappiness cleanup
>>
>> 
>> I suppose it will show some improvement but fail to get performance
>> anywhere near 2.4 -- at least that's what my own tests found. I've been
>> working on a break-down of where we're losing it.
>> Bottom line: It's not simply a price we pay for feature X or Y. It's
>> all over the map, and thus no single patch can possibly fix it.
>
> With 2.6 its possible to tell the kernel how much to swap.  Con's patch
> tries to keep applications in memory.  You can also play with 
> /proc/sys/vm/swappiness which is what Con's patch tries to replace.

FWIW, I've been getting horrible performance when FREEING swap space. UI
would just hang for several seconds. It've been that way all the -test
timeframe. 
test9 has been much better, but seemingly only because it doesn't seem
to like swap as good as before, it doesn't use it unless in dire need.

It's not exactly low-end machine either: 2x1800, 512M + 1G swap.
DMA is on for IDE.

below is 'vmstat 1', I loaded some 30MB images into GIMP and made tight
operations (undo level at 100). Loaded some more heavy apps, then moved
to GIMP's workspace and closed it. There was 'no-response' -time at
marked places:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
...
 1  1 318784   6788   9500 134092  964    0  1652   220 1262  1185  9  2 44 46
 0  1 318784   5188   9512 134164 1548    0  1620     4 1162   861  8  1 32 59
 0  1 318784   3140   9268 132052 8000    0  8000     0 1299  1130  4  2 48 46
 0  2 331280   2812   8860 127072 7360 13336  7360 13336 1502  1700  1  4 44 51
 0  1 331280   3800   8644 125816 4648    0  4648     4 1368  1330  1  3 29 67
 0  2 333596   3104   8632 125312 3136 2340  3136  2372 1466  1809  2  4 44 51
 1  2 336548   3620   8588 124924 5736 2972  5736  2972 1658  2163  3  5 23 69
 0  2 336468   3752   8668 124836  736   92   816   112 1454  1697  3  4 23 71
 0  1 336468   5232   8760 123288 5468    0  5536   136 1694  2265  3  4 28 65
- click GIMP 'Quit' -
 1  0 334848   4680   8808 124088 2200    0  3000    56 1339  1404  3  2 34 61
 0  1 291692 150596   8812 124496  876    0  1248     0 1232  1116 13  3 48 36
- GUI freeze -
 0  1 291692 148432   8812 124164 2576    0  2576     0 1369   561  1  1 50 49
 0  1 291692 146128   8812 124132 2344    0  2344     0 1284   488  1  1 50 49
 0  2 291692 143828   8812 124048 2396    0  2396    28 1254   373  0  2 49 49
 0  2 291692 141588   8828 124092 2320    0  2320    52 1270   583  0  1  6 92
 0  2 291692 139228   8848 124044 2272    0  2272    32 1257   524  0  1 13 86
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1 291692 136988   8848 124064 2224    0  2224     0 1213   401  1  1 50 49
 0  1 291692 134688   8848 124048 2328    0  2328     0 1349   362  1  1 49 49
 0  1 291692 132128   8848 124040 2524    0  2524     0 1408   371  0  2 49 49
 0  1 291692 129824   8848 124028 2324    0  2324     0 1210   361  0  1 49 50
 0  1 291692 127520   8852 124044 2292    0  2292   164 1266   474  0  1 29 69
 0  2 291692 125348   8868 124060 2144    0  2144    28 1225   473  0  2  8 90
 0  2 291692 122788   8888 124100 2524    0  2524    32 1283   557  1  1 24 75
 0  1 291692 120612   8888 124092 2184    0  2184     0 1181   367  0  1 43 56
 0  1 291692 118052   8888 124092 2516    0  2516     0 1176   346  1  1 50 49
 0  1 291692 115684   8888 124068 2404    0  2404     0 1178   337  1  1 50 49
 0  1 291692 113124   8888 124112 2540    0  2540     0 1183   349  1  1 50 49
 0  1 291692 111080   8888 123688 2464    0  2464     0 1186   336  1  1 49 50
 0  0 291692 109920   8924 123712 1164    0  1164    60 1165   869  3  1 57 39
- GUI available -
 0  0 291692 109924   8924 123712    0    0     0     0 1079   632  0  0 100  0
 0  0 291692 109924   8924 123712    0    0     0     0 1079   687  0  0 100  0
 0  0 291692 109924   8924 123712    0    0     0     0 1074   645  0  0 100  0
 0  0 291692 109928   8924 123712    0    0     0     0 1122   797  1  0 100  0


-- 
   Psi -- <http://www.iki.fi/pasi.savolainen>
Vivake -- Virtuaalinen valokuvauskerho <http://members.lycos.co.uk/vivake/>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31  3:57 ` Rik van Riel
  2003-10-31 11:26   ` Roger Luethi
  2003-10-31 21:52   ` Chris Vine
@ 2003-11-02 23:06   ` Chris Vine
  2003-11-03  0:48     ` Con Kolivas
  2 siblings, 1 reply; 63+ messages in thread
From: Chris Vine @ 2003-11-02 23:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Con Kolivas

On Friday 31 October 2003 3:57 am, Rik van Riel wrote:
> On Wed, 29 Oct 2003, Chris Vine wrote:
> > However, on a low end machine (200MHz Pentium MMX uniprocessor with only
> > 32MB of RAM and 70MB of swap) I get poor performance once extensive use
> > is made of the swap space.
>
> Could you try the patch Con Kolivas posted on the 25th ?
>
> Subject: [PATCH] Autoregulate vm swappiness cleanup

OK.  I have now done some testing.

The default swappiness in the kernel (without Con's patch) is 60.  This gives 
hopeless swapping results on a 200MHz Pentium with 32MB of RAM once the 
amount of memory swapped out exceeds about 15 to 20MB.  A static swappiness 
of 10 gives results which work under load, with up to 40MB swapped out (I 
haven't tested beyond that).  Compile times with a test file requiring about 
35MB of swap and with everything else the same are:

2.4.22 - average of 1 minute 35 seconds
2.6.0-test9 (swappiness 10) - average of 5 minutes 56 seconds

A swappiness of 5 on the test compile causes the machine to hang in some kind 
of "won't swap/can't continue without more memory" stand-off, and a 
swappiness of 20 starts the machine thrashing to the point where I stopped 
the compile.  A swappiness of 10 would complete anything I threw at it and 
without excessive thrashing, but more slowly (and using a little more swap 
space) than 2.4.22.

With Con's dynamic swappiness patch things were worse, rather than better.  
With no load, the swappiness (now read only) was around 37.  Under load with 
the test compile, swappiness went up to around 62, thrashing began, and after 
30 minutes the compile still had not completed, swappiness had reached 70, 
and I abandoned it.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-02 23:06   ` Chris Vine
@ 2003-11-03  0:48     ` Con Kolivas
  2003-11-03 21:13       ` Chris Vine
  0 siblings, 1 reply; 63+ messages in thread
From: Con Kolivas @ 2003-11-03  0:48 UTC (permalink / raw)
  To: Chris Vine, Rik van Riel; +Cc: linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 2273 bytes --]

On Mon, 3 Nov 2003 10:06, Chris Vine wrote:
> On Friday 31 October 2003 3:57 am, Rik van Riel wrote:
> > On Wed, 29 Oct 2003, Chris Vine wrote:
> > > However, on a low end machine (200MHz Pentium MMX uniprocessor with
> > > only 32MB of RAM and 70MB of swap) I get poor performance once
> > > extensive use is made of the swap space.
> >
> > Could you try the patch Con Kolivas posted on the 25th ?
> >
> > Subject: [PATCH] Autoregulate vm swappiness cleanup
>
> OK.  I have now done some testing.
>
> The default swappiness in the kernel (without Con's patch) is 60.  This
> gives hopeless swapping results on a 200MHz Pentium with 32MB of RAM once
> the amount of memory swapped out exceeds about 15 to 20MB.  A static
> swappiness of 10 gives results which work under load, with up to 40MB
> swapped out (I haven't tested beyond that).  Compile times with a test file
> requiring about 35MB of swap and with everything else the same are:
>
> 2.4.22 - average of 1 minute 35 seconds
> 2.6.0-test9 (swappiness 10) - average of 5 minutes 56 seconds
>
> A swappiness of 5 on the test compile causes the machine to hang in some
> kind of "won't swap/can't continue without more memory" stand-off, and a
> swappiness of 20 starts the machine thrashing to the point where I stopped
> the compile.  A swappiness of 10 would complete anything I threw at it and
> without excessive thrashing, but more slowly (and using a little more swap
> space) than 2.4.22.
>
> With Con's dynamic swappiness patch things were worse, rather than better.
> With no load, the swappiness (now read only) was around 37.  Under load
> with the test compile, swappiness went up to around 62, thrashing began,
> and after 30 minutes the compile still had not completed, swappiness had
> reached 70, and I abandoned it.

Well I was considering adding the swap pressure to this algorithm but I had 
hoped 2.6 behaved better than this under swap overload which is what appears 
to happen to yours. Can you try this patch? It takes into account swap 
pressure as well. It wont be as aggressive as setting the swappiness manually 
to 10, but unlike a swappiness of 10 it will be more useful over a wide range 
of hardware and circumstances.

Con

P.S. patches available here: http://ck.kolivas.org/patches

[-- Attachment #2: patch-test9-am-5 --]
[-- Type: text/x-diff, Size: 2320 bytes --]

--- linux-2.6.0-test8-base/kernel/sysctl.c	2003-10-20 14:16:54.000000000 +1000
+++ linux-2.6.0-test8/kernel/sysctl.c	2003-11-03 10:49:15.000000000 +1100
@@ -664,11 +664,8 @@ static ctl_table vm_table[] = {
 		.procname	= "swappiness",
 		.data		= &vm_swappiness,
 		.maxlen		= sizeof(vm_swappiness),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
-		.strategy	= &sysctl_intvec,
-		.extra1		= &zero,
-		.extra2		= &one_hundred,
+		.mode		= 0444 /* read-only*/,
+		.proc_handler	= &proc_dointvec,
 	},
 #ifdef CONFIG_HUGETLB_PAGE
 	 {
--- linux-2.6.0-test8-base/mm/vmscan.c	2003-10-20 14:16:54.000000000 +1000
+++ linux-2.6.0-test8/mm/vmscan.c	2003-11-03 11:38:08.542960408 +1100
@@ -47,7 +47,7 @@
 /*
  * From 0 .. 100.  Higher means more swappy.
  */
-int vm_swappiness = 60;
+int vm_swappiness = 0;
 static long total_memory;
 
 #ifdef ARCH_HAS_PREFETCH
@@ -600,6 +600,7 @@ refill_inactive_zone(struct zone *zone, 
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
 	struct page *page;
 	struct pagevec pvec;
+	struct sysinfo i;
 	int reclaim_mapped = 0;
 	long mapped_ratio;
 	long distress;
@@ -641,14 +642,38 @@ refill_inactive_zone(struct zone *zone, 
 	 */
 	mapped_ratio = (ps->nr_mapped * 100) / total_memory;
 
+	si_swapinfo(&i);
+	if (unlikely(!i.totalswap))
+		vm_swappiness = 0;
+	else {
+		int app_centile, swap_centile;
+
+		/*
+		 * app_centile is the percentage of physical ram used
+		 * by application pages.
+		 */
+		si_meminfo(&i);
+		app_centile = 100 - (((i.freeram + get_page_cache_size() -
+			swapper_space.nrpages) * 100) / i.totalram);
+
+		/*
+		 * swap_centile is the percentage of free swap.
+		 */
+		swap_centile = i.freeswap * 100 / i.totalswap;
+
+		/*
+		 * Autoregulate vm_swappiness to be equal to the lowest of
+		 * app_centile and swap_centile. -ck
+		 */
+		vm_swappiness = min(app_centile, swap_centile);
+	}
+
 	/*
 	 * Now decide how much we really want to unmap some pages.  The mapped
 	 * ratio is downgraded - just because there's a lot of mapped memory
 	 * doesn't necessarily mean that page reclaim isn't succeeding.
 	 *
 	 * The distress ratio is important - we don't want to start going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm altogether.
 	 */
 	swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-03  0:48     ` Con Kolivas
@ 2003-11-03 21:13       ` Chris Vine
  2003-11-04  2:55         ` Con Kolivas
  0 siblings, 1 reply; 63+ messages in thread
From: Chris Vine @ 2003-11-03 21:13 UTC (permalink / raw)
  To: Con Kolivas, Rik van Riel; +Cc: linux-kernel, Martin J. Bligh

On Monday 03 November 2003 12:48 am, Con Kolivas wrote:

> Well I was considering adding the swap pressure to this algorithm but I had
> hoped 2.6 behaved better than this under swap overload which is what
> appears to happen to yours. Can you try this patch? It takes into account
> swap pressure as well. It wont be as aggressive as setting the swappiness
> manually to 10, but unlike a swappiness of 10 it will be more useful over a
> wide range of hardware and circumstances.

Hi,

I applied the patch.

The test compile started in a similar way to the compile when using your first 
patch.  swappiness under no load was 37.  At the beginning of the compile it 
went up to 67, but when thrashing was well established it started to come 
down slowly.  After 40 minutes of thrashing it came down to 53.  At that 
point I stopped the compile attempt (which did not complete).

So, there is a slight move in the right direction, but given that a swappiness 
of 20 generates thrashing with 32 MB of RAM when more than about 20MB of 
memory is swapped out, it is a drop in the ocean.

The conclusion appears to be that for low end systems, once memory swapped out 
reaches about 60% of installed RAM the swap ceases to work effectively unless 
swappiness is much more aggressively low than your patch achieves.  The 
ability manually to tune it therefore seems to be required (and even then, 
2.4.22 is considerably better, compiling the test file in about 1 minute 35 
seconds).

I suppose one question is whether I would get the same thrashiness with my 
other machine (which has 512MB of RAM) once more than about 300MB is swapped 
out.  However, I cannot answer that question as I do not have anything here 
which makes memory demands of that kind.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-03 21:13       ` Chris Vine
@ 2003-11-04  2:55         ` Con Kolivas
  2003-11-04 22:08           ` Chris Vine
  2003-12-08 13:52           ` William Lee Irwin III
  0 siblings, 2 replies; 63+ messages in thread
From: Con Kolivas @ 2003-11-04  2:55 UTC (permalink / raw)
  To: Chris Vine, Rik van Riel
  Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III

On Tue, 4 Nov 2003 08:13, Chris Vine wrote:
> On Monday 03 November 2003 12:48 am, Con Kolivas wrote:
> > Well I was considering adding the swap pressure to this algorithm but I
> > had hoped 2.6 behaved better than this under swap overload which is what
> > appears to happen to yours. Can you try this patch? It takes into account
> > swap pressure as well. It wont be as aggressive as setting the swappiness
> > manually to 10, but unlike a swappiness of 10 it will be more useful over
> > a wide range of hardware and circumstances.

>
> The test compile started in a similar way to the compile when using your
> first patch.  swappiness under no load was 37.  At the beginning of the
> compile it went up to 67, but when thrashing was well established it
> started to come down slowly.  After 40 minutes of thrashing it came down to
> 53.  At that point I stopped the compile attempt (which did not complete).
>
> So, there is a slight move in the right direction, but given that a
> swappiness of 20 generates thrashing with 32 MB of RAM when more than about
> 20MB of memory is swapped out, it is a drop in the ocean.
>
> The conclusion appears to be that for low end systems, once memory swapped
> out reaches about 60% of installed RAM the swap ceases to work effectively
> unless swappiness is much more aggressively low than your patch achieves. 
> The ability manually to tune it therefore seems to be required (and even
> then, 2.4.22 is considerably better, compiling the test file in about 1
> minute 35 seconds).
>
> I suppose one question is whether I would get the same thrashiness with my
> other machine (which has 512MB of RAM) once more than about 300MB is
> swapped out.  However, I cannot answer that question as I do not have
> anything here which makes memory demands of that kind.

That's pretty much what I expected. Overall I'm happier with this later 
version as it doesn't impact on the noticable improvement on systems that are 
not overloaded, yet keeps performance at least that of the untuned version. I 
can tune it to be better for this work load but it would be to the detriment 
of the rest. 

Ultimately this is the problem I see with 2.6 ; there is no way for the vm to 
know that "all the pages belonging to the currently running tasks should try 
their best to fit into the available space by getting an equal share". It 
seems the 2.6 vm gives nice emphasis to the most current task, but at the 
detriment of other tasks that are on the runqueue and still need ram. The 
original design of the 2.6 vm didn't even include this last ditch effort at 
taming swappiness with the "knob", and behaved as though the swapppiness was 
always set at 100. Trying to tune this further with just the swappiness value 
will prove futile as can be seen by the "best" setting of 20 in your test 
case still taking 4 times longer to compile the kernel. 

This is now a balance tradeoff of trying to set a value that works for your 
combination of the required ram of the applications you run concurrently, the 
physical ram and the swap ram. As you can see from your example, in your 
workload it seems there would be no point having more swap than your physical 
ram since even if it tries to use say 40Mb it just drowns in a swapstorm. 
Clearly this is not the case in a machine with more ram in different 
circumstances, as swapping out say openoffice and mozilla while it's not 
being used will not cause any harm to a kernel compile that takes up all the 
available physical ram (it would actually be beneficial). Fortunately most 
modern machines' ram vs application sizes are of the latter balance.

There's always so much more you can do...

wli, riel care to comment?

Cheers,
Con


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-04  2:55         ` Con Kolivas
@ 2003-11-04 22:08           ` Chris Vine
  2003-11-04 22:30             ` Con Kolivas
  2003-12-08 13:52           ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Chris Vine @ 2003-11-04 22:08 UTC (permalink / raw)
  To: Con Kolivas, Rik van Riel
  Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III

On Tuesday 04 November 2003 2:55 am, Con Kolivas wrote:
> That's pretty much what I expected. Overall I'm happier with this later
> version as it doesn't impact on the noticable improvement on systems that
> are not overloaded, yet keeps performance at least that of the untuned
> version. I can tune it to be better for this work load but it would be to
> the detriment of the rest.
>
> Ultimately this is the problem I see with 2.6 ; there is no way for the vm
> to know that "all the pages belonging to the currently running tasks should
> try their best to fit into the available space by getting an equal share".
> It seems the 2.6 vm gives nice emphasis to the most current task, but at
> the detriment of other tasks that are on the runqueue and still need ram.
> The original design of the 2.6 vm didn't even include this last ditch
> effort at taming swappiness with the "knob", and behaved as though the
> swapppiness was always set at 100. Trying to tune this further with just
> the swappiness value will prove futile as can be seen by the "best" setting
> of 20 in your test case still taking 4 times longer to compile the kernel.
>
> This is now a balance tradeoff of trying to set a value that works for your
> combination of the required ram of the applications you run concurrently,
> the physical ram and the swap ram. As you can see from your example, in
> your workload it seems there would be no point having more swap than your
> physical ram since even if it tries to use say 40Mb it just drowns in a
> swapstorm. Clearly this is not the case in a machine with more ram in
> different circumstances, as swapping out say openoffice and mozilla while
> it's not being used will not cause any harm to a kernel compile that takes
> up all the available physical ram (it would actually be beneficial).
> Fortunately most modern machines' ram vs application sizes are of the
> latter balance.

Your diagnosis looks right, but two points -

1.  The test compile was not of the kernel but of a file in a C++ program 
using quite a lot of templates and therefore which is quite memory intensive 
(for the sake of choosing something, it was a compile of src/main.o in 
http://www.cvine.freeserve.co.uk/efax-gtk/efax-gtk-2.2.2.src.tgz).  It would 
be a sad day if the kernel could not be compiled under 2.6 in 32MB of memory, 
and I am glad to say that it does compile - my 2.6.0-test9 kernel compiles on 
the 32MB machine in on average 45 minutes 13 seconds under kernel 2.4.22, and 
in 54 minutes 11 seconds under 2.6.0-test9 with your latest patch, which is 
not an enormous difference.  (As a digression, in the 2.0 days the kernel 
would compile in 6 minutes on the machine in question, and at the time I was 
very impressed.)

2.  Being able to choose a manual setting for swappiness is not "futile".  As 
I mentioned in an earlier post, a swappiness of 10 will enable 2.6.0-test9 to 
compile the things I threw at it on a low end machine, albeit slowly, whereas 
with dynamic swappiness it would not compile at all.  So the difference is 
between being able to do something and not being able to do it.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-04 22:08           ` Chris Vine
@ 2003-11-04 22:30             ` Con Kolivas
  0 siblings, 0 replies; 63+ messages in thread
From: Con Kolivas @ 2003-11-04 22:30 UTC (permalink / raw)
  To: Chris Vine, Rik van Riel
  Cc: linux-kernel, Martin J. Bligh, William Lee Irwin III

On Wed, 5 Nov 2003 09:08, Chris Vine wrote:
> Your diagnosis looks right, but two points -
>
> 1.  The test compile was not of the kernel but of a file in a C++ program
> using quite a lot of templates and therefore which is quite memory
> intensive (for the sake of choosing something, it was a compile of
> src/main.o in
> http://www.cvine.freeserve.co.uk/efax-gtk/efax-gtk-2.2.2.src.tgz).  It
> would be a sad day if the kernel could not be compiled under 2.6 in 32MB of
> memory, and I am glad to say that it does compile - my 2.6.0-test9 kernel
> compiles on the 32MB machine in on average 45 minutes 13 seconds under
> kernel 2.4.22, and in 54 minutes 11 seconds under 2.6.0-test9 with your
> latest patch, which is not an enormous difference.  (As a digression, in
> the 2.0 days the kernel would compile in 6 minutes on the machine in
> question, and at the time I was very impressed.)

Phew. It would be sad if it couldn't compile a kernel indeed.
>
> 2.  Being able to choose a manual setting for swappiness is not "futile". 
> As I mentioned in an earlier post, a swappiness of 10 will enable
> 2.6.0-test9 to compile the things I threw at it on a low end machine,
> albeit slowly, whereas with dynamic swappiness it would not compile at all.
>  So the difference is between being able to do something and not being able
> to do it.

I agree with you on that; I meant it would be futile trying to get the compile 
times back to 2.4 levels with just this tunable modified alone (statically or 
dynamically)... which means we should look elsewhere for ways to tackle this.

Con


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-10-31 12:55     ` Ed Tomlinson
  2003-11-01 18:34       ` Pasi Savolainen
@ 2003-11-06 18:40       ` bill davidsen
  1 sibling, 0 replies; 63+ messages in thread
From: bill davidsen @ 2003-11-06 18:40 UTC (permalink / raw)
  To: linux-kernel

In article <200310310755.36224.edt@aei.ca>, Ed Tomlinson  <edt@aei.ca> wrote:

| With 2.6 its possible to tell the kernel how much to swap.  Con's patch
| tries to keep applications in memory.  You can also play with 
| /proc/sys/vm/swappiness which is what Con's patch tries to replace.

I added Nick's sched and io patches to Con's patch on test9, and it
looked stable under load. But I'm (mostly) on vacation this week, so it
isn't being tested any more. My responsiveness test didn't show it to be
as good as 2.4, unfortunately.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-11-04  2:55         ` Con Kolivas
  2003-11-04 22:08           ` Chris Vine
@ 2003-12-08 13:52           ` William Lee Irwin III
  2003-12-08 14:23             ` Con Kolivas
  2003-12-08 19:49             ` Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-08 13:52 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Tue, Nov 04, 2003 at 01:55:08PM +1100, Con Kolivas wrote:
> This is now a balance tradeoff of trying to set a value that works for your 
> combination of the required ram of the applications you run concurrently, the 
> physical ram and the swap ram. As you can see from your example, in your 
> workload it seems there would be no point having more swap than your physical 
> ram since even if it tries to use say 40Mb it just drowns in a swapstorm. 
> Clearly this is not the case in a machine with more ram in different 
> circumstances, as swapping out say openoffice and mozilla while it's not 
> being used will not cause any harm to a kernel compile that takes up all the 
> available physical ram (it would actually be beneficial). Fortunately most 
> modern machines' ram vs application sizes are of the latter balance.
> There's always so much more you can do...
> wli, riel care to comment?

Explicit load control is in order. 2.4 appears to work better in these
instances because it victimizes one process at a time. It vaguely
resembles load control with a random demotion policy (mmlist order is
effectively random), but is the only method of page reclamation, which
disturbs its two-stage LRU, and basically livelocks in various situations
because having "demoted" a process address space to whatever extent it
does fails to eliminate it from consideration during further attempts
to reclaim memory to satisfy allocations.

On smaller machines or workloads with high levels of overcommitment
(in a sense different from non-overcommit; here it means that if all
tasks were executing simultaneously over some period of time they
would require more RAM than the machine has), the effect of load control
dominates replacement by several orders of magnitude, so the mere
presence of anything like a load control mechanism does them wonders.

According to a study from the 80's (Carr's thesis), the best load
control policies are demoting the smallest task, demoting the "most
recently activated task", and demoting the "task with the largest
remaining quantum". The latter two no longer make sense in the presence
of threads, or at least have to be revised not to assume a unique
execution context associated with a process address space. These three
Were said to be largely equivalent and performed 15% better than random.

Other important aspects of load control beyond the demotion policy are
explicit suspension the execution contexts of the process address
spaces chosen as its victims, complete eviction of the process address
space, load-time bonuses for process address spaces promoted from that
demoted status, and, of course, fair enough scheduling that starvation
or repetitive demotions of the same tasks (I think demoting the faulting
task runs into this) without forward progress don't occur.

2.4 does not do any of this.

The effect of not suspending the execution contexts of the demoted
process address spaces is that the victimized execution contexts thrash
while trying to reload the memory they need to execute. The effect of
incomplete demotion is essentially livelock under sufficient stress.
Its memory scheduling to what extent it has it is RR and hence fair,
but the various caveats above justify "does not do any of this",
particularly incomplete demotion.

So I predict that a true load control mechanism and policy would be
both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4
on underprovisioned machines. For now, we lack an implementation.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 13:52           ` William Lee Irwin III
@ 2003-12-08 14:23             ` Con Kolivas
  2003-12-08 14:30               ` William Lee Irwin III
                                 ` (2 more replies)
  2003-12-08 19:49             ` Roger Luethi
  1 sibling, 3 replies; 63+ messages in thread
From: Con Kolivas @ 2003-12-08 14:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

[snip original discussion thrashing swap on 2.6test with 32mb ram]

Chris

By an unusual coincidence I was looking into the patches that were supposed to 
speed up application startup and noticed this one was merged. A brief 
discussion with wli suggests this could cause thrashing problems on low 
memory boxes so can you try this patch? Applies to test11.

Con

[-- Attachment #2: patch-backout-readahead --]
[-- Type: text/x-diff, Size: 495 bytes --]

--- linux-2.6.0-test11-base/mm/filemap.c	2003-11-24 22:18:56.000000000 +1100
+++ linux-2.6.0-test11-fremap/mm/filemap.c	2003-12-09 01:17:47.793384425 +1100
@@ -1285,10 +1285,6 @@ static int filemap_populate(struct vm_ar
 	struct page *page;
 	int err;
 
-	if (!nonblock)
-		force_page_cache_readahead(mapping, vma->vm_file,
-					pgoff, len >> PAGE_CACHE_SHIFT);
-
 repeat:
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 14:23             ` Con Kolivas
@ 2003-12-08 14:30               ` William Lee Irwin III
  2003-12-09 21:03               ` Chris Vine
  2003-12-13 14:08               ` Chris Vine
  2 siblings, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-08 14:30 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Tue, Dec 09, 2003 at 01:23:31AM +1100, Con Kolivas wrote:
> [snip original discussion thrashing swap on 2.6test with 32mb ram]
> Chris
> By an unusual coincidence I was looking into the patches that were supposed to 
> speed up application startup and noticed this one was merged. A brief 
> discussion with wli suggests this could cause thrashing problems on low 
> memory boxes so can you try this patch? Applies to test11.

This is effectively only called when faulting on paged-out ptes whose
file offsets were disturbed by remap_file_pages() and when calling
remap_file_pages() itself.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 13:52           ` William Lee Irwin III
  2003-12-08 14:23             ` Con Kolivas
@ 2003-12-08 19:49             ` Roger Luethi
  2003-12-08 20:48               ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-08 19:49 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 5296 bytes --]

I've been looking at this during the past few months. I will sketch
out a few of my findinds below. I can follow up with some details and
actual data if necessary.

On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
> Explicit load control is in order. 2.4 appears to work better in these
> instances because it victimizes one process at a time. It vaguely
> resembles load control with a random demotion policy (mmlist order is

Everybody I talked to seemed to assume that 2.4 does better due to the
way mapped pages are freed (i.e. swap_out in 2.4). While it is true
that the new VM as merged in 2.5.27 didn't exactly help with thrashing
performance, the main factors slowing 2.6 down were merged much later.

Have a look at the graph attached to this message to get an idea of
what I am talking about (x axis is kernel releases after 2.5.0, y axis
is time to complete each benchmark).

It is important to note that different work loads show different
thrashing behavior. Some changes in 2.5 improved one thrashing benchmark
and made another worse. However, 2.4 seems to do better than 2.6 across
the board, which suggests that some elements are in fact better for
any types of thrashing.

> Other important aspects of load control beyond the demotion policy are
> explicit suspension the execution contexts of the process address
> spaces chosen as its victims, complete eviction of the process address

I implemented suspension during memory shortage for 2.6 and I had some
code for complete eviction as well. It definitely helped for some
benchmarks. There's one problem, though: Latency. If a machine is
thrashing, a sys admin won't appreciate that her shell is suspended
when she tries to log in to correct the problem. I have some simple
criteria for selecting a process to suspend, but it's hard to get it
right every time (kind of like the OOM killer, although with smaller
damage for bad decisions).

For workstations and most servers latency is so important compared to
throughput that I began to wonder whether implementing suspension was
actually worth it. After benchmarking 2.4 vs 2.6, though, I suspected
that there must be plenty of room for improvement _before_ such drastic
measures are necessary. It makes little sense to add suspension to 2.6
if performance can be improved _without_ hurting latency. That's why
I shelved my work on suspension to find out and document when exactly
performance went down during 2.5.

> 2.4 does not do any of this.
> 
> The effect of not suspending the execution contexts of the demoted
> process address spaces is that the victimized execution contexts thrash
> while trying to reload the memory they need to execute. The effect of
> incomplete demotion is essentially livelock under sufficient stress.
> Its memory scheduling to what extent it has it is RR and hence fair,
> but the various caveats above justify "does not do any of this",
> particularly incomplete demotion.

One thing you can observe with 2.4 is that one process may force another
process out. Say you have several instances of the same program which
all have the same working set size (i.e.  requirements, not RSS) and
a constant rate of memory references in the code. If their current RSS
differ then some take more major faults and spend more time blocked than
others. In a thrashing situation, you can see the small RSSs shrink
to virtually zero, while the largest RSS will grow even further --
the thrashing processes are stealing each other's pages while the one
which hardly ever faults keeps its complete working set in RAM. Bad for
fairness, but can help throughput quite a bit. This effect is harder
to trigger in 2.6.

> So I predict that a true load control mechanism and policy would be
> both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4
> on underprovisioned machines. For now, we lack an implementation.

I doubt that you can get performance anywhere near 2.4 just by adding
load control to 2.6 unless you measure throughput and nothing else --
otherwise latency will kill you. I am convinced the key is not in
_adding_ stuff, but _fixing_ what we have.

IMO the question is: How much do we care? Machines with tight memory are
not necessarily very concerned about paging (e.g. PDAs), and serious
servers rarely operate under such conditions: Admins tend to add RAM
when the paging load is significant.

If you don't care _that_ much about thrashing in Linux, just tell
people to buy more RAM. Computers are cheap, RAM even more so, 64 bit
becomes affordable, and heavy paging sucks no matter how good a paging
mechanism is.

If you care enough to spend resources to address the problem, look at
the major regressions in 2.5 and find out where they were a consequence
of a deliberate trade-off decision and where it was an oversight which
can be fixed or mitigated without sacrificing what was gained through
the respective changes in 2.5. Obviously, performing regular testing
with thrashing benchmarks would make lasting major regressions like
those in the 2.5 development series much less likely in the future.

Additional load control mechanisms create new problems (latency,
increased complexity), so I think they should be a last resort, not
some method to paper over deficiencies elsewhere in the kernel.

Roger

[-- Attachment #2: plot.png --]
[-- Type: image/png, Size: 10196 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 19:49             ` Roger Luethi
@ 2003-12-08 20:48               ` William Lee Irwin III
  2003-12-09  0:27                 ` Roger Luethi
  2003-12-10 21:52                 ` Andrea Arcangeli
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-08 20:48 UTC (permalink / raw)
  To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
>> Explicit load control is in order. 2.4 appears to work better in these
>> instances because it victimizes one process at a time. It vaguely
>> resembles load control with a random demotion policy (mmlist order is

On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> Everybody I talked to seemed to assume that 2.4 does better due to the
> way mapped pages are freed (i.e. swap_out in 2.4). While it is true
> that the new VM as merged in 2.5.27 didn't exactly help with thrashing
> performance, the main factors slowing 2.6 down were merged much later.

What kinds of factors are these? How did you find these factors? When
were these factors introduced?


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> Have a look at the graph attached to this message to get an idea of
> what I am talking about (x axis is kernel releases after 2.5.0, y axis
> is time to complete each benchmark).

On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> It is important to note that different work loads show different
> thrashing behavior. Some changes in 2.5 improved one thrashing benchmark
> and made another worse. However, 2.4 seems to do better than 2.6 across
> the board, which suggests that some elements are in fact better for
> any types of thrashing.

qsbench I'd pretty much ignore except as a control case, since there's
nothing to do with a single process but let it thrash.


On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
>> Other important aspects of load control beyond the demotion policy are
>> explicit suspension the execution contexts of the process address
>> spaces chosen as its victims, complete eviction of the process address

On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> I implemented suspension during memory shortage for 2.6 and I had some
> code for complete eviction as well. It definitely helped for some
> benchmarks. There's one problem, though: Latency. If a machine is
> thrashing, a sys admin won't appreciate that her shell is suspended
> when she tries to log in to correct the problem. I have some simple
> criteria for selecting a process to suspend, but it's hard to get it
> right every time (kind of like the OOM killer, although with smaller
> damage for bad decisions).

I'd be interested in seeing the specific criteria used, since the
policy can strongly influence performance. Some of the most obvious
policies do worse than random.


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> For workstations and most servers latency is so important compared to
> throughput that I began to wonder whether implementing suspension was
> actually worth it. After benchmarking 2.4 vs 2.6, though, I suspected
> that there must be plenty of room for improvement _before_ such drastic
> measures are necessary. It makes little sense to add suspension to 2.6
> if performance can be improved _without_ hurting latency. That's why
> I shelved my work on suspension to find out and document when exactly
> performance went down during 2.5.

Ideally, the targets for suspension and complete eviction would be
background tasks that aren't going to demand memory in the near future.
Unfortunately that algorithm appears to require an oracle to implement.
Also, the best criteria as I know of them are somewhat counterintuitive,
so I'd like to be sure they were tried.


On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
>> 2.4 does not do any of this.
>> The effect of not suspending the execution contexts of the demoted
>> process address spaces is that the victimized execution contexts thrash
>> while trying to reload the memory they need to execute. The effect of
>> incomplete demotion is essentially livelock under sufficient stress.
>> Its memory scheduling to what extent it has it is RR and hence fair,
>> but the various caveats above justify "does not do any of this",
>> particularly incomplete demotion.

On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> One thing you can observe with 2.4 is that one process may force another
> process out. Say you have several instances of the same program which
> all have the same working set size (i.e.  requirements, not RSS) and
> a constant rate of memory references in the code. If their current RSS
> differ then some take more major faults and spend more time blocked than
> others. In a thrashing situation, you can see the small RSSs shrink
> to virtually zero, while the largest RSS will grow even further --
> the thrashing processes are stealing each other's pages while the one
> which hardly ever faults keeps its complete working set in RAM. Bad for
> fairness, but can help throughput quite a bit. This effect is harder
> to trigger in 2.6.

There was a study conducted by someone involved with CKRM (included in
some joint paper with the rest of the team) that actually charted out
this property of 2.6 in terms of either faults taken over time or RSS
over time, but compared it to a modified page replacement policy that
actually had it to a greater degree than stock 2.6 instead of 2.4.


On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
>> So I predict that a true load control mechanism and policy would be
>> both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4
>> on underprovisioned machines. For now, we lack an implementation.

On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> I doubt that you can get performance anywhere near 2.4 just by adding
> load control to 2.6 unless you measure throughput and nothing else --
> otherwise latency will kill you. I am convinced the key is not in
> _adding_ stuff, but _fixing_ what we have.

A small problem with that kind of argument is that it's assuming the
existence of some accumulation of small regressions that haven't proven
to exist (or have they?), where the kind of a priori argument I've made
only needs to rely on the properties of the algorithms. But neither can
actually provide a guarantee of results without testing. I suppose one
point in favor of my "grab this tool off the shelf" approach is that
there is quite a bit of history behind the methods and that they are
well-understood.


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> IMO the question is: How much do we care? Machines with tight memory are
> not necessarily very concerned about paging (e.g. PDAs), and serious
> servers rarely operate under such conditions: Admins tend to add RAM
> when the paging load is significant.

The question is not if we care, but if we care about others. Economies
aren't as kind to all users as they are to us


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> If you don't care _that_ much about thrashing in Linux, just tell
> people to buy more RAM. Computers are cheap, RAM even more so, 64 bit
> becomes affordable, and heavy paging sucks no matter how good a paging
> mechanism is.

If I took this kind of argument seriously I'd be telling people to go
shopping for new devices every time they run into a driver problem. I'm
actually rather annoyed with hearing this line of reasoning repeated to
many so many times over, and I'd appreciate not hearing it ever again
(offenders, you know who you are).

The issue at hand is improving how the kernel behaves on specific
hardware configurations; the fact other hardware configurations exist
is irrelevant.


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> If you care enough to spend resources to address the problem, look at
> the major regressions in 2.5 and find out where they were a consequence
> of a deliberate trade-off decision and where it was an oversight which
> can be fixed or mitigated without sacrificing what was gained through
> the respective changes in 2.5. Obviously, performing regular testing
> with thrashing benchmarks would make lasting major regressions like
> those in the 2.5 development series much less likely in the future.

Yes, this does need to be done more regularly. c.f. the min_free_kb
tuning problem Matt Mackall and I identified.


On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> Additional load control mechanisms create new problems (latency,
> increased complexity), so I think they should be a last resort, not
> some method to paper over deficiencies elsewhere in the kernel.

Who could disagree with this without looking ridiculous?

Methods of last resort are not necessarily unavoidable; the OOM killer
is an example of one that isn't avoidable. The issue is less clear cut
here, since the effect is limited to degraded performance on a limited
range of machines. But I would prefer not to send an "FOAD" message to
the users of older hardware or users who can't afford fast hardware.

The assumption methods of last resort create more problems than they
solve appears to be based on the notion that they'll be used for more
than methods of last resort. They're meant to handle the specific cases
where they are beneficial, not infect the common case with behavior
that's only appropriate for underpowered machines or other bogosity.
That is, it should teach the kernel how to behave in the new situation
where we want it to behave well, not change its behavior where it
already behaves well.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 20:48               ` William Lee Irwin III
@ 2003-12-09  0:27                 ` Roger Luethi
  2003-12-09  4:05                   ` William Lee Irwin III
  2003-12-10 21:52                 ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-09  0:27 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
> On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> > Everybody I talked to seemed to assume that 2.4 does better due to the
> > way mapped pages are freed (i.e. swap_out in 2.4). While it is true
> > that the new VM as merged in 2.5.27 didn't exactly help with thrashing
> > performance, the main factors slowing 2.6 down were merged much later.
> 
> What kinds of factors are these? How did you find these factors? When
> were these factors introduced?

The relevant changes are all over the place, factors other than the
pageout mechanism affect thrashing. I haven't identified all of them,
though. I work on it occasionally.

When I realized what had happened in 2.5 (took a while), I went for a
tedious, systematic approach. It started with benchmarks: 3 benchmarks
x some 85 kernels x 10 runs each. The graph you saw in my previous
message represents a few hundred hours worth of benchmarking (required
because variance in thrashing benchmarks is pretty bad). The real stuff
is quite detailed but too large to post on the list.

I scanned the resulting data for significant performance changes. For
some of them, I used the Changelog and -- if necessary -- a binary
search to nail down the patch set that caused the regression.

The next step would be to find out whether the regression was "necessary"
or not. Problem is, ten or twenty kernel releases later, you can't easily
revert a patch and it's not always obvious which regression was fixed by
the occasional performance improvement in a graph.

So what it boils down to quite often is this: Figure out what the
patch intended to do, find out if it's still slowing down recent test
kernels, then try to achieve the same without causing the regression in
2.6.0-test11. I didn't have much time to spend on this so far, and the
original patch authors would be much more qualified to do this anyway.

> qsbench I'd pretty much ignore except as a control case, since there's
> nothing to do with a single process but let it thrash.

I like to keep qsbench around for a number of reasons: It's the benchmark
where 2.6 looks best (i.e. less bad). I can't rule out that somewhere
somebody has a real work load of that type. And it is an interesting
contrast to the real world compile benchmarks I care about.

> > right every time (kind of like the OOM killer, although with smaller
> > damage for bad decisions).
> 
> I'd be interested in seeing the specific criteria used, since the
> policy can strongly influence performance. Some of the most obvious
> policies do worse than random.

Define "performance". My goal was to improve both responsiveness and
throughput of the system under extreme memory pressure. That also
meant that I wasn't interested in better throughput if latency got
even worse.

I used a modified version of badness in oom_kill. I didn't put too
much effort into it, but I could explain the reasoning behind the
changes. I had a bunch of batch processes thrashing and I wanted to see
them selected and not the sshd or the login shell. It worked reasonably
well for me.

/*
 * Resident memory size of the process is the basis for the badness.
 */
points = p->mm->rss;

/*
 * CPU time is in seconds and run time is in minutes. There is no
 * particular reason for this other than that it turned out to work
 * very well in practice.
 */
cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3);
run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10);

points *= int_sqrt(cpu_time);
points *= int_sqrt(int_sqrt(run_time));

/*
 * Niced processes are most likely less important.
 */
if (task_nice(p) > 0)
        points *= 4;

/*
 * Keep interactive processes around.
 */
if (task_interactive(p))
        points /= 4;

/*
 * Superuser processes are usually more important, so we make it
 * less likely that we kill those.
 */
if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
                        p->uid == 0 || p->euid == 0)
        points /= 2;

/*
 * We don't want to kill a process with direct hardware access.
 * Not only could that mess up the hardware, but usually users
 * tend to only have this flag set on applications they think
 * of as important.
 */
if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
        points /= 2;

> Ideally, the targets for suspension and complete eviction would be
> background tasks that aren't going to demand memory in the near future.

You lost me there. If I knew of a background task that it was about to
demand more memory in the near future when memory is very tight anyway,
that would be the first process I'd suspend and evict before it gets
a chance to make matters worse.

So what are some potential criteria?

- process owner: sshd runs often as root. You don't want to stun that.
  OTOH, a sys admin will usually log in as a normal user before su'ing
  to root. So stunning non-root processes isn't a clear winner, either.

- process size: I favored stunning processes with large RSS because
  for my scenario that described the culprits quite well and left the
  interactive stuff alone.

- interactivity: Avoiding to stun tasks the scheduler considers
  interactive was a no-brainer.

- nice value: A niced process tends to be a batch process. Stun.

- time: OOM kill doesn't want to take down long running processes
  because of the work that is lost. For stunning, I don't care.
  In fact, they are probably batch processes, so stun them.

- fault frequency, I/O requests: When the paging disk is the bottleneck,
  it might be sensible to stun a process that produces lots of faults
  or does a lot of disk I/O. If there is an easy way to get that data
  then I missed it.

There are certainly more, but that's what I can think of off the top
of my head. I did note your reference to Carr's thesis (which I'm not
familiar with), but like most papers I've seen on the subject it seems
to focus on throughput. That's special-casing for batch processing or
transaction systems, however; on a general-purpose computer, throughput
means nothing if latency goes down the tube.

> Unfortunately that algorithm appears to require an oracle to implement.

Ah, we've all seen these optimal solutions for classic CS problems
where the only gotcha is that you need omniscience.

> Also, the best criteria as I know of them are somewhat counterintuitive,
> so I'd like to be sure they were tried.

Again, best for what?

> On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> > I doubt that you can get performance anywhere near 2.4 just by adding
> > load control to 2.6 unless you measure throughput and nothing else --
> > otherwise latency will kill you. I am convinced the key is not in
> > _adding_ stuff, but _fixing_ what we have.
> 
> A small problem with that kind of argument is that it's assuming the
> existence of some accumulation of small regressions that haven't proven
> to exist (or have they?), where the kind of a priori argument I've made

Heh. Did you look at the graph in my previous message? Yes, there are
several, independant regressions. What we don't know is which ones were
unavoidable. For instance, the regression in 2.5.27 is quite possibly a
necessary consequence of the new pageout mechanism and the benefits in
upward scalability may well outweigh the costs for the low-end user.

If we accept the notion that we don't care about what we can't measure
(remember the interactivity debates?) and since nobody tested regularly
for thrashing behavior, it seems quite likely that at least some of
the regressions can be fixed, maybe at a slight cost in performance
elsewhere, maybe not even that.

There should be plenty of room for improvement: We are not talking 10%
or 20%, but factors of 3 and more.

> actually provide a guarantee of results without testing. I suppose one
> point in favor of my "grab this tool off the shelf" approach is that
> there is quite a bit of history behind the methods and that they are
> well-understood.

I know I sound like a broken record, but I have one problem with
the off-the-shelf solutions I've found so far: They try to maximize
throughput. They don't care about latency. I do.

> On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> > IMO the question is: How much do we care? Machines with tight memory are
> > not necessarily very concerned about paging (e.g. PDAs), and serious
> > servers rarely operate under such conditions: Admins tend to add RAM
> > when the paging load is significant.
> 
> The question is not if we care, but if we care about others. Economies
> aren't as kind to all users as they are to us

Right. But kernel hackers tend to work for companies that don't make
their money by helping those who don't have any. And before you call
me a cynic, look at the resources that go into making Linux capable of
running on the top 0.something percent of machines and compare that to
the interest with which this and similar threads have been received. I
made my observation based on experience, not personal preference.

That said, it is a fact that thrashing is not the hot issue it was
35 years ago, although the hardware (growing access gap RAM/disk)
and usage patterns (latency matters a lot more, load is unpredictable
and exogenous for the kernel) should have made the problem worse. The
classic solutions are pretty much unworkable today and in most cases
there is one economic solution which is indeed to throw more RAM at it.

> > If you don't care _that_ much about thrashing in Linux, just tell
> > people to buy more RAM. Computers are cheap, RAM even more so, 64 bit
> > becomes affordable, and heavy paging sucks no matter how good a paging
> > mechanism is.
> 
> If I took this kind of argument seriously I'd be telling people to go
> shopping for new devices every time they run into a driver problem. I'm

No. Bad example. For starters, new devices are more likely to have driver
problems, so your advice would be dubious even if they had the money :-P.

The argument I hear for the regressions is that 2.6 is more scalable
on high-end machines now and we just made a trade-off. It has happened
before. Linux 1.x didn't have the hardware requirements of 2.4.

The point I was trying to make with regard to thrashing was that
I suspect it was written off as an inevitable trade-off too early.
I believe that some of the regressions can be fixed without losing the
gains in upward scalability _if_ we find the resources to do it.

Quite frankly, playing with the suspension code was a lot more fun than
investigating regressions in other people's work. But I hated the idea
that Linux fails so miserably now where it used to do so well. At the
very least I wanted to be sure that it was forced collateral damage
and not just an oversight or bad tuning. Clearly, I do care.

> The issue at hand is improving how the kernel behaves on specific
> hardware configurations; the fact other hardware configurations exist
> is irrelevant.

Why do you make me remind you that we live in a world with resource
constraints? What _is_ relevant is where the resources to do the work
come from, which is a non-trivial problem if the work is to benefit
people who don't have the money to buy more RAM. Just saying that it's
unacceptable to screw over those with low-end hardware won't help anybody
:-). If you are volunteering to help out, though, more power to you.

> > the respective changes in 2.5. Obviously, performing regular testing
> > with thrashing benchmarks would make lasting major regressions like
> > those in the 2.5 development series much less likely in the future.
> 
> Yes, this does need to be done more regularly. c.f. the min_free_kb
> tuning problem Matt Mackall and I identified.

Well, tuning problems always make me want to try genetic algorithms.
Regression testing would be much easier. Just run all benchmarks for
every new kernel. Update chart. Done. ... It's scriptable even.

> On Mon, Dec 08, 2003 at 08:49:30PM +0100, Roger Luethi wrote:
> > Additional load control mechanisms create new problems (latency,
> > increased complexity), so I think they should be a last resort, not
> > some method to paper over deficiencies elsewhere in the kernel.
> 
> Who could disagree with this without looking ridiculous?

Heh. It was carefully worded that way <g>. Seriously, though, it's not
as ridiculous as it may seem. The problems we need to address are not
even on the map for the classic papers I have seen on the subject. They
suggest working sets or some sort of abstract load control, but 2.6
has problems that are very specific to that kernel and its mechanisms.
There's no elegant, proven standard algorithm to solve those problems
for us.

> Methods of last resort are not necessarily unavoidable; the OOM killer
> is an example of one that isn't avoidable. The issue is less clear cut

That's debatable. Funny that you should take that example.

> range of machines. But I would prefer not to send an "FOAD" message to
> the users of older hardware or users who can't afford fast hardware.

Agreed.

> The assumption methods of last resort create more problems than they
> solve appears to be based on the notion that they'll be used for more
> than methods of last resort. They're meant to handle the specific cases

No. My beef with load control is that once it's there people will say
"See? Performance is back!" and whatever incentive there was to fix the
real problems is gone. Which I could accept if it wasn't for the fact
that a load control solution is always inferior to other improvements
because of the massive latency increase.

> where they are beneficial, not infect the common case with behavior
> that's only appropriate for underpowered machines or other bogosity.
> That is, it should teach the kernel how to behave in the new situation
> where we want it to behave well, not change its behavior where it
> already behaves well.

Alright. One more thing: Thrashing is not a clear cut system state. You
don't want to change behavior when it was doing well, so you need to
be cautious about your trigger. Which means it will often not fire
for border cases, that is light thrashing. I didn't do a survey, but
I suspect that light thrashing (where there's just not quite enough
memory) is much more common than the heavy variant. Now guess what?
The 2.6 performance for light thrashing is absolutely abysmal. In fact
2.6 will happily spend a lot of time in I/O wait in a situation where
2.4 will cruise through the task without a hitch.

I'm all for adding load control to deal with heavy thrashing that can't
be handled any other way. But I am firmly opposed to pretending that
it is a solution for the common case.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09  0:27                 ` Roger Luethi
@ 2003-12-09  4:05                   ` William Lee Irwin III
  2003-12-09 15:11                     ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-09  4:05 UTC (permalink / raw)
  To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> What kinds of factors are these? How did you find these factors? When
>> were these factors introduced?

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> The relevant changes are all over the place, factors other than the
> pageout mechanism affect thrashing. I haven't identified all of them,
> though. I work on it occasionally.
> When I realized what had happened in 2.5 (took a while), I went for a
> tedious, systematic approach. It started with benchmarks: 3 benchmarks
> x some 85 kernels x 10 runs each. The graph you saw in my previous
> message represents a few hundred hours worth of benchmarking (required
> because variance in thrashing benchmarks is pretty bad). The real stuff
> is quite detailed but too large to post on the list.

Okay, I'm interested in getting my hands on this however you can get it
to me.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> qsbench I'd pretty much ignore except as a control case, since there's
>> nothing to do with a single process but let it thrash.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> I like to keep qsbench around for a number of reasons: It's the benchmark
> where 2.6 looks best (i.e. less bad). I can't rule out that somewhere
> somebody has a real work load of that type. And it is an interesting
> contrast to the real world compile benchmarks I care about.

I won't debate that; however, as far as load control goes, there's
nothing to do.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> I'd be interested in seeing the specific criteria used, since the
>> policy can strongly influence performance. Some of the most obvious
>> policies do worse than random.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Define "performance". My goal was to improve both responsiveness and
> throughput of the system under extreme memory pressure. That also
> meant that I wasn't interested in better throughput if latency got
> even worse.

It was defined in two different ways: cpu utilization (inverse of iowait)
and multiprogramming level (how many tasks it could avoid suspending).


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> So what are some potential criteria?
> - process owner: sshd runs often as root. You don't want to stun that.
>   OTOH, a sys admin will usually log in as a normal user before su'ing
>   to root. So stunning non-root processes isn't a clear winner, either.

That's a method I haven't even heard of. I'd be wary of this; there are
a lot of daemons that might as well be swapped as they almost never run
and aren't latency-sensitive when they do.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> - process size: I favored stunning processes with large RSS because
>   for my scenario that described the culprits quite well and left the
>   interactive stuff alone.

Demoting the largest task is one that does worse than random.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> - interactivity: Avoiding to stun tasks the scheduler considers
>   interactive was a no-brainer.

An odd result was that since the ancient kernels didn't have threads,
their mm's had unique timeslices etc. Largest remaining quantum is one
of the three equivalent "empirically best" policies.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> - nice value: A niced process tends to be a batch process. Stun.
> - time: OOM kill doesn't want to take down long running processes
>   because of the work that is lost. For stunning, I don't care.
>   In fact, they are probably batch processes, so stun them.

These sound unusual, but innocuous.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> - fault frequency, I/O requests: When the paging disk is the bottleneck,
>   it might be sensible to stun a process that produces lots of faults
>   or does a lot of disk I/O. If there is an easy way to get that data
>   then I missed it.

Like the PFF bits for WS?


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> There are certainly more, but that's what I can think of off the top
> of my head. I did note your reference to Carr's thesis (which I'm not
> familiar with), but like most papers I've seen on the subject it seems
> to focus on throughput. That's special-casing for batch processing or
> transaction systems, however; on a general-purpose computer, throughput
> means nothing if latency goes down the tube.

Multiprogramming level and cpu utilization seemed to be more oriented
toward concurrency than either throughput or latency. What exactly that's
worth I'm not entirely sure.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> Also, the best criteria as I know of them are somewhat counterintuitive,
>> so I'd like to be sure they were tried.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Again, best for what?

Cpu utilization, a.k.a. minimizing iowait.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> A small problem with that kind of argument is that it's assuming the
>> existence of some accumulation of small regressions that haven't proven
>> to exist (or have they?), where the kind of a priori argument I've made

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Heh. Did you look at the graph in my previous message? Yes, there are
> several, independant regressions. What we don't know is which ones were
> unavoidable. For instance, the regression in 2.5.27 is quite possibly a
> necessary consequence of the new pageout mechanism and the benefits in
> upward scalability may well outweigh the costs for the low-end user.

I don't see other kernels to compare it to. I guess you can use earlier
versions of itself.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> If we accept the notion that we don't care about what we can't measure
> (remember the interactivity debates?) and since nobody tested regularly
> for thrashing behavior, it seems quite likely that at least some of
> the regressions can be fixed, maybe at a slight cost in performance
> elsewhere, maybe not even that.
> There should be plenty of room for improvement: We are not talking 10%
> or 20%, but factors of 3 and more.

They should probably get cleaned up. Where are your benchmarks?


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>>                          ...                             I suppose one
>> point in favor of my "grab this tool off the shelf" approach is that
>> there is quite a bit of history behind the methods and that they are
>> well-understood.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> I know I sound like a broken record, but I have one problem with
> the off-the-shelf solutions I've found so far: They try to maximize
> throughput. They don't care about latency. I do.

I have a vague notion we're thinking of different cases. What kinds of
overcommitment levels are you thinking of?


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> The question is not if we care, but if we care about others. Economies
>> aren't as kind to all users as they are to us

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Right. But kernel hackers tend to work for companies that don't make
> their money by helping those who don't have any. And before you call
> me a cynic, look at the resources that go into making Linux capable of
> running on the top 0.something percent of machines and compare that to
> the interest with which this and similar threads have been received. I
> made my observation based on experience, not personal preference.
> That said, it is a fact that thrashing is not the hot issue it was
> 35 years ago, although the hardware (growing access gap RAM/disk)
> and usage patterns (latency matters a lot more, load is unpredictable
> and exogenous for the kernel) should have made the problem worse. The
> classic solutions are pretty much unworkable today and in most cases
> there is one economic solution which is indeed to throw more RAM at it.

Top 0.001%? For expanded range, both endpoints matter. My notion of
scalability is running like greased lightning (compared to other OS's)
on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM
to a 16384x/16PB superdupercomputer.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> If I took this kind of argument seriously I'd be telling people to go
>> shopping for new devices every time they run into a driver problem. I'm

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> No. Bad example. For starters, new devices are more likely to have driver
> problems, so your advice would be dubious even if they had the money :-P.
> The argument I hear for the regressions is that 2.6 is more scalable
> on high-end machines now and we just made a trade-off. It has happened
> before. Linux 1.x didn't have the hardware requirements of 2.4.

No! No! NO!!!

(a) buying a G3 will not make my sun3 boot Linux
(b) buying an SS1 will not make my Decstation 5000/200's PMAD-AA
	driver work
(c) buying another multia will not make my multia stop deadlocking
	while swapping over NFS

No matter how many pieces of hardware you buy, the original one still
isn't driven correctly by the software. Hardware *NEVER* fixes software.


On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> The point I was trying to make with regard to thrashing was that
> I suspect it was written off as an inevitable trade-off too early.
> I believe that some of the regressions can be fixed without losing the
> gains in upward scalability _if_ we find the resources to do it.
> Quite frankly, playing with the suspension code was a lot more fun than
> investigating regressions in other people's work. But I hated the idea
> that Linux fails so miserably now where it used to do so well. At the
> very least I wanted to be sure that it was forced collateral damage
> and not just an oversight or bad tuning. Clearly, I do care.

They should get cleaned up; start sending data over.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> The issue at hand is improving how the kernel behaves on specific
>> hardware configurations; the fact other hardware configurations exist
>> is irrelevant.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Why do you make me remind you that we live in a world with resource
> constraints? What _is_ relevant is where the resources to do the work
> come from, which is a non-trivial problem if the work is to benefit
> people who don't have the money to buy more RAM. Just saying that it's
> unacceptable to screw over those with low-end hardware won't help anybody
> :-). If you are volunteering to help out, though, more power to you.

It's unlikely I'll have any trouble coughing up code.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> Yes, this does need to be done more regularly. c.f. the min_free_kb
>> tuning problem Matt Mackall and I identified.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Well, tuning problems always make me want to try genetic algorithms.
> Regression testing would be much easier. Just run all benchmarks for
> every new kernel. Update chart. Done. ... It's scriptable even.

This was a bit easier than that; the boot-time default was 1MB regardless
of the size of RAM; akpm picked some scaling algorithm out of a hat and
it pretty much got solved as it shrank with memory.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> Who could disagree with this without looking ridiculous?

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Heh. It was carefully worded that way <g>. Seriously, though, it's not
> as ridiculous as it may seem. The problems we need to address are not
> even on the map for the classic papers I have seen on the subject. They
> suggest working sets or some sort of abstract load control, but 2.6
> has problems that are very specific to that kernel and its mechanisms.
> There's no elegant, proven standard algorithm to solve those problems
> for us.

WS had its own load control bundled with it. AIUI most replacement
algorithms need a load control tailored to them (some seem to do okay
with an arbitrary choice), so there's some synthesis involved.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> Methods of last resort are not necessarily unavoidable; the OOM killer
>> is an example of one that isn't avoidable. The issue is less clear cut

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> That's debatable. Funny that you should take that example.

Not really. e.g. -aa's OOM killer just uses a trivial policy that shoots
the requesting task. Eliminating it entirely is theoretically possible
with ridiculous amounts of accounting, but I'm relatively certain it's
infeasible to implement.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> The assumption methods of last resort create more problems than they
>> solve appears to be based on the notion that they'll be used for more
>> than methods of last resort. They're meant to handle the specific cases

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> No. My beef with load control is that once it's there people will say
> "See? Performance is back!" and whatever incentive there was to fix the
> real problems is gone. Which I could accept if it wasn't for the fact
> that a load control solution is always inferior to other improvements
> because of the massive latency increase.

I think we have vastly different levels of overcommitment in mind.


On Mon, 08 Dec 2003 12:48:17 -0800, William Lee Irwin III wrote:
>> where they are beneficial, not infect the common case with behavior
>> that's only appropriate for underpowered machines or other bogosity.
>> That is, it should teach the kernel how to behave in the new situation
>> where we want it to behave well, not change its behavior where it
>> already behaves well.

On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> Alright. One more thing: Thrashing is not a clear cut system state. You
> don't want to change behavior when it was doing well, so you need to
> be cautious about your trigger. Which means it will often not fire
> for border cases, that is light thrashing. I didn't do a survey, but
> I suspect that light thrashing (where there's just not quite enough
> memory) is much more common than the heavy variant. Now guess what?
> The 2.6 performance for light thrashing is absolutely abysmal. In fact
> 2.6 will happily spend a lot of time in I/O wait in a situation where
> 2.4 will cruise through the task without a hitch.
> I'm all for adding load control to deal with heavy thrashing that can't
> be handled any other way. But I am firmly opposed to pretending that
> it is a solution for the common case.

The common case is pretty much zero or slim overcommitment these days.
The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's,
not non-overcommit-related.)


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09  4:05                   ` William Lee Irwin III
@ 2003-12-09 15:11                     ` Roger Luethi
  2003-12-09 16:04                       ` Rik van Riel
                                         ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-09 15:11 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
> > because variance in thrashing benchmarks is pretty bad). The real stuff
> > is quite detailed but too large to post on the list.
> 
> Okay, I'm interested in getting my hands on this however you can get it
> to me.

http://hellgate.ch/bench/thrash.tar.gz

The tar contains one postscript plot file and three data files:
{kbuild,efax,qsbench}.dat. Numbers are execution times in seconds. The
first column is the average of each row, and the values of each row
are sorted in ascending order. Other than that, it's the raw data.

The kernels were not tested in order, so it's definitely not the hardware
that's been deteriorating over time.

I repeated some tests that looked like mistakes: In 2.5.32, for instance,
it seems odd that both kbuild and qsbench are slower but efax isn't. I
believe the data is accurate, but I can do reruns upon request.

A fourth file, plot.ps, contains the graphs I use right now: You can
see how both average execution time and variance have grown from 2.5.0
and 2.6.0-test11. The graph is precise enough to determine the kernel
release that caused a regression.

The more fine-grained work is not complete and I'm not sure it ever
will be. Some _preliminary_ results (i.e. take with a grain of salt):

The regression for kbuild in 2.5.48 was caused by a patch titled "better
inode reclaim balancing". In 2.5.49, "strengthen the `incremental
min' logic in the page". In 2.6.0-test3 (aka 2.6.78), it's a subtle
interaction between "fix kswapd throttling" and "decaying average of
zone pressure" -- IIRC reverting the former gains nothing unless you
also revert the latter. I'd have to dig through my notes.

> On Tue, Dec 09, 2003 at 01:27:45AM +0100, Roger Luethi wrote:
> > Define "performance". My goal was to improve both responsiveness and
> > throughput of the system under extreme memory pressure. That also
> > meant that I wasn't interested in better throughput if latency got
> > even worse.
> 
> It was defined in two different ways: cpu utilization (inverse of iowait)
> and multiprogramming level (how many tasks it could avoid suspending).

Yeah, that's the classic. It _is_ throughput. Unless you have task
priorities (i.e. eye candy or SETI@home competing for cycles), CPU
utilization is an excellent approximation for throughput. And the
benefit of maintaining a high level of multiprogramming is that you
have a better chance to have a runnable process at any time, meaning
better CPU utilization meaning higher throughput.

The classic strategies based on these criteria work for transaction and
batch systems. They are all but useless, though, for a workstation and
even most modern servers, due to assumptions that are incorrect today
(remember all the degrees of freedom a scheduler had 30 years ago)
and additional factors that only became crucial in the past few decades
(latency again).

> > - process size: I favored stunning processes with large RSS because
> >   for my scenario that described the culprits quite well and left the
> >   interactive stuff alone.
> 
> Demoting the largest task is one that does worse than random.

We only know that to be true for irrelevant optimization criteria.

> > - fault frequency, I/O requests: When the paging disk is the bottleneck,
> >   it might be sensible to stun a process that produces lots of faults
> >   or does a lot of disk I/O. If there is an easy way to get that data
> >   then I missed it.
> 
> Like the PFF bits for WS?

Yup. PFF doesn't cover all disk I/O, though. Suspending a process that
is I/O bound even with a low PFF improves thrashing performance as well,
because disk I/O is the bottleneck.

> > Heh. Did you look at the graph in my previous message? Yes, there are
> > several, independant regressions. What we don't know is which ones were
> > unavoidable. For instance, the regression in 2.5.27 is quite possibly a
> > necessary consequence of the new pageout mechanism and the benefits in
> > upward scalability may well outweigh the costs for the low-end user.
> 
> I don't see other kernels to compare it to. I guess you can use earlier
> versions of itself.

You don't need anything to compare it to. You can investigate the
performance regression and determine whether it was a logical consequence
of the intended change in behavior.

Suppose you found that the problem in 2.5.27 is that shared pages are
unmapped too quickly -- that would be easy to fix without affecting
the benefits of the new VM. I think the more likely candidates for
improvements are later in 2.5, though.

> Top 0.001%? For expanded range, both endpoints matter. My notion of
> scalability is running like greased lightning (compared to other OS's)
> on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM
> to a 16384x/16PB superdupercomputer.

Well, that's nice. I agree. IIRC, though, each major release had more
demanding minimum requirements (in terms of RAM). The range covered
has been growing only because upward scalability grew faster. I can't
help but notice that some of your statements sound a lot like wishful
thinking.

> > No. Bad example. For starters, new devices are more likely to have driver
> > problems, so your advice would be dubious even if they had the money :-P.
> > The argument I hear for the regressions is that 2.6 is more scalable
> > on high-end machines now and we just made a trade-off. It has happened
> > before. Linux 1.x didn't have the hardware requirements of 2.4.
> 
> No! No! NO!!!
> 
> (a) buying a G3 will not make my sun3 boot Linux
> (b) buying an SS1 will not make my Decstation 5000/200's PMAD-AA
> 	driver work
> (c) buying another multia will not make my multia stop deadlocking
> 	while swapping over NFS
> 
> No matter how many pieces of hardware you buy, the original one still
> isn't driven correctly by the software. Hardware *NEVER* fixes software.

Look, I became the maintainer of via-rhine because nobody else wanted to
fix the driver for a very common, but barely documented piece of
cheap hardware. People were just told to buy another cheap card. That's
the reality of Linux.

Don't forget what we are talking about, though. Once you are seriously
tight on memory, you can only mitigate the damage in software, the only
solution is to add more RAM. Thrashing is not a bug like a broken driver.
I am currently writing a paper on the subject, and the gist of it will
likely be that we should try to prevent thrashing from happening as
long as possible (with good page replacement, I/O scheduling, etc.),
but when it's inevitable we're pretty much done for. Load control may
or may not be worth adding, but it only helps in some special cases
and does not seem clearly beneficial in general-purpose systems.

> This was a bit easier than that; the boot-time default was 1MB regardless
> of the size of RAM; akpm picked some scaling algorithm out of a hat and
> it pretty much got solved as it shrank with memory.

With all due respect for akpm's hat, sometimes I wish we had some good
heuristics for this stuff.

> > 2.6 will happily spend a lot of time in I/O wait in a situation where
> > 2.4 will cruise through the task without a hitch.
> >
> > I'm all for adding load control to deal with heavy thrashing that can't
> > be handled any other way. But I am firmly opposed to pretending that
> > it is a solution for the common case.
> 
> The common case is pretty much zero or slim overcommitment these days.
> The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's,
> not non-overcommit-related.)

So you want to help people who for some reason _have_ to run several
_batch_ jobs _concurrently_ (otherwise load control is ineffective)
on a low-end machine to result in a 10x overcommit system? Why don't
we buy those two or three guys a DIMM each?

I'm afraid you have a solution in search of a problem. Nobody runs a
10x overcommit system. And if they did, they would find it doesn't work
well with 2.4, either, so no one will complain about a regression. What
does happen, though, is that people go close to the limit of what
their low-end hardware supports, which will work perfectly with 2.4
and collapse with 2.6.

The real problem, the one many people will hit, the one the very
complaint that started this thread was about, is light and medium
overcommit. And load control is not the answer to that.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09 15:11                     ` Roger Luethi
@ 2003-12-09 16:04                       ` Rik van Riel
  2003-12-09 16:31                         ` Roger Luethi
  2003-12-09 18:31                       ` William Lee Irwin III
  2003-12-09 19:38                       ` William Lee Irwin III
  2 siblings, 1 reply; 63+ messages in thread
From: Rik van Riel @ 2003-12-09 16:04 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Tue, 9 Dec 2003, Roger Luethi wrote:

> The classic strategies based on these criteria work for transaction and
> batch systems. They are all but useless, though, for a workstation and
> even most modern servers, due to assumptions that are incorrect today
> (remember all the degrees of freedom a scheduler had 30 years ago)
> and additional factors that only became crucial in the past few decades
> (latency again).

Don't forget that computers have gotten a lot slower
over the years ;)

Swapping out a 64kB process to a disk that does 180kB/s
is a lot faster than swapping out a 100MB process to a
disk that does 50MB/s ...

Once you figure in seek times, the picture looks even
worse.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09 16:04                       ` Rik van Riel
@ 2003-12-09 16:31                         ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-09 16:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Tue, 09 Dec 2003 11:04:49 -0500, Rik van Riel wrote:
> > The classic strategies based on these criteria work for transaction and
> > batch systems. They are all but useless, though, for a workstation and
> > even most modern servers, due to assumptions that are incorrect today
> > (remember all the degrees of freedom a scheduler had 30 years ago)
> > and additional factors that only became crucial in the past few decades
> > (latency again).
> 
> Don't forget that computers have gotten a lot slower
> over the years ;)
> 
> Swapping out a 64kB process to a disk that does 180kB/s
> is a lot faster than swapping out a 100MB process to a
> disk that does 50MB/s ...
> 
> Once you figure in seek times, the picture looks even
> worse.

Exactly -- I did mention the growing access time gap between RAM and
disks in an earlier message. Yes, there are quite a few developments in
hardware and in the way we use computers (interactive, Client/Server,
dedicated machines, etc.) that made thrashing pretty much unsolvable
at an OS level. Fortunately, fixing it in hardware by adding RAM works
for most.

What we _can_ do in software, though, is prevent thrashing as long as
possible. Comparing 2.4 and 2.6 shows that a kernel can still make a
significant difference with smart pageout algorithms, I/O scheduling etc.
But you won't get much help with that from ancient papers.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09 15:11                     ` Roger Luethi
  2003-12-09 16:04                       ` Rik van Riel
@ 2003-12-09 18:31                       ` William Lee Irwin III
  2003-12-09 19:38                       ` William Lee Irwin III
  2 siblings, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-09 18:31 UTC (permalink / raw)
  To: Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> I'm afraid you have a solution in search of a problem. Nobody runs a
> 10x overcommit system. And if they did, they would find it doesn't work
> well with 2.4, either, so no one will complain about a regression. What
> does happen, though, is that people go close to the limit of what
> their low-end hardware supports, which will work perfectly with 2.4
> and collapse with 2.6.

No, I've got a guy in Russia complaining about 2.6 not doing well on
one of his boxen.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09 15:11                     ` Roger Luethi
  2003-12-09 16:04                       ` Rik van Riel
  2003-12-09 18:31                       ` William Lee Irwin III
@ 2003-12-09 19:38                       ` William Lee Irwin III
  2003-12-10 13:58                         ` Roger Luethi
  2 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-09 19:38 UTC (permalink / raw)
  To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> The more fine-grained work is not complete and I'm not sure it ever
> will be. Some _preliminary_ results (i.e. take with a grain of salt):
> The regression for kbuild in 2.5.48 was caused by a patch titled "better
> inode reclaim balancing". In 2.5.49, "strengthen the `incremental
> min' logic in the page". In 2.6.0-test3 (aka 2.6.78), it's a subtle
> interaction between "fix kswapd throttling" and "decaying average of
> zone pressure" -- IIRC reverting the former gains nothing unless you
> also revert the latter. I'd have to dig through my notes.

Okay, it sounds like you're well on our way to cleaning things up.
Not too hard to chime in as needed.


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> It was defined in two different ways: cpu utilization (inverse of iowait)
>> and multiprogramming level (how many tasks it could avoid suspending).

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> Yeah, that's the classic. It _is_ throughput. Unless you have task
> priorities (i.e. eye candy or SETI@home competing for cycles), CPU
> utilization is an excellent approximation for throughput. And the
> benefit of maintaining a high level of multiprogramming is that you
> have a better chance to have a runnable process at any time, meaning
> better CPU utilization meaning higher throughput.
> The classic strategies based on these criteria work for transaction and
> batch systems. They are all but useless, though, for a workstation and
> even most modern servers, due to assumptions that are incorrect today
> (remember all the degrees of freedom a scheduler had 30 years ago)
> and additional factors that only became crucial in the past few decades
> (latency again).

This assessment is inaccurate. The performance metrics are not entirely
useless, and it's rather trivial to recover data useful for modern
scenarios based on them. The driving notion from the iron age (I guess
the stone age was when the only way to support virtual memory was
swapping) was that getting stuck on io was the thing preventing the cpu
from getting used. Nowadays, we burn the cpu nicely enough with GNOME
etc. but have to worry about what happens to some task or other. So:

(a) Multiprogramming level is obviously trying to minimize the amount of
	swapping out going on. i.e. this is trying to limit the worst case
	handling to the worst case. Minimal adjustments are required. e.g.
	consider number of process address spaces swapped out directly
	as something to be minimized instead of total minus that.
(b) CPU utilization is essentially trying to minimize how much the system
	gets stuck on io. This one needs more adjustment for modern systems,
	which tend to utilize the cpu regardless of io being in flight.
	Number of blocked tasks is a close approximation, directly using
	iowait, and so on are various other possible substitutes.


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> Demoting the largest task is one that does worse than random.

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> We only know that to be true for irrelevant optimization criteria.

The above explains how and why they are relevant.

It's also not difficult to understand why it goes wrong: the operation
is too expensive.


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> Like the PFF bits for WS?

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> Yup. PFF doesn't cover all disk I/O, though. Suspending a process that
> is I/O bound even with a low PFF improves thrashing performance as well,
> because disk I/O is the bottleneck.

That's a significantly different use for it; AIUI it was an heuristic
to estimate the WSS without periodic catastrophes like WSinterval,
though ISTR bits about "extremely high" rates contracting the estimated
size.


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> Top 0.001%? For expanded range, both endpoints matter. My notion of
>> scalability is running like greased lightning (compared to other OS's)
>> on everything from some ancient toaster with a 0.1MHz cpu and 256KB RAM
>> to a 16384x/16PB superdupercomputer.

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> Well, that's nice. I agree. IIRC, though, each major release had more
> demanding minimum requirements (in terms of RAM). The range covered
> has been growing only because upward scalability grew faster. I can't
> help but notice that some of your statements sound a lot like wishful
> thinking.

This is not wishful thinking; it's an example that tries to illustrate
the goal. It's rather clear to me that neither end of the spectrum
mentioned above is even functional in current source (well, 4096x might
boot with deep enough stacks, though I'd expect it to perform too poorly
to be considered functional).


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> No matter how many pieces of hardware you buy, the original one still
>> isn't driven correctly by the software. Hardware *NEVER* fixes software.

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> Look, I became the maintainer of via-rhine because nobody else wanted to
> fix the driver for a very common, but barely documented piece of
> cheap hardware. People were just told to buy another cheap card. That's
> the reality of Linux.

That's an _unfortunate_ reality. And you changed it in a similar way to
how we want to support the lower end, though I'm going a bit lower end
than you are.


On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> Don't forget what we are talking about, though. Once you are seriously
> tight on memory, you can only mitigate the damage in software, the only
> solution is to add more RAM. Thrashing is not a bug like a broken driver.

Covering for low quality hardware is generally a kernel's job. c.f. the
"how many address lines did this device snip off" games that have even
infected the VM. Of course, it's not as clear cut as an oops, no.


On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> I am currently writing a paper on the subject, and the gist of it will
> likely be that we should try to prevent thrashing from happening as
> long as possible (with good page replacement, I/O scheduling, etc.),
> but when it's inevitable we're pretty much done for. Load control may
> or may not be worth adding, but it only helps in some special cases
> and does not seem clearly beneficial in general-purpose systems.

Figures.


On Mon, 08 Dec 2003 20:05:01 -0800, William Lee Irwin III wrote:
>> The common case is pretty much zero or slim overcommitment these days.
>> The case I have in mind is pretty much 10x RAM committed. (Sum of WSS's,
>> not non-overcommit-related.)

On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> So you want to help people who for some reason _have_ to run several
> _batch_ jobs _concurrently_ (otherwise load control is ineffective)
> on a low-end machine to result in a 10x overcommit system? Why don't
> we buy those two or three guys a DIMM each?
> I'm afraid you have a solution in search of a problem. Nobody runs a
> 10x overcommit system. And if they did, they would find it doesn't work
> well with 2.4, either, so no one will complain about a regression. What
> does happen, though, is that people go close to the limit of what
> their low-end hardware supports, which will work perfectly with 2.4
> and collapse with 2.6.
> The real problem, the one many people will hit, the one the very
> complaint that started this thread was about, is light and medium
> overcommit. And load control is not the answer to that.

No, I've got some guy in Russia complaining about 2.6 sucking on his
box who has a 10x overcommit ratio (approximate sum of WSS's).

(Also, whatever this thread was, the In-Reply-To: chain was broken
somewhere and the first thing I saw was the post I replied to.)

Hmm.

I was trying to avoid duplicating effort and/or preempt someone's code
they were working on because I'd heard either you and/or Nick Piggin
were working on the stuff. You're doing something useful and relevant...

Well, I guess I might as well help with your paper. If the demotion
criteria you're using are anything like what you posted, they risk
invalidating the results, since they're apparently based on something
worse than random.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 14:23             ` Con Kolivas
  2003-12-08 14:30               ` William Lee Irwin III
@ 2003-12-09 21:03               ` Chris Vine
  2003-12-13 14:08               ` Chris Vine
  2 siblings, 0 replies; 63+ messages in thread
From: Chris Vine @ 2003-12-09 21:03 UTC (permalink / raw)
  To: Con Kolivas, William Lee Irwin III
  Cc: Rik van Riel, linux-kernel, Martin J. Bligh

On Monday 08 December 2003 2:23 pm, Con Kolivas wrote:
> [snip original discussion thrashing swap on 2.6test with 32mb ram]
>
> Chris
>
> By an unusual coincidence I was looking into the patches that were supposed
> to speed up application startup and noticed this one was merged. A brief
> discussion with wli suggests this could cause thrashing problems on low
> memory boxes so can you try this patch? Applies to test11.
>
> Con

Con,

I have just got back from a trip away.  I will try out the patch tomorrow, I 
hope, and see what difference it makes.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-09 19:38                       ` William Lee Irwin III
@ 2003-12-10 13:58                         ` Roger Luethi
  2003-12-10 17:47                           ` William Lee Irwin III
                                             ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-10 13:58 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote:
> On Tue, Dec 09, 2003 at 04:11:03PM +0100, Roger Luethi wrote:
> > The more fine-grained work is not complete and I'm not sure it ever
> > will be. Some _preliminary_ results (i.e. take with a grain of salt):
> 
> Okay, it sounds like you're well on our way to cleaning things up.

Actually, I'm rather well on my way wrapping things up. I documented
in detail how much 2.6 sucks in this area and where the potential for
improvements would have likely been, but now I've got a deadline to
meet and other things on my plate.

For me this discussion just confirmed that my approach fails to draw much
interest, either because there are better alternatives or because heavy
paging and medium thrashing are generally not considered interesting
problems.

> > The classic strategies based on these criteria work for transaction and
> > batch systems. They are all but useless, though, for a workstation and
> > even most modern servers, due to assumptions that are incorrect today
> > (remember all the degrees of freedom a scheduler had 30 years ago)
> > and additional factors that only became crucial in the past few decades
> > (latency again).
> 
> This assessment is inaccurate. The performance metrics are not entirely
> useless, and it's rather trivial to recover data useful for modern
> scenarios based on them. The driving notion from the iron age (I guess

I said _strategies_ rather than papers or research because I realize
that the metrics can be an important part of the modern picture. It's
just the ancient recipes that once solved the problem that are useless
for typical modern usage patterns.

> >> Demoting the largest task is one that does worse than random.
> 
> > We only know that to be true for irrelevant optimization criteria.
> 
> The above explains how and why they are relevant.
> 
> It's also not difficult to understand why it goes wrong: the operation
> is too expensive.

What goes wrong is that once you start suspending tasks, you have a
hard time telling the interactive tasks apart from the batch load.
This may not be much of a problem on a 10x overcommit system, because
that's presumably quite unresponsive anyway, but it does matter a lot if
you have an interactive system that just crossed the border to thrashing.

Our apparent differences come from the fact that we try to solve
different problems as you correctly noted: You are concerned with
extreme overcommit, while I am concerned that 2.6 takes several times
longer than 2.4 to complete a task under slight overcommit.

I have no reason to doubt that load control will help you solve your
problem. It may help with medium thrashing and it might even keep
latency within reasonable bounds. I do think, however, that we should
investigate _first_ how we lost over 50% of the performance we had in
2.5.40 for both compile benchmarks.

> (Also, whatever this thread was, the In-Reply-To: chain was broken
> somewhere and the first thing I saw was the post I replied to.)

You can read the whole thread starting from here:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=M794.3OE.7%40gated-at.bofh.it

> Well, I guess I might as well help with your paper. If the demotion
> criteria you're using are anything like what you posted, they risk
> invalidating the results, since they're apparently based on something
> worse than random.

Worse than random may still improve throughput, though, compared to
doing nothing, right? And I did measure improvements.

There are variables other than the demotion criteria that I found can
be important, to name a few:

- Trigger: Under which circumstances is suspending any processes
  considered? How often?

- Eviction: Does regular pageout take care of the memory of a suspended
  process, or are pages marked old or even unmapped upon stunning?

- Release: Is the stunning queue a simple FIFO? How long do the
  processes stay there? Does a process get a bonus after it's woken up
  again -- bigger quantum, chunk of free memory, prepaged working set
  before stunning?

There's quite a bit of complexity involved and many variables will depend
on the scenario. Sort of like interactivity, except lots of people were
affected by the interactivity tuning and only few will notice and test
load control.

The key question with regards to load control remains: How do you keep a
load controled system responsive? Cleverly detect interactive processes
and spare them, or wake them up again quickly enough? How? Or is the
plan to use load control where responsiveness doesn't matter anyway?

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 13:58                         ` Roger Luethi
@ 2003-12-10 17:47                           ` William Lee Irwin III
  2003-12-10 22:23                             ` Roger Luethi
  2003-12-10 21:04                           ` Rik van Riel
  2003-12-10 23:30                           ` Helge Hafting
  2 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-10 17:47 UTC (permalink / raw)
  To: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> Actually, I'm rather well on my way wrapping things up. I documented
> in detail how much 2.6 sucks in this area and where the potential for
> improvements would have likely been, but now I've got a deadline to
> meet and other things on my plate.

Well, it'd be nice to see the code, then.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> For me this discussion just confirmed that my approach fails to draw much
> interest, either because there are better alternatives or because heavy
> paging and medium thrashing are generally not considered interesting
> problems.

They're worthwhile; I didn't even realize there were such problems until
you pointed them out. I had presumed it was due to physical scanning.


On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote:
>> This assessment is inaccurate. The performance metrics are not entirely
>> useless, and it's rather trivial to recover data useful for modern
>> scenarios based on them. The driving notion from the iron age (I guess

On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> I said _strategies_ rather than papers or research because I realize
> that the metrics can be an important part of the modern picture. It's
> just the ancient recipes that once solved the problem that are useless
> for typical modern usage patterns.

Hmm. There were a wide variety of algorithms.


On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote:
>> The above explains how and why they are relevant.
>> It's also not difficult to understand why it goes wrong: the operation
>> is too expensive.

On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> What goes wrong is that once you start suspending tasks, you have a
> hard time telling the interactive tasks apart from the batch load.
> This may not be much of a problem on a 10x overcommit system, because
> that's presumably quite unresponsive anyway, but it does matter a lot if
> you have an interactive system that just crossed the border to thrashing.

It's effectively a form of longer-term process scheduling.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> Our apparent differences come from the fact that we try to solve
> different problems as you correctly noted: You are concerned with
> extreme overcommit, while I am concerned that 2.6 takes several times
> longer than 2.4 to complete a task under slight overcommit.

Yes, my focus is pushing back the point of true thrashing as opposed to
the interior points of the range.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> I have no reason to doubt that load control will help you solve your
> problem. It may help with medium thrashing and it might even keep
> latency within reasonable bounds. I do think, however, that we should
> investigate _first_ how we lost over 50% of the performance we had in
> 2.5.40 for both compile benchmarks.

Perfectly reasonable.


On Tue, 09 Dec 2003 11:38:01 -0800, William Lee Irwin III wrote:
>> Well, I guess I might as well help with your paper. If the demotion
>> criteria you're using are anything like what you posted, they risk
>> invalidating the results, since they're apparently based on something
>> worse than random.

On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> Worse than random may still improve throughput, though, compared to
> doing nothing, right? And I did measure improvements.

I didn't see any of the methods compared to no load control, so I don't
have any information on that.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> There are variables other than the demotion criteria that I found can
> be important, to name a few:
> - Trigger: Under which circumstances is suspending any processes
>   considered? How often?

This is generally part of the load control algorithm, but it
essentially just tries to detect levels of overcommitment that would
degrade performance so it can resolve them.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> - Eviction: Does regular pageout take care of the memory of a suspended
>   process, or are pages marked old or even unmapped upon stunning?

This is generally unmapping and evicting upon suspension. The effect
isn't immediate anyway, since io is required, and batching the work for
io contiguity etc. is a fair amount of savings, so there's little or no
incentive to delay this apart from keeping io rates down to where user
io and VM io aren't in competition.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> - Release: Is the stunning queue a simple FIFO? How long do the
>   processes stay there? Does a process get a bonus after it's woken up
>   again -- bigger quantum, chunk of free memory, prepaged working set
>   before stunning?

It's a form of process scheduling. Memory scheduling policies are not
discussed very much in the sources I can get at, so some synthesis may
be required unless material can be found on that, but in general this
isn't a very interesting problem (at least not since the 70's or earlier).

FreeBSD has an implementation of some of this we can all look at,
though it doesn't illustrate a number of the concepts.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> There's quite a bit of complexity involved and many variables will depend
> on the scenario. Sort of like interactivity, except lots of people were
> affected by the interactivity tuning and only few will notice and test
> load control.

It's basically just process scheduling, so I don't see an issue there.


On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> The key question with regards to load control remains: How do you keep a
> load controled system responsive? Cleverly detect interactive processes
> and spare them, or wake them up again quickly enough? How? Or is the
> plan to use load control where responsiveness doesn't matter anyway?

It's more in the interest of graceful degradation and relative
improvement than meeting absolute response time requirements. i.e.
making the best of a bad situation. Interactivity heuristics would
presumably be part of a memory scheduling policy as they are for a cpu
scheduling policy, but there is some missing information there. I
suppose that's where synthesis is required.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 13:58                         ` Roger Luethi
  2003-12-10 17:47                           ` William Lee Irwin III
@ 2003-12-10 21:04                           ` Rik van Riel
  2003-12-10 23:17                             ` Roger Luethi
  2003-12-10 23:30                           ` Helge Hafting
  2 siblings, 1 reply; 63+ messages in thread
From: Rik van Riel @ 2003-12-10 21:04 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Wed, 10 Dec 2003, Roger Luethi wrote:

> For me this discussion just confirmed that my approach fails to draw
> much interest, either because there are better alternatives or because
> heavy paging and medium thrashing are generally not considered
> interesting problems.

I'm willing to take over this work if you really want
to throw in the towel.  It has to be done, simply to
make Linux better able to deal with load spikes.

> Our apparent differences come from the fact that we try to solve
> different problems as you correctly noted: You are concerned with
> extreme overcommit, while I am concerned that 2.6 takes several times
> longer than 2.4 to complete a task under slight overcommit.

Agreed, the slight to medium overcommit needs to be
addressed well.  This is way more important than very
highly overcommitted systems, because computers are
powerful enough for their workloads anyway.

The thing Linux needs to deal with are unexpected
load spikes.  The thing that needs to be done is
making sure that such a load spike doesn't send
Linux into a death spiral.

If such a load control mechanism also solves the
highly overloaded scenario, that's just a nice
bonus.

> The key question with regards to load control remains: How do you keep a
> load controled system responsive? Cleverly detect interactive processes
> and spare them, or wake them up again quickly enough? How? Or is the
> plan to use load control where responsiveness doesn't matter anyway?

Under light to moderate overload, a load controlled system
will be more responsive than a thrashing system.

Heavy overload is probably a "docter, it hurts ..." case.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 20:48               ` William Lee Irwin III
  2003-12-09  0:27                 ` Roger Luethi
@ 2003-12-10 21:52                 ` Andrea Arcangeli
  2003-12-10 22:05                   ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2003-12-10 21:52 UTC (permalink / raw)
  To: William Lee Irwin III, rl, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote:
> qsbench I'd pretty much ignore except as a control case, since there's
> nothing to do with a single process but let it thrash.

this is not the point. If a single process like qsbench trashes twice as
fast in 2.4, it means 2.6 has some great problem in the core vm, the
whole point of swap is to trash but to give the task more physical
virtual memory. I doubt you can solve it with anything returned by
si_swapinfo.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 21:52                 ` Andrea Arcangeli
@ 2003-12-10 22:05                   ` Roger Luethi
  2003-12-10 22:44                     ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-10 22:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Wed, 10 Dec 2003 22:52:35 +0100, Andrea Arcangeli wrote:
> On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote:
> > qsbench I'd pretty much ignore except as a control case, since there's
> > nothing to do with a single process but let it thrash.
> 
> this is not the point. If a single process like qsbench trashes twice as
> fast in 2.4, it means 2.6 has some great problem in the core vm, the
> whole point of swap is to trash but to give the task more physical
> virtual memory. I doubt you can solve it with anything returned by
> si_swapinfo.

Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not
about one single process. There were four worker processes (-p 4), and my
load control stuff did make it run faster, so the point is moot.

Also, the 2.6 core VM doesn't seem all that bad since it was introduced in
2.5.27 but most of the problems I measured were introduced after 2.5.40.
Check out the graph I posted.

Thank you, we now return to our regularly scheduled programming.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 17:47                           ` William Lee Irwin III
@ 2003-12-10 22:23                             ` Roger Luethi
  2003-12-11  0:12                               ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-10 22:23 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 4079 bytes --]

On Wed, 10 Dec 2003 09:47:57 -0800, William Lee Irwin III wrote:
> On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> > Actually, I'm rather well on my way wrapping things up. I documented
> > in detail how much 2.6 sucks in this area and where the potential for
> > improvements would have likely been, but now I've got a deadline to
> > meet and other things on my plate.
> 
> Well, it'd be nice to see the code, then.

I attached the stunning code I wrote a few months ago, rediffed
against test11, seems to compile. It does not include the eviction code
(although you can tell where it plugs in) -- that's a bit messy and
I'm not too confident that I got all the locking right.

The trigger in the page allocator worked pretty well in test4 to test6,
but it is sensitive to VM changes. Earlier 2.5 kernels went through
the slow path much more frequently (IIRC before akpm limited use of
blk_congestion_wait), for instance. That would require a different
trigger.

The time processes spend in the stunning queue (defined in stun_time())
is too short to gain much in terms of throughput -- that's because back
then I tried to put a cap on worst case latency.

> you pointed them out. I had presumed it was due to physical scanning.

Everybody did, including me. Only after doing some of the benchmarks
did I realize I had been wrong. It's quite clear that physical scanning
accounts for a 50% higher execution time at most, which is a mere fifth
of the overall slow down in compile benchmarks.

> > There are variables other than the demotion criteria that I found can
> > be important, to name a few:
> > - Trigger: Under which circumstances is suspending any processes
> >   considered? How often?
> 
> This is generally part of the load control algorithm, but it
> essentially just tries to detect levels of overcommitment that would
> degrade performance so it can resolve them.

Level of overcommitment? What kind of criteria is that supposed to be?
You can have 10x overcommit and not thrash at all, if most of the memory
is allocated and filled but never referenced again. IOW, I can't derive
an algorithm from your handwaving <g>.

> > - Eviction: Does regular pageout take care of the memory of a suspended
> >   process, or are pages marked old or even unmapped upon stunning?
> 
> This is generally unmapping and evicting upon suspension. The effect
> isn't immediate anyway, since io is required, and batching the work for
> io contiguity etc. is a fair amount of savings, so there's little or no
> incentive to delay this apart from keeping io rates down to where user
> io and VM io aren't in competition.

I agree with that part.

> On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> > - Release: Is the stunning queue a simple FIFO? How long do the
> >   processes stay there? Does a process get a bonus after it's woken up
> >   again -- bigger quantum, chunk of free memory, prepaged working set
> >   before stunning?
> 
> It's a form of process scheduling. Memory scheduling policies are not
> discussed very much in the sources I can get at, so some synthesis may
> be required unless material can be found on that, but in general this
> isn't a very interesting problem (at least not since the 70's or earlier).

Not interesting, yes. And I realize that it's not even important once
you accept the very real possibility of extreme latencies.

> > There's quite a bit of complexity involved and many variables will depend
> > on the scenario. Sort of like interactivity, except lots of people were
> > affected by the interactivity tuning and only few will notice and test
> > load control.
> 
> It's basically just process scheduling, so I don't see an issue there.

The issue is that there are tons of knobs and dials that affect the
behavior, and it's hard to get good heuristics with a tiny test field.
Admittedly, things get easier once you want load control only for the
heavy thrashing case, and that's been my plan, too, since I realized that
it doesn't work well for the light and medium type I'd been working on.

Roger

[-- Attachment #2: linux-2.6.0-test11-stun.patch --]
[-- Type: text/plain, Size: 10228 bytes --]

diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/loadcontrol.h ./include/linux/loadcontrol.h
--- ../../18_binsearch/linux-2.6.0-test11/include/linux/loadcontrol.h	1970-01-01 01:00:00.000000000 +0100
+++ ./include/linux/loadcontrol.h	2003-12-10 22:11:15.999792424 +0100
@@ -0,0 +1,14 @@
+#ifndef _LINUX_LOADCONTROL_H
+#define _LINUX_LOADCONTROL_H
+
+#include <asm/atomic.h>
+
+extern wait_queue_head_t loadctrl_wq;
+extern struct semaphore stun_ser;
+extern struct semaphore unstun_token;
+
+extern void loadcontrol(void);
+extern void thrashing(unsigned long);
+extern atomic_t lctrl_waiting;
+
+#endif /* _LINUX_LOADCONTROL_H */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/sched.h ./include/linux/sched.h
--- ../../18_binsearch/linux-2.6.0-test11/include/linux/sched.h	2003-11-24 10:28:54.000000000 +0100
+++ ./include/linux/sched.h	2003-12-10 22:11:16.002791985 +0100
@@ -500,6 +500,8 @@ do { if (atomic_dec_and_test(&(tsk)->usa
 #define PF_SWAPOFF	0x00080000	/* I am in swapoff */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
+#define PF_STUN		0x00400000
+#define PF_YIELD	0x00800000
 
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
@@ -581,6 +583,7 @@ extern int FASTCALL(wake_up_process(stru
 #endif
 extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
 extern void FASTCALL(sched_exit(task_t * p));
+extern int task_interactive(task_t * p);
 
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/include/linux/swap.h ./include/linux/swap.h
--- ../../18_binsearch/linux-2.6.0-test11/include/linux/swap.h	2003-10-15 15:03:46.000000000 +0200
+++ ./include/linux/swap.h	2003-12-10 22:11:16.003791839 +0100
@@ -174,6 +174,7 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
+extern int shrink_list(struct list_head *, unsigned int, int *, int *);
 extern int shrink_all_memory(int);
 extern int vm_swappiness;
 
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/loadcontrol.c ./kernel/loadcontrol.c
--- ../../18_binsearch/linux-2.6.0-test11/kernel/loadcontrol.c	1970-01-01 01:00:00.000000000 +0100
+++ ./kernel/loadcontrol.c	2003-12-10 22:11:16.005791546 +0100
@@ -0,0 +1,169 @@
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/loadcontrol.h>
+
+void loadcontrol(void);
+void thrashing(unsigned long);
+
+DECLARE_MUTEX(stun_ser);
+DECLARE_MUTEX_LOCKED(unstun_token);
+DECLARE_WAIT_QUEUE_HEAD(loadctrl_wq);
+
+atomic_t lctrl_waiting;
+
+static inline void stun_me(void)
+{
+	DEFINE_WAIT(wait);
+
+	up(&stun_ser);		/* Allow next */
+	atomic_inc(&lctrl_waiting);
+
+	for (;;) {
+		prepare_to_wait_exclusive(&loadctrl_wq, &wait,
+				TASK_UNINTERRUPTIBLE);
+		schedule();
+		if (!down_trylock(&unstun_token)) {
+			/* Yay. Got unstun token, wake up */
+			break;
+		}
+	}
+	finish_wait(&loadctrl_wq, &wait);
+
+	atomic_dec(&lctrl_waiting);
+}
+
+void loadcontrol()
+{
+	unsigned long flags = current->flags;
+
+	spin_lock_irq(&current->sighand->siglock);
+	recalc_sigpending();	/* We sent fake signal, clean it up */
+	spin_unlock_irq(&current->sighand->siglock);
+
+	if (flags & PF_STUN)
+		stun_me();
+
+	current->flags &= ~(PF_STUN|PF_YIELD|PF_MEMALLOC);
+}
+
+/*
+ * int_sqrt - oom_kill.c internal function, rough approximation to sqrt
+ * @x: integer of which to calculate the sqrt
+ *
+ * A very rough approximation to the sqrt() function.
+ */
+static unsigned int int_sqrt(unsigned int x)
+{
+	unsigned int out = x;
+	while (x & ~(unsigned int)1) x >>=2, out >>=1;
+	if (x) out -= out >> 2;
+	return (out ? out : 1);
+}
+
+static int badness(struct task_struct *p, int flags)
+{
+	int points, cpu_time, run_time;
+
+	if (!p->mm)
+		return 0;
+
+	if (p->flags & (PF_MEMDIE | flags))
+		return 0;
+
+	/*
+	 * Resident memory size of the process is the basis for the badness.
+	 */
+	points = p->mm->rss;
+
+	/*
+	 * CPU time is in seconds and run time is in minutes. There is no
+	 * particular reason for this other than that it turned out to work
+	 * very well in practice.
+	 */
+	cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3);
+	run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10);
+
+	points *= int_sqrt(cpu_time);
+	points *= int_sqrt(int_sqrt(run_time));
+
+	/*
+	 * Niced processes are most likely less important.
+	 */
+	if (task_nice(p) > 0)
+		points *= 4;
+
+	/*
+	 * Keep interactive processes around.
+	 */
+	if (task_interactive(p))
+		points /= 4;
+
+	/*
+	 * Superuser processes are usually more important, so we make it
+	 * less likely that we kill those.
+	 */
+	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
+				p->uid == 0 || p->euid == 0)
+		points /= 2;
+
+	/*
+	 * We don't want to kill a process with direct hardware access.
+	 * Not only could that mess up the hardware, but usually users
+	 * tend to only have this flag set on applications they think
+	 * of as important.
+	 */
+	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
+		points /= 2;
+
+	return points;
+}
+
+/*
+ * Simple selection loop. We chose the process with the highest
+ * number of 'points'. We expect the caller will lock the tasklist.
+ */
+static struct task_struct * pick_bad_process(int flags)
+{
+	int maxpoints = 0;
+	struct task_struct *g, *p;
+	struct task_struct *chosen = NULL;
+
+	do_each_thread(g, p)
+		if (p->pid) {
+			int points = badness(p, flags);
+			if (points > maxpoints) {
+				chosen = p;
+				maxpoints = points;
+			}
+		}
+	while_each_thread(g, p);
+	return chosen;
+}
+
+
+void thrashing(unsigned long action)
+{
+	struct task_struct *p;
+	unsigned long flags;
+
+	if (down_trylock(&stun_ser))
+		return;
+
+	read_lock(&tasklist_lock);
+
+	p = pick_bad_process(PF_STUN|action);
+	if (!p) {
+		up(&stun_ser);
+		goto out_unlock;
+	}
+
+	p->flags |= action|PF_MEMALLOC;
+	p->time_slice = HZ;
+
+	spin_lock_irqsave(&p->sighand->siglock, flags);
+	signal_wake_up(p, 0);
+	spin_unlock_irqrestore(&p->sighand->siglock, flags);
+
+out_unlock:
+	read_unlock(&tasklist_lock);
+}
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/Makefile ./kernel/Makefile
--- ../../18_binsearch/linux-2.6.0-test11/kernel/Makefile	2003-10-15 15:03:46.000000000 +0200
+++ ./kernel/Makefile	2003-12-10 22:11:16.006791400 +0100
@@ -6,7 +6,8 @@ obj-y     = sched.o fork.o exec_domain.o
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o intermodule.o extable.o params.o posix-timers.o
+	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
+	    loadcontrol.o
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/kernel/sched.c ./kernel/sched.c
--- ../../18_binsearch/linux-2.6.0-test11/kernel/sched.c	2003-11-24 10:28:54.000000000 +0100
+++ ./kernel/sched.c	2003-12-10 22:11:16.015790083 +0100
@@ -37,6 +37,7 @@
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
 #include <linux/percpu.h>
+#include <linux/loadcontrol.h>
 
 #ifdef CONFIG_NUMA
 #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
@@ -1465,6 +1466,22 @@ out:
 
 void scheduling_functions_start_here(void) { }
 
+static unsigned long stun_time(void) {
+	unsigned long ret;
+	int ql = atomic_read(&lctrl_waiting);
+	if (ql == 1)
+		ret = 5*HZ;
+	else if (ql == 2)
+		ret = 3*HZ;
+	else if (ql < 5)
+		ret = 2*HZ;
+	else if (ql < 10)
+		ret = 1*HZ;
+	else
+		ret = HZ/2;
+	return ret;
+}
+
 /*
  * schedule() is the main scheduler function.
  */
@@ -1495,6 +1512,22 @@ need_resched:
 	prev = current;
 	rq = this_rq();
 
+	if (unlikely(waitqueue_active(&loadctrl_wq))) {
+		static unsigned long prev_unstun;
+		unsigned long wait = stun_time();
+		if (time_before(jiffies, prev_unstun + wait) && prev_unstun)
+			goto loadctrl_done;
+		if (!atomic_read(&stun_ser.count))
+			goto loadctrl_done;
+		if (!prev_unstun) {
+			prev_unstun = jiffies;
+			goto loadctrl_done;
+		}
+		prev_unstun = jiffies;
+		up(&unstun_token);
+		wake_up(&loadctrl_wq);
+	}
+loadctrl_done:
 	release_kernel_lock(prev);
 	now = sched_clock();
 	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
@@ -1935,6 +1968,11 @@ asmlinkage long sys_nice(int increment)
 
 #endif
 
+int task_interactive(task_t *p)
+{
+	return TASK_INTERACTIVE(p);
+}
+
 /**
  * task_prio - return the priority value of a given task.
  * @p: the task in question.
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/arch/i386/kernel/signal.c ./arch/i386/kernel/signal.c
--- ../../18_binsearch/linux-2.6.0-test11/arch/i386/kernel/signal.c	2003-11-24 10:28:51.000000000 +0100
+++ ./arch/i386/kernel/signal.c	2003-12-10 22:11:16.018789645 +0100
@@ -24,6 +24,7 @@
 #include <asm/uaccess.h>
 #include <asm/i387.h>
 #include "sigframe.h"
+#include <linux/loadcontrol.h>
 
 #define DEBUG_SIG 0
 
@@ -569,6 +570,10 @@ int do_signal(struct pt_regs *regs, sigs
 		refrigerator(0);
 		goto no_signal;
 	}
+	if (current->flags & (PF_STUN|PF_YIELD)) {
+		loadcontrol();
+		goto no_signal;
+	}
 
 	if (!oldset)
 		oldset = &current->blocked;
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 ../../18_binsearch/linux-2.6.0-test11/mm/page_alloc.c ./mm/page_alloc.c
--- ../../18_binsearch/linux-2.6.0-test11/mm/page_alloc.c	2003-10-15 15:03:46.000000000 +0200
+++ ./mm/page_alloc.c	2003-12-10 22:11:16.021789206 +0100
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/loadcontrol.h>
 
 #include <asm/tlbflush.h>
 
@@ -606,6 +607,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 	}
 
 	/* here we're in the low on memory slow path */
+	thrashing(PF_STUN);
 
 rebalance:
 	if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 22:05                   ` Roger Luethi
@ 2003-12-10 22:44                     ` Andrea Arcangeli
  2003-12-11  1:28                       ` William Lee Irwin III
                                         ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2003-12-10 22:44 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Wed, Dec 10, 2003 at 11:05:25PM +0100, Roger Luethi wrote:
> On Wed, 10 Dec 2003 22:52:35 +0100, Andrea Arcangeli wrote:
> > On Mon, Dec 08, 2003 at 12:48:17PM -0800, William Lee Irwin III wrote:
> > > qsbench I'd pretty much ignore except as a control case, since there's
> > > nothing to do with a single process but let it thrash.
> > 
> > this is not the point. If a single process like qsbench trashes twice as
> > fast in 2.4, it means 2.6 has some great problem in the core vm, the
> > whole point of swap is to trash but to give the task more physical
> > virtual memory. I doubt you can solve it with anything returned by
> > si_swapinfo.
> 
> Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not
> about one single process. There were four worker processes (-p 4), and my
> load control stuff did make it run faster, so the point is moot.

more processes can be optimized even better by adding unfariness.
Either ways a significant slowdown of qsbench probably means worse core
VM, at least if compared with 2.4 that isn't adding huge unfariness just
to optimize qsbench.

> Also, the 2.6 core VM doesn't seem all that bad since it was introduced in
> 2.5.27 but most of the problems I measured were introduced after 2.5.40.
> Check out the graph I posted.

you're confusing rmap with core vm. rmap in no way can be defined as the
core vm, rmap is just a method used by the core vm to find some
information more efficiently at the expenses of all the fast paths
that now have to do the rmap bookkeeping.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 21:04                           ` Rik van Riel
@ 2003-12-10 23:17                             ` Roger Luethi
  2003-12-11  1:31                               ` Rik van Riel
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-10 23:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Wed, 10 Dec 2003 16:04:16 -0500, Rik van Riel wrote:
> > For me this discussion just confirmed that my approach fails to draw
> > much interest, either because there are better alternatives or because
> > heavy paging and medium thrashing are generally not considered
> > interesting problems.
> 
> I'm willing to take over this work if you really want
> to throw in the towel.  It has to be done, simply to
> make Linux better able to deal with load spikes.

I am willing to keep my work up if I don't have to pull this alone. As
far as thrashing is concerned, the VM changed significanly even during
the -test series and I expect that to continue once 2.6.0 is released.
It would be good to get help from the people who made those changes --
they should know their stuff best, after all.

For one, we could look at the regression in test3 which might be easier
to fix than others because the changes haven't been buried under dozens
of later kernels. Some time ago, I took some notes about how the two
patches I mentioned in an earlier message worked together to change
the pageout patterns. Is that something we could start with?

Setting up some regular regression testing for new kernels might be a
good idea, too. Otherwise it's going to be Sisyphus work. For the time
being I can continue the testing, provided the harddisk that miraculously
survived hundreds of hours of thrashing tests keeps going.

> Under light to moderate overload, a load controlled system
> will be more responsive than a thrashing system.

That I doubt. 2.4 is very responsive under light overload -- every
process is mostly in memory and ready to grab a few missing pages at any
time. Once you add load control, you have processes that are completely
evicted and stunned when they are needed. Of course it's a matter of
definition, too, so I'd go even as far as saying:

- It is light thrashing when load control has no advantage.

- It is medium thrashing when using load control is a toss-up. Probably
  better throughput, but somewhat higher latency.

- It is heavy thrashing when load control is a winner in both regards.

I just made this up. It neatly resolves all arguments about when load
control is appropriate. Yeah, so it's a circular definition. Sue me.

> Heavy overload is probably a "docter, it hurts ..." case.

That's pretty much my thinking, too. Might still be worthwhile adding
some load control if there are more people like wli's Russian guy.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 13:58                         ` Roger Luethi
  2003-12-10 17:47                           ` William Lee Irwin III
  2003-12-10 21:04                           ` Rik van Riel
@ 2003-12-10 23:30                           ` Helge Hafting
  2 siblings, 0 replies; 63+ messages in thread
From: Helge Hafting @ 2003-12-10 23:30 UTC (permalink / raw)
  To: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Wed, Dec 10, 2003 at 02:58:29PM +0100, Roger Luethi wrote:
> 
> What goes wrong is that once you start suspending tasks, you have a
> hard time telling the interactive tasks apart from the batch load.
> This may not be much of a problem on a 10x overcommit system, because
> that's presumably quite unresponsive anyway, but it does matter a lot if
> you have an interactive system that just crossed the border to thrashing.
> 
This isn't too bad.  Lets say I use the system interavtively and the "wrong"
app suddenly is swapped out.  I notice this, and simply close
down some responsive apps that are less needed.  The system
will then notice that there's "enough" memory and allow
the app to page in again.

Helge Hafting

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 22:23                             ` Roger Luethi
@ 2003-12-11  0:12                               ` William Lee Irwin III
  0 siblings, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-11  0:12 UTC (permalink / raw)
  To: Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Wed, Dec 10, 2003 at 11:23:55PM +0100, Roger Luethi wrote:
> Level of overcommitment? What kind of criteria is that supposed to be?
> You can have 10x overcommit and not thrash at all, if most of the memory
> is allocated and filled but never referenced again. IOW, I can't derive
> an algorithm from your handwaving <g>.

There is no handwaving; the answer is necessarily ambiguous.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 22:44                     ` Andrea Arcangeli
@ 2003-12-11  1:28                       ` William Lee Irwin III
  2003-12-11  1:32                         ` Rik van Riel
  2003-12-11 10:16                       ` Roger Luethi
  2003-12-15 23:31                       ` Andrew Morton
  2 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-11  1:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: rl, Con Kolivas, Chris Vine, Rik van Riel, linux-kernel, Martin J. Bligh

On Wed, Dec 10, 2003 at 11:05:25PM +0100, Roger Luethi wrote:
>> Also, the 2.6 core VM doesn't seem all that bad since it was introduced in
>> 2.5.27 but most of the problems I measured were introduced after 2.5.40.
>> Check out the graph I posted.

On Wed, Dec 10, 2003 at 11:44:46PM +0100, Andrea Arcangeli wrote:
> you're confusing rmap with core vm. rmap in no way can be defined as the
> core vm, rmap is just a method used by the core vm to find some
> information more efficiently at the expenses of all the fast paths
> that now have to do the rmap bookkeeping.

I've been maintaining one of the answers to this (anobjrmap, originally
from hugh). I still haven't removed page->mapcount because keeping
nr_mapped straight requires some care, though doing so should be feasible.

I could probably use some helpers to untangle it from the highpmd,
compile-time mapping->page_lock rwlock/spinlock switching, RCU
mapping->i_shared_lock, and O(1) proc_pid_statm() bits.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 23:17                             ` Roger Luethi
@ 2003-12-11  1:31                               ` Rik van Riel
  2003-12-11 10:16                                 ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: Rik van Riel @ 2003-12-11  1:31 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Thu, 11 Dec 2003, Roger Luethi wrote:

Hmmm, those definitions have changed a little from the
OS books I read ;))

> - It is light thrashing when load control has no advantage.

This used to be called "no thrashing" ;)

> - It is medium thrashing when using load control is a toss-up. Probably
>   better throughput, but somewhat higher latency.

This would be when the system load is so high that
decreasing the multiprocessing level would increase
system load, but performance would still be within
acceptable limits (say, 30% of top performance).

> - It is heavy thrashing when load control is a winner in both regards.

Heavy thrashing would be "no work gets done by the
processes in the system, nobody makes good progress".

In that case load control is needed to make the system
survive in a useful way.

> I just made this up. It neatly resolves all arguments about when load
> control is appropriate. Yeah, so it's a circular definition. Sue me.

Knowing what your definitions are has definately made it
easier for me to understand your previous mails.

Still, sticking to the textbook definitions might make it
even easier to talk about things, and compare the plans
for Linux with what's been done for other OSes.

Also, it would make the job of a load control mechanism
really easy to define:

	"Prevent the system from thrashing"

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-11  1:28                       ` William Lee Irwin III
@ 2003-12-11  1:32                         ` Rik van Riel
  0 siblings, 0 replies; 63+ messages in thread
From: Rik van Riel @ 2003-12-11  1:32 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, rl, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Wed, 10 Dec 2003, William Lee Irwin III wrote:

> I could probably use some helpers to untangle it from the highpmd,
> compile-time mapping->page_lock rwlock/spinlock switching, RCU
> mapping->i_shared_lock, and O(1) proc_pid_statm() bits.

Looking into it.  Your patch certainly has a lot of stuff
folded into one piece ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 22:44                     ` Andrea Arcangeli
  2003-12-11  1:28                       ` William Lee Irwin III
@ 2003-12-11 10:16                       ` Roger Luethi
  2003-12-15 23:31                       ` Andrew Morton
  2 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-11 10:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, Rik van Riel,
	linux-kernel, Martin J. Bligh

On Wed, 10 Dec 2003 23:44:46 +0100, Andrea Arcangeli wrote:
> more processes can be optimized even better by adding unfariness.
> Either ways a significant slowdown of qsbench probably means worse core
> VM, at least if compared with 2.4 that isn't adding huge unfariness just
> to optimize qsbench.

Can you be a bit more specific about the type of unfairness? The only
instance I clearly noticed is that one process can grow its RSS at the
expense of others if they already have a high PFF. That happens more
often in 2.4 and helps a lot with some benchmarks.

I did notice, though, that after an initial slowdown, qsbench improved
during 2.5, while the compile benchmarks got even worse.

> > Also, the 2.6 core VM doesn't seem all that bad since it was introduced in
> > 2.5.27 but most of the problems I measured were introduced after 2.5.40.
> > Check out the graph I posted.
> 
> you're confusing rmap with core vm. rmap in no way can be defined as the
> core vm, rmap is just a method used by the core vm to find some

Incidentally, all these places where rmap is used by the core VM were
introduced in 2.5.27 as well. In particular vmscan.c was completely
overhauled. But apparently you suspect subsequent changes to the core
to be a problem. I am curious what they are if that can help fixing
the slowdowns I'm seeing.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-11  1:31                               ` Rik van Riel
@ 2003-12-11 10:16                                 ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-11 10:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: William Lee Irwin III, Con Kolivas, Chris Vine, linux-kernel,
	Martin J. Bligh

On Wed, 10 Dec 2003 20:31:40 -0500, Rik van Riel wrote:
> Hmmm, those definitions have changed a little from the
> OS books I read ;))
> 
> > - It is light thrashing when load control has no advantage.
> 
> This used to be called "no thrashing" ;)

Fair enough, but that was before Linux 2.6 <g>.

kbuild benchmark, execution time in seconds (median over ten runs):
 74	2.6.0-test11, 256 MB RAM
115	2.4.21, 64 MB RAM
539	2.6.0-test11, 64 MB RAM

We can call it lousy paging, that'll be fine with me.

> Also, it would make the job of a load control mechanism
> really easy to define:
> 
> 	"Prevent the system from thrashing"

"... once all other means are exhausted". Then I'll buy it.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-08 14:23             ` Con Kolivas
  2003-12-08 14:30               ` William Lee Irwin III
  2003-12-09 21:03               ` Chris Vine
@ 2003-12-13 14:08               ` Chris Vine
  2 siblings, 0 replies; 63+ messages in thread
From: Chris Vine @ 2003-12-13 14:08 UTC (permalink / raw)
  To: Con Kolivas, William Lee Irwin III
  Cc: Rik van Riel, linux-kernel, Martin J. Bligh

On Monday 08 December 2003 2:23 pm, Con Kolivas wrote:
> [snip original discussion thrashing swap on 2.6test with 32mb ram]
>
> Chris
>
> By an unusual coincidence I was looking into the patches that were supposed
> to speed up application startup and noticed this one was merged. A brief
> discussion with wli suggests this could cause thrashing problems on low
> memory boxes so can you try this patch? Applies to test11.

Con,

I have applied the patch, and performance is nearly indistinguishable from 
that with the kernel without it.

Chris.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-10 22:44                     ` Andrea Arcangeli
  2003-12-11  1:28                       ` William Lee Irwin III
  2003-12-11 10:16                       ` Roger Luethi
@ 2003-12-15 23:31                       ` Andrew Morton
  2003-12-15 23:37                         ` Andrea Arcangeli
  2 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2003-12-15 23:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh

Andrea Arcangeli <andrea@suse.de> wrote:
>
> > Uhm.. guys? I forgot to mention that earlier: qsbench as I used it was not
> > about one single process. There were four worker processes (-p 4), and my
> > load control stuff did make it run faster, so the point is moot.
> 
> more processes can be optimized even better by adding unfariness.
> Either ways a significant slowdown of qsbench probably means worse core
> VM, at least if compared with 2.4 that isn't adding huge unfariness just
> to optimize qsbench.

Single-threaded qsbench is OK on 2.6.  Last time I looked it was a little
quicker than 2.4.  It's when you go to multiple qsbench instances that
everything goes to crap.

It's interesting to watch the `top' output during the run.  In 2.4 you see
three qsbench instances have consumed 0.1 seconds CPU and the fourth has
consumed 45 seconds and then exits.

In 2.6 all four processes consume CPU at the same rate.  Really, really
slowly.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-15 23:31                       ` Andrew Morton
@ 2003-12-15 23:37                         ` Andrea Arcangeli
  2003-12-15 23:54                           ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2003-12-15 23:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh

On Mon, Dec 15, 2003 at 03:31:22PM -0800, Andrew Morton wrote:
> Single-threaded qsbench is OK on 2.6.  Last time I looked it was a little
> quicker than 2.4.  It's when you go to multiple qsbench instances that
> everything goes to crap.
> 
> It's interesting to watch the `top' output during the run.  In 2.4 you see
> three qsbench instances have consumed 0.1 seconds CPU and the fourth has
> consumed 45 seconds and then exits.
> 
> In 2.6 all four processes consume CPU at the same rate.  Really, really
> slowly.

sounds good, so this seems only a fariness issue. 2.6 is more fair but
fariness in this case means much inferior performance.

The reason 2.4 runs faster could be a more aggressive "young" pagetable
heuristic via the swap_out clock algorithm. as soon as one program grows
a bit its rss, it will run for longer, and the longer it runs the more
pages it marks "young" during a clock scan, and the more pages it marks
young the bigger it will grow. This keeps going until it's the by far
biggest task and takes almost all available cpu. This is optimal for
performance, but not optimal for fariness. So 2.6 may be better or worse
depending if fariness payoffs or not, obviously in qsbench it doesn't
since it's not even measured.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-15 23:37                         ` Andrea Arcangeli
@ 2003-12-15 23:54                           ` Andrew Morton
  2003-12-16  0:17                             ` Rik van Riel
  2003-12-16 11:23                             ` Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: Andrew Morton @ 2003-12-15 23:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: wli, kernel, chris, riel, linux-kernel, mbligh

Andrea Arcangeli <andrea@suse.de> wrote:
>
> The reason 2.4 runs faster could be a more aggressive "young" pagetable
> heuristic via the swap_out clock algorithm. as soon as one program grows
> a bit its rss, it will run for longer, and the longer it runs the more
> pages it marks "young" during a clock scan, and the more pages it marks
> young the bigger it will grow. This keeps going until it's the by far
> biggest task and takes almost all available cpu. This is optimal for
> performance, but not optimal for fariness.

Sounds right.

One thing to be cautious of here is an interaction with the "human factor".
 One tends to adjust the test case so that it takes a reasonable amount of
time.  So the process is:

Run 1: took five seconds.

       "hmm, it didn't swap at all.  I'll use some more threads"

Run 2: takes 4 hours.

       "man, that sucked.  I'll use a few less threads"

Run 3: takes ten minutes.

       "ah, that's nice.  I'll use that many threads from now on".

Problem is, you have now carefully placed your test point right on the
point of a sharp knee in a big curve.  So small changes in input conditions
cause large changes in runtime.   At least, that's what I do ;)

> So 2.6 may be better or worse
> depending if fariness payoffs or not, obviously in qsbench it doesn't
> since it's not even measured.

It would be nice, but I've yet to find a workload in which 2.6 pageout
decisively wins.

It could well be that something is simply misbehaving in there and that we
can pull back significant benefits with some inspired tweaking rather than
with radical changes.  Certainly some of Roger's measurements indicate that
this is the case, although I worry that he may have tuned himself onto the
knee of the curve.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-15 23:54                           ` Andrew Morton
@ 2003-12-16  0:17                             ` Rik van Riel
  2003-12-16 11:23                             ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Rik van Riel @ 2003-12-16  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, wli, kernel, chris, linux-kernel, mbligh

On Mon, 15 Dec 2003, Andrew Morton wrote:

> It could well be that something is simply misbehaving in there

I have my suspicions about inter-zone balancing in 2.6.

Something seems wrong, but I can't quite put my finger
on it yet.  This should have quite some impact in the
1 - 4 GB range and a test (done by somebody else, I can't
give you the details yet unfortunately) has shown there
is a problem.

I'm working on it and should come up with a patch soon.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-15 23:54                           ` Andrew Morton
  2003-12-16  0:17                             ` Rik van Riel
@ 2003-12-16 11:23                             ` Roger Luethi
  2003-12-16 16:29                               ` Rik van Riel
  2003-12-17 18:53                               ` Rik van Riel
  1 sibling, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-16 11:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, wli, kernel, chris, riel, linux-kernel, mbligh

On Mon, 15 Dec 2003 15:54:27 -0800, Andrew Morton wrote:
>  One tends to adjust the test case so that it takes a reasonable amount of
> time.  So the process is:
> 
> Run 1: took five seconds.
> 
>        "hmm, it didn't swap at all.  I'll use some more threads"
> 
> Run 2: takes 4 hours.
> 
>        "man, that sucked.  I'll use a few less threads"
> 
> Run 3: takes ten minutes.
> 
>        "ah, that's nice.  I'll use that many threads from now on".
> 
> [...]
> 
> It could well be that something is simply misbehaving in there and that we
> can pull back significant benefits with some inspired tweaking rather than
> with radical changes.  Certainly some of Roger's measurements indicate that
> this is the case, although I worry that he may have tuned himself onto the
> knee of the curve.

No worries, mate :-). The efax benchmark I run is a replica of the
case that started this thread. "make main.o" for efax with 32 MB. The
kbuild benchmark is very different as far as compile benchmarks go:
"make -j 24" for the Linux kernel with 64 MB -- the time was adjusted
not by using fewer processes but by only building a small part of the
kernel, which does not change the character of the test. As benchmarks,
efax and kbuild seem different enough to warrant the conclusion that
compiling under tight memory conditions is slow on 2.6.

The qsbench benchmarks is clearly a different type from the other two.
Improvements in qsbench coincided several times with losses for efax/kbuild
and vice versa. Exceptions exist like 2.5.65 which brought no change for
efax but big improvements for kbuild and qsbench (which was back on par
with 2.5.0 for two releases). It is at least conceivable, though, that the
damage for one type of benchmark (qsbench) was mitigated at the expense of
others.

One potential problem with the benchmarks is that my test box has
just one bar with 256 MB RAM. The kbuild and efax tests were run with
mem=64M and mem=32M, respectively. If the difference between mem=32M
and a real 32 MB machine is significant for the benchmark, the results
will be less than perfect. I plan to do some testing on a machine with
more than one memory module to get an idea of the impact, provided I
can dig up some usable hardware.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-16 11:23                             ` Roger Luethi
@ 2003-12-16 16:29                               ` Rik van Riel
  2003-12-17 11:03                                 ` Roger Luethi
  2003-12-17 18:53                               ` Rik van Riel
  1 sibling, 1 reply; 63+ messages in thread
From: Rik van Riel @ 2003-12-16 16:29 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Tue, 16 Dec 2003, Roger Luethi wrote:

> One potential problem with the benchmarks is that my test box has
> just one bar with 256 MB RAM. The kbuild and efax tests were run with
> mem=64M and mem=32M, respectively. If the difference between mem=32M
> and a real 32 MB machine is significant for the benchmark,

Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ?

I have a feeling that the lower zone protection logic could
be badly messing up systems in the 24-48 MB range, as well
as systems in the 1.5-3 GB range.

This would be because the allocation threshold for the
lower zone would be 30% higher than the high threshold
of the pageout code, meaning that the memory in the lower
zone would be just sitting there, without old pages being
recycled by the pageout code.

In effect, your 32 MB test would have old memory in the
lower 16 MB, without the pageout code reusing that memory
for something more useful, reducing the amount of memory
the system really uses well to something way lower.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-16 16:29                               ` Rik van Riel
@ 2003-12-17 11:03                                 ` Roger Luethi
  2003-12-17 11:06                                   ` William Lee Irwin III
  2003-12-17 11:33                                   ` Rik van Riel
  0 siblings, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-17 11:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Tue, 16 Dec 2003 11:29:50 -0500, Rik van Riel wrote:
> On Tue, 16 Dec 2003, Roger Luethi wrote:
> 
> > One potential problem with the benchmarks is that my test box has
> > just one bar with 256 MB RAM. The kbuild and efax tests were run with
> > mem=64M and mem=32M, respectively. If the difference between mem=32M
> > and a real 32 MB machine is significant for the benchmark,
> 
> Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ?

Defaults to 0 anyway, doesn't it? Turning it _on_ seems to slow
benchmarks down somewhat (< 5%). In one of ten runs, though, the efax
test stopped doing anything for ten minutes -- no disk activity, no
progress whatsoever.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 11:03                                 ` Roger Luethi
@ 2003-12-17 11:06                                   ` William Lee Irwin III
  2003-12-17 16:50                                     ` Roger Luethi
  2003-12-17 11:33                                   ` Rik van Riel
  1 sibling, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-17 11:06 UTC (permalink / raw)
  To: Rik van Riel, Andrew Morton, Andrea Arcangeli, kernel, chris,
	linux-kernel, mbligh

On Tue, 16 Dec 2003 11:29:50 -0500, Rik van Riel wrote:
>> Could you try "echo 0 > /proc/sys/vm/lower_zone_protection" ?

On Wed, Dec 17, 2003 at 12:03:37PM +0100, Roger Luethi wrote:
> Defaults to 0 anyway, doesn't it? Turning it _on_ seems to slow
> benchmarks down somewhat (< 5%). In one of ten runs, though, the efax
> test stopped doing anything for ten minutes -- no disk activity, no
> progress whatsoever.

Sorry about that, that got brought up elsewhere but not propagated out
to lkml. Hearing more about the various degradations you've identified
would be helpful.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 11:03                                 ` Roger Luethi
  2003-12-17 11:06                                   ` William Lee Irwin III
@ 2003-12-17 11:33                                   ` Rik van Riel
  1 sibling, 0 replies; 63+ messages in thread
From: Rik van Riel @ 2003-12-17 11:33 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Wed, 17 Dec 2003, Roger Luethi wrote:

> Defaults to 0 anyway, doesn't it?

Duh, right...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 11:06                                   ` William Lee Irwin III
@ 2003-12-17 16:50                                     ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-17 16:50 UTC (permalink / raw)
  To: William Lee Irwin III, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, kernel, chris, linux-kernel, mbligh

On Wed, 17 Dec 2003 03:06:48 -0800, William Lee Irwin III wrote:
> to lkml. Hearing more about the various degradations you've identified
> would be helpful.

I'll use 2.6.0-test3 again as the example. That release brought a
slight improvement for qsbench and big slowdowns for kbuild and efax
(check the numbers I posted for details), due to two patches: "fix
kswapd throttling" (patch 1) and "decaying average of zone pressure/use
zone_pressure for page unmapping" (patch 2).

Even as late as test9 I found that reverting patches 1 and 2 changed
performance numbers for all benchmarks pretty much back to test2 level.
Reverting only patch 1 brought a partial improvement, reverting only
patch 2 none at all.

Patch 1 prevented those frequent calls to blk_congestion_wait in
balance_pgdat when enough pages were freed:

diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Thu Jul 17 06:09:38 2003
+++ b/mm/vmscan.c	Fri Aug  1 03:02:09 2003
@@ -930,7 +930,8 @@
 		}
 		if (all_zones_ok)
 			break;
-		blk_congestion_wait(WRITE, HZ/10);
+		if (to_free > 0)
+			blk_congestion_wait(WRITE, HZ/10);
 	}
 	return nr_pages - to_free;
 }

Unconditional blk_congestion_wait breaks (as they have been in test2 and
earlier) reduce the speed at which kswapd can free pages, making it much
more likely that memory is reclaimed by the allocator (try_to_free_pages)
because kswapd fails to keep up with demand.

Patch 2 changed distress and thus reclaim_mapped in refill_inactive_zone.
distress became less volatile -- kernels before test3 tended to consider
mapped pages only after a few iterations in balance_pgdat (i.e. with
raising priority).

To get the benefits of reverting patch 2 in test3/test9 this small
patch should suffice:

diff -u ./mm/vmscan.c ./mm/vmscan.c
--- ./mm/vmscan.c	Wed Nov 19 11:02:51 2003
+++ ./mm/vmscan.c	Wed Nov 19 23:53:06 2003
@@ -632,7 +632,7 @@
 	 * `distress' is a measure of how much trouble we're having reclaiming
 	 * pages.  0 -> no problems.  100 -> great trouble.
 	 */
-	distress = 100 >> zone->prev_priority;
+	distress = 100 >> priority;
 
 	/*
 	 * The point of this algorithm is to decide when to start reclaiming

Without patch 2 (kernel test2 and earlier), kswapd freeing is frequently
interrupted by the allocator satisfying immediate needs. With the
patch, refill is dominated by long, undisturbed sequences driven by
kswapd.

All this has little impact on qsbench because unlike the other two
benchmarks qsbench hardly ever fails to convince refill_inactive_zone to
consider mapped pages as well (thanks to an extremely high mapped_ratio).

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-16 11:23                             ` Roger Luethi
  2003-12-16 16:29                               ` Rik van Riel
@ 2003-12-17 18:53                               ` Rik van Riel
  2003-12-17 19:27                                 ` William Lee Irwin III
  2003-12-17 19:49                                 ` Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: Rik van Riel @ 2003-12-17 18:53 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Tue, 16 Dec 2003, Roger Luethi wrote:

> One potential problem with the benchmarks is that my test box has
> just one bar with 256 MB RAM. The kbuild and efax tests were run with
> mem=64M and mem=32M, respectively. If the difference between mem=32M

OK, I found another difference with 2.4.

Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think
that should give the same free watermarks that 2.4 has.

Using 1MB as the min free watermark for lowmem is bound
to result in more free (and less used) memory on systems
with less than 128 MB RAM ... significantly so on smaller
systems.

The fact that ZONE_HIGHMEM and ZONE_NORMAL are recycled
at very different rates could also be of influence on
some performance tests...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 18:53                               ` Rik van Riel
@ 2003-12-17 19:27                                 ` William Lee Irwin III
  2003-12-17 19:51                                   ` Rik van Riel
  2003-12-17 19:49                                 ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-17 19:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Roger Luethi, Andrew Morton, Andrea Arcangeli, kernel, chris,
	linux-kernel, mbligh

On Tue, 16 Dec 2003, Roger Luethi wrote:
>> One potential problem with the benchmarks is that my test box has
>> just one bar with 256 MB RAM. The kbuild and efax tests were run with
>> mem=64M and mem=32M, respectively. If the difference between mem=32M

On Wed, Dec 17, 2003 at 01:53:28PM -0500, Rik van Riel wrote:
> OK, I found another difference with 2.4.
> Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think
> that should give the same free watermarks that 2.4 has.
> Using 1MB as the min free watermark for lowmem is bound
> to result in more free (and less used) memory on systems
> with less than 128 MB RAM ... significantly so on smaller
> systems.
> The fact that ZONE_HIGHMEM and ZONE_NORMAL are recycled
> at very different rates could also be of influence on
> some performance tests...

Limited sets of configurations may have left holes in the testing.
Upper zones much larger than lower zones basically want the things
to be unequal. It probably wants the replacement load spread
proportionally in general or some such nonsense.

-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 18:53                               ` Rik van Riel
  2003-12-17 19:27                                 ` William Lee Irwin III
@ 2003-12-17 19:49                                 ` Roger Luethi
  2003-12-17 21:41                                   ` Andrew Morton
  2003-12-17 21:41                                   ` Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2003-12-17 19:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Wed, 17 Dec 2003 13:53:28 -0500, Rik van Riel wrote:
> On Tue, 16 Dec 2003, Roger Luethi wrote:
> 
> > One potential problem with the benchmarks is that my test box has
> > just one bar with 256 MB RAM. The kbuild and efax tests were run with
> > mem=64M and mem=32M, respectively. If the difference between mem=32M
> 
> OK, I found another difference with 2.4.
> 
> Try "echo 256 > /proc/sys/vm/min_free_kbytes", I think
> that should give the same free watermarks that 2.4 has.

I played around with that knob after wli posted his findings in the
"mem=16MB laptop testing" thread. IIRC tweaking min_free_kbytes didn't
help nearly as much as I had hoped. I'm running the efax benchmark
right now just to make sure. It's going to take a couple of hours,
I'll follow up with results.

FWIW akpm posted a patch to initialize min_free_kbytes depending on
available RAM which seemed to make sense but it hasn't made it into
mainline yet.

> Using 1MB as the min free watermark for lowmem is bound
> to result in more free (and less used) memory on systems
> with less than 128 MB RAM ... significantly so on smaller
> systems.

Possibly. If memory pressure is high enough, though, the allocator
ignores the watermarks. And on the other end kswapd seems to be pretty
busy anyway during the benchmarks.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 19:27                                 ` William Lee Irwin III
@ 2003-12-17 19:51                                   ` Rik van Riel
  0 siblings, 0 replies; 63+ messages in thread
From: Rik van Riel @ 2003-12-17 19:51 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Roger Luethi, Andrew Morton, Andrea Arcangeli, kernel, chris,
	linux-kernel, mbligh

On Wed, 17 Dec 2003, William Lee Irwin III wrote:

> Limited sets of configurations may have left holes in the testing.
> Upper zones much larger than lower zones basically want the things
> to be unequal. It probably wants the replacement load spread
> proportionally in general or some such nonsense.

Yeah. In some configurations 2.4-rmap takes care of this
automagically since part of the replacement isn't as
pressure driven as in 2.4 mainline and 2.6, ie. some of
the aging is done independantly of allocation pressure.

Still, inter-zone balancing is HARD to get right. I'm
currently trying to absorb all of the 2.6 VM balancing
into my mind (*sound effects of brain turning to slush*)
to find any possible imbalances.

Some of the test results I have seen make me very
suspicious...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 19:49                                 ` Roger Luethi
@ 2003-12-17 21:41                                   ` Andrew Morton
  2003-12-17 21:41                                   ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2003-12-17 21:41 UTC (permalink / raw)
  To: Roger Luethi; +Cc: riel, andrea, wli, kernel, chris, linux-kernel, mbligh

Roger Luethi <rl@hellgate.ch> wrote:
>
> FWIW akpm posted a patch to initialize min_free_kbytes depending on
> available RAM which seemed to make sense but it hasn't made it into
> mainline yet.

Yup.  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test11/2.6.0-test11-mm1/broken-out/scale-min_free_kbytes.patch

Also, note that setup_per_zone_pages_min() plays games to ensure that the
highmem zone's free pages limit is small: there's not a lot of point in
keeping lots of highmem pages free.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 19:49                                 ` Roger Luethi
  2003-12-17 21:41                                   ` Andrew Morton
@ 2003-12-17 21:41                                   ` Roger Luethi
  2003-12-18  0:21                                     ` Rik van Riel
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-17 21:41 UTC (permalink / raw)
  To: Rik van Riel, Andrew Morton, Andrea Arcangeli, wli, kernel,
	chris, linux-kernel, mbligh

On Wed, 17 Dec 2003 20:49:51 +0100, Roger Luethi wrote:
> right now just to make sure. It's going to take a couple of hours,
> I'll follow up with results.

For efax, a benchmark run with mem=32M, the difference in run time
between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise
(< 1%).

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-17 21:41                                   ` Roger Luethi
@ 2003-12-18  0:21                                     ` Rik van Riel
  2003-12-18 22:53                                       ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: Rik van Riel @ 2003-12-18  0:21 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Wed, 17 Dec 2003, Roger Luethi wrote:
> On Wed, 17 Dec 2003 20:49:51 +0100, Roger Luethi wrote:
> > right now just to make sure. It's going to take a couple of hours,
> > I'll follow up with results.
> 
> For efax, a benchmark run with mem=32M, the difference in run time
> between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise
> (< 1%).

OK, so I guess you're not as close to the knee
of the curve as this kind of tests tend to be ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-18  0:21                                     ` Rik van Riel
@ 2003-12-18 22:53                                       ` Roger Luethi
  2003-12-18 23:38                                         ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2003-12-18 22:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Andrea Arcangeli, wli, kernel, chris,
	linux-kernel, mbligh

On Wed, 17 Dec 2003 19:21:52 -0500, Rik van Riel wrote:
> > For efax, a benchmark run with mem=32M, the difference in run time
> > between values 256 and 1024 for /proc/sys/vm/min_free_kbytes is noise
> > (< 1%).
> 
> OK, so I guess you're not as close to the knee
> of the curve as this kind of tests tend to be ;)

Depends on the axis in your graph. The benchmarks I am using are not
balancing on the verge of going bad, if that's what you mean. They
cut deep (30 to 100 MB) into swap through most of their run time,
and there's quite a bit of swap turnover with compiling stuff.

I also completed a best effort attempt at determining the impact of
any differences between mem= and actual RAM removal. I had to adapt
the kbuild benchmark somewhat to the available hardware. I benchmarked
with 48 MB RAM at mem=16M and again after removing 32MB of RAM. If there
was a difference in performance, it was very small for both 2.4.23 and
2.6.0-test11, with the latter taking over 2.5 times as long to complete
the benchmark.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: 2.6.0-test9 - poor swap performance on low end machines
  2003-12-18 22:53                                       ` Roger Luethi
@ 2003-12-18 23:38                                         ` William Lee Irwin III
  0 siblings, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2003-12-18 23:38 UTC (permalink / raw)
  To: rl, Rik van Riel, Andrew Morton, Andrea Arcangeli, kernel, chris,
	linux-kernel, mbligh

On Thu, Dec 18, 2003 at 11:53:25PM +0100, Roger Luethi wrote:
> Depends on the axis in your graph. The benchmarks I am using are not
> balancing on the verge of going bad, if that's what you mean. They
> cut deep (30 to 100 MB) into swap through most of their run time,
> and there's quite a bit of swap turnover with compiling stuff.
> I also completed a best effort attempt at determining the impact of
> any differences between mem= and actual RAM removal. I had to adapt
> the kbuild benchmark somewhat to the available hardware. I benchmarked
> with 48 MB RAM at mem=16M and again after removing 32MB of RAM. If there
> was a difference in performance, it was very small for both 2.4.23 and
> 2.6.0-test11, with the latter taking over 2.5 times as long to complete
> the benchmark.

A bogon was recently fixed in 2.6 that caused the results to differ.
They should not differ.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2003-12-18 23:39 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-29 22:30 2.6.0-test9 - poor swap performance on low end machines Chris Vine
2003-10-31  3:57 ` Rik van Riel
2003-10-31 11:26   ` Roger Luethi
2003-10-31 12:37     ` Con Kolivas
2003-10-31 12:59       ` Roger Luethi
2003-10-31 12:55     ` Ed Tomlinson
2003-11-01 18:34       ` Pasi Savolainen
2003-11-06 18:40       ` bill davidsen
2003-10-31 21:52   ` Chris Vine
2003-11-02 23:06   ` Chris Vine
2003-11-03  0:48     ` Con Kolivas
2003-11-03 21:13       ` Chris Vine
2003-11-04  2:55         ` Con Kolivas
2003-11-04 22:08           ` Chris Vine
2003-11-04 22:30             ` Con Kolivas
2003-12-08 13:52           ` William Lee Irwin III
2003-12-08 14:23             ` Con Kolivas
2003-12-08 14:30               ` William Lee Irwin III
2003-12-09 21:03               ` Chris Vine
2003-12-13 14:08               ` Chris Vine
2003-12-08 19:49             ` Roger Luethi
2003-12-08 20:48               ` William Lee Irwin III
2003-12-09  0:27                 ` Roger Luethi
2003-12-09  4:05                   ` William Lee Irwin III
2003-12-09 15:11                     ` Roger Luethi
2003-12-09 16:04                       ` Rik van Riel
2003-12-09 16:31                         ` Roger Luethi
2003-12-09 18:31                       ` William Lee Irwin III
2003-12-09 19:38                       ` William Lee Irwin III
2003-12-10 13:58                         ` Roger Luethi
2003-12-10 17:47                           ` William Lee Irwin III
2003-12-10 22:23                             ` Roger Luethi
2003-12-11  0:12                               ` William Lee Irwin III
2003-12-10 21:04                           ` Rik van Riel
2003-12-10 23:17                             ` Roger Luethi
2003-12-11  1:31                               ` Rik van Riel
2003-12-11 10:16                                 ` Roger Luethi
2003-12-10 23:30                           ` Helge Hafting
2003-12-10 21:52                 ` Andrea Arcangeli
2003-12-10 22:05                   ` Roger Luethi
2003-12-10 22:44                     ` Andrea Arcangeli
2003-12-11  1:28                       ` William Lee Irwin III
2003-12-11  1:32                         ` Rik van Riel
2003-12-11 10:16                       ` Roger Luethi
2003-12-15 23:31                       ` Andrew Morton
2003-12-15 23:37                         ` Andrea Arcangeli
2003-12-15 23:54                           ` Andrew Morton
2003-12-16  0:17                             ` Rik van Riel
2003-12-16 11:23                             ` Roger Luethi
2003-12-16 16:29                               ` Rik van Riel
2003-12-17 11:03                                 ` Roger Luethi
2003-12-17 11:06                                   ` William Lee Irwin III
2003-12-17 16:50                                     ` Roger Luethi
2003-12-17 11:33                                   ` Rik van Riel
2003-12-17 18:53                               ` Rik van Riel
2003-12-17 19:27                                 ` William Lee Irwin III
2003-12-17 19:51                                   ` Rik van Riel
2003-12-17 19:49                                 ` Roger Luethi
2003-12-17 21:41                                   ` Andrew Morton
2003-12-17 21:41                                   ` Roger Luethi
2003-12-18  0:21                                     ` Rik van Riel
2003-12-18 22:53                                       ` Roger Luethi
2003-12-18 23:38                                         ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).