linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.16 & OOM killer screw up (fwd)
@ 2001-12-10 19:08 Marcelo Tosatti
  2001-12-10 20:47 ` Andrew Morton
  2001-12-11  0:43 ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2001-12-10 19:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: lkml


Andrea, 

Could you please start looking at any 2.4 VM issues which show up ?

Just please make sure that when sending a fix for something, send me _one_
problem and a patch which fixes _that_ problem.

I'm tempted to look at VM, but I think I'll spend my limited time in a
better way if I review's others people work instead.

---------- Forwarded message ----------
Date: Mon, 10 Dec 2001 16:46:02 -0200 (BRST)
From: Marcelo Tosatti <marcelo@conectiva.com.br>
To: Abraham vd Merwe <abraham@2d3d.co.za>
Cc: Linux Kernel Development <linux-kernel@vger.kernel.org>
Subject: Re: 2.4.16 & OOM killer screw up



On Mon, 10 Dec 2001, Abraham vd Merwe wrote:

> Hi!
> 
> If I leave my machine on for a day or two without doing anything on it (e.g.
> my machine at work over a weekend) and I come back then 1) all my memory is
> used for buffers/caches and when I try running application, the OOM killer
> kicks in, tries to allocate swap space (which I don't have) and kills
> whatever I try start (that's with 300M+ memory in buffers/caches).

Abraham, 

I'll take a look at this issue as soon as pre8 is released. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-10 20:47 ` Andrew Morton
@ 2001-12-10 19:42   ` Marcelo Tosatti
  2001-12-11  0:11   ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2001-12-10 19:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, lkml



On Mon, 10 Dec 2001, Andrew Morton wrote:

> Marcelo Tosatti wrote:
> > 
> > Andrea,
> > 
> > Could you please start looking at any 2.4 VM issues which show up ?
> > 
> 
> Just fwiw, I did some testing on this yesterday.
> 
> Buffers and cache data are sitting on the active list, and shrink_caches()
> is *not* getting them off the active list, and onto the inactive list
> where they can be freed.
> 
> So we end up with enormous amounts of anon memory on the inactive
> list, so this code:
> 
>         /* try to keep the active list 2/3 of the size of the cache */
>         ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
>         refill_inactive(ratio);
> 
> just calls refill_inactive(0) all the time.  Nothing gets moved
> onto the inactive list - it remains full of unfreeable anon
> allocations.  And with no swap, there's nowhere to go.
> 
> I think a little fix is to add
> 
>         if (ratio < nr_pages)
>                 ratio = nr_pages;
> 
> so we at least move *something* onto the inactive list.
> 
> Also refill_inactive needs to be changed so that it counts
> the number of pages which it actually moved, rather than
> the number of pages which it inspected.
> 
> In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> So we're madly trying to swap pages out and finding that there's no swap
> space.  I beleive that when we find there's no swap left we should move
> the page onto the active list so we don't keep rescanning it pointlessly.
> 
> A fix may be to just remove the use-once stuff.  It is one of the
> sources of this problem, because it's overpopulating the inactive list.
> 
> In my testing last night, I tried to allocate 650 megs on a 768 meg
> swapless box.  Got oom-killed when there was almost 100 megs of freeable
> memory: half buffercache, half filecache.  Presumably, all of it was
> stuck on the active list with no way to get off.
> 
> We also need to do something about shrink_[di]cache_memory(),
> which seem to be called in the wrong place.
> 
> There's also the report concerning modify_ldt() failure in a
> similar situation.  I'm not sure why this one occurred.  It
> vmallocs 64k of memory and that seems to fail.

I haven't applied the modify_ldt() patch because I want to make sure its
needed: It may just be a bad effect of this one bug. 

> I did some similar testing a week or so ago, also tested
> the -aa patches.  They seemed to maybe help a tiny bit,
> but not significantly.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti
@ 2001-12-10 20:47 ` Andrew Morton
  2001-12-10 19:42   ` Marcelo Tosatti
  2001-12-11  0:11   ` Andrea Arcangeli
  2001-12-11  0:43 ` Andrea Arcangeli
  1 sibling, 2 replies; 43+ messages in thread
From: Andrew Morton @ 2001-12-10 20:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, lkml

Marcelo Tosatti wrote:
> 
> Andrea,
> 
> Could you please start looking at any 2.4 VM issues which show up ?
> 

Just fwiw, I did some testing on this yesterday.

Buffers and cache data are sitting on the active list, and shrink_caches()
is *not* getting them off the active list, and onto the inactive list
where they can be freed.

So we end up with enormous amounts of anon memory on the inactive
list, so this code:

        /* try to keep the active list 2/3 of the size of the cache */
        ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
        refill_inactive(ratio);

just calls refill_inactive(0) all the time.  Nothing gets moved
onto the inactive list - it remains full of unfreeable anon
allocations.  And with no swap, there's nowhere to go.

I think a little fix is to add

        if (ratio < nr_pages)
                ratio = nr_pages;

so we at least move *something* onto the inactive list.

Also refill_inactive needs to be changed so that it counts
the number of pages which it actually moved, rather than
the number of pages which it inspected.

In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
So we're madly trying to swap pages out and finding that there's no swap
space.  I beleive that when we find there's no swap left we should move
the page onto the active list so we don't keep rescanning it pointlessly.

A fix may be to just remove the use-once stuff.  It is one of the
sources of this problem, because it's overpopulating the inactive list.

In my testing last night, I tried to allocate 650 megs on a 768 meg
swapless box.  Got oom-killed when there was almost 100 megs of freeable
memory: half buffercache, half filecache.  Presumably, all of it was
stuck on the active list with no way to get off.

We also need to do something about shrink_[di]cache_memory(),
which seem to be called in the wrong place.

There's also the report concerning modify_ldt() failure in a
similar situation.  I'm not sure why this one occurred.  It
vmallocs 64k of memory and that seems to fail.

I did some similar testing a week or so ago, also tested
the -aa patches.  They seemed to maybe help a tiny bit,
but not significantly.

-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-10 20:47 ` Andrew Morton
  2001-12-10 19:42   ` Marcelo Tosatti
@ 2001-12-11  0:11   ` Andrea Arcangeli
  2001-12-11  7:07     ` Andrew Morton
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11  0:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, lkml

On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote:
> Marcelo Tosatti wrote:
> > 
> > Andrea,
> > 
> > Could you please start looking at any 2.4 VM issues which show up ?
> > 
> 
> Just fwiw, I did some testing on this yesterday.
> 
> Buffers and cache data are sitting on the active list, and shrink_caches()
> is *not* getting them off the active list, and onto the inactive list
> where they can be freed.

please check 2.4.17pre4aa1, see the per-classzone info, they will
prevent all the problems with the refill inactive with highmem.

> 
> So we end up with enormous amounts of anon memory on the inactive
> list, so this code:
> 
>         /* try to keep the active list 2/3 of the size of the cache */
>         ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
>         refill_inactive(ratio);
> 
> just calls refill_inactive(0) all the time.  Nothing gets moved
> onto the inactive list - it remains full of unfreeable anon
> allocations.  And with no swap, there's nowhere to go.
> 
> I think a little fix is to add
> 
>         if (ratio < nr_pages)
>                 ratio = nr_pages;
> 
> so we at least move *something* onto the inactive list.
> 
> Also refill_inactive needs to be changed so that it counts
> the number of pages which it actually moved, rather than
> the number of pages which it inspected.

done ages ago here.

> 
> In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> So we're madly trying to swap pages out and finding that there's no swap
> space.  I beleive that when we find there's no swap left we should move
> the page onto the active list so we don't keep rescanning it pointlessly.

yes, however I think the swap-flood with no swap isn't a very
interesting case to optimize.

> 
> A fix may be to just remove the use-once stuff.  It is one of the
> sources of this problem, because it's overpopulating the inactive list.
> 
> In my testing last night, I tried to allocate 650 megs on a 768 meg
> swapless box.  Got oom-killed when there was almost 100 megs of freeable
> memory: half buffercache, half filecache.  Presumably, all of it was
> stuck on the active list with no way to get off.
> 
> We also need to do something about shrink_[di]cache_memory(),
> which seem to be called in the wrong place.
> 
> There's also the report concerning modify_ldt() failure in a
> similar situation.  I'm not sure why this one occurred.  It
> vmallocs 64k of memory and that seems to fail.

dunno about this modify_ldt failure.

> 
> I did some similar testing a week or so ago, also tested
> the -aa patches.  They seemed to maybe help a tiny bit,
> but not significantly.

I don't have any pending bug report. AFIK those bugs are only in
mainline. If you can reproduce with -aa please send me a bug report.
thanks,

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti
  2001-12-10 20:47 ` Andrew Morton
@ 2001-12-11  0:43 ` Andrea Arcangeli
  2001-12-11 15:46   ` Luigi Genoni
  2001-12-12 22:05   ` Ken Brownfield
  1 sibling, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11  0:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml

On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
> 
> Andrea, 
> 
> Could you please start looking at any 2.4 VM issues which show up ?

well, as far I can tell no VM bug should be present in my latest -aa, so
I think I'm finished. At the very least I know people is using 2.4.15aa1
and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
load and I didn't got any bugreport back yet.

> 
> Just please make sure that when sending a fix for something, send me _one_
> problem and a patch which fixes _that_ problem.

I will split something for you soon, at the moment I was doing some
further benchmark.

> 
> I'm tempted to look at VM, but I think I'll spend my limited time in a
> better way if I review's others people work instead.

until I split something out, you can see all the vm related changes in
the 10_vm-* patches in my ftp area.

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11  0:11   ` Andrea Arcangeli
@ 2001-12-11  7:07     ` Andrew Morton
  2001-12-11 13:32       ` Rik van Riel
  2001-12-11 13:42       ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Andrew Morton @ 2001-12-11  7:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml

Andrea Arcangeli wrote:
> 
> On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote:
> > Marcelo Tosatti wrote:
> > >
> > > Andrea,
> > >
> > > Could you please start looking at any 2.4 VM issues which show up ?
> > >
> >
> > Just fwiw, I did some testing on this yesterday.
> >
> > Buffers and cache data are sitting on the active list, and shrink_caches()
> > is *not* getting them off the active list, and onto the inactive list
> > where they can be freed.
> 
> please check 2.4.17pre4aa1, see the per-classzone info, they will
> prevent all the problems with the refill inactive with highmem.

This is not highmem-related.  But the latest -aa patch does
appear to have fixed this bug.  Stale memory is no longer being
left on the active list, and all buffercache memory is being reclaimed
before the oom-killer kicks in (swapless case).

Also, (and this is in fact the same problem), the patched kernel
has less tendency to push in-use memory out to swap while leaving
tens of megs of old memory on the active list.  This is all good.

Which of your changes has caused this?

Could you please separate this out into one or more specific patches for
the 2.4.17 series?





Why does this code exist at the end of refill_inactive()?

        if (entry != &active_list) {
                list_del(&active_list);
                list_add(&active_list, entry);
        }





This test on a 64 megabyte machine, on ext2:

	time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)

On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.

This is probably due to the write scheduling changes in fs/buffer.c.
This chunk especially will, under some conditions, cause bdflush
to madly spin in a loop unplugging all the disk queues:

@@ -2787,7 +2795,7 @@
 
                spin_lock(&lru_list_lock);
                if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
-                       wait_for_some_buffers(NODEV);
+                       run_task_queue(&tq_disk);
                        interruptible_sleep_on(&bdflush_wait);
                }
        }

Why did you make this change?





Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
dual x86:

-aa:					4 minutes 20 seconds
2.4.7-pre8				4 minutes 8 seconds
2.4.7-pre8 plus the below patch:	3 minutes 55 seconds

Now it could be that this performance regression is due to the
write merging mistake in fs/buffer.c.  But with so much unrelated
material in the same patch it's hard to pinpoint the source.



--- linux-2.4.17-pre8/mm/vmscan.c	Thu Nov 22 23:02:59 2001
+++ linux-akpm/mm/vmscan.c	Mon Dec 10 22:34:18 2001
@@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages
 
 	spin_lock(&pagemap_lru_lock);
 	entry = active_list.prev;
-	while (nr_pages-- && entry != &active_list) {
+	while (nr_pages && entry != &active_list) {
 		struct page * page;
 
 		page = list_entry(entry, struct page, lru);
@@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages
 		del_page_from_active_list(page);
 		add_page_to_inactive_list(page);
 		SetPageReferenced(page);
+		nr_pages--;
 	}
 	spin_unlock(&pagemap_lru_lock);
 }
@@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz
 	int chunk_size = nr_pages;
 	unsigned long ratio;
 
+	shrink_dcache_memory(priority, gfp_mask);
+	shrink_icache_memory(priority, gfp_mask);
+#ifdef CONFIG_QUOTA
+	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
+#endif
+
 	nr_pages -= kmem_cache_reap(gfp_mask);
 	if (nr_pages <= 0)
 		return 0;
@@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz
 	nr_pages = chunk_size;
 	/* try to keep the active list 2/3 of the size of the cache */
 	ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
+	if (ratio == 0)
+		ratio = nr_pages;
 	refill_inactive(ratio);
 
 	nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
 	if (nr_pages <= 0)
 		return 0;
-
-	shrink_dcache_memory(priority, gfp_mask);
-	shrink_icache_memory(priority, gfp_mask);
-#ifdef CONFIG_QUOTA
-	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif
 
 	return nr_pages;
 }

> ...
> 
> >
> > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > So we're madly trying to swap pages out and finding that there's no swap
> > space.  I beleive that when we find there's no swap left we should move
> > the page onto the active list so we don't keep rescanning it pointlessly.
> 
> yes, however I think the swap-flood with no swap isn't a very
> interesting case to optimize.

Running swapless is a valid configuration, and the kernel is doing
great amounts of pointless work.  I would expect a diskless workstation
to suffer from this.  The problem remains in latest -aa.  It would be
useful to find a fix.
 
> 
> I don't have any pending bug report. AFIK those bugs are only in
> mainline. If you can reproduce with -aa please send me a bug report.
> thanks,

Bugs which are only fixed in -aa aren't much use to anyone.

The VM code lacks comments, and nobody except yourself understands
what it is supposed to be doing.  That's a bug, don't you think?

-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11  7:07     ` Andrew Morton
@ 2001-12-11 13:32       ` Rik van Riel
  2001-12-11 13:46         ` Andrea Arcangeli
  2001-12-11 13:42       ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2001-12-11 13:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Marcelo Tosatti, lkml

On Mon, 10 Dec 2001, Andrew Morton wrote:

> This test on a 64 megabyte machine, on ext2:
>
> 	time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
>
> On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.

> Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> dual x86:
>
> -aa:					4 minutes 20 seconds
> 2.4.7-pre8				4 minutes 8 seconds
> 2.4.7-pre8 plus the below patch:	3 minutes 55 seconds


Andrea, it seems -aa is not the holy grail VM-wise. If you want
to merge your good stuff with marcelo, please do it in the
"one patch with explanation per problem" style marcelo asked.

If nothing happens I'll take my chainsaw and remove the whole
use-once stuff just so 2.4 will avoid the worst cases, even if
it happens to remove some of the nice stuff you've been working
on.

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11  7:07     ` Andrew Morton
  2001-12-11 13:32       ` Rik van Riel
@ 2001-12-11 13:42       ` Andrea Arcangeli
  2001-12-11 13:59         ` Rik van Riel
                           ` (3 more replies)
  1 sibling, 4 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11 13:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, lkml

On Mon, Dec 10, 2001 at 11:07:31PM -0800, Andrew Morton wrote:
> Why does this code exist at the end of refill_inactive()?
> 
>         if (entry != &active_list) {
>                 list_del(&active_list);
>                 list_add(&active_list, entry);
>         }

so that we restart next time at the point where we stopped browsing the
active list.

> This test on a 64 megabyte machine, on ext2:
> 
> 	time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> 
> On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.
> 
> This is probably due to the write scheduling changes in fs/buffer.c.

yes, I also lowered the percentage of dirty memory in the system by
default, so that a write flood should less probably stall the system.

Plus I made the elevator more latency oriented, rather than throughput
oriented. Did you also tested how much the system was responsive during
the test?

Do you remeber the thread about a 'tar xzf' hanging the machine? It
doesn't hang with -aa, but of course you'll run slower if it has to do
seeks.

> This chunk especially will, under some conditions, cause bdflush
> to madly spin in a loop unplugging all the disk queues:
> 
> @@ -2787,7 +2795,7 @@
>  
>                 spin_lock(&lru_list_lock);
>                 if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
> -                       wait_for_some_buffers(NODEV);
> +                       run_task_queue(&tq_disk);
>                         interruptible_sleep_on(&bdflush_wait);
>                 }
>         }
> 
> Why did you make this change?

to make bdflush to less badly spin in a loop unplugging all the disk
queues.

We need to unplug only once, to submit the I/O, but we don't need to
wait on every single buffer that we previously wrote. Note that
run_task_queue() has nothing to do with wait_on_buffer, the above should
be much better in terms of "spinning in a loop unplugging all the disk
queues". It will do it only once at least.

Infact all the wait_for_some_buffers are broken (particularly the one in
balance_dirty()), they're not necessary, they can only slowdown the
machine.

The only reason would be to refile the buffers into the clean list, but
nothing else. That's a total waste of I/O pipelining. And yes, that's
something to fix too.

> Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> dual x86:
> 
> -aa:					4 minutes 20 seconds
> 2.4.7-pre8				4 minutes 8 seconds
> 2.4.7-pre8 plus the below patch:	3 minutes 55 seconds
> 
> Now it could be that this performance regression is due to the
> write merging mistake in fs/buffer.c.  But with so much unrelated
> material in the same patch it's hard to pinpoint the source.
> 
> 
> 
> --- linux-2.4.17-pre8/mm/vmscan.c	Thu Nov 22 23:02:59 2001
> +++ linux-akpm/mm/vmscan.c	Mon Dec 10 22:34:18 2001
> @@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages
>  
>  	spin_lock(&pagemap_lru_lock);
>  	entry = active_list.prev;
> -	while (nr_pages-- && entry != &active_list) {
> +	while (nr_pages && entry != &active_list) {
>  		struct page * page;
>  
>  		page = list_entry(entry, struct page, lru);
> @@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages
>  		del_page_from_active_list(page);
>  		add_page_to_inactive_list(page);
>  		SetPageReferenced(page);
> +		nr_pages--;
>  	}
>  	spin_unlock(&pagemap_lru_lock);
>  }
> @@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz
>  	int chunk_size = nr_pages;
>  	unsigned long ratio;
>  
> +	shrink_dcache_memory(priority, gfp_mask);
> +	shrink_icache_memory(priority, gfp_mask);
> +#ifdef CONFIG_QUOTA
> +	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
> +#endif
> +
>  	nr_pages -= kmem_cache_reap(gfp_mask);
>  	if (nr_pages <= 0)
>  		return 0;
> @@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz
>  	nr_pages = chunk_size;
>  	/* try to keep the active list 2/3 of the size of the cache */
>  	ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
> +	if (ratio == 0)
> +		ratio = nr_pages;
>  	refill_inactive(ratio);
>  
>  	nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
>  	if (nr_pages <= 0)
>  		return 0;
> -
> -	shrink_dcache_memory(priority, gfp_mask);
> -	shrink_icache_memory(priority, gfp_mask);
> -#ifdef CONFIG_QUOTA
> -	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
> -#endif
>  
>  	return nr_pages;
>  }

it should be simple, mainline swapouts more, so it's less likely to
trash away some useful cache.

just try -aa after a:

	echo 10 >/proc/sys/vm/vm_mapped_ratio

it should swapout more and better preserve the cache.

> > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > So we're madly trying to swap pages out and finding that there's no swap
> > > space.  I beleive that when we find there's no swap left we should move
> > > the page onto the active list so we don't keep rescanning it pointlessly.
> > 
> > yes, however I think the swap-flood with no swap isn't a very
> > interesting case to optimize.
> 
> Running swapless is a valid configuration, and the kernel is doing

I'm not saying it's not valid or non interesting.

It's the mix "I'm running out of memory and I'm swapless" that is the
case not interesting to optimize.

If you're swapless it means you've enough memory and that you're not
running out of swap. Otherwise _you_ (not the kernel) are wrong not
having swap.

> great amounts of pointless work.  I would expect a diskless workstation
> to suffer from this.  The problem remains in latest -aa.  It would be
> useful to find a fix.

It can be optimized by making the other cases slower. I believe if
swap_out is recalled heavily in a swapless configuration either some
other part of the kernel or the user are wrong, not swap_out. So it's at
least not obvious to me that it would be useful to fix it inside
swap_out.

> > I don't have any pending bug report. AFIK those bugs are only in
> > mainline. If you can reproduce with -aa please send me a bug report.
> > thanks,
> 
> Bugs which are only fixed in -aa aren't much use to anyone.

Then there are no other bugs, that's fine, this is why I said I'm
finished (except for the minor performance work, like the buffer
flushing in buffer.c that certainly cannot affect stability, or the
swap-triggering etc.. all minor things that doesn't affect stability and
where there's not the perfect solution anyways).

> The VM code lacks comments, and nobody except yourself understands
> what it is supposed to be doing.  That's a bug, don't you think?

Lack of documentation is not a bug, period. Also it's not true that I'm
the only one who understands it. For istance Linus understand it
completly, I am 100% sure.

Anyways I wrote a dozen of slides on the VM with some graph showing the
design of the VM if anybody can better learn from a slide than from the
code.

I believe the slides are useful to understand the design, but if you
want to change one line of code slides or not you've to read the code.
Everybody is complaining about documentation. This is a red-herring.
There's no documentation that allows you to hack the previous VM code.
I'd ask how many of the people happy with the previous documentation
were effectively VM developers. Except for some possible misleading
comment in the current code that we may have not updated yet, I don't
think there's been a regression in documentation.

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:32       ` Rik van Riel
@ 2001-12-11 13:46         ` Andrea Arcangeli
  2001-12-12  8:44           ` Andrew Morton
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11 13:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> On Mon, 10 Dec 2001, Andrew Morton wrote:
> 
> > This test on a 64 megabyte machine, on ext2:
> >
> > 	time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> >
> > On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.
> 
> > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > dual x86:
> >
> > -aa:					4 minutes 20 seconds
> > 2.4.7-pre8				4 minutes 8 seconds
> > 2.4.7-pre8 plus the below patch:	3 minutes 55 seconds
> 
> 
> Andrea, it seems -aa is not the holy grail VM-wise. If you want

it may be not a holy grail in swap benchmarks and flood of writes to
disk, those are minor performance regressions, but I have no one single
bug report related to "stability".

The only thing I got back from Andrew is been "it runs a little slower"
in those two tests.

and of course he didn't even attempted to benchmark the interactive
feeling that was the _whole_ point of my buffer.c and elevator changes.

So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
solid and usable in production.

We'll keep doing background benchmarking and changes that cannot
affect stability, but the core design is finished as far I can tell.

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:42       ` Andrea Arcangeli
@ 2001-12-11 13:59         ` Rik van Riel
  2001-12-11 14:23           ` Andrea Arcangeli
  2001-12-11 13:59         ` Abraham vd Merwe
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2001-12-11 13:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On Tue, 11 Dec 2001, Andrea Arcangeli wrote:

> > The VM code lacks comments, and nobody except yourself understands
> > what it is supposed to be doing.  That's a bug, don't you think?
>
> Lack of documentation is not a bug, period. Also it's not true that
> I'm the only one who understands it.

Without documentation, you can only know what the code
does, never what it is supposed to do or why it does it.

This makes fixing problems a lot harder, especially since
people will never agree on what a piece of code is supposed
to do.

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:42       ` Andrea Arcangeli
  2001-12-11 13:59         ` Rik van Riel
@ 2001-12-11 13:59         ` Abraham vd Merwe
  2001-12-11 14:01           ` Andrea Arcangeli
  2001-12-11 15:47         ` Henning P. Schmiedehausen
  2001-12-12  8:39         ` Andrew Morton
  3 siblings, 1 reply; 43+ messages in thread
From: Abraham vd Merwe @ 2001-12-11 13:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux Kernel Development

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

Hi Andrea!

> > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > space.  I beleive that when we find there's no swap left we should move
> > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > > 
> > > yes, however I think the swap-flood with no swap isn't a very
> > > interesting case to optimize.
> > 
> > Running swapless is a valid configuration, and the kernel is doing
> 
> I'm not saying it's not valid or non interesting.
> 
> It's the mix "I'm running out of memory and I'm swapless" that is the
> case not interesting to optimize.
> 
> If you're swapless it means you've enough memory and that you're not
> running out of swap. Otherwise _you_ (not the kernel) are wrong not
> having swap.

The problem is that your VM is unnecesarily eating up memory and then wants
swap. That is unacceptable. Having 90% of your memory in buffers/cache and
then the OOM killer kicks in because nothing is free is what we're moaning
about.

-- 

Regards
 Abraham

Did you hear about the model who sat on a broken bottle and cut a nice figure?

__________________________________________________________
 Abraham vd Merwe - 2d3D, Inc.

 Device Driver Development, Outsourcing, Embedded Systems

  Cell: +27 82 565 4451         Snailmail:
   Tel: +27 21 761 7549            Block C, Antree Park
   Fax: +27 21 761 7648            Doncaster Road
 Email: abraham@2d3d.co.za         Kenilworth, 7700
  Http: http://www.2d3d.com        South Africa


[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:59         ` Abraham vd Merwe
@ 2001-12-11 14:01           ` Andrea Arcangeli
  2001-12-11 17:30             ` Leigh Orf
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11 14:01 UTC (permalink / raw)
  To: Abraham vd Merwe, Linux Kernel Development

On Tue, Dec 11, 2001 at 03:59:22PM +0200, Abraham vd Merwe wrote:
> Hi Andrea!
> 
> > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > > space.  I beleive that when we find there's no swap left we should move
> > > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > > > 
> > > > yes, however I think the swap-flood with no swap isn't a very
> > > > interesting case to optimize.
> > > 
> > > Running swapless is a valid configuration, and the kernel is doing
> > 
> > I'm not saying it's not valid or non interesting.
> > 
> > It's the mix "I'm running out of memory and I'm swapless" that is the
> > case not interesting to optimize.
> > 
> > If you're swapless it means you've enough memory and that you're not
> > running out of swap. Otherwise _you_ (not the kernel) are wrong not
> > having swap.
> 
> The problem is that your VM is unnecesarily eating up memory and then wants
> swap. That is unacceptable. Having 90% of your memory in buffers/cache and
> then the OOM killer kicks in because nothing is free is what we're moaning
> about.

Dear, Abraham please apply this patch:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2

on top of a 2.4.17pre4 and then recompile, try again and send me a
bugreport if you can reproduce. thanks,

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:59         ` Rik van Riel
@ 2001-12-11 14:23           ` Andrea Arcangeli
  2001-12-11 15:27             ` Daniel Phillips
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-11 14:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On Tue, Dec 11, 2001 at 11:59:06AM -0200, Rik van Riel wrote:
> On Tue, 11 Dec 2001, Andrea Arcangeli wrote:
> 
> > > The VM code lacks comments, and nobody except yourself understands
> > > what it is supposed to be doing.  That's a bug, don't you think?
> >
> > Lack of documentation is not a bug, period. Also it's not true that
> > I'm the only one who understands it.
> 
> Without documentation, you can only know what the code
> does, never what it is supposed to do or why it does it.

I only care about "what the code does" and "what are the results and the
bugreports".  Anything else is vaopurware and I don't care about that.

As said I wrote some documentation on the VM for my last speech at the
one of the most important italian linux events, it explains the basic
design. It should be published on their webside as soon as I find the
time to send them the slides. I can post a link once it will be online.
It shoud allow non VM-developers to understand the logic behind the VM
algorithm, but understanding those slides it's far from allowing anyone
to hack the VM.

I _totally_ agree with Linus when he said "real world is totally
dominated by the implementation details". I was thinking this way before
reading his recent email to l-k (however I totally disagree about
evolution being random and the other kernel-offtopic part of such thread :).

For developers the real freedom is the code, not the documentation and
the code is there. And I think it's much easier to understand the
current code (ok I'm biased, but still I believe for outsiders it's
simpler).

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 14:23           ` Andrea Arcangeli
@ 2001-12-11 15:27             ` Daniel Phillips
  2001-12-12 11:16               ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Daniel Phillips @ 2001-12-11 15:27 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> As said I wrote some documentation on the VM for my last speech at the
> one of the most important italian linux events, it explains the basic
> design. It should be published on their webside as soon as I find the
> time to send them the slides. I can post a link once it will be online.

Why not also post the whole thing as an email, right here?

> It shoud allow non VM-developers to understand the logic behind the VM
> algorithm, but understanding those slides it's far from allowing anyone
> to hack the VM.

It's a start.

> I _totally_ agree with Linus when he said "real world is totally
> dominated by the implementation details".

Linus didn't say anything about not documenting the implementation details, 
nor did he say anything about not documenting in general.

> For developers the real freedom is the code, not the documentation and
> the code is there. And I think it's much easier to understand the
> current code (ok I'm biased, but still I believe for outsiders it's
> simpler).

Judging by the number of complaints, it's not easy enough.  I know that, 
personally, decoding your vm is something that's always on my 'things I could 
do if I didn't have a lot of other things to do' list.  So far, only Linus, 
Marcelo, Andrew and maybe Rik seem to have made the investment.  You'd have a 
lot more helpers by now if you gave just a little higher priority to 
documentation

--
Daniel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11  0:43 ` Andrea Arcangeli
@ 2001-12-11 15:46   ` Luigi Genoni
  2001-12-12 22:05   ` Ken Brownfield
  1 sibling, 0 replies; 43+ messages in thread
From: Luigi Genoni @ 2001-12-11 15:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml



On Tue, 11 Dec 2001, Andrea Arcangeli wrote:

> On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
> >
> > Andrea,
> >
> > Could you please start looking at any 2.4 VM issues which show up ?
>
> well, as far I can tell no VM bug should be present in my latest -aa, so
> I think I'm finished. At the very least I know people is using 2.4.15aa1
> and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
> load and I didn't got any bugreport back yet.
2.4.17pre1aa1 is quire rock solid on all my 2 and 4 GB machines
But I have to admitt that actually I did not really stressed the VM on my
servers, since, guys, we are going to christmass :)




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:42       ` Andrea Arcangeli
  2001-12-11 13:59         ` Rik van Riel
  2001-12-11 13:59         ` Abraham vd Merwe
@ 2001-12-11 15:47         ` Henning P. Schmiedehausen
  2001-12-11 16:01           ` Alan Cox
                             ` (2 more replies)
  2001-12-12  8:39         ` Andrew Morton
  3 siblings, 3 replies; 43+ messages in thread
From: Henning P. Schmiedehausen @ 2001-12-11 15:47 UTC (permalink / raw)
  To: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

>Lack of documentation is not a bug, period. Also it's not true that I'm

I scare myself shitless that you as the one responsible for something
as crucial as MM in the Linux kernel, has such an attitude towards
software development especially when people like RvR as for docs.

Sorry, but to me this sounds like something from M$ (MAPI? You don't
need MAPI documentation. We know what we're doing. You don't need to
know how Windows XX works. It's enough that we know).

Actually, you _do_ get documentation from M$. Something, one can't say
about the Linux MM-sprikled-with holy-penguin-pee subsystem.

I'm not happy about your usage of magic numbers, either. So it is
still running on solid 2.2.19 until further notice (or until Rik loses
his patience. ;-) )

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 15:47         ` Henning P. Schmiedehausen
@ 2001-12-11 16:01           ` Alan Cox
  2001-12-11 16:37           ` Hubert Mantel
  2001-12-11 17:09           ` Rik van Riel
  2 siblings, 0 replies; 43+ messages in thread
From: Alan Cox @ 2001-12-11 16:01 UTC (permalink / raw)
  To: hps; +Cc: linux-kernel

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

Andrea did the 2.2.19 VM as well, but that one is somewhat better
documented, and doesn't have the use-once funnies.

Alan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 15:47         ` Henning P. Schmiedehausen
  2001-12-11 16:01           ` Alan Cox
@ 2001-12-11 16:37           ` Hubert Mantel
  2001-12-11 17:09           ` Rik van Riel
  2 siblings, 0 replies; 43+ messages in thread
From: Hubert Mantel @ 2001-12-11 16:37 UTC (permalink / raw)
  To: linux-kernel

Hi,

On Tue, Dec 11, Henning P. Schmiedehausen wrote:

> Andrea Arcangeli <andrea@suse.de> writes:
> 
> >Lack of documentation is not a bug, period. Also it's not true that I'm
> 
> I scare myself shitless that you as the one responsible for something
> as crucial as MM in the Linux kernel, has such an attitude towards
> software development especially when people like RvR as for docs.
> 
> Sorry, but to me this sounds like something from M$ (MAPI? You don't
> need MAPI documentation. We know what we're doing. You don't need to
> know how Windows XX works. It's enough that we know).
> 
> Actually, you _do_ get documentation from M$. Something, one can't say

How do you know the documentation matches the actual code?

> about the Linux MM-sprikled-with holy-penguin-pee subsystem.

In Linux, you get even more: You can look at the code itself.

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

Oh, the 2.2.19 VM is from Andrea ;)

> 	Regards
> 		Henning
                                                                  -o)
    Hubert Mantel              Goodbye, dots...                   /\\
                                                                 _\_v

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 15:47         ` Henning P. Schmiedehausen
  2001-12-11 16:01           ` Alan Cox
  2001-12-11 16:37           ` Hubert Mantel
@ 2001-12-11 17:09           ` Rik van Riel
  2001-12-11 17:28             ` Alan Cox
  2 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2001-12-11 17:09 UTC (permalink / raw)
  To: hps; +Cc: linux-kernel

On Tue, 11 Dec 2001, Henning P. Schmiedehausen wrote:

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

I've lost patience and have decided to move development away
from the main tree.  http://linuxvm.bkbits.net/   ;)

cheers,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 17:28             ` Alan Cox
@ 2001-12-11 17:22               ` Rik van Riel
  2001-12-11 17:23               ` Christoph Hellwig
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2001-12-11 17:22 UTC (permalink / raw)
  To: Alan Cox; +Cc: hps, linux-kernel

On Tue, 11 Dec 2001, Alan Cox wrote:

> > > I'm not happy about your usage of magic numbers, either. So it is
> > > still running on solid 2.2.19 until further notice (or until Rik loses
> > > his patience. ;-) )
> >
> > I've lost patience and have decided to move development away
> > from the main tree.  http://linuxvm.bkbits.net/   ;)
>
> Are your patches available in a format that is accessible using free
> software ?

Yes, I'm making patches available on my home page:

	http://surriel.com/patches/

Note that development isn't too fast due to the fact
that I try to clean up all code I touch instead of
just making the changes needed for the functionality.

kind regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 17:28             ` Alan Cox
  2001-12-11 17:22               ` Rik van Riel
@ 2001-12-11 17:23               ` Christoph Hellwig
  2001-12-12 22:20                 ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2001-12-11 17:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

In article <E16DqhI-0005vG-00@the-village.bc.nu> you wrote:
>> > I'm not happy about your usage of magic numbers, either. So it is
>> > still running on solid 2.2.19 until further notice (or until Rik loses
>> > his patience. ;-) )
>> 
>> I've lost patience and have decided to move development away
>> from the main tree.  http://linuxvm.bkbits.net/   ;)
>
> Are your patches available in a format that is accessible using free
> software ?

As bitkeeper-ignorant I've found nice snapshots on
http://www.surriel.com/patches/.

For BSD advocates it might be a problem that these are unified diffs
that are only applyable with GPL-licensed patch(1) version..

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 17:09           ` Rik van Riel
@ 2001-12-11 17:28             ` Alan Cox
  2001-12-11 17:22               ` Rik van Riel
  2001-12-11 17:23               ` Christoph Hellwig
  0 siblings, 2 replies; 43+ messages in thread
From: Alan Cox @ 2001-12-11 17:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: hps, linux-kernel

> > I'm not happy about your usage of magic numbers, either. So it is
> > still running on solid 2.2.19 until further notice (or until Rik loses
> > his patience. ;-) )
> 
> I've lost patience and have decided to move development away
> from the main tree.  http://linuxvm.bkbits.net/   ;)

Are your patches available in a format that is accessible using free
software ?

(Now where did I put the troll sign 8))


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 14:01           ` Andrea Arcangeli
@ 2001-12-11 17:30             ` Leigh Orf
  0 siblings, 0 replies; 43+ messages in thread
From: Leigh Orf @ 2001-12-11 17:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux Kernel Development



Andrea Arcangeli wrote:

|   > The problem is that your VM is unnecesarily eating up
|   > memory and then wants swap. That is unacceptable. Having
|   > 90% of your memory in buffers/cache and then the OOM killer
|   > kicks in because nothing is free is what we're moaning
|   > about.
|

|   Dear, Abraham please apply this patch:
|   
|   	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2
|   
|   on top of a 2.4.17pre4 and then recompile, try again and send me a
|   bugreport if you can reproduce. thanks,

Andrea,

I applied your patch and it didn't fix the problem.
I reported this earlier to the kernel list but I'm not sure if you got
it. See http://groups.google.com/groups?hl=en&rnum=1&selm=linux.kernel.200112081539.fB8FdFj03048%40orp.orf.cx
or see the recent thread "2.4.16 memory badness (reproducible)". The
behavior I cite with 2.4.16 is identical to what happens with
2.4.17pre4aa1, but here it is again. It is reproducible.
Machine is 1.4GHZ Athlon with 1 GB memory, 2 GB swap, RH 7.2 with
updates.

home[1001]:/home/orf% uname -a
Linux orp.orf.cx 2.4.17-pre4 #1 Mon Dec 10 22:09:16 EST 2001 i686 unknown
(it's been patched with 2.4.17pre4aa1.bz2)
(updatedb updates RedHat's file database, does lots of file I/O)

home[1005]:/home/orf% free         
             total       used       free     shared    buffers     cached
Mem:       1029780     207976     821804          0      49468      71856
-/+ buffers/cache:      86652     943128
Swap:      2064344       6324    2058020

home[1006]:/home/orf% sudo updatedb
Password:

home[1007]:/home/orf% free
             total       used       free     shared    buffers     cached
Mem:       1029780    1017576      12204          0     471548      70924
-/+ buffers/cache:     475104     554676
Swap:      2064344       6312    2058032

home[1008]:/home/orf% xmms
Memory fault 

home[1009]:/home/orf% strace xmms 2>&1 | tail
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40316000
mprotect(0x40448000, 37704, PROT_NONE)  = 0
old_mmap(0x40448000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40448000
old_mmap(0x4044e000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4044e000
close(3)                                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40452000
munmap(0x40018000, 72492)               = 0
modify_ldt(0x1, 0xbffff33c, 0x10)       = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Note that some applications don't mem fault this way, but all the ones
that do die at modify_ldt (see my previous post).

home[1010]:/home/orf% cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  1054494720 1041756160 12738560        0 481837056 77209600
Swap: 2113888256  6463488 2107424768
MemTotal:      1029780 kB
MemFree:         12440 kB
MemShared:           0 kB
Buffers:        470544 kB
Cached:          71388 kB
SwapCached:       4012 kB
Active:         367796 kB
Inactive:       232088 kB
HighTotal:      130992 kB
HighFree:         2044 kB
LowTotal:       898788 kB
LowFree:         10396 kB
SwapTotal:     2064344 kB
SwapFree:      2058032 kB


home[1011]:/home/orf% cat /proc/slabinfo
slabinfo - version: 1.1
kmem_cache            65     68    112    2    2    1
ip_conntrack          22     50    384    5    5    1
nfs_write_data         0      0    384    0    0    1
nfs_read_data          0      0    384    0    0    1
nfs_page               0      0    128    0    0    1
ip_fib_hash           10    112     32    1    1    1
urb_priv               0      0     64    0    0    1
clip_arp_cache         0      0    128    0    0    1
ip_mrt_cache           0      0    128    0    0    1
tcp_tw_bucket          0      0    128    0    0    1
tcp_bind_bucket       17    112     32    1    1    1
tcp_open_request       0      0    128    0    0    1
inet_peer_cache        2     59     64    1    1    1
ip_dst_cache          56     80    192    4    4    1
arp_cache              3     30    128    1    1    1
blkdev_requests      640    660    128   22   22    1
journal_head           0      0     48    0    0    1
revoke_table           0      0     12    0    0    1
revoke_record          0      0     32    0    0    1
dnotify cache          0      0     20    0    0    1
file lock cache        2     42     92    1    1    1
fasync cache           2    202     16    1    1    1
uid_cache              7    112     32    1    1    1
skbuff_head_cache    293    320    192   16   16    1
sock                 131    132   1280   44   44    1
sigqueue               4     29    132    1    1    1
cdev_cache          2313   2360     64   40   40    1
bdev_cache             8     59     64    1    1    1
mnt_cache             19     59     64    1    1    1
inode_cache       452259 452263    512 64609 64609    1
dentry_cache      469963 469980    128 15666 15666    1
dquot                  0      0    128    0    0    1
filp                1633   1650    128   55   55    1
names_cache            0      2   4096    0    2    1
buffer_head       136268 164880    128 5496 5496    1
mm_struct             54     60    192    3    3    1
vm_area_struct      2186   2250    128   73   75    1
fs_cache              53     59     64    1    1    1
files_cache           53     63    448    6    7    1
signal_act            61     63   1344   21   21    1
size-131072(DMA)       0      0 131072    0    0   32
size-131072            0      0 131072    0    0   32
size-65536(DMA)        0      0  65536    0    0   16
size-65536             1      1  65536    1    1   16
size-32768(DMA)        0      0  32768    0    0    8
size-32768             1      1  32768    1    1    8
size-16384(DMA)        0      0  16384    0    0    4
size-16384             1      3  16384    1    3    4
size-8192(DMA)         0      0   8192    0    0    2
size-8192              5      7   8192    5    7    2
size-4096(DMA)         0      0   4096    0    0    1
size-4096             70     73   4096   70   73    1
size-2048(DMA)         0      0   2048    0    0    1
size-2048             64     68   2048   34   34    1
size-1024(DMA)         0      0   1024    0    0    1
size-1024          11028  11032   1024 2757 2758    1
size-512(DMA)          0      0    512    0    0    1
size-512           12029  12032    512 1504 1504    1
size-256(DMA)          0      0    256    0    0    1
size-256            1609   1635    256  109  109    1
size-128(DMA)          2     30    128    1    1    1
size-128           29383  29430    128  980  981    1
size-64(DMA)           0      0     64    0    0    1
size-64             9105   9145     64  155  155    1
size-32(DMA)          34     59     64    1    1    1
size-32            70942  70977     64 1203 1203    1

Leigh Orf


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:42       ` Andrea Arcangeli
                           ` (2 preceding siblings ...)
  2001-12-11 15:47         ` Henning P. Schmiedehausen
@ 2001-12-12  8:39         ` Andrew Morton
  3 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2001-12-12  8:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml

Andrea Arcangeli wrote:
> 
> 
> [ big snip.  Addressed in other email ]
> 
> it should be simple, mainline swapouts more, so it's less likely to
> trash away some useful cache.
> 
> just try -aa after a:
> 
>         echo 10 >/proc/sys/vm/vm_mapped_ratio
> 
> it should swapout more and better preserve the cache.

-aa swapout balancing seems very good indeed to me.

> > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > space.  I beleive that when we find there's no swap left we should move
> > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > >
> > > yes, however I think the swap-flood with no swap isn't a very
> > > interesting case to optimize.
> >
> > Running swapless is a valid configuration, and the kernel is doing
> 
> I'm not saying it's not valid or non interesting.
> 
> It's the mix "I'm running out of memory and I'm swapless" that is the
> case not interesting to optimize.
> 
> If you're swapless it means you've enough memory and that you're not
> running out of swap. Otherwise _you_ (not the kernel) are wrong not
> having swap.

um.  Spose so.
 
> ...
> 
> > The VM code lacks comments, and nobody except yourself understands
> > what it is supposed to be doing.  That's a bug, don't you think?
> 
> Lack of documentation is not a bug, period. Also it's not true that I'm
> the only one who understands it. For istance Linus understand it
> completly, I am 100% sure.
> 
> Anyways I wrote a dozen of slides on the VM with some graph showing the
> design of the VM if anybody can better learn from a slide than from the
> code.

That's good.  Your elevator design slides were very helpful.  However
offline documentation tends to go stale.   A nice big block comment
maintained by a programmer who cares goes a loooog way.

> I believe the slides are useful to understand the design, but if you
> want to change one line of code slides or not you've to read the code.
> Everybody is complaining about documentation. This is a red-herring.
> There's no documentation that allows you to hack the previous VM code.
> I'd ask how many of the people happy with the previous documentation
> were effectively VM developers. Except for some possible misleading
> comment in the current code that we may have not updated yet, I don't
> think there's been a regression in documentation.
> 

Sigh.  Just because the current core kernel looks like it was
scrawled in crayon by an infant doesn't mean that everyone has
to eschew literate, mature, competent and maintainable programming
practices.

-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 13:46         ` Andrea Arcangeli
@ 2001-12-12  8:44           ` Andrew Morton
  2001-12-12  9:21             ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2001-12-12  8:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Marcelo Tosatti, lkml

Andrea Arcangeli wrote:
> 
> On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> > On Mon, 10 Dec 2001, Andrew Morton wrote:
> >
> > > This test on a 64 megabyte machine, on ext2:
> > >
> > >     time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> > >
> > > On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.
> >
> > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > > dual x86:
> > >
> > > -aa:                                        4 minutes 20 seconds
> > > 2.4.7-pre8                          4 minutes 8 seconds
> > > 2.4.7-pre8 plus the below patch:    3 minutes 55 seconds
> >
> >
> > Andrea, it seems -aa is not the holy grail VM-wise. If you want
> 
> it may be not a holy grail in swap benchmarks and flood of writes to
> disk, those are minor performance regressions, but I have no one single
> bug report related to "stability".

Your patch increases the time to untar a kernel tree by seventy five
percent.  That's a fairly major minor regression.

> The only thing I got back from Andrew is been "it runs a little slower"
> in those two tests.

The swapstorm I agree is uninteresting.  The slowdown with a heavy write
load impacts a very common usage, and I've told you how to mostly fix
it.  You need to back out the change to bdflush.
 
> and of course he didn't even attempted to benchmark the interactive
> feeling that was the _whole_ point of my buffer.c and elevator changes.

As far as I know, at no point in time have you told anyone that
this was an objective of your latest patch.  So of course I
didn't test for it.

Interactivity is indeed improved.  It has gone from catastrophic to
horrid.

There are four basic tests I use to quantify this, all with 64 megs of
memory:

1: Start a continuous write, and on a different partition, time how
   long it takes to read a 16 megabyte file.

   Here, -aa takes 40 seconds.  Stock 2.4.17-pre8 takes 71 seconds.
   2.4.17-pre8 with the same elevator settings as in -aa takes
   40 seconds.

   Large writes are slowing reads by a factor of 100.

2: Start a continuous write and, from another machine, run

	time ssh -X otherhost xterm -e true

   On -aa this takes 68 seconds.  On 2.4.17-pre8 it takes over
   three minutes.  I got bored and killed it.  The problem can't
   be fixed on 2.4.17-pre8 with tuning - it's probably due to the
   poor page replacement - stuff is getting swapped out.  This is
   a significant problem in 2.4.17-pre and we need a fix for it.

3: Run `cp -a linux/ junk'.  Time how long it takes to read a 16 meg file.

   There's no appreciable difference between any of the kernels here.
   It varies from 2 seconds to 10, and is generally OK.

4:  Run `cp -a linux/ junk'.  time ssh -X otherhost xterm -e true

   Varies between three and five seconds, depending on elvtune settings.
   No noticeable difference between any kernels.

It's tests 1 and 2 which are interesting, because we perform so
very badly.  And no amount of fiddling buffer.c or elvtune settings
is going to fix it, because they don't address the core problem.

Which is: when the elevator can't merge a read it sticks it at the
end of the request queue, behind all the writes.

I'll be submitting a little patch for 2.4.18-pre which allows the user
to tunably promote reads ahead of most of the writes.  It improves
tests 1 and 2 by a factor of eight to twelve.

> So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
> solid and usable in production.

I haven't done much stability testing - without a description of what the
changes are trying to do, I can't test them - all I could do is blindly
run stress tests and I'm sure your QA team can do that as well as I,
on bigger boxes.

But I don't doubt that it's stable.   However Red Hat's QA guys are
pretty good at knocking kernels over...

gargh.  Ninety seconds of bash-shared-mapping and I get "end-request:
buffer-list destroyed" against the swap device.  Borked IDE driver.
Seems stable on SCSI.

The -aa VM is still a little prone to tossing out "0-order allocation
failures" when there's tons of swap available and when much memory
is freeable by dropping or writing back to shared mappings.  But
this doesn't seem to cause any problems, as long as there's some
memory available for atomic allocations, and I never saw free
memory go below 800 kbytes...

> We'll keep doing background benchmarking and changes that cannot
> affect stability, but the core design is finished as far I can tell.

We'll know when it gets wider testing in the runup to 2.4.18.  The
fact that I found a major (although easily fixed) performance problem
in the first ten minutes indicates that caution is needed, yes?

What's the thinking with the changes to dcache/icache flushing?
A single d/icache entry can save three seeks, which is _enormous_ value for
just a few hundred bytes of memory.  You appear to be shrinking the i/dcache
by 12% each time you try to swap out or evict 32 pages.   What this means
is that as soon we start to get a bit short on memory, the i/dcache vanishes.
And it takes ages to read that stuff back in.  How did you test this?  Without
having done (or even devised) any quantitative testing myself, I have a gut
feel that we need to preserve the i/dcache (versus file data) much more than
this.



Oh.  Maybe the core design (whatever it is :)) is not finished,
because it retains the bone-headed, dumb-to-the-point-of-astonishing
misfeature which Linux VM has always had:

If someone is linearly writing (or reading) a gigabyte file on a 64
megabyte box they *don't* want the VM to evict every last little scrap
of cache on behalf of data which they *obviously* do not want
cached.

It's good that -aa VM doesn't summarily dump the i/dcache and plonk
everything you want into swap when this happens.  Progress.


So. To summarise.

- Your attempt to address read latencies didn't work out, and should
  be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))

- We urgently need a fix for 2.4.17's page replacement problems.  
 
- aa is good.  Believe it or not, I like it. The mm/* portions fix
  significant performance problems in our current VM.  I guess we
  should bite the bullet and merge it all in 2.4.18-pre

-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12  8:44           ` Andrew Morton
@ 2001-12-12  9:21             ` Andrea Arcangeli
  2001-12-12  9:45               ` Rik van Riel
  2001-12-12  9:59               ` Andrew Morton
  0 siblings, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12  9:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Marcelo Tosatti, lkml

On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> > > On Mon, 10 Dec 2001, Andrew Morton wrote:
> > >
> > > > This test on a 64 megabyte machine, on ext2:
> > > >
> > > >     time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> > > >
> > > > On 2.4.17-pre7 it takes 21 seconds.  On -aa it is much slower: 36 seconds.
> > >
> > > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > > > dual x86:
> > > >
> > > > -aa:                                        4 minutes 20 seconds
> > > > 2.4.7-pre8                          4 minutes 8 seconds
> > > > 2.4.7-pre8 plus the below patch:    3 minutes 55 seconds
> > >
> > >
> > > Andrea, it seems -aa is not the holy grail VM-wise. If you want
> > 
> > it may be not a holy grail in swap benchmarks and flood of writes to
> > disk, those are minor performance regressions, but I have no one single
> > bug report related to "stability".
> 
> Your patch increases the time to untar a kernel tree by seventy five
> percent.  That's a fairly major minor regression.
> 
> > The only thing I got back from Andrew is been "it runs a little slower"
> > in those two tests.
> 
> The swapstorm I agree is uninteresting.  The slowdown with a heavy write
> load impacts a very common usage, and I've told you how to mostly fix
> it.  You need to back out the change to bdflush.

I guess i should drop the run_task_queue(&tq_disk) instead of replacing
it back with a wait_for_some_buffers().

> > and of course he didn't even attempted to benchmark the interactive
> > feeling that was the _whole_ point of my buffer.c and elevator changes.
> 
> As far as I know, at no point in time have you told anyone that
> this was an objective of your latest patch.  So of course I
> didn't test for it.
> 
> Interactivity is indeed improved.  It has gone from catastrophic to
> horrid.

:)

> 
> There are four basic tests I use to quantify this, all with 64 megs of
> memory:
> 
> 1: Start a continuous write, and on a different partition, time how
>    long it takes to read a 16 megabyte file.
> 
>    Here, -aa takes 40 seconds.  Stock 2.4.17-pre8 takes 71 seconds.
>    2.4.17-pre8 with the same elevator settings as in -aa takes
>    40 seconds.
> 
>    Large writes are slowing reads by a factor of 100.
> 
> 2: Start a continuous write and, from another machine, run
> 
> 	time ssh -X otherhost xterm -e true
> 
>    On -aa this takes 68 seconds.  On 2.4.17-pre8 it takes over
>    three minutes.  I got bored and killed it.  The problem can't
>    be fixed on 2.4.17-pre8 with tuning - it's probably due to the
>    poor page replacement - stuff is getting swapped out.  This is
>    a significant problem in 2.4.17-pre and we need a fix for it.
> 
> 3: Run `cp -a linux/ junk'.  Time how long it takes to read a 16 meg file.
> 
>    There's no appreciable difference between any of the kernels here.
>    It varies from 2 seconds to 10, and is generally OK.
> 
> 4:  Run `cp -a linux/ junk'.  time ssh -X otherhost xterm -e true
> 
>    Varies between three and five seconds, depending on elvtune settings.
>    No noticeable difference between any kernels.
> 
> It's tests 1 and 2 which are interesting, because we perform so
> very badly.  And no amount of fiddling buffer.c or elvtune settings
> is going to fix it, because they don't address the core problem.
> 
> Which is: when the elevator can't merge a read it sticks it at the
> end of the request queue, behind all the writes.
> 
> I'll be submitting a little patch for 2.4.18-pre which allows the user
> to tunably promote reads ahead of most of the writes.  It improves
> tests 1 and 2 by a factor of eight to twelve.

Note that the first elevator (not elevator_linus) could handle this
case, however it was too complicated and I'm been told it was hurting
too much the performance of things like dbench etc.. But it was allowing
you to take a few seconds for your test number 2 for example. Quite
frankly all my benchmark were latency oriented, but I couldn't notice
an huge drop of performance, but OTOH at that time my test box had a
10mbyte/sec HD, and I know for experience that on such HD numbers tends
to be very different than on fast SCSI and my current test hd IDE
33mbyte/sec so I think they were right.

> > So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
> > solid and usable in production.
> 
> I haven't done much stability testing - without a description of what the
> changes are trying to do, I can't test them - all I could do is blindly
> run stress tests and I'm sure your QA team can do that as well as I,
> on bigger boxes.
> 
> But I don't doubt that it's stable.   However Red Hat's QA guys are
> pretty good at knocking kernels over...
> 
> gargh.  Ninety seconds of bash-shared-mapping and I get "end-request:
> buffer-list destroyed" against the swap device.  Borked IDE driver.
> Seems stable on SCSI.
> 
> The -aa VM is still a little prone to tossing out "0-order allocation
> failures" when there's tons of swap available and when much memory
> is freeable by dropping or writing back to shared mappings.  But
> this doesn't seem to cause any problems, as long as there's some
> memory available for atomic allocations, and I never saw free
> memory go below 800 kbytes...

It mostly tends to fail on the GFP_NOIO and friends, where it cannot
block and I believe that's correct, looping forever inside the allocator
can only lead to deadlocks. Those GFP_NOIO users have loops outside the
allocator if required.

A failure means that unless somebody else does something for us, we
couldn't allocate anything. Thus SCHED_YIELD and try again.

> > We'll keep doing background benchmarking and changes that cannot
> > affect stability, but the core design is finished as far I can tell.
> 
> We'll know when it gets wider testing in the runup to 2.4.18.  The
> fact that I found a major (although easily fixed) performance problem
> in the first ten minutes indicates that caution is needed, yes?

I consider that minor tuning (as you said removing the run_task_queue()
in bdflush may be enough to cure the tar xzf, I will make some test).

> What's the thinking with the changes to dcache/icache flushing?
> A single d/icache entry can save three seeks, which is _enormous_ value for
> just a few hundred bytes of memory.  You appear to be shrinking the i/dcache
> by 12% each time you try to swap out or evict 32 pages.   What this means

yes.

> is that as soon we start to get a bit short on memory, the i/dcache vanishes.
> And it takes ages to read that stuff back in.  How did you test this?  Without
> having done (or even devised) any quantitative testing myself, I have a gut
> feel that we need to preserve the i/dcache (versus file data) much more than
> this.

The problem is the zone-normal, if we fail to shrink the cache we _must_
shrink the dcache/icache as well to be correct (at the very least if the
classzone is < ZONE_HIGHMEM). Otherwise zone normal/dma allocations can
fail forever and you won't be able to fork a new task any longer. I
tested this with a ZONE_NORMAL of 1/2 mbytes with highmem emulation. Of
course this makes the problem reproducible trivially but it could happen
on larger boxes as well at least in theory, and I want to cover all the
cases as best as I can.

> Oh.  Maybe the core design (whatever it is :)) is not finished,
> because it retains the bone-headed, dumb-to-the-point-of-astonishing
> misfeature which Linux VM has always had:
> 
> If someone is linearly writing (or reading) a gigabyte file on a 64
> megabyte box they *don't* want the VM to evict every last little scrap
> of cache on behalf of data which they *obviously* do not want
> cached.

The current design tries to detect this, at least much much better than
2.2. This is why I disagree with Rik's patch of yesterday.  detecting
cache pollution is good also on the lowmem boxes (not only for DB).

> It's good that -aa VM doesn't summarily dump the i/dcache and plonk
> everything you want into swap when this happens.  Progress.
> 
> 
> So. To summarise.
> 
> - Your attempt to address read latencies didn't work out, and should
>   be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))

It should not be dropped. And it's not an hack, I only enabled the code
that was basically disabled due the huge numbers. It will work as 2.2.20.

Now what you want to add is an hack to move the read at the top of the
request_queue and if you go back to 2.3.5x you'll see I was doing this,
that's the first thing I did while playing with the elevator. And
latency-wise it was working great. I'm sure somebody remebers the kind
of latency you could get with such an elevator.

Then I got flames from Linus and Ingo claiming that I screwedup the
elevator and that I was the source of the 2.3.x bad I/O performance and
so they required to nearly rewrite the elevator in a way that was
obvious that couldn't hurt the benchmarks and so Jens dropped part of my
latency-capable elevator and he did the elevator_linus that of course
cannot hurt performance of benchmarks, but that has the usual problem
you need to wait 1 minute for xterm to be stared under a write flood.

However my object was to avoid nearly infinite starvation and the
elevator_linus avoids it (you can start the xterm it in 1 minute,
previously in early 2.3 and 2.2 you'd need to wait for the disk to be
full, and that could take some day with some terabyte of data). So I was
pretty much fine with elevator_linus too but we very well known reads
would be starved again significantly (even if not indefinitely).

Many thanks for the help!!

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12  9:21             ` Andrea Arcangeli
@ 2001-12-12  9:45               ` Rik van Riel
  2001-12-12 10:09                 ` Andrea Arcangeli
  2001-12-12  9:59               ` Andrew Morton
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2001-12-12  9:45 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On Wed, 12 Dec 2001, Andrea Arcangeli wrote:
> On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> > Oh.  Maybe the core design (whatever it is :)) is not finished,
> > because it retains the bone-headed, dumb-to-the-point-of-astonishing
> > misfeature which Linux VM has always had:
> >
> > If someone is linearly writing (or reading) a gigabyte file on a 64
> > megabyte box they *don't* want the VM to evict every last little scrap
> > of cache on behalf of data which they *obviously* do not want
> > cached.
>
> The current design tries to detect this, at least much much better than
> 2.2. This is why I disagree with Rik's patch of yesterday.  detecting
> cache pollution is good also on the lowmem boxes (not only for DB).

Oh, absolutely. The problem just is that the current design
has even worse problems where it doesn't put any pressure on
pages which were touched twice an hour ago.

This leads to the situation that applications get OOM-killed
to preserve buffer cache memory which hasn't been touched
since bootup time.

There are ways to both have good behaviour on bulk IO and
flush out old data which was in active use but no longer is.
I believe these are called page aging and drop-behind.
I've been thinking about achieving the wanted behaviour
without these two, but haven't been able to come up with
any algorithm which doesn't have some very bad side effects.

If you know a way of doing bulk IO properly and flushing out
an old working set correctly, please let us know.

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12  9:21             ` Andrea Arcangeli
  2001-12-12  9:45               ` Rik van Riel
@ 2001-12-12  9:59               ` Andrew Morton
  2001-12-12 10:15                 ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2001-12-12  9:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Marcelo Tosatti, lkml

Andrea Arcangeli wrote:
> 
> ...
> > The swapstorm I agree is uninteresting.  The slowdown with a heavy write
> > load impacts a very common usage, and I've told you how to mostly fix
> > it.  You need to back out the change to bdflush.
> 
> I guess i should drop the run_task_queue(&tq_disk) instead of replacing
> it back with a wait_for_some_buffers().

hum.  Nope, it definitely wants the wait_for_locked_buffers() in there.
36 seconds versus 25.  (21 on stock kernel)

My theory is that balance_dirty() is directing heaps of wakeups
to bdflush, so bdflush just keeps on running.  I'll take a look
tomorrow.

(If we're sending that many wakeups, we should do a waitqueue_active
test in wakeup_bdflush...)

> ...
>
> Note that the first elevator (not elevator_linus) could handle this
> case, however it was too complicated and I'm been told it was hurting
> too much the performance of things like dbench etc.. But it was allowing
> you to take a few seconds for your test number 2 for example. Quite
> frankly all my benchmark were latency oriented, but I couldn't notice
> an huge drop of performance, but OTOH at that time my test box had a
> 10mbyte/sec HD, and I know for experience that on such HD numbers tends
> to be very different than on fast SCSI and my current test hd IDE
> 33mbyte/sec so I think they were right.

OK, well I think I'll make it so the feature defaults to "off" - no
change in behaviour.  People need to run `elvtune -b non-zero-value'
to turn it on.

So what is then needed is testing to determine the latency-versus-throughput
tradeoff.  Andries takes manpage patches :)

> ...
> > - Your attempt to address read latencies didn't work out, and should
> >   be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))
> 
> It should not be dropped. And it's not an hack, I only enabled the code
> that was basically disabled due the huge numbers. It will work as 2.2.20.

Sorry, I was referring to the elevator-bypass patch.  Jens called
it a hack ;)

> Now what you want to add is an hack to move the read at the top of the
> request_queue and if you go back to 2.3.5x you'll see I was doing this,
> that's the first thing I did while playing with the elevator. And
> latency-wise it was working great. I'm sure somebody remebers the kind
> of latency you could get with such an elevator.
> 
> Then I got flames from Linus and Ingo claiming that I screwedup the
> elevator and that I was the source of the 2.3.x bad I/O performance and
> so they required to nearly rewrite the elevator in a way that was
> obvious that couldn't hurt the benchmarks and so Jens dropped part of my
> latency-capable elevator and he did the elevator_linus that of course
> cannot hurt performance of benchmarks, but that has the usual problem
> you need to wait 1 minute for xterm to be stared under a write flood.
> 
> However my object was to avoid nearly infinite starvation and the
> elevator_linus avoids it (you can start the xterm it in 1 minute,
> previously in early 2.3 and 2.2 you'd need to wait for the disk to be
> full, and that could take some day with some terabyte of data). So I was
> pretty much fine with elevator_linus too but we very well known reads
> would be starved again significantly (even if not indefinitely).
> 

OK, thanks.

As long as the elevator-bypass tunable gives a good range of
latency-versus-throughput tuning then I'll be happy.  It's a
bit sad that in even the best case, reads are penalised by a
factor of ten when there are writes happening.

But fixing that would require major readahead surgery, and perhaps
implementation of anticipatory scheduling, as described in
http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf
which is out of scope.

-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12  9:45               ` Rik van Riel
@ 2001-12-12 10:09                 ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12 10:09 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Marcelo Tosatti, lkml

On Wed, Dec 12, 2001 at 07:45:45AM -0200, Rik van Riel wrote:
> On Wed, 12 Dec 2001, Andrea Arcangeli wrote:
> > On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> > > Oh.  Maybe the core design (whatever it is :)) is not finished,
> > > because it retains the bone-headed, dumb-to-the-point-of-astonishing
> > > misfeature which Linux VM has always had:
> > >
> > > If someone is linearly writing (or reading) a gigabyte file on a 64
> > > megabyte box they *don't* want the VM to evict every last little scrap
> > > of cache on behalf of data which they *obviously* do not want
> > > cached.
> >
> > The current design tries to detect this, at least much much better than
> > 2.2. This is why I disagree with Rik's patch of yesterday.  detecting
> > cache pollution is good also on the lowmem boxes (not only for DB).
> 
> Oh, absolutely. The problem just is that the current design
> has even worse problems where it doesn't put any pressure on
> pages which were touched twice an hour ago.

it does. See the refill_inactive pass.

> This leads to the situation that applications get OOM-killed
> to preserve buffer cache memory which hasn't been touched
> since bootup time.

It doesn't happen here.

At the very least the fix is the two liner from Andrew that forces a
nr_pages refile from active list, that will guarantee that whatever
happens we always roll the active list too, but the oom killing you are
experiencing is a problem of mainline, it definitely doesn't happen here
and the refill_inactive(0) cannot be the culprit because the active list
grows always to a relevant size and if during oom a few pages stays
untouched into the active list that's fine, those two pages couldn't
save us anyways so they'd better stay there so we don't trash.

> 
> There are ways to both have good behaviour on bulk IO and
> flush out old data which was in active use but no longer is.
> I believe these are called page aging and drop-behind.
> I've been thinking about achieving the wanted behaviour
> without these two, but haven't been able to come up with
> any algorithm which doesn't have some very bad side effects.
> 
> If you know a way of doing bulk IO properly and flushing out
> an old working set correctly, please let us know.
> 
> regards,
> 
> Rik
> -- 
> Shortwave goes a long way:  irc.starchat.net  #swl
> 
> http://www.surriel.com/		http://distro.conectiva.com/


Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12  9:59               ` Andrew Morton
@ 2001-12-12 10:15                 ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12 10:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Marcelo Tosatti, lkml

On Wed, Dec 12, 2001 at 01:59:38AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > ...
> > > The swapstorm I agree is uninteresting.  The slowdown with a heavy write
> > > load impacts a very common usage, and I've told you how to mostly fix
> > > it.  You need to back out the change to bdflush.
> > 
> > I guess i should drop the run_task_queue(&tq_disk) instead of replacing
> > it back with a wait_for_some_buffers().
> 
> hum.  Nope, it definitely wants the wait_for_locked_buffers() in there.
> 36 seconds versus 25.  (21 on stock kernel)

please try without the wait_for_locked_buffers and without the
run_task_queue, just delete that line.

> 
> My theory is that balance_dirty() is directing heaps of wakeups
> to bdflush, so bdflush just keeps on running.  I'll take a look
> tomorrow.

Please delete the wait_on_buffers from balance_dirty() too, it's totally
broken there as well.

wait_on_something _does_ wakeup the queue just like a run_task_queue()
otherwise it's a noop.

However I need to check better the refile of clean buffers from locked to
clean lists, we should make sure not to spend too much time there, the
first time a wait_on_buffers is recalled...

> (If we're sending that many wakeups, we should do a waitqueue_active
> test in wakeup_bdflush...)
> 
> > ...
> >
> > Note that the first elevator (not elevator_linus) could handle this
> > case, however it was too complicated and I'm been told it was hurting
> > too much the performance of things like dbench etc.. But it was allowing
> > you to take a few seconds for your test number 2 for example. Quite
> > frankly all my benchmark were latency oriented, but I couldn't notice
> > an huge drop of performance, but OTOH at that time my test box had a
> > 10mbyte/sec HD, and I know for experience that on such HD numbers tends
> > to be very different than on fast SCSI and my current test hd IDE
> > 33mbyte/sec so I think they were right.
> 
> OK, well I think I'll make it so the feature defaults to "off" - no
> change in behaviour.  People need to run `elvtune -b non-zero-value'
> to turn it on.

Ok. BTW, I guess on this side it worth to work only on 2.5. We know
latency isn't very good in 2.4 and in 2.2, we're more throughput oriented.

Ah and of course to make the latency better we could as well reduce the
size of the I/O queue, I bet the queues are way oversized for a normal
desktop.

> 
> So what is then needed is testing to determine the latency-versus-throughput
> tradeoff.  Andries takes manpage patches :)
> 
> > ...
> > > - Your attempt to address read latencies didn't work out, and should
> > >   be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))
> > 
> > It should not be dropped. And it's not an hack, I only enabled the code
> > that was basically disabled due the huge numbers. It will work as 2.2.20.
> 
> Sorry, I was referring to the elevator-bypass patch.  Jens called
> it a hack ;)

Oh yes, that's an "hack" :), and it definitely works well for the latency.

> 
> > Now what you want to add is an hack to move the read at the top of the
> > request_queue and if you go back to 2.3.5x you'll see I was doing this,
> > that's the first thing I did while playing with the elevator. And
> > latency-wise it was working great. I'm sure somebody remebers the kind
> > of latency you could get with such an elevator.
> > 
> > Then I got flames from Linus and Ingo claiming that I screwedup the
> > elevator and that I was the source of the 2.3.x bad I/O performance and
> > so they required to nearly rewrite the elevator in a way that was
> > obvious that couldn't hurt the benchmarks and so Jens dropped part of my
> > latency-capable elevator and he did the elevator_linus that of course
> > cannot hurt performance of benchmarks, but that has the usual problem
> > you need to wait 1 minute for xterm to be stared under a write flood.
> > 
> > However my object was to avoid nearly infinite starvation and the
> > elevator_linus avoids it (you can start the xterm it in 1 minute,
> > previously in early 2.3 and 2.2 you'd need to wait for the disk to be
> > full, and that could take some day with some terabyte of data). So I was
> > pretty much fine with elevator_linus too but we very well known reads
> > would be starved again significantly (even if not indefinitely).
> > 
> 
> OK, thanks.
> 
> As long as the elevator-bypass tunable gives a good range of
> latency-versus-throughput tuning then I'll be happy.  It's a
> bit sad that in even the best case, reads are penalised by a
> factor of ten when there are writes happening.
> 
> But fixing that would require major readahead surgery, and perhaps
> implementation of anticipatory scheduling, as described in
> http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf
> which is out of scope.
> 
> -


Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 15:27             ` Daniel Phillips
@ 2001-12-12 11:16               ` Andrea Arcangeli
  2001-12-12 20:03                 ` Daniel Phillips
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12 11:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml

On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > As said I wrote some documentation on the VM for my last speech at the
> > one of the most important italian linux events, it explains the basic
> > design. It should be published on their webside as soon as I find the
> > time to send them the slides. I can post a link once it will be online.
> 
> Why not also post the whole thing as an email, right here?

I uploaded it here:

	ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz

Hopefully it's understandable standalone.

> > It shoud allow non VM-developers to understand the logic behind the VM
> > algorithm, but understanding those slides it's far from allowing anyone
> > to hack the VM.
> 
> It's a start.
> 
> > I _totally_ agree with Linus when he said "real world is totally
> > dominated by the implementation details".
> 
> Linus didn't say anything about not documenting the implementation details, 
> nor did he say anything about not documenting in general.

yes, my only point was that "documentation" isn't nearly enough, and
that it's not mandatory (given all the changes don't affect any user
API), but I certainly agree documentation helps.

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 11:16               ` Andrea Arcangeli
@ 2001-12-12 20:03                 ` Daniel Phillips
  2001-12-12 21:25                   ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Daniel Phillips @ 2001-12-12 20:03 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml

On December 12, 2001 12:16 pm, Andrea Arcangeli wrote:
> On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > > As said I wrote some documentation on the VM for my last speech at the
> > > one of the most important italian linux events, it explains the basic
> > > design. It should be published on their webside as soon as I find the
> > > time to send them the slides. I can post a link once it will be online.
> > 
> > Why not also post the whole thing as an email, right here?
> 
> I uploaded it here:

ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz

This is really, really useful.

Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get 
install mgp) and do:

   mv pluto.mpg pluto.mgp # ;)
   mgp pluto.mgp -x vflib

Helpful hint #2: Actually, just gv pluto.ps is gets all the content.

Helpful hint #3: Actually, less pluto.mgp will do the trick (after the 
rename) and lets you cut and paste the text, as I'm about to do...

Nit: "vm shrinking is not serialized with any other subsystem, it is also 
                                                              only---^^^^
threaded against itself."

The big thing I see missing from this presentation is a discussion of how 
icache, dcache etc fit into the picture, i.e., shrink_caches.

--
Daniel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 20:03                 ` Daniel Phillips
@ 2001-12-12 21:25                   ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12 21:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Andrew Morton, Marcelo Tosatti, lkml

On Wed, Dec 12, 2001 at 09:03:20PM +0100, Daniel Phillips wrote:
> On December 12, 2001 12:16 pm, Andrea Arcangeli wrote:
> > On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> > > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > > > As said I wrote some documentation on the VM for my last speech at the
> > > > one of the most important italian linux events, it explains the basic
> > > > design. It should be published on their webside as soon as I find the
> > > > time to send them the slides. I can post a link once it will be online.
> > > 
> > > Why not also post the whole thing as an email, right here?
> > 
> > I uploaded it here:
> 
> ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz
> 
> This is really, really useful.
> 
> Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get 
> install mgp) and do:
> 
>    mv pluto.mpg pluto.mgp # ;)

8)

>    mgp pluto.mgp -x vflib
> 
> Helpful hint #2: Actually, just gv pluto.ps is gets all the content.
> 
> Helpful hint #3: Actually, less pluto.mgp will do the trick (after the 
> rename) and lets you cut and paste the text, as I'm about to do...
> 
> Nit: "vm shrinking is not serialized with any other subsystem, it is also 
>                                                               only---^^^^
> threaded against itself."

correct.

> The big thing I see missing from this presentation is a discussion of how 
> icache, dcache etc fit into the picture, i.e., shrink_caches.

Going into the differences between icache/dcache and pagecache would
been too low level (and I should have spent some time explaining what
icache and dcache are first ;), so as you noticed I intentionally
ignored those highlevel vfs caches in the slides. The concept of "pages
of cache" is usually well known by most people instead, so I only
considered the pagecache, that incidentally is also the most interesting
case for the VM.  For seasoned kernel developers it would been
interesting to integrate more stuff, of course, but as you said this is
a start at least :).

About the icache/dcache shrinking, that's probably the most rough thing
we have in the vm at the moment. It just works.

Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11  0:43 ` Andrea Arcangeli
  2001-12-11 15:46   ` Luigi Genoni
@ 2001-12-12 22:05   ` Ken Brownfield
  2001-12-12 22:30     ` Andrea Arcangeli
  2001-12-12 23:23     ` Rik van Riel
  1 sibling, 2 replies; 43+ messages in thread
From: Ken Brownfield @ 2001-12-12 22:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: lkml

On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote:
| On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
| > Andrea, 
| > Could you please start looking at any 2.4 VM issues which show up ?
| 
| well, as far I can tell no VM bug should be present in my latest -aa, so
| I think I'm finished. At the very least I know people is using 2.4.15aa1
| and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
| load and I didn't got any bugreport back yet.
[...]

I look forward to this stuff.  2.4 mainline falls down reliably and
completely when running updatedb on systems with a large number of used
inodes.  Linus' VM/mmap patch helped a ton, but between general VM
issues and the i/dcache bloat I'm hoping that I won't have to redirect
my irritated users' ire into a karma pool to get these changes merged
into mainline where all of the knowledgeable folks here can beat out the
details.

I do think that the vast majority of users don't see this issue on
small-ish UP desktops.  But I'm about to buy >100 SMP systems for
production expansion which will most likely be effected by this issue.
For me that emphasizes that these so-called corner cases really are
show-stoppers for Linux-as-more-than-toy.

Gimme the /proc interface (bdflush?) and lets bang on this stuff in
mainline.  I need to stick with the latest -pre so I can track progress,
so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-(

Cheers, just venting,
-- 
Ken.
brownfld@irridia.com

PS: Nice catch on the NTFS vmalloc() issue.

| > Just please make sure that when sending a fix for something, send me _one_
| > problem and a patch which fixes _that_ problem.
| 
| I will split something for you soon, at the moment I was doing some
| further benchmark.
| 
| > 
| > I'm tempted to look at VM, but I think I'll spend my limited time in a
| > better way if I review's others people work instead.
| 
| until I split something out, you can see all the vm related changes in
| the 10_vm-* patches in my ftp area.
| 
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 17:23               ` Christoph Hellwig
@ 2001-12-12 22:20                 ` Rob Landley
  2001-12-13  8:47                   ` David S. Miller
  2001-12-13  8:48                   ` Alan Cox
  0 siblings, 2 replies; 43+ messages in thread
From: Rob Landley @ 2001-12-12 22:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:

> For BSD advocates it might be a problem that these are unified diffs
> that are only applyable with GPL-licensed patch(1) version..

Why would BSD advocates be applying patches to the linux kernel?  (You don't 
need the tool to read a patch for ideas, do you?)  Why would BSD advocates 
apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then 
complain that the tool they're using to do so is GPL-licensed?

I'm confused.  (Not SUPRISED, mind you.  Just easily confused.)

Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 22:05   ` Ken Brownfield
@ 2001-12-12 22:30     ` Andrea Arcangeli
  2001-12-12 23:23     ` Rik van Riel
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2001-12-12 22:30 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: lkml, Andrew Morton

On Wed, Dec 12, 2001 at 04:05:51PM -0600, Ken Brownfield wrote:
> On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote:
> | On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
> | > Andrea, 
> | > Could you please start looking at any 2.4 VM issues which show up ?
> | 
> | well, as far I can tell no VM bug should be present in my latest -aa, so
> | I think I'm finished. At the very least I know people is using 2.4.15aa1
> | and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
> | load and I didn't got any bugreport back yet.
> [...]
> 
> I look forward to this stuff.  2.4 mainline falls down reliably and
> completely when running updatedb on systems with a large number of used
> inodes.  Linus' VM/mmap patch helped a ton, but between general VM
> issues and the i/dcache bloat I'm hoping that I won't have to redirect
> my irritated users' ire into a karma pool to get these changes merged
> into mainline where all of the knowledgeable folks here can beat out the
> details.
> 
> I do think that the vast majority of users don't see this issue on
> small-ish UP desktops.  But I'm about to buy >100 SMP systems for
> production expansion which will most likely be effected by this issue.
> For me that emphasizes that these so-called corner cases really are
> show-stoppers for Linux-as-more-than-toy.
> 
> Gimme the /proc interface (bdflush?) and lets bang on this stuff in
> mainline.  I need to stick with the latest -pre so I can track progress,
> so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-(

I finished fixing the bdflush stuff that Andrew kindly pointed out.
async writes are as fast as possible again now and I also introduced
some histeresis for bdflush to reduce the wakeup rate, plus I'm forcing
bdflush to do some significant work rather than just NRSYNC buffers. But
I'm doing some other swapout benchmarking before releasing a new -aa, I
hope to finish tomorrow. Once I'll feel to be finished I'll split out
something.

anyways here it is a preview of the bdflush fixes for Andrew. it
definitely cures the performance for me. previously there were too many
reschedule. I also wonder that the balance_dirty() should also write
nfract of buffers, instead of only NRSYNC (or maybe something less than
ndirty but more than NRSYNC). comments?

(then BUF_LOCKED will contain all the clean buffers too, and so it
cannot be accounted into balance_dirty anymore, the VM will throttle on
those locked buffers and so it's not a problem)

--- 2.4.17pre7aa1/fs/buffer.c.~1~	Mon Dec 10 16:10:40 2001
+++ 2.4.17pre7aa1/fs/buffer.c	Wed Dec 12 19:16:23 2001
@@ -105,22 +105,23 @@
 	struct {
 		int nfract;	/* Percentage of buffer cache dirty to 
 				   activate bdflush */
-		int dummy1;	/* old "ndirty" */
+		int ndirty;	/* Maximum number of dirty blocks to write out per
+				   wake-cycle */
 		int dummy2;	/* old "nrefill" */
 		int dummy3;	/* unused */
 		int interval;	/* jiffies delay between kupdate flushes */
 		int age_buffer;	/* Time for normal buffer to age before we flush it */
 		int nfract_sync;/* Percentage of buffer cache dirty to 
 				   activate bdflush synchronously */
-		int dummy4;	/* unused */
+		int nfract_stop_bdflush; /* Percetange of buffer cache dirty to stop bdflush */
 		int dummy5;	/* unused */
 	} b_un;
 	unsigned int data[N_PARAM];
-} bdf_prm = {{20, 0, 0, 0, 5*HZ, 30*HZ, 40, 0, 0}};
+} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}};
 
 /* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = {  0,  0,    0,   0,  0,   1*HZ,   0, 0, 0};
-int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 0, 0};
+int bdflush_min[N_PARAM] = {  0,  1,    0,   0,  0,   1*HZ,   0, 0, 0};
+int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 100, 0};
 
 void unlock_buffer(struct buffer_head *bh)
 {
@@ -181,7 +182,6 @@
 		bh->b_end_io = end_buffer_io_sync;
 		clear_bit(BH_Pending_IO, &bh->b_state);
 		submit_bh(WRITE, bh);
-		conditional_schedule();
 	} while (--count);
 }
 
@@ -217,11 +217,10 @@
 			array[count++] = bh;
 			if (count < NRSYNC)
 				continue;
-
 			spin_unlock(&lru_list_lock);
-			conditional_schedule();
 
 			write_locked_buffers(array, count);
+			conditional_schedule();
 			return -EAGAIN;
 		}
 		unlock_buffer(bh);
@@ -282,12 +281,6 @@
 	return 0;
 }
 
-static inline void wait_for_some_buffers(kdev_t dev)
-{
-	spin_lock(&lru_list_lock);
-	wait_for_buffers(dev, BUF_LOCKED, 1);
-}
-
 static int wait_for_locked_buffers(kdev_t dev, int index, int refile)
 {
 	do
@@ -1043,7 +1036,6 @@
 	unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
 
 	dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
-	dirty += size_buffers_type[BUF_LOCKED] >> PAGE_SHIFT;
 	tot = nr_free_buffer_pages();
 
 	dirty *= 100;
@@ -1060,6 +1052,21 @@
 	return -1;
 }
 
+static int bdflush_stop(void)
+{
+	unsigned long dirty, tot, dirty_limit;
+
+	dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+	tot = nr_free_buffer_pages();
+
+	dirty *= 100;
+	dirty_limit = tot * bdf_prm.b_un.nfract_stop_bdflush;
+
+	if (dirty > dirty_limit)
+		return 0;
+	return 1;
+}
+
 /*
  * if a new dirty buffer is created we need to balance bdflush.
  *
@@ -1084,7 +1091,6 @@
 	if (state > 0) {
 		spin_lock(&lru_list_lock);
 		write_some_buffers(NODEV);
-		wait_for_some_buffers(NODEV);
 	}
 }
 
@@ -2789,13 +2795,18 @@
 	complete((struct completion *)startup);
 
 	for (;;) {
+		int ndirty = bdf_prm.b_un.ndirty;
+
 		CHECK_EMERGENCY_SYNC
 
-		spin_lock(&lru_list_lock);
-		if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
-			run_task_queue(&tq_disk);
-			interruptible_sleep_on(&bdflush_wait);
+		while (ndirty > 0) {
+			spin_lock(&lru_list_lock);
+			if (!write_some_buffers(NODEV))
+				break;
+			ndirty -= NRSYNC;
 		}
+		if (ndirty > 0 || bdflush_stop())
+			interruptible_sleep_on(&bdflush_wait);
 	}
 }
 



Andrea

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 22:05   ` Ken Brownfield
  2001-12-12 22:30     ` Andrea Arcangeli
@ 2001-12-12 23:23     ` Rik van Riel
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2001-12-12 23:23 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: Andrea Arcangeli, lkml

On Wed, 12 Dec 2001, Ken Brownfield wrote:

> I'm hoping that I won't have to redirect my irritated users' ire into
> a karma pool to get these changes merged into mainline

Actually, Marcelo has already indicated that he's willing to
take VM code from Andrea, as long as the parts are merged one
by one and come with proper argumentation.

This means you'll either have to split out Andrea's patch
yourself or you'll have to convince Andrea to play by the
rules ;))

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 22:20                 ` Rob Landley
@ 2001-12-13  8:47                   ` David S. Miller
  2001-12-13 18:41                     ` Matthias Andree
  2001-12-13  8:48                   ` Alan Cox
  1 sibling, 1 reply; 43+ messages in thread
From: David S. Miller @ 2001-12-13  8:47 UTC (permalink / raw)
  To: alan; +Cc: landley, hch, linux-kernel


   > > For BSD advocates it might be a problem that these are unified diffs
   > > that are only applyable with GPL-licensed patch(1) version..
   
I'm back quoting twice, sorry I've lost the original attribution.

But anyways didn't the original Larry Wall patch do unified diffs?
I thought it did, and I recall that wasn't GPL licensed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-12 22:20                 ` Rob Landley
  2001-12-13  8:47                   ` David S. Miller
@ 2001-12-13  8:48                   ` Alan Cox
  2001-12-13 10:22                     ` [OT] " Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Alan Cox @ 2001-12-13  8:48 UTC (permalink / raw)
  To: Rob Landley; +Cc: Christoph Hellwig, linux-kernel

> On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:
> 
> > For BSD advocates it might be a problem that these are unified diffs
> > that are only applyable with GPL-licensed patch(1) version..
> 
> Why would BSD advocates be applying patches to the linux kernel?  (You don't 
> need the tool to read a patch for ideas, do you?)  Why would BSD advocates 
> apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then 
> complain that the tool they're using to do so is GPL-licensed?
> 
> I'm confused.  (Not SUPRISED, mind you.  Just easily confused.)

Christoph, please remember that irony is not available between the Canadian
and Mexican border.... you are confusing them again 8)

Alan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [OT] Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-13  8:48                   ` Alan Cox
@ 2001-12-13 10:22                     ` Rob Landley
  0 siblings, 0 replies; 43+ messages in thread
From: Rob Landley @ 2001-12-13 10:22 UTC (permalink / raw)
  To: Alan Cox; +Cc: Christoph Hellwig, linux-kernel

On Thursday 13 December 2001 03:48 am, Alan Cox wrote:
> > On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:
> > > For BSD advocates it might be a problem that these are unified diffs
> > > that are only applyable with GPL-licensed patch(1) version..
> >
> > Why would BSD advocates be applying patches to the linux kernel?  (You
> > don't need the tool to read a patch for ideas, do you?)  Why would BSD
> > advocates apply a GPL-licensed patch to the GPL-licensed Linux kernel,
> > and then complain that the tool they're using to do so is GPL-licensed?
> >
> > I'm confused.  (Not SUPRISED, mind you.  Just easily confused.)
>
> Christoph, please remember that irony is not available between the Canadian
> and Mexican border.... you are confusing them again 8)

We'll get it back when the whole "everything has changed" fad dies down.  
Average together how long the OJ simpson trial lasted, the monica lewinsky 
thing, elian gonzalez down in miami, the press coverage of hurricane andrew, 
the original gulf war, nancy kerrigan, john wayne bobbit, joey buttafuoco, 
the military interventions in somalia and bosnia, the outcry over alar and 
malathion in california back in the 80's, dan quayle attacking murphy brown, 
the anti-nuke sentiment following chernobyl and three mile island...

That's our national attention span.  A year, maybe a year and change.  
Anybody who thinks some nut with a beard can keep this country permanently 
nervous obviously doesn't remember the cuban missile crisis.  (And of course 
there are a lot of people who don't, again because of our short attention 
span...)  Our military may be rather impressive, but our sarcastic 
self-centered indifference is legendary.  We're STILL bombing Iraq, and most 
of the US has forgotten that country even exists...

</off topic thread>

> Alan

Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-13  8:47                   ` David S. Miller
@ 2001-12-13 18:41                     ` Matthias Andree
  0 siblings, 0 replies; 43+ messages in thread
From: Matthias Andree @ 2001-12-13 18:41 UTC (permalink / raw)
  To: linux-kernel

On Thu, 13 Dec 2001, David S. Miller wrote:

> But anyways didn't the original Larry Wall patch do unified diffs?
> I thought it did, and I recall that wasn't GPL licensed.

Nope, it did context diffs however.

-- 
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."         Benjamin Franklin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
  2001-12-11 16:45 ` Marcelo Tosatti
@ 2001-12-11 18:51   ` Rik van Riel
  0 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2001-12-11 18:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, Andrea Arcangeli, lkml

On Tue, 11 Dec 2001, Marcelo Tosatti wrote:

> > I'll take a stab at completely removing the use-once stuff as an
> > emergency measure.
>
> Could you please make a patch without use-once and post the patch to
> lkml ?
>
> This way people can test it and report performance results.

OK, here's a quick hack to migrate 2.4 to second-chance
replacement. In this implementation that means:

1) for pages in the working set of processes, we keep
   the pages resident whenever we find a referenced
   bit in the page table

2) for pages which are not mapped, we unconditionally
   move the page to the inactive list; the page only
   gets reactivated if it is referenced while on the
   inactive list

This should give us some small protection against use-once
data, since the referenced bit doesn't count, while allowing
us to protect the working set of processes.

It also makes shrinking of the slab-based filesystem caches
unconditional, to prevent bad effects there.

Note that I'm still compiling and haven't tested it yet,
please give it a spin.

regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/



--- linux-2.4.17-pre8/mm/filemap.c.orig	Tue Dec 11 16:11:16 2001
+++ linux-2.4.17-pre8/mm/filemap.c	Tue Dec 11 16:27:44 2001
@@ -1249,23 +1249,20 @@
 }

 /*
- * Mark a page as having seen activity.
- *
- * If it was already so marked, move it
- * to the active queue and drop the referenced
- * bit. Otherwise, just mark it for future
- * action..
+ * Simple second-chance replacement.
+ * As long as a page is on the active list, further references
+ * are ignored so used-once pages get replaced quickly.
+ * If a page on the inactive list gets referenced or has a
+ * referenced bit in the page table page, it gets moved back
+ * to the far end of the active list.
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page)) {
+	if (PageLRU(page) && !PageActive(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 		return;
 	}
-
-	/* Mark the page referenced, AFTER checking for previous usage.. */
-	SetPageReferenced(page);
 }

 /*
--- linux-2.4.17-pre8/mm/swap.c.orig	Tue Dec 11 16:11:16 2001
+++ linux-2.4.17-pre8/mm/swap.c	Tue Dec 11 16:13:11 2001
@@ -59,7 +59,7 @@
 {
 	if (!TestSetPageLRU(page)) {
 		spin_lock(&pagemap_lru_lock);
-		add_page_to_inactive_list(page);
+		add_page_to_active_list(page);
 		spin_unlock(&pagemap_lru_lock);
 	}
 }
--- linux-2.4.17-pre8/mm/vmscan.c.orig	Tue Dec 11 16:11:16 2001
+++ linux-2.4.17-pre8/mm/vmscan.c	Tue Dec 11 16:43:10 2001
@@ -526,10 +526,14 @@

 /*
  * This moves pages from the active list to
- * the inactive list.
+ * the inactive list. If they get referenced
+ * while on the inactive list, they will be
+ * activated again.
  *
- * We move them the other way when we see the
- * reference bit on the page.
+ * Note that we cannot (and don't want to)
+ * clear the referenced bits in the page tables
+ * of pages, so the working sets of processes
+ * have an edge on cache pages.
  */
 static void refill_inactive(int nr_pages)
 {
@@ -542,15 +546,10 @@

 		page = list_entry(entry, struct page, lru);
 		entry = entry->prev;
-		if (PageTestandClearReferenced(page)) {
-			list_del(&page->lru);
-			list_add(&page->lru, &active_list);
-			continue;
-		}

 		del_page_from_active_list(page);
 		add_page_to_inactive_list(page);
-		SetPageReferenced(page);
+		ClearPageReferenced(page);
 	}
 	spin_unlock(&pagemap_lru_lock);
 }
@@ -570,16 +569,16 @@
 	ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
 	refill_inactive(ratio);

-	nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
-	if (nr_pages <= 0)
-		return 0;
-
 	shrink_dcache_memory(priority, gfp_mask);
 	shrink_icache_memory(priority, gfp_mask);
 #ifdef CONFIG_QUOTA
 	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
 #endif

+	nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
+
+	if (nr_pages <= 0)
+		return 0;
 	return nr_pages;
 }



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: 2.4.16 & OOM killer screw up (fwd)
       [not found] <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva>
@ 2001-12-11 16:45 ` Marcelo Tosatti
  2001-12-11 18:51   ` Rik van Riel
  0 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2001-12-11 16:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, lkml



On Mon, 10 Dec 2001, Rik van Riel wrote:

> On Mon, 10 Dec 2001, Andrew Morton wrote:
> 
> > A fix may be to just remove the use-once stuff.  It is one of the
> > sources of this problem, because it's overpopulating the inactive list.
> 
> Absolutely. Use-once is an inherently unstable system, suitable
> for things like a database load (where you know you want to spend
> a certain percentage of your RAM on caching the index), but not
> suitable for a general-purpose VM, where you have no idea how
> large the working set will be.
> 
> I'll take a stab at completely removing the use-once stuff as an
> emergency measure.

Rik,

Could you please make a patch without use-once and post the patch to lkml
?

This way people can test it and report performance results.

I really would prefer to remove use-once as I also think its an
optimization which breaks some workloads, but I want to know what happens
in practice if we do that.


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2001-12-13 22:43 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti
2001-12-10 20:47 ` Andrew Morton
2001-12-10 19:42   ` Marcelo Tosatti
2001-12-11  0:11   ` Andrea Arcangeli
2001-12-11  7:07     ` Andrew Morton
2001-12-11 13:32       ` Rik van Riel
2001-12-11 13:46         ` Andrea Arcangeli
2001-12-12  8:44           ` Andrew Morton
2001-12-12  9:21             ` Andrea Arcangeli
2001-12-12  9:45               ` Rik van Riel
2001-12-12 10:09                 ` Andrea Arcangeli
2001-12-12  9:59               ` Andrew Morton
2001-12-12 10:15                 ` Andrea Arcangeli
2001-12-11 13:42       ` Andrea Arcangeli
2001-12-11 13:59         ` Rik van Riel
2001-12-11 14:23           ` Andrea Arcangeli
2001-12-11 15:27             ` Daniel Phillips
2001-12-12 11:16               ` Andrea Arcangeli
2001-12-12 20:03                 ` Daniel Phillips
2001-12-12 21:25                   ` Andrea Arcangeli
2001-12-11 13:59         ` Abraham vd Merwe
2001-12-11 14:01           ` Andrea Arcangeli
2001-12-11 17:30             ` Leigh Orf
2001-12-11 15:47         ` Henning P. Schmiedehausen
2001-12-11 16:01           ` Alan Cox
2001-12-11 16:37           ` Hubert Mantel
2001-12-11 17:09           ` Rik van Riel
2001-12-11 17:28             ` Alan Cox
2001-12-11 17:22               ` Rik van Riel
2001-12-11 17:23               ` Christoph Hellwig
2001-12-12 22:20                 ` Rob Landley
2001-12-13  8:47                   ` David S. Miller
2001-12-13 18:41                     ` Matthias Andree
2001-12-13  8:48                   ` Alan Cox
2001-12-13 10:22                     ` [OT] " Rob Landley
2001-12-12  8:39         ` Andrew Morton
2001-12-11  0:43 ` Andrea Arcangeli
2001-12-11 15:46   ` Luigi Genoni
2001-12-12 22:05   ` Ken Brownfield
2001-12-12 22:30     ` Andrea Arcangeli
2001-12-12 23:23     ` Rik van Riel
     [not found] <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva>
2001-12-11 16:45 ` Marcelo Tosatti
2001-12-11 18:51   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).