linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
@ 2004-12-20 15:15 Rik van Riel
  2004-12-20 15:23 ` Rik van Riel
  2004-12-20 20:54 ` Andrew Morton
  0 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2004-12-20 15:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Robert_Hentosh

Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>" will
result in OOM kills, with the dirty pagecache completely filling up
lowmem.  This patch is part 1 to fixing that problem.

This patch effectively lowers the dirty limit for mappings which cannot
be cached in highmem, counting the dirty limit as a percentage of lowmem
instead.  This should prevent heavy block device writers from pushing
the VM over the edge and triggering OOM kills.

Signed-off-by: Rik van Riel <riel@redhat.com>


--- linux-2.6.9/mm/page-writeback.c.highmem	2004-12-16 11:22:48.193641312 
-0500
+++ linux-2.6.9/mm/page-writeback.c	2004-12-16 11:30:00.565676290 -0500
@@ -133,18 +133,28 @@
   * clamping level.
   */
  static void
-get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty)
+get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty, 
struct address_space *mapping)
  {
  	int background_ratio;		/* Percentages */
  	int dirty_ratio;
  	int unmapped_ratio;
  	long background;
  	long dirty;
+	unsigned long available_memory = total_pages;
  	struct task_struct *tsk;

  	get_writeback_state(wbs);

-	unmapped_ratio = 100 - (wbs->nr_mapped * 100) / total_pages;
+#ifdef CONFIG_HIGHMEM
+	/*
+	 * If this mapping can only allocate from low memory,
+	 * we exclude high memory from our count.
+	 */
+	if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
+		available_memory -= totalhigh_pages;
+#endif
+
+	unmapped_ratio = 100 - (wbs->nr_mapped * 100) / available_memory;

  	dirty_ratio = vm_dirty_ratio;
  	if (dirty_ratio > unmapped_ratio / 2)
@@ -194,7 +204,8 @@
  			.nr_to_write	= write_chunk,
  		};

-		get_dirty_limits(&wbs, &background_thresh, &dirty_thresh);
+		get_dirty_limits(&wbs, &background_thresh,
+					&dirty_thresh, mapping);
  		nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
  		if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
  			break;
@@ -210,7 +221,7 @@
  		if (nr_reclaimable) {
  			writeback_inodes(&wbc);
  			get_dirty_limits(&wbs, &background_thresh,
-					&dirty_thresh);
+					&dirty_thresh, mapping);
  			nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
  			if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
  				break;
@@ -283,7 +294,7 @@
  	long dirty_thresh;

          for ( ; ; ) {
-		get_dirty_limits(&wbs, &background_thresh, &dirty_thresh);
+		get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, 
NULL);

                  /*
                   * Boost the allowable dirty threshold a bit for page
@@ -318,7 +329,7 @@
  		long background_thresh;
  		long dirty_thresh;

-		get_dirty_limits(&wbs, &background_thresh, &dirty_thresh);
+		get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, 
NULL);
  		if (wbs.nr_dirty + wbs.nr_unstable < background_thresh
  				&& min_pages <= 0)
  			break;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-20 15:15 [PATCH][1/2] adjust dirty threshold for lowmem-only mappings Rik van Riel
@ 2004-12-20 15:23 ` Rik van Riel
  2004-12-20 20:54 ` Andrew Morton
  1 sibling, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2004-12-20 15:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Robert_Hentosh

On Mon, 20 Dec 2004, Rik van Riel wrote:

> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>" will
> result in OOM kills, with the dirty pagecache completely filling up
> lowmem.  This patch is part 1 to fixing that problem.

What I forgot to say is that in order to trigger this OOM kill
the dirty_limit of 40% needs to be more memory than what fits
in low memory.  So this will work on x86 with 4GB RAM, since
the dirty_limit is 1.6GB, but the block device cache cannot
grow that big because it is restricted to low memory.

This has the effect of all low memory being tied up in dirty
page cache and userspace try_to_free_pages() skipping the
writeout of these pages because the block device is congested.



-- 
He did not think of himself as a tourist; he was a traveler. The difference is
partly one of time, he would explain. Where as the tourist generally hurries
back home at the end of a few weeks or months, the traveler belonging no more
to one place than to the next, moves slowly, over periods of years, from one
part of the earth to another. Indeed, he would have found it difficult to tell,
among the many places he had lived, precisely where it was he had felt most at
home.  -- Paul Bowles

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-20 15:15 [PATCH][1/2] adjust dirty threshold for lowmem-only mappings Rik van Riel
  2004-12-20 15:23 ` Rik van Riel
@ 2004-12-20 20:54 ` Andrew Morton
  2004-12-20 21:27   ` Rik van Riel
  2004-12-23 19:21   ` Rik van Riel
  1 sibling, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2004-12-20 20:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Robert_Hentosh

Rik van Riel <riel@redhat.com> wrote:
>
> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>" will
>  result in OOM kills, with the dirty pagecache completely filling up
>  lowmem.

That surely used to work - I have a feeling that it got broken somehow. 
The below might fix it, but probably not.

The intended behaviour is that the page-allocating process will throttle
and will then pick up those pages from the tail of the LRU which
rotate_reclaimable_page() put there.



We haven't been incrementing local variable total_scanned since the
scan_control stuff went in.  That broke kswapd throttling.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/mm/vmscan.c |    1 +
 1 files changed, 1 insertion(+)

diff -puN mm/vmscan.c~vmscan-total_scanned-fix mm/vmscan.c
--- 25/mm/vmscan.c~vmscan-total_scanned-fix	2004-12-20 12:47:25.855643408 -0800
+++ 25-akpm/mm/vmscan.c	2004-12-20 12:47:25.860642648 -0800
@@ -1063,6 +1063,7 @@ scan:
 			shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_reclaimed += sc.nr_reclaimed;
+			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
 				continue;
 			if (zone->pages_scanned >= (zone->nr_active +
_


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-20 20:54 ` Andrew Morton
@ 2004-12-20 21:27   ` Rik van Riel
  2004-12-23 19:21   ` Rik van Riel
  1 sibling, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2004-12-20 21:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Robert_Hentosh

On Mon, 20 Dec 2004, Andrew Morton wrote:

> We haven't been incrementing local variable total_scanned since the
> scan_control stuff went in.  That broke kswapd throttling.

That would explain the "kswapd uses heaps of CPU time when
starting a memory hungry task", too ...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-20 20:54 ` Andrew Morton
  2004-12-20 21:27   ` Rik van Riel
@ 2004-12-23 19:21   ` Rik van Riel
  2004-12-24 16:01     ` Andrea Arcangeli
  1 sibling, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2004-12-23 19:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Robert_Hentosh

On Mon, 20 Dec 2004, Andrew Morton wrote:
> Rik van Riel <riel@redhat.com> wrote:
>>
>> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>" will
>>  result in OOM kills, with the dirty pagecache completely filling up
>>  lowmem.
>
> That surely used to work - I have a feeling that it got broken somehow.
> The below might fix it, but probably not.

Even all 3 patches together don't seem to have fixed
the bug completely.  The time needed to trigger the
bug has gone up though, from 5 minutes to a day ...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-23 19:21   ` Rik van Riel
@ 2004-12-24 16:01     ` Andrea Arcangeli
  2004-12-24 16:22       ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2004-12-24 16:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Thu, Dec 23, 2004 at 02:21:15PM -0500, Rik van Riel wrote:
> the bug completely.  The time needed to trigger the
> bug has gone up though, from 5 minutes to a day ...

I'm running it in a loop for a day and no oom yet. This is all as well
applied to the current suse kernel and it's behaving very well so far
(well except the oom killer with oracle that in order to fix it badness
must be rewritten with a completely different algorithm). Let's limit
the workload to normal desktops for now so we don't have to change
everything at once.

Patches applied are Andrew's ignore-swap-token, Andrew's write
throttling total_scanned, Con's disable-swap-token, my oom
killer fixes (but that should not influence the write throttling), my
lowmem-reserve (that can definitely influence it on big boxes and it's a
must have for any computer with more than 1G), and a few more certainly
unrelated bits.

So I recommend you to try again with at least "Andrew's
ignore-swap-token, Andrew's total_scanned, Con's disable-swap-token and
my lowmem_reserve". Effectively disable-swap-token obsoletes
ignore-swap-token, but both makes sense together since just in case
somebody enables the feature, ignore-swap-token will give it a chance
not to generate a suprious oom kills.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-24 16:01     ` Andrea Arcangeli
@ 2004-12-24 16:22       ` Rik van Riel
  2004-12-24 16:40         ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2004-12-24 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Fri, 24 Dec 2004, Andrea Arcangeli wrote:

> So I recommend you to try again with at least "Andrew's
> ignore-swap-token, Andrew's total_scanned, Con's disable-swap-token and
> my lowmem_reserve". Effectively disable-swap-token obsoletes
> ignore-swap-token, but both makes sense together since just in case
> somebody enables the feature, ignore-swap-token will give it a chance
> not to generate a suprious oom kills.

That makes little sense, since 99% of lowmem is in the page
cache and not mapped into any process, so the swap token
won't get involved at all.  Same for the lowmem_reserve patch,
since the pagecache allocations for dding to a block device
do not use __GFP_HIGHMEM, so the lowmem_reserve protection of
low memory won't be activated.

I am already running with akpm's total_scanned, my lowering of
the dirty limit for non-highmem capable mappings and my "do not
OOM kill if we had to skip writes due to congestion" patch.

The system can still be made to OOM kill, it just takes a day
instead of a few minutes.  And no, the process text, data and
libraries all live in highmem, which isn't scanned by the VM
because there's still 2.7GB free...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-24 16:22       ` Rik van Riel
@ 2004-12-24 16:40         ` Andrea Arcangeli
  2004-12-24 22:12           ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2004-12-24 16:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Fri, Dec 24, 2004 at 11:22:54AM -0500, Rik van Riel wrote:
> On Fri, 24 Dec 2004, Andrea Arcangeli wrote:
> 
> >So I recommend you to try again with at least "Andrew's
> >ignore-swap-token, Andrew's total_scanned, Con's disable-swap-token and
> >my lowmem_reserve". Effectively disable-swap-token obsoletes
> >ignore-swap-token, but both makes sense together since just in case
> >somebody enables the feature, ignore-swap-token will give it a chance
> >not to generate a suprious oom kills.
> 
> That makes little sense, since 99% of lowmem is in the page
> cache and not mapped into any process, so the swap token
> won't get involved at all.  Same for the lowmem_reserve patch,
> since the pagecache allocations for dding to a block device
> do not use __GFP_HIGHMEM, so the lowmem_reserve protection of
> low memory won't be activated.

Since you provided no debugging output I had to provide you the full
reccomandation. There was no sign that you didn't run out of lowmemory,
I don't know what else is running on the box with the cp.

> I am already running with akpm's total_scanned, my lowering of
> the dirty limit for non-highmem capable mappings and my "do not
> OOM kill if we had to skip writes due to congestion" patch.
> 
> The system can still be made to OOM kill, it just takes a day

Did you apply Con's disable-swap-token leaving the sysctl to the default
value after applying that patch?

Of course I know if you don't apply Con's fix it will run oom, you don't
need a cp for that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-24 16:40         ` Andrea Arcangeli
@ 2004-12-24 22:12           ` Rik van Riel
  2004-12-25  2:07             ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2004-12-24 22:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Fri, 24 Dec 2004, Andrea Arcangeli wrote:

>> I am already running with akpm's total_scanned, my lowering of
>> the dirty limit for non-highmem capable mappings and my "do not
>> OOM kill if we had to skip writes due to congestion" patch.
>
> Did you apply Con's disable-swap-token leaving the sysctl to the default
> value after applying that patch?
>
> Of course I know if you don't apply Con's fix it will run oom, you don't
> need a cp for that.

The process 'dd', and all the other processes, live in
the highmem zone, which has 2.5GB of memory free. Now
tell me again why you think the swap token has any
relevance to those 950MB of pagecache that is filling
up lowmem ?


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-24 22:12           ` Rik van Riel
@ 2004-12-25  2:07             ` Andrea Arcangeli
  2004-12-25 17:59               ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2004-12-25  2:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Fri, Dec 24, 2004 at 05:12:32PM -0500, Rik van Riel wrote:
> The process 'dd', and all the other processes, live in
> the highmem zone, which has 2.5GB of memory free. Now
> tell me again why you think the swap token has any
> relevance to those 950MB of pagecache that is filling
> up lowmem ?

If 2.5G of ram is really free, then how can the oom killer be invoked in
the first place? If that happens it means you're under a lowmem
shortage, something you apparently ruled out when you said
lowmem_reserve couldn't help your workload.

If you would post a vmstat before and after the oom killing plus the
exact oom killer syslog dump, it would help to see what's going on.

I sure can't reproduce your problem here with 2.6.10-rc3 + the 4 patches
I posted (so with swap-token disabled).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25  2:07             ` Andrea Arcangeli
@ 2004-12-25 17:59               ` Rik van Riel
  2004-12-25 18:36                 ` Andrea Arcangeli
  2004-12-25 19:07                 ` William Lee Irwin III
  0 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2004-12-25 17:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, 25 Dec 2004, Andrea Arcangeli wrote:
> On Fri, Dec 24, 2004 at 05:12:32PM -0500, Rik van Riel wrote:
>> The process 'dd', and all the other processes, live in
>> the highmem zone, which has 2.5GB of memory free. Now
>> tell me again why you think the swap token has any
>> relevance to those 950MB of pagecache that is filling
>> up lowmem ?
>
> If 2.5G of ram is really free, then how can the oom killer be invoked in
> the first place? If that happens it means you're under a lowmem
> shortage, something you apparently ruled out when you said
> lowmem_reserve couldn't help your workload.

Let me explain a 3rd time:

1) run dd if=/dev/zero of=/dev/hdaN on a system with 4GB RAM

2) the pagecache mapping for /dev/hdaN can only come from
    lowmem, of which we have roughly 900MB

3) the dirty_limit is 40% of 4GB, or roughly 1.6GB - the dd
    from (1) will not throttle itself at all, but will just
    fill up lowmem without limitation

4) any memory that could be affected by the swap token (process
    text, data, stack, ...) is allocated with __GFP_HIGHMEM, so
    that all lives in the highmem zone with 2.5GB free

5) since dd is not being paged out at all, and can dirty memory
    without limit, the VM gets backed into a corner and will
    trigger an OOM kill - even though most of lowmem is simply
    dirty page cache

6) an unpatched 2.6.10-rc kernel will OOM kill in minutes on
    a test system here

6) Andrew's total_pages patch, marcelo's vm-writeout-throttle patch
    and my two patches improve the situation a lot, and the OOM kill
    takes a day or so to be triggered

If you have any more questions as to why the bug happens, don't
hesitate to ask and I'll explain you why this problem happens.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 17:59               ` Rik van Riel
@ 2004-12-25 18:36                 ` Andrea Arcangeli
  2004-12-25 19:07                 ` William Lee Irwin III
  1 sibling, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2004-12-25 18:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, Dec 25, 2004 at 12:59:10PM -0500, Rik van Riel wrote:
> 4) any memory that could be affected by the swap token (process
>    text, data, stack, ...) is allocated with __GFP_HIGHMEM, so
>    that all lives in the highmem zone with 2.5GB free
> 5) since dd is not being paged out at all, and can dirty memory
>    without limit, the VM gets backed into a corner and will
>    trigger an OOM kill - even though most of lowmem is simply
>    dirty page cache

This shouldn't happen of course, and it's a bit hard to see how can it
work fine for 23 hours and break at the 24th hour since it's quite a
repetitive algorithm. (sure it could be a race or the algorithm being
very fragile, but I can't reproduce problems here)

Plus doing cp /dev/zero . should be even worse since it also fills up
the highmem.

Are you sure cron isn't spawning something big?

Anyway my point is that swap-token is _proven_ to trigger suprious oom
kills, so if you could just reproduce once with Con's patch applied and
default sysctl value, then you would provide the proof it's unrelated.

I agree with your reasoning, I think you're right, but I'd like to be
sure we're not missing something. There are definitely other reports
where the ignore-token patch wasn't enough and Con's patch fixed it.

I also recommend you to keep vmstat in the background, in my experience
swap token was filling all swap with freeable swapcache (but it wasn't
freeable due the referenced ++ that swap-token does), and then the oom
killer was invoked despite all that freeable swapcache.

So on a computer that had plenty of lowmem and highmem free, in seconds
it would run out of memory with all swap allocated.

I agree dd shouldn't be enough, but the 1 day variable may be just some
big cron task that we didn't put into the equation.

So I still would like to see a `vmstat 1` before/after the killing, and
to hear the confirmation that Con's patch doesn't help.

The only thing I can imagine being wrong with `cp /dev/zero /dev/sd?`
while working fine on `cp /dev/zero .`, are the write throttling levels
that might be taking highmem into account while they really cannot take
highmem into account, I mean nr_free_buffer_pages must be used by the
write throttling and not nr_free_pages, but I'd be surprised if this
wasn't correct. You may want to check this bit just in case. If this is
correct then doing cp /dev/zero . should fail too, no? I for sure can't
reproduce here, and by your same arguments about the highmem levels, it
shouldn't matter how much ram I have (I've 1G). The less ram I have, the
worse it should behave.

Without more data and without being able to reproduce I can't be more
helpful than this.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 17:59               ` Rik van Riel
  2004-12-25 18:36                 ` Andrea Arcangeli
@ 2004-12-25 19:07                 ` William Lee Irwin III
  2004-12-25 20:03                   ` Andrea Arcangeli
                                     ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-12-25 19:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, Robert_Hentosh,
	Con Kolivas

On Sat, 25 Dec 2004, Andrea Arcangeli wrote:
>> the first place? If that happens it means you're under a lowmem
>> shortage, something you apparently ruled out when you said
>> lowmem_reserve couldn't help your workload.

On Sat, Dec 25, 2004 at 12:59:10PM -0500, Rik van Riel wrote:
> Let me explain a 3rd time:
[...]
> If you have any more questions as to why the bug happens, don't
> hesitate to ask and I'll explain you why this problem happens.

This is an old and well-known problem.

Lifting the artificial lowmem restrictions on blockdev mappings
(thereby nuking mapping->gfp_mask altogether) would resolve a number of
problems, not that anything making that much sense could ever happen.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 19:07                 ` William Lee Irwin III
@ 2004-12-25 20:03                   ` Andrea Arcangeli
  2004-12-26  3:07                     ` William Lee Irwin III
  2004-12-25 22:03                   ` Nikita Danilov
  2005-01-02 15:11                   ` Jens Axboe
  2 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2004-12-25 20:03 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, Dec 25, 2004 at 11:07:10AM -0800, William Lee Irwin III wrote:
> Lifting the artificial lowmem restrictions on blockdev mappings
> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
> problems, not that anything making that much sense could ever happen.

I recall that such restriction is needed only for the buffercache, or
you'd need to change _all_ the fs to kmap before accessing metadata
(this is partly already happening for the dir in pagecache, but not for
everything else).

Whatever the problem is (assuming there's really a problem in the write
throttling) it isn't going away by eliminating that restriction. Just
think booting with mem=800M, it would run into the same issue that
happens right now with the artificial limitation and >=1G of ram.

2.4 has the same limitation and it has no problem with write throttling
(and from my part 2.6 is working fine too with the 4 patches I posted,
it's not me being able to reproduce it).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 19:07                 ` William Lee Irwin III
  2004-12-25 20:03                   ` Andrea Arcangeli
@ 2004-12-25 22:03                   ` Nikita Danilov
  2004-12-26  3:16                     ` William Lee Irwin III
  2005-01-02 15:11                   ` Jens Axboe
  2 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2004-12-25 22:03 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, Robert_Hentosh,
	Con Kolivas

William Lee Irwin III <wli@holomorphy.com> writes:

> On Sat, 25 Dec 2004, Andrea Arcangeli wrote:
>>> the first place? If that happens it means you're under a lowmem
>>> shortage, something you apparently ruled out when you said
>>> lowmem_reserve couldn't help your workload.
>
> On Sat, Dec 25, 2004 at 12:59:10PM -0500, Rik van Riel wrote:
>> Let me explain a 3rd time:
> [...]
>> If you have any more questions as to why the bug happens, don't
>> hesitate to ask and I'll explain you why this problem happens.
>
> This is an old and well-known problem.
>
> Lifting the artificial lowmem restrictions on blockdev mappings
> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
> problems, not that anything making that much sense could ever happen.

mapping->gfp_mask is used for other things beyond specifying a
zonelist. For example, file systems want all allocations inside a
transaction to be done with GFP_NOFS, which forces GFP_NOFS in
mapping->gfp_mask of meta-data address_spaces.

>
>
> -- wli

Nikita.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 20:03                   ` Andrea Arcangeli
@ 2004-12-26  3:07                     ` William Lee Irwin III
  2005-01-02 16:10                       ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-12-26  3:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, Dec 25, 2004 at 11:07:10AM -0800, William Lee Irwin III wrote:
>> Lifting the artificial lowmem restrictions on blockdev mappings
>> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
>> problems, not that anything making that much sense could ever happen.

On Sat, Dec 25, 2004 at 09:03:49PM +0100, Andrea Arcangeli wrote:
> I recall that such restriction is needed only for the buffercache, or
> you'd need to change _all_ the fs to kmap before accessing metadata
> (this is partly already happening for the dir in pagecache, but not for
> everything else).
> Whatever the problem is (assuming there's really a problem in the write
> throttling) it isn't going away by eliminating that restriction. Just
> think booting with mem=800M, it would run into the same issue that
> happens right now with the artificial limitation and >=1G of ram.
> 2.4 has the same limitation and it has no problem with write throttling
> (and from my part 2.6 is working fine too with the 4 patches I posted,
> it's not me being able to reproduce it).

The problem as posed is that the dirty memory limits are global, but
ZONE_NORMAL can be overwhelmed by dirty memory. bdev pagecache is as
surely subject to the zone limits as all others, but overwhelms them
and is not pressured because globally the thresholds are not tripped.

The sheer idiocy of physical placement restrictions imposed on behalf
of software is merely what's being exploited to artificially create
such a situation for a testcase and what users are tripping over daily.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 22:03                   ` Nikita Danilov
@ 2004-12-26  3:16                     ` William Lee Irwin III
  0 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-12-26  3:16 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, Robert_Hentosh,
	Con Kolivas

William Lee Irwin III <wli@holomorphy.com> writes:
[...]
>> Lifting the artificial lowmem restrictions on blockdev mappings
>> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
>> problems, not that anything making that much sense could ever happen.

On Sun, Dec 26, 2004 at 01:03:14AM +0300, Nikita Danilov wrote:
> mapping->gfp_mask is used for other things beyond specifying a
> zonelist. For example, file systems want all allocations inside a
> transaction to be done with GFP_NOFS, which forces GFP_NOFS in
> mapping->gfp_mask of meta-data address_spaces.

It's news to me, but benign. ->gfp_mask appears to be folded into
some bitflag word now so there wouldn't be an inode size reduction
anyway. Per-mapping gfp masks sound like a poor fit from the above.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-25 19:07                 ` William Lee Irwin III
  2004-12-25 20:03                   ` Andrea Arcangeli
  2004-12-25 22:03                   ` Nikita Danilov
@ 2005-01-02 15:11                   ` Jens Axboe
  2005-01-02 16:18                     ` Andrea Arcangeli
  2005-01-02 20:03                     ` Andrew Morton
  2 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2005-01-02 15:11 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Andrea Arcangeli, Andrew Morton, linux-kernel,
	Robert_Hentosh, Con Kolivas

On Sat, Dec 25 2004, William Lee Irwin III wrote:
> On Sat, 25 Dec 2004, Andrea Arcangeli wrote:
> >> the first place? If that happens it means you're under a lowmem
> >> shortage, something you apparently ruled out when you said
> >> lowmem_reserve couldn't help your workload.
> 
> On Sat, Dec 25, 2004 at 12:59:10PM -0500, Rik van Riel wrote:
> > Let me explain a 3rd time:
> [...]
> > If you have any more questions as to why the bug happens, don't
> > hesitate to ask and I'll explain you why this problem happens.
> 
> This is an old and well-known problem.
> 
> Lifting the artificial lowmem restrictions on blockdev mappings
> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
> problems, not that anything making that much sense could ever happen.

It should be lifted for block devices, it doesn't make any sense.
mapping->gfp_mask is still needed for things like loop though, so it
cannot be nuked.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-26  3:07                     ` William Lee Irwin III
@ 2005-01-02 16:10                       ` Andrea Arcangeli
  2005-01-02 16:36                         ` William Lee Irwin III
  2005-01-02 16:53                         ` Rik van Riel
  0 siblings, 2 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2005-01-02 16:10 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, Dec 25, 2004 at 07:07:21PM -0800, William Lee Irwin III wrote:
> The problem as posed is that the dirty memory limits are global, but

What do you mean with global? Global is one thing, but taking highmem
into account for calculating the limit is another thing. The
nr_free_buffer_pages exists exactly to avoid taking highmem into account
for the dirty memory limits. 2.6 must also ignore highmem in the dirty
memory limits like 2.4 does. I'd be surprised if somebody broke this in
2.6. As far as I can tell, while writing to a blkdev it cannot make any
difference if you've 4G or 1G of ram because of that (I mean on x86 of
course).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 15:11                   ` Jens Axboe
@ 2005-01-02 16:18                     ` Andrea Arcangeli
  2005-01-02 20:03                     ` Andrew Morton
  1 sibling, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2005-01-02 16:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: William Lee Irwin III, Rik van Riel, Andrew Morton, linux-kernel,
	Robert_Hentosh, Con Kolivas

On Sun, Jan 02, 2005 at 04:11:47PM +0100, Jens Axboe wrote:
> It should be lifted for block devices, it doesn't make any sense.

It cannot be lifted without:

1) creating aliasing between buffercache and blkdev pagecache
2) changing all fs to kmap around all buffercache accesses

2 would create an huge change (sure not a good idea during 2.6, 2.7 if
something). 1 would break lilo and tunefs and other things writing to a
superblock while the fs is mounted.

I effectively wrote it like 2 but I had to learn the hard way it broke
lilo in some weird configuration and IIRC Linus and Al fixed it very
nicely with current design.

There's no highmem and in turn no limit on 64bit in the first place, so
both efforts are worthless in the long term.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 16:10                       ` Andrea Arcangeli
@ 2005-01-02 16:36                         ` William Lee Irwin III
  2005-01-02 16:53                         ` Rik van Riel
  1 sibling, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2005-01-02 16:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, linux-kernel, Robert_Hentosh, Con Kolivas

On Sat, Dec 25, 2004 at 07:07:21PM -0800, William Lee Irwin III wrote:
>> The problem as posed is that the dirty memory limits are global, but

On Sun, Jan 02, 2005 at 05:10:08PM +0100, Andrea Arcangeli wrote:
> What do you mean with global? Global is one thing, but taking highmem
> into account for calculating the limit is another thing. The
> nr_free_buffer_pages exists exactly to avoid taking highmem into account
> for the dirty memory limits. 2.6 must also ignore highmem in the dirty
> memory limits like 2.4 does. I'd be surprised if somebody broke this in
> 2.6. As far as I can tell, while writing to a blkdev it cannot make any
> difference if you've 4G or 1G of ram because of that (I mean on x86 of
> course).

It's not used for any of these purposes in 2.6.x.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 16:10                       ` Andrea Arcangeli
  2005-01-02 16:36                         ` William Lee Irwin III
@ 2005-01-02 16:53                         ` Rik van Riel
  2005-01-02 17:21                           ` Andrea Arcangeli
  1 sibling, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2005-01-02 16:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Andrew Morton, linux-kernel,
	Robert_Hentosh, Con Kolivas

On Sun, 2 Jan 2005, Andrea Arcangeli wrote:

> nr_free_buffer_pages exists exactly to avoid taking highmem into account
> for the dirty memory limits. 2.6 must also ignore highmem in the dirty
> memory limits like 2.4 does. I'd be surprised if somebody broke this in
> 2.6.

2.6 does not ignore highmem when calculating the dirty memory
limits, which is causing problems.  That's why I sent in the
patch in the first place ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 16:53                         ` Rik van Riel
@ 2005-01-02 17:21                           ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2005-01-02 17:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: William Lee Irwin III, Andrew Morton, linux-kernel,
	Robert_Hentosh, Con Kolivas

On Sun, Jan 02, 2005 at 11:53:09AM -0500, Rik van Riel wrote:
> 2.6 does not ignore highmem when calculating the dirty memory
> limits, which is causing problems.  That's why I sent in the
> patch in the first place ;)

Ok great, things are clear now, I apparently missed your original patch
in the noise last time I checked this thread and I only focused on
Andrew's proposed fix, and the two patches are both good and 
orthogonal with each other.  I agree your patch is needed and it should
fix the blkdev write on > 1G. Without it the VM is guaranteed to run oom
in your setup, since whatever we writeback, it can be made dirty again
before we can attempt to free it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 15:11                   ` Jens Axboe
  2005-01-02 16:18                     ` Andrea Arcangeli
@ 2005-01-02 20:03                     ` Andrew Morton
  2005-01-02 20:25                       ` William Lee Irwin III
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-01-02 20:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: wli, riel, andrea, linux-kernel, Robert_Hentosh, kernel

Jens Axboe <axboe@suse.de> wrote:
>
> > Lifting the artificial lowmem restrictions on blockdev mappings
>  > (thereby nuking mapping->gfp_mask altogether) would resolve a number of
>  > problems, not that anything making that much sense could ever happen.
> 
>  It should be lifted for block devices, it doesn't make any sense.

Before we can permit blockdev pagecache to use highmem we must convert
every piece of code which accesses the cache to use kmap/kmap_atomic.  If
you grep around for b_data you'll see there are a lot of such places.

Probably the migration could be done on a per-fs basis.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2005-01-02 20:03                     ` Andrew Morton
@ 2005-01-02 20:25                       ` William Lee Irwin III
  0 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2005-01-02 20:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, riel, andrea, linux-kernel, Robert_Hentosh, kernel

At some point in the past, I wrote:
>>> Lifting the artificial lowmem restrictions on blockdev mappings
>>> (thereby nuking mapping->gfp_mask altogether) would resolve a number of
>>> problems, not that anything making that much sense could ever happen.

Jens Axboe <axboe@suse.de> wrote:
>>  It should be lifted for block devices, it doesn't make any sense.

On Sun, Jan 02, 2005 at 12:03:24PM -0800, Andrew Morton wrote:
> Before we can permit blockdev pagecache to use highmem we must convert
> every piece of code which accesses the cache to use kmap/kmap_atomic.  If
> you grep around for b_data you'll see there are a lot of such places.
> Probably the migration could be done on a per-fs basis.

I'd regard such an incremental conversion strategy as a prerequisite, and
would have no trouble working within such constraints.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
  2004-12-20 16:46 Robert_Hentosh
@ 2004-12-20 17:56 ` Sami Farin
  0 siblings, 0 replies; 27+ messages in thread
From: Sami Farin @ 2004-12-20 17:56 UTC (permalink / raw)
  To: linux-kernel

On Mon, Dec 20, 2004 at 10:46:48AM -0600, Robert_Hentosh@Dell.com wrote:
> 
> 
> > On Mon, 20 Dec 2004, Rik van Riel wrote:
> >
> >> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>"
> >> will result in OOM kills, with the dirty pagecache
> >> completely filling up lowmem.  This patch is part 1 to
> >> fixing that problem.
> >
> > What I forgot to say is that in order to trigger this OOM
> > Kill the dirty_limit of 40% needs to be more memory than
> > what fits in low memory.  So this will work on x86 with 
> > 4GB RAM, since the dirty_limit is 1.6GB, but the block 
> > device cache cannot grow that big because it is restricted
> > to low memory.
> >
> > This has the effect of all low memory being tied up in
> > Dirty page cache and userspace try_to_free_pages() skipping
> > the writeout of these pages because the block device is
> > congested.
> 
> I am just confirming that this is a real problem.  The problem 
> more frequently shows up with block sizes above 4k on the
> dd and also showed up on some platforms with just a mke2fs
> on a slower device such as a USB hard drive.
> 
> Rik's patch has solved the issue and has been running under
> stress (via ctcs) over the weekend without failure.  

Rik's patch was broken (word-wrap by pine), but I patched
manually.  However, I have tglx-oom-final patch which moved
out_of_memory call from vmscan.c:try_to_free_pages()
to page_alloc.c:__alloc_pages().

Basically, (sc.nr_congested < SWAP_CLUSTER_MAX) check is missing.
So, what's the best way to combine these two patches?

If you use mutt, the patch can be found with command
/~i 1102697553.3306.91.camel@tglx.tec.linutronix.de
from your LKML mailbox.

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
@ 2004-12-20 16:46 Robert_Hentosh
  2004-12-20 17:56 ` Sami Farin
  0 siblings, 1 reply; 27+ messages in thread
From: Robert_Hentosh @ 2004-12-20 16:46 UTC (permalink / raw)
  To: riel, akpm; +Cc: linux-kernel



> On Mon, 20 Dec 2004, Rik van Riel wrote:
>
>> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>"
>> will result in OOM kills, with the dirty pagecache
>> completely filling up lowmem.  This patch is part 1 to
>> fixing that problem.
>
> What I forgot to say is that in order to trigger this OOM
> Kill the dirty_limit of 40% needs to be more memory than
> what fits in low memory.  So this will work on x86 with 
> 4GB RAM, since the dirty_limit is 1.6GB, but the block 
> device cache cannot grow that big because it is restricted
> to low memory.
>
> This has the effect of all low memory being tied up in
> Dirty page cache and userspace try_to_free_pages() skipping
> the writeout of these pages because the block device is
> congested.

I am just confirming that this is a real problem.  The problem 
more frequently shows up with block sizes above 4k on the
dd and also showed up on some platforms with just a mke2fs
on a slower device such as a USB hard drive.

Rik's patch has solved the issue and has been running under
stress (via ctcs) over the weekend without failure.  

Regards,
Robert


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2005-01-02 20:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-20 15:15 [PATCH][1/2] adjust dirty threshold for lowmem-only mappings Rik van Riel
2004-12-20 15:23 ` Rik van Riel
2004-12-20 20:54 ` Andrew Morton
2004-12-20 21:27   ` Rik van Riel
2004-12-23 19:21   ` Rik van Riel
2004-12-24 16:01     ` Andrea Arcangeli
2004-12-24 16:22       ` Rik van Riel
2004-12-24 16:40         ` Andrea Arcangeli
2004-12-24 22:12           ` Rik van Riel
2004-12-25  2:07             ` Andrea Arcangeli
2004-12-25 17:59               ` Rik van Riel
2004-12-25 18:36                 ` Andrea Arcangeli
2004-12-25 19:07                 ` William Lee Irwin III
2004-12-25 20:03                   ` Andrea Arcangeli
2004-12-26  3:07                     ` William Lee Irwin III
2005-01-02 16:10                       ` Andrea Arcangeli
2005-01-02 16:36                         ` William Lee Irwin III
2005-01-02 16:53                         ` Rik van Riel
2005-01-02 17:21                           ` Andrea Arcangeli
2004-12-25 22:03                   ` Nikita Danilov
2004-12-26  3:16                     ` William Lee Irwin III
2005-01-02 15:11                   ` Jens Axboe
2005-01-02 16:18                     ` Andrea Arcangeli
2005-01-02 20:03                     ` Andrew Morton
2005-01-02 20:25                       ` William Lee Irwin III
2004-12-20 16:46 Robert_Hentosh
2004-12-20 17:56 ` Sami Farin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).