linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] adjustments to dirty memory thresholds
@ 2002-08-28  4:39 Andrew Morton
  2002-08-28 20:08 ` William Lee Irwin III
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2002-08-28  4:39 UTC (permalink / raw)
  To: lkml



Writeback parameter tuning.  Somewhat experimental, but heading in the
right direction, I hope.

- Allowing 40% of physical memory to be dirtied on massive ia32 boxes
  is unreasonable.  It pins too many buffer_heads and contribues to
  page reclaim latency.

The patch changes the initial value of
/proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the
presently non-functional) dirty_sync_ratio so that they are reduced
when the highmem:lowmem ratio exceeds 4:1.

These ratios are scaled so that as the highmem:lowmem ratio goes
beyond 4:1, the maximum amount of allowed dirty memory ceases to
increase.  It is clamped at the amount of memory which a 4:1 machine
is allowed to use.

- Aggressive reduction in the dirty memory threshold at which
  background writeback cuts in.  2.4 uses 30% of ZONE_NORMAL.  2.5 uses
  40% of total memory.  This patch changes it to 10% of total memory
  (if total memory <= 4G.  Even less otherwise - see above).

This means that:

- Much more writeback is performed by pdflush.

- When the application is generating dirty data at a moderate
  rate, background writeback cuts in much earlier, so memory is
  cleaned more promptly.

- Reduces the risk of user applications getting stalled by writeback.

- Will damage dbench numbers.  So bite me.

  (It turns out that the damage is fairly small)

- Moderate reduction in the dirty level at which the write(2) caller
  is forced to perform writeback (throttling).  Was 40% of total
  memory.  Is now 30% of total memory (if total memory <= 4G, less
  otherwise).

  This is to reduce page reclaim latency, and generally because
  allowing processes to flood the machine with dirty data is a bad
  thing in mixed workloads.




 page-writeback.c |   50 ++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 38 insertions(+), 12 deletions(-)

--- 2.5.32/mm/page-writeback.c~writeback-thresholds	Tue Aug 27 21:35:27 2002
+++ 2.5.32-akpm/mm/page-writeback.c	Tue Aug 27 21:35:27 2002
@@ -38,7 +38,12 @@
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
-static int ratelimit_pages = 32;
+static long ratelimit_pages = 32;
+
+/*
+ * The total number of pagesin the machine.
+ */
+static long total_pages;
 
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
@@ -60,17 +65,17 @@ static inline int sync_writeback_pages(v
 /*
  * Start background writeback (via pdflush) at this level
  */
-int dirty_background_ratio = 40;
+int dirty_background_ratio = 10;
 
 /*
  * The generator of dirty data starts async writeback at this level
  */
-int dirty_async_ratio = 50;
+int dirty_async_ratio = 40;
 
 /*
  * The generator of dirty data performs sync writeout at this level
  */
-int dirty_sync_ratio = 60;
+int dirty_sync_ratio = 50;
 
 /*
  * The interval between `kupdate'-style writebacks, in centiseconds
@@ -107,18 +112,17 @@ static void background_writeout(unsigned
  */
 void balance_dirty_pages(struct address_space *mapping)
 {
-	const int tot = nr_free_pagecache_pages();
 	struct page_state ps;
-	int background_thresh, async_thresh, sync_thresh;
+	long background_thresh, async_thresh, sync_thresh;
 	unsigned long dirty_and_writeback;
 	struct backing_dev_info *bdi;
 
 	get_page_state(&ps);
 	dirty_and_writeback = ps.nr_dirty + ps.nr_writeback;
 
-	background_thresh = (dirty_background_ratio * tot) / 100;
-	async_thresh = (dirty_async_ratio * tot) / 100;
-	sync_thresh = (dirty_sync_ratio * tot) / 100;
+	background_thresh = (dirty_background_ratio * total_pages) / 100;
+	async_thresh = (dirty_async_ratio * total_pages) / 100;
+	sync_thresh = (dirty_sync_ratio * total_pages) / 100;
 	bdi = mapping->backing_dev_info;
 
 	if (dirty_and_writeback > sync_thresh) {
@@ -171,13 +175,14 @@ void balance_dirty_pages_ratelimited(str
  */
 static void background_writeout(unsigned long _min_pages)
 {
-	const int tot = nr_free_pagecache_pages();
-	const int background_thresh = (dirty_background_ratio * tot) / 100;
 	long min_pages = _min_pages;
+	long background_thresh;
 	int nr_to_write;
 
 	CHECK_EMERGENCY_SYNC
 
+	background_thresh = (dirty_background_ratio * total_pages) / 100;
+
 	do {
 		struct page_state ps;
 
@@ -269,7 +274,7 @@ static void wb_timer_fn(unsigned long un
 
 static void set_ratelimit(void)
 {
-	ratelimit_pages = nr_free_pagecache_pages() / (num_online_cpus() * 32);
+	ratelimit_pages = total_pages / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
 	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
@@ -288,8 +293,29 @@ static struct notifier_block ratelimit_n
 	.next		= NULL,
 };
 
+/*
+ * If the machine has a large highmem:lowmem ratio then scale back the default
+ * dirty memory thresholds: allowing too much dirty highmem pins an excessive
+ * number of buffer_heads.
+ */
 static int __init page_writeback_init(void)
 {
+	long buffer_pages = nr_free_buffer_pages();
+	long correction;
+
+	total_pages = nr_free_pagecache_pages();
+
+	correction = (100 * 4 * buffer_pages) / total_pages;
+
+	if (correction < 100) {
+		dirty_background_ratio *= correction;
+		dirty_background_ratio /= 100;
+		dirty_async_ratio *= correction;
+		dirty_async_ratio /= 100;
+		dirty_sync_ratio *= correction;
+		dirty_sync_ratio /= 100;
+	}
+
 	init_timer(&wb_timer);
 	wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100;
 	wb_timer.data = 0;

.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28  4:39 [patch] adjustments to dirty memory thresholds Andrew Morton
@ 2002-08-28 20:08 ` William Lee Irwin III
  2002-08-28 20:27   ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2002-08-28 20:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

On Tue, Aug 27, 2002 at 09:39:09PM -0700, Andrew Morton wrote:
> These ratios are scaled so that as the highmem:lowmem ratio goes
> beyond 4:1, the maximum amount of allowed dirty memory ceases to
> increase.  It is clamped at the amount of memory which a 4:1 machine
> is allowed to use.

This is disturbing. I suspect this is only going to raise poor memory
utilization issues on highmem boxen. Of course, "f**k highmem" is such
a common refrain these days so that's probably falling on deaf ears.
AFAICT the OOM issues are largely a by-product of mempool allocations
entering out_of_memory() when they have the perfectly reasonable
alternative strategy of simply waiting for the mempool to refill.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 20:08 ` William Lee Irwin III
@ 2002-08-28 20:27   ` Andrew Morton
  2002-08-28 21:42     ` William Lee Irwin III
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2002-08-28 20:27 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: lkml

William Lee Irwin III wrote:
> 
> On Tue, Aug 27, 2002 at 09:39:09PM -0700, Andrew Morton wrote:
> > These ratios are scaled so that as the highmem:lowmem ratio goes
> > beyond 4:1, the maximum amount of allowed dirty memory ceases to
> > increase.  It is clamped at the amount of memory which a 4:1 machine
> > is allowed to use.
> 
> This is disturbing. I suspect this is only going to raise poor memory
> utilization issues on highmem boxen.

The intent is to fix them.  Allowing more than 2G of dirty data to
float about seems unreasonable, and it pins buffer_heads.

But hey.  The patch merely sets the initial value of /proc/sys/vm/dirty*,
and those things are writeable.

> Of course, "f**k highmem" is such
> a common refrain these days so that's probably falling on deaf ears.

On the contrary.

> AFAICT the OOM issues are largely a by-product of mempool allocations
> entering out_of_memory() when they have the perfectly reasonable
> alternative strategy of simply waiting for the mempool to refill.

I don't have enough RAM to reproduce this.  Please send
call traces up from out_of_memory().

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 20:27   ` Andrew Morton
@ 2002-08-28 21:42     ` William Lee Irwin III
  2002-08-28 21:58       ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2002-08-28 21:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

William Lee Irwin III wrote:
>> This is disturbing. I suspect this is only going to raise poor memory
>> utilization issues on highmem boxen.

On Wed, Aug 28, 2002 at 01:27:02PM -0700, Andrew Morton wrote:
> The intent is to fix them.  Allowing more than 2G of dirty data to
> float about seems unreasonable, and it pins buffer_heads.
> But hey.  The patch merely sets the initial value of /proc/sys/vm/dirty*,
> and those things are writeable.

Hmm. Then I've actually tested this... I can at least say it's stable,
even if I'm not wild about the approach.


William Lee Irwin III wrote:
>> AFAICT the OOM issues are largely a by-product of mempool allocations
>> entering out_of_memory() when they have the perfectly reasonable
>> alternative strategy of simply waiting for the mempool to refill.

On Wed, Aug 28, 2002 at 01:27:02PM -0700, Andrew Morton wrote:
> I don't have enough RAM to reproduce this.  Please send
> call traces up from out_of_memory().

I've already written the patch to address it, though of course, I can
post those traces along with the patch once it's rediffed. (It's trivial
though -- just a fresh GFP flag and a check for it before calling
out_of_memory(), setting it in mempool_alloc(), and ignoring it in
slab.c.) It requires several rounds of "un-throttling" to reproduce
the OOM's, the nature of which I've outlined elsewhere.

One such trace is below, some of the others might require repeating the
runs. It's actually a relatively deep call chain, I'd be worried about
blowing the stack at this point as well.


Cheers,
Bill

2.5.31-akpm + request queue size of 16384 + inode table size of 1024       
+ zone->wait_table max size of 65536 + MIN_PDFLUSH_THREADS == NR_CPUS
+ MAX_PDFLUSH_THREADS == 16*NR_CPUS on 16x/16GB x86 running 4
simultaneous tiobench --size $((4*1024)) --threads 256 on 4 disks.

They also pile up on ->i_sem of the dir they create files in, not sure
what to do about that aside from working around it in userspace. It
basically takes this kind of stuff so the things don't all fall asleep
on some resource or other, though the box is still pretty much idle.


#1  0xc013ba01 in oom_kill () at oom_kill.c:181
#2  0xc013ba7c in out_of_memory () at oom_kill.c:248
#3  0xc0137628 in try_to_free_pages (classzone=0xc039f300, gfp_mask=80,
    order=0) at vmscan.c:585
#4  0xc013831b in balance_classzone (classzone=0xc039f300, gfp_mask=80,
    order=0, freed=0xf7b0dc5c) at page_alloc.c:278
#5  0xc01385f7 in __alloc_pages (gfp_mask=80, order=0, zonelist=0xc02b4064)
    at page_alloc.c:401
#6  0xc013b777 in alloc_pages_pgdat (pgdat=0xc039f000, gfp_mask=80, order=0)
    at numa.c:77
#7  0xc013b7c3 in _alloc_pages (gfp_mask=80, order=0) at numa.c:105
#8  0xc013e440 in page_pool_alloc (gfp_mask=80, data=0x0) at highmem.c:33
#9  0xc013f395 in mempool_alloc (pool=0xf7b78d20, gfp_mask=80) at mempool.c:203
#10 0xc013ed85 in blk_queue_bounce (q=0xf76a941c, bio_orig=0xf7b0dd60)
    at highmem.c:397
#11 0xc01da088 in __make_request (q=0xf76a941c, bio=0xec0324a0)
    at ll_rw_blk.c:1481
#12 0xc01da5bf in generic_make_request (bio=0xec0324a0) at ll_rw_blk.c:1714
#13 0xc01da63c in submit_bio (rw=1, bio=0xec0324a0) at ll_rw_blk.c:1760
#14 0xc0161701 in mpage_bio_submit (rw=1, bio=0xec0324a0) at mpage.c:93
#15 0xc0162094 in mpage_writepages (mapping=0xed953d7c,
#16 0xc01722e0 in ext2_writepages (mapping=0xed953d7c, nr_to_write=0xf7b0df8c)
    at inode.c:636
#17 0xc0140a1a in do_writepages (mapping=0xed953d7c, nr_to_write=0xf7b0df8c)
    at page-writeback.c:372
#18 0xc0160b74 in __sync_single_inode (inode=0xed953cf4, wait=0,
    nr_to_write=0xf7b0df8c) at fs-writeback.c:147
#19 0xc0160d50 in __writeback_single_inode (inode=0xed953cf4, sync=0,
    nr_to_write=0xf7b0df8c) at fs-writeback.c:196
#20 0xc0160ec1 in sync_sb_inodes (single_bdi=0x0, sb=0xf6049c00, sync_mode=0,
    nr_to_write=0xf7b0df8c, older_than_this=0x0) at fs-writeback.c:270
#21 0xc016104d in __writeback_unlocked_inodes (bdi=0x0,
    nr_to_write=0xf7b0df8c, sync_mode=WB_SYNC_NONE, older_than_this=0x0)
    at fs-writeback.c:310
#22 0xc01610f6 in writeback_unlocked_inodes (nr_to_write=0xf7b0df8c,
    sync_mode=WB_SYNC_NONE, older_than_this=0x0) at fs-writeback.c:340
#23 0xc01407e9 in background_writeout (_min_pages=0) at page-writeback.c:188
#24 0xc0140408 in __pdflush (my_work=0xf7b0dfd4) at pdflush.c:120
#25 0xc01404f7 in pdflush (dummy=0x0) at pdflush.c:168

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 21:42     ` William Lee Irwin III
@ 2002-08-28 21:58       ` Andrew Morton
  2002-08-28 22:15         ` Andrew Morton
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrew Morton @ 2002-08-28 21:58 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: lkml

William Lee Irwin III wrote:
> 
> ...
> I've already written the patch to address it, though of course, I can
> post those traces along with the patch once it's rediffed. (It's trivial
> though -- just a fresh GFP flag and a check for it before calling
> out_of_memory(), setting it in mempool_alloc(), and ignoring it in
> slab.c.) It requires several rounds of "un-throttling" to reproduce
> the OOM's, the nature of which I've outlined elsewhere.

That's a sane approach.  mempool_alloc() is designed for allocations
which "must" succeed if you wait long enough.

In fact it might make sense to only perform a single scan of the
LRU if __GFP_WLI is set, rather than the increasing priority thing.

But sigh.  Pointlessly scanning zillions of dirty pages and doing nothing
with them is dumb.  So much better to go for a FIFO snooze on a per-zone
waitqueue, be woken when some memory has been cleansed.  (That's effectively
what mempool does, but it's all private and different).

> One such trace is below, some of the others might require repeating the
> runs. It's actually a relatively deep call chain, I'd be worried about
> blowing the stack at this point as well.

Well it's presumably the GFP_NOIO which has killed it - we can't wait
on PG_writeback pages and we can't write out dirty pages.  Taking a
nap in mempool_alloc is appropriate.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 21:58       ` Andrew Morton
@ 2002-08-28 22:15         ` Andrew Morton
  2002-08-29  0:26         ` Rik van Riel
  2002-08-29  3:49         ` William Lee Irwin III
  2 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2002-08-28 22:15 UTC (permalink / raw)
  To: William Lee Irwin III, lkml

Andrew Morton wrote:
> 
> ...
> Well it's presumably the GFP_NOIO which has killed it - we can't wait
> on PG_writeback pages and we can't write out dirty pages.  Taking a
> nap in mempool_alloc is appropriate.

Actually, it might be better to teach mempool_alloc to not call page reclaim
at all if __GFP_FS is not set.  Just kick bdflush and go to sleep.

I really, really, really dislike the VM's tendency to go and scan hundreds
of thousands of pages.  It's a clear sign of an inappropriate algorithm.

Test something like this, please?


--- 2.5.32/mm/mempool.c~wli	Wed Aug 28 15:07:31 2002
+++ 2.5.32-akpm/mm/mempool.c	Wed Aug 28 15:12:53 2002
@@ -196,10 +196,11 @@ repeat_alloc:
 		return element;
 
 	/*
-	 * If the pool is less than 50% full then try harder
-	 * to allocate an element:
+	 * If the pool is less than 50% full and we can perform effective
+	 * page reclaim then try harder to allocate an element:
 	 */
-	if ((gfp_mask != gfp_nowait) && (pool->curr_nr <= pool->min_nr/2)) {
+	if ((gfp_mask & __GFP_FS) && (gfp_mask != gfp_nowait) &&
+			(pool->curr_nr <= pool->min_nr/2)) {
 		element = pool->alloc(gfp_mask, pool->pool_data);
 		if (likely(element != NULL))
 			return element;

.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 21:58       ` Andrew Morton
  2002-08-28 22:15         ` Andrew Morton
@ 2002-08-29  0:26         ` Rik van Riel
  2002-08-29  2:10           ` Andrew Morton
  2002-08-29  3:49         ` William Lee Irwin III
  2 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2002-08-29  0:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, lkml

On Wed, 28 Aug 2002, Andrew Morton wrote:

> But sigh.  Pointlessly scanning zillions of dirty pages and doing
> nothing with them is dumb.  So much better to go for a FIFO snooze on a
> per-zone waitqueue, be woken when some memory has been cleansed.

But not per-zone, since many (most?) allocations can be satisfied
from multiple zones.  Guess what 2.4-rmap has had for ages ?

Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;)

[I'll mercilessly increase your patch queue since it doesn't show
any sign of ever shrinking anyway]

cheers,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-29  2:10           ` Andrew Morton
@ 2002-08-29  2:10             ` Rik van Riel
  2002-08-29  2:52               ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2002-08-29  2:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, lkml

On Wed, 28 Aug 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> >
> > On Wed, 28 Aug 2002, Andrew Morton wrote:
> >
> > > But sigh.  Pointlessly scanning zillions of dirty pages and doing
> > > nothing with them is dumb.  So much better to go for a FIFO snooze on a
> > > per-zone waitqueue, be woken when some memory has been cleansed.
> >
> > But not per-zone, since many (most?) allocations can be satisfied
> > from multiple zones.  Guess what 2.4-rmap has had for ages ?
>
> Per-classzone ;)

I pull the NUMA-fallback card ;)

But serious, having one waitqueue for this case should be
fine. If the system is not under lots of VM pressure with
tons of dirty pages, kswapd will free pages as fast as
they get allocated.

If the system can't keep up and we have to wait for dirty
page writeout to finish before we can allocate more, it
shouldn't really matter how many waitqueues we have.
Except for the fact that having a more complex system can
introduce more opportunities for unfairness and starvation.

> > Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;)
> >
> > [I'll mercilessly increase your patch queue since it doesn't show
> > any sign of ever shrinking anyway]
>
> Lack of patches is not a huge problem at present ;).  It's getting them
> tested for performance, stability and general does-good-thingsness
> which is the rate limiting step.

Yup, but if I were to wait for your queue to shrink I'd never get
any patches merged ;)

> But yes, I'm interested in a port of the code, and in the description
> of the problems which it solves, and how it solves them.

I'll introduce this stuff in 2 or 3 steps, with descriptions.

> But what is even more valuable than the code is a report of its
> before-and-after effectiveness under a broad range of loads on a broad
> range of hardware.  That's the most time-consuming part...

Eeeks ;)   I don't even have a broad range of hardware...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-29  0:26         ` Rik van Riel
@ 2002-08-29  2:10           ` Andrew Morton
  2002-08-29  2:10             ` Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2002-08-29  2:10 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, lkml

Rik van Riel wrote:
> 
> On Wed, 28 Aug 2002, Andrew Morton wrote:
> 
> > But sigh.  Pointlessly scanning zillions of dirty pages and doing
> > nothing with them is dumb.  So much better to go for a FIFO snooze on a
> > per-zone waitqueue, be woken when some memory has been cleansed.
> 
> But not per-zone, since many (most?) allocations can be satisfied
> from multiple zones.  Guess what 2.4-rmap has had for ages ?

Per-classzone ;)

> Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;)
> 
> [I'll mercilessly increase your patch queue since it doesn't show
> any sign of ever shrinking anyway]

Lack of patches is not a huge problem at present ;).  It's getting them
tested for performance, stability and general does-good-thingsness
which is the rate limiting step.

The next really significant design change in the queue is slablru,
and we'll need to let that sit in partial isolation for a while to
make sure that it's doing what we want it to do.

But yes, I'm interested in a port of the code, and in the description
of the problems which it solves, and how it solves them.  But what is
even more valuable than the code is a report of its before-and-after
effectiveness under a broad range of loads on a broad range of 
hardware.  That's the most time-consuming part...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-29  2:10             ` Rik van Riel
@ 2002-08-29  2:52               ` Andrew Morton
  2002-09-01  1:37                 ` William Lee Irwin III
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2002-08-29  2:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, lkml

Rik van Riel wrote:
> 
> On Wed, 28 Aug 2002, Andrew Morton wrote:
> > Rik van Riel wrote:
> > >
> > > On Wed, 28 Aug 2002, Andrew Morton wrote:
> > >
> > > > But sigh.  Pointlessly scanning zillions of dirty pages and doing
> > > > nothing with them is dumb.  So much better to go for a FIFO snooze on a
> > > > per-zone waitqueue, be woken when some memory has been cleansed.
> > >
> > > But not per-zone, since many (most?) allocations can be satisfied
> > > from multiple zones.  Guess what 2.4-rmap has had for ages ?
> >
> > Per-classzone ;)
> 
> I pull the NUMA-fallback card ;)

Ah, but you can never satisfy a NUMA person.

> But serious, having one waitqueue for this case should be
> fine. If the system is not under lots of VM pressure with
> tons of dirty pages, kswapd will free pages as fast as
> they get allocated.
> 
> If the system can't keep up and we have to wait for dirty
> page writeout to finish before we can allocate more, it
> shouldn't really matter how many waitqueues we have.
> Except for the fact that having a more complex system can
> introduce more opportunities for unfairness and starvation.

Sure.  We have this lovely fast wakeup/context switch time.  Blowing
some cycles in this situation surely is not a problem.

But I do think we want to perform the wakeups from interrupt context;
there are just too many opportunities for kswapd to take an
extended vacation on a request queue.

Non-blocking writeout infrastructure would be nice, too.  And for
simple cases, that's just a matter of getting the block layer
to manage a flag in q->backing_dev_info.  But even that would result
in scanning past pages.  And every time we do that, there are
whacko corner cases which chew tons of CPU or cause oom failures.
Lists, lists, we need more lists!

hmm.  But mapping->backing_dev_info is trivially available in the
superblock scan, and in that case we can scan past entire congested
filesystems, rather than single congested pages.  hmm.

I suspect q->backing_dev_info gets inaccurate once we get into
stacking and striping at the block layer, but that's just an
efficiency failing, not an oops.

> ...
> 
> > But what is even more valuable than the code is a report of its
> > before-and-after effectiveness under a broad range of loads on a broad
> > range of hardware.  That's the most time-consuming part...
> 
> Eeeks ;)   I don't even have a broad range of hardware...
> 

Eeeks indeed.  But the main variables really are memory size,
IO bandwidth and workload.  That's manageable.

The traditional toss-it-in-and-see-who-complains approach will
catch the weird corner cases but it's slow turnaround.  I guess
as long as we know what the code is trying to do then it should be
fairly straightforward to verify that it's doing it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-28 21:58       ` Andrew Morton
  2002-08-28 22:15         ` Andrew Morton
  2002-08-29  0:26         ` Rik van Riel
@ 2002-08-29  3:49         ` William Lee Irwin III
  2002-08-29 12:37           ` Rik van Riel
  2 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2002-08-29  3:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

William Lee Irwin III wrote:
>> I've already written the patch to address it, though of course, I can
>> post those traces along with the patch once it's rediffed. (It's trivial
>> though -- just a fresh GFP flag and a check for it before calling
>> out_of_memory(), setting it in mempool_alloc(), and ignoring it in
>> slab.c.) It requires several rounds of "un-throttling" to reproduce
>> the OOM's, the nature of which I've outlined elsewhere.

On Wed, Aug 28, 2002 at 02:58:20PM -0700, Andrew Morton wrote:
> That's a sane approach.  mempool_alloc() is designed for allocations
> which "must" succeed if you wait long enough.
> In fact it might make sense to only perform a single scan of the
> LRU if __GFP_WLI is set, rather than the increasing priority thing.
> But sigh.  Pointlessly scanning zillions of dirty pages and doing nothing
> with them is dumb.  So much better to go for a FIFO snooze on a per-zone
> waitqueue, be woken when some memory has been cleansed.  (That's effectively
> what mempool does, but it's all private and different).

Here's a stab in that direction, against 2.5.31. A trivially different
patch was tested and verified to solve the problems in practice. A
theoretical deadlock remains where a mempool allocator sleeps on general
purpose memory and is not woken when the mempool is replenished.


Cheers,
Bill


diff -urN linux-2.5.31-virgin/include/linux/gfp.h linux-2.5.31-nokill/include/linux/gfp.h
--- linux-2.5.31-virgin/include/linux/gfp.h	2002-08-10 18:41:24.000000000 -0700
+++ linux-2.5.31-nokill/include/linux/gfp.h	2002-08-28 02:22:55.000000000 -0700
@@ -17,6 +17,7 @@
 #define __GFP_IO	0x40	/* Can start low memory physical IO? */
 #define __GFP_HIGHIO	0x80	/* Can start high mem physical IO? */
 #define __GFP_FS	0x100	/* Can call down to low-level FS? */
+#define __GFP_NOKILL	0x200	/* Should not OOM kill */
 
 #define GFP_NOHIGHIO	(             __GFP_WAIT | __GFP_IO)
 #define GFP_NOIO	(             __GFP_WAIT)
diff -urN linux-2.5.31-virgin/include/linux/slab.h linux-2.5.31-nokill/include/linux/slab.h
--- linux-2.5.31-virgin/include/linux/slab.h	2002-08-10 18:41:28.000000000 -0700
+++ linux-2.5.31-nokill/include/linux/slab.h	2002-08-28 02:22:55.000000000 -0700
@@ -24,7 +24,7 @@
 #define	SLAB_NFS		GFP_NFS
 #define	SLAB_DMA		GFP_DMA
 
-#define SLAB_LEVEL_MASK		(__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHIO|__GFP_FS)
+#define SLAB_LEVEL_MASK		(__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHIO|__GFP_FS|__GFP_NOKILL)
 #define	SLAB_NO_GROW		0x00001000UL	/* don't grow a cache */
 
 /* flags to pass to kmem_cache_create().
diff -urN linux-2.5.31-virgin/mm/mempool.c linux-2.5.31-nokill/mm/mempool.c
--- linux-2.5.31-virgin/mm/mempool.c	2002-08-10 18:41:19.000000000 -0700
+++ linux-2.5.31-nokill/mm/mempool.c	2002-08-28 02:22:55.000000000 -0700
@@ -186,7 +186,11 @@
 	unsigned long flags;
 	int curr_nr;
 	DECLARE_WAITQUEUE(wait, current);
-	int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+	int gfp_nowait;
+
+	gfp_mask |= __GFP_NOKILL;
+
+	gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO | __GFP_NOKILL);
 
 repeat_alloc:
 	element = pool->alloc(gfp_nowait, pool->pool_data);
diff -urN linux-2.5.31-virgin/mm/vmscan.c linux-2.5.31-nokill/mm/vmscan.c
--- linux-2.5.31-virgin/mm/vmscan.c	2002-08-10 18:41:21.000000000 -0700
+++ linux-2.5.31-nokill/mm/vmscan.c	2002-08-28 03:17:15.000000000 -0700
@@ -401,23 +401,24 @@
 
 int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order)
 {
-	int priority = DEF_PRIORITY;
-	int nr_pages = SWAP_CLUSTER_MAX;
+	int priority, status, nr_pages = SWAP_CLUSTER_MAX;
 
 	KERNEL_STAT_INC(pageoutrun);
 
-	do {
+	for (priority = DEF_PRIORITY; priority; --priority) {
 		nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages);
-		if (nr_pages <= 0)
-			return 1;
-	} while (--priority);
+		status = (nr_pages <= 0) ? 1 : 0;
+		if (status || (gfp_mask & __GFP_NOKILL))
+			goto out;
+	}
 
 	/*
 	 * Hmm.. Cache shrink failed - time to kill something?
 	 * Mhwahahhaha! This is the part I really like. Giggle.
 	 */
 	out_of_memory();
-	return 0;
+out:
+	return status;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-29  3:49         ` William Lee Irwin III
@ 2002-08-29 12:37           ` Rik van Riel
  0 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2002-08-29 12:37 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, lkml

On Wed, 28 Aug 2002, William Lee Irwin III wrote:

> +	gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO | __GFP_NOKILL);


I suspect what you want is (in vmscan.c):

-	out_of_memory();
+	if (gfp_mask & __GFP_FS)
+		out_of_memory();

This means we'll just never call out_of_memory() if we haven't
used all possibilities for freeing pages.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] adjustments to dirty memory thresholds
  2002-08-29  2:52               ` Andrew Morton
@ 2002-09-01  1:37                 ` William Lee Irwin III
  0 siblings, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-09-01  1:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, lkml

On Wed, Aug 28, 2002 at 07:52:56PM -0700, Andrew Morton wrote:
> Eeeks indeed.  But the main variables really are memory size,
> IO bandwidth and workload.  That's manageable.
> The traditional toss-it-in-and-see-who-complains approach will
> catch the weird corner cases but it's slow turnaround.  I guess
> as long as we know what the code is trying to do then it should be
> fairly straightforward to verify that it's doing it.

Okay, not sure which in the thread to respond to, but since I can't
find a public statement to this effect, in my testing, all 3 OOM
patches behave identically.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-09-01  1:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-28  4:39 [patch] adjustments to dirty memory thresholds Andrew Morton
2002-08-28 20:08 ` William Lee Irwin III
2002-08-28 20:27   ` Andrew Morton
2002-08-28 21:42     ` William Lee Irwin III
2002-08-28 21:58       ` Andrew Morton
2002-08-28 22:15         ` Andrew Morton
2002-08-29  0:26         ` Rik van Riel
2002-08-29  2:10           ` Andrew Morton
2002-08-29  2:10             ` Rik van Riel
2002-08-29  2:52               ` Andrew Morton
2002-09-01  1:37                 ` William Lee Irwin III
2002-08-29  3:49         ` William Lee Irwin III
2002-08-29 12:37           ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).