* [patch] adjustments to dirty memory thresholds @ 2002-08-28 4:39 Andrew Morton 2002-08-28 20:08 ` William Lee Irwin III 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2002-08-28 4:39 UTC (permalink / raw) To: lkml Writeback parameter tuning. Somewhat experimental, but heading in the right direction, I hope. - Allowing 40% of physical memory to be dirtied on massive ia32 boxes is unreasonable. It pins too many buffer_heads and contribues to page reclaim latency. The patch changes the initial value of /proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the presently non-functional) dirty_sync_ratio so that they are reduced when the highmem:lowmem ratio exceeds 4:1. These ratios are scaled so that as the highmem:lowmem ratio goes beyond 4:1, the maximum amount of allowed dirty memory ceases to increase. It is clamped at the amount of memory which a 4:1 machine is allowed to use. - Aggressive reduction in the dirty memory threshold at which background writeback cuts in. 2.4 uses 30% of ZONE_NORMAL. 2.5 uses 40% of total memory. This patch changes it to 10% of total memory (if total memory <= 4G. Even less otherwise - see above). This means that: - Much more writeback is performed by pdflush. - When the application is generating dirty data at a moderate rate, background writeback cuts in much earlier, so memory is cleaned more promptly. - Reduces the risk of user applications getting stalled by writeback. - Will damage dbench numbers. So bite me. (It turns out that the damage is fairly small) - Moderate reduction in the dirty level at which the write(2) caller is forced to perform writeback (throttling). Was 40% of total memory. Is now 30% of total memory (if total memory <= 4G, less otherwise). This is to reduce page reclaim latency, and generally because allowing processes to flood the machine with dirty data is a bad thing in mixed workloads. page-writeback.c | 50 ++++++++++++++++++++++++++++++++++++++------------ 1 files changed, 38 insertions(+), 12 deletions(-) --- 2.5.32/mm/page-writeback.c~writeback-thresholds Tue Aug 27 21:35:27 2002 +++ 2.5.32-akpm/mm/page-writeback.c Tue Aug 27 21:35:27 2002 @@ -38,7 +38,12 @@ * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited * will look to see if it needs to force writeback or throttling. */ -static int ratelimit_pages = 32; +static long ratelimit_pages = 32; + +/* + * The total number of pagesin the machine. + */ +static long total_pages; /* * When balance_dirty_pages decides that the caller needs to perform some @@ -60,17 +65,17 @@ static inline int sync_writeback_pages(v /* * Start background writeback (via pdflush) at this level */ -int dirty_background_ratio = 40; +int dirty_background_ratio = 10; /* * The generator of dirty data starts async writeback at this level */ -int dirty_async_ratio = 50; +int dirty_async_ratio = 40; /* * The generator of dirty data performs sync writeout at this level */ -int dirty_sync_ratio = 60; +int dirty_sync_ratio = 50; /* * The interval between `kupdate'-style writebacks, in centiseconds @@ -107,18 +112,17 @@ static void background_writeout(unsigned */ void balance_dirty_pages(struct address_space *mapping) { - const int tot = nr_free_pagecache_pages(); struct page_state ps; - int background_thresh, async_thresh, sync_thresh; + long background_thresh, async_thresh, sync_thresh; unsigned long dirty_and_writeback; struct backing_dev_info *bdi; get_page_state(&ps); dirty_and_writeback = ps.nr_dirty + ps.nr_writeback; - background_thresh = (dirty_background_ratio * tot) / 100; - async_thresh = (dirty_async_ratio * tot) / 100; - sync_thresh = (dirty_sync_ratio * tot) / 100; + background_thresh = (dirty_background_ratio * total_pages) / 100; + async_thresh = (dirty_async_ratio * total_pages) / 100; + sync_thresh = (dirty_sync_ratio * total_pages) / 100; bdi = mapping->backing_dev_info; if (dirty_and_writeback > sync_thresh) { @@ -171,13 +175,14 @@ void balance_dirty_pages_ratelimited(str */ static void background_writeout(unsigned long _min_pages) { - const int tot = nr_free_pagecache_pages(); - const int background_thresh = (dirty_background_ratio * tot) / 100; long min_pages = _min_pages; + long background_thresh; int nr_to_write; CHECK_EMERGENCY_SYNC + background_thresh = (dirty_background_ratio * total_pages) / 100; + do { struct page_state ps; @@ -269,7 +274,7 @@ static void wb_timer_fn(unsigned long un static void set_ratelimit(void) { - ratelimit_pages = nr_free_pagecache_pages() / (num_online_cpus() * 32); + ratelimit_pages = total_pages / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) @@ -288,8 +293,29 @@ static struct notifier_block ratelimit_n .next = NULL, }; +/* + * If the machine has a large highmem:lowmem ratio then scale back the default + * dirty memory thresholds: allowing too much dirty highmem pins an excessive + * number of buffer_heads. + */ static int __init page_writeback_init(void) { + long buffer_pages = nr_free_buffer_pages(); + long correction; + + total_pages = nr_free_pagecache_pages(); + + correction = (100 * 4 * buffer_pages) / total_pages; + + if (correction < 100) { + dirty_background_ratio *= correction; + dirty_background_ratio /= 100; + dirty_async_ratio *= correction; + dirty_async_ratio /= 100; + dirty_sync_ratio *= correction; + dirty_sync_ratio /= 100; + } + init_timer(&wb_timer); wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100; wb_timer.data = 0; . ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 4:39 [patch] adjustments to dirty memory thresholds Andrew Morton @ 2002-08-28 20:08 ` William Lee Irwin III 2002-08-28 20:27 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: William Lee Irwin III @ 2002-08-28 20:08 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml On Tue, Aug 27, 2002 at 09:39:09PM -0700, Andrew Morton wrote: > These ratios are scaled so that as the highmem:lowmem ratio goes > beyond 4:1, the maximum amount of allowed dirty memory ceases to > increase. It is clamped at the amount of memory which a 4:1 machine > is allowed to use. This is disturbing. I suspect this is only going to raise poor memory utilization issues on highmem boxen. Of course, "f**k highmem" is such a common refrain these days so that's probably falling on deaf ears. AFAICT the OOM issues are largely a by-product of mempool allocations entering out_of_memory() when they have the perfectly reasonable alternative strategy of simply waiting for the mempool to refill. Cheers, Bill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 20:08 ` William Lee Irwin III @ 2002-08-28 20:27 ` Andrew Morton 2002-08-28 21:42 ` William Lee Irwin III 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2002-08-28 20:27 UTC (permalink / raw) To: William Lee Irwin III; +Cc: lkml William Lee Irwin III wrote: > > On Tue, Aug 27, 2002 at 09:39:09PM -0700, Andrew Morton wrote: > > These ratios are scaled so that as the highmem:lowmem ratio goes > > beyond 4:1, the maximum amount of allowed dirty memory ceases to > > increase. It is clamped at the amount of memory which a 4:1 machine > > is allowed to use. > > This is disturbing. I suspect this is only going to raise poor memory > utilization issues on highmem boxen. The intent is to fix them. Allowing more than 2G of dirty data to float about seems unreasonable, and it pins buffer_heads. But hey. The patch merely sets the initial value of /proc/sys/vm/dirty*, and those things are writeable. > Of course, "f**k highmem" is such > a common refrain these days so that's probably falling on deaf ears. On the contrary. > AFAICT the OOM issues are largely a by-product of mempool allocations > entering out_of_memory() when they have the perfectly reasonable > alternative strategy of simply waiting for the mempool to refill. I don't have enough RAM to reproduce this. Please send call traces up from out_of_memory(). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 20:27 ` Andrew Morton @ 2002-08-28 21:42 ` William Lee Irwin III 2002-08-28 21:58 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: William Lee Irwin III @ 2002-08-28 21:42 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml William Lee Irwin III wrote: >> This is disturbing. I suspect this is only going to raise poor memory >> utilization issues on highmem boxen. On Wed, Aug 28, 2002 at 01:27:02PM -0700, Andrew Morton wrote: > The intent is to fix them. Allowing more than 2G of dirty data to > float about seems unreasonable, and it pins buffer_heads. > But hey. The patch merely sets the initial value of /proc/sys/vm/dirty*, > and those things are writeable. Hmm. Then I've actually tested this... I can at least say it's stable, even if I'm not wild about the approach. William Lee Irwin III wrote: >> AFAICT the OOM issues are largely a by-product of mempool allocations >> entering out_of_memory() when they have the perfectly reasonable >> alternative strategy of simply waiting for the mempool to refill. On Wed, Aug 28, 2002 at 01:27:02PM -0700, Andrew Morton wrote: > I don't have enough RAM to reproduce this. Please send > call traces up from out_of_memory(). I've already written the patch to address it, though of course, I can post those traces along with the patch once it's rediffed. (It's trivial though -- just a fresh GFP flag and a check for it before calling out_of_memory(), setting it in mempool_alloc(), and ignoring it in slab.c.) It requires several rounds of "un-throttling" to reproduce the OOM's, the nature of which I've outlined elsewhere. One such trace is below, some of the others might require repeating the runs. It's actually a relatively deep call chain, I'd be worried about blowing the stack at this point as well. Cheers, Bill 2.5.31-akpm + request queue size of 16384 + inode table size of 1024 + zone->wait_table max size of 65536 + MIN_PDFLUSH_THREADS == NR_CPUS + MAX_PDFLUSH_THREADS == 16*NR_CPUS on 16x/16GB x86 running 4 simultaneous tiobench --size $((4*1024)) --threads 256 on 4 disks. They also pile up on ->i_sem of the dir they create files in, not sure what to do about that aside from working around it in userspace. It basically takes this kind of stuff so the things don't all fall asleep on some resource or other, though the box is still pretty much idle. #1 0xc013ba01 in oom_kill () at oom_kill.c:181 #2 0xc013ba7c in out_of_memory () at oom_kill.c:248 #3 0xc0137628 in try_to_free_pages (classzone=0xc039f300, gfp_mask=80, order=0) at vmscan.c:585 #4 0xc013831b in balance_classzone (classzone=0xc039f300, gfp_mask=80, order=0, freed=0xf7b0dc5c) at page_alloc.c:278 #5 0xc01385f7 in __alloc_pages (gfp_mask=80, order=0, zonelist=0xc02b4064) at page_alloc.c:401 #6 0xc013b777 in alloc_pages_pgdat (pgdat=0xc039f000, gfp_mask=80, order=0) at numa.c:77 #7 0xc013b7c3 in _alloc_pages (gfp_mask=80, order=0) at numa.c:105 #8 0xc013e440 in page_pool_alloc (gfp_mask=80, data=0x0) at highmem.c:33 #9 0xc013f395 in mempool_alloc (pool=0xf7b78d20, gfp_mask=80) at mempool.c:203 #10 0xc013ed85 in blk_queue_bounce (q=0xf76a941c, bio_orig=0xf7b0dd60) at highmem.c:397 #11 0xc01da088 in __make_request (q=0xf76a941c, bio=0xec0324a0) at ll_rw_blk.c:1481 #12 0xc01da5bf in generic_make_request (bio=0xec0324a0) at ll_rw_blk.c:1714 #13 0xc01da63c in submit_bio (rw=1, bio=0xec0324a0) at ll_rw_blk.c:1760 #14 0xc0161701 in mpage_bio_submit (rw=1, bio=0xec0324a0) at mpage.c:93 #15 0xc0162094 in mpage_writepages (mapping=0xed953d7c, #16 0xc01722e0 in ext2_writepages (mapping=0xed953d7c, nr_to_write=0xf7b0df8c) at inode.c:636 #17 0xc0140a1a in do_writepages (mapping=0xed953d7c, nr_to_write=0xf7b0df8c) at page-writeback.c:372 #18 0xc0160b74 in __sync_single_inode (inode=0xed953cf4, wait=0, nr_to_write=0xf7b0df8c) at fs-writeback.c:147 #19 0xc0160d50 in __writeback_single_inode (inode=0xed953cf4, sync=0, nr_to_write=0xf7b0df8c) at fs-writeback.c:196 #20 0xc0160ec1 in sync_sb_inodes (single_bdi=0x0, sb=0xf6049c00, sync_mode=0, nr_to_write=0xf7b0df8c, older_than_this=0x0) at fs-writeback.c:270 #21 0xc016104d in __writeback_unlocked_inodes (bdi=0x0, nr_to_write=0xf7b0df8c, sync_mode=WB_SYNC_NONE, older_than_this=0x0) at fs-writeback.c:310 #22 0xc01610f6 in writeback_unlocked_inodes (nr_to_write=0xf7b0df8c, sync_mode=WB_SYNC_NONE, older_than_this=0x0) at fs-writeback.c:340 #23 0xc01407e9 in background_writeout (_min_pages=0) at page-writeback.c:188 #24 0xc0140408 in __pdflush (my_work=0xf7b0dfd4) at pdflush.c:120 #25 0xc01404f7 in pdflush (dummy=0x0) at pdflush.c:168 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 21:42 ` William Lee Irwin III @ 2002-08-28 21:58 ` Andrew Morton 2002-08-28 22:15 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Andrew Morton @ 2002-08-28 21:58 UTC (permalink / raw) To: William Lee Irwin III; +Cc: lkml William Lee Irwin III wrote: > > ... > I've already written the patch to address it, though of course, I can > post those traces along with the patch once it's rediffed. (It's trivial > though -- just a fresh GFP flag and a check for it before calling > out_of_memory(), setting it in mempool_alloc(), and ignoring it in > slab.c.) It requires several rounds of "un-throttling" to reproduce > the OOM's, the nature of which I've outlined elsewhere. That's a sane approach. mempool_alloc() is designed for allocations which "must" succeed if you wait long enough. In fact it might make sense to only perform a single scan of the LRU if __GFP_WLI is set, rather than the increasing priority thing. But sigh. Pointlessly scanning zillions of dirty pages and doing nothing with them is dumb. So much better to go for a FIFO snooze on a per-zone waitqueue, be woken when some memory has been cleansed. (That's effectively what mempool does, but it's all private and different). > One such trace is below, some of the others might require repeating the > runs. It's actually a relatively deep call chain, I'd be worried about > blowing the stack at this point as well. Well it's presumably the GFP_NOIO which has killed it - we can't wait on PG_writeback pages and we can't write out dirty pages. Taking a nap in mempool_alloc is appropriate. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 21:58 ` Andrew Morton @ 2002-08-28 22:15 ` Andrew Morton 2002-08-29 0:26 ` Rik van Riel 2002-08-29 3:49 ` William Lee Irwin III 2 siblings, 0 replies; 13+ messages in thread From: Andrew Morton @ 2002-08-28 22:15 UTC (permalink / raw) To: William Lee Irwin III, lkml Andrew Morton wrote: > > ... > Well it's presumably the GFP_NOIO which has killed it - we can't wait > on PG_writeback pages and we can't write out dirty pages. Taking a > nap in mempool_alloc is appropriate. Actually, it might be better to teach mempool_alloc to not call page reclaim at all if __GFP_FS is not set. Just kick bdflush and go to sleep. I really, really, really dislike the VM's tendency to go and scan hundreds of thousands of pages. It's a clear sign of an inappropriate algorithm. Test something like this, please? --- 2.5.32/mm/mempool.c~wli Wed Aug 28 15:07:31 2002 +++ 2.5.32-akpm/mm/mempool.c Wed Aug 28 15:12:53 2002 @@ -196,10 +196,11 @@ repeat_alloc: return element; /* - * If the pool is less than 50% full then try harder - * to allocate an element: + * If the pool is less than 50% full and we can perform effective + * page reclaim then try harder to allocate an element: */ - if ((gfp_mask != gfp_nowait) && (pool->curr_nr <= pool->min_nr/2)) { + if ((gfp_mask & __GFP_FS) && (gfp_mask != gfp_nowait) && + (pool->curr_nr <= pool->min_nr/2)) { element = pool->alloc(gfp_mask, pool->pool_data); if (likely(element != NULL)) return element; . ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 21:58 ` Andrew Morton 2002-08-28 22:15 ` Andrew Morton @ 2002-08-29 0:26 ` Rik van Riel 2002-08-29 2:10 ` Andrew Morton 2002-08-29 3:49 ` William Lee Irwin III 2 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2002-08-29 0:26 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, lkml On Wed, 28 Aug 2002, Andrew Morton wrote: > But sigh. Pointlessly scanning zillions of dirty pages and doing > nothing with them is dumb. So much better to go for a FIFO snooze on a > per-zone waitqueue, be woken when some memory has been cleansed. But not per-zone, since many (most?) allocations can be satisfied from multiple zones. Guess what 2.4-rmap has had for ages ? Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;) [I'll mercilessly increase your patch queue since it doesn't show any sign of ever shrinking anyway] cheers, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-29 0:26 ` Rik van Riel @ 2002-08-29 2:10 ` Andrew Morton 2002-08-29 2:10 ` Rik van Riel 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2002-08-29 2:10 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, lkml Rik van Riel wrote: > > On Wed, 28 Aug 2002, Andrew Morton wrote: > > > But sigh. Pointlessly scanning zillions of dirty pages and doing > > nothing with them is dumb. So much better to go for a FIFO snooze on a > > per-zone waitqueue, be woken when some memory has been cleansed. > > But not per-zone, since many (most?) allocations can be satisfied > from multiple zones. Guess what 2.4-rmap has had for ages ? Per-classzone ;) > Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;) > > [I'll mercilessly increase your patch queue since it doesn't show > any sign of ever shrinking anyway] Lack of patches is not a huge problem at present ;). It's getting them tested for performance, stability and general does-good-thingsness which is the rate limiting step. The next really significant design change in the queue is slablru, and we'll need to let that sit in partial isolation for a while to make sure that it's doing what we want it to do. But yes, I'm interested in a port of the code, and in the description of the problems which it solves, and how it solves them. But what is even more valuable than the code is a report of its before-and-after effectiveness under a broad range of loads on a broad range of hardware. That's the most time-consuming part... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-29 2:10 ` Andrew Morton @ 2002-08-29 2:10 ` Rik van Riel 2002-08-29 2:52 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2002-08-29 2:10 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, lkml On Wed, 28 Aug 2002, Andrew Morton wrote: > Rik van Riel wrote: > > > > On Wed, 28 Aug 2002, Andrew Morton wrote: > > > > > But sigh. Pointlessly scanning zillions of dirty pages and doing > > > nothing with them is dumb. So much better to go for a FIFO snooze on a > > > per-zone waitqueue, be woken when some memory has been cleansed. > > > > But not per-zone, since many (most?) allocations can be satisfied > > from multiple zones. Guess what 2.4-rmap has had for ages ? > > Per-classzone ;) I pull the NUMA-fallback card ;) But serious, having one waitqueue for this case should be fine. If the system is not under lots of VM pressure with tons of dirty pages, kswapd will free pages as fast as they get allocated. If the system can't keep up and we have to wait for dirty page writeout to finish before we can allocate more, it shouldn't really matter how many waitqueues we have. Except for the fact that having a more complex system can introduce more opportunities for unfairness and starvation. > > Interested in a port for 2.5 on top of 2.5.32-mm2 ? ;) > > > > [I'll mercilessly increase your patch queue since it doesn't show > > any sign of ever shrinking anyway] > > Lack of patches is not a huge problem at present ;). It's getting them > tested for performance, stability and general does-good-thingsness > which is the rate limiting step. Yup, but if I were to wait for your queue to shrink I'd never get any patches merged ;) > But yes, I'm interested in a port of the code, and in the description > of the problems which it solves, and how it solves them. I'll introduce this stuff in 2 or 3 steps, with descriptions. > But what is even more valuable than the code is a report of its > before-and-after effectiveness under a broad range of loads on a broad > range of hardware. That's the most time-consuming part... Eeeks ;) I don't even have a broad range of hardware... regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-29 2:10 ` Rik van Riel @ 2002-08-29 2:52 ` Andrew Morton 2002-09-01 1:37 ` William Lee Irwin III 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2002-08-29 2:52 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, lkml Rik van Riel wrote: > > On Wed, 28 Aug 2002, Andrew Morton wrote: > > Rik van Riel wrote: > > > > > > On Wed, 28 Aug 2002, Andrew Morton wrote: > > > > > > > But sigh. Pointlessly scanning zillions of dirty pages and doing > > > > nothing with them is dumb. So much better to go for a FIFO snooze on a > > > > per-zone waitqueue, be woken when some memory has been cleansed. > > > > > > But not per-zone, since many (most?) allocations can be satisfied > > > from multiple zones. Guess what 2.4-rmap has had for ages ? > > > > Per-classzone ;) > > I pull the NUMA-fallback card ;) Ah, but you can never satisfy a NUMA person. > But serious, having one waitqueue for this case should be > fine. If the system is not under lots of VM pressure with > tons of dirty pages, kswapd will free pages as fast as > they get allocated. > > If the system can't keep up and we have to wait for dirty > page writeout to finish before we can allocate more, it > shouldn't really matter how many waitqueues we have. > Except for the fact that having a more complex system can > introduce more opportunities for unfairness and starvation. Sure. We have this lovely fast wakeup/context switch time. Blowing some cycles in this situation surely is not a problem. But I do think we want to perform the wakeups from interrupt context; there are just too many opportunities for kswapd to take an extended vacation on a request queue. Non-blocking writeout infrastructure would be nice, too. And for simple cases, that's just a matter of getting the block layer to manage a flag in q->backing_dev_info. But even that would result in scanning past pages. And every time we do that, there are whacko corner cases which chew tons of CPU or cause oom failures. Lists, lists, we need more lists! hmm. But mapping->backing_dev_info is trivially available in the superblock scan, and in that case we can scan past entire congested filesystems, rather than single congested pages. hmm. I suspect q->backing_dev_info gets inaccurate once we get into stacking and striping at the block layer, but that's just an efficiency failing, not an oops. > ... > > > But what is even more valuable than the code is a report of its > > before-and-after effectiveness under a broad range of loads on a broad > > range of hardware. That's the most time-consuming part... > > Eeeks ;) I don't even have a broad range of hardware... > Eeeks indeed. But the main variables really are memory size, IO bandwidth and workload. That's manageable. The traditional toss-it-in-and-see-who-complains approach will catch the weird corner cases but it's slow turnaround. I guess as long as we know what the code is trying to do then it should be fairly straightforward to verify that it's doing it. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-29 2:52 ` Andrew Morton @ 2002-09-01 1:37 ` William Lee Irwin III 0 siblings, 0 replies; 13+ messages in thread From: William Lee Irwin III @ 2002-09-01 1:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, lkml On Wed, Aug 28, 2002 at 07:52:56PM -0700, Andrew Morton wrote: > Eeeks indeed. But the main variables really are memory size, > IO bandwidth and workload. That's manageable. > The traditional toss-it-in-and-see-who-complains approach will > catch the weird corner cases but it's slow turnaround. I guess > as long as we know what the code is trying to do then it should be > fairly straightforward to verify that it's doing it. Okay, not sure which in the thread to respond to, but since I can't find a public statement to this effect, in my testing, all 3 OOM patches behave identically. Cheers, Bill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-28 21:58 ` Andrew Morton 2002-08-28 22:15 ` Andrew Morton 2002-08-29 0:26 ` Rik van Riel @ 2002-08-29 3:49 ` William Lee Irwin III 2002-08-29 12:37 ` Rik van Riel 2 siblings, 1 reply; 13+ messages in thread From: William Lee Irwin III @ 2002-08-29 3:49 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml William Lee Irwin III wrote: >> I've already written the patch to address it, though of course, I can >> post those traces along with the patch once it's rediffed. (It's trivial >> though -- just a fresh GFP flag and a check for it before calling >> out_of_memory(), setting it in mempool_alloc(), and ignoring it in >> slab.c.) It requires several rounds of "un-throttling" to reproduce >> the OOM's, the nature of which I've outlined elsewhere. On Wed, Aug 28, 2002 at 02:58:20PM -0700, Andrew Morton wrote: > That's a sane approach. mempool_alloc() is designed for allocations > which "must" succeed if you wait long enough. > In fact it might make sense to only perform a single scan of the > LRU if __GFP_WLI is set, rather than the increasing priority thing. > But sigh. Pointlessly scanning zillions of dirty pages and doing nothing > with them is dumb. So much better to go for a FIFO snooze on a per-zone > waitqueue, be woken when some memory has been cleansed. (That's effectively > what mempool does, but it's all private and different). Here's a stab in that direction, against 2.5.31. A trivially different patch was tested and verified to solve the problems in practice. A theoretical deadlock remains where a mempool allocator sleeps on general purpose memory and is not woken when the mempool is replenished. Cheers, Bill diff -urN linux-2.5.31-virgin/include/linux/gfp.h linux-2.5.31-nokill/include/linux/gfp.h --- linux-2.5.31-virgin/include/linux/gfp.h 2002-08-10 18:41:24.000000000 -0700 +++ linux-2.5.31-nokill/include/linux/gfp.h 2002-08-28 02:22:55.000000000 -0700 @@ -17,6 +17,7 @@ #define __GFP_IO 0x40 /* Can start low memory physical IO? */ #define __GFP_HIGHIO 0x80 /* Can start high mem physical IO? */ #define __GFP_FS 0x100 /* Can call down to low-level FS? */ +#define __GFP_NOKILL 0x200 /* Should not OOM kill */ #define GFP_NOHIGHIO ( __GFP_WAIT | __GFP_IO) #define GFP_NOIO ( __GFP_WAIT) diff -urN linux-2.5.31-virgin/include/linux/slab.h linux-2.5.31-nokill/include/linux/slab.h --- linux-2.5.31-virgin/include/linux/slab.h 2002-08-10 18:41:28.000000000 -0700 +++ linux-2.5.31-nokill/include/linux/slab.h 2002-08-28 02:22:55.000000000 -0700 @@ -24,7 +24,7 @@ #define SLAB_NFS GFP_NFS #define SLAB_DMA GFP_DMA -#define SLAB_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHIO|__GFP_FS) +#define SLAB_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHIO|__GFP_FS|__GFP_NOKILL) #define SLAB_NO_GROW 0x00001000UL /* don't grow a cache */ /* flags to pass to kmem_cache_create(). diff -urN linux-2.5.31-virgin/mm/mempool.c linux-2.5.31-nokill/mm/mempool.c --- linux-2.5.31-virgin/mm/mempool.c 2002-08-10 18:41:19.000000000 -0700 +++ linux-2.5.31-nokill/mm/mempool.c 2002-08-28 02:22:55.000000000 -0700 @@ -186,7 +186,11 @@ unsigned long flags; int curr_nr; DECLARE_WAITQUEUE(wait, current); - int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO); + int gfp_nowait; + + gfp_mask |= __GFP_NOKILL; + + gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO | __GFP_NOKILL); repeat_alloc: element = pool->alloc(gfp_nowait, pool->pool_data); diff -urN linux-2.5.31-virgin/mm/vmscan.c linux-2.5.31-nokill/mm/vmscan.c --- linux-2.5.31-virgin/mm/vmscan.c 2002-08-10 18:41:21.000000000 -0700 +++ linux-2.5.31-nokill/mm/vmscan.c 2002-08-28 03:17:15.000000000 -0700 @@ -401,23 +401,24 @@ int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order) { - int priority = DEF_PRIORITY; - int nr_pages = SWAP_CLUSTER_MAX; + int priority, status, nr_pages = SWAP_CLUSTER_MAX; KERNEL_STAT_INC(pageoutrun); - do { + for (priority = DEF_PRIORITY; priority; --priority) { nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages); - if (nr_pages <= 0) - return 1; - } while (--priority); + status = (nr_pages <= 0) ? 1 : 0; + if (status || (gfp_mask & __GFP_NOKILL)) + goto out; + } /* * Hmm.. Cache shrink failed - time to kill something? * Mhwahahhaha! This is the part I really like. Giggle. */ out_of_memory(); - return 0; +out: + return status; } DECLARE_WAIT_QUEUE_HEAD(kswapd_wait); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] adjustments to dirty memory thresholds 2002-08-29 3:49 ` William Lee Irwin III @ 2002-08-29 12:37 ` Rik van Riel 0 siblings, 0 replies; 13+ messages in thread From: Rik van Riel @ 2002-08-29 12:37 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrew Morton, lkml On Wed, 28 Aug 2002, William Lee Irwin III wrote: > + gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO | __GFP_NOKILL); I suspect what you want is (in vmscan.c): - out_of_memory(); + if (gfp_mask & __GFP_FS) + out_of_memory(); This means we'll just never call out_of_memory() if we haven't used all possibilities for freeing pages. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2002-09-01 1:36 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-08-28 4:39 [patch] adjustments to dirty memory thresholds Andrew Morton 2002-08-28 20:08 ` William Lee Irwin III 2002-08-28 20:27 ` Andrew Morton 2002-08-28 21:42 ` William Lee Irwin III 2002-08-28 21:58 ` Andrew Morton 2002-08-28 22:15 ` Andrew Morton 2002-08-29 0:26 ` Rik van Riel 2002-08-29 2:10 ` Andrew Morton 2002-08-29 2:10 ` Rik van Riel 2002-08-29 2:52 ` Andrew Morton 2002-09-01 1:37 ` William Lee Irwin III 2002-08-29 3:49 ` William Lee Irwin III 2002-08-29 12:37 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).