From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 6BD6E6B0005 for ; Tue, 26 Jul 2016 03:07:59 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id p129so1099087wmp.3 for ; Tue, 26 Jul 2016 00:07:59 -0700 (PDT) Received: from mail-wm0-f68.google.com (mail-wm0-f68.google.com. [74.125.82.68]) by mx.google.com with ESMTPS id i130si27662786wme.120.2016.07.26.00.07.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 26 Jul 2016 00:07:57 -0700 (PDT) Received: by mail-wm0-f68.google.com with SMTP id q128so193245wma.1 for ; Tue, 26 Jul 2016 00:07:57 -0700 (PDT) Date: Tue, 26 Jul 2016 09:07:56 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks Message-ID: <20160726070755.GB32462@dhcp22.suse.cz> References: <1468831164-26621-1-git-send-email-mhocko@kernel.org> <1468831285-27242-1-git-send-email-mhocko@kernel.org> <1468831285-27242-2-git-send-email-mhocko@kernel.org> <87oa5q5abi.fsf@notabene.neil.brown.name> <20160722091558.GF794@dhcp22.suse.cz> <878twt5i1j.fsf@notabene.neil.brown.name> <20160725083247.GD9401@dhcp22.suse.cz> <20160725192344.GD2166@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160725192344.GD2166@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: NeilBrown Cc: linux-mm@kvack.org, Mikulas Patocka , Ondrej Kozina , David Rientjes , Tetsuo Handa , Mel Gorman , Andrew Morton , LKML , dm-devel@redhat.com, Marcelo Tosatti On Mon 25-07-16 21:23:44, Michal Hocko wrote: > [CC Marcelo who might remember other details for the loads which made > him to add this code - see the patch changelog for more context] > > On Mon 25-07-16 10:32:47, Michal Hocko wrote: [...] > From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 25 Jul 2016 14:18:54 +0200 > Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout > > throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused > by excessive pageout activity during the reclaim. Too many pages could > be put under writeback therefore LRUs would be full of unreclaimable pages > until the IO completes and in turn the OOM killer could be invoked. > > There have been some important changes introduced since then in the > reclaim path though. Writers are throttled by balance_dirty_pages > when initiating the buffered IO and later during the memory pressure, > the direct reclaim is throttled by wait_iff_congested if the node is > considered congested by dirty pages on LRUs and the underlying bdi > is congested by the queued IO. The kswapd is throttled as well if it > encounters pages marked for immediate reclaim or under writeback which > signals that that there are too many pages under writeback already. > Another important aspect is that we do not issue any IO from the direct > reclaim context anymore. In a heavy parallel load this could queue a lot > of IO which would be very scattered and thus unefficient which would > just make the problem worse. And I forgot another throttling point. should_reclaim_retry which is the main logic to decide whether we go OOM or not has a congestion_wait if there are too many dirty/writeback pages. That should give the IO subsystem some time to finish the IO. > This three mechanisms should throttle and keep the amount of IO in a > steady state even under heavy IO and memory pressure so yet another > throttling point doesn't really seem helpful. Quite contrary, Mikulas > Patocka has reported that swap backed by dm-crypt doesn't work properly > because the swapout IO cannot make sufficient progress as the writeout > path depends on dm_crypt worker which has to allocate memory to perform > the encryption. In order to guarantee a forward progress it relies > on the mempool allocator. mempool_alloc(), however, prefers to use > the underlying (usually page) allocator before it grabs objects from > the pool. Such an allocation can dive into the memory reclaim and > consequently to throttle_vm_writeout. If there are too many dirty or > pages under writeback it will get throttled even though it is in fact a > flusher to clear pending pages. > > [ 345.352536] kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000 > [ 345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt] > [ 345.352536] ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80 > [ 345.352536] ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470 > [ 345.352536] ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450 > [ 345.352536] Call Trace: > [ 345.352536] [] schedule+0x3c/0x90 > [ 345.352536] [] schedule_timeout+0x1d8/0x360 > [ 345.352536] [] ? detach_if_pending+0x1c0/0x1c0 > [ 345.352536] [] ? ktime_get+0xb3/0x150 > [ 345.352536] [] ? __delayacct_blkio_start+0x1f/0x30 > [ 345.352536] [] io_schedule_timeout+0xa4/0x110 > [ 345.352536] [] congestion_wait+0x86/0x1f0 > [ 345.352536] [] ? prepare_to_wait_event+0xf0/0xf0 > [ 345.352536] [] throttle_vm_writeout+0x44/0xd0 > [ 345.352536] [] shrink_zone_memcg+0x613/0x720 > [ 345.352536] [] shrink_zone+0xe0/0x300 > [ 345.352536] [] do_try_to_free_pages+0x1ad/0x450 > [ 345.352536] [] try_to_free_pages+0xef/0x300 > [ 345.352536] [] __alloc_pages_nodemask+0x879/0x1210 > [ 345.352536] [] ? sched_clock_cpu+0x90/0xc0 > [ 345.352536] [] alloc_pages_current+0xa1/0x1f0 > [ 345.352536] [] ? new_slab+0x3f5/0x6a0 > [ 345.352536] [] new_slab+0x2d7/0x6a0 > [ 345.352536] [] ? sched_clock_local+0x17/0x80 > [ 345.352536] [] ___slab_alloc+0x3fb/0x5c0 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] ? sched_clock_local+0x17/0x80 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] __slab_alloc+0x51/0x90 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] kmem_cache_alloc+0x27b/0x310 > [ 345.352536] [] mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] mempool_alloc+0x91/0x230 > [ 345.352536] [] bio_alloc_bioset+0xbd/0x260 > [ 345.352536] [] kcryptd_crypt+0x114/0x3b0 [dm_crypt] > > Let's just drop throttle_vm_writeout altogether. It is not very much > helpful anymore. > > I have tried to test a potential writeback IO runaway similar to the one > described in the original patch which has introduced that [1]. Small > virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a > rather slow NFS in a sync mode on the host) with 8 parallel writers each > writing 1G worth of data. As soon as the pagecache fills up and the > direct reclaim hits then I start anon memory consumer in a loop > (allocating 300M and exiting after populating it) in the background > to make the memory pressure even stronger as well as to disrupt the > steady state for the IO. The direct reclaim is throttled because of the > congestion as well as kswapd hitting congestion_wait due to nr_immediate > but throttle_vm_writeout doesn't ever trigger the sleep throughout > the test. Dirty+writeback are close to nr_dirty_threshold with some > fluctuations caused by the anon consumer. > > [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch > Cc: Marcelo Tosatti > Reported-by: Mikulas Patocka > Signed-off-by: Michal Hocko > --- > include/linux/writeback.h | 1 - > mm/page-writeback.c | 30 ------------------------------ > mm/vmscan.c | 2 -- > 3 files changed, 33 deletions(-) > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 44b4422ae57f..f67a992cdf89 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data); > #else > static inline void laptop_sync_completion(void) { } > #endif > -void throttle_vm_writeout(gfp_t gfp_mask); > bool node_dirty_ok(struct pglist_data *pgdat); > int wb_domain_init(struct wb_domain *dom, gfp_t gfp); > #ifdef CONFIG_CGROUP_WRITEBACK > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b82303a9e67d..2828d6ca1451 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) > return false; > } > > -void throttle_vm_writeout(gfp_t gfp_mask) > -{ > - unsigned long background_thresh; > - unsigned long dirty_thresh; > - > - for ( ; ; ) { > - global_dirty_limits(&background_thresh, &dirty_thresh); > - dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh); > - > - /* > - * Boost the allowable dirty threshold a bit for page > - * allocators so they don't get DoS'ed by heavy writers > - */ > - dirty_thresh += dirty_thresh / 10; /* wheeee... */ > - > - if (global_node_page_state(NR_UNSTABLE_NFS) + > - global_node_page_state(NR_WRITEBACK) <= dirty_thresh) > - break; > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - > - /* > - * The caller might hold locks which can prevent IO completion > - * or progress in the filesystem. So we cannot just sit here > - * waiting for IO to complete. > - */ > - if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) > - break; > - } > -} > - > /* > * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs > */ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 0294ab34f475..0f35ed30e35b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc > if (inactive_list_is_low(lruvec, false, sc)) > shrink_active_list(SWAP_CLUSTER_MAX, lruvec, > sc, LRU_ACTIVE_ANON); > - > - throttle_vm_writeout(sc->gfp_mask); > } > > /* Use reclaim/compaction for costly allocs or under memory pressure */ > -- > 2.8.1 > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org