From: Tejun Heo <tj@kernel.org> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo <tj@kernel.org>, Vladimir Davydov <vdavydov@parallels.com> Subject: [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim Date: Mon, 23 Mar 2015 01:07:47 -0400 [thread overview] Message-ID: <1427087267-16592-19-git-send-email-tj@kernel.org> (raw) In-Reply-To: <1427087267-16592-1-git-send-email-tj@kernel.org> Because writeback wasn't cgroup aware before, the usual dirty throttling mechanism in balance_dirty_pages() didn't work for processes under memcg limit. The writeback path didn't know how much memory is available or how fast the dirty pages are being written out for a given memcg and balance_dirty_pages() didn't have any measure of IO back pressure for the memcg. To work around the issue, memcg implemented an ad-hoc dirty throttling mechanism in the direct reclaim path by stalling on pages under writeback which are encountered during direct reclaim scan. This is rather ugly and crude - none of the configurability, fairness, or bandwidth-proportional distribution of the normal path. The previous patches implemented proper memcg aware dirty throttling and the ad-hoc mechanism is no longer necessary. Remove it. Note: I removed the parts which seemed obvious and it behaves fine while testing but my understanding of this code path is rudimentary and it's quite possible that I got something wrong. Please let me know if I got some wrong or more global_reclaim() sites should be updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> --- mm/vmscan.c | 109 ++++++++++++++++++------------------------------------------ 1 file changed, 33 insertions(+), 76 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9f8d3c0..d084c95 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,53 +929,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, nr_congested++; /* - * If a page at the tail of the LRU is under writeback, there - * are three cases to consider. - * - * 1) If reclaim is encountering an excessive number of pages - * under writeback and this page is both under writeback and - * PageReclaim then it indicates that pages are being queued - * for IO but are being recycled through the LRU before the - * IO can complete. Waiting on the page itself risks an - * indefinite stall if it is impossible to writeback the - * page due to IO error or disconnected storage so instead - * note that the LRU is being scanned too quickly and the - * caller can stall after page list has been processed. - * - * 2) Global reclaim encounters a page, memcg encounters a - * page that is not marked for immediate reclaim or - * the caller does not have __GFP_IO. In this case mark - * the page for immediate reclaim and continue scanning. - * - * __GFP_IO is checked because a loop driver thread might - * enter reclaim, and deadlock if it waits on a page for - * which it is needed to do the write (loop masks off - * __GFP_IO|__GFP_FS for this reason); but more thought - * would probably show more reasons. - * - * Don't require __GFP_FS, since we're not going into the - * FS, just waiting on its writeback completion. Worryingly, - * ext4 gfs2 and xfs allocate pages with - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing - * may_enter_fs here is liable to OOM on them. - * - * 3) memcg encounters a page that is not already marked - * PageReclaim. memcg does not have any dirty pages - * throttling so we could easily OOM just because too many - * pages are in writeback and there is nothing else to - * reclaim. Wait for the writeback to complete. + * A page at the tail of the LRU is under writeback. If + * reclaim is encountering an excessive number of pages + * under writeback and this page is both under writeback + * and PageReclaim then it indicates that pages are being + * queued for IO but are being recycled through the LRU + * before the IO can complete. Waiting on the page itself + * risks an indefinite stall if it is impossible to + * writeback the page due to IO error or disconnected + * storage so instead note that the LRU is being scanned + * too quickly and the caller can stall after page list has + * been processed. */ if (PageWriteback(page)) { - /* Case 1 above */ if (current_is_kswapd() && PageReclaim(page) && test_bit(ZONE_WRITEBACK, &zone->flags)) { nr_immediate++; - goto keep_locked; - - /* Case 2 above */ - } else if (global_reclaim(sc) || - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { + } else { /* * This is slightly racy - end_page_writeback() * might have just cleared PageReclaim, then @@ -989,13 +960,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ SetPageReclaim(page); nr_writeback++; - - goto keep_locked; - - /* Case 3 above */ - } else { - wait_on_page_writeback(page); } + goto keep_locked; } if (!force_reclaim) @@ -1423,9 +1389,6 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; - if (!global_reclaim(sc)) - return 0; - if (file) { inactive = zone_page_state(zone, NR_INACTIVE_FILE); isolated = zone_page_state(zone, NR_ISOLATED_FILE); @@ -1615,35 +1578,29 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, set_bit(ZONE_WRITEBACK, &zone->flags); /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim + * Tag a zone as congested if all the dirty pages scanned were + * backed by a congested BDI and wait_iff_congested will stall. */ - if (global_reclaim(sc)) { - /* - * Tag a zone as congested if all the dirty pages scanned were - * backed by a congested BDI and wait_iff_congested will stall. - */ - if (nr_dirty && nr_dirty == nr_congested) - set_bit(ZONE_CONGESTED, &zone->flags); + if (nr_dirty && nr_dirty == nr_congested) + set_bit(ZONE_CONGESTED, &zone->flags); - /* - * If dirty pages are scanned that are not queued for IO, it - * implies that flushers are not keeping up. In this case, flag - * the zone ZONE_DIRTY and kswapd will start writing pages from - * reclaim context. - */ - if (nr_unqueued_dirty == nr_taken) - set_bit(ZONE_DIRTY, &zone->flags); + /* + * If dirty pages are scanned that are not queued for IO, it + * implies that flushers are not keeping up. In this case, flag the + * zone ZONE_DIRTY and kswapd will start writing pages from reclaim + * context. + */ + if (nr_unqueued_dirty == nr_taken) + set_bit(ZONE_DIRTY, &zone->flags); - /* - * If kswapd scans pages marked marked for immediate - * reclaim and under writeback (nr_immediate), it implies - * that pages are cycling through the LRU faster than - * they are written so also forcibly stall. - */ - if (nr_immediate && current_may_throttle()) - congestion_wait(BLK_RW_ASYNC, HZ/10); - } + /* + * If kswapd scans pages marked marked for immediate reclaim and + * under writeback (nr_immediate), it implies that pages are + * cycling through the LRU faster than they are written so also + * forcibly stall. + */ + if (nr_immediate && current_may_throttle()) + congestion_wait(BLK_RW_ASYNC, HZ/10); /* * Stall direct reclaim for IO completions if underlying BDIs or zone -- 2.1.0
WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj@kernel.org> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo <tj@kernel.org>, Vladimir Davydov <vdavydov@parallels.com> Subject: [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim Date: Mon, 23 Mar 2015 01:07:47 -0400 [thread overview] Message-ID: <1427087267-16592-19-git-send-email-tj@kernel.org> (raw) In-Reply-To: <1427087267-16592-1-git-send-email-tj@kernel.org> Because writeback wasn't cgroup aware before, the usual dirty throttling mechanism in balance_dirty_pages() didn't work for processes under memcg limit. The writeback path didn't know how much memory is available or how fast the dirty pages are being written out for a given memcg and balance_dirty_pages() didn't have any measure of IO back pressure for the memcg. To work around the issue, memcg implemented an ad-hoc dirty throttling mechanism in the direct reclaim path by stalling on pages under writeback which are encountered during direct reclaim scan. This is rather ugly and crude - none of the configurability, fairness, or bandwidth-proportional distribution of the normal path. The previous patches implemented proper memcg aware dirty throttling and the ad-hoc mechanism is no longer necessary. Remove it. Note: I removed the parts which seemed obvious and it behaves fine while testing but my understanding of this code path is rudimentary and it's quite possible that I got something wrong. Please let me know if I got some wrong or more global_reclaim() sites should be updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> --- mm/vmscan.c | 109 ++++++++++++++++++------------------------------------------ 1 file changed, 33 insertions(+), 76 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9f8d3c0..d084c95 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,53 +929,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, nr_congested++; /* - * If a page at the tail of the LRU is under writeback, there - * are three cases to consider. - * - * 1) If reclaim is encountering an excessive number of pages - * under writeback and this page is both under writeback and - * PageReclaim then it indicates that pages are being queued - * for IO but are being recycled through the LRU before the - * IO can complete. Waiting on the page itself risks an - * indefinite stall if it is impossible to writeback the - * page due to IO error or disconnected storage so instead - * note that the LRU is being scanned too quickly and the - * caller can stall after page list has been processed. - * - * 2) Global reclaim encounters a page, memcg encounters a - * page that is not marked for immediate reclaim or - * the caller does not have __GFP_IO. In this case mark - * the page for immediate reclaim and continue scanning. - * - * __GFP_IO is checked because a loop driver thread might - * enter reclaim, and deadlock if it waits on a page for - * which it is needed to do the write (loop masks off - * __GFP_IO|__GFP_FS for this reason); but more thought - * would probably show more reasons. - * - * Don't require __GFP_FS, since we're not going into the - * FS, just waiting on its writeback completion. Worryingly, - * ext4 gfs2 and xfs allocate pages with - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing - * may_enter_fs here is liable to OOM on them. - * - * 3) memcg encounters a page that is not already marked - * PageReclaim. memcg does not have any dirty pages - * throttling so we could easily OOM just because too many - * pages are in writeback and there is nothing else to - * reclaim. Wait for the writeback to complete. + * A page at the tail of the LRU is under writeback. If + * reclaim is encountering an excessive number of pages + * under writeback and this page is both under writeback + * and PageReclaim then it indicates that pages are being + * queued for IO but are being recycled through the LRU + * before the IO can complete. Waiting on the page itself + * risks an indefinite stall if it is impossible to + * writeback the page due to IO error or disconnected + * storage so instead note that the LRU is being scanned + * too quickly and the caller can stall after page list has + * been processed. */ if (PageWriteback(page)) { - /* Case 1 above */ if (current_is_kswapd() && PageReclaim(page) && test_bit(ZONE_WRITEBACK, &zone->flags)) { nr_immediate++; - goto keep_locked; - - /* Case 2 above */ - } else if (global_reclaim(sc) || - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { + } else { /* * This is slightly racy - end_page_writeback() * might have just cleared PageReclaim, then @@ -989,13 +960,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ SetPageReclaim(page); nr_writeback++; - - goto keep_locked; - - /* Case 3 above */ - } else { - wait_on_page_writeback(page); } + goto keep_locked; } if (!force_reclaim) @@ -1423,9 +1389,6 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; - if (!global_reclaim(sc)) - return 0; - if (file) { inactive = zone_page_state(zone, NR_INACTIVE_FILE); isolated = zone_page_state(zone, NR_ISOLATED_FILE); @@ -1615,35 +1578,29 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, set_bit(ZONE_WRITEBACK, &zone->flags); /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim + * Tag a zone as congested if all the dirty pages scanned were + * backed by a congested BDI and wait_iff_congested will stall. */ - if (global_reclaim(sc)) { - /* - * Tag a zone as congested if all the dirty pages scanned were - * backed by a congested BDI and wait_iff_congested will stall. - */ - if (nr_dirty && nr_dirty == nr_congested) - set_bit(ZONE_CONGESTED, &zone->flags); + if (nr_dirty && nr_dirty == nr_congested) + set_bit(ZONE_CONGESTED, &zone->flags); - /* - * If dirty pages are scanned that are not queued for IO, it - * implies that flushers are not keeping up. In this case, flag - * the zone ZONE_DIRTY and kswapd will start writing pages from - * reclaim context. - */ - if (nr_unqueued_dirty == nr_taken) - set_bit(ZONE_DIRTY, &zone->flags); + /* + * If dirty pages are scanned that are not queued for IO, it + * implies that flushers are not keeping up. In this case, flag the + * zone ZONE_DIRTY and kswapd will start writing pages from reclaim + * context. + */ + if (nr_unqueued_dirty == nr_taken) + set_bit(ZONE_DIRTY, &zone->flags); - /* - * If kswapd scans pages marked marked for immediate - * reclaim and under writeback (nr_immediate), it implies - * that pages are cycling through the LRU faster than - * they are written so also forcibly stall. - */ - if (nr_immediate && current_may_throttle()) - congestion_wait(BLK_RW_ASYNC, HZ/10); - } + /* + * If kswapd scans pages marked marked for immediate reclaim and + * under writeback (nr_immediate), it implies that pages are + * cycling through the LRU faster than they are written so also + * forcibly stall. + */ + if (nr_immediate && current_may_throttle()) + congestion_wait(BLK_RW_ASYNC, HZ/10); /* * Stall direct reclaim for IO completions if underlying BDIs or zone -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-03-23 5:08 UTC|newest] Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top 2015-03-23 5:07 [PATCHSET 2/3 block/for-4.1/core] writeback: cgroup writeback backpressure propagation Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 01/18] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-25 22:39 ` [PATCH 1.5/18] writeback: clean up wb_dirty_limit() Tejun Heo 2015-03-25 22:39 ` Tejun Heo 2015-03-25 22:39 ` Tejun Heo 2015-03-25 22:39 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 02/18] writeback: reorganize [__]wb_update_bandwidth() Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 03/18] writeback: implement wb_domain Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 04/18] writeback: move global_dirty_limit into wb_domain Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 05/18] writeback: consolidate dirty throttle parameters into dirty_throttle_control Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 06/18] writeback: add dirty_throttle_control->wb_bg_thresh Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 07/18] writeback: make __wb_dirty_limit() take dirty_throttle_control Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-25 22:42 ` [PATCH v2 07/18] writeback: make __wb_calc_thresh() " Tejun Heo 2015-03-25 22:42 ` Tejun Heo 2015-03-25 22:42 ` Tejun Heo 2015-03-25 22:42 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 08/18] writeback: add dirty_throttle_control->pos_ratio Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 09/18] writeback: add dirty_throttle_control->wb_completions Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 10/18] writeback: add dirty_throttle_control->dom Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 11/18] writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 12/18] writeback: separate out domain_dirty_limits() Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 13/18] writeback: move over_bground_thresh() to mm/page-writeback.c Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 14/18] writeback: update wb_over_bg_thresh() to use wb_domain aware operations Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 15/18] writeback: implement memcg wb_domain Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 16/18] writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` [PATCH 17/18] writeback: implement memcg writeback domain based throttling Tejun Heo 2015-03-23 5:07 ` Tejun Heo 2015-03-23 5:07 ` Tejun Heo [this message] 2015-03-23 5:07 ` [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim Tejun Heo 2015-03-23 5:27 ` Tejun Heo 2015-03-23 5:27 ` Tejun Heo 2015-03-25 22:26 ` [PATCH v2 18/18] mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use Tejun Heo 2015-03-25 22:26 ` Tejun Heo 2015-03-25 22:26 ` Tejun Heo 2015-03-25 22:26 ` Tejun Heo
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1427087267-16592-19-git-send-email-tj@kernel.org \ --to=tj@kernel.org \ --cc=axboe@kernel.dk \ --cc=cgroups@vger.kernel.org \ --cc=clm@fb.com \ --cc=david@fromorbit.com \ --cc=fengguang.wu@intel.com \ --cc=gthelen@google.com \ --cc=hannes@cmpxchg.org \ --cc=hch@infradead.org \ --cc=jack@suse.cz \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=lizefan@huawei.com \ --cc=mhocko@suse.cz \ --cc=vdavydov@parallels.com \ --cc=vgoyal@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.