All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Greg Thelen <gthelen@google.com>, Jan Kara <jack@suse.cz>,
	"bsingharora@gmail.com" <bsingharora@gmail.com>,
	Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.cz>,
	linux-mm@kvack.org, Ying Han <yinghan@google.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Rik van Riel <riel@redhat.com>,
	Minchan Kim <minchan.kim@gmail.com>
Subject: Re: reclaim the LRU lists full of dirty/writeback pages
Date: Tue, 14 Feb 2012 21:18:12 +0800	[thread overview]
Message-ID: <20120214131812.GA17625@localhost> (raw)
In-Reply-To: <20120214101931.GB5938@suse.de>

On Tue, Feb 14, 2012 at 10:19:31AM +0000, Mel Gorman wrote:
> On Sat, Feb 11, 2012 at 08:44:45PM +0800, Wu Fengguang wrote:
> > <SNIP>
> > --- linux.orig/mm/vmscan.c	2012-02-03 21:42:21.000000000 +0800
> > +++ linux/mm/vmscan.c	2012-02-11 17:28:54.000000000 +0800
> > @@ -813,6 +813,8 @@ static unsigned long shrink_page_list(st
> >  
> >  		if (PageWriteback(page)) {
> >  			nr_writeback++;
> > +			if (PageReclaim(page))
> > +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  			/*
> >  			 * Synchronous reclaim cannot queue pages for
> >  			 * writeback due to the possibility of stack overflow
> 
> I didn't look closely at the rest of the patch, I'm just focusing on the
> congestion_wait part. You called this out yourself but this is in fact
> really really bad. If this is in place and a user copies a large amount of
> data to slow storage like a USB stick, the system will stall severely. A
> parallel streaming reader will certainly have major issues as it will enter
> page reclaim, find a bunch of dirty USB-backed pages at the end of the LRU
> (20% of memory potentially) and stall for HZ/10 on each one of them. How
> badly each process is affected will vary.

Cannot agree any more the principle...me just want to demonstrate the
idea first :-)
 
> For the OOM problem, a more reasonable stopgap might be to identify when
> a process is scanning a memcg at high priority and encountered all
> PageReclaim with no forward progress and to congestion_wait() if that
> situation occurs. A preferable way would be to wait until the flusher
> wakes up a waiter on PageReclaim pages to be written out because we want
> to keep moving way from congestion_wait() if at all possible.

Good points! Below is the more serious page reclaim changes.

The dirty/writeback pages may often come close to each other in the
LRU list, so the local test during a 32-page scan may still trigger
reclaim waits unnecessarily. Some global information on the percent
of dirty/writeback pages in the LRU list may help. Anyway the added
tests should still be much better than no protection.

A global wait queue and reclaim_wait() is introduced. The waiters will
be wakeup when pages are rotated by end_page_writeback() or lru drain.

I have to say its effectiveness depends on the filesystem... ext4
and btrfs do fluent IO completions, so reclaim_wait() works pretty
well:
              dd-14560 [017] ....  1360.894605: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=10000
              dd-14560 [017] ....  1360.904456: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=8000
              dd-14560 [017] ....  1360.908293: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=2000
              dd-14560 [017] ....  1360.923960: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=15000
              dd-14560 [017] ....  1360.927810: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=2000
              dd-14560 [017] ....  1360.931656: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=2000
              dd-14560 [017] ....  1360.943503: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=10000
              dd-14560 [017] ....  1360.953289: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=7000
              dd-14560 [017] ....  1360.957177: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=2000
              dd-14560 [017] ....  1360.972949: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=15000

However XFS does IO completions in very large batches (there may be
only several big IO completions in one second). So reclaim_wait()
mostly end up waiting to the full HZ/10 timeout:

              dd-4177  [008] ....   866.367661: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [010] ....   866.567583: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [012] ....   866.767458: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [013] ....   866.867419: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [008] ....   867.167266: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [010] ....   867.367168: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [012] ....   867.818950: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [013] ....   867.918905: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [013] ....   867.971657: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=52000
              dd-4177  [013] ....   867.971812: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=0
              dd-4177  [008] ....   868.355700: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000
              dd-4177  [010] ....   868.700515: writeback_reclaim_wait: usec_timeout=100000 usec_delayed=100000

> Another possibility would be to relook at LRU_IMMEDIATE but right now it
> requires a page flag and I haven't devised a way around that. Besides,
> it would only address the problem of PageREclaim pages being encountered,
> it would not handle the case where a memcg was filled with PageReclaim pages.

I also considered things like LRU_IMMEDIATE, however got no clear idea yet.
Since the simple "wait on PG_reclaim" approach appears to work for this
memcg dd case, it effectively disables me to think any further ;-)

For the single dd inside memcg, ext4 is now working pretty well, with
least CPU overheads:

(running from another test box, so not directly comparable with old tests)

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.03    0.00    0.85    5.35    0.00   93.77

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00  112.00     0.00 57348.00  1024.07    81.66 1045.21   8.93 100.00

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.00    0.00    0.69    4.07    0.00   95.24

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00   142.00    0.00  112.00     0.00 56832.00  1014.86   127.94  790.04   8.93 100.00

And xfs a bit less fluent:

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.00    0.00    3.79    2.54    0.00   93.68

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00  108.00     0.00 54644.00  1011.93    48.13 1044.83   8.44  91.20

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.00    0.00    3.38    3.88    0.00   92.74

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00  105.00     0.00 53156.00  1012.50   128.50  451.90   9.25  97.10

btrfs also looks good:

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.00    0.00    8.05    3.85    0.00   88.10

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00  108.00     0.00 53248.00   986.07    88.11  643.99   9.26 100.00

        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   0.00    0.00    4.04    2.51    0.00   93.45

        Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00  112.00     0.00 57344.00  1024.00    91.58  998.41   8.93 100.00


Thanks,
Fengguang
---

--- linux.orig/include/linux/backing-dev.h	2012-02-14 19:43:06.000000000 +0800
+++ linux/include/linux/backing-dev.h	2012-02-14 19:49:26.000000000 +0800
@@ -304,6 +304,8 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
 long wait_iff_congested(struct zone *zone, int sync, long timeout);
+long reclaim_wait(long timeout);
+void reclaim_rotated(void);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
--- linux.orig/mm/backing-dev.c	2012-02-14 19:26:15.000000000 +0800
+++ linux/mm/backing-dev.c	2012-02-14 20:09:45.000000000 +0800
@@ -873,3 +873,38 @@ out:
 	return ret;
 }
 EXPORT_SYMBOL(wait_iff_congested);
+
+static DECLARE_WAIT_QUEUE_HEAD(reclaim_wqh);
+
+/**
+ * reclaim_wait - wait for some pages being rotated to the LRU tail
+ * @timeout: timeout in jiffies
+ *
+ * Wait until @timeout, or when some (typically PG_reclaim under writeback)
+ * pages rotated to the LRU so that page reclaim can make progress.
+ */
+long reclaim_wait(long timeout)
+{
+	long ret;
+	unsigned long start = jiffies;
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&reclaim_wqh, &wait, TASK_KILLABLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(&reclaim_wqh, &wait);
+
+	trace_writeback_reclaim_wait(jiffies_to_usecs(timeout),
+				     jiffies_to_usecs(jiffies - start));
+
+	return ret;
+}
+EXPORT_SYMBOL(reclaim_wait);
+
+void reclaim_rotated()
+{
+	wait_queue_head_t *wqh = &reclaim_wqh;
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
--- linux.orig/mm/swap.c	2012-02-14 19:40:10.000000000 +0800
+++ linux/mm/swap.c	2012-02-14 19:45:13.000000000 +0800
@@ -253,6 +253,7 @@ static void pagevec_move_tail(struct pag
 
 	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
 	__count_vm_events(PGROTATED, pgmoved);
+	reclaim_rotated();
 }
 
 /*
--- linux.orig/mm/vmscan.c	2012-02-14 17:53:27.000000000 +0800
+++ linux/mm/vmscan.c	2012-02-14 19:44:11.000000000 +0800
@@ -767,7 +767,8 @@ static unsigned long shrink_page_list(st
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
-				      unsigned long *ret_nr_writeback)
+				      unsigned long *ret_nr_writeback,
+				      unsigned long *ret_nr_pgreclaim)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -776,6 +777,7 @@ static unsigned long shrink_page_list(st
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 
 	cond_resched();
 
@@ -813,6 +815,10 @@ static unsigned long shrink_page_list(st
 
 		if (PageWriteback(page)) {
 			nr_writeback++;
+			if (PageReclaim(page))
+				nr_pgreclaim++;
+			else
+				SetPageReclaim(page);
 			/*
 			 * Synchronous reclaim cannot queue pages for
 			 * writeback due to the possibility of stack overflow
@@ -874,12 +880,15 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * run into the visited page again: we are scanning
+			 * faster than the flusher can writeout dirty pages
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && PageReclaim(page)) {
+				nr_pgreclaim++;
+				goto keep_locked;
+			}
+			if (page_is_file_cache(page) && mapping &&
+			    flush_inode_page(mapping, page, false) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -1028,6 +1037,7 @@ keep_lumpy:
 	count_vm_events(PGACTIVATE, pgactivate);
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_writeback += nr_writeback;
+	*ret_nr_pgreclaim += nr_pgreclaim;
 	return nr_reclaimed;
 }
 
@@ -1087,8 +1097,10 @@ int __isolate_lru_page(struct page *page
 	 */
 	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
 		/* All the caller can do on PageWriteback is block */
-		if (PageWriteback(page))
+		if (PageWriteback(page)) {
+			SetPageReclaim(page);
 			return ret;
+		}
 
 		if (PageDirty(page)) {
 			struct address_space *mapping;
@@ -1509,6 +1521,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_file;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 	struct zone *zone = mz->zone;
 
@@ -1559,13 +1572,13 @@ shrink_inactive_list(unsigned long nr_to
 	spin_unlock_irq(&zone->lru_lock);
 
 	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
-						&nr_dirty, &nr_writeback);
+				&nr_dirty, &nr_writeback, &nr_pgreclaim);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
-					priority, &nr_dirty, &nr_writeback);
+			priority, &nr_dirty, &nr_writeback, &nr_pgreclaim);
 	}
 
 	spin_lock_irq(&zone->lru_lock);
@@ -1608,6 +1621,8 @@ shrink_inactive_list(unsigned long nr_to
 	 */
 	if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+	if (nr_pgreclaim && nr_pgreclaim >= (nr_taken >> (DEF_PRIORITY-priority)))
+		reclaim_wait(HZ/10);
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
@@ -2382,8 +2397,6 @@ static unsigned long do_try_to_free_page
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-02-14 13:28 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-08  7:55 memcg writeback (was Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.) Greg Thelen
2012-02-08  9:31 ` Wu Fengguang
2012-02-08 20:54   ` Ying Han
2012-02-09 13:50     ` Wu Fengguang
2012-02-13 18:40       ` Ying Han
2012-02-10  5:51   ` Greg Thelen
2012-02-10  5:52     ` Greg Thelen
2012-02-10  9:20       ` Wu Fengguang
2012-02-10 11:47     ` Wu Fengguang
2012-02-11 12:44       ` reclaim the LRU lists full of dirty/writeback pages Wu Fengguang
2012-02-11 14:55         ` Rik van Riel
2012-02-12  3:10           ` Wu Fengguang
2012-02-12  6:45             ` Wu Fengguang
2012-02-13 15:43             ` Jan Kara
2012-02-14 10:03               ` Wu Fengguang
2012-02-14 13:29                 ` Jan Kara
2012-02-16  4:00                   ` Wu Fengguang
2012-02-16 12:44                     ` Jan Kara
2012-02-16 13:32                       ` Wu Fengguang
2012-02-16 14:06                         ` Wu Fengguang
2012-02-17 16:41                     ` Wu Fengguang
2012-02-20 14:00                       ` Jan Kara
2012-02-14 10:19         ` Mel Gorman
2012-02-14 13:18           ` Wu Fengguang [this message]
2012-02-14 13:35             ` Wu Fengguang
2012-02-14 15:51             ` Mel Gorman
2012-02-16  9:50               ` Wu Fengguang
2012-02-16 17:31                 ` Mel Gorman
2012-02-27 14:24                   ` Fengguang Wu
2012-02-16  0:00             ` KAMEZAWA Hiroyuki
2012-02-16  3:04               ` Wu Fengguang
2012-02-16  3:52                 ` KAMEZAWA Hiroyuki
2012-02-16  4:05                   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120214131812.GA17625@localhost \
    --to=fengguang.wu@intel.com \
    --cc=bsingharora@gmail.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.