All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Greg Thelen <gthelen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"containers@lists.osdl.org" <containers@lists.osdl.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Balbir Singh <bsingharora@gmail.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Dave Chinner <david@fromorbit.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Andrea Righi <andrea@betterlinux.com>,
	Ciju Rajan K <ciju@linux.vnet.ibm.com>,
	David Rientjes <rientjes@google.com>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	"Shi, Alex" <alex.shi@intel.com>,
	"Chen, Tim C" <tim.c.chen@intel.com>
Subject: Re: [PATCH v9 12/13] memcg: create support routines for page writeback
Date: Thu, 18 Aug 2011 22:08:56 +0200	[thread overview]
Message-ID: <20110818200856.GD12426@quack.suse.cz> (raw)
In-Reply-To: <20110818121714.GA1883@localhost>

On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > Subject: squeeze max-pause area and drop pass-good area
> > > Date: Tue Aug 16 13:37:14 CST 2011
> > > 
> > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > introduce max-pause and pass-good dirty limits") and make the
> > > max-pause area smaller and safe.
> > > 
> > > This fixes ~30% performance regression in the ext3 data=writeback
> > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > 
> > > Using deadline scheduler also has a regression, but not that big as
> > > CFQ, so this suggests we have some write starvation.
> > > 
> > > The test logs show that
> > > 
> > > - the disks are sometimes under utilized
> > > 
> > > - global dirty pages sometimes rush high to the pass-good area for
> > >   several hundred seconds, while in the mean time some bdi dirty pages
> > >   drop to very low value (bdi_dirty << bdi_thresh).
> > >   Then suddenly the global dirty pages dropped under global dirty
> > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > >   called at all.
> > > 
> > > So the problems are
> > > 
> > > 1) The random writes progress so slow that they break the assumption of
> > > the max-pause logic that "8 pages per 200ms is typically more than
> > > enough to curb heavy dirtiers".
> > > 
> > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > >    possibility for some bdi's to over dirty pages, leading to
> > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > 
> > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > >    swing of dirty pages.
> > > 
> > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > 
> > > Tests show that it fixed the JBOD regression completely (both behavior
> > > and performance), while still being able to cut down large pause times
> > > in balance_dirty_pages() for single-disk cases.
> > > 
> > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/writeback.h |   11 -----------
> > >  mm/page-writeback.c       |   15 ++-------------
> > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > 
> > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > >  		 */
> > > -		if (nr_dirty < dirty_thresh +
> > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > +		if (nr_dirty < dirty_thresh &&
> > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > >  			break;
> >   This looks definitely much safer than the original patch since we now
> > always observe global dirty limit.
> 
> Yeah.
> 
> > I just wonder: We have throttled the
> > task because bdi_nr_reclaimable > task_bdi_thresh.
> 
> Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> for the whole loop. And the 200ms pause that trigger the above test
> may totally come from the io_schedule_timeout() calls.
> 
> > Now in practice there
> > should be some pages under writeback and this task should have submitted
> > even more just a while ago. So the condition
> >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> 
> I guess the writeback_inodes_wb() call is irrelevant for the above
> test, because writeback_inodes_wb() transfers reclaimable pages to
> writeback pages, with the total bdi_dirty value staying the same.
> Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> variables have not been updated between writeback_inodes_wb() and the
> max-pause test.
  Right, that comment was a bit off.

> > looks still relatively weak. Shouldn't there be
> >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> 
> That's much easier condition to satisfy..
  Argh, sorry. I was mistaken by the name of the variable - I though it
contains only dirty pages on the bdi but it also contains pages under
writeback and bdi_nr_reclaimable is the one that contains only dirty pages.
So your patch does exactly what I had in mind. You can add:
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Greg Thelen <gthelen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"containers@lists.osdl.org" <containers@lists.osdl.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Balbir Singh <bsingharora@gmail.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Dave Chinner <david@fromorbit.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Andrea Righi <andrea@betterlinux.com>,
	Ciju Rajan K <ciju@linux.vnet.ibm.com>,
	David Rientjes <rientjes@google.com>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	"Shi, Alex" <alex.shi@intel.com>,
	"Chen, Tim C" <tim.c.chen@intel.com>
Subject: Re: [PATCH v9 12/13] memcg: create support routines for page writeback
Date: Thu, 18 Aug 2011 22:08:56 +0200	[thread overview]
Message-ID: <20110818200856.GD12426@quack.suse.cz> (raw)
In-Reply-To: <20110818121714.GA1883@localhost>

On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > Subject: squeeze max-pause area and drop pass-good area
> > > Date: Tue Aug 16 13:37:14 CST 2011
> > > 
> > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > introduce max-pause and pass-good dirty limits") and make the
> > > max-pause area smaller and safe.
> > > 
> > > This fixes ~30% performance regression in the ext3 data=writeback
> > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > 
> > > Using deadline scheduler also has a regression, but not that big as
> > > CFQ, so this suggests we have some write starvation.
> > > 
> > > The test logs show that
> > > 
> > > - the disks are sometimes under utilized
> > > 
> > > - global dirty pages sometimes rush high to the pass-good area for
> > >   several hundred seconds, while in the mean time some bdi dirty pages
> > >   drop to very low value (bdi_dirty << bdi_thresh).
> > >   Then suddenly the global dirty pages dropped under global dirty
> > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > >   called at all.
> > > 
> > > So the problems are
> > > 
> > > 1) The random writes progress so slow that they break the assumption of
> > > the max-pause logic that "8 pages per 200ms is typically more than
> > > enough to curb heavy dirtiers".
> > > 
> > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > >    possibility for some bdi's to over dirty pages, leading to
> > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > 
> > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > >    swing of dirty pages.
> > > 
> > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > 
> > > Tests show that it fixed the JBOD regression completely (both behavior
> > > and performance), while still being able to cut down large pause times
> > > in balance_dirty_pages() for single-disk cases.
> > > 
> > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/writeback.h |   11 -----------
> > >  mm/page-writeback.c       |   15 ++-------------
> > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > 
> > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > >  		 */
> > > -		if (nr_dirty < dirty_thresh +
> > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > +		if (nr_dirty < dirty_thresh &&
> > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > >  			break;
> >   This looks definitely much safer than the original patch since we now
> > always observe global dirty limit.
> 
> Yeah.
> 
> > I just wonder: We have throttled the
> > task because bdi_nr_reclaimable > task_bdi_thresh.
> 
> Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> for the whole loop. And the 200ms pause that trigger the above test
> may totally come from the io_schedule_timeout() calls.
> 
> > Now in practice there
> > should be some pages under writeback and this task should have submitted
> > even more just a while ago. So the condition
> >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> 
> I guess the writeback_inodes_wb() call is irrelevant for the above
> test, because writeback_inodes_wb() transfers reclaimable pages to
> writeback pages, with the total bdi_dirty value staying the same.
> Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> variables have not been updated between writeback_inodes_wb() and the
> max-pause test.
  Right, that comment was a bit off.

> > looks still relatively weak. Shouldn't there be
> >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> 
> That's much easier condition to satisfy..
  Argh, sorry. I was mistaken by the name of the variable - I though it
contains only dirty pages on the bdi but it also contains pages under
writeback and bdi_nr_reclaimable is the one that contains only dirty pages.
So your patch does exactly what I had in mind. You can add:
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-08-18 20:08 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-17 16:14 [PATCH v9 00/13] memcg: per cgroup dirty page limiting Greg Thelen
2011-08-17 16:14 ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 01/13] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 02/13] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 03/13] memcg: add dirty page accounting infrastructure Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:39   ` KAMEZAWA Hiroyuki
2011-08-18  0:39     ` KAMEZAWA Hiroyuki
2011-08-18  6:07     ` Greg Thelen
2011-08-18  6:07       ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 04/13] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty() Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:51   ` KAMEZAWA Hiroyuki
2011-08-18  0:51     ` KAMEZAWA Hiroyuki
2011-08-17 16:14 ` [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:53   ` KAMEZAWA Hiroyuki
2011-08-18  0:53     ` KAMEZAWA Hiroyuki
2011-08-17 16:14 ` [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:55   ` KAMEZAWA Hiroyuki
2011-08-18  0:55     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 08/13] memcg: dirty page accounting support routines Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:05   ` KAMEZAWA Hiroyuki
2011-08-18  1:05     ` KAMEZAWA Hiroyuki
2011-08-18  7:04     ` Greg Thelen
2011-08-18  7:04       ` Greg Thelen
2011-08-17 16:15 ` [PATCH v9 09/13] memcg: create support routines for writeback Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:13   ` KAMEZAWA Hiroyuki
2011-08-18  1:13     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes() Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:15   ` KAMEZAWA Hiroyuki
2011-08-18  1:15     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 11/13] writeback: make background writeback cgroup aware Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:23   ` KAMEZAWA Hiroyuki
2011-08-18  1:23     ` KAMEZAWA Hiroyuki
2011-08-18  7:10     ` Greg Thelen
2011-08-18  7:10       ` Greg Thelen
2011-08-18  7:17       ` KAMEZAWA Hiroyuki
2011-08-18  7:17         ` KAMEZAWA Hiroyuki
2011-08-18  7:38         ` Greg Thelen
2011-08-18  7:38           ` Greg Thelen
2011-08-18  7:35           ` KAMEZAWA Hiroyuki
2011-08-18  7:35             ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 12/13] memcg: create support routines for page writeback Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:38   ` KAMEZAWA Hiroyuki
2011-08-18  1:38     ` KAMEZAWA Hiroyuki
2011-08-18  2:36     ` Wu Fengguang
2011-08-18  2:36       ` Wu Fengguang
2011-08-18 10:12       ` Jan Kara
2011-08-18 10:12         ` Jan Kara
2011-08-18 12:17         ` Wu Fengguang
2011-08-18 12:17           ` Wu Fengguang
2011-08-18 20:08           ` Jan Kara [this message]
2011-08-18 20:08             ` Jan Kara
2011-08-19  1:36             ` Wu Fengguang
2011-08-19  1:36               ` Wu Fengguang
2011-08-17 16:15 ` [PATCH v9 13/13] memcg: check memcg dirty limits in " Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:40   ` KAMEZAWA Hiroyuki
2011-08-18  1:40     ` KAMEZAWA Hiroyuki
2011-08-18  0:35 ` [PATCH v9 00/13] memcg: per cgroup dirty page limiting KAMEZAWA Hiroyuki
2011-08-18  0:35   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110818200856.GD12426@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=andrea@betterlinux.com \
    --cc=bsingharora@gmail.com \
    --cc=ciju@linux.vnet.ibm.com \
    --cc=containers@lists.osdl.org \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=rientjes@google.com \
    --cc=shaohua.li@intel.com \
    --cc=tim.c.chen@intel.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.