Re: reclaim the LRU lists full of dirty/writeback pages

From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>, Rik van Riel <riel@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	"bsingharora@gmail.com" <bsingharora@gmail.com>,
	Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.cz>,
	linux-mm@kvack.org, Mel Gorman <mgorman@suse.de>,
	Ying Han <yinghan@google.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>
Subject: Re: reclaim the LRU lists full of dirty/writeback pages
Date: Thu, 16 Feb 2012 13:44:45 +0100	[thread overview]
Message-ID: <20120216124445.GB18613@quack.suse.cz> (raw)
In-Reply-To: <20120216040019.GB17597@localhost>

On Thu 16-02-12 12:00:19, Wu Fengguang wrote:
> On Tue, Feb 14, 2012 at 02:29:50PM +0100, Jan Kara wrote:
> > > >   I wonder what happens if you run:
> > > >        mkdir /cgroup/x
> > > >        echo 100M > /cgroup/x/memory.limit_in_bytes
> > > >        echo $$ > /cgroup/x/tasks
> > > > 
> > > >        for (( i = 0; i < 2; i++ )); do
> > > >          mkdir /fs/d$i
> > > >          for (( j = 0; j < 5000; j++ )); do 
> > > >            dd if=/dev/zero of=/fs/d$i/f$j bs=1k count=50
> > > >          done &
> > > >        done
> > > 
> > > That's a very good case, thanks!
> > >  
> > > >   Because for small files the writearound logic won't help much...
> > > 
> > > Right, it also means the native background work cannot be more I/O
> > > efficient than the pageout works, except for the overheads of more
> > > work items..
> >   Yes, that's true.
> > 
> > > >   Also the number of work items queued might become interesting.
> > > 
> > > It turns out that the 1024 mempool reservations are not exhausted at
> > > all (the below patch as a trace_printk on alloc failure and it didn't
> > > trigger at all).
> > > 
> > > Here is the representative iostat lines on XFS (full "iostat -kx 1 20" log attached):
> > > 
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle                                                                     
> > >            0.80    0.00    6.03    0.03    0.00   93.14                                                                     
> > >                                                                                                                             
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util                   
> > > sda               0.00   205.00    0.00  163.00     0.00 16900.00   207.36     4.09   21.63   1.88  30.70                   
> > > 
> > > The attached dirtied/written progress graph looks interesting.
> > > Although the iostat disk utilization is low, the "dirtied" progress
> > > line is pretty straight and there is no single congestion_wait event
> > > in the trace log. Which makes me wonder if there are some unknown
> > > blocking issues in the way.
> >   Interesting. I'd also expect we should block in reclaim path. How fast
> > can dd threads progress when there is no cgroup involved?
> 
> I tried running the dd tasks in global context with
> 
>         echo $((100<<20)) > /proc/sys/vm/dirty_bytes
> 
> and got mostly the same results on XFS:
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.85    0.00    8.88    0.00    0.00   90.26
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00   50.00     0.00 23036.00   921.44     9.59  738.02   7.38  36.90
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.95    0.00    8.95    0.00    0.00   90.11
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00   854.00    0.00   99.00     0.00 19552.00   394.99    34.14   87.98   3.82  37.80
  OK, so it seems that reclaiming pages in memcg reclaim acted as a natural
throttling similar to what balance_dirty_pages() does in the global case.


> Interestingly, ext4 shows comparable throughput, however is reporting
> near 100% disk utilization:
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.76    0.00    9.02    0.00    0.00   90.23
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  317.00     0.00 20956.00   132.21    28.57   82.71   3.16 100.10
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.82    0.00    8.95    0.00    0.00   90.23
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  402.00     0.00 24388.00   121.33    21.09   58.55   2.42  97.40
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.82    0.00    8.99    0.00    0.00   90.19
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  409.00     0.00 21996.00   107.56    15.25   36.74   2.30  94.10
  Average request size is smaller so maybe ext4 does more seeking.

> > > > Another common case to test - run 'slapadd' command in each cgroup to
> > > > create big LDAP database. That does pretty much random IO on a big mmaped
> > > > DB file.
> > > 
> > > I've not used this. Will it need some configuration and data feed?
> > > fio looks more handy to me for emulating mmap random IO.
> >   Yes, fio can generate random mmap IO. It's just that this is a real life
> > workload. So it is not completely random, it happens on several files and
> > is also interleaved with other memory allocations from DB. I can send you
> > the config files and data feed if you are interested.
> 
> I'm very interested, thank you!
  OK, I'll send it in private email...

> > > > > +/*
> > > > > + * schedule writeback on a range of inode pages.
> > > > > + */
> > > > > +static struct wb_writeback_work *
> > > > > +bdi_flush_inode_range(struct backing_dev_info *bdi,
> > > > > +		      struct inode *inode,
> > > > > +		      pgoff_t offset,
> > > > > +		      pgoff_t len,
> > > > > +		      bool wait)
> > > > > +{
> > > > > +	struct wb_writeback_work *work;
> > > > > +
> > > > > +	if (!igrab(inode))
> > > > > +		return ERR_PTR(-ENOENT);
> > > >   One technical note here: If the inode is deleted while it is queued, this
> > > > reference will keep it living until flusher thread gets to it. Then when
> > > > flusher thread puts its reference, the inode will get deleted in flusher
> > > > thread context. I don't see an immediate problem in that but it might be
> > > > surprising sometimes. Another problem I see is that if you try to
> > > > unmount the filesystem while the work item is queued, you'll get EBUSY for
> > > > no apparent reason (for userspace).
> > > 
> > > Yeah, we need to make umount work.
> >   The positive thing is that if the inode is reaped while the work item is
> > queue, we know all that needed to be done is done. So we don't really need
> > to pin the inode.
> 
> But I do need to make sure the *inode pointer does not point to some
> invalid memory at work exec time. Is this possible without raising
> ->i_count?
  I was thinking about it and what should work is that we have inode
reference in work item but in generic_shutdown_super() we go through
the worklist and drop all work items for superblock before calling
evict_inodes()...

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>