Re: [PATCH 14/14] writeback: Per-sb dirty tracking

From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [PATCH 14/14] writeback: Per-sb dirty tracking
Date: Fri, 8 Aug 2014 12:46:19 +0200	[thread overview]
Message-ID: <20140808104619.GD2398@quack.suse.cz> (raw)
In-Reply-To: <20140806211305.GK20518@dastard>

On Thu 07-08-14 07:13:05, Dave Chinner wrote:
> On Wed, Aug 06, 2014 at 10:46:39AM +0200, Jan Kara wrote:
> > On Wed 06-08-14 09:44:16, Dave Chinner wrote:
> > > On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote:
> > > > Switch inode dirty tracking lists to be per superblock instead of per
> > > > bdi. This is a major step towards filesystems being able to do their
> > > > own dirty tracking and selection of inodes for writeback if they desire
> > > > so (e.g. because they journal or COW data and need to writeback inodes
> > > > & pages in a specific order unknown to generic writeback code).
> > > > 
> > > > Per superblock dirty lists also make selecting inodes for writeback
> > > > somewhat simpler because we don't have to search for inodes from a
> > > > particular superblock for some kinds of writeback (OTOH we pay for this
> > > > by having to iterate through superblocks for all-bdi type of writeback)
> > > > and this simplification will allow for an easier switch to a better
> > > > scaling data structure for dirty inodes.
> ....
> > > I'm not sure I really like this code very much - it seems to be
> > > muchmore complex than it needs to be because writeback is still
> > > managed on a per-bdi basis and the sb iteration is pretty clunky.
> > > If we are moving to per-sb inode tracking, we should also move all
> > > the writeback management to per-sb as well.
> > > 
> > > IMO, there's no good reason for keeping flusher threads per-bdi and
> > > then having to iterate per-sb just to do background/periodic
> > > writeback, and then have special cases for sb specific writeback
> > > that avoids the bdi per-sb looping. i.e. per-sb flush work executed
> > > by a bdi flusher thread makes a lot more sense than per-bdi
> > > flush work that iterates superblocks.
> > > 
> > > So for the moment, I think this patch makes things worse rather than
> > > better. I'd much prefer to see a single series that moves from per-bdi
> > > tracking/writeback to per-sb tracking/writeback than to split the
> > > tracking/writeback changes and then have to support an weird,
> > > temporary, intermediate code base like this...
> >   So when writing this series I was thinking about both possibilities -
> > i.e., keeping per-bdi threads and changing to per-sb threads. In the end
> > I've decided to start with keeping per-bdi threads and seeing how things
> > work out.
> 
> > Regarding per-sb threads I have two unresolved questions:
> > 
> > 1) How to handle block device inodes - when filesystem is mounted on the
> > block device, it would be natural to writeback such inode together with
> > other filesystem's inodes. When filesystem isn't mounted on the block
> > device we have no superblock to attach such inode to so we would have to
> > have some flusher for the virtual blkdev superblock that will deal with
> > specifics of block device superblock.
> 
> You mean for the blockdev_superblock? That's probably a good idea;
> it would isolate the special case blockdev inode writeback in it's
> own little function. i.e. this is crying out for a superblock
> specific writeback method (just like tux3 is asking for) that simply
> does:
> 
> 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> 		if (work->sync == WB_SYNC_ALL)
> 			__sync_blockdev(I_BDEV(inode), 1);
> 		else
> 			__sync_blockdev(I_BDEV(inode), 0);
> 	}
  Sadly, it's not that easy - you have to pass in struct writeback_control
so that writing obeys nr_to_write - otherwise you can livelock doing
writeback and preemption of background works won't work either. So you have
to do something along the lines of what writeback_inodes() does...

Also I don't really see how to get rid of specialcasing block devices.
Normal superblocks are attached to a bdi - bdi flusher iterates all
superblocks on that bdi to do background writeback of the bdi. However
block device superblock is special - inodes of that superblock belong to
different bdis so you cannot just attach that superblock to any particular
bdi. That's why I created a special "writeback queue" for each bdi embedded
in struct backing_dev_info and use that for tracking writeback state of
block device inodes belonging to the bdi.

> For data integrity operations, the higher layers call
> sync_blockdev() appropriately so we don't actually have to care
> about blockdev inodes during per-sb writeback at all.
  Agreed.

> With that, we get rid of another longstanding, ugly wart in the
> writeback code.
> 
> > 2) How to deal with multiple superblocks per device - and here I'm
> > convinced we should not regress writeback performance of the case where
> > disk is partitioned into several partitions. And I think this will require
> > some kind of synchronization between per-sb threads on the same device.
> 
> Right, that's why I suggested this:
> 
> > > i.e. per-sb flush work executed
> > > by a bdi flusher thread makes a lot more sense than per-bdi
> > > flush work that iterates superblocks.
  Aha, sorry, I misunderstood this sentense.

> What I mean is that every work that is executed on a BDI has
> work->sb set. Functions like wb_check_background_flush and
> wb_check_background_flush no longer take just the bdi, they also
> take the sb they are going to operate on so it canbe stuffed into
> the work structure that is passed to bdi_writeback(). At this point,
> bdi_writeback() only needs to operate on a single sb.
> 
> We still have all the same "periodic/background writeback is aborted
> if new work is queued" logic so that data integrity operations
> preempt background work, so there is no difference in behaviour
> there.  i.e. the bdi still manages the interactions between all the
> superblocks, just from a slightly different perspective: we
> explicitly schedule writeback by superblock rather than by bdi.
  OK, this is an interesting idea. However suppose you have two active
filesystems on a bdi. Now you have to somehow switch between these two
filesystems - I don't think you can switch to the second filesystem after
you clean the first one because that may starve the second filesystem of
background writeback for a long time and in the extreme case it may never
happen. But I suppose if we set nr_to_write sensibly we could avoid such
problems... I'll try this and see how it looks in the code.

> IOWs, I'm not suggesting per-sb flusher threads, just scheduling of
> the work to the existing bdi flusher thread slightly differently.
> 
> I suspect that the above two things will also make the sb-specific
> writeback callout much more obvious and, in general, make the
> writeback logic easier to read as there's no extraneous juggling of
> special cases needed. The sb callout will handles the blockdev inode
> writeback case, and the change of work scheduling gets rid of the
> need to iterate the sbs on the bdi....

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR