Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED

From: Jan Kara <jack@suse.cz>
To: Andreas Gruenbacher <agruenba@redhat.com>
Cc: "Christoph Hellwig" <hch@lst.de>,
	cluster-devel <cluster-devel@redhat.com>,
	"Dave Chinner" <david@fromorbit.com>,
	"Ross Lagerwall" <ross.lagerwall@citrix.com>,
	"Mark Syms" <Mark.Syms@citrix.com>,
	"Edwin Török" <edvin.torok@citrix.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Jan Kara" <jack@suse.cz>,
	linux-mm@kvack.org
Subject: Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
Date: Mon, 8 Apr 2019 15:44:05 +0200	[thread overview]
Message-ID: <20190408134405.GA15023@quack2.suse.cz> (raw)
In-Reply-To: <CAHc6FU7kgm4OyrY-KRb8H2w6LDrWDSJ2p=UgZeeJ8YrHynKU2w@mail.gmail.com>

On Mon 08-04-19 10:53:34, Andreas Gruenbacher wrote:
> On Sun, 7 Apr 2019 at 09:32, Christoph Hellwig <hch@lst.de> wrote:
> >
> > [adding Jan and linux-mm]
> >
> > On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > > > But what is the requirement to do this in writeback context?  Can't
> > > > we move it out into another context instead?
> > >
> > > Indeed, this isn't for data integrity in this case but because the
> > > dirty limit is exceeded. What other context would you suggest to move
> > > this to?
> > >
> > > (The iomap flag I've proposed would save us from getting into this
> > > situation in the first place.)
> >
> > Your patch does two things:
> >
> >  - it only calls balance_dirty_pages_ratelimited once per write
> >    operation instead of once per page.  In the past btrfs did
> >    hacks like that, but IIRC they caused VM balancing issues.
> >    That is why everyone now calls balance_dirty_pages_ratelimited
> >    one per page.  If calling it at a coarse granularity would
> >    be fine we should do it everywhere instead of just in gfs2
> >    in journaled mode
> >  - it artifically reduces the size of writes to a low value,
> >    which I suspect is going to break real life application
> 
> Not quite, balance_dirty_pages_ratelimited is called from iomap_end,
> so once per iomap mapping returned, not per write. (The first version
> of this patch got that wrong by accident, but not the second.)
> 
> We can limit the size of the mappings returned just in that case. I'm
> aware that there is a risk of balancing problems, I just don't have
> any better ideas.
> 
> This is a problem all filesystems with data-journaling will have with
> iomap, it's not that gfs2 is doing anything particularly stupid.

I agree that if ext4 would be using iomap, it would have similar issues.

> > So I really think we need to fix this properly.  And if that means
> > that you can't make use of the iomap batching for gfs2 in journaled
> > mode that is still a better option.
> 
> That would mean using the old-style, page-size allocations, and a
> completely separate write path in that case. That would be quite a
> nightmare.
> 
> > But I really think you need
> > to look into the scope of your flush_log and figure out a good way
> > to reduce that as solve the root cause.
> 
> We won't be able to do a log flush while another transaction is
> active, but that's what's needed to clean dirty pages. iomap doesn't
> allow us to put the block allocation into a separate transaction from
> the page writes; for that, the opposite to the page_done hook would
> probably be needed.

I agree that a ->page_prepare() hook would be probably the cleanest
solution for this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR