Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED

From: Dave Chinner <david@fromorbit.com>
To: Andreas Gruenbacher <agruenba@redhat.com>
Cc: "Christoph Hellwig" <hch@lst.de>,
	cluster-devel@redhat.com,
	"Ross Lagerwall" <ross.lagerwall@citrix.com>,
	"Mark Syms" <Mark.Syms@citrix.com>,
	"Edwin Török" <edvin.torok@citrix.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
Date: Fri, 22 Mar 2019 08:43:45 +1100	[thread overview]
Message-ID: <20190321214345.GE26298@dastard> (raw)
In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com>

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com