All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Andreas Gruenbacher <agruenba@redhat.com>
Cc: "Christoph Hellwig" <hch@lst.de>,
	cluster-devel@redhat.com,
	"Ross Lagerwall" <ross.lagerwall@citrix.com>,
	"Mark Syms" <Mark.Syms@citrix.com>,
	"Edwin Török" <edvin.torok@citrix.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
Date: Fri, 22 Mar 2019 08:43:45 +1100	[thread overview]
Message-ID: <20190321214345.GE26298@dastard> (raw)
In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com>

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
Date: Fri, 22 Mar 2019 08:43:45 +1100	[thread overview]
Message-ID: <20190321214345.GE26298@dastard> (raw)
In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com>

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



  reply	other threads:[~2019-03-21 21:43 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-21 13:13 gfs2 iomap dealock, IOMAP_F_UNBALANCED Andreas Gruenbacher
2019-03-21 13:13 ` [Cluster-devel] " Andreas Gruenbacher
2019-03-21 21:43 ` Dave Chinner [this message]
2019-03-21 21:43   ` Dave Chinner
2019-03-21 23:01   ` Andreas Gruenbacher
2019-03-21 23:01     ` [Cluster-devel] " Andreas Gruenbacher
2019-03-22  0:21   ` Andreas Gruenbacher
2019-03-22  0:21     ` [Cluster-devel] " Andreas Gruenbacher
2019-03-27 16:49     ` Ross Lagerwall
2019-03-27 16:49       ` [Cluster-devel] " Ross Lagerwall
2019-03-28 16:51 ` Christoph Hellwig
2019-03-28 16:51   ` [Cluster-devel] " Christoph Hellwig
2019-03-29 22:13   ` Andreas Gruenbacher
2019-03-29 22:13     ` [Cluster-devel] " Andreas Gruenbacher
2019-04-07  7:32     ` Christoph Hellwig
2019-04-07  7:32       ` [Cluster-devel] " Christoph Hellwig
2019-04-08  8:53       ` Andreas Gruenbacher
2019-04-08  8:53         ` [Cluster-devel] " Andreas Gruenbacher
2019-04-08 13:44         ` Jan Kara
2019-04-08 13:44           ` [Cluster-devel] " Jan Kara
2019-04-09 12:15           ` Christoph Hellwig
2019-04-09 12:15             ` [Cluster-devel] " Christoph Hellwig
2019-04-09 12:27             ` Andreas Gruenbacher
2019-04-09 12:27               ` [Cluster-devel] " Andreas Gruenbacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190321214345.GE26298@dastard \
    --to=david@fromorbit.com \
    --cc=Mark.Syms@citrix.com \
    --cc=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=edvin.torok@citrix.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=ross.lagerwall@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.