linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* gfs2 iomap dealock, IOMAP_F_UNBALANCED
@ 2019-03-21 13:13 Andreas Gruenbacher
  2019-03-21 21:43 ` Dave Chinner
  2019-03-28 16:51 ` Christoph Hellwig
  0 siblings, 2 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-03-21 13:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, cluster-devel, Dave Chinner, Ross Lagerwall,
	Mark Syms, Edwin Török, linux-fsdevel

Hi Christoph,

we need your help fixing a gfs2 deadlock involving iomap.  What's going
on is the following:

* During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
  lock and keeps it until gfs2_iomap_end.  It currently always does that
  even though there is no point other than for journaled data writes.

* iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
  If that ends up calling gfs2_write_inode, gfs2 will try to grab the
  log flush lock again and deadlock.

We can stop gfs2_iomap_begin from keeping the log flush lock held for
non-journaled data writes, but that still leaves us with the deadlock
for journaled data writes: for them, we need to add the data pages to
the running transaction, so dropping the log flush lock wouldn't work.

I've tried to work around the unwanted balance_dirty_pages_ratelimited
in gfs2_write_inode as originally suggested by Ross.  That works to a
certain degree, but only if we always skip inodes in the WB_SYNC_NONE
case, and that means that gfs2_write_inode becomes quite ineffective.

So it seems that it would be more reasonable to avoid the dirty page
balancing under the log flush lock in the first place.

The attached patch changes gfs2_iomap_begin to only hold on to the log
flush lock for journaled data writes.  In that case, we also make sure
to limit the write size to not overrun dirty limits using a similar
logic as in balance_dirty_pages_ratelimited; there is precedent for that
approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
dirty pages via the new IOMAP_F_UNBALANCED flag.

Does that seem like an acceptable approach?

Thanks,
Andreas

---
 fs/gfs2/bmap.c        | 62 +++++++++++++++++++++++++++++++++----------
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            |  3 ++-
 include/linux/iomap.h |  1 +
 4 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 02b2646d84b3a..dad7d3810533e 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -862,6 +862,24 @@ static int gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	iomap->offset = lblock << inode->i_blkbits;
 	lblock_stop = (pos + length - 1) >> inode->i_blkbits;
 	len = lblock_stop - lblock + 1;
+
+	if ((flags & IOMAP_WRITE) && gfs2_is_jdata(ip)) {
+		int max_pages;
+		u64 max_blocks;
+
+		/*
+		 * Limit the write size: this ensures that write throttling
+		 * will kick in fast enough even when we don't call
+		 * balance_dirty_pages_ratelimited for each page written.
+		 */
+		max_pages = min((int)DIV_ROUND_UP(length, PAGE_SIZE),
+				current->nr_dirtied_pause - current->nr_dirtied);
+		max_pages = max(max_pages, 8);
+		max_blocks = (u64)max_pages << (PAGE_SHIFT - inode->i_blkbits);
+		if (len > max_blocks)
+			len = max_blocks;
+	}
+
 	iomap->length = len << inode->i_blkbits;
 
 	height = ip->i_height;
@@ -974,6 +992,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos,
 	gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
 }
 
+static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_trans *tr = current->journal_info;
+
+	gfs2_ordered_add_inode(ip);
+
+	if (tr->tr_num_buf_new)
+		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+	else
+		gfs2_trans_add_meta(ip->i_gl, dibh);
+}
+
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 				  loff_t length, unsigned flags,
 				  struct iomap *iomap,
@@ -1052,6 +1083,13 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 	if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip))
 		iomap->page_done = gfs2_iomap_journaled_page_done;
+
+	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) {
+		iomap->flags |= IOMAP_F_UNBALANCED;
+	} else {
+		gfs2_write_end(inode, mp->mp_bh[0]);
+		gfs2_trans_end(sdp);
+	}
 	return 0;
 
 out_trans_end:
@@ -1103,28 +1141,24 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	struct gfs2_trans *tr = current->journal_info;
 	struct buffer_head *dibh = iomap->private;
 
 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE)
 		goto out;
 
-	if (iomap->type != IOMAP_INLINE) {
-		gfs2_ordered_add_inode(ip);
+	if (current->journal_info) {
+		struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-		if (tr->tr_num_buf_new)
-			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-		else
-			gfs2_trans_add_meta(ip->i_gl, dibh);
-	}
+		if (iomap->type != IOMAP_INLINE)
+			gfs2_write_end(inode, dibh);
 
-	if (inode == sdp->sd_rindex) {
-		adjust_fs_space(inode);
-		sdp->sd_rindex_uptodate = 0;
-	}
+		if (inode == sdp->sd_rindex) {
+			adjust_fs_space(inode);
+			sdp->sd_rindex_uptodate = 0;
+		}
 
-	gfs2_trans_end(sdp);
+		gfs2_trans_end(sdp);
+	}
 	gfs2_inplace_release(ip);
 
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 58a768e59712e..542bd149c4aa3 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			iocb->ki_pos += ret;
 	}
 
+	balance_dirty_pages_ratelimited(file->f_mapping);
+
 out2:
 	current->backing_dev_info = NULL;
 out:
diff --git a/fs/iomap.c b/fs/iomap.c
index 97cb9d486a7da..0d8ea3f29b2ef 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -863,7 +863,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bdaf..e9a04e76a3217 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@ struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_UNBALANCED	0x08	/* don't balance dirty pages */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-21 13:13 gfs2 iomap dealock, IOMAP_F_UNBALANCED Andreas Gruenbacher
@ 2019-03-21 21:43 ` Dave Chinner
  2019-03-21 23:01   ` Andreas Gruenbacher
  2019-03-22  0:21   ` Andreas Gruenbacher
  2019-03-28 16:51 ` Christoph Hellwig
  1 sibling, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2019-03-21 21:43 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, Ross Lagerwall, Mark Syms,
	Edwin Török, linux-fsdevel

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-21 21:43 ` Dave Chinner
@ 2019-03-21 23:01   ` Andreas Gruenbacher
  2019-03-22  0:21   ` Andreas Gruenbacher
  1 sibling, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-03-21 23:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, cluster-devel, Ross Lagerwall, Mark Syms,
	Edwin Török, linux-fsdevel

On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
> The problem is calling balance_dirty_pages() inside the
> ->iomap_begin/->iomap_end calls and not that it is called by the
> iomap infrastructure itself, right?
>
> Is so, I'd prefer to see this in iomap_apply() after the call to
> ops->iomap_end because iomap_file_buffered_write() can iterate and
> call iomap_apply() multiple times. This would keep the balancing to
> a per-iomap granularity, rather than a per-syscall granularity.
>
> i.e. if we do write(2GB), we want more than one balancing call
> during that syscall, so it would be up to the filesystem to a) limit
> the size of write mappings to something smaller (e.g. 1024 pages)
> so that there are still frequent balancing calls for large writes.

Hmm. The looping across multiple mappings isn't done in iomap_apply
but in iomap_file_buffered_write, so the balancing could go into
iomap_apply or iomap_file_buffered_write, but can't go further up the
stack. Given that, iomap_file_buffered_write seems the better place,
but this is still quite horrible.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-21 21:43 ` Dave Chinner
  2019-03-21 23:01   ` Andreas Gruenbacher
@ 2019-03-22  0:21   ` Andreas Gruenbacher
  2019-03-27 16:49     ` Ross Lagerwall
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-03-22  0:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Gruenbacher, Christoph Hellwig, cluster-devel,
	Ross Lagerwall, Mark Syms, Edwin Török, linux-fsdevel

On Fri, 22 Mar 2019 at 00:01, Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
> > The problem is calling balance_dirty_pages() inside the
> > ->iomap_begin/->iomap_end calls and not that it is called by the
> > iomap infrastructure itself, right?
> >
> > Is so, I'd prefer to see this in iomap_apply() after the call to
> > ops->iomap_end because iomap_file_buffered_write() can iterate and
> > call iomap_apply() multiple times. This would keep the balancing to
> > a per-iomap granularity, rather than a per-syscall granularity.
> >
> > i.e. if we do write(2GB), we want more than one balancing call
> > during that syscall, so it would be up to the filesystem to a) limit
> > the size of write mappings to something smaller (e.g. 1024 pages)
> > so that there are still frequent balancing calls for large writes.
>
> Hmm. The looping across multiple mappings isn't done in iomap_apply
> but in iomap_file_buffered_write, so the balancing could go into
> iomap_apply or iomap_file_buffered_write, but can't go further up the
> stack. Given that, iomap_file_buffered_write seems the better place,
> but this is still quite horrible.

Here's a more reasonable version of my first patch, with a cleaned up
and hopefully fixed gfs2 part.

In addition, this checks for IOMAP_F_UNBALANCED in iomap_dirty_actor,
the actor for iomap_file_dirty.  We don't use iomap_file_dirty in gfs2,
but we should probably allowing to skip the dirty page balancing there
as well.

Thanks,
Andreas
---
 fs/gfs2/bmap.c        | 64 +++++++++++++++++++++++++++++++++----------
 fs/iomap.c            |  6 ++--
 include/linux/iomap.h |  1 +
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 02b2646d84b3a..628d66d07fc6c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -974,6 +974,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos,
 	gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
 }
 
+static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_trans *tr = current->journal_info;
+
+	gfs2_ordered_add_inode(ip);
+
+	if (tr->tr_num_buf_new)
+		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+	else
+		gfs2_trans_add_meta(ip->i_gl, dibh);
+}
+
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 				  loff_t length, unsigned flags,
 				  struct iomap *iomap,
@@ -996,6 +1009,25 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	if (ret)
 		goto out_unlock;
 
+	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) {
+		int max_pages;
+		u64 max_length;
+
+		iomap->flags |= IOMAP_F_UNBALANCED;
+
+		/*
+		 * Limit the write size: this ensures that write throttling
+		 * will kick in fast enough even when we don't call
+		 * balance_dirty_pages_ratelimited for each page written.
+		 */
+		max_pages = current->nr_dirtied_pause - current->nr_dirtied;
+		if (max_pages < 8)
+			max_pages = 8;
+		max_length = (u64)max_pages << PAGE_SHIFT;
+		if (iomap->length > max_length)
+			iomap->length = max_length;
+	}
+
 	alloc_required = unstuff || iomap->type == IOMAP_HOLE;
 
 	if (alloc_required || gfs2_is_jdata(ip))
@@ -1052,6 +1084,11 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 	if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip))
 		iomap->page_done = gfs2_iomap_journaled_page_done;
+
+	if (!(iomap->flags & IOMAP_F_UNBALANCED)) {
+		gfs2_write_end(inode, mp->mp_bh[0]);
+		gfs2_trans_end(sdp);
+	}
 	return 0;
 
 out_trans_end:
@@ -1103,30 +1140,29 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	struct gfs2_trans *tr = current->journal_info;
 	struct buffer_head *dibh = iomap->private;
 
 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE)
 		goto out;
 
-	if (iomap->type != IOMAP_INLINE) {
-		gfs2_ordered_add_inode(ip);
+	if (current->journal_info) {
+		struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-		if (tr->tr_num_buf_new)
-			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-		else
-			gfs2_trans_add_meta(ip->i_gl, dibh);
-	}
+		if (iomap->type != IOMAP_INLINE)
+			gfs2_write_end(inode, dibh);
 
-	if (inode == sdp->sd_rindex) {
-		adjust_fs_space(inode);
-		sdp->sd_rindex_uptodate = 0;
-	}
+		if (inode == sdp->sd_rindex) {
+			adjust_fs_space(inode);
+			sdp->sd_rindex_uptodate = 0;
+		}
 
-	gfs2_trans_end(sdp);
+		gfs2_trans_end(sdp);
+	}
 	gfs2_inplace_release(ip);
 
+	if (iomap->flags & IOMAP_F_UNBALANCED)
+		balance_dirty_pages_ratelimited(inode->i_mapping);
+
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
 		/* Deallocate blocks that were just allocated. */
 		loff_t blockmask = i_blocksize(inode) - 1;
diff --git a/fs/iomap.c b/fs/iomap.c
index 97cb9d486a7da..5f950fee0834f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -863,7 +863,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
@@ -945,7 +946,8 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += status;
 		length -= status;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (length);
 
 	return written;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bdaf..e9a04e76a3217 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@ struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_UNBALANCED	0x08	/* don't balance dirty pages */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-22  0:21   ` Andreas Gruenbacher
@ 2019-03-27 16:49     ` Ross Lagerwall
  0 siblings, 0 replies; 12+ messages in thread
From: Ross Lagerwall @ 2019-03-27 16:49 UTC (permalink / raw)
  To: Andreas Gruenbacher, Dave Chinner
  Cc: Christoph Hellwig, cluster-devel, Mark Syms,
	Edwin Török, linux-fsdevel

On 3/22/19 12:21 AM, Andreas Gruenbacher wrote:
> On Fri, 22 Mar 2019 at 00:01, Andreas Gruenbacher <agruenba@redhat.com> wrote:
>> On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
>>> The problem is calling balance_dirty_pages() inside the
>>> ->iomap_begin/->iomap_end calls and not that it is called by the
>>> iomap infrastructure itself, right?
>>>
>>> Is so, I'd prefer to see this in iomap_apply() after the call to
>>> ops->iomap_end because iomap_file_buffered_write() can iterate and
>>> call iomap_apply() multiple times. This would keep the balancing to
>>> a per-iomap granularity, rather than a per-syscall granularity.
>>>
>>> i.e. if we do write(2GB), we want more than one balancing call
>>> during that syscall, so it would be up to the filesystem to a) limit
>>> the size of write mappings to something smaller (e.g. 1024 pages)
>>> so that there are still frequent balancing calls for large writes.
>>
>> Hmm. The looping across multiple mappings isn't done in iomap_apply
>> but in iomap_file_buffered_write, so the balancing could go into
>> iomap_apply or iomap_file_buffered_write, but can't go further up the
>> stack. Given that, iomap_file_buffered_write seems the better place,
>> but this is still quite horrible.
> 
> Here's a more reasonable version of my first patch, with a cleaned up
> and hopefully fixed gfs2 part.
> 
> In addition, this checks for IOMAP_F_UNBALANCED in iomap_dirty_actor,
> the actor for iomap_file_dirty.  We don't use iomap_file_dirty in gfs2,
> but we should probably allowing to skip the dirty page balancing there
> as well.
> 
> Thanks,
> Andreas
> ---
>   fs/gfs2/bmap.c        | 64 +++++++++++++++++++++++++++++++++----------
>   fs/iomap.c            |  6 ++--
>   include/linux/iomap.h |  1 +
>   3 files changed, 55 insertions(+), 16 deletions(-)
> 
Thanks, this fixes the reported deadlock. I haven't yet checked whether 
it has any performance impact.

Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-21 13:13 gfs2 iomap dealock, IOMAP_F_UNBALANCED Andreas Gruenbacher
  2019-03-21 21:43 ` Dave Chinner
@ 2019-03-28 16:51 ` Christoph Hellwig
  2019-03-29 22:13   ` Andreas Gruenbacher
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-03-28 16:51 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, Dave Chinner, Ross Lagerwall,
	Mark Syms, Edwin Török, linux-fsdevel

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.

What is the exactly call chain?  balance_dirty_pages_ratelimited these
days doesn't start I/O, but just wakes up the flusher threads.  Or
do we have a issue where it is blocking on those threads?

Also why do you need to flush the log for background writeback in
->write_inode?

balance_dirty_pages_ratelimited is per definition not a data integrity
writeback, so there shouldn't be a good reason to flush the log
(which I assume the log flush log is for).  If we look gfs2_write_inode,
this seems to be the code:

	bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));

        if (flush_all)
		gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
			       GFS2_LOG_HEAD_FLUSH_NORMAL |
			       GFS2_LFC_WRITE_INODE);

But what is the requirement to do this in writeback context?  Can't
we move it out into another context instead?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-28 16:51 ` Christoph Hellwig
@ 2019-03-29 22:13   ` Andreas Gruenbacher
  2019-04-07  7:32     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-03-29 22:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cluster-devel, Dave Chinner, Ross Lagerwall, Mark Syms,
	Edwin Török, linux-fsdevel

On Thu, 28 Mar 2019 at 17:51, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> > Hi Christoph,
> >
> > we need your help fixing a gfs2 deadlock involving iomap.  What's going
> > on is the following:
> >
> > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
> >   lock and keeps it until gfs2_iomap_end.  It currently always does that
> >   even though there is no point other than for journaled data writes.
> >
> > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
> >   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
> >   log flush lock again and deadlock.
>
> What is the exact call chain?

It's laid out here:
https://www.redhat.com/archives/cluster-devel/2019-March/msg00000.html

> balance_dirty_pages_ratelimited these days doesn't start I/O, but just
> wakes up the flusher threads.  Or do we have a issue where it is blocking
> on those threads?

Yes, the writer is holding sd_log_flush_lock at the point where it
ends up kicking the flusher thread and waiting for writeback to
happen. The flusher thread calls gfs2_write_inode, and that tries to
grab sd_log_flush_lock again.

> Also why do you need to flush the log for background writeback in
> ->write_inode?

If we stop doing that in the (wbc->sync_mode == WB_SYNC_NONE) case,
then inodes will remain dirty until the journal is flushed for some
other reason (or a write_inode with WB_SYNC_ALL). That doesn't seem
right. We could perhaps trigger a background journal flush in the
WB_SYNC_NONE case, but that would remove the back pressure on
balance_dirty_pages. Not sure this is a good idea, either.

> balance_dirty_pages_ratelimited is per definition not a data integrity
> writeback, so there shouldn't be a good reason to flush the log
> (which I assume the log flush log is for).
>
> If we look gfs2_write_inode, this seems to be the code:
>
>         bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
>
>         if (flush_all)
>                 gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
>                                GFS2_LOG_HEAD_FLUSH_NORMAL |
>                                GFS2_LFC_WRITE_INODE);
>
> But what is the requirement to do this in writeback context?  Can't
> we move it out into another context instead?

Indeed, this isn't for data integrity in this case but because the
dirty limit is exceeded. What other context would you suggest to move
this to?

(The iomap flag I've proposed would save us from getting into this
situation in the first place.)

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-03-29 22:13   ` Andreas Gruenbacher
@ 2019-04-07  7:32     ` Christoph Hellwig
  2019-04-08  8:53       ` Andreas Gruenbacher
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-04-07  7:32 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, Dave Chinner, Ross Lagerwall,
	Mark Syms, Edwin Török, linux-fsdevel, Jan Kara,
	linux-mm

[adding Jan and linux-mm]

On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > But what is the requirement to do this in writeback context?  Can't
> > we move it out into another context instead?
> 
> Indeed, this isn't for data integrity in this case but because the
> dirty limit is exceeded. What other context would you suggest to move
> this to?
> 
> (The iomap flag I've proposed would save us from getting into this
> situation in the first place.)

Your patch does two things:

 - it only calls balance_dirty_pages_ratelimited once per write
   operation instead of once per page.  In the past btrfs did
   hacks like that, but IIRC they caused VM balancing issues.
   That is why everyone now calls balance_dirty_pages_ratelimited
   one per page.  If calling it at a coarse granularity would
   be fine we should do it everywhere instead of just in gfs2
   in journaled mode
 - it artifically reduces the size of writes to a low value,
   which I suspect is going to break real life application

So I really think we need to fix this properly.  And if that means
that you can't make use of the iomap batching for gfs2 in journaled
mode that is still a better option.  But I really think you need
to look into the scope of your flush_log and figure out a good way
to reduce that as solve the root cause.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-04-07  7:32     ` Christoph Hellwig
@ 2019-04-08  8:53       ` Andreas Gruenbacher
  2019-04-08 13:44         ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-04-08  8:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cluster-devel, Dave Chinner, Ross Lagerwall, Mark Syms,
	Edwin Török, linux-fsdevel, Jan Kara, linux-mm

On Sun, 7 Apr 2019 at 09:32, Christoph Hellwig <hch@lst.de> wrote:
>
> [adding Jan and linux-mm]
>
> On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > > But what is the requirement to do this in writeback context?  Can't
> > > we move it out into another context instead?
> >
> > Indeed, this isn't for data integrity in this case but because the
> > dirty limit is exceeded. What other context would you suggest to move
> > this to?
> >
> > (The iomap flag I've proposed would save us from getting into this
> > situation in the first place.)
>
> Your patch does two things:
>
>  - it only calls balance_dirty_pages_ratelimited once per write
>    operation instead of once per page.  In the past btrfs did
>    hacks like that, but IIRC they caused VM balancing issues.
>    That is why everyone now calls balance_dirty_pages_ratelimited
>    one per page.  If calling it at a coarse granularity would
>    be fine we should do it everywhere instead of just in gfs2
>    in journaled mode
>  - it artifically reduces the size of writes to a low value,
>    which I suspect is going to break real life application

Not quite, balance_dirty_pages_ratelimited is called from iomap_end,
so once per iomap mapping returned, not per write. (The first version
of this patch got that wrong by accident, but not the second.)

We can limit the size of the mappings returned just in that case. I'm
aware that there is a risk of balancing problems, I just don't have
any better ideas.

This is a problem all filesystems with data-journaling will have with
iomap, it's not that gfs2 is doing anything particularly stupid.

> So I really think we need to fix this properly.  And if that means
> that you can't make use of the iomap batching for gfs2 in journaled
> mode that is still a better option.

That would mean using the old-style, page-size allocations, and a
completely separate write path in that case. That would be quite a
nightmare.

> But I really think you need
> to look into the scope of your flush_log and figure out a good way
> to reduce that as solve the root cause.

We won't be able to do a log flush while another transaction is
active, but that's what's needed to clean dirty pages. iomap doesn't
allow us to put the block allocation into a separate transaction from
the page writes; for that, the opposite to the page_done hook would
probably be needed.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-04-08  8:53       ` Andreas Gruenbacher
@ 2019-04-08 13:44         ` Jan Kara
  2019-04-09 12:15           ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2019-04-08 13:44 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, Dave Chinner, Ross Lagerwall,
	Mark Syms, Edwin Török, linux-fsdevel, Jan Kara,
	linux-mm

On Mon 08-04-19 10:53:34, Andreas Gruenbacher wrote:
> On Sun, 7 Apr 2019 at 09:32, Christoph Hellwig <hch@lst.de> wrote:
> >
> > [adding Jan and linux-mm]
> >
> > On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > > > But what is the requirement to do this in writeback context?  Can't
> > > > we move it out into another context instead?
> > >
> > > Indeed, this isn't for data integrity in this case but because the
> > > dirty limit is exceeded. What other context would you suggest to move
> > > this to?
> > >
> > > (The iomap flag I've proposed would save us from getting into this
> > > situation in the first place.)
> >
> > Your patch does two things:
> >
> >  - it only calls balance_dirty_pages_ratelimited once per write
> >    operation instead of once per page.  In the past btrfs did
> >    hacks like that, but IIRC they caused VM balancing issues.
> >    That is why everyone now calls balance_dirty_pages_ratelimited
> >    one per page.  If calling it at a coarse granularity would
> >    be fine we should do it everywhere instead of just in gfs2
> >    in journaled mode
> >  - it artifically reduces the size of writes to a low value,
> >    which I suspect is going to break real life application
> 
> Not quite, balance_dirty_pages_ratelimited is called from iomap_end,
> so once per iomap mapping returned, not per write. (The first version
> of this patch got that wrong by accident, but not the second.)
> 
> We can limit the size of the mappings returned just in that case. I'm
> aware that there is a risk of balancing problems, I just don't have
> any better ideas.
> 
> This is a problem all filesystems with data-journaling will have with
> iomap, it's not that gfs2 is doing anything particularly stupid.

I agree that if ext4 would be using iomap, it would have similar issues.

> > So I really think we need to fix this properly.  And if that means
> > that you can't make use of the iomap batching for gfs2 in journaled
> > mode that is still a better option.
> 
> That would mean using the old-style, page-size allocations, and a
> completely separate write path in that case. That would be quite a
> nightmare.
> 
> > But I really think you need
> > to look into the scope of your flush_log and figure out a good way
> > to reduce that as solve the root cause.
> 
> We won't be able to do a log flush while another transaction is
> active, but that's what's needed to clean dirty pages. iomap doesn't
> allow us to put the block allocation into a separate transaction from
> the page writes; for that, the opposite to the page_done hook would
> probably be needed.

I agree that a ->page_prepare() hook would be probably the cleanest
solution for this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-04-08 13:44         ` Jan Kara
@ 2019-04-09 12:15           ` Christoph Hellwig
  2019-04-09 12:27             ` Andreas Gruenbacher
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-04-09 12:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Gruenbacher, Christoph Hellwig, cluster-devel,
	Dave Chinner, Ross Lagerwall, Mark Syms, Edwin Török,
	linux-fsdevel, linux-mm

On Mon, Apr 08, 2019 at 03:44:05PM +0200, Jan Kara wrote:
> > We won't be able to do a log flush while another transaction is
> > active, but that's what's needed to clean dirty pages. iomap doesn't
> > allow us to put the block allocation into a separate transaction from
> > the page writes; for that, the opposite to the page_done hook would
> > probably be needed.
> 
> I agree that a ->page_prepare() hook would be probably the cleanest
> solution for this.

That doesn't sound too bad to me.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
  2019-04-09 12:15           ` Christoph Hellwig
@ 2019-04-09 12:27             ` Andreas Gruenbacher
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2019-04-09 12:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Dave Chinner, Ross Lagerwall, Mark Syms,
	Edwin Török, linux-fsdevel, linux-mm

On Tue, 9 Apr 2019 at 14:15, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Apr 08, 2019 at 03:44:05PM +0200, Jan Kara wrote:
> > > We won't be able to do a log flush while another transaction is
> > > active, but that's what's needed to clean dirty pages. iomap doesn't
> > > allow us to put the block allocation into a separate transaction from
> > > the page writes; for that, the opposite to the page_done hook would
> > > probably be needed.
> >
> > I agree that a ->page_prepare() hook would be probably the cleanest
> > solution for this.
>
> That doesn't sound too bad to me.

Okay, I'll see how the code for that will turn out.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-09 12:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-21 13:13 gfs2 iomap dealock, IOMAP_F_UNBALANCED Andreas Gruenbacher
2019-03-21 21:43 ` Dave Chinner
2019-03-21 23:01   ` Andreas Gruenbacher
2019-03-22  0:21   ` Andreas Gruenbacher
2019-03-27 16:49     ` Ross Lagerwall
2019-03-28 16:51 ` Christoph Hellwig
2019-03-29 22:13   ` Andreas Gruenbacher
2019-04-07  7:32     ` Christoph Hellwig
2019-04-08  8:53       ` Andreas Gruenbacher
2019-04-08 13:44         ` Jan Kara
2019-04-09 12:15           ` Christoph Hellwig
2019-04-09 12:27             ` Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).