From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=0cHI=RY=vger.kernel.org=linux-fsdevel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2EA5FC43381
	for <linux-fsdevel@archiver.kernel.org>; Thu, 21 Mar 2019 21:43:50 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 0296C21916
	for <linux-fsdevel@archiver.kernel.org>; Thu, 21 Mar 2019 21:43:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726809AbfCUVnt (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Thu, 21 Mar 2019 17:43:49 -0400
Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:12991 "EHLO
        ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1725999AbfCUVnt (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Thu, 21 Mar 2019 17:43:49 -0400
Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252])
  by ipmail07.adl2.internode.on.net with ESMTP; 22 Mar 2019 08:13:46 +1030
Received: from dave by dastard with local (Exim 4.80)
        (envelope-from <david@fromorbit.com>)
        id 1h75TV-0006XH-AF; Fri, 22 Mar 2019 08:43:45 +1100
Date:   Fri, 22 Mar 2019 08:43:45 +1100
From:   Dave Chinner <david@fromorbit.com>
To:     Andreas Gruenbacher <agruenba@redhat.com>
Cc:     Christoph Hellwig <hch@lst.de>, cluster-devel@redhat.com,
        Ross Lagerwall <ross.lagerwall@citrix.com>,
        Mark Syms <Mark.Syms@citrix.com>,
        Edwin =?iso-8859-1?B?VPZy9ms=?= <edvin.torok@citrix.com>,
        linux-fsdevel@vger.kernel.org
Subject: Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED
Message-ID: <20190321214345.GE26298@dastard>
References: <20190321131304.21618-1-agruenba@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Date: Fri, 22 Mar 2019 08:43:45 +1100
Subject: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com>
References: <20190321131304.21618-1-agruenba@redhat.com>
Message-ID: <20190321214345.GE26298@dastard>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com