From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2EA5FC43381 for ; Thu, 21 Mar 2019 21:43:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0296C21916 for ; Thu, 21 Mar 2019 21:43:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726809AbfCUVnt (ORCPT ); Thu, 21 Mar 2019 17:43:49 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:12991 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725999AbfCUVnt (ORCPT ); Thu, 21 Mar 2019 17:43:49 -0400 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 22 Mar 2019 08:13:46 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1h75TV-0006XH-AF; Fri, 22 Mar 2019 08:43:45 +1100 Date: Fri, 22 Mar 2019 08:43:45 +1100 From: Dave Chinner To: Andreas Gruenbacher Cc: Christoph Hellwig , cluster-devel@redhat.com, Ross Lagerwall , Mark Syms , Edwin =?iso-8859-1?B?VPZy9ms=?= , linux-fsdevel@vger.kernel.org Subject: Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED Message-ID: <20190321214345.GE26298@dastard> References: <20190321131304.21618-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote: > Hi Christoph, > > we need your help fixing a gfs2 deadlock involving iomap. What's going > on is the following: > > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush > lock and keeps it until gfs2_iomap_end. It currently always does that > even though there is no point other than for journaled data writes. > > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. > If that ends up calling gfs2_write_inode, gfs2 will try to grab the > log flush lock again and deadlock. > > We can stop gfs2_iomap_begin from keeping the log flush lock held for > non-journaled data writes, but that still leaves us with the deadlock > for journaled data writes: for them, we need to add the data pages to > the running transaction, so dropping the log flush lock wouldn't work. > > I've tried to work around the unwanted balance_dirty_pages_ratelimited > in gfs2_write_inode as originally suggested by Ross. That works to a > certain degree, but only if we always skip inodes in the WB_SYNC_NONE > case, and that means that gfs2_write_inode becomes quite ineffective. > > So it seems that it would be more reasonable to avoid the dirty page > balancing under the log flush lock in the first place. > > The attached patch changes gfs2_iomap_begin to only hold on to the log > flush lock for journaled data writes. In that case, we also make sure > to limit the write size to not overrun dirty limits using a similar > logic as in balance_dirty_pages_ratelimited; there is precedent for that > approach in btrfs_buffered_write. Then, we prevent iomap from balancing > dirty pages via the new IOMAP_F_UNBALANCED flag. .... > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c > index 58a768e59712e..542bd149c4aa3 100644 > --- a/fs/gfs2/file.c > +++ b/fs/gfs2/file.c > @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > iocb->ki_pos += ret; > } > > + balance_dirty_pages_ratelimited(file->f_mapping); > + > out2: > current->backing_dev_info = NULL; > out: The problem is calling balance_dirty_pages() inside the ->iomap_begin/->iomap_end calls and not that it is called by the iomap infrastructure itself, right? Is so, I'd prefer to see this in iomap_apply() after the call to ops->iomap_end because iomap_file_buffered_write() can iterate and call iomap_apply() multiple times. This would keep the balancing to a per-iomap granularity, rather than a per-syscall granularity. i.e. if we do write(2GB), we want more than one balancing call during that syscall, so it woul dbe up to the filesystem to a) limit the size of write mappings to something smaller (e.g. 1024 pages) so that there are still frequent balancing calls for large writes. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Date: Fri, 22 Mar 2019 08:43:45 +1100 Subject: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED In-Reply-To: <20190321131304.21618-1-agruenba@redhat.com> References: <20190321131304.21618-1-agruenba@redhat.com> Message-ID: <20190321214345.GE26298@dastard> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote: > Hi Christoph, > > we need your help fixing a gfs2 deadlock involving iomap. What's going > on is the following: > > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush > lock and keeps it until gfs2_iomap_end. It currently always does that > even though there is no point other than for journaled data writes. > > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. > If that ends up calling gfs2_write_inode, gfs2 will try to grab the > log flush lock again and deadlock. > > We can stop gfs2_iomap_begin from keeping the log flush lock held for > non-journaled data writes, but that still leaves us with the deadlock > for journaled data writes: for them, we need to add the data pages to > the running transaction, so dropping the log flush lock wouldn't work. > > I've tried to work around the unwanted balance_dirty_pages_ratelimited > in gfs2_write_inode as originally suggested by Ross. That works to a > certain degree, but only if we always skip inodes in the WB_SYNC_NONE > case, and that means that gfs2_write_inode becomes quite ineffective. > > So it seems that it would be more reasonable to avoid the dirty page > balancing under the log flush lock in the first place. > > The attached patch changes gfs2_iomap_begin to only hold on to the log > flush lock for journaled data writes. In that case, we also make sure > to limit the write size to not overrun dirty limits using a similar > logic as in balance_dirty_pages_ratelimited; there is precedent for that > approach in btrfs_buffered_write. Then, we prevent iomap from balancing > dirty pages via the new IOMAP_F_UNBALANCED flag. .... > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c > index 58a768e59712e..542bd149c4aa3 100644 > --- a/fs/gfs2/file.c > +++ b/fs/gfs2/file.c > @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > iocb->ki_pos += ret; > } > > + balance_dirty_pages_ratelimited(file->f_mapping); > + > out2: > current->backing_dev_info = NULL; > out: The problem is calling balance_dirty_pages() inside the ->iomap_begin/->iomap_end calls and not that it is called by the iomap infrastructure itself, right? Is so, I'd prefer to see this in iomap_apply() after the call to ops->iomap_end because iomap_file_buffered_write() can iterate and call iomap_apply() multiple times. This would keep the balancing to a per-iomap granularity, rather than a per-syscall granularity. i.e. if we do write(2GB), we want more than one balancing call during that syscall, so it woul dbe up to the filesystem to a) limit the size of write mappings to something smaller (e.g. 1024 pages) so that there are still frequent balancing calls for large writes. Cheers, Dave. -- Dave Chinner david at fromorbit.com