From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9867C43381 for ; Thu, 21 Mar 2019 13:13:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A16702083D for ; Thu, 21 Mar 2019 13:13:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728138AbfCUNNN (ORCPT ); Thu, 21 Mar 2019 09:13:13 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43820 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728102AbfCUNNN (ORCPT ); Thu, 21 Mar 2019 09:13:13 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id D2D2B58E3E; Thu, 21 Mar 2019 13:13:12 +0000 (UTC) Received: from max.home.com (unknown [10.40.205.139]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9D78360857; Thu, 21 Mar 2019 13:13:07 +0000 (UTC) From: Andreas Gruenbacher To: Christoph Hellwig Cc: Andreas Gruenbacher , cluster-devel@redhat.com, Dave Chinner , Ross Lagerwall , Mark Syms , =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= , linux-fsdevel@vger.kernel.org Subject: gfs2 iomap dealock, IOMAP_F_UNBALANCED Date: Thu, 21 Mar 2019 14:13:04 +0100 Message-Id: <20190321131304.21618-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Thu, 21 Mar 2019 13:13:12 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hi Christoph, we need your help fixing a gfs2 deadlock involving iomap. What's going on is the following: * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush lock and keeps it until gfs2_iomap_end. It currently always does that even though there is no point other than for journaled data writes. * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. If that ends up calling gfs2_write_inode, gfs2 will try to grab the log flush lock again and deadlock. We can stop gfs2_iomap_begin from keeping the log flush lock held for non-journaled data writes, but that still leaves us with the deadlock for journaled data writes: for them, we need to add the data pages to the running transaction, so dropping the log flush lock wouldn't work. I've tried to work around the unwanted balance_dirty_pages_ratelimited in gfs2_write_inode as originally suggested by Ross. That works to a certain degree, but only if we always skip inodes in the WB_SYNC_NONE case, and that means that gfs2_write_inode becomes quite ineffective. So it seems that it would be more reasonable to avoid the dirty page balancing under the log flush lock in the first place. The attached patch changes gfs2_iomap_begin to only hold on to the log flush lock for journaled data writes. In that case, we also make sure to limit the write size to not overrun dirty limits using a similar logic as in balance_dirty_pages_ratelimited; there is precedent for that approach in btrfs_buffered_write. Then, we prevent iomap from balancing dirty pages via the new IOMAP_F_UNBALANCED flag. Does that seem like an acceptable approach? Thanks, Andreas --- fs/gfs2/bmap.c | 62 +++++++++++++++++++++++++++++++++---------- fs/gfs2/file.c | 2 ++ fs/iomap.c | 3 ++- include/linux/iomap.h | 1 + 4 files changed, 53 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 02b2646d84b3a..dad7d3810533e 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -862,6 +862,24 @@ static int gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length, iomap->offset = lblock << inode->i_blkbits; lblock_stop = (pos + length - 1) >> inode->i_blkbits; len = lblock_stop - lblock + 1; + + if ((flags & IOMAP_WRITE) && gfs2_is_jdata(ip)) { + int max_pages; + u64 max_blocks; + + /* + * Limit the write size: this ensures that write throttling + * will kick in fast enough even when we don't call + * balance_dirty_pages_ratelimited for each page written. + */ + max_pages = min((int)DIV_ROUND_UP(length, PAGE_SIZE), + current->nr_dirtied_pause - current->nr_dirtied); + max_pages = max(max_pages, 8); + max_blocks = (u64)max_pages << (PAGE_SHIFT - inode->i_blkbits); + if (len > max_blocks) + len = max_blocks; + } + iomap->length = len << inode->i_blkbits; height = ip->i_height; @@ -974,6 +992,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos, gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied); } +static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_trans *tr = current->journal_info; + + gfs2_ordered_add_inode(ip); + + if (tr->tr_num_buf_new) + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + else + gfs2_trans_add_meta(ip->i_gl, dibh); +} + static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap, @@ -1052,6 +1083,13 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, } if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip)) iomap->page_done = gfs2_iomap_journaled_page_done; + + if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) { + iomap->flags |= IOMAP_F_UNBALANCED; + } else { + gfs2_write_end(inode, mp->mp_bh[0]); + gfs2_trans_end(sdp); + } return 0; out_trans_end: @@ -1103,28 +1141,24 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { struct gfs2_inode *ip = GFS2_I(inode); - struct gfs2_sbd *sdp = GFS2_SB(inode); - struct gfs2_trans *tr = current->journal_info; struct buffer_head *dibh = iomap->private; if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE) goto out; - if (iomap->type != IOMAP_INLINE) { - gfs2_ordered_add_inode(ip); + if (current->journal_info) { + struct gfs2_sbd *sdp = GFS2_SB(inode); - if (tr->tr_num_buf_new) - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - else - gfs2_trans_add_meta(ip->i_gl, dibh); - } + if (iomap->type != IOMAP_INLINE) + gfs2_write_end(inode, dibh); - if (inode == sdp->sd_rindex) { - adjust_fs_space(inode); - sdp->sd_rindex_uptodate = 0; - } + if (inode == sdp->sd_rindex) { + adjust_fs_space(inode); + sdp->sd_rindex_uptodate = 0; + } - gfs2_trans_end(sdp); + gfs2_trans_end(sdp); + } gfs2_inplace_release(ip); if (length != written && (iomap->flags & IOMAP_F_NEW)) { diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 58a768e59712e..542bd149c4aa3 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->ki_pos += ret; } + balance_dirty_pages_ratelimited(file->f_mapping); + out2: current->backing_dev_info = NULL; out: diff --git a/fs/iomap.c b/fs/iomap.c index 97cb9d486a7da..0d8ea3f29b2ef 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -863,7 +863,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data, written += copied; length -= copied; - balance_dirty_pages_ratelimited(inode->i_mapping); + if (!(iomap->flags & IOMAP_F_UNBALANCED)) + balance_dirty_pages_ratelimited(inode->i_mapping); } while (iov_iter_count(i) && length); return written ? written : status; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0fefb5455bdaf..e9a04e76a3217 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -35,6 +35,7 @@ struct vm_fault; #define IOMAP_F_NEW 0x01 /* blocks have been newly allocated */ #define IOMAP_F_DIRTY 0x02 /* uncommitted metadata */ #define IOMAP_F_BUFFER_HEAD 0x04 /* file system requires buffer heads */ +#define IOMAP_F_UNBALANCED 0x08 /* don't balance dirty pages */ /* * Flags that only need to be reported for IOMAP_REPORT requests: -- 2.20.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Gruenbacher Date: Thu, 21 Mar 2019 14:13:04 +0100 Subject: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED Message-ID: <20190321131304.21618-1-agruenba@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi Christoph, we need your help fixing a gfs2 deadlock involving iomap. What's going on is the following: * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush lock and keeps it until gfs2_iomap_end. It currently always does that even though there is no point other than for journaled data writes. * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. If that ends up calling gfs2_write_inode, gfs2 will try to grab the log flush lock again and deadlock. We can stop gfs2_iomap_begin from keeping the log flush lock held for non-journaled data writes, but that still leaves us with the deadlock for journaled data writes: for them, we need to add the data pages to the running transaction, so dropping the log flush lock wouldn't work. I've tried to work around the unwanted balance_dirty_pages_ratelimited in gfs2_write_inode as originally suggested by Ross. That works to a certain degree, but only if we always skip inodes in the WB_SYNC_NONE case, and that means that gfs2_write_inode becomes quite ineffective. So it seems that it would be more reasonable to avoid the dirty page balancing under the log flush lock in the first place. The attached patch changes gfs2_iomap_begin to only hold on to the log flush lock for journaled data writes. In that case, we also make sure to limit the write size to not overrun dirty limits using a similar logic as in balance_dirty_pages_ratelimited; there is precedent for that approach in btrfs_buffered_write. Then, we prevent iomap from balancing dirty pages via the new IOMAP_F_UNBALANCED flag. Does that seem like an acceptable approach? Thanks, Andreas --- fs/gfs2/bmap.c | 62 +++++++++++++++++++++++++++++++++---------- fs/gfs2/file.c | 2 ++ fs/iomap.c | 3 ++- include/linux/iomap.h | 1 + 4 files changed, 53 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 02b2646d84b3a..dad7d3810533e 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -862,6 +862,24 @@ static int gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length, iomap->offset = lblock << inode->i_blkbits; lblock_stop = (pos + length - 1) >> inode->i_blkbits; len = lblock_stop - lblock + 1; + + if ((flags & IOMAP_WRITE) && gfs2_is_jdata(ip)) { + int max_pages; + u64 max_blocks; + + /* + * Limit the write size: this ensures that write throttling + * will kick in fast enough even when we don't call + * balance_dirty_pages_ratelimited for each page written. + */ + max_pages = min((int)DIV_ROUND_UP(length, PAGE_SIZE), + current->nr_dirtied_pause - current->nr_dirtied); + max_pages = max(max_pages, 8); + max_blocks = (u64)max_pages << (PAGE_SHIFT - inode->i_blkbits); + if (len > max_blocks) + len = max_blocks; + } + iomap->length = len << inode->i_blkbits; height = ip->i_height; @@ -974,6 +992,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos, gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied); } +static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_trans *tr = current->journal_info; + + gfs2_ordered_add_inode(ip); + + if (tr->tr_num_buf_new) + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + else + gfs2_trans_add_meta(ip->i_gl, dibh); +} + static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap, @@ -1052,6 +1083,13 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, } if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip)) iomap->page_done = gfs2_iomap_journaled_page_done; + + if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) { + iomap->flags |= IOMAP_F_UNBALANCED; + } else { + gfs2_write_end(inode, mp->mp_bh[0]); + gfs2_trans_end(sdp); + } return 0; out_trans_end: @@ -1103,28 +1141,24 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { struct gfs2_inode *ip = GFS2_I(inode); - struct gfs2_sbd *sdp = GFS2_SB(inode); - struct gfs2_trans *tr = current->journal_info; struct buffer_head *dibh = iomap->private; if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE) goto out; - if (iomap->type != IOMAP_INLINE) { - gfs2_ordered_add_inode(ip); + if (current->journal_info) { + struct gfs2_sbd *sdp = GFS2_SB(inode); - if (tr->tr_num_buf_new) - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - else - gfs2_trans_add_meta(ip->i_gl, dibh); - } + if (iomap->type != IOMAP_INLINE) + gfs2_write_end(inode, dibh); - if (inode == sdp->sd_rindex) { - adjust_fs_space(inode); - sdp->sd_rindex_uptodate = 0; - } + if (inode == sdp->sd_rindex) { + adjust_fs_space(inode); + sdp->sd_rindex_uptodate = 0; + } - gfs2_trans_end(sdp); + gfs2_trans_end(sdp); + } gfs2_inplace_release(ip); if (length != written && (iomap->flags & IOMAP_F_NEW)) { diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 58a768e59712e..542bd149c4aa3 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->ki_pos += ret; } + balance_dirty_pages_ratelimited(file->f_mapping); + out2: current->backing_dev_info = NULL; out: diff --git a/fs/iomap.c b/fs/iomap.c index 97cb9d486a7da..0d8ea3f29b2ef 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -863,7 +863,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data, written += copied; length -= copied; - balance_dirty_pages_ratelimited(inode->i_mapping); + if (!(iomap->flags & IOMAP_F_UNBALANCED)) + balance_dirty_pages_ratelimited(inode->i_mapping); } while (iov_iter_count(i) && length); return written ? written : status; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0fefb5455bdaf..e9a04e76a3217 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -35,6 +35,7 @@ struct vm_fault; #define IOMAP_F_NEW 0x01 /* blocks have been newly allocated */ #define IOMAP_F_DIRTY 0x02 /* uncommitted metadata */ #define IOMAP_F_BUFFER_HEAD 0x04 /* file system requires buffer heads */ +#define IOMAP_F_UNBALANCED 0x08 /* don't balance dirty pages */ /* * Flags that only need to be reported for IOMAP_REPORT requests: -- 2.20.1