From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id 516B06B0388 for ; Sun, 5 Mar 2017 08:35:39 -0500 (EST) Received: by mail-qk0-f199.google.com with SMTP id n127so206766575qkf.3 for ; Sun, 05 Mar 2017 05:35:39 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q68si4667789qkl.182.2017.03.05.05.35.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Mar 2017 05:35:37 -0800 (PST) From: Jeff Layton Subject: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Date: Sun, 5 Mar 2017 08:35:32 -0500 Message-Id: <20170305133535.6516-1-jlayton@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org I recently did some work to wire up -ENOSPC handling in ceph, and found I could get back -EIO errors in some cases when I should have instead gotten -ENOSPC. The problem was that the ceph writeback code would set PG_error on a writeback error, and that error would clobber the mapping error. While I fixed that problem by simply not setting that bit on errors, that led me down a rabbit hole of looking at how PG_error is being handled in the kernel. This patch series is a few fixes for things that I 100% noticed by inspection. I don't have a great way to test these since they involve error handling. I can certainly doctor up a kernel to inject errors in this code and test by hand however if these look plausible up front. Jeff Layton (3): nilfs2: set the mapping error when calling SetPageError on writeback mm: don't TestClearPageError in __filemap_fdatawait_range mm: set mapping error when launder_pages fails fs/nilfs2/segment.c | 1 + mm/filemap.c | 19 ++++--------------- mm/truncate.c | 6 +++++- 3 files changed, 10 insertions(+), 16 deletions(-) -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id EF4B86B038A for ; Sun, 5 Mar 2017 08:35:39 -0500 (EST) Received: by mail-qk0-f198.google.com with SMTP id n127so206767274qkf.3 for ; Sun, 05 Mar 2017 05:35:39 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id i58si13464498qti.175.2017.03.05.05.35.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Mar 2017 05:35:39 -0800 (PST) From: Jeff Layton Subject: [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Date: Sun, 5 Mar 2017 08:35:33 -0500 Message-Id: <20170305133535.6516-2-jlayton@redhat.com> In-Reply-To: <20170305133535.6516-1-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org In a later patch, we're going to want to make the fsync codepath not do a TestClearPageError call as that can override the error set in the address space. To do that though, we need to ensure that filesystems that are relying on the PG_error bit for reporting writeback errors also set an error in the address space. The only place I've found that looks potentially problematic is this spot in nilfs2. Ensure that it sets an error in the mapping in addition to setting PageError. Signed-off-by: Jeff Layton --- fs/nilfs2/segment.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index bedcae2c28e6..c1041b07060e 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err) } else { __set_page_dirty_nobuffers(page); SetPageError(page); + mapping_set_error(page_mapping(page), err); } end_page_writeback(page); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 311E96B038B for ; Sun, 5 Mar 2017 08:35:40 -0500 (EST) Received: by mail-qk0-f197.google.com with SMTP id n127so206767363qkf.3 for ; Sun, 05 Mar 2017 05:35:40 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id l76si13476268qkh.86.2017.03.05.05.35.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Mar 2017 05:35:39 -0800 (PST) From: Jeff Layton Subject: [PATCH 3/3] mm: set mapping error when launder_pages fails Date: Sun, 5 Mar 2017 08:35:35 -0500 Message-Id: <20170305133535.6516-4-jlayton@redhat.com> In-Reply-To: <20170305133535.6516-1-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org If launder_page fails, then we hit a problem writing back some inode data. Ensure that we communicate that fact in a subsequent fsync since another task could still have it open for write. Signed-off-by: Jeff Layton --- mm/truncate.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/truncate.c b/mm/truncate.c index dd7b24e083c5..49ad4e2a6cb6 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -593,11 +593,15 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) static int do_launder_page(struct address_space *mapping, struct page *page) { + int ret; + if (!PageDirty(page)) return 0; if (page->mapping != mapping || mapping->a_ops->launder_page == NULL) return 0; - return mapping->a_ops->launder_page(page); + ret = mapping->a_ops->launder_page(page); + mapping_set_error(mapping, ret); + return ret; } /** -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id B55156B038C for ; Sun, 5 Mar 2017 08:35:40 -0500 (EST) Received: by mail-qk0-f197.google.com with SMTP id a189so206637607qkc.4 for ; Sun, 05 Mar 2017 05:35:40 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id a16si13482801qkg.74.2017.03.05.05.35.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Mar 2017 05:35:40 -0800 (PST) From: Jeff Layton Subject: [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Date: Sun, 5 Mar 2017 08:35:34 -0500 Message-Id: <20170305133535.6516-3-jlayton@redhat.com> In-Reply-To: <20170305133535.6516-1-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org The -EIO returned here can end up overriding whatever error is marked in the address space. This means that an -ENOSPC error (AS_ENOSPC) can end up being turned into -EIO if a page gets PG_error set on it during error handling. Arguably, that's a bug in the writeback code, but... Read errors are also tracked on a per page level using PG_error. Suppose we have a read error on a page, and then that page is subsequently dirtied by overwriting the whole page. Writeback doesn't clear PG_error, so we can then end up successfully writing back that page and still return -EIO on fsync. Since the handling of this bit is somewhat inconsistent across subsystems, let's just rely on marking the address space when there are writeback errors. Signed-off-by: Jeff Layton --- mm/filemap.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 3f9afded581b..2b0b4ff4668b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -375,17 +375,16 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); -static int __filemap_fdatawait_range(struct address_space *mapping, +static void __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { pgoff_t index = start_byte >> PAGE_SHIFT; pgoff_t end = end_byte >> PAGE_SHIFT; struct pagevec pvec; int nr_pages; - int ret = 0; if (end_byte < start_byte) - goto out; + return; pagevec_init(&pvec, 0); while ((index <= end) && @@ -402,14 +401,10 @@ static int __filemap_fdatawait_range(struct address_space *mapping, continue; wait_on_page_writeback(page); - if (TestClearPageError(page)) - ret = -EIO; } pagevec_release(&pvec); cond_resched(); } -out: - return ret; } /** @@ -429,14 +424,8 @@ static int __filemap_fdatawait_range(struct address_space *mapping, int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { - int ret, ret2; - - ret = __filemap_fdatawait_range(mapping, start_byte, end_byte); - ret2 = filemap_check_errors(mapping); - if (!ret) - ret = ret2; - - return ret; + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_errors(mapping); } EXPORT_SYMBOL(filemap_fdatawait_range); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id 8213D6B0038 for ; Sun, 5 Mar 2017 09:40:58 -0500 (EST) Received: by mail-qk0-f200.google.com with SMTP id j127so110718194qke.2 for ; Sun, 05 Mar 2017 06:40:58 -0800 (PST) Received: from mail-qk0-f180.google.com (mail-qk0-f180.google.com. [209.85.220.180]) by mx.google.com with ESMTPS id f4si13555995qke.184.2017.03.05.06.40.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Mar 2017 06:40:57 -0800 (PST) Received: by mail-qk0-f180.google.com with SMTP id p64so1149716qke.1 for ; Sun, 05 Mar 2017 06:40:57 -0800 (PST) Message-ID: <1488724854.2925.6.camel@redhat.com> Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business From: Jeff Layton Date: Sun, 05 Mar 2017 09:40:54 -0500 In-Reply-To: <20170305133535.6516-1-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown , Ross Zwisler On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > I recently did some work to wire up -ENOSPC handling in ceph, and found > I could get back -EIO errors in some cases when I should have instead > gotten -ENOSPC. The problem was that the ceph writeback code would set > PG_error on a writeback error, and that error would clobber the mapping > error. > I should also note that relying on PG_error to report writeback errors is inherently unreliable as well. If someone calls sync() before your fsync gets in there, then you'll likely lose it anyway. filemap_fdatawait_keep_errors will preserve the error in the mapping, but not the individual PG_error flags, so I think we do want to ensure that the mapping error is set when there is a writeback error and not rely on PG_error bit for that. > While I fixed that problem by simply not setting that bit on errors, > that led me down a rabbit hole of looking at how PG_error is being > handled in the kernel. > > This patch series is a few fixes for things that I 100% noticed by > inspection. I don't have a great way to test these since they involve > error handling. I can certainly doctor up a kernel to inject errors > in this code and test by hand however if these look plausible up front. > > Jeff Layton (3): > nilfs2: set the mapping error when calling SetPageError on writeback > mm: don't TestClearPageError in __filemap_fdatawait_range > mm: set mapping error when launder_pages fails > > fs/nilfs2/segment.c | 1 + > mm/filemap.c | 19 ++++--------------- > mm/truncate.c | 6 +++++- > 3 files changed, 10 insertions(+), 16 deletions(-) > (cc'ing Ross...) Just when I thought that only NILFS2 needed a little work here, I see another spot... I think that we should also need to fix dax_writeback_mapping_range to set a mapping error on writeback as well. It looks like that's not happening today. Something like the patch below (obviously untested). I'll also plan to follow up with a patch to vfs.txt to outline how writeback errors should be handled by filesystems, assuming that this patchset isn't completely off base. -------------------8<----------------------- [PATCH] dax: set error in mapping when writeback fails In order to get proper error codes from fsync, we must set an error in the mapping range when writeback fails. Signed-off-by: Jeff Layton --- fs/dax.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index c45598b912e1..9005d90deeda 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, ret = dax_writeback_one(bdev, mapping, indices[i], pvec.pages[i]); - if (ret < 0) + if (ret < 0) { + mapping_set_error(mapping, ret); return ret; + } } } return 0; -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id ACECC6B0388 for ; Sun, 5 Mar 2017 22:06:23 -0500 (EST) Received: by mail-wm0-f71.google.com with SMTP id n11so25417122wma.5 for ; Sun, 05 Mar 2017 19:06:23 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id w110si24657425wrb.207.2017.03.05.19.06.22 for (version=TLS1 cipher=AES128-SHA bits=128/128); Sun, 05 Mar 2017 19:06:22 -0800 (PST) From: NeilBrown Date: Mon, 06 Mar 2017 14:06:11 +1100 Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business In-Reply-To: <20170305133535.6516-1-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> Message-ID: <871subkst8.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Sun, Mar 05 2017, Jeff Layton wrote: > I recently did some work to wire up -ENOSPC handling in ceph, and found > I could get back -EIO errors in some cases when I should have instead > gotten -ENOSPC. The problem was that the ceph writeback code would set > PG_error on a writeback error, and that error would clobber the mapping > error. > > While I fixed that problem by simply not setting that bit on errors, > that led me down a rabbit hole of looking at how PG_error is being > handled in the kernel. Speaking of rabbit holes... I thought to wonder how IO error propagate up from NFS. It doesn't use SetPageError or mapping_set_error() for files (except in one case that looks a bit odd). It has an "nfs_open_context" and store the latest error in ctx->error. So when you get around to documenting how this is supposed to work, it would be worth while describing the required observable behaviour, and note that while filesystems can use mapping_set_error() to achieve this, they don't have to. I notice that drivers/staging/lustre/lustre/llite/rw.c fs/afs/write.c fs/btrfs/extent_io.c fs/cifs/file.c fs/jffs2/file.c fs/jfs/jfs_metapage.c fs/ntfs/aops.c (and possible others) all have SetPageError() calls that seem to be in response to a write error to a file, but don't appear to have matching mapping_set_error() calls. Did you look at these? Did I miss something? Thanks, NeilBrown > > This patch series is a few fixes for things that I 100% noticed by > inspection. I don't have a great way to test these since they involve > error handling. I can certainly doctor up a kernel to inject errors > in this code and test by hand however if these look plausible up front. > > Jeff Layton (3): > nilfs2: set the mapping error when calling SetPageError on writeback > mm: don't TestClearPageError in __filemap_fdatawait_range > mm: set mapping error when launder_pages fails > > fs/nilfs2/segment.c | 1 + > mm/filemap.c | 19 ++++--------------- > mm/truncate.c | 6 +++++- > 3 files changed, 10 insertions(+), 16 deletions(-) > > --=20 > 2.9.3 --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAli80iQACgkQOeye3VZi gblO6hAAhio2Ndl5HR0tR+Ao1Kmj9lz3P41Gmjv4+PTuHhYn0311U38WQ42hvI7U CUu4mMv+b2M508vEk6w2VSgwYJdA6Mfs6yvEqEBkceKfWfgfcHjQ40Hk5xWuQ5kZ V0JFVMHK6vxKOn/GhZsVZ9Y7BllG6iY2gqRXwIoTyqqS7fxPt40KL+jEj1Xg9OMT nLBqfhaofimah+WlxPdajveOAAXbjRws/7NH+GEYaD/Z6zjhletSMKumNgkN6i1k kgoHNsN+wHMKULef7AoE3BS9mNQjxvu86VBXnS36MmFIHkVAiTcWXZTaUp98ye3a WvxapiMfcBl9x7tfQHV0QkK2aV0Xw/Y1KlqyO6f649ZQAMLsAvgyWOJd/2T/E7u1 M+pa7GhVKFwBC5Sa7qm5WrriNsVR9Ue0+kMgPmjJxBXBZtuRh7idNxP+5Da7GRO2 ySNK359Lv2ZpcbT9Z8H+fT3siueaMW835+dSCWqCaC/S2daEq5D6f9qB5V5hDmRD Hz/9TYDiyCvQt4U8Vr/2ne73Zu8hfpgiP8c7FtpUtcFgvo4sblfs1RTnoRz5CHWC gNG1K/04CLZgnR+Ya0SO4lU24lvP0NGId7Cyr6WfTA2zMxRRAR5wvG4zve3SOk8j DPjhAIyWf2mDhXSvKC0NVy0umRCPr/Kx03veJ+w88P64hbtyJvQ= =Z5g1 -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id 08BC26B0038 for ; Mon, 6 Mar 2017 06:43:38 -0500 (EST) Received: by mail-qk0-f198.google.com with SMTP id o135so14272635qke.3 for ; Mon, 06 Mar 2017 03:43:38 -0800 (PST) Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com. [209.85.220.174]) by mx.google.com with ESMTPS id v1si15257224qki.194.2017.03.06.03.43.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 06 Mar 2017 03:43:36 -0800 (PST) Received: by mail-qk0-f174.google.com with SMTP id v125so87083650qkh.2 for ; Mon, 06 Mar 2017 03:43:36 -0800 (PST) Message-ID: <1488800614.2989.4.camel@redhat.com> Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business From: Jeff Layton Date: Mon, 06 Mar 2017 06:43:34 -0500 In-Reply-To: <871subkst8.fsf@notabene.neil.brown.name> References: <20170305133535.6516-1-jlayton@redhat.com> <871subkst8.fsf@notabene.neil.brown.name> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: NeilBrown , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, Andrew Morton On Mon, 2017-03-06 at 14:06 +1100, NeilBrown wrote: > On Sun, Mar 05 2017, Jeff Layton wrote: > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > I could get back -EIO errors in some cases when I should have instead > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > PG_error on a writeback error, and that error would clobber the mapping > > error. > > > > While I fixed that problem by simply not setting that bit on errors, > > that led me down a rabbit hole of looking at how PG_error is being > > handled in the kernel. > > Speaking of rabbit holes... I thought to wonder how IO error propagate > up from NFS. > It doesn't use SetPageError or mapping_set_error() for files (except in > one case that looks a bit odd). > It has an "nfs_open_context" and store the latest error in ctx->error. > > So when you get around to documenting how this is supposed to work, it > would be worth while describing the required observable behaviour, and > note that while filesystems can use mapping_set_error() to achieve this, > they don't have to. > > I notice that > drivers/staging/lustre/lustre/llite/rw.c > fs/afs/write.c > fs/btrfs/extent_io.c > fs/cifs/file.c > fs/jffs2/file.c > fs/jfs/jfs_metapage.c > fs/ntfs/aops.c > > (and possible others) all have SetPageError() calls that seem to be > in response to a write error to a file, but don't appear to have > matching mapping_set_error() calls. Did you look at these? Did I miss > something? > Those are all in writepage implementations, and the callers of writepage all call mapping_set_error if it returns error. The exception is write_one_page, which is typically used for writing out dir info and such, and so it's not really necessary there. Now that I look though, I think I may have gotten the page migration codepath wrong. I had convinced myself it was ok before, but looking again, I think we probably need to add a mapping_set_error call to writeout() as well. I'll go over that more carefully in a little while. > > > > This patch series is a few fixes for things that I 100% noticed by > > inspection. I don't have a great way to test these since they involve > > error handling. I can certainly doctor up a kernel to inject errors > > in this code and test by hand however if these look plausible up front. > > > > Jeff Layton (3): > > nilfs2: set the mapping error when calling SetPageError on writeback > > mm: don't TestClearPageError in __filemap_fdatawait_range > > mm: set mapping error when launder_pages fails > > > > fs/nilfs2/segment.c | 1 + > > mm/filemap.c | 19 ++++--------------- > > mm/truncate.c | 6 +++++- > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > -- > > 2.9.3 -- Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id E2E096B0387 for ; Mon, 6 Mar 2017 18:08:22 -0500 (EST) Received: by mail-pf0-f200.google.com with SMTP id v190so85455331pfb.5 for ; Mon, 06 Mar 2017 15:08:22 -0800 (PST) Received: from mga06.intel.com (mga06.intel.com. [134.134.136.31]) by mx.google.com with ESMTPS id n67si20390069pfk.77.2017.03.06.15.08.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 06 Mar 2017 15:08:21 -0800 (PST) Date: Mon, 6 Mar 2017 16:08:01 -0700 From: Ross Zwisler Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170306230801.GA28111@linux.intel.com> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1488724854.2925.6.camel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton Cc: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown , Ross Zwisler , Jan Kara On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > I could get back -EIO errors in some cases when I should have instead > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > PG_error on a writeback error, and that error would clobber the mapping > > error. > > > > I should also note that relying on PG_error to report writeback errors > is inherently unreliable as well. If someone calls sync() before your > fsync gets in there, then you'll likely lose it anyway. > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > but not the individual PG_error flags, so I think we do want to ensure > that the mapping error is set when there is a writeback error and not > rely on PG_error bit for that. > > > While I fixed that problem by simply not setting that bit on errors, > > that led me down a rabbit hole of looking at how PG_error is being > > handled in the kernel. > > > > This patch series is a few fixes for things that I 100% noticed by > > inspection. I don't have a great way to test these since they involve > > error handling. I can certainly doctor up a kernel to inject errors > > in this code and test by hand however if these look plausible up front. > > > > Jeff Layton (3): > > nilfs2: set the mapping error when calling SetPageError on writeback > > mm: don't TestClearPageError in __filemap_fdatawait_range > > mm: set mapping error when launder_pages fails > > > > fs/nilfs2/segment.c | 1 + > > mm/filemap.c | 19 ++++--------------- > > mm/truncate.c | 6 +++++- > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > (cc'ing Ross...) > > Just when I thought that only NILFS2 needed a little work here, I see > another spot... > > I think that we should also need to fix dax_writeback_mapping_range to > set a mapping error on writeback as well. It looks like that's not > happening today. Something like the patch below (obviously untested). > > I'll also plan to follow up with a patch to vfs.txt to outline how > writeback errors should be handled by filesystems, assuming that this > patchset isn't completely off base. > > -------------------8<----------------------- > > [PATCH] dax: set error in mapping when writeback fails > > In order to get proper error codes from fsync, we must set an error in > the mapping range when writeback fails. > > Signed-off-by: Jeff Layton > --- > fs/dax.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index c45598b912e1..9005d90deeda 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > ret = dax_writeback_one(bdev, mapping, indices[i], > pvec.pages[i]); > - if (ret < 0) > + if (ret < 0) { > + mapping_set_error(mapping, ret); > return ret; > + } (Adding Jan) I tested this a bit, and for the DAX case at least I don't think this does what you want. The current code already returns -EIO if dax_writeback_one() hits an error, which bubbles up through the call stack and makes the fsync() call in userspace fail with EIO, as we want. With both ext4 and xfs this patch (applied to v4.10) makes it so that we fail the current fsync() due to the return value of -EIO, then we fail the next fsync() as well because only then do we actually process the AS_EIO flag inside of filemap_check_errors(). I think maybe the missing piece is that our normal DAX fsync call stack doesn't include a call to filemap_check_errors() if we return -EIO. Here's our stack in xfs: dax_writeback_mapping_range+0x32/0x70 xfs_vm_writepages+0x8c/0xf0 do_writepages+0x21/0x30 __filemap_fdatawrite_range+0xc6/0x100 filemap_write_and_wait_range+0x44/0x90 xfs_file_fsync+0x7a/0x2c0 vfs_fsync_range+0x4b/0xb0 ? trace_hardirqs_on_caller+0xf5/0x1b0 do_fsync+0x3d/0x70 SyS_fsync+0x10/0x20 entry_SYSCALL_64_fastpath+0x1f/0xc2 On the subsequent fsync() call we *do* end up calling filemap_check_errors() via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the mapping: filemap_fdatawait_range+0x3b/0x80 filemap_write_and_wait_range+0x5a/0x90 xfs_file_fsync+0x7a/0x2c0 vfs_fsync_range+0x4b/0xb0 ? trace_hardirqs_on_caller+0xf5/0x1b0 do_fsync+0x3d/0x70 SyS_fsync+0x10/0x20 entry_SYSCALL_64_fastpath+0x1f/0xc2 Was your concern just that you didn't think that fsync() was properly returning an error when dax_writeback_one() hit an error? Or is there another path by which we need to report the error, where it is actually important that we set AS_EIO? If it's the latter, then I think we need to rework the fsync call path so that we both generate and consume AS_EIO on the same call, probably in filemap_write_and_wait_range(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 646506B038B for ; Tue, 7 Mar 2017 05:26:27 -0500 (EST) Received: by mail-wm0-f72.google.com with SMTP id u9so109315wme.6 for ; Tue, 07 Mar 2017 02:26:27 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id y63si30172280wrb.176.2017.03.07.02.26.25 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 07 Mar 2017 02:26:25 -0800 (PST) Date: Tue, 7 Mar 2017 11:26:22 +0100 From: Jan Kara Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170307102622.GB2578@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170306230801.GA28111@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ross Zwisler Cc: Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown , Jan Kara On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > I could get back -EIO errors in some cases when I should have instead > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > PG_error on a writeback error, and that error would clobber the mapping > > > error. > > > > > > > I should also note that relying on PG_error to report writeback errors > > is inherently unreliable as well. If someone calls sync() before your > > fsync gets in there, then you'll likely lose it anyway. > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > but not the individual PG_error flags, so I think we do want to ensure > > that the mapping error is set when there is a writeback error and not > > rely on PG_error bit for that. > > > > > While I fixed that problem by simply not setting that bit on errors, > > > that led me down a rabbit hole of looking at how PG_error is being > > > handled in the kernel. > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > inspection. I don't have a great way to test these since they involve > > > error handling. I can certainly doctor up a kernel to inject errors > > > in this code and test by hand however if these look plausible up front. > > > > > > Jeff Layton (3): > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > mm: set mapping error when launder_pages fails > > > > > > fs/nilfs2/segment.c | 1 + > > > mm/filemap.c | 19 ++++--------------- > > > mm/truncate.c | 6 +++++- > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > (cc'ing Ross...) > > > > Just when I thought that only NILFS2 needed a little work here, I see > > another spot... > > > > I think that we should also need to fix dax_writeback_mapping_range to > > set a mapping error on writeback as well. It looks like that's not > > happening today. Something like the patch below (obviously untested). > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > writeback errors should be handled by filesystems, assuming that this > > patchset isn't completely off base. > > > > -------------------8<----------------------- > > > > [PATCH] dax: set error in mapping when writeback fails > > > > In order to get proper error codes from fsync, we must set an error in > > the mapping range when writeback fails. > > > > Signed-off-by: Jeff Layton > > --- > > fs/dax.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index c45598b912e1..9005d90deeda 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > pvec.pages[i]); > > - if (ret < 0) > > + if (ret < 0) { > > + mapping_set_error(mapping, ret); > > return ret; > > + } > > (Adding Jan) > > I tested this a bit, and for the DAX case at least I don't think this does > what you want. The current code already returns -EIO if dax_writeback_one() > hits an error, which bubbles up through the call stack and makes the fsync() > call in userspace fail with EIO, as we want. With both ext4 and xfs this > patch (applied to v4.10) makes it so that we fail the current fsync() due to > the return value of -EIO, then we fail the next fsync() as well because only > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > I think maybe the missing piece is that our normal DAX fsync call stack > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > our stack in xfs: > > dax_writeback_mapping_range+0x32/0x70 > xfs_vm_writepages+0x8c/0xf0 > do_writepages+0x21/0x30 > __filemap_fdatawrite_range+0xc6/0x100 > filemap_write_and_wait_range+0x44/0x90 > xfs_file_fsync+0x7a/0x2c0 > vfs_fsync_range+0x4b/0xb0 > ? trace_hardirqs_on_caller+0xf5/0x1b0 > do_fsync+0x3d/0x70 > SyS_fsync+0x10/0x20 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > mapping: > > filemap_fdatawait_range+0x3b/0x80 > filemap_write_and_wait_range+0x5a/0x90 > xfs_file_fsync+0x7a/0x2c0 > vfs_fsync_range+0x4b/0xb0 > ? trace_hardirqs_on_caller+0xf5/0x1b0 > do_fsync+0x3d/0x70 > SyS_fsync+0x10/0x20 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > Was your concern just that you didn't think that fsync() was properly > returning an error when dax_writeback_one() hit an error? Or is there another > path by which we need to report the error, where it is actually important that > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > call path so that we both generate and consume AS_EIO on the same call, > probably in filemap_write_and_wait_range(). So I believe this is due to the special handling of EIO inside filemap_write_and_wait(). Normally, filemap_check_errors() happens inside filemap_fdatawait() there however not for EIO returned from filemap_fdatawrite(). In that case we bail out immediately. So I think Jeff's patch is correct but we need to change filemap_write_and_wait() to call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). On a more general note (DAX is actually fine here), I find the current practice of clearing page dirty bits on error and reporting it just once problematic. It keeps the system running but data is lost and possibly without getting the error anywhere where it is useful. We get away with this because it is a rare event but it seems like a problematic behavior. But this is more for the discussion at LSF. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id C2DA06B0387 for ; Tue, 7 Mar 2017 08:46:17 -0500 (EST) Received: by mail-pf0-f197.google.com with SMTP id o126so3477054pfb.2 for ; Tue, 07 Mar 2017 05:46:17 -0800 (PST) Received: from mail-pg0-x241.google.com (mail-pg0-x241.google.com. [2607:f8b0:400e:c05::241]) by mx.google.com with ESMTPS id r20si78303pgo.110.2017.03.07.05.46.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 Mar 2017 05:46:16 -0800 (PST) Received: by mail-pg0-x241.google.com with SMTP id 187so191706pgb.2 for ; Tue, 07 Mar 2017 05:46:16 -0800 (PST) Date: Tue, 07 Mar 2017 22:46:12 +0900 (JST) Message-Id: <20170307.224612.801707040634574055.konishi.ryusuke@lab.ntt.co.jp> Subject: Re: [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback From: Ryusuke Konishi In-Reply-To: <20170305133535.6516-2-jlayton@redhat.com> References: <20170305133535.6516-1-jlayton@redhat.com> <20170305133535.6516-2-jlayton@redhat.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton Cc: viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org On Sun, 5 Mar 2017 08:35:33 -0500, Jeff Layton wrote: > In a later patch, we're going to want to make the fsync codepath not do > a TestClearPageError call as that can override the error set in the > address space. To do that though, we need to ensure that filesystems > that are relying on the PG_error bit for reporting writeback errors > also set an error in the address space. > > The only place I've found that looks potentially problematic is this > spot in nilfs2. Ensure that it sets an error in the mapping in addition > to setting PageError. > > Signed-off-by: Jeff Layton Acked-by: Ryusuke Konishi Agreed that nilfs2 needs this if the successive patch is applied. Thanks, Ryusuke Konishi > --- > fs/nilfs2/segment.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c > index bedcae2c28e6..c1041b07060e 100644 > --- a/fs/nilfs2/segment.c > +++ b/fs/nilfs2/segment.c > @@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err) > } else { > __set_page_dirty_nobuffers(page); > SetPageError(page); > + mapping_set_error(page_mapping(page), err); > } > > end_page_writeback(page); > -- > 2.9.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id BC9AA6B0387 for ; Tue, 7 Mar 2017 09:03:21 -0500 (EST) Received: by mail-qk0-f199.google.com with SMTP id f191so4598025qka.7 for ; Tue, 07 Mar 2017 06:03:21 -0800 (PST) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com. [209.85.220.178]) by mx.google.com with ESMTPS id m187si103895qkf.95.2017.03.07.06.03.20 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 Mar 2017 06:03:20 -0800 (PST) Received: by mail-qk0-f178.google.com with SMTP id 1so4794751qkl.3 for ; Tue, 07 Mar 2017 06:03:20 -0800 (PST) Message-ID: <1488895397.2788.3.camel@redhat.com> Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business From: Jeff Layton Date: Tue, 07 Mar 2017 09:03:17 -0500 In-Reply-To: <20170307102622.GB2578@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara , Ross Zwisler Cc: viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown , Kevin Wolf On Tue, 2017-03-07 at 11:26 +0100, Jan Kara wrote: > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > I could get back -EIO errors in some cases when I should have instead > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > error. > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > is inherently unreliable as well. If someone calls sync() before your > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > but not the individual PG_error flags, so I think we do want to ensure > > > that the mapping error is set when there is a writeback error and not > > > rely on PG_error bit for that. > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > handled in the kernel. > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > inspection. I don't have a great way to test these since they involve > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > Jeff Layton (3): > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > mm: set mapping error when launder_pages fails > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > mm/filemap.c | 19 ++++--------------- > > > > mm/truncate.c | 6 +++++- > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > (cc'ing Ross...) > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > another spot... > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > set a mapping error on writeback as well. It looks like that's not > > > happening today. Something like the patch below (obviously untested). > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > writeback errors should be handled by filesystems, assuming that this > > > patchset isn't completely off base. > > > > > > -------------------8<----------------------- > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > In order to get proper error codes from fsync, we must set an error in > > > the mapping range when writeback fails. > > > > > > Signed-off-by: Jeff Layton > > > --- > > > fs/dax.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index c45598b912e1..9005d90deeda 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > pvec.pages[i]); > > > - if (ret < 0) > > > + if (ret < 0) { > > > + mapping_set_error(mapping, ret); > > > return ret; > > > + } > > > > (Adding Jan) > > > > I tested this a bit, and for the DAX case at least I don't think this does > > what you want. The current code already returns -EIO if dax_writeback_one() > > hits an error, which bubbles up through the call stack and makes the fsync() > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > the return value of -EIO, then we fail the next fsync() as well because only > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > our stack in xfs: > > > > dax_writeback_mapping_range+0x32/0x70 > > xfs_vm_writepages+0x8c/0xf0 > > do_writepages+0x21/0x30 > > __filemap_fdatawrite_range+0xc6/0x100 > > filemap_write_and_wait_range+0x44/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > mapping: > > > > filemap_fdatawait_range+0x3b/0x80 > > filemap_write_and_wait_range+0x5a/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > Was your concern just that you didn't think that fsync() was properly > > returning an error when dax_writeback_one() hit an error? Or is there another > > path by which we need to report the error, where it is actually important that > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > call path so that we both generate and consume AS_EIO on the same call, > > probably in filemap_write_and_wait_range(). > > So I believe this is due to the special handling of EIO inside > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside > filemap_fdatawait() there however not for EIO returned from > filemap_fdatawrite(). In that case we bail out immediately. So I think > Jeff's patch is correct but we need to change filemap_write_and_wait() to > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). > Yes that makes total sense. I've got a filemap_write_and_wait patch in my pile now that does this. I'll run what I have through an xfstests run, and see how it does, and will plan to post a v2 set once I do. > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. > That really is the crux of the matter. Unfortunately, that's sort of how the POSIX write/fsync model is designed. If we want to change that, then I think that we have to consider what a new interface for this would look like. Maybe we can do something there with new sync_file_range flags? I think this probably also dovetails with Kevin Wolf's proposed LSF topic too, so maybe we can discuss all of this together there. -- Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 8706B6B038E for ; Tue, 7 Mar 2017 10:59:18 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id v190so9195249pfb.5 for ; Tue, 07 Mar 2017 07:59:18 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTPS id k195si412598pgc.159.2017.03.07.07.59.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 Mar 2017 07:59:17 -0800 (PST) Date: Tue, 7 Mar 2017 08:59:16 -0700 From: Ross Zwisler Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170307155916.GA31882@linux.intel.com> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170307102622.GB2578@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Ross Zwisler , Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > I could get back -EIO errors in some cases when I should have instead > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > error. > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > is inherently unreliable as well. If someone calls sync() before your > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > but not the individual PG_error flags, so I think we do want to ensure > > > that the mapping error is set when there is a writeback error and not > > > rely on PG_error bit for that. > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > handled in the kernel. > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > inspection. I don't have a great way to test these since they involve > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > Jeff Layton (3): > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > mm: set mapping error when launder_pages fails > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > mm/filemap.c | 19 ++++--------------- > > > > mm/truncate.c | 6 +++++- > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > (cc'ing Ross...) > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > another spot... > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > set a mapping error on writeback as well. It looks like that's not > > > happening today. Something like the patch below (obviously untested). > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > writeback errors should be handled by filesystems, assuming that this > > > patchset isn't completely off base. > > > > > > -------------------8<----------------------- > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > In order to get proper error codes from fsync, we must set an error in > > > the mapping range when writeback fails. > > > > > > Signed-off-by: Jeff Layton > > > --- > > > fs/dax.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index c45598b912e1..9005d90deeda 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > pvec.pages[i]); > > > - if (ret < 0) > > > + if (ret < 0) { > > > + mapping_set_error(mapping, ret); > > > return ret; > > > + } > > > > (Adding Jan) > > > > I tested this a bit, and for the DAX case at least I don't think this does > > what you want. The current code already returns -EIO if dax_writeback_one() > > hits an error, which bubbles up through the call stack and makes the fsync() > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > the return value of -EIO, then we fail the next fsync() as well because only > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > our stack in xfs: > > > > dax_writeback_mapping_range+0x32/0x70 > > xfs_vm_writepages+0x8c/0xf0 > > do_writepages+0x21/0x30 > > __filemap_fdatawrite_range+0xc6/0x100 > > filemap_write_and_wait_range+0x44/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > mapping: > > > > filemap_fdatawait_range+0x3b/0x80 > > filemap_write_and_wait_range+0x5a/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > Was your concern just that you didn't think that fsync() was properly > > returning an error when dax_writeback_one() hit an error? Or is there another > > path by which we need to report the error, where it is actually important that > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > call path so that we both generate and consume AS_EIO on the same call, > > probably in filemap_write_and_wait_range(). > > So I believe this is due to the special handling of EIO inside > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular case, but we definitely want to make changes that keep them consistent. > filemap_fdatawait() there however not for EIO returned from > filemap_fdatawrite(). In that case we bail out immediately. So I think > Jeff's patch is correct but we need to change filemap_write_and_wait() to > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). So I guess my question was: why is it important that we set AS_EIO, if the EIO is already being reported correctly? Is it just for consistency with the buffered fsync case, or is there currently a path where the -EIO from DAX will be missed, and we actually need AS_EIO to be set in mapping->flags so that we correctly report an error? > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. > > Honza > -- > Jan Kara > SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id D7CA56B0397 for ; Tue, 7 Mar 2017 11:17:20 -0500 (EST) Received: by mail-wm0-f71.google.com with SMTP id n11so2946818wma.5 for ; Tue, 07 Mar 2017 08:17:20 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id o191si1087016wme.129.2017.03.07.08.17.19 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 07 Mar 2017 08:17:19 -0800 (PST) Date: Tue, 7 Mar 2017 17:17:16 +0100 From: Jan Kara Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170307161716.GA11837@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170307155916.GA31882@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170307155916.GA31882@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ross Zwisler Cc: Jan Kara , Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Tue 07-03-17 08:59:16, Ross Zwisler wrote: > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > > I could get back -EIO errors in some cases when I should have instead > > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > > error. > > > > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > > is inherently unreliable as well. If someone calls sync() before your > > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > > but not the individual PG_error flags, so I think we do want to ensure > > > > that the mapping error is set when there is a writeback error and not > > > > rely on PG_error bit for that. > > > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > > handled in the kernel. > > > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > > inspection. I don't have a great way to test these since they involve > > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > > > Jeff Layton (3): > > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > > mm: set mapping error when launder_pages fails > > > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > > mm/filemap.c | 19 ++++--------------- > > > > > mm/truncate.c | 6 +++++- > > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > > > > (cc'ing Ross...) > > > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > > another spot... > > > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > > set a mapping error on writeback as well. It looks like that's not > > > > happening today. Something like the patch below (obviously untested). > > > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > > writeback errors should be handled by filesystems, assuming that this > > > > patchset isn't completely off base. > > > > > > > > -------------------8<----------------------- > > > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > > > In order to get proper error codes from fsync, we must set an error in > > > > the mapping range when writeback fails. > > > > > > > > Signed-off-by: Jeff Layton > > > > --- > > > > fs/dax.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > index c45598b912e1..9005d90deeda 100644 > > > > --- a/fs/dax.c > > > > +++ b/fs/dax.c > > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > > pvec.pages[i]); > > > > - if (ret < 0) > > > > + if (ret < 0) { > > > > + mapping_set_error(mapping, ret); > > > > return ret; > > > > + } > > > > > > (Adding Jan) > > > > > > I tested this a bit, and for the DAX case at least I don't think this does > > > what you want. The current code already returns -EIO if dax_writeback_one() > > > hits an error, which bubbles up through the call stack and makes the fsync() > > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > > the return value of -EIO, then we fail the next fsync() as well because only > > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > > our stack in xfs: > > > > > > dax_writeback_mapping_range+0x32/0x70 > > > xfs_vm_writepages+0x8c/0xf0 > > > do_writepages+0x21/0x30 > > > __filemap_fdatawrite_range+0xc6/0x100 > > > filemap_write_and_wait_range+0x44/0x90 > > > xfs_file_fsync+0x7a/0x2c0 > > > vfs_fsync_range+0x4b/0xb0 > > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > > do_fsync+0x3d/0x70 > > > SyS_fsync+0x10/0x20 > > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > > mapping: > > > > > > filemap_fdatawait_range+0x3b/0x80 > > > filemap_write_and_wait_range+0x5a/0x90 > > > xfs_file_fsync+0x7a/0x2c0 > > > vfs_fsync_range+0x4b/0xb0 > > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > > do_fsync+0x3d/0x70 > > > SyS_fsync+0x10/0x20 > > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > > > Was your concern just that you didn't think that fsync() was properly > > > returning an error when dax_writeback_one() hit an error? Or is there another > > > path by which we need to report the error, where it is actually important that > > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > > call path so that we both generate and consume AS_EIO on the same call, > > > probably in filemap_write_and_wait_range(). > > > > So I believe this is due to the special handling of EIO inside > > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside > > s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular > case, but we definitely want to make changes that keep them consistent. > > > filemap_fdatawait() there however not for EIO returned from > > filemap_fdatawrite(). In that case we bail out immediately. So I think > > Jeff's patch is correct but we need to change filemap_write_and_wait() to > > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). > > So I guess my question was: why is it important that we set AS_EIO, if the EIO > is already being reported correctly? Is it just for consistency with the > buffered fsync case, or is there currently a path where the -EIO from DAX will > be missed, and we actually need AS_EIO to be set in mapping->flags so that we > correctly report an error? It is just for consistency and future-proofing. E.g. if we ever decided to do some background flushing of CPU caches, current DAX behavior would suddently result in missed reports of EIO errors. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua0-f197.google.com (mail-ua0-f197.google.com [209.85.217.197]) by kanga.kvack.org (Postfix) with ESMTP id 75ECB6B0391 for ; Wed, 8 Mar 2017 21:57:33 -0500 (EST) Received: by mail-ua0-f197.google.com with SMTP id w33so78434661uaw.4 for ; Wed, 08 Mar 2017 18:57:33 -0800 (PST) Received: from imap.thunk.org (imap.thunk.org. [2600:3c02::f03c:91ff:fe96:be03]) by mx.google.com with ESMTPS id j20si2425141vkf.122.2017.03.08.18.57.31 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Mar 2017 18:57:32 -0800 (PST) Date: Wed, 8 Mar 2017 21:57:25 -0500 From: Theodore Ts'o Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170309025725.5wrszri462zipiix@thunk.org> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170307102622.GB2578@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Ross Zwisler , Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. I'm actually running into this in the last day or two because some MM folks at $WORK have been trying to push hard for GFP_NOFS removal in ext4 (at least when we are holding some mutex/semaphore like i_data_sem) because otherwise it's possible for the OOM killer to be unable to kill processes because they are holding on to locks that ext4 is holding. I've done some initial investigation, and while it's not that hard to remove GFP_NOFS from certain parts of the writepages() codepath (which is where we had been are running into problems), a really, REALLY big problem is if any_filesystem->writepages() returns ENOMEM, it causes silent data loss, because the pages are marked clean, and so data written using buffered writeback goes *poof*. I confirmed this by creating a test kernel with a simple patch such that if the ext4 file system is mounted with -o debug, there was a 1 in 16 chance that ext4_writepages will immediately return with ENOMEM (and printk the inode number, so I knew which inodes had gotten the ENOMEM treatment). The result was **NOT** pretty. What I think we should strongly consider is at the very least, special case ENOMEM being returned by writepages() during background writeback, and *not* mark the pages clean, and make sure the inode stays on the dirty inode list, so we can retry the write later. This is especially important since the process that issued the write may have gone away, so there might not even be a userspace process to complain to. By converting certain page allocations (most notably in ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to release the i_data_sem lock and return an error. This should allow allow the OOM killer to do its dirty deed, and hopefully we can retry the writepages() for that inode later. In the case of a data integrity sync being issued by fsync() or umount(), we could allow ENOMEM to get returned to userspace in that case as well. I'm not convinced all userspace code will handle an ENOMEM correctly or sanely, but at least they people will be (less likely) to blame file system developers. :-) The real problem that's going on here, by the way, is that people are trying to run programs in insanely tight containers, and then when the kernel locks up, they blame the mm developers. But if there is silent data corruption, they will blame the fs developers instead. And while kernel lockups are temporary (all you have to do is let the watchdog reboot the system :-), silent data corruption is *forever*. So what we really need to do is to allow the OOM killer do its work, and if job owners are unhappy that their processes are getting OOM killed, maybe they will be suitably incentivized to pay for more memory in their containers.... - Ted P.S. Michael Hocko, apologies for not getting back to you with your GFP_NOFS removal patches. But the possibility of fs malfunctions that might lead to silent data corruption is why I'm being very cautious, and I now have rather strong confirmation that this is not just an irrational concern on my part. (This is also performance review season, FAST conference was last week, and Usenix ATC program committee reviews are due this week. So apologies for any reply latency.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id E31D82808AC for ; Thu, 9 Mar 2017 04:04:53 -0500 (EST) Received: by mail-wr0-f199.google.com with SMTP id u48so18416954wrc.0 for ; Thu, 09 Mar 2017 01:04:53 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id c63si3518132wmf.132.2017.03.09.01.04.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 09 Mar 2017 01:04:52 -0800 (PST) Date: Thu, 9 Mar 2017 10:04:49 +0100 From: Jan Kara Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170309090449.GD15874@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170309025725.5wrszri462zipiix@thunk.org> Sender: owner-linux-mm@kvack.org List-ID: To: Theodore Ts'o Cc: Jan Kara , Ross Zwisler , Jeff Layton , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Wed 08-03-17 21:57:25, Ted Tso wrote: > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > On a more general note (DAX is actually fine here), I find the current > > practice of clearing page dirty bits on error and reporting it just once > > problematic. It keeps the system running but data is lost and possibly > > without getting the error anywhere where it is useful. We get away with > > this because it is a rare event but it seems like a problematic behavior. > > But this is more for the discussion at LSF. > > I'm actually running into this in the last day or two because some MM > folks at $WORK have been trying to push hard for GFP_NOFS removal in > ext4 (at least when we are holding some mutex/semaphore like > i_data_sem) because otherwise it's possible for the OOM killer to be > unable to kill processes because they are holding on to locks that > ext4 is holding. > > I've done some initial investigation, and while it's not that hard to > remove GFP_NOFS from certain parts of the writepages() codepath (which > is where we had been are running into problems), a really, REALLY big > problem is if any_filesystem->writepages() returns ENOMEM, it causes > silent data loss, because the pages are marked clean, and so data > written using buffered writeback goes *poof*. > > I confirmed this by creating a test kernel with a simple patch such > that if the ext4 file system is mounted with -o debug, there was a 1 > in 16 chance that ext4_writepages will immediately return with ENOMEM > (and printk the inode number, so I knew which inodes had gotten the > ENOMEM treatment). The result was **NOT** pretty. > > What I think we should strongly consider is at the very least, special > case ENOMEM being returned by writepages() during background > writeback, and *not* mark the pages clean, and make sure the inode > stays on the dirty inode list, so we can retry the write later. This > is especially important since the process that issued the write may > have gone away, so there might not even be a userspace process to > complain to. By converting certain page allocations (most notably in > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > release the i_data_sem lock and return an error. This should allow > allow the OOM killer to do its dirty deed, and hopefully we can retry > the writepages() for that inode later. Yeah, so if we can hope the error is transient, keeping pages dirty and retrying the write is definitely better option. For start we can say that ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means there's no hope of getting data to disk and so we just discard them. It will be somewhat rough distinction but probably better than what we have now. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id A31576B0452 for ; Thu, 9 Mar 2017 05:47:55 -0500 (EST) Received: by mail-qk0-f199.google.com with SMTP id f191so111298458qka.7 for ; Thu, 09 Mar 2017 02:47:55 -0800 (PST) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com. [209.85.220.178]) by mx.google.com with ESMTPS id r2si5371868qta.142.2017.03.09.02.47.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 02:47:54 -0800 (PST) Received: by mail-qk0-f178.google.com with SMTP id p64so112559942qke.1 for ; Thu, 09 Mar 2017 02:47:54 -0800 (PST) Message-ID: <1489056471.2791.2.camel@redhat.com> Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business From: Jeff Layton Date: Thu, 09 Mar 2017 05:47:51 -0500 In-Reply-To: <20170309090449.GD15874@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> <20170309090449.GD15874@quack2.suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara , Theodore Ts'o Cc: Ross Zwisler , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > On a more general note (DAX is actually fine here), I find the current > > > practice of clearing page dirty bits on error and reporting it just once > > > problematic. It keeps the system running but data is lost and possibly > > > without getting the error anywhere where it is useful. We get away with > > > this because it is a rare event but it seems like a problematic behavior. > > > But this is more for the discussion at LSF. > > > > I'm actually running into this in the last day or two because some MM > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > ext4 (at least when we are holding some mutex/semaphore like > > i_data_sem) because otherwise it's possible for the OOM killer to be > > unable to kill processes because they are holding on to locks that > > ext4 is holding. > > > > I've done some initial investigation, and while it's not that hard to > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > is where we had been are running into problems), a really, REALLY big > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > silent data loss, because the pages are marked clean, and so data > > written using buffered writeback goes *poof*. > > > > I confirmed this by creating a test kernel with a simple patch such > > that if the ext4 file system is mounted with -o debug, there was a 1 > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > (and printk the inode number, so I knew which inodes had gotten the > > ENOMEM treatment). The result was **NOT** pretty. > > > > What I think we should strongly consider is at the very least, special > > case ENOMEM being returned by writepages() during background > > writeback, and *not* mark the pages clean, and make sure the inode > > stays on the dirty inode list, so we can retry the write later. This > > is especially important since the process that issued the write may > > have gone away, so there might not even be a userspace process to > > complain to. By converting certain page allocations (most notably in > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > release the i_data_sem lock and return an error. This should allow > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > the writepages() for that inode later. > > Yeah, so if we can hope the error is transient, keeping pages dirty and > retrying the write is definitely better option. For start we can say that > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > there's no hope of getting data to disk and so we just discard them. It > will be somewhat rough distinction but probably better than what we have > now. > > Honza I'm not sure about ENOSPC there. That's a return code that is specifically expected to be returned by fsync. It seems like that ought not be considered a transient error? -- Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id BBED02808AC for ; Thu, 9 Mar 2017 06:02:31 -0500 (EST) Received: by mail-wm0-f72.google.com with SMTP id d66so19699967wmi.2 for ; Thu, 09 Mar 2017 03:02:31 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 62si8296411wrj.58.2017.03.09.03.02.27 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 09 Mar 2017 03:02:28 -0800 (PST) Date: Thu, 9 Mar 2017 12:02:25 +0100 From: Jan Kara Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170309110225.GF15874@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> <20170309090449.GD15874@quack2.suse.cz> <1489056471.2791.2.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1489056471.2791.2.camel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton Cc: Jan Kara , Theodore Ts'o , Ross Zwisler , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Thu 09-03-17 05:47:51, Jeff Layton wrote: > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > On a more general note (DAX is actually fine here), I find the current > > > > practice of clearing page dirty bits on error and reporting it just once > > > > problematic. It keeps the system running but data is lost and possibly > > > > without getting the error anywhere where it is useful. We get away with > > > > this because it is a rare event but it seems like a problematic behavior. > > > > But this is more for the discussion at LSF. > > > > > > I'm actually running into this in the last day or two because some MM > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > ext4 (at least when we are holding some mutex/semaphore like > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > unable to kill processes because they are holding on to locks that > > > ext4 is holding. > > > > > > I've done some initial investigation, and while it's not that hard to > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > is where we had been are running into problems), a really, REALLY big > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > silent data loss, because the pages are marked clean, and so data > > > written using buffered writeback goes *poof*. > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > (and printk the inode number, so I knew which inodes had gotten the > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > What I think we should strongly consider is at the very least, special > > > case ENOMEM being returned by writepages() during background > > > writeback, and *not* mark the pages clean, and make sure the inode > > > stays on the dirty inode list, so we can retry the write later. This > > > is especially important since the process that issued the write may > > > have gone away, so there might not even be a userspace process to > > > complain to. By converting certain page allocations (most notably in > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > release the i_data_sem lock and return an error. This should allow > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > the writepages() for that inode later. > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > retrying the write is definitely better option. For start we can say that > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > there's no hope of getting data to disk and so we just discard them. It > > will be somewhat rough distinction but probably better than what we have > > now. > > > > Honza > > I'm not sure about ENOSPC there. That's a return code that is > specifically expected to be returned by fsync. It seems like that ought > not be considered a transient error? Yeah, for start we should probably keep ENOSPC as is to prevent surprises. Long term, we may need to make at least some ENOSPC situations behave as transient to make thin provisioned storage not loose data in case admin does not supply additional space fast enough (i.e., before ENOSPC is actually hit). EIO is actually in a similar bucket although probably more on the "hard failure" side - I can imagine there can by types of storage and situations where the loss of connectivity to the storage is only transient. But for start I would not bother with this. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id 9D19A2808C6 for ; Thu, 9 Mar 2017 07:43:16 -0500 (EST) Received: by mail-qk0-f200.google.com with SMTP id a189so135563025qkc.4 for ; Thu, 09 Mar 2017 04:43:16 -0800 (PST) Received: from mail-qk0-f171.google.com (mail-qk0-f171.google.com. [209.85.220.171]) by mx.google.com with ESMTPS id z54si5572810qtz.34.2017.03.09.04.43.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 04:43:15 -0800 (PST) Received: by mail-qk0-f171.google.com with SMTP id p64so116587789qke.1 for ; Thu, 09 Mar 2017 04:43:15 -0800 (PST) Message-ID: <1489063392.2791.8.camel@redhat.com> Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business From: Jeff Layton Date: Thu, 09 Mar 2017 07:43:12 -0500 In-Reply-To: <20170309110225.GF15874@quack2.suse.cz> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> <20170309090449.GD15874@quack2.suse.cz> <1489056471.2791.2.camel@redhat.com> <20170309110225.GF15874@quack2.suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Theodore Ts'o , Ross Zwisler , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote: > On Thu 09-03-17 05:47:51, Jeff Layton wrote: > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > > On a more general note (DAX is actually fine here), I find the current > > > > > practice of clearing page dirty bits on error and reporting it just once > > > > > problematic. It keeps the system running but data is lost and possibly > > > > > without getting the error anywhere where it is useful. We get away with > > > > > this because it is a rare event but it seems like a problematic behavior. > > > > > But this is more for the discussion at LSF. > > > > > > > > I'm actually running into this in the last day or two because some MM > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > > ext4 (at least when we are holding some mutex/semaphore like > > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > > unable to kill processes because they are holding on to locks that > > > > ext4 is holding. > > > > > > > > I've done some initial investigation, and while it's not that hard to > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > > is where we had been are running into problems), a really, REALLY big > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > > silent data loss, because the pages are marked clean, and so data > > > > written using buffered writeback goes *poof*. > > > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > > (and printk the inode number, so I knew which inodes had gotten the > > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > > > What I think we should strongly consider is at the very least, special > > > > case ENOMEM being returned by writepages() during background > > > > writeback, and *not* mark the pages clean, and make sure the inode > > > > stays on the dirty inode list, so we can retry the write later. This > > > > is especially important since the process that issued the write may > > > > have gone away, so there might not even be a userspace process to > > > > complain to. By converting certain page allocations (most notably in > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > > release the i_data_sem lock and return an error. This should allow > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > > the writepages() for that inode later. > > > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > > retrying the write is definitely better option. For start we can say that > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > > there's no hope of getting data to disk and so we just discard them. It > > > will be somewhat rough distinction but probably better than what we have > > > now. > > > > > > Honza > > > > I'm not sure about ENOSPC there. That's a return code that is > > specifically expected to be returned by fsync. It seems like that ought > > not be considered a transient error? > > Yeah, for start we should probably keep ENOSPC as is to prevent surprises. > Long term, we may need to make at least some ENOSPC situations behave as > transient to make thin provisioned storage not loose data in case admin > does not supply additional space fast enough (i.e., before ENOSPC is > actually hit). > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a transient error? Just have it hang until we get enough space when that tunable is enabled? > EIO is actually in a similar bucket although probably more on the "hard > failure" side - I can imagine there can by types of storage and situations > where the loss of connectivity to the storage is only transient. But for > start I would not bother with this. > > Honza I don't see what we can reasonably do with -EIO other than return a hard error. If we want to deal with loss of connectivity to storage as a transient failure, I think that we'd need to ensure that the lower layers return more distinct error codes in those cases (ENODEV or ENXIO maybe? Or declare a new kernel-internal code -- EDEVGONE?). In any case, I think that the basic idea of marking certain writepage/writepages/launder_page errors as transient might be a reasonable approach to handling this sanely. The problem with all of this though is that we have a pile of existing code that will likely need to be reworked for the new error handling. I expect that we'll have to walk all of the writepage/writepages/launder_page implementations and fix them up one by one once we sort out the rules for this. -- Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id 83AEC6B0418 for ; Thu, 9 Mar 2017 08:22:17 -0500 (EST) Received: by mail-qk0-f200.google.com with SMTP id a189so136978199qkc.4 for ; Thu, 09 Mar 2017 05:22:17 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z53si5625492qta.146.2017.03.09.05.22.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 05:22:16 -0800 (PST) Date: Thu, 9 Mar 2017 08:22:14 -0500 From: Brian Foster Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170309132214.GB16713@bfoster.bfoster> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> <20170309090449.GD15874@quack2.suse.cz> <1489056471.2791.2.camel@redhat.com> <20170309110225.GF15874@quack2.suse.cz> <1489063392.2791.8.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1489063392.2791.8.camel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton Cc: Jan Kara , Theodore Ts'o , Ross Zwisler , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote: > On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote: > > On Thu 09-03-17 05:47:51, Jeff Layton wrote: > > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > > > On a more general note (DAX is actually fine here), I find the current > > > > > > practice of clearing page dirty bits on error and reporting it just once > > > > > > problematic. It keeps the system running but data is lost and possibly > > > > > > without getting the error anywhere where it is useful. We get away with > > > > > > this because it is a rare event but it seems like a problematic behavior. > > > > > > But this is more for the discussion at LSF. > > > > > > > > > > I'm actually running into this in the last day or two because some MM > > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > > > ext4 (at least when we are holding some mutex/semaphore like > > > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > > > unable to kill processes because they are holding on to locks that > > > > > ext4 is holding. > > > > > > > > > > I've done some initial investigation, and while it's not that hard to > > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > > > is where we had been are running into problems), a really, REALLY big > > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > > > silent data loss, because the pages are marked clean, and so data > > > > > written using buffered writeback goes *poof*. > > > > > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > > > (and printk the inode number, so I knew which inodes had gotten the > > > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > > > > > What I think we should strongly consider is at the very least, special > > > > > case ENOMEM being returned by writepages() during background > > > > > writeback, and *not* mark the pages clean, and make sure the inode > > > > > stays on the dirty inode list, so we can retry the write later. This > > > > > is especially important since the process that issued the write may > > > > > have gone away, so there might not even be a userspace process to > > > > > complain to. By converting certain page allocations (most notably in > > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > > > release the i_data_sem lock and return an error. This should allow > > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > > > the writepages() for that inode later. > > > > > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > > > retrying the write is definitely better option. For start we can say that > > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > > > there's no hope of getting data to disk and so we just discard them. It > > > > will be somewhat rough distinction but probably better than what we have > > > > now. > > > > > > > > Honza > > > > > > I'm not sure about ENOSPC there. That's a return code that is > > > specifically expected to be returned by fsync. It seems like that ought > > > not be considered a transient error? > > > > Yeah, for start we should probably keep ENOSPC as is to prevent surprises. > > Long term, we may need to make at least some ENOSPC situations behave as > > transient to make thin provisioned storage not loose data in case admin > > does not supply additional space fast enough (i.e., before ENOSPC is > > actually hit). > > > > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a > transient error? Just have it hang until we get enough space when that > tunable is enabled? > Just FYI, XFS has a similar error configuration mechanism that we use for essentially this purpose when dealing with metadata buffer I/O errors. The original motivation was to help us deal with the varying requirements for thin provisioning. I.e., whether a particular error is permanent or transient depends on the admin's preference and affects whether the filesystem continuously retries failed I/Os in anticipation of future success or gives up quickly and shuts down (or something in between). See roughly commits 192852be8b5 through e6b3bb7896 for the initial code, /sys/fs/xfs//error on an XFS filesystem for the user interface, and the "Error handling" section of Documentation/filesystems/xfs.txt for information on how the interface works. Brian > > EIO is actually in a similar bucket although probably more on the "hard > > failure" side - I can imagine there can by types of storage and situations > > where the loss of connectivity to the storage is only transient. But for > > start I would not bother with this. > > > > Honza > > I don't see what we can reasonably do with -EIO other than return a hard > error. If we want to deal with loss of connectivity to storage as a > transient failure, I think that we'd need to ensure that the lower > layers return more distinct error codes in those cases (ENODEV or ENXIO > maybe? Or declare a new kernel-internal code -- EDEVGONE?). > > In any case, I think that the basic idea of marking certain > writepage/writepages/launder_page errors as transient might be a > reasonable approach to handling this sanely. > > The problem with all of this though is that we have a pile of existing > code that will likely need to be reworked for the new error handling. I > expect that we'll have to walk all of the > writepage/writepages/launder_page implementations and fix them up one by > one once we sort out the rules for this. > > -- > Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f199.google.com (mail-yb0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id 2E93B2808C5 for ; Thu, 9 Mar 2017 09:21:46 -0500 (EST) Received: by mail-yb0-f199.google.com with SMTP id d88so19466757ybi.3 for ; Thu, 09 Mar 2017 06:21:46 -0800 (PST) Received: from imap.thunk.org (imap.thunk.org. [2600:3c02::f03c:91ff:fe96:be03]) by mx.google.com with ESMTPS id e184si862557ywh.383.2017.03.09.06.21.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 06:21:45 -0800 (PST) Date: Thu, 9 Mar 2017 09:21:37 -0500 From: Theodore Ts'o Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Message-ID: <20170309142137.lz7cba4was3jfyyt@thunk.org> References: <20170305133535.6516-1-jlayton@redhat.com> <1488724854.2925.6.camel@redhat.com> <20170306230801.GA28111@linux.intel.com> <20170307102622.GB2578@quack2.suse.cz> <20170309025725.5wrszri462zipiix@thunk.org> <20170309090449.GD15874@quack2.suse.cz> <1489056471.2791.2.camel@redhat.com> <20170309110225.GF15874@quack2.suse.cz> <1489063392.2791.8.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1489063392.2791.8.camel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Layton Cc: Jan Kara , Ross Zwisler , viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org, NeilBrown On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote: > > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a > transient error? Just have it hang until we get enough space when that > tunable is enabled? Or maybe we need a new kernel-internal errno (ala ERESTARSYS) which means it's a "soft ENOSPC"? It would get translated to ENOSPC if it gets propagated to userspace, but that way for devices like dm-thin or other storage array with thin volumes, it could send back a soft ENOSPC, while for file systems where "ENOSPC means ENOSPC", we can treat those as a hard ENOSPC. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f199.google.com (mail-yw0-f199.google.com [209.85.161.199]) by kanga.kvack.org (Postfix) with ESMTP id 7AE6B6B0389 for ; Wed, 15 Mar 2017 01:07:54 -0400 (EDT) Received: by mail-yw0-f199.google.com with SMTP id k13so24137691ywk.2 for ; Tue, 14 Mar 2017 22:07:54 -0700 (PDT) Received: from imap.thunk.org (imap.thunk.org. [2600:3c02::f03c:91ff:fe96:be03]) by mx.google.com with ESMTPS id b17si258215ybj.323.2017.03.14.22.07.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Mar 2017 22:07:53 -0700 (PDT) From: Theodore Ts'o Subject: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Date: Wed, 15 Mar 2017 01:07:43 -0400 Message-Id: <20170315050743.5539-1-tytso@mit.edu> In-Reply-To: <20170309090449.GD15874@quack2.suse.cz> References: <20170309090449.GD15874@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-fsdevel@vger.kernel.org Cc: Theodore Ts'o , Jan Kara , Michal Hocko , linux-mm@kvack.org Currently, file system's writepages() function must not fail with an ENOMEM, since if they do, it's possible for buffered data to be lost. This is because on a data integrity writeback writepages() gets called but once, and if it returns ENOMEM and you're lucky the error will get reflected back to the userspace process calling fsync() --- at which point the application may or may not be properly checking error codes. If you aren't lucky, the user is unmounting the file system, and the dirty pages will simply be lost. For this reason, file system code generally will use GFP_NOFS, and in some cases, will retry the allocation in a loop, on the theory that "kernel livelocks are temporary; data loss is forever". Unfortunately, this can indeed cause livelocks, since inside the writepages() call, the file system is holding various mutexes, and these mutexes may prevent the OOM killer from killing its targetted victim if it is also holding on to those mutexes. A better solution would be to allow writepages() to call the memory allocator with flags that give greater latitude to the allocator to fail, and then release its locks and return ENOMEM, and in the case of background writeback, the writes can be retried at a later time. In the case of data-integrity writeback retry after waiting a brief amount of time. Signed-off-by: Theodore Ts'o --- As we had discussed in an e-mail thread last week, I'm interested in allowing ext4_writepages() to return ENOMEM without causing dirty pages from buffered writes getting list. It looks like doing so should be fairly straightforward. What do folks think? mm/page-writeback.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 290e8b7d3181..8666d3f3c57a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; - if (mapping->a_ops->writepages) - ret = mapping->a_ops->writepages(mapping, wbc); - else - ret = generic_writepages(mapping, wbc); + while (1) { + if (mapping->a_ops->writepages) + ret = mapping->a_ops->writepages(mapping, wbc); + else + ret = generic_writepages(mapping, wbc); + if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL)) + break; + cond_resched(); + congestion_wait(BLK_RW_ASYNC, HZ/50); + } return ret; } -- 2.11.0.rc0.7.gbe5a750 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 3DE4B6B0389 for ; Wed, 15 Mar 2017 07:59:37 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id y90so2601313wrb.1 for ; Wed, 15 Mar 2017 04:59:37 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v2si2348147wrd.12.2017.03.15.04.59.35 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 15 Mar 2017 04:59:35 -0700 (PDT) Date: Wed, 15 Mar 2017 12:59:33 +0100 From: Jan Kara Subject: Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Message-ID: <20170315115933.GF12989@quack2.suse.cz> References: <20170309090449.GD15874@quack2.suse.cz> <20170315050743.5539-1-tytso@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170315050743.5539-1-tytso@mit.edu> Sender: owner-linux-mm@kvack.org List-ID: To: Theodore Ts'o Cc: linux-fsdevel@vger.kernel.org, Jan Kara , Michal Hocko , linux-mm@kvack.org On Wed 15-03-17 01:07:43, Ted Tso wrote: > Currently, file system's writepages() function must not fail with an > ENOMEM, since if they do, it's possible for buffered data to be lost. > This is because on a data integrity writeback writepages() gets called > but once, and if it returns ENOMEM and you're lucky the error will get > reflected back to the userspace process calling fsync() --- at which > point the application may or may not be properly checking error codes. > If you aren't lucky, the user is unmounting the file system, and the > dirty pages will simply be lost. > > For this reason, file system code generally will use GFP_NOFS, and in > some cases, will retry the allocation in a loop, on the theory that > "kernel livelocks are temporary; data loss is forever". > Unfortunately, this can indeed cause livelocks, since inside the > writepages() call, the file system is holding various mutexes, and > these mutexes may prevent the OOM killer from killing its targetted > victim if it is also holding on to those mutexes. > > A better solution would be to allow writepages() to call the memory > allocator with flags that give greater latitude to the allocator to > fail, and then release its locks and return ENOMEM, and in the case of > background writeback, the writes can be retried at a later time. In > the case of data-integrity writeback retry after waiting a brief > amount of time. > > Signed-off-by: Theodore Ts'o > --- > > As we had discussed in an e-mail thread last week, I'm interested in > allowing ext4_writepages() to return ENOMEM without causing dirty > pages from buffered writes getting list. It looks like doing so > should be fairly straightforward. What do folks think? Makes sense to me. One comment below: > + while (1) { > + if (mapping->a_ops->writepages) > + ret = mapping->a_ops->writepages(mapping, wbc); > + else > + ret = generic_writepages(mapping, wbc); > + if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL)) -ENOMEM I guess... > + break; > + cond_resched(); > + congestion_wait(BLK_RW_ASYNC, HZ/50); > + } > return ret; > } Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 036276B0389 for ; Wed, 15 Mar 2017 09:03:08 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id v190so4842045wme.0 for ; Wed, 15 Mar 2017 06:03:07 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u16si2525615wrc.200.2017.03.15.06.03.06 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 15 Mar 2017 06:03:06 -0700 (PDT) Date: Wed, 15 Mar 2017 14:03:05 +0100 From: Michal Hocko Subject: Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Message-ID: <20170315130305.GJ32620@dhcp22.suse.cz> References: <20170309090449.GD15874@quack2.suse.cz> <20170315050743.5539-1-tytso@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170315050743.5539-1-tytso@mit.edu> Sender: owner-linux-mm@kvack.org List-ID: To: Theodore Ts'o Cc: linux-fsdevel@vger.kernel.org, Jan Kara , linux-mm@kvack.org On Wed 15-03-17 01:07:43, Theodore Ts'o wrote: > Currently, file system's writepages() function must not fail with an > ENOMEM, since if they do, it's possible for buffered data to be lost. > This is because on a data integrity writeback writepages() gets called > but once, and if it returns ENOMEM and you're lucky the error will get > reflected back to the userspace process calling fsync() --- at which > point the application may or may not be properly checking error codes. > If you aren't lucky, the user is unmounting the file system, and the > dirty pages will simply be lost. > > For this reason, file system code generally will use GFP_NOFS, and in > some cases, will retry the allocation in a loop, on the theory that > "kernel livelocks are temporary; data loss is forever". > Unfortunately, this can indeed cause livelocks, since inside the > writepages() call, the file system is holding various mutexes, and > these mutexes may prevent the OOM killer from killing its targetted > victim if it is also holding on to those mutexes. The victim might be looping inside do_writepages now instead (especially when the memory reserves are depleted), though. On the other hand the recent OOM killer changes do not rely on the oom victim exiting anymore. We try to reap as much memory from its address space as possible which alone should help us to move on. Even if that is not sufficient we will move on to another victim. So unless everything is in this path and all the memory is sitting unreachable from the reapable address space we should be safe. > A better solution would be to allow writepages() to call the memory > allocator with flags that give greater latitude to the allocator to > fail, and then release its locks and return ENOMEM, and in the case of > background writeback, the writes can be retried at a later time. In > the case of data-integrity writeback retry after waiting a brief > amount of time. yes that sounds reasonable to me. Btw. I was proposing __GFP_RETRY_MAYFAIL recently [1] which sounds like a good fit here. [1] http://lkml.kernel.org/r/20170307154843.32516-1-mhocko@kernel.org > Signed-off-by: Theodore Ts'o The patch looks good to me be I am not familiar with all the callers to be fully qualified to give my Acked-by > --- > > As we had discussed in an e-mail thread last week, I'm interested in > allowing ext4_writepages() to return ENOMEM without causing dirty > pages from buffered writes getting list. It looks like doing so > should be fairly straightforward. What do folks think? > > mm/page-writeback.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 290e8b7d3181..8666d3f3c57a 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) > > if (wbc->nr_to_write <= 0) > return 0; > - if (mapping->a_ops->writepages) > - ret = mapping->a_ops->writepages(mapping, wbc); > - else > - ret = generic_writepages(mapping, wbc); > + while (1) { > + if (mapping->a_ops->writepages) > + ret = mapping->a_ops->writepages(mapping, wbc); > + else > + ret = generic_writepages(mapping, wbc); > + if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL)) > + break; > + cond_resched(); > + congestion_wait(BLK_RW_ASYNC, HZ/50); > + } > return ret; > } > > -- > 2.11.0.rc0.7.gbe5a750 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f200.google.com (mail-yw0-f200.google.com [209.85.161.200]) by kanga.kvack.org (Postfix) with ESMTP id 5571A6B0389 for ; Wed, 15 Mar 2017 10:09:30 -0400 (EDT) Received: by mail-yw0-f200.google.com with SMTP id u138so55507836ywg.0 for ; Wed, 15 Mar 2017 07:09:30 -0700 (PDT) Received: from imap.thunk.org (imap.thunk.org. [2600:3c02::f03c:91ff:fe96:be03]) by mx.google.com with ESMTPS id t6si654333ybf.65.2017.03.15.07.09.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 15 Mar 2017 07:09:29 -0700 (PDT) Date: Wed, 15 Mar 2017 10:09:27 -0400 From: Theodore Ts'o Subject: Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Message-ID: <20170315140927.g5ylzcbxrvjqune3@thunk.org> References: <20170309090449.GD15874@quack2.suse.cz> <20170315050743.5539-1-tytso@mit.edu> <20170315115933.GF12989@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170315115933.GF12989@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: linux-fsdevel@vger.kernel.org, Michal Hocko , linux-mm@kvack.org On Wed, Mar 15, 2017 at 12:59:33PM +0100, Jan Kara wrote: > > + while (1) { > > + if (mapping->a_ops->writepages) > > + ret = mapping->a_ops->writepages(mapping, wbc); > > + else > > + ret = generic_writepages(mapping, wbc); > > + if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL)) > > -ENOMEM I guess... Oops. Thanks for noticing! Unless anyone has any objections I plan to carry this in the ext4 tree. - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 48C856B038A for ; Thu, 16 Mar 2017 06:18:34 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id e129so78310546pfh.1 for ; Thu, 16 Mar 2017 03:18:34 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id 20si3427938pfo.372.2017.03.16.03.18.32 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 16 Mar 2017 03:18:33 -0700 (PDT) Subject: Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback References: <20170309090449.GD15874@quack2.suse.cz> <20170315050743.5539-1-tytso@mit.edu> <20170315130305.GJ32620@dhcp22.suse.cz> From: Tetsuo Handa Message-ID: Date: Thu, 16 Mar 2017 19:18:19 +0900 MIME-Version: 1.0 In-Reply-To: <20170315130305.GJ32620@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Theodore Ts'o Cc: linux-fsdevel@vger.kernel.org, Jan Kara , linux-mm@kvack.org On 2017/03/15 22:03, Michal Hocko wrote: > On Wed 15-03-17 01:07:43, Theodore Ts'o wrote: >> Unfortunately, this can indeed cause livelocks, since inside the >> writepages() call, the file system is holding various mutexes, and >> these mutexes may prevent the OOM killer from killing its targetted >> victim if it is also holding on to those mutexes. > > The victim might be looping inside do_writepages now instead (especially > when the memory reserves are depleted), though. On the other hand the > recent OOM killer changes do not rely on the oom victim exiting anymore. True only if CONFIG_MMU=y. > We try to reap as much memory from its address space as possible > which alone should help us to move on. Even if that is not sufficient we > will move on to another victim. So unless everything is in this path and > all the memory is sitting unreachable from the reapable address space we > should be safe. If the caller is doing sync() or umount() syscall, isn't it reasonable to bail out if fatal_signal_pending() is true because it is caller's responsibility to check whether sync() or umount() succeeded? Though, I don't know whether writepages() can preserve data for later retry by other callers. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org