All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org,
	NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
Date: Thu, 09 Mar 2017 07:43:12 -0500	[thread overview]
Message-ID: <1489063392.2791.8.camel@redhat.com> (raw)
In-Reply-To: <20170309110225.GF15874@quack2.suse.cz>

On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > > 
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > > 
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > > 
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > 
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to.  By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error.  This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > > 
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > > 
> > > 								Honza
> > 
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
> 
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
> 

Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?

> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
> 
> 								Honza

I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).

In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.

The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.

-- 
Jeff Layton <jlayton@redhat.com>

WARNING: multiple messages have this Message-ID (diff)
From: Jeff Layton <jlayton@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	 viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp,
	 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org,
	NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
Date: Thu, 09 Mar 2017 07:43:12 -0500	[thread overview]
Message-ID: <1489063392.2791.8.camel@redhat.com> (raw)
In-Reply-To: <20170309110225.GF15874@quack2.suse.cz>

On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > > 
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > > 
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > > 
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > 
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to.  By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error.  This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > > 
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > > 
> > > 								Honza
> > 
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
> 
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
> 

Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?

> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
> 
> 								Honza

I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).

In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.

The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.

-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Jeff Layton <jlayton@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org,
	NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
Date: Thu, 09 Mar 2017 07:43:12 -0500	[thread overview]
Message-ID: <1489063392.2791.8.camel@redhat.com> (raw)
In-Reply-To: <20170309110225.GF15874@quack2.suse.cz>

On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > > 
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > > 
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > > 
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > 
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to.  By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error.  This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > > 
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > > 
> > > 								Honza
> > 
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
> 
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
> 

Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?

> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
> 
> 								Honza

I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).

In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.

The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.

-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-03-09 12:44 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-05 13:35 ` Jeff Layton
2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
2017-03-05 13:35   ` Jeff Layton
2017-03-07 13:46   ` Ryusuke Konishi
2017-03-07 13:46     ` Ryusuke Konishi
2017-03-05 13:35 ` [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Jeff Layton
2017-03-05 13:35   ` Jeff Layton
2017-03-05 13:35 ` [PATCH 3/3] mm: set mapping error when launder_pages fails Jeff Layton
2017-03-05 13:35   ` Jeff Layton
2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-05 14:40   ` Jeff Layton
2017-03-06 23:08   ` Ross Zwisler
2017-03-06 23:08     ` Ross Zwisler
2017-03-07 10:26     ` Jan Kara
2017-03-07 10:26       ` Jan Kara
2017-03-07 14:03       ` Jeff Layton
2017-03-07 14:03         ` Jeff Layton
2017-03-07 14:03         ` Jeff Layton
2017-03-07 15:59       ` Ross Zwisler
2017-03-07 15:59         ` Ross Zwisler
2017-03-07 16:17         ` Jan Kara
2017-03-07 16:17           ` Jan Kara
2017-03-09  2:57       ` Theodore Ts'o
2017-03-09  2:57         ` Theodore Ts'o
2017-03-09  9:04         ` Jan Kara
2017-03-09  9:04           ` Jan Kara
2017-03-09 10:47           ` Jeff Layton
2017-03-09 10:47             ` Jeff Layton
2017-03-09 10:47             ` Jeff Layton
2017-03-09 11:02             ` Jan Kara
2017-03-09 11:02               ` Jan Kara
2017-03-09 12:43               ` Jeff Layton [this message]
2017-03-09 12:43                 ` Jeff Layton
2017-03-09 12:43                 ` Jeff Layton
2017-03-09 13:22                 ` Brian Foster
2017-03-09 13:22                   ` Brian Foster
2017-03-09 13:22                   ` Brian Foster
2017-03-09 14:21                 ` Theodore Ts'o
2017-03-09 14:21                   ` Theodore Ts'o
2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
2017-03-15 11:59             ` Jan Kara
2017-03-15 14:09               ` Theodore Ts'o
2017-03-15 14:09                 ` Theodore Ts'o
2017-03-15 13:03             ` Michal Hocko
2017-03-16 10:18               ` Tetsuo Handa
2017-03-06  3:06 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business NeilBrown
2017-03-06 11:43   ` Jeff Layton
2017-03-06 11:43     ` Jeff Layton
2017-03-06 11:43     ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1489063392.2791.8.camel@redhat.com \
    --to=jlayton@redhat.com \
    --cc=jack@suse.cz \
    --cc=konishi.ryusuke@lab.ntt.co.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nilfs@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.