linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] do we really need PG_error at all?
@ 2017-02-26 14:42 Jeff Layton
  2017-02-26 17:10 ` James Bottomley
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2017-02-26 14:42 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, LKML; +Cc: lsf-pc, Neil Brown

Proposing this as a LSF/MM TOPIC, but it may turn out to be me just not
understanding the semantics here.

As I was looking into -ENOSPC handling in cephfs, I noticed that
PG_error is only ever tested in one place [1] __filemap_fdatawait_range,
which does this:

	if (TestClearPageError(page))
		ret = -EIO;

This error code will override any AS_* error that was set in the
mapping. Which makes me wonder...why don't we just set this error in the
mapping and not bother with a per-page flag? Could we potentially free
up a page flag by eliminating this?

The main argument I could see for keeping it is that removing it might
subtly change the behavior of sync_file_range if you have tasks syncing
different ranges in a file concurrently. I'm not sure if that would
break any guarantees though.

Even if we do need it, I think we might need some cleanup here anyway. A
lot of readpage operations end up setting that flag when they hit an
error. Isn't it wrong to return an error on fsync, just because we had a
read error somewhere in the file in a range that was never dirtied?

--
[1]: there is another place in f2fs, but it's more or less equivalent to
the call site in __filemap_fdatawait_range.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 14:42 [LSF/MM TOPIC] do we really need PG_error at all? Jeff Layton
@ 2017-02-26 17:10 ` James Bottomley
  2017-02-26 21:03   ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2017-02-26 17:10 UTC (permalink / raw)
  To: Jeff Layton, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

[added linux-scsi and linux-block because this is part of our error
handling as well]
On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> Proposing this as a LSF/MM TOPIC, but it may turn out to be me just 
> not understanding the semantics here.
> 
> As I was looking into -ENOSPC handling in cephfs, I noticed that
> PG_error is only ever tested in one place [1] 
> __filemap_fdatawait_range, which does this:
> 
> 	if (TestClearPageError(page))
> 		ret = -EIO;
> 
> This error code will override any AS_* error that was set in the
> mapping. Which makes me wonder...why don't we just set this error in 
> the mapping and not bother with a per-page flag? Could we potentially
> free up a page flag by eliminating this?

Note that currently the AS_* codes are only set for write errors not
for reads and we have no mapping error handling at all for swap pages,
but I'm sure this is fixable.

>From the I/O layer point of view we take great pains to try to pinpoint
the error exactly to the sector.  We reflect this up by setting the
PG_error flag on the page where the error occurred.  If we only set the
error on the mapping, we lose that granularity, because the mapping is
mostly at the file level (or VMA level for anon pages).

So I think the question for filesystem people from us would be do you
care about this accuracy?  If it's OK just to know an error occurred
somewhere in this file, then perhaps we don't need it.

James

> The main argument I could see for keeping it is that removing it 
> might subtly change the behavior of sync_file_range if you have tasks
> syncing different ranges in a file concurrently. I'm not sure if that 
> would break any guarantees though.
> 
> Even if we do need it, I think we might need some cleanup here 
> anyway. A lot of readpage operations end up setting that flag when 
> they hit an error. Isn't it wrong to return an error on fsync, just 
> because we had a read error somewhere in the file in a range that was
> never dirtied?
> 
> --
> [1]: there is another place in f2fs, but it's more or less equivalent 
> to the call site in __filemap_fdatawait_range.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 17:10 ` James Bottomley
@ 2017-02-26 21:03   ` NeilBrown
  2017-02-26 22:43     ` Jeff Layton
  2017-02-26 23:30     ` James Bottomley
  0 siblings, 2 replies; 15+ messages in thread
From: NeilBrown @ 2017-02-26 21:03 UTC (permalink / raw)
  To: James Bottomley, Jeff Layton, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

On Sun, Feb 26 2017, James Bottomley wrote:

> [added linux-scsi and linux-block because this is part of our error
> handling as well]
> On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
>> Proposing this as a LSF/MM TOPIC, but it may turn out to be me just 
>> not understanding the semantics here.
>> 
>> As I was looking into -ENOSPC handling in cephfs, I noticed that
>> PG_error is only ever tested in one place [1] 
>> __filemap_fdatawait_range, which does this:
>> 
>> 	if (TestClearPageError(page))
>> 		ret = -EIO;
>> 
>> This error code will override any AS_* error that was set in the
>> mapping. Which makes me wonder...why don't we just set this error in 
>> the mapping and not bother with a per-page flag? Could we potentially
>> free up a page flag by eliminating this?
>
> Note that currently the AS_* codes are only set for write errors not
> for reads and we have no mapping error handling at all for swap pages,
> but I'm sure this is fixable.

How is a read error different from a failure to set PG_uptodate?
Does PG_error suppress retries?

>
> From the I/O layer point of view we take great pains to try to pinpoint
> the error exactly to the sector.  We reflect this up by setting the
> PG_error flag on the page where the error occurred.  If we only set the
> error on the mapping, we lose that granularity, because the mapping is
> mostly at the file level (or VMA level for anon pages).

Are you saying that the IO layer finds the page in the bi_io_vec and
explicitly sets PG_error, rather than just passing an error indication
to bi_end_io ??  That would seem to be wrong as the page may not be in
the page cache. So I guess I misunderstand you.

>
> So I think the question for filesystem people from us would be do you
> care about this accuracy?  If it's OK just to know an error occurred
> somewhere in this file, then perhaps we don't need it.

I had always assumed that a bio would either succeed or fail, and that
no finer granularity could be available.

I think the question here is: Do filesystems need the pagecache to
record which pages have seen an IO error?
I think that for write errors, there is no value in recording
block-oriented error status - only file-oriented status.
For read errors, it might if help to avoid indefinite read retries, but
I don't know the code well enough to be sure if this is an issue.

NeilBrown


>
> James
>
>> The main argument I could see for keeping it is that removing it 
>> might subtly change the behavior of sync_file_range if you have tasks
>> syncing different ranges in a file concurrently. I'm not sure if that 
>> would break any guarantees though.
>> 
>> Even if we do need it, I think we might need some cleanup here 
>> anyway. A lot of readpage operations end up setting that flag when 
>> they hit an error. Isn't it wrong to return an error on fsync, just 
>> because we had a read error somewhere in the file in a range that was
>> never dirtied?
>> 
>> --
>> [1]: there is another place in f2fs, but it's more or less equivalent 
>> to the call site in __filemap_fdatawait_range.
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 21:03   ` NeilBrown
@ 2017-02-26 22:43     ` Jeff Layton
  2017-02-26 23:30     ` James Bottomley
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2017-02-26 22:43 UTC (permalink / raw)
  To: NeilBrown, James Bottomley, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> On Sun, Feb 26 2017, James Bottomley wrote:
> 
> > [added linux-scsi and linux-block because this is part of our error
> > handling as well]
> > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me just 
> > > not understanding the semantics here.
> > > 
> > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > PG_error is only ever tested in one place [1] 
> > > __filemap_fdatawait_range, which does this:
> > > 
> > > 	if (TestClearPageError(page))
> > > 		ret = -EIO;
> > > 
> > > This error code will override any AS_* error that was set in the
> > > mapping. Which makes me wonder...why don't we just set this error in 
> > > the mapping and not bother with a per-page flag? Could we potentially
> > > free up a page flag by eliminating this?
> > 
> > Note that currently the AS_* codes are only set for write errors not
> > for reads and we have no mapping error handling at all for swap pages,
> > but I'm sure this is fixable.
> 
> How is a read error different from a failure to set PG_uptodate?
> Does PG_error suppress retries?
> 

The thing is that we only ever TestClearError in write and fsync type
codepaths. Does it make sense to report read errors _at_all_ in fsync?

> > 
> > From the I/O layer point of view we take great pains to try to pinpoint
> > the error exactly to the sector.  We reflect this up by setting the
> > PG_error flag on the page where the error occurred.  If we only set the
> > error on the mapping, we lose that granularity, because the mapping is
> > mostly at the file level (or VMA level for anon pages).
> 
> Are you saying that the IO layer finds the page in the bi_io_vec and
> explicitly sets PG_error, rather than just passing an error indication
> to bi_end_io ??  That would seem to be wrong as the page may not be in
> the page cache. So I guess I misunderstand you.
> 
> > 
> > So I think the question for filesystem people from us would be do you
> > care about this accuracy?  If it's OK just to know an error occurred
> > somewhere in this file, then perhaps we don't need it.
> 
> I had always assumed that a bio would either succeed or fail, and that
> no finer granularity could be available.
> 
> I think the question here is: Do filesystems need the pagecache to
> record which pages have seen an IO error?
> I think that for write errors, there is no value in recording
> block-oriented error status - only file-oriented status.
> For read errors, it might if help to avoid indefinite read retries, but
> I don't know the code well enough to be sure if this is an issue.
> 

Yeah, it might be useful for preventing failing read retries, but I
don't see that it's being used in that way today, unless I'm missing
something.

If PG_error is ultimately needed, I'd like to have some more clearly
defined semantics for it. It's sprinkled all over the place today, and
it's not clear to me that it's being used correctly everywhere.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 21:03   ` NeilBrown
  2017-02-26 22:43     ` Jeff Layton
@ 2017-02-26 23:30     ` James Bottomley
  2017-02-26 23:57       ` Jeff Layton
  2017-02-27  0:27       ` NeilBrown
  1 sibling, 2 replies; 15+ messages in thread
From: James Bottomley @ 2017-02-26 23:30 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]

On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> On Sun, Feb 26 2017, James Bottomley wrote:
> 
> > [added linux-scsi and linux-block because this is part of our error
> > handling as well]
> > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
> > > just not understanding the semantics here.
> > > 
> > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > PG_error is only ever tested in one place [1] 
> > > __filemap_fdatawait_range, which does this:
> > > 
> > > 	if (TestClearPageError(page))
> > > 		ret = -EIO;
> > > 
> > > This error code will override any AS_* error that was set in the
> > > mapping. Which makes me wonder...why don't we just set this error 
> > > in the mapping and not bother with a per-page flag? Could we
> > > potentially free up a page flag by eliminating this?
> > 
> > Note that currently the AS_* codes are only set for write errors 
> > not for reads and we have no mapping error handling at all for swap
> > pages, but I'm sure this is fixable.
> 
> How is a read error different from a failure to set PG_uptodate?
> Does PG_error suppress retries?

We don't do any retries in the code above the block layer (or at least
we shouldn't).  

> > 
> > From the I/O layer point of view we take great pains to try to 
> > pinpoint the error exactly to the sector.  We reflect this up by 
> > setting the PG_error flag on the page where the error occurred.  If 
> > we only set the error on the mapping, we lose that granularity, 
> > because the mapping is mostly at the file level (or VMA level for
> > anon pages).
> 
> Are you saying that the IO layer finds the page in the bi_io_vec and
> explicitly sets PG_error,

I didn't say anything about the mechanism.  I think the function you're
looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
the position in the request.  Block maps the position to bio and the
bio completion maps to page.  So the actual granularity seen in the
upper layer depends on how the page to bio mapping is done.

>  rather than just passing an error indication to bi_end_io ??  That
> would seem to be wrong as the page may not be in the page cache.

Usually pages in the mpage_end_io path are pinned, I think.

>  So I guess I misunderstand you.
> 
> > 
> > So I think the question for filesystem people from us would be do 
> > you care about this accuracy?  If it's OK just to know an error
> > occurred somewhere in this file, then perhaps we don't need it.
> 
> I had always assumed that a bio would either succeed or fail, and 
> that no finer granularity could be available.

It does ... but a bio can be as small as a single page.

> I think the question here is: Do filesystems need the pagecache to
> record which pages have seen an IO error?

It's not just filesystems.  The partition code uses PageError() ... the
metadata code might as well (those are things with no mapping).  I'm
not saying we can't remove PG_error; I am saying it's not going to be
quite as simple as using the AS_ flags.

James

> I think that for write errors, there is no value in recording
> block-oriented error status - only file-oriented status.
> For read errors, it might if help to avoid indefinite read retries, 
> but I don't know the code well enough to be sure if this is an issue.
> 
> NeilBrown
> 
> 
> > 
> > James
> > 
> > > The main argument I could see for keeping it is that removing it 
> > > might subtly change the behavior of sync_file_range if you have 
> > > tasks syncing different ranges in a file concurrently. I'm not 
> > > sure if that would break any guarantees though.
> > > 
> > > Even if we do need it, I think we might need some cleanup here 
> > > anyway. A lot of readpage operations end up setting that flag 
> > > when they hit an error. Isn't it wrong to return an error on 
> > > fsync, just because we had a read error somewhere in the file in 
> > > a range that was never dirtied?
> > > 
> > > --
> > > [1]: there is another place in f2fs, but it's more or less 
> > > equivalent to the call site in __filemap_fdatawait_range.
> > > 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 23:30     ` James Bottomley
@ 2017-02-26 23:57       ` Jeff Layton
  2017-02-27  0:27       ` NeilBrown
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2017-02-26 23:57 UTC (permalink / raw)
  To: James Bottomley, NeilBrown, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

On Sun, 2017-02-26 at 15:30 -0800, James Bottomley wrote:
> On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> > On Sun, Feb 26 2017, James Bottomley wrote:
> > 
> > > [added linux-scsi and linux-block because this is part of our error
> > > handling as well]
> > > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
> > > > just not understanding the semantics here.
> > > > 
> > > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > > PG_error is only ever tested in one place [1] 
> > > > __filemap_fdatawait_range, which does this:
> > > > 
> > > > 	if (TestClearPageError(page))
> > > > 		ret = -EIO;
> > > > 
> > > > This error code will override any AS_* error that was set in the
> > > > mapping. Which makes me wonder...why don't we just set this error 
> > > > in the mapping and not bother with a per-page flag? Could we
> > > > potentially free up a page flag by eliminating this?
> > > 
> > > Note that currently the AS_* codes are only set for write errors 
> > > not for reads and we have no mapping error handling at all for swap
> > > pages, but I'm sure this is fixable.
> > 
> > How is a read error different from a failure to set PG_uptodate?
> > Does PG_error suppress retries?
> 
> We don't do any retries in the code above the block layer (or at least
> we shouldn't).  
> 
> > > 
> > > From the I/O layer point of view we take great pains to try to 
> > > pinpoint the error exactly to the sector.  We reflect this up by 
> > > setting the PG_error flag on the page where the error occurred.  If 
> > > we only set the error on the mapping, we lose that granularity, 
> > > because the mapping is mostly at the file level (or VMA level for
> > > anon pages).
> > 
> > Are you saying that the IO layer finds the page in the bi_io_vec and
> > explicitly sets PG_error,
> 
> I didn't say anything about the mechanism.  I think the function you're
> looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
> the position in the request.  Block maps the position to bio and the
> bio completion maps to page.  So the actual granularity seen in the
> upper layer depends on how the page to bio mapping is done.
> 
> >  rather than just passing an error indication to bi_end_io ??  That
> > would seem to be wrong as the page may not be in the page cache.
> 
> Usually pages in the mpage_end_io path are pinned, I think.
> 
> >  So I guess I misunderstand you.
> > 
> > > 
> > > So I think the question for filesystem people from us would be do 
> > > you care about this accuracy?  If it's OK just to know an error
> > > occurred somewhere in this file, then perhaps we don't need it.
> > 
> > I had always assumed that a bio would either succeed or fail, and 
> > that no finer granularity could be available.
> 
> It does ... but a bio can be as small as a single page.
> 
> > I think the question here is: Do filesystems need the pagecache to
> > record which pages have seen an IO error?
> 
> It's not just filesystems.  The partition code uses PageError() ... the
> metadata code might as well (those are things with no mapping).  I'm
> not saying we can't remove PG_error; I am saying it's not going to be
> quite as simple as using the AS_ flags.
> 
> James
> 

Ok, I see what you mean about the partition code. We may very well need
to keep PG_error for things like that.

If we do, then it'd be good to spell out when and how filesystems should
use it. Most of the usage seems to be the result of cargo-cult copying
all over the place.

In particular, I think we might be best off not using PG_error for
writeback errors in filesystems. It complicates the error path there and
I don't see how it adds much benefit. It's also inconsistent as a stray
sync() syscall can wipe the flag in most cases without reporting
anything.

For filesystem read errors, it might make sense to keep it, but it'd be
nice to see some guidelines about how it should be used there.

> > I think that for write errors, there is no value in recording
> > block-oriented error status - only file-oriented status.
> > For read errors, it might if help to avoid indefinite read retries, 
> > but I don't know the code well enough to be sure if this is an issue.
> > 
> > NeilBrown
> > 
> > 
> > > 
> > > James
> > > 
> > > > The main argument I could see for keeping it is that removing it 
> > > > might subtly change the behavior of sync_file_range if you have 
> > > > tasks syncing different ranges in a file concurrently. I'm not 
> > > > sure if that would break any guarantees though.
> > > > 
> > > > Even if we do need it, I think we might need some cleanup here 
> > > > anyway. A lot of readpage operations end up setting that flag 
> > > > when they hit an error. Isn't it wrong to return an error on 
> > > > fsync, just because we had a read error somewhere in the file in 
> > > > a range that was never dirtied?
> > > > 
> > > > --
> > > > [1]: there is another place in f2fs, but it's more or less 
> > > > equivalent to the call site in __filemap_fdatawait_range.
> > > > 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-26 23:30     ` James Bottomley
  2017-02-26 23:57       ` Jeff Layton
@ 2017-02-27  0:27       ` NeilBrown
  2017-02-27 15:07         ` Jeff Layton
  1 sibling, 1 reply; 15+ messages in thread
From: NeilBrown @ 2017-02-27  0:27 UTC (permalink / raw)
  To: James Bottomley, Jeff Layton, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 5847 bytes --]

On Sun, Feb 26 2017, James Bottomley wrote:

> On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
>> On Sun, Feb 26 2017, James Bottomley wrote:
>> 
>> > [added linux-scsi and linux-block because this is part of our error
>> > handling as well]
>> > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
>> > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
>> > > just not understanding the semantics here.
>> > > 
>> > > As I was looking into -ENOSPC handling in cephfs, I noticed that
>> > > PG_error is only ever tested in one place [1] 
>> > > __filemap_fdatawait_range, which does this:
>> > > 
>> > > 	if (TestClearPageError(page))
>> > > 		ret = -EIO;
>> > > 
>> > > This error code will override any AS_* error that was set in the
>> > > mapping. Which makes me wonder...why don't we just set this error 
>> > > in the mapping and not bother with a per-page flag? Could we
>> > > potentially free up a page flag by eliminating this?
>> > 
>> > Note that currently the AS_* codes are only set for write errors 
>> > not for reads and we have no mapping error handling at all for swap
>> > pages, but I'm sure this is fixable.
>> 
>> How is a read error different from a failure to set PG_uptodate?
>> Does PG_error suppress retries?
>
> We don't do any retries in the code above the block layer (or at least
> we shouldn't).

I was wondering about what would/should happen if a read request was
re-issued for some reason.  Should the error flag on the page cause an
immediate failure, or should it try again.
If read-ahead sees a read-error on some future page, is it necessary to
record the error so subsequent read-aheads don't notice the page is
missing and repeatedly try to re-load it?
When the application eventually gets to the faulty page, should a read
be tried then, or is the read-ahead failure permanent?



>
>> > 
>> > From the I/O layer point of view we take great pains to try to 
>> > pinpoint the error exactly to the sector.  We reflect this up by 
>> > setting the PG_error flag on the page where the error occurred.  If 
>> > we only set the error on the mapping, we lose that granularity, 
>> > because the mapping is mostly at the file level (or VMA level for
>> > anon pages).
>> 
>> Are you saying that the IO layer finds the page in the bi_io_vec and
>> explicitly sets PG_error,
>
> I didn't say anything about the mechanism.  I think the function you're
> looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
> the position in the request.  Block maps the position to bio and the
> bio completion maps to page.  So the actual granularity seen in the
> upper layer depends on how the page to bio mapping is done.

If the block layer is just returning the status at a per-bio level (which
makes perfect sense), then this has nothing directly to do with the
PG_error flag.

The page cache needs to do something with bi_error, but it isn't
immediately clear that it needs to set PG_error.

>
>>  rather than just passing an error indication to bi_end_io ??  That
>> would seem to be wrong as the page may not be in the page cache.
>
> Usually pages in the mpage_end_io path are pinned, I think.
>
>>  So I guess I misunderstand you.
>> 
>> > 
>> > So I think the question for filesystem people from us would be do 
>> > you care about this accuracy?  If it's OK just to know an error
>> > occurred somewhere in this file, then perhaps we don't need it.
>> 
>> I had always assumed that a bio would either succeed or fail, and 
>> that no finer granularity could be available.
>
> It does ... but a bio can be as small as a single page.
>
>> I think the question here is: Do filesystems need the pagecache to
>> record which pages have seen an IO error?
>
> It's not just filesystems.  The partition code uses PageError() ... the
> metadata code might as well (those are things with no mapping).  I'm
> not saying we can't remove PG_error; I am saying it's not going to be
> quite as simple as using the AS_ flags.

The partition code could use PageUptodate().
mpage_end_io() calls page_endio() on each page, and on read error that
calls:

			ClearPageUptodate(page);
			SetPageError(page);

are both of these necessary?

fs/buffer.c can use several bios to read a single page.
If any one returns an error, PG_error is set.  When all of them have
completed, if PG_error is clear, PG_uptodate is then set.
This is an opportunistic use of PG_error, rather than an essential use.
It could be "fixed", and would need to be fixed if we were to deprecate
use of PG_error for read errors.
There are probably other usages like this.

Thanks,
NeilBrown


>
> James
>
>> I think that for write errors, there is no value in recording
>> block-oriented error status - only file-oriented status.
>> For read errors, it might if help to avoid indefinite read retries, 
>> but I don't know the code well enough to be sure if this is an issue.
>> 
>> NeilBrown
>> 
>> 
>> > 
>> > James
>> > 
>> > > The main argument I could see for keeping it is that removing it 
>> > > might subtly change the behavior of sync_file_range if you have 
>> > > tasks syncing different ranges in a file concurrently. I'm not 
>> > > sure if that would break any guarantees though.
>> > > 
>> > > Even if we do need it, I think we might need some cleanup here 
>> > > anyway. A lot of readpage operations end up setting that flag 
>> > > when they hit an error. Isn't it wrong to return an error on 
>> > > fsync, just because we had a read error somewhere in the file in 
>> > > a range that was never dirtied?
>> > > 
>> > > --
>> > > [1]: there is another place in f2fs, but it's more or less 
>> > > equivalent to the call site in __filemap_fdatawait_range.
>> > > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-27  0:27       ` NeilBrown
@ 2017-02-27 15:07         ` Jeff Layton
  2017-02-27 22:51           ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2017-02-27 15:07 UTC (permalink / raw)
  To: NeilBrown, James Bottomley, linux-mm, linux-fsdevel, LKML
  Cc: lsf-pc, Neil Brown, linux-scsi, linux-block

On Mon, 2017-02-27 at 11:27 +1100, NeilBrown wrote:
> On Sun, Feb 26 2017, James Bottomley wrote:
> 
> > On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> > > On Sun, Feb 26 2017, James Bottomley wrote:
> > > 
> > > > [added linux-scsi and linux-block because this is part of our error
> > > > handling as well]
> > > > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
> > > > > just not understanding the semantics here.
> > > > > 
> > > > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > > > PG_error is only ever tested in one place [1] 
> > > > > __filemap_fdatawait_range, which does this:
> > > > > 
> > > > > 	if (TestClearPageError(page))
> > > > > 		ret = -EIO;
> > > > > 
> > > > > This error code will override any AS_* error that was set in the
> > > > > mapping. Which makes me wonder...why don't we just set this error 
> > > > > in the mapping and not bother with a per-page flag? Could we
> > > > > potentially free up a page flag by eliminating this?
> > > > 
> > > > Note that currently the AS_* codes are only set for write errors 
> > > > not for reads and we have no mapping error handling at all for swap
> > > > pages, but I'm sure this is fixable.
> > > 
> > > How is a read error different from a failure to set PG_uptodate?
> > > Does PG_error suppress retries?
> > 
> > We don't do any retries in the code above the block layer (or at least
> > we shouldn't).
> 
> I was wondering about what would/should happen if a read request was
> re-issued for some reason.  Should the error flag on the page cause an
> immediate failure, or should it try again.
> If read-ahead sees a read-error on some future page, is it necessary to
> record the error so subsequent read-aheads don't notice the page is
> missing and repeatedly try to re-load it?
> When the application eventually gets to the faulty page, should a read
> be tried then, or is the read-ahead failure permanent?
> 
> 
> 
> > 
> > > > 
> > > > From the I/O layer point of view we take great pains to try to 
> > > > pinpoint the error exactly to the sector.  We reflect this up by 
> > > > setting the PG_error flag on the page where the error occurred.  If 
> > > > we only set the error on the mapping, we lose that granularity, 
> > > > because the mapping is mostly at the file level (or VMA level for
> > > > anon pages).
> > > 
> > > Are you saying that the IO layer finds the page in the bi_io_vec and
> > > explicitly sets PG_error,
> > 
> > I didn't say anything about the mechanism.  I think the function you're
> > looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
> > the position in the request.  Block maps the position to bio and the
> > bio completion maps to page.  So the actual granularity seen in the
> > upper layer depends on how the page to bio mapping is done.
> 
> If the block layer is just returning the status at a per-bio level (which
> makes perfect sense), then this has nothing directly to do with the
> PG_error flag.
> 
> The page cache needs to do something with bi_error, but it isn't
> immediately clear that it needs to set PG_error.
> 
> > :q
> > >  rather than just passing an error indication to bi_end_io ??  That
> > > would seem to be wrong as the page may not be in the page cache.
> > 
> > Usually pages in the mpage_end_io path are pinned, I think.
> > 
> > >  So I guess I misunderstand you.
> > > 
> > > > 
> > > > So I think the question for filesystem people from us would be do 
> > > > you care about this accuracy?  If it's OK just to know an error
> > > > occurred somewhere in this file, then perhaps we don't need it.
> > > 
> > > I had always assumed that a bio would either succeed or fail, and 
> > > that no finer granularity could be available.
> > 
> > It does ... but a bio can be as small as a single page.
> > 
> > > I think the question here is: Do filesystems need the pagecache to
> > > record which pages have seen an IO error?
> > 
> > It's not just filesystems.  The partition code uses PageError() ... the
> > metadata code might as well (those are things with no mapping).  I'm
> > not saying we can't remove PG_error; I am saying it's not going to be
> > quite as simple as using the AS_ flags.
> 
> The partition code could use PageUptodate().
> mpage_end_io() calls page_endio() on each page, and on read error that
> calls:
> 
> 			ClearPageUptodate(page);
> 			SetPageError(page);
> 
> are both of these necessary?
> 

> fs/buffer.c can use several bios to read a single page.
> If any one returns an error, PG_error is set.  When all of them have
> completed, if PG_error is clear, PG_uptodate is then set.
> This is an opportunistic use of PG_error, rather than an essential use.
> It could be "fixed", and would need to be fixed if we were to deprecate
> use of PG_error for read errors.
> There are probably other usages like this.
> 

Ok, I think I get it (somewhat):

The tricky part there is how to handle the PageError check in
read_dev_sector if you don't use SetPageError in the result handler.

If we can count on read_pagecache_sector and read_dax_sector reliably
returning an error when the page is considered to be in the cache
(PageUpToDate) but had a read error, then that would work. I'm not sure
how you'd indicate that without something like PG_error though if you
don't want to retry on every attempt.

OTOH, if we want to always retry to read in pages that have had read
errors when someone requests them, then we can simply not set
PageUpToDate when readahead fails.

To chip away at the edges of this, what may make sense is to get this
flag out of the writeback code as much as we can. When a write fails and
SetPageError is called, we should also mark the mapping with an error.
Then, we should be able to stop overriding the mapping error with -EIO
in that codepath. Maybe call ClearPageError, or maybe leave it alone
there?

> > 
> > James
> > 
> > > I think that for write errors, there is no value in recording
> > > block-oriented error status - only file-oriented status.
> > > For read errors, it might if help to avoid indefinite read retries, 
> > > but I don't know the code well enough to be sure if this is an issue.
> > > 
> > > NeilBrown
> > > 
> > > 
> > > > 
> > > > James
> > > > 
> > > > > The main argument I could see for keeping it is that removing it 
> > > > > might subtly change the behavior of sync_file_range if you have 
> > > > > tasks syncing different ranges in a file concurrently. I'm not 
> > > > > sure if that would break any guarantees though.
> > > > > 
> > > > > Even if we do need it, I think we might need some cleanup here 
> > > > > anyway. A lot of readpage operations end up setting that flag 
> > > > > when they hit an error. Isn't it wrong to return an error on 
> > > > > fsync, just because we had a read error somewhere in the file in 
> > > > > a range that was never dirtied?
> > > > > 
> > > > > --
> > > > > [1]: there is another place in f2fs, but it's more or less 
> > > > > equivalent to the call site in __filemap_fdatawait_range.
> > > > > 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-27 15:07         ` Jeff Layton
@ 2017-02-27 22:51           ` Andreas Dilger
  2017-02-27 23:02             ` Jeff Layton
  2017-02-27 23:32             ` NeilBrown
  0 siblings, 2 replies; 15+ messages in thread
From: Andreas Dilger @ 2017-02-27 22:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, James Bottomley, linux-mm, linux-fsdevel, LKML,
	lsf-pc, Neil Brown, linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 7794 bytes --]

On Feb 27, 2017, at 8:07 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Mon, 2017-02-27 at 11:27 +1100, NeilBrown wrote:
>> On Sun, Feb 26 2017, James Bottomley wrote:
>> 
>>> On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
>>>> On Sun, Feb 26 2017, James Bottomley wrote:
>>>> 
>>>>> [added linux-scsi and linux-block because this is part of our error
>>>>> handling as well]
>>>>> On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
>>>>>> Proposing this as a LSF/MM TOPIC, but it may turn out to be me
>>>>>> just not understanding the semantics here.
>>>>>> 
>>>>>> As I was looking into -ENOSPC handling in cephfs, I noticed that
>>>>>> PG_error is only ever tested in one place [1]
>>>>>> __filemap_fdatawait_range, which does this:
>>>>>> 
>>>>>> 	if (TestClearPageError(page))
>>>>>> 		ret = -EIO;
>>>>>> 
>>>>>> This error code will override any AS_* error that was set in the
>>>>>> mapping. Which makes me wonder...why don't we just set this error
>>>>>> in the mapping and not bother with a per-page flag? Could we
>>>>>> potentially free up a page flag by eliminating this?
>>>>> 
>>>>> Note that currently the AS_* codes are only set for write errors
>>>>> not for reads and we have no mapping error handling at all for swap
>>>>> pages, but I'm sure this is fixable.
>>>> 
>>>> How is a read error different from a failure to set PG_uptodate?
>>>> Does PG_error suppress retries?
>>> 
>>> We don't do any retries in the code above the block layer (or at least
>>> we shouldn't).
>> 
>> I was wondering about what would/should happen if a read request was
>> re-issued for some reason.  Should the error flag on the page cause an
>> immediate failure, or should it try again.
>> If read-ahead sees a read-error on some future page, is it necessary to
>> record the error so subsequent read-aheads don't notice the page is
>> missing and repeatedly try to re-load it?
>> When the application eventually gets to the faulty page, should a read
>> be tried then, or is the read-ahead failure permanent?
>> 
>> 
>> 
>>> 
>>>>> 
>>>>> From the I/O layer point of view we take great pains to try to
>>>>> pinpoint the error exactly to the sector.  We reflect this up by
>>>>> setting the PG_error flag on the page where the error occurred.  If
>>>>> we only set the error on the mapping, we lose that granularity,
>>>>> because the mapping is mostly at the file level (or VMA level for
>>>>> anon pages).
>>>> 
>>>> Are you saying that the IO layer finds the page in the bi_io_vec and
>>>> explicitly sets PG_error,
>>> 
>>> I didn't say anything about the mechanism.  I think the function you're
>>> looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
>>> the position in the request.  Block maps the position to bio and the
>>> bio completion maps to page.  So the actual granularity seen in the
>>> upper layer depends on how the page to bio mapping is done.
>> 
>> If the block layer is just returning the status at a per-bio level (which
>> makes perfect sense), then this has nothing directly to do with the
>> PG_error flag.
>> 
>> The page cache needs to do something with bi_error, but it isn't
>> immediately clear that it needs to set PG_error.
>> 
>>> :q
>>>> rather than just passing an error indication to bi_end_io ??  That
>>>> would seem to be wrong as the page may not be in the page cache.
>>> 
>>> Usually pages in the mpage_end_io path are pinned, I think.
>>> 
>>>> So I guess I misunderstand you.
>>>> 
>>>>> 
>>>>> So I think the question for filesystem people from us would be do
>>>>> you care about this accuracy?  If it's OK just to know an error
>>>>> occurred somewhere in this file, then perhaps we don't need it.
>>>> 
>>>> I had always assumed that a bio would either succeed or fail, and
>>>> that no finer granularity could be available.
>>> 
>>> It does ... but a bio can be as small as a single page.
>>> 
>>>> I think the question here is: Do filesystems need the pagecache to
>>>> record which pages have seen an IO error?
>>> 
>>> It's not just filesystems.  The partition code uses PageError() ... the
>>> metadata code might as well (those are things with no mapping).  I'm
>>> not saying we can't remove PG_error; I am saying it's not going to be
>>> quite as simple as using the AS_ flags.
>> 
>> The partition code could use PageUptodate().
>> mpage_end_io() calls page_endio() on each page, and on read error that
>> calls:
>> 
>> 			ClearPageUptodate(page);
>> 			SetPageError(page);
>> 
>> are both of these necessary?
>> 
> 
>> fs/buffer.c can use several bios to read a single page.
>> If any one returns an error, PG_error is set.  When all of them have
>> completed, if PG_error is clear, PG_uptodate is then set.
>> This is an opportunistic use of PG_error, rather than an essential use.
>> It could be "fixed", and would need to be fixed if we were to deprecate
>> use of PG_error for read errors.
>> There are probably other usages like this.
>> 
> 
> Ok, I think I get it (somewhat):
> 
> The tricky part there is how to handle the PageError check in
> read_dev_sector if you don't use SetPageError in the result handler.
> 
> If we can count on read_pagecache_sector and read_dax_sector reliably
> returning an error when the page is considered to be in the cache
> (PageUpToDate) but had a read error, then that would work. I'm not sure
> how you'd indicate that without something like PG_error though if you
> don't want to retry on every attempt.
> 
> OTOH, if we want to always retry to read in pages that have had read
> errors when someone requests them, then we can simply not set
> PageUpToDate when readahead fails.
> 
> To chip away at the edges of this, what may make sense is to get this
> flag out of the writeback code as much as we can. When a write fails and
> SetPageError is called, we should also mark the mapping with an error.
> Then, we should be able to stop overriding the mapping error with -EIO
> in that codepath. Maybe call ClearPageError, or maybe leave it alone
> there?

My thought is that PG_error is definitely useful for applications to get
correct errors back when doing write()/sync_file_range() so that they know
there is an error in the data that _they_ wrote, rather than receiving an
error for data that may have been written by another thread, and in turn
clearing the error from another thread so it *doesn't* know it had a write
error.

As for stray sync() clearing PG_error from underneath an application, that
shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).

Cheers, Andreas

>>> 
>>> James
>>> 
>>>> I think that for write errors, there is no value in recording
>>>> block-oriented error status - only file-oriented status.
>>>> For read errors, it might if help to avoid indefinite read retries,
>>>> but I don't know the code well enough to be sure if this is an issue.
>>>> 
>>>> NeilBrown
>>>> 
>>>> 
>>>>> 
>>>>> James
>>>>> 
>>>>>> The main argument I could see for keeping it is that removing it
>>>>>> might subtly change the behavior of sync_file_range if you have
>>>>>> tasks syncing different ranges in a file concurrently. I'm not
>>>>>> sure if that would break any guarantees though.
>>>>>> 
>>>>>> Even if we do need it, I think we might need some cleanup here
>>>>>> anyway. A lot of readpage operations end up setting that flag
>>>>>> when they hit an error. Isn't it wrong to return an error on
>>>>>> fsync, just because we had a read error somewhere in the file in
>>>>>> a range that was never dirtied?
>>>>>> 
>>>>>> --
>>>>>> [1]: there is another place in f2fs, but it's more or less
>>>>>> equivalent to the call site in __filemap_fdatawait_range.
>>>>>> 
> 
> --
> Jeff Layton <jlayton@redhat.com>


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-27 22:51           ` Andreas Dilger
@ 2017-02-27 23:02             ` Jeff Layton
  2017-02-27 23:32             ` NeilBrown
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2017-02-27 23:02 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: NeilBrown, James Bottomley, linux-mm, linux-fsdevel, LKML,
	lsf-pc, Neil Brown, linux-scsi, linux-block

On Mon, 2017-02-27 at 15:51 -0700, Andreas Dilger wrote:
> On Feb 27, 2017, at 8:07 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Mon, 2017-02-27 at 11:27 +1100, NeilBrown wrote:
> > > On Sun, Feb 26 2017, James Bottomley wrote:
> > > 
> > > > On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> > > > > On Sun, Feb 26 2017, James Bottomley wrote:
> > > > > 
> > > > > > [added linux-scsi and linux-block because this is part of our error
> > > > > > handling as well]
> > > > > > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > > > > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me
> > > > > > > just not understanding the semantics here.
> > > > > > > 
> > > > > > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > > > > > PG_error is only ever tested in one place [1]
> > > > > > > __filemap_fdatawait_range, which does this:
> > > > > > > 
> > > > > > > 	if (TestClearPageError(page))
> > > > > > > 		ret = -EIO;
> > > > > > > 
> > > > > > > This error code will override any AS_* error that was set in the
> > > > > > > mapping. Which makes me wonder...why don't we just set this error
> > > > > > > in the mapping and not bother with a per-page flag? Could we
> > > > > > > potentially free up a page flag by eliminating this?
> > > > > > 
> > > > > > Note that currently the AS_* codes are only set for write errors
> > > > > > not for reads and we have no mapping error handling at all for swap
> > > > > > pages, but I'm sure this is fixable.
> > > > > 
> > > > > How is a read error different from a failure to set PG_uptodate?
> > > > > Does PG_error suppress retries?
> > > > 
> > > > We don't do any retries in the code above the block layer (or at least
> > > > we shouldn't).
> > > 
> > > I was wondering about what would/should happen if a read request was
> > > re-issued for some reason.  Should the error flag on the page cause an
> > > immediate failure, or should it try again.
> > > If read-ahead sees a read-error on some future page, is it necessary to
> > > record the error so subsequent read-aheads don't notice the page is
> > > missing and repeatedly try to re-load it?
> > > When the application eventually gets to the faulty page, should a read
> > > be tried then, or is the read-ahead failure permanent?
> > > 
> > > 
> > > 
> > > > 
> > > > > > 
> > > > > > From the I/O layer point of view we take great pains to try to
> > > > > > pinpoint the error exactly to the sector.  We reflect this up by
> > > > > > setting the PG_error flag on the page where the error occurred.  If
> > > > > > we only set the error on the mapping, we lose that granularity,
> > > > > > because the mapping is mostly at the file level (or VMA level for
> > > > > > anon pages).
> > > > > 
> > > > > Are you saying that the IO layer finds the page in the bi_io_vec and
> > > > > explicitly sets PG_error,
> > > > 
> > > > I didn't say anything about the mechanism.  I think the function you're
> > > > looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
> > > > the position in the request.  Block maps the position to bio and the
> > > > bio completion maps to page.  So the actual granularity seen in the
> > > > upper layer depends on how the page to bio mapping is done.
> > > 
> > > If the block layer is just returning the status at a per-bio level (which
> > > makes perfect sense), then this has nothing directly to do with the
> > > PG_error flag.
> > > 
> > > The page cache needs to do something with bi_error, but it isn't
> > > immediately clear that it needs to set PG_error.
> > > 
> > > > :q
> > > > > rather than just passing an error indication to bi_end_io ??  That
> > > > > would seem to be wrong as the page may not be in the page cache.
> > > > 
> > > > Usually pages in the mpage_end_io path are pinned, I think.
> > > > 
> > > > > So I guess I misunderstand you.
> > > > > 
> > > > > > 
> > > > > > So I think the question for filesystem people from us would be do
> > > > > > you care about this accuracy?  If it's OK just to know an error
> > > > > > occurred somewhere in this file, then perhaps we don't need it.
> > > > > 
> > > > > I had always assumed that a bio would either succeed or fail, and
> > > > > that no finer granularity could be available.
> > > > 
> > > > It does ... but a bio can be as small as a single page.
> > > > 
> > > > > I think the question here is: Do filesystems need the pagecache to
> > > > > record which pages have seen an IO error?
> > > > 
> > > > It's not just filesystems.  The partition code uses PageError() ... the
> > > > metadata code might as well (those are things with no mapping).  I'm
> > > > not saying we can't remove PG_error; I am saying it's not going to be
> > > > quite as simple as using the AS_ flags.
> > > 
> > > The partition code could use PageUptodate().
> > > mpage_end_io() calls page_endio() on each page, and on read error that
> > > calls:
> > > 
> > > 			ClearPageUptodate(page);
> > > 			SetPageError(page);
> > > 
> > > are both of these necessary?
> > > 
> > > fs/buffer.c can use several bios to read a single page.
> > > If any one returns an error, PG_error is set.  When all of them have
> > > completed, if PG_error is clear, PG_uptodate is then set.
> > > This is an opportunistic use of PG_error, rather than an essential use.
> > > It could be "fixed", and would need to be fixed if we were to deprecate
> > > use of PG_error for read errors.
> > > There are probably other usages like this.
> > > 
> > 
> > Ok, I think I get it (somewhat):
> > 
> > The tricky part there is how to handle the PageError check in
> > read_dev_sector if you don't use SetPageError in the result handler.
> > 
> > If we can count on read_pagecache_sector and read_dax_sector reliably
> > returning an error when the page is considered to be in the cache
> > (PageUpToDate) but had a read error, then that would work. I'm not sure
> > how you'd indicate that without something like PG_error though if you
> > don't want to retry on every attempt.
> > 
> > OTOH, if we want to always retry to read in pages that have had read
> > errors when someone requests them, then we can simply not set
> > PageUpToDate when readahead fails.
> > 
> > To chip away at the edges of this, what may make sense is to get this
> > flag out of the writeback code as much as we can. When a write fails and
> > SetPageError is called, we should also mark the mapping with an error.
> > Then, we should be able to stop overriding the mapping error with -EIO
> > in that codepath. Maybe call ClearPageError, or maybe leave it alone
> > there?
> 
> My thought is that PG_error is definitely useful for applications to get
> correct errors back when doing write()/sync_file_range() so that they know
> there is an error in the data that _they_ wrote, rather than receiving an
> error for data that may have been written by another thread, and in turn
> clearing the error from another thread so it *doesn't* know it had a write
> error.
> 

Right, that was my point about sync_file_range. Today I think you can
call sync_file_range and if your range didn't hit errors (none of you
PG_error bits are set), then you might get back 0 (iff the mapping had
no error flagged on it).

The question I have is: is that the semantics that sync_file_range is
supposed to have? It's not clear from the manpage whether errors are
supposed to be that granular or not.

It's also the case that not all writepage implementations set the
PG_error flag. There is a lot of variation here, and so we end up with
different semantics on different filesystems. That's less than ideal.

> As for stray sync() clearing PG_error from underneath an application, that
> shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
> and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).
> 
> 

It sure looks like it does to me. Am I missing something?

filemap_fdatawait_keep_errors calls __filemap_fdatawait_range, which
finds pages and waits on their writeback bit to clear. Once it does
that, it calls TestClearPageError and sets the return to -EIO if the bit
was set. That should result in all of the PG_error bits being cleared on
a sync() call, right?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-27 22:51           ` Andreas Dilger
  2017-02-27 23:02             ` Jeff Layton
@ 2017-02-27 23:32             ` NeilBrown
  2017-02-28  1:11               ` [Lsf-pc] " Jeff Layton
  1 sibling, 1 reply; 15+ messages in thread
From: NeilBrown @ 2017-02-27 23:32 UTC (permalink / raw)
  To: Andreas Dilger, Jeff Layton
  Cc: James Bottomley, linux-mm, linux-fsdevel, LKML, lsf-pc,
	Neil Brown, linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

On Mon, Feb 27 2017, Andreas Dilger wrote:

>
> My thought is that PG_error is definitely useful for applications to get
> correct errors back when doing write()/sync_file_range() so that they know
> there is an error in the data that _they_ wrote, rather than receiving an
> error for data that may have been written by another thread, and in turn
> clearing the error from another thread so it *doesn't* know it had a write
> error.

It might be useful in that way, but it is not currently used that way.
Such usage would be a change in visible behaviour.

sync_file_range() calls filemap_fdatawait_range(), which calls
filemap_check_errors().
If there have been any errors in the file recently, inside or outside
the range, the latter will return an error which will propagate up.

>
> As for stray sync() clearing PG_error from underneath an application, that
> shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
> and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).

filemap_fdatawait_keep_errors() calls __filemap_fdatawait_range() which
clears PG_error on every page.
What it doesn't do is call filemap_check_errors(), and so doesn't clear
AS_ENOSPC or AS_EIO.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-27 23:32             ` NeilBrown
@ 2017-02-28  1:11               ` Jeff Layton
  2017-02-28 10:12                 ` Boaz Harrosh
  2017-02-28 20:45                 ` NeilBrown
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff Layton @ 2017-02-28  1:11 UTC (permalink / raw)
  To: NeilBrown, Andreas Dilger
  Cc: linux-block, linux-scsi, lsf-pc, Neil Brown, LKML,
	James Bottomley, linux-mm, linux-fsdevel

On Tue, 2017-02-28 at 10:32 +1100, NeilBrown wrote:
> On Mon, Feb 27 2017, Andreas Dilger wrote:
> 
> > 
> > My thought is that PG_error is definitely useful for applications to get
> > correct errors back when doing write()/sync_file_range() so that they know
> > there is an error in the data that _they_ wrote, rather than receiving an
> > error for data that may have been written by another thread, and in turn
> > clearing the error from another thread so it *doesn't* know it had a write
> > error.
> 
> It might be useful in that way, but it is not currently used that way.
> Such usage would be a change in visible behaviour.
> 
> sync_file_range() calls filemap_fdatawait_range(), which calls
> filemap_check_errors().
> If there have been any errors in the file recently, inside or outside
> the range, the latter will return an error which will propagate up.
> 
> > 
> > As for stray sync() clearing PG_error from underneath an application, that
> > shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
> > and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).
> 
> filemap_fdatawait_keep_errors() calls __filemap_fdatawait_range() which
> clears PG_error on every page.
> What it doesn't do is call filemap_check_errors(), and so doesn't clear
> AS_ENOSPC or AS_EIO.
> 
> 

I think it's helpful to get a clear idea of what happens now in the face
of errors and what we expect to happen, and I don't quite have that yet:

--------------------------8<-----------------------------
void page_endio(struct page *page, bool is_write, int err)
{
        if (!is_write) {
                if (!err) {
                        SetPageUptodate(page);
                } else {
                        ClearPageUptodate(page);
                        SetPageError(page);
                }
                unlock_page(page);
        } else {
                if (err) {
                        SetPageError(page);
                        if (page->mapping)
                                mapping_set_error(page->mapping, err);
                }
                end_page_writeback(page);
        }
}
--------------------------8<-----------------------------

...not everything uses page_endio, but it's a good place to look since
we have both flavors of error handling in one place.

On a write error, we SetPageError and set the error in the mapping.

What I'm not clear on is:

1) what happens to the page at that point when we get a writeback error?
Does it just remain in-core and is allowed to service reads (assuming
that it was uptodate before)?

Can I redirty it and have it retry the write? Is there standard behavior
for this or is it just up to the whim of the filesystem?

I'll probably have questions about the read side as well, but for now it
looks like it's mostly used in an ad-hoc way to communicate errors
across subsystems (block to fs layer, for instance).
--
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-28  1:11               ` [Lsf-pc] " Jeff Layton
@ 2017-02-28 10:12                 ` Boaz Harrosh
  2017-02-28 11:32                   ` Jeff Layton
  2017-02-28 20:45                 ` NeilBrown
  1 sibling, 1 reply; 15+ messages in thread
From: Boaz Harrosh @ 2017-02-28 10:12 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Andreas Dilger
  Cc: linux-block, linux-scsi, lsf-pc, Neil Brown, LKML,
	James Bottomley, linux-mm, linux-fsdevel

On 02/28/2017 03:11 AM, Jeff Layton wrote:
<>
> 
> I'll probably have questions about the read side as well, but for now it
> looks like it's mostly used in an ad-hoc way to communicate errors
> across subsystems (block to fs layer, for instance).

If memory does not fail me it used to be checked long time ago in the
read-ahead case. On the buffered read case, the first page is read synchronous
and any error is returned to the caller, but then a read-ahead chunk is
read async all the while the original thread returned to the application.
So any errors are only recorded on the page-bit, since otherwise the uptodate
is off and the IO will be retransmitted. Then the move to read_iter changed
all that I think.
But again this is like 5-6 years ago, and maybe I didn't even understand
very well.

> --
> Jeff Layton <jlayton@redhat.com>
> 

I would like a Documentation of all this as well please. Where are the
tests for this?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-28 10:12                 ` Boaz Harrosh
@ 2017-02-28 11:32                   ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2017-02-28 11:32 UTC (permalink / raw)
  To: Boaz Harrosh, NeilBrown, Andreas Dilger
  Cc: linux-block, linux-scsi, lsf-pc, Neil Brown, LKML,
	James Bottomley, linux-mm, linux-fsdevel

On Tue, 2017-02-28 at 12:12 +0200, Boaz Harrosh wrote:
> On 02/28/2017 03:11 AM, Jeff Layton wrote:
> <>
> > 
> > I'll probably have questions about the read side as well, but for now it
> > looks like it's mostly used in an ad-hoc way to communicate errors
> > across subsystems (block to fs layer, for instance).
> 
> If memory does not fail me it used to be checked long time ago in the
> read-ahead case. On the buffered read case, the first page is read synchronous
> and any error is returned to the caller, but then a read-ahead chunk is
> read async all the while the original thread returned to the application.
> So any errors are only recorded on the page-bit, since otherwise the uptodate
> is off and the IO will be retransmitted. Then the move to read_iter changed
> all that I think.
> But again this is like 5-6 years ago, and maybe I didn't even understand
> very well.
> 

Yep, that's what I meant about using it to communicate errors between
layers. e.g. end_buffer_async_read will check PageError and only
SetPageUptodate if it's not set. That has morphed a lot in the last few
years though and it looks like it may rely on PG_error less than it used
to.

> 
> I would like a Documentation of all this as well please. Where are the
> tests for this?
> 

Documentation is certainly doable (and I'd like to write some once we
have this all straightened out). In particular, I think we need clear
guidelines for fs authors on how to handle pagecache read and write
errors. Tests are a little tougher -- this is all kernel-internal stuff
and not easily visible to userland.

The one thing I have noticed is that even if you set AS_ENOSPC in the
mapping, you'll still get back -EIO on the first fsync if any PG_error
bits are set. I think we ought to fix that by not doing the
TestClearPageError call in __filemap_fdatawait_range, and just rely on
the mapping error there.

We could maybe roll a test for that, but it's rather hard to test ENOSPC
conditions in a fs-agnostic way. I'm open to suggestions here though.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?
  2017-02-28  1:11               ` [Lsf-pc] " Jeff Layton
  2017-02-28 10:12                 ` Boaz Harrosh
@ 2017-02-28 20:45                 ` NeilBrown
  1 sibling, 0 replies; 15+ messages in thread
From: NeilBrown @ 2017-02-28 20:45 UTC (permalink / raw)
  To: Jeff Layton, Andreas Dilger
  Cc: linux-block, linux-scsi, lsf-pc, Neil Brown, LKML,
	James Bottomley, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3827 bytes --]

On Mon, Feb 27 2017, Jeff Layton wrote:

> On Tue, 2017-02-28 at 10:32 +1100, NeilBrown wrote:
>> On Mon, Feb 27 2017, Andreas Dilger wrote:
>> 
>> > 
>> > My thought is that PG_error is definitely useful for applications to get
>> > correct errors back when doing write()/sync_file_range() so that they know
>> > there is an error in the data that _they_ wrote, rather than receiving an
>> > error for data that may have been written by another thread, and in turn
>> > clearing the error from another thread so it *doesn't* know it had a write
>> > error.
>> 
>> It might be useful in that way, but it is not currently used that way.
>> Such usage would be a change in visible behaviour.
>> 
>> sync_file_range() calls filemap_fdatawait_range(), which calls
>> filemap_check_errors().
>> If there have been any errors in the file recently, inside or outside
>> the range, the latter will return an error which will propagate up.
>> 
>> > 
>> > As for stray sync() clearing PG_error from underneath an application, that
>> > shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
>> > and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).
>> 
>> filemap_fdatawait_keep_errors() calls __filemap_fdatawait_range() which
>> clears PG_error on every page.
>> What it doesn't do is call filemap_check_errors(), and so doesn't clear
>> AS_ENOSPC or AS_EIO.
>> 
>> 
>
> I think it's helpful to get a clear idea of what happens now in the face
> of errors and what we expect to happen, and I don't quite have that yet:
>
> --------------------------8<-----------------------------
> void page_endio(struct page *page, bool is_write, int err)
> {
>         if (!is_write) {
>                 if (!err) {
>                         SetPageUptodate(page);
>                 } else {
>                         ClearPageUptodate(page);
>                         SetPageError(page);
>                 }
>                 unlock_page(page);
>         } else {
>                 if (err) {
>                         SetPageError(page);
>                         if (page->mapping)
>                                 mapping_set_error(page->mapping, err);
>                 }
>                 end_page_writeback(page);
>         }
> }
> --------------------------8<-----------------------------
>
> ...not everything uses page_endio, but it's a good place to look since
> we have both flavors of error handling in one place.
>
> On a write error, we SetPageError and set the error in the mapping.
>
> What I'm not clear on is:
>
> 1) what happens to the page at that point when we get a writeback error?
> Does it just remain in-core and is allowed to service reads (assuming
> that it was uptodate before)?

Yes, it remains in core and can service reads.  It is no different from
a page on which a write recent succeeded, except that the write didn't
succeed so the contents of backing store might be different from the
contents of the page.

>
> Can I redirty it and have it retry the write? Is there standard behavior
> for this or is it just up to the whim of the filesystem?

Everything is at the whim of the filesystem, but I doubt if many differ
from the above.

NeilBrown

>
> I'll probably have questions about the read side as well, but for now it
> looks like it's mostly used in an ad-hoc way to communicate errors
> across subsystems (block to fs layer, for instance).
> --
> Jeff Layton <jlayton@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-02-28 20:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-26 14:42 [LSF/MM TOPIC] do we really need PG_error at all? Jeff Layton
2017-02-26 17:10 ` James Bottomley
2017-02-26 21:03   ` NeilBrown
2017-02-26 22:43     ` Jeff Layton
2017-02-26 23:30     ` James Bottomley
2017-02-26 23:57       ` Jeff Layton
2017-02-27  0:27       ` NeilBrown
2017-02-27 15:07         ` Jeff Layton
2017-02-27 22:51           ` Andreas Dilger
2017-02-27 23:02             ` Jeff Layton
2017-02-27 23:32             ` NeilBrown
2017-02-28  1:11               ` [Lsf-pc] " Jeff Layton
2017-02-28 10:12                 ` Boaz Harrosh
2017-02-28 11:32                   ` Jeff Layton
2017-02-28 20:45                 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).