linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] zram: fix writeback_store returning zero in most situations
@ 2020-04-15  5:24 Justin Gottula
  2020-04-16 21:47 ` Minchan Kim
  0 siblings, 1 reply; 5+ messages in thread
From: Justin Gottula @ 2020-04-15  5:24 UTC (permalink / raw)
  To: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Jens Axboe
  Cc: linux-block, linux-kernel

Commit 3b82a051c101 ("zram: fix error return codes not being returned in
writeback_store") altered the return value logic in writeback_store such
that ret = len is done at the beginning of the function, which alleviated
an issue where the -EIO and -ENOSPC break-outs from the for-loop would
unintentionally hit the ret = len previously located at the end of the
function. That change in itself was largely fine; however, it exposed
another return value logic problem.

Currently, when the writeback_store for-loop does _not_ encounter the -EIO
or -ENOSPC situations, and _does_ attempt to write at least one block to
the backing device, the function will simply return whatever the most
recent submit_bio_wait return value was. This means that if at least one
block write is attempted, the function will never actually return len; and,
if the last written block's call to submit_bio_wait was successful, then
the function will actually return _zero_. (Prior to the above-mentioned
patch, this situation was masked by the fact that the ret = len assignment
occurred _after_ all calls to submit_bio_wait.)

So, if writeback of one or more blocks is attempted, and the last
submit_bio_wait is successful (which will usually be the case), userspace
programs performing write(3) syscalls to /sys/block/zramN/writeback will
receive a return value of zero. Well-written programs use non-negative
return values from write(3) to determine if a partial write occurred, and
in that circumstance, they generally add the value to their buffer pointer
and call write(3) again. Programs written this way will, as a result,
effectively spin in a loop calling write(3) forever (or at least until an
actual error value does finally crop up), repeatedly writing the entire
string buffer ("huge" or "idle") to the writeback file and getting the
false impression that no forward progress is being made, since zero bytes
of the buffer are apparently being consumed with each syscall.

I came across this problem while adding a new parameter to the zramctl
program, in a personal fork of the util-linux suite, that would allow the
user to easily request writeback on a zram device. In zramctl (and the
util-linux programs in general), sysfs writes are typically done via
ul_path_write_string, which invokes write_all. The write_all utility
function is written in such a way that write(3) return values _not greater
than zero_ are treated as error conditions; upon which, any errno value
other than EINTR or EAGAIN will cause the function to bail out and indicate
an error. A return value of zero is indeed greater than zero; and an errno
value of zero is neither EINTR nor EAGAIN; so write_all bails and indicates
an error. (And thanks to automatic strerror-based printout functions, this
of course produces the message "error: Success", a personal favorite of
mine.)

The writeback_store return value problem is addressed by storing the
submit_bio_wait result in a separate local variable, which is only assigned
to ret if it is actually an errno value.

The behavior should now be that, if an error arises from one or more of the
calls to submit_bio_wait during writeback_store, and none of the other
explicit error cases in the function are triggered, then the most recent
submit_bio_wait error will be returned. Otherwise, the function will return
len.

Fixes: a939888ec38b ("zram: support idle/huge page writeback")
Signed-off-by: Justin Gottula <justin@jgottula.com>
---
Side Note: My understanding of whether a return value of zero from
write(3) is actually even valid--given a nonzero length parameter--became a
whole lot murkier when I took a really close look at the Linux and POSIX
man pages, as well as Internet discussions on the topic. Initially I was
sure that a zero return value indicated that no error had occurred, but
that no forward progress on the write had been made either; and that as
such, the implementation of write_all in util-linux was faulty. But now
I've come to understand that apparently some historical implementations of
write(3) have actually returned zero as an indication that a non-blocking
write on certain special files would block (though modern POSIX specifies
returning -1 and setting errno to EAGAIN for this case); and that a return
value of zero may apparently be used as an EOF indication. So perhaps
treating a zero result as an error in write_all, implicit and accidental
though it may have appeared, was actually intentional and well-founded all
along; I don't entirely know. In any case, though, it's clear that the zram
driver shouldn't be returning 0 in this circumstance.

 drivers/block/zram/zram_drv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1bf4a908a0bd..e55330a615c2 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -627,6 +627,7 @@ static ssize_t writeback_store(struct device *dev,
    struct bio_vec bio_vec;
    struct page *page;
    ssize_t ret = len;
+   int bio_ret;
    int mode;
    unsigned long blk_idx = 0;

@@ -719,8 +720,9 @@ static ssize_t writeback_store(struct device *dev,
         * XXX: A single page IO would be inefficient for write
         * but it would be not bad as starter.
         */
-       ret = submit_bio_wait(&bio);
-       if (ret) {
+       bio_ret = submit_bio_wait(&bio);
+       if (bio_ret) {
+           ret = bio_ret;
            zram_slot_lock(zram, index);
            zram_clear_flag(zram, index, ZRAM_UNDER_WB);
            zram_clear_flag(zram, index, ZRAM_IDLE);
--
2.25.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] zram: fix writeback_store returning zero in most situations
  2020-04-15  5:24 [PATCH] zram: fix writeback_store returning zero in most situations Justin Gottula
@ 2020-04-16 21:47 ` Minchan Kim
  2020-04-17  0:45   ` Justin Gottula
  0 siblings, 1 reply; 5+ messages in thread
From: Minchan Kim @ 2020-04-16 21:47 UTC (permalink / raw)
  To: Justin Gottula, sergey.senozhatsky.work
  Cc: Nitin Gupta, Sergey Senozhatsky, Jens Axboe, linux-block, linux-kernel

On Tue, Apr 14, 2020 at 10:24:18PM -0700, Justin Gottula wrote:
> Commit 3b82a051c101 ("zram: fix error return codes not being returned in
> writeback_store") altered the return value logic in writeback_store such
> that ret = len is done at the beginning of the function, which alleviated
> an issue where the -EIO and -ENOSPC break-outs from the for-loop would
> unintentionally hit the ret = len previously located at the end of the
> function. That change in itself was largely fine; however, it exposed
> another return value logic problem.
> 
> Currently, when the writeback_store for-loop does _not_ encounter the -EIO
> or -ENOSPC situations, and _does_ attempt to write at least one block to
> the backing device, the function will simply return whatever the most
> recent submit_bio_wait return value was. This means that if at least one
> block write is attempted, the function will never actually return len; and,
> if the last written block's call to submit_bio_wait was successful, then
> the function will actually return _zero_. (Prior to the above-mentioned
> patch, this situation was masked by the fact that the ret = len assignment
> occurred _after_ all calls to submit_bio_wait.)
> 
> So, if writeback of one or more blocks is attempted, and the last
> submit_bio_wait is successful (which will usually be the case), userspace
> programs performing write(3) syscalls to /sys/block/zramN/writeback will
> receive a return value of zero. Well-written programs use non-negative
> return values from write(3) to determine if a partial write occurred, and
> in that circumstance, they generally add the value to their buffer pointer
> and call write(3) again. Programs written this way will, as a result,
> effectively spin in a loop calling write(3) forever (or at least until an
> actual error value does finally crop up), repeatedly writing the entire
> string buffer ("huge" or "idle") to the writeback file and getting the
> false impression that no forward progress is being made, since zero bytes
> of the buffer are apparently being consumed with each syscall.
> 
> I came across this problem while adding a new parameter to the zramctl
> program, in a personal fork of the util-linux suite, that would allow the
> user to easily request writeback on a zram device. In zramctl (and the
> util-linux programs in general), sysfs writes are typically done via
> ul_path_write_string, which invokes write_all. The write_all utility
> function is written in such a way that write(3) return values _not greater
> than zero_ are treated as error conditions; upon which, any errno value
> other than EINTR or EAGAIN will cause the function to bail out and indicate
> an error. A return value of zero is indeed greater than zero; and an errno
> value of zero is neither EINTR nor EAGAIN; so write_all bails and indicates
> an error. (And thanks to automatic strerror-based printout functions, this
> of course produces the message "error: Success", a personal favorite of
> mine.)
> 
> The writeback_store return value problem is addressed by storing the
> submit_bio_wait result in a separate local variable, which is only assigned
> to ret if it is actually an errno value.
> 
> The behavior should now be that, if an error arises from one or more of the
> calls to submit_bio_wait during writeback_store, and none of the other
> explicit error cases in the function are triggered, then the most recent
> submit_bio_wait error will be returned. Otherwise, the function will return
> len.
> 
> Fixes: a939888ec38b ("zram: support idle/huge page writeback")
> Signed-off-by: Justin Gottula <justin@jgottula.com>

Thanks for the patch, Justin.

I couldn't remember why I wanted to do continue even though we knew
the write was failure from the beginning.

Couldn't we just bail out whenever we encounter the error?
Sergey, Justin, what do you think?

IMO, it would be more consistent with other error handling.

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1bdb5793842b..e444359edaf5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -703,12 +703,13 @@ static ssize_t writeback_store(struct device *dev,
                /* Need for hugepage writeback racing */
                zram_set_flag(zram, index, ZRAM_IDLE);
                zram_slot_unlock(zram, index);
-               if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
+               ret = zram_bvec_read(zram, &bvec, index, 0, NULL);
+               if (ret) {
                        zram_slot_lock(zram, index);
                        zram_clear_flag(zram, index, ZRAM_UNDER_WB);
                        zram_clear_flag(zram, index, ZRAM_IDLE);
                        zram_slot_unlock(zram, index);
-                       continue;
+                       break;
                }

                bio_init(&bio, &bio_vec, 1);
@@ -728,7 +729,7 @@ static ssize_t writeback_store(struct device *dev,
                        zram_clear_flag(zram, index, ZRAM_UNDER_WB);
                        zram_clear_flag(zram, index, ZRAM_IDLE);
                        zram_slot_unlock(zram, index);
-                       continue;
+                       break;
                }

                atomic64_inc(&zram->stats.bd_writes);


> ---
> Side Note: My understanding of whether a return value of zero from
> write(3) is actually even valid--given a nonzero length parameter--became a
> whole lot murkier when I took a really close look at the Linux and POSIX
> man pages, as well as Internet discussions on the topic. Initially I was
> sure that a zero return value indicated that no error had occurred, but
> that no forward progress on the write had been made either; and that as
> such, the implementation of write_all in util-linux was faulty. But now
> I've come to understand that apparently some historical implementations of
> write(3) have actually returned zero as an indication that a non-blocking
> write on certain special files would block (though modern POSIX specifies
> returning -1 and setting errno to EAGAIN for this case); and that a return
> value of zero may apparently be used as an EOF indication. So perhaps
> treating a zero result as an error in write_all, implicit and accidental
> though it may have appeared, was actually intentional and well-founded all
> along; I don't entirely know. In any case, though, it's clear that the zram
> driver shouldn't be returning 0 in this circumstance.
> 
>  drivers/block/zram/zram_drv.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 1bf4a908a0bd..e55330a615c2 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -627,6 +627,7 @@ static ssize_t writeback_store(struct device *dev,
>     struct bio_vec bio_vec;
>     struct page *page;
>     ssize_t ret = len;
> +   int bio_ret;
>     int mode;
>     unsigned long blk_idx = 0;
> 
> @@ -719,8 +720,9 @@ static ssize_t writeback_store(struct device *dev,
>          * XXX: A single page IO would be inefficient for write
>          * but it would be not bad as starter.
>          */
> -       ret = submit_bio_wait(&bio);
> -       if (ret) {
> +       bio_ret = submit_bio_wait(&bio);
> +       if (bio_ret) {
> +           ret = bio_ret;
>             zram_slot_lock(zram, index);
>             zram_clear_flag(zram, index, ZRAM_UNDER_WB);
>             zram_clear_flag(zram, index, ZRAM_IDLE);
> --
> 2.25.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] zram: fix writeback_store returning zero in most situations
  2020-04-16 21:47 ` Minchan Kim
@ 2020-04-17  0:45   ` Justin Gottula
  2020-04-17  0:52     ` Justin Gottula
  2020-04-17 15:35     ` Minchan Kim
  0 siblings, 2 replies; 5+ messages in thread
From: Justin Gottula @ 2020-04-17  0:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Nitin Gupta, Jens Axboe, linux-block, linux-kernel

On Thu, Apr 16, 2020 at 2:47 PM Minchan Kim <minchan@kernel.org> wrote:
> I couldn't remember why I wanted to do continue even though we knew
> the write was failure from the beginning.
>
> Couldn't we just bail out whenever we encounter the error?
> Sergey, Justin, what do you think?
>
> IMO, it would be more consistent with other error handling.

As far as being consistent with the rest of the error handling in the
loop, I think there's a reasonable distinction that can be drawn between
some of the error cases and others. With some, breaking out immediately is
obviously the correct action; but with others, it's not as clear-cut.

For the zram->wb_limit_enable && !zram->bd_wb_limit case, breaking out of
the loop and immediately returning makes complete sense: once we've hit
the writeback limit, there's nothing that could happen in future loop
iterations that could cause us to somehow _not_ be at the limit anymore.
So there's no reason to continue the loop.

Similarly, for the alloc_block_bdev(zram) failure case, breaking out of
the loop and immediately returning also makes complete sense: if we've
already run out of blocks on the backing device, then continuing to loop
isn't going to result in is somehow ending up with more backing device
blocks available. So there's no reason to continue the loop in that case
either.

With the zram_bvec_read and submit_bio_wait failure cases, though, it's
not nearly as clear that a failure on the current slot _automatically_
guarantees that all future slots would also have errors. For example, in
the zram_bvec_read failure case, it's entirely possible that the
decompression code in the currently-used compression backend might have
had a problem with the current buffer; but that doesn't necessarily seem
like a clear-cut indication that the same decompression error would
definitely happen on later iterations of the loop: in fact, many of the
later slots could very well be ZRAM_HUGE or ZRAM_SAME, and so they would
avoid that type of error entirely. And in the submit_bio_wait failure
case, it's also conceivable that the backing device may have had some sort
of write error with the current block and data buffer, but that the same
error might not happen again on future iterations of the loop when writing
a different buffer to potentially an entirely different block. (Especially
if the backing device happens to be using some kind of complicated driver,
like one of the weirder LVM/device-mapper backends or a network block
device.)

So, at least when it comes to "are these particular failure cases the kind
where it would certainly make no sense to continue with future loop
iterations because we would definitely have the same failures then too",
there's a reasonable argument that the zram_bvec_read and submit_bio_wait
error cases don't necessarily fit that logic and so potentially shouldn't
immediately break and return.

A couple of other considerations came to mind while thinking about this:

1. From the standpoint of the user, what should they reasonably be
expecting to happen when writing "huge"/"idle" to the writeback file? A
best-effort attempt at doing all possible block writebacks (of the
requested type) that can possibly be successfully accomplished? Or,
writeback of only as many blocks as can be done until a (possibly
transient or not-applicable-to-all-blocks) error pops up, and then
stopping at that point? I tend to think that the former makes a bit more
sense, but I don't know for sure if it's the definite Right Answer.

2. There is a bit of a predicament with the keep-going-even-if-an-error-
happened approach. There could be multiple submit_bio_wait failures (e.g.
due to a few blocks that couldn't be written-back due to transient write
errors on the backing device), each with their own error return value; but
we can only return one error code from writeback_store. So how do we even
communicate a multiple-errors-happened situation to the user properly? My
patch, as originally written, would give the user an errno corresponding
to the last submit_bio_wait failure to happen; but of course that leaves
out any other error codes that may have also come up on prior write
attempts that failed (and those failures could have been for different
reasons). And then also, it's entirely possible that submit_bio_wait may
return -EIO or -ENOSPC or -ENOMEM, which would be ambiguous with the other
meanings of those error codes from the writeback_store function itself
(-EIO for WB limit, -ENOSPC for no free blocks on backing device, -ENOMEM
if couldn't allocate the temporary page).

As for the main issue of whether or not to break out of the loop in the
event of a backing device write error, I think it probably makes more
sense not to break out. But it could probably be argued either way.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] zram: fix writeback_store returning zero in most situations
  2020-04-17  0:45   ` Justin Gottula
@ 2020-04-17  0:52     ` Justin Gottula
  2020-04-17 15:35     ` Minchan Kim
  1 sibling, 0 replies; 5+ messages in thread
From: Justin Gottula @ 2020-04-17  0:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Nitin Gupta, Jens Axboe, linux-block, linux-kernel

On Thu, Apr 16, 2020 at 5:45 PM Justin Gottula <justin@jgottula.com> wrote:
>
> 2. There is a bit of a predicament with the keep-going-even-if-an-error-
> happened approach. [...]

Oops. I meant to also include this in my reply:

I suppose a reasonable way to address #2 might be to condense any error(s)
arising from submit_bio_wait into one writeback_store error return value
whose purpose is to indicate that the backing device itself had some sort
of write problem(s).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] zram: fix writeback_store returning zero in most situations
  2020-04-17  0:45   ` Justin Gottula
  2020-04-17  0:52     ` Justin Gottula
@ 2020-04-17 15:35     ` Minchan Kim
  1 sibling, 0 replies; 5+ messages in thread
From: Minchan Kim @ 2020-04-17 15:35 UTC (permalink / raw)
  To: Justin Gottula
  Cc: Sergey Senozhatsky, Nitin Gupta, Jens Axboe, linux-block, linux-kernel

Hi Justin,

On Thu, Apr 16, 2020 at 05:45:31PM -0700, Justin Gottula wrote:
> On Thu, Apr 16, 2020 at 2:47 PM Minchan Kim <minchan@kernel.org> wrote:
> > I couldn't remember why I wanted to do continue even though we knew
> > the write was failure from the beginning.
> >
> > Couldn't we just bail out whenever we encounter the error?
> > Sergey, Justin, what do you think?
> >
> > IMO, it would be more consistent with other error handling.
> 
> As far as being consistent with the rest of the error handling in the
> loop, I think there's a reasonable distinction that can be drawn between
> some of the error cases and others. With some, breaking out immediately is
> obviously the correct action; but with others, it's not as clear-cut.
> 
> For the zram->wb_limit_enable && !zram->bd_wb_limit case, breaking out of
> the loop and immediately returning makes complete sense: once we've hit
> the writeback limit, there's nothing that could happen in future loop
> iterations that could cause us to somehow _not_ be at the limit anymore.
> So there's no reason to continue the loop.
> 
> Similarly, for the alloc_block_bdev(zram) failure case, breaking out of
> the loop and immediately returning also makes complete sense: if we've
> already run out of blocks on the backing device, then continuing to loop
> isn't going to result in is somehow ending up with more backing device
> blocks available. So there's no reason to continue the loop in that case
> either.
> 
> With the zram_bvec_read and submit_bio_wait failure cases, though, it's
> not nearly as clear that a failure on the current slot _automatically_
> guarantees that all future slots would also have errors. For example, in
> the zram_bvec_read failure case, it's entirely possible that the
> decompression code in the currently-used compression backend might have
> had a problem with the current buffer; but that doesn't necessarily seem
> like a clear-cut indication that the same decompression error would
> definitely happen on later iterations of the loop: in fact, many of the
> later slots could very well be ZRAM_HUGE or ZRAM_SAME, and so they would
> avoid that type of error entirely. And in the submit_bio_wait failure
> case, it's also conceivable that the backing device may have had some sort
> of write error with the current block and data buffer, but that the same
> error might not happen again on future iterations of the loop when writing
> a different buffer to potentially an entirely different block. (Especially
> if the backing device happens to be using some kind of complicated driver,
> like one of the weirder LVM/device-mapper backends or a network block
> device.)
> 
> So, at least when it comes to "are these particular failure cases the kind
> where it would certainly make no sense to continue with future loop
> iterations because we would definitely have the same failures then too",
> there's a reasonable argument that the zram_bvec_read and submit_bio_wait
> error cases don't necessarily fit that logic and so potentially shouldn't
> immediately break and return.
> 
> A couple of other considerations came to mind while thinking about this:
> 
> 1. From the standpoint of the user, what should they reasonably be
> expecting to happen when writing "huge"/"idle" to the writeback file? A
> best-effort attempt at doing all possible block writebacks (of the
> requested type) that can possibly be successfully accomplished? Or,
> writeback of only as many blocks as can be done until a (possibly
> transient or not-applicable-to-all-blocks) error pops up, and then
> stopping at that point? I tend to think that the former makes a bit more
> sense, but I don't know for sure if it's the definite Right Answer.

I agree it should be best effort.

> 
> 2. There is a bit of a predicament with the keep-going-even-if-an-error-
> happened approach. There could be multiple submit_bio_wait failures (e.g.
> due to a few blocks that couldn't be written-back due to transient write
> errors on the backing device), each with their own error return value; but
> we can only return one error code from writeback_store. So how do we even
> communicate a multiple-errors-happened situation to the user properly? My
> patch, as originally written, would give the user an errno corresponding
> to the last submit_bio_wait failure to happen; but of course that leaves
> out any other error codes that may have also come up on prior write
> attempts that failed (and those failures could have been for different
> reasons). And then also, it's entirely possible that submit_bio_wait may
> return -EIO or -ENOSPC or -ENOMEM, which would be ambiguous with the other
> meanings of those error codes from the writeback_store function itself
> (-EIO for WB limit, -ENOSPC for no free blocks on backing device, -ENOMEM
> if couldn't allocate the temporary page).

Indeed.

What I wanted by my suggestion was that user could find the error
as soon as something popped up so they could judge what they could
as well as consistency of code structuring. But based on your comment,
they couldn't since several errors are overloaded here. In that case,
it makes my trial void. Another thought is if we could change writeback
work asynchronously(and sync right before returning to the user), it also
will make my trial void. So, yes, your suggestion is much sane.

Could you resend it with dealing with zram_bvec_read failure?
And change footer with following
Fixes: 3b82a051c101 ("zram: fix error return codes not being returned in writeback_store")
Cc: stable@vger.kernel.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-17 15:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15  5:24 [PATCH] zram: fix writeback_store returning zero in most situations Justin Gottula
2020-04-16 21:47 ` Minchan Kim
2020-04-17  0:45   ` Justin Gottula
2020-04-17  0:52     ` Justin Gottula
2020-04-17 15:35     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).