Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* BLKSECDISCARD ioctl and hung tasks
@ 2020-02-12 22:27 Salman Qazi
  2020-02-12 23:06 ` Theodore Y. Ts'o
  2020-02-13  8:26 ` Ming Lei
  0 siblings, 2 replies; 19+ messages in thread
From: Salman Qazi @ 2020-02-12 22:27 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Bart Van Assche, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block
  Cc: Gwendal Grignou, Jesse Barnes

Hi,

So, here's another issue that we are grappling with, where we have a
root-cause but don't currently have a good fix for.  BLKSECDISCARD is
an operation used for securely destroying a subset of the data on a
device.  Unfortunately, on SSDs, this is an operation with variable
performance.  It can be O(minutes) in the worst case.  The
pathological case is when many erase blocks on the flash contain a
small amount of data that is part of the discard and a large amount of
data that isn't.  In such cases, the erase blocks have to be copied
almost in entirety to fresh blocks, in order to erase the sectors to
be discarded. This can be thought of as a defragmentation operation on
the drive and can be expected to cost in the same ballpark as
rewriting most of the contents of the drive.

Therefore, it is possible for the thread waiting in the IOCTL in
submit_bio_wait call in blkdev_issue_discard to wait for several
minutes.  The hung task watchdog is usually configured for 2 minutes,
and this can expire before the operation finishes.

This operation is very important to the security model of Chrome OS
devices.  Therefore, we would like the kernel to survive this even if
it takes several minutes.

Three approaches come to mind:

One approach is to somehow avoid waiting for a single monolithic
operation and instead wait on bits and pieces of the operation.  These
can be sized to finish within a reasonable timeframe.  The exact size
is likely device-specific.  We already split these operations before
issuing to the device, but the IOCTL thread is waiting for the whole
rather than the parts. The hung task watchdog only sees the total
amount of time the thread slept and not the forward progress taking
place quietly.

Another approach might be to do something in the spirit of the write
system call: complete the partial operation (whatever the kernel
thinks is reasonable), adjust the IOCTL argument and have the
userspace reissue the syscall to continue the operation.  The second
option should probably be done with a different IOCTL name to avoid
breaking userspace.

A third approach, which is perhaps more adventurous, is to have a
notion of forward progress that a thread can export and the hung task
watchdog can evaluate.  This can take the form of a function pointer
and an argument.  The result of the function is a monotonically
decreasing unsigned value.  When this value stops changing, we can
conclude that the thread is hung.  This can be used in place of
context switch count for tasks where this function is available.  This
can potentially solve other similar issues, there is a way to tell if
there is forward progress, but it is not as straightforward as the
context switch count.

What are your thoughts?

Thanks in advance,

Salman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-12 22:27 BLKSECDISCARD ioctl and hung tasks Salman Qazi
@ 2020-02-12 23:06 ` Theodore Y. Ts'o
  2020-02-13  1:20   ` Salman Qazi
  2020-02-13  8:26 ` Ming Lei
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-12 23:06 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Ming Lei, Bart Van Assche, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

This is a problem we've been strugging with in other contexts.  For
example, if you have the hung task timer set to 2 minutes, and the
system to panic if the hung task timer exceeds that, and an NFS server
which the client is writing to crashes, and it takes longer for the
NFS server to come back, that might be a situation where we might want
to exempt the hung task warning from panic'ing the system.  On the
other hand, if the process is failing to schedule for other reasons,
maybe we would still want the hung task timeout to go off.

So I've been meditating over whether the right answer is to just
globally configure the hung task timer to something like 5 or 10
minutes (which would require no kernel changes, yay?), or have some
way of telling the hung task timeout logic that it shouldn't apply, or
should have a different timeout, when we're waiting for I/O to
complete.

It seems to me that perhaps there's a different solution here for your
specific case, which is what if there is a asynchronous version of
BLKSECDISCARD, either using io_uring or some other interface?  That
bypasses the whole issue of how do we modulate the hung task timeout
when it's a situation where maybe it's OK for a userspace thread to
block for more than 120 seconds, without having to either completely
disable the hung task timeout, or changing it globally to some much
larger value.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-12 23:06 ` Theodore Y. Ts'o
@ 2020-02-13  1:20   ` Salman Qazi
  2020-02-13  1:24     ` Jesse Barnes
  0 siblings, 1 reply; 19+ messages in thread
From: Salman Qazi @ 2020-02-13  1:20 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jens Axboe, Ming Lei, Bart Van Assche, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Wed, Feb 12, 2020 at 3:07 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> This is a problem we've been strugging with in other contexts.  For
> example, if you have the hung task timer set to 2 minutes, and the
> system to panic if the hung task timer exceeds that, and an NFS server
> which the client is writing to crashes, and it takes longer for the
> NFS server to come back, that might be a situation where we might want
> to exempt the hung task warning from panic'ing the system.  On the
> other hand, if the process is failing to schedule for other reasons,
> maybe we would still want the hung task timeout to go off.
>
> So I've been meditating over whether the right answer is to just
> globally configure the hung task timer to something like 5 or 10
> minutes (which would require no kernel changes, yay?), or have some
> way of telling the hung task timeout logic that it shouldn't apply, or
> should have a different timeout, when we're waiting for I/O to
> complete.

The problem that I anticipate in our space is that a generous timeout
will make impatient people reboot their chromebooks, losing us
information
about hangs.  But, this can be worked around by having multiple
different timeouts.  For instance, a thread that is expecting to do
something slow, can set a flag
to indicate that it wishes to be held against the more generous
criteria.  This is something I am tempted to do on older kernels where
we might not feel
comfortable backporting io_uring.

>
> It seems to me that perhaps there's a different solution here for your
> specific case, which is what if there is a asynchronous version of
> BLKSECDISCARD, either using io_uring or some other interface?  That
> bypasses the whole issue of how do we modulate the hung task timeout
> when it's a situation where maybe it's OK for a userspace thread to
> block for more than 120 seconds, without having to either completely
> disable the hung task timeout, or changing it globally to some much
> larger value.

This is worth evaluating.

Thanks,

Salman

>
>                                         - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13  1:20   ` Salman Qazi
@ 2020-02-13  1:24     ` Jesse Barnes
  0 siblings, 0 replies; 19+ messages in thread
From: Jesse Barnes @ 2020-02-13  1:24 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Theodore Y. Ts'o, Jens Axboe, Ming Lei, Bart Van Assche,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Wed, Feb 12, 2020 at 5:20 PM Salman Qazi <sqazi@google.com> wrote:
>
> On Wed, Feb 12, 2020 at 3:07 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >
> > This is a problem we've been strugging with in other contexts.  For
> > example, if you have the hung task timer set to 2 minutes, and the
> > system to panic if the hung task timer exceeds that, and an NFS server
> > which the client is writing to crashes, and it takes longer for the
> > NFS server to come back, that might be a situation where we might want
> > to exempt the hung task warning from panic'ing the system.  On the
> > other hand, if the process is failing to schedule for other reasons,
> > maybe we would still want the hung task timeout to go off.
> >
> > So I've been meditating over whether the right answer is to just
> > globally configure the hung task timer to something like 5 or 10
> > minutes (which would require no kernel changes, yay?), or have some
> > way of telling the hung task timeout logic that it shouldn't apply, or
> > should have a different timeout, when we're waiting for I/O to
> > complete.
>
> The problem that I anticipate in our space is that a generous timeout
> will make impatient people reboot their chromebooks, losing us
> information
> about hangs.  But, this can be worked around by having multiple
> different timeouts.  For instance, a thread that is expecting to do
> something slow, can set a flag
> to indicate that it wishes to be held against the more generous
> criteria.  This is something I am tempted to do on older kernels where
> we might not feel
> comfortable backporting io_uring.

I was going to reply along the same lines when I got distracted by a
mtg.  If anything I'd like to see a LOWER hung task timeout, generally
speaking.  And maybe that means having more operations be asynchronous
like Ted suggests (I'm generally a fan of that anyway).

[snipped good suggestion about async interface]

Jesse

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-12 22:27 BLKSECDISCARD ioctl and hung tasks Salman Qazi
  2020-02-12 23:06 ` Theodore Y. Ts'o
@ 2020-02-13  8:26 ` Ming Lei
  2020-02-13 17:48   ` Bart Van Assche
  1 sibling, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-02-13  8:26 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Wed, Feb 12, 2020 at 02:27:09PM -0800, Salman Qazi wrote:
> Hi,
> 
> So, here's another issue that we are grappling with, where we have a
> root-cause but don't currently have a good fix for.  BLKSECDISCARD is
> an operation used for securely destroying a subset of the data on a
> device.  Unfortunately, on SSDs, this is an operation with variable
> performance.  It can be O(minutes) in the worst case.  The
> pathological case is when many erase blocks on the flash contain a
> small amount of data that is part of the discard and a large amount of
> data that isn't.  In such cases, the erase blocks have to be copied
> almost in entirety to fresh blocks, in order to erase the sectors to
> be discarded. This can be thought of as a defragmentation operation on
> the drive and can be expected to cost in the same ballpark as
> rewriting most of the contents of the drive.
> 
> Therefore, it is possible for the thread waiting in the IOCTL in
> submit_bio_wait call in blkdev_issue_discard to wait for several
> minutes.  The hung task watchdog is usually configured for 2 minutes,
> and this can expire before the operation finishes.
> 
> This operation is very important to the security model of Chrome OS
> devices.  Therefore, we would like the kernel to survive this even if
> it takes several minutes.
> 
> Three approaches come to mind:
> 
> One approach is to somehow avoid waiting for a single monolithic
> operation and instead wait on bits and pieces of the operation.  These
> can be sized to finish within a reasonable timeframe.  The exact size
> is likely device-specific.  We already split these operations before
> issuing to the device, but the IOCTL thread is waiting for the whole
> rather than the parts. The hung task watchdog only sees the total
> amount of time the thread slept and not the forward progress taking
> place quietly.
> 
> Another approach might be to do something in the spirit of the write
> system call: complete the partial operation (whatever the kernel
> thinks is reasonable), adjust the IOCTL argument and have the
> userspace reissue the syscall to continue the operation.  The second
> option should probably be done with a different IOCTL name to avoid
> breaking userspace.
> 
> A third approach, which is perhaps more adventurous, is to have a
> notion of forward progress that a thread can export and the hung task
> watchdog can evaluate.  This can take the form of a function pointer
> and an argument.  The result of the function is a monotonically
> decreasing unsigned value.  When this value stops changing, we can
> conclude that the thread is hung.  This can be used in place of
> context switch count for tasks where this function is available.  This
> can potentially solve other similar issues, there is a way to tell if
> there is forward progress, but it is not as straightforward as the
> context switch count.
> 
> What are your thoughts?

The approach used in blk_execute_rq() can be borrowed for workaround the
issue, such as:

diff --git a/block/bio.c b/block/bio.c
index 94d697217887..c9ce19a86de7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -17,6 +17,7 @@
 #include <linux/cgroup.h>
 #include <linux/blk-cgroup.h>
 #include <linux/highmem.h>
+#include <linux/sched/sysctl.h>
 
 #include <trace/events/block.h>
 #include "blk.h"
@@ -1019,12 +1020,19 @@ static void submit_bio_wait_endio(struct bio *bio)
 int submit_bio_wait(struct bio *bio)
 {
 	DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
+	unsigned long hang_check;
 
 	bio->bi_private = &done;
 	bio->bi_end_io = submit_bio_wait_endio;
 	bio->bi_opf |= REQ_SYNC;
 	submit_bio(bio);
-	wait_for_completion_io(&done);
+
+	/* Prevent hang_check timer from firing at us during very long I/O */
+	hang_check = sysctl_hung_task_timeout_secs;
+	if (hang_check)
+		while (!wait_for_completion_io_timeout(&done, hang_check * (HZ/2)));
+	else
+		wait_for_completion_io(&done);
 
 	return blk_status_to_errno(bio->bi_status);
 }

thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13  8:26 ` Ming Lei
@ 2020-02-13 17:48   ` Bart Van Assche
  2020-02-13 19:21     ` Salman Qazi
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2020-02-13 17:48 UTC (permalink / raw)
  To: Ming Lei, Salman Qazi
  Cc: Jens Axboe, Christoph Hellwig, Linux Kernel Mailing List,
	linux-block, Gwendal Grignou, Jesse Barnes

On 2/13/20 12:26 AM, Ming Lei wrote:
> The approach used in blk_execute_rq() can be borrowed for workaround the
> issue, such as:
> 
> diff --git a/block/bio.c b/block/bio.c
> index 94d697217887..c9ce19a86de7 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -17,6 +17,7 @@
>   #include <linux/cgroup.h>
>   #include <linux/blk-cgroup.h>
>   #include <linux/highmem.h>
> +#include <linux/sched/sysctl.h>
>   
>   #include <trace/events/block.h>
>   #include "blk.h"
> @@ -1019,12 +1020,19 @@ static void submit_bio_wait_endio(struct bio *bio)
>   int submit_bio_wait(struct bio *bio)
>   {
>   	DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
> +	unsigned long hang_check;
>   
>   	bio->bi_private = &done;
>   	bio->bi_end_io = submit_bio_wait_endio;
>   	bio->bi_opf |= REQ_SYNC;
>   	submit_bio(bio);
> -	wait_for_completion_io(&done);
> +
> +	/* Prevent hang_check timer from firing at us during very long I/O */
> +	hang_check = sysctl_hung_task_timeout_secs;
> +	if (hang_check)
> +		while (!wait_for_completion_io_timeout(&done, hang_check * (HZ/2)));
> +	else
> +		wait_for_completion_io(&done);
>   
>   	return blk_status_to_errno(bio->bi_status);
>   }

Instead of suppressing the hung task complaints, has it been considered 
to use the bio splitting mechanism to make discard bios smaller? Block 
drivers may set a limit by calling blk_queue_max_discard_segments(). 
 From block/blk-settings.c:

/**
  * blk_queue_max_discard_segments - set max segments for discard
  * requests
  * @q:  the request queue for the device
  * @max_segments:  max number of segments
  *
  * Description:
  *    Enables a low level driver to set an upper limit on the number of
  *    segments in a discard request.
  **/
void blk_queue_max_discard_segments(struct request_queue *q,
		unsigned short max_segments)
{
	q->limits.max_discard_segments = max_segments;
}
EXPORT_SYMBOL_GPL(blk_queue_max_discard_segments);

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13 17:48   ` Bart Van Assche
@ 2020-02-13 19:21     ` Salman Qazi
  2020-02-13 22:08       ` Salman Qazi
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Salman Qazi @ 2020-02-13 19:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Thu, Feb 13, 2020 at 9:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/13/20 12:26 AM, Ming Lei wrote:
> > The approach used in blk_execute_rq() can be borrowed for workaround the
> > issue, such as:
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 94d697217887..c9ce19a86de7 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -17,6 +17,7 @@
> >   #include <linux/cgroup.h>
> >   #include <linux/blk-cgroup.h>
> >   #include <linux/highmem.h>
> > +#include <linux/sched/sysctl.h>
> >
> >   #include <trace/events/block.h>
> >   #include "blk.h"
> > @@ -1019,12 +1020,19 @@ static void submit_bio_wait_endio(struct bio *bio)
> >   int submit_bio_wait(struct bio *bio)
> >   {
> >       DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
> > +     unsigned long hang_check;
> >
> >       bio->bi_private = &done;
> >       bio->bi_end_io = submit_bio_wait_endio;
> >       bio->bi_opf |= REQ_SYNC;
> >       submit_bio(bio);
> > -     wait_for_completion_io(&done);
> > +
> > +     /* Prevent hang_check timer from firing at us during very long I/O */
> > +     hang_check = sysctl_hung_task_timeout_secs;
> > +     if (hang_check)
> > +             while (!wait_for_completion_io_timeout(&done, hang_check * (HZ/2)));
> > +     else
> > +             wait_for_completion_io(&done);
> >
> >       return blk_status_to_errno(bio->bi_status);
> >   }
>
> Instead of suppressing the hung task complaints, has it been considered
> to use the bio splitting mechanism to make discard bios smaller? Block
> drivers may set a limit by calling blk_queue_max_discard_segments().
>  From block/blk-settings.c:
>
> /**
>   * blk_queue_max_discard_segments - set max segments for discard
>   * requests
>   * @q:  the request queue for the device
>   * @max_segments:  max number of segments
>   *
>   * Description:
>   *    Enables a low level driver to set an upper limit on the number of
>   *    segments in a discard request.
>   **/
> void blk_queue_max_discard_segments(struct request_queue *q,
>                 unsigned short max_segments)
> {
>         q->limits.max_discard_segments = max_segments;
> }
> EXPORT_SYMBOL_GPL(blk_queue_max_discard_segments);
>

AFAICT, This is not actually sufficient, because the issuer of the bio
is waiting for the entire bio, regardless of how it is split later.
But, also there isn't a good mapping between the size of the secure
discard and how long it will take.  If given the geometry of a flash
device, it is not hard to construct a scenario where a relatively
small secure discard (few thousand sectors) will take a very long time
(multiple seconds).

Having said that, I don't like neutering the hung task timer either.

> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13 19:21     ` Salman Qazi
@ 2020-02-13 22:08       ` Salman Qazi
  2020-02-14  0:25       ` Ming Lei
  2020-02-14  5:49       ` Bart Van Assche
  2 siblings, 0 replies; 19+ messages in thread
From: Salman Qazi @ 2020-02-13 22:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Thu, Feb 13, 2020 at 11:21 AM Salman Qazi <sqazi@google.com> wrote:
>
> On Thu, Feb 13, 2020 at 9:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2/13/20 12:26 AM, Ming Lei wrote:
> > > The approach used in blk_execute_rq() can be borrowed for workaround the
> > > issue, such as:
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 94d697217887..c9ce19a86de7 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -17,6 +17,7 @@
> > >   #include <linux/cgroup.h>
> > >   #include <linux/blk-cgroup.h>
> > >   #include <linux/highmem.h>
> > > +#include <linux/sched/sysctl.h>
> > >
> > >   #include <trace/events/block.h>
> > >   #include "blk.h"
> > > @@ -1019,12 +1020,19 @@ static void submit_bio_wait_endio(struct bio *bio)
> > >   int submit_bio_wait(struct bio *bio)
> > >   {
> > >       DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
> > > +     unsigned long hang_check;
> > >
> > >       bio->bi_private = &done;
> > >       bio->bi_end_io = submit_bio_wait_endio;
> > >       bio->bi_opf |= REQ_SYNC;
> > >       submit_bio(bio);
> > > -     wait_for_completion_io(&done);
> > > +
> > > +     /* Prevent hang_check timer from firing at us during very long I/O */
> > > +     hang_check = sysctl_hung_task_timeout_secs;
> > > +     if (hang_check)
> > > +             while (!wait_for_completion_io_timeout(&done, hang_check * (HZ/2)));
> > > +     else
> > > +             wait_for_completion_io(&done);
> > >
> > >       return blk_status_to_errno(bio->bi_status);
> > >   }
> >
> > Instead of suppressing the hung task complaints, has it been considered
> > to use the bio splitting mechanism to make discard bios smaller? Block
> > drivers may set a limit by calling blk_queue_max_discard_segments().
> >  From block/blk-settings.c:
> >
> > /**
> >   * blk_queue_max_discard_segments - set max segments for discard
> >   * requests
> >   * @q:  the request queue for the device
> >   * @max_segments:  max number of segments
> >   *
> >   * Description:
> >   *    Enables a low level driver to set an upper limit on the number of
> >   *    segments in a discard request.
> >   **/
> > void blk_queue_max_discard_segments(struct request_queue *q,
> >                 unsigned short max_segments)
> > {
> >         q->limits.max_discard_segments = max_segments;
> > }
> > EXPORT_SYMBOL_GPL(blk_queue_max_discard_segments);
> >
>
> AFAICT, This is not actually sufficient, because the issuer of the bio
> is waiting for the entire bio, regardless of how it is split later.
> But, also there isn't a good mapping between the size of the secure
> discard and how long it will take.  If given the geometry of a flash
> device, it is not hard to construct a scenario where a relatively
> small secure discard (few thousand sectors) will take a very long time
> (multiple seconds).
>
> Having said that, I don't like neutering the hung task timer either.

In fact, it's worse than that.  Today, I was able to construct a case
of a 4K discard on a particular device that took 100 seconds.  I did
this
by arranging to write a single copy of page 0 for every erase unit of
the device, and wrote random LBAs to the rest of the erase unit.  I
suspect the
slow speed comes from the need to copy almost the entire device to
erase all the stale copies of page 0.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <assert.h>
#include <unistd.h>
#include <linux/fs.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

char page[8192];

int main(int argc, char **argv)
{
        unsigned long start;
        int fd;
        int i;
        char *page_aligned = (char *)(((unsigned long)page + 4095) & ~4095UL);
        unsigned long range[2];
        fd = open(argv[1], O_RDWR | O_DIRECT);
        assert(fd >= 0);
        range[0] = 0;
        assert(ioctl(fd, BLKGETSIZE64, &range[1]) >= 0);
        for (i = 0; i < range[1]; i += 4096) {
                /* paranoia: incase there is any deduping */
                page_aligned[0] = i;
                /*
                 * Almost always write randomly
                 */
                if (i % (4*1024*1024) != 0)
                        assert(pwrite(fd, page_aligned, 4096,
                              (lrand48() % range[1]) & ~4095UL) == 4096);
                else
                        /* except, once per erase block, write page 0 */
                        assert(pwrite(fd, page_aligned, 4096, 0) == 4096);
        }
        start = time(NULL);

        /* discard exactly one page */
        range[1] = 4096;
        printf("Starting discard %lu!\n", start);
        assert(ioctl(fd, BLKSECDISCARD, &range) >= 0);
        printf("Finished discard.  Took %lu!\n", time(NULL) - start);
        close(fd);
}


>
> > Thanks,
> >
> > Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13 19:21     ` Salman Qazi
  2020-02-13 22:08       ` Salman Qazi
@ 2020-02-14  0:25       ` Ming Lei
  2020-02-14  5:49       ` Bart Van Assche
  2 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-02-14  0:25 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Bart Van Assche, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Thu, Feb 13, 2020 at 11:21:37AM -0800, Salman Qazi wrote:
> On Thu, Feb 13, 2020 at 9:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2/13/20 12:26 AM, Ming Lei wrote:
> > > The approach used in blk_execute_rq() can be borrowed for workaround the
> > > issue, such as:
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 94d697217887..c9ce19a86de7 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -17,6 +17,7 @@
> > >   #include <linux/cgroup.h>
> > >   #include <linux/blk-cgroup.h>
> > >   #include <linux/highmem.h>
> > > +#include <linux/sched/sysctl.h>
> > >
> > >   #include <trace/events/block.h>
> > >   #include "blk.h"
> > > @@ -1019,12 +1020,19 @@ static void submit_bio_wait_endio(struct bio *bio)
> > >   int submit_bio_wait(struct bio *bio)
> > >   {
> > >       DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
> > > +     unsigned long hang_check;
> > >
> > >       bio->bi_private = &done;
> > >       bio->bi_end_io = submit_bio_wait_endio;
> > >       bio->bi_opf |= REQ_SYNC;
> > >       submit_bio(bio);
> > > -     wait_for_completion_io(&done);
> > > +
> > > +     /* Prevent hang_check timer from firing at us during very long I/O */
> > > +     hang_check = sysctl_hung_task_timeout_secs;
> > > +     if (hang_check)
> > > +             while (!wait_for_completion_io_timeout(&done, hang_check * (HZ/2)));
> > > +     else
> > > +             wait_for_completion_io(&done);
> > >
> > >       return blk_status_to_errno(bio->bi_status);
> > >   }
> >
> > Instead of suppressing the hung task complaints, has it been considered
> > to use the bio splitting mechanism to make discard bios smaller? Block
> > drivers may set a limit by calling blk_queue_max_discard_segments().
> >  From block/blk-settings.c:
> >
> > /**
> >   * blk_queue_max_discard_segments - set max segments for discard
> >   * requests
> >   * @q:  the request queue for the device
> >   * @max_segments:  max number of segments
> >   *
> >   * Description:
> >   *    Enables a low level driver to set an upper limit on the number of
> >   *    segments in a discard request.
> >   **/
> > void blk_queue_max_discard_segments(struct request_queue *q,
> >                 unsigned short max_segments)
> > {
> >         q->limits.max_discard_segments = max_segments;
> > }
> > EXPORT_SYMBOL_GPL(blk_queue_max_discard_segments);
> >
> 
> AFAICT, This is not actually sufficient, because the issuer of the bio
> is waiting for the entire bio, regardless of how it is split later.

Right.

> But, also there isn't a good mapping between the size of the secure
> discard and how long it will take.  If given the geometry of a flash
> device, it is not hard to construct a scenario where a relatively
> small secure discard (few thousand sectors) will take a very long time
> (multiple seconds).

It isn't strange to see such implementation, and I also see discard
request timeout.

> 
> Having said that, I don't like neutering the hung task timer either.

But it does fix the hung warning, doesn't it?

thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-13 19:21     ` Salman Qazi
  2020-02-13 22:08       ` Salman Qazi
  2020-02-14  0:25       ` Ming Lei
@ 2020-02-14  5:49       ` Bart Van Assche
  2020-02-14  9:22         ` Ming Lei
  2 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2020-02-14  5:49 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On 2020-02-13 11:21, Salman Qazi wrote:
> AFAICT, This is not actually sufficient, because the issuer of the bio
> is waiting for the entire bio, regardless of how it is split later.
> But, also there isn't a good mapping between the size of the secure
> discard and how long it will take.  If given the geometry of a flash
> device, it is not hard to construct a scenario where a relatively
> small secure discard (few thousand sectors) will take a very long time
> (multiple seconds).
> 
> Having said that, I don't like neutering the hung task timer either.

Hi Salman,

How about modifying the block layer such that completions of bio
fragments are considered as task activity? I think that bio splitting is
rare enough for such a change not to affect performance of the hot path.

How about setting max_discard_segments such that a discard always
completes in less than half the hung task timeout? This may make
discards a bit slower for one particular block driver but I think that's
better than hung task complaints.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-14  5:49       ` Bart Van Assche
@ 2020-02-14  9:22         ` Ming Lei
  2020-02-14 19:42           ` Salman Qazi
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-02-14  9:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Salman Qazi, Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Fri, Feb 14, 2020 at 1:50 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2020-02-13 11:21, Salman Qazi wrote:
> > AFAICT, This is not actually sufficient, because the issuer of the bio
> > is waiting for the entire bio, regardless of how it is split later.
> > But, also there isn't a good mapping between the size of the secure
> > discard and how long it will take.  If given the geometry of a flash
> > device, it is not hard to construct a scenario where a relatively
> > small secure discard (few thousand sectors) will take a very long time
> > (multiple seconds).
> >
> > Having said that, I don't like neutering the hung task timer either.
>
> Hi Salman,
>
> How about modifying the block layer such that completions of bio
> fragments are considered as task activity? I think that bio splitting is
> rare enough for such a change not to affect performance of the hot path.

Are you sure that the task hung warning won't be triggered in case of
non-splitting?

>
> How about setting max_discard_segments such that a discard always
> completes in less than half the hung task timeout? This may make
> discards a bit slower for one particular block driver but I think that's
> better than hung task complaints.

I am afraid you can't find a golden setting max_discard_segments working
for every drivers. Even it is found, the performance  may have been affected.

So just wondering why not take the simple approach used in blk_execute_rq()?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-14  9:22         ` Ming Lei
@ 2020-02-14 19:42           ` Salman Qazi
  2020-02-15  3:46             ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Salman Qazi @ 2020-02-14 19:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Fri, Feb 14, 2020 at 1:23 AM Ming Lei <tom.leiming@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 1:50 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2020-02-13 11:21, Salman Qazi wrote:
> > > AFAICT, This is not actually sufficient, because the issuer of the bio
> > > is waiting for the entire bio, regardless of how it is split later.
> > > But, also there isn't a good mapping between the size of the secure
> > > discard and how long it will take.  If given the geometry of a flash
> > > device, it is not hard to construct a scenario where a relatively
> > > small secure discard (few thousand sectors) will take a very long time
> > > (multiple seconds).
> > >
> > > Having said that, I don't like neutering the hung task timer either.
> >
> > Hi Salman,
> >
> > How about modifying the block layer such that completions of bio
> > fragments are considered as task activity? I think that bio splitting is
> > rare enough for such a change not to affect performance of the hot path.
>
> Are you sure that the task hung warning won't be triggered in case of
> non-splitting?

I demonstrated a few emails ago that it doesn't take a very large
secure discard command to trigger this.  So, I am sceptical that we
will be able to use splitting to solve this.

>
> >
> > How about setting max_discard_segments such that a discard always
> > completes in less than half the hung task timeout? This may make
> > discards a bit slower for one particular block driver but I think that's
> > better than hung task complaints.
>
> I am afraid you can't find a golden setting max_discard_segments working
> for every drivers. Even it is found, the performance  may have been affected.
>
> So just wondering why not take the simple approach used in blk_execute_rq()?

My colleague Gwendal pointed out another issue which I had missed:
secure discard is an exclusive command: it monopolizes the device.
Even if we fix this via your approach, it will show up somewhere else,
because other operations to the drive will not make progress for that
length of time.

For Chromium OS purposes, if we had a blank slate, this is how I would solve it:

* Under the assumption that the truly sensitive data is not very big:
    * Keep secure data on a separate partition to make sure that those
LBAs have controlled history
    * Treat the files in that partition as immutable (i.e. no
overwriting the contents of the file without first secure erasing the
existing contents).
    * By never letting more than one version of the file accumulate,
we can guarantee that the secure erase will always be fast for
moderate sized files.

But for all the existing machines with keys on them, we will need to
do something else.



>
> Thanks,
> Ming Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-14 19:42           ` Salman Qazi
@ 2020-02-15  3:46             ` Ming Lei
  2020-02-18 16:11               ` Jesse Barnes
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-02-15  3:46 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, linux-block, Gwendal Grignou,
	Jesse Barnes

On Fri, Feb 14, 2020 at 11:42:32AM -0800, Salman Qazi wrote:
> On Fri, Feb 14, 2020 at 1:23 AM Ming Lei <tom.leiming@gmail.com> wrote:
> >
> > On Fri, Feb 14, 2020 at 1:50 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 2020-02-13 11:21, Salman Qazi wrote:
> > > > AFAICT, This is not actually sufficient, because the issuer of the bio
> > > > is waiting for the entire bio, regardless of how it is split later.
> > > > But, also there isn't a good mapping between the size of the secure
> > > > discard and how long it will take.  If given the geometry of a flash
> > > > device, it is not hard to construct a scenario where a relatively
> > > > small secure discard (few thousand sectors) will take a very long time
> > > > (multiple seconds).
> > > >
> > > > Having said that, I don't like neutering the hung task timer either.
> > >
> > > Hi Salman,
> > >
> > > How about modifying the block layer such that completions of bio
> > > fragments are considered as task activity? I think that bio splitting is
> > > rare enough for such a change not to affect performance of the hot path.
> >
> > Are you sure that the task hung warning won't be triggered in case of
> > non-splitting?
> 
> I demonstrated a few emails ago that it doesn't take a very large
> secure discard command to trigger this.  So, I am sceptical that we
> will be able to use splitting to solve this.
> 
> >
> > >
> > > How about setting max_discard_segments such that a discard always
> > > completes in less than half the hung task timeout? This may make
> > > discards a bit slower for one particular block driver but I think that's
> > > better than hung task complaints.
> >
> > I am afraid you can't find a golden setting max_discard_segments working
> > for every drivers. Even it is found, the performance  may have been affected.
> >
> > So just wondering why not take the simple approach used in blk_execute_rq()?
> 
> My colleague Gwendal pointed out another issue which I had missed:
> secure discard is an exclusive command: it monopolizes the device.
> Even if we fix this via your approach, it will show up somewhere else,
> because other operations to the drive will not make progress for that
> length of time.

What are the 'other operations'? Are they block IOs?

If yes, that is why I suggest to fix submit_bio_wait(), which should cover
most of sync bio submission.

Anyway, the fix is simple & generic enough, I'd plan to post a formal
patch if no one figures out better doable approaches.

> 
> For Chromium OS purposes, if we had a blank slate, this is how I would solve it:
> 
> * Under the assumption that the truly sensitive data is not very big:
>     * Keep secure data on a separate partition to make sure that those
> LBAs have controlled history
>     * Treat the files in that partition as immutable (i.e. no
> overwriting the contents of the file without first secure erasing the
> existing contents).
>     * By never letting more than one version of the file accumulate,
> we can guarantee that the secure erase will always be fast for
> moderate sized files.
> 
> But for all the existing machines with keys on them, we will need to
> do something else.

The issue you reported is a generic one, not Chromium only.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-15  3:46             ` Ming Lei
@ 2020-02-18 16:11               ` Jesse Barnes
  2020-02-19  1:37                 ` Ming Lei
  2020-02-19  2:54                 ` Ming Lei
  0 siblings, 2 replies; 19+ messages in thread
From: Jesse Barnes @ 2020-02-18 16:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Salman Qazi, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> What are the 'other operations'? Are they block IOs?
>
> If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> most of sync bio submission.
>
> Anyway, the fix is simple & generic enough, I'd plan to post a formal
> patch if no one figures out better doable approaches.

Yeah I think any block I/O operation that occurs after the
BLKSECDISCARD is submitted will also potentially be affected by the
hung task timeouts, and I think your patch will address that.  My only
concern with it is that it might hide some other I/O "hangs" that are
due to device misbehavior instead.  Yes driver and device timeouts
should generally catch those, but with this in place we might miss a
few bugs.

Given the nature of these types of storage devices though, I think
that's a minor issue and not worth blocking the patch on, given that
it should prevent a lot of false positive hang reports as Salman
demonstrated.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-18 16:11               ` Jesse Barnes
@ 2020-02-19  1:37                 ` Ming Lei
  2020-02-19  2:54                 ` Ming Lei
  1 sibling, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-02-19  1:37 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Salman Qazi, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Tue, Feb 18, 2020 at 08:11:53AM -0800, Jesse Barnes wrote:
> On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > What are the 'other operations'? Are they block IOs?
> >
> > If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> > most of sync bio submission.
> >
> > Anyway, the fix is simple & generic enough, I'd plan to post a formal
> > patch if no one figures out better doable approaches.
> 
> Yeah I think any block I/O operation that occurs after the
> BLKSECDISCARD is submitted will also potentially be affected by the
> hung task timeouts, and I think your patch will address that.  My only
> concern with it is that it might hide some other I/O "hangs" that are
> due to device misbehavior instead.  Yes driver and device timeouts
> should generally catch those, but with this in place we might miss a
> few bugs.

We may add a warning(device name, request, ...) when one IO isn't completed in
sysctl_hung_task_timeout_secs, so this device misbehavior still can be
caught.

> 
> Given the nature of these types of storage devices though, I think
> that's a minor issue and not worth blocking the patch on, given that
> it should prevent a lot of false positive hang reports as Salman
> demonstrated.

OK, I will post it soon.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-18 16:11               ` Jesse Barnes
  2020-02-19  1:37                 ` Ming Lei
@ 2020-02-19  2:54                 ` Ming Lei
  2020-02-19 17:54                   ` Salman Qazi
  1 sibling, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-02-19  2:54 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Salman Qazi, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Tue, Feb 18, 2020 at 08:11:53AM -0800, Jesse Barnes wrote:
> On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > What are the 'other operations'? Are they block IOs?
> >
> > If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> > most of sync bio submission.
> >
> > Anyway, the fix is simple & generic enough, I'd plan to post a formal
> > patch if no one figures out better doable approaches.
> 
> Yeah I think any block I/O operation that occurs after the
> BLKSECDISCARD is submitted will also potentially be affected by the
> hung task timeouts, and I think your patch will address that.  My only
> concern with it is that it might hide some other I/O "hangs" that are
> due to device misbehavior instead.  Yes driver and device timeouts
> should generally catch those, but with this in place we might miss a
> few bugs.
> 
> Given the nature of these types of storage devices though, I think
> that's a minor issue and not worth blocking the patch on, given that
> it should prevent a lot of false positive hang reports as Salman
> demonstrated.

Hello Jesse and Salman,

One more question about this issue, do you enable BLK_WBT on your test
kernel?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-19  2:54                 ` Ming Lei
@ 2020-02-19 17:54                   ` Salman Qazi
  2020-02-19 22:22                     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Salman Qazi @ 2020-02-19 17:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jesse Barnes, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Tue, Feb 18, 2020 at 6:55 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Feb 18, 2020 at 08:11:53AM -0800, Jesse Barnes wrote:
> > On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > What are the 'other operations'? Are they block IOs?
> > >
> > > If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> > > most of sync bio submission.
> > >
> > > Anyway, the fix is simple & generic enough, I'd plan to post a formal
> > > patch if no one figures out better doable approaches.
> >
> > Yeah I think any block I/O operation that occurs after the
> > BLKSECDISCARD is submitted will also potentially be affected by the
> > hung task timeouts, and I think your patch will address that.  My only
> > concern with it is that it might hide some other I/O "hangs" that are
> > due to device misbehavior instead.  Yes driver and device timeouts
> > should generally catch those, but with this in place we might miss a
> > few bugs.
> >
> > Given the nature of these types of storage devices though, I think
> > that's a minor issue and not worth blocking the patch on, given that
> > it should prevent a lot of false positive hang reports as Salman
> > demonstrated.
>
> Hello Jesse and Salman,
>
> One more question about this issue, do you enable BLK_WBT on your test
> kernel?

It doesn't exist on the original 4.4-based kernel where we reproduced
this bug.  I am curious how this interacts with this bug.

>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-19 17:54                   ` Salman Qazi
@ 2020-02-19 22:22                     ` Ming Lei
  2020-02-19 22:26                       ` Salman Qazi
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-02-19 22:22 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Jesse Barnes, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Wed, Feb 19, 2020 at 09:54:31AM -0800, Salman Qazi wrote:
> On Tue, Feb 18, 2020 at 6:55 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Tue, Feb 18, 2020 at 08:11:53AM -0800, Jesse Barnes wrote:
> > > On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > What are the 'other operations'? Are they block IOs?
> > > >
> > > > If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> > > > most of sync bio submission.
> > > >
> > > > Anyway, the fix is simple & generic enough, I'd plan to post a formal
> > > > patch if no one figures out better doable approaches.
> > >
> > > Yeah I think any block I/O operation that occurs after the
> > > BLKSECDISCARD is submitted will also potentially be affected by the
> > > hung task timeouts, and I think your patch will address that.  My only
> > > concern with it is that it might hide some other I/O "hangs" that are
> > > due to device misbehavior instead.  Yes driver and device timeouts
> > > should generally catch those, but with this in place we might miss a
> > > few bugs.
> > >
> > > Given the nature of these types of storage devices though, I think
> > > that's a minor issue and not worth blocking the patch on, given that
> > > it should prevent a lot of false positive hang reports as Salman
> > > demonstrated.
> >
> > Hello Jesse and Salman,
> >
> > One more question about this issue, do you enable BLK_WBT on your test
> > kernel?
> 
> It doesn't exist on the original 4.4-based kernel where we reproduced
> this bug.  I am curious how this interacts with this bug.

blk-wbt can throttle discard request and keep discard queue not too
deep.

However, given block erase is involved in BLKSECDISCARD, I guess blk-wbt
may not avoid this task hung issue completely.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: BLKSECDISCARD ioctl and hung tasks
  2020-02-19 22:22                     ` Ming Lei
@ 2020-02-19 22:26                       ` Salman Qazi
  0 siblings, 0 replies; 19+ messages in thread
From: Salman Qazi @ 2020-02-19 22:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jesse Barnes, Ming Lei, Bart Van Assche, Jens Axboe,
	Christoph Hellwig, Linux Kernel Mailing List, linux-block,
	Gwendal Grignou

On Wed, Feb 19, 2020 at 2:23 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, Feb 19, 2020 at 09:54:31AM -0800, Salman Qazi wrote:
> > On Tue, Feb 18, 2020 at 6:55 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Tue, Feb 18, 2020 at 08:11:53AM -0800, Jesse Barnes wrote:
> > > > On Fri, Feb 14, 2020 at 7:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > What are the 'other operations'? Are they block IOs?
> > > > >
> > > > > If yes, that is why I suggest to fix submit_bio_wait(), which should cover
> > > > > most of sync bio submission.
> > > > >
> > > > > Anyway, the fix is simple & generic enough, I'd plan to post a formal
> > > > > patch if no one figures out better doable approaches.
> > > >
> > > > Yeah I think any block I/O operation that occurs after the
> > > > BLKSECDISCARD is submitted will also potentially be affected by the
> > > > hung task timeouts, and I think your patch will address that.  My only
> > > > concern with it is that it might hide some other I/O "hangs" that are
> > > > due to device misbehavior instead.  Yes driver and device timeouts
> > > > should generally catch those, but with this in place we might miss a
> > > > few bugs.
> > > >
> > > > Given the nature of these types of storage devices though, I think
> > > > that's a minor issue and not worth blocking the patch on, given that
> > > > it should prevent a lot of false positive hang reports as Salman
> > > > demonstrated.
> > >
> > > Hello Jesse and Salman,
> > >
> > > One more question about this issue, do you enable BLK_WBT on your test
> > > kernel?
> >
> > It doesn't exist on the original 4.4-based kernel where we reproduced
> > this bug.  I am curious how this interacts with this bug.
>
> blk-wbt can throttle discard request and keep discard queue not too
> deep.
>
> However, given block erase is involved in BLKSECDISCARD, I guess blk-wbt
> may not avoid this task hung issue completely.

Thanks for the explanation.

As I said, it takes one 4K BLKSECDISCARD to get to 100 second delay
where the entire device is unusable for that time.  So, the
queue doesn't have to be deep at all.  It's a single tiny IOCTL.

>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 22:27 BLKSECDISCARD ioctl and hung tasks Salman Qazi
2020-02-12 23:06 ` Theodore Y. Ts'o
2020-02-13  1:20   ` Salman Qazi
2020-02-13  1:24     ` Jesse Barnes
2020-02-13  8:26 ` Ming Lei
2020-02-13 17:48   ` Bart Van Assche
2020-02-13 19:21     ` Salman Qazi
2020-02-13 22:08       ` Salman Qazi
2020-02-14  0:25       ` Ming Lei
2020-02-14  5:49       ` Bart Van Assche
2020-02-14  9:22         ` Ming Lei
2020-02-14 19:42           ` Salman Qazi
2020-02-15  3:46             ` Ming Lei
2020-02-18 16:11               ` Jesse Barnes
2020-02-19  1:37                 ` Ming Lei
2020-02-19  2:54                 ` Ming Lei
2020-02-19 17:54                   ` Salman Qazi
2020-02-19 22:22                     ` Ming Lei
2020-02-19 22:26                       ` Salman Qazi

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git