All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] md: make suspend range wait timed out
@ 2017-06-09 19:41 Shaohua Li
  2017-06-16  3:26 ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2017-06-09 19:41 UTC (permalink / raw)
  To: linux-raid; +Cc: Shaohua Li, NeilBrown, Mikulas Patocka

From: Shaohua Li <shli@fb.com>

suspend range is controlled by userspace. If userspace doesn't clear suspend
range, it's possible a thread will wait for the range forever, we can't even
kill it. This is bad behavior. Add a timeout in the wait. If timeout happens,
we return IO error. The app controlling suspend range looks like part of disk
firmware, if disk isn't responded for a long time, timed out IO error is
returned.

A simple search in SCSI code shows maximum IO timeout is 120s, so I use this
value here too.

Cc: NeilBrown <neilb@suse.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.h    |  1 +
 drivers/md/raid1.c | 12 +++++++++++-
 drivers/md/raid5.c | 10 +++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 63d342d560b8..11a0ec33e79b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -29,6 +29,7 @@
 
 #define MaxSector (~(sector_t)0)
 
+#define MD_SUSPEND_TIMEOUT (120 * HZ)
 /*
  * These flags should really be called "NO_RETRY" rather than
  * "FAILFAST" because they don't make any promise about time lapse,
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d9e5373444d2..bc6dee0259df 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1326,6 +1326,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	    (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
 		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
+		long remaining = -1;
 
 		/*
 		 * As the suspend_* range is controlled by userspace, we want
@@ -1345,10 +1346,19 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 				break;
 			sigfillset(&full);
 			sigprocmask(SIG_BLOCK, &full, &old);
-			schedule();
+			remaining = schedule_timeout(MD_SUSPEND_TIMEOUT);
 			sigprocmask(SIG_SETMASK, &old, NULL);
+			if (remaining == 0)
+				break;
 		}
 		finish_wait(&conf->wait_barrier, &w);
+		if (remaining == 0) {
+			pr_err("md/raid1:%s: suspend range is locked\n",
+				mdname(mddev));
+			bio->bi_error = -ETIMEDOUT;
+			bio_endio(bio);
+			return;
+		}
 	}
 	wait_barrier(conf, bio->bi_iter.bi_sector);
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cf1ac2e0f4c8..24297f1530d1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5685,6 +5685,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			if (rw == WRITE &&
 			    logical_sector >= mddev->suspend_lo &&
 			    logical_sector < mddev->suspend_hi) {
+				long remaining = -1;
 				raid5_release_stripe(sh);
 				/* As the suspend_* range is controlled by
 				 * userspace, we want an interruptible
@@ -5697,10 +5698,17 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 					sigset_t full, old;
 					sigfillset(&full);
 					sigprocmask(SIG_BLOCK, &full, &old);
-					schedule();
+					remaining = schedule_timeout(
+							MD_SUSPEND_TIMEOUT);
 					sigprocmask(SIG_SETMASK, &old, NULL);
 					do_prepare = true;
 				}
+				if (remaining == 0) {
+					pr_err("md/raid5:%s: suspend range is locked\n",
+						mdname(mddev));
+					bi->bi_error = -ETIMEDOUT;
+					break;
+				}
 				goto retry;
 			}
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-09 19:41 [PATCH] md: make suspend range wait timed out Shaohua Li
@ 2017-06-16  3:26 ` NeilBrown
  2017-06-16 15:52   ` Shaohua Li
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2017-06-16  3:26 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Shaohua Li, Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 4431 bytes --]

On Fri, Jun 09 2017, Shaohua Li wrote:

> From: Shaohua Li <shli@fb.com>
>
> suspend range is controlled by userspace. If userspace doesn't clear suspend
> range, it's possible a thread will wait for the range forever, we can't even
> kill it. This is bad behavior. Add a timeout in the wait. If timeout happens,
> we return IO error. The app controlling suspend range looks like part of disk
> firmware, if disk isn't responded for a long time, timed out IO error is
> returned.
>
> A simple search in SCSI code shows maximum IO timeout is 120s, so I use this
> value here too.

I really don't like this.  It is an API change with no really
justification.  Has the current behavior caused a problem?

Both md and dm export  APIs which allow IO to be suspended while
changes are made.  Changing that to a timed-out period needs, I think,
to be clearly justified.

If it is changed to a timed-out period, then that should be explicit,
rather than having each request independently time out.
i.e. when the suspend is initiated, the end-time should be computed, and
any IO would block until that time, not block for some number of
seconds.

If an md device is left suspended, then the current code will block IO
indefinitely.  This patch will at a 20minute times to every single
request, which will mean IO proceeds, but extremely slowly.  I don't see
that as a useful improvement.

Thanks,
NeilBrown


>
> Cc: NeilBrown <neilb@suse.com>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.h    |  1 +
>  drivers/md/raid1.c | 12 +++++++++++-
>  drivers/md/raid5.c | 10 +++++++++-
>  3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 63d342d560b8..11a0ec33e79b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -29,6 +29,7 @@
>  
>  #define MaxSector (~(sector_t)0)
>  
> +#define MD_SUSPEND_TIMEOUT (120 * HZ)
>  /*
>   * These flags should really be called "NO_RETRY" rather than
>   * "FAILFAST" because they don't make any promise about time lapse,
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d9e5373444d2..bc6dee0259df 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1326,6 +1326,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  	    (mddev_is_clustered(mddev) &&
>  	     md_cluster_ops->area_resyncing(mddev, WRITE,
>  		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
> +		long remaining = -1;
>  
>  		/*
>  		 * As the suspend_* range is controlled by userspace, we want
> @@ -1345,10 +1346,19 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  				break;
>  			sigfillset(&full);
>  			sigprocmask(SIG_BLOCK, &full, &old);
> -			schedule();
> +			remaining = schedule_timeout(MD_SUSPEND_TIMEOUT);
>  			sigprocmask(SIG_SETMASK, &old, NULL);
> +			if (remaining == 0)
> +				break;
>  		}
>  		finish_wait(&conf->wait_barrier, &w);
> +		if (remaining == 0) {
> +			pr_err("md/raid1:%s: suspend range is locked\n",
> +				mdname(mddev));
> +			bio->bi_error = -ETIMEDOUT;
> +			bio_endio(bio);
> +			return;
> +		}
>  	}
>  	wait_barrier(conf, bio->bi_iter.bi_sector);
>  
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cf1ac2e0f4c8..24297f1530d1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5685,6 +5685,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  			if (rw == WRITE &&
>  			    logical_sector >= mddev->suspend_lo &&
>  			    logical_sector < mddev->suspend_hi) {
> +				long remaining = -1;
>  				raid5_release_stripe(sh);
>  				/* As the suspend_* range is controlled by
>  				 * userspace, we want an interruptible
> @@ -5697,10 +5698,17 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  					sigset_t full, old;
>  					sigfillset(&full);
>  					sigprocmask(SIG_BLOCK, &full, &old);
> -					schedule();
> +					remaining = schedule_timeout(
> +							MD_SUSPEND_TIMEOUT);
>  					sigprocmask(SIG_SETMASK, &old, NULL);
>  					do_prepare = true;
>  				}
> +				if (remaining == 0) {
> +					pr_err("md/raid5:%s: suspend range is locked\n",
> +						mdname(mddev));
> +					bi->bi_error = -ETIMEDOUT;
> +					break;
> +				}
>  				goto retry;
>  			}
>  
> -- 
> 2.11.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16  3:26 ` NeilBrown
@ 2017-06-16 15:52   ` Shaohua Li
  2017-06-16 18:06     ` Brad Campbell
  2017-06-18 21:30     ` NeilBrown
  0 siblings, 2 replies; 12+ messages in thread
From: Shaohua Li @ 2017-06-16 15:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Mikulas Patocka

On Fri, Jun 16, 2017 at 01:26:00PM +1000, Neil Brown wrote:
> On Fri, Jun 09 2017, Shaohua Li wrote:
> 
> > From: Shaohua Li <shli@fb.com>
> >
> > suspend range is controlled by userspace. If userspace doesn't clear suspend
> > range, it's possible a thread will wait for the range forever, we can't even
> > kill it. This is bad behavior. Add a timeout in the wait. If timeout happens,
> > we return IO error. The app controlling suspend range looks like part of disk
> > firmware, if disk isn't responded for a long time, timed out IO error is
> > returned.
> >
> > A simple search in SCSI code shows maximum IO timeout is 120s, so I use this
> > value here too.
> 
> I really don't like this.  It is an API change with no really
> justification.  Has the current behavior caused a problem?

This centainly causes problem. Set the suspend range will make application
stall for ever, don't you think this is a problem?

> Both md and dm export  APIs which allow IO to be suspended while
> changes are made.  Changing that to a timed-out period needs, I think,
> to be clearly justified.
> 
> If it is changed to a timed-out period, then that should be explicit,
> rather than having each request independently time out.
> i.e. when the suspend is initiated, the end-time should be computed, and
> any IO would block until that time, not block for some number of
> seconds.

Ok, this makes sense. We can add a timeout. If it's expired, we clear suspend
range. Userspace should know what they are doing.

> 
> If an md device is left suspended, then the current code will block IO
> indefinitely.  This patch will at a 20minute times to every single
> request, which will mean IO proceeds, but extremely slowly.  I don't see
> that as a useful improvement.

It returns error, so application will not dispatch more IO. But I agree a
timeout to clear the suspend looks a better policy.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16 15:52   ` Shaohua Li
@ 2017-06-16 18:06     ` Brad Campbell
  2017-06-16 19:07       ` Shaohua Li
  2017-06-18 21:30     ` NeilBrown
  1 sibling, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2017-06-16 18:06 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown; +Cc: linux-raid, Shaohua Li, Mikulas Patocka

On 16/06/17 23:52, Shaohua Li wrote:
> On Fri, Jun 16, 2017 at 01:26:00PM +1000, Neil Brown wrote:
>> If an md device is left suspended, then the current code will block IO
>> indefinitely.  This patch will at a 20minute times to every single
>> request, which will mean IO proceeds, but extremely slowly.  I don't see
>> that as a useful improvement.
>
> It returns error, so application will not dispatch more IO. But I agree a
> timeout to clear the suspend looks a better policy.

If you insist on doing this, make the default timeout configurable and 
set it to wait forever by default. That way there is no change in 
current behaviour.

There are a number of times I've tickled those values to suspend things 
while doing something sinister to the array. I'd be mighty upset if I 
got called away and came back to find the kernel had resumed underneath 
me because it thinks it knows best.

Regards,
Brad
-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16 18:06     ` Brad Campbell
@ 2017-06-16 19:07       ` Shaohua Li
  2017-06-16 19:22         ` Brad Campbell
  2017-06-16 19:37         ` Brad Campbell
  0 siblings, 2 replies; 12+ messages in thread
From: Shaohua Li @ 2017-06-16 19:07 UTC (permalink / raw)
  To: Brad Campbell; +Cc: NeilBrown, linux-raid, Shaohua Li, Mikulas Patocka

On Sat, Jun 17, 2017 at 02:06:23AM +0800, Brad Campbell wrote:
> On 16/06/17 23:52, Shaohua Li wrote:
> > On Fri, Jun 16, 2017 at 01:26:00PM +1000, Neil Brown wrote:
> > > If an md device is left suspended, then the current code will block IO
> > > indefinitely.  This patch will at a 20minute times to every single
> > > request, which will mean IO proceeds, but extremely slowly.  I don't see
> > > that as a useful improvement.
> > 
> > It returns error, so application will not dispatch more IO. But I agree a
> > timeout to clear the suspend looks a better policy.
> 
> If you insist on doing this, make the default timeout configurable and set
> it to wait forever by default. That way there is no change in current
> behaviour.

That's meaningless if the default is wait forever
 
> There are a number of times I've tickled those values to suspend things
> while doing something sinister to the array. I'd be mighty upset if I got
> called away and came back to find the kernel had resumed underneath me
> because it thinks it knows best.

I'd like to know your use case, can you describe it in details? I thought only
raid6check uses this function. If there are other usage cases which depend on
current behavior, we certaily can't change it.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16 19:07       ` Shaohua Li
@ 2017-06-16 19:22         ` Brad Campbell
  2017-06-16 19:37         ` Brad Campbell
  1 sibling, 0 replies; 12+ messages in thread
From: Brad Campbell @ 2017-06-16 19:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, Shaohua Li, Mikulas Patocka


> I'd like to know your use case, can you describe it in details? I thought only
> raid6check uses this function. If there are other usage cases which depend on
> current behavior, we certaily can't change it.
>
>
It's entirely manual. I have a series of test arrays I use when people 
come to the list with issues and it's not uncommon to get into sysfs to 
manipulate the arrays to either re-create a situation or get the array 
into a state where I can try and figure out what is going on. To be 
honest all the use cases for me are entirely pathological, but it's a 
change to the current behavior that I have used in the past and expect 
to use again in the future.

The issue being rectified is not one I've seen come up frequently (if at 
all) so I thought I'd pipe up and say "I have used this in the past and 
will probably use it again in the future, please don't break it of you 
can avoid it".

I am currently 17,600KM away from my test stuff, so I can't elaborate 
further at the moment.

Regards,
Brad
-- 

Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16 19:07       ` Shaohua Li
  2017-06-16 19:22         ` Brad Campbell
@ 2017-06-16 19:37         ` Brad Campbell
  1 sibling, 0 replies; 12+ messages in thread
From: Brad Campbell @ 2017-06-16 19:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, Shaohua Li, Mikulas Patocka



On 17/06/17 03:07, Shaohua Li wrote:
>
> I'd like to know your use case, can you describe it in details? I thought only
> raid6check uses this function. If there are other usage cases which depend on
> current behavior, we certaily can't change it.
>
It occurs to me that my objection is based on a particular workflow and 
there is probably more benefit to the general population to actually 
make the change. I can adapt to whatever changes are made. Sorry for the 
noise.

Regards,
Brad

-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-16 15:52   ` Shaohua Li
  2017-06-16 18:06     ` Brad Campbell
@ 2017-06-18 21:30     ` NeilBrown
  2017-06-20  0:54       ` Shaohua Li
  1 sibling, 1 reply; 12+ messages in thread
From: NeilBrown @ 2017-06-18 21:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Shaohua Li, Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 2939 bytes --]

On Fri, Jun 16 2017, Shaohua Li wrote:

> On Fri, Jun 16, 2017 at 01:26:00PM +1000, Neil Brown wrote:
>> On Fri, Jun 09 2017, Shaohua Li wrote:
>> 
>> > From: Shaohua Li <shli@fb.com>
>> >
>> > suspend range is controlled by userspace. If userspace doesn't clear suspend
>> > range, it's possible a thread will wait for the range forever, we can't even
>> > kill it. This is bad behavior. Add a timeout in the wait. If timeout happens,
>> > we return IO error. The app controlling suspend range looks like part of disk
>> > firmware, if disk isn't responded for a long time, timed out IO error is
>> > returned.
>> >
>> > A simple search in SCSI code shows maximum IO timeout is 120s, so I use this
>> > value here too.
>> 
>> I really don't like this.  It is an API change with no really
>> justification.  Has the current behavior caused a problem?
>
> This centainly causes problem. Set the suspend range will make application
> stall for ever, don't you think this is a problem?

I agree that it could cause a problem.  I'm asking it is actually, in
practice, causes a problem.  Do you have reports from people saying "the
IO to my RAID array is hanging, what is wrong?" and you look into it and
find out that suspend_hi is larger than suspend_lo?

And if that does happen, is this really the best way to fix it?

>
>> Both md and dm export  APIs which allow IO to be suspended while
>> changes are made.  Changing that to a timed-out period needs, I think,
>> to be clearly justified.
>> 
>> If it is changed to a timed-out period, then that should be explicit,
>> rather than having each request independently time out.
>> i.e. when the suspend is initiated, the end-time should be computed, and
>> any IO would block until that time, not block for some number of
>> seconds.
>
> Ok, this makes sense. We can add a timeout. If it's expired, we clear suspend
> range. Userspace should know what they are doing.
>
>> 
>> If an md device is left suspended, then the current code will block IO
>> indefinitely.  This patch will at a 20minute times to every single
>> request, which will mean IO proceeds, but extremely slowly.  I don't see
>> that as a useful improvement.
>
> It returns error, so application will not dispatch more IO. But I agree a
> timeout to clear the suspend looks a better policy.

Write errors only get back to the application if it calls fsync(), and
many don't do that.  Write errors can easily cause a filesystem to go
read-only, and require an fsck.  I think we should be very cautious
about triggering write errors.

NFS will hang indefinitely rather then return an error if the server is
not available.  That can certainly be annoying, but the alternative has
been tried, and it leads to random data corruption.
The two cases are only comparable at a very high level, but I think
this result should encourage substantial caution.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-18 21:30     ` NeilBrown
@ 2017-06-20  0:54       ` Shaohua Li
  2017-06-21 14:09         ` Mikulas Patocka
  0 siblings, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2017-06-20  0:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Mikulas Patocka

On Mon, Jun 19, 2017 at 07:30:50AM +1000, Neil Brown wrote:
> On Fri, Jun 16 2017, Shaohua Li wrote:
> 
> > On Fri, Jun 16, 2017 at 01:26:00PM +1000, Neil Brown wrote:
> >> On Fri, Jun 09 2017, Shaohua Li wrote:
> >> 
> >> > From: Shaohua Li <shli@fb.com>
> >> >
> >> > suspend range is controlled by userspace. If userspace doesn't clear suspend
> >> > range, it's possible a thread will wait for the range forever, we can't even
> >> > kill it. This is bad behavior. Add a timeout in the wait. If timeout happens,
> >> > we return IO error. The app controlling suspend range looks like part of disk
> >> > firmware, if disk isn't responded for a long time, timed out IO error is
> >> > returned.
> >> >
> >> > A simple search in SCSI code shows maximum IO timeout is 120s, so I use this
> >> > value here too.
> >> 
> >> I really don't like this.  It is an API change with no really
> >> justification.  Has the current behavior caused a problem?
> >
> > This centainly causes problem. Set the suspend range will make application
> > stall for ever, don't you think this is a problem?
> 
> I agree that it could cause a problem.  I'm asking it is actually, in
> practice, causes a problem.  Do you have reports from people saying "the
> IO to my RAID array is hanging, what is wrong?" and you look into it and
> find out that suspend_hi is larger than suspend_lo?
> 
> And if that does happen, is this really the best way to fix it?

I don't have reports about this issue. Just think the behavior is bad.
 
> >
> >> Both md and dm export  APIs which allow IO to be suspended while
> >> changes are made.  Changing that to a timed-out period needs, I think,
> >> to be clearly justified.
> >> 
> >> If it is changed to a timed-out period, then that should be explicit,
> >> rather than having each request independently time out.
> >> i.e. when the suspend is initiated, the end-time should be computed, and
> >> any IO would block until that time, not block for some number of
> >> seconds.
> >
> > Ok, this makes sense. We can add a timeout. If it's expired, we clear suspend
> > range. Userspace should know what they are doing.
> >
> >> 
> >> If an md device is left suspended, then the current code will block IO
> >> indefinitely.  This patch will at a 20minute times to every single
> >> request, which will mean IO proceeds, but extremely slowly.  I don't see
> >> that as a useful improvement.
> >
> > It returns error, so application will not dispatch more IO. But I agree a
> > timeout to clear the suspend looks a better policy.
> 
> Write errors only get back to the application if it calls fsync(), and
> many don't do that.  Write errors can easily cause a filesystem to go
> read-only, and require an fsck.  I think we should be very cautious
> about triggering write errors.
> 
> NFS will hang indefinitely rather then return an error if the server is
> not available.  That can certainly be annoying, but the alternative has
> been tried, and it leads to random data corruption.
> The two cases are only comparable at a very high level, but I think
> this result should encourage substantial caution.

It's hard to say if an IO error or an infinite wait is better, but since there
is better option in this case, I don't want to argue. I'll repost a patch to
reset suspend range after a timeout, assume this is your suggestion.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-20  0:54       ` Shaohua Li
@ 2017-06-21 14:09         ` Mikulas Patocka
  2017-06-21 16:07           ` Shaohua Li
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2017-06-21 14:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, Shaohua Li



On Mon, 19 Jun 2017, Shaohua Li wrote:

> > Write errors only get back to the application if it calls fsync(), and
> > many don't do that.  Write errors can easily cause a filesystem to go
> > read-only, and require an fsck.  I think we should be very cautious
> > about triggering write errors.
> > 
> > NFS will hang indefinitely rather then return an error if the server is
> > not available.  That can certainly be annoying, but the alternative has
> > been tried, and it leads to random data corruption.
> > The two cases are only comparable at a very high level, but I think
> > this result should encourage substantial caution.
> 
> It's hard to say if an IO error or an infinite wait is better, but since there
> is better option in this case, I don't want to argue. I'll repost a patch to
> reset suspend range after a timeout, assume this is your suggestion.
> 
> Thanks,
> Shaohua

Automatically resetting the suspend range could result in data corruption, 
so it is even worse than a deadlock.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-21 14:09         ` Mikulas Patocka
@ 2017-06-21 16:07           ` Shaohua Li
  2017-06-22 21:54             ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2017-06-21 16:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: NeilBrown, linux-raid, Shaohua Li

On Wed, Jun 21, 2017 at 10:09:08AM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 19 Jun 2017, Shaohua Li wrote:
> 
> > > Write errors only get back to the application if it calls fsync(), and
> > > many don't do that.  Write errors can easily cause a filesystem to go
> > > read-only, and require an fsck.  I think we should be very cautious
> > > about triggering write errors.
> > > 
> > > NFS will hang indefinitely rather then return an error if the server is
> > > not available.  That can certainly be annoying, but the alternative has
> > > been tried, and it leads to random data corruption.
> > > The two cases are only comparable at a very high level, but I think
> > > this result should encourage substantial caution.
> > 
> > It's hard to say if an IO error or an infinite wait is better, but since there
> > is better option in this case, I don't want to argue. I'll repost a patch to
> > reset suspend range after a timeout, assume this is your suggestion.
> > 
> > Thanks,
> > Shaohua
> 
> Automatically resetting the suspend range could result in data corruption, 
> so it is even worse than a deadlock.

depending on how you look at this. a deadlock means you will eventually hard
reset the system, and that will result in data corruption.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] md: make suspend range wait timed out
  2017-06-21 16:07           ` Shaohua Li
@ 2017-06-22 21:54             ` NeilBrown
  0 siblings, 0 replies; 12+ messages in thread
From: NeilBrown @ 2017-06-22 21:54 UTC (permalink / raw)
  To: Shaohua Li, Mikulas Patocka; +Cc: linux-raid, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 2535 bytes --]

On Wed, Jun 21 2017, Shaohua Li wrote:

> On Wed, Jun 21, 2017 at 10:09:08AM -0400, Mikulas Patocka wrote:
>> 
>> 
>> On Mon, 19 Jun 2017, Shaohua Li wrote:
>> 
>> > > Write errors only get back to the application if it calls fsync(), and
>> > > many don't do that.  Write errors can easily cause a filesystem to go
>> > > read-only, and require an fsck.  I think we should be very cautious
>> > > about triggering write errors.
>> > > 
>> > > NFS will hang indefinitely rather then return an error if the server is
>> > > not available.  That can certainly be annoying, but the alternative has
>> > > been tried, and it leads to random data corruption.
>> > > The two cases are only comparable at a very high level, but I think
>> > > this result should encourage substantial caution.
>> > 
>> > It's hard to say if an IO error or an infinite wait is better, but since there
>> > is better option in this case, I don't want to argue. I'll repost a patch to
>> > reset suspend range after a timeout, assume this is your suggestion.
>> > 
>> > Thanks,
>> > Shaohua
>> 
>> Automatically resetting the suspend range could result in data corruption, 
>> so it is even worse than a deadlock.
>
> depending on how you look at this. a deadlock means you will eventually hard
> reset the system, and that will result in data corruption.

But in that circumstance (purely hypothetical at this stage) the
sysadmin knows that something has gone wrong.  In the other, they might
not.
Invisible data corruption is much worse that visible.

The suspend functionality is used by user-space programs.  If you think
the current interface is not ideal, maybe it would be better to design
an interface that a user-space program can use which is safe to use, but
also fails safe.  Maybe it could give the kernel a PID with the meaning
"if you have to invalidate my suspend request, please kill the pid
first".  That would have risks of its own of course.

The suspend interface is currently used:
 - to enable backup of a region which it is being reshaped in-place.
   As time goes on, this will be used less and less as the
   change-data-offset approach to reshape doesn't need this.
 - to stablise a region while raid6check performs parity calculations
   and possibly "corrects" one block.

You at least need to analyse the failure modes of these before you can
justify making any change to the interface they use.
But again, I really don't think there is a problem that needs fixing.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-06-22 21:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09 19:41 [PATCH] md: make suspend range wait timed out Shaohua Li
2017-06-16  3:26 ` NeilBrown
2017-06-16 15:52   ` Shaohua Li
2017-06-16 18:06     ` Brad Campbell
2017-06-16 19:07       ` Shaohua Li
2017-06-16 19:22         ` Brad Campbell
2017-06-16 19:37         ` Brad Campbell
2017-06-18 21:30     ` NeilBrown
2017-06-20  0:54       ` Shaohua Li
2017-06-21 14:09         ` Mikulas Patocka
2017-06-21 16:07           ` Shaohua Li
2017-06-22 21:54             ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.