linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] nbd: cmd timeout fixes
@ 2019-08-09 21:26 Mike Christie
  2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:26 UTC (permalink / raw)
  To: josef, linux-block

The following patches fix bugs related to the nbd cmd timeout feature.

The patches were made over Linus's current tree and also apply over
this nbd patch:

https://marc.info/?l=linux-block&m=156494582914495&w=2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] nbd: add set cmd timeout helper
  2019-08-09 21:26 [PATCH 0/4] nbd: cmd timeout fixes Mike Christie
@ 2019-08-09 21:26 ` Mike Christie
  2019-08-09 21:32   ` Mike Christie
  2019-08-13 13:11   ` Josef Bacik
  2019-08-09 21:26 ` [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd Mike Christie
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:26 UTC (permalink / raw)
  To: josef, linux-block; +Cc: Mike Christie

Add a helper to set the cmd timeout. It does not really do a lot now,
but will be more useful in the next patches.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 drivers/block/nbd.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 5af1988785b0..cb2b98392ae5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1245,6 +1245,12 @@ static bool nbd_is_valid_blksize(unsigned long blksize)
 	return true;
 }
 
+static void nbd_set_cmd_timeout(struct nbd_device *nbd, u64 timeout)
+{
+	nbd->tag_set.timeout = timeout * HZ;
+	blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
+}
+
 /* Must be called with config_lock held */
 static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		       unsigned int cmd, unsigned long arg)
@@ -1275,10 +1281,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd_size_set(nbd, config->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
-		if (arg) {
-			nbd->tag_set.timeout = arg * HZ;
-			blk_queue_rq_timeout(nbd->disk->queue, arg * HZ);
-		}
+		if (arg)
+			nbd_set_cmd_timeout(nbd, arg);
 		return 0;
 
 	case NBD_SET_FLAGS:
@@ -1798,11 +1802,9 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 	if (ret)
 		goto out;
 
-	if (info->attrs[NBD_ATTR_TIMEOUT]) {
-		u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]);
-		nbd->tag_set.timeout = timeout * HZ;
-		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
-	}
+	if (info->attrs[NBD_ATTR_TIMEOUT])
+		nbd_set_cmd_timeout(nbd,
+				    nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]));
 	if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) {
 		config->dead_conn_timeout =
 			nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]);
@@ -1976,11 +1978,9 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 	if (ret)
 		goto out;
 
-	if (info->attrs[NBD_ATTR_TIMEOUT]) {
-		u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]);
-		nbd->tag_set.timeout = timeout * HZ;
-		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
-	}
+	if (info->attrs[NBD_ATTR_TIMEOUT])
+		nbd_set_cmd_timeout(nbd,
+				    nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]));
 	if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) {
 		config->dead_conn_timeout =
 			nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd
  2019-08-09 21:26 [PATCH 0/4] nbd: cmd timeout fixes Mike Christie
  2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
@ 2019-08-09 21:26 ` Mike Christie
  2019-08-13 13:11   ` Josef Bacik
  2019-08-09 21:26 ` [PATCH 3/4] nbd: add missing config put Mike Christie
  2019-08-09 21:26 ` [PATCH 4/4] nbd: fix zero cmd timeout handling Mike Christie
  3 siblings, 1 reply; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:26 UTC (permalink / raw)
  To: josef, linux-block; +Cc: Mike Christie

This adds a helper function to convert a block req op to a nbd cmd type.
It will be used in the last patch to log the type in the timeout
handler.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 drivers/block/nbd.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index cb2b98392ae5..131b541d07c3 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -342,6 +342,22 @@ static void sock_shutdown(struct nbd_device *nbd)
 	dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
 }
 
+static u32 req_to_nbd_cmd_type(struct request *req)
+{
+	switch (req_op(req)) {
+	case REQ_OP_DISCARD:
+		return NBD_CMD_TRIM;
+	case REQ_OP_FLUSH:
+		return NBD_CMD_FLUSH;
+	case REQ_OP_WRITE:
+		return NBD_CMD_WRITE;
+	case REQ_OP_READ:
+		return NBD_CMD_READ;
+	default:
+		return U32_MAX;
+	}
+}
+
 static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 						 bool reserved)
 {
@@ -476,22 +492,9 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 
 	iov_iter_kvec(&from, WRITE, &iov, 1, sizeof(request));
 
-	switch (req_op(req)) {
-	case REQ_OP_DISCARD:
-		type = NBD_CMD_TRIM;
-		break;
-	case REQ_OP_FLUSH:
-		type = NBD_CMD_FLUSH;
-		break;
-	case REQ_OP_WRITE:
-		type = NBD_CMD_WRITE;
-		break;
-	case REQ_OP_READ:
-		type = NBD_CMD_READ;
-		break;
-	default:
+	type = req_to_nbd_cmd_type(req);
+	if (type == U32_MAX)
 		return -EIO;
-	}
 
 	if (rq_data_dir(req) == WRITE &&
 	    (config->flags & NBD_FLAG_READ_ONLY)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] nbd: add missing config put
  2019-08-09 21:26 [PATCH 0/4] nbd: cmd timeout fixes Mike Christie
  2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
  2019-08-09 21:26 ` [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd Mike Christie
@ 2019-08-09 21:26 ` Mike Christie
  2019-08-13 13:11   ` Josef Bacik
  2019-08-09 21:26 ` [PATCH 4/4] nbd: fix zero cmd timeout handling Mike Christie
  3 siblings, 1 reply; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:26 UTC (permalink / raw)
  To: josef, linux-block; +Cc: Mike Christie

Fix bug added with the patch:

commit 8f3ea35929a0806ad1397db99a89ffee0140822a
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Mon Jul 16 12:11:35 2018 -0400

    nbd: handle unexpected replies better

where if the timeout handler runs when the completion path is and we fail
to grab the mutex in the timeout handler we will leave a config reference
and cannot free the config later.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 drivers/block/nbd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 131b541d07c3..93294000ce55 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -371,8 +371,10 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	}
 	config = nbd->config;
 
-	if (!mutex_trylock(&cmd->lock))
+	if (!mutex_trylock(&cmd->lock)) {
+		nbd_config_put(nbd);
 		return BLK_EH_RESET_TIMER;
+	}
 
 	if (config->num_connections > 1) {
 		dev_err_ratelimited(nbd_to_dev(nbd),
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-09 21:26 [PATCH 0/4] nbd: cmd timeout fixes Mike Christie
                   ` (2 preceding siblings ...)
  2019-08-09 21:26 ` [PATCH 3/4] nbd: add missing config put Mike Christie
@ 2019-08-09 21:26 ` Mike Christie
  2019-08-13 13:13   ` Josef Bacik
  3 siblings, 1 reply; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:26 UTC (permalink / raw)
  To: josef, linux-block; +Cc: Mike Christie

This fixes a regression added in 4.9 with commit:

commit 0eadf37afc2500e1162c9040ec26a705b9af8d47
Author: Josef Bacik <jbacik@fb.com>
Date:   Thu Sep 8 12:33:40 2016 -0700

    nbd: allow block mq to deal with timeouts

where before the patch userspace would set the timeout to 0 to disable
it. With the above patch, a zero timeout tells the block layer to use
the default value of 30 seconds. For setups where commands can take a
long time or experience transient issues like network disruptions this
then results in IO errors being sent to the application.

To fix this, the patch still uses the common block layer timeout
framework, but if zero is set, nbd just logs a message and then resets
the timer when it expires.

Josef,

I did not cc stable, but I think we want to port the patches to some
releases. We originally hit this with users using the longterm kernels
with ceph. The patch does not apply anywhere cleanly with older ones
like 4.9, so I was not sure how we wanted to handle it.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 drivers/block/nbd.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 93294000ce55..8459adce713a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -120,6 +120,7 @@ struct nbd_cmd {
 	struct mutex lock;
 	int index;
 	int cookie;
+	int retries;
 	blk_status_t status;
 	unsigned long flags;
 	u32 cmd_cookie;
@@ -405,10 +406,25 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 			nbd_config_put(nbd);
 			return BLK_EH_DONE;
 		}
-	} else {
-		dev_err_ratelimited(nbd_to_dev(nbd),
-				    "Connection timed out\n");
 	}
+
+	if (!nbd->tag_set.timeout) {
+		/*
+		 * Userspace sets timeout=0 to disable socket disconnection,
+		 * so just warn and reset the timer.
+		 */
+		cmd->retries++;
+		dev_info(nbd_to_dev(nbd), "Possible stuck request %p: control (%s@%llu,%uB). Runtime %u seconds\n",
+			req, nbdcmd_to_ascii(req_to_nbd_cmd_type(req)),
+			(unsigned long long)blk_rq_pos(req) << 9,
+			blk_rq_bytes(req), (req->timeout / HZ) * cmd->retries);
+
+		mutex_unlock(&cmd->lock);
+		nbd_config_put(nbd);
+		return BLK_EH_RESET_TIMER;
+	}
+
+	dev_err_ratelimited(nbd_to_dev(nbd), "Connection timed out\n");
 	set_bit(NBD_TIMEDOUT, &config->runtime_flags);
 	cmd->status = BLK_STS_IOERR;
 	mutex_unlock(&cmd->lock);
@@ -527,6 +543,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	}
 	cmd->index = index;
 	cmd->cookie = nsock->cookie;
+	cmd->retries = 0;
 	request.type = htonl(type | nbd_cmd_flags);
 	if (type != NBD_CMD_FLUSH) {
 		request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9);
@@ -1253,7 +1270,8 @@ static bool nbd_is_valid_blksize(unsigned long blksize)
 static void nbd_set_cmd_timeout(struct nbd_device *nbd, u64 timeout)
 {
 	nbd->tag_set.timeout = timeout * HZ;
-	blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
+	if (timeout)
+		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
 }
 
 /* Must be called with config_lock held */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] nbd: add set cmd timeout helper
  2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
@ 2019-08-09 21:32   ` Mike Christie
  2019-08-13 13:11   ` Josef Bacik
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Christie @ 2019-08-09 21:32 UTC (permalink / raw)
  To: josef, linux-block

On 08/09/2019 04:26 PM, Mike Christie wrote:
>  
> +static void nbd_set_cmd_timeout(struct nbd_device *nbd, u64 timeout)
> +{
> +	nbd->tag_set.timeout = timeout * HZ;
> +	blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
> +}
> +
>  /* Must be called with config_lock held */
>  static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		       unsigned int cmd, unsigned long arg)
> @@ -1275,10 +1281,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		nbd_size_set(nbd, config->blksize, arg);
>  		return 0;
>  	case NBD_SET_TIMEOUT:
> -		if (arg) {
> -			nbd->tag_set.timeout = arg * HZ;
> -			blk_queue_rq_timeout(nbd->disk->queue, arg * HZ);
> -		}
> +		if (arg)
> +			nbd_set_cmd_timeout(nbd, arg);
>  		return 0;

Josef,

These patches still leave one regression where you used to be able to
disable the timer after it has already been set. My patches did not fix
that yet. I will fix it on the resend after you give me your comments.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] nbd: add set cmd timeout helper
  2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
  2019-08-09 21:32   ` Mike Christie
@ 2019-08-13 13:11   ` Josef Bacik
  1 sibling, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-13 13:11 UTC (permalink / raw)
  To: Mike Christie; +Cc: josef, linux-block

On Fri, Aug 09, 2019 at 04:26:07PM -0500, Mike Christie wrote:
> Add a helper to set the cmd timeout. It does not really do a lot now,
> but will be more useful in the next patches.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd
  2019-08-09 21:26 ` [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd Mike Christie
@ 2019-08-13 13:11   ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-13 13:11 UTC (permalink / raw)
  To: Mike Christie; +Cc: josef, linux-block

On Fri, Aug 09, 2019 at 04:26:08PM -0500, Mike Christie wrote:
> This adds a helper function to convert a block req op to a nbd cmd type.
> It will be used in the last patch to log the type in the timeout
> handler.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] nbd: add missing config put
  2019-08-09 21:26 ` [PATCH 3/4] nbd: add missing config put Mike Christie
@ 2019-08-13 13:11   ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-13 13:11 UTC (permalink / raw)
  To: Mike Christie; +Cc: josef, linux-block

On Fri, Aug 09, 2019 at 04:26:09PM -0500, Mike Christie wrote:
> Fix bug added with the patch:
> 
> commit 8f3ea35929a0806ad1397db99a89ffee0140822a
> Author: Josef Bacik <josef@toxicpanda.com>
> Date:   Mon Jul 16 12:11:35 2018 -0400
> 
>     nbd: handle unexpected replies better
> 
> where if the timeout handler runs when the completion path is and we fail
> to grab the mutex in the timeout handler we will leave a config reference
> and cannot free the config later.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-09 21:26 ` [PATCH 4/4] nbd: fix zero cmd timeout handling Mike Christie
@ 2019-08-13 13:13   ` Josef Bacik
  2019-08-13 15:45     ` Mike Christie
  0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-08-13 13:13 UTC (permalink / raw)
  To: Mike Christie; +Cc: josef, linux-block

On Fri, Aug 09, 2019 at 04:26:10PM -0500, Mike Christie wrote:
> This fixes a regression added in 4.9 with commit:
> 
> commit 0eadf37afc2500e1162c9040ec26a705b9af8d47
> Author: Josef Bacik <jbacik@fb.com>
> Date:   Thu Sep 8 12:33:40 2016 -0700
> 
>     nbd: allow block mq to deal with timeouts
> 
> where before the patch userspace would set the timeout to 0 to disable
> it. With the above patch, a zero timeout tells the block layer to use
> the default value of 30 seconds. For setups where commands can take a
> long time or experience transient issues like network disruptions this
> then results in IO errors being sent to the application.
> 
> To fix this, the patch still uses the common block layer timeout
> framework, but if zero is set, nbd just logs a message and then resets
> the timer when it expires.
> 
> Josef,
> 
> I did not cc stable, but I think we want to port the patches to some
> releases. We originally hit this with users using the longterm kernels
> with ceph. The patch does not apply anywhere cleanly with older ones
> like 4.9, so I was not sure how we wanted to handle it.
> 

I assume you tested this?  IIRC there was a problem where 0 really meant 0 and
commands would insta-timeout.  But my memory is foggy here, so I'm not sure if
it was setting the tag_set timeout to 0 that made things go wrong, or what.  Or
I could be making it all up, who knows.

There's a blktest that just runs fio on a normal device with no timeouts or
anything, that's where I would see the problem since it was a little racy.
Basically have the timeout set to 0 and put load on the disk and eventually
you'd start seeing timeouts.  If that all goes fine then you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-13 13:13   ` Josef Bacik
@ 2019-08-13 15:45     ` Mike Christie
  2019-08-13 15:54       ` Mike Christie
  2019-08-13 16:15       ` Josef Bacik
  0 siblings, 2 replies; 14+ messages in thread
From: Mike Christie @ 2019-08-13 15:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-block

On 08/13/2019 08:13 AM, Josef Bacik wrote:
> On Fri, Aug 09, 2019 at 04:26:10PM -0500, Mike Christie wrote:
>> This fixes a regression added in 4.9 with commit:
>>
>> commit 0eadf37afc2500e1162c9040ec26a705b9af8d47
>> Author: Josef Bacik <jbacik@fb.com>
>> Date:   Thu Sep 8 12:33:40 2016 -0700
>>
>>     nbd: allow block mq to deal with timeouts
>>
>> where before the patch userspace would set the timeout to 0 to disable
>> it. With the above patch, a zero timeout tells the block layer to use
>> the default value of 30 seconds. For setups where commands can take a
>> long time or experience transient issues like network disruptions this
>> then results in IO errors being sent to the application.
>>
>> To fix this, the patch still uses the common block layer timeout
>> framework, but if zero is set, nbd just logs a message and then resets
>> the timer when it expires.
>>
>> Josef,
>>
>> I did not cc stable, but I think we want to port the patches to some
>> releases. We originally hit this with users using the longterm kernels
>> with ceph. The patch does not apply anywhere cleanly with older ones
>> like 4.9, so I was not sure how we wanted to handle it.
>>
> 
> I assume you tested this?  IIRC there was a problem where 0 really meant 0 and

Yes.

> commands would insta-timeout.  But my memory is foggy here, so I'm not sure if
> it was setting the tag_set timeout to 0 that made things go wrong, or what.  Or
> I could be making it all up, who knows.

Yes, if you call blk_queue_rq_timeout with 0, then the command will
timeout almost immediately. I added a check for this in the first patch.

If blk_mq_tag_set.timeout is 0, blk_mq_init_allocated_queue uses the
default 30 second value.

So with the patch if the user sets the timeout to 0, then we will just
log a message every 30 seconds that the command is stuck.

> 
> There's a blktest that just runs fio on a normal device with no timeouts or
> anything, that's where I would see the problem since it was a little racy.
> Basically have the timeout set to 0 and put load on the disk and eventually
> you'd start seeing timeouts.  If that all goes fine then you can add
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 

Ok.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-13 15:45     ` Mike Christie
@ 2019-08-13 15:54       ` Mike Christie
  2019-08-13 15:59         ` Mike Christie
  2019-08-13 16:15       ` Josef Bacik
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Christie @ 2019-08-13 15:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-block

On 08/13/2019 10:45 AM, Mike Christie wrote:
> On 08/13/2019 08:13 AM, Josef Bacik wrote:
>> On Fri, Aug 09, 2019 at 04:26:10PM -0500, Mike Christie wrote:
>>> This fixes a regression added in 4.9 with commit:
>>>
>>> commit 0eadf37afc2500e1162c9040ec26a705b9af8d47
>>> Author: Josef Bacik <jbacik@fb.com>
>>> Date:   Thu Sep 8 12:33:40 2016 -0700
>>>
>>>     nbd: allow block mq to deal with timeouts
>>>
>>> where before the patch userspace would set the timeout to 0 to disable
>>> it. With the above patch, a zero timeout tells the block layer to use
>>> the default value of 30 seconds. For setups where commands can take a
>>> long time or experience transient issues like network disruptions this
>>> then results in IO errors being sent to the application.
>>>
>>> To fix this, the patch still uses the common block layer timeout
>>> framework, but if zero is set, nbd just logs a message and then resets
>>> the timer when it expires.
>>>
>>> Josef,
>>>
>>> I did not cc stable, but I think we want to port the patches to some
>>> releases. We originally hit this with users using the longterm kernels
>>> with ceph. The patch does not apply anywhere cleanly with older ones
>>> like 4.9, so I was not sure how we wanted to handle it.
>>>
>>
>> I assume you tested this?  IIRC there was a problem where 0 really meant 0 and
> 
> Yes.
> 
>> commands would insta-timeout.  But my memory is foggy here, so I'm not sure if
>> it was setting the tag_set timeout to 0 that made things go wrong, or what.  Or
>> I could be making it all up, who knows.
> 
> Yes, if you call blk_queue_rq_timeout with 0, then the command will
> timeout almost immediately. I added a check for this in the first patch.
> 
> If blk_mq_tag_set.timeout is 0, blk_mq_init_allocated_queue uses the
> default 30 second value.
> 
> So with the patch if the user sets the timeout to 0, then we will just
> log a message every 30 seconds that the command is stuck.
> 
>>
>> There's a blktest that just runs fio on a normal device with no timeouts or
>> anything, that's where I would see the problem since it was a little racy.
>> Basically have the timeout set to 0 and put load on the disk and eventually
>> you'd start seeing timeouts.  If that all goes fine then you can add

Oh yeah just to be clear that is another issue that you can hit with any
driver.

If a app/user sets the timeout value in sysfs:

/sys/block/$dev/queue/io_timeout

then it bypasses the driver completely because it just does

queue_io_timeout_store -> blk_queue_rq_timeout

and that function/interface lets you set the timeout to anything.

My patches just fix up the nbd interface that existing tools were using
and hitting regressions with.

I was debating about sending a patch for not allowing

blk_queue_rq_timeout(q, 9)

in a separate patchset, but I was not sure if people use that for
testing fast timeouts.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-13 15:54       ` Mike Christie
@ 2019-08-13 15:59         ` Mike Christie
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Christie @ 2019-08-13 15:59 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-block

On 08/13/2019 10:54 AM, Mike Christie wrote:
> I was debating about sending a patch for not allowing
> 
> blk_queue_rq_timeout(q, 9)

I meant zero

blk_queue_rq_timeout(q, 0)

> 
> in a separate patchset, but I was not sure if people use that for
> testing fast timeouts.
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nbd: fix zero cmd timeout handling
  2019-08-13 15:45     ` Mike Christie
  2019-08-13 15:54       ` Mike Christie
@ 2019-08-13 16:15       ` Josef Bacik
  1 sibling, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-13 16:15 UTC (permalink / raw)
  To: Mike Christie; +Cc: Josef Bacik, linux-block

On Tue, Aug 13, 2019 at 10:45:55AM -0500, Mike Christie wrote:
> On 08/13/2019 08:13 AM, Josef Bacik wrote:
> > On Fri, Aug 09, 2019 at 04:26:10PM -0500, Mike Christie wrote:
> >> This fixes a regression added in 4.9 with commit:
> >>
> >> commit 0eadf37afc2500e1162c9040ec26a705b9af8d47
> >> Author: Josef Bacik <jbacik@fb.com>
> >> Date:   Thu Sep 8 12:33:40 2016 -0700
> >>
> >>     nbd: allow block mq to deal with timeouts
> >>
> >> where before the patch userspace would set the timeout to 0 to disable
> >> it. With the above patch, a zero timeout tells the block layer to use
> >> the default value of 30 seconds. For setups where commands can take a
> >> long time or experience transient issues like network disruptions this
> >> then results in IO errors being sent to the application.
> >>
> >> To fix this, the patch still uses the common block layer timeout
> >> framework, but if zero is set, nbd just logs a message and then resets
> >> the timer when it expires.
> >>
> >> Josef,
> >>
> >> I did not cc stable, but I think we want to port the patches to some
> >> releases. We originally hit this with users using the longterm kernels
> >> with ceph. The patch does not apply anywhere cleanly with older ones
> >> like 4.9, so I was not sure how we wanted to handle it.
> >>
> > 
> > I assume you tested this?  IIRC there was a problem where 0 really meant 0 and
> 
> Yes.
> 
> > commands would insta-timeout.  But my memory is foggy here, so I'm not sure if
> > it was setting the tag_set timeout to 0 that made things go wrong, or what.  Or
> > I could be making it all up, who knows.
> 
> Yes, if you call blk_queue_rq_timeout with 0, then the command will
> timeout almost immediately. I added a check for this in the first patch.
> 

Ahhh that's what it was, thank you.  I'm cool then, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-08-13 16:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 21:26 [PATCH 0/4] nbd: cmd timeout fixes Mike Christie
2019-08-09 21:26 ` [PATCH 1/4] nbd: add set cmd timeout helper Mike Christie
2019-08-09 21:32   ` Mike Christie
2019-08-13 13:11   ` Josef Bacik
2019-08-09 21:26 ` [PATCH 2/4] nbd: add function to convert blk req op to nbd cmd Mike Christie
2019-08-13 13:11   ` Josef Bacik
2019-08-09 21:26 ` [PATCH 3/4] nbd: add missing config put Mike Christie
2019-08-13 13:11   ` Josef Bacik
2019-08-09 21:26 ` [PATCH 4/4] nbd: fix zero cmd timeout handling Mike Christie
2019-08-13 13:13   ` Josef Bacik
2019-08-13 15:45     ` Mike Christie
2019-08-13 15:54       ` Mike Christie
2019-08-13 15:59         ` Mike Christie
2019-08-13 16:15       ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).