All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
@ 2021-12-27 16:41 Keith Busch
  2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Keith Busch @ 2021-12-27 16:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, axboe; +Cc: hch, sagi, Keith Busch

While iterating a list, a particular request may need to be removed for
special handling. Provide an iterator that can safely handle that.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/blkdev.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 22746b2d6825..c4597ccdaf26 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1365,6 +1365,10 @@ struct io_comp_batch {
 #define rq_list_for_each(listptr, pos)			\
 	for (pos = rq_list_peek((listptr)); pos; pos = rq_list_next(pos))
 
+#define rq_list_for_each_safe(listptr, pos, nxt)			\
+	for (pos = rq_list_peek((listptr)), nxt = rq_list_next(pos);	\
+		pos; pos = nxt, nxt = pos ? rq_list_next(pos) : NULL)
+
 #define rq_list_next(rq)	(rq)->rq_next
 #define rq_list_empty(list)	((list) == (struct request *) NULL)
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCHv2 2/3] block: introduce rq_list_move
  2021-12-27 16:41 [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Keith Busch
@ 2021-12-27 16:41 ` Keith Busch
  2021-12-27 18:49     ` kernel test robot
  2021-12-29 17:41   ` Christoph Hellwig
  2021-12-27 16:41 ` [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting Keith Busch
  2021-12-29 17:39 ` [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Christoph Hellwig
  2 siblings, 2 replies; 22+ messages in thread
From: Keith Busch @ 2021-12-27 16:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, axboe; +Cc: hch, sagi, Keith Busch

When iterating a list, a particular request may need to be moved for
special handling. Provide a helper function to achieve that so drivers
don't need to reimplement rqlist manipulation.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/blk-mq.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 550996cf419c..0efa25abcc1c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -216,6 +216,25 @@ static inline unsigned short req_get_ioprio(struct request *req)
 #define rq_dma_dir(rq) \
 	(op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
 
+/**
+ * rq_list_move() - move a struct request from one list to another
+ * @src: The source list @rq is currently in
+ * @dst: The destination list that @rq will be appended to
+ * @rq: The request to move
+ * @prv: The request preceding @rq in @src (NULL if @rq is the head)
+ * @nxt: The request following @rq in @src (NULL if @rq is the tail)
+ */
+static void inline rq_list_move(struct request **src, struct request **dst,
+				struct request *rq, struct request *prv,
+				struct request *nxt)
+{
+	if (prv)
+		prv->rq_next = nxt;
+	else
+		*src = nxt;
+	rq_list_add(dst, rq);
+}
+
 enum blk_eh_timer_return {
 	BLK_EH_DONE,		/* drivers has completed the command */
 	BLK_EH_RESET_TIMER,	/* reset timer and try again */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2021-12-27 16:41 [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Keith Busch
  2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
@ 2021-12-27 16:41 ` Keith Busch
  2021-12-29 17:46   ` Christoph Hellwig
  2021-12-29 17:39 ` [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Christoph Hellwig
  2 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-12-27 16:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, axboe; +Cc: hch, sagi, Keith Busch

If command prep fails, current handling will orphan subsequent requests
in the list. Consider a simple example:

  rqlist = [ 1 -> 2 ]

When prep for request '1' fails, it will be appended to the
'requeue_list', leaving request '2' disconnected from the original
rqlist and no longer tracked. Meanwhile, rqlist is still pointing to the
failed request '1' and will attempt to submit an unprepped command.

Fix this by updating the rqlist accordingly using the request list
helper functions.

Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:

  Replaced the backward looking iterator with the forward looking
  version implemented in PATCH 1/3. This is a little easier to read.

  Replaced the driver's list manipulation with the helper function
  provided in PATCH 2/3.

 drivers/nvme/host/pci.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50deb8b69c40..992ee314e91b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -999,30 +999,30 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 
 static void nvme_queue_rqs(struct request **rqlist)
 {
-	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *req, *next, *prev = NULL;
 	struct request *requeue_list = NULL;
 
-	do {
+	rq_list_for_each_safe(rqlist, req, next) {
 		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 		if (!nvme_prep_rq_batch(nvmeq, req)) {
 			/* detach 'req' and add to remainder list */
-			if (prev)
-				prev->rq_next = req->rq_next;
-			rq_list_add(&requeue_list, req);
-		} else {
-			prev = req;
+			rq_list_move(rqlist, &requeue_list, req, prev, next);
+
+			req = prev;
+			if (!req)
+				continue;
 		}
 
-		req = rq_list_next(req);
-		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+		if (!next || req->mq_hctx != next->mq_hctx) {
 			/* detach rest of list, and submit */
-			if (prev)
-				prev->rq_next = NULL;
+			req->rq_next = NULL;
 			nvme_submit_cmds(nvmeq, rqlist);
-			*rqlist = req;
-		}
-	} while (req);
+			*rqlist = next;
+			prev = NULL;
+		} else
+			prev = req;
+	}
 
 	*rqlist = requeue_list;
 }
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 2/3] block: introduce rq_list_move
  2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
@ 2021-12-27 18:49     ` kernel test robot
  2021-12-29 17:41   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2021-12-27 18:49 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, axboe
  Cc: kbuild-all, hch, sagi, Keith Busch

Hi Keith,

I love your patch! Perhaps something to improve:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linux-review/Xie-Yongji/nbd-Don-t-use-workqueue-to-handle-recv-work/20211227-171406 linus/master v5.16-rc7 next-20211224]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Keith-Busch/block-introduce-rq_list_for_each_safe-macro/20211228-004304
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20211228/202112280210.F89j103t-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/81098ed2c64adf477eae9c21a4188916e0ef5918
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Keith-Busch/block-introduce-rq_list_for_each_safe-macro/20211228-004304
        git checkout 81098ed2c64adf477eae9c21a4188916e0ef5918
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/um/drivers/ubd_kern.c:27:
>> include/linux/blk-mq.h:227:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
     227 | static void inline rq_list_move(struct request **src, struct request **dst,
         | ^~~~~~
--
   In file included from include/linux/blktrace_api.h:5,
                    from block/bfq-iosched.h:9,
                    from block/bfq-cgroup.c:16:
>> include/linux/blk-mq.h:227:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
     227 | static void inline rq_list_move(struct request **src, struct request **dst,
         | ^~~~~~
   block/bfq-cgroup.c:1437:6: warning: no previous prototype for 'bfqg_and_blkg_get' [-Wmissing-prototypes]
    1437 | void bfqg_and_blkg_get(struct bfq_group *bfqg) {}
         |      ^~~~~~~~~~~~~~~~~


vim +/inline +227 include/linux/blk-mq.h

   215	
   216	#define rq_dma_dir(rq) \
   217		(op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
   218	
   219	/**
   220	 * rq_list_move() - move a struct request from one list to another
   221	 * @src: The source list @rq is currently in
   222	 * @dst: The destination list that @rq will be appended to
   223	 * @rq: The request to move
   224	 * @prv: The request preceding @rq in @src (NULL if @rq is the head)
   225	 * @nxt: The request following @rq in @src (NULL if @rq is the tail)
   226	 */
 > 227	static void inline rq_list_move(struct request **src, struct request **dst,
   228					struct request *rq, struct request *prv,
   229					struct request *nxt)
   230	{
   231		if (prv)
   232			prv->rq_next = nxt;
   233		else
   234			*src = nxt;
   235		rq_list_add(dst, rq);
   236	}
   237	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 2/3] block: introduce rq_list_move
@ 2021-12-27 18:49     ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2021-12-27 18:49 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3379 bytes --]

Hi Keith,

I love your patch! Perhaps something to improve:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linux-review/Xie-Yongji/nbd-Don-t-use-workqueue-to-handle-recv-work/20211227-171406 linus/master v5.16-rc7 next-20211224]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Keith-Busch/block-introduce-rq_list_for_each_safe-macro/20211228-004304
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20211228/202112280210.F89j103t-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/81098ed2c64adf477eae9c21a4188916e0ef5918
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Keith-Busch/block-introduce-rq_list_for_each_safe-macro/20211228-004304
        git checkout 81098ed2c64adf477eae9c21a4188916e0ef5918
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/um/drivers/ubd_kern.c:27:
>> include/linux/blk-mq.h:227:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
     227 | static void inline rq_list_move(struct request **src, struct request **dst,
         | ^~~~~~
--
   In file included from include/linux/blktrace_api.h:5,
                    from block/bfq-iosched.h:9,
                    from block/bfq-cgroup.c:16:
>> include/linux/blk-mq.h:227:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
     227 | static void inline rq_list_move(struct request **src, struct request **dst,
         | ^~~~~~
   block/bfq-cgroup.c:1437:6: warning: no previous prototype for 'bfqg_and_blkg_get' [-Wmissing-prototypes]
    1437 | void bfqg_and_blkg_get(struct bfq_group *bfqg) {}
         |      ^~~~~~~~~~~~~~~~~


vim +/inline +227 include/linux/blk-mq.h

   215	
   216	#define rq_dma_dir(rq) \
   217		(op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
   218	
   219	/**
   220	 * rq_list_move() - move a struct request from one list to another
   221	 * @src: The source list @rq is currently in
   222	 * @dst: The destination list that @rq will be appended to
   223	 * @rq: The request to move
   224	 * @prv: The request preceding @rq in @src (NULL if @rq is the head)
   225	 * @nxt: The request following @rq in @src (NULL if @rq is the tail)
   226	 */
 > 227	static void inline rq_list_move(struct request **src, struct request **dst,
   228					struct request *rq, struct request *prv,
   229					struct request *nxt)
   230	{
   231		if (prv)
   232			prv->rq_next = nxt;
   233		else
   234			*src = nxt;
   235		rq_list_add(dst, rq);
   236	}
   237	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2021-12-27 16:41 [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Keith Busch
  2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
  2021-12-27 16:41 ` [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting Keith Busch
@ 2021-12-29 17:39 ` Christoph Hellwig
  2021-12-29 20:57   ` Keith Busch
  2 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2021-12-29 17:39 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, linux-block, axboe, hch, sagi

On Mon, Dec 27, 2021 at 08:41:36AM -0800, Keith Busch wrote:
> While iterating a list, a particular request may need to be removed for
> special handling. Provide an iterator that can safely handle that.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

(except for the fact that it, just like the other rq_list helpers
really should go into blk-mq.h, where all request related bits moved
early in this cycle)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 2/3] block: introduce rq_list_move
  2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
  2021-12-27 18:49     ` kernel test robot
@ 2021-12-29 17:41   ` Christoph Hellwig
  2021-12-29 20:59     ` Keith Busch
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2021-12-29 17:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, linux-block, axboe, hch, sagi

On Mon, Dec 27, 2021 at 08:41:37AM -0800, Keith Busch wrote:
> +/**
> + * rq_list_move() - move a struct request from one list to another
> + * @src: The source list @rq is currently in
> + * @dst: The destination list that @rq will be appended to
> + * @rq: The request to move
> + * @prv: The request preceding @rq in @src (NULL if @rq is the head)
> + * @nxt: The request following @rq in @src (NULL if @rq is the tail)
> + */
> +static void inline rq_list_move(struct request **src, struct request **dst,
> +				struct request *rq, struct request *prv,
> +				struct request *nxt)
> +{
> +	if (prv)
> +		prv->rq_next = nxt;
> +	else
> +		*src = nxt;
> +	rq_list_add(dst, rq);
> +}

Do we even need the nxt argument?  I think it should always be
rq->rq_next?

Also I'd spell out prev and next for a little more readability.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2021-12-27 16:41 ` [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting Keith Busch
@ 2021-12-29 17:46   ` Christoph Hellwig
  2021-12-29 21:04     ` Keith Busch
  2022-01-04 19:38     ` Keith Busch
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-12-29 17:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, linux-block, axboe, hch, sagi

> +			rq_list_move(rqlist, &requeue_list, req, prev, next);
> +
> +			req = prev;
> +			if (!req)
> +				continue;

Shouldn't this be a break?

> +			*rqlist = next;
> +			prev = NULL;
> +		} else
> +			prev = req;
> +	}

I wonder if a restart label here would be a little cleaner, something
like:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 992ee314e91ba..29b433fd12bdd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -999,9 +999,11 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 
 static void nvme_queue_rqs(struct request **rqlist)
 {
-	struct request *req, *next, *prev = NULL;
+	struct request *req, *next, *prev;
 	struct request *requeue_list = NULL;
 
+restart:
+	prev = NULL;
 	rq_list_for_each_safe(rqlist, req, next) {
 		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
@@ -1009,9 +1011,9 @@ static void nvme_queue_rqs(struct request **rqlist)
 			/* detach 'req' and add to remainder list */
 			rq_list_move(rqlist, &requeue_list, req, prev, next);
 
+			if (!prev)
+				break;
 			req = prev;
-			if (!req)
-				continue;
 		}
 
 		if (!next || req->mq_hctx != next->mq_hctx) {
@@ -1019,9 +1021,9 @@ static void nvme_queue_rqs(struct request **rqlist)
 			req->rq_next = NULL;
 			nvme_submit_cmds(nvmeq, rqlist);
 			*rqlist = next;
-			prev = NULL;
-		} else
-			prev = req;
+			goto restart;
+		}
+		prev = req;
 	}
 
 	*rqlist = requeue_list;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2021-12-29 17:39 ` [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Christoph Hellwig
@ 2021-12-29 20:57   ` Keith Busch
  2021-12-30 14:38     ` Max Gurtovoy
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-12-29 20:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, axboe, sagi

On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
> (except for the fact that it, just like the other rq_list helpers
> really should go into blk-mq.h, where all request related bits moved
> early in this cycle)

Agreed, I just put it here because it's where the other macros live. But
'struct request' doesn't exist in this header, so it does seem out of
place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 2/3] block: introduce rq_list_move
  2021-12-29 17:41   ` Christoph Hellwig
@ 2021-12-29 20:59     ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2021-12-29 20:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, axboe, sagi

On Wed, Dec 29, 2021 at 06:41:09PM +0100, Christoph Hellwig wrote:
> On Mon, Dec 27, 2021 at 08:41:37AM -0800, Keith Busch wrote:
> > +/**
> > + * rq_list_move() - move a struct request from one list to another
> > + * @src: The source list @rq is currently in
> > + * @dst: The destination list that @rq will be appended to
> > + * @rq: The request to move
> > + * @prv: The request preceding @rq in @src (NULL if @rq is the head)
> > + * @nxt: The request following @rq in @src (NULL if @rq is the tail)
> > + */
> > +static void inline rq_list_move(struct request **src, struct request **dst,
> > +				struct request *rq, struct request *prv,
> > +				struct request *nxt)
> > +{
> > +	if (prv)
> > +		prv->rq_next = nxt;
> > +	else
> > +		*src = nxt;
> > +	rq_list_add(dst, rq);
> > +}
> 
> Do we even need the nxt argument?  I think it should always be
> rq->rq_next?

Sure. I only used it here because the safe iterator already has rq_next.
It's not an optimization, so I'll remove it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2021-12-29 17:46   ` Christoph Hellwig
@ 2021-12-29 21:04     ` Keith Busch
  2021-12-30  7:53       ` Christoph Hellwig
  2022-01-04 19:38     ` Keith Busch
  1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-12-29 21:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, axboe, sagi

On Wed, Dec 29, 2021 at 06:46:02PM +0100, Christoph Hellwig wrote:
> > +			rq_list_move(rqlist, &requeue_list, req, prev, next);
> > +
> > +			req = prev;
> > +			if (!req)
> > +				continue;
> 
> Shouldn't this be a break?

The condition just means we're at the beginning of the rqlist. There may
be more requests to consider, so we have to continue.

Or are you saying any failed prep should just abandon the batched
sequence? If so, we would need to concat the return list with the rest
of rqlist before breaking.
 
> > +			*rqlist = next;
> > +			prev = NULL;
> > +		} else
> > +			prev = req;
> > +	}
> 
> I wonder if a restart label here would be a little cleaner, something
> like:

I applied your suggestion to give it a look, and I agree. Will use that
for the next version.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2021-12-29 21:04     ` Keith Busch
@ 2021-12-30  7:53       ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-12-30  7:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi

On Wed, Dec 29, 2021 at 01:04:46PM -0800, Keith Busch wrote:
> On Wed, Dec 29, 2021 at 06:46:02PM +0100, Christoph Hellwig wrote:
> > > +			rq_list_move(rqlist, &requeue_list, req, prev, next);
> > > +
> > > +			req = prev;
> > > +			if (!req)
> > > +				continue;
> > 
> > Shouldn't this be a break?
> 
> The condition just means we're at the beginning of the rqlist. There may
> be more requests to consider, so we have to continue.
> 
> Or are you saying any failed prep should just abandon the batched
> sequence? If so, we would need to concat the return list with the rest
> of rqlist before breaking.

No, I misunderstood the check,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2021-12-29 20:57   ` Keith Busch
@ 2021-12-30 14:38     ` Max Gurtovoy
  2021-12-30 15:30       ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Max Gurtovoy @ 2021-12-30 14:38 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig; +Cc: linux-nvme, linux-block, axboe, sagi


On 12/29/2021 10:57 PM, Keith Busch wrote:
> On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
>> (except for the fact that it, just like the other rq_list helpers
>> really should go into blk-mq.h, where all request related bits moved
>> early in this cycle)
> Agreed, I just put it here because it's where the other macros live. But
> 'struct request' doesn't exist in this header, so it does seem out of
> place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.

Did you see the discussion I had with Jens ?

Seems it works only for non-shared tagsets and it means that it will 
work only for NVMe devices that have 1 namespace and will not work for 
NVMf drivers.

Unless, we'll do some changes in the block layer and/or remove this 
condition.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2021-12-30 14:38     ` Max Gurtovoy
@ 2021-12-30 15:30       ` Keith Busch
  2022-01-03 15:23         ` Max Gurtovoy
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-12-30 15:30 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi

On Thu, Dec 30, 2021 at 04:38:57PM +0200, Max Gurtovoy wrote:
> 
> On 12/29/2021 10:57 PM, Keith Busch wrote:
> > On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
> > > (except for the fact that it, just like the other rq_list helpers
> > > really should go into blk-mq.h, where all request related bits moved
> > > early in this cycle)
> > Agreed, I just put it here because it's where the other macros live. But
> > 'struct request' doesn't exist in this header, so it does seem out of
> > place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.
> 
> Did you see the discussion I had with Jens ?

I did, yes. That's when I noticed the error handling wasn't right, and
doing it correctly was a bit clunky. This series was supposed to make it
easier for drivers to use the new interface.
 
> Seems it works only for non-shared tagsets and it means that it will work
> only for NVMe devices that have 1 namespace and will not work for NVMf
> drivers.

I am aware shared tags don't get to use the batched queueing, but that's
orthogonal to this series.

Most PCIe NVMe SSDs only have 1 namespace, so it generally works out for
those. 

> Unless, we'll do some changes in the block layer and/or remove this
> condition.

I think it just may work if we export blk_mq_get_driver_tag().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2021-12-30 15:30       ` Keith Busch
@ 2022-01-03 15:23         ` Max Gurtovoy
  2022-01-03 18:15           ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Max Gurtovoy @ 2022-01-03 15:23 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi


On 12/30/2021 5:30 PM, Keith Busch wrote:
> On Thu, Dec 30, 2021 at 04:38:57PM +0200, Max Gurtovoy wrote:
>> On 12/29/2021 10:57 PM, Keith Busch wrote:
>>> On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
>>>> (except for the fact that it, just like the other rq_list helpers
>>>> really should go into blk-mq.h, where all request related bits moved
>>>> early in this cycle)
>>> Agreed, I just put it here because it's where the other macros live. But
>>> 'struct request' doesn't exist in this header, so it does seem out of
>>> place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.
>> Did you see the discussion I had with Jens ?
> I did, yes. That's when I noticed the error handling wasn't right, and
> doing it correctly was a bit clunky. This series was supposed to make it
> easier for drivers to use the new interface.
>   
>> Seems it works only for non-shared tagsets and it means that it will work
>> only for NVMe devices that have 1 namespace and will not work for NVMf
>> drivers.
> I am aware shared tags don't get to use the batched queueing, but that's
> orthogonal to this series.

Yes. This series probably should be squashed to Jens's and merged together.

> Most PCIe NVMe SSDs only have 1 namespace, so it generally works out for
> those.

There are SSDs that support NS management and also devices that support 
multiple NS (NVIDIA's NVMe SNAP device support many NSs).

Also all the fabrics controllers support it.

I don't think we need to restrict capable controllers from not working 
with the batching feature from day 1.

>
>> Unless, we'll do some changes in the block layer and/or remove this
>> condition.
> I think it just may work if we export blk_mq_get_driver_tag().

do you have a suggestion for the NVMe/PCI driver ?

I can run it in my lab if you don't have an SSD with more than 1 NS..



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2022-01-03 15:23         ` Max Gurtovoy
@ 2022-01-03 18:15           ` Keith Busch
  2022-01-04 12:15             ` Max Gurtovoy
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2022-01-03 18:15 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi

On Mon, Jan 03, 2022 at 05:23:08PM +0200, Max Gurtovoy wrote:
> On 12/30/2021 5:30 PM, Keith Busch wrote:
> > I think it just may work if we export blk_mq_get_driver_tag().
> 
> do you have a suggestion for the NVMe/PCI driver ?

The following tests fine with my multi-namespace setups. I have real
hardware with namespace management capabilities, but qemu can also
easily emulate it too for anyone who doesn't have one.

---
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 5668e28be0b7..84f2e73d0c7c 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -41,12 +41,6 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 	return sbq_wait_ptr(bt, &hctx->wait_index);
 }
 
-enum {
-	BLK_MQ_NO_TAG		= -1U,
-	BLK_MQ_TAG_MIN		= 1,
-	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
-};
-
 extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
 extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0d7c9d3e0329..b4540723077a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1589,6 +1589,7 @@ bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	hctx->tags->rqs[rq->tag] = rq;
 	return true;
 }
+EXPORT_SYMBOL_GPL(__blk_mq_get_driver_tag);
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 				int flags, void *key)
@@ -2582,11 +2583,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		 * same queue, caller must ensure that's the case.
 		 *
 		 * Since we pass off the full list to the driver at this point,
-		 * we do not increment the active request count for the queue.
-		 * Bypass shared tags for now because of that.
+		 * we are counting on the driver to increment the active
+		 * request count for the queue.
 		 */
-		if (q->mq_ops->queue_rqs &&
-		    !(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		if (q->mq_ops->queue_rqs) {
 			blk_mq_run_dispatch_ops(q,
 				__blk_mq_flush_plug_list(q, plug));
 			if (rq_list_empty(plug->mq_list))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 948791ea2a3e..0f37ae906901 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -268,21 +268,6 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
 	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
 }
 
-bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
-
-static inline bool blk_mq_get_driver_tag(struct request *rq)
-{
-	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-
-	if (rq->tag != BLK_MQ_NO_TAG &&
-	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		hctx->tags->rqs[rq->tag] = rq;
-		return true;
-	}
-
-	return __blk_mq_get_driver_tag(hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50deb8b69c40..f50483475c12 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -992,8 +992,9 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 		return false;
 	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
 		return false;
+	if (!blk_mq_get_driver_tag(req))
+		return false;
 
-	req->mq_hctx->tags->rqs[req->tag] = req;
 	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 550996cf419c..8fb544a35330 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1072,6 +1072,27 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+enum {
+	BLK_MQ_NO_TAG		= -1U,
+	BLK_MQ_TAG_MIN		= 1,
+	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
+};
+
+bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
+
+static inline bool blk_mq_get_driver_tag(struct request *rq)
+{
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+
+	if (rq->tag != BLK_MQ_NO_TAG &&
+	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		hctx->tags->rqs[rq->tag] = rq;
+		return true;
+	}
+
+	return __blk_mq_get_driver_tag(hctx, rq);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int blk_rq_zone_no(struct request *rq)
 {
--

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2022-01-03 18:15           ` Keith Busch
@ 2022-01-04 12:15             ` Max Gurtovoy
  2022-01-05 17:26               ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Max Gurtovoy @ 2022-01-04 12:15 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi


On 1/3/2022 8:15 PM, Keith Busch wrote:
> On Mon, Jan 03, 2022 at 05:23:08PM +0200, Max Gurtovoy wrote:
>> On 12/30/2021 5:30 PM, Keith Busch wrote:
>>> I think it just may work if we export blk_mq_get_driver_tag().
>> do you have a suggestion for the NVMe/PCI driver ?
> The following tests fine with my multi-namespace setups. I have real
> hardware with namespace management capabilities, but qemu can also
> easily emulate it too for anyone who doesn't have one.
>
> ---
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 5668e28be0b7..84f2e73d0c7c 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -41,12 +41,6 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   	return sbq_wait_ptr(bt, &hctx->wait_index);
>   }
>   
> -enum {
> -	BLK_MQ_NO_TAG		= -1U,
> -	BLK_MQ_TAG_MIN		= 1,
> -	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
> -};
> -
>   extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
>   extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0d7c9d3e0329..b4540723077a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1589,6 +1589,7 @@ bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq)
>   	hctx->tags->rqs[rq->tag] = rq;
>   	return true;
>   }
> +EXPORT_SYMBOL_GPL(__blk_mq_get_driver_tag);
>   
>   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>   				int flags, void *key)
> @@ -2582,11 +2583,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>   		 * same queue, caller must ensure that's the case.
>   		 *
>   		 * Since we pass off the full list to the driver at this point,
> -		 * we do not increment the active request count for the queue.
> -		 * Bypass shared tags for now because of that.
> +		 * we are counting on the driver to increment the active
> +		 * request count for the queue.
>   		 */
> -		if (q->mq_ops->queue_rqs &&
> -		    !(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> +		if (q->mq_ops->queue_rqs) {
>   			blk_mq_run_dispatch_ops(q,
>   				__blk_mq_flush_plug_list(q, plug));
>   			if (rq_list_empty(plug->mq_list))
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 948791ea2a3e..0f37ae906901 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -268,21 +268,6 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
>   	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
>   }
>   
> -bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
> -
> -static inline bool blk_mq_get_driver_tag(struct request *rq)
> -{
> -	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> -
> -	if (rq->tag != BLK_MQ_NO_TAG &&
> -	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		hctx->tags->rqs[rq->tag] = rq;
> -		return true;
> -	}
> -
> -	return __blk_mq_get_driver_tag(hctx, rq);
> -}
> -
>   static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>   {
>   	int cpu;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 50deb8b69c40..f50483475c12 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -992,8 +992,9 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
>   		return false;
>   	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
>   		return false;
> +	if (!blk_mq_get_driver_tag(req))
> +		return false;
>   
> -	req->mq_hctx->tags->rqs[req->tag] = req;
>   	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
>   }
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 550996cf419c..8fb544a35330 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1072,6 +1072,27 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>   }
>   void blk_dump_rq_flags(struct request *, char *);
>   
> +enum {
> +	BLK_MQ_NO_TAG		= -1U,
> +	BLK_MQ_TAG_MIN		= 1,
> +	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
> +};
> +
> +bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
> +
> +static inline bool blk_mq_get_driver_tag(struct request *rq)
> +{
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +
> +	if (rq->tag != BLK_MQ_NO_TAG &&
> +	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> +		hctx->tags->rqs[rq->tag] = rq;
> +		return true;
> +	}
> +
> +	return __blk_mq_get_driver_tag(hctx, rq);
> +}
> +
>   #ifdef CONFIG_BLK_DEV_ZONED
>   static inline unsigned int blk_rq_zone_no(struct request *rq)
>   {
> --

This patch worked for me with 2 namespaces for NVMe PCI.

I'll check it later on with my RDMA queue_rqs patches as well. There we 
have also a tagset sharing with the connect_q (and not only with 
multiple namespaces).

But the connect_q is using a reserved tags only (for the connect commands).

I saw some strange things that I couldn't understand:

1. running randread fio with libaio ioengine didn't call nvme_queue_rqs 
- expected

*2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - 
Not expected !!*

*3. running randread fio with io_uring ioengine (and --iodepth_batch=32) 
didn't call nvme_queue_rqs - Not expected !!*

4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) 
did call nvme_queue_rqs - expected

5. *running randread fio with io_uring ioengine (and --iodepth_batch=32 
--runtime=30) didn't finish after 30 seconds and stuck for 300 seconds 
(fio jobs required "kill -9 fio" to remove refcounts from nvme_core)   - 
Not expected !!*

*debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 
seconds, it appears to be stuck. Doing forceful exit of this job.
*

*6. ***running randwrite fio with io_uring ioengine (and  
--iodepth_batch=32 --runtime=30) didn't finish after 30 seconds and 
stuck for 300 seconds (fio jobs required "kill -9 fio" to remove 
refcounts from nvme_core)   - Not expected !!**

***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 
seconds, it appears to be stuck. Doing forceful exit of this job.***


any idea what could cause these unexpected scenarios ? at least 
unexpected for me :)


******


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2021-12-29 17:46   ` Christoph Hellwig
  2021-12-29 21:04     ` Keith Busch
@ 2022-01-04 19:38     ` Keith Busch
  2022-01-05  7:35       ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2022-01-04 19:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, axboe, sagi

On Wed, Dec 29, 2021 at 06:46:02PM +0100, Christoph Hellwig wrote:
> I wonder if a restart label here would be a little cleaner, something
> like:

On second thought, this requires another check that the *rqlist is not
NULL before 'goto restart' since the safe iterator dereferences it. I'm
not sure this is cleaner than the original anymore.
 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 992ee314e91ba..29b433fd12bdd 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -999,9 +999,11 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
>  
>  static void nvme_queue_rqs(struct request **rqlist)
>  {
> -	struct request *req, *next, *prev = NULL;
> +	struct request *req, *next, *prev;
>  	struct request *requeue_list = NULL;
>  
> +restart:
> +	prev = NULL;
>  	rq_list_for_each_safe(rqlist, req, next) {
>  		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  
> @@ -1009,9 +1011,9 @@ static void nvme_queue_rqs(struct request **rqlist)
>  			/* detach 'req' and add to remainder list */
>  			rq_list_move(rqlist, &requeue_list, req, prev, next);
>  
> +			if (!prev)
> +				break;
>  			req = prev;
> -			if (!req)
> -				continue;
>  		}
>  
>  		if (!next || req->mq_hctx != next->mq_hctx) {
> @@ -1019,9 +1021,9 @@ static void nvme_queue_rqs(struct request **rqlist)
>  			req->rq_next = NULL;
>  			nvme_submit_cmds(nvmeq, rqlist);
>  			*rqlist = next;
> -			prev = NULL;
> -		} else
> -			prev = req;
> +			goto restart;
> +		}
> +		prev = req;
>  	}
>  
>  	*rqlist = requeue_list;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting
  2022-01-04 19:38     ` Keith Busch
@ 2022-01-05  7:35       ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2022-01-05  7:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi

On Tue, Jan 04, 2022 at 11:38:07AM -0800, Keith Busch wrote:
> On Wed, Dec 29, 2021 at 06:46:02PM +0100, Christoph Hellwig wrote:
> > I wonder if a restart label here would be a little cleaner, something
> > like:
> 
> On second thought, this requires another check that the *rqlist is not
> NULL before 'goto restart' since the safe iterator dereferences it. I'm
> not sure this is cleaner than the original anymore.

Ok, let's stick to the original then.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2022-01-04 12:15             ` Max Gurtovoy
@ 2022-01-05 17:26               ` Keith Busch
  2022-01-06 11:54                 ` Max Gurtovoy
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2022-01-05 17:26 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi

On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
> 
> This patch worked for me with 2 namespaces for NVMe PCI.
> 
> I'll check it later on with my RDMA queue_rqs patches as well. There we have
> also a tagset sharing with the connect_q (and not only with multiple
> namespaces).
> 
> But the connect_q is using a reserved tags only (for the connect commands).
> 
> I saw some strange things that I couldn't understand:
> 
> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
> expected
> 
> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
> expected !!*
> 
> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
> didn't call nvme_queue_rqs - Not expected !!*
> 
> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
> call nvme_queue_rqs - expected
> 
> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
> expected !!*
> 
> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
> seconds, it appears to be stuck. Doing forceful exit of this job.
> *
> 
> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
> expected !!**
> 
> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
> seconds, it appears to be stuck. Doing forceful exit of this job.***
> 
> 
> any idea what could cause these unexpected scenarios ? at least unexpected
> for me :)

Not sure about all the scenarios. I believe it should call queue_rqs
anytime we finish a plugged list of requests as long as the requests
come from the same request_queue, and it's not being flushed from
io_schedule().

The stuck fio job might be a lost request, which is what this series
should address. It would be unusual to see such an error happen in
normal operation, though. I had to synthesize errors to verify the bug
and fix.

In any case, I'll run more multi-namespace tests to see if I can find
any other issues with shared tags.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2022-01-05 17:26               ` Keith Busch
@ 2022-01-06 11:54                 ` Max Gurtovoy
  2022-01-06 13:41                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Max Gurtovoy @ 2022-01-06 11:54 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-block, axboe, sagi


On 1/5/2022 7:26 PM, Keith Busch wrote:
> On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
>> This patch worked for me with 2 namespaces for NVMe PCI.
>>
>> I'll check it later on with my RDMA queue_rqs patches as well. There we have
>> also a tagset sharing with the connect_q (and not only with multiple
>> namespaces).
>>
>> But the connect_q is using a reserved tags only (for the connect commands).
>>
>> I saw some strange things that I couldn't understand:
>>
>> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
>> expected
>>
>> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
>> expected !!*
>>
>> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
>> didn't call nvme_queue_rqs - Not expected !!*
>>
>> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
>> call nvme_queue_rqs - expected
>>
>> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>> expected !!*
>>
>> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>> seconds, it appears to be stuck. Doing forceful exit of this job.
>> *
>>
>> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>> expected !!**
>>
>> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>> seconds, it appears to be stuck. Doing forceful exit of this job.***
>>
>>
>> any idea what could cause these unexpected scenarios ? at least unexpected
>> for me :)
> Not sure about all the scenarios. I believe it should call queue_rqs
> anytime we finish a plugged list of requests as long as the requests
> come from the same request_queue, and it's not being flushed from
> io_schedule().

I also see we have batch > 1 only in the start of the fio operation. 
After X IO operations batch size == 1 till the end of the fio.

>
> The stuck fio job might be a lost request, which is what this series
> should address. It would be unusual to see such an error happen in
> normal operation, though. I had to synthesize errors to verify the bug
> and fix.

But there is no timeout error and prints in the dmesg.

If there was a timeout prints I would expect the issue might be in the 
local NVMe device, but there isn't.

Also this phenomena doesn't happen with NVMf/RDMA code I developed locally.

>
> In any case, I'll run more multi-namespace tests to see if I can find
> any other issues with shared tags.

I believe that the above concerns are not related to the shared-tags but 
to the entire mechanism.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro
  2022-01-06 11:54                 ` Max Gurtovoy
@ 2022-01-06 13:41                   ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2022-01-06 13:41 UTC (permalink / raw)
  To: Max Gurtovoy, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, linux-block, sagi

On 1/6/22 3:54 AM, Max Gurtovoy wrote:
> 
> On 1/5/2022 7:26 PM, Keith Busch wrote:
>> On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
>>> This patch worked for me with 2 namespaces for NVMe PCI.
>>>
>>> I'll check it later on with my RDMA queue_rqs patches as well. There we have
>>> also a tagset sharing with the connect_q (and not only with multiple
>>> namespaces).
>>>
>>> But the connect_q is using a reserved tags only (for the connect commands).
>>>
>>> I saw some strange things that I couldn't understand:
>>>
>>> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
>>> expected
>>>
>>> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
>>> expected !!*
>>>
>>> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
>>> didn't call nvme_queue_rqs - Not expected !!*
>>>
>>> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
>>> call nvme_queue_rqs - expected
>>>
>>> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
>>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>>> expected !!*
>>>
>>> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>>> seconds, it appears to be stuck. Doing forceful exit of this job.
>>> *
>>>
>>> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
>>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>>> expected !!**
>>>
>>> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>>> seconds, it appears to be stuck. Doing forceful exit of this job.***
>>>
>>>
>>> any idea what could cause these unexpected scenarios ? at least unexpected
>>> for me :)
>> Not sure about all the scenarios. I believe it should call queue_rqs
>> anytime we finish a plugged list of requests as long as the requests
>> come from the same request_queue, and it's not being flushed from
>> io_schedule().
> 
> I also see we have batch > 1 only in the start of the fio operation. 
> After X IO operations batch size == 1 till the end of the fio.

There are two settings for completion batch, you're likely not setting
them? That in turn will prevent the submit side from submitting more
than 1, as that's all that's left.

>> The stuck fio job might be a lost request, which is what this series
>> should address. It would be unusual to see such an error happen in
>> normal operation, though. I had to synthesize errors to verify the bug
>> and fix.
> 
> But there is no timeout error and prints in the dmesg.
> 
> If there was a timeout prints I would expect the issue might be in the
> local NVMe device, but there isn't.
> 
> Also this phenomena doesn't happen with NVMf/RDMA code I developed
> locally.

There would only be a timeout if it wasn't lost. Keith's patches fixed a
case where it was simply dropped from the list. As it was never started,
it won't get timed out.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-01-06 13:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-27 16:41 [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Keith Busch
2021-12-27 16:41 ` [PATCHv2 2/3] block: introduce rq_list_move Keith Busch
2021-12-27 18:49   ` kernel test robot
2021-12-27 18:49     ` kernel test robot
2021-12-29 17:41   ` Christoph Hellwig
2021-12-29 20:59     ` Keith Busch
2021-12-27 16:41 ` [PATCHv2 3/3] nvme-pci: fix queue_rqs list splitting Keith Busch
2021-12-29 17:46   ` Christoph Hellwig
2021-12-29 21:04     ` Keith Busch
2021-12-30  7:53       ` Christoph Hellwig
2022-01-04 19:38     ` Keith Busch
2022-01-05  7:35       ` Christoph Hellwig
2021-12-29 17:39 ` [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Christoph Hellwig
2021-12-29 20:57   ` Keith Busch
2021-12-30 14:38     ` Max Gurtovoy
2021-12-30 15:30       ` Keith Busch
2022-01-03 15:23         ` Max Gurtovoy
2022-01-03 18:15           ` Keith Busch
2022-01-04 12:15             ` Max Gurtovoy
2022-01-05 17:26               ` Keith Busch
2022-01-06 11:54                 ` Max Gurtovoy
2022-01-06 13:41                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.