Linux-Block Archive on lore.kernel.org
 help / Atom feed
* [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list
@ 2019-04-28  7:39 Ming Lei
  2019-04-28  7:39 ` [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Ming Lei @ 2019-04-28  7:39 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche,
	Ewan D . Milne, Hannes Reinecke, Sagi Grimberg, Chuck Lever,
	netdev, linux-nvme

Hi,

Since supporting to blk-mq, big pre-allocation for sg list is introduced,
this way is very unfriendly wrt. memory consumption.

There were Red Hat internal reports that some scsi_debug based tests
can't be run any more because of too big pre-allocation.

Also lpfc users commplained that 1GB+ ram is pre-allocatd for single
HBA.

sg_alloc_table_chained() is improved to support variant size of 1st
pre-allocated SGL in the 1st patch as suggested by Christoph.

The other two patches try to address this issue by allocating sg list runtime,
meantime pre-allocating one or two inline sg entries for small IO. This
ways follows NVMe's approach wrt. sg list allocation.

V4:
	- add parameter to sg_alloc_table_chained()/sg_free_table_chained()
	directly, and update current callers

V3:
	- improve sg_alloc_table_chained() to accept variant size of
	the 1st pre-allocated SGL
	- applies the improved sg API to address the big pre-allocation
	issue

V2:
	- move inline sg table initializetion into one helper
	- introduce new helper for getting inline sg
	- comment log fix


Ming Lei (3):
  lib/sg_pool.c: improve APIs for allocating sg pool
  scsi: core: avoid to pre-allocate big chunk for protection meta data
  scsi: core: avoid to pre-allocate big chunk for sg list

 drivers/nvme/host/fc.c            |  7 ++++---
 drivers/nvme/host/rdma.c          |  7 ++++---
 drivers/nvme/target/loop.c        |  4 ++--
 drivers/scsi/scsi_lib.c           | 31 ++++++++++++++++++++++---------
 include/linux/scatterlist.h       | 11 +++++++----
 lib/scatterlist.c                 | 36 +++++++++++++++++++++++-------------
 lib/sg_pool.c                     | 37 +++++++++++++++++++++++++++----------
 net/sunrpc/xprtrdma/svc_rdma_rw.c |  5 +++--
 8 files changed, 92 insertions(+), 46 deletions(-)

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org

-- 
2.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool
  2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
@ 2019-04-28  7:39 ` Ming Lei
  2019-04-28 12:04   ` Christoph Hellwig
  2019-04-28  7:39 ` [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2019-04-28  7:39 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche,
	Ewan D . Milne, Hannes Reinecke, Sagi Grimberg, Chuck Lever,
	netdev, linux-nvme

Now sg_alloc_table_chained() allows user to provide one preallocated
SGL, and returns simply if the requested number isn't bigger than size of
that SGL. This way is nice for inline small SGL to small IO request.

However, scattergather code only allows that size of the 1st preallocated
SGL is SG_CHUNK_SIZE(128), and this way isn't flexiable and useful, because
it may take too much memory(4KB) to pre-allocat one such size SGL for each
IO request, especially block layer always pre-allocates IO request structure.
Instead it is more friendly to pre-allocate one small size inline SGL just
for small IO.

Introduces one extra parameter to sg_alloc_table_chained() and sg_free_table_chained()
for specifying size of the pre-allocated SGL, then the 'first_chunk' SGL can
include any number of entries.

Both __sg_free_table() and __sg_alloc_table() assumes that each SGL has
same size except for the last one, changes code to allow both to accept
variant size for the 1st preallocated SGL.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/fc.c            |  7 ++++---
 drivers/nvme/host/rdma.c          |  7 ++++---
 drivers/nvme/target/loop.c        |  4 ++--
 drivers/scsi/scsi_lib.c           | 10 ++++++----
 include/linux/scatterlist.h       | 11 +++++++----
 lib/scatterlist.c                 | 36 +++++++++++++++++++++++-------------
 lib/sg_pool.c                     | 37 +++++++++++++++++++++++++++----------
 net/sunrpc/xprtrdma/svc_rdma_rw.c |  5 +++--
 8 files changed, 76 insertions(+), 41 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 31637f8ef22e..9b42ef82a273 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2112,7 +2112,8 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq,
 
 	freq->sg_table.sgl = freq->first_sgl;
 	ret = sg_alloc_table_chained(&freq->sg_table,
-			blk_rq_nr_phys_segments(rq), freq->sg_table.sgl);
+			blk_rq_nr_phys_segments(rq), freq->sg_table.sgl,
+			SG_CHUNK_SIZE);
 	if (ret)
 		return -ENOMEM;
 
@@ -2122,7 +2123,7 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq,
 	freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl,
 				op->nents, dir);
 	if (unlikely(freq->sg_cnt <= 0)) {
-		sg_free_table_chained(&freq->sg_table, true);
+		sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
 		freq->sg_cnt = 0;
 		return -EFAULT;
 	}
@@ -2148,7 +2149,7 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct request *rq,
 
 	nvme_cleanup_cmd(rq);
 
-	sg_free_table_chained(&freq->sg_table, true);
+	sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
 
 	freq->sg_cnt = 0;
 }
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 11a5ecae78c8..2022a196d2a7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1155,7 +1155,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 
 	nvme_cleanup_cmd(rq);
-	sg_free_table_chained(&req->sg_table, true);
+	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
 }
 
 static int nvme_rdma_set_sg_null(struct nvme_command *c)
@@ -1270,7 +1270,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 	req->sg_table.sgl = req->first_sgl;
 	ret = sg_alloc_table_chained(&req->sg_table,
-			blk_rq_nr_phys_segments(rq), req->sg_table.sgl);
+			blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
+			SG_CHUNK_SIZE);
 	if (ret)
 		return -ENOMEM;
 
@@ -1310,7 +1311,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 			req->nents, rq_data_dir(rq) ==
 			WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 out_free_table:
-	sg_free_table_chained(&req->sg_table, true);
+	sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
 	return ret;
 }
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index b9f623ab01f3..b3e2f42a3644 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -77,7 +77,7 @@ static void nvme_loop_complete_rq(struct request *req)
 	struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(req);
 
 	nvme_cleanup_cmd(req);
-	sg_free_table_chained(&iod->sg_table, true);
+	sg_free_table_chained(&iod->sg_table, SG_CHUNK_SIZE);
 	nvme_complete_rq(req);
 }
 
@@ -171,7 +171,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		iod->sg_table.sgl = iod->first_sgl;
 		if (sg_alloc_table_chained(&iod->sg_table,
 				blk_rq_nr_phys_segments(req),
-				iod->sg_table.sgl))
+				iod->sg_table.sgl, SG_CHUNK_SIZE))
 			return BLK_STS_RESOURCE;
 
 		iod->req.sg = iod->sg_table.sgl;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6b2570a5642d..c37263c123eb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -541,9 +541,9 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		sg_free_table_chained(&cmd->sdb.table, true);
+		sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
 	if (scsi_prot_sg_count(cmd))
-		sg_free_table_chained(&cmd->prot_sdb->table, true);
+		sg_free_table_chained(&cmd->prot_sdb->table, SG_CHUNK_SIZE);
 }
 
 static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
@@ -976,7 +976,8 @@ static blk_status_t scsi_init_sgtable(struct request *req,
 	 * If sg table allocation fails, requeue request later.
 	 */
 	if (unlikely(sg_alloc_table_chained(&sdb->table,
-			blk_rq_nr_phys_segments(req), sdb->table.sgl)))
+			blk_rq_nr_phys_segments(req), sdb->table.sgl,
+			SG_CHUNK_SIZE)))
 		return BLK_STS_RESOURCE;
 
 	/* 
@@ -1030,7 +1031,8 @@ blk_status_t scsi_init_io(struct scsi_cmnd *cmd)
 		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
 
 		if (sg_alloc_table_chained(&prot_sdb->table, ivecs,
-				prot_sdb->table.sgl)) {
+				prot_sdb->table.sgl,
+				SG_CHUNK_SIZE)) {
 			ret = BLK_STS_RESOURCE;
 			goto out_free_sgtables;
 		}
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index b4be960c7e5d..e1d4cf5ab28a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -266,10 +266,11 @@ int sg_split(struct scatterlist *in, const int in_mapped_nents,
 typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t);
 typedef void (sg_free_fn)(struct scatterlist *, unsigned int);
 
-void __sg_free_table(struct sg_table *, unsigned int, bool, sg_free_fn *);
+void __sg_free_table(struct sg_table *, unsigned int, unsigned int,
+		     sg_free_fn *);
 void sg_free_table(struct sg_table *);
 int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
-		     struct scatterlist *, gfp_t, sg_alloc_fn *);
+		     struct scatterlist *, unsigned int, gfp_t, sg_alloc_fn *);
 int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
 int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
 				unsigned int n_pages, unsigned int offset,
@@ -331,9 +332,11 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
 #endif
 
 #ifdef CONFIG_SG_POOL
-void sg_free_table_chained(struct sg_table *table, bool first_chunk);
+void sg_free_table_chained(struct sg_table *table,
+			   unsigned nents_first_chunk);
 int sg_alloc_table_chained(struct sg_table *table, int nents,
-			   struct scatterlist *first_chunk);
+			   struct scatterlist *first_chunk,
+			   unsigned nents_first_chunk);
 #endif
 
 /*
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 739dc9fe2c55..77ec8eec3fd0 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -181,7 +181,8 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
  * __sg_free_table - Free a previously mapped sg table
  * @table:	The sg table header to use
  * @max_ents:	The maximum number of entries per single scatterlist
- * @skip_first_chunk: don't free the (preallocated) first scatterlist chunk
+ * @nents_first_chunk: Number of entries int the (preallocated) first
+ * 	scatterlist chunk, 0 means no such preallocated first chunk
  * @free_fn:	Free function
  *
  *  Description:
@@ -191,9 +192,10 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
  *
  **/
 void __sg_free_table(struct sg_table *table, unsigned int max_ents,
-		     bool skip_first_chunk, sg_free_fn *free_fn)
+		     unsigned int nents_first_chunk, sg_free_fn *free_fn)
 {
 	struct scatterlist *sgl, *next;
+	unsigned curr_max_ents = nents_first_chunk ?: max_ents;
 
 	if (unlikely(!table->sgl))
 		return;
@@ -209,9 +211,9 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
 		 * sg_size is then one less than alloc size, since the last
 		 * element is the chain pointer.
 		 */
-		if (alloc_size > max_ents) {
-			next = sg_chain_ptr(&sgl[max_ents - 1]);
-			alloc_size = max_ents;
+		if (alloc_size > curr_max_ents) {
+			next = sg_chain_ptr(&sgl[curr_max_ents - 1]);
+			alloc_size = curr_max_ents;
 			sg_size = alloc_size - 1;
 		} else {
 			sg_size = alloc_size;
@@ -219,11 +221,12 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
 		}
 
 		table->orig_nents -= sg_size;
-		if (skip_first_chunk)
-			skip_first_chunk = false;
+		if (nents_first_chunk)
+			nents_first_chunk = 0;
 		else
 			free_fn(sgl, alloc_size);
 		sgl = next;
+		curr_max_ents = max_ents;
 	}
 
 	table->sgl = NULL;
@@ -246,6 +249,8 @@ EXPORT_SYMBOL(sg_free_table);
  * @table:	The sg table header to use
  * @nents:	Number of entries in sg list
  * @max_ents:	The maximum number of entries the allocator returns per call
+ * @nents_first_chunk: Number of entries int the (preallocated) first
+ * 	scatterlist chunk, 0 means no such preallocated chunk provided by user
  * @gfp_mask:	GFP allocation mask
  * @alloc_fn:	Allocator to use
  *
@@ -262,10 +267,13 @@ EXPORT_SYMBOL(sg_free_table);
  **/
 int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 		     unsigned int max_ents, struct scatterlist *first_chunk,
-		     gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
+		     unsigned int nents_first_chunk, gfp_t gfp_mask,
+		     sg_alloc_fn *alloc_fn)
 {
 	struct scatterlist *sg, *prv;
 	unsigned int left;
+	unsigned curr_max_ents = nents_first_chunk ?: max_ents;
+	unsigned prv_max_ents;
 
 	memset(table, 0, sizeof(*table));
 
@@ -281,8 +289,8 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 	do {
 		unsigned int sg_size, alloc_size = left;
 
-		if (alloc_size > max_ents) {
-			alloc_size = max_ents;
+		if (alloc_size > curr_max_ents) {
+			alloc_size = curr_max_ents;
 			sg_size = alloc_size - 1;
 		} else
 			sg_size = alloc_size;
@@ -316,7 +324,7 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 		 * If this is not the first mapping, chain previous part.
 		 */
 		if (prv)
-			sg_chain(prv, max_ents, sg);
+			sg_chain(prv, prv_max_ents, sg);
 		else
 			table->sgl = sg;
 
@@ -327,6 +335,8 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 			sg_mark_end(&sg[sg_size - 1]);
 
 		prv = sg;
+		prv_max_ents = curr_max_ents;
+		curr_max_ents = max_ents;
 	} while (left);
 
 	return 0;
@@ -349,9 +359,9 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
 	int ret;
 
 	ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC,
-			       NULL, gfp_mask, sg_kmalloc);
+			       NULL, 0, gfp_mask, sg_kmalloc);
 	if (unlikely(ret))
-		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
+		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, 0, sg_kfree);
 
 	return ret;
 }
diff --git a/lib/sg_pool.c b/lib/sg_pool.c
index d1c1e6388eaa..b3b8cf62ff49 100644
--- a/lib/sg_pool.c
+++ b/lib/sg_pool.c
@@ -69,18 +69,27 @@ static struct scatterlist *sg_pool_alloc(unsigned int nents, gfp_t gfp_mask)
 /**
  * sg_free_table_chained - Free a previously mapped sg table
  * @table:	The sg table header to use
- * @first_chunk: was first_chunk not NULL in sg_alloc_table_chained?
+ * @nents_first_chunk: size of the first_chunk SGL passed to
+ *		sg_alloc_table_chained
  *
  *  Description:
  *    Free an sg table previously allocated and setup with
  *    sg_alloc_table_chained().
  *
+ *    @nents_first_chunk has to be same with that same parameter passed
+ *    to sg_alloc_table_chained().
+ *
  **/
-void sg_free_table_chained(struct sg_table *table, bool first_chunk)
+void sg_free_table_chained(struct sg_table *table,
+		unsigned nents_first_chunk)
 {
-	if (first_chunk && table->orig_nents <= SG_CHUNK_SIZE)
+	if (table->orig_nents <= nents_first_chunk)
 		return;
-	__sg_free_table(table, SG_CHUNK_SIZE, first_chunk, sg_pool_free);
+
+	if (nents_first_chunk == 1)
+		nents_first_chunk = 0;
+
+	__sg_free_table(table, SG_CHUNK_SIZE, nents_first_chunk, sg_pool_free);
 }
 EXPORT_SYMBOL_GPL(sg_free_table_chained);
 
@@ -89,31 +98,39 @@ EXPORT_SYMBOL_GPL(sg_free_table_chained);
  * @table:	The sg table header to use
  * @nents:	Number of entries in sg list
  * @first_chunk: first SGL
+ * @nents_first_chunk: number of the SGL of @first_chunk
  *
  *  Description:
  *    Allocate and chain SGLs in an sg table. If @nents@ is larger than
- *    SG_CHUNK_SIZE a chained sg table will be setup.
+ *    @nents_first_chunk a chained sg table will be setup.
  *
  **/
 int sg_alloc_table_chained(struct sg_table *table, int nents,
-		struct scatterlist *first_chunk)
+		struct scatterlist *first_chunk, unsigned nents_first_chunk)
 {
 	int ret;
 
 	BUG_ON(!nents);
 
-	if (first_chunk) {
-		if (nents <= SG_CHUNK_SIZE) {
+	if (first_chunk && nents_first_chunk) {
+		if (nents <= nents_first_chunk) {
 			table->nents = table->orig_nents = nents;
 			sg_init_table(table->sgl, nents);
 			return 0;
 		}
 	}
 
+	/* User supposes that the 1st SGL includes real entry */
+	if (nents_first_chunk == 1) {
+		first_chunk = NULL;
+		nents_first_chunk = 0;
+	}
+
 	ret = __sg_alloc_table(table, nents, SG_CHUNK_SIZE,
-			       first_chunk, GFP_ATOMIC, sg_pool_alloc);
+			       first_chunk, nents_first_chunk,
+			       GFP_ATOMIC, sg_pool_alloc);
 	if (unlikely(ret))
-		sg_free_table_chained(table, (bool)first_chunk);
+		sg_free_table_chained(table, nents_first_chunk);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(sg_alloc_table_chained);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 2121c9b4d275..48fe3b16b0d9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -73,7 +73,8 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
 
 	ctxt->rw_sg_table.sgl = ctxt->rw_first_sgl;
 	if (sg_alloc_table_chained(&ctxt->rw_sg_table, sges,
-				   ctxt->rw_sg_table.sgl)) {
+				   ctxt->rw_sg_table.sgl,
+				   SG_CHUNK_SIZE)) {
 		kfree(ctxt);
 		ctxt = NULL;
 	}
@@ -84,7 +85,7 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
 static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
 				 struct svc_rdma_rw_ctxt *ctxt)
 {
-	sg_free_table_chained(&ctxt->rw_sg_table, true);
+	sg_free_table_chained(&ctxt->rw_sg_table, SG_CHUNK_SIZE);
 
 	spin_lock(&rdma->sc_rw_ctxt_lock);
 	list_add(&ctxt->rw_list, &rdma->sc_rw_ctxts);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data
  2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
  2019-04-28  7:39 ` [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool Ming Lei
@ 2019-04-28  7:39 ` Ming Lei
  2019-04-29 18:15   ` Bart Van Assche
  2019-04-28  7:39 ` [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2019-04-28  7:39 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche,
	Ewan D . Milne, Hannes Reinecke

Now scsi_mq_setup_tags() pre-allocates a big buffer for protection
sg entries, and the buffer size is scsi_mq_sgl_size().

This way isn't correct, scsi_mq_sgl_size() is used to pre-allocate
sg entries for IO data. And the protection data buffer is much less,
for example, one 512byte sector needs 8byte protection data, and
the max sector number for one request is 2560(BLK_DEF_MAX_SECTORS),
so the max protection data size is just 20k.

The usual case is that one bio builds one single bip segment. Attribute
to bio split, bio merge is seldom done for big IO, and it is only done
in case of small bios. And protection data segment number is usually
same with bio count in the request, so the number won't be very big,
and allocating from slab is fast enough.

Reduce to pre-allocate one sg entry for protection data, and switch
to runtime allocation in case that the protection data segment number
is bigger than 1. Then we can save huge pre-alocation, for example,
500+MB is saved on single lpfc HBA.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c37263c123eb..2eaba41655de 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -39,6 +39,12 @@
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
+/*
+ * Size of integrity metadata is usually small, 1 inline sg should
+ * cover normal cases.
+ */
+#define  SCSI_INLINE_PROT_SG_CNT  1
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -543,7 +549,8 @@ static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 	if (cmd->sdb.table.nents)
 		sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
 	if (scsi_prot_sg_count(cmd))
-		sg_free_table_chained(&cmd->prot_sdb->table, SG_CHUNK_SIZE);
+		sg_free_table_chained(&cmd->prot_sdb->table,
+				SCSI_INLINE_PROT_SG_CNT);
 }
 
 static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
@@ -1032,7 +1039,7 @@ blk_status_t scsi_init_io(struct scsi_cmnd *cmd)
 
 		if (sg_alloc_table_chained(&prot_sdb->table, ivecs,
 				prot_sdb->table.sgl,
-				SG_CHUNK_SIZE)) {
+				SCSI_INLINE_PROT_SG_CNT)) {
 			ret = BLK_STS_RESOURCE;
 			goto out_free_sgtables;
 		}
@@ -1820,7 +1827,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	sgl_size = scsi_mq_sgl_size(shost);
 	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
 	if (scsi_host_get_prot(shost))
-		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+		cmd_size += sizeof(struct scsi_data_buffer) +
+			sizeof(struct scatterlist) * SCSI_INLINE_PROT_SG_CNT;
 
 	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
 	shost->tag_set.ops = &scsi_mq_ops;
-- 
2.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
  2019-04-28  7:39 ` [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool Ming Lei
  2019-04-28  7:39 ` [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data Ming Lei
@ 2019-04-28  7:39 ` Ming Lei
  2019-04-29 18:16   ` Bart Van Assche
  2019-06-03 20:44   ` Guenter Roeck
  2019-05-05  1:10 ` [PATCH V4 0/3] scsi: core: avoid big pre-allocation " Ming Lei
  2019-05-14  2:06 ` Martin K. Petersen
  4 siblings, 2 replies; 15+ messages in thread
From: Ming Lei @ 2019-04-28  7:39 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Ming Lei, Christoph Hellwig, Bart Van Assche,
	Ewan D . Milne, Hannes Reinecke

Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
and the buffer size is scsi_mq_sgl_size() which depends on smaller
value between shost->sg_tablesize and SG_CHUNK_SIZE.

Modern HBA's DMA is often capable of deadling with very big segment
number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.

Then if one HBA has lots of queues, and each hw queue's depth is
high, pre-allocation for sg list can consume huge memory.
For example of lpfc, nr_hw_queues can be 70, each queue's depth
can be 3781, so the pre-allocation for data sg list is 70*3781*2k
=517MB for single HBA.

There is Red Hat internal report that scsi_debug based tests can't
be run any more since legacy io path is killed because too big
pre-allocation.

So switch to runtime allocation for sg list, meantime pre-allocate 2
inline sg entries. This way has been applied to NVMe PCI for a while,
so it should be fine for SCSI too. Also runtime sg entries allocation
has verified and run always in the original legacy io path.

Not see performance effect in my big BS test on scsi_debug.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2eaba41655de..472d848f1778 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -45,6 +45,8 @@
  */
 #define  SCSI_INLINE_PROT_SG_CNT  1
 
+#define  SCSI_INLINE_SG_CNT  2
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -547,7 +549,8 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
+		sg_free_table_chained(&cmd->sdb.table,
+				SCSI_INLINE_SG_CNT);
 	if (scsi_prot_sg_count(cmd))
 		sg_free_table_chained(&cmd->prot_sdb->table,
 				SCSI_INLINE_PROT_SG_CNT);
@@ -984,7 +987,7 @@ static blk_status_t scsi_init_sgtable(struct request *req,
 	 */
 	if (unlikely(sg_alloc_table_chained(&sdb->table,
 			blk_rq_nr_phys_segments(req), sdb->table.sgl,
-			SG_CHUNK_SIZE)))
+			SCSI_INLINE_SG_CNT)))
 		return BLK_STS_RESOURCE;
 
 	/* 
@@ -1550,9 +1553,9 @@ static int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 }
 
 /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */
-static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost)
+static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost)
 {
-	return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) *
+	return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) *
 		sizeof(struct scatterlist);
 }
 
@@ -1730,7 +1733,7 @@ static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 	if (scsi_host_get_prot(shost)) {
 		sg = (void *)cmd + sizeof(struct scsi_cmnd) +
 			shost->hostt->cmd_size;
-		cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost);
+		cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost);
 	}
 
 	return 0;
@@ -1824,7 +1827,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 {
 	unsigned int cmd_size, sgl_size;
 
-	sgl_size = scsi_mq_sgl_size(shost);
+	sgl_size = scsi_mq_inline_sgl_size(shost);
 	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
 	if (scsi_host_get_prot(shost))
 		cmd_size += sizeof(struct scsi_data_buffer) +
-- 
2.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool
  2019-04-28  7:39 ` [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool Ming Lei
@ 2019-04-28 12:04   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2019-04-28 12:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke, Sagi Grimberg, Chuck Lever, netdev, linux-nvme

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data
  2019-04-28  7:39 ` [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data Ming Lei
@ 2019-04-29 18:15   ` Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2019-04-29 18:15 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Christoph Hellwig, Ewan D . Milne, Hannes Reinecke

On Sun, 2019-04-28 at 15:39 +-0800, Ming Lei wrote:
+AD4 Now scsi+AF8-mq+AF8-setup+AF8-tags() pre-allocates a big buffer for protection
+AD4 sg entries, and the buffer size is scsi+AF8-mq+AF8-sgl+AF8-size().
+AD4 
+AD4 This way isn't correct, scsi+AF8-mq+AF8-sgl+AF8-size() is used to pre-allocate
+AD4 sg entries for IO data. And the protection data buffer is much less,
+AD4 for example, one 512byte sector needs 8byte protection data, and
+AD4 the max sector number for one request is 2560(BLK+AF8-DEF+AF8-MAX+AF8-SECTORS),
+AD4 so the max protection data size is just 20k.
+AD4 
+AD4 The usual case is that one bio builds one single bip segment. Attribute
+AD4 to bio split, bio merge is seldom done for big IO, and it is only done
+AD4 in case of small bios. And protection data segment number is usually
+AD4 same with bio count in the request, so the number won't be very big,
+AD4 and allocating from slab is fast enough.
+AD4 
+AD4 Reduce to pre-allocate one sg entry for protection data, and switch
+AD4 to runtime allocation in case that the protection data segment number
+AD4 is bigger than 1. Then we can save huge pre-alocation, for example,
+AD4 500

Reviewed-by: Bart Van Assche +ADw-bvanassche+AEA-acm.org+AD4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-04-28  7:39 ` [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list Ming Lei
@ 2019-04-29 18:16   ` Bart Van Assche
  2019-06-03 20:44   ` Guenter Roeck
  1 sibling, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2019-04-29 18:16 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Christoph Hellwig, Ewan D . Milne, Hannes Reinecke

On Sun, 2019-04-28 at 15:39 +-0800, Ming Lei wrote:
+AD4 Now scsi+AF8-mq+AF8-setup+AF8-tags() pre-allocates a big buffer for IO sg list,
+AD4 and the buffer size is scsi+AF8-mq+AF8-sgl+AF8-size() which depends on smaller
+AD4 value between shost-+AD4-sg+AF8-tablesize and SG+AF8-CHUNK+AF8-SIZE.
+AD4 
+AD4 Modern HBA's DMA is often capable of deadling with very big segment
+AD4 number, so scsi+AF8-mq+AF8-sgl+AF8-size() is often big. Suppose the max sg number
+AD4 of SG+AF8-CHUNK+AF8-SIZE is taken, scsi+AF8-mq+AF8-sgl+AF8-size() will be 4KB.
+AD4 
+AD4 Then if one HBA has lots of queues, and each hw queue's depth is
+AD4 high, pre-allocation for sg list can consume huge memory.
+AD4 For example of lpfc, nr+AF8-hw+AF8-queues can be 70, each queue's depth
+AD4 can be 3781, so the pre-allocation for data sg list is 70+ACo-3781+ACo-2k
+AD4 +AD0-517MB for single HBA.
+AD4 
+AD4 There is Red Hat internal report that scsi+AF8-debug based tests can't
+AD4 be run any more since legacy io path is killed because too big
+AD4 pre-allocation.
+AD4 
+AD4 So switch to runtime allocation for sg list, meantime pre-allocate 2
+AD4 inline sg entries. This way has been applied to NVMe PCI for a while,
+AD4 so it should be fine for SCSI too. Also runtime sg entries allocation
+AD4 has verified and run always in the original legacy io path.
+AD4 
+AD4 Not see performance effect in my big BS test on scsi+AF8-debug.

Reviewed-by: Bart Van Assche +ADw-bvanassche+AEA-acm.org+AD4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list
  2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
                   ` (2 preceding siblings ...)
  2019-04-28  7:39 ` [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list Ming Lei
@ 2019-05-05  1:10 ` " Ming Lei
  2019-05-14  2:06 ` Martin K. Petersen
  4 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2019-05-05  1:10 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke, Sagi Grimberg, Chuck Lever, netdev, linux-nvme

On Sun, Apr 28, 2019 at 03:39:29PM +0800, Ming Lei wrote:
> Hi,
> 
> Since supporting to blk-mq, big pre-allocation for sg list is introduced,
> this way is very unfriendly wrt. memory consumption.
> 
> There were Red Hat internal reports that some scsi_debug based tests
> can't be run any more because of too big pre-allocation.
> 
> Also lpfc users commplained that 1GB+ ram is pre-allocatd for single
> HBA.
> 
> sg_alloc_table_chained() is improved to support variant size of 1st
> pre-allocated SGL in the 1st patch as suggested by Christoph.
> 
> The other two patches try to address this issue by allocating sg list runtime,
> meantime pre-allocating one or two inline sg entries for small IO. This
> ways follows NVMe's approach wrt. sg list allocation.
> 
> V4:
> 	- add parameter to sg_alloc_table_chained()/sg_free_table_chained()
> 	directly, and update current callers
> 
> V3:
> 	- improve sg_alloc_table_chained() to accept variant size of
> 	the 1st pre-allocated SGL
> 	- applies the improved sg API to address the big pre-allocation
> 	issue
> 
> V2:
> 	- move inline sg table initializetion into one helper
> 	- introduce new helper for getting inline sg
> 	- comment log fix
> 
> 
> Ming Lei (3):
>   lib/sg_pool.c: improve APIs for allocating sg pool
>   scsi: core: avoid to pre-allocate big chunk for protection meta data
>   scsi: core: avoid to pre-allocate big chunk for sg list
> 
>  drivers/nvme/host/fc.c            |  7 ++++---
>  drivers/nvme/host/rdma.c          |  7 ++++---
>  drivers/nvme/target/loop.c        |  4 ++--
>  drivers/scsi/scsi_lib.c           | 31 ++++++++++++++++++++++---------
>  include/linux/scatterlist.h       | 11 +++++++----
>  lib/scatterlist.c                 | 36 +++++++++++++++++++++++-------------
>  lib/sg_pool.c                     | 37 +++++++++++++++++++++++++++----------
>  net/sunrpc/xprtrdma/svc_rdma_rw.c |  5 +++--
>  8 files changed, 92 insertions(+), 46 deletions(-)
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Chuck Lever <chuck.lever@oracle.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-nvme@lists.infradead.org

Hi Martin,

Could you consider to merge this patchset to 5.2 if you are fine?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list
  2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
                   ` (3 preceding siblings ...)
  2019-05-05  1:10 ` [PATCH V4 0/3] scsi: core: avoid big pre-allocation " Ming Lei
@ 2019-05-14  2:06 ` Martin K. Petersen
  4 siblings, 0 replies; 15+ messages in thread
From: Martin K. Petersen @ 2019-05-14  2:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke, Sagi Grimberg, Chuck Lever, netdev, linux-nvme


Ming,

> Since supporting to blk-mq, big pre-allocation for sg list is
> introduced, this way is very unfriendly wrt. memory consumption.

Applied to 5.3/scsi-queue with some clarifications to the commit
descriptions.

I am not entirely sold on 1 for the inline protection SGL size. NVMe
over PCIe is pretty constrained thanks to the metadata pointer whereas
SCSI DIX uses a real SGL for the PI. Consequently, straddling a page is
not that uncommon for large, sequential I/Os.

But let's try it out. If performance suffers substantially, we may want
to bump it to 2.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-04-28  7:39 ` [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list Ming Lei
  2019-04-29 18:16   ` Bart Van Assche
@ 2019-06-03 20:44   ` Guenter Roeck
  2019-06-04  1:00     ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Guenter Roeck @ 2019-06-03 20:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> and the buffer size is scsi_mq_sgl_size() which depends on smaller
> value between shost->sg_tablesize and SG_CHUNK_SIZE.
> 
> Modern HBA's DMA is often capable of deadling with very big segment
> number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> 
> Then if one HBA has lots of queues, and each hw queue's depth is
> high, pre-allocation for sg list can consume huge memory.
> For example of lpfc, nr_hw_queues can be 70, each queue's depth
> can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> =517MB for single HBA.
> 
> There is Red Hat internal report that scsi_debug based tests can't
> be run any more since legacy io path is killed because too big
> pre-allocation.
> 
> So switch to runtime allocation for sg list, meantime pre-allocate 2
> inline sg entries. This way has been applied to NVMe PCI for a while,
> so it should be fine for SCSI too. Also runtime sg entries allocation
> has verified and run always in the original legacy io path.
> 
> Not see performance effect in my big BS test on scsi_debug.
> 

This patch causes a variety of boot failures in -next. Typical failure
pattern is scsi hangs or failure to find a root file system. For example,
on alpha, trying to boot from usb:

scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
sd 0:0:0:0: Power-on or device reset occurred
sd 0:0:0:0: [sda] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] Attached SCSI disk
usb 1-1: reset full-speed USB device number 2 using ohci-pci
sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00
sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 00 58 00 00 42 00
print_req_error: I/O error, dev sda, sector 88 flags 80000
EXT4-fs error (device sda): __ext4_get_inode_loc:4703: inode #2: block 44: comm
swapper: unable to read itable block
EXT4-fs (sda): get root inode failed
EXT4-fs (sda): mount failed
VFS: Cannot open root device "sda" or unknown-block(8,0): error -5

Similar problems are seen with other architectures.
Reverting the patch fixes the problem. Bisect log attached.

Guenter

---
# bad: [ae3cad8f39ccf8d31775d9737488bccf0e44d370] Add linux-next specific files for 20190603
# good: [f2c7c76c5d0a443053e94adb9f0918fa2fb85c3a] Linux 5.2-rc3
git bisect start 'HEAD' 'v5.2-rc3'
# good: [8ff6f4c6e067a9d3f3bbacf02c4ea5eb81fe2c6a] Merge remote-tracking branch 'crypto/master'
git bisect good 8ff6f4c6e067a9d3f3bbacf02c4ea5eb81fe2c6a
# good: [6c93755861ce6a6dd904df9cdae9f08671132dbe] Merge remote-tracking branch 'iommu/next'
git bisect good 6c93755861ce6a6dd904df9cdae9f08671132dbe
# good: [1a567956cb3be5754d94ce9380a2151e57e204a7] Merge remote-tracking branch 'cgroup/for-next'
git bisect good 1a567956cb3be5754d94ce9380a2151e57e204a7
# bad: [a6878ca73cf30b83efbdfb1ecc443d7cfb2d8193] Merge remote-tracking branch 'rtc/rtc-next'
git bisect bad a6878ca73cf30b83efbdfb1ecc443d7cfb2d8193
# bad: [25e7f57cf9f86076704b543628a0f02d3d733726] Merge remote-tracking branch 'gpio/for-next'
git bisect bad 25e7f57cf9f86076704b543628a0f02d3d733726
# bad: [28e85db94534c92fa00d787264e3ea1843d1dc42] scsi: lpfc: Fix nvmet handling of received ABTS for unmapped frames
git bisect bad 28e85db94534c92fa00d787264e3ea1843d1dc42
# bad: [cf9eddf616bb03387e597629142b0be34111e8d0] scsi: hpsa: check for tag collision
git bisect bad cf9eddf616bb03387e597629142b0be34111e8d0
# good: [4e74166c52a836f05d4bd8270835703908b34d3e] scsi: libsas: switch sas_ata.[ch] to SPDX tags
git bisect good 4e74166c52a836f05d4bd8270835703908b34d3e
# good: [f186090846c29fd9760917fb3d01f095c39262e0] scsi: lib/sg_pool.c: improve APIs for allocating sg pool
git bisect good f186090846c29fd9760917fb3d01f095c39262e0
# bad: [12b6b55806928690359bb21de3a19199412289fd] scsi: sd: Inline sd_probe_part2()
git bisect bad 12b6b55806928690359bb21de3a19199412289fd
# bad: [c3288dd8c2328a74cb2157dff18d13f6a75cedd2] scsi: core: avoid pre-allocating big SGL for data
git bisect bad c3288dd8c2328a74cb2157dff18d13f6a75cedd2
# good: [0f0e744eae6c8af361d227d3a2286973351e5a2a] scsi: core: avoid pre-allocating big SGL for protection information
git bisect good 0f0e744eae6c8af361d227d3a2286973351e5a2a
# first bad commit: [c3288dd8c2328a74cb2157dff18d13f6a75cedd2] scsi: core: avoid pre-allocating big SGL for data

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-06-03 20:44   ` Guenter Roeck
@ 2019-06-04  1:00     ` Ming Lei
  2019-06-04  3:49       ` Guenter Roeck
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2019-06-04  1:00 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > 
> > Modern HBA's DMA is often capable of deadling with very big segment
> > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > 
> > Then if one HBA has lots of queues, and each hw queue's depth is
> > high, pre-allocation for sg list can consume huge memory.
> > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > =517MB for single HBA.
> > 
> > There is Red Hat internal report that scsi_debug based tests can't
> > be run any more since legacy io path is killed because too big
> > pre-allocation.
> > 
> > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > inline sg entries. This way has been applied to NVMe PCI for a while,
> > so it should be fine for SCSI too. Also runtime sg entries allocation
> > has verified and run always in the original legacy io path.
> > 
> > Not see performance effect in my big BS test on scsi_debug.
> > 
> 
> This patch causes a variety of boot failures in -next. Typical failure
> pattern is scsi hangs or failure to find a root file system. For example,
> on alpha, trying to boot from usb:

I guess it is because alpha doesn't support sg chaining, and
CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
can only be arm, alpha and parisc.

Please test the following patch and see if it makes a difference:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6e81258471fa..9ef632963740 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -44,9 +44,13 @@
  * Size of integrity metadata is usually small, 1 inline sg should
  * cover normal cases.
  */
+#ifndef CONFIG_ARCH_NO_SG_CHAIN
 #define  SCSI_INLINE_PROT_SG_CNT  1
-
 #define  SCSI_INLINE_SG_CNT  2
+#else
+#define  SCSI_INLINE_PROT_SG_CNT  0
+#define  SCSI_INLINE_SG_CNT  0
+#endif
 
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;


Thanks,
Ming

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-06-04  1:00     ` Ming Lei
@ 2019-06-04  3:49       ` Guenter Roeck
  2019-06-04  4:10         ` Ming Lei
  2019-06-04  6:55         ` Ming Lei
  0 siblings, 2 replies; 15+ messages in thread
From: Guenter Roeck @ 2019-06-04  3:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On 6/3/19 6:00 PM, Ming Lei wrote:
> On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
>> On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
>>> Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
>>> and the buffer size is scsi_mq_sgl_size() which depends on smaller
>>> value between shost->sg_tablesize and SG_CHUNK_SIZE.
>>>
>>> Modern HBA's DMA is often capable of deadling with very big segment
>>> number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
>>> of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
>>>
>>> Then if one HBA has lots of queues, and each hw queue's depth is
>>> high, pre-allocation for sg list can consume huge memory.
>>> For example of lpfc, nr_hw_queues can be 70, each queue's depth
>>> can be 3781, so the pre-allocation for data sg list is 70*3781*2k
>>> =517MB for single HBA.
>>>
>>> There is Red Hat internal report that scsi_debug based tests can't
>>> be run any more since legacy io path is killed because too big
>>> pre-allocation.
>>>
>>> So switch to runtime allocation for sg list, meantime pre-allocate 2
>>> inline sg entries. This way has been applied to NVMe PCI for a while,
>>> so it should be fine for SCSI too. Also runtime sg entries allocation
>>> has verified and run always in the original legacy io path.
>>>
>>> Not see performance effect in my big BS test on scsi_debug.
>>>
>>
>> This patch causes a variety of boot failures in -next. Typical failure
>> pattern is scsi hangs or failure to find a root file system. For example,
>> on alpha, trying to boot from usb:
> 
> I guess it is because alpha doesn't support sg chaining, and
> CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> can only be arm, alpha and parisc.
> 

I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
sparc, ppc, and m68k as well, and possibly others (I didn't check all because
-next is in terrible shape right now). Error log is always a bit different
but similar.

On sparc:

scsi host0: Data transfer overflow.
scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
scsi host0: DMA length is zero!
scsi host0: cur adr[f000f000] len[00000000]
scsi host0: Data transfer overflow.
scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
scsi host0: DMA length is zero!

On ppc:

scsi host0: DMA length is zero!
scsi host0: cur adr[0fd21000] len[00000000]
scsi host0: Aborting command [(ptrval):28]
scsi host0: Current command [(ptrval):28]
scsi host0:  Active command [(ptrval):28]

On x86, x86_64 (after reverting a different crash-causing patch):

[   20.226809] scsi host0: DMA length is zero!
[   20.227459] scsi host0: cur adr[00000000] len[00000000]
[   50.588814] scsi host0: Aborting command [(____ptrval____):28]
[   50.589210] scsi host0: Current command [(____ptrval____):28]
[   50.589447] scsi host0:  Active command [(____ptrval____):28]
[   50.589674] scsi host0: Dumping command log

On m68k there is a crash.

Unable to handle kernel NULL pointer dereference at virtual address (ptrval)
Oops: 00000000
Modules linked in:
PC: [<00203a9e>] esp_maybe_execute_command+0x31e/0x46c
SR: 2704  SP: (ptrval)  a2: 07c1ea20
d0: 00000002    d1: 00000400    d2: 00000000    d3: 00002000
d4: 00000000    d5: 00000000    a0: 07db753c    a1: 00000000
Process kworker/0:0H (pid: 4, task=(ptrval))
Frame format=7 eff addr=00000020 ssw=0505 faddr=00000020
wb 1 stat/addr/data: 0000 00000000 00000000
wb 2 stat/addr/data: 0000 00000000 00000000
wb 3 stat/addr/data: 0000 00000020 00000000
push data: 00000000 00000000 00000000 00000000
Stack from 07c2be08:
...
Call Trace: [<00002000>] _start+0x0/0x8
  [<001f65ca>] scsi_mq_done+0x0/0x2c
  [<001887fe>] blk_mq_get_driver_tag+0x0/0xba
  [<00188800>] blk_mq_get_driver_tag+0x2/0xba
  [<00025aec>] create_worker+0x0/0x14e
  [<00203f64>] esp_queuecommand+0x9c/0xa2
  [<00203f3a>] esp_queuecommand+0x72/0xa2
  [<001f7fd4>] scsi_queue_rq+0x54e/0x5ba
  [<00188b4a>] blk_mq_dispatch_rq_list+0x292/0x38a
  [<00188656>] blk_mq_dequeue_from_ctx+0x0/0x1a8
  [<001888b8>] blk_mq_dispatch_rq_list+0x0/0x38a
  [<00025aec>] create_worker+0x0/0x14e
...

All those problems are fixed by reverting this patch.

Guenter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-06-04  3:49       ` Guenter Roeck
@ 2019-06-04  4:10         ` Ming Lei
  2019-06-04 14:51           ` Guenter Roeck
  2019-06-04  6:55         ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Ming Lei @ 2019-06-04  4:10 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> On 6/3/19 6:00 PM, Ming Lei wrote:
> > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > 
> > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > 
> > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > high, pre-allocation for sg list can consume huge memory.
> > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > =517MB for single HBA.
> > > > 
> > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > be run any more since legacy io path is killed because too big
> > > > pre-allocation.
> > > > 
> > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > has verified and run always in the original legacy io path.
> > > > 
> > > > Not see performance effect in my big BS test on scsi_debug.
> > > > 
> > > 
> > > This patch causes a variety of boot failures in -next. Typical failure
> > > pattern is scsi hangs or failure to find a root file system. For example,
> > > on alpha, trying to boot from usb:
> > 
> > I guess it is because alpha doesn't support sg chaining, and
> > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > can only be arm, alpha and parisc.
> > 
> 
> I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> -next is in terrible shape right now). Error log is always a bit different
> but similar.
> 
> On sparc:
> 
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> scsi host0: cur adr[f000f000] len[00000000]
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> 
> On ppc:
> 
> scsi host0: DMA length is zero!
> scsi host0: cur adr[0fd21000] len[00000000]
> scsi host0: Aborting command [(ptrval):28]
> scsi host0: Current command [(ptrval):28]
> scsi host0:  Active command [(ptrval):28]
> 
> On x86, x86_64 (after reverting a different crash-causing patch):
> 
> [   20.226809] scsi host0: DMA length is zero!
> [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> [   50.589210] scsi host0: Current command [(____ptrval____):28]
> [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> [   50.589674] scsi host0: Dumping command log

OK, I did see one boot crash issue on x86_64 with -next, so could
you share us that patch which needs to be reverted? Meantime, please
provide me your steps for reproducing this issue? (rootfs image, kernel
config, qemu command)

BTW, the patch has been tested in RH QE lab, so far not see such reports
yet.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-06-04  3:49       ` Guenter Roeck
  2019-06-04  4:10         ` Ming Lei
@ 2019-06-04  6:55         ` Ming Lei
  1 sibling, 0 replies; 15+ messages in thread
From: Ming Lei @ 2019-06-04  6:55 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> On 6/3/19 6:00 PM, Ming Lei wrote:
> > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > 
> > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > 
> > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > high, pre-allocation for sg list can consume huge memory.
> > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > =517MB for single HBA.
> > > > 
> > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > be run any more since legacy io path is killed because too big
> > > > pre-allocation.
> > > > 
> > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > has verified and run always in the original legacy io path.
> > > > 
> > > > Not see performance effect in my big BS test on scsi_debug.
> > > > 
> > > 
> > > This patch causes a variety of boot failures in -next. Typical failure
> > > pattern is scsi hangs or failure to find a root file system. For example,
> > > on alpha, trying to boot from usb:
> > 
> > I guess it is because alpha doesn't support sg chaining, and
> > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > can only be arm, alpha and parisc.
> > 
> 
> I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> -next is in terrible shape right now). Error log is always a bit different
> but similar.
> 
> On sparc:
> 
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> scsi host0: cur adr[f000f000] len[00000000]
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> 
> On ppc:
> 
> scsi host0: DMA length is zero!
> scsi host0: cur adr[0fd21000] len[00000000]
> scsi host0: Aborting command [(ptrval):28]
> scsi host0: Current command [(ptrval):28]
> scsi host0:  Active command [(ptrval):28]
> 
> On x86, x86_64 (after reverting a different crash-causing patch):
> 
> [   20.226809] scsi host0: DMA length is zero!
> [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> [   50.589210] scsi host0: Current command [(____ptrval____):28]
> [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> [   50.589674] scsi host0: Dumping command log
> 
> On m68k there is a crash.
> 
> Unable to handle kernel NULL pointer dereference at virtual address (ptrval)
> Oops: 00000000
> Modules linked in:
> PC: [<00203a9e>] esp_maybe_execute_command+0x31e/0x46c
> SR: 2704  SP: (ptrval)  a2: 07c1ea20
> d0: 00000002    d1: 00000400    d2: 00000000    d3: 00002000
> d4: 00000000    d5: 00000000    a0: 07db753c    a1: 00000000
> Process kworker/0:0H (pid: 4, task=(ptrval))
> Frame format=7 eff addr=00000020 ssw=0505 faddr=00000020
> wb 1 stat/addr/data: 0000 00000000 00000000
> wb 2 stat/addr/data: 0000 00000000 00000000
> wb 3 stat/addr/data: 0000 00000020 00000000
> push data: 00000000 00000000 00000000 00000000
> Stack from 07c2be08:
> ...
> Call Trace: [<00002000>] _start+0x0/0x8
>  [<001f65ca>] scsi_mq_done+0x0/0x2c
>  [<001887fe>] blk_mq_get_driver_tag+0x0/0xba
>  [<00188800>] blk_mq_get_driver_tag+0x2/0xba
>  [<00025aec>] create_worker+0x0/0x14e
>  [<00203f64>] esp_queuecommand+0x9c/0xa2
>  [<00203f3a>] esp_queuecommand+0x72/0xa2
>  [<001f7fd4>] scsi_queue_rq+0x54e/0x5ba
>  [<00188b4a>] blk_mq_dispatch_rq_list+0x292/0x38a
>  [<00188656>] blk_mq_dequeue_from_ctx+0x0/0x1a8
>  [<001888b8>] blk_mq_dispatch_rq_list+0x0/0x38a
>  [<00025aec>] create_worker+0x0/0x14e
> ...
> 
> All those problems are fixed by reverting this patch.

All above problem is esp_scsi specific, and esp_scsi does not support
sg chain.

I will try to figure out one patch to cover all.

esp_map_dma():
        if (esp->flags & ESP_FLAG_NO_DMA_MAP) {
                /*
                 * For pseudo DMA and PIO we need the virtual address instead of
                 * a dma address, so perform an identity mapping.
                 */
                spriv->num_sg = scsi_sg_count(cmd);
                for (i = 0; i < spriv->num_sg; i++) {
                        sg[i].dma_address = (uintptr_t)sg_virt(&sg[i]);
                        total += sg_dma_len(&sg[i]);
                }
        } else {
                spriv->num_sg = scsi_dma_map(cmd);
                for (i = 0; i < spriv->num_sg; i++)
                        total += sg_dma_len(&sg[i]);
        }

esp_advance_dma():
	p->cur_sg++;

esp_msgin_process():
	spriv->cur_sg--;

Thanks,
Ming

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
  2019-06-04  4:10         ` Ming Lei
@ 2019-06-04 14:51           ` Guenter Roeck
  0 siblings, 0 replies; 15+ messages in thread
From: Guenter Roeck @ 2019-06-04 14:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Christoph Hellwig, Bart Van Assche, Ewan D . Milne,
	Hannes Reinecke

On Tue, Jun 04, 2019 at 12:10:00PM +0800, Ming Lei wrote:
> On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> > On 6/3/19 6:00 PM, Ming Lei wrote:
> > > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > > 
> > > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > > 
> > > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > > high, pre-allocation for sg list can consume huge memory.
> > > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > > =517MB for single HBA.
> > > > > 
> > > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > > be run any more since legacy io path is killed because too big
> > > > > pre-allocation.
> > > > > 
> > > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > > has verified and run always in the original legacy io path.
> > > > > 
> > > > > Not see performance effect in my big BS test on scsi_debug.
> > > > > 
> > > > 
> > > > This patch causes a variety of boot failures in -next. Typical failure
> > > > pattern is scsi hangs or failure to find a root file system. For example,
> > > > on alpha, trying to boot from usb:
> > > 
> > > I guess it is because alpha doesn't support sg chaining, and
> > > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > > can only be arm, alpha and parisc.
> > > 
> > 
> > I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> > sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> > -next is in terrible shape right now). Error log is always a bit different
> > but similar.
> > 
> > On sparc:
> > 
> > scsi host0: Data transfer overflow.
> > scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> > scsi host0: DMA length is zero!
> > scsi host0: cur adr[f000f000] len[00000000]
> > scsi host0: Data transfer overflow.
> > scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> > scsi host0: DMA length is zero!
> > 
> > On ppc:
> > 
> > scsi host0: DMA length is zero!
> > scsi host0: cur adr[0fd21000] len[00000000]
> > scsi host0: Aborting command [(ptrval):28]
> > scsi host0: Current command [(ptrval):28]
> > scsi host0:  Active command [(ptrval):28]
> > 
> > On x86, x86_64 (after reverting a different crash-causing patch):
> > 
> > [   20.226809] scsi host0: DMA length is zero!
> > [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> > [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> > [   50.589210] scsi host0: Current command [(____ptrval____):28]
> > [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> > [   50.589674] scsi host0: Dumping command log
> 
> OK, I did see one boot crash issue on x86_64 with -next, so could
> you share us that patch which needs to be reverted? Meantime, please
> provide me your steps for reproducing this issue? (rootfs image, kernel
> config, qemu command)
> 
The patch to be reverted is this one. I'll prepare the rest of the
information later today.

> BTW, the patch has been tested in RH QE lab, so far not see such reports
> yet.
> 
FWIW, I don't think the RE QE lab tests any of the affected configurations.

Guenter

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-28  7:39 [PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list Ming Lei
2019-04-28  7:39 ` [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool Ming Lei
2019-04-28 12:04   ` Christoph Hellwig
2019-04-28  7:39 ` [PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data Ming Lei
2019-04-29 18:15   ` Bart Van Assche
2019-04-28  7:39 ` [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list Ming Lei
2019-04-29 18:16   ` Bart Van Assche
2019-06-03 20:44   ` Guenter Roeck
2019-06-04  1:00     ` Ming Lei
2019-06-04  3:49       ` Guenter Roeck
2019-06-04  4:10         ` Ming Lei
2019-06-04 14:51           ` Guenter Roeck
2019-06-04  6:55         ` Ming Lei
2019-05-05  1:10 ` [PATCH V4 0/3] scsi: core: avoid big pre-allocation " Ming Lei
2019-05-14  2:06 ` Martin K. Petersen

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox