linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] Add MMC packed request support
@ 2020-04-26  9:38 Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing Baolin Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

Hi All,

Now some SD/MMC controllers can support packed command or packed request,
that means it can package multiple requests to host controller to be handled
at one time, which can improve the I/O performence. Thus this patch set
tries to add the MMC packed request function to support packed request or
packed command.

In this patch set, I extanded commit_rqs() to do batch processing suggested
by Ming, to allow dispatching a batch of requests to hardware and expanded
the MMC software queue to support packed request. I also implemented the
SD host ADMA3 transfer mode to support packed request. The ADMA3 transfer
mode can process a multi-block data transfer by using a pair of command
descriptor and ADMA2 descriptor. In future we can easily expand the MMC
packed function to support packed command.

Below are some comparison data between packed request and non-packed request
with fio tool. The fio command I used is like below with changing the
'--rw' parameter and enabling the direct IO flag to measure the actual hardware
transfer speed. I tested 5 times for each case and output a average speed.

./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=512M --group_reporting --numjobs=20 --name=test_read

My eMMC card working at HS400 Enhanced strobe mode:
[    2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
[    2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
[    2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
[    2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
[    2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)

1. Non-packed request
1) Sequential read:
Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
Average speed: 60.68MiB/s

2) Random read:
Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
Average speed: 31.36MiB/s

3) Sequential write:
Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
Average speed: 71.66MiB/s

4) Random write:
Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
Average speed: 68.76MiB/s

2. Packed request
1) Sequential read:
Speed: 230MiB/s, 230MiB/s, 229MiB/s, 230MiB/s, 229MiB/s
Average speed: 229.6MiB/s

2) Random read:
Speed: 181MiB/s, 181MiB/s, 181MiB/s, 180MiB/s, 181MiB/s
Average speed: 180.8MiB/s

3) Sequential write:
Speed: 175MiB/s, 171MiB/s, 171MiB/s, 172MiB/s, 171MiB/s
Average speed: 172MiB/s

4) Random write:
Speed: 169MiB/s, 169MiB/s, 171MiB/s, 167MiB/s, 170MiB/s
Average speed: 169.2MiB/s

From above data, we can see the packed request can improve the performance
greatly. Any comments are welcome. Thanks a lot.

Changes from RFC v1:
 - Re-implement the batch processing according to Ming's suggestion
 - Remove the bd.last validation in MMC block.c, since we always get bd.last == false
 according to the new batch processing method.

Baolin Wang (6):
  mmc: Add MMC packed request support for MMC software queue
  mmc: host: sdhci: Introduce ADMA3 transfer mode
  mmc: host: sdhci: Factor out the command configuration
  mmc: host: sdhci: Remove redundant sg_count member of struct
    sdhci_host
  mmc: host: sdhci: Add MMC packed request support
  mmc: host: sdhci-sprd: Add MMC packed request support

Ming Lei (1):
  block: Extand commit_rqs() to do batch processing

 block/blk-mq-sched.c          |  29 +-
 block/blk-mq.c                |  15 +-
 drivers/mmc/core/block.c      |  14 +
 drivers/mmc/core/core.c       |  26 ++
 drivers/mmc/core/core.h       |   2 +
 drivers/mmc/core/queue.c      |  19 +-
 drivers/mmc/host/mmc_hsq.c    | 292 +++++++++++++++++---
 drivers/mmc/host/mmc_hsq.h    |  25 +-
 drivers/mmc/host/sdhci-sprd.c |  30 +-
 drivers/mmc/host/sdhci.c      | 504 +++++++++++++++++++++++++++++-----
 drivers/mmc/host/sdhci.h      |  61 +++-
 include/linux/blk-mq.h        |   1 +
 include/linux/mmc/core.h      |   6 +
 include/linux/mmc/host.h      |   9 +
 14 files changed, 900 insertions(+), 133 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-27 15:46   ` Christoph Hellwig
  2020-04-26  9:38 ` [RFC PATCH v2 2/7] mmc: Add MMC packed request support for MMC software queue Baolin Wang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

From: Ming Lei <ming.lei@redhat.com>

Now some SD/MMC host controllers can support packed command or packed request,
that means we can package several requests to host controller at one time
to improve performence.

But the blk-mq always takes one request from the scheduler and dispatch it to
the device, regardless of the driver or the scheduler, so there should only
ever be one request in the local list in blk_mq_dispatch_rq_list(), that means
the bd.last is always true and the driver can not use bd.last to decide if
there are requests are pending now in hardware queue to help to package
requests.

Thus this patch introduces a new 'BLK_MQ_F_FORCE_COMMIT_RQS' flag to call
.commit_rqs() to do batch processing if necessary.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 block/blk-mq-sched.c   | 29 ++++++++++++++++++++---------
 block/blk-mq.c         | 15 ++++++++++-----
 include/linux/blk-mq.h |  1 +
 3 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..3429a71a7364 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -85,11 +85,12 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
  */
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
+	bool ret = false;
 
 	do {
 		struct request *rq;
@@ -112,7 +113,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 * in blk_mq_dispatch_rq_list().
 		 */
 		list_add(&rq->queuelist, &rq_list);
-	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+		ret = blk_mq_dispatch_rq_list(q, &rq_list, true);
+	} while (ret);
+
+	return ret;
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -131,11 +135,12 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
  * its queue by itself in its completion handler, so we don't need to
  * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
  */
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+	bool ret = false;
 
 	do {
 		struct request *rq;
@@ -161,10 +166,11 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 
 		/* round robin for fair dispatch */
 		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
-
-	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+		ret = blk_mq_dispatch_rq_list(q, &rq_list, true);
+	} while (ret);
 
 	WRITE_ONCE(hctx->dispatch_from, ctx);
+	return ret;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -173,6 +179,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
 	LIST_HEAD(rq_list);
+	bool dispatch_ret;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -206,21 +213,25 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
+		dispatch_ret = blk_mq_dispatch_rq_list(q, &rq_list, false);
+		if (dispatch_ret) {
 			if (has_sched_dispatch)
 				blk_mq_do_dispatch_sched(hctx);
 			else
 				blk_mq_do_dispatch_ctx(hctx);
 		}
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		dispatch_ret = blk_mq_do_dispatch_sched(hctx);
 	} else if (hctx->dispatch_busy) {
 		/* dequeue request one by one from sw queue if queue is busy */
-		blk_mq_do_dispatch_ctx(hctx);
+		dispatch_ret = blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list, false);
+		dispatch_ret = blk_mq_dispatch_rq_list(q, &rq_list, false);
 	}
+
+	if (dispatch_ret && (hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS))
+		hctx->queue->mq_ops->commit_rqs(hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8e56884fd2e9..bde122feef01 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1253,11 +1253,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * Flag last if we have no more requests, or if we have more
 		 * but can't assign a driver tag to it.
 		 */
-		if (list_empty(list))
-			bd.last = true;
-		else {
-			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt);
+		if (!(hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS)) {
+			if (list_empty(list)) {
+				bd.last = true;
+			} else {
+				nxt = list_first_entry(list, struct request,
+						       queuelist);
+				bd.last = !blk_mq_get_driver_tag(nxt);
+			}
+		} else {
+			bd.last = false;
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f389d7c724bd..6a20f8e8eb85 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -391,6 +391,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 2/7] mmc: Add MMC packed request support for MMC software queue
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 3/7] mmc: host: sdhci: Introduce ADMA3 transfer mode Baolin Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

Some SD controllers can support packed command or packed request, that
means it can package several requests to host controller to be handled
at one time, which can reduce interrutps and improve the DMA transfer.
As a result, the I/O performence can be improved.

Thus this patch adds MMC packed function to support packed requests or
packed command based on the MMC software queue.

The basic concept of this function is that, we try to collect more
requests from block layer as much as possible to be linked into
MMC packed queue by mmc_blk_hsq_issue_rw_rq(). When the last
request of the hardware queue comes, or the collected request numbers
are larger than 16, or a larger request comes, then we can start to
pakage a packed request to host controller. The MMC packed function
also supplies packed algorithm operations to help to package qualified
requests. After finishing the packed request, the MMC packed function
will help to complete each request, at the same time, the MMC packed
queue will allow to collect more requests from block layer. After
completing each request, the MMC packed function can try to package
another packed request to host controller directly in the complete path,
if there are enough requests in MMC packed queue or the request pending
flag is not set. If the pending flag was set, we should let the
mmc_blk_hsq_issue_rw_rq() collect more request as much as possible.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/core/block.c      |  14 ++
 drivers/mmc/core/core.c       |  26 +++
 drivers/mmc/core/core.h       |   2 +
 drivers/mmc/core/queue.c      |  19 ++-
 drivers/mmc/host/mmc_hsq.c    | 292 +++++++++++++++++++++++++++++-----
 drivers/mmc/host/mmc_hsq.h    |  25 ++-
 drivers/mmc/host/sdhci-sprd.c |   2 +-
 include/linux/mmc/core.h      |   6 +
 include/linux/mmc/host.h      |   9 ++
 9 files changed, 345 insertions(+), 50 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 8499b56a15a8..528db34c60b0 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1907,6 +1907,19 @@ static void mmc_blk_hsq_req_done(struct mmc_request *mrq)
 	if (mmc_blk_rq_error(&mqrq->brq) ||
 	    mmc_blk_urgent_bkops_needed(mq, mqrq)) {
 		spin_lock_irqsave(&mq->lock, flags);
+
+		/*
+		 * The HSQ may complete more than one requests at one time
+		 * for the packed request mode. So if there is one recovery
+		 * request is pending, the following error requests just
+		 * should be completed directly, since we should not do
+		 * recovery continuously.
+		 */
+		if (mq->recovery_needed) {
+			spin_unlock_irqrestore(&mq->lock, flags);
+			goto out;
+		}
+
 		mq->recovery_needed = true;
 		mq->recovery_req = req;
 		spin_unlock_irqrestore(&mq->lock, flags);
@@ -1919,6 +1932,7 @@ static void mmc_blk_hsq_req_done(struct mmc_request *mrq)
 
 	mmc_blk_rw_reset_success(mq, req);
 
+out:
 	/*
 	 * Block layer timeouts race with completions which means the normal
 	 * completion path cannot be used during recovery.
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 4c5de6d37ac7..85d40dbc204e 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -329,6 +329,7 @@ static int mmc_mrq_prep(struct mmc_host *host, struct mmc_request *mrq)
 		}
 	}
 
+	INIT_LIST_HEAD(&mrq->list);
 	return 0;
 }
 
@@ -519,6 +520,31 @@ void mmc_cqe_post_req(struct mmc_host *host, struct mmc_request *mrq)
 }
 EXPORT_SYMBOL(mmc_cqe_post_req);
 
+/**
+ *	mmc_cqe_commit_rqs - Commit requests pending in CQE
+ *	@host: MMC host
+ *	@last: Indicate if the last request from block layer
+ */
+void mmc_cqe_commit_rqs(struct mmc_host *host)
+{
+	if (host->cqe_ops->cqe_commit_rqs)
+		host->cqe_ops->cqe_commit_rqs(host);
+}
+EXPORT_SYMBOL(mmc_cqe_commit_rqs);
+
+/**
+ *	mmc_cqe_is_busy - If CQE is busy or not
+ *	@host: MMC host
+ */
+bool mmc_cqe_is_busy(struct mmc_host *host)
+{
+	if (host->cqe_ops->cqe_is_busy)
+		return host->cqe_ops->cqe_is_busy(host);
+
+	return false;
+}
+EXPORT_SYMBOL(mmc_cqe_is_busy);
+
 /* Arbitrary 1 second timeout */
 #define MMC_CQE_RECOVERY_TIMEOUT	1000
 
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 575ac0257af2..db81ba27bcf4 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -139,6 +139,8 @@ static inline void mmc_claim_host(struct mmc_host *host)
 int mmc_cqe_start_req(struct mmc_host *host, struct mmc_request *mrq);
 void mmc_cqe_post_req(struct mmc_host *host, struct mmc_request *mrq);
 int mmc_cqe_recovery(struct mmc_host *host);
+void mmc_cqe_commit_rqs(struct mmc_host *host);
+bool mmc_cqe_is_busy(struct mmc_host *host);
 
 /**
  *	mmc_pre_req - Prepare for a new request
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 25bee3daf9e2..8e63e3d1c8d1 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -285,11 +285,7 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		}
 		break;
 	case MMC_ISSUE_ASYNC:
-		/*
-		 * For MMC host software queue, we only allow 2 requests in
-		 * flight to avoid a long latency.
-		 */
-		if (host->hsq_enabled && mq->in_flight[issue_type] > 2) {
+		if (mq->use_cqe && mmc_cqe_is_busy(host)) {
 			spin_unlock_irq(&mq->lock);
 			return BLK_STS_RESOURCE;
 		}
@@ -362,8 +358,19 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static void mmc_mq_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct mmc_queue *mq = hctx->queue->queuedata;
+	struct mmc_card *card = mq->card;
+	struct mmc_host *host = card->host;
+
+	if (mq->use_cqe)
+		mmc_cqe_commit_rqs(host);
+}
+
 static const struct blk_mq_ops mmc_mq_ops = {
 	.queue_rq	= mmc_mq_queue_rq,
+	.commit_rqs	= mmc_mq_commit_rqs,
 	.init_request	= mmc_mq_init_request,
 	.exit_request	= mmc_mq_exit_request,
 	.complete	= mmc_blk_mq_complete,
@@ -451,6 +458,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
 		mq->tag_set.queue_depth = MMC_QUEUE_DEPTH;
 	mq->tag_set.numa_node = NUMA_NO_NODE;
 	mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
+	if (host->max_packed_reqs > 0)
+		mq->tag_set.flags |= BLK_MQ_F_FORCE_COMMIT_RQS;
 	mq->tag_set.nr_hw_queues = 1;
 	mq->tag_set.cmd_size = sizeof(struct mmc_queue_req);
 	mq->tag_set.driver_data = mq;
diff --git a/drivers/mmc/host/mmc_hsq.c b/drivers/mmc/host/mmc_hsq.c
index a5e05ed0fda3..d1e871bf4fc2 100644
--- a/drivers/mmc/host/mmc_hsq.c
+++ b/drivers/mmc/host/mmc_hsq.c
@@ -9,6 +9,7 @@
 
 #include <linux/mmc/card.h>
 #include <linux/mmc/host.h>
+#include <linux/mmc/mmc.h>
 #include <linux/module.h>
 
 #include "mmc_hsq.h"
@@ -16,6 +17,33 @@
 #define HSQ_NUM_SLOTS	64
 #define HSQ_INVALID_TAG	HSQ_NUM_SLOTS
 
+#define HSQ_PACKED_FLUSH_BLOCKS	256
+
+/**
+ * mmc_hsq_packed_algo_rw - the algorithm to package read or write requests
+ * @mmc: the host controller
+ *
+ * TODO: we can add more condition to decide if we can package this
+ * request or not.
+ */
+void mmc_hsq_packed_algo_rw(struct mmc_host *mmc)
+{
+	struct mmc_hsq *hsq = mmc->cqe_private;
+	struct hsq_packed *packed = hsq->packed;
+	struct mmc_packed_request *prq = &packed->prq;
+	struct mmc_request *mrq, *t;
+	u32 i = 0;
+
+	list_for_each_entry_safe(mrq, t, &packed->list, list) {
+		if (++i > packed->max_entries)
+			break;
+
+		list_move_tail(&mrq->list, &prq->list);
+		prq->nr_reqs++;
+	}
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_packed_algo_rw);
+
 static void mmc_hsq_retry_handler(struct work_struct *work)
 {
 	struct mmc_hsq *hsq = container_of(work, struct mmc_hsq, retry_work);
@@ -26,15 +54,17 @@ static void mmc_hsq_retry_handler(struct work_struct *work)
 
 static void mmc_hsq_pump_requests(struct mmc_hsq *hsq)
 {
+	struct hsq_packed *packed = hsq->packed;
 	struct mmc_host *mmc = hsq->mmc;
 	struct hsq_slot *slot;
+	struct mmc_request *mrq;
 	unsigned long flags;
 	int ret = 0;
 
 	spin_lock_irqsave(&hsq->lock, flags);
 
 	/* Make sure we are not already running a request now */
-	if (hsq->mrq) {
+	if (hsq->mrq || (packed && packed->prq.nr_reqs)) {
 		spin_unlock_irqrestore(&hsq->lock, flags);
 		return;
 	}
@@ -45,30 +75,72 @@ static void mmc_hsq_pump_requests(struct mmc_hsq *hsq)
 		return;
 	}
 
-	slot = &hsq->slot[hsq->next_tag];
-	hsq->mrq = slot->mrq;
-	hsq->qcnt--;
+	if (packed) {
+		/* Try to package requests */
+		packed->ops->packed_algo(mmc);
+
+		packed->busy = true;
+		hsq->qcnt -= packed->prq.nr_reqs;
+	} else {
+		slot = &hsq->slot[hsq->next_tag];
+		hsq->mrq = slot->mrq;
+		hsq->qcnt--;
+	}
 
 	spin_unlock_irqrestore(&hsq->lock, flags);
 
-	if (mmc->ops->request_atomic)
-		ret = mmc->ops->request_atomic(mmc, hsq->mrq);
-	else
-		mmc->ops->request(mmc, hsq->mrq);
+	if (!packed) {
+		if (mmc->ops->request_atomic)
+			ret = mmc->ops->request_atomic(mmc, hsq->mrq);
+		else
+			mmc->ops->request(mmc, hsq->mrq);
+
+		/*
+		 * If returning BUSY from request_atomic(), which means the card
+		 * may be busy now, and we should change to non-atomic context to
+		 * try again for this unusual case, to avoid time-consuming operations
+		 * in the atomic context.
+		 *
+		 * Note: we just give a warning for other error cases, since the host
+		 * driver will handle them.
+		 */
+		if (ret == -EBUSY)
+			schedule_work(&hsq->retry_work);
+		else
+			WARN_ON_ONCE(ret);
 
-	/*
-	 * If returning BUSY from request_atomic(), which means the card
-	 * may be busy now, and we should change to non-atomic context to
-	 * try again for this unusual case, to avoid time-consuming operations
-	 * in the atomic context.
-	 *
-	 * Note: we just give a warning for other error cases, since the host
-	 * driver will handle them.
-	 */
-	if (ret == -EBUSY)
-		schedule_work(&hsq->retry_work);
-	else
-		WARN_ON_ONCE(ret);
+		return;
+	}
+
+	if (packed->ops->prepare_hardware) {
+		ret = packed->ops->prepare_hardware(mmc);
+		if (ret) {
+			pr_err("failed to prepare hardware\n");
+			goto error;
+		}
+	}
+
+	ret = packed->ops->packed_request(mmc, &packed->prq);
+	if (ret) {
+		pr_err("failed to packed requests\n");
+		goto error;
+	}
+
+	return;
+
+error:
+	spin_lock_irqsave(&hsq->lock, flags);
+
+	list_for_each_entry(mrq, &packed->prq.list, list) {
+		struct mmc_data *data = mrq->data;
+
+		data->error = ret;
+		data->bytes_xfered = 0;
+	}
+
+	spin_unlock_irqrestore(&hsq->lock, flags);
+
+	mmc_hsq_finalize_packed_request(mmc, &packed->prq);
 }
 
 static void mmc_hsq_update_next_tag(struct mmc_hsq *hsq, int remains)
@@ -110,16 +182,21 @@ static void mmc_hsq_update_next_tag(struct mmc_hsq *hsq, int remains)
 
 static void mmc_hsq_post_request(struct mmc_hsq *hsq)
 {
+	struct hsq_packed *packed = hsq->packed;
 	unsigned long flags;
 	int remains;
 
 	spin_lock_irqsave(&hsq->lock, flags);
 
 	remains = hsq->qcnt;
-	hsq->mrq = NULL;
+	if (packed) {
+		packed->prq.nr_reqs = 0;
+	} else {
+		hsq->mrq = NULL;
 
-	/* Update the next available tag to be queued. */
-	mmc_hsq_update_next_tag(hsq, remains);
+		/* Update the next available tag to be queued. */
+		mmc_hsq_update_next_tag(hsq, remains);
+	}
 
 	if (hsq->waiting_for_idle && !remains) {
 		hsq->waiting_for_idle = false;
@@ -177,6 +254,91 @@ bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq)
 }
 EXPORT_SYMBOL_GPL(mmc_hsq_finalize_request);
 
+/**
+ * mmc_hsq_finalize_packed_request - finalize one packed request
+ * @mmc: the host controller
+ * @prq: the packed request need to be finalized
+ */
+void mmc_hsq_finalize_packed_request(struct mmc_host *mmc,
+				     struct mmc_packed_request *prq)
+{
+	struct mmc_hsq *hsq = mmc->cqe_private;
+	struct hsq_packed *packed = hsq->packed;
+	struct mmc_request *mrq, *t;
+	LIST_HEAD(head);
+	unsigned long flags;
+
+	if (!packed || !prq)
+		return;
+
+	if (packed->ops->unprepare_hardware &&
+	    packed->ops->unprepare_hardware(mmc))
+		pr_err("failed to unprepare hardware\n");
+
+	/*
+	 * Clear busy flag to allow collecting more requests into command
+	 * queue, but now we can not pump them to controller, we should wait
+	 * for all requests are completed. During the period of completing
+	 * requests, we should collect more requests from block layer as much
+	 * as possible.
+	 */
+	spin_lock_irqsave(&hsq->lock, flags);
+	list_splice_tail_init(&prq->list, &head);
+	packed->busy = false;
+	spin_unlock_irqrestore(&hsq->lock, flags);
+
+	list_for_each_entry_safe(mrq, t, &head, list) {
+		list_del(&mrq->list);
+
+		mmc_cqe_request_done(mmc, mrq);
+	}
+
+	mmc_hsq_post_request(hsq);
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_finalize_packed_request);
+
+static void mmc_hsq_commit_rqs(struct mmc_host *mmc)
+{
+	struct mmc_hsq *hsq = mmc->cqe_private;
+	struct hsq_packed *packed = hsq->packed;
+
+	if (!packed)
+		return;
+
+	mmc_hsq_pump_requests(hsq);
+}
+
+static bool mmc_hsq_is_busy(struct mmc_host *mmc)
+{
+	struct mmc_hsq *hsq = mmc->cqe_private;
+	struct hsq_packed *packed = hsq->packed;
+	unsigned long flags;
+	bool busy;
+
+	spin_lock_irqsave(&hsq->lock, flags);
+
+	/*
+	 * For packed mode, when hardware is busy, we can only allow maximum
+	 * packed number requests to be ready in software queue to be queued
+	 * after previous packed request is completed, which avoiding long
+	 * latency.
+	 *
+	 * For non-packed mode, we can only allow 2 requests in flight to avoid
+	 * long latency.
+	 *
+	 * Otherwise return BLK_STS_RESOURCE to tell block layer to dispatch
+	 * requests later.
+	 */
+	if (packed)
+		busy = packed->busy && hsq->qcnt >= packed->max_entries;
+	else
+		busy = hsq->qcnt > 1;
+
+	spin_unlock_irqrestore(&hsq->lock, flags);
+
+	return busy;
+}
+
 static void mmc_hsq_recovery_start(struct mmc_host *mmc)
 {
 	struct mmc_hsq *hsq = mmc->cqe_private;
@@ -212,7 +374,8 @@ static void mmc_hsq_recovery_finish(struct mmc_host *mmc)
 static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq)
 {
 	struct mmc_hsq *hsq = mmc->cqe_private;
-	int tag = mrq->tag;
+	struct hsq_packed *packed = hsq->packed;
+	int nr_rqs = 0, tag = mrq->tag;
 
 	spin_lock_irq(&hsq->lock);
 
@@ -227,20 +390,37 @@ static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq)
 		return -EBUSY;
 	}
 
-	hsq->slot[tag].mrq = mrq;
+	hsq->qcnt++;
 
-	/*
-	 * Set the next tag as current request tag if no available
-	 * next tag.
-	 */
-	if (hsq->next_tag == HSQ_INVALID_TAG)
-		hsq->next_tag = tag;
+	if (packed) {
+		list_add_tail(&mrq->list, &packed->list);
 
-	hsq->qcnt++;
+		nr_rqs = hsq->qcnt;
+	} else {
+		hsq->slot[tag].mrq = mrq;
+
+		/*
+		 * Set the next tag as current request tag if no available
+		 * next tag.
+		 */
+		if (hsq->next_tag == HSQ_INVALID_TAG)
+			hsq->next_tag = tag;
+	}
 
 	spin_unlock_irq(&hsq->lock);
 
-	mmc_hsq_pump_requests(hsq);
+	/*
+	 * For non-packed request mode, we should pump requests as soon as
+	 * possible.
+	 *
+	 * For the packed request mode, if it is a larger request or the
+	 * request count is larger than the maximum packed number, we
+	 * should pump requests to controller. Otherwise we should try to
+	 * combine requests as much as we can.
+	 */
+	if (!packed || mrq->data->blocks > HSQ_PACKED_FLUSH_BLOCKS ||
+	    nr_rqs >= packed->max_entries)
+		mmc_hsq_pump_requests(hsq);
 
 	return 0;
 }
@@ -253,12 +433,17 @@ static void mmc_hsq_post_req(struct mmc_host *mmc, struct mmc_request *mrq)
 
 static bool mmc_hsq_queue_is_idle(struct mmc_hsq *hsq, int *ret)
 {
+	struct hsq_packed *packed = hsq->packed;
 	bool is_idle;
 
 	spin_lock_irq(&hsq->lock);
 
-	is_idle = (!hsq->mrq && !hsq->qcnt) ||
-		hsq->recovery_halt;
+	if (packed)
+		is_idle = (!packed->prq.nr_reqs && !hsq->qcnt) ||
+			hsq->recovery_halt;
+	else
+		is_idle = (!hsq->mrq && !hsq->qcnt) ||
+			hsq->recovery_halt;
 
 	*ret = hsq->recovery_halt ? -EBUSY : 0;
 	hsq->waiting_for_idle = !is_idle;
@@ -335,17 +520,38 @@ static const struct mmc_cqe_ops mmc_hsq_ops = {
 	.cqe_wait_for_idle = mmc_hsq_wait_for_idle,
 	.cqe_recovery_start = mmc_hsq_recovery_start,
 	.cqe_recovery_finish = mmc_hsq_recovery_finish,
+	.cqe_is_busy = mmc_hsq_is_busy,
+	.cqe_commit_rqs = mmc_hsq_commit_rqs,
 };
 
-int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc)
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc,
+		 const struct hsq_packed_ops *ops, int max_packed)
 {
-	hsq->num_slots = HSQ_NUM_SLOTS;
-	hsq->next_tag = HSQ_INVALID_TAG;
+	if (ops && max_packed > 1) {
+		struct hsq_packed *packed;
+
+		packed = devm_kzalloc(mmc_dev(mmc), sizeof(struct hsq_packed),
+				      GFP_KERNEL);
+		if (!packed)
+			return -ENOMEM;
+
+		packed->ops = ops;
+		packed->max_entries = max_packed;
+		INIT_LIST_HEAD(&packed->list);
+		INIT_LIST_HEAD(&packed->prq.list);
+
+		hsq->packed = packed;
+		mmc->max_packed_reqs = max_packed;
+	} else {
+		hsq->num_slots = HSQ_NUM_SLOTS;
+		hsq->next_tag = HSQ_INVALID_TAG;
 
-	hsq->slot = devm_kcalloc(mmc_dev(mmc), hsq->num_slots,
-				 sizeof(struct hsq_slot), GFP_KERNEL);
-	if (!hsq->slot)
-		return -ENOMEM;
+		hsq->slot = devm_kcalloc(mmc_dev(mmc), hsq->num_slots,
+					 sizeof(struct hsq_slot),
+					 GFP_KERNEL);
+		if (!hsq->slot)
+			return -ENOMEM;
+	}
 
 	hsq->mmc = mmc;
 	hsq->mmc->cqe_private = hsq;
diff --git a/drivers/mmc/host/mmc_hsq.h b/drivers/mmc/host/mmc_hsq.h
index ffdd9cd172c3..8b416ec796a9 100644
--- a/drivers/mmc/host/mmc_hsq.h
+++ b/drivers/mmc/host/mmc_hsq.h
@@ -2,6 +2,23 @@
 #ifndef LINUX_MMC_HSQ_H
 #define LINUX_MMC_HSQ_H
 
+struct hsq_packed_ops {
+	void (*packed_algo)(struct mmc_host *mmc);
+	int (*prepare_hardware)(struct mmc_host *mmc);
+	int (*unprepare_hardware)(struct mmc_host *mmc);
+	int (*packed_request)(struct mmc_host *mmc,
+			      struct mmc_packed_request *prq);
+};
+
+struct hsq_packed {
+	bool busy;
+	int max_entries;
+
+	struct list_head list;
+	struct mmc_packed_request prq;
+	const struct hsq_packed_ops *ops;
+};
+
 struct hsq_slot {
 	struct mmc_request *mrq;
 };
@@ -21,11 +38,17 @@ struct mmc_hsq {
 	bool enabled;
 	bool waiting_for_idle;
 	bool recovery_halt;
+
+	struct hsq_packed *packed;
 };
 
-int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc);
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc,
+		 const struct hsq_packed_ops *ops, int max_packed);
 void mmc_hsq_suspend(struct mmc_host *mmc);
 int mmc_hsq_resume(struct mmc_host *mmc);
 bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq);
+void mmc_hsq_finalize_packed_request(struct mmc_host *mmc,
+				     struct mmc_packed_request *prq);
+void mmc_hsq_packed_algo_rw(struct mmc_host *mmc);
 
 #endif
diff --git a/drivers/mmc/host/sdhci-sprd.c b/drivers/mmc/host/sdhci-sprd.c
index bc7a8cb84862..ad0981e19571 100644
--- a/drivers/mmc/host/sdhci-sprd.c
+++ b/drivers/mmc/host/sdhci-sprd.c
@@ -669,7 +669,7 @@ static int sdhci_sprd_probe(struct platform_device *pdev)
 		goto err_cleanup_host;
 	}
 
-	ret = mmc_hsq_init(hsq, host->mmc);
+	ret = mmc_hsq_init(hsq, host->mmc, NULL, 0);
 	if (ret)
 		goto err_cleanup_host;
 
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 29aa50711626..4267e90905f2 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -162,6 +162,12 @@ struct mmc_request {
 	bool			cap_cmd_during_tfr;
 
 	int			tag;
+	struct list_head	list;
+};
+
+struct mmc_packed_request {
+	struct list_head list;
+	u32 nr_reqs;
 };
 
 struct mmc_card;
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index d4a50e5dc111..bb8bcc4b0687 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -216,6 +216,14 @@ struct mmc_cqe_ops {
 	 * will have zero data bytes transferred.
 	 */
 	void	(*cqe_recovery_finish)(struct mmc_host *host);
+
+	/* If CQE is busy or not. */
+	bool	(*cqe_is_busy)(struct mmc_host *host);
+	/*
+	 * Serve the purpose of kicking the hardware to handle pending
+	 * requests.
+	 */
+	void	(*cqe_commit_rqs)(struct mmc_host *host);
 };
 
 struct mmc_async_req {
@@ -388,6 +396,7 @@ struct mmc_host {
 	unsigned int		max_blk_size;	/* maximum size of one mmc block */
 	unsigned int		max_blk_count;	/* maximum number of blocks in one req */
 	unsigned int		max_busy_timeout; /* max busy timeout in ms */
+	unsigned int		max_packed_reqs;  /* max number of requests can be packed */
 
 	/* private data */
 	spinlock_t		lock;		/* lock for claim and bus ops */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 3/7] mmc: host: sdhci: Introduce ADMA3 transfer mode
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 2/7] mmc: Add MMC packed request support for MMC software queue Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 4/7] mmc: host: sdhci: Factor out the command configuration Baolin Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

The standard SD host controller can support ADMA3 transfer mode optionally.
The ADMA3 uses command descriptor to issue an SD command, and a multi-block
data transfer is programmed by using a pair of command descriptor and ADMA2
descriptor. ADMA3 performs multiple of multi-block data transfer by using
integrated descriptor.

This is a preparation patch to add ADMA3 structures and help to expand the
ADMA buffer for support ADMA3 transfer mode.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/host/sdhci.c | 106 ++++++++++++++++++++++++++++++++-------
 drivers/mmc/host/sdhci.h |  48 ++++++++++++++++++
 2 files changed, 137 insertions(+), 17 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 0baef595de26..afea8393c71f 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -773,7 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
 		 * If this triggers then we have a calculation bug
 		 * somewhere. :/
 		 */
-		WARN_ON((desc - host->adma_table) >= host->adma_table_sz);
+		WARN_ON((desc - host->adma_table) >=
+			host->adma_table_sz * host->adma3_table_cnt);
 	}
 
 	if (host->quirks & SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC) {
@@ -4148,10 +4149,17 @@ int sdhci_setup_host(struct sdhci_host *host)
 		(host->caps & SDHCI_CAN_DO_ADMA2))
 		host->flags |= SDHCI_USE_ADMA;
 
+	if ((host->quirks2 & SDHCI_QUIRK2_USE_ADMA3_SUPPORT) &&
+	    (host->flags & SDHCI_USE_ADMA) &&
+	    (host->caps1 & SDHCI_CAN_DO_ADMA3)) {
+		DBG("Enable ADMA3 mode for data transfer\n");
+		host->flags |= SDHCI_USE_ADMA3;
+	}
+
 	if ((host->quirks & SDHCI_QUIRK_BROKEN_ADMA) &&
 		(host->flags & SDHCI_USE_ADMA)) {
 		DBG("Disabling ADMA as it is marked broken\n");
-		host->flags &= ~SDHCI_USE_ADMA;
+		host->flags &= ~(SDHCI_USE_ADMA | SDHCI_USE_ADMA3);
 	}
 
 	if (sdhci_can_64bit_dma(host))
@@ -4184,7 +4192,7 @@ int sdhci_setup_host(struct sdhci_host *host)
 		if (ret) {
 			pr_warn("%s: No suitable DMA available - falling back to PIO\n",
 				mmc_hostname(mmc));
-			host->flags &= ~(SDHCI_USE_SDMA | SDHCI_USE_ADMA);
+			host->flags &= ~(SDHCI_USE_SDMA | SDHCI_USE_ADMA | SDHCI_USE_ADMA3);
 
 			ret = 0;
 		}
@@ -4206,31 +4214,68 @@ int sdhci_setup_host(struct sdhci_host *host)
 		host->desc_sz = host->alloc_desc_sz;
 		host->adma_table_sz = host->adma_table_cnt * host->desc_sz;
 
+		host->adma3_table_cnt = 1;
+
+		if (host->flags & SDHCI_USE_ADMA3) {
+			/* We can package maximum 16 requests once */
+			host->adma3_table_cnt = SDHCI_MAX_ADMA3_ENTRIES;
+
+			if (host->flags & SDHCI_USE_64_BIT_DMA)
+				host->integr_desc_sz = SDHCI_INTEGR_64_DESC_SZ;
+			else
+				host->integr_desc_sz = SDHCI_INTEGR_32_DESC_SZ;
+
+			host->cmd_desc_sz = SDHCI_ADMA3_CMD_DESC_SZ;
+			host->cmd_table_sz = host->adma3_table_cnt *
+				SDHCI_ADMA3_CMD_DESC_SZ * SDHCI_ADMA3_CMD_DESC_ENTRIES;
+
+			buf = dma_alloc_coherent(mmc_dev(mmc),
+						 host->adma3_table_cnt *
+						 host->integr_desc_sz,
+						 &dma, GFP_KERNEL);
+			if (!buf) {
+				pr_warn("%s: Unable to allocate ADMA3 integrated buffers - falling back to ADMA\n",
+					mmc_hostname(mmc));
+				host->flags &= ~SDHCI_USE_ADMA3;
+				host->adma3_table_cnt = 1;
+			} else {
+				host->integr_table = buf;
+				host->integr_addr = dma;
+			}
+		}
+
 		host->align_buffer_sz = SDHCI_MAX_SEGS * SDHCI_ADMA2_ALIGN;
 		/*
 		 * Use zalloc to zero the reserved high 32-bits of 128-bit
 		 * descriptors so that they never need to be written.
 		 */
 		buf = dma_alloc_coherent(mmc_dev(mmc),
-					 host->align_buffer_sz + host->adma_table_sz,
+					 host->align_buffer_sz *
+					 host->adma3_table_cnt +
+					 host->cmd_table_sz +
+					 host->adma_table_sz *
+					 host->adma3_table_cnt,
 					 &dma, GFP_KERNEL);
 		if (!buf) {
 			pr_warn("%s: Unable to allocate ADMA buffers - falling back to standard DMA\n",
 				mmc_hostname(mmc));
-			host->flags &= ~SDHCI_USE_ADMA;
-		} else if ((dma + host->align_buffer_sz) &
+			host->flags &= ~(SDHCI_USE_ADMA | SDHCI_USE_ADMA3);
+		} else if ((dma + host->align_buffer_sz * host->adma3_table_cnt) &
 			   (SDHCI_ADMA2_DESC_ALIGN - 1)) {
 			pr_warn("%s: unable to allocate aligned ADMA descriptor\n",
 				mmc_hostname(mmc));
-			host->flags &= ~SDHCI_USE_ADMA;
-			dma_free_coherent(mmc_dev(mmc), host->align_buffer_sz +
-					  host->adma_table_sz, buf, dma);
+			host->flags &= ~(SDHCI_USE_ADMA | SDHCI_USE_ADMA3);
+			dma_free_coherent(mmc_dev(mmc), host->align_buffer_sz *
+					  host->adma3_table_cnt +
+					  host->cmd_table_sz +
+					  host->adma_table_sz *
+					  host->adma3_table_cnt, buf, dma);
 		} else {
 			host->align_buffer = buf;
 			host->align_addr = dma;
 
-			host->adma_table = buf + host->align_buffer_sz;
-			host->adma_addr = dma + host->align_buffer_sz;
+			host->adma_table = buf + host->align_buffer_sz * host->adma3_table_cnt;
+			host->adma_addr = dma + host->align_buffer_sz * host->adma3_table_cnt;
 		}
 	}
 
@@ -4631,12 +4676,21 @@ int sdhci_setup_host(struct sdhci_host *host)
 		regulator_disable(mmc->supply.vqmmc);
 undma:
 	if (host->align_buffer)
-		dma_free_coherent(mmc_dev(mmc), host->align_buffer_sz +
-				  host->adma_table_sz, host->align_buffer,
+		dma_free_coherent(mmc_dev(mmc),
+				  host->align_buffer_sz * host->adma3_table_cnt +
+				  host->cmd_table_sz +
+				  host->adma_table_sz * host->adma3_table_cnt,
+				  host->align_buffer,
 				  host->align_addr);
 	host->adma_table = NULL;
 	host->align_buffer = NULL;
 
+	if (host->integr_table)
+		dma_free_coherent(mmc_dev(mmc),
+				  host->adma3_table_cnt * host->integr_desc_sz,
+				  host->integr_table, host->integr_addr);
+	host->integr_table = NULL;
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(sdhci_setup_host);
@@ -4649,8 +4703,11 @@ void sdhci_cleanup_host(struct sdhci_host *host)
 		regulator_disable(mmc->supply.vqmmc);
 
 	if (host->align_buffer)
-		dma_free_coherent(mmc_dev(mmc), host->align_buffer_sz +
-				  host->adma_table_sz, host->align_buffer,
+		dma_free_coherent(mmc_dev(mmc),
+				  host->align_buffer_sz * host->adma3_table_cnt +
+				  host->cmd_table_sz +
+				  host->adma_table_sz * host->adma3_table_cnt,
+				  host->align_buffer,
 				  host->align_addr);
 
 	if (host->use_external_dma)
@@ -4658,6 +4715,12 @@ void sdhci_cleanup_host(struct sdhci_host *host)
 
 	host->adma_table = NULL;
 	host->align_buffer = NULL;
+
+	if (host->integr_table)
+		dma_free_coherent(mmc_dev(mmc),
+				  host->adma3_table_cnt * host->integr_desc_sz,
+				  host->integr_table, host->integr_addr);
+	host->integr_table = NULL;
 }
 EXPORT_SYMBOL_GPL(sdhci_cleanup_host);
 
@@ -4786,8 +4849,11 @@ void sdhci_remove_host(struct sdhci_host *host, int dead)
 		regulator_disable(mmc->supply.vqmmc);
 
 	if (host->align_buffer)
-		dma_free_coherent(mmc_dev(mmc), host->align_buffer_sz +
-				  host->adma_table_sz, host->align_buffer,
+		dma_free_coherent(mmc_dev(mmc),
+				  host->align_buffer_sz * host->adma3_table_cnt +
+				  host->cmd_table_sz +
+				  host->adma_table_sz * host->adma3_table_cnt,
+				  host->align_buffer,
 				  host->align_addr);
 
 	if (host->use_external_dma)
@@ -4795,6 +4861,12 @@ void sdhci_remove_host(struct sdhci_host *host, int dead)
 
 	host->adma_table = NULL;
 	host->align_buffer = NULL;
+
+	if (host->integr_table)
+		dma_free_coherent(mmc_dev(mmc),
+				  host->adma3_table_cnt * host->integr_desc_sz,
+				  host->integr_table, host->integr_addr);
+	host->integr_table = NULL;
 }
 
 EXPORT_SYMBOL_GPL(sdhci_remove_host);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 4bd70da7aa00..5cade9a7b53a 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -272,6 +272,9 @@
 #define SDHCI_PRESET_CLKGEN_SEL		BIT(10)
 #define SDHCI_PRESET_SDCLK_FREQ_MASK	GENMASK(9, 0)
 
+#define SDHCI_ADMA3_ADDRESS	0x78
+#define SDHCI_ADMA3_ADDRESS_HI	0x7c
+
 #define SDHCI_SLOT_INT_STATUS	0xFC
 
 #define SDHCI_HOST_VERSION	0xFE
@@ -344,6 +347,41 @@ struct sdhci_adma2_64_desc {
 #define ADMA2_NOP_END_VALID	0x3
 #define ADMA2_END		0x2
 
+#define SDHCI_MAX_ADMA3_ENTRIES	16
+
+/* ADMA3 command descriptor */
+struct sdhci_adma3_cmd_desc {
+	__le32	cmd;
+	__le32	reg;
+}  __packed __aligned(4);
+
+#define ADMA3_TRAN_VALID	0x9
+#define ADMA3_TRAN_END		0xb
+
+/* ADMA3 command descriptor size */
+#define SDHCI_ADMA3_CMD_DESC_ENTRIES	4
+#define SDHCI_ADMA3_CMD_DESC_SZ		8
+
+/* ADMA3 integrated 32-bit descriptor */
+struct sdhci_integr_32_desc {
+	__le32	cmd;
+	__le32	addr;
+}  __packed __aligned(4);
+
+#define SDHCI_INTEGR_32_DESC_SZ		8
+
+/* ADMA3 integrated 64-bit descriptor. */
+struct sdhci_integr_64_desc {
+	__le32	cmd;
+	__le32	addr_lo;
+	__le32	addr_hi;
+}  __packed __aligned(4);
+
+#define SDHCI_INTEGR_64_DESC_SZ		16
+
+#define ADMA3_INTEGR_TRAN_VALID		0x39
+#define ADMA3_INTEGR_TRAN_END		0x3b
+
 /*
  * Maximum segments assuming a 512KiB maximum requisition size and a minimum
  * 4KiB page size.
@@ -482,6 +520,8 @@ struct sdhci_host {
  * block count.
  */
 #define SDHCI_QUIRK2_USE_32BIT_BLK_CNT			(1<<18)
+/* use ADMA3 for data read/write if hardware supports */
+#define SDHCI_QUIRK2_USE_ADMA3_SUPPORT			(1<<19)
 
 	int irq;		/* Device IRQ */
 	void __iomem *ioaddr;	/* Mapped address */
@@ -518,6 +558,7 @@ struct sdhci_host {
 #define SDHCI_SIGNALING_330	(1<<14)	/* Host is capable of 3.3V signaling */
 #define SDHCI_SIGNALING_180	(1<<15)	/* Host is capable of 1.8V signaling */
 #define SDHCI_SIGNALING_120	(1<<16)	/* Host is capable of 1.2V signaling */
+#define SDHCI_USE_ADMA3		(1<<17)	/* Host is ADMA3 capable */
 
 	unsigned int version;	/* SDHCI spec. version */
 
@@ -551,15 +592,20 @@ struct sdhci_host {
 
 	void *adma_table;	/* ADMA descriptor table */
 	void *align_buffer;	/* Bounce buffer */
+	void *integr_table;	/* ADMA3 intergrate descriptor table */
 
 	size_t adma_table_sz;	/* ADMA descriptor table size */
 	size_t align_buffer_sz;	/* Bounce buffer size */
+	size_t cmd_table_sz;	/* ADMA3 command descriptor table size */
 
 	dma_addr_t adma_addr;	/* Mapped ADMA descr. table */
 	dma_addr_t align_addr;	/* Mapped bounce buffer */
+	dma_addr_t integr_addr;	/* Mapped ADMA3 intergrate descr. table */
 
 	unsigned int desc_sz;	/* ADMA current descriptor size */
 	unsigned int alloc_desc_sz;	/* ADMA descr. max size host supports */
+	unsigned int cmd_desc_sz;	/* ADMA3 command descriptor size */
+	unsigned int integr_desc_sz;	/* ADMA3 intergrate descriptor size */
 
 	struct workqueue_struct *complete_wq;	/* Request completion wq */
 	struct work_struct	complete_work;	/* Request completion work */
@@ -610,6 +656,8 @@ struct sdhci_host {
 
 	/* Host ADMA table count */
 	u32			adma_table_cnt;
+	/* Host ADMA3 table count */
+	u32			adma3_table_cnt;
 
 	u64			data_timeout;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 4/7] mmc: host: sdhci: Factor out the command configuration
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
                   ` (2 preceding siblings ...)
  2020-04-26  9:38 ` [RFC PATCH v2 3/7] mmc: host: sdhci: Introduce ADMA3 transfer mode Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 5/7] mmc: host: sdhci: Remove redundant sg_count member of struct sdhci_host Baolin Wang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

Move the SD command configuration into one separate function to simplify
the sdhci_send_command(). Moreover this function can be used to support
ADMA3 transfer mode in following patches.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/host/sdhci.c | 69 +++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index afea8393c71f..3d4cdf0e0535 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1579,9 +1579,45 @@ static void sdhci_finish_data(struct sdhci_host *host)
 	__sdhci_finish_data(host, false);
 }
 
-static bool sdhci_send_command(struct sdhci_host *host, struct mmc_command *cmd)
+static int sdhci_get_command(struct sdhci_host *host, struct mmc_command *cmd)
 {
 	int flags;
+
+	if ((cmd->flags & MMC_RSP_136) && (cmd->flags & MMC_RSP_BUSY)) {
+		WARN_ONCE(1, "Unsupported response type!\n");
+		/*
+		 * This does not happen in practice because 136-bit response
+		 * commands never have busy waiting, so rather than complicate
+		 * the error path, just remove busy waiting and continue.
+		 */
+		cmd->flags &= ~MMC_RSP_BUSY;
+	}
+
+	if (!(cmd->flags & MMC_RSP_PRESENT))
+		flags = SDHCI_CMD_RESP_NONE;
+	else if (cmd->flags & MMC_RSP_136)
+		flags = SDHCI_CMD_RESP_LONG;
+	else if (cmd->flags & MMC_RSP_BUSY)
+		flags = SDHCI_CMD_RESP_SHORT_BUSY;
+	else
+		flags = SDHCI_CMD_RESP_SHORT;
+
+	if (cmd->flags & MMC_RSP_CRC)
+		flags |= SDHCI_CMD_CRC;
+	if (cmd->flags & MMC_RSP_OPCODE)
+		flags |= SDHCI_CMD_INDEX;
+
+	/* CMD19 is special in that the Data Present Select should be set */
+	if (cmd->data || cmd->opcode == MMC_SEND_TUNING_BLOCK ||
+	    cmd->opcode == MMC_SEND_TUNING_BLOCK_HS200)
+		flags |= SDHCI_CMD_DATA;
+
+	return SDHCI_MAKE_CMD(cmd->opcode, flags);
+}
+
+static bool sdhci_send_command(struct sdhci_host *host, struct mmc_command *cmd)
+{
+	int command;
 	u32 mask;
 	unsigned long timeout;
 
@@ -1625,34 +1661,7 @@ static bool sdhci_send_command(struct sdhci_host *host, struct mmc_command *cmd)
 
 	sdhci_set_transfer_mode(host, cmd);
 
-	if ((cmd->flags & MMC_RSP_136) && (cmd->flags & MMC_RSP_BUSY)) {
-		WARN_ONCE(1, "Unsupported response type!\n");
-		/*
-		 * This does not happen in practice because 136-bit response
-		 * commands never have busy waiting, so rather than complicate
-		 * the error path, just remove busy waiting and continue.
-		 */
-		cmd->flags &= ~MMC_RSP_BUSY;
-	}
-
-	if (!(cmd->flags & MMC_RSP_PRESENT))
-		flags = SDHCI_CMD_RESP_NONE;
-	else if (cmd->flags & MMC_RSP_136)
-		flags = SDHCI_CMD_RESP_LONG;
-	else if (cmd->flags & MMC_RSP_BUSY)
-		flags = SDHCI_CMD_RESP_SHORT_BUSY;
-	else
-		flags = SDHCI_CMD_RESP_SHORT;
-
-	if (cmd->flags & MMC_RSP_CRC)
-		flags |= SDHCI_CMD_CRC;
-	if (cmd->flags & MMC_RSP_OPCODE)
-		flags |= SDHCI_CMD_INDEX;
-
-	/* CMD19 is special in that the Data Present Select should be set */
-	if (cmd->data || cmd->opcode == MMC_SEND_TUNING_BLOCK ||
-	    cmd->opcode == MMC_SEND_TUNING_BLOCK_HS200)
-		flags |= SDHCI_CMD_DATA;
+	command = sdhci_get_command(host, cmd);
 
 	timeout = jiffies;
 	if (host->data_timeout)
@@ -1666,7 +1675,7 @@ static bool sdhci_send_command(struct sdhci_host *host, struct mmc_command *cmd)
 	if (host->use_external_dma)
 		sdhci_external_dma_pre_transfer(host, cmd);
 
-	sdhci_writew(host, SDHCI_MAKE_CMD(cmd->opcode, flags), SDHCI_COMMAND);
+	sdhci_writew(host, command, SDHCI_COMMAND);
 
 	return true;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 5/7] mmc: host: sdhci: Remove redundant sg_count member of struct sdhci_host
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
                   ` (3 preceding siblings ...)
  2020-04-26  9:38 ` [RFC PATCH v2 4/7] mmc: host: sdhci: Factor out the command configuration Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-26  9:38 ` [RFC PATCH v2 6/7] mmc: host: sdhci: Add MMC packed request support Baolin Wang
  2020-04-26  9:39 ` [RFC PATCH v2 7/7] mmc: host: sdhci-sprd: " Baolin Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

The mmc_data structure has a member to save the mapped sg count, so
no need introduce a redundant sg_count of struct sdhci_host, remove it.
This is also a preparation patch to support ADMA3 transfer mode.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/host/sdhci.c | 12 +++++-------
 drivers/mmc/host/sdhci.h |  2 --
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 3d4cdf0e0535..ad7e2442f120 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -709,7 +709,7 @@ static void sdhci_adma_mark_end(void *desc)
 }
 
 static void sdhci_adma_table_pre(struct sdhci_host *host,
-	struct mmc_data *data, int sg_count)
+	struct mmc_data *data)
 {
 	struct scatterlist *sg;
 	unsigned long flags;
@@ -723,14 +723,12 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
 	 * We currently guess that it is LE.
 	 */
 
-	host->sg_count = sg_count;
-
 	desc = host->adma_table;
 	align = host->align_buffer;
 
 	align_addr = host->align_addr;
 
-	for_each_sg(data->sg, sg, host->sg_count, i) {
+	for_each_sg(data->sg, sg, data->sg_count, i) {
 		addr = sg_dma_address(sg);
 		len = sg_dma_len(sg);
 
@@ -802,7 +800,7 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 		bool has_unaligned = false;
 
 		/* Do a quick scan of the SG list for any unaligned mappings */
-		for_each_sg(data->sg, sg, host->sg_count, i)
+		for_each_sg(data->sg, sg, data->sg_count, i)
 			if (sg_dma_address(sg) & SDHCI_ADMA2_MASK) {
 				has_unaligned = true;
 				break;
@@ -814,7 +812,7 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 
 			align = host->align_buffer;
 
-			for_each_sg(data->sg, sg, host->sg_count, i) {
+			for_each_sg(data->sg, sg, data->sg_count, i) {
 				if (sg_dma_address(sg) & SDHCI_ADMA2_MASK) {
 					size = SDHCI_ADMA2_ALIGN -
 					       (sg_dma_address(sg) & SDHCI_ADMA2_MASK);
@@ -1134,7 +1132,7 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
 			WARN_ON(1);
 			host->flags &= ~SDHCI_REQ_USE_DMA;
 		} else if (host->flags & SDHCI_USE_ADMA) {
-			sdhci_adma_table_pre(host, data, sg_cnt);
+			sdhci_adma_table_pre(host, data);
 			sdhci_set_adma_addr(host, host->adma_addr);
 		} else {
 			WARN_ON(sg_cnt != 1);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 5cade9a7b53a..51207072d1ec 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -588,8 +588,6 @@ struct sdhci_host {
 	struct sg_mapping_iter sg_miter;	/* SG state for PIO */
 	unsigned int blocks;	/* remaining PIO blocks */
 
-	int sg_count;		/* Mapped sg entries */
-
 	void *adma_table;	/* ADMA descriptor table */
 	void *align_buffer;	/* Bounce buffer */
 	void *integr_table;	/* ADMA3 intergrate descriptor table */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 6/7] mmc: host: sdhci: Add MMC packed request support
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
                   ` (4 preceding siblings ...)
  2020-04-26  9:38 ` [RFC PATCH v2 5/7] mmc: host: sdhci: Remove redundant sg_count member of struct sdhci_host Baolin Wang
@ 2020-04-26  9:38 ` Baolin Wang
  2020-04-26  9:39 ` [RFC PATCH v2 7/7] mmc: host: sdhci-sprd: " Baolin Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:38 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

This patch adds MMC packed operations to support packed requests,
and enables ADMA3 transfer mode to support this feature.

Enable ADMA3 transfer mode only for read and write commands, and
we will disable command interrupt and data timeout interrupt,
instead we will use software data timeout for ADMA3 fransfer mode.
For other non-data commands, we still use the ADMA2 transfer, since
no bebefits using ADMA3 transfer.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/host/sdhci.c | 317 +++++++++++++++++++++++++++++++++++++--
 drivers/mmc/host/sdhci.h |  11 ++
 2 files changed, 316 insertions(+), 12 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index ad7e2442f120..22935bbb0538 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -111,6 +111,19 @@ void sdhci_dumpregs(struct sdhci_host *host)
 		}
 	}
 
+	if (host->adma3_enabled) {
+		if (host->flags & SDHCI_USE_64_BIT_DMA) {
+			SDHCI_DUMP("ADMA3 Err:  0x%08x | ADMA3 Ptr: 0x%08x%08x\n",
+				   sdhci_readl(host, SDHCI_ADMA_ERROR),
+				   sdhci_readl(host, SDHCI_ADMA3_ADDRESS_HI),
+				   sdhci_readl(host, SDHCI_ADMA3_ADDRESS));
+		} else {
+			SDHCI_DUMP("ADMA3 Err:  0x%08x | ADMA3 Ptr: 0x%08x\n",
+				   sdhci_readl(host, SDHCI_ADMA_ERROR),
+				   sdhci_readl(host, SDHCI_ADMA3_ADDRESS));
+		}
+	}
+
 	SDHCI_DUMP("============================================\n");
 }
 EXPORT_SYMBOL_GPL(sdhci_dumpregs);
@@ -288,7 +301,9 @@ static void sdhci_config_dma(struct sdhci_host *host)
 		goto out;
 
 	/* Note if DMA Select is zero then SDMA is selected */
-	if (host->flags & SDHCI_USE_ADMA)
+	if (host->adma3_enabled)
+		ctrl |= SDHCI_CTRL_ADMA3;
+	else if (host->flags & SDHCI_USE_ADMA)
 		ctrl |= SDHCI_CTRL_ADMA32;
 
 	if (host->flags & SDHCI_USE_64_BIT_DMA) {
@@ -458,7 +473,7 @@ static inline void sdhci_led_deactivate(struct sdhci_host *host)
 static void sdhci_mod_timer(struct sdhci_host *host, struct mmc_request *mrq,
 			    unsigned long timeout)
 {
-	if (sdhci_data_line_cmd(mrq->cmd))
+	if (host->prq || sdhci_data_line_cmd(mrq->cmd))
 		mod_timer(&host->data_timer, timeout);
 	else
 		mod_timer(&host->timer, timeout);
@@ -466,7 +481,7 @@ static void sdhci_mod_timer(struct sdhci_host *host, struct mmc_request *mrq,
 
 static void sdhci_del_timer(struct sdhci_host *host, struct mmc_request *mrq)
 {
-	if (sdhci_data_line_cmd(mrq->cmd))
+	if (host->prq || sdhci_data_line_cmd(mrq->cmd))
 		del_timer(&host->data_timer);
 	else
 		del_timer(&host->timer);
@@ -723,10 +738,16 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
 	 * We currently guess that it is LE.
 	 */
 
-	desc = host->adma_table;
-	align = host->align_buffer;
-
-	align_addr = host->align_addr;
+	if (host->adma3_enabled) {
+		desc = host->adma3_pos;
+		align = host->adma3_align_pos;
+		align_addr = host->align_addr +
+			host->adma3_align_pos - host->align_buffer;
+	} else {
+		desc = host->adma_table;
+		align = host->align_buffer;
+		align_addr = host->align_addr;
+	}
 
 	for_each_sg(data->sg, sg, data->sg_count, i) {
 		addr = sg_dma_address(sg);
@@ -785,6 +806,11 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
 		/* Add a terminating entry - nop, end, valid */
 		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
 	}
+
+	if (host->adma3_enabled) {
+		host->adma3_pos = desc;
+		host->adma3_align_pos = align;
+	}
 }
 
 static void sdhci_adma_table_post(struct sdhci_host *host,
@@ -810,7 +836,10 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 			dma_sync_sg_for_cpu(mmc_dev(host->mmc), data->sg,
 					    data->sg_len, DMA_FROM_DEVICE);
 
-			align = host->align_buffer;
+			if (host->adma3_enabled)
+				align = host->adma3_align_pos;
+			else
+				align = host->align_buffer;
 
 			for_each_sg(data->sg, sg, data->sg_count, i) {
 				if (sg_dma_address(sg) & SDHCI_ADMA2_MASK) {
@@ -824,6 +853,9 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 					align += SDHCI_ADMA2_ALIGN;
 				}
 			}
+
+			if (host->adma3_enabled)
+				host->adma3_align_pos = align;
 		}
 	}
 }
@@ -1032,7 +1064,7 @@ static void sdhci_set_timeout(struct sdhci_host *host, struct mmc_command *cmd)
 static void sdhci_initialize_data(struct sdhci_host *host,
 				  struct mmc_data *data)
 {
-	WARN_ON(host->data);
+	WARN_ON(!host->prq && host->data);
 
 	/* Sanity checks */
 	BUG_ON(data->blksz * data->blocks > 524288);
@@ -1133,7 +1165,8 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
 			host->flags &= ~SDHCI_REQ_USE_DMA;
 		} else if (host->flags & SDHCI_USE_ADMA) {
 			sdhci_adma_table_pre(host, data);
-			sdhci_set_adma_addr(host, host->adma_addr);
+			if (!host->adma3_enabled)
+				sdhci_set_adma_addr(host, host->adma_addr);
 		} else {
 			WARN_ON(sg_cnt != 1);
 			sdhci_set_sdma_addr(host, sdhci_sdma_address(host));
@@ -1156,6 +1189,9 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
 
 	sdhci_set_transfer_irqs(host);
 
+	if (host->adma3_enabled)
+		return;
+
 	sdhci_set_block_info(host, data);
 }
 
@@ -1501,6 +1537,36 @@ static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
 	queue_work(host->complete_wq, &host->complete_work);
 }
 
+static void sdhci_finish_packed_data(struct sdhci_host *host, int error)
+{
+	struct mmc_request *mrq;
+
+	host->data = NULL;
+	/*
+	 * Reset the align buffer pointer address for unaligned mappings after
+	 * finishing the transfer.
+	 */
+	host->adma3_align_pos = host->align_buffer;
+
+	if (error)
+		sdhci_do_reset(host, SDHCI_RESET_CMD | SDHCI_RESET_DATA);
+
+	list_for_each_entry(mrq, &host->prq->list, list) {
+		struct mmc_data *data = mrq->data;
+
+		sdhci_adma_table_post(host, data);
+		data->error = error;
+
+		if (data->error)
+			data->bytes_xfered = 0;
+		else
+			data->bytes_xfered = data->blksz * data->blocks;
+	}
+
+	sdhci_del_timer(host, NULL);
+	sdhci_led_deactivate(host);
+}
+
 static void __sdhci_finish_data(struct sdhci_host *host, bool sw_data_timeout)
 {
 	struct mmc_command *data_cmd = host->data_cmd;
@@ -1642,7 +1708,7 @@ static bool sdhci_send_command(struct sdhci_host *host, struct mmc_command *cmd)
 
 	host->cmd = cmd;
 	host->data_timeout = 0;
-	if (sdhci_data_line_cmd(cmd)) {
+	if (!host->prq && sdhci_data_line_cmd(cmd)) {
 		WARN_ON(host->data_cmd);
 		host->data_cmd = cmd;
 		sdhci_set_timeout(host, cmd);
@@ -2120,6 +2186,206 @@ EXPORT_SYMBOL_GPL(sdhci_set_power_and_bus_voltage);
  *                                                                           *
 \*****************************************************************************/
 
+static void sdhci_adma3_write_cmd_desc(struct sdhci_host *host,
+				       struct mmc_command *cmd)
+{
+	struct mmc_data *data = cmd->data;
+	struct sdhci_adma3_cmd_desc *cmd_desc = host->adma3_pos;
+	int blksz, command;
+	u16 mode = 0;
+
+	/* Set block count */
+	cmd_desc->cmd = cpu_to_le32(ADMA3_TRAN_VALID);
+	cmd_desc->reg = cpu_to_le32(data->blocks);
+	cmd_desc++;
+
+	/* Set block size */
+	cmd_desc->cmd = cpu_to_le32(ADMA3_TRAN_VALID);
+	blksz = SDHCI_MAKE_BLKSZ(host->sdma_boundary, data->blksz);
+	cmd_desc->reg = cpu_to_le32(blksz);
+	cmd_desc++;
+
+	/* Set argument */
+	cmd_desc->cmd = cpu_to_le32(ADMA3_TRAN_VALID);
+	cmd_desc->reg = cpu_to_le32(cmd->arg);
+	cmd_desc++;
+
+	/* set command and transfer mode */
+	if (data->flags & MMC_DATA_READ)
+		mode |= SDHCI_TRNS_READ;
+
+	if (!(host->quirks2 & SDHCI_QUIRK2_SUPPORT_SINGLE))
+		mode |= SDHCI_TRNS_BLK_CNT_EN;
+
+	if (mmc_op_multi(cmd->opcode) || data->blocks > 1)
+		mode |= SDHCI_TRNS_MULTI;
+
+	sdhci_auto_cmd_select(host, cmd, &mode);
+	mode |= SDHCI_TRNS_DMA;
+
+	command = sdhci_get_command(host, cmd);
+	command = (command << 16) | mode;
+	cmd_desc->cmd = cpu_to_le32(ADMA3_TRAN_END);
+	cmd_desc->reg = cpu_to_le32(command);
+
+	host->adma3_pos +=
+		SDHCI_ADMA3_CMD_DESC_SZ * SDHCI_ADMA3_CMD_DESC_ENTRIES;
+}
+
+static void sdhci_adma3_write_integr_desc(struct sdhci_host *host,
+					  dma_addr_t addr)
+{
+	struct sdhci_integr_64_desc *integr_desc = host->integr_table;
+
+	integr_desc->cmd = cpu_to_le32(ADMA3_INTEGR_TRAN_END);
+	integr_desc->addr_lo = cpu_to_le32((u32)addr);
+
+	if (host->flags & SDHCI_USE_64_BIT_DMA)
+		integr_desc->addr_hi = cpu_to_le32((u64)addr >> 32);
+}
+
+static void sdhci_set_adma3_addr(struct sdhci_host *host, dma_addr_t addr)
+{
+	sdhci_writel(host, addr, SDHCI_ADMA3_ADDRESS);
+	if (host->flags & SDHCI_USE_64_BIT_DMA)
+		sdhci_writel(host, (u64)addr >> 32, SDHCI_ADMA3_ADDRESS_HI);
+}
+
+int sdhci_prepare_packed(struct mmc_host *mmc)
+{
+	struct sdhci_host *host = mmc_priv(mmc);
+	unsigned long timeout, flags;
+	u32 mask;
+
+	spin_lock_irqsave(&host->lock, flags);
+
+	if (!(host->flags & SDHCI_USE_ADMA3) ||
+	    !(host->flags & (SDHCI_AUTO_CMD23 | SDHCI_AUTO_CMD12))) {
+		spin_unlock_irqrestore(&host->lock, flags);
+		pr_err("%s: Unsupported packed request\n",
+		       mmc_hostname(host->mmc));
+		return -EOPNOTSUPP;
+	}
+
+	/* Wait max 10 ms */
+	timeout = 10;
+	mask = SDHCI_CMD_INHIBIT | SDHCI_DATA_INHIBIT;
+
+	while (sdhci_readl(host, SDHCI_PRESENT_STATE) & mask) {
+		if (timeout == 0) {
+			sdhci_dumpregs(host);
+			spin_unlock_irqrestore(&host->lock, flags);
+
+			pr_err("%s: Controller never released inhibit bit(s).\n",
+			       mmc_hostname(host->mmc));
+			return -EIO;
+		}
+
+		timeout--;
+		mdelay(1);
+	}
+
+	/* Disable command complete event for ADMA3 mode */
+	host->ier &= ~SDHCI_INT_RESPONSE;
+	sdhci_writel(host, host->ier, SDHCI_INT_ENABLE);
+	sdhci_writel(host, host->ier, SDHCI_SIGNAL_ENABLE);
+
+	/*
+	 * Disable data timeout interrupt, and will use software timeout for
+	 * packed request.
+	 */
+	sdhci_set_data_timeout_irq(host, false);
+
+	/* Enable ADMA3 mode for packed request */
+	host->adma3_enabled = true;
+
+	spin_unlock_irqrestore(&host->lock, flags);
+
+	return 0;
+}
+
+int sdhci_unprepare_packed(struct mmc_host *mmc)
+{
+	struct sdhci_host *host = mmc_priv(mmc);
+	unsigned long flags;
+
+	spin_lock_irqsave(&host->lock, flags);
+
+	/* Disable ADMA3 mode after finishing packed request */
+	host->adma3_enabled = false;
+
+	/* Re-enable command complete event after ADMA3 mode */
+	host->ier |= SDHCI_INT_RESPONSE;
+
+	sdhci_writel(host, host->ier, SDHCI_INT_ENABLE);
+	sdhci_writel(host, host->ier, SDHCI_SIGNAL_ENABLE);
+	spin_unlock_irqrestore(&host->lock, flags);
+
+	return 0;
+}
+
+int sdhci_packed_request(struct mmc_host *mmc,
+			 struct mmc_packed_request *prq)
+{
+	struct sdhci_host *host = mmc_priv(mmc);
+	struct mmc_request *mrq;
+	unsigned long timeout, flags;
+	u64 data_timeout = 0;
+	dma_addr_t integr_addr;
+	int present;
+
+	/* Firstly check card presence */
+	present = mmc->ops->get_cd(mmc);
+
+	spin_lock_irqsave(&host->lock, flags);
+
+	sdhci_led_activate(host);
+
+	if (!present || host->flags & SDHCI_DEVICE_DEAD) {
+		spin_unlock_irqrestore(&host->lock, flags);
+		return -ENOMEDIUM;
+	}
+
+	host->prq = prq;
+	host->adma3_pos = host->adma_table;
+	host->adma3_align_pos = host->align_buffer;
+	integr_addr = host->adma_addr;
+
+	list_for_each_entry(mrq, &prq->list, list) {
+		struct mmc_command *cmd = mrq->cmd;
+
+		/* Set command descriptor */
+		sdhci_adma3_write_cmd_desc(host, cmd);
+		/* Set ADMA2 descriptors */
+		sdhci_prepare_data(host, cmd);
+		/* Set integrated descriptor */
+		sdhci_adma3_write_integr_desc(host, integr_addr);
+
+		/* Update the integrated descriptor address */
+		integr_addr =
+			host->adma_addr + (host->adma3_pos - host->adma_table);
+
+		/* Calculate each command's data timeout */
+		sdhci_calc_sw_timeout(host, cmd);
+		data_timeout += host->data_timeout;
+	}
+
+	timeout = jiffies;
+	if (data_timeout)
+		timeout += nsecs_to_jiffies(data_timeout);
+	else
+		timeout += 10 * HZ * prq->nr_reqs;
+	sdhci_mod_timer(host, NULL, timeout);
+
+	/* Start ADMA3 transfer */
+	sdhci_set_adma3_addr(host, host->integr_addr);
+
+	spin_unlock_irqrestore(&host->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(sdhci_packed_request);
+
 void sdhci_request(struct mmc_host *mmc, struct mmc_request *mrq)
 {
 	struct sdhci_host *host = mmc_priv(mmc);
@@ -2978,9 +3244,19 @@ static bool sdhci_request_done(struct sdhci_host *host)
 {
 	unsigned long flags;
 	struct mmc_request *mrq;
+	struct mmc_packed_request *prq;
 	int i;
 
 	spin_lock_irqsave(&host->lock, flags);
+	prq = host->prq;
+
+	if (prq) {
+		host->prq = NULL;
+		spin_unlock_irqrestore(&host->lock, flags);
+
+		host->ops->packed_request_done(host, prq);
+		return true;
+	}
 
 	for (i = 0; i < SDHCI_MAX_MRQS; i++) {
 		mrq = host->mrqs_done[i];
@@ -3136,6 +3412,17 @@ static void sdhci_timeout_data_timer(struct timer_list *t)
 
 	spin_lock_irqsave(&host->lock, flags);
 
+	if (host->prq) {
+		pr_err("%s: Packed requests timeout for hardware interrupt.\n",
+		       mmc_hostname(host->mmc));
+		sdhci_dumpregs(host);
+		sdhci_finish_packed_data(host, -ETIMEDOUT);
+		queue_work(host->complete_wq, &host->complete_work);
+		spin_unlock_irqrestore(&host->lock, flags);
+
+		return;
+	}
+
 	if (host->data || host->data_cmd ||
 	    (host->cmd && sdhci_data_line_cmd(host->cmd))) {
 		pr_err("%s: Timeout waiting for hardware interrupt.\n",
@@ -3343,7 +3630,9 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
 			host->ops->adma_workaround(host, intmask);
 	}
 
-	if (host->data->error)
+	if (host->prq)
+		sdhci_finish_packed_data(host, host->data->error);
+	else if (host->data->error)
 		sdhci_finish_data(host);
 	else {
 		if (intmask & (SDHCI_INT_DATA_AVAIL | SDHCI_INT_SPACE_AVAIL))
@@ -3515,6 +3804,10 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
 			host->mrqs_done[i] = NULL;
 		}
 	}
+
+	if (host->prq)
+		result = IRQ_WAKE_THREAD;
+
 out:
 	if (host->deferred_cmd)
 		result = IRQ_WAKE_THREAD;
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 51207072d1ec..012958611790 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -577,6 +577,7 @@ struct sdhci_host {
 	bool v4_mode;		/* Host Version 4 Enable */
 	bool use_external_dma;	/* Host selects to use external DMA */
 	bool always_defer_done;	/* Always defer to complete requests */
+	bool adma3_enabled;	/* ADMA3 mode enabled */
 
 	struct mmc_request *mrqs_done[SDHCI_MAX_MRQS];	/* Requests done */
 	struct mmc_command *cmd;	/* Current command */
@@ -585,12 +586,15 @@ struct sdhci_host {
 	struct mmc_data *data;	/* Current data request */
 	unsigned int data_early:1;	/* Data finished before cmd */
 
+	struct mmc_packed_request *prq;	/* Current packed request */
 	struct sg_mapping_iter sg_miter;	/* SG state for PIO */
 	unsigned int blocks;	/* remaining PIO blocks */
 
 	void *adma_table;	/* ADMA descriptor table */
 	void *align_buffer;	/* Bounce buffer */
 	void *integr_table;	/* ADMA3 intergrate descriptor table */
+	void *adma3_pos;	/* ADMA3 buffer position */
+	void *adma3_align_pos;	/* ADMA3 Bounce buffer position */
 
 	size_t adma_table_sz;	/* ADMA descriptor table size */
 	size_t align_buffer_sz;	/* Bounce buffer size */
@@ -702,6 +706,8 @@ struct sdhci_ops {
 				   dma_addr_t addr, int len, unsigned int cmd);
 	void	(*request_done)(struct sdhci_host *host,
 				struct mmc_request *mrq);
+	void	(*packed_request_done)(struct sdhci_host *host,
+				       struct mmc_packed_request *prq);
 };
 
 #ifdef CONFIG_MMC_SDHCI_IO_ACCESSORS
@@ -858,4 +864,9 @@ void sdhci_switch_external_dma(struct sdhci_host *host, bool en);
 void sdhci_set_data_timeout_irq(struct sdhci_host *host, bool enable);
 void __sdhci_set_timeout(struct sdhci_host *host, struct mmc_command *cmd);
 
+int sdhci_prepare_packed(struct mmc_host *mmc);
+int sdhci_unprepare_packed(struct mmc_host *mmc);
+int sdhci_packed_request(struct mmc_host *mmc,
+			 struct mmc_packed_request *prq);
+
 #endif /* __SDHCI_HW_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 7/7] mmc: host: sdhci-sprd: Add MMC packed request support
  2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
                   ` (5 preceding siblings ...)
  2020-04-26  9:38 ` [RFC PATCH v2 6/7] mmc: host: sdhci: Add MMC packed request support Baolin Wang
@ 2020-04-26  9:39 ` Baolin Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-26  9:39 UTC (permalink / raw)
  To: axboe, ulf.hansson, adrian.hunter
  Cc: arnd, linus.walleij, paolo.valente, ming.lei, orsonzhai,
	zhang.lyra, baolin.wang7, linux-mmc, linux-block, linux-kernel

Enable the ADMA3 transfer mode as well as adding packed operations
to support MMC packed requests to improve IO performance.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
---
 drivers/mmc/host/sdhci-sprd.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/host/sdhci-sprd.c b/drivers/mmc/host/sdhci-sprd.c
index ad0981e19571..3409fa64ba37 100644
--- a/drivers/mmc/host/sdhci-sprd.c
+++ b/drivers/mmc/host/sdhci-sprd.c
@@ -390,6 +390,12 @@ static void sdhci_sprd_request_done(struct sdhci_host *host,
 	 mmc_request_done(host->mmc, mrq);
 }
 
+static void sdhci_sprd_packed_request_done(struct sdhci_host *host,
+					   struct mmc_packed_request *prq)
+{
+	mmc_hsq_finalize_packed_request(host->mmc, prq);
+}
+
 static struct sdhci_ops sdhci_sprd_ops = {
 	.read_l = sdhci_sprd_readl,
 	.write_l = sdhci_sprd_writel,
@@ -404,6 +410,7 @@ static struct sdhci_ops sdhci_sprd_ops = {
 	.get_max_timeout_count = sdhci_sprd_get_max_timeout_count,
 	.get_ro = sdhci_sprd_get_ro,
 	.request_done = sdhci_sprd_request_done,
+	.packed_request_done = sdhci_sprd_packed_request_done,
 };
 
 static void sdhci_sprd_check_auto_cmd23(struct mmc_host *mmc,
@@ -539,10 +546,18 @@ static const struct sdhci_pltfm_data sdhci_sprd_pdata = {
 		  SDHCI_QUIRK_MISSING_CAPS,
 	.quirks2 = SDHCI_QUIRK2_BROKEN_HS200 |
 		   SDHCI_QUIRK2_USE_32BIT_BLK_CNT |
-		   SDHCI_QUIRK2_PRESET_VALUE_BROKEN,
+		   SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
+		   SDHCI_QUIRK2_USE_ADMA3_SUPPORT,
 	.ops = &sdhci_sprd_ops,
 };
 
+static const struct hsq_packed_ops packed_ops = {
+	.packed_algo = mmc_hsq_packed_algo_rw,
+	.prepare_hardware = sdhci_prepare_packed,
+	.unprepare_hardware = sdhci_unprepare_packed,
+	.packed_request = sdhci_packed_request,
+};
+
 static int sdhci_sprd_probe(struct platform_device *pdev)
 {
 	struct sdhci_host *host;
@@ -669,7 +684,18 @@ static int sdhci_sprd_probe(struct platform_device *pdev)
 		goto err_cleanup_host;
 	}
 
-	ret = mmc_hsq_init(hsq, host->mmc, NULL, 0);
+	/*
+	 * If the host controller can support ADMA3 mode, we can enable the
+	 * packed request mode to improve the read/write performance.
+	 *
+	 * Considering the maximum ADMA3 entries (default is 16) and the request
+	 * latency, we set the default maximum packed requests number is 8.
+	 */
+	if (host->flags & SDHCI_USE_ADMA3)
+		ret = mmc_hsq_init(hsq, host->mmc, &packed_ops,
+				   SDHCI_MAX_ADMA3_ENTRIES / 2);
+	else
+		ret = mmc_hsq_init(hsq, host->mmc, NULL, 0);
 	if (ret)
 		goto err_cleanup_host;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-04-26  9:38 ` [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing Baolin Wang
@ 2020-04-27 15:46   ` Christoph Hellwig
  2020-04-28  8:02     ` Baolin Wang
  2020-05-08 21:35     ` Sagi Grimberg
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-27 15:46 UTC (permalink / raw)
  To: Baolin Wang
  Cc: axboe, ulf.hansson, adrian.hunter, arnd, linus.walleij,
	paolo.valente, ming.lei, orsonzhai, zhang.lyra, linux-mmc,
	linux-block, linux-kernel

extand in the subject really shpuld be 'extend'

On Sun, Apr 26, 2020 at 05:38:54PM +0800, Baolin Wang wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Now some SD/MMC host controllers can support packed command or packed request,
> that means we can package several requests to host controller at one time
> to improve performence.
> 
> But the blk-mq always takes one request from the scheduler and dispatch it to
> the device, regardless of the driver or the scheduler, so there should only
> ever be one request in the local list in blk_mq_dispatch_rq_list(), that means
> the bd.last is always true and the driver can not use bd.last to decide if
> there are requests are pending now in hardware queue to help to package
> requests.
> 
> Thus this patch introduces a new 'BLK_MQ_F_FORCE_COMMIT_RQS' flag to call
> .commit_rqs() to do batch processing if necessary.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Tested-by: Baolin Wang <baolin.wang7@gmail.com>
> Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
> ---
>  block/blk-mq-sched.c   | 29 ++++++++++++++++++++---------
>  block/blk-mq.c         | 15 ++++++++++-----
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..3429a71a7364 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -85,11 +85,12 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>   * its queue by itself in its completion handler, so we don't need to
>   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
>   */
> -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)

This function already returns an int in the current for-5.8/block tree.

> +		if (!(hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS)) {
> +			if (list_empty(list)) {
> +				bd.last = true;
> +			} else {
> +				nxt = list_first_entry(list, struct request,
> +						       queuelist);
> +				bd.last = !blk_mq_get_driver_tag(nxt);
> +			}
> +		} else {
> +			bd.last = false;
>  		}

This seems a little odd in terms of code flow.  Why not:

		if (hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS) {
			bd.last = false;
		} else if (list_empty(list)) {
			bd.last = true;
		} else {
			nxt = list_first_entry(list, struct request, queuelist);
			bd.last = !blk_mq_get_driver_tag(nxt);
		}

> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index f389d7c724bd..6a20f8e8eb85 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -391,6 +391,7 @@ struct blk_mq_ops {
>  enum {
>  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,

Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name?  Also this
flag (just like the existing ones..) could really use a comment
explaining it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-04-27 15:46   ` Christoph Hellwig
@ 2020-04-28  8:02     ` Baolin Wang
  2020-05-08 21:35     ` Sagi Grimberg
  1 sibling, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2020-04-28  8:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, Ulf Hansson, Adrian Hunter, Arnd Bergmann, Linus Walleij,
	Paolo Valente, Ming Lei, Orson Zhai, Chunyan Zhang, linux-mmc,
	linux-block, LKML

On Mon, Apr 27, 2020 at 11:46 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> extand in the subject really shpuld be 'extend'

Sorry for typo, and will fix in next version.

>
> On Sun, Apr 26, 2020 at 05:38:54PM +0800, Baolin Wang wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> >
> > Now some SD/MMC host controllers can support packed command or packed request,
> > that means we can package several requests to host controller at one time
> > to improve performence.
> >
> > But the blk-mq always takes one request from the scheduler and dispatch it to
> > the device, regardless of the driver or the scheduler, so there should only
> > ever be one request in the local list in blk_mq_dispatch_rq_list(), that means
> > the bd.last is always true and the driver can not use bd.last to decide if
> > there are requests are pending now in hardware queue to help to package
> > requests.
> >
> > Thus this patch introduces a new 'BLK_MQ_F_FORCE_COMMIT_RQS' flag to call
> > .commit_rqs() to do batch processing if necessary.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > Tested-by: Baolin Wang <baolin.wang7@gmail.com>
> > Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
> > ---
> >  block/blk-mq-sched.c   | 29 ++++++++++++++++++++---------
> >  block/blk-mq.c         | 15 ++++++++++-----
> >  include/linux/blk-mq.h |  1 +
> >  3 files changed, 31 insertions(+), 14 deletions(-)
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 74cedea56034..3429a71a7364 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -85,11 +85,12 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> >   * its queue by itself in its completion handler, so we don't need to
> >   * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> >   */
> > -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>
> This function already returns an int in the current for-5.8/block tree.

Thanks for pointing this out, and seems I should re-modify the return
values of the functions.

>
> > +             if (!(hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS)) {
> > +                     if (list_empty(list)) {
> > +                             bd.last = true;
> > +                     } else {
> > +                             nxt = list_first_entry(list, struct request,
> > +                                                    queuelist);
> > +                             bd.last = !blk_mq_get_driver_tag(nxt);
> > +                     }
> > +             } else {
> > +                     bd.last = false;
> >               }
>
> This seems a little odd in terms of code flow.  Why not:
>
>                 if (hctx->flags & BLK_MQ_F_FORCE_COMMIT_RQS) {
>                         bd.last = false;
>                 } else if (list_empty(list)) {
>                         bd.last = true;
>                 } else {
>                         nxt = list_first_entry(list, struct request, queuelist);
>                         bd.last = !blk_mq_get_driver_tag(nxt);
>                 }

Yes, looks better.

> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index f389d7c724bd..6a20f8e8eb85 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -391,6 +391,7 @@ struct blk_mq_ops {
> >  enum {
> >       BLK_MQ_F_SHOULD_MERGE   = 1 << 0,
> >       BLK_MQ_F_TAG_SHARED     = 1 << 1,
> > +     BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,
>
> Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name?  Also this

Looks reasonable to me, and will do.

> flag (just like the existing ones..) could really use a comment
> explaining it.

OK, will add some comments. Thanks for your comments.

-- 
Baolin Wang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-04-27 15:46   ` Christoph Hellwig
  2020-04-28  8:02     ` Baolin Wang
@ 2020-05-08 21:35     ` Sagi Grimberg
  2020-05-08 21:46       ` Ming Lei
  1 sibling, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-08 21:35 UTC (permalink / raw)
  To: Christoph Hellwig, Baolin Wang
  Cc: axboe, ulf.hansson, adrian.hunter, arnd, linus.walleij,
	paolo.valente, ming.lei, orsonzhai, zhang.lyra, linux-mmc,
	linux-block, linux-kernel


>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index f389d7c724bd..6a20f8e8eb85 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -391,6 +391,7 @@ struct blk_mq_ops {
>>   enum {
>>   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>>   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
>> +	BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,
> 
> Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name?  Also this
> flag (just like the existing ones..) could really use a comment
> explaining it.

Would it make sense to elevate this flag to a request_queue flag
(QUEUE_FLAG_ALWAYS_COMMIT)?

I'm thinking of a possibility that an I/O scheduler may be used
to activate this functionality rather than having the driver set
it necessarily...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-08 21:35     ` Sagi Grimberg
@ 2020-05-08 21:46       ` Ming Lei
  2020-05-08 22:19         ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-05-08 21:46 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Baolin Wang, axboe, ulf.hansson,
	adrian.hunter, arnd, linus.walleij, paolo.valente, orsonzhai,
	zhang.lyra, linux-mmc, linux-block, linux-kernel

On Fri, May 08, 2020 at 02:35:35PM -0700, Sagi Grimberg wrote:
> 
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > > index f389d7c724bd..6a20f8e8eb85 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -391,6 +391,7 @@ struct blk_mq_ops {
> > >   enum {
> > >   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
> > >   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> > > +	BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,
> > 
> > Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name?  Also this
> > flag (just like the existing ones..) could really use a comment
> > explaining it.
> 
> Would it make sense to elevate this flag to a request_queue flag
> (QUEUE_FLAG_ALWAYS_COMMIT)?

request queue flag usually is writable, however this case just needs
one read-only flag, so I think it may be better to make it as
tagset/hctx flag.

> 
> I'm thinking of a possibility that an I/O scheduler may be used
> to activate this functionality rather than having the driver set
> it necessarily...

Could you explain a bit why I/O scheduler should activate this
functionality?

batching submission may be good for some drivers, and currently
we only do it in limited way. One reason is that there is extra
cost for full batching submission, such as this patch requires
one extra .commit_rqs() for each dispatch, and lock is often needed
in this callback.

IMO it can be a win for some slow driver or device, but may cause
a little performance drop for fast driver/device especially in workload
of not-batching submission.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-08 21:46       ` Ming Lei
@ 2020-05-08 22:19         ` Sagi Grimberg
  2020-05-08 23:22           ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-08 22:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Baolin Wang, axboe, ulf.hansson,
	adrian.hunter, arnd, linus.walleij, paolo.valente, orsonzhai,
	zhang.lyra, linux-mmc, linux-block, linux-kernel

Hey Ming,

>> Would it make sense to elevate this flag to a request_queue flag
>> (QUEUE_FLAG_ALWAYS_COMMIT)?
> 
> request queue flag usually is writable, however this case just needs
> one read-only flag, so I think it may be better to make it as
> tagset/hctx flag.

I actually intended it to be writable.

>> I'm thinking of a possibility that an I/O scheduler may be used
>> to activate this functionality rather than having the driver set
>> it necessarily...
> 
> Could you explain a bit why I/O scheduler should activate this
> functionality?

Sure, I've recently seen some academic work showing the benefits
of batching in tcp/ip based block drivers. The problem with the
approaches taken is that I/O scheduling is exercised deep down in the
driver, which is not the direction I'd like to go if we are want
to adopt some of the batching concepts.

I spent some (limited) time thinking about this, and it seems to
me that there is an opportunity to implement this as a dedicated
I/O scheduler, and tie it to driver specific LLD stack optimizations
(net-stack for example) relying on the commit_rq/bd->last hints.

When scanning the scheduler code, I noticed exactly the phenomenon that
this patchset is attempting to solve and Christoph referred me to it.
Now I'm thinking if we can extend this batching optimization for both
use-cases.

> batching submission may be good for some drivers, and currently
> we only do it in limited way. One reason is that there is extra
> cost for full batching submission, such as this patch requires
> one extra .commit_rqs() for each dispatch, and lock is often needed
> in this callback.

That is not necessarily the case at all.

> IMO it can be a win for some slow driver or device, but may cause
> a little performance drop for fast driver/device especially in workload
> of not-batching submission.

You're mostly correct. This is exactly why an I/O scheduler may be
applicable here IMO. Mostly because I/O schedulers tend to optimize for
something specific and always present tradeoffs. Users need to
understand what they are optimizing for.

Hence I'd say this functionality can definitely be available to an I/O
scheduler should one exist.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-08 22:19         ` Sagi Grimberg
@ 2020-05-08 23:22           ` Ming Lei
  2020-05-09  8:57             ` Baolin Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-05-08 23:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Baolin Wang, axboe, ulf.hansson,
	adrian.hunter, arnd, linus.walleij, paolo.valente, orsonzhai,
	zhang.lyra, linux-mmc, linux-block, linux-kernel

Hi Sagi,

On Fri, May 08, 2020 at 03:19:45PM -0700, Sagi Grimberg wrote:
> Hey Ming,
> 
> > > Would it make sense to elevate this flag to a request_queue flag
> > > (QUEUE_FLAG_ALWAYS_COMMIT)?
> > 
> > request queue flag usually is writable, however this case just needs
> > one read-only flag, so I think it may be better to make it as
> > tagset/hctx flag.
> 
> I actually intended it to be writable.
> 
> > > I'm thinking of a possibility that an I/O scheduler may be used
> > > to activate this functionality rather than having the driver set
> > > it necessarily...
> > 
> > Could you explain a bit why I/O scheduler should activate this
> > functionality?
> 
> Sure, I've recently seen some academic work showing the benefits
> of batching in tcp/ip based block drivers. The problem with the
> approaches taken is that I/O scheduling is exercised deep down in the
> driver, which is not the direction I'd like to go if we are want
> to adopt some of the batching concepts.
> 
> I spent some (limited) time thinking about this, and it seems to
> me that there is an opportunity to implement this as a dedicated
> I/O scheduler, and tie it to driver specific LLD stack optimizations
> (net-stack for example) relying on the commit_rq/bd->last hints.
> 
> When scanning the scheduler code, I noticed exactly the phenomenon that
> this patchset is attempting to solve and Christoph referred me to it.
> Now I'm thinking if we can extend this batching optimization for both
> use-cases.

Got it, thanks for the sharing.

> 
> > batching submission may be good for some drivers, and currently
> > we only do it in limited way. One reason is that there is extra
> > cost for full batching submission, such as this patch requires
> > one extra .commit_rqs() for each dispatch, and lock is often needed
> > in this callback.
> 
> That is not necessarily the case at all.

So far, all in-tree .commit_rqs() implementation requires lock.

> 
> > IMO it can be a win for some slow driver or device, but may cause
> > a little performance drop for fast driver/device especially in workload
> > of not-batching submission.
> 
> You're mostly correct. This is exactly why an I/O scheduler may be
> applicable here IMO. Mostly because I/O schedulers tend to optimize for
> something specific and always present tradeoffs. Users need to
> understand what they are optimizing for.
> 
> Hence I'd say this functionality can definitely be available to an I/O
> scheduler should one exist.
> 

I guess it is just that there can be multiple requests available from
scheduler queue. Actually it can be so for other non-nvme drivers in
case of none, such as SCSI.

Another way is to use one per-task list(such as plug list) to hold the
requests for dispatch, then every drivers may see real .last flag, so they
may get chance for optimizing batch queuing. I will think about the
idea further and see if it is really doable.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-08 23:22           ` Ming Lei
@ 2020-05-09  8:57             ` Baolin Wang
  2020-05-09  9:43               ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2020-05-09  8:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML

On Sat, May 9, 2020 at 7:22 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Sagi,
>
> On Fri, May 08, 2020 at 03:19:45PM -0700, Sagi Grimberg wrote:
> > Hey Ming,
> >
> > > > Would it make sense to elevate this flag to a request_queue flag
> > > > (QUEUE_FLAG_ALWAYS_COMMIT)?
> > >
> > > request queue flag usually is writable, however this case just needs
> > > one read-only flag, so I think it may be better to make it as
> > > tagset/hctx flag.
> >
> > I actually intended it to be writable.
> >
> > > > I'm thinking of a possibility that an I/O scheduler may be used
> > > > to activate this functionality rather than having the driver set
> > > > it necessarily...
> > >
> > > Could you explain a bit why I/O scheduler should activate this
> > > functionality?
> >
> > Sure, I've recently seen some academic work showing the benefits
> > of batching in tcp/ip based block drivers. The problem with the
> > approaches taken is that I/O scheduling is exercised deep down in the
> > driver, which is not the direction I'd like to go if we are want
> > to adopt some of the batching concepts.
> >
> > I spent some (limited) time thinking about this, and it seems to
> > me that there is an opportunity to implement this as a dedicated
> > I/O scheduler, and tie it to driver specific LLD stack optimizations
> > (net-stack for example) relying on the commit_rq/bd->last hints.
> >
> > When scanning the scheduler code, I noticed exactly the phenomenon that
> > this patchset is attempting to solve and Christoph referred me to it.
> > Now I'm thinking if we can extend this batching optimization for both
> > use-cases.
>
> Got it, thanks for the sharing.
>
> >
> > > batching submission may be good for some drivers, and currently
> > > we only do it in limited way. One reason is that there is extra
> > > cost for full batching submission, such as this patch requires
> > > one extra .commit_rqs() for each dispatch, and lock is often needed
> > > in this callback.
> >
> > That is not necessarily the case at all.
>
> So far, all in-tree .commit_rqs() implementation requires lock.
>
> >
> > > IMO it can be a win for some slow driver or device, but may cause
> > > a little performance drop for fast driver/device especially in workload
> > > of not-batching submission.
> >
> > You're mostly correct. This is exactly why an I/O scheduler may be
> > applicable here IMO. Mostly because I/O schedulers tend to optimize for
> > something specific and always present tradeoffs. Users need to
> > understand what they are optimizing for.
> >
> > Hence I'd say this functionality can definitely be available to an I/O
> > scheduler should one exist.
> >
>
> I guess it is just that there can be multiple requests available from
> scheduler queue. Actually it can be so for other non-nvme drivers in
> case of none, such as SCSI.
>
> Another way is to use one per-task list(such as plug list) to hold the
> requests for dispatch, then every drivers may see real .last flag, so they
> may get chance for optimizing batch queuing. I will think about the
> idea further and see if it is really doable.

How about my RFC v1 patch set[1], which allows dispatching more than
one request from the scheduler to support batch requests?

[1]
https://lore.kernel.org/patchwork/patch/1210034/
https://lore.kernel.org/patchwork/patch/1210035/

-- 
Baolin Wang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-09  8:57             ` Baolin Wang
@ 2020-05-09  9:43               ` Ming Lei
  2020-05-10  7:44                 ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-05-09  9:43 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Sagi Grimberg, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML

On Sat, May 09, 2020 at 04:57:48PM +0800, Baolin Wang wrote:
> On Sat, May 9, 2020 at 7:22 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Hi Sagi,
> >
> > On Fri, May 08, 2020 at 03:19:45PM -0700, Sagi Grimberg wrote:
> > > Hey Ming,
> > >
> > > > > Would it make sense to elevate this flag to a request_queue flag
> > > > > (QUEUE_FLAG_ALWAYS_COMMIT)?
> > > >
> > > > request queue flag usually is writable, however this case just needs
> > > > one read-only flag, so I think it may be better to make it as
> > > > tagset/hctx flag.
> > >
> > > I actually intended it to be writable.
> > >
> > > > > I'm thinking of a possibility that an I/O scheduler may be used
> > > > > to activate this functionality rather than having the driver set
> > > > > it necessarily...
> > > >
> > > > Could you explain a bit why I/O scheduler should activate this
> > > > functionality?
> > >
> > > Sure, I've recently seen some academic work showing the benefits
> > > of batching in tcp/ip based block drivers. The problem with the
> > > approaches taken is that I/O scheduling is exercised deep down in the
> > > driver, which is not the direction I'd like to go if we are want
> > > to adopt some of the batching concepts.
> > >
> > > I spent some (limited) time thinking about this, and it seems to
> > > me that there is an opportunity to implement this as a dedicated
> > > I/O scheduler, and tie it to driver specific LLD stack optimizations
> > > (net-stack for example) relying on the commit_rq/bd->last hints.
> > >
> > > When scanning the scheduler code, I noticed exactly the phenomenon that
> > > this patchset is attempting to solve and Christoph referred me to it.
> > > Now I'm thinking if we can extend this batching optimization for both
> > > use-cases.
> >
> > Got it, thanks for the sharing.
> >
> > >
> > > > batching submission may be good for some drivers, and currently
> > > > we only do it in limited way. One reason is that there is extra
> > > > cost for full batching submission, such as this patch requires
> > > > one extra .commit_rqs() for each dispatch, and lock is often needed
> > > > in this callback.
> > >
> > > That is not necessarily the case at all.
> >
> > So far, all in-tree .commit_rqs() implementation requires lock.
> >
> > >
> > > > IMO it can be a win for some slow driver or device, but may cause
> > > > a little performance drop for fast driver/device especially in workload
> > > > of not-batching submission.
> > >
> > > You're mostly correct. This is exactly why an I/O scheduler may be
> > > applicable here IMO. Mostly because I/O schedulers tend to optimize for
> > > something specific and always present tradeoffs. Users need to
> > > understand what they are optimizing for.
> > >
> > > Hence I'd say this functionality can definitely be available to an I/O
> > > scheduler should one exist.
> > >
> >
> > I guess it is just that there can be multiple requests available from
> > scheduler queue. Actually it can be so for other non-nvme drivers in
> > case of none, such as SCSI.
> >
> > Another way is to use one per-task list(such as plug list) to hold the
> > requests for dispatch, then every drivers may see real .last flag, so they
> > may get chance for optimizing batch queuing. I will think about the
> > idea further and see if it is really doable.
> 
> How about my RFC v1 patch set[1], which allows dispatching more than
> one request from the scheduler to support batch requests?
> 
> [1]
> https://lore.kernel.org/patchwork/patch/1210034/
> https://lore.kernel.org/patchwork/patch/1210035/

Basically, my idea is to dequeue request one by one, and for each
dequeued request:

- we try to get a budget and driver tag, if both succeed, add the
request to one per-task list which can be stored in stack variable,
then continue to dequeue more request

- if either budget or driver tag can't be allocated for this request,
marks the last request in the per-task list as .last, and send the
batching requests stored in the list to LLD

- when queueing batching requests to LLD, if one request isn't queued
to driver successfully, calling .commit_rqs() like before, meantime
adding the remained requests in the per-task list back to scheduler
queue or hctx->dispatch.

One issue is that this way might degrade sequential IO performance if
the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-09  9:43               ` Ming Lei
@ 2020-05-10  7:44                 ` Sagi Grimberg
  2020-05-11  1:29                   ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-10  7:44 UTC (permalink / raw)
  To: Ming Lei, Baolin Wang
  Cc: Christoph Hellwig, axboe, Ulf Hansson, Adrian Hunter,
	Arnd Bergmann, Linus Walleij, Paolo Valente, Orson Zhai,
	Chunyan Zhang, linux-mmc, linux-block, LKML


>>>> You're mostly correct. This is exactly why an I/O scheduler may be
>>>> applicable here IMO. Mostly because I/O schedulers tend to optimize for
>>>> something specific and always present tradeoffs. Users need to
>>>> understand what they are optimizing for.
>>>>
>>>> Hence I'd say this functionality can definitely be available to an I/O
>>>> scheduler should one exist.
>>>>
>>>
>>> I guess it is just that there can be multiple requests available from
>>> scheduler queue. Actually it can be so for other non-nvme drivers in
>>> case of none, such as SCSI.
>>>
>>> Another way is to use one per-task list(such as plug list) to hold the
>>> requests for dispatch, then every drivers may see real .last flag, so they
>>> may get chance for optimizing batch queuing. I will think about the
>>> idea further and see if it is really doable.
>>
>> How about my RFC v1 patch set[1], which allows dispatching more than
>> one request from the scheduler to support batch requests?
>>
>> [1]
>> https://lore.kernel.org/patchwork/patch/1210034/
>> https://lore.kernel.org/patchwork/patch/1210035/
> 
> Basically, my idea is to dequeue request one by one, and for each
> dequeued request:
> 
> - we try to get a budget and driver tag, if both succeed, add the
> request to one per-task list which can be stored in stack variable,
> then continue to dequeue more request
> 
> - if either budget or driver tag can't be allocated for this request,
> marks the last request in the per-task list as .last, and send the
> batching requests stored in the list to LLD
> 
> - when queueing batching requests to LLD, if one request isn't queued
> to driver successfully, calling .commit_rqs() like before, meantime
> adding the remained requests in the per-task list back to scheduler
> queue or hctx->dispatch.

Sounds good to me.

> One issue is that this way might degrade sequential IO performance if
> the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
> so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.

Why is that degrading sequential I/O performance? because the specific
device will do better without batching submissions? If so, the driver
is not obligated to respect the bd->last/.commit_rqs, so if that is the
case, it should just ignore it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-10  7:44                 ` Sagi Grimberg
@ 2020-05-11  1:29                   ` Ming Lei
  2020-05-11  9:23                     ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-05-11  1:29 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Baolin Wang, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML

On Sun, May 10, 2020 at 12:44:53AM -0700, Sagi Grimberg wrote:
> 
> > > > > You're mostly correct. This is exactly why an I/O scheduler may be
> > > > > applicable here IMO. Mostly because I/O schedulers tend to optimize for
> > > > > something specific and always present tradeoffs. Users need to
> > > > > understand what they are optimizing for.
> > > > > 
> > > > > Hence I'd say this functionality can definitely be available to an I/O
> > > > > scheduler should one exist.
> > > > > 
> > > > 
> > > > I guess it is just that there can be multiple requests available from
> > > > scheduler queue. Actually it can be so for other non-nvme drivers in
> > > > case of none, such as SCSI.
> > > > 
> > > > Another way is to use one per-task list(such as plug list) to hold the
> > > > requests for dispatch, then every drivers may see real .last flag, so they
> > > > may get chance for optimizing batch queuing. I will think about the
> > > > idea further and see if it is really doable.
> > > 
> > > How about my RFC v1 patch set[1], which allows dispatching more than
> > > one request from the scheduler to support batch requests?
> > > 
> > > [1]
> > > https://lore.kernel.org/patchwork/patch/1210034/
> > > https://lore.kernel.org/patchwork/patch/1210035/
> > 
> > Basically, my idea is to dequeue request one by one, and for each
> > dequeued request:
> > 
> > - we try to get a budget and driver tag, if both succeed, add the
> > request to one per-task list which can be stored in stack variable,
> > then continue to dequeue more request
> > 
> > - if either budget or driver tag can't be allocated for this request,
> > marks the last request in the per-task list as .last, and send the
> > batching requests stored in the list to LLD
> > 
> > - when queueing batching requests to LLD, if one request isn't queued
> > to driver successfully, calling .commit_rqs() like before, meantime
> > adding the remained requests in the per-task list back to scheduler
> > queue or hctx->dispatch.
> 
> Sounds good to me.
> 
> > One issue is that this way might degrade sequential IO performance if
> > the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
> > so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.
> 
> Why is that degrading sequential I/O performance? because the specific

Some devices may only return BLK_STS_RESOURCE from .queue_rq(), then more
requests are dequeued from scheduler queue if we always queue batching IOs
to LLD, and chance of IO merge is reduced, so sequential IO performance will
be effected.

Such as some scsi device which doesn't use sdev->queue_depth for
throttling IOs.

For virtio-scsi or virtio-blk, we may stop queue for avoiding the
potential affect.

> device will do better without batching submissions? If so, the driver

It isn't related with batching submission, IMO.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-11  1:29                   ` Ming Lei
@ 2020-05-11  9:23                     ` Sagi Grimberg
  2020-05-11 11:47                       ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-11  9:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Baolin Wang, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML


>>> Basically, my idea is to dequeue request one by one, and for each
>>> dequeued request:
>>>
>>> - we try to get a budget and driver tag, if both succeed, add the
>>> request to one per-task list which can be stored in stack variable,
>>> then continue to dequeue more request
>>>
>>> - if either budget or driver tag can't be allocated for this request,
>>> marks the last request in the per-task list as .last, and send the
>>> batching requests stored in the list to LLD
>>>
>>> - when queueing batching requests to LLD, if one request isn't queued
>>> to driver successfully, calling .commit_rqs() like before, meantime
>>> adding the remained requests in the per-task list back to scheduler
>>> queue or hctx->dispatch.
>>
>> Sounds good to me.
>>
>>> One issue is that this way might degrade sequential IO performance if
>>> the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
>>> so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.
>>
>> Why is that degrading sequential I/O performance? because the specific
> 
> Some devices may only return BLK_STS_RESOURCE from .queue_rq(), then more
> requests are dequeued from scheduler queue if we always queue batching IOs
> to LLD, and chance of IO merge is reduced, so sequential IO performance will
> be effected.
> 
> Such as some scsi device which doesn't use sdev->queue_depth for
> throttling IOs.
> 
> For virtio-scsi or virtio-blk, we may stop queue for avoiding the
> potential affect.

Do we have a way to characterize such devices? I'd assume that most
devices will benefit from the batching so maybe the flag needs to be
inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-11  9:23                     ` Sagi Grimberg
@ 2020-05-11 11:47                       ` Ming Lei
  2020-05-12  6:26                         ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-05-11 11:47 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Baolin Wang, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML

On Mon, May 11, 2020 at 02:23:14AM -0700, Sagi Grimberg wrote:
> 
> > > > Basically, my idea is to dequeue request one by one, and for each
> > > > dequeued request:
> > > > 
> > > > - we try to get a budget and driver tag, if both succeed, add the
> > > > request to one per-task list which can be stored in stack variable,
> > > > then continue to dequeue more request
> > > > 
> > > > - if either budget or driver tag can't be allocated for this request,
> > > > marks the last request in the per-task list as .last, and send the
> > > > batching requests stored in the list to LLD
> > > > 
> > > > - when queueing batching requests to LLD, if one request isn't queued
> > > > to driver successfully, calling .commit_rqs() like before, meantime
> > > > adding the remained requests in the per-task list back to scheduler
> > > > queue or hctx->dispatch.
> > > 
> > > Sounds good to me.
> > > 
> > > > One issue is that this way might degrade sequential IO performance if
> > > > the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
> > > > so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.
> > > 
> > > Why is that degrading sequential I/O performance? because the specific
> > 
> > Some devices may only return BLK_STS_RESOURCE from .queue_rq(), then more
> > requests are dequeued from scheduler queue if we always queue batching IOs
> > to LLD, and chance of IO merge is reduced, so sequential IO performance will
> > be effected.
> > 
> > Such as some scsi device which doesn't use sdev->queue_depth for
> > throttling IOs.
> > 
> > For virtio-scsi or virtio-blk, we may stop queue for avoiding the
> > potential affect.
> 
> Do we have a way to characterize such devices? I'd assume that most

It may not be easy.

> devices will benefit from the batching so maybe the flag needs to be
> inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?

Actually I'd rather to not add any flag, and we may use some algorithm
(maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly)
to figure out one dynamic batching number which is used to dequeue requests
from scheduler or sw queue.

Then just one single dispatch io code path is enough.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-11 11:47                       ` Ming Lei
@ 2020-05-12  6:26                         ` Sagi Grimberg
  2020-05-12  7:55                           ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-12  6:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Baolin Wang, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML


>> devices will benefit from the batching so maybe the flag needs to be
>> inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?
> 
> Actually I'd rather to not add any flag, and we may use some algorithm
> (maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly)
> to figure out one dynamic batching number which is used to dequeue requests
> from scheduler or sw queue.
> 
> Then just one single dispatch io code path is enough.

Sounds good to me, do you plan to submit a patchset?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
  2020-05-12  6:26                         ` Sagi Grimberg
@ 2020-05-12  7:55                           ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2020-05-12  7:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Baolin Wang, Christoph Hellwig, axboe, Ulf Hansson,
	Adrian Hunter, Arnd Bergmann, Linus Walleij, Paolo Valente,
	Orson Zhai, Chunyan Zhang, linux-mmc, linux-block, LKML

On Mon, May 11, 2020 at 11:26:07PM -0700, Sagi Grimberg wrote:
> 
> > > devices will benefit from the batching so maybe the flag needs to be
> > > inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?
> > 
> > Actually I'd rather to not add any flag, and we may use some algorithm
> > (maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly)
> > to figure out one dynamic batching number which is used to dequeue requests
> > from scheduler or sw queue.
> > 
> > Then just one single dispatch io code path is enough.
> 
> Sounds good to me, do you plan to submit a patchset?
> 

Yeah, I am working on that.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-05-12  7:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-26  9:38 [RFC PATCH v2 0/7] Add MMC packed request support Baolin Wang
2020-04-26  9:38 ` [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing Baolin Wang
2020-04-27 15:46   ` Christoph Hellwig
2020-04-28  8:02     ` Baolin Wang
2020-05-08 21:35     ` Sagi Grimberg
2020-05-08 21:46       ` Ming Lei
2020-05-08 22:19         ` Sagi Grimberg
2020-05-08 23:22           ` Ming Lei
2020-05-09  8:57             ` Baolin Wang
2020-05-09  9:43               ` Ming Lei
2020-05-10  7:44                 ` Sagi Grimberg
2020-05-11  1:29                   ` Ming Lei
2020-05-11  9:23                     ` Sagi Grimberg
2020-05-11 11:47                       ` Ming Lei
2020-05-12  6:26                         ` Sagi Grimberg
2020-05-12  7:55                           ` Ming Lei
2020-04-26  9:38 ` [RFC PATCH v2 2/7] mmc: Add MMC packed request support for MMC software queue Baolin Wang
2020-04-26  9:38 ` [RFC PATCH v2 3/7] mmc: host: sdhci: Introduce ADMA3 transfer mode Baolin Wang
2020-04-26  9:38 ` [RFC PATCH v2 4/7] mmc: host: sdhci: Factor out the command configuration Baolin Wang
2020-04-26  9:38 ` [RFC PATCH v2 5/7] mmc: host: sdhci: Remove redundant sg_count member of struct sdhci_host Baolin Wang
2020-04-26  9:38 ` [RFC PATCH v2 6/7] mmc: host: sdhci: Add MMC packed request support Baolin Wang
2020-04-26  9:39 ` [RFC PATCH v2 7/7] mmc: host: sdhci-sprd: " Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).