All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-12 18:13 ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:13 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

Add support to prepare one MMC request while another is active on
the host. This is done by making the issue_rw_rq() asynchronous.
The increase in throughput is proportional to the time it takes to
prepare a request and how fast the memory is. The faster the MMC/SD is
the more significant the prepare request time becomes. Measurements on U5500
and U8500 on eMMC shows significant performance gain for DMA on MMC for large
reads. In the PIO case there is some gain in performance for large reads too.
There seems to be no or small performance gain for write, don't have a good
explanation for this yet.

There are two optional hooks pre_req() and post_req() that the host driver
may implement in order to improve double buffering. In the DMA case pre_req()
may do dma_map_sg() and prepare the dma descriptor and post_req runs the
dma_unmap_sg.

The mmci host driver implementation for double buffering is not intended
nor ready for mainline yet. It is only an example of how to implement
pre_req() and post_req(). The reason for this is that the basic DMA support
for MMCI is not complete yet. The mmci patches are sent in a separate patch
series "[FYI 0/4] arm: mmci: example implementation of double buffering".

Issues/Questions for issue_rw_rq() in block.c:
* Is it safe to claim the host for the first MMC request and wait to release
  it until the MMC queue is empty again? Or must the host be claimed and
  released for every request?
* Is it possible to predict the result from __blk_end_request().
  If there are no errors for a completed MMC request and the
  blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
  __blk_end_request will return 0?

Here follows the IOZone results for u8500 v1.1 on eMMC.
The numbers for DMA are a bit to good here due to the fact that the
CPU speed is decreased compared to u8500 v2. This makes the cache handling
even more significant.

Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u

Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       +0%     +0%     +0%     +0%     +0%     +0%
        cpu:            +0.1    -0.1    -0.5    -0.3    -0.1    -0.0

        51200   8       +0%     +0%     +6%     +6%     +8%     +0%
        cpu:            +0.1    -0.1    -0.3    -0.4    -0.8    +0.0

        51200   16      +0%     -2%     +0%     +0%     -3%     +0%
        cpu:            +0.0    -0.2    +0.0    +0.0    -0.2    +0.0

        51200   32      +0%     +1%     +0%     +0%     +0%     +0%
        cpu:            +0.1    +0.0    -0.3    +0.0    +0.0    +0.0

        51200   64      +0%     +0%     +0%     +0%     +0%     +0%
        cpu:            +0.1    +0.0    +0.0    +0.0    +0.0    +0.0

        51200   128     +0%     +1%     +1%     +1%     +1%     +0%
        cpu:            +0.0    +0.2    +0.1    -0.3    +0.4    +0.0

        51200   256     +0%     +0%     +1%     +1%     +1%     +0%
        cpu:            +0.0    -0.0    +0.1    +0.1    +0.1    +0.0

        51200   512     +0%     +1%     +2%     +2%     +2%     +0%
        cpu:            +0.1    +0.0    +0.2    +0.2    +0.2    +0.1

        51200   1024    +0%     +2%     +2%     +2%     +3%     +0%
        cpu:            +0.2    +0.1    +0.2    +0.5    -0.8    +0.0

        51200   2048    +0%     +2%     +3%     +3%     +3%     +0%
        cpu:            +0.0    -0.2    +0.4    +0.8    -0.5    +0.2

        51200   4096    +0%     +1%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.1    +0.9    +0.9    +0.5    +0.1

        51200   8192    +1%     +0%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.2    +1.3    +1.3    +1.0    +0.0

        51200   16384   +0%     +1%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.1    +1.0    +1.3    +1.0    +0.5

Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       +0%     -3%     +6%     +5%     +5%     +0%
        cpu:            +0.0    -0.2    -0.6    -0.1    +0.3    +0.0

        51200   8       +0%     +0%     +7%     +7%     +7%     +0%
        cpu:            +0.0    +0.1    +0.8    +0.6    +0.9    +0.0

        51200   16      +0%     +0%     +7%     +7%     +8%     +0%
        cpu:            +0.0    -0.0    +0.7    +0.7    +0.8    +0.0

        51200   32      +0%     +0%     +8%     +8%     +9%     +0%
        cpu:            +0.0    +0.1    +0.7    +0.7    +0.3    +0.0

        51200   64      +0%     +1%     +9%     +9%     +9%     +0%
        cpu:            +0.0    +0.0    +0.8    +0.7    +0.8    +0.0

        51200   128     +1%     +0%     +13%    +13%    +14%    +0%
        cpu:            +0.2    +0.0    +1.0    +1.0    +1.1    +0.0

        51200   256     +1%     +2%     +8%     +8%     +11%    +0%
        cpu:            +0.0    +0.3    +0.0    +0.7    +1.5    +0.0

        51200   512     +1%     +2%     +16%    +16%    +17%    +0%
        cpu:            +0.2    +0.2    +2.2    +2.1    +2.2    +0.1

        51200   1024    +1%     +2%     +20%    +20%    +20%    +1%
        cpu:            +0.2    +0.1    +2.6    +1.9    +2.6    +0.0

        51200   2048    +0%     +2%     +22%    +22%    +21%    +0%
        cpu:            +0.0    +0.3    +2.3    +2.9    +2.1    -0.0

        51200   4096    +1%     +2%     +23%    +23%    +23%    +1%
        cpu:            +0.2    +0.1    +2.0    +3.2    +3.1    +0.0

        51200   8192    +1%     +5%     +24%    +24%    +24%    +1%
        cpu:            +1.4    -0.0    +4.2    +3.0    +2.8    +0.1

        51200   16384   +1%     +3%     +24%    +24%    +24%    +2%
        cpu:            +0.0    +0.3    +3.4    +3.8    +3.7    +0.1

Here follows the IOZone results for u5500 on eMMC.
These numbers for DMA are more as expected.

Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u

Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   128     +1%     +1%     +10%    +9%     +10%    +0%
        cpu:            +0.1    +0.0    +1.3    +0.1    +0.8    +0.1

        51200   256     +2%     +2%     +7%     +7%     +9%     +0%
        cpu:            +0.1    +0.4    +0.5    +0.6    +0.7    +0.0

        51200   512     +2%     +2%     +12%    +12%    +12%    +1%
        cpu:            +0.4    +0.6    +1.8    +2.4    +2.4    +0.2

        51200   1024    +2%     +3%     +14%    +14%    +14%    +0%
        cpu:            +0.3    +0.1    +2.1    +1.4    +1.4    +0.2

        51200   2048    +3%     +3%     +16%    +16%    +16%    +1%
        cpu:            +0.2    +0.2    +2.5    +1.8    +2.4    -0.2

        51200   4096    +3%     +3%     +17%    +17%    +18%    +3%
        cpu:            +0.1    -0.1    +2.7    +2.0    +2.7    -0.1

        51200   8192    +3%     +3%     +18%    +18%    +18%    +3%
        cpu:            -0.1    +0.2    +3.0    +2.3    +2.2    +0.2

        51200   16384   +3%     +3%     +18%    +18%    +18%    +4%
        cpu:            +0.2    +0.2    +2.8    +3.5    +2.4    -0.0

Per Forlin (5):
  mmc: add member in mmc queue struct to hold request data
  mmc: Add a block request prepare function
  mmc: Add a second mmc queue request member
  mmc: Store the mmc block request struct in mmc queue
  mmc: Add double buffering for mmc block requests

 drivers/mmc/card/block.c |  337 ++++++++++++++++++++++++++++++----------------
 drivers/mmc/card/queue.c |  171 +++++++++++++++---------
 drivers/mmc/card/queue.h |   31 +++-
 drivers/mmc/core/core.c  |   77 +++++++++--
 include/linux/mmc/core.h |    7 +-
 include/linux/mmc/host.h |    8 +
 6 files changed, 432 insertions(+), 199 deletions(-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-12 18:13 ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:13 UTC (permalink / raw)
  To: linux-arm-kernel

Add support to prepare one MMC request while another is active on
the host. This is done by making the issue_rw_rq() asynchronous.
The increase in throughput is proportional to the time it takes to
prepare a request and how fast the memory is. The faster the MMC/SD is
the more significant the prepare request time becomes. Measurements on U5500
and U8500 on eMMC shows significant performance gain for DMA on MMC for large
reads. In the PIO case there is some gain in performance for large reads too.
There seems to be no or small performance gain for write, don't have a good
explanation for this yet.

There are two optional hooks pre_req() and post_req() that the host driver
may implement in order to improve double buffering. In the DMA case pre_req()
may do dma_map_sg() and prepare the dma descriptor and post_req runs the
dma_unmap_sg.

The mmci host driver implementation for double buffering is not intended
nor ready for mainline yet. It is only an example of how to implement
pre_req() and post_req(). The reason for this is that the basic DMA support
for MMCI is not complete yet. The mmci patches are sent in a separate patch
series "[FYI 0/4] arm: mmci: example implementation of double buffering".

Issues/Questions for issue_rw_rq() in block.c:
* Is it safe to claim the host for the first MMC request and wait to release
  it until the MMC queue is empty again? Or must the host be claimed and
  released for every request?
* Is it possible to predict the result from __blk_end_request().
  If there are no errors for a completed MMC request and the
  blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
  __blk_end_request will return 0?

Here follows the IOZone results for u8500 v1.1 on eMMC.
The numbers for DMA are a bit to good here due to the fact that the
CPU speed is decreased compared to u8500 v2. This makes the cache handling
even more significant.

Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u

Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       +0%     +0%     +0%     +0%     +0%     +0%
        cpu:            +0.1    -0.1    -0.5    -0.3    -0.1    -0.0

        51200   8       +0%     +0%     +6%     +6%     +8%     +0%
        cpu:            +0.1    -0.1    -0.3    -0.4    -0.8    +0.0

        51200   16      +0%     -2%     +0%     +0%     -3%     +0%
        cpu:            +0.0    -0.2    +0.0    +0.0    -0.2    +0.0

        51200   32      +0%     +1%     +0%     +0%     +0%     +0%
        cpu:            +0.1    +0.0    -0.3    +0.0    +0.0    +0.0

        51200   64      +0%     +0%     +0%     +0%     +0%     +0%
        cpu:            +0.1    +0.0    +0.0    +0.0    +0.0    +0.0

        51200   128     +0%     +1%     +1%     +1%     +1%     +0%
        cpu:            +0.0    +0.2    +0.1    -0.3    +0.4    +0.0

        51200   256     +0%     +0%     +1%     +1%     +1%     +0%
        cpu:            +0.0    -0.0    +0.1    +0.1    +0.1    +0.0

        51200   512     +0%     +1%     +2%     +2%     +2%     +0%
        cpu:            +0.1    +0.0    +0.2    +0.2    +0.2    +0.1

        51200   1024    +0%     +2%     +2%     +2%     +3%     +0%
        cpu:            +0.2    +0.1    +0.2    +0.5    -0.8    +0.0

        51200   2048    +0%     +2%     +3%     +3%     +3%     +0%
        cpu:            +0.0    -0.2    +0.4    +0.8    -0.5    +0.2

        51200   4096    +0%     +1%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.1    +0.9    +0.9    +0.5    +0.1

        51200   8192    +1%     +0%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.2    +1.3    +1.3    +1.0    +0.0

        51200   16384   +0%     +1%     +3%     +3%     +3%     +1%
        cpu:            +0.2    +0.1    +1.0    +1.3    +1.0    +0.5

Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       +0%     -3%     +6%     +5%     +5%     +0%
        cpu:            +0.0    -0.2    -0.6    -0.1    +0.3    +0.0

        51200   8       +0%     +0%     +7%     +7%     +7%     +0%
        cpu:            +0.0    +0.1    +0.8    +0.6    +0.9    +0.0

        51200   16      +0%     +0%     +7%     +7%     +8%     +0%
        cpu:            +0.0    -0.0    +0.7    +0.7    +0.8    +0.0

        51200   32      +0%     +0%     +8%     +8%     +9%     +0%
        cpu:            +0.0    +0.1    +0.7    +0.7    +0.3    +0.0

        51200   64      +0%     +1%     +9%     +9%     +9%     +0%
        cpu:            +0.0    +0.0    +0.8    +0.7    +0.8    +0.0

        51200   128     +1%     +0%     +13%    +13%    +14%    +0%
        cpu:            +0.2    +0.0    +1.0    +1.0    +1.1    +0.0

        51200   256     +1%     +2%     +8%     +8%     +11%    +0%
        cpu:            +0.0    +0.3    +0.0    +0.7    +1.5    +0.0

        51200   512     +1%     +2%     +16%    +16%    +17%    +0%
        cpu:            +0.2    +0.2    +2.2    +2.1    +2.2    +0.1

        51200   1024    +1%     +2%     +20%    +20%    +20%    +1%
        cpu:            +0.2    +0.1    +2.6    +1.9    +2.6    +0.0

        51200   2048    +0%     +2%     +22%    +22%    +21%    +0%
        cpu:            +0.0    +0.3    +2.3    +2.9    +2.1    -0.0

        51200   4096    +1%     +2%     +23%    +23%    +23%    +1%
        cpu:            +0.2    +0.1    +2.0    +3.2    +3.1    +0.0

        51200   8192    +1%     +5%     +24%    +24%    +24%    +1%
        cpu:            +1.4    -0.0    +4.2    +3.0    +2.8    +0.1

        51200   16384   +1%     +3%     +24%    +24%    +24%    +2%
        cpu:            +0.0    +0.3    +3.4    +3.8    +3.7    +0.1

Here follows the IOZone results for u5500 on eMMC.
These numbers for DMA are more as expected.

Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u

Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
cpu load is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   128     +1%     +1%     +10%    +9%     +10%    +0%
        cpu:            +0.1    +0.0    +1.3    +0.1    +0.8    +0.1

        51200   256     +2%     +2%     +7%     +7%     +9%     +0%
        cpu:            +0.1    +0.4    +0.5    +0.6    +0.7    +0.0

        51200   512     +2%     +2%     +12%    +12%    +12%    +1%
        cpu:            +0.4    +0.6    +1.8    +2.4    +2.4    +0.2

        51200   1024    +2%     +3%     +14%    +14%    +14%    +0%
        cpu:            +0.3    +0.1    +2.1    +1.4    +1.4    +0.2

        51200   2048    +3%     +3%     +16%    +16%    +16%    +1%
        cpu:            +0.2    +0.2    +2.5    +1.8    +2.4    -0.2

        51200   4096    +3%     +3%     +17%    +17%    +18%    +3%
        cpu:            +0.1    -0.1    +2.7    +2.0    +2.7    -0.1

        51200   8192    +3%     +3%     +18%    +18%    +18%    +3%
        cpu:            -0.1    +0.2    +3.0    +2.3    +2.2    +0.2

        51200   16384   +3%     +3%     +18%    +18%    +18%    +4%
        cpu:            +0.2    +0.2    +2.8    +3.5    +2.4    -0.0

Per Forlin (5):
  mmc: add member in mmc queue struct to hold request data
  mmc: Add a block request prepare function
  mmc: Add a second mmc queue request member
  mmc: Store the mmc block request struct in mmc queue
  mmc: Add double buffering for mmc block requests

 drivers/mmc/card/block.c |  337 ++++++++++++++++++++++++++++++----------------
 drivers/mmc/card/queue.c |  171 +++++++++++++++---------
 drivers/mmc/card/queue.h |   31 +++-
 drivers/mmc/core/core.c  |   77 +++++++++--
 include/linux/mmc/core.h |    7 +-
 include/linux/mmc/host.h |    8 +
 6 files changed, 432 insertions(+), 199 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] mmc: add member in mmc queue struct to hold request data
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:13   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:13 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

The way the request data is organized in the mmc queue struct
it only allows processing of one request at the time.
This patch adds a new struct to hold mmc queue request data such as
sg list, request and bounce buffers, and update functions depending on
the mmc queue struct. This lies the ground for
using multiple active request for one mmc queue.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |    8 ++--
 drivers/mmc/card/queue.c |  129 ++++++++++++++++++++++++----------------------
 drivers/mmc/card/queue.h |   22 +++++---
 3 files changed, 85 insertions(+), 74 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 217f820..be51bde 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -398,8 +398,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 
 		mmc_set_data_timeout(&brq.data, card);
 
-		brq.data.sg = mq->sg;
-		brq.data.sg_len = mmc_queue_map_sg(mq);
+		brq.data.sg = mq->mqrq_cur->sg;
+		brq.data.sg_len = mmc_queue_map_sg(mq, mq->mqrq_cur);
 
 		/*
 		 * Adjust the sg list so it is the same size as the
@@ -420,11 +420,11 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 			brq.data.sg_len = i;
 		}
 
-		mmc_queue_bounce_pre(mq);
+		mmc_queue_bounce_pre(mq->mqrq_cur);
 
 		mmc_wait_for_req(card->host, &brq.mrq);
 
-		mmc_queue_bounce_post(mq);
+		mmc_queue_bounce_post(mq->mqrq_cur);
 
 		/*
 		 * Check for errors here, but don't jump to cmd_err
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 4e42d03..8a8d88b 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -57,7 +57,7 @@ static int mmc_queue_thread(void *d)
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!blk_queue_plugged(q))
 			req = blk_fetch_request(q);
-		mq->req = req;
+		mq->mqrq_cur->req = req;
 		spin_unlock_irq(q->queue_lock);
 
 		if (!req) {
@@ -98,10 +98,25 @@ static void mmc_request(struct request_queue *q)
 		return;
 	}
 
-	if (!mq->req)
+	if (!mq->mqrq_cur->req)
 		wake_up_process(mq->thread);
 }
 
+struct scatterlist *mmc_alloc_sg(int sg_len, int *err)
+{
+	struct scatterlist *sg;
+
+	sg = kmalloc(sizeof(struct scatterlist)*sg_len, GFP_KERNEL);
+	if (!sg)
+		*err = -ENOMEM;
+	else {
+		*err = 0;
+		sg_init_table(sg, sg_len);
+	}
+
+	return sg;
+}
+
 /**
  * mmc_init_queue - initialise a queue structure.
  * @mq: mmc queue
@@ -115,6 +130,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	struct mmc_host *host = card->host;
 	u64 limit = BLK_BOUNCE_HIGH;
 	int ret;
+	struct mmc_queue_req *mqrq_cur = &mq->mqrq[0];
 
 	if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
 		limit = *mmc_dev(host)->dma_mask;
@@ -124,8 +140,9 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	if (!mq->queue)
 		return -ENOMEM;
 
+	memset(&mq->mqrq_cur, 0, sizeof(mq->mqrq_cur));
+	mq->mqrq_cur = mqrq_cur;
 	mq->queue->queuedata = mq;
-	mq->req = NULL;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
@@ -159,53 +176,44 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 			bouncesz = host->max_blk_count * 512;
 
 		if (bouncesz > 512) {
-			mq->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
-			if (!mq->bounce_buf) {
+			mqrq_cur->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
+			if (!mqrq_cur->bounce_buf) {
 				printk(KERN_WARNING "%s: unable to "
-					"allocate bounce buffer\n",
+					"allocate bounce cur buffer\n",
 					mmc_card_name(card));
 			}
 		}
 
-		if (mq->bounce_buf) {
+		if (mqrq_cur->bounce_buf) {
 			blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY);
 			blk_queue_max_hw_sectors(mq->queue, bouncesz / 512);
 			blk_queue_max_segments(mq->queue, bouncesz / 512);
 			blk_queue_max_segment_size(mq->queue, bouncesz);
 
-			mq->sg = kmalloc(sizeof(struct scatterlist),
-				GFP_KERNEL);
-			if (!mq->sg) {
-				ret = -ENOMEM;
+			mqrq_cur->sg = mmc_alloc_sg(1, &ret);
+			if (ret)
 				goto cleanup_queue;
-			}
-			sg_init_table(mq->sg, 1);
 
-			mq->bounce_sg = kmalloc(sizeof(struct scatterlist) *
-				bouncesz / 512, GFP_KERNEL);
-			if (!mq->bounce_sg) {
-				ret = -ENOMEM;
+			mqrq_cur->bounce_sg =
+				mmc_alloc_sg(bouncesz / 512, &ret);
+			if (ret)
 				goto cleanup_queue;
-			}
-			sg_init_table(mq->bounce_sg, bouncesz / 512);
+
 		}
 	}
 #endif
 
-	if (!mq->bounce_buf) {
+	if (!mqrq_cur->bounce_buf) {
 		blk_queue_bounce_limit(mq->queue, limit);
 		blk_queue_max_hw_sectors(mq->queue,
 			min(host->max_blk_count, host->max_req_size / 512));
 		blk_queue_max_segments(mq->queue, host->max_segs);
 		blk_queue_max_segment_size(mq->queue, host->max_seg_size);
 
-		mq->sg = kmalloc(sizeof(struct scatterlist) *
-			host->max_segs, GFP_KERNEL);
-		if (!mq->sg) {
-			ret = -ENOMEM;
+		mqrq_cur->sg = mmc_alloc_sg(host->max_segs, &ret);
+		if (ret)
 			goto cleanup_queue;
-		}
-		sg_init_table(mq->sg, host->max_segs);
+
 	}
 
 	sema_init(&mq->thread_sem, 1);
@@ -220,16 +228,15 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 
 	return 0;
  free_bounce_sg:
- 	if (mq->bounce_sg)
- 		kfree(mq->bounce_sg);
- 	mq->bounce_sg = NULL;
+	kfree(mqrq_cur->bounce_sg);
+	mqrq_cur->bounce_sg = NULL;
+
  cleanup_queue:
- 	if (mq->sg)
-		kfree(mq->sg);
-	mq->sg = NULL;
-	if (mq->bounce_buf)
-		kfree(mq->bounce_buf);
-	mq->bounce_buf = NULL;
+	kfree(mqrq_cur->sg);
+	mqrq_cur->sg = NULL;
+	kfree(mqrq_cur->bounce_buf);
+	mqrq_cur->bounce_buf = NULL;
+
 	blk_cleanup_queue(mq->queue);
 	return ret;
 }
@@ -238,6 +245,7 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 {
 	struct request_queue *q = mq->queue;
 	unsigned long flags;
+	struct mmc_queue_req *mqrq_cur = mq->mqrq_cur;
 
 	/* Make sure the queue isn't suspended, as that will deadlock */
 	mmc_queue_resume(mq);
@@ -251,16 +259,14 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	blk_start_queue(q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
- 	if (mq->bounce_sg)
- 		kfree(mq->bounce_sg);
- 	mq->bounce_sg = NULL;
+	kfree(mqrq_cur->bounce_sg);
+	mqrq_cur->bounce_sg = NULL;
 
-	kfree(mq->sg);
-	mq->sg = NULL;
+	kfree(mqrq_cur->sg);
+	mqrq_cur->sg = NULL;
 
-	if (mq->bounce_buf)
-		kfree(mq->bounce_buf);
-	mq->bounce_buf = NULL;
+	kfree(mqrq_cur->bounce_buf);
+	mqrq_cur->bounce_buf = NULL;
 
 	mq->card = NULL;
 }
@@ -313,27 +319,27 @@ void mmc_queue_resume(struct mmc_queue *mq)
 /*
  * Prepare the sg list(s) to be handed of to the host driver
  */
-unsigned int mmc_queue_map_sg(struct mmc_queue *mq)
+unsigned int mmc_queue_map_sg(struct mmc_queue *mq, struct mmc_queue_req *mqrq)
 {
 	unsigned int sg_len;
 	size_t buflen;
 	struct scatterlist *sg;
 	int i;
 
-	if (!mq->bounce_buf)
-		return blk_rq_map_sg(mq->queue, mq->req, mq->sg);
+	if (!mqrq->bounce_buf)
+		return blk_rq_map_sg(mq->queue, mqrq->req, mqrq->sg);
 
-	BUG_ON(!mq->bounce_sg);
+	BUG_ON(!mqrq->bounce_sg);
 
-	sg_len = blk_rq_map_sg(mq->queue, mq->req, mq->bounce_sg);
+	sg_len = blk_rq_map_sg(mq->queue, mqrq->req, mqrq->bounce_sg);
 
-	mq->bounce_sg_len = sg_len;
+	mqrq->bounce_sg_len = sg_len;
 
 	buflen = 0;
-	for_each_sg(mq->bounce_sg, sg, sg_len, i)
+	for_each_sg(mqrq->bounce_sg, sg, sg_len, i)
 		buflen += sg->length;
 
-	sg_init_one(mq->sg, mq->bounce_buf, buflen);
+	sg_init_one(mqrq->sg, mqrq->bounce_buf, buflen);
 
 	return 1;
 }
@@ -342,19 +348,19 @@ unsigned int mmc_queue_map_sg(struct mmc_queue *mq)
  * If writing, bounce the data to the buffer before the request
  * is sent to the host driver
  */
-void mmc_queue_bounce_pre(struct mmc_queue *mq)
+void mmc_queue_bounce_pre(struct mmc_queue_req *mqrq)
 {
 	unsigned long flags;
 
-	if (!mq->bounce_buf)
+	if (!mqrq->bounce_buf)
 		return;
 
-	if (rq_data_dir(mq->req) != WRITE)
+	if (rq_data_dir(mqrq->req) != WRITE)
 		return;
 
 	local_irq_save(flags);
-	sg_copy_to_buffer(mq->bounce_sg, mq->bounce_sg_len,
-		mq->bounce_buf, mq->sg[0].length);
+	sg_copy_to_buffer(mqrq->bounce_sg, mqrq->bounce_sg_len,
+		mqrq->bounce_buf, mqrq->sg[0].length);
 	local_irq_restore(flags);
 }
 
@@ -362,19 +368,18 @@ void mmc_queue_bounce_pre(struct mmc_queue *mq)
  * If reading, bounce the data from the buffer after the request
  * has been handled by the host driver
  */
-void mmc_queue_bounce_post(struct mmc_queue *mq)
+void mmc_queue_bounce_post(struct mmc_queue_req *mqrq)
 {
 	unsigned long flags;
 
-	if (!mq->bounce_buf)
+	if (!mqrq->bounce_buf)
 		return;
 
-	if (rq_data_dir(mq->req) != READ)
+	if (rq_data_dir(mqrq->req) != READ)
 		return;
 
 	local_irq_save(flags);
-	sg_copy_from_buffer(mq->bounce_sg, mq->bounce_sg_len,
-		mq->bounce_buf, mq->sg[0].length);
+	sg_copy_from_buffer(mqrq->bounce_sg, mqrq->bounce_sg_len,
+		mqrq->bounce_buf, mqrq->sg[0].length);
 	local_irq_restore(flags);
 }
-
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index 64e66e0..96c440d 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -4,19 +4,24 @@
 struct request;
 struct task_struct;
 
+struct mmc_queue_req {
+	struct request		*req;
+	struct scatterlist	*sg;
+	char			*bounce_buf;
+	struct scatterlist	*bounce_sg;
+	unsigned int		bounce_sg_len;
+};
+
 struct mmc_queue {
 	struct mmc_card		*card;
 	struct task_struct	*thread;
 	struct semaphore	thread_sem;
 	unsigned int		flags;
-	struct request		*req;
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-	struct scatterlist	*sg;
-	char			*bounce_buf;
-	struct scatterlist	*bounce_sg;
-	unsigned int		bounce_sg_len;
+	struct mmc_queue_req	mqrq[1];
+	struct mmc_queue_req	*mqrq_cur;
 };
 
 extern int mmc_init_queue(struct mmc_queue *, struct mmc_card *, spinlock_t *);
@@ -24,8 +29,9 @@ extern void mmc_cleanup_queue(struct mmc_queue *);
 extern void mmc_queue_suspend(struct mmc_queue *);
 extern void mmc_queue_resume(struct mmc_queue *);
 
-extern unsigned int mmc_queue_map_sg(struct mmc_queue *);
-extern void mmc_queue_bounce_pre(struct mmc_queue *);
-extern void mmc_queue_bounce_post(struct mmc_queue *);
+extern unsigned int mmc_queue_map_sg(struct mmc_queue *,
+				     struct mmc_queue_req *);
+extern void mmc_queue_bounce_pre(struct mmc_queue_req *);
+extern void mmc_queue_bounce_post(struct mmc_queue_req *);
 
 #endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/5] mmc: add member in mmc queue struct to hold request data
@ 2011-01-12 18:13   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:13 UTC (permalink / raw)
  To: linux-arm-kernel

The way the request data is organized in the mmc queue struct
it only allows processing of one request at the time.
This patch adds a new struct to hold mmc queue request data such as
sg list, request and bounce buffers, and update functions depending on
the mmc queue struct. This lies the ground for
using multiple active request for one mmc queue.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |    8 ++--
 drivers/mmc/card/queue.c |  129 ++++++++++++++++++++++++----------------------
 drivers/mmc/card/queue.h |   22 +++++---
 3 files changed, 85 insertions(+), 74 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 217f820..be51bde 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -398,8 +398,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 
 		mmc_set_data_timeout(&brq.data, card);
 
-		brq.data.sg = mq->sg;
-		brq.data.sg_len = mmc_queue_map_sg(mq);
+		brq.data.sg = mq->mqrq_cur->sg;
+		brq.data.sg_len = mmc_queue_map_sg(mq, mq->mqrq_cur);
 
 		/*
 		 * Adjust the sg list so it is the same size as the
@@ -420,11 +420,11 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 			brq.data.sg_len = i;
 		}
 
-		mmc_queue_bounce_pre(mq);
+		mmc_queue_bounce_pre(mq->mqrq_cur);
 
 		mmc_wait_for_req(card->host, &brq.mrq);
 
-		mmc_queue_bounce_post(mq);
+		mmc_queue_bounce_post(mq->mqrq_cur);
 
 		/*
 		 * Check for errors here, but don't jump to cmd_err
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 4e42d03..8a8d88b 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -57,7 +57,7 @@ static int mmc_queue_thread(void *d)
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!blk_queue_plugged(q))
 			req = blk_fetch_request(q);
-		mq->req = req;
+		mq->mqrq_cur->req = req;
 		spin_unlock_irq(q->queue_lock);
 
 		if (!req) {
@@ -98,10 +98,25 @@ static void mmc_request(struct request_queue *q)
 		return;
 	}
 
-	if (!mq->req)
+	if (!mq->mqrq_cur->req)
 		wake_up_process(mq->thread);
 }
 
+struct scatterlist *mmc_alloc_sg(int sg_len, int *err)
+{
+	struct scatterlist *sg;
+
+	sg = kmalloc(sizeof(struct scatterlist)*sg_len, GFP_KERNEL);
+	if (!sg)
+		*err = -ENOMEM;
+	else {
+		*err = 0;
+		sg_init_table(sg, sg_len);
+	}
+
+	return sg;
+}
+
 /**
  * mmc_init_queue - initialise a queue structure.
  * @mq: mmc queue
@@ -115,6 +130,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	struct mmc_host *host = card->host;
 	u64 limit = BLK_BOUNCE_HIGH;
 	int ret;
+	struct mmc_queue_req *mqrq_cur = &mq->mqrq[0];
 
 	if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
 		limit = *mmc_dev(host)->dma_mask;
@@ -124,8 +140,9 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	if (!mq->queue)
 		return -ENOMEM;
 
+	memset(&mq->mqrq_cur, 0, sizeof(mq->mqrq_cur));
+	mq->mqrq_cur = mqrq_cur;
 	mq->queue->queuedata = mq;
-	mq->req = NULL;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
@@ -159,53 +176,44 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 			bouncesz = host->max_blk_count * 512;
 
 		if (bouncesz > 512) {
-			mq->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
-			if (!mq->bounce_buf) {
+			mqrq_cur->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
+			if (!mqrq_cur->bounce_buf) {
 				printk(KERN_WARNING "%s: unable to "
-					"allocate bounce buffer\n",
+					"allocate bounce cur buffer\n",
 					mmc_card_name(card));
 			}
 		}
 
-		if (mq->bounce_buf) {
+		if (mqrq_cur->bounce_buf) {
 			blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY);
 			blk_queue_max_hw_sectors(mq->queue, bouncesz / 512);
 			blk_queue_max_segments(mq->queue, bouncesz / 512);
 			blk_queue_max_segment_size(mq->queue, bouncesz);
 
-			mq->sg = kmalloc(sizeof(struct scatterlist),
-				GFP_KERNEL);
-			if (!mq->sg) {
-				ret = -ENOMEM;
+			mqrq_cur->sg = mmc_alloc_sg(1, &ret);
+			if (ret)
 				goto cleanup_queue;
-			}
-			sg_init_table(mq->sg, 1);
 
-			mq->bounce_sg = kmalloc(sizeof(struct scatterlist) *
-				bouncesz / 512, GFP_KERNEL);
-			if (!mq->bounce_sg) {
-				ret = -ENOMEM;
+			mqrq_cur->bounce_sg =
+				mmc_alloc_sg(bouncesz / 512, &ret);
+			if (ret)
 				goto cleanup_queue;
-			}
-			sg_init_table(mq->bounce_sg, bouncesz / 512);
+
 		}
 	}
 #endif
 
-	if (!mq->bounce_buf) {
+	if (!mqrq_cur->bounce_buf) {
 		blk_queue_bounce_limit(mq->queue, limit);
 		blk_queue_max_hw_sectors(mq->queue,
 			min(host->max_blk_count, host->max_req_size / 512));
 		blk_queue_max_segments(mq->queue, host->max_segs);
 		blk_queue_max_segment_size(mq->queue, host->max_seg_size);
 
-		mq->sg = kmalloc(sizeof(struct scatterlist) *
-			host->max_segs, GFP_KERNEL);
-		if (!mq->sg) {
-			ret = -ENOMEM;
+		mqrq_cur->sg = mmc_alloc_sg(host->max_segs, &ret);
+		if (ret)
 			goto cleanup_queue;
-		}
-		sg_init_table(mq->sg, host->max_segs);
+
 	}
 
 	sema_init(&mq->thread_sem, 1);
@@ -220,16 +228,15 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 
 	return 0;
  free_bounce_sg:
- 	if (mq->bounce_sg)
- 		kfree(mq->bounce_sg);
- 	mq->bounce_sg = NULL;
+	kfree(mqrq_cur->bounce_sg);
+	mqrq_cur->bounce_sg = NULL;
+
  cleanup_queue:
- 	if (mq->sg)
-		kfree(mq->sg);
-	mq->sg = NULL;
-	if (mq->bounce_buf)
-		kfree(mq->bounce_buf);
-	mq->bounce_buf = NULL;
+	kfree(mqrq_cur->sg);
+	mqrq_cur->sg = NULL;
+	kfree(mqrq_cur->bounce_buf);
+	mqrq_cur->bounce_buf = NULL;
+
 	blk_cleanup_queue(mq->queue);
 	return ret;
 }
@@ -238,6 +245,7 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 {
 	struct request_queue *q = mq->queue;
 	unsigned long flags;
+	struct mmc_queue_req *mqrq_cur = mq->mqrq_cur;
 
 	/* Make sure the queue isn't suspended, as that will deadlock */
 	mmc_queue_resume(mq);
@@ -251,16 +259,14 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	blk_start_queue(q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
- 	if (mq->bounce_sg)
- 		kfree(mq->bounce_sg);
- 	mq->bounce_sg = NULL;
+	kfree(mqrq_cur->bounce_sg);
+	mqrq_cur->bounce_sg = NULL;
 
-	kfree(mq->sg);
-	mq->sg = NULL;
+	kfree(mqrq_cur->sg);
+	mqrq_cur->sg = NULL;
 
-	if (mq->bounce_buf)
-		kfree(mq->bounce_buf);
-	mq->bounce_buf = NULL;
+	kfree(mqrq_cur->bounce_buf);
+	mqrq_cur->bounce_buf = NULL;
 
 	mq->card = NULL;
 }
@@ -313,27 +319,27 @@ void mmc_queue_resume(struct mmc_queue *mq)
 /*
  * Prepare the sg list(s) to be handed of to the host driver
  */
-unsigned int mmc_queue_map_sg(struct mmc_queue *mq)
+unsigned int mmc_queue_map_sg(struct mmc_queue *mq, struct mmc_queue_req *mqrq)
 {
 	unsigned int sg_len;
 	size_t buflen;
 	struct scatterlist *sg;
 	int i;
 
-	if (!mq->bounce_buf)
-		return blk_rq_map_sg(mq->queue, mq->req, mq->sg);
+	if (!mqrq->bounce_buf)
+		return blk_rq_map_sg(mq->queue, mqrq->req, mqrq->sg);
 
-	BUG_ON(!mq->bounce_sg);
+	BUG_ON(!mqrq->bounce_sg);
 
-	sg_len = blk_rq_map_sg(mq->queue, mq->req, mq->bounce_sg);
+	sg_len = blk_rq_map_sg(mq->queue, mqrq->req, mqrq->bounce_sg);
 
-	mq->bounce_sg_len = sg_len;
+	mqrq->bounce_sg_len = sg_len;
 
 	buflen = 0;
-	for_each_sg(mq->bounce_sg, sg, sg_len, i)
+	for_each_sg(mqrq->bounce_sg, sg, sg_len, i)
 		buflen += sg->length;
 
-	sg_init_one(mq->sg, mq->bounce_buf, buflen);
+	sg_init_one(mqrq->sg, mqrq->bounce_buf, buflen);
 
 	return 1;
 }
@@ -342,19 +348,19 @@ unsigned int mmc_queue_map_sg(struct mmc_queue *mq)
  * If writing, bounce the data to the buffer before the request
  * is sent to the host driver
  */
-void mmc_queue_bounce_pre(struct mmc_queue *mq)
+void mmc_queue_bounce_pre(struct mmc_queue_req *mqrq)
 {
 	unsigned long flags;
 
-	if (!mq->bounce_buf)
+	if (!mqrq->bounce_buf)
 		return;
 
-	if (rq_data_dir(mq->req) != WRITE)
+	if (rq_data_dir(mqrq->req) != WRITE)
 		return;
 
 	local_irq_save(flags);
-	sg_copy_to_buffer(mq->bounce_sg, mq->bounce_sg_len,
-		mq->bounce_buf, mq->sg[0].length);
+	sg_copy_to_buffer(mqrq->bounce_sg, mqrq->bounce_sg_len,
+		mqrq->bounce_buf, mqrq->sg[0].length);
 	local_irq_restore(flags);
 }
 
@@ -362,19 +368,18 @@ void mmc_queue_bounce_pre(struct mmc_queue *mq)
  * If reading, bounce the data from the buffer after the request
  * has been handled by the host driver
  */
-void mmc_queue_bounce_post(struct mmc_queue *mq)
+void mmc_queue_bounce_post(struct mmc_queue_req *mqrq)
 {
 	unsigned long flags;
 
-	if (!mq->bounce_buf)
+	if (!mqrq->bounce_buf)
 		return;
 
-	if (rq_data_dir(mq->req) != READ)
+	if (rq_data_dir(mqrq->req) != READ)
 		return;
 
 	local_irq_save(flags);
-	sg_copy_from_buffer(mq->bounce_sg, mq->bounce_sg_len,
-		mq->bounce_buf, mq->sg[0].length);
+	sg_copy_from_buffer(mqrq->bounce_sg, mqrq->bounce_sg_len,
+		mqrq->bounce_buf, mqrq->sg[0].length);
 	local_irq_restore(flags);
 }
-
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index 64e66e0..96c440d 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -4,19 +4,24 @@
 struct request;
 struct task_struct;
 
+struct mmc_queue_req {
+	struct request		*req;
+	struct scatterlist	*sg;
+	char			*bounce_buf;
+	struct scatterlist	*bounce_sg;
+	unsigned int		bounce_sg_len;
+};
+
 struct mmc_queue {
 	struct mmc_card		*card;
 	struct task_struct	*thread;
 	struct semaphore	thread_sem;
 	unsigned int		flags;
-	struct request		*req;
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-	struct scatterlist	*sg;
-	char			*bounce_buf;
-	struct scatterlist	*bounce_sg;
-	unsigned int		bounce_sg_len;
+	struct mmc_queue_req	mqrq[1];
+	struct mmc_queue_req	*mqrq_cur;
 };
 
 extern int mmc_init_queue(struct mmc_queue *, struct mmc_card *, spinlock_t *);
@@ -24,8 +29,9 @@ extern void mmc_cleanup_queue(struct mmc_queue *);
 extern void mmc_queue_suspend(struct mmc_queue *);
 extern void mmc_queue_resume(struct mmc_queue *);
 
-extern unsigned int mmc_queue_map_sg(struct mmc_queue *);
-extern void mmc_queue_bounce_pre(struct mmc_queue *);
-extern void mmc_queue_bounce_post(struct mmc_queue *);
+extern unsigned int mmc_queue_map_sg(struct mmc_queue *,
+				     struct mmc_queue_req *);
+extern void mmc_queue_bounce_pre(struct mmc_queue_req *);
+extern void mmc_queue_bounce_post(struct mmc_queue_req *);
 
 #endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] mmc: Add a block request prepare function
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:14   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

Break out code from mmc_blk_issue_rw_rq to create a
block request prepare function. This doesn't change
any functionallity.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |  173 +++++++++++++++++++++++++---------------------
 1 files changed, 94 insertions(+), 79 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index be51bde..3f98b15 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -331,97 +331,112 @@ out:
 	return err ? 0 : 1;
 }
 
-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+static void mmc_blk_issue_rw_rq_prep(struct mmc_blk_request *brq,
+				     struct mmc_queue_req *mqrq,
+				     struct request *req,
+				     struct mmc_card *card,
+				     int disable_multi,
+				     struct mmc_queue *mq)
 {
-	struct mmc_blk_data *md = mq->data;
-	struct mmc_card *card = md->queue.card;
-	struct mmc_blk_request brq;
-	int ret = 1, disable_multi = 0;
+	u32 readcmd, writecmd;
 
-	mmc_claim_host(card->host);
 
-	do {
-		struct mmc_command cmd;
-		u32 readcmd, writecmd, status = 0;
-
-		memset(&brq, 0, sizeof(struct mmc_blk_request));
-		brq.mrq.cmd = &brq.cmd;
-		brq.mrq.data = &brq.data;
-
-		brq.cmd.arg = blk_rq_pos(req);
-		if (!mmc_card_blockaddr(card))
-			brq.cmd.arg <<= 9;
-		brq.cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
-		brq.data.blksz = 512;
-		brq.stop.opcode = MMC_STOP_TRANSMISSION;
-		brq.stop.arg = 0;
-		brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
-		brq.data.blocks = blk_rq_sectors(req);
+	memset(brq, 0, sizeof(struct mmc_blk_request));
 
-		/*
-		 * The block layer doesn't support all sector count
-		 * restrictions, so we need to be prepared for too big
-		 * requests.
-		 */
-		if (brq.data.blocks > card->host->max_blk_count)
-			brq.data.blocks = card->host->max_blk_count;
+	brq->mrq.cmd = &brq->cmd;
+	brq->mrq.data = &brq->data;
 
-		/*
-		 * After a read error, we redo the request one sector at a time
-		 * in order to accurately determine which sectors can be read
-		 * successfully.
-		 */
-		if (disable_multi && brq.data.blocks > 1)
-			brq.data.blocks = 1;
-
-		if (brq.data.blocks > 1) {
-			/* SPI multiblock writes terminate using a special
-			 * token, not a STOP_TRANSMISSION request.
-			 */
-			if (!mmc_host_is_spi(card->host)
-					|| rq_data_dir(req) == READ)
-				brq.mrq.stop = &brq.stop;
-			readcmd = MMC_READ_MULTIPLE_BLOCK;
-			writecmd = MMC_WRITE_MULTIPLE_BLOCK;
-		} else {
-			brq.mrq.stop = NULL;
-			readcmd = MMC_READ_SINGLE_BLOCK;
-			writecmd = MMC_WRITE_BLOCK;
-		}
-		if (rq_data_dir(req) == READ) {
-			brq.cmd.opcode = readcmd;
-			brq.data.flags |= MMC_DATA_READ;
-		} else {
-			brq.cmd.opcode = writecmd;
-			brq.data.flags |= MMC_DATA_WRITE;
-		}
+	brq->cmd.arg = blk_rq_pos(req);
+	if (!mmc_card_blockaddr(card))
+		brq->cmd.arg <<= 9;
+	brq->cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
+	brq->data.blksz = 512;
+	brq->stop.opcode = MMC_STOP_TRANSMISSION;
+	brq->stop.arg = 0;
+	brq->stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
+	brq->data.blocks = blk_rq_sectors(req);
 
-		mmc_set_data_timeout(&brq.data, card);
+	/*
+	 * The block layer doesn't support all sector count
+	 * restrictions, so we need to be prepared for too big
+	 * requests.
+	 */
+	if (brq->data.blocks > card->host->max_blk_count)
+		brq->data.blocks = card->host->max_blk_count;
 
-		brq.data.sg = mq->mqrq_cur->sg;
-		brq.data.sg_len = mmc_queue_map_sg(mq, mq->mqrq_cur);
+	/*
+	 * After a read error, we redo the request one sector at a time
+	 * in order to accurately determine which sectors can be read
+	 * successfully.
+	 */
+	if (disable_multi && brq->data.blocks > 1)
+		brq->data.blocks = 1;
 
-		/*
-		 * Adjust the sg list so it is the same size as the
-		 * request.
+
+	if (brq->data.blocks > 1) {
+		/* SPI multiblock writes terminate using a special
+		 * token, not a STOP_TRANSMISSION request.
 		 */
-		if (brq.data.blocks != blk_rq_sectors(req)) {
-			int i, data_size = brq.data.blocks << 9;
-			struct scatterlist *sg;
-
-			for_each_sg(brq.data.sg, sg, brq.data.sg_len, i) {
-				data_size -= sg->length;
-				if (data_size <= 0) {
-					sg->length += data_size;
-					i++;
-					break;
-				}
+		if (!mmc_host_is_spi(card->host)
+		    || rq_data_dir(req) == READ)
+			brq->mrq.stop = &brq->stop;
+		readcmd = MMC_READ_MULTIPLE_BLOCK;
+		writecmd = MMC_WRITE_MULTIPLE_BLOCK;
+	} else {
+		brq->mrq.stop = NULL;
+		readcmd = MMC_READ_SINGLE_BLOCK;
+		writecmd = MMC_WRITE_BLOCK;
+	}
+	if (rq_data_dir(req) == READ) {
+		brq->cmd.opcode = readcmd;
+		brq->data.flags |= MMC_DATA_READ;
+	} else {
+		brq->cmd.opcode = writecmd;
+		brq->data.flags |= MMC_DATA_WRITE;
+	}
+
+	mmc_set_data_timeout(&brq->data, card);
+
+	brq->data.sg = mqrq->sg;
+	brq->data.sg_len = mmc_queue_map_sg(mq, mqrq);
+
+	/*
+	 * Adjust the sg list so it is the same size as the
+	 * request.
+	 */
+	if (brq->data.blocks != blk_rq_sectors(req)) {
+		int i, data_size = brq->data.blocks << 9;
+		struct scatterlist *sg;
+
+		for_each_sg(brq->data.sg, sg, brq->data.sg_len, i) {
+			data_size -= sg->length;
+			if (data_size <= 0) {
+				sg->length += data_size;
+				i++;
+				break;
 			}
-			brq.data.sg_len = i;
+			brq->data.sg_len = i;
 		}
+	}
+
+	mmc_queue_bounce_pre(mqrq);
+}
 
-		mmc_queue_bounce_pre(mq->mqrq_cur);
+static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+{
+	struct mmc_blk_data *md = mq->data;
+	struct mmc_card *card = md->queue.card;
+	struct mmc_blk_request brq;
+	int ret = 1, disable_multi = 0;
+
+	mmc_claim_host(card->host);
+
+	do {
+		struct mmc_command cmd;
+		u32 status = 0;
 
+		mmc_blk_issue_rw_rq_prep(&brq, mq->mqrq_cur, req, card,
+					 disable_multi, mq);
 		mmc_wait_for_req(card->host, &brq.mrq);
 
 		mmc_queue_bounce_post(mq->mqrq_cur);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] mmc: Add a block request prepare function
@ 2011-01-12 18:14   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

Break out code from mmc_blk_issue_rw_rq to create a
block request prepare function. This doesn't change
any functionallity.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |  173 +++++++++++++++++++++++++---------------------
 1 files changed, 94 insertions(+), 79 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index be51bde..3f98b15 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -331,97 +331,112 @@ out:
 	return err ? 0 : 1;
 }
 
-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+static void mmc_blk_issue_rw_rq_prep(struct mmc_blk_request *brq,
+				     struct mmc_queue_req *mqrq,
+				     struct request *req,
+				     struct mmc_card *card,
+				     int disable_multi,
+				     struct mmc_queue *mq)
 {
-	struct mmc_blk_data *md = mq->data;
-	struct mmc_card *card = md->queue.card;
-	struct mmc_blk_request brq;
-	int ret = 1, disable_multi = 0;
+	u32 readcmd, writecmd;
 
-	mmc_claim_host(card->host);
 
-	do {
-		struct mmc_command cmd;
-		u32 readcmd, writecmd, status = 0;
-
-		memset(&brq, 0, sizeof(struct mmc_blk_request));
-		brq.mrq.cmd = &brq.cmd;
-		brq.mrq.data = &brq.data;
-
-		brq.cmd.arg = blk_rq_pos(req);
-		if (!mmc_card_blockaddr(card))
-			brq.cmd.arg <<= 9;
-		brq.cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
-		brq.data.blksz = 512;
-		brq.stop.opcode = MMC_STOP_TRANSMISSION;
-		brq.stop.arg = 0;
-		brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
-		brq.data.blocks = blk_rq_sectors(req);
+	memset(brq, 0, sizeof(struct mmc_blk_request));
 
-		/*
-		 * The block layer doesn't support all sector count
-		 * restrictions, so we need to be prepared for too big
-		 * requests.
-		 */
-		if (brq.data.blocks > card->host->max_blk_count)
-			brq.data.blocks = card->host->max_blk_count;
+	brq->mrq.cmd = &brq->cmd;
+	brq->mrq.data = &brq->data;
 
-		/*
-		 * After a read error, we redo the request one sector at a time
-		 * in order to accurately determine which sectors can be read
-		 * successfully.
-		 */
-		if (disable_multi && brq.data.blocks > 1)
-			brq.data.blocks = 1;
-
-		if (brq.data.blocks > 1) {
-			/* SPI multiblock writes terminate using a special
-			 * token, not a STOP_TRANSMISSION request.
-			 */
-			if (!mmc_host_is_spi(card->host)
-					|| rq_data_dir(req) == READ)
-				brq.mrq.stop = &brq.stop;
-			readcmd = MMC_READ_MULTIPLE_BLOCK;
-			writecmd = MMC_WRITE_MULTIPLE_BLOCK;
-		} else {
-			brq.mrq.stop = NULL;
-			readcmd = MMC_READ_SINGLE_BLOCK;
-			writecmd = MMC_WRITE_BLOCK;
-		}
-		if (rq_data_dir(req) == READ) {
-			brq.cmd.opcode = readcmd;
-			brq.data.flags |= MMC_DATA_READ;
-		} else {
-			brq.cmd.opcode = writecmd;
-			brq.data.flags |= MMC_DATA_WRITE;
-		}
+	brq->cmd.arg = blk_rq_pos(req);
+	if (!mmc_card_blockaddr(card))
+		brq->cmd.arg <<= 9;
+	brq->cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
+	brq->data.blksz = 512;
+	brq->stop.opcode = MMC_STOP_TRANSMISSION;
+	brq->stop.arg = 0;
+	brq->stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
+	brq->data.blocks = blk_rq_sectors(req);
 
-		mmc_set_data_timeout(&brq.data, card);
+	/*
+	 * The block layer doesn't support all sector count
+	 * restrictions, so we need to be prepared for too big
+	 * requests.
+	 */
+	if (brq->data.blocks > card->host->max_blk_count)
+		brq->data.blocks = card->host->max_blk_count;
 
-		brq.data.sg = mq->mqrq_cur->sg;
-		brq.data.sg_len = mmc_queue_map_sg(mq, mq->mqrq_cur);
+	/*
+	 * After a read error, we redo the request one sector at a time
+	 * in order to accurately determine which sectors can be read
+	 * successfully.
+	 */
+	if (disable_multi && brq->data.blocks > 1)
+		brq->data.blocks = 1;
 
-		/*
-		 * Adjust the sg list so it is the same size as the
-		 * request.
+
+	if (brq->data.blocks > 1) {
+		/* SPI multiblock writes terminate using a special
+		 * token, not a STOP_TRANSMISSION request.
 		 */
-		if (brq.data.blocks != blk_rq_sectors(req)) {
-			int i, data_size = brq.data.blocks << 9;
-			struct scatterlist *sg;
-
-			for_each_sg(brq.data.sg, sg, brq.data.sg_len, i) {
-				data_size -= sg->length;
-				if (data_size <= 0) {
-					sg->length += data_size;
-					i++;
-					break;
-				}
+		if (!mmc_host_is_spi(card->host)
+		    || rq_data_dir(req) == READ)
+			brq->mrq.stop = &brq->stop;
+		readcmd = MMC_READ_MULTIPLE_BLOCK;
+		writecmd = MMC_WRITE_MULTIPLE_BLOCK;
+	} else {
+		brq->mrq.stop = NULL;
+		readcmd = MMC_READ_SINGLE_BLOCK;
+		writecmd = MMC_WRITE_BLOCK;
+	}
+	if (rq_data_dir(req) == READ) {
+		brq->cmd.opcode = readcmd;
+		brq->data.flags |= MMC_DATA_READ;
+	} else {
+		brq->cmd.opcode = writecmd;
+		brq->data.flags |= MMC_DATA_WRITE;
+	}
+
+	mmc_set_data_timeout(&brq->data, card);
+
+	brq->data.sg = mqrq->sg;
+	brq->data.sg_len = mmc_queue_map_sg(mq, mqrq);
+
+	/*
+	 * Adjust the sg list so it is the same size as the
+	 * request.
+	 */
+	if (brq->data.blocks != blk_rq_sectors(req)) {
+		int i, data_size = brq->data.blocks << 9;
+		struct scatterlist *sg;
+
+		for_each_sg(brq->data.sg, sg, brq->data.sg_len, i) {
+			data_size -= sg->length;
+			if (data_size <= 0) {
+				sg->length += data_size;
+				i++;
+				break;
 			}
-			brq.data.sg_len = i;
+			brq->data.sg_len = i;
 		}
+	}
+
+	mmc_queue_bounce_pre(mqrq);
+}
 
-		mmc_queue_bounce_pre(mq->mqrq_cur);
+static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+{
+	struct mmc_blk_data *md = mq->data;
+	struct mmc_card *card = md->queue.card;
+	struct mmc_blk_request brq;
+	int ret = 1, disable_multi = 0;
+
+	mmc_claim_host(card->host);
+
+	do {
+		struct mmc_command cmd;
+		u32 status = 0;
 
+		mmc_blk_issue_rw_rq_prep(&brq, mq->mqrq_cur, req, card,
+					 disable_multi, mq);
 		mmc_wait_for_req(card->host, &brq.mrq);
 
 		mmc_queue_bounce_post(mq->mqrq_cur);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] mmc: Add a second mmc queue request member
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:14   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

Add an additional mmc queue request instance to make way for
double buffering. One request may be active while the
other request is being prepared.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/queue.c |   44 ++++++++++++++++++++++++++++++++++++++++++--
 drivers/mmc/card/queue.h |    4 +++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 8a8d88b..30d4707 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -131,6 +131,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	u64 limit = BLK_BOUNCE_HIGH;
 	int ret;
 	struct mmc_queue_req *mqrq_cur = &mq->mqrq[0];
+	struct mmc_queue_req *mqrq_prev = &mq->mqrq[1];
 
 	if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
 		limit = *mmc_dev(host)->dma_mask;
@@ -141,7 +142,9 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 		return -ENOMEM;
 
 	memset(&mq->mqrq_cur, 0, sizeof(mq->mqrq_cur));
+	memset(&mq->mqrq_prev, 0, sizeof(mq->mqrq_prev));
 	mq->mqrq_cur = mqrq_cur;
+	mq->mqrq_prev = mqrq_prev;
 	mq->queue->queuedata = mq;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
@@ -182,9 +185,17 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 					"allocate bounce cur buffer\n",
 					mmc_card_name(card));
 			}
+			mqrq_prev->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
+			if (!mqrq_prev->bounce_buf) {
+				printk(KERN_WARNING "%s: unable to "
+					"allocate bounce prev buffer\n",
+					mmc_card_name(card));
+				kfree(mqrq_cur->bounce_buf);
+				mqrq_cur->bounce_buf = NULL;
+			}
 		}
 
-		if (mqrq_cur->bounce_buf) {
+		if (mqrq_cur->bounce_buf && mqrq_prev->bounce_buf) {
 			blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY);
 			blk_queue_max_hw_sectors(mq->queue, bouncesz / 512);
 			blk_queue_max_segments(mq->queue, bouncesz / 512);
@@ -199,11 +210,19 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 			if (ret)
 				goto cleanup_queue;
 
+			mqrq_prev->sg = mmc_alloc_sg(1, &ret);
+			if (ret)
+				goto cleanup_queue;
+
+			mqrq_prev->bounce_sg =
+				mmc_alloc_sg(bouncesz / 512, &ret);
+			if (ret)
+				goto cleanup_queue;
 		}
 	}
 #endif
 
-	if (!mqrq_cur->bounce_buf) {
+	if (!mqrq_cur->bounce_buf && !mqrq_prev->bounce_buf) {
 		blk_queue_bounce_limit(mq->queue, limit);
 		blk_queue_max_hw_sectors(mq->queue,
 			min(host->max_blk_count, host->max_req_size / 512));
@@ -214,6 +233,10 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 		if (ret)
 			goto cleanup_queue;
 
+
+		mqrq_prev->sg = mmc_alloc_sg(host->max_segs, &ret);
+		if (ret)
+			goto cleanup_queue;
 	}
 
 	sema_init(&mq->thread_sem, 1);
@@ -230,6 +253,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
  free_bounce_sg:
 	kfree(mqrq_cur->bounce_sg);
 	mqrq_cur->bounce_sg = NULL;
+	kfree(mqrq_prev->bounce_sg);
+	mqrq_prev->bounce_sg = NULL;
 
  cleanup_queue:
 	kfree(mqrq_cur->sg);
@@ -237,6 +262,11 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	kfree(mqrq_cur->bounce_buf);
 	mqrq_cur->bounce_buf = NULL;
 
+	kfree(mqrq_prev->sg);
+	mqrq_prev->sg = NULL;
+	kfree(mqrq_prev->bounce_buf);
+	mqrq_prev->bounce_buf = NULL;
+
 	blk_cleanup_queue(mq->queue);
 	return ret;
 }
@@ -246,6 +276,7 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	struct request_queue *q = mq->queue;
 	unsigned long flags;
 	struct mmc_queue_req *mqrq_cur = mq->mqrq_cur;
+	struct mmc_queue_req *mqrq_prev = mq->mqrq_prev;
 
 	/* Make sure the queue isn't suspended, as that will deadlock */
 	mmc_queue_resume(mq);
@@ -268,6 +299,15 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	kfree(mqrq_cur->bounce_buf);
 	mqrq_cur->bounce_buf = NULL;
 
+	kfree(mqrq_prev->bounce_sg);
+	mqrq_prev->bounce_sg = NULL;
+
+	kfree(mqrq_prev->sg);
+	mqrq_prev->sg = NULL;
+
+	kfree(mqrq_prev->bounce_buf);
+	mqrq_prev->bounce_buf = NULL;
+
 	mq->card = NULL;
 }
 EXPORT_SYMBOL(mmc_cleanup_queue);
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index 96c440d..f65eb88 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -20,8 +20,10 @@ struct mmc_queue {
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-	struct mmc_queue_req	mqrq[1];
+
+	struct mmc_queue_req	mqrq[2];
 	struct mmc_queue_req	*mqrq_cur;
+	struct mmc_queue_req	*mqrq_prev;
 };
 
 extern int mmc_init_queue(struct mmc_queue *, struct mmc_card *, spinlock_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] mmc: Add a second mmc queue request member
@ 2011-01-12 18:14   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

Add an additional mmc queue request instance to make way for
double buffering. One request may be active while the
other request is being prepared.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/queue.c |   44 ++++++++++++++++++++++++++++++++++++++++++--
 drivers/mmc/card/queue.h |    4 +++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 8a8d88b..30d4707 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -131,6 +131,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	u64 limit = BLK_BOUNCE_HIGH;
 	int ret;
 	struct mmc_queue_req *mqrq_cur = &mq->mqrq[0];
+	struct mmc_queue_req *mqrq_prev = &mq->mqrq[1];
 
 	if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
 		limit = *mmc_dev(host)->dma_mask;
@@ -141,7 +142,9 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 		return -ENOMEM;
 
 	memset(&mq->mqrq_cur, 0, sizeof(mq->mqrq_cur));
+	memset(&mq->mqrq_prev, 0, sizeof(mq->mqrq_prev));
 	mq->mqrq_cur = mqrq_cur;
+	mq->mqrq_prev = mqrq_prev;
 	mq->queue->queuedata = mq;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
@@ -182,9 +185,17 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 					"allocate bounce cur buffer\n",
 					mmc_card_name(card));
 			}
+			mqrq_prev->bounce_buf = kmalloc(bouncesz, GFP_KERNEL);
+			if (!mqrq_prev->bounce_buf) {
+				printk(KERN_WARNING "%s: unable to "
+					"allocate bounce prev buffer\n",
+					mmc_card_name(card));
+				kfree(mqrq_cur->bounce_buf);
+				mqrq_cur->bounce_buf = NULL;
+			}
 		}
 
-		if (mqrq_cur->bounce_buf) {
+		if (mqrq_cur->bounce_buf && mqrq_prev->bounce_buf) {
 			blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY);
 			blk_queue_max_hw_sectors(mq->queue, bouncesz / 512);
 			blk_queue_max_segments(mq->queue, bouncesz / 512);
@@ -199,11 +210,19 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 			if (ret)
 				goto cleanup_queue;
 
+			mqrq_prev->sg = mmc_alloc_sg(1, &ret);
+			if (ret)
+				goto cleanup_queue;
+
+			mqrq_prev->bounce_sg =
+				mmc_alloc_sg(bouncesz / 512, &ret);
+			if (ret)
+				goto cleanup_queue;
 		}
 	}
 #endif
 
-	if (!mqrq_cur->bounce_buf) {
+	if (!mqrq_cur->bounce_buf && !mqrq_prev->bounce_buf) {
 		blk_queue_bounce_limit(mq->queue, limit);
 		blk_queue_max_hw_sectors(mq->queue,
 			min(host->max_blk_count, host->max_req_size / 512));
@@ -214,6 +233,10 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 		if (ret)
 			goto cleanup_queue;
 
+
+		mqrq_prev->sg = mmc_alloc_sg(host->max_segs, &ret);
+		if (ret)
+			goto cleanup_queue;
 	}
 
 	sema_init(&mq->thread_sem, 1);
@@ -230,6 +253,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
  free_bounce_sg:
 	kfree(mqrq_cur->bounce_sg);
 	mqrq_cur->bounce_sg = NULL;
+	kfree(mqrq_prev->bounce_sg);
+	mqrq_prev->bounce_sg = NULL;
 
  cleanup_queue:
 	kfree(mqrq_cur->sg);
@@ -237,6 +262,11 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	kfree(mqrq_cur->bounce_buf);
 	mqrq_cur->bounce_buf = NULL;
 
+	kfree(mqrq_prev->sg);
+	mqrq_prev->sg = NULL;
+	kfree(mqrq_prev->bounce_buf);
+	mqrq_prev->bounce_buf = NULL;
+
 	blk_cleanup_queue(mq->queue);
 	return ret;
 }
@@ -246,6 +276,7 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	struct request_queue *q = mq->queue;
 	unsigned long flags;
 	struct mmc_queue_req *mqrq_cur = mq->mqrq_cur;
+	struct mmc_queue_req *mqrq_prev = mq->mqrq_prev;
 
 	/* Make sure the queue isn't suspended, as that will deadlock */
 	mmc_queue_resume(mq);
@@ -268,6 +299,15 @@ void mmc_cleanup_queue(struct mmc_queue *mq)
 	kfree(mqrq_cur->bounce_buf);
 	mqrq_cur->bounce_buf = NULL;
 
+	kfree(mqrq_prev->bounce_sg);
+	mqrq_prev->bounce_sg = NULL;
+
+	kfree(mqrq_prev->sg);
+	mqrq_prev->sg = NULL;
+
+	kfree(mqrq_prev->bounce_buf);
+	mqrq_prev->bounce_buf = NULL;
+
 	mq->card = NULL;
 }
 EXPORT_SYMBOL(mmc_cleanup_queue);
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index 96c440d..f65eb88 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -20,8 +20,10 @@ struct mmc_queue {
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-	struct mmc_queue_req	mqrq[1];
+
+	struct mmc_queue_req	mqrq[2];
 	struct mmc_queue_req	*mqrq_cur;
+	struct mmc_queue_req	*mqrq_prev;
 };
 
 extern int mmc_init_queue(struct mmc_queue *, struct mmc_card *, spinlock_t *);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] mmc: Store the mmc block request struct in mmc queue
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:14   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

Move the mmc block request to the mmc queue struct in
order to make way for processing two brqs simultanously.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |   68 +++++++++++++++++++++-------------------------
 drivers/mmc/card/queue.h |    9 +++++-
 2 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 3f98b15..028b2b8 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -165,13 +165,6 @@ static const struct block_device_operations mmc_bdops = {
 	.owner			= THIS_MODULE,
 };
 
-struct mmc_blk_request {
-	struct mmc_request	mrq;
-	struct mmc_command	cmd;
-	struct mmc_command	stop;
-	struct mmc_data		data;
-};
-
 static u32 mmc_sd_num_wr_blocks(struct mmc_card *card)
 {
 	int err;
@@ -422,11 +415,11 @@ static void mmc_blk_issue_rw_rq_prep(struct mmc_blk_request *brq,
 	mmc_queue_bounce_pre(mqrq);
 }
 
-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
 	struct mmc_blk_data *md = mq->data;
 	struct mmc_card *card = md->queue.card;
-	struct mmc_blk_request brq;
+	struct mmc_blk_request *brqc = &mq->mqrq_cur->brq;
 	int ret = 1, disable_multi = 0;
 
 	mmc_claim_host(card->host);
@@ -435,9 +428,9 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		struct mmc_command cmd;
 		u32 status = 0;
 
-		mmc_blk_issue_rw_rq_prep(&brq, mq->mqrq_cur, req, card,
+		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur, rqc, card,
 					 disable_multi, mq);
-		mmc_wait_for_req(card->host, &brq.mrq);
+		mmc_wait_for_req(card->host, &brqc->mrq);
 
 		mmc_queue_bounce_post(mq->mqrq_cur);
 
@@ -446,43 +439,43 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		 * until later as we need to wait for the card to leave
 		 * programming mode even when things go wrong.
 		 */
-		if (brq.cmd.error || brq.data.error || brq.stop.error) {
-			if (brq.data.blocks > 1 && rq_data_dir(req) == READ) {
+		if (brqc->cmd.error || brqc->data.error || brqc->stop.error) {
+			if (brqc->data.blocks > 1 && rq_data_dir(rqc) == READ) {
 				/* Redo read one sector at a time */
 				printk(KERN_WARNING "%s: retrying using single "
-				       "block read\n", req->rq_disk->disk_name);
+				       "block read\n", rqc->rq_disk->disk_name);
 				disable_multi = 1;
 				continue;
 			}
-			status = get_card_status(card, req);
+			status = get_card_status(card, rqc);
 		}
 
-		if (brq.cmd.error) {
+		if (brqc->cmd.error) {
 			printk(KERN_ERR "%s: error %d sending read/write "
 			       "command, response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.cmd.error,
-			       brq.cmd.resp[0], status);
+			       rqc->rq_disk->disk_name, brqc->cmd.error,
+			       brqc->cmd.resp[0], status);
 		}
 
-		if (brq.data.error) {
-			if (brq.data.error == -ETIMEDOUT && brq.mrq.stop)
+		if (brqc->data.error) {
+			if (brqc->data.error == -ETIMEDOUT && brqc->mrq.stop)
 				/* 'Stop' response contains card status */
-				status = brq.mrq.stop->resp[0];
+				status = brqc->mrq.stop->resp[0];
 			printk(KERN_ERR "%s: error %d transferring data,"
 			       " sector %u, nr %u, card status %#x\n",
-			       req->rq_disk->disk_name, brq.data.error,
-			       (unsigned)blk_rq_pos(req),
-			       (unsigned)blk_rq_sectors(req), status);
+			       rqc->rq_disk->disk_name, brqc->data.error,
+			       (unsigned)blk_rq_pos(rqc),
+			       (unsigned)blk_rq_sectors(rqc), status);
 		}
 
-		if (brq.stop.error) {
+		if (brqc->stop.error) {
 			printk(KERN_ERR "%s: error %d sending stop command, "
 			       "response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.stop.error,
-			       brq.stop.resp[0], status);
+			       rqc->rq_disk->disk_name, brqc->stop.error,
+			       brqc->stop.resp[0], status);
 		}
 
-		if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
+		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqc) != READ) {
 			do {
 				int err;
 
@@ -492,7 +485,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 				err = mmc_wait_for_cmd(card->host, &cmd, 5);
 				if (err) {
 					printk(KERN_ERR "%s: error %d requesting status\n",
-					       req->rq_disk->disk_name, err);
+					       rqc->rq_disk->disk_name, err);
 					goto cmd_err;
 				}
 				/*
@@ -506,21 +499,22 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 #if 0
 			if (cmd.resp[0] & ~0x00000900)
 				printk(KERN_ERR "%s: status = %08x\n",
-				       req->rq_disk->disk_name, cmd.resp[0]);
+				       rqc->rq_disk->disk_name, cmd.resp[0]);
 			if (mmc_decode_status(cmd.resp))
 				goto cmd_err;
 #endif
 		}
 
-		if (brq.cmd.error || brq.stop.error || brq.data.error) {
-			if (rq_data_dir(req) == READ) {
+		if (brqc->cmd.error || brqc->stop.error || brqc->data.error) {
+			if (rq_data_dir(rqc) == READ) {
 				/*
 				 * After an error, we redo I/O one sector at a
 				 * time, so we only reach here after trying to
 				 * read a single sector.
 				 */
 				spin_lock_irq(&md->lock);
-				ret = __blk_end_request(req, -EIO, brq.data.blksz);
+				ret = __blk_end_request(rqc, -EIO,
+							brqc->data.blksz);
 				spin_unlock_irq(&md->lock);
 				continue;
 			}
@@ -531,7 +525,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		 * A block was successfully transferred.
 		 */
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
+		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	} while (ret);
 
@@ -554,12 +548,12 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		blocks = mmc_sd_num_wr_blocks(card);
 		if (blocks != (u32)-1) {
 			spin_lock_irq(&md->lock);
-			ret = __blk_end_request(req, 0, blocks << 9);
+			ret = __blk_end_request(rqc, 0, blocks << 9);
 			spin_unlock_irq(&md->lock);
 		}
 	} else {
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
+		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	}
 
@@ -567,7 +561,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 
 	spin_lock_irq(&md->lock);
 	while (ret)
-		ret = __blk_end_request(req, -EIO, blk_rq_cur_bytes(req));
+		ret = __blk_end_request(rqc, -EIO, blk_rq_cur_bytes(rqc));
 	spin_unlock_irq(&md->lock);
 
 	return 0;
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index f65eb88..bf3dee9 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -4,12 +4,20 @@
 struct request;
 struct task_struct;
 
+struct mmc_blk_request {
+	struct mmc_request	mrq;
+	struct mmc_command	cmd;
+	struct mmc_command	stop;
+	struct mmc_data		data;
+};
+
 struct mmc_queue_req {
 	struct request		*req;
 	struct scatterlist	*sg;
 	char			*bounce_buf;
 	struct scatterlist	*bounce_sg;
 	unsigned int		bounce_sg_len;
+	struct mmc_blk_request	brq;
 };
 
 struct mmc_queue {
@@ -20,7 +28,6 @@ struct mmc_queue {
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-
 	struct mmc_queue_req	mqrq[2];
 	struct mmc_queue_req	*mqrq_cur;
 	struct mmc_queue_req	*mqrq_prev;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] mmc: Store the mmc block request struct in mmc queue
@ 2011-01-12 18:14   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

Move the mmc block request to the mmc queue struct in
order to make way for processing two brqs simultanously.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |   68 +++++++++++++++++++++-------------------------
 drivers/mmc/card/queue.h |    9 +++++-
 2 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 3f98b15..028b2b8 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -165,13 +165,6 @@ static const struct block_device_operations mmc_bdops = {
 	.owner			= THIS_MODULE,
 };
 
-struct mmc_blk_request {
-	struct mmc_request	mrq;
-	struct mmc_command	cmd;
-	struct mmc_command	stop;
-	struct mmc_data		data;
-};
-
 static u32 mmc_sd_num_wr_blocks(struct mmc_card *card)
 {
 	int err;
@@ -422,11 +415,11 @@ static void mmc_blk_issue_rw_rq_prep(struct mmc_blk_request *brq,
 	mmc_queue_bounce_pre(mqrq);
 }
 
-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
 	struct mmc_blk_data *md = mq->data;
 	struct mmc_card *card = md->queue.card;
-	struct mmc_blk_request brq;
+	struct mmc_blk_request *brqc = &mq->mqrq_cur->brq;
 	int ret = 1, disable_multi = 0;
 
 	mmc_claim_host(card->host);
@@ -435,9 +428,9 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		struct mmc_command cmd;
 		u32 status = 0;
 
-		mmc_blk_issue_rw_rq_prep(&brq, mq->mqrq_cur, req, card,
+		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur, rqc, card,
 					 disable_multi, mq);
-		mmc_wait_for_req(card->host, &brq.mrq);
+		mmc_wait_for_req(card->host, &brqc->mrq);
 
 		mmc_queue_bounce_post(mq->mqrq_cur);
 
@@ -446,43 +439,43 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		 * until later as we need to wait for the card to leave
 		 * programming mode even when things go wrong.
 		 */
-		if (brq.cmd.error || brq.data.error || brq.stop.error) {
-			if (brq.data.blocks > 1 && rq_data_dir(req) == READ) {
+		if (brqc->cmd.error || brqc->data.error || brqc->stop.error) {
+			if (brqc->data.blocks > 1 && rq_data_dir(rqc) == READ) {
 				/* Redo read one sector at a time */
 				printk(KERN_WARNING "%s: retrying using single "
-				       "block read\n", req->rq_disk->disk_name);
+				       "block read\n", rqc->rq_disk->disk_name);
 				disable_multi = 1;
 				continue;
 			}
-			status = get_card_status(card, req);
+			status = get_card_status(card, rqc);
 		}
 
-		if (brq.cmd.error) {
+		if (brqc->cmd.error) {
 			printk(KERN_ERR "%s: error %d sending read/write "
 			       "command, response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.cmd.error,
-			       brq.cmd.resp[0], status);
+			       rqc->rq_disk->disk_name, brqc->cmd.error,
+			       brqc->cmd.resp[0], status);
 		}
 
-		if (brq.data.error) {
-			if (brq.data.error == -ETIMEDOUT && brq.mrq.stop)
+		if (brqc->data.error) {
+			if (brqc->data.error == -ETIMEDOUT && brqc->mrq.stop)
 				/* 'Stop' response contains card status */
-				status = brq.mrq.stop->resp[0];
+				status = brqc->mrq.stop->resp[0];
 			printk(KERN_ERR "%s: error %d transferring data,"
 			       " sector %u, nr %u, card status %#x\n",
-			       req->rq_disk->disk_name, brq.data.error,
-			       (unsigned)blk_rq_pos(req),
-			       (unsigned)blk_rq_sectors(req), status);
+			       rqc->rq_disk->disk_name, brqc->data.error,
+			       (unsigned)blk_rq_pos(rqc),
+			       (unsigned)blk_rq_sectors(rqc), status);
 		}
 
-		if (brq.stop.error) {
+		if (brqc->stop.error) {
 			printk(KERN_ERR "%s: error %d sending stop command, "
 			       "response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.stop.error,
-			       brq.stop.resp[0], status);
+			       rqc->rq_disk->disk_name, brqc->stop.error,
+			       brqc->stop.resp[0], status);
 		}
 
-		if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
+		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqc) != READ) {
 			do {
 				int err;
 
@@ -492,7 +485,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 				err = mmc_wait_for_cmd(card->host, &cmd, 5);
 				if (err) {
 					printk(KERN_ERR "%s: error %d requesting status\n",
-					       req->rq_disk->disk_name, err);
+					       rqc->rq_disk->disk_name, err);
 					goto cmd_err;
 				}
 				/*
@@ -506,21 +499,22 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 #if 0
 			if (cmd.resp[0] & ~0x00000900)
 				printk(KERN_ERR "%s: status = %08x\n",
-				       req->rq_disk->disk_name, cmd.resp[0]);
+				       rqc->rq_disk->disk_name, cmd.resp[0]);
 			if (mmc_decode_status(cmd.resp))
 				goto cmd_err;
 #endif
 		}
 
-		if (brq.cmd.error || brq.stop.error || brq.data.error) {
-			if (rq_data_dir(req) == READ) {
+		if (brqc->cmd.error || brqc->stop.error || brqc->data.error) {
+			if (rq_data_dir(rqc) == READ) {
 				/*
 				 * After an error, we redo I/O one sector at a
 				 * time, so we only reach here after trying to
 				 * read a single sector.
 				 */
 				spin_lock_irq(&md->lock);
-				ret = __blk_end_request(req, -EIO, brq.data.blksz);
+				ret = __blk_end_request(rqc, -EIO,
+							brqc->data.blksz);
 				spin_unlock_irq(&md->lock);
 				continue;
 			}
@@ -531,7 +525,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		 * A block was successfully transferred.
 		 */
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
+		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	} while (ret);
 
@@ -554,12 +548,12 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		blocks = mmc_sd_num_wr_blocks(card);
 		if (blocks != (u32)-1) {
 			spin_lock_irq(&md->lock);
-			ret = __blk_end_request(req, 0, blocks << 9);
+			ret = __blk_end_request(rqc, 0, blocks << 9);
 			spin_unlock_irq(&md->lock);
 		}
 	} else {
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
+		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	}
 
@@ -567,7 +561,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 
 	spin_lock_irq(&md->lock);
 	while (ret)
-		ret = __blk_end_request(req, -EIO, blk_rq_cur_bytes(req));
+		ret = __blk_end_request(rqc, -EIO, blk_rq_cur_bytes(rqc));
 	spin_unlock_irq(&md->lock);
 
 	return 0;
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index f65eb88..bf3dee9 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -4,12 +4,20 @@
 struct request;
 struct task_struct;
 
+struct mmc_blk_request {
+	struct mmc_request	mrq;
+	struct mmc_command	cmd;
+	struct mmc_command	stop;
+	struct mmc_data		data;
+};
+
 struct mmc_queue_req {
 	struct request		*req;
 	struct scatterlist	*sg;
 	char			*bounce_buf;
 	struct scatterlist	*bounce_sg;
 	unsigned int		bounce_sg_len;
+	struct mmc_blk_request	brq;
 };
 
 struct mmc_queue {
@@ -20,7 +28,6 @@ struct mmc_queue {
 	int			(*issue_fn)(struct mmc_queue *, struct request *);
 	void			*data;
 	struct request_queue	*queue;
-
 	struct mmc_queue_req	mqrq[2];
 	struct mmc_queue_req	*mqrq_cur;
 	struct mmc_queue_req	*mqrq_prev;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] mmc: Add double buffering for mmc block requests
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:14   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, dev; +Cc: Chris Ball, Per Forlin

Change mmc_blk_issue_rw_rq() to become asynchronous.
The execution flow looks like this:
The mmc-queue calls issue_rw_rq(), which sends the request
to the host and returns back to the mmc-queue. The mmc-queue calls
isuue_rw_rq() again with a new request. This new request is prepared,
in isuue_rw_rq(), then it waits for the active request to complete before
pushing it to the host. When to mmc-queue is empty it will call
isuue_rw_rq() with req=NULL to finish off the active request
without starting a new request.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |  170 +++++++++++++++++++++++++++++++++++----------
 drivers/mmc/card/queue.c |    2 +-
 drivers/mmc/core/core.c  |   77 ++++++++++++++++++---
 include/linux/mmc/core.h |    7 ++-
 include/linux/mmc/host.h |    8 ++
 5 files changed, 214 insertions(+), 50 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 028b2b8..11e6e97 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -420,62 +420,98 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 	struct mmc_blk_data *md = mq->data;
 	struct mmc_card *card = md->queue.card;
 	struct mmc_blk_request *brqc = &mq->mqrq_cur->brq;
-	int ret = 1, disable_multi = 0;
+	struct mmc_blk_request *brqp = &mq->mqrq_prev->brq;
+	struct mmc_queue_req  *mqrqp = mq->mqrq_prev;
+	struct request *rqp = mqrqp->req;
+	int ret = 0;
+	int disable_multi = 0;
+	bool complete_transfer = true;
+
+	if (!rqc && !rqp) {
+		brqc->mrq.data = NULL;
+		brqp->mrq.data = NULL;
+		return 0;
+	}
 
-	mmc_claim_host(card->host);
+	/*
+	 * TODO: Find out if it is OK to only claim host for the first request.
+	 *       For the first request the previous request is NULL
+	 */
+	if (!rqp && rqc)
+		mmc_claim_host(card->host);
+
+	if (rqc) {
+		/* Prepare a new request */
+		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur,
+					 rqc, card, 0, mq);
+		mmc_pre_req(card->host, &brqc->mrq, !rqp);
+	}
 
 	do {
 		struct mmc_command cmd;
 		u32 status = 0;
 
-		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur, rqc, card,
-					 disable_multi, mq);
-		mmc_wait_for_req(card->host, &brqc->mrq);
-
-		mmc_queue_bounce_post(mq->mqrq_cur);
+		/* In case of error redo prepare and resend */
+		if (ret) {
+			mmc_blk_issue_rw_rq_prep(brqp, mqrqp, rqp, card,
+						 disable_multi, mq);
+			mmc_pre_req(card->host, &brqc->mrq, !rqp);
+			mmc_start_req(card->host, &brqp->mrq);
+		}
+		/*
+		 * If there is an ongoing request, indicated by rqp, wait for
+		 * it to finish before starting a new one.
+		 */
+		if (rqp) {
+			mmc_wait_for_req_done(&brqp->mrq);
+		} else {
+			/* start a new asynchronous request */
+			mmc_start_req(card->host, &brqc->mrq);
+			goto out;
+		}
 
 		/*
 		 * Check for errors here, but don't jump to cmd_err
 		 * until later as we need to wait for the card to leave
 		 * programming mode even when things go wrong.
 		 */
-		if (brqc->cmd.error || brqc->data.error || brqc->stop.error) {
-			if (brqc->data.blocks > 1 && rq_data_dir(rqc) == READ) {
+		if (brqp->cmd.error || brqp->data.error || brqp->stop.error) {
+			if (brqp->data.blocks > 1 && rq_data_dir(rqp) == READ) {
 				/* Redo read one sector at a time */
 				printk(KERN_WARNING "%s: retrying using single "
-				       "block read\n", rqc->rq_disk->disk_name);
+				       "block read\n", rqp->rq_disk->disk_name);
 				disable_multi = 1;
 				continue;
 			}
-			status = get_card_status(card, rqc);
+			status = get_card_status(card, rqp);
 		}
 
-		if (brqc->cmd.error) {
+		if (brqp->cmd.error) {
 			printk(KERN_ERR "%s: error %d sending read/write "
 			       "command, response %#x, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->cmd.error,
-			       brqc->cmd.resp[0], status);
+			       rqp->rq_disk->disk_name, brqp->cmd.error,
+			       brqp->cmd.resp[0], status);
 		}
 
-		if (brqc->data.error) {
-			if (brqc->data.error == -ETIMEDOUT && brqc->mrq.stop)
+		if (brqp->data.error) {
+			if (brqp->data.error == -ETIMEDOUT && brqp->mrq.stop)
 				/* 'Stop' response contains card status */
-				status = brqc->mrq.stop->resp[0];
+				status = brqp->mrq.stop->resp[0];
 			printk(KERN_ERR "%s: error %d transferring data,"
 			       " sector %u, nr %u, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->data.error,
-			       (unsigned)blk_rq_pos(rqc),
-			       (unsigned)blk_rq_sectors(rqc), status);
+			       rqp->rq_disk->disk_name, brqp->data.error,
+			       (unsigned)blk_rq_pos(rqp),
+			       (unsigned)blk_rq_sectors(rqp), status);
 		}
 
-		if (brqc->stop.error) {
+		if (brqp->stop.error) {
 			printk(KERN_ERR "%s: error %d sending stop command, "
 			       "response %#x, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->stop.error,
-			       brqc->stop.resp[0], status);
+			       rqp->rq_disk->disk_name, brqp->stop.error,
+			       brqp->stop.resp[0], status);
 		}
 
-		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqc) != READ) {
+		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqp) != READ) {
 			do {
 				int err;
 
@@ -485,7 +521,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 				err = mmc_wait_for_cmd(card->host, &cmd, 5);
 				if (err) {
 					printk(KERN_ERR "%s: error %d requesting status\n",
-					       rqc->rq_disk->disk_name, err);
+					       rqp->rq_disk->disk_name, err);
 					goto cmd_err;
 				}
 				/*
@@ -499,22 +535,22 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 #if 0
 			if (cmd.resp[0] & ~0x00000900)
 				printk(KERN_ERR "%s: status = %08x\n",
-				       rqc->rq_disk->disk_name, cmd.resp[0]);
+				       rqp->rq_disk->disk_name, cmd.resp[0]);
 			if (mmc_decode_status(cmd.resp))
 				goto cmd_err;
 #endif
 		}
 
-		if (brqc->cmd.error || brqc->stop.error || brqc->data.error) {
-			if (rq_data_dir(rqc) == READ) {
+		if (brqp->cmd.error || brqp->stop.error || brqp->data.error) {
+			if (rq_data_dir(rqp) == READ) {
 				/*
 				 * After an error, we redo I/O one sector at a
 				 * time, so we only reach here after trying to
 				 * read a single sector.
 				 */
 				spin_lock_irq(&md->lock);
-				ret = __blk_end_request(rqc, -EIO,
-							brqc->data.blksz);
+				ret = __blk_end_request(rqp, -EIO,
+							brqp->data.blksz);
 				spin_unlock_irq(&md->lock);
 				continue;
 			}
@@ -524,14 +560,72 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 		/*
 		 * A block was successfully transferred.
 		 */
+		/*
+		 * TODO: Find out if it safe to only check if
+		 * blk_rq_bytes(req) == data.bytes_xfered to make sure
+		 * the entire request is completed. If equal, defer
+		 * __blk_end_request until after the new request is started.
+		 */
+		if (blk_rq_bytes(rqp) != brqp->data.bytes_xfered ||
+		    !complete_transfer) {
+			complete_transfer = false;
+			mmc_post_req(card->host, &brqp->mrq);
+			mmc_queue_bounce_post(mqrqp);
+
+			spin_lock_irq(&md->lock);
+			ret = __blk_end_request(rqp, 0,
+						brqp->data.bytes_xfered);
+			spin_unlock_irq(&md->lock);
+		}
+	} while (ret);
+
+	/* Previous request is completed, start the new request if any */
+	if (rqc)
+		mmc_start_req(card->host, &brqc->mrq);
+
+	/* Post process the previous request while the new request is active */
+	if (complete_transfer) {
+		mmc_post_req(card->host, &brqp->mrq);
+		mmc_queue_bounce_post(mqrqp);
+
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
+		ret = __blk_end_request(rqp, 0, brqp->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
-	} while (ret);
 
-	mmc_release_host(card->host);
+		/*
+		 * TODO: Make sure "ret" can never be true and remove the
+		 * if-statement and the code inside it.
+		 */
+		if (ret) {
+			/* This should never happen */
+			printk(KERN_ERR "[%s] BUG: rq_bytes %d xfered %d\n",
+			       __func__, blk_rq_bytes(rqp),
+			       brqp->data.bytes_xfered);
+			BUG();
+		}
+	}
+	/* 1 indicates one request has been completed */
+	ret = 1;
+ out:
+	/*
+	 * TODO: Find out if it is OK to only release host after the
+	 *       last request. For the last request the current request
+	 *        is NULL, which means no requests are pending.
+	 */
+	if (!rqc)
+		mmc_release_host(card->host);
+
+	do {
+		/* Current request becomes previous request and vice versa. */
+		struct mmc_queue_req  *tmp;
+		mq->mqrq_prev->brq.mrq.data = NULL;
+		mq->mqrq_prev->req = NULL;
+		tmp = mq->mqrq_prev;
+		mq->mqrq_prev = mq->mqrq_cur;
+		mq->mqrq_cur = tmp;
+	} while (0);
 
-	return 1;
+	return ret;
 
  cmd_err:
  	/*
@@ -548,12 +642,12 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 		blocks = mmc_sd_num_wr_blocks(card);
 		if (blocks != (u32)-1) {
 			spin_lock_irq(&md->lock);
-			ret = __blk_end_request(rqc, 0, blocks << 9);
+			ret = __blk_end_request(rqp, 0, blocks << 9);
 			spin_unlock_irq(&md->lock);
 		}
 	} else {
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
+		ret = __blk_end_request(rqp, 0, brqp->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	}
 
@@ -561,7 +655,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 
 	spin_lock_irq(&md->lock);
 	while (ret)
-		ret = __blk_end_request(rqc, -EIO, blk_rq_cur_bytes(rqc));
+		ret = __blk_end_request(rqp, -EIO, blk_rq_cur_bytes(rqp));
 	spin_unlock_irq(&md->lock);
 
 	return 0;
@@ -569,7 +663,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 
 static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
 {
-	if (req->cmd_flags & REQ_DISCARD) {
+	if (req && req->cmd_flags & REQ_DISCARD) {
 		if (req->cmd_flags & REQ_SECURE)
 			return mmc_blk_issue_secdiscard_rq(mq, req);
 		else
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 30d4707..30f8ae9 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -60,6 +60,7 @@ static int mmc_queue_thread(void *d)
 		mq->mqrq_cur->req = req;
 		spin_unlock_irq(q->queue_lock);
 
+		mq->issue_fn(mq, req);
 		if (!req) {
 			if (kthread_should_stop()) {
 				set_current_state(TASK_RUNNING);
@@ -72,7 +73,6 @@ static int mmc_queue_thread(void *d)
 		}
 		set_current_state(TASK_RUNNING);
 
-		mq->issue_fn(mq, req);
 	} while (1);
 	up(&mq->thread_sem);
 
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index a3a780f..63b4684 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -195,30 +195,87 @@ mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
 
 static void mmc_wait_done(struct mmc_request *mrq)
 {
-	complete(mrq->done_data);
+	complete(&mrq->completion);
 }
 
 /**
- *	mmc_wait_for_req - start a request and wait for completion
+ *	mmc_pre_req - Prepare for a new request
+ *	@host: MMC host to prepare command
+ *	@mrq: MMC request to prepare for
+ *	@host_is_idle: true if the host it not processing a request,
+ *		       false if a request may be active on the host.
+ *
+ *	mmc_pre_req() is called in prior to mmc_start_req() to let
+ *	host prepare for the new request. Preparation of a request may be
+ *	performed while another request is running on the host.
+ */
+void mmc_pre_req(struct mmc_host *host, struct mmc_request *mrq,
+		 bool host_is_idle)
+{
+	if (host->ops->pre_req)
+		host->ops->pre_req(host, mrq, host_is_idle);
+}
+EXPORT_SYMBOL(mmc_pre_req);
+
+/**
+ *	mmc_post_req - Post process a completed request
+ *	@host: MMC host to post process command
+ *	@mrq: MMC request to post process for
+ *
+ *	Let the host post process a completed request. Post processing of
+ *	a request may be performed while another reuqest is running.
+ */
+void mmc_post_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+	if (host->ops->post_req)
+		host->ops->post_req(host, mrq);
+}
+EXPORT_SYMBOL(mmc_post_req);
+
+/**
+ *	mmc_start_req - start a request
  *	@host: MMC host to start command
  *	@mrq: MMC request to start
  *
- *	Start a new MMC custom command request for a host, and wait
- *	for the command to complete. Does not attempt to parse the
- *	response.
+ *	Start a new MMC custom command request for a host.
+ *	Does not wait for the command to complete.
  */
-void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq)
+void mmc_start_req(struct mmc_host *host, struct mmc_request *mrq)
 {
-	DECLARE_COMPLETION_ONSTACK(complete);
-
-	mrq->done_data = &complete;
+	init_completion(&mrq->completion);
 	mrq->done = mmc_wait_done;
 
 	mmc_start_request(host, mrq);
+}
+EXPORT_SYMBOL(mmc_start_req);
 
-	wait_for_completion(&complete);
+/**
+ *	mmc_wait_for_req_done - wait for completion of request
+ *	@mrq: MMC request to wait for
+ *
+ *	Wait for the command to complete. Does not attempt to parse the
+ *	response.
+ */
+void mmc_wait_for_req_done(struct mmc_request *mrq)
+{
+	wait_for_completion(&mrq->completion);
 }
+EXPORT_SYMBOL(mmc_wait_for_req_done);
 
+/**
+ *	mmc_wait_for_req - start a request and wait for completion
+ *	@host: MMC host to start command
+ *	@mrq: MMC request to start
+ *
+ *	Start a new MMC custom command request for a host, and wait
+ *	for the command to complete. Does not attempt to parse the
+ *	response.
+ */
+void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+	mmc_start_req(host, mrq);
+	mmc_wait_for_req_done(mrq);
+}
 EXPORT_SYMBOL(mmc_wait_for_req);
 
 /**
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 64e013f..da504f7 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -124,13 +124,18 @@ struct mmc_request {
 	struct mmc_data		*data;
 	struct mmc_command	*stop;
 
-	void			*done_data;	/* completion data */
+	struct completion	completion;
 	void			(*done)(struct mmc_request *);/* completion function */
 };
 
 struct mmc_host;
 struct mmc_card;
 
+extern void mmc_pre_req(struct mmc_host *host, struct mmc_request *mrq,
+			bool host_is_idle);
+extern void mmc_post_req(struct mmc_host *host, struct mmc_request *mrq);
+extern void mmc_start_req(struct mmc_host *host, struct mmc_request *mrq);
+extern void mmc_wait_for_req_done(struct mmc_request *mrq);
 extern void mmc_wait_for_req(struct mmc_host *, struct mmc_request *);
 extern int mmc_wait_for_cmd(struct mmc_host *, struct mmc_command *, int);
 extern int mmc_wait_for_app_cmd(struct mmc_host *, struct mmc_card *,
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 30f6fad..b85463b 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -88,6 +88,14 @@ struct mmc_host_ops {
 	 */
 	int (*enable)(struct mmc_host *host);
 	int (*disable)(struct mmc_host *host, int lazy);
+	/*
+	 * It is optional for the host to implement pre_req and post_req in
+	 * order to support double buffering of requests (prepare one
+	 * request while another request is active).
+	 */
+	void	(*post_req)(struct mmc_host *host, struct mmc_request *req);
+	void	(*pre_req)(struct mmc_host *host, struct mmc_request *req,
+			   bool host_is_idle);
 	void	(*request)(struct mmc_host *host, struct mmc_request *req);
 	/*
 	 * Avoid calling these three functions too often or in a "fast path",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] mmc: Add double buffering for mmc block requests
@ 2011-01-12 18:14   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

Change mmc_blk_issue_rw_rq() to become asynchronous.
The execution flow looks like this:
The mmc-queue calls issue_rw_rq(), which sends the request
to the host and returns back to the mmc-queue. The mmc-queue calls
isuue_rw_rq() again with a new request. This new request is prepared,
in isuue_rw_rq(), then it waits for the active request to complete before
pushing it to the host. When to mmc-queue is empty it will call
isuue_rw_rq() with req=NULL to finish off the active request
without starting a new request.

Signed-off-by: Per Forlin <per.forlin@linaro.org>
---
 drivers/mmc/card/block.c |  170 +++++++++++++++++++++++++++++++++++----------
 drivers/mmc/card/queue.c |    2 +-
 drivers/mmc/core/core.c  |   77 ++++++++++++++++++---
 include/linux/mmc/core.h |    7 ++-
 include/linux/mmc/host.h |    8 ++
 5 files changed, 214 insertions(+), 50 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 028b2b8..11e6e97 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -420,62 +420,98 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 	struct mmc_blk_data *md = mq->data;
 	struct mmc_card *card = md->queue.card;
 	struct mmc_blk_request *brqc = &mq->mqrq_cur->brq;
-	int ret = 1, disable_multi = 0;
+	struct mmc_blk_request *brqp = &mq->mqrq_prev->brq;
+	struct mmc_queue_req  *mqrqp = mq->mqrq_prev;
+	struct request *rqp = mqrqp->req;
+	int ret = 0;
+	int disable_multi = 0;
+	bool complete_transfer = true;
+
+	if (!rqc && !rqp) {
+		brqc->mrq.data = NULL;
+		brqp->mrq.data = NULL;
+		return 0;
+	}
 
-	mmc_claim_host(card->host);
+	/*
+	 * TODO: Find out if it is OK to only claim host for the first request.
+	 *       For the first request the previous request is NULL
+	 */
+	if (!rqp && rqc)
+		mmc_claim_host(card->host);
+
+	if (rqc) {
+		/* Prepare a new request */
+		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur,
+					 rqc, card, 0, mq);
+		mmc_pre_req(card->host, &brqc->mrq, !rqp);
+	}
 
 	do {
 		struct mmc_command cmd;
 		u32 status = 0;
 
-		mmc_blk_issue_rw_rq_prep(brqc, mq->mqrq_cur, rqc, card,
-					 disable_multi, mq);
-		mmc_wait_for_req(card->host, &brqc->mrq);
-
-		mmc_queue_bounce_post(mq->mqrq_cur);
+		/* In case of error redo prepare and resend */
+		if (ret) {
+			mmc_blk_issue_rw_rq_prep(brqp, mqrqp, rqp, card,
+						 disable_multi, mq);
+			mmc_pre_req(card->host, &brqc->mrq, !rqp);
+			mmc_start_req(card->host, &brqp->mrq);
+		}
+		/*
+		 * If there is an ongoing request, indicated by rqp, wait for
+		 * it to finish before starting a new one.
+		 */
+		if (rqp) {
+			mmc_wait_for_req_done(&brqp->mrq);
+		} else {
+			/* start a new asynchronous request */
+			mmc_start_req(card->host, &brqc->mrq);
+			goto out;
+		}
 
 		/*
 		 * Check for errors here, but don't jump to cmd_err
 		 * until later as we need to wait for the card to leave
 		 * programming mode even when things go wrong.
 		 */
-		if (brqc->cmd.error || brqc->data.error || brqc->stop.error) {
-			if (brqc->data.blocks > 1 && rq_data_dir(rqc) == READ) {
+		if (brqp->cmd.error || brqp->data.error || brqp->stop.error) {
+			if (brqp->data.blocks > 1 && rq_data_dir(rqp) == READ) {
 				/* Redo read one sector at a time */
 				printk(KERN_WARNING "%s: retrying using single "
-				       "block read\n", rqc->rq_disk->disk_name);
+				       "block read\n", rqp->rq_disk->disk_name);
 				disable_multi = 1;
 				continue;
 			}
-			status = get_card_status(card, rqc);
+			status = get_card_status(card, rqp);
 		}
 
-		if (brqc->cmd.error) {
+		if (brqp->cmd.error) {
 			printk(KERN_ERR "%s: error %d sending read/write "
 			       "command, response %#x, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->cmd.error,
-			       brqc->cmd.resp[0], status);
+			       rqp->rq_disk->disk_name, brqp->cmd.error,
+			       brqp->cmd.resp[0], status);
 		}
 
-		if (brqc->data.error) {
-			if (brqc->data.error == -ETIMEDOUT && brqc->mrq.stop)
+		if (brqp->data.error) {
+			if (brqp->data.error == -ETIMEDOUT && brqp->mrq.stop)
 				/* 'Stop' response contains card status */
-				status = brqc->mrq.stop->resp[0];
+				status = brqp->mrq.stop->resp[0];
 			printk(KERN_ERR "%s: error %d transferring data,"
 			       " sector %u, nr %u, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->data.error,
-			       (unsigned)blk_rq_pos(rqc),
-			       (unsigned)blk_rq_sectors(rqc), status);
+			       rqp->rq_disk->disk_name, brqp->data.error,
+			       (unsigned)blk_rq_pos(rqp),
+			       (unsigned)blk_rq_sectors(rqp), status);
 		}
 
-		if (brqc->stop.error) {
+		if (brqp->stop.error) {
 			printk(KERN_ERR "%s: error %d sending stop command, "
 			       "response %#x, card status %#x\n",
-			       rqc->rq_disk->disk_name, brqc->stop.error,
-			       brqc->stop.resp[0], status);
+			       rqp->rq_disk->disk_name, brqp->stop.error,
+			       brqp->stop.resp[0], status);
 		}
 
-		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqc) != READ) {
+		if (!mmc_host_is_spi(card->host) && rq_data_dir(rqp) != READ) {
 			do {
 				int err;
 
@@ -485,7 +521,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 				err = mmc_wait_for_cmd(card->host, &cmd, 5);
 				if (err) {
 					printk(KERN_ERR "%s: error %d requesting status\n",
-					       rqc->rq_disk->disk_name, err);
+					       rqp->rq_disk->disk_name, err);
 					goto cmd_err;
 				}
 				/*
@@ -499,22 +535,22 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 #if 0
 			if (cmd.resp[0] & ~0x00000900)
 				printk(KERN_ERR "%s: status = %08x\n",
-				       rqc->rq_disk->disk_name, cmd.resp[0]);
+				       rqp->rq_disk->disk_name, cmd.resp[0]);
 			if (mmc_decode_status(cmd.resp))
 				goto cmd_err;
 #endif
 		}
 
-		if (brqc->cmd.error || brqc->stop.error || brqc->data.error) {
-			if (rq_data_dir(rqc) == READ) {
+		if (brqp->cmd.error || brqp->stop.error || brqp->data.error) {
+			if (rq_data_dir(rqp) == READ) {
 				/*
 				 * After an error, we redo I/O one sector at a
 				 * time, so we only reach here after trying to
 				 * read a single sector.
 				 */
 				spin_lock_irq(&md->lock);
-				ret = __blk_end_request(rqc, -EIO,
-							brqc->data.blksz);
+				ret = __blk_end_request(rqp, -EIO,
+							brqp->data.blksz);
 				spin_unlock_irq(&md->lock);
 				continue;
 			}
@@ -524,14 +560,72 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 		/*
 		 * A block was successfully transferred.
 		 */
+		/*
+		 * TODO: Find out if it safe to only check if
+		 * blk_rq_bytes(req) == data.bytes_xfered to make sure
+		 * the entire request is completed. If equal, defer
+		 * __blk_end_request until after the new request is started.
+		 */
+		if (blk_rq_bytes(rqp) != brqp->data.bytes_xfered ||
+		    !complete_transfer) {
+			complete_transfer = false;
+			mmc_post_req(card->host, &brqp->mrq);
+			mmc_queue_bounce_post(mqrqp);
+
+			spin_lock_irq(&md->lock);
+			ret = __blk_end_request(rqp, 0,
+						brqp->data.bytes_xfered);
+			spin_unlock_irq(&md->lock);
+		}
+	} while (ret);
+
+	/* Previous request is completed, start the new request if any */
+	if (rqc)
+		mmc_start_req(card->host, &brqc->mrq);
+
+	/* Post process the previous request while the new request is active */
+	if (complete_transfer) {
+		mmc_post_req(card->host, &brqp->mrq);
+		mmc_queue_bounce_post(mqrqp);
+
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
+		ret = __blk_end_request(rqp, 0, brqp->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
-	} while (ret);
 
-	mmc_release_host(card->host);
+		/*
+		 * TODO: Make sure "ret" can never be true and remove the
+		 * if-statement and the code inside it.
+		 */
+		if (ret) {
+			/* This should never happen */
+			printk(KERN_ERR "[%s] BUG: rq_bytes %d xfered %d\n",
+			       __func__, blk_rq_bytes(rqp),
+			       brqp->data.bytes_xfered);
+			BUG();
+		}
+	}
+	/* 1 indicates one request has been completed */
+	ret = 1;
+ out:
+	/*
+	 * TODO: Find out if it is OK to only release host after the
+	 *       last request. For the last request the current request
+	 *        is NULL, which means no requests are pending.
+	 */
+	if (!rqc)
+		mmc_release_host(card->host);
+
+	do {
+		/* Current request becomes previous request and vice versa. */
+		struct mmc_queue_req  *tmp;
+		mq->mqrq_prev->brq.mrq.data = NULL;
+		mq->mqrq_prev->req = NULL;
+		tmp = mq->mqrq_prev;
+		mq->mqrq_prev = mq->mqrq_cur;
+		mq->mqrq_cur = tmp;
+	} while (0);
 
-	return 1;
+	return ret;
 
  cmd_err:
  	/*
@@ -548,12 +642,12 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 		blocks = mmc_sd_num_wr_blocks(card);
 		if (blocks != (u32)-1) {
 			spin_lock_irq(&md->lock);
-			ret = __blk_end_request(rqc, 0, blocks << 9);
+			ret = __blk_end_request(rqp, 0, blocks << 9);
 			spin_unlock_irq(&md->lock);
 		}
 	} else {
 		spin_lock_irq(&md->lock);
-		ret = __blk_end_request(rqc, 0, brqc->data.bytes_xfered);
+		ret = __blk_end_request(rqp, 0, brqp->data.bytes_xfered);
 		spin_unlock_irq(&md->lock);
 	}
 
@@ -561,7 +655,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 
 	spin_lock_irq(&md->lock);
 	while (ret)
-		ret = __blk_end_request(rqc, -EIO, blk_rq_cur_bytes(rqc));
+		ret = __blk_end_request(rqp, -EIO, blk_rq_cur_bytes(rqp));
 	spin_unlock_irq(&md->lock);
 
 	return 0;
@@ -569,7 +663,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 
 static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
 {
-	if (req->cmd_flags & REQ_DISCARD) {
+	if (req && req->cmd_flags & REQ_DISCARD) {
 		if (req->cmd_flags & REQ_SECURE)
 			return mmc_blk_issue_secdiscard_rq(mq, req);
 		else
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 30d4707..30f8ae9 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -60,6 +60,7 @@ static int mmc_queue_thread(void *d)
 		mq->mqrq_cur->req = req;
 		spin_unlock_irq(q->queue_lock);
 
+		mq->issue_fn(mq, req);
 		if (!req) {
 			if (kthread_should_stop()) {
 				set_current_state(TASK_RUNNING);
@@ -72,7 +73,6 @@ static int mmc_queue_thread(void *d)
 		}
 		set_current_state(TASK_RUNNING);
 
-		mq->issue_fn(mq, req);
 	} while (1);
 	up(&mq->thread_sem);
 
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index a3a780f..63b4684 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -195,30 +195,87 @@ mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
 
 static void mmc_wait_done(struct mmc_request *mrq)
 {
-	complete(mrq->done_data);
+	complete(&mrq->completion);
 }
 
 /**
- *	mmc_wait_for_req - start a request and wait for completion
+ *	mmc_pre_req - Prepare for a new request
+ *	@host: MMC host to prepare command
+ *	@mrq: MMC request to prepare for
+ *	@host_is_idle: true if the host it not processing a request,
+ *		       false if a request may be active on the host.
+ *
+ *	mmc_pre_req() is called in prior to mmc_start_req() to let
+ *	host prepare for the new request. Preparation of a request may be
+ *	performed while another request is running on the host.
+ */
+void mmc_pre_req(struct mmc_host *host, struct mmc_request *mrq,
+		 bool host_is_idle)
+{
+	if (host->ops->pre_req)
+		host->ops->pre_req(host, mrq, host_is_idle);
+}
+EXPORT_SYMBOL(mmc_pre_req);
+
+/**
+ *	mmc_post_req - Post process a completed request
+ *	@host: MMC host to post process command
+ *	@mrq: MMC request to post process for
+ *
+ *	Let the host post process a completed request. Post processing of
+ *	a request may be performed while another reuqest is running.
+ */
+void mmc_post_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+	if (host->ops->post_req)
+		host->ops->post_req(host, mrq);
+}
+EXPORT_SYMBOL(mmc_post_req);
+
+/**
+ *	mmc_start_req - start a request
  *	@host: MMC host to start command
  *	@mrq: MMC request to start
  *
- *	Start a new MMC custom command request for a host, and wait
- *	for the command to complete. Does not attempt to parse the
- *	response.
+ *	Start a new MMC custom command request for a host.
+ *	Does not wait for the command to complete.
  */
-void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq)
+void mmc_start_req(struct mmc_host *host, struct mmc_request *mrq)
 {
-	DECLARE_COMPLETION_ONSTACK(complete);
-
-	mrq->done_data = &complete;
+	init_completion(&mrq->completion);
 	mrq->done = mmc_wait_done;
 
 	mmc_start_request(host, mrq);
+}
+EXPORT_SYMBOL(mmc_start_req);
 
-	wait_for_completion(&complete);
+/**
+ *	mmc_wait_for_req_done - wait for completion of request
+ *	@mrq: MMC request to wait for
+ *
+ *	Wait for the command to complete. Does not attempt to parse the
+ *	response.
+ */
+void mmc_wait_for_req_done(struct mmc_request *mrq)
+{
+	wait_for_completion(&mrq->completion);
 }
+EXPORT_SYMBOL(mmc_wait_for_req_done);
 
+/**
+ *	mmc_wait_for_req - start a request and wait for completion
+ *	@host: MMC host to start command
+ *	@mrq: MMC request to start
+ *
+ *	Start a new MMC custom command request for a host, and wait
+ *	for the command to complete. Does not attempt to parse the
+ *	response.
+ */
+void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+	mmc_start_req(host, mrq);
+	mmc_wait_for_req_done(mrq);
+}
 EXPORT_SYMBOL(mmc_wait_for_req);
 
 /**
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 64e013f..da504f7 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -124,13 +124,18 @@ struct mmc_request {
 	struct mmc_data		*data;
 	struct mmc_command	*stop;
 
-	void			*done_data;	/* completion data */
+	struct completion	completion;
 	void			(*done)(struct mmc_request *);/* completion function */
 };
 
 struct mmc_host;
 struct mmc_card;
 
+extern void mmc_pre_req(struct mmc_host *host, struct mmc_request *mrq,
+			bool host_is_idle);
+extern void mmc_post_req(struct mmc_host *host, struct mmc_request *mrq);
+extern void mmc_start_req(struct mmc_host *host, struct mmc_request *mrq);
+extern void mmc_wait_for_req_done(struct mmc_request *mrq);
 extern void mmc_wait_for_req(struct mmc_host *, struct mmc_request *);
 extern int mmc_wait_for_cmd(struct mmc_host *, struct mmc_command *, int);
 extern int mmc_wait_for_app_cmd(struct mmc_host *, struct mmc_card *,
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 30f6fad..b85463b 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -88,6 +88,14 @@ struct mmc_host_ops {
 	 */
 	int (*enable)(struct mmc_host *host);
 	int (*disable)(struct mmc_host *host, int lazy);
+	/*
+	 * It is optional for the host to implement pre_req and post_req in
+	 * order to support double buffering of requests (prepare one
+	 * request while another request is active).
+	 */
+	void	(*post_req)(struct mmc_host *host, struct mmc_request *req);
+	void	(*pre_req)(struct mmc_host *host, struct mmc_request *req,
+			   bool host_is_idle);
 	void	(*request)(struct mmc_host *host, struct mmc_request *req);
 	/*
 	 * Avoid calling these three functions too often or in a "fast path",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-12 18:24   ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:24 UTC (permalink / raw)
  To: linux-mmc, linux-arm-kernel, linux-kernel, linaro-dev
  Cc: Chris Ball, Per Forlin

I mistyped the linaro email in this patch series.

Sorry for the mess
/Per

On 12 January 2011 19:13, Per Forlin <per.forlin@linaro.org> wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.
>
> There are two optional hooks pre_req() and post_req() that the host driver
> may implement in order to improve double buffering. In the DMA case pre_req()
> may do dma_map_sg() and prepare the dma descriptor and post_req runs the
> dma_unmap_sg.
>
> The mmci host driver implementation for double buffering is not intended
> nor ready for mainline yet. It is only an example of how to implement
> pre_req() and post_req(). The reason for this is that the basic DMA support
> for MMCI is not complete yet. The mmci patches are sent in a separate patch
> series "[FYI 0/4] arm: mmci: example implementation of double buffering".
>
> Issues/Questions for issue_rw_rq() in block.c:
> * Is it safe to claim the host for the first MMC request and wait to release
>  it until the MMC queue is empty again? Or must the host be claimed and
>  released for every request?
> * Is it possible to predict the result from __blk_end_request().
>  If there are no errors for a completed MMC request and the
>  blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
>  __blk_end_request will return 0?
>
> Here follows the IOZone results for u8500 v1.1 on eMMC.
> The numbers for DMA are a bit to good here due to the fact that the
> CPU speed is decreased compared to u8500 v2. This makes the cache handling
> even more significant.
>
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
>
> Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
> cpu load is abs diff
>                                                        random  random
>        KB      reclen  write   rewrite read    reread  read    write
>        51200   4       +0%     +0%     +0%     +0%     +0%     +0%
>        cpu:            +0.1    -0.1    -0.5    -0.3    -0.1    -0.0
>
>        51200   8       +0%     +0%     +6%     +6%     +8%     +0%
>        cpu:            +0.1    -0.1    -0.3    -0.4    -0.8    +0.0
>
>        51200   16      +0%     -2%     +0%     +0%     -3%     +0%
>        cpu:            +0.0    -0.2    +0.0    +0.0    -0.2    +0.0
>
>        51200   32      +0%     +1%     +0%     +0%     +0%     +0%
>        cpu:            +0.1    +0.0    -0.3    +0.0    +0.0    +0.0
>
>        51200   64      +0%     +0%     +0%     +0%     +0%     +0%
>        cpu:            +0.1    +0.0    +0.0    +0.0    +0.0    +0.0
>
>        51200   128     +0%     +1%     +1%     +1%     +1%     +0%
>        cpu:            +0.0    +0.2    +0.1    -0.3    +0.4    +0.0
>
>        51200   256     +0%     +0%     +1%     +1%     +1%     +0%
>        cpu:            +0.0    -0.0    +0.1    +0.1    +0.1    +0.0
>
>        51200   512     +0%     +1%     +2%     +2%     +2%     +0%
>        cpu:            +0.1    +0.0    +0.2    +0.2    +0.2    +0.1
>
>        51200   1024    +0%     +2%     +2%     +2%     +3%     +0%
>        cpu:            +0.2    +0.1    +0.2    +0.5    -0.8    +0.0
>
>        51200   2048    +0%     +2%     +3%     +3%     +3%     +0%
>        cpu:            +0.0    -0.2    +0.4    +0.8    -0.5    +0.2
>
>        51200   4096    +0%     +1%     +3%     +3%     +3%     +1%
>        cpu:            +0.2    +0.1    +0.9    +0.9    +0.5    +0.1
>
>        51200   8192    +1%     +0%     +3%     +3%     +3%     +1%
>        cpu:            +0.2    +0.2    +1.3    +1.3    +1.0    +0.0
>
>        51200   16384   +0%     +1%     +3%     +3%     +3%     +1%
>        cpu:            +0.2    +0.1    +1.0    +1.3    +1.0    +0.5
>
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                        random  random
>        KB      reclen  write   rewrite read    reread  read    write
>        51200   4       +0%     -3%     +6%     +5%     +5%     +0%
>        cpu:            +0.0    -0.2    -0.6    -0.1    +0.3    +0.0
>
>        51200   8       +0%     +0%     +7%     +7%     +7%     +0%
>        cpu:            +0.0    +0.1    +0.8    +0.6    +0.9    +0.0
>
>        51200   16      +0%     +0%     +7%     +7%     +8%     +0%
>        cpu:            +0.0    -0.0    +0.7    +0.7    +0.8    +0.0
>
>        51200   32      +0%     +0%     +8%     +8%     +9%     +0%
>        cpu:            +0.0    +0.1    +0.7    +0.7    +0.3    +0.0
>
>        51200   64      +0%     +1%     +9%     +9%     +9%     +0%
>        cpu:            +0.0    +0.0    +0.8    +0.7    +0.8    +0.0
>
>        51200   128     +1%     +0%     +13%    +13%    +14%    +0%
>        cpu:            +0.2    +0.0    +1.0    +1.0    +1.1    +0.0
>
>        51200   256     +1%     +2%     +8%     +8%     +11%    +0%
>        cpu:            +0.0    +0.3    +0.0    +0.7    +1.5    +0.0
>
>        51200   512     +1%     +2%     +16%    +16%    +17%    +0%
>        cpu:            +0.2    +0.2    +2.2    +2.1    +2.2    +0.1
>
>        51200   1024    +1%     +2%     +20%    +20%    +20%    +1%
>        cpu:            +0.2    +0.1    +2.6    +1.9    +2.6    +0.0
>
>        51200   2048    +0%     +2%     +22%    +22%    +21%    +0%
>        cpu:            +0.0    +0.3    +2.3    +2.9    +2.1    -0.0
>
>        51200   4096    +1%     +2%     +23%    +23%    +23%    +1%
>        cpu:            +0.2    +0.1    +2.0    +3.2    +3.1    +0.0
>
>        51200   8192    +1%     +5%     +24%    +24%    +24%    +1%
>        cpu:            +1.4    -0.0    +4.2    +3.0    +2.8    +0.1
>
>        51200   16384   +1%     +3%     +24%    +24%    +24%    +2%
>        cpu:            +0.0    +0.3    +3.4    +3.8    +3.7    +0.1
>
> Here follows the IOZone results for u5500 on eMMC.
> These numbers for DMA are more as expected.
>
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
>
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                        random  random
>        KB      reclen  write   rewrite read    reread  read    write
>        51200   128     +1%     +1%     +10%    +9%     +10%    +0%
>        cpu:            +0.1    +0.0    +1.3    +0.1    +0.8    +0.1
>
>        51200   256     +2%     +2%     +7%     +7%     +9%     +0%
>        cpu:            +0.1    +0.4    +0.5    +0.6    +0.7    +0.0
>
>        51200   512     +2%     +2%     +12%    +12%    +12%    +1%
>        cpu:            +0.4    +0.6    +1.8    +2.4    +2.4    +0.2
>
>        51200   1024    +2%     +3%     +14%    +14%    +14%    +0%
>        cpu:            +0.3    +0.1    +2.1    +1.4    +1.4    +0.2
>
>        51200   2048    +3%     +3%     +16%    +16%    +16%    +1%
>        cpu:            +0.2    +0.2    +2.5    +1.8    +2.4    -0.2
>
>        51200   4096    +3%     +3%     +17%    +17%    +18%    +3%
>        cpu:            +0.1    -0.1    +2.7    +2.0    +2.7    -0.1
>
>        51200   8192    +3%     +3%     +18%    +18%    +18%    +3%
>        cpu:            -0.1    +0.2    +3.0    +2.3    +2.2    +0.2
>
>        51200   16384   +3%     +3%     +18%    +18%    +18%    +4%
>        cpu:            +0.2    +0.2    +2.8    +3.5    +2.4    -0.0
>
> Per Forlin (5):
>  mmc: add member in mmc queue struct to hold request data
>  mmc: Add a block request prepare function
>  mmc: Add a second mmc queue request member
>  mmc: Store the mmc block request struct in mmc queue
>  mmc: Add double buffering for mmc block requests
>
>  drivers/mmc/card/block.c |  337 ++++++++++++++++++++++++++++++----------------
>  drivers/mmc/card/queue.c |  171 +++++++++++++++---------
>  drivers/mmc/card/queue.h |   31 +++-
>  drivers/mmc/core/core.c  |   77 +++++++++--
>  include/linux/mmc/core.h |    7 +-
>  include/linux/mmc/host.h |    8 +
>  6 files changed, 432 insertions(+), 199 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-12 18:24   ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-12 18:24 UTC (permalink / raw)
  To: linux-arm-kernel

I mistyped the linaro email in this patch series.

Sorry for the mess
/Per

On 12 January 2011 19:13, Per Forlin <per.forlin@linaro.org> wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.
>
> There are two optional hooks pre_req() and post_req() that the host driver
> may implement in order to improve double buffering. In the DMA case pre_req()
> may do dma_map_sg() and prepare the dma descriptor and post_req runs the
> dma_unmap_sg.
>
> The mmci host driver implementation for double buffering is not intended
> nor ready for mainline yet. It is only an example of how to implement
> pre_req() and post_req(). The reason for this is that the basic DMA support
> for MMCI is not complete yet. The mmci patches are sent in a separate patch
> series "[FYI 0/4] arm: mmci: example implementation of double buffering".
>
> Issues/Questions for issue_rw_rq() in block.c:
> * Is it safe to claim the host for the first MMC request and wait to release
> ?it until the MMC queue is empty again? Or must the host be claimed and
> ?released for every request?
> * Is it possible to predict the result from __blk_end_request().
> ?If there are no errors for a completed MMC request and the
> ?blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
> ?__blk_end_request will return 0?
>
> Here follows the IOZone results for u8500 v1.1 on eMMC.
> The numbers for DMA are a bit to good here due to the fact that the
> CPU speed is decreased compared to u8500 v2. This makes the cache handling
> even more significant.
>
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
>
> Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
> cpu load is abs diff
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?random ?random
> ? ? ? ?KB ? ? ?reclen ?write ? rewrite read ? ?reread ?read ? ?write
> ? ? ? ?51200 ? 4 ? ? ? +0% ? ? +0% ? ? +0% ? ? +0% ? ? +0% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?-0.1 ? ?-0.5 ? ?-0.3 ? ?-0.1 ? ?-0.0
>
> ? ? ? ?51200 ? 8 ? ? ? +0% ? ? +0% ? ? +6% ? ? +6% ? ? +8% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?-0.1 ? ?-0.3 ? ?-0.4 ? ?-0.8 ? ?+0.0
>
> ? ? ? ?51200 ? 16 ? ? ?+0% ? ? -2% ? ? +0% ? ? +0% ? ? -3% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?-0.2 ? ?+0.0 ? ?+0.0 ? ?-0.2 ? ?+0.0
>
> ? ? ? ?51200 ? 32 ? ? ?+0% ? ? +1% ? ? +0% ? ? +0% ? ? +0% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?+0.0 ? ?-0.3 ? ?+0.0 ? ?+0.0 ? ?+0.0
>
> ? ? ? ?51200 ? 64 ? ? ?+0% ? ? +0% ? ? +0% ? ? +0% ? ? +0% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?+0.0 ? ?+0.0 ? ?+0.0 ? ?+0.0 ? ?+0.0
>
> ? ? ? ?51200 ? 128 ? ? +0% ? ? +1% ? ? +1% ? ? +1% ? ? +1% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.2 ? ?+0.1 ? ?-0.3 ? ?+0.4 ? ?+0.0
>
> ? ? ? ?51200 ? 256 ? ? +0% ? ? +0% ? ? +1% ? ? +1% ? ? +1% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?-0.0 ? ?+0.1 ? ?+0.1 ? ?+0.1 ? ?+0.0
>
> ? ? ? ?51200 ? 512 ? ? +0% ? ? +1% ? ? +2% ? ? +2% ? ? +2% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?+0.0 ? ?+0.2 ? ?+0.2 ? ?+0.2 ? ?+0.1
>
> ? ? ? ?51200 ? 1024 ? ?+0% ? ? +2% ? ? +2% ? ? +2% ? ? +3% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.1 ? ?+0.2 ? ?+0.5 ? ?-0.8 ? ?+0.0
>
> ? ? ? ?51200 ? 2048 ? ?+0% ? ? +2% ? ? +3% ? ? +3% ? ? +3% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?-0.2 ? ?+0.4 ? ?+0.8 ? ?-0.5 ? ?+0.2
>
> ? ? ? ?51200 ? 4096 ? ?+0% ? ? +1% ? ? +3% ? ? +3% ? ? +3% ? ? +1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.1 ? ?+0.9 ? ?+0.9 ? ?+0.5 ? ?+0.1
>
> ? ? ? ?51200 ? 8192 ? ?+1% ? ? +0% ? ? +3% ? ? +3% ? ? +3% ? ? +1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.2 ? ?+1.3 ? ?+1.3 ? ?+1.0 ? ?+0.0
>
> ? ? ? ?51200 ? 16384 ? +0% ? ? +1% ? ? +3% ? ? +3% ? ? +3% ? ? +1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.1 ? ?+1.0 ? ?+1.3 ? ?+1.0 ? ?+0.5
>
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?random ?random
> ? ? ? ?KB ? ? ?reclen ?write ? rewrite read ? ?reread ?read ? ?write
> ? ? ? ?51200 ? 4 ? ? ? +0% ? ? -3% ? ? +6% ? ? +5% ? ? +5% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?-0.2 ? ?-0.6 ? ?-0.1 ? ?+0.3 ? ?+0.0
>
> ? ? ? ?51200 ? 8 ? ? ? +0% ? ? +0% ? ? +7% ? ? +7% ? ? +7% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.1 ? ?+0.8 ? ?+0.6 ? ?+0.9 ? ?+0.0
>
> ? ? ? ?51200 ? 16 ? ? ?+0% ? ? +0% ? ? +7% ? ? +7% ? ? +8% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?-0.0 ? ?+0.7 ? ?+0.7 ? ?+0.8 ? ?+0.0
>
> ? ? ? ?51200 ? 32 ? ? ?+0% ? ? +0% ? ? +8% ? ? +8% ? ? +9% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.1 ? ?+0.7 ? ?+0.7 ? ?+0.3 ? ?+0.0
>
> ? ? ? ?51200 ? 64 ? ? ?+0% ? ? +1% ? ? +9% ? ? +9% ? ? +9% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.0 ? ?+0.8 ? ?+0.7 ? ?+0.8 ? ?+0.0
>
> ? ? ? ?51200 ? 128 ? ? +1% ? ? +0% ? ? +13% ? ?+13% ? ?+14% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.0 ? ?+1.0 ? ?+1.0 ? ?+1.1 ? ?+0.0
>
> ? ? ? ?51200 ? 256 ? ? +1% ? ? +2% ? ? +8% ? ? +8% ? ? +11% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.3 ? ?+0.0 ? ?+0.7 ? ?+1.5 ? ?+0.0
>
> ? ? ? ?51200 ? 512 ? ? +1% ? ? +2% ? ? +16% ? ?+16% ? ?+17% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.2 ? ?+2.2 ? ?+2.1 ? ?+2.2 ? ?+0.1
>
> ? ? ? ?51200 ? 1024 ? ?+1% ? ? +2% ? ? +20% ? ?+20% ? ?+20% ? ?+1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.1 ? ?+2.6 ? ?+1.9 ? ?+2.6 ? ?+0.0
>
> ? ? ? ?51200 ? 2048 ? ?+0% ? ? +2% ? ? +22% ? ?+22% ? ?+21% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.3 ? ?+2.3 ? ?+2.9 ? ?+2.1 ? ?-0.0
>
> ? ? ? ?51200 ? 4096 ? ?+1% ? ? +2% ? ? +23% ? ?+23% ? ?+23% ? ?+1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.1 ? ?+2.0 ? ?+3.2 ? ?+3.1 ? ?+0.0
>
> ? ? ? ?51200 ? 8192 ? ?+1% ? ? +5% ? ? +24% ? ?+24% ? ?+24% ? ?+1%
> ? ? ? ?cpu: ? ? ? ? ? ?+1.4 ? ?-0.0 ? ?+4.2 ? ?+3.0 ? ?+2.8 ? ?+0.1
>
> ? ? ? ?51200 ? 16384 ? +1% ? ? +3% ? ? +24% ? ?+24% ? ?+24% ? ?+2%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.0 ? ?+0.3 ? ?+3.4 ? ?+3.8 ? ?+3.7 ? ?+0.1
>
> Here follows the IOZone results for u5500 on eMMC.
> These numbers for DMA are more as expected.
>
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
>
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?random ?random
> ? ? ? ?KB ? ? ?reclen ?write ? rewrite read ? ?reread ?read ? ?write
> ? ? ? ?51200 ? 128 ? ? +1% ? ? +1% ? ? +10% ? ?+9% ? ? +10% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?+0.0 ? ?+1.3 ? ?+0.1 ? ?+0.8 ? ?+0.1
>
> ? ? ? ?51200 ? 256 ? ? +2% ? ? +2% ? ? +7% ? ? +7% ? ? +9% ? ? +0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?+0.4 ? ?+0.5 ? ?+0.6 ? ?+0.7 ? ?+0.0
>
> ? ? ? ?51200 ? 512 ? ? +2% ? ? +2% ? ? +12% ? ?+12% ? ?+12% ? ?+1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.4 ? ?+0.6 ? ?+1.8 ? ?+2.4 ? ?+2.4 ? ?+0.2
>
> ? ? ? ?51200 ? 1024 ? ?+2% ? ? +3% ? ? +14% ? ?+14% ? ?+14% ? ?+0%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.3 ? ?+0.1 ? ?+2.1 ? ?+1.4 ? ?+1.4 ? ?+0.2
>
> ? ? ? ?51200 ? 2048 ? ?+3% ? ? +3% ? ? +16% ? ?+16% ? ?+16% ? ?+1%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.2 ? ?+2.5 ? ?+1.8 ? ?+2.4 ? ?-0.2
>
> ? ? ? ?51200 ? 4096 ? ?+3% ? ? +3% ? ? +17% ? ?+17% ? ?+18% ? ?+3%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.1 ? ?-0.1 ? ?+2.7 ? ?+2.0 ? ?+2.7 ? ?-0.1
>
> ? ? ? ?51200 ? 8192 ? ?+3% ? ? +3% ? ? +18% ? ?+18% ? ?+18% ? ?+3%
> ? ? ? ?cpu: ? ? ? ? ? ?-0.1 ? ?+0.2 ? ?+3.0 ? ?+2.3 ? ?+2.2 ? ?+0.2
>
> ? ? ? ?51200 ? 16384 ? +3% ? ? +3% ? ? +18% ? ?+18% ? ?+18% ? ?+4%
> ? ? ? ?cpu: ? ? ? ? ? ?+0.2 ? ?+0.2 ? ?+2.8 ? ?+3.5 ? ?+2.4 ? ?-0.0
>
> Per Forlin (5):
> ?mmc: add member in mmc queue struct to hold request data
> ?mmc: Add a block request prepare function
> ?mmc: Add a second mmc queue request member
> ?mmc: Store the mmc block request struct in mmc queue
> ?mmc: Add double buffering for mmc block requests
>
> ?drivers/mmc/card/block.c | ?337 ++++++++++++++++++++++++++++++----------------
> ?drivers/mmc/card/queue.c | ?171 +++++++++++++++---------
> ?drivers/mmc/card/queue.h | ? 31 +++-
> ?drivers/mmc/core/core.c ?| ? 77 +++++++++--
> ?include/linux/mmc/core.h | ? ?7 +-
> ?include/linux/mmc/host.h | ? ?8 +
> ?6 files changed, 432 insertions(+), 199 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-01-12 18:13 ` Per Forlin
@ 2011-01-18  2:35   ` Jaehoon Chung
  -1 siblings, 0 replies; 27+ messages in thread
From: Jaehoon Chung @ 2011-01-18  2:35 UTC (permalink / raw)
  To: Per Forlin
  Cc: linux-mmc, linux-arm-kernel, linux-kernel, dev, Chris Ball,
	Kyungmin Park

Hi Per..

it is interesting approach..so
we want to test your double buffering in our environment(Samsung SoC).

Did you test with SDHCI?
If you tested with SDHCI, i want to know how much increase the performance.

Thanks,
Jaehoon Chung

Per Forlin wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.
> 
> There are two optional hooks pre_req() and post_req() that the host driver
> may implement in order to improve double buffering. In the DMA case pre_req()
> may do dma_map_sg() and prepare the dma descriptor and post_req runs the
> dma_unmap_sg.
> 
> The mmci host driver implementation for double buffering is not intended
> nor ready for mainline yet. It is only an example of how to implement
> pre_req() and post_req(). The reason for this is that the basic DMA support
> for MMCI is not complete yet. The mmci patches are sent in a separate patch
> series "[FYI 0/4] arm: mmci: example implementation of double buffering".
> 
> Issues/Questions for issue_rw_rq() in block.c:
> * Is it safe to claim the host for the first MMC request and wait to release
>   it until the MMC queue is empty again? Or must the host be claimed and
>   released for every request?
> * Is it possible to predict the result from __blk_end_request().
>   If there are no errors for a completed MMC request and the
>   blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
>   __blk_end_request will return 0?
> 
> Here follows the IOZone results for u8500 v1.1 on eMMC.
> The numbers for DMA are a bit to good here due to the fact that the
> CPU speed is decreased compared to u8500 v2. This makes the cache handling
> even more significant.
> 
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
> 
> Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   4       +0%     +0%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    -0.1    -0.5    -0.3    -0.1    -0.0
> 
>         51200   8       +0%     +0%     +6%     +6%     +8%     +0%
>         cpu:            +0.1    -0.1    -0.3    -0.4    -0.8    +0.0
> 
>         51200   16      +0%     -2%     +0%     +0%     -3%     +0%
>         cpu:            +0.0    -0.2    +0.0    +0.0    -0.2    +0.0
> 
>         51200   32      +0%     +1%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    +0.0    -0.3    +0.0    +0.0    +0.0
> 
>         51200   64      +0%     +0%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    +0.0    +0.0    +0.0    +0.0    +0.0
> 
>         51200   128     +0%     +1%     +1%     +1%     +1%     +0%
>         cpu:            +0.0    +0.2    +0.1    -0.3    +0.4    +0.0
> 
>         51200   256     +0%     +0%     +1%     +1%     +1%     +0%
>         cpu:            +0.0    -0.0    +0.1    +0.1    +0.1    +0.0
> 
>         51200   512     +0%     +1%     +2%     +2%     +2%     +0%
>         cpu:            +0.1    +0.0    +0.2    +0.2    +0.2    +0.1
> 
>         51200   1024    +0%     +2%     +2%     +2%     +3%     +0%
>         cpu:            +0.2    +0.1    +0.2    +0.5    -0.8    +0.0
> 
>         51200   2048    +0%     +2%     +3%     +3%     +3%     +0%
>         cpu:            +0.0    -0.2    +0.4    +0.8    -0.5    +0.2
> 
>         51200   4096    +0%     +1%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.1    +0.9    +0.9    +0.5    +0.1
> 
>         51200   8192    +1%     +0%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.2    +1.3    +1.3    +1.0    +0.0
> 
>         51200   16384   +0%     +1%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.1    +1.0    +1.3    +1.0    +0.5
> 
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   4       +0%     -3%     +6%     +5%     +5%     +0%
>         cpu:            +0.0    -0.2    -0.6    -0.1    +0.3    +0.0
> 
>         51200   8       +0%     +0%     +7%     +7%     +7%     +0%
>         cpu:            +0.0    +0.1    +0.8    +0.6    +0.9    +0.0
> 
>         51200   16      +0%     +0%     +7%     +7%     +8%     +0%
>         cpu:            +0.0    -0.0    +0.7    +0.7    +0.8    +0.0
> 
>         51200   32      +0%     +0%     +8%     +8%     +9%     +0%
>         cpu:            +0.0    +0.1    +0.7    +0.7    +0.3    +0.0
> 
>         51200   64      +0%     +1%     +9%     +9%     +9%     +0%
>         cpu:            +0.0    +0.0    +0.8    +0.7    +0.8    +0.0
> 
>         51200   128     +1%     +0%     +13%    +13%    +14%    +0%
>         cpu:            +0.2    +0.0    +1.0    +1.0    +1.1    +0.0
> 
>         51200   256     +1%     +2%     +8%     +8%     +11%    +0%
>         cpu:            +0.0    +0.3    +0.0    +0.7    +1.5    +0.0
> 
>         51200   512     +1%     +2%     +16%    +16%    +17%    +0%
>         cpu:            +0.2    +0.2    +2.2    +2.1    +2.2    +0.1
> 
>         51200   1024    +1%     +2%     +20%    +20%    +20%    +1%
>         cpu:            +0.2    +0.1    +2.6    +1.9    +2.6    +0.0
> 
>         51200   2048    +0%     +2%     +22%    +22%    +21%    +0%
>         cpu:            +0.0    +0.3    +2.3    +2.9    +2.1    -0.0
> 
>         51200   4096    +1%     +2%     +23%    +23%    +23%    +1%
>         cpu:            +0.2    +0.1    +2.0    +3.2    +3.1    +0.0
> 
>         51200   8192    +1%     +5%     +24%    +24%    +24%    +1%
>         cpu:            +1.4    -0.0    +4.2    +3.0    +2.8    +0.1
> 
>         51200   16384   +1%     +3%     +24%    +24%    +24%    +2%
>         cpu:            +0.0    +0.3    +3.4    +3.8    +3.7    +0.1
> 
> Here follows the IOZone results for u5500 on eMMC.
> These numbers for DMA are more as expected.
> 
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
> 
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   128     +1%     +1%     +10%    +9%     +10%    +0%
>         cpu:            +0.1    +0.0    +1.3    +0.1    +0.8    +0.1
> 
>         51200   256     +2%     +2%     +7%     +7%     +9%     +0%
>         cpu:            +0.1    +0.4    +0.5    +0.6    +0.7    +0.0
> 
>         51200   512     +2%     +2%     +12%    +12%    +12%    +1%
>         cpu:            +0.4    +0.6    +1.8    +2.4    +2.4    +0.2
> 
>         51200   1024    +2%     +3%     +14%    +14%    +14%    +0%
>         cpu:            +0.3    +0.1    +2.1    +1.4    +1.4    +0.2
> 
>         51200   2048    +3%     +3%     +16%    +16%    +16%    +1%
>         cpu:            +0.2    +0.2    +2.5    +1.8    +2.4    -0.2
> 
>         51200   4096    +3%     +3%     +17%    +17%    +18%    +3%
>         cpu:            +0.1    -0.1    +2.7    +2.0    +2.7    -0.1
> 
>         51200   8192    +3%     +3%     +18%    +18%    +18%    +3%
>         cpu:            -0.1    +0.2    +3.0    +2.3    +2.2    +0.2
> 
>         51200   16384   +3%     +3%     +18%    +18%    +18%    +4%
>         cpu:            +0.2    +0.2    +2.8    +3.5    +2.4    -0.0
> 
> Per Forlin (5):
>   mmc: add member in mmc queue struct to hold request data
>   mmc: Add a block request prepare function
>   mmc: Add a second mmc queue request member
>   mmc: Store the mmc block request struct in mmc queue
>   mmc: Add double buffering for mmc block requests
> 
>  drivers/mmc/card/block.c |  337 ++++++++++++++++++++++++++++++----------------
>  drivers/mmc/card/queue.c |  171 +++++++++++++++---------
>  drivers/mmc/card/queue.h |   31 +++-
>  drivers/mmc/core/core.c  |   77 +++++++++--
>  include/linux/mmc/core.h |    7 +-
>  include/linux/mmc/host.h |    8 +
>  6 files changed, 432 insertions(+), 199 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-18  2:35   ` Jaehoon Chung
  0 siblings, 0 replies; 27+ messages in thread
From: Jaehoon Chung @ 2011-01-18  2:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Per..

it is interesting approach..so
we want to test your double buffering in our environment(Samsung SoC).

Did you test with SDHCI?
If you tested with SDHCI, i want to know how much increase the performance.

Thanks,
Jaehoon Chung

Per Forlin wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.
> 
> There are two optional hooks pre_req() and post_req() that the host driver
> may implement in order to improve double buffering. In the DMA case pre_req()
> may do dma_map_sg() and prepare the dma descriptor and post_req runs the
> dma_unmap_sg.
> 
> The mmci host driver implementation for double buffering is not intended
> nor ready for mainline yet. It is only an example of how to implement
> pre_req() and post_req(). The reason for this is that the basic DMA support
> for MMCI is not complete yet. The mmci patches are sent in a separate patch
> series "[FYI 0/4] arm: mmci: example implementation of double buffering".
> 
> Issues/Questions for issue_rw_rq() in block.c:
> * Is it safe to claim the host for the first MMC request and wait to release
>   it until the MMC queue is empty again? Or must the host be claimed and
>   released for every request?
> * Is it possible to predict the result from __blk_end_request().
>   If there are no errors for a completed MMC request and the
>   blk_rq_bytes(req) == data.bytes_xfered, will it be guaranteed that
>   __blk_end_request will return 0?
> 
> Here follows the IOZone results for u8500 v1.1 on eMMC.
> The numbers for DMA are a bit to good here due to the fact that the
> CPU speed is decreased compared to u8500 v2. This makes the cache handling
> even more significant.
> 
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
> 
> Relative diff: VANILLA-MMC-PIO -> 2BUF-MMC-PIO
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   4       +0%     +0%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    -0.1    -0.5    -0.3    -0.1    -0.0
> 
>         51200   8       +0%     +0%     +6%     +6%     +8%     +0%
>         cpu:            +0.1    -0.1    -0.3    -0.4    -0.8    +0.0
> 
>         51200   16      +0%     -2%     +0%     +0%     -3%     +0%
>         cpu:            +0.0    -0.2    +0.0    +0.0    -0.2    +0.0
> 
>         51200   32      +0%     +1%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    +0.0    -0.3    +0.0    +0.0    +0.0
> 
>         51200   64      +0%     +0%     +0%     +0%     +0%     +0%
>         cpu:            +0.1    +0.0    +0.0    +0.0    +0.0    +0.0
> 
>         51200   128     +0%     +1%     +1%     +1%     +1%     +0%
>         cpu:            +0.0    +0.2    +0.1    -0.3    +0.4    +0.0
> 
>         51200   256     +0%     +0%     +1%     +1%     +1%     +0%
>         cpu:            +0.0    -0.0    +0.1    +0.1    +0.1    +0.0
> 
>         51200   512     +0%     +1%     +2%     +2%     +2%     +0%
>         cpu:            +0.1    +0.0    +0.2    +0.2    +0.2    +0.1
> 
>         51200   1024    +0%     +2%     +2%     +2%     +3%     +0%
>         cpu:            +0.2    +0.1    +0.2    +0.5    -0.8    +0.0
> 
>         51200   2048    +0%     +2%     +3%     +3%     +3%     +0%
>         cpu:            +0.0    -0.2    +0.4    +0.8    -0.5    +0.2
> 
>         51200   4096    +0%     +1%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.1    +0.9    +0.9    +0.5    +0.1
> 
>         51200   8192    +1%     +0%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.2    +1.3    +1.3    +1.0    +0.0
> 
>         51200   16384   +0%     +1%     +3%     +3%     +3%     +1%
>         cpu:            +0.2    +0.1    +1.0    +1.3    +1.0    +0.5
> 
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   4       +0%     -3%     +6%     +5%     +5%     +0%
>         cpu:            +0.0    -0.2    -0.6    -0.1    +0.3    +0.0
> 
>         51200   8       +0%     +0%     +7%     +7%     +7%     +0%
>         cpu:            +0.0    +0.1    +0.8    +0.6    +0.9    +0.0
> 
>         51200   16      +0%     +0%     +7%     +7%     +8%     +0%
>         cpu:            +0.0    -0.0    +0.7    +0.7    +0.8    +0.0
> 
>         51200   32      +0%     +0%     +8%     +8%     +9%     +0%
>         cpu:            +0.0    +0.1    +0.7    +0.7    +0.3    +0.0
> 
>         51200   64      +0%     +1%     +9%     +9%     +9%     +0%
>         cpu:            +0.0    +0.0    +0.8    +0.7    +0.8    +0.0
> 
>         51200   128     +1%     +0%     +13%    +13%    +14%    +0%
>         cpu:            +0.2    +0.0    +1.0    +1.0    +1.1    +0.0
> 
>         51200   256     +1%     +2%     +8%     +8%     +11%    +0%
>         cpu:            +0.0    +0.3    +0.0    +0.7    +1.5    +0.0
> 
>         51200   512     +1%     +2%     +16%    +16%    +17%    +0%
>         cpu:            +0.2    +0.2    +2.2    +2.1    +2.2    +0.1
> 
>         51200   1024    +1%     +2%     +20%    +20%    +20%    +1%
>         cpu:            +0.2    +0.1    +2.6    +1.9    +2.6    +0.0
> 
>         51200   2048    +0%     +2%     +22%    +22%    +21%    +0%
>         cpu:            +0.0    +0.3    +2.3    +2.9    +2.1    -0.0
> 
>         51200   4096    +1%     +2%     +23%    +23%    +23%    +1%
>         cpu:            +0.2    +0.1    +2.0    +3.2    +3.1    +0.0
> 
>         51200   8192    +1%     +5%     +24%    +24%    +24%    +1%
>         cpu:            +1.4    -0.0    +4.2    +3.0    +2.8    +0.1
> 
>         51200   16384   +1%     +3%     +24%    +24%    +24%    +2%
>         cpu:            +0.0    +0.3    +3.4    +3.8    +3.7    +0.1
> 
> Here follows the IOZone results for u5500 on eMMC.
> These numbers for DMA are more as expected.
> 
> Command line used: ./iozone -az -i0 -i1 -i2 -s 50m -I -f /iozone.tmp -e -R -+u
> 
> Relative diff: VANILLA-MMC-DMA -> 2BUF-MMC-MMCI-DMA
> cpu load is abs diff
>                                                         random  random
>         KB      reclen  write   rewrite read    reread  read    write
>         51200   128     +1%     +1%     +10%    +9%     +10%    +0%
>         cpu:            +0.1    +0.0    +1.3    +0.1    +0.8    +0.1
> 
>         51200   256     +2%     +2%     +7%     +7%     +9%     +0%
>         cpu:            +0.1    +0.4    +0.5    +0.6    +0.7    +0.0
> 
>         51200   512     +2%     +2%     +12%    +12%    +12%    +1%
>         cpu:            +0.4    +0.6    +1.8    +2.4    +2.4    +0.2
> 
>         51200   1024    +2%     +3%     +14%    +14%    +14%    +0%
>         cpu:            +0.3    +0.1    +2.1    +1.4    +1.4    +0.2
> 
>         51200   2048    +3%     +3%     +16%    +16%    +16%    +1%
>         cpu:            +0.2    +0.2    +2.5    +1.8    +2.4    -0.2
> 
>         51200   4096    +3%     +3%     +17%    +17%    +18%    +3%
>         cpu:            +0.1    -0.1    +2.7    +2.0    +2.7    -0.1
> 
>         51200   8192    +3%     +3%     +18%    +18%    +18%    +3%
>         cpu:            -0.1    +0.2    +3.0    +2.3    +2.2    +0.2
> 
>         51200   16384   +3%     +3%     +18%    +18%    +18%    +4%
>         cpu:            +0.2    +0.2    +2.8    +3.5    +2.4    -0.0
> 
> Per Forlin (5):
>   mmc: add member in mmc queue struct to hold request data
>   mmc: Add a block request prepare function
>   mmc: Add a second mmc queue request member
>   mmc: Store the mmc block request struct in mmc queue
>   mmc: Add double buffering for mmc block requests
> 
>  drivers/mmc/card/block.c |  337 ++++++++++++++++++++++++++++++----------------
>  drivers/mmc/card/queue.c |  171 +++++++++++++++---------
>  drivers/mmc/card/queue.h |   31 +++-
>  drivers/mmc/core/core.c  |   77 +++++++++--
>  include/linux/mmc/core.h |    7 +-
>  include/linux/mmc/host.h |    8 +
>  6 files changed, 432 insertions(+), 199 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-01-18  2:35   ` Jaehoon Chung
@ 2011-01-18  8:12     ` Per Forlin
  -1 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-18  8:12 UTC (permalink / raw)
  To: Jaehoon Chung
  Cc: linux-mmc, linux-arm-kernel, linux-kernel, linaro-dev,
	Chris Ball, Kyungmin Park

On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
> Hi Per..
>
> it is interesting approach..so
> we want to test your double buffering in our environment(Samsung SoC).
>
> Did you test with SDHCI?
So far I have only tested on mmci for board u5500 and u8500. I happily
test on a different board if I can get hold of one.

> If you tested with SDHCI, i want to know how much increase the performance.
>
> Thanks,
> Jaehoon Chung

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-18  8:12     ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-18  8:12 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
> Hi Per..
>
> it is interesting approach..so
> we want to test your double buffering in our environment(Samsung SoC).
>
> Did you test with SDHCI?
So far I have only tested on mmci for board u5500 and u8500. I happily
test on a different board if I can get hold of one.

> If you tested with SDHCI, i want to know how much increase the performance.
>
> Thanks,
> Jaehoon Chung

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-28  8:28       ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-28  8:28 UTC (permalink / raw)
  To: Jaehoon Chung
  Cc: linux-mmc, linux-arm-kernel, linux-kernel, linaro-dev,
	Chris Ball, Kyungmin Park

Hi Jaehoon,

Have you had the chance to test the patches on the Samsung Soc. I can
sketch a implementation for sdhci if that helps? Unfortunately I don't
have any hardware to test the SDHCI driver unless there is support in
QEMU?
I really think I need to test this on more hardware in order to get
more results for the patchset and hopefully more attention as well.

BR
Per

On 18 January 2011 09:12, Per Forlin <per.forlin@linaro.org> wrote:
> On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>> Hi Per..
>>
>> it is interesting approach..so
>> we want to test your double buffering in our environment(Samsung SoC).
>>
>> Did you test with SDHCI?
> So far I have only tested on mmci for board u5500 and u8500. I happily
> test on a different board if I can get hold of one.
>
>> If you tested with SDHCI, i want to know how much increase the performance.
>>
>> Thanks,
>> Jaehoon Chung
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-28  8:28       ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-28  8:28 UTC (permalink / raw)
  To: Jaehoon Chung
  Cc: linaro-dev-cunTk1MwBs8s++Sfvej+rw,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Kyungmin Park, Chris Ball,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Jaehoon,

Have you had the chance to test the patches on the Samsung Soc. I can
sketch a implementation for sdhci if that helps? Unfortunately I don't
have any hardware to test the SDHCI driver unless there is support in
QEMU?
I really think I need to test this on more hardware in order to get
more results for the patchset and hopefully more attention as well.

BR
Per

On 18 January 2011 09:12, Per Forlin <per.forlin-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> On 18 January 2011 03:35, Jaehoon Chung <jh80.chung-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> wrote:
>> Hi Per..
>>
>> it is interesting approach..so
>> we want to test your double buffering in our environment(Samsung SoC).
>>
>> Did you test with SDHCI?
> So far I have only tested on mmci for board u5500 and u8500. I happily
> test on a different board if I can get hold of one.
>
>> If you tested with SDHCI, i want to know how much increase the performance.
>>
>> Thanks,
>> Jaehoon Chung
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-28  8:28       ` Per Forlin
  0 siblings, 0 replies; 27+ messages in thread
From: Per Forlin @ 2011-01-28  8:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jaehoon,

Have you had the chance to test the patches on the Samsung Soc. I can
sketch a implementation for sdhci if that helps? Unfortunately I don't
have any hardware to test the SDHCI driver unless there is support in
QEMU?
I really think I need to test this on more hardware in order to get
more results for the patchset and hopefully more attention as well.

BR
Per

On 18 January 2011 09:12, Per Forlin <per.forlin@linaro.org> wrote:
> On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>> Hi Per..
>>
>> it is interesting approach..so
>> we want to test your double buffering in our environment(Samsung SoC).
>>
>> Did you test with SDHCI?
> So far I have only tested on mmci for board u5500 and u8500. I happily
> test on a different board if I can get hold of one.
>
>> If you tested with SDHCI, i want to know how much increase the performance.
>>
>> Thanks,
>> Jaehoon Chung
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-01-28  8:28       ` Per Forlin
@ 2011-01-30  8:23         ` Jaehoon Chung
  -1 siblings, 0 replies; 27+ messages in thread
From: Jaehoon Chung @ 2011-01-30  8:23 UTC (permalink / raw)
  To: Per Forlin
  Cc: Jaehoon Chung, linux-mmc, linux-arm-kernel, linux-kernel,
	linaro-dev, Chris Ball, Kyungmin Park

Hi Per.

If you sketch a implementation for sdhci. i can test the patches on the Samsung Soc
Your help is very nice for me :).i want to know how much improve the performance 
after applied your patch.

Regards,
Jaehoon Chung

Per Forlin wrote:
> Hi Jaehoon,
> 
> Have you had the chance to test the patches on the Samsung Soc. I can
> sketch a implementation for sdhci if that helps? Unfortunately I don't
> have any hardware to test the SDHCI driver unless there is support in
> QEMU?
> I really think I need to test this on more hardware in order to get
> more results for the patchset and hopefully more attention as well.
> 
> BR
> Per
> 
> On 18 January 2011 09:12, Per Forlin <per.forlin@linaro.org> wrote:
>> On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Hi Per..
>>>
>>> it is interesting approach..so
>>> we want to test your double buffering in our environment(Samsung SoC).
>>>
>>> Did you test with SDHCI?
>> So far I have only tested on mmci for board u5500 and u8500. I happily
>> test on a different board if I can get hold of one.
>>
>>> If you tested with SDHCI, i want to know how much increase the performance.
>>>
>>> Thanks,
>>> Jaehoon Chung
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-01-30  8:23         ` Jaehoon Chung
  0 siblings, 0 replies; 27+ messages in thread
From: Jaehoon Chung @ 2011-01-30  8:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Per.

If you sketch a implementation for sdhci. i can test the patches on the Samsung Soc
Your help is very nice for me :).i want to know how much improve the performance 
after applied your patch.

Regards,
Jaehoon Chung

Per Forlin wrote:
> Hi Jaehoon,
> 
> Have you had the chance to test the patches on the Samsung Soc. I can
> sketch a implementation for sdhci if that helps? Unfortunately I don't
> have any hardware to test the SDHCI driver unless there is support in
> QEMU?
> I really think I need to test this on more hardware in order to get
> more results for the patchset and hopefully more attention as well.
> 
> BR
> Per
> 
> On 18 January 2011 09:12, Per Forlin <per.forlin@linaro.org> wrote:
>> On 18 January 2011 03:35, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Hi Per..
>>>
>>> it is interesting approach..so
>>> we want to test your double buffering in our environment(Samsung SoC).
>>>
>>> Did you test with SDHCI?
>> So far I have only tested on mmci for board u5500 and u8500. I happily
>> test on a different board if I can get hold of one.
>>
>>> If you tested with SDHCI, i want to know how much increase the performance.
>>>
>>> Thanks,
>>> Jaehoon Chung
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-01-12 18:13 ` Per Forlin
@ 2011-02-05 17:02   ` Russell King - ARM Linux
  -1 siblings, 0 replies; 27+ messages in thread
From: Russell King - ARM Linux @ 2011-02-05 17:02 UTC (permalink / raw)
  To: Per Forlin, Catalin Marinas
  Cc: linux-mmc, linux-arm-kernel, linux-kernel, dev, Chris Ball

On Wed, Jan 12, 2011 at 07:13:58PM +0100, Per Forlin wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.

It might be worth seeing what effect the following patch has.  This
moves the dsb out of the cache operations into a separate function,
so we only do one dsb per DMA mapping/unmapping operation.  That's
particularly significant for the scattergather code.

I don't remember the reason why this was dropped as a candidate for
merging - could that be because the dsb needs to be before the outer
cache maintainence?  Adding Catalin for comment on that.

 arch/arm/include/asm/cacheflush.h  |    4 ++++
 arch/arm/include/asm/dma-mapping.h |    8 ++++++++
 arch/arm/mm/cache-fa.S             |   13 +++++++------
 arch/arm/mm/cache-v3.S             |    3 +++
 arch/arm/mm/cache-v4.S             |    3 +++
 arch/arm/mm/cache-v4wb.S           |    9 +++++++--
 arch/arm/mm/cache-v4wt.S           |    3 +++
 arch/arm/mm/cache-v6.S             |   13 +++++++------
 arch/arm/mm/cache-v7.S             |    9 ++++++---
 arch/arm/mm/dma-mapping.c          |   12 ++++++++++++
 arch/arm/mm/proc-arm1020e.S        |   10 +++++++---
 arch/arm/mm/proc-arm1022.S         |   10 +++++++---
 arch/arm/mm/proc-arm1026.S         |   10 +++++++---
 arch/arm/mm/proc-arm920.S          |   10 +++++++---
 arch/arm/mm/proc-arm922.S          |   10 +++++++---
 arch/arm/mm/proc-arm925.S          |   10 +++++++---
 arch/arm/mm/proc-arm926.S          |   10 +++++++---
 arch/arm/mm/proc-arm940.S          |   10 +++++++---
 arch/arm/mm/proc-arm946.S          |   10 +++++++---
 arch/arm/mm/proc-feroceon.S        |   13 ++++++++-----
 arch/arm/mm/proc-mohawk.S          |   10 +++++++---
 arch/arm/mm/proc-xsc3.S            |   10 +++++++---
 arch/arm/mm/proc-xscale.S          |   10 +++++++---
 23 files changed, 152 insertions(+), 58 deletions(-)

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index e290885..5928e78 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -223,6 +223,7 @@ struct cpu_cache_fns {
 
 	void (*dma_map_area)(const void *, size_t, int);
 	void (*dma_unmap_area)(const void *, size_t, int);
+	void (*dma_barrier)(void);
 
 	void (*dma_flush_range)(const void *, const void *);
 };
@@ -250,6 +251,7 @@ extern struct cpu_cache_fns cpu_cache;
  */
 #define dmac_map_area			cpu_cache.dma_map_area
 #define dmac_unmap_area		cpu_cache.dma_unmap_area
+#define dmac_barrier			cpu_cache.dma_barrier
 #define dmac_flush_range		cpu_cache.dma_flush_range
 
 #else
@@ -278,10 +280,12 @@ extern void __cpuc_flush_dcache_area(void *, size_t);
  */
 #define dmac_map_area			__glue(_CACHE,_dma_map_area)
 #define dmac_unmap_area		__glue(_CACHE,_dma_unmap_area)
+#define dmac_barrier			__glue(_CACHE,_dma_barrier)
 #define dmac_flush_range		__glue(_CACHE,_dma_flush_range)
 
 extern void dmac_map_area(const void *, size_t, int);
 extern void dmac_unmap_area(const void *, size_t, int);
+extern void dmac_barrier(void);
 extern void dmac_flush_range(const void *, const void *);
 
 #endif
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 256ee1c..1371db7 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -115,6 +115,8 @@ static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
 		___dma_page_dev_to_cpu(page, off, size, dir);
 }
 
+extern void __dma_barrier(enum dma_data_direction);
+
 /*
  * Return whether the given device DMA address mask can be supported
  * properly.  For example, if your device can only drive the low 24-bits
@@ -378,6 +380,7 @@ static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_single(dev, cpu_addr, size, dir);
+	__dma_barrier(dir);
 	debug_dma_map_page(dev, virt_to_page(cpu_addr),
 			(unsigned long)cpu_addr & ~PAGE_MASK, size,
 			dir, addr, true);
@@ -407,6 +410,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_page(dev, page, offset, size, dir);
+	__dma_barrier(dir);
 	debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
 	return addr;
@@ -431,6 +435,7 @@ static inline void dma_unmap_single(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, true);
 	__dma_unmap_single(dev, handle, size, dir);
+	__dma_barrier(dir);
 }
 
 /**
@@ -452,6 +457,7 @@ static inline void dma_unmap_page(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, false);
 	__dma_unmap_page(dev, handle, size, dir);
+	__dma_barrier(dir);
 }
 
 /**
@@ -484,6 +490,7 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev,
 		return;
 
 	__dma_single_dev_to_cpu(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_barrier(dir);
 }
 
 static inline void dma_sync_single_range_for_device(struct device *dev,
@@ -498,6 +505,7 @@ static inline void dma_sync_single_range_for_device(struct device *dev,
 		return;
 
 	__dma_single_cpu_to_dev(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_barrier(dir);
 }
 
 static inline void dma_sync_single_for_cpu(struct device *dev,
diff --git a/arch/arm/mm/cache-fa.S b/arch/arm/mm/cache-fa.S
index 7148e53..cdcfae2 100644
--- a/arch/arm/mm/cache-fa.S
+++ b/arch/arm/mm/cache-fa.S
@@ -179,8 +179,6 @@ fa_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -197,8 +195,6 @@ fa_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -212,8 +208,6 @@ ENTRY(fa_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -240,6 +234,12 @@ ENTRY(fa_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(fa_dma_unmap_area)
 
+ENTRY(fa_dma_barrier)
+	mov	r0, #0	
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(fa_dma_barrier)
+
 	__INITDATA
 
 	.type	fa_cache_fns, #object
@@ -253,5 +253,6 @@ ENTRY(fa_cache_fns)
 	.long	fa_flush_kern_dcache_area
 	.long	fa_dma_map_area
 	.long	fa_dma_unmap_area
+	.long	fa_dma_barrier
 	.long	fa_dma_flush_range
 	.size	fa_cache_fns, . - fa_cache_fns
diff --git a/arch/arm/mm/cache-v3.S b/arch/arm/mm/cache-v3.S
index c2ff3c5..df34458 100644
--- a/arch/arm/mm/cache-v3.S
+++ b/arch/arm/mm/cache-v3.S
@@ -123,9 +123,11 @@ ENTRY(v3_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v3_dma_map_area)
+ENTRY(v3_dma_barrier)
 	mov	pc, lr
 ENDPROC(v3_dma_unmap_area)
 ENDPROC(v3_dma_map_area)
+ENDPROC(v3_dma_barrier)
 
 	__INITDATA
 
@@ -140,5 +142,6 @@ ENTRY(v3_cache_fns)
 	.long	v3_flush_kern_dcache_area
 	.long	v3_dma_map_area
 	.long	v3_dma_unmap_area
+	.long	v3_dma_barrier
 	.long	v3_dma_flush_range
 	.size	v3_cache_fns, . - v3_cache_fns
diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index 4810f7e..20260b1 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -135,9 +135,11 @@ ENTRY(v4_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v4_dma_map_area)
+ENTRY(v4_dma_barrier)
 	mov	pc, lr
 ENDPROC(v4_dma_unmap_area)
 ENDPROC(v4_dma_map_area)
+ENDPROC(v4_dma_barrier)
 
 	__INITDATA
 
@@ -152,5 +154,6 @@ ENTRY(v4_cache_fns)
 	.long	v4_flush_kern_dcache_area
 	.long	v4_dma_map_area
 	.long	v4_dma_unmap_area
+	.long	v4_dma_barrier
 	.long	v4_dma_flush_range
 	.size	v4_cache_fns, . - v4_cache_fns
diff --git a/arch/arm/mm/cache-v4wb.S b/arch/arm/mm/cache-v4wb.S
index df8368a..9c9c875 100644
--- a/arch/arm/mm/cache-v4wb.S
+++ b/arch/arm/mm/cache-v4wb.S
@@ -194,7 +194,6 @@ v4wb_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -211,7 +210,6 @@ v4wb_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -251,6 +249,12 @@ ENTRY(v4wb_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v4wb_dma_unmap_area)
 
+ENTRY(v4wb_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(v4wb_dma_barrier)
+
 	__INITDATA
 
 	.type	v4wb_cache_fns, #object
@@ -264,5 +268,6 @@ ENTRY(v4wb_cache_fns)
 	.long	v4wb_flush_kern_dcache_area
 	.long	v4wb_dma_map_area
 	.long	v4wb_dma_unmap_area
+	.long	v4wb_dma_barrier
 	.long	v4wb_dma_flush_range
 	.size	v4wb_cache_fns, . - v4wb_cache_fns
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 45c7031..223eea4 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -191,9 +191,11 @@ ENTRY(v4wt_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v4wt_dma_map_area)
+ENTRY(v4wt_dma_barrier)
 	mov	pc, lr
 ENDPROC(v4wt_dma_unmap_area)
 ENDPROC(v4wt_dma_map_area)
+ENDPROC(v4wt_dma_barrier)
 
 	__INITDATA
 
@@ -208,5 +210,6 @@ ENTRY(v4wt_cache_fns)
 	.long	v4wt_flush_kern_dcache_area
 	.long	v4wt_dma_map_area
 	.long	v4wt_dma_unmap_area
+	.long	v4wt_dma_barrier
 	.long	v4wt_dma_flush_range
 	.size	v4wt_cache_fns, . - v4wt_cache_fns
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index 9d89c67..b294854 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -238,8 +238,6 @@ v6_dma_inv_range:
 	strlo	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -261,8 +259,6 @@ v6_dma_clean_range:
 	add	r0, r0, #D_CACHE_LINE_SIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -289,8 +285,6 @@ ENTRY(v6_dma_flush_range)
 	strlob	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -327,6 +321,12 @@ ENTRY(v6_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v6_dma_unmap_area)
 
+ENTRY(v6_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(v6_dma_barrier)
+
 	__INITDATA
 
 	.type	v6_cache_fns, #object
@@ -340,5 +340,6 @@ ENTRY(v6_cache_fns)
 	.long	v6_flush_kern_dcache_area
 	.long	v6_dma_map_area
 	.long	v6_dma_unmap_area
+	.long	v6_dma_barrier
 	.long	v6_dma_flush_range
 	.size	v6_cache_fns, . - v6_cache_fns
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index bcd64f2..d89d55a 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -255,7 +255,6 @@ v7_dma_inv_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_inv_range)
 
@@ -273,7 +272,6 @@ v7_dma_clean_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_clean_range)
 
@@ -291,7 +289,6 @@ ENTRY(v7_dma_flush_range)
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_flush_range)
 
@@ -321,6 +318,11 @@ ENTRY(v7_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v7_dma_unmap_area)
 
+ENTRY(v7_dma_barrier)
+	dsb
+	mov	pc, lr
+ENDPROC(v7_dma_barrier)
+
 	__INITDATA
 
 	.type	v7_cache_fns, #object
@@ -334,5 +336,6 @@ ENTRY(v7_cache_fns)
 	.long	v7_flush_kern_dcache_area
 	.long	v7_dma_map_area
 	.long	v7_dma_unmap_area
+	.long	v7_dma_barrier
 	.long	v7_dma_flush_range
 	.size	v7_cache_fns, . - v7_cache_fns
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 64daef2..d807f38 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -97,6 +97,7 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
 	memset(ptr, 0, size);
 	dmac_flush_range(ptr, ptr + size);
 	outer_flush_range(__pa(ptr), __pa(ptr) + size);
+	dmac_barrier();
 
 	return page;
 }
@@ -542,6 +543,12 @@ void ___dma_page_dev_to_cpu(struct page *page, unsigned long off,
 }
 EXPORT_SYMBOL(___dma_page_dev_to_cpu);
 
+void __dma_barrier(enum dma_data_direction dir)
+{
+	dmac_barrier();
+}
+EXPORT_SYMBOL(__dma_barrier);
+
 /**
  * dma_map_sg - map a set of SG buffers for streaming mode DMA
  * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
@@ -572,6 +579,7 @@ int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 		if (dma_mapping_error(dev, s->dma_address))
 			goto bad_mapping;
 	}
+	__dma_barrier(dir);
 	debug_dma_map_sg(dev, sg, nents, nents, dir);
 	return nents;
 
@@ -602,6 +610,8 @@ void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 
 	for_each_sg(sg, s, nents, i)
 		__dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir);
+
+	__dma_barrier(dir);
 }
 EXPORT_SYMBOL(dma_unmap_sg);
 
@@ -627,6 +637,7 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_barrier(dir);
 	debug_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_cpu);
@@ -653,6 +664,7 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_barrier(dir);
	debug_dma_sync_sg_for_device(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_device);
diff --git a/arch/arm/mm/proc-arm1020e.S b/arch/arm/mm/proc-arm1020e.S
index d278298..fea33c9 100644
--- a/arch/arm/mm/proc-arm1020e.S
+++ b/arch/arm/mm/proc-arm1020e.S
@@ -281,7 +281,6 @@ arm1020e_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -303,7 +302,6 @@ arm1020e_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -323,7 +321,6 @@ ENTRY(arm1020e_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -350,6 +347,12 @@ ENTRY(arm1020e_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1020e_dma_unmap_area)
 
+ENTRY(arm1020e_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1020e_dma_barrier)
+
 ENTRY(arm1020e_cache_fns)
 	.long	arm1020e_flush_icache_all
 	.long	arm1020e_flush_kern_cache_all
@@ -360,6 +363,7 @@ ENTRY(arm1020e_cache_fns)
 	.long	arm1020e_flush_kern_dcache_area
 	.long	arm1020e_dma_map_area
 	.long	arm1020e_dma_unmap_area
+	.long	arm1020e_dma_barrier
 	.long	arm1020e_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm1022.S b/arch/arm/mm/proc-arm1022.S
index ce13e4a..ba1a7df 100644
--- a/arch/arm/mm/proc-arm1022.S
+++ b/arch/arm/mm/proc-arm1022.S
@@ -270,7 +270,6 @@ arm1022_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -292,7 +291,6 @@ arm1022_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -312,7 +310,6 @@ ENTRY(arm1022_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -339,6 +336,12 @@ ENTRY(arm1022_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1022_dma_unmap_area)
 
+ENTRY(arm1022_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1022_dma_barrier)
+
 ENTRY(arm1022_cache_fns)
 	.long	arm1022_flush_icache_all
 	.long	arm1022_flush_kern_cache_all
@@ -349,6 +352,7 @@ ENTRY(arm1022_cache_fns)
 	.long	arm1022_flush_kern_dcache_area
 	.long	arm1022_dma_map_area
 	.long	arm1022_dma_unmap_area
+	.long	arm1022_dma_barrier
 	.long	arm1022_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm1026.S b/arch/arm/mm/proc-arm1026.S
index 636672a..de648f1 100644
--- a/arch/arm/mm/proc-arm1026.S
+++ b/arch/arm/mm/proc-arm1026.S
@@ -264,7 +264,6 @@ arm1026_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -286,7 +285,6 @@ arm1026_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -306,7 +304,6 @@ ENTRY(arm1026_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -333,6 +330,12 @@ ENTRY(arm1026_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1026_dma_unmap_area)
 
+ENTRY(arm1026_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1026_dma_barrier)
+
 ENTRY(arm1026_cache_fns)
 	.long	arm1026_flush_icache_all
 	.long	arm1026_flush_kern_cache_all
@@ -343,6 +346,7 @@ ENTRY(arm1026_cache_fns)
 	.long	arm1026_flush_kern_dcache_area
 	.long	arm1026_dma_map_area
 	.long	arm1026_dma_unmap_area
+	.long	arm1026_dma_barrier
 	.long	arm1026_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm920.S b/arch/arm/mm/proc-arm920.S
index 8be8199..ec74093 100644
--- a/arch/arm/mm/proc-arm920.S
+++ b/arch/arm/mm/proc-arm920.S
@@ -252,7 +252,6 @@ arm920_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -271,7 +270,6 @@ arm920_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -288,7 +286,6 @@ ENTRY(arm920_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -315,6 +312,12 @@ ENTRY(arm920_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm920_dma_unmap_area)
 
+ENTRY(arm920_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm920_dma_barrier)
+
 ENTRY(arm920_cache_fns)
 	.long	arm920_flush_icache_all
 	.long	arm920_flush_kern_cache_all
@@ -325,6 +328,7 @@ ENTRY(arm920_cache_fns)
 	.long	arm920_flush_kern_dcache_area
 	.long	arm920_dma_map_area
 	.long	arm920_dma_unmap_area
+	.long	arm920_dma_barrier
 	.long	arm920_dma_flush_range
 
 #endif
diff --git a/arch/arm/mm/proc-arm922.S b/arch/arm/mm/proc-arm922.S
index c0ff8e4..474d4c6 100644
--- a/arch/arm/mm/proc-arm922.S
+++ b/arch/arm/mm/proc-arm922.S
@@ -254,7 +254,6 @@ arm922_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -273,7 +272,6 @@ arm922_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -290,7 +288,6 @@ ENTRY(arm922_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -317,6 +314,12 @@ ENTRY(arm922_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm922_dma_unmap_area)
 
+ENTRY(arm922_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm922_dma_barrier)
+
 ENTRY(arm922_cache_fns)
 	.long	arm922_flush_icache_all
 	.long	arm922_flush_kern_cache_all
@@ -327,6 +330,7 @@ ENTRY(arm922_cache_fns)
 	.long	arm922_flush_kern_dcache_area
 	.long	arm922_dma_map_area
 	.long	arm922_dma_unmap_area
+	.long	arm922_dma_barrier
 	.long	arm922_dma_flush_range
 
 #endif
diff --git a/arch/arm/mm/proc-arm925.S b/arch/arm/mm/proc-arm925.S
index 3c6cffe..0336ae3 100644
--- a/arch/arm/mm/proc-arm925.S
+++ b/arch/arm/mm/proc-arm925.S
@@ -302,7 +302,6 @@ arm925_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -323,7 +322,6 @@ arm925_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -345,7 +343,6 @@ ENTRY(arm925_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -372,6 +369,12 @@ ENTRY(arm925_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm925_dma_unmap_area)
 
+ENTRY(arm925_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm925_dma_barrier)
+
 ENTRY(arm925_cache_fns)
 	.long	arm925_flush_icache_all
 	.long	arm925_flush_kern_cache_all
@@ -382,6 +385,7 @@ ENTRY(arm925_cache_fns)
 	.long	arm925_flush_kern_dcache_area
 	.long	arm925_dma_map_area
 	.long	arm925_dma_unmap_area
+	.long	arm925_dma_barrier
 	.long	arm925_dma_flush_range
 
 ENTRY(cpu_arm925_dcache_clean_area)
diff --git a/arch/arm/mm/proc-arm926.S b/arch/arm/mm/proc-arm926.S
index 75b707c..473bbe6 100644
--- a/arch/arm/mm/proc-arm926.S
+++ b/arch/arm/mm/proc-arm926.S
@@ -265,7 +265,6 @@ arm926_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -286,7 +285,6 @@ arm926_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -308,7 +306,6 @@ ENTRY(arm926_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -335,6 +332,12 @@ ENTRY(arm926_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm926_dma_unmap_area)
 
+ENTRY(arm926_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm926_dma_barrier)
+
 ENTRY(arm926_cache_fns)
 	.long	arm926_flush_icache_all
 	.long	arm926_flush_kern_cache_all
@@ -345,6 +348,7 @@ ENTRY(arm926_cache_fns)
 	.long	arm926_flush_kern_dcache_area
 	.long	arm926_dma_map_area
 	.long	arm926_dma_unmap_area
+	.long	arm926_dma_barrier
 	.long	arm926_dma_flush_range
 
 ENTRY(cpu_arm926_dcache_clean_area)
diff --git a/arch/arm/mm/proc-arm940.S b/arch/arm/mm/proc-arm940.S
index 1af1657..c44c963 100644
--- a/arch/arm/mm/proc-arm940.S
+++ b/arch/arm/mm/proc-arm940.S
@@ -187,7 +187,6 @@ arm940_dma_inv_range:
 	bcs	2b				@ entries 63 to 0
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -211,7 +210,6 @@ ENTRY(cpu_arm940_dcache_clean_area)
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -237,7 +235,6 @@ ENTRY(arm940_dma_flush_range)
 	bcs	2b				@ entries 63 to 0
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -264,6 +261,12 @@ ENTRY(arm940_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm940_dma_unmap_area)
 
+ENTRY(arm940_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm940_dma_barrier)
+
 ENTRY(arm940_cache_fns)
 	.long	arm940_flush_icache_all
 	.long	arm940_flush_kern_cache_all
@@ -274,6 +277,7 @@ ENTRY(arm940_cache_fns)
 	.long	arm940_flush_kern_dcache_area
 	.long	arm940_dma_map_area
 	.long	arm940_dma_unmap_area
+	.long	arm940_dma_barrier
 	.long	arm940_dma_flush_range
 
 	__CPUINIT
diff --git a/arch/arm/mm/proc-arm946.S b/arch/arm/mm/proc-arm946.S
index 1664b6a..11e9ad7 100644
--- a/arch/arm/mm/proc-arm946.S
+++ b/arch/arm/mm/proc-arm946.S
@@ -234,7 +234,6 @@ arm946_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -255,7 +254,6 @@ arm946_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -279,7 +277,6 @@ ENTRY(arm946_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -306,6 +303,12 @@ ENTRY(arm946_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm946_dma_unmap_area)
 
+ENTRY(arm946_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm946_dma_barrier)
+
 ENTRY(arm946_cache_fns)
 	.long	arm946_flush_icache_all
 	.long	arm946_flush_kern_cache_all
@@ -316,6 +319,7 @@ ENTRY(arm946_cache_fns)
 	.long	arm946_flush_kern_dcache_area
 	.long	arm946_dma_map_area
 	.long	arm946_dma_unmap_area
+	.long	arm946_dma_barrier
 	.long	arm946_dma_flush_range
 
 
diff --git a/arch/arm/mm/proc-feroceon.S b/arch/arm/mm/proc-feroceon.S
index 53e6323..50a309e 100644
--- a/arch/arm/mm/proc-feroceon.S
+++ b/arch/arm/mm/proc-feroceon.S
@@ -290,7 +290,6 @@ feroceon_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -326,7 +325,6 @@ feroceon_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -339,7 +337,6 @@ feroceon_range_dma_clean_range:
 	mcr	p15, 5, r0, c15, c13, 0		@ D clean range start
 	mcr	p15, 5, r1, c15, c13, 1		@ D clean range top
 	msr	cpsr_c, r2			@ restore interrupts
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -357,7 +354,6 @@ ENTRY(feroceon_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -370,7 +366,6 @@ ENTRY(feroceon_range_dma_flush_range)
 	mcr	p15, 5, r0, c15, c15, 0		@ D clean/inv range start
 	mcr	p15, 5, r1, c15, c15, 1		@ D clean/inv range top
 	msr	cpsr_c, r2			@ restore interrupts
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -411,6 +406,12 @@ ENTRY(feroceon_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(feroceon_dma_unmap_area)
 
+ENTRY(feroceon_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(feroceon_dma_barrier)
+
 ENTRY(feroceon_cache_fns)
 	.long	feroceon_flush_icache_all
 	.long	feroceon_flush_kern_cache_all
@@ -421,6 +422,7 @@ ENTRY(feroceon_cache_fns)
 	.long	feroceon_flush_kern_dcache_area
 	.long	feroceon_dma_map_area
 	.long	feroceon_dma_unmap_area
+	.long	feroceon_dma_barrier
 	.long	feroceon_dma_flush_range
 
 ENTRY(feroceon_range_cache_fns)
@@ -433,6 +435,7 @@ ENTRY(feroceon_range_cache_fns)
 	.long	feroceon_range_flush_kern_dcache_area
 	.long	feroceon_range_dma_map_area
 	.long	feroceon_dma_unmap_area
+	.long	feroceon_dma_barrier
 	.long	feroceon_range_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-mohawk.S b/arch/arm/mm/proc-mohawk.S
index caa3115..09e8883 100644
--- a/arch/arm/mm/proc-mohawk.S
+++ b/arch/arm/mm/proc-mohawk.S
@@ -224,7 +224,6 @@ mohawk_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -243,7 +242,6 @@ mohawk_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -261,7 +259,6 @@ ENTRY(mohawk_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -288,6 +285,12 @@ ENTRY(mohawk_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(mohawk_dma_unmap_area)
 
+ENTRY(mohawk_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(mohawk_dma_barrier)
+
 ENTRY(mohawk_cache_fns)
 	.long	mohawk_flush_kern_cache_all
 	.long	mohawk_flush_user_cache_all
@@ -297,6 +300,7 @@ ENTRY(mohawk_cache_fns)
 	.long	mohawk_flush_kern_dcache_area
 	.long	mohawk_dma_map_area
 	.long	mohawk_dma_unmap_area
+	.long	mohawk_dma_barrier
 	.long	mohawk_dma_flush_range
 
 ENTRY(cpu_mohawk_dcache_clean_area)
diff --git a/arch/arm/mm/proc-xsc3.S b/arch/arm/mm/proc-xsc3.S
index 046b3d8..d033ed4 100644
--- a/arch/arm/mm/proc-xsc3.S
+++ b/arch/arm/mm/proc-xsc3.S
@@ -274,7 +274,6 @@ xsc3_dma_inv_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -291,7 +290,6 @@ xsc3_dma_clean_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -308,7 +306,6 @@ ENTRY(xsc3_dma_flush_range)
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -335,6 +332,12 @@ ENTRY(xsc3_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(xsc3_dma_unmap_area)
 
+ENTRY(xsc3_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
+	mov	pc, lr
+ENDPROC(xsc3_dma_barrier)
+
 ENTRY(xsc3_cache_fns)
 	.long	xsc3_flush_icache_all
 	.long	xsc3_flush_kern_cache_all
@@ -345,6 +348,7 @@ ENTRY(xsc3_cache_fns)
 	.long	xsc3_flush_kern_dcache_area
 	.long	xsc3_dma_map_area
 	.long	xsc3_dma_unmap_area
+	.long	xsc3_dma_barrier
 	.long	xsc3_dma_flush_range
 
 ENTRY(cpu_xsc3_dcache_clean_area)
diff --git a/arch/arm/mm/proc-xscale.S b/arch/arm/mm/proc-xscale.S
index 63037e2..e390ae6 100644
--- a/arch/arm/mm/proc-xscale.S
+++ b/arch/arm/mm/proc-xscale.S
@@ -332,7 +332,6 @@ xscale_dma_inv_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -349,7 +348,6 @@ xscale_dma_clean_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -367,7 +365,6 @@ ENTRY(xscale_dma_flush_range)
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -407,6 +404,12 @@ ENTRY(xscale_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(xscale_dma_unmap_area)
 
+ENTRY(xscale_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
+	mov	pc, lr
+ENDPROC(xscsale_dma_barrier)
+
 ENTRY(xscale_cache_fns)
 	.long	xscale_flush_icache_all
 	.long	xscale_flush_kern_cache_all
@@ -417,6 +420,7 @@ ENTRY(xscale_cache_fns)
 	.long	xscale_flush_kern_dcache_area
 	.long	xscale_dma_map_area
 	.long	xscale_dma_unmap_area
+	.long	xscale_dma_barrier
 	.long	xscale_dma_flush_range
 
 /*


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-02-05 17:02   ` Russell King - ARM Linux
  0 siblings, 0 replies; 27+ messages in thread
From: Russell King - ARM Linux @ 2011-02-05 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 12, 2011 at 07:13:58PM +0100, Per Forlin wrote:
> Add support to prepare one MMC request while another is active on
> the host. This is done by making the issue_rw_rq() asynchronous.
> The increase in throughput is proportional to the time it takes to
> prepare a request and how fast the memory is. The faster the MMC/SD is
> the more significant the prepare request time becomes. Measurements on U5500
> and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> reads. In the PIO case there is some gain in performance for large reads too.
> There seems to be no or small performance gain for write, don't have a good
> explanation for this yet.

It might be worth seeing what effect the following patch has.  This
moves the dsb out of the cache operations into a separate function,
so we only do one dsb per DMA mapping/unmapping operation.  That's
particularly significant for the scattergather code.

I don't remember the reason why this was dropped as a candidate for
merging - could that be because the dsb needs to be before the outer
cache maintainence?  Adding Catalin for comment on that.

 arch/arm/include/asm/cacheflush.h  |    4 ++++
 arch/arm/include/asm/dma-mapping.h |    8 ++++++++
 arch/arm/mm/cache-fa.S             |   13 +++++++------
 arch/arm/mm/cache-v3.S             |    3 +++
 arch/arm/mm/cache-v4.S             |    3 +++
 arch/arm/mm/cache-v4wb.S           |    9 +++++++--
 arch/arm/mm/cache-v4wt.S           |    3 +++
 arch/arm/mm/cache-v6.S             |   13 +++++++------
 arch/arm/mm/cache-v7.S             |    9 ++++++---
 arch/arm/mm/dma-mapping.c          |   12 ++++++++++++
 arch/arm/mm/proc-arm1020e.S        |   10 +++++++---
 arch/arm/mm/proc-arm1022.S         |   10 +++++++---
 arch/arm/mm/proc-arm1026.S         |   10 +++++++---
 arch/arm/mm/proc-arm920.S          |   10 +++++++---
 arch/arm/mm/proc-arm922.S          |   10 +++++++---
 arch/arm/mm/proc-arm925.S          |   10 +++++++---
 arch/arm/mm/proc-arm926.S          |   10 +++++++---
 arch/arm/mm/proc-arm940.S          |   10 +++++++---
 arch/arm/mm/proc-arm946.S          |   10 +++++++---
 arch/arm/mm/proc-feroceon.S        |   13 ++++++++-----
 arch/arm/mm/proc-mohawk.S          |   10 +++++++---
 arch/arm/mm/proc-xsc3.S            |   10 +++++++---
 arch/arm/mm/proc-xscale.S          |   10 +++++++---
 23 files changed, 152 insertions(+), 58 deletions(-)

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index e290885..5928e78 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -223,6 +223,7 @@ struct cpu_cache_fns {
 
 	void (*dma_map_area)(const void *, size_t, int);
 	void (*dma_unmap_area)(const void *, size_t, int);
+	void (*dma_barrier)(void);
 
 	void (*dma_flush_range)(const void *, const void *);
 };
@@ -250,6 +251,7 @@ extern struct cpu_cache_fns cpu_cache;
  */
 #define dmac_map_area			cpu_cache.dma_map_area
 #define dmac_unmap_area		cpu_cache.dma_unmap_area
+#define dmac_barrier			cpu_cache.dma_barrier
 #define dmac_flush_range		cpu_cache.dma_flush_range
 
 #else
@@ -278,10 +280,12 @@ extern void __cpuc_flush_dcache_area(void *, size_t);
  */
 #define dmac_map_area			__glue(_CACHE,_dma_map_area)
 #define dmac_unmap_area		__glue(_CACHE,_dma_unmap_area)
+#define dmac_barrier			__glue(_CACHE,_dma_barrier)
 #define dmac_flush_range		__glue(_CACHE,_dma_flush_range)
 
 extern void dmac_map_area(const void *, size_t, int);
 extern void dmac_unmap_area(const void *, size_t, int);
+extern void dmac_barrier(void);
 extern void dmac_flush_range(const void *, const void *);
 
 #endif
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 256ee1c..1371db7 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -115,6 +115,8 @@ static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
 		___dma_page_dev_to_cpu(page, off, size, dir);
 }
 
+extern void __dma_barrier(enum dma_data_direction);
+
 /*
  * Return whether the given device DMA address mask can be supported
  * properly.  For example, if your device can only drive the low 24-bits
@@ -378,6 +380,7 @@ static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_single(dev, cpu_addr, size, dir);
+	__dma_barrier(dir);
 	debug_dma_map_page(dev, virt_to_page(cpu_addr),
 			(unsigned long)cpu_addr & ~PAGE_MASK, size,
 			dir, addr, true);
@@ -407,6 +410,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_page(dev, page, offset, size, dir);
+	__dma_barrier(dir);
 	debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
 	return addr;
@@ -431,6 +435,7 @@ static inline void dma_unmap_single(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, true);
 	__dma_unmap_single(dev, handle, size, dir);
+	__dma_barrier(dir);
 }
 
 /**
@@ -452,6 +457,7 @@ static inline void dma_unmap_page(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, false);
 	__dma_unmap_page(dev, handle, size, dir);
+	__dma_barrier(dir);
 }
 
 /**
@@ -484,6 +490,7 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev,
 		return;
 
 	__dma_single_dev_to_cpu(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_barrier(dir);
 }
 
 static inline void dma_sync_single_range_for_device(struct device *dev,
@@ -498,6 +505,7 @@ static inline void dma_sync_single_range_for_device(struct device *dev,
 		return;
 
 	__dma_single_cpu_to_dev(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_barrier(dir);
 }
 
 static inline void dma_sync_single_for_cpu(struct device *dev,
diff --git a/arch/arm/mm/cache-fa.S b/arch/arm/mm/cache-fa.S
index 7148e53..cdcfae2 100644
--- a/arch/arm/mm/cache-fa.S
+++ b/arch/arm/mm/cache-fa.S
@@ -179,8 +179,6 @@ fa_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -197,8 +195,6 @@ fa_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -212,8 +208,6 @@ ENTRY(fa_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -240,6 +234,12 @@ ENTRY(fa_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(fa_dma_unmap_area)
 
+ENTRY(fa_dma_barrier)
+	mov	r0, #0	
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(fa_dma_barrier)
+
 	__INITDATA
 
 	.type	fa_cache_fns, #object
@@ -253,5 +253,6 @@ ENTRY(fa_cache_fns)
 	.long	fa_flush_kern_dcache_area
 	.long	fa_dma_map_area
 	.long	fa_dma_unmap_area
+	.long	fa_dma_barrier
 	.long	fa_dma_flush_range
 	.size	fa_cache_fns, . - fa_cache_fns
diff --git a/arch/arm/mm/cache-v3.S b/arch/arm/mm/cache-v3.S
index c2ff3c5..df34458 100644
--- a/arch/arm/mm/cache-v3.S
+++ b/arch/arm/mm/cache-v3.S
@@ -123,9 +123,11 @@ ENTRY(v3_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v3_dma_map_area)
+ENTRY(v3_dma_barrier)
 	mov	pc, lr
 ENDPROC(v3_dma_unmap_area)
 ENDPROC(v3_dma_map_area)
+ENDPROC(v3_dma_barrier)
 
 	__INITDATA
 
@@ -140,5 +142,6 @@ ENTRY(v3_cache_fns)
 	.long	v3_flush_kern_dcache_area
 	.long	v3_dma_map_area
 	.long	v3_dma_unmap_area
+	.long	v3_dma_barrier
 	.long	v3_dma_flush_range
 	.size	v3_cache_fns, . - v3_cache_fns
diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index 4810f7e..20260b1 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -135,9 +135,11 @@ ENTRY(v4_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v4_dma_map_area)
+ENTRY(v4_dma_barrier)
 	mov	pc, lr
 ENDPROC(v4_dma_unmap_area)
 ENDPROC(v4_dma_map_area)
+ENDPROC(v4_dma_barrier)
 
 	__INITDATA
 
@@ -152,5 +154,6 @@ ENTRY(v4_cache_fns)
 	.long	v4_flush_kern_dcache_area
 	.long	v4_dma_map_area
 	.long	v4_dma_unmap_area
+	.long	v4_dma_barrier
 	.long	v4_dma_flush_range
 	.size	v4_cache_fns, . - v4_cache_fns
diff --git a/arch/arm/mm/cache-v4wb.S b/arch/arm/mm/cache-v4wb.S
index df8368a..9c9c875 100644
--- a/arch/arm/mm/cache-v4wb.S
+++ b/arch/arm/mm/cache-v4wb.S
@@ -194,7 +194,6 @@ v4wb_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -211,7 +210,6 @@ v4wb_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -251,6 +249,12 @@ ENTRY(v4wb_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v4wb_dma_unmap_area)
 
+ENTRY(v4wb_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(v4wb_dma_barrier)
+
 	__INITDATA
 
 	.type	v4wb_cache_fns, #object
@@ -264,5 +268,6 @@ ENTRY(v4wb_cache_fns)
 	.long	v4wb_flush_kern_dcache_area
 	.long	v4wb_dma_map_area
 	.long	v4wb_dma_unmap_area
+	.long	v4wb_dma_barrier
 	.long	v4wb_dma_flush_range
 	.size	v4wb_cache_fns, . - v4wb_cache_fns
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 45c7031..223eea4 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -191,9 +191,11 @@ ENTRY(v4wt_dma_unmap_area)
  *	- dir	- DMA direction
  */
 ENTRY(v4wt_dma_map_area)
+ENTRY(v4wt_dma_barrier)
 	mov	pc, lr
 ENDPROC(v4wt_dma_unmap_area)
 ENDPROC(v4wt_dma_map_area)
+ENDPROC(v4wt_dma_barrier)
 
 	__INITDATA
 
@@ -208,5 +210,6 @@ ENTRY(v4wt_cache_fns)
 	.long	v4wt_flush_kern_dcache_area
 	.long	v4wt_dma_map_area
 	.long	v4wt_dma_unmap_area
+	.long	v4wt_dma_barrier
 	.long	v4wt_dma_flush_range
 	.size	v4wt_cache_fns, . - v4wt_cache_fns
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index 9d89c67..b294854 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -238,8 +238,6 @@ v6_dma_inv_range:
 	strlo	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -261,8 +259,6 @@ v6_dma_clean_range:
 	add	r0, r0, #D_CACHE_LINE_SIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -289,8 +285,6 @@ ENTRY(v6_dma_flush_range)
 	strlob	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -327,6 +321,12 @@ ENTRY(v6_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v6_dma_unmap_area)
 
+ENTRY(v6_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
+	mov	pc, lr
+ENDPROC(v6_dma_barrier)
+
 	__INITDATA
 
 	.type	v6_cache_fns, #object
@@ -340,5 +340,6 @@ ENTRY(v6_cache_fns)
 	.long	v6_flush_kern_dcache_area
 	.long	v6_dma_map_area
 	.long	v6_dma_unmap_area
+	.long	v6_dma_barrier
 	.long	v6_dma_flush_range
 	.size	v6_cache_fns, . - v6_cache_fns
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index bcd64f2..d89d55a 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -255,7 +255,6 @@ v7_dma_inv_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_inv_range)
 
@@ -273,7 +272,6 @@ v7_dma_clean_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_clean_range)
 
@@ -291,7 +289,6 @@ ENTRY(v7_dma_flush_range)
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_flush_range)
 
@@ -321,6 +318,11 @@ ENTRY(v7_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(v7_dma_unmap_area)
 
+ENTRY(v7_dma_barrier)
+	dsb
+	mov	pc, lr
+ENDPROC(v7_dma_barrier)
+
 	__INITDATA
 
 	.type	v7_cache_fns, #object
@@ -334,5 +336,6 @@ ENTRY(v7_cache_fns)
 	.long	v7_flush_kern_dcache_area
 	.long	v7_dma_map_area
 	.long	v7_dma_unmap_area
+	.long	v7_dma_barrier
 	.long	v7_dma_flush_range
 	.size	v7_cache_fns, . - v7_cache_fns
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 64daef2..d807f38 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -97,6 +97,7 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
 	memset(ptr, 0, size);
 	dmac_flush_range(ptr, ptr + size);
 	outer_flush_range(__pa(ptr), __pa(ptr) + size);
+	dmac_barrier();
 
 	return page;
 }
@@ -542,6 +543,12 @@ void ___dma_page_dev_to_cpu(struct page *page, unsigned long off,
 }
 EXPORT_SYMBOL(___dma_page_dev_to_cpu);
 
+void __dma_barrier(enum dma_data_direction dir)
+{
+	dmac_barrier();
+}
+EXPORT_SYMBOL(__dma_barrier);
+
 /**
  * dma_map_sg - map a set of SG buffers for streaming mode DMA
  * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
@@ -572,6 +579,7 @@ int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 		if (dma_mapping_error(dev, s->dma_address))
 			goto bad_mapping;
 	}
+	__dma_barrier(dir);
 	debug_dma_map_sg(dev, sg, nents, nents, dir);
 	return nents;
 
@@ -602,6 +610,8 @@ void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 
 	for_each_sg(sg, s, nents, i)
 		__dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir);
+
+	__dma_barrier(dir);
 }
 EXPORT_SYMBOL(dma_unmap_sg);
 
@@ -627,6 +637,7 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_barrier(dir);
 	debug_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_cpu);
@@ -653,6 +664,7 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_barrier(dir);
	debug_dma_sync_sg_for_device(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_device);
diff --git a/arch/arm/mm/proc-arm1020e.S b/arch/arm/mm/proc-arm1020e.S
index d278298..fea33c9 100644
--- a/arch/arm/mm/proc-arm1020e.S
+++ b/arch/arm/mm/proc-arm1020e.S
@@ -281,7 +281,6 @@ arm1020e_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -303,7 +302,6 @@ arm1020e_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -323,7 +321,6 @@ ENTRY(arm1020e_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -350,6 +347,12 @@ ENTRY(arm1020e_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1020e_dma_unmap_area)
 
+ENTRY(arm1020e_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1020e_dma_barrier)
+
 ENTRY(arm1020e_cache_fns)
 	.long	arm1020e_flush_icache_all
 	.long	arm1020e_flush_kern_cache_all
@@ -360,6 +363,7 @@ ENTRY(arm1020e_cache_fns)
 	.long	arm1020e_flush_kern_dcache_area
 	.long	arm1020e_dma_map_area
 	.long	arm1020e_dma_unmap_area
+	.long	arm1020e_dma_barrier
 	.long	arm1020e_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm1022.S b/arch/arm/mm/proc-arm1022.S
index ce13e4a..ba1a7df 100644
--- a/arch/arm/mm/proc-arm1022.S
+++ b/arch/arm/mm/proc-arm1022.S
@@ -270,7 +270,6 @@ arm1022_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -292,7 +291,6 @@ arm1022_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -312,7 +310,6 @@ ENTRY(arm1022_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -339,6 +336,12 @@ ENTRY(arm1022_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1022_dma_unmap_area)
 
+ENTRY(arm1022_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1022_dma_barrier)
+
 ENTRY(arm1022_cache_fns)
 	.long	arm1022_flush_icache_all
 	.long	arm1022_flush_kern_cache_all
@@ -349,6 +352,7 @@ ENTRY(arm1022_cache_fns)
 	.long	arm1022_flush_kern_dcache_area
 	.long	arm1022_dma_map_area
 	.long	arm1022_dma_unmap_area
+	.long	arm1022_dma_barrier
 	.long	arm1022_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm1026.S b/arch/arm/mm/proc-arm1026.S
index 636672a..de648f1 100644
--- a/arch/arm/mm/proc-arm1026.S
+++ b/arch/arm/mm/proc-arm1026.S
@@ -264,7 +264,6 @@ arm1026_dma_inv_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -286,7 +285,6 @@ arm1026_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -306,7 +304,6 @@ ENTRY(arm1026_dma_flush_range)
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -333,6 +330,12 @@ ENTRY(arm1026_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm1026_dma_unmap_area)
 
+ENTRY(arm1026_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm1026_dma_barrier)
+
 ENTRY(arm1026_cache_fns)
 	.long	arm1026_flush_icache_all
 	.long	arm1026_flush_kern_cache_all
@@ -343,6 +346,7 @@ ENTRY(arm1026_cache_fns)
 	.long	arm1026_flush_kern_dcache_area
 	.long	arm1026_dma_map_area
 	.long	arm1026_dma_unmap_area
+	.long	arm1026_dma_barrier
 	.long	arm1026_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-arm920.S b/arch/arm/mm/proc-arm920.S
index 8be8199..ec74093 100644
--- a/arch/arm/mm/proc-arm920.S
+++ b/arch/arm/mm/proc-arm920.S
@@ -252,7 +252,6 @@ arm920_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -271,7 +270,6 @@ arm920_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -288,7 +286,6 @@ ENTRY(arm920_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -315,6 +312,12 @@ ENTRY(arm920_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm920_dma_unmap_area)
 
+ENTRY(arm920_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm920_dma_barrier)
+
 ENTRY(arm920_cache_fns)
 	.long	arm920_flush_icache_all
 	.long	arm920_flush_kern_cache_all
@@ -325,6 +328,7 @@ ENTRY(arm920_cache_fns)
 	.long	arm920_flush_kern_dcache_area
 	.long	arm920_dma_map_area
 	.long	arm920_dma_unmap_area
+	.long	arm920_dma_barrier
 	.long	arm920_dma_flush_range
 
 #endif
diff --git a/arch/arm/mm/proc-arm922.S b/arch/arm/mm/proc-arm922.S
index c0ff8e4..474d4c6 100644
--- a/arch/arm/mm/proc-arm922.S
+++ b/arch/arm/mm/proc-arm922.S
@@ -254,7 +254,6 @@ arm922_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -273,7 +272,6 @@ arm922_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -290,7 +288,6 @@ ENTRY(arm922_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -317,6 +314,12 @@ ENTRY(arm922_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm922_dma_unmap_area)
 
+ENTRY(arm922_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm922_dma_barrier)
+
 ENTRY(arm922_cache_fns)
 	.long	arm922_flush_icache_all
 	.long	arm922_flush_kern_cache_all
@@ -327,6 +330,7 @@ ENTRY(arm922_cache_fns)
 	.long	arm922_flush_kern_dcache_area
 	.long	arm922_dma_map_area
 	.long	arm922_dma_unmap_area
+	.long	arm922_dma_barrier
 	.long	arm922_dma_flush_range
 
 #endif
diff --git a/arch/arm/mm/proc-arm925.S b/arch/arm/mm/proc-arm925.S
index 3c6cffe..0336ae3 100644
--- a/arch/arm/mm/proc-arm925.S
+++ b/arch/arm/mm/proc-arm925.S
@@ -302,7 +302,6 @@ arm925_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -323,7 +322,6 @@ arm925_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -345,7 +343,6 @@ ENTRY(arm925_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -372,6 +369,12 @@ ENTRY(arm925_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm925_dma_unmap_area)
 
+ENTRY(arm925_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm925_dma_barrier)
+
 ENTRY(arm925_cache_fns)
 	.long	arm925_flush_icache_all
 	.long	arm925_flush_kern_cache_all
@@ -382,6 +385,7 @@ ENTRY(arm925_cache_fns)
 	.long	arm925_flush_kern_dcache_area
 	.long	arm925_dma_map_area
 	.long	arm925_dma_unmap_area
+	.long	arm925_dma_barrier
 	.long	arm925_dma_flush_range
 
 ENTRY(cpu_arm925_dcache_clean_area)
diff --git a/arch/arm/mm/proc-arm926.S b/arch/arm/mm/proc-arm926.S
index 75b707c..473bbe6 100644
--- a/arch/arm/mm/proc-arm926.S
+++ b/arch/arm/mm/proc-arm926.S
@@ -265,7 +265,6 @@ arm926_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -286,7 +285,6 @@ arm926_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -308,7 +306,6 @@ ENTRY(arm926_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -335,6 +332,12 @@ ENTRY(arm926_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm926_dma_unmap_area)
 
+ENTRY(arm926_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm926_dma_barrier)
+
 ENTRY(arm926_cache_fns)
 	.long	arm926_flush_icache_all
 	.long	arm926_flush_kern_cache_all
@@ -345,6 +348,7 @@ ENTRY(arm926_cache_fns)
 	.long	arm926_flush_kern_dcache_area
 	.long	arm926_dma_map_area
 	.long	arm926_dma_unmap_area
+	.long	arm926_dma_barrier
 	.long	arm926_dma_flush_range
 
 ENTRY(cpu_arm926_dcache_clean_area)
diff --git a/arch/arm/mm/proc-arm940.S b/arch/arm/mm/proc-arm940.S
index 1af1657..c44c963 100644
--- a/arch/arm/mm/proc-arm940.S
+++ b/arch/arm/mm/proc-arm940.S
@@ -187,7 +187,6 @@ arm940_dma_inv_range:
 	bcs	2b				@ entries 63 to 0
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -211,7 +210,6 @@ ENTRY(cpu_arm940_dcache_clean_area)
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
 #endif
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -237,7 +235,6 @@ ENTRY(arm940_dma_flush_range)
 	bcs	2b				@ entries 63 to 0
 	subs	r1, r1, #1 << 4
 	bcs	1b				@ segments 7 to 0
-	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -264,6 +261,12 @@ ENTRY(arm940_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm940_dma_unmap_area)
 
+ENTRY(arm940_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, ip, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm940_dma_barrier)
+
 ENTRY(arm940_cache_fns)
 	.long	arm940_flush_icache_all
 	.long	arm940_flush_kern_cache_all
@@ -274,6 +277,7 @@ ENTRY(arm940_cache_fns)
 	.long	arm940_flush_kern_dcache_area
 	.long	arm940_dma_map_area
 	.long	arm940_dma_unmap_area
+	.long	arm940_dma_barrier
 	.long	arm940_dma_flush_range
 
 	__CPUINIT
diff --git a/arch/arm/mm/proc-arm946.S b/arch/arm/mm/proc-arm946.S
index 1664b6a..11e9ad7 100644
--- a/arch/arm/mm/proc-arm946.S
+++ b/arch/arm/mm/proc-arm946.S
@@ -234,7 +234,6 @@ arm946_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -255,7 +254,6 @@ arm946_dma_clean_range:
 	cmp	r0, r1
 	blo	1b
 #endif
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -279,7 +277,6 @@ ENTRY(arm946_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -306,6 +303,12 @@ ENTRY(arm946_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(arm946_dma_unmap_area)
 
+ENTRY(arm946_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(arm946_dma_barrier)
+
 ENTRY(arm946_cache_fns)
 	.long	arm946_flush_icache_all
 	.long	arm946_flush_kern_cache_all
@@ -316,6 +319,7 @@ ENTRY(arm946_cache_fns)
 	.long	arm946_flush_kern_dcache_area
 	.long	arm946_dma_map_area
 	.long	arm946_dma_unmap_area
+	.long	arm946_dma_barrier
 	.long	arm946_dma_flush_range
 
 
diff --git a/arch/arm/mm/proc-feroceon.S b/arch/arm/mm/proc-feroceon.S
index 53e6323..50a309e 100644
--- a/arch/arm/mm/proc-feroceon.S
+++ b/arch/arm/mm/proc-feroceon.S
@@ -290,7 +290,6 @@ feroceon_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -326,7 +325,6 @@ feroceon_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -339,7 +337,6 @@ feroceon_range_dma_clean_range:
 	mcr	p15, 5, r0, c15, c13, 0		@ D clean range start
 	mcr	p15, 5, r1, c15, c13, 1		@ D clean range top
 	msr	cpsr_c, r2			@ restore interrupts
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -357,7 +354,6 @@ ENTRY(feroceon_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 	.align	5
@@ -370,7 +366,6 @@ ENTRY(feroceon_range_dma_flush_range)
 	mcr	p15, 5, r0, c15, c15, 0		@ D clean/inv range start
 	mcr	p15, 5, r1, c15, c15, 1		@ D clean/inv range top
 	msr	cpsr_c, r2			@ restore interrupts
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -411,6 +406,12 @@ ENTRY(feroceon_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(feroceon_dma_unmap_area)
 
+ENTRY(feroceon_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(feroceon_dma_barrier)
+
 ENTRY(feroceon_cache_fns)
 	.long	feroceon_flush_icache_all
 	.long	feroceon_flush_kern_cache_all
@@ -421,6 +422,7 @@ ENTRY(feroceon_cache_fns)
 	.long	feroceon_flush_kern_dcache_area
 	.long	feroceon_dma_map_area
 	.long	feroceon_dma_unmap_area
+	.long	feroceon_dma_barrier
 	.long	feroceon_dma_flush_range
 
 ENTRY(feroceon_range_cache_fns)
@@ -433,6 +435,7 @@ ENTRY(feroceon_range_cache_fns)
 	.long	feroceon_range_flush_kern_dcache_area
 	.long	feroceon_range_dma_map_area
 	.long	feroceon_dma_unmap_area
+	.long	feroceon_dma_barrier
 	.long	feroceon_range_dma_flush_range
 
 	.align	5
diff --git a/arch/arm/mm/proc-mohawk.S b/arch/arm/mm/proc-mohawk.S
index caa3115..09e8883 100644
--- a/arch/arm/mm/proc-mohawk.S
+++ b/arch/arm/mm/proc-mohawk.S
@@ -224,7 +224,6 @@ mohawk_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -243,7 +242,6 @@ mohawk_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -261,7 +259,6 @@ ENTRY(mohawk_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
 	mov	pc, lr
 
 /*
@@ -288,6 +285,12 @@ ENTRY(mohawk_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(mohawk_dma_unmap_area)
 
+ENTRY(mohawk_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
+	mov	pc, lr
+ENDPROC(mohawk_dma_barrier)
+
 ENTRY(mohawk_cache_fns)
 	.long	mohawk_flush_kern_cache_all
 	.long	mohawk_flush_user_cache_all
@@ -297,6 +300,7 @@ ENTRY(mohawk_cache_fns)
 	.long	mohawk_flush_kern_dcache_area
 	.long	mohawk_dma_map_area
 	.long	mohawk_dma_unmap_area
+	.long	mohawk_dma_barrier
 	.long	mohawk_dma_flush_range
 
 ENTRY(cpu_mohawk_dcache_clean_area)
diff --git a/arch/arm/mm/proc-xsc3.S b/arch/arm/mm/proc-xsc3.S
index 046b3d8..d033ed4 100644
--- a/arch/arm/mm/proc-xsc3.S
+++ b/arch/arm/mm/proc-xsc3.S
@@ -274,7 +274,6 @@ xsc3_dma_inv_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -291,7 +290,6 @@ xsc3_dma_clean_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -308,7 +306,6 @@ ENTRY(xsc3_dma_flush_range)
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
 	mov	pc, lr
 
 /*
@@ -335,6 +332,12 @@ ENTRY(xsc3_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(xsc3_dma_unmap_area)
 
+ENTRY(xsc3_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ data write barrier
+	mov	pc, lr
+ENDPROC(xsc3_dma_barrier)
+
 ENTRY(xsc3_cache_fns)
 	.long	xsc3_flush_icache_all
 	.long	xsc3_flush_kern_cache_all
@@ -345,6 +348,7 @@ ENTRY(xsc3_cache_fns)
 	.long	xsc3_flush_kern_dcache_area
 	.long	xsc3_dma_map_area
 	.long	xsc3_dma_unmap_area
+	.long	xsc3_dma_barrier
 	.long	xsc3_dma_flush_range
 
 ENTRY(cpu_xsc3_dcache_clean_area)
diff --git a/arch/arm/mm/proc-xscale.S b/arch/arm/mm/proc-xscale.S
index 63037e2..e390ae6 100644
--- a/arch/arm/mm/proc-xscale.S
+++ b/arch/arm/mm/proc-xscale.S
@@ -332,7 +332,6 @@ xscale_dma_inv_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -349,7 +348,6 @@ xscale_dma_clean_range:
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -367,7 +365,6 @@ ENTRY(xscale_dma_flush_range)
 	add	r0, r0, #CACHELINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
 	mov	pc, lr
 
 /*
@@ -407,6 +404,12 @@ ENTRY(xscale_dma_unmap_area)
 	mov	pc, lr
 ENDPROC(xscale_dma_unmap_area)
 
+ENTRY(xscale_dma_barrier)
+	mov	r0, #0
+	mcr	p15, 0, r0, c7, c10, 4		@ Drain Write (& Fill) Buffer
+	mov	pc, lr
+ENDPROC(xscsale_dma_barrier)
+
 ENTRY(xscale_cache_fns)
 	.long	xscale_flush_icache_all
 	.long	xscale_flush_kern_cache_all
@@ -417,6 +420,7 @@ ENTRY(xscale_cache_fns)
 	.long	xscale_flush_kern_dcache_area
 	.long	xscale_dma_map_area
 	.long	xscale_dma_unmap_area
+	.long	xscale_dma_barrier
 	.long	xscale_dma_flush_range
 
 /*

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] mmc: add double buffering for mmc block requests
  2011-02-05 17:02   ` Russell King - ARM Linux
@ 2011-02-05 20:36     ` Russell King - ARM Linux
  -1 siblings, 0 replies; 27+ messages in thread
From: Russell King - ARM Linux @ 2011-02-05 20:36 UTC (permalink / raw)
  To: Per Forlin, Catalin Marinas
  Cc: Chris Ball, linux-mmc, linux-kernel, linux-arm-kernel, dev

On Sat, Feb 05, 2011 at 05:02:55PM +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 12, 2011 at 07:13:58PM +0100, Per Forlin wrote:
> > Add support to prepare one MMC request while another is active on
> > the host. This is done by making the issue_rw_rq() asynchronous.
> > The increase in throughput is proportional to the time it takes to
> > prepare a request and how fast the memory is. The faster the MMC/SD is
> > the more significant the prepare request time becomes. Measurements on U5500
> > and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> > reads. In the PIO case there is some gain in performance for large reads too.
> > There seems to be no or small performance gain for write, don't have a good
> > explanation for this yet.
> 
> It might be worth seeing what effect the following patch has.  This
> moves the dsb out of the cache operations into a separate function,
> so we only do one dsb per DMA mapping/unmapping operation.  That's
> particularly significant for the scattergather code.
> 
> I don't remember the reason why this was dropped as a candidate for
> merging - could that be because the dsb needs to be before the outer
> cache maintainence?  Adding Catalin for comment on that.

FWIW, trying this with MMC on OMAP4, I see no measurable difference in
performance nor CPU usage.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] mmc: add double buffering for mmc block requests
@ 2011-02-05 20:36     ` Russell King - ARM Linux
  0 siblings, 0 replies; 27+ messages in thread
From: Russell King - ARM Linux @ 2011-02-05 20:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 05, 2011 at 05:02:55PM +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 12, 2011 at 07:13:58PM +0100, Per Forlin wrote:
> > Add support to prepare one MMC request while another is active on
> > the host. This is done by making the issue_rw_rq() asynchronous.
> > The increase in throughput is proportional to the time it takes to
> > prepare a request and how fast the memory is. The faster the MMC/SD is
> > the more significant the prepare request time becomes. Measurements on U5500
> > and U8500 on eMMC shows significant performance gain for DMA on MMC for large
> > reads. In the PIO case there is some gain in performance for large reads too.
> > There seems to be no or small performance gain for write, don't have a good
> > explanation for this yet.
> 
> It might be worth seeing what effect the following patch has.  This
> moves the dsb out of the cache operations into a separate function,
> so we only do one dsb per DMA mapping/unmapping operation.  That's
> particularly significant for the scattergather code.
> 
> I don't remember the reason why this was dropped as a candidate for
> merging - could that be because the dsb needs to be before the outer
> cache maintainence?  Adding Catalin for comment on that.

FWIW, trying this with MMC on OMAP4, I see no measurable difference in
performance nor CPU usage.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-02-05 20:36 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-12 18:13 [PATCH 0/5] mmc: add double buffering for mmc block requests Per Forlin
2011-01-12 18:13 ` Per Forlin
2011-01-12 18:13 ` [PATCH 1/5] mmc: add member in mmc queue struct to hold request data Per Forlin
2011-01-12 18:13   ` Per Forlin
2011-01-12 18:14 ` [PATCH 2/5] mmc: Add a block request prepare function Per Forlin
2011-01-12 18:14   ` Per Forlin
2011-01-12 18:14 ` [PATCH 3/5] mmc: Add a second mmc queue request member Per Forlin
2011-01-12 18:14   ` Per Forlin
2011-01-12 18:14 ` [PATCH 4/5] mmc: Store the mmc block request struct in mmc queue Per Forlin
2011-01-12 18:14   ` Per Forlin
2011-01-12 18:14 ` [PATCH 5/5] mmc: Add double buffering for mmc block requests Per Forlin
2011-01-12 18:14   ` Per Forlin
2011-01-12 18:24 ` [PATCH 0/5] mmc: add " Per Forlin
2011-01-12 18:24   ` Per Forlin
2011-01-18  2:35 ` Jaehoon Chung
2011-01-18  2:35   ` Jaehoon Chung
2011-01-18  8:12   ` Per Forlin
2011-01-18  8:12     ` Per Forlin
2011-01-28  8:28     ` Per Forlin
2011-01-28  8:28       ` Per Forlin
2011-01-28  8:28       ` Per Forlin
2011-01-30  8:23       ` Jaehoon Chung
2011-01-30  8:23         ` Jaehoon Chung
2011-02-05 17:02 ` Russell King - ARM Linux
2011-02-05 17:02   ` Russell King - ARM Linux
2011-02-05 20:36   ` Russell King - ARM Linux
2011-02-05 20:36     ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.