linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
@ 2006-09-11 23:00 Dan Williams
  2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
                   ` (21 more replies)
  0 siblings, 22 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:00 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

Neil,

The following patches implement hardware accelerated raid5 for the Intel
Xscale® series of I/O Processors.  The MD changes allow stripe
operations to run outside the spin lock in a work queue.  Hardware
acceleration is achieved by using a dma-engine-aware work queue routine
instead of the default software only routine.

Since the last release of the raid5 changes many bug fixes and other
improvements have been made as a result of stress testing.  See the per
patch change logs for more information about what was fixed.  This
release is the first release of the full dma implementation.

The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
interface, and a platform device driver for IOPs.  The raid5 changes
follow your comments concerning making the acceleration implementation
similar to how the stripe cache handles I/O requests.  The dmaengine
changes are the second release of this code.  They expand the interface
to handle more than memcpy operations, and add a generic raid5-dma
client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
memset across all IOP architectures (32x, 33x, and 13xx).

Concerning the context switching performance concerns raised at the
previous release, I have observed the following.  For the hardware
accelerated case it appears that performance is always better with the
work queue than without since it allows multiple stripes to be operated
on simultaneously.  I expect the same for an SMP platform, but so far my
testing has been limited to IOPs.  For a single-processor
non-accelerated configuration I have not observed performance
degradation with work queue support enabled, but in the Kconfig option
help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Please consider the patches for -mm.

-Dan

[PATCH 01/19] raid5: raid5_do_soft_block_ops
[PATCH 02/19] raid5: move write operations to a workqueue
[PATCH 03/19] raid5: move check parity operations to a workqueue
[PATCH 04/19] raid5: move compute block operations to a workqueue
[PATCH 05/19] raid5: move read completion copies to a workqueue
[PATCH 06/19] raid5: move the reconstruct write expansion operation to a workqueue
[PATCH 07/19] raid5: remove compute_block and compute_parity5
[PATCH 08/19] dmaengine: enable multiple clients and operations
[PATCH 09/19] dmaengine: reduce backend address permutations
[PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients
[PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
[PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy
[PATCH 13/19] dmaengine: add support for dma xor zero sum operations
[PATCH 14/19] dmaengine: add dma_sync_wait
[PATCH 15/19] dmaengine: raid5 dma client
[PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines
[PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
[PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
[PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver

Note, the iop3xx patches apply against the iop3xx platform code
re-factoring done by Lennert Buytenhek.  His patches are reproduced,
with permission, on the Xscale IOP SourceForge site.

Also available on SourceForge:

Linux Symposium Paper: MD RAID Acceleration Support for Asynchronous
DMA/XOR Engines
http://prdownloads.sourceforge.net/xscaleiop/ols_paper_2006.pdf?download

Tar archive of the patch set
http://prdownloads.sourceforge.net/xscaleiop/md_raid_accel-2.6.18-rc6.tar.gz?download

[PATCH 01/19] http://prdownloads.sourceforge.net/xscaleiop/md-add-raid5-do-soft-block-ops.patch?download
[PATCH 02/19] http://prdownloads.sourceforge.net/xscaleiop/md-move-write-operations-to-a-workqueue.patch?download
[PATCH 03/19] http://prdownloads.sourceforge.net/xscaleiop/md-move-check-parity-operations-to-a-workqueue.patch?download
[PATCH 04/19] http://prdownloads.sourceforge.net/xscaleiop/md-move-compute-block-operations-to-a-workqueue.patch?download
[PATCH 05/19] http://prdownloads.sourceforge.net/xscaleiop/md-move-read-completion-copies-to-a-workqueue.patch?download
[PATCH 06/19] http://prdownloads.sourceforge.net/xscaleiop/md-move-expansion-operations-to-a-workqueue.patch?download
[PATCH 07/19] http://prdownloads.sourceforge.net/xscaleiop/md-remove-compute_block-and-compute_parity5.patch?download
[PATCH 08/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-multiple-clients-and-multiple-operations.patch?download
[PATCH 09/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-unite-backend-address-types.patch?download
[PATCH 10/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-map-page.patch?download
[PATCH 11/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-memset.patch?download
[PATCH 12/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-memcpy-err.patch?download
[PATCH 13/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-zero-sum.patch?download
[PATCH 14/19] http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-sync-wait.patch?download
[PATCH 15/19] http://prdownloads.sourceforge.net/xscaleiop/md-raid5-dma-client.patch?download
[PATCH 16/19] http://prdownloads.sourceforge.net/xscaleiop/iop-adma-device-driver.patch?download
[PATCH 17/19] http://prdownloads.sourceforge.net/xscaleiop/iop3xx-register-macro-cleanup.patch?download
[PATCH 18/19] http://prdownloads.sourceforge.net/xscaleiop/iop3xx-pci-initialization.patch?download
[PATCH 19/19] http://prdownloads.sourceforge.net/xscaleiop/iop3xx-adma-support.patch?download

Optimal performance on IOPs is obtained with:
CONFIG_MD_RAID456_WORKQUEUE=y
CONFIG_MD_RAID5_HW_OFFLOAD=y
CONFIG_RAID5_DMA=y
CONFIG_INTEL_IOP_ADMA=y

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/19] raid5: raid5_do_soft_block_ops
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
@ 2006-09-11 23:17 ` Dan Williams
  2006-09-11 23:34   ` Jeff Garzik
  2006-09-11 23:17 ` [PATCH 02/19] raid5: move write operations to a workqueue Dan Williams
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:17 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

raid5_do_soft_block_ops consolidates all the stripe cache maintenance
operations into a single routine.  The stripe operations are:
* copying data between the stripe cache and user application buffers
* computing blocks to save a disk access, or to recover a missing block
* updating the parity on a write operation (reconstruct write and
read-modify-write)
* checking parity correctness

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  289 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/raid/raid5.h |  129 +++++++++++++++++++-
 2 files changed, 415 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4500660..8fde62b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1362,6 +1362,295 @@ static int stripe_to_pdidx(sector_t stri
 	return pd_idx;
 }
 
+/*
+ * raid5_do_soft_block_ops - perform block memory operations on stripe data
+ * outside the spin lock.
+ */
+static void raid5_do_soft_block_ops(void *stripe_head_ref)
+{
+	struct stripe_head *sh = stripe_head_ref;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks;
+	void *ptr[MAX_XOR_BLOCKS];
+	int overlap=0, work=0, written=0, compute=0, dd_idx=0;
+	int pd_uptodate=0;
+	unsigned long state, ops_state, ops_state_orig;
+	raid5_conf_t *conf = sh->raid_conf;
+
+	/* take a snapshot of what needs to be done at this point in time */
+	spin_lock(&sh->lock);
+	state = sh->state;
+	ops_state_orig = ops_state = sh->ops.state;
+	spin_unlock(&sh->lock);
+
+	if (test_bit(STRIPE_OP_BIOFILL, &state)) {
+		struct bio *return_bi=NULL;
+
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_ReadReq, &dev->flags)) {
+				struct bio *rbi, *rbi2;
+				PRINTK("%s: stripe %llu STRIPE_OP_BIOFILL op_state: %lx disk: %d\n",
+					__FUNCTION__, (unsigned long long)sh->sector,
+					ops_state, i);
+				spin_lock_irq(&conf->device_lock);
+				rbi = dev->toread;
+				dev->toread = NULL;
+				spin_unlock_irq(&conf->device_lock);
+				overlap++;
+				while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+					copy_data(0, rbi, dev->page, dev->sector);
+					rbi2 = r5_next_bio(rbi, dev->sector);
+					spin_lock_irq(&conf->device_lock);
+					if (--rbi->bi_phys_segments == 0) {
+						rbi->bi_next = return_bi;
+						return_bi = rbi;
+					}
+					spin_unlock_irq(&conf->device_lock);
+					rbi = rbi2;
+				}
+				dev->read = return_bi;
+			}
+		}
+		if (overlap) {
+			set_bit(STRIPE_OP_BIOFILL_Done, &ops_state);
+			work++;
+		}
+	}
+
+	if (test_bit(STRIPE_OP_COMPUTE, &state)) {
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_ComputeReq, &dev->flags)) {
+				dd_idx = i;
+				i = -1;
+				break;
+			}
+		}
+		BUG_ON(i >= 0);
+		PRINTK("%s: stripe %llu STRIPE_OP_COMPUTE op_state: %lx block: %d\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state, dd_idx);
+		ptr[0] = page_address(sh->dev[dd_idx].page);
+
+		if (test_and_clear_bit(STRIPE_OP_COMPUTE_Prep, &ops_state)) {
+			memset(ptr[0], 0, STRIPE_SIZE);
+			set_bit(STRIPE_OP_COMPUTE_Parity, &ops_state);
+		}
+
+		if (test_and_clear_bit(STRIPE_OP_COMPUTE_Parity, &ops_state)) {
+			int count = 1;
+			for (i = disks ; i--; ) {
+				struct r5dev *dev = &sh->dev[i];
+				void *p;
+				if (i == dd_idx)
+					continue;
+				p = page_address(dev->page);
+				ptr[count++] = p;
+
+				check_xor();
+			}
+			if (count != 1)
+				xor_block(count, STRIPE_SIZE, ptr);
+
+			work++;
+			compute++;
+			set_bit(STRIPE_OP_COMPUTE_Done, &ops_state);
+		}
+	}
+
+	if (test_bit(STRIPE_OP_RMW, &state)) {
+		BUG_ON(test_bit(STRIPE_OP_RCW, &state));
+
+		PRINTK("%s: stripe %llu STRIPE_OP_RMW op_state: %lx\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state);
+
+		ptr[0] = page_address(sh->dev[pd_idx].page);
+
+		if (test_and_clear_bit(STRIPE_OP_RMW_ParityPre, &ops_state)) {
+			int count = 1;
+
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *chosen;
+
+				/* Only process blocks that are known to be uptodate */
+				if (dev->towrite && test_bit(R5_RMWReq, &dev->flags)) {
+					ptr[count++] = page_address(dev->page);
+
+					spin_lock(&sh->lock);
+					chosen = dev->towrite;
+					dev->towrite = NULL;
+					BUG_ON(dev->written);
+					dev->written = chosen;
+					spin_unlock(&sh->lock);
+
+					overlap++;
+
+					check_xor();
+				}
+			}
+			if (count != 1)
+				xor_block(count, STRIPE_SIZE, ptr);
+			set_bit(STRIPE_OP_RMW_Drain, &ops_state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_RMW_Drain, &ops_state)) {
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *wbi = dev->written;
+
+				if (dev->written)
+					written++;
+
+				while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+					copy_data(1, wbi, dev->page, dev->sector);
+					wbi = r5_next_bio(wbi, dev->sector);
+				}
+			}
+			set_bit(STRIPE_OP_RMW_ParityPost, &ops_state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_RMW_ParityPost, &ops_state)) {
+			int count = 1;
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				if (dev->written) {
+					ptr[count++] = page_address(dev->page);
+					check_xor();
+				}
+			}
+			if (count != 1)
+				xor_block(count, STRIPE_SIZE, ptr);
+
+			work++;
+			pd_uptodate++;
+			set_bit(STRIPE_OP_RMW_Done, &ops_state);
+		}
+
+	}
+
+	if (test_bit(STRIPE_OP_RCW, &state)) {
+		BUG_ON(test_bit(STRIPE_OP_RMW, &state));
+
+		PRINTK("%s: stripe %llu STRIPE_OP_RCW op_state: %lx\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state);
+
+		ptr[0] = page_address(sh->dev[pd_idx].page);
+
+		if (test_and_clear_bit(STRIPE_OP_RCW_Drain, &ops_state)) {
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *chosen;
+				struct bio *wbi;
+
+				if (i!=pd_idx && dev->towrite &&
+					test_bit(R5_LOCKED, &dev->flags)) {
+
+					spin_lock(&sh->lock);
+					chosen = dev->towrite;
+					dev->towrite = NULL;
+					spin_unlock(&sh->lock);
+
+					BUG_ON(dev->written);
+					wbi = dev->written = chosen;
+
+					overlap++;
+					written++;
+
+					while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+						copy_data(1, wbi, dev->page, dev->sector);
+						wbi = r5_next_bio(wbi, dev->sector);
+					}
+				} else if (i==pd_idx)
+					memset(ptr[0], 0, STRIPE_SIZE);
+			}
+			set_bit(STRIPE_OP_RCW_Parity, &ops_state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_RCW_Parity, &ops_state)) {
+			int count = 1;
+			for (i=disks; i--;)
+				if (i != pd_idx) {
+					ptr[count++] = page_address(sh->dev[i].page);
+					check_xor();
+				}
+			if (count != 1)
+				xor_block(count, STRIPE_SIZE, ptr);
+
+			work++;
+			pd_uptodate++;
+			set_bit(STRIPE_OP_RCW_Done, &ops_state);
+
+		}
+	}
+
+	if (test_bit(STRIPE_OP_CHECK, &state)) {
+		PRINTK("%s: stripe %llu STRIPE_OP_CHECK op_state: %lx\n",
+		__FUNCTION__, (unsigned long long)sh->sector,
+		ops_state);
+
+		ptr[0] = page_address(sh->dev[pd_idx].page);
+
+		if (test_and_clear_bit(STRIPE_OP_CHECK_Gen, &ops_state)) {
+			int count = 1;
+			for (i=disks; i--;)
+				if (i != pd_idx) {
+					ptr[count++] = page_address(sh->dev[i].page);
+					check_xor();
+				}
+			if (count != 1)
+				xor_block(count, STRIPE_SIZE, ptr);
+
+			set_bit(STRIPE_OP_CHECK_Verify, &ops_state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_CHECK_Verify, &ops_state)) {
+			if (page_is_zero(sh->dev[pd_idx].page))
+				set_bit(STRIPE_OP_CHECK_IsZero, &ops_state);
+
+			work++;
+			set_bit(STRIPE_OP_CHECK_Done, &ops_state);
+		}
+	}
+
+	spin_lock(&sh->lock);
+	/* Update the state of operations:
+	 * -clear incoming requests
+	 * -preserve output status (i.e. done status / check result)
+	 * -preserve requests added since 'ops_state_orig' was set
+	 */
+	sh->ops.state ^= (ops_state_orig & ~STRIPE_OP_COMPLETION_MASK);
+	sh->ops.state |= ops_state;
+
+	if (pd_uptodate)
+		set_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+
+	if (written)
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (dev->written)
+				set_bit(R5_UPTODATE, &dev->flags);
+		}
+
+	if (overlap)
+		for (i= disks; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_and_clear_bit(R5_Overlap, &dev->flags))
+				wake_up(&sh->raid_conf->wait_for_overlap);
+		}
+
+	if (compute) {
+		clear_bit(R5_ComputeReq, &sh->dev[dd_idx].flags);
+		set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
+	}
+
+	sh->ops.pending -= work;
+	BUG_ON(sh->ops.pending < 0);
+	clear_bit(STRIPE_OP_QUEUED, &sh->state);
+	set_bit(STRIPE_HANDLE, &sh->state);
+	queue_raid_work(sh);
+	spin_unlock(&sh->lock);
+
+	release_stripe(sh);
+}
 
 /*
  * handle_stripe - do things to a stripe.
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 20ed4c9..c8a315b 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -116,13 +116,39 @@ #include <linux/raid/xor.h>
  *  attach a request to an active stripe (add_stripe_bh())
  *     lockdev attach-buffer unlockdev
  *  handle a stripe (handle_stripe())
- *     lockstripe clrSTRIPE_HANDLE ... (lockdev check-buffers unlockdev) .. change-state .. record io needed unlockstripe schedule io
+ *     lockstripe clrSTRIPE_HANDLE ... (lockdev check-buffers unlockdev) .. change-state .. record io/ops needed unlockstripe schedule io/ops
  *  release an active stripe (release_stripe())
  *     lockdev if (!--cnt) { if  STRIPE_HANDLE, add to handle_list else add to inactive-list } unlockdev
  *
  * The refcount counts each thread that have activated the stripe,
  * plus raid5d if it is handling it, plus one for each active request
- * on a cached buffer.
+ * on a cached buffer, and plus one if the stripe is undergoing stripe
+ * operations.
+ *
+ * Stripe operations are performed outside the stripe lock,
+ * the stripe operations are:
+ * -copying data between the stripe cache and user application buffers
+ * -computing blocks to save a disk access, or to recover a missing block
+ * -updating the parity on a write operation (reconstruct write and read-modify-write)
+ * -checking parity correctness
+ * These operations are carried out by either a software routine,
+ * raid5_do_soft_block_ops, or by a routine that arranges for the work to be
+ * done by dedicated DMA engines.
+ * When requesting an operation handle_stripe sets the proper state and work
+ * request flags, it then hands control to the operations routine.  There are
+ * some critical dependencies between the operations that prevent some
+ * operations from being requested while another is in flight.
+ * Here are the inter-dependencies:
+ * -parity check operations destroy the in cache version of the parity block,
+ *  so we prevent parity dependent operations like writes and compute_blocks
+ *  from starting while a check is in progress.
+ * -when a write operation is requested we immediately lock the affected blocks,
+ *  and mark them as not up to date.  This causes new read requests to be held
+ *  off, as well as parity checks and compute block operations.
+ * -once a compute block operation has been requested handle_stripe treats that
+ *  block as if it is immediately up to date.  The routine carrying out the
+ *  operation guaruntees that any operation that is dependent on the
+ *  compute block result is initiated after the computation completes.
  */
 
 struct stripe_head {
@@ -136,11 +162,18 @@ struct stripe_head {
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
+	struct stripe_operations {
+		int			pending;	/* number of operations requested */
+		unsigned long		state;		/* state of block operations */
+		#ifdef CONFIG_MD_RAID456_WORKQUEUE
+		struct work_struct	work;		/* work queue descriptor */
+		#endif
+	} ops;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
 		struct page	*page;
-		struct bio	*toread, *towrite, *written;
+		struct bio	*toread, *read, *towrite, *written;
 		sector_t	sector;			/* sector of this page */
 		unsigned long	flags;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
@@ -158,6 +191,11 @@ #define	R5_ReadError	8	/* seen a read er
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
 
 #define	R5_Expanded	10	/* This block now has post-expand data */
+#define	R5_Consistent	11	/* Block is HW DMA-able without a cache flush */
+#define	R5_ComputeReq	12	/* compute_block in progress treat as uptodate */
+#define	R5_ReadReq	13	/* dev->toread contains a bio that needs filling */
+#define	R5_RMWReq	14	/* distinguish blocks ready for rmw from other "towrites" */
+
 /*
  * Write method
  */
@@ -179,6 +217,72 @@ #define	STRIPE_BIT_DELAY	8
 #define	STRIPE_EXPANDING	9
 #define	STRIPE_EXPAND_SOURCE	10
 #define	STRIPE_EXPAND_READY	11
+#define	STRIPE_OP_RCW		12
+#define	STRIPE_OP_RMW		13 /* RAID-5 only */
+#define	STRIPE_OP_UPDATE	14 /* RAID-6 only */
+#define	STRIPE_OP_CHECK		15
+#define	STRIPE_OP_COMPUTE	16
+#define	STRIPE_OP_COMPUTE2	17 /* RAID-6 only */
+#define	STRIPE_OP_BIOFILL	18
+#define	STRIPE_OP_QUEUED	19
+#define	STRIPE_OP_DMA		20
+
+/*
+ * These flags are communication markers between the handle_stripe[5|6]
+ * routine and the block operations work queue
+ * - The *_Done definitions signal completion from work queue to handle_stripe
+ * - STRIPE_OP_CHECK_IsZero: signals parity correctness to handle_stripe
+ * - STRIPE_OP_RCW_Expand: expansion operations perform a modified RCW sequence
+ * - STRIPE_OP_COMPUTE_Recover_pd: recovering the parity disk involves an extra
+ *	write back step
+ * - STRIPE_OP_*_Dma: flag operations that will be done once the DMA engine
+ *	goes idle
+ * - All other definitions are service requests for the work queue
+ */
+#define	STRIPE_OP_RCW_Drain		0
+#define	STRIPE_OP_RCW_Parity		1
+#define	STRIPE_OP_RCW_Done		2
+#define	STRIPE_OP_RCW_Expand		3
+#define	STRIPE_OP_RMW_ParityPre		4
+#define	STRIPE_OP_RMW_Drain		5
+#define	STRIPE_OP_RMW_ParityPost	6
+#define	STRIPE_OP_RMW_Done		7
+#define	STRIPE_OP_CHECK_Gen   		8
+#define	STRIPE_OP_CHECK_Verify		9
+#define	STRIPE_OP_CHECK_Done		10
+#define	STRIPE_OP_CHECK_IsZero		11
+#define	STRIPE_OP_COMPUTE_Prep		12
+#define	STRIPE_OP_COMPUTE_Parity	13
+#define	STRIPE_OP_COMPUTE_Done		14
+#define	STRIPE_OP_COMPUTE_Recover_pd	15
+#define	STRIPE_OP_BIOFILL_Copy		16
+#define	STRIPE_OP_BIOFILL_Done		17
+#define	STRIPE_OP_RCW_Dma		18
+#define	STRIPE_OP_RMW_Dma		19
+#define	STRIPE_OP_UPDATE_Dma		20
+#define	STRIPE_OP_CHECK_Dma		21
+#define	STRIPE_OP_COMPUTE_Dma		22
+#define	STRIPE_OP_COMPUTE2_Dma		23
+#define	STRIPE_OP_BIOFILL_Dma		24
+
+/*
+ * Bit mask for status bits not to be auto-cleared by the work queue thread
+ */
+#define	STRIPE_OP_COMPLETION_MASK 	(1 << STRIPE_OP_RCW_Done |\
+						1 << STRIPE_OP_RMW_Done |\
+						1 << STRIPE_OP_CHECK_Done |\
+						1 << STRIPE_OP_CHECK_IsZero |\
+						1 << STRIPE_OP_COMPUTE_Done |\
+						1 << STRIPE_OP_COMPUTE_Recover_pd |\
+						1 << STRIPE_OP_BIOFILL_Done |\
+						1 << STRIPE_OP_RCW_Dma |\
+						1 << STRIPE_OP_RMW_Dma |\
+						1 << STRIPE_OP_UPDATE_Dma |\
+						1 << STRIPE_OP_CHECK_Dma |\
+						1 << STRIPE_OP_COMPUTE_Dma |\
+						1 << STRIPE_OP_COMPUTE2_Dma |\
+						1 << STRIPE_OP_BIOFILL_Dma)
+
 /*
  * Plugging:
  *
@@ -229,11 +333,19 @@ struct raid5_private_data {
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
+	#ifdef CONFIG_MD_RAID456_WORKQUEUE
+	struct workqueue_struct *block_ops_queue;
+	#endif
+	void (*do_block_ops)(void *);
+
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.
 	 */
 	int			active_name;
 	char			cache_name[2][20];
+	#ifdef CONFIG_MD_RAID456_WORKQUEUE
+	char			workqueue_name[20];
+	#endif
 	kmem_cache_t		*slab_cache; /* for allocating stripes */
 
 	int			seq_flush, seq_write;
@@ -264,6 +376,17 @@ struct raid5_private_data {
 typedef struct raid5_private_data raid5_conf_t;
 
 #define mddev_to_conf(mddev) ((raid5_conf_t *) mddev->private)
+/* must be called under the stripe lock */
+static inline void queue_raid_work(struct stripe_head *sh)
+{
+	if (sh->ops.pending != 0 && !test_bit(STRIPE_OP_QUEUED, &sh->state)) {
+		set_bit(STRIPE_OP_QUEUED, &sh->state);
+		atomic_inc(&sh->count);
+		#ifdef CONFIG_MD_RAID456_WORKQUEUE
+		queue_work(sh->raid_conf->block_ops_queue, &sh->ops.work);
+		#endif
+	}
+}
 
 /*
  * Our supported algorithms

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/19] raid5: move write operations to a workqueue
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
  2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
@ 2006-09-11 23:17 ` Dan Williams
  2006-09-11 23:36   ` Jeff Garzik
  2006-09-11 23:17 ` [PATCH 03/19] raid5: move check parity " Dan Williams
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:17 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable handle_stripe5 to pass off write operations to
raid5_do_soft_blocks_ops (which can be run as a workqueue).  The operations
moved are reconstruct-writes and read-modify-writes formerly handled by
compute_parity5.

Changelog:
* moved raid5_do_soft_block_ops changes into a separate patch
* changed handle_write_operations5 to only initiate write operations, which
prevents new writes from being requested while the current one is in flight
* all blocks undergoing a write are now marked locked and !uptodate at the
beginning of the write operation
* blocks undergoing a read-modify-write need a request flag to distinguish
them from blocks that are locked for reading. Reconstruct-writes still use
the R5_LOCKED bit to select blocks for the operation
* integrated the work queue Kconfig option

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/Kconfig         |   21 +++++
 drivers/md/raid5.c         |  192 ++++++++++++++++++++++++++++++++++++++------
 include/linux/raid/raid5.h |    3 +
 3 files changed, 190 insertions(+), 26 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index bf869ed..2a16b3b 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -162,6 +162,27 @@ config MD_RAID5_RESHAPE
 	  There should be enough spares already present to make the new
 	  array workable.
 
+config MD_RAID456_WORKQUEUE
+	depends on MD_RAID456
+	bool "Offload raid work to a workqueue from raid5d"
+	---help---
+	  This option enables raid work (block copy and xor operations)
+	  to run in a workqueue.  If your platform has a high context
+	  switch penalty say N.  If you are using hardware offload or
+	  are running on an SMP platform say Y.
+
+	  If unsure say, Y.
+
+config MD_RAID456_WORKQUEUE_MULTITHREAD
+	depends on MD_RAID456_WORKQUEUE && SMP
+	bool "Enable multi-threaded raid processing"
+	default y
+	---help---
+	  This option controls whether the raid workqueue will be multi-
+	  threaded or single threaded.
+
+	  If unsure say, Y.
+
 config MD_MULTIPATH
 	tristate "Multipath I/O support"
 	depends on BLK_DEV_MD
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8fde62b..e39d248 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -222,6 +222,8 @@ static void init_stripe(struct stripe_he
 
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
+	BUG_ON(sh->ops.state);
+	BUG_ON(sh->ops.pending);
 	
 	CHECK_DEVLOCK();
 	PRINTK("init_stripe called, stripe %llu\n", 
@@ -331,6 +333,9 @@ static int grow_one_stripe(raid5_conf_t 
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
 	sh->raid_conf = conf;
 	spin_lock_init(&sh->lock);
+	#ifdef CONFIG_MD_RAID456_WORKQUEUE
+	INIT_WORK(&sh->ops.work, conf->do_block_ops, sh);
+	#endif
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
@@ -1266,7 +1271,72 @@ static void compute_block_2(struct strip
 	}
 }
 
+static int handle_write_operations5(struct stripe_head *sh, int rcw)
+{
+	int i, pd_idx = sh->pd_idx, disks = sh->disks;
+	int locked=0;
+
+	if (rcw == 0) {
+		/* skip the drain operation on an expand */
+		if (test_bit(STRIPE_OP_RCW_Expand, &sh->ops.state)) {
+			set_bit(STRIPE_OP_RCW, &sh->state);
+			set_bit(STRIPE_OP_RCW_Parity, &sh->ops.state);
+			for (i=disks ; i-- ;) {
+				set_bit(R5_LOCKED, &sh->dev[i].flags);
+				locked++;
+			}
+		} else { /* enter stage 1 of reconstruct write operation */
+			set_bit(STRIPE_OP_RCW, &sh->state);
+			set_bit(STRIPE_OP_RCW_Drain, &sh->ops.state);
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+
+				if (dev->towrite) {
+					set_bit(R5_LOCKED, &dev->flags);
+					clear_bit(R5_UPTODATE, &dev->flags);
+					locked++;
+				}
+			}
+		}
+	} else {
+		/* enter stage 1 of read modify write operation */
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
+
+		set_bit(STRIPE_OP_RMW, &sh->state);
+		set_bit(STRIPE_OP_RMW_ParityPre, &sh->ops.state);
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (i==pd_idx)
+				continue;
+
+			/* For a read-modify write there may be blocks that are
+			 * locked for reading while others are ready to be written
+			 * so we distinguish these blocks by the RMWReq bit
+			 */
+			if (dev->towrite &&
+			    test_bit(R5_UPTODATE, &dev->flags)) {
+				set_bit(R5_RMWReq, &dev->flags);
+				set_bit(R5_LOCKED, &dev->flags);
+				clear_bit(R5_UPTODATE, &dev->flags);
+				locked++;
+			}
+		}
+	}
+
+	/* keep the parity disk locked while asynchronous operations
+	 * are in flight
+	 */
+	set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
+	clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
+	locked++;
+	sh->ops.pending++;
 
+	PRINTK("%s: stripe %llu locked: %d op_state: %lx\n",
+		__FUNCTION__, (unsigned long long)sh->sector,
+		locked, sh->ops.state);
+
+	return locked;
+}
 
 /*
  * Each stripe/dev can have one or more bion attached.
@@ -1664,7 +1734,6 @@ static void raid5_do_soft_block_ops(void
  *    schedule a write of some buffers
  *    return confirmation of parity correctness
  *
- * Parity calculations are done inside the stripe lock
  * buffers are taken off read_list or write_list, and bh_cache buffers
  * get BH_Lock set before the stripe lock is released.
  *
@@ -1679,13 +1748,13 @@ static void handle_stripe5(struct stripe
 	int i;
 	int syncing, expanding, expanded;
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-	int non_overwrite = 0;
+	int non_overwrite=0, write_complete=0;
 	int failed_num=0;
 	struct r5dev *dev;
 
-	PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
-		(unsigned long long)sh->sector, atomic_read(&sh->count),
-		sh->pd_idx);
+	PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d\n",
+	       (unsigned long long)sh->sector, sh->state, atomic_read(&sh->count),
+	       sh->pd_idx);
 
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
@@ -1926,8 +1995,56 @@ #endif
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
-	/* now to consider writing and what else, if anything should be read */
-	if (to_write) {
+	/* Now we check to see if any write operations have recently
+	 * completed
+	 */
+	if (test_bit(STRIPE_OP_RCW, &sh->state) &&
+		test_bit(STRIPE_OP_RCW_Done, &sh->ops.state)) {
+		clear_bit(STRIPE_OP_RCW, &sh->state);
+		clear_bit(STRIPE_OP_RCW_Done, &sh->ops.state);
+		write_complete++;
+	}
+
+	if (test_bit(STRIPE_OP_RMW, &sh->state) &&
+		test_bit(STRIPE_OP_RMW_Done, &sh->ops.state)) {
+		clear_bit(STRIPE_OP_RMW, &sh->state);
+		clear_bit(STRIPE_OP_RMW_Done, &sh->ops.state);
+		BUG_ON(++write_complete > 1);
+		for (i=disks; i--;)
+			clear_bit(R5_RMWReq, &sh->dev[i].flags);
+	}
+
+	/* All the 'written' buffers and the parity block are ready to be
+	 * written back to disk
+	 */
+	if (write_complete) {
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+		for (i=disks; i--;) {
+			dev = &sh->dev[i];
+			if (test_bit(R5_LOCKED, &dev->flags) &&
+				(i == sh->pd_idx || dev->written)) {
+				PRINTK("Writing block %d\n", i);
+				set_bit(R5_Wantwrite, &dev->flags);
+				if (!test_bit(R5_Insync, &dev->flags)
+				    || (i==sh->pd_idx && failed == 0))
+					set_bit(STRIPE_INSYNC, &sh->state);
+			}
+		}
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
+			atomic_dec(&conf->preread_active_stripes);
+			if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		}
+	}
+
+	/* 1/ Now to consider new write requests and what else, if anything should be read
+	 * 2/ Check operations clobber the parity block so do not start new writes while
+	 *    a check is in flight
+	 * 3/ Write operations do not stack
+	 */
+	if (to_write && !test_bit(STRIPE_OP_RCW, &sh->state) &&
+		!test_bit(STRIPE_OP_RMW, &sh->state) &&
+		!test_bit(STRIPE_OP_CHECK, &sh->state)) {
 		int rmw=0, rcw=0;
 		for (i=disks ; i--;) {
 			/* would I have to read this buffer for read_modify_write */
@@ -2000,25 +2117,8 @@ #endif
 			}
 		/* now if nothing is locked, and if we have enough data, we can start a write request */
 		if (locked == 0 && (rcw == 0 ||rmw == 0) &&
-		    !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
-			PRINTK("Computing parity...\n");
-			compute_parity5(sh, rcw==0 ? RECONSTRUCT_WRITE : READ_MODIFY_WRITE);
-			/* now every locked buffer is ready to be written */
-			for (i=disks; i--;)
-				if (test_bit(R5_LOCKED, &sh->dev[i].flags)) {
-					PRINTK("Writing block %d\n", i);
-					locked++;
-					set_bit(R5_Wantwrite, &sh->dev[i].flags);
-					if (!test_bit(R5_Insync, &sh->dev[i].flags)
-					    || (i==sh->pd_idx && failed == 0))
-						set_bit(STRIPE_INSYNC, &sh->state);
-				}
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			}
-		}
+		    !test_bit(STRIPE_BIT_DELAY, &sh->state))
+			locked += handle_write_operations5(sh, rcw);
 	}
 
 	/* maybe we need to check and possibly fix the parity for this stripe
@@ -2150,8 +2250,17 @@ #endif
 			}
 	}
 
+	queue_raid_work(sh);
+
 	spin_unlock(&sh->lock);
 
+	#ifndef CONFIG_MD_RAID456_WORKQUEUE
+	while (test_bit(STRIPE_OP_QUEUED, &sh->state)) {
+		PRINTK("run do_block_ops\n");
+		conf->do_block_ops(sh);
+	}
+	#endif
+
 	while ((bi=return_bi)) {
 		int bytes = bi->bi_size;
 
@@ -3439,6 +3548,30 @@ static int run(mddev_t *mddev)
 		if (!conf->spare_page)
 			goto abort;
 	}
+
+	#ifdef CONFIG_MD_RAID456_WORKQUEUE
+	sprintf(conf->workqueue_name, "%s_raid5_ops",
+		mddev->gendisk->disk_name);
+
+	#ifdef CONFIG_MD_RAID456_MULTITHREAD
+	if ((conf->block_ops_queue = create_workqueue(conf->workqueue_name))
+				     == NULL)
+		goto abort;
+	#else
+	if ((conf->block_ops_queue = create_singlethread_workqueue(
+					conf->workqueue_name)) == NULL)
+		goto abort;
+	#endif
+	#endif
+
+	/* To Do:
+	 * 1/ Offload to asynchronous copy / xor engines
+	 * 2/ Automated selection of optimal do_block_ops
+	 *	routine similar to the xor template selection
+	 */
+	conf->do_block_ops = raid5_do_soft_block_ops;
+
+
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
@@ -3598,6 +3731,10 @@ abort:
 		safe_put_page(conf->spare_page);
 		kfree(conf->disks);
 		kfree(conf->stripe_hashtbl);
+		#ifdef CONFIG_MD_RAID456_WORKQUEUE
+		if (conf->do_block_ops)
+			destroy_workqueue(conf->block_ops_queue);
+		#endif
 		kfree(conf);
 	}
 	mddev->private = NULL;
@@ -3618,6 +3755,9 @@ static int stop(mddev_t *mddev)
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
 	kfree(conf->disks);
+	#ifdef CONFIG_MD_RAID456_WORKQUEUE
+	destroy_workqueue(conf->block_ops_queue);
+	#endif
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index c8a315b..31ae55c 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -3,6 +3,7 @@ #define _RAID5_H
 
 #include <linux/raid/md.h>
 #include <linux/raid/xor.h>
+#include <linux/workqueue.h>
 
 /*
  *
@@ -333,6 +334,7 @@ struct raid5_private_data {
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
+
 	#ifdef CONFIG_MD_RAID456_WORKQUEUE
 	struct workqueue_struct *block_ops_queue;
 	#endif
@@ -376,6 +378,7 @@ struct raid5_private_data {
 typedef struct raid5_private_data raid5_conf_t;
 
 #define mddev_to_conf(mddev) ((raid5_conf_t *) mddev->private)
+
 /* must be called under the stripe lock */
 static inline void queue_raid_work(struct stripe_head *sh)
 {

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/19] raid5: move check parity operations to a workqueue
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
  2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
  2006-09-11 23:17 ` [PATCH 02/19] raid5: move write operations to a workqueue Dan Williams
@ 2006-09-11 23:17 ` Dan Williams
  2006-09-11 23:17 ` [PATCH 04/19] raid5: move compute block " Dan Williams
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:17 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable handle_stripe5 to pass off check parity operations to
raid5_do_soft_block_ops formerly handled by compute_parity5.

Changelog:
* removed handle_check_operations5.  All logic moved into handle_stripe5 so
that we do not need to go through the initiation logic to end the
operation.
* clear the uptodate bit on the parity block
* hold off check operations if a parity dependent operation is in flight
like a write

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c |   60 ++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e39d248..24ed4d8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2121,35 +2121,59 @@ #endif
 			locked += handle_write_operations5(sh, rcw);
 	}
 
-	/* maybe we need to check and possibly fix the parity for this stripe
-	 * Any reads will already have been scheduled, so we just see if enough data
-	 * is available
+	/* 1/ Maybe we need to check and possibly fix the parity for this stripe.
+	 *    Any reads will already have been scheduled, so we just see if enough data
+	 *    is available.
+	 * 2/ Hold off parity checks while parity dependent operations are in flight
+	 *    (RCW and RMW are protected by 'locked')
 	 */
-	if (syncing && locked == 0 &&
-	    !test_bit(STRIPE_INSYNC, &sh->state)) {
+	if ((syncing && locked == 0 &&
+	    !test_bit(STRIPE_INSYNC, &sh->state)) ||
+	    	test_bit(STRIPE_OP_CHECK, &sh->state)) {
+
 		set_bit(STRIPE_HANDLE, &sh->state);
+		/* Take one of the following actions:
+		 * 1/ start a check parity operation if (uptodate == disks)
+		 * 2/ finish a check parity operation and act on the result
+		 */
 		if (failed == 0) {
-			BUG_ON(uptodate != disks);
-			compute_parity5(sh, CHECK_PARITY);
-			uptodate--;
-			if (page_is_zero(sh->dev[sh->pd_idx].page)) {
-				/* parity is correct (on disc, not in buffer any more) */
-				set_bit(STRIPE_INSYNC, &sh->state);
-			} else {
-				conf->mddev->resync_mismatches += STRIPE_SECTORS;
-				if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
-					/* don't try to repair!! */
+			if (!test_bit(STRIPE_OP_CHECK, &sh->state)) {
+				BUG_ON(uptodate != disks);
+				set_bit(STRIPE_OP_CHECK, &sh->state);
+				set_bit(STRIPE_OP_CHECK_Gen, &sh->ops.state);
+				clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+				sh->ops.pending++;
+				uptodate--;
+			} else if (test_and_clear_bit(STRIPE_OP_CHECK_Done, &sh->ops.state)) {
+				clear_bit(STRIPE_OP_CHECK, &sh->state);
+
+				if (test_and_clear_bit(STRIPE_OP_CHECK_IsZero,
+							&sh->ops.state))
+					/* parity is correct (on disc, not in buffer any more) */
 					set_bit(STRIPE_INSYNC, &sh->state);
 				else {
-					compute_block(sh, sh->pd_idx);
-					uptodate++;
+					conf->mddev->resync_mismatches += STRIPE_SECTORS;
+					if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+						/* don't try to repair!! */
+						set_bit(STRIPE_INSYNC, &sh->state);
+					else {
+						compute_block(sh, sh->pd_idx);
+						uptodate++;
+					}
 				}
 			}
 		}
-		if (!test_bit(STRIPE_INSYNC, &sh->state)) {
+
+		/* Wait for check parity operations to complete
+		 * before write-back
+		 */
+		if (!test_bit(STRIPE_INSYNC, &sh->state) &&
+			!test_bit(STRIPE_OP_CHECK, &sh->state)) {
+
 			/* either failed parity check, or recovery is happening */
 			if (failed==0)
 				failed_num = sh->pd_idx;
+
 			dev = &sh->dev[failed_num];
 			BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
 			BUG_ON(uptodate != disks);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/19] raid5: move compute block operations to a workqueue
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (2 preceding siblings ...)
  2006-09-11 23:17 ` [PATCH 03/19] raid5: move check parity " Dan Williams
@ 2006-09-11 23:17 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 05/19] raid5: move read completion copies " Dan Williams
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:17 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable handle_stripe5 to pass off compute block operations to
raid5_do_soft_block_ops, formerly handled by compute_block.

Here are a few notes about the new flags R5_ComputeReq and
STRIPE_OP_COMPUTE_Recover:

Previously, when handle_stripe5 found a block that needed to be computed it
updated it in the same step.  Now that these operations are separated
(across multiple calls to handle_stripe5), a R5_ComputeReq flag is needed
to tell other parts of handle_stripe5 to treat the block under computation
as if it were up to date.  The order of events in the work queue ensures that the
block is indeed up to date before performing further operations.

STRIPE_OP_COMPUTE_Recover_pd was added to track when the parity block is being
computed due to a failed parity check.  This allows the code in
handle_stripe5 that produces requests for check_parity and compute_block
operations to be separate from the code that consumes the result.

Changelog:
* count blocks under computation as uptodate
* removed handle_compute_operations5.  All logic moved into handle_stripe5
so that we do not need to go through the initiation logic to end the
operation.
* since the write operations mark blocks !uptodate we hold off the code to
compute/read blocks until it completes.
* new compute block operations and reads are held off while a compute is in
flight
* do not compute a block while a check parity operation is pending, and do
not start a new check parity operation while a compute operation is pending
* STRIPE_OP_Recover_pd holds off the clearing of the STRIPE_OP_COMPUTE state.
This allows the transition to be handled by the check parity logic that
writes recomputed parity to disk.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c |  153 ++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 107 insertions(+), 46 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 24ed4d8..0c39203 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1300,7 +1300,8 @@ static int handle_write_operations5(stru
 		}
 	} else {
 		/* enter stage 1 of read modify write operation */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
+		BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
+			test_bit(R5_ComputeReq, &sh->dev[pd_idx].flags)));
 
 		set_bit(STRIPE_OP_RMW, &sh->state);
 		set_bit(STRIPE_OP_RMW_ParityPre, &sh->ops.state);
@@ -1314,7 +1315,8 @@ static int handle_write_operations5(stru
 			 * so we distinguish these blocks by the RMWReq bit
 			 */
 			if (dev->towrite &&
-			    test_bit(R5_UPTODATE, &dev->flags)) {
+			    (test_bit(R5_UPTODATE, &dev->flags) ||
+			    test_bit(R5_ComputeReq, &dev->flags))) {
 				set_bit(R5_RMWReq, &dev->flags);
 				set_bit(R5_LOCKED, &dev->flags);
 				clear_bit(R5_UPTODATE, &dev->flags);
@@ -1748,7 +1750,7 @@ static void handle_stripe5(struct stripe
 	int i;
 	int syncing, expanding, expanded;
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-	int non_overwrite=0, write_complete=0;
+	int compute=0, non_overwrite=0, write_complete=0;
 	int failed_num=0;
 	struct r5dev *dev;
 
@@ -1799,7 +1801,7 @@ static void handle_stripe5(struct stripe
 		/* now count some things */
 		if (test_bit(R5_LOCKED, &dev->flags)) locked++;
 		if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
-
+		if (test_bit(R5_ComputeReq, &dev->flags)) BUG_ON(++compute > 1);
 		
 		if (dev->toread) to_read++;
 		if (dev->towrite) {
@@ -1955,40 +1957,83 @@ static void handle_stripe5(struct stripe
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (syncing && (uptodate < disks)) || expanding) {
-		for (i=disks; i--;) {
-			dev = &sh->dev[i];
-			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
-			    (dev->toread ||
-			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
-			     syncing ||
-			     expanding ||
-			     (failed && (sh->dev[failed_num].toread ||
-					 (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
-				    )
-				) {
-				/* we would like to get this block, possibly
-				 * by computing it, but we might not be able to
+	if (to_read || non_overwrite || (syncing && (uptodate + compute < disks)) || expanding ||
+		test_bit(STRIPE_OP_COMPUTE, &sh->state)) {
+		/* Finish any pending compute operations.  Parity recovery implies
+		 * a write-back which is handled later on in this routine
+		 */
+		if (test_bit(STRIPE_OP_COMPUTE, &sh->state) &&
+			test_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state) &&
+			!test_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state)) {
+			clear_bit(STRIPE_OP_COMPUTE, &sh->state);
+			clear_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state);
+		}
+		
+		/* blocks being written are temporarily !UPTODATE */
+		if (!test_bit(STRIPE_OP_COMPUTE, &sh->state) &&
+			!test_bit(STRIPE_OP_RCW, &sh->state) &&
+			!test_bit(STRIPE_OP_RMW, &sh->state)) {
+			for (i=disks; i--;) {
+				dev = &sh->dev[i];
+
+				/* don't schedule compute operations or reads on
+				 * the parity block while a check is in flight
 				 */
-				if (uptodate == disks-1) {
-					PRINTK("Computing block %d\n", i);
-					compute_block(sh, i);
-					uptodate++;
-				} else if (test_bit(R5_Insync, &dev->flags)) {
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
+				if ((i == sh->pd_idx) && test_bit(STRIPE_OP_CHECK, &sh->state))
+					continue;
+
+				if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
+				     (dev->toread ||
+				     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+				     syncing ||
+				     expanding ||
+				     (failed && (sh->dev[failed_num].toread ||
+						 (sh->dev[failed_num].towrite &&
+						 	!test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
+					    )
+					) {
+					/* 1/ We would like to get this block, possibly
+					 * by computing it, but we might not be able to.
+					 *
+					 * 2/ Since parity check operations make the parity
+					 * block !uptodate it will need to be refreshed
+					 * before any compute operations on data disks are
+					 * scheduled.
+					 *
+					 * 3/ We hold off parity block re-reads until check
+					 * operations have quiesced.
+					 */
+					if ((uptodate == disks-1) && !test_bit(STRIPE_OP_CHECK, &sh->state)) {
+						set_bit(STRIPE_OP_COMPUTE, &sh->state);
+						set_bit(STRIPE_OP_COMPUTE_Prep, &sh->ops.state);
+						set_bit(R5_ComputeReq, &dev->flags);
+						sh->ops.pending++;
+						/* Careful: from this point on 'uptodate' is in the eye of the
+						 * workqueue which services 'compute' operations before writes.
+						 * R5_ComputeReq flags blocks that will be R5_UPTODATE
+						 * in the work queue.
+						 */
+						uptodate++;
+					} else if ((uptodate < disks-1) && test_bit(R5_Insync, &dev->flags)) {
+						/* Note: we hold off compute operations while checks are in flight,
+						 * but we still prefer 'compute' over 'read' hence we only read if
+						 * (uptodate < disks-1)
+						 */
+						set_bit(R5_LOCKED, &dev->flags);
+						set_bit(R5_Wantread, &dev->flags);
 #if 0
-					/* if I am just reading this block and we don't have
-					   a failed drive, or any pending writes then sidestep the cache */
-					if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
-					    ! syncing && !failed && !to_write) {
-						sh->bh_cache[i]->b_page =  sh->bh_read[i]->b_page;
-						sh->bh_cache[i]->b_data =  sh->bh_read[i]->b_data;
-					}
+						/* if I am just reading this block and we don't have
+						   a failed drive, or any pending writes then sidestep the cache */
+						if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
+						    ! syncing && !failed && !to_write) {
+							sh->bh_cache[i]->b_page =  sh->bh_read[i]->b_page;
+							sh->bh_cache[i]->b_data =  sh->bh_read[i]->b_data;
+						}
 #endif
-					locked++;
-					PRINTK("Reading block %d (sync=%d)\n", 
-						i, syncing);
+						locked++;
+						PRINTK("Reading block %d (sync=%d)\n", 
+							i, syncing);
+					}
 				}
 			}
 		}
@@ -2055,7 +2100,7 @@ #if 0
 || sh->bh_page[i]!=bh->b_page
 #endif
 				    ) &&
-			    !test_bit(R5_UPTODATE, &dev->flags)) {
+			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_ComputeReq, &dev->flags))) {
 				if (test_bit(R5_Insync, &dev->flags)
 /*				    && !(!mddev->insync && i == sh->pd_idx) */
 					)
@@ -2069,7 +2114,7 @@ #if 0
 || sh->bh_page[i] != bh->b_page
 #endif
 				    ) &&
-			    !test_bit(R5_UPTODATE, &dev->flags)) {
+			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_ComputeReq, &dev->flags))) {
 				if (test_bit(R5_Insync, &dev->flags)) rcw++;
 				else rcw += 2*disks;
 			}
@@ -2082,7 +2127,8 @@ #endif
 			for (i=disks; i--;) {
 				dev = &sh->dev[i];
 				if ((dev->towrite || i == sh->pd_idx) &&
-				    !test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
+				    !test_bit(R5_LOCKED, &dev->flags) &&
+				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_ComputeReq, &dev->flags)) &&
 				    test_bit(R5_Insync, &dev->flags)) {
 					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 					{
@@ -2101,7 +2147,8 @@ #endif
 			for (i=disks; i--;) {
 				dev = &sh->dev[i];
 				if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
-				    !test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
+				    !test_bit(R5_LOCKED, &dev->flags) &&
+				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_ComputeReq, &dev->flags)) &&
 				    test_bit(R5_Insync, &dev->flags)) {
 					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 					{
@@ -2127,16 +2174,19 @@ #endif
 	 * 2/ Hold off parity checks while parity dependent operations are in flight
 	 *    (RCW and RMW are protected by 'locked')
 	 */
-	if ((syncing && locked == 0 &&
-	    !test_bit(STRIPE_INSYNC, &sh->state)) ||
-	    	test_bit(STRIPE_OP_CHECK, &sh->state)) {
+	if ((syncing && locked == 0 && !test_bit(STRIPE_OP_COMPUTE, &sh->state) &&
+		!test_bit(STRIPE_INSYNC, &sh->state)) ||
+	    	test_bit(STRIPE_OP_CHECK, &sh->state) ||
+	    	test_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state)) {
 
 		set_bit(STRIPE_HANDLE, &sh->state);
 		/* Take one of the following actions:
 		 * 1/ start a check parity operation if (uptodate == disks)
 		 * 2/ finish a check parity operation and act on the result
+		 * 3/ skip to the writeback section if we previously 
+		 *    initiated a recovery operation
 		 */
-		if (failed == 0) {
+		if (failed == 0 && !test_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state)) {
 			if (!test_bit(STRIPE_OP_CHECK, &sh->state)) {
 				BUG_ON(uptodate != disks);
 				set_bit(STRIPE_OP_CHECK, &sh->state);
@@ -2157,18 +2207,29 @@ #endif
 						/* don't try to repair!! */
 						set_bit(STRIPE_INSYNC, &sh->state);
 					else {
-						compute_block(sh, sh->pd_idx);
+						set_bit(STRIPE_OP_COMPUTE, &sh->state);
+						set_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state);
+						set_bit(STRIPE_OP_COMPUTE_Prep, &sh->ops.state);
+						set_bit(R5_ComputeReq, &sh->dev[sh->pd_idx].flags);
+						sh->ops.pending++;
 						uptodate++;
 					}
 				}
 			}
 		}
+		if (test_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state) &&
+			test_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state)) {
+			clear_bit(STRIPE_OP_COMPUTE, &sh->state);
+			clear_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state);
+			clear_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state);
+		}
 
-		/* Wait for check parity operations to complete
+		/* Wait for check parity and compute block operations to complete
 		 * before write-back
 		 */
 		if (!test_bit(STRIPE_INSYNC, &sh->state) &&
-			!test_bit(STRIPE_OP_CHECK, &sh->state)) {
+			!test_bit(STRIPE_OP_CHECK, &sh->state) &&
+			!test_bit(STRIPE_OP_COMPUTE, &sh->state)) {
 
 			/* either failed parity check, or recovery is happening */
 			if (failed==0)

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/19] raid5: move read completion copies to a workqueue
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (3 preceding siblings ...)
  2006-09-11 23:17 ` [PATCH 04/19] raid5: move compute block " Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 06/19] raid5: move the reconstruct write expansion operation " Dan Williams
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable handle_stripe5 to hand off the memory copy operations that satisfy
read requests to raid5_do_soft_blocks_ops, formerly this was handled in
line within handle_stripe5.

It adds a 'read' (past tense) pointer to the r5dev structure
to to track reads that have been offloaded to the workqueue.  When the copy
operation is complete the 'read' pointer is reused as the return_bi for the
bi_end_io() call.

Changelog:
* dev->read only holds reads that have been satisfied, previously it
doubled as a request queue to the operations routine
* added R5_ReadReq to mark the blocks that belong to a given bio fill
operation
* requested reads no longer count towards the 'to_read' count, 'to_fill'
tracks the number of requested reads

Signed-off-by: Dan Williams <dan.j.willams@intel.com>
---

 drivers/md/raid5.c |   67 +++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0c39203..1a8dfd2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -240,11 +240,11 @@ static void init_stripe(struct stripe_he
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
-		if (dev->toread || dev->towrite || dev->written ||
+		if (dev->toread || dev->read || dev->towrite || dev->written ||
 		    test_bit(R5_LOCKED, &dev->flags)) {
-			printk("sector=%llx i=%d %p %p %p %d\n",
+			printk("sector=%llx i=%d %p %p %p %p %d\n",
 			       (unsigned long long)sh->sector, i, dev->toread,
-			       dev->towrite, dev->written,
+			       dev->read, dev->towrite, dev->written,
 			       test_bit(R5_LOCKED, &dev->flags));
 			BUG();
 		}
@@ -1749,7 +1749,7 @@ static void handle_stripe5(struct stripe
 	struct bio *bi;
 	int i;
 	int syncing, expanding, expanded;
-	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
+	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0, to_fill=0;
 	int compute=0, non_overwrite=0, write_complete=0;
 	int failed_num=0;
 	struct r5dev *dev;
@@ -1765,44 +1765,47 @@ static void handle_stripe5(struct stripe
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
-	/* Now to look around and see what can be done */
 
+	if (test_bit(STRIPE_OP_BIOFILL, &sh->state) &&
+		test_bit(STRIPE_OP_BIOFILL_Done, &sh->ops.state)) {
+		clear_bit(STRIPE_OP_BIOFILL, &sh->state);
+		clear_bit(STRIPE_OP_BIOFILL_Done, &sh->ops.state);
+	}
+
+	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
 
-		PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
-			i, dev->flags, dev->toread, dev->towrite, dev->written);
+		PRINTK("check %d: state 0x%lx toread %p read %p write %p written %p\n",
+		i, dev->flags, dev->toread, dev->read, dev->towrite, dev->written);
+
+		/* maybe we can acknowledge completion of a biofill operation */
+		if (test_bit(R5_ReadReq, &dev->flags) && !dev->toread)
+			clear_bit(R5_ReadReq, &dev->flags);
+
 		/* maybe we can reply to a read */
+		if (dev->read && !test_bit(R5_ReadReq, &dev->flags) &&
+			!test_bit(STRIPE_OP_BIOFILL, &sh->state)) {
+			return_bi = dev->read;
+			dev->read = NULL;
+		}
+
+		/* maybe we can start a biofill operation */
 		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
-			struct bio *rbi, *rbi2;
-			PRINTK("Return read for disc %d\n", i);
-			spin_lock_irq(&conf->device_lock);
-			rbi = dev->toread;
-			dev->toread = NULL;
-			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&conf->wait_for_overlap);
-			spin_unlock_irq(&conf->device_lock);
-			while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
-				copy_data(0, rbi, dev->page, dev->sector);
-				rbi2 = r5_next_bio(rbi, dev->sector);
-				spin_lock_irq(&conf->device_lock);
-				if (--rbi->bi_phys_segments == 0) {
-					rbi->bi_next = return_bi;
-					return_bi = rbi;
-				}
-				spin_unlock_irq(&conf->device_lock);
-				rbi = rbi2;
-			}
+			to_read--;
+			if (!test_bit(STRIPE_OP_BIOFILL, &sh->state))
+				set_bit(R5_ReadReq, &dev->flags);
 		}
 
 		/* now count some things */
 		if (test_bit(R5_LOCKED, &dev->flags)) locked++;
 		if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
+		if (test_bit(R5_ReadReq, &dev->flags)) to_fill++;
 		if (test_bit(R5_ComputeReq, &dev->flags)) BUG_ON(++compute > 1);
-		
+
 		if (dev->toread) to_read++;
 		if (dev->towrite) {
 			to_write++;
@@ -1824,9 +1827,15 @@ static void handle_stripe5(struct stripe
 			set_bit(R5_Insync, &dev->flags);
 	}
 	rcu_read_unlock();
+
+	if (to_fill && !test_bit(STRIPE_OP_BIOFILL, &sh->state)) {
+		set_bit(STRIPE_OP_BIOFILL, &sh->state);
+		sh->ops.pending++;
+	}
+
 	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
+		" to_write=%d to_fill=%d failed=%d failed_num=%d\n",
+		locked, uptodate, to_read, to_write, to_fill, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/19] raid5: move the reconstruct write expansion operation to a workqueue
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (4 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 05/19] raid5: move read completion copies " Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 07/19] raid5: remove compute_block and compute_parity5 Dan Williams
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable handle_stripe5 to use the reconstruct write operations capability
for expansion operations.  

However this does not move the copy operation associated with an expand to
the workqueue.  First, it was difficult to find a clean way to pass the
parameters of this operation to the queue.  Second, this section of code is
a good candidate for performing the copies with inline calls to the dma
routines.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c |   36 +++++++++++++++++++++++++++---------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1a8dfd2..a07b52b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2053,6 +2053,7 @@ #endif
 	 * completed
 	 */
 	if (test_bit(STRIPE_OP_RCW, &sh->state) &&
+		!test_bit(STRIPE_OP_RCW_Expand, &sh->ops.state) &&
 		test_bit(STRIPE_OP_RCW_Done, &sh->ops.state)) {
 		clear_bit(STRIPE_OP_RCW, &sh->state);
 		clear_bit(STRIPE_OP_RCW_Done, &sh->ops.state);
@@ -2226,6 +2227,7 @@ #endif
 				}
 			}
 		}
+
 		if (test_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state) &&
 			test_bit(STRIPE_OP_COMPUTE_Recover_pd, &sh->ops.state)) {
 			clear_bit(STRIPE_OP_COMPUTE, &sh->state);
@@ -2282,18 +2284,28 @@ #endif
 		}
 	}
 
-	if (expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+	/* Finish 'rcw' operations initiated by the expansion
+	 * process
+	 */
+	if (test_bit(STRIPE_OP_RCW, &sh->state) &&
+		test_bit(STRIPE_OP_RCW_Expand, &sh->ops.state) &&
+		test_bit(STRIPE_OP_RCW_Done, &sh->ops.state)) {
+		clear_bit(STRIPE_OP_RCW, &sh->state);
+		clear_bit(STRIPE_OP_RCW_Done, &sh->ops.state);
+		clear_bit(STRIPE_OP_RCW_Expand, &sh->ops.state);
+		clear_bit(STRIPE_EXPANDING, &sh->state);
+		for (i= conf->raid_disks; i--;)
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+	}
+
+	if (expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+		!test_bit(STRIPE_OP_RCW, &sh->state)) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
 		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks);
-		compute_parity5(sh, RECONSTRUCT_WRITE);
-		for (i= conf->raid_disks; i--;) {
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			locked++;
-			set_bit(R5_Wantwrite, &sh->dev[i].flags);
-		}
-		clear_bit(STRIPE_EXPANDING, &sh->state);
-	} else if (expanded) {
+		set_bit(STRIPE_OP_RCW_Expand, &sh->ops.state);
+		locked += handle_write_operations5(sh, 0);
+	} else if (expanded && !test_bit(STRIPE_OP_RCW, &sh->state)) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
@@ -2327,9 +2339,15 @@ #endif
 					release_stripe(sh2);
 					continue;
 				}
+				/* to do: perform these operations with a dma engine
+				 * inline (rather than pushing to the workqueue)
+				 */
+				/*#ifdef CONFIG_RAID5_DMA*/
+				/*#else*/
 				memcpy(page_address(sh2->dev[dd_idx].page),
 				       page_address(sh->dev[i].page),
 				       STRIPE_SIZE);
+				/*#endif*/
 				set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
 				set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 				for (j=0; j<conf->raid_disks; j++)

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/19] raid5: remove compute_block and compute_parity5
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (5 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 06/19] raid5: move the reconstruct write expansion operation " Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 08/19] dmaengine: enable multiple clients and operations Dan Williams
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

replaced by the workqueue implementation

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c |  123 ----------------------------------------------------
 1 files changed, 0 insertions(+), 123 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a07b52b..ad6883b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -964,129 +964,6 @@ #define check_xor() 	do { 						\
 			} while(0)
 
 
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-	int i, count, disks = sh->disks;
-	void *ptr[MAX_XOR_BLOCKS], *p;
-
-	PRINTK("compute_block, stripe %llu, idx %d\n", 
-		(unsigned long long)sh->sector, dd_idx);
-
-	ptr[0] = page_address(sh->dev[dd_idx].page);
-	memset(ptr[0], 0, STRIPE_SIZE);
-	count = 1;
-	for (i = disks ; i--; ) {
-		if (i == dd_idx)
-			continue;
-		p = page_address(sh->dev[i].page);
-		if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
-			ptr[count++] = p;
-		else
-			printk(KERN_ERR "compute_block() %d, stripe %llu, %d"
-				" not present\n", dd_idx,
-				(unsigned long long)sh->sector, i);
-
-		check_xor();
-	}
-	if (count != 1)
-		xor_block(count, STRIPE_SIZE, ptr);
-	set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
-	void *ptr[MAX_XOR_BLOCKS];
-	struct bio *chosen;
-
-	PRINTK("compute_parity5, stripe %llu, method %d\n",
-		(unsigned long long)sh->sector, method);
-
-	count = 1;
-	ptr[0] = page_address(sh->dev[pd_idx].page);
-	switch(method) {
-	case READ_MODIFY_WRITE:
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
-		for (i=disks ; i-- ;) {
-			if (i==pd_idx)
-				continue;
-			if (sh->dev[i].towrite &&
-			    test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
-
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-					wake_up(&conf->wait_for_overlap);
-
-				BUG_ON(sh->dev[i].written);
-				sh->dev[i].written = chosen;
-				check_xor();
-			}
-		}
-		break;
-	case RECONSTRUCT_WRITE:
-		memset(ptr[0], 0, STRIPE_SIZE);
-		for (i= disks; i-- ;)
-			if (i!=pd_idx && sh->dev[i].towrite) {
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
-
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-					wake_up(&conf->wait_for_overlap);
-
-				BUG_ON(sh->dev[i].written);
-				sh->dev[i].written = chosen;
-			}
-		break;
-	case CHECK_PARITY:
-		break;
-	}
-	if (count>1) {
-		xor_block(count, STRIPE_SIZE, ptr);
-		count = 1;
-	}
-	
-	for (i = disks; i--;)
-		if (sh->dev[i].written) {
-			sector_t sector = sh->dev[i].sector;
-			struct bio *wbi = sh->dev[i].written;
-			while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) {
-				copy_data(1, wbi, sh->dev[i].page, sector);
-				wbi = r5_next_bio(wbi, sector);
-			}
-
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			set_bit(R5_UPTODATE, &sh->dev[i].flags);
-		}
-
-	switch(method) {
-	case RECONSTRUCT_WRITE:
-	case CHECK_PARITY:
-		for (i=disks; i--;)
-			if (i != pd_idx) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				check_xor();
-			}
-		break;
-	case READ_MODIFY_WRITE:
-		for (i = disks; i--;)
-			if (sh->dev[i].written) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				check_xor();
-			}
-	}
-	if (count != 1)
-		xor_block(count, STRIPE_SIZE, ptr);
-	
-	if (method != CHECK_PARITY) {
-		set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
-		set_bit(R5_LOCKED,   &sh->dev[pd_idx].flags);
-	} else
-		clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
-}
-
 static void compute_parity6(struct stripe_head *sh, int method)
 {
 	raid6_conf_t *conf = sh->raid_conf;

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (6 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 07/19] raid5: remove compute_block and compute_parity5 Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:44   ` Jeff Garzik
  2006-09-11 23:18 ` [PATCH 09/19] dmaengine: reduce backend address permutations Dan Williams
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Enable the dmaengine interface to allow multiple clients to share a
channel, and enable clients to request channels based on an operations
capability mask.  This prepares the interface for use with the RAID5 client
and the future RAID6 client.

Multi-client support is achieved by modifying channels to maintain a list
of peer clients.

Multi-operation support is achieved by modifying clients to maintain lists
of channel references.  Channel references in a given request list satisfy
a client specified capability mask.

Changelog:
* make the dmaengine api EXPORT_SYMBOL_GPL
* zero sum support should be standalone, not integrated into xor

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c   |  357 ++++++++++++++++++++++++++++++++++++---------
 drivers/dma/ioatdma.c     |   12 +-
 include/linux/dmaengine.h |  164 ++++++++++++++++++---
 net/core/dev.c            |   21 +--
 net/ipv4/tcp.c            |    4 -
 5 files changed, 443 insertions(+), 115 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 1527804..e10f19d 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -37,8 +37,13 @@
  * Each device has a channels list, which runs unlocked but is never modified
  * once the device is registered, it's just setup by the driver.
  *
- * Each client has a channels list, it's only modified under the client->lock
- * and in an RCU callback, so it's safe to read under rcu_read_lock().
+ * Each client has 'n' lists of channel references where
+ * n == DMA_MAX_CHAN_TYPE_REQ.  These lists are only modified under the
+ * client->lock and in an RCU callback, so they are safe to read under
+ * rcu_read_lock().
+ *
+ * Each channel has a list of peer clients, it's only modified under the
+ * chan->lock.  This allows a channel to be shared amongst several clients
  *
  * Each device has a kref, which is initialized to 1 when the device is
  * registered. A kref_put is done for each class_device registered.  When the
@@ -85,6 +90,18 @@ static ssize_t show_memcpy_count(struct 
 	return sprintf(buf, "%lu\n", count);
 }
 
+static ssize_t show_xor_count(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+	unsigned long count = 0;
+	int i;
+
+	for_each_possible_cpu(i)
+		count += per_cpu_ptr(chan->local, i)->xor_count;
+
+	return sprintf(buf, "%lu\n", count);
+}
+
 static ssize_t show_bytes_transferred(struct class_device *cd, char *buf)
 {
 	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
@@ -97,16 +114,37 @@ static ssize_t show_bytes_transferred(st
 	return sprintf(buf, "%lu\n", count);
 }
 
+static ssize_t show_bytes_xor(struct class_device *cd, char *buf)
+{
+	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+	unsigned long count = 0;
+	int i;
+
+	for_each_possible_cpu(i)
+		count += per_cpu_ptr(chan->local, i)->bytes_xor;
+
+	return sprintf(buf, "%lu\n", count);
+}
+
 static ssize_t show_in_use(struct class_device *cd, char *buf)
 {
+	unsigned int clients = 0;
+	struct list_head *peer;
 	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
 
-	return sprintf(buf, "%d\n", (chan->client ? 1 : 0));
+	rcu_read_lock();
+	list_for_each_rcu(peer, &chan->peers)
+		clients++;
+	rcu_read_unlock();
+
+	return sprintf(buf, "%d\n", clients);
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
 	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(xor_count, S_IRUGO, show_xor_count, NULL),
 	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
+	__ATTR(bytes_xor, S_IRUGO, show_bytes_xor, NULL),
 	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
 	__ATTR_NULL
 };
@@ -130,34 +168,79 @@ static struct class dma_devclass = {
 /**
  * dma_client_chan_alloc - try to allocate a channel to a client
  * @client: &dma_client
+ * @req: request descriptor
  *
  * Called with dma_list_mutex held.
  */
-static struct dma_chan *dma_client_chan_alloc(struct dma_client *client)
+static struct dma_chan *dma_client_chan_alloc(struct dma_client *client,
+	struct dma_req *req)
 {
 	struct dma_device *device;
 	struct dma_chan *chan;
+	struct dma_client_chan_peer *peer;
+	struct dma_chan_client_ref *chan_ref;
 	unsigned long flags;
 	int desc;	/* allocated descriptor count */
+	int allocated;	/* flag re-allocations */
 
-	/* Find a channel, any DMA engine will do */
+	/* Find a channel */
 	list_for_each_entry(device, &dma_device_list, global_node) {
+		if ((req->cap_mask & device->capabilities)
+			!= req->cap_mask)
+			continue;
 		list_for_each_entry(chan, &device->channels, device_node) {
-			if (chan->client)
+			allocated = 0;
+			rcu_read_lock();
+			list_for_each_entry_rcu(chan_ref, &req->channels, req_node) {
+				if (chan_ref->chan == chan) {
+					allocated = 1;
+					break;
+				}
+			}
+			rcu_read_unlock();
+
+			if (allocated)
 				continue;
 
+			/* can the channel be shared between multiple clients */
+			if ((req->exclusive && !list_empty(&chan->peers)) ||
+				chan->exclusive)
+				continue;
+
+			chan_ref = kmalloc(sizeof(*chan_ref), GFP_KERNEL);
+			if (!chan_ref)
+				continue;
+
+			peer = kmalloc(sizeof(*peer), GFP_KERNEL);
+			if (!peer) {
+				kfree(chan_ref);
+				continue;
+			}
+
 			desc = chan->device->device_alloc_chan_resources(chan);
-			if (desc >= 0) {
+			if (desc) {
 				kref_get(&device->refcount);
-				kref_init(&chan->refcount);
-				chan->slow_ref = 0;
-				INIT_RCU_HEAD(&chan->rcu);
-				chan->client = client;
+				kref_get(&chan->refcount);
+				INIT_RCU_HEAD(&peer->rcu);
+				INIT_RCU_HEAD(&chan_ref->rcu);
+				INIT_LIST_HEAD(&peer->peer_node);
+				INIT_LIST_HEAD(&chan_ref->req_node);
+				peer->client = client;
+				chan_ref->chan = chan;
+
+				spin_lock_irqsave(&chan->lock, flags);
+				list_add_tail_rcu(&peer->peer_node, &chan->peers);
+				spin_unlock_irqrestore(&chan->lock, flags);
+
 				spin_lock_irqsave(&client->lock, flags);
-				list_add_tail_rcu(&chan->client_node,
-				                  &client->channels);
+				chan->exclusive = req->exclusive ? client : NULL;
+				list_add_tail_rcu(&chan_ref->req_node,
+						&req->channels);
 				spin_unlock_irqrestore(&client->lock, flags);
 				return chan;
+			} else {
+				kfree(peer);
+				kfree(chan_ref);
 			}
 		}
 	}
@@ -173,7 +256,6 @@ void dma_chan_cleanup(struct kref *kref)
 {
 	struct dma_chan *chan = container_of(kref, struct dma_chan, refcount);
 	chan->device->device_free_chan_resources(chan);
-	chan->client = NULL;
 	kref_put(&chan->device->refcount, dma_async_device_cleanup);
 }
 
@@ -186,51 +268,93 @@ static void dma_chan_free_rcu(struct rcu
 		bias -= local_read(&per_cpu_ptr(chan->local, i)->refcount);
 	atomic_sub(bias, &chan->refcount.refcount);
 	kref_put(&chan->refcount, dma_chan_cleanup);
+	kref_put(&chan->device->refcount, dma_async_device_cleanup);
+}
+
+static void dma_peer_free_rcu(struct rcu_head *rcu)
+{
+	struct dma_client_chan_peer *peer =
+		container_of(rcu, struct dma_client_chan_peer, rcu);
+
+	kfree(peer);
+}
+
+static void dma_chan_ref_free_rcu(struct rcu_head *rcu)
+{
+	struct dma_chan_client_ref *chan_ref =
+		container_of(rcu, struct dma_chan_client_ref, rcu);
+
+	kfree(chan_ref);
 }
 
-static void dma_client_chan_free(struct dma_chan *chan)
+static void dma_client_chan_free(struct dma_client *client,
+				struct dma_chan_client_ref *chan_ref)
 {
+	struct dma_client_chan_peer *peer;
+	struct dma_chan *chan = chan_ref->chan;
 	atomic_add(0x7FFFFFFF, &chan->refcount.refcount);
 	chan->slow_ref = 1;
-	call_rcu(&chan->rcu, dma_chan_free_rcu);
+	rcu_read_lock();
+	list_for_each_entry_rcu(peer, &chan->peers, peer_node)
+		if (peer->client == client) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&chan->lock, flags);
+			list_del_rcu(&peer->peer_node);
+			if (list_empty(&chan->peers))
+				chan->exclusive = NULL;
+			spin_unlock_irqrestore(&chan->lock, flags);
+			call_rcu(&peer->rcu, dma_peer_free_rcu);
+			call_rcu(&chan_ref->rcu, dma_chan_ref_free_rcu);
+			call_rcu(&chan->rcu, dma_chan_free_rcu);
+			break;
+		}
+	rcu_read_unlock();
 }
 
 /**
  * dma_chans_rebalance - reallocate channels to clients
  *
- * When the number of DMA channel in the system changes,
- * channels need to be rebalanced among clients.
+ * When the number of DMA channels in the system changes,
+ * channels need to be rebalanced among clients
  */
 static void dma_chans_rebalance(void)
 {
 	struct dma_client *client;
 	struct dma_chan *chan;
+	struct dma_chan_client_ref *chan_ref;
+
 	unsigned long flags;
+	int i;
 
 	mutex_lock(&dma_list_mutex);
 
 	list_for_each_entry(client, &dma_client_list, global_node) {
-		while (client->chans_desired > client->chan_count) {
-			chan = dma_client_chan_alloc(client);
-			if (!chan)
-				break;
-			client->chan_count++;
-			client->event_callback(client,
-	                                       chan,
-	                                       DMA_RESOURCE_ADDED);
-		}
-		while (client->chans_desired < client->chan_count) {
-			spin_lock_irqsave(&client->lock, flags);
-			chan = list_entry(client->channels.next,
-			                  struct dma_chan,
-			                  client_node);
-			list_del_rcu(&chan->client_node);
-			spin_unlock_irqrestore(&client->lock, flags);
-			client->chan_count--;
-			client->event_callback(client,
-			                       chan,
-			                       DMA_RESOURCE_REMOVED);
-			dma_client_chan_free(chan);
+		for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+			struct dma_req *req = &client->req[i];
+			while (req->chans_desired > atomic_read(&req->chan_count)) {
+				chan = dma_client_chan_alloc(client, req);
+				if (!chan)
+					break;
+				atomic_inc(&req->chan_count);
+				client->event_callback(client,
+		                                       chan,
+		                                       DMA_RESOURCE_ADDED);
+			}
+			while (req->chans_desired < atomic_read(&req->chan_count)) {
+				spin_lock_irqsave(&client->lock, flags);
+				chan_ref = list_entry(req->channels.next,
+				                  struct dma_chan_client_ref,
+				                  req_node);
+				list_del_rcu(&chan_ref->req_node);
+				spin_unlock_irqrestore(&client->lock, flags);
+				atomic_dec(&req->chan_count);
+
+				client->event_callback(client,
+				                       chan_ref->chan,
+				                       DMA_RESOURCE_REMOVED);
+				dma_client_chan_free(client, chan_ref);
+			}
 		}
 	}
 
@@ -244,15 +368,18 @@ static void dma_chans_rebalance(void)
 struct dma_client *dma_async_client_register(dma_event_callback event_callback)
 {
 	struct dma_client *client;
+	int i;
 
 	client = kzalloc(sizeof(*client), GFP_KERNEL);
 	if (!client)
 		return NULL;
 
-	INIT_LIST_HEAD(&client->channels);
+	for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+		INIT_LIST_HEAD(&client->req[i].channels);
+		atomic_set(&client->req[i].chan_count, 0);
+	}
+
 	spin_lock_init(&client->lock);
-	client->chans_desired = 0;
-	client->chan_count = 0;
 	client->event_callback = event_callback;
 
 	mutex_lock(&dma_list_mutex);
@@ -270,14 +397,16 @@ struct dma_client *dma_async_client_regi
  */
 void dma_async_client_unregister(struct dma_client *client)
 {
-	struct dma_chan *chan;
+	struct dma_chan_client_ref *chan_ref;
+	int i;
 
 	if (!client)
 		return;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(chan, &client->channels, client_node)
-		dma_client_chan_free(chan);
+	for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++)
+		list_for_each_entry_rcu(chan_ref, &client->req[i].channels, req_node)
+			dma_client_chan_free(client, chan_ref);
 	rcu_read_unlock();
 
 	mutex_lock(&dma_list_mutex);
@@ -292,17 +421,46 @@ void dma_async_client_unregister(struct 
  * dma_async_client_chan_request - request DMA channels
  * @client: &dma_client
  * @number: count of DMA channels requested
+ * @mask: limits the DMA channels returned to those that
+ *	have the requisite capabilities
  *
  * Clients call dma_async_client_chan_request() to specify how many
  * DMA channels they need, 0 to free all currently allocated.
  * The resulting allocations/frees are indicated to the client via the
- * event callback.
+ * event callback.  If the client has exhausted the number of distinct
+ * requests allowed (DMA_MAX_CHAN_TYPE_REQ) this function will return 0.
  */
-void dma_async_client_chan_request(struct dma_client *client,
-			unsigned int number)
+int dma_async_client_chan_request(struct dma_client *client,
+			unsigned int number, unsigned int mask)
 {
-	client->chans_desired = number;
-	dma_chans_rebalance();
+	int request_slot_found = 0, i;
+
+	/* adjust an outstanding request */
+	for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+		struct dma_req *req = &client->req[i];
+		if (req->cap_mask == mask) {
+			req->chans_desired = number;
+			request_slot_found = 1;
+			break;
+		}
+	}
+
+	/* start a new request */
+	if (!request_slot_found)
+		for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+			struct dma_req *req = &client->req[i];
+			if (!req->chans_desired) {
+				req->chans_desired = number;
+				req->cap_mask = mask;
+				request_slot_found = 1;
+				break;
+			}
+		}
+
+	if (request_slot_found)
+		dma_chans_rebalance();
+
+	return request_slot_found;
 }
 
 /**
@@ -335,6 +493,7 @@ int dma_async_device_register(struct dma
 		         device->dev_id, chan->chan_id);
 
 		kref_get(&device->refcount);
+		kref_init(&chan->refcount);
 		class_device_register(&chan->class_dev);
 	}
 
@@ -348,6 +507,20 @@ int dma_async_device_register(struct dma
 }
 
 /**
+ * dma_async_chan_init - common channel initialization
+ * @chan: &dma_chan
+ * @device: &dma_device
+ */
+void dma_async_chan_init(struct dma_chan *chan, struct dma_device *device)
+{
+	INIT_LIST_HEAD(&chan->peers);
+	INIT_RCU_HEAD(&chan->rcu);
+	spin_lock_init(&chan->lock);
+	chan->device = device;
+	list_add_tail(&chan->device_node, &device->channels);
+}
+
+/**
  * dma_async_device_cleanup - function called when all references are released
  * @kref: kernel reference object
  */
@@ -366,31 +539,70 @@ static void dma_async_device_cleanup(str
 void dma_async_device_unregister(struct dma_device *device)
 {
 	struct dma_chan *chan;
+	struct dma_client_chan_peer *peer;
+	struct dma_req *req;
+	struct dma_chan_client_ref *chan_ref;
+	struct dma_client *client;
+	int i;
 	unsigned long flags;
 
 	mutex_lock(&dma_list_mutex);
 	list_del(&device->global_node);
 	mutex_unlock(&dma_list_mutex);
 
+	/* look up and free each reference to a channel
+	 * note: a channel can be allocated to a client once per
+	 * request type (DMA_MAX_CHAN_TYPE_REQ)
+	 */
 	list_for_each_entry(chan, &device->channels, device_node) {
-		if (chan->client) {
-			spin_lock_irqsave(&chan->client->lock, flags);
-			list_del(&chan->client_node);
-			chan->client->chan_count--;
-			spin_unlock_irqrestore(&chan->client->lock, flags);
-			chan->client->event_callback(chan->client,
-			                             chan,
-			                             DMA_RESOURCE_REMOVED);
-			dma_client_chan_free(chan);
+		rcu_read_lock();
+		list_for_each_entry_rcu(peer, &chan->peers, peer_node) {
+			client = peer->client;
+			for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+				req = &client->req[i];
+				list_for_each_entry_rcu(chan_ref,
+							&req->channels,
+							req_node) {
+					if (chan_ref->chan == chan) {
+						spin_lock_irqsave(&client->lock, flags);
+						list_del_rcu(&chan_ref->req_node);
+						spin_unlock_irqrestore(&client->lock, flags);
+						atomic_dec(&req->chan_count);
+						client->event_callback(
+						client,
+						chan,
+						DMA_RESOURCE_REMOVED);
+						dma_client_chan_free(client,
+								     chan_ref);
+						break;
+					}
+				}
+			}
 		}
-		class_device_unregister(&chan->class_dev);
+		rcu_read_unlock();
+		kref_put(&chan->refcount, dma_chan_cleanup);
+		kref_put(&device->refcount, dma_async_device_cleanup);
 	}
+
+	class_device_unregister(&chan->class_dev);
+
 	dma_chans_rebalance();
 
 	kref_put(&device->refcount, dma_async_device_cleanup);
 	wait_for_completion(&device->done);
 }
 
+/**
+ * dma_async_xor_pgs_to_pg_err - default function for dma devices that
+ *	do not support xor
+ */
+dma_cookie_t dma_async_xor_pgs_to_pg_err(struct dma_chan *chan,
+	struct page *dest_pg, unsigned int dest_off, struct page *src_pgs,
+	unsigned int src_cnt, unsigned int src_off, size_t len)
+{
+	return -ENXIO;
+}
+
 static int __init dma_bus_init(void)
 {
 	mutex_init(&dma_list_mutex);
@@ -399,14 +611,17 @@ static int __init dma_bus_init(void)
 
 subsys_initcall(dma_bus_init);
 
-EXPORT_SYMBOL(dma_async_client_register);
-EXPORT_SYMBOL(dma_async_client_unregister);
-EXPORT_SYMBOL(dma_async_client_chan_request);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_complete);
-EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
-EXPORT_SYMBOL(dma_async_device_register);
-EXPORT_SYMBOL(dma_async_device_unregister);
-EXPORT_SYMBOL(dma_chan_cleanup);
+EXPORT_SYMBOL_GPL(dma_async_client_register);
+EXPORT_SYMBOL_GPL(dma_async_client_unregister);
+EXPORT_SYMBOL_GPL(dma_async_client_chan_request);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_buf_to_buf);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_buf_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_operation_complete);
+EXPORT_SYMBOL_GPL(dma_async_issue_pending);
+EXPORT_SYMBOL_GPL(dma_async_device_register);
+EXPORT_SYMBOL_GPL(dma_async_device_unregister);
+EXPORT_SYMBOL_GPL(dma_chan_cleanup);
+EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg_err);
+EXPORT_SYMBOL_GPL(dma_async_chan_init);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index dbd4d6c..415de03 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -69,11 +69,7 @@ static int enumerate_dma_channels(struct
 		spin_lock_init(&ioat_chan->desc_lock);
 		INIT_LIST_HEAD(&ioat_chan->free_desc);
 		INIT_LIST_HEAD(&ioat_chan->used_desc);
-		/* This should be made common somewhere in dmaengine.c */
-		ioat_chan->common.device = &device->common;
-		ioat_chan->common.client = NULL;
-		list_add_tail(&ioat_chan->common.device_node,
-		              &device->common.channels);
+		dma_async_chan_init(&ioat_chan->common, &device->common);
 	}
 	return device->common.chancnt;
 }
@@ -759,8 +755,10 @@ #endif
 	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
 	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
 	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
-	device->common.device_memcpy_complete = ioat_dma_is_complete;
-	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
+	device->common.device_operation_complete = ioat_dma_is_complete;
+	device->common.device_xor_pgs_to_pg = dma_async_xor_pgs_to_pg_err;
+	device->common.device_issue_pending = ioat_dma_memcpy_issue_pending;
+	device->common.capabilities = DMA_MEMCPY;
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c94d8f1..3599472 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -20,7 +20,7 @@
  */
 #ifndef DMAENGINE_H
 #define DMAENGINE_H
-
+#include <linux/config.h>
 #ifdef CONFIG_DMA_ENGINE
 
 #include <linux/device.h>
@@ -65,6 +65,27 @@ enum dma_status {
 };
 
 /**
+ * enum dma_capabilities - DMA operational capabilities
+ * @DMA_MEMCPY: src to dest copy
+ * @DMA_XOR: src*n to dest xor
+ * @DMA_DUAL_XOR: src*n to dest_diag and dest_horiz xor
+ * @DMA_PQ_XOR: src*n to dest_q and dest_p gf/xor
+ * @DMA_MEMCPY_CRC32C: src to dest copy and crc-32c sum
+ * @DMA_SHARE: multiple clients can use this channel
+ */
+enum dma_capabilities {
+	DMA_MEMCPY		= 0x1,
+	DMA_XOR			= 0x2,
+	DMA_PQ_XOR		= 0x4,
+	DMA_DUAL_XOR		= 0x8,
+	DMA_PQ_UPDATE		= 0x10,
+	DMA_ZERO_SUM		= 0x20,
+	DMA_PQ_ZERO_SUM		= 0x40,
+	DMA_MEMSET		= 0x80,
+	DMA_MEMCPY_CRC32C	= 0x100,
+};
+
+/**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
  * @refcount: local_t used for open-coded "bigref" counting
  * @memcpy_count: transaction counter
@@ -75,27 +96,32 @@ struct dma_chan_percpu {
 	local_t refcount;
 	/* stats */
 	unsigned long memcpy_count;
+	unsigned long xor_count;
 	unsigned long bytes_transferred;
+	unsigned long bytes_xor;
 };
 
 /**
  * struct dma_chan - devices supply DMA channels, clients use them
- * @client: ptr to the client user of this chan, will be %NULL when unused
+ * @peers: list of the clients of this chan, will be 'empty' when unused
  * @device: ptr to the dma device who supplies this channel, always !%NULL
  * @cookie: last cookie value returned to client
+ * @exclusive: ptr to the client that is exclusively using this channel
+ * @lock: protects access to the peer list
  * @chan_id: channel ID for sysfs
  * @class_dev: class device for sysfs
  * @refcount: kref, used in "bigref" slow-mode
  * @slow_ref: indicates that the DMA channel is free
  * @rcu: the DMA channel's RCU head
- * @client_node: used to add this to the client chan list
  * @device_node: used to add this to the device chan list
  * @local: per-cpu pointer to a struct dma_chan_percpu
  */
 struct dma_chan {
-	struct dma_client *client;
+	struct list_head peers;
 	struct dma_device *device;
 	dma_cookie_t cookie;
+	struct dma_client *exclusive;
+	spinlock_t lock;
 
 	/* sysfs */
 	int chan_id;
@@ -105,7 +131,6 @@ struct dma_chan {
 	int slow_ref;
 	struct rcu_head rcu;
 
-	struct list_head client_node;
 	struct list_head device_node;
 	struct dma_chan_percpu *local;
 };
@@ -139,29 +164,66 @@ typedef void (*dma_event_callback) (stru
 		struct dma_chan *chan, enum dma_event event);
 
 /**
- * struct dma_client - info on the entity making use of DMA services
- * @event_callback: func ptr to call when something happens
+ * struct dma_req - info on the type and number of channels allocated to a client
  * @chan_count: number of chans allocated
  * @chans_desired: number of chans requested. Can be +/- chan_count
+ * @cap_mask: DMA capabilities required to satisfy this request
+ * @exclusive: Whether this client would like exclusive use of the channel(s)
+ */
+struct dma_req {
+	atomic_t	chan_count;
+	unsigned int	chans_desired;
+	unsigned int	cap_mask;
+	int		exclusive;
+	struct list_head channels;
+};
+
+/**
+ * struct dma_client - info on the entity making use of DMA services
+ * @event_callback: func ptr to call when something happens
+ * @dma_req: tracks client channel requests per capability mask
  * @lock: protects access to the channels list
  * @channels: the list of DMA channels allocated
  * @global_node: list_head for global dma_client_list
  */
+#define DMA_MAX_CHAN_TYPE_REQ 2
 struct dma_client {
 	dma_event_callback	event_callback;
-	unsigned int		chan_count;
-	unsigned int		chans_desired;
-
+	struct dma_req		req[DMA_MAX_CHAN_TYPE_REQ];
 	spinlock_t		lock;
-	struct list_head	channels;
 	struct list_head	global_node;
 };
 
 /**
+ * struct dma_client_chan_peer - info on the entities sharing a DMA channel
+ * @client: &dma_client
+ * @peer_node: node list of other clients on the channel
+ * @rcu: rcu head for the peer object
+ */
+struct dma_client_chan_peer {
+	struct dma_client *client;
+	struct list_head peer_node;
+	struct rcu_head rcu;
+};
+
+/**
+ * struct dma_chan_client_ref - reference object for clients to track channels
+ * @chan: channel reference
+ * @chan_node: node in the list of other channels on the client
+ * @rcu: rcu head for the chan_ref object
+ */
+struct dma_chan_client_ref {
+	struct dma_chan *chan;
+	struct list_head req_node;
+	struct rcu_head rcu;
+};
+
+/**
  * struct dma_device - info on the entity supplying DMA services
  * @chancnt: how many DMA channels are supported
  * @channels: the list of struct dma_chan
  * @global_node: list_head for global dma_device_list
+ * @capabilities: channel operations capabilities
  * @refcount: reference count
  * @done: IO completion struct
  * @dev_id: unique device ID
@@ -179,6 +241,7 @@ struct dma_device {
 	unsigned int chancnt;
 	struct list_head channels;
 	struct list_head global_node;
+	unsigned long capabilities;
 
 	struct kref refcount;
 	struct completion done;
@@ -195,18 +258,26 @@ struct dma_device {
 	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
 			struct page *dest_pg, unsigned int dest_off,
 			struct page *src_pg, unsigned int src_off, size_t len);
-	enum dma_status (*device_memcpy_complete)(struct dma_chan *chan,
+	dma_cookie_t (*device_xor_pgs_to_pg)(struct dma_chan *chan,
+			struct page *dest_pg, unsigned int dest_off,
+			struct page **src_pgs, unsigned int src_cnt,
+			unsigned int src_off, size_t len);
+	enum dma_status (*device_operation_complete)(struct dma_chan *chan,
 			dma_cookie_t cookie, dma_cookie_t *last,
 			dma_cookie_t *used);
-	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
+	void (*device_issue_pending)(struct dma_chan *chan);
 };
 
 /* --- public DMA engine API --- */
 
 struct dma_client *dma_async_client_register(dma_event_callback event_callback);
 void dma_async_client_unregister(struct dma_client *client);
-void dma_async_client_chan_request(struct dma_client *client,
-		unsigned int number);
+int dma_async_client_chan_request(struct dma_client *client,
+		unsigned int number, unsigned int mask);
+void dma_async_chan_init(struct dma_chan *chan, struct dma_device *device);
+dma_cookie_t dma_async_xor_pgs_to_pg_err(struct dma_chan *chan,
+	struct page *dest_pg, unsigned int dest_off, struct page *src_pgs,
+	unsigned int src_cnt, unsigned int src_off, size_t len);
 
 /**
  * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
@@ -284,19 +355,65 @@ static inline dma_cookie_t dma_async_mem
 }
 
 /**
- * dma_async_memcpy_issue_pending - flush pending copies to HW
+ * dma_async_xor_pgs_to_pg - offloaded xor from pages to page
+ * @chan: DMA channel to offload xor to
+ * @dest_page: destination page
+ * @dest_off: offset in page to xor to
+ * @src_pgs: array of source pages
+ * @src_cnt: number of source pages
+ * @src_off: offset in pages to xor from
+ * @len: length
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_xor_pgs_to_pg(struct dma_chan *chan,
+	struct page *dest_pg, unsigned int dest_off, struct page **src_pgs,
+	unsigned int src_cnt, unsigned int src_off, size_t len)
+{
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_xor += len * src_cnt;
+	per_cpu_ptr(chan->local, cpu)->xor_count++;
+	put_cpu();
+
+	return chan->device->device_xor_pgs_to_pg(chan, dest_pg, dest_off,
+		src_pgs, src_cnt, src_off, len);
+}
+
+/**
+ * dma_async_issue_pending - flush pending copies to HW
  * @chan: target DMA channel
  *
- * This allows drivers to push copies to HW in batches,
+ * This allows drivers to push operations to HW in batches,
  * reducing MMIO writes where possible.
  */
-static inline void dma_async_memcpy_issue_pending(struct dma_chan *chan)
+static inline void dma_async_issue_pending(struct dma_chan *chan)
+{
+	return chan->device->device_issue_pending(chan);
+}
+
+/**
+ * dma_async_issue_all - call dma_async_issue_pending on all channels
+ * @client: &dma_client
+ */
+static inline void dma_async_issue_all(struct dma_client *client)
 {
-	return chan->device->device_memcpy_issue_pending(chan);
+	int i;
+	struct dma_chan_client_ref *chan_ref;
+	struct dma_req *req;
+	for (i = 0; i < DMA_MAX_CHAN_TYPE_REQ; i++) {
+		req = &client->req[i];
+		rcu_read_lock();
+		list_for_each_entry_rcu(chan_ref, &req->channels, req_node)
+			dma_async_issue_pending(chan_ref->chan);
+		rcu_read_unlock();
+	}
 }
 
 /**
- * dma_async_memcpy_complete - poll for transaction completion
+ * dma_async_operations_complete - poll for transaction completion
  * @chan: DMA channel
  * @cookie: transaction identifier to check status of
  * @last: returns last completed cookie, can be NULL
@@ -306,10 +423,10 @@ static inline void dma_async_memcpy_issu
  * internal state and can be used with dma_async_is_complete() to check
  * the status of multiple cookies without re-checking hardware state.
  */
-static inline enum dma_status dma_async_memcpy_complete(struct dma_chan *chan,
+static inline enum dma_status dma_async_operation_complete(struct dma_chan *chan,
 	dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
 {
-	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+	return chan->device->device_operation_complete(chan, cookie, last, used);
 }
 
 /**
@@ -318,7 +435,7 @@ static inline enum dma_status dma_async_
  * @last_complete: last know completed transaction
  * @last_used: last cookie value handed out
  *
- * dma_async_is_complete() is used in dma_async_memcpy_complete()
+ * dma_async_is_complete() is used in dma_async_operation_complete()
  * the test logic is seperated for lightweight testing of multiple cookies
  */
 static inline enum dma_status dma_async_is_complete(dma_cookie_t cookie,
@@ -334,7 +451,6 @@ static inline enum dma_status dma_async_
 	return DMA_IN_PROGRESS;
 }
 
-
 /* --- DMA device --- */
 
 int dma_async_device_register(struct dma_device *device);
diff --git a/net/core/dev.c b/net/core/dev.c
index d4a1ec3..9447f94 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1941,13 +1941,8 @@ #ifdef CONFIG_NET_DMA
 	 * There may not be any more sk_buffs coming right now, so push
 	 * any pending DMA copies to hardware
 	 */
-	if (net_dma_client) {
-		struct dma_chan *chan;
-		rcu_read_lock();
-		list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node)
-			dma_async_memcpy_issue_pending(chan);
-		rcu_read_unlock();
-	}
+	if (net_dma_client)
+		dma_async_issue_all(net_dma_client);
 #endif
 	local_irq_enable();
 	return;
@@ -3410,7 +3405,8 @@ #ifdef CONFIG_NET_DMA
 static void net_dma_rebalance(void)
 {
 	unsigned int cpu, i, n;
-	struct dma_chan *chan;
+	struct dma_chan_client_ref *chan_ref;
+	struct dma_req *req;
 
 	if (net_dma_count == 0) {
 		for_each_online_cpu(cpu)
@@ -3421,13 +3417,16 @@ static void net_dma_rebalance(void)
 	i = 0;
 	cpu = first_cpu(cpu_online_map);
 
+	/* NET_DMA only requests one type of dma channel (memcpy) */
+	req = &net_dma_client->req[0];
+
 	rcu_read_lock();
-	list_for_each_entry(chan, &net_dma_client->channels, client_node) {
+	list_for_each_entry(chan_ref, &req->channels, req_node) {
 		n = ((num_online_cpus() / net_dma_count)
 		   + (i < (num_online_cpus() % net_dma_count) ? 1 : 0));
 
 		while(n) {
-			per_cpu(softnet_data, cpu).net_dma = chan;
+			per_cpu(softnet_data, cpu).net_dma = chan_ref->chan;
 			cpu = next_cpu(cpu, cpu_online_map);
 			n--;
 		}
@@ -3471,7 +3470,7 @@ static int __init netdev_dma_register(vo
 	if (net_dma_client == NULL)
 		return -ENOMEM;
 
-	dma_async_client_chan_request(net_dma_client, num_online_cpus());
+	dma_async_client_chan_request(net_dma_client, num_online_cpus(), DMA_MEMCPY);
 	return 0;
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..cd8ad41 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1431,9 +1431,9 @@ #ifdef CONFIG_NET_DMA
 		struct sk_buff *skb;
 		dma_cookie_t done, used;
 
-		dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+		dma_async_issue_pending(tp->ucopy.dma_chan);
 
-		while (dma_async_memcpy_complete(tp->ucopy.dma_chan,
+		while (dma_async_operation_complete(tp->ucopy.dma_chan,
 		                                 tp->ucopy.dma_cookie, &done,
 		                                 &used) == DMA_IN_PROGRESS) {
 			/* do partial cleanup of sk_async_wait_queue */

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/19] dmaengine: reduce backend address permutations
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (7 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 08/19] dmaengine: enable multiple clients and operations Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-15 14:46   ` Olof Johansson
  2006-09-11 23:18 ` [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients Dan Williams
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Change the backend dma driver API to accept a 'union dmaengine_addr'.  The
intent is to be able to support a wide range of frontend address type
permutations without needing an equal number of function type permutations
on the backend.

Changelog:
* make the dmaengine api EXPORT_SYMBOL_GPL
* zero sum support should be standalone, not integrated into xor

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c   |   15 ++-
 drivers/dma/ioatdma.c     |  186 +++++++++++++++++--------------------------
 include/linux/dmaengine.h |  193 +++++++++++++++++++++++++++++++++++++++------
 3 files changed, 249 insertions(+), 145 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index e10f19d..9b02afa 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -593,12 +593,13 @@ void dma_async_device_unregister(struct 
 }
 
 /**
- * dma_async_xor_pgs_to_pg_err - default function for dma devices that
+ * dma_async_do_xor_err - default function for dma devices that
  *	do not support xor
  */
-dma_cookie_t dma_async_xor_pgs_to_pg_err(struct dma_chan *chan,
-	struct page *dest_pg, unsigned int dest_off, struct page *src_pgs,
-	unsigned int src_cnt, unsigned int src_off, size_t len)
+dma_cookie_t dma_async_do_xor_err(struct dma_chan *chan,
+		union dmaengine_addr dest, unsigned int dest_off,
+		union dmaengine_addr src, unsigned int src_cnt,
+		unsigned int src_off, size_t len, unsigned long flags)
 {
 	return -ENXIO;
 }
@@ -617,11 +618,15 @@ EXPORT_SYMBOL_GPL(dma_async_client_chan_
 EXPORT_SYMBOL_GPL(dma_async_memcpy_buf_to_buf);
 EXPORT_SYMBOL_GPL(dma_async_memcpy_buf_to_pg);
 EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_dma);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to_dma);
+EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_pg);
 EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_xor_dma_list_to_dma);
 EXPORT_SYMBOL_GPL(dma_async_operation_complete);
 EXPORT_SYMBOL_GPL(dma_async_issue_pending);
 EXPORT_SYMBOL_GPL(dma_async_device_register);
 EXPORT_SYMBOL_GPL(dma_async_device_unregister);
 EXPORT_SYMBOL_GPL(dma_chan_cleanup);
-EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg_err);
+EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
 EXPORT_SYMBOL_GPL(dma_async_chan_init);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index 415de03..dd5b9f0 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -213,20 +213,25 @@ static void ioat_dma_free_chan_resources
 
 /**
  * do_ioat_dma_memcpy - actual function that initiates a IOAT DMA transaction
- * @ioat_chan: IOAT DMA channel handle
- * @dest: DMA destination address
- * @src: DMA source address
+ * @chan: IOAT DMA channel handle
+ * @dest: DMAENGINE destination address
+ * @dest_off: Page offset
+ * @src: DMAENGINE source address
+ * @src_off: Page offset
  * @len: transaction length in bytes
  */
 
-static dma_cookie_t do_ioat_dma_memcpy(struct ioat_dma_chan *ioat_chan,
-                                       dma_addr_t dest,
-                                       dma_addr_t src,
-                                       size_t len)
+static dma_cookie_t do_ioat_dma_memcpy(struct dma_chan *dma_chan,
+                                       union dmaengine_addr dest,
+					unsigned int dest_off,
+                                       union dmaengine_addr src,
+					unsigned int src_off,
+                                       size_t len,
+                                       unsigned long flags)
 {
 	struct ioat_desc_sw *first;
 	struct ioat_desc_sw *prev;
-	struct ioat_desc_sw *new;
+	struct ioat_desc_sw *new = 0;
 	dma_cookie_t cookie;
 	LIST_HEAD(new_chain);
 	u32 copy;
@@ -234,16 +239,47 @@ static dma_cookie_t do_ioat_dma_memcpy(s
 	dma_addr_t orig_src, orig_dst;
 	unsigned int desc_count = 0;
 	unsigned int append = 0;
+	struct ioat_dma_chan *ioat_chan = to_ioat_chan(dma_chan);
 
-	if (!ioat_chan || !dest || !src)
+	if (!dma_chan || !dest.dma || !src.dma)
 		return -EFAULT;
 
 	if (!len)
 		return ioat_chan->common.cookie;
 
+	switch (flags & (DMA_SRC_BUF | DMA_SRC_PAGE | DMA_SRC_DMA)) {
+	case DMA_SRC_BUF:
+		src.dma = pci_map_single(ioat_chan->device->pdev,
+			src.buf, len, PCI_DMA_TODEVICE);
+		break;
+	case DMA_SRC_PAGE:
+		src.dma = pci_map_page(ioat_chan->device->pdev,
+			src.pg, src_off, len, PCI_DMA_TODEVICE);
+		break;
+	case DMA_SRC_DMA:
+		break;
+	default:
+		return -EFAULT;
+	}
+
+	switch (flags & (DMA_DEST_BUF | DMA_DEST_PAGE | DMA_DEST_DMA)) {
+	case DMA_DEST_BUF:
+		dest.dma = pci_map_single(ioat_chan->device->pdev,
+			dest.buf, len, PCI_DMA_FROMDEVICE);
+		break;
+	case DMA_DEST_PAGE:
+		dest.dma = pci_map_page(ioat_chan->device->pdev,
+			dest.pg, dest_off, len, PCI_DMA_FROMDEVICE);
+		break;
+	case DMA_DEST_DMA:
+		break;
+	default:
+		return -EFAULT;
+	}
+
 	orig_len = len;
-	orig_src = src;
-	orig_dst = dest;
+	orig_src = src.dma;
+	orig_dst = dest.dma;
 
 	first = NULL;
 	prev = NULL;
@@ -266,8 +302,8 @@ static dma_cookie_t do_ioat_dma_memcpy(s
 
 		new->hw->size = copy;
 		new->hw->ctl = 0;
-		new->hw->src_addr = src;
-		new->hw->dst_addr = dest;
+		new->hw->src_addr = src.dma;
+		new->hw->dst_addr = dest.dma;
 		new->cookie = 0;
 
 		/* chain together the physical address list for the HW */
@@ -279,8 +315,8 @@ static dma_cookie_t do_ioat_dma_memcpy(s
 		prev = new;
 
 		len  -= copy;
-		dest += copy;
-		src  += copy;
+		dest.dma += copy;
+		src.dma  += copy;
 
 		list_add_tail(&new->node, &new_chain);
 		desc_count++;
@@ -321,89 +357,7 @@ static dma_cookie_t do_ioat_dma_memcpy(s
 }
 
 /**
- * ioat_dma_memcpy_buf_to_buf - wrapper that takes src & dest bufs
- * @chan: IOAT DMA channel handle
- * @dest: DMA destination address
- * @src: DMA source address
- * @len: transaction length in bytes
- */
-
-static dma_cookie_t ioat_dma_memcpy_buf_to_buf(struct dma_chan *chan,
-                                               void *dest,
-                                               void *src,
-                                               size_t len)
-{
-	dma_addr_t dest_addr;
-	dma_addr_t src_addr;
-	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
-
-	dest_addr = pci_map_single(ioat_chan->device->pdev,
-		dest, len, PCI_DMA_FROMDEVICE);
-	src_addr = pci_map_single(ioat_chan->device->pdev,
-		src, len, PCI_DMA_TODEVICE);
-
-	return do_ioat_dma_memcpy(ioat_chan, dest_addr, src_addr, len);
-}
-
-/**
- * ioat_dma_memcpy_buf_to_pg - wrapper, copying from a buf to a page
- * @chan: IOAT DMA channel handle
- * @page: pointer to the page to copy to
- * @offset: offset into that page
- * @src: DMA source address
- * @len: transaction length in bytes
- */
-
-static dma_cookie_t ioat_dma_memcpy_buf_to_pg(struct dma_chan *chan,
-                                              struct page *page,
-                                              unsigned int offset,
-                                              void *src,
-                                              size_t len)
-{
-	dma_addr_t dest_addr;
-	dma_addr_t src_addr;
-	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
-
-	dest_addr = pci_map_page(ioat_chan->device->pdev,
-		page, offset, len, PCI_DMA_FROMDEVICE);
-	src_addr = pci_map_single(ioat_chan->device->pdev,
-		src, len, PCI_DMA_TODEVICE);
-
-	return do_ioat_dma_memcpy(ioat_chan, dest_addr, src_addr, len);
-}
-
-/**
- * ioat_dma_memcpy_pg_to_pg - wrapper, copying between two pages
- * @chan: IOAT DMA channel handle
- * @dest_pg: pointer to the page to copy to
- * @dest_off: offset into that page
- * @src_pg: pointer to the page to copy from
- * @src_off: offset into that page
- * @len: transaction length in bytes. This is guaranteed not to make a copy
- *	 across a page boundary.
- */
-
-static dma_cookie_t ioat_dma_memcpy_pg_to_pg(struct dma_chan *chan,
-                                             struct page *dest_pg,
-                                             unsigned int dest_off,
-                                             struct page *src_pg,
-                                             unsigned int src_off,
-                                             size_t len)
-{
-	dma_addr_t dest_addr;
-	dma_addr_t src_addr;
-	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
-
-	dest_addr = pci_map_page(ioat_chan->device->pdev,
-		dest_pg, dest_off, len, PCI_DMA_FROMDEVICE);
-	src_addr = pci_map_page(ioat_chan->device->pdev,
-		src_pg, src_off, len, PCI_DMA_TODEVICE);
-
-	return do_ioat_dma_memcpy(ioat_chan, dest_addr, src_addr, len);
-}
-
-/**
- * ioat_dma_memcpy_issue_pending - push potentially unrecognized appended descriptors to hw
+ * ioat_dma_memcpy_issue_pending - push potentially unrecognoized appended descriptors to hw
  * @chan: DMA channel handle
  */
 
@@ -626,24 +580,24 @@ #define IOAT_TEST_SIZE 2000
 static int ioat_self_test(struct ioat_device *device)
 {
 	int i;
-	u8 *src;
-	u8 *dest;
+	union dmaengine_addr src;
+	union dmaengine_addr dest;
 	struct dma_chan *dma_chan;
 	dma_cookie_t cookie;
 	int err = 0;
 
-	src = kzalloc(sizeof(u8) * IOAT_TEST_SIZE, SLAB_KERNEL);
-	if (!src)
+	src.buf = kzalloc(sizeof(u8) * IOAT_TEST_SIZE, SLAB_KERNEL);
+	if (!src.buf)
 		return -ENOMEM;
-	dest = kzalloc(sizeof(u8) * IOAT_TEST_SIZE, SLAB_KERNEL);
-	if (!dest) {
-		kfree(src);
+	dest.buf = kzalloc(sizeof(u8) * IOAT_TEST_SIZE, SLAB_KERNEL);
+	if (!dest.buf) {
+		kfree(src.buf);
 		return -ENOMEM;
 	}
 
 	/* Fill in src buffer */
 	for (i = 0; i < IOAT_TEST_SIZE; i++)
-		src[i] = (u8)i;
+		((u8 *) src.buf)[i] = (u8)i;
 
 	/* Start copy, using first DMA channel */
 	dma_chan = container_of(device->common.channels.next,
@@ -654,7 +608,8 @@ static int ioat_self_test(struct ioat_de
 		goto out;
 	}
 
-	cookie = ioat_dma_memcpy_buf_to_buf(dma_chan, dest, src, IOAT_TEST_SIZE);
+	cookie = do_ioat_dma_memcpy(dma_chan, dest, 0, src, 0,
+		IOAT_TEST_SIZE, DMA_SRC_BUF | DMA_DEST_BUF);
 	ioat_dma_memcpy_issue_pending(dma_chan);
 	msleep(1);
 
@@ -663,7 +618,7 @@ static int ioat_self_test(struct ioat_de
 		err = -ENODEV;
 		goto free_resources;
 	}
-	if (memcmp(src, dest, IOAT_TEST_SIZE)) {
+	if (memcmp(src.buf, dest.buf, IOAT_TEST_SIZE)) {
 		printk(KERN_ERR "ioatdma: Self-test copy failed compare, disabling\n");
 		err = -ENODEV;
 		goto free_resources;
@@ -672,11 +627,16 @@ static int ioat_self_test(struct ioat_de
 free_resources:
 	ioat_dma_free_chan_resources(dma_chan);
 out:
-	kfree(src);
-	kfree(dest);
+	kfree(src.buf);
+	kfree(dest.buf);
 	return err;
 }
 
+extern dma_cookie_t dma_async_do_xor_err(struct dma_chan *chan,
+	union dmaengine_addr dest, unsigned int dest_off,
+	union dmaengine_addr src, unsigned int src_cnt,
+	unsigned int src_off, size_t len, unsigned long flags);
+
 static int __devinit ioat_probe(struct pci_dev *pdev,
                                 const struct pci_device_id *ent)
 {
@@ -752,13 +712,11 @@ #endif
 
 	device->common.device_alloc_chan_resources = ioat_dma_alloc_chan_resources;
 	device->common.device_free_chan_resources = ioat_dma_free_chan_resources;
-	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
-	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
-	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
 	device->common.device_operation_complete = ioat_dma_is_complete;
-	device->common.device_xor_pgs_to_pg = dma_async_xor_pgs_to_pg_err;
 	device->common.device_issue_pending = ioat_dma_memcpy_issue_pending;
 	device->common.capabilities = DMA_MEMCPY;
+	device->common.device_do_dma_memcpy = do_ioat_dma_memcpy;
+	device->common.device_do_dma_xor = dma_async_do_xor_err;
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 3599472..df055cc 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -86,6 +86,32 @@ enum dma_capabilities {
 };
 
 /**
+ * union dmaengine_addr - Private address types
+ * -passing a dma address to the hardware engine
+ *  implies skipping the dma_map* operation
+ */
+union dmaengine_addr {
+	void *buf;
+	struct page *pg;
+	struct page **pgs;
+	dma_addr_t dma;
+	dma_addr_t *dma_list;
+};
+
+enum dmaengine_flags {
+	DMA_SRC_BUF		= 0x1,
+	DMA_SRC_PAGE		= 0x2,
+	DMA_SRC_PAGES		= 0x4,
+	DMA_SRC_DMA		= 0x8,
+	DMA_SRC_DMA_LIST	= 0x10,
+	DMA_DEST_BUF		= 0x20,
+	DMA_DEST_PAGE		= 0x40,
+	DMA_DEST_PAGES		= 0x80,
+	DMA_DEST_DMA		= 0x100,
+	DMA_DEST_DMA_LIST	= 0x200,
+};
+
+/**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
  * @refcount: local_t used for open-coded "bigref" counting
  * @memcpy_count: transaction counter
@@ -230,11 +256,10 @@ struct dma_chan_client_ref {
  * @device_alloc_chan_resources: allocate resources and return the
  *	number of allocated descriptors
  * @device_free_chan_resources: release DMA channel's resources
- * @device_memcpy_buf_to_buf: memcpy buf pointer to buf pointer
- * @device_memcpy_buf_to_pg: memcpy buf pointer to struct page
- * @device_memcpy_pg_to_pg: memcpy struct page/offset to struct page/offset
  * @device_memcpy_complete: poll the status of an IOAT DMA transaction
- * @device_memcpy_issue_pending: push appended descriptors to hardware
+ * @device_issue_pending: push appended descriptors to hardware
+ * @device_do_dma_memcpy: perform memcpy with a dma engine
+ * @device_do_dma_xor: perform block xor with a dma engine
  */
 struct dma_device {
 
@@ -250,18 +275,15 @@ struct dma_device {
 
 	int (*device_alloc_chan_resources)(struct dma_chan *chan);
 	void (*device_free_chan_resources)(struct dma_chan *chan);
-	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan,
-			void *dest, void *src, size_t len);
-	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan,
-			struct page *page, unsigned int offset, void *kdata,
-			size_t len);
-	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
-			struct page *dest_pg, unsigned int dest_off,
-			struct page *src_pg, unsigned int src_off, size_t len);
-	dma_cookie_t (*device_xor_pgs_to_pg)(struct dma_chan *chan,
-			struct page *dest_pg, unsigned int dest_off,
-			struct page **src_pgs, unsigned int src_cnt,
-			unsigned int src_off, size_t len);
+	dma_cookie_t (*device_do_dma_memcpy)(struct dma_chan *chan,
+			union dmaengine_addr dest, unsigned int dest_off,
+			union dmaengine_addr src, unsigned int src_off,
+                       size_t len, unsigned long flags);
+	dma_cookie_t (*device_do_dma_xor)(struct dma_chan *chan,
+			union dmaengine_addr dest, unsigned int dest_off,
+			union dmaengine_addr src, unsigned int src_cnt,
+			unsigned int src_off, size_t len,
+			unsigned long flags);
 	enum dma_status (*device_operation_complete)(struct dma_chan *chan,
 			dma_cookie_t cookie, dma_cookie_t *last,
 			dma_cookie_t *used);
@@ -275,9 +297,6 @@ void dma_async_client_unregister(struct 
 int dma_async_client_chan_request(struct dma_client *client,
 		unsigned int number, unsigned int mask);
 void dma_async_chan_init(struct dma_chan *chan, struct dma_device *device);
-dma_cookie_t dma_async_xor_pgs_to_pg_err(struct dma_chan *chan,
-	struct page *dest_pg, unsigned int dest_off, struct page *src_pgs,
-	unsigned int src_cnt, unsigned int src_off, size_t len);
 
 /**
  * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
@@ -294,12 +313,16 @@ dma_cookie_t dma_async_xor_pgs_to_pg_err
 static inline dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
 	void *dest, void *src, size_t len)
 {
+	unsigned long flags = DMA_DEST_BUF | DMA_SRC_BUF;
+	union dmaengine_addr dest_addr = { .buf = dest };
+	union dmaengine_addr src_addr = { .buf = src };
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
 	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, 0,
+						src_addr, 0, len, flags);
 }
 
 /**
@@ -318,13 +341,16 @@ static inline dma_cookie_t dma_async_mem
 static inline dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
 	struct page *page, unsigned int offset, void *kdata, size_t len)
 {
+	unsigned long flags = DMA_DEST_PAGE | DMA_SRC_BUF;
+	union dmaengine_addr dest_addr = { .pg = page };
+	union dmaengine_addr src_addr = { .buf = kdata };
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
 	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_pg(chan, page, offset,
-	                                             kdata, len);
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, offset,
+						src_addr, 0, len, flags);
 }
 
 /**
@@ -345,13 +371,101 @@ static inline dma_cookie_t dma_async_mem
 	struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
 	unsigned int src_off, size_t len)
 {
+	unsigned long flags = DMA_DEST_PAGE | DMA_SRC_PAGE;
+	union dmaengine_addr dest_addr = { .pg = dest_pg };
+	union dmaengine_addr src_addr = { .pg = src_pg };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, dest_off,
+						src_addr, src_off, len, flags);
+}
+
+/**
+ * dma_async_memcpy_dma_to_dma - offloaded copy from dma to dma
+ * @chan: DMA channel to offload copy to
+ * @dest: destination already mapped and consistent
+ * @src: source already mapped and consistent
+ * @len: length
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_dma_to_dma(struct dma_chan *chan,
+	dma_addr_t dest, dma_addr_t src, size_t len)
+{
+	unsigned long flags = DMA_DEST_DMA | DMA_SRC_DMA;
+	union dmaengine_addr dest_addr = { .dma = dest };
+	union dmaengine_addr src_addr = { .dma = src };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, 0,
+						src_addr, 0, len, flags);
+}
+
+/**
+ * dma_async_memcpy_pg_to_dma - offloaded copy from page to dma
+ * @chan: DMA channel to offload copy to
+ * @dest: destination already mapped and consistent
+ * @src_pg: source page
+ * @src_off: offset in page to copy from
+ * @len: length
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_pg_to_dma(struct dma_chan *chan,
+	dma_addr_t dest, struct page *src_pg,
+	unsigned int src_off, size_t len)
+{
+	unsigned long flags = DMA_DEST_DMA | DMA_SRC_PAGE;
+	union dmaengine_addr dest_addr = { .dma = dest };
+	union dmaengine_addr src_addr = { .pg = src_pg };
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
 	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
-	                                            src_pg, src_off, len);
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, 0,
+						src_addr, src_off, len, flags);
+}
+
+/**
+ * dma_async_memcpy_dma_to_pg - offloaded copy	from dma to page
+ * @chan: DMA channel to offload copy to
+ * @dest_page: destination page
+ * @dest_off: offset in page to copy to
+ * @src: source already mapped and consistent
+ * @len: length
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_memcpy_dma_to_pg(struct dma_chan *chan,
+	struct page *dest_pg, unsigned int dest_off, dma_addr_t src,
+	size_t len)
+{
+	unsigned long flags = DMA_DEST_PAGE | DMA_SRC_DMA;
+	union dmaengine_addr dest_addr = { .pg = dest_pg };
+	union dmaengine_addr src_addr = { .dma = src };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memcpy(chan, dest_addr, dest_off,
+						src_addr, 0, len, flags);
 }
 
 /**
@@ -373,13 +487,40 @@ static inline dma_cookie_t dma_async_xor
 	struct page *dest_pg, unsigned int dest_off, struct page **src_pgs,
 	unsigned int src_cnt, unsigned int src_off, size_t len)
 {
+	unsigned long flags = DMA_DEST_PAGE | DMA_SRC_PAGES;
+	union dmaengine_addr dest_addr = { .pg = dest_pg };
+	union dmaengine_addr src_addr = { .pgs = src_pgs };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_xor += len * src_cnt;
+	per_cpu_ptr(chan->local, cpu)->xor_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_xor(chan, dest_addr, dest_off,
+		src_addr, src_cnt, src_off, len, flags);
+}
+
+/**
+ * dma_async_xor_dma_list_to_dma - offloaded xor of dma blocks
+ * @chan: DMA channel to offload xor to
+ * @dest: destination already mapped and consistent
+ * @src_list: array of sources already mapped and consistent
+ * @src_cnt: number of sources
+ * @len: length
+ */
+static inline dma_cookie_t dma_async_xor_dma_list_to_dma(struct dma_chan *chan,
+	dma_addr_t dest, dma_addr_t *src_list, unsigned int src_cnt,
+	size_t len)
+{
+	unsigned long flags = DMA_DEST_DMA | DMA_SRC_DMA_LIST;
+	union dmaengine_addr dest_addr = { .dma = dest };
+	union dmaengine_addr src_addr = { .dma_list = src_list };
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_xor += len * src_cnt;
 	per_cpu_ptr(chan->local, cpu)->xor_count++;
 	put_cpu();
 
-	return chan->device->device_xor_pgs_to_pg(chan, dest_pg, dest_off,
-		src_pgs, src_cnt, src_off, len);
+	return chan->device->device_do_dma_xor(chan, dest_addr, 0,
+		src_addr, src_cnt, 0, len, flags);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (8 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 09/19] dmaengine: reduce backend address permutations Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation Dan Williams
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Allow a client to ensure that the dma channel it has selected can
dma to the specified buffer or page address.  Also allow the client to
pre-map address ranges to be passed to the operations API.

Changelog:
* make the dmaengine api EXPORT_SYMBOL_GPL
* zero sum support should be standalone, not integrated into xor

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c   |    4 ++++
 drivers/dma/ioatdma.c     |   35 +++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |   34 ++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 9b02afa..e78ce89 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -630,3 +630,7 @@ EXPORT_SYMBOL_GPL(dma_async_device_unreg
 EXPORT_SYMBOL_GPL(dma_chan_cleanup);
 EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
 EXPORT_SYMBOL_GPL(dma_async_chan_init);
+EXPORT_SYMBOL_GPL(dma_async_map_page);
+EXPORT_SYMBOL_GPL(dma_async_map_single);
+EXPORT_SYMBOL_GPL(dma_async_unmap_page);
+EXPORT_SYMBOL_GPL(dma_async_unmap_single);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index dd5b9f0..0159d14 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -637,6 +637,37 @@ extern dma_cookie_t dma_async_do_xor_err
 	union dmaengine_addr src, unsigned int src_cnt,
 	unsigned int src_off, size_t len, unsigned long flags);
 
+static dma_addr_t ioat_map_page(struct dma_chan *chan, struct page *page,
+					unsigned long offset, size_t size,
+					int direction)
+{
+	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
+	return pci_map_page(ioat_chan->device->pdev, page, offset, size,
+			direction);
+}
+
+static dma_addr_t ioat_map_single(struct dma_chan *chan, void *cpu_addr,
+					size_t size, int direction)
+{
+	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
+	return pci_map_single(ioat_chan->device->pdev, cpu_addr, size,
+			direction);
+}
+
+static void ioat_unmap_page(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction)
+{
+	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
+	pci_unmap_page(ioat_chan->device->pdev, handle, size, direction);
+}
+
+static void ioat_unmap_single(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction)
+{
+	struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan);
+	pci_unmap_single(ioat_chan->device->pdev, handle, size,	direction);
+}
+
 static int __devinit ioat_probe(struct pci_dev *pdev,
                                 const struct pci_device_id *ent)
 {
@@ -717,6 +748,10 @@ #endif
 	device->common.capabilities = DMA_MEMCPY;
 	device->common.device_do_dma_memcpy = do_ioat_dma_memcpy;
 	device->common.device_do_dma_xor = dma_async_do_xor_err;
+	device->common.map_page = ioat_map_page;
+	device->common.map_single = ioat_map_single;
+	device->common.unmap_page = ioat_unmap_page;
+	device->common.unmap_single = ioat_unmap_single;
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index df055cc..cb4cfcf 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -287,6 +287,15 @@ struct dma_device {
 	enum dma_status (*device_operation_complete)(struct dma_chan *chan,
 			dma_cookie_t cookie, dma_cookie_t *last,
 			dma_cookie_t *used);
+	dma_addr_t (*map_page)(struct dma_chan *chan, struct page *page,
+				unsigned long offset, size_t size,
+				int direction);
+	dma_addr_t (*map_single)(struct dma_chan *chan, void *cpu_addr,
+				size_t size, int direction);
+	void (*unmap_page)(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction);
+	void (*unmap_single)(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction);
 	void (*device_issue_pending)(struct dma_chan *chan);
 };
 
@@ -592,6 +601,31 @@ static inline enum dma_status dma_async_
 	return DMA_IN_PROGRESS;
 }
 
+static inline dma_addr_t dma_async_map_page(struct dma_chan *chan,
+			struct page *page, unsigned long offset, size_t size,
+			int direction)
+{
+	return chan->device->map_page(chan, page, offset, size, direction);
+}
+
+static inline dma_addr_t dma_async_map_single(struct dma_chan *chan,
+			void *cpu_addr,	size_t size, int direction)
+{
+	return chan->device->map_single(chan, cpu_addr, size, direction);
+}
+
+static inline void dma_async_unmap_page(struct dma_chan *chan,
+			dma_addr_t handle, size_t size, int direction)
+{
+	chan->device->unmap_page(chan, handle, size, direction);
+}
+
+static inline void dma_async_unmap_single(struct dma_chan *chan,
+			dma_addr_t handle, size_t size, int direction)
+{
+	chan->device->unmap_single(chan, handle, size, direction);
+}
+
 /* --- DMA device --- */
 
 int dma_async_device_register(struct dma_device *device);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (9 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:50   ` Jeff Garzik
  2006-09-11 23:18 ` [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy Dan Williams
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Changelog:
* make the dmaengine api EXPORT_SYMBOL_GPL
* zero sum support should be standalone, not integrated into xor

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c   |   15 ++++++++++
 drivers/dma/ioatdma.c     |    5 +++
 include/linux/dmaengine.h |   68 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index e78ce89..fe62237 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -604,6 +604,17 @@ dma_cookie_t dma_async_do_xor_err(struct
 	return -ENXIO;
 }
 
+/**
+ * dma_async_do_memset_err - default function for dma devices that
+ *      do not support memset
+ */
+dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
+                union dmaengine_addr dest, unsigned int dest_off,
+                int val, size_t len, unsigned long flags)
+{
+        return -ENXIO;
+}
+
 static int __init dma_bus_init(void)
 {
 	mutex_init(&dma_list_mutex);
@@ -621,6 +632,9 @@ EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to
 EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_dma);
 EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to_dma);
 EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_pg);
+EXPORT_SYMBOL_GPL(dma_async_memset_buf);
+EXPORT_SYMBOL_GPL(dma_async_memset_page);
+EXPORT_SYMBOL_GPL(dma_async_memset_dma);
 EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg);
 EXPORT_SYMBOL_GPL(dma_async_xor_dma_list_to_dma);
 EXPORT_SYMBOL_GPL(dma_async_operation_complete);
@@ -629,6 +643,7 @@ EXPORT_SYMBOL_GPL(dma_async_device_regis
 EXPORT_SYMBOL_GPL(dma_async_device_unregister);
 EXPORT_SYMBOL_GPL(dma_chan_cleanup);
 EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
+EXPORT_SYMBOL_GPL(dma_async_do_memset_err);
 EXPORT_SYMBOL_GPL(dma_async_chan_init);
 EXPORT_SYMBOL_GPL(dma_async_map_page);
 EXPORT_SYMBOL_GPL(dma_async_map_single);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index 0159d14..231247c 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -637,6 +637,10 @@ extern dma_cookie_t dma_async_do_xor_err
 	union dmaengine_addr src, unsigned int src_cnt,
 	unsigned int src_off, size_t len, unsigned long flags);
 
+extern dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
+	union dmaengine_addr dest, unsigned int dest_off,
+	int val, size_t size, unsigned long flags);
+
 static dma_addr_t ioat_map_page(struct dma_chan *chan, struct page *page,
 					unsigned long offset, size_t size,
 					int direction)
@@ -748,6 +752,7 @@ #endif
 	device->common.capabilities = DMA_MEMCPY;
 	device->common.device_do_dma_memcpy = do_ioat_dma_memcpy;
 	device->common.device_do_dma_xor = dma_async_do_xor_err;
+	device->common.device_do_dma_memset = dma_async_do_memset_err;
 	device->common.map_page = ioat_map_page;
 	device->common.map_single = ioat_map_single;
 	device->common.unmap_page = ioat_unmap_page;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index cb4cfcf..8d53b08 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -260,6 +260,7 @@ struct dma_chan_client_ref {
  * @device_issue_pending: push appended descriptors to hardware
  * @device_do_dma_memcpy: perform memcpy with a dma engine
  * @device_do_dma_xor: perform block xor with a dma engine
+ * @device_do_dma_memset: perform block fill with a dma engine
  */
 struct dma_device {
 
@@ -284,6 +285,9 @@ struct dma_device {
 			union dmaengine_addr src, unsigned int src_cnt,
 			unsigned int src_off, size_t len,
 			unsigned long flags);
+	dma_cookie_t (*device_do_dma_memset)(struct dma_chan *chan,
+			union dmaengine_addr dest, unsigned int dest_off,
+			int value, size_t len, unsigned long flags);
 	enum dma_status (*device_operation_complete)(struct dma_chan *chan,
 			dma_cookie_t cookie, dma_cookie_t *last,
 			dma_cookie_t *used);
@@ -478,6 +482,70 @@ static inline dma_cookie_t dma_async_mem
 }
 
 /**
+ * dma_async_memset_buf - offloaded memset
+ * @chan: DMA channel to offload memset to
+ * @buf: destination buffer
+ * @val: value to initialize the buffer
+ * @len: length
+ */
+static inline dma_cookie_t dma_async_memset_buf(struct dma_chan *chan,
+	void *buf, int val, size_t len)
+{
+	unsigned long flags = DMA_DEST_BUF;
+	union dmaengine_addr dest_addr = { .buf = buf };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memset(chan, dest_addr, 0, val,
+						len, flags);
+}
+
+/**
+ * dma_async_memset_page - offloaded memset
+ * @chan: DMA channel to offload memset to
+ * @page: destination page
+ * @offset: offset into the destination
+ * @val: value to initialize the buffer
+ * @len: length
+ */
+static inline dma_cookie_t dma_async_memset_page(struct dma_chan *chan,
+	struct page *page, unsigned int offset, int val, size_t len)
+{
+	unsigned long flags = DMA_DEST_PAGE;
+	union dmaengine_addr dest_addr = { .pg = page };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memset(chan, dest_addr, offset, val,
+						len, flags);
+}
+
+/**
+ * dma_async_memset_dma - offloaded memset
+ * @chan: DMA channel to offload memset to
+ * @page: destination dma address
+ * @val: value to initialize the buffer
+ * @len: length
+ */
+static inline dma_cookie_t dma_async_memset_dma(struct dma_chan *chan,
+	dma_addr_t dma, int val, size_t len)
+{
+	unsigned long flags = DMA_DEST_DMA;
+	union dmaengine_addr dest_addr = { .dma = dma };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_memset(chan, dest_addr, 0, val,
+						len, flags);
+}
+
+/**
  * dma_async_xor_pgs_to_pg - offloaded xor from pages to page
  * @chan: DMA channel to offload xor to
  * @dest_page: destination page

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (10 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:51   ` Jeff Garzik
  2006-09-11 23:18 ` [PATCH 13/19] dmaengine: add support for dma xor zero sum operations Dan Williams
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Default virtual function that returns an error if the user attempts a
memcpy operation.  An XOR engine is an example of a DMA engine that does
not support memcpy.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index fe62237..33ad690 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -593,6 +593,18 @@ void dma_async_device_unregister(struct 
 }
 
 /**
+ * dma_async_do_memcpy_err - default function for dma devices that
+ *	do not support memcpy
+ */
+dma_cookie_t dma_async_do_memcpy_err(struct dma_chan *chan,
+		union dmaengine_addr dest, unsigned int dest_off,
+		union dmaengine_addr src, unsigned int src_off,
+                size_t len, unsigned long flags)
+{
+	return -ENXIO;
+}
+
+/**
  * dma_async_do_xor_err - default function for dma devices that
  *	do not support xor
  */
@@ -642,6 +654,7 @@ EXPORT_SYMBOL_GPL(dma_async_issue_pendin
 EXPORT_SYMBOL_GPL(dma_async_device_register);
 EXPORT_SYMBOL_GPL(dma_async_device_unregister);
 EXPORT_SYMBOL_GPL(dma_chan_cleanup);
+EXPORT_SYMBOL_GPL(dma_async_do_memcpy_err);
 EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
 EXPORT_SYMBOL_GPL(dma_async_do_memset_err);
 EXPORT_SYMBOL_GPL(dma_async_chan_init);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/19] dmaengine: add support for dma xor zero sum operations
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (11 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:18 ` [PATCH 14/19] dmaengine: add dma_sync_wait Dan Williams
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/dmaengine.c   |   15 ++++++++++++
 drivers/dma/ioatdma.c     |    6 +++++
 include/linux/dmaengine.h |   56 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 33ad690..190c612 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -617,6 +617,18 @@ dma_cookie_t dma_async_do_xor_err(struct
 }
 
 /**
+ * dma_async_do_zero_sum_err - default function for dma devices that
+ *	do not support xor zero sum
+ */
+dma_cookie_t dma_async_do_zero_sum_err(struct dma_chan *chan,
+		union dmaengine_addr src, unsigned int src_cnt,
+		unsigned int src_off, size_t len, u32 *result,
+		unsigned long flags)
+{
+	return -ENXIO;
+}
+
+/**
  * dma_async_do_memset_err - default function for dma devices that
  *      do not support memset
  */
@@ -649,6 +661,8 @@ EXPORT_SYMBOL_GPL(dma_async_memset_page)
 EXPORT_SYMBOL_GPL(dma_async_memset_dma);
 EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg);
 EXPORT_SYMBOL_GPL(dma_async_xor_dma_list_to_dma);
+EXPORT_SYMBOL_GPL(dma_async_zero_sum_pgs);
+EXPORT_SYMBOL_GPL(dma_async_zero_sum_dma_list);
 EXPORT_SYMBOL_GPL(dma_async_operation_complete);
 EXPORT_SYMBOL_GPL(dma_async_issue_pending);
 EXPORT_SYMBOL_GPL(dma_async_device_register);
@@ -656,6 +670,7 @@ EXPORT_SYMBOL_GPL(dma_async_device_unreg
 EXPORT_SYMBOL_GPL(dma_chan_cleanup);
 EXPORT_SYMBOL_GPL(dma_async_do_memcpy_err);
 EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
+EXPORT_SYMBOL_GPL(dma_async_do_zero_sum_err);
 EXPORT_SYMBOL_GPL(dma_async_do_memset_err);
 EXPORT_SYMBOL_GPL(dma_async_chan_init);
 EXPORT_SYMBOL_GPL(dma_async_map_page);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index 231247c..4e90b02 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -637,6 +637,11 @@ extern dma_cookie_t dma_async_do_xor_err
 	union dmaengine_addr src, unsigned int src_cnt,
 	unsigned int src_off, size_t len, unsigned long flags);
 
+extern dma_cookie_t dma_async_do_zero_sum_err(struct dma_chan *chan,
+        union dmaengine_addr src, unsigned int src_cnt,
+        unsigned int src_off, size_t len, u32 *result, 
+	unsigned long flags);
+
 extern dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
 	union dmaengine_addr dest, unsigned int dest_off,
 	int val, size_t size, unsigned long flags);
@@ -752,6 +757,7 @@ #endif
 	device->common.capabilities = DMA_MEMCPY;
 	device->common.device_do_dma_memcpy = do_ioat_dma_memcpy;
 	device->common.device_do_dma_xor = dma_async_do_xor_err;
+	device->common.device_do_dma_zero_sum = dma_async_do_zero_sum_err;
 	device->common.device_do_dma_memset = dma_async_do_memset_err;
 	device->common.map_page = ioat_map_page;
 	device->common.map_single = ioat_map_single;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 8d53b08..9fd6cbd 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -260,6 +260,7 @@ struct dma_chan_client_ref {
  * @device_issue_pending: push appended descriptors to hardware
  * @device_do_dma_memcpy: perform memcpy with a dma engine
  * @device_do_dma_xor: perform block xor with a dma engine
+ * @device_do_dma_zero_sum: perform block xor zero sum with a dma engine
  * @device_do_dma_memset: perform block fill with a dma engine
  */
 struct dma_device {
@@ -285,6 +286,10 @@ struct dma_device {
 			union dmaengine_addr src, unsigned int src_cnt,
 			unsigned int src_off, size_t len,
 			unsigned long flags);
+	dma_cookie_t (*device_do_dma_zero_sum)(struct dma_chan *chan,
+			union dmaengine_addr src, unsigned int src_cnt,
+			unsigned int src_off, size_t len, u32 *result,
+			unsigned long flags);
 	dma_cookie_t (*device_do_dma_memset)(struct dma_chan *chan,
 			union dmaengine_addr dest, unsigned int dest_off,
 			int value, size_t len, unsigned long flags);
@@ -601,6 +606,57 @@ static inline dma_cookie_t dma_async_xor
 }
 
 /**
+ * dma_async_zero_sum_pgs - offloaded xor zero sum from a list of pages
+ * @chan: DMA channel to offload zero sum to
+ * @src_pgs: array of source pages
+ * @src_cnt: number of source pages
+ * @src_off: offset in pages to xor from
+ * @len: length
+ * @result: set to 1 if sum is zero else 0
+ *
+ * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
+ * address according to the DMA mapping API rules for streaming mappings.
+ * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
+ * (kernel memory or locked user space pages)
+ */
+static inline dma_cookie_t dma_async_zero_sum_pgs(struct dma_chan *chan,
+	struct page **src_pgs, unsigned int src_cnt, unsigned int src_off,
+	size_t len, u32 *result)
+{
+	unsigned long flags = DMA_DEST_PAGE | DMA_SRC_PAGES;
+	union dmaengine_addr src_addr = { .pgs = src_pgs };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_xor += len * src_cnt;
+	per_cpu_ptr(chan->local, cpu)->xor_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_zero_sum(chan,
+		src_addr, src_cnt, src_off, len, result, flags);
+}
+
+/**
+ * dma_async_zero_sum_dma_list - offloaded xor zero sum from a dma list
+ * @chan: DMA channel to offload zero sum to
+ * @src_list: array of sources already mapped and consistent
+ * @src_cnt: number of sources
+ * @len: length
+ * @result: set to 1 if sum is zero else 0
+ */
+static inline dma_cookie_t dma_async_zero_sum_dma_list(struct dma_chan *chan,
+	dma_addr_t *src_list, unsigned int src_cnt, size_t len, u32 *result)
+{
+	unsigned long flags = DMA_DEST_DMA | DMA_SRC_DMA_LIST;
+	union dmaengine_addr src_addr = { .dma_list = src_list };
+	int cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_xor += len * src_cnt;
+	per_cpu_ptr(chan->local, cpu)->xor_count++;
+	put_cpu();
+
+	return chan->device->device_do_dma_zero_sum(chan,
+		src_addr, src_cnt, 0, len, result, flags);
+}
+
+/**
  * dma_async_issue_pending - flush pending copies to HW
  * @chan: target DMA channel
  *

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 14/19] dmaengine: add dma_sync_wait
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (12 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 13/19] dmaengine: add support for dma xor zero sum operations Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:52   ` Jeff Garzik
  2006-09-11 23:18 ` [PATCH 15/19] dmaengine: raid5 dma client Dan Williams
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

dma_sync_wait is a common routine to live wait for a dma operation to
complete.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 include/linux/dmaengine.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 9fd6cbd..0a70c9e 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -750,6 +750,18 @@ static inline void dma_async_unmap_singl
 	chan->device->unmap_single(chan, handle, size, direction);
 }
 
+static inline enum dma_status dma_sync_wait(struct dma_chan *chan,
+						dma_cookie_t cookie)
+{
+	enum dma_status status;
+	dma_async_issue_pending(chan);
+	do {
+		status = dma_async_operation_complete(chan, cookie, NULL, NULL);
+	} while (status == DMA_IN_PROGRESS);
+
+	return status;
+}
+
 /* --- DMA device --- */
 
 int dma_async_device_register(struct dma_device *device);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 15/19] dmaengine: raid5 dma client
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (13 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 14/19] dmaengine: add dma_sync_wait Dan Williams
@ 2006-09-11 23:18 ` Dan Williams
  2006-09-11 23:54   ` Jeff Garzik
  2006-09-11 23:19 ` [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines Dan Williams
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:18 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Adds a dmaengine client that is the hardware accelerated version of
raid5_do_soft_block_ops.  It utilizes the raid5 workqueue implementation to
operate on multiple stripes simultaneously.  See the iop-adma.c driver for
an example of a driver that enables hardware accelerated raid5.

Changelog:
* mark operations as _Dma rather than _Done until all outstanding
operations have completed.  Once all operations have completed update the
state and return it to the handle list
* add a helper routine to retrieve the last used cookie
* use dma_async_zero_sum_dma_list for checking parity which optionally
allows parity check operations to not dirty the parity block in the cache
(if 'disks' is less than 'MAX_ADMA_XOR_SOURCES')
* remove dependencies on iop13xx
* take into account the fact that dma engines have a staging buffer so we
can perform 1 less block operation compared to software xor
* added __arch_raid5_dma_chan_request __arch_raid5_dma_next_channel and
__arch_raid5_dma_check_channel to make the driver architecture independent
* added channel switching capability for architectures that implement
different operations (i.e. copy & xor) on individual channels
* added initial support for "non-blocking" channel switching

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/Kconfig        |    9 +
 drivers/dma/Makefile       |    1 
 drivers/dma/raid5-dma.c    |  730 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/Kconfig         |   11 +
 drivers/md/raid5.c         |   66 ++++
 include/linux/dmaengine.h  |    5 
 include/linux/raid/raid5.h |   24 +
 7 files changed, 839 insertions(+), 7 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 30d021d..fced8c3 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -22,6 +22,15 @@ config NET_DMA
 	  Since this is the main user of the DMA engine, it should be enabled;
 	  say Y here.
 
+config RAID5_DMA
+        tristate "MD raid5: block operations offload"
+	depends on INTEL_IOP_ADMA && MD_RAID456
+	default y
+	---help---
+	  This enables the use of DMA engines in the MD-RAID5 driver to
+	  offload stripe cache operations, freeing CPU cycles.
+	  say Y here
+
 comment "DMA Devices"
 
 config INTEL_IOATDMA
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index bdcfdbd..4e36d6e 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
+obj-$(CONFIG_RAID5_DMA) += raid5-dma.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
diff --git a/drivers/dma/raid5-dma.c b/drivers/dma/raid5-dma.c
new file mode 100644
index 0000000..04a1790
--- /dev/null
+++ b/drivers/dma/raid5-dma.c
@@ -0,0 +1,730 @@
+/*
+ * Offload raid5 operations to hardware RAID engines
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+#include <linux/raid/raid5.h>
+#include <linux/dmaengine.h>
+
+static struct dma_client *raid5_dma_client;
+static atomic_t raid5_count;
+extern void release_stripe(struct stripe_head *sh);
+extern void __arch_raid5_dma_chan_request(struct dma_client *client);
+extern struct dma_chan *__arch_raid5_dma_next_channel(struct dma_client *client);
+
+#define MAX_HW_XOR_SRCS 16
+
+#ifndef STRIPE_SIZE
+#define STRIPE_SIZE PAGE_SIZE
+#endif
+
+#ifndef STRIPE_SECTORS
+#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
+#endif
+
+#ifndef r5_next_bio
+#define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL)
+#endif
+
+#define DMA_RAID5_DEBUG 0
+#define PRINTK(x...) ((void)(DMA_RAID5_DEBUG && printk(x)))
+
+/*
+ * Copy data between a page in the stripe cache, and one or more bion
+ * The page could align with the middle of the bio, or there could be
+ * several bion, each with several bio_vecs, which cover part of the page
+ * Multiple bion are linked together on bi_next.  There may be extras
+ * at the end of this list.  We ignore them.
+ */
+static dma_cookie_t dma_raid_copy_data(int frombio, struct bio *bio,
+		     dma_addr_t dma, sector_t sector, struct dma_chan *chan,
+		     dma_cookie_t cookie)
+{
+	struct bio_vec *bvl;
+	struct page *bio_page;
+	int i;
+	int dma_offset;
+	dma_cookie_t last_cookie = cookie;
+
+	if (bio->bi_sector >= sector)
+		dma_offset = (signed)(bio->bi_sector - sector) * 512;
+	else
+		dma_offset = (signed)(sector - bio->bi_sector) * -512;
+	bio_for_each_segment(bvl, bio, i) {
+		int len = bio_iovec_idx(bio,i)->bv_len;
+		int clen;
+		int b_offset = 0;
+
+		if (dma_offset < 0) {
+			b_offset = -dma_offset;
+			dma_offset += b_offset;
+			len -= b_offset;
+		}
+
+		if (len > 0 && dma_offset + len > STRIPE_SIZE)
+			clen = STRIPE_SIZE - dma_offset;
+		else clen = len;
+
+		if (clen > 0) {
+			b_offset += bio_iovec_idx(bio,i)->bv_offset;
+			bio_page = bio_iovec_idx(bio,i)->bv_page;
+			if (frombio)
+				do {
+					cookie = dma_async_memcpy_pg_to_dma(chan,
+								dma + dma_offset,
+								bio_page,
+								b_offset,
+								clen);
+					if (cookie == -ENOMEM)
+						dma_sync_wait(chan, last_cookie);
+					else
+						WARN_ON(cookie <= 0);
+				} while (cookie == -ENOMEM);
+			else
+				do {
+					cookie = dma_async_memcpy_dma_to_pg(chan,
+								bio_page,
+								b_offset,
+								dma + dma_offset,
+								clen);
+					if (cookie == -ENOMEM)
+						dma_sync_wait(chan, last_cookie);
+					else
+						WARN_ON(cookie <= 0);
+				} while (cookie == -ENOMEM);
+		}
+		last_cookie = cookie;
+		if (clen < len) /* hit end of page */
+			break;
+		dma_offset +=  len;
+	}
+
+	return last_cookie;
+}
+
+#define issue_xor() do {					          \
+			 do {					          \
+			 	cookie = dma_async_xor_dma_list_to_dma(   \
+			 		sh->ops.dma_chan,	          \
+			 		xor_destination_addr,	          \
+			 		dma,			          \
+			 		count,			          \
+			 		STRIPE_SIZE);		          \
+			 	if (cookie == -ENOMEM)		          \
+			 		dma_sync_wait(sh->ops.dma_chan,	  \
+			 			sh->ops.dma_cookie);      \
+			 	else				          \
+			 		WARN_ON(cookie <= 0);	          \
+			 } while (cookie == -ENOMEM);		          \
+			 sh->ops.dma_cookie = cookie;		          \
+			 dma[0] = xor_destination_addr;			  \
+			 count = 1;					  \
+			} while(0)
+#define check_xor() do {						\
+			if (count == MAX_HW_XOR_SRCS)			\
+				issue_xor();				\
+		     } while (0)
+
+#ifdef CONFIG_RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH
+extern struct dma_chan *__arch_raid5_dma_check_channel(struct dma_chan *chan,
+						dma_cookie_t cookie,
+						struct dma_client *client,
+						unsigned long capabilities);
+
+#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
+#define check_channel(cap, bookmark) do {			     \
+bookmark:							     \
+	next_chan = __arch_raid5_dma_check_channel(sh->ops.dma_chan, \
+						sh->ops.dma_cookie,  \
+						raid5_dma_client,    \
+						(cap));		     \
+	if (!next_chan) {					     \
+		BUG_ON(sh->ops.ops_bookmark);			     \
+		sh->ops.ops_bookmark = &&bookmark;		     \
+		goto raid5_dma_retry;				     \
+	} else {						     \
+		sh->ops.dma_chan = next_chan;			     \
+		sh->ops.dma_cookie = dma_async_get_last_cookie(	     \
+							next_chan);  \
+		sh->ops.ops_bookmark = NULL;			     \
+	}							     \
+} while (0)
+#else
+#define check_channel(cap, bookmark) do {			     \
+bookmark:							     \
+	next_chan = __arch_raid5_dma_check_channel(sh->ops.dma_chan, \
+						sh->ops.dma_cookie,  \
+						raid5_dma_client,    \
+						(cap));		     \
+	if (!next_chan) {					     \
+		dma_sync_wait(sh->ops.dma_chan, sh->ops.dma_cookie); \
+		goto bookmark;					     \
+	} else {						     \
+		sh->ops.dma_chan = next_chan;			     \
+		sh->ops.dma_cookie = dma_async_get_last_cookie(	     \
+							next_chan);  \
+	}							     \
+} while (0)
+#endif /* CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE */
+#else
+#define check_channel(cap, bookmark) do { } while (0)
+#endif /* CONFIG_RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH */
+
+/*
+ * dma_do_raid5_block_ops - perform block memory operations on stripe data
+ * outside the spin lock with dma engines
+ *
+ * A note about the need for __arch_raid5_dma_check_channel:
+ * This function is only needed to support architectures where a single raid
+ * operation spans multiple hardware channels.  For example on a reconstruct
+ * write, memory copy operations are submitted to a memcpy channel and then
+ * the routine must switch to the xor channel to complete the raid operation.
+ * __arch_raid5_dma_check_channel makes sure the previous operation has
+ * completed before returning the new channel.
+ * Some efficiency can be gained by putting the stripe back on the work
+ * queue rather than spin waiting.  This code is a work in progress and is
+ * available via the 'broken' option CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE.
+ * If 'wait via requeue' is not defined the check_channel macro live waits
+ * for the next channel.
+ */
+static void dma_do_raid5_block_ops(void *stripe_head_ref)
+{
+	struct stripe_head *sh = stripe_head_ref;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks;
+	dma_addr_t dma[MAX_HW_XOR_SRCS];
+	int overlap=0;
+	unsigned long state, ops_state, ops_state_orig;
+	raid5_conf_t *conf = sh->raid_conf;
+	dma_cookie_t cookie;
+	#ifdef CONFIG_RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH
+	struct dma_chan *next_chan;
+	#endif
+
+	if (!sh->ops.dma_chan) {
+		sh->ops.dma_chan = __arch_raid5_dma_next_channel(raid5_dma_client);
+		dma_chan_get(sh->ops.dma_chan);
+		/* retrieve the last used cookie on this channel */
+		sh->ops.dma_cookie = dma_async_get_last_cookie(sh->ops.dma_chan);
+	}
+
+	/* take a snapshot of what needs to be done at this point in time */
+	spin_lock(&sh->lock);
+	state = sh->state;
+	ops_state_orig = ops_state = sh->ops.state;
+	spin_unlock(&sh->lock);
+
+	#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
+	/* pick up where we left off */
+	if (sh->ops.ops_bookmark)
+		goto *sh->ops.ops_bookmark;
+	#endif
+
+	if (test_bit(STRIPE_OP_BIOFILL, &state) &&
+		!test_bit(STRIPE_OP_BIOFILL_Dma, &ops_state)) {
+		struct bio *return_bi;
+		PRINTK("%s: stripe %llu STRIPE_OP_BIOFILL op_state: %lx\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state);
+
+		check_channel(DMA_MEMCPY, stripe_op_biofill);
+		return_bi = NULL;
+
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_ReadReq, &dev->flags)) {
+				struct bio *rbi, *rbi2;
+				spin_lock_irq(&conf->device_lock);
+				rbi = dev->toread;
+				dev->toread = NULL;
+				spin_unlock_irq(&conf->device_lock);
+				overlap++;
+				while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+					sh->ops.dma_cookie = dma_raid_copy_data(0,
+								rbi, dev->dma, dev->sector,
+								sh->ops.dma_chan,
+								sh->ops.dma_cookie);
+					rbi2 = r5_next_bio(rbi, dev->sector);
+					spin_lock_irq(&conf->device_lock);
+					if (--rbi->bi_phys_segments == 0) {
+						rbi->bi_next = return_bi;
+						return_bi = rbi;
+					}
+					spin_unlock_irq(&conf->device_lock);
+					rbi = rbi2;
+				}
+				dev->read = return_bi;
+			}
+		}
+		if (overlap)
+			set_bit(STRIPE_OP_BIOFILL_Dma, &ops_state);
+	}
+
+	if (test_bit(STRIPE_OP_COMPUTE, &state) &&
+		!test_bit(STRIPE_OP_COMPUTE_Dma, &ops_state)) {
+
+		/* dma engines do not need to pre-zero the destination */
+		if (test_and_clear_bit(STRIPE_OP_COMPUTE_Prep, &ops_state))
+			set_bit(STRIPE_OP_COMPUTE_Parity, &ops_state);
+
+		if (test_and_clear_bit(STRIPE_OP_COMPUTE_Parity, &ops_state)) {
+			dma_addr_t xor_destination_addr;
+			int dd_idx;
+			int count;
+
+			check_channel(DMA_XOR, stripe_op_compute_parity);
+			dd_idx = -1;
+			count = 0;
+
+			for (i=disks ; i-- ; )
+				if (test_bit(R5_ComputeReq, &sh->dev[i].flags)) {
+					dd_idx = i;
+					PRINTK("%s: stripe %llu STRIPE_OP_COMPUTE "
+					       "op_state: %lx block: %d\n",
+						__FUNCTION__,
+						(unsigned long long)sh->sector,
+						ops_state, dd_idx);
+					break;
+				}
+
+			BUG_ON(dd_idx < 0);
+
+			xor_destination_addr = sh->dev[dd_idx].dma;
+
+			for (i=disks ; i-- ; )
+				if (i != dd_idx) {
+					dma[count++] = sh->dev[i].dma;
+					check_xor();
+				}
+
+			if (count > 1)
+				issue_xor();
+
+			set_bit(STRIPE_OP_COMPUTE_Dma, &ops_state);
+		}
+	}
+
+	if (test_bit(STRIPE_OP_RMW, &state) &&
+		!test_bit(STRIPE_OP_RMW_Dma, &ops_state)) {
+		BUG_ON(test_bit(STRIPE_OP_RCW, &state));
+
+		PRINTK("%s: stripe %llu STRIPE_OP_RMW op_state: %lx\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state);
+
+		if (test_and_clear_bit(STRIPE_OP_RMW_ParityPre, &ops_state)) {
+			dma_addr_t xor_destination_addr;
+			int count;
+
+			check_channel(DMA_XOR, stripe_op_rmw_paritypre);
+			count = 0;
+
+			/* existing parity data is used in the xor subtraction */
+			xor_destination_addr = dma[count++] = sh->dev[pd_idx].dma;
+
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *chosen;
+
+				/* Only process blocks that are known to be uptodate */
+				if (dev->towrite && test_bit(R5_RMWReq, &dev->flags)) {
+					dma[count++] = dev->dma;
+
+					spin_lock(&sh->lock);
+					chosen = dev->towrite;
+					dev->towrite = NULL;
+					BUG_ON(dev->written);
+					dev->written = chosen;
+					spin_unlock(&sh->lock);
+
+					overlap++;
+
+					check_xor();
+				}
+			}
+			if (count > 1)
+				issue_xor();
+
+			set_bit(STRIPE_OP_RMW_Drain, &ops_state);
+		}
+
+		if (test_and_clear_bit(STRIPE_OP_RMW_Drain, &ops_state)) {
+
+			check_channel(DMA_MEMCPY, stripe_op_rmw_drain);
+
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *wbi = dev->written;
+
+				while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+					sh->ops.dma_cookie = dma_raid_copy_data(1,
+							     wbi, dev->dma, dev->sector,
+							     sh->ops.dma_chan,
+							     sh->ops.dma_cookie);
+					wbi = r5_next_bio(wbi, dev->sector);
+				}
+			}
+			set_bit(STRIPE_OP_RMW_ParityPost, &ops_state);
+		}
+
+		if (test_and_clear_bit(STRIPE_OP_RMW_ParityPost, &ops_state)) {
+			dma_addr_t xor_destination_addr;
+			int count;
+
+			check_channel(DMA_XOR, stripe_op_rmw_paritypost);
+			count = 0;
+
+			xor_destination_addr = dma[count++] = sh->dev[pd_idx].dma;
+
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				if (dev->written) {
+					dma[count++] = dev->dma;
+					check_xor();
+				}
+			}
+			if (count > 1)
+				issue_xor();
+
+			set_bit(STRIPE_OP_RMW_Dma, &ops_state);
+		}
+	}
+
+	if (test_bit(STRIPE_OP_RCW, &state) &&
+		!test_bit(STRIPE_OP_RCW_Dma, &ops_state)) {
+		BUG_ON(test_bit(STRIPE_OP_RMW, &state));
+
+		PRINTK("%s: stripe %llu STRIPE_OP_RCW op_state: %lx\n",
+			__FUNCTION__, (unsigned long long)sh->sector,
+			ops_state);
+
+
+		if (test_and_clear_bit(STRIPE_OP_RCW_Drain, &ops_state)) {
+
+			check_channel(DMA_MEMCPY, stripe_op_rcw_drain);
+
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				struct bio *chosen;
+				struct bio *wbi;
+
+				if (i!=pd_idx && dev->towrite &&
+					test_bit(R5_LOCKED, &dev->flags)) {
+
+					spin_lock(&sh->lock);
+					chosen = dev->towrite;
+					dev->towrite = NULL;
+					BUG_ON(dev->written);
+					wbi = dev->written = chosen;
+					spin_unlock(&sh->lock);
+
+					overlap++;
+
+					while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+						sh->ops.dma_cookie = dma_raid_copy_data(1,
+								     wbi, dev->dma, dev->sector,
+								     sh->ops.dma_chan,
+								     sh->ops.dma_cookie);
+						wbi = r5_next_bio(wbi, dev->sector);
+					}
+				}
+			}
+			set_bit(STRIPE_OP_RCW_Parity, &ops_state);
+		}
+
+		if (test_and_clear_bit(STRIPE_OP_RCW_Parity, &ops_state)) {
+			dma_addr_t xor_destination_addr;
+			int count;
+
+			check_channel(DMA_XOR, stripe_op_rcw_parity);
+			count = 0;
+
+			xor_destination_addr = sh->dev[pd_idx].dma;
+
+			for (i=disks; i--;)
+				if (i != pd_idx) {
+					dma[count++] = sh->dev[i].dma;
+					check_xor();
+				}
+			if (count > 1)
+				issue_xor();
+
+			set_bit(STRIPE_OP_RCW_Dma, &ops_state);
+		}
+	}
+
+	if (test_bit(STRIPE_OP_CHECK, &state) &&
+		!test_bit(STRIPE_OP_CHECK_Dma, &ops_state)) {
+		PRINTK("%s: stripe %llu STRIPE_OP_CHECK op_state: %lx\n",
+		__FUNCTION__, (unsigned long long)sh->sector,
+		ops_state);
+
+		if (test_and_clear_bit(STRIPE_OP_CHECK_Gen, &ops_state)) {
+
+			check_channel(DMA_XOR | DMA_ZERO_SUM, stripe_op_check_gen);
+
+			if (disks > MAX_HW_XOR_SRCS) {
+				/* we need to do a destructive xor 
+				 * i.e. the result needs to be temporarily stored in memory
+				 */
+				dma_addr_t xor_destination_addr;
+				int count = 0;
+				int skip = -1;
+
+				xor_destination_addr = dma[count++] = sh->dev[pd_idx].dma;
+
+				/* xor all but one block */
+				for (i=disks; i--;)
+					if (i != pd_idx) {
+						if (skip < 0) {
+							skip = i;
+							continue;
+						}
+						dma[count++] = sh->dev[i].dma;
+						check_xor();
+					}
+				if (count > 1)
+					issue_xor();
+
+				/* zero result check the skipped block with
+				 * the new parity
+				 */
+				count = 2;
+				dma[1] = sh->dev[skip].dma;
+				do {
+					cookie = dma_async_zero_sum_dma_list(
+						sh->ops.dma_chan,
+						dma,
+						count,
+						STRIPE_SIZE,
+						&sh->ops.dma_result);
+					if (cookie == -ENOMEM)
+						dma_sync_wait(sh->ops.dma_chan,
+						sh->ops.dma_cookie);
+					else
+						WARN_ON(cookie <= 0);
+				} while (cookie == -ENOMEM);
+				sh->ops.dma_cookie = cookie;
+			} else {
+				int count = 0;
+				for (i=disks; i--;)
+					dma[count++] = sh->dev[i].dma;
+				do {
+					cookie = dma_async_zero_sum_dma_list(
+						sh->ops.dma_chan,
+						dma,
+						count,
+						STRIPE_SIZE,
+						&sh->ops.dma_result);
+					if (cookie == -ENOMEM)
+						dma_sync_wait(sh->ops.dma_chan,
+						sh->ops.dma_cookie);
+					else
+						WARN_ON(cookie <= 0);
+				} while (cookie == -ENOMEM);
+				sh->ops.dma_cookie = cookie;
+			}
+			set_bit(STRIPE_OP_CHECK_Verify, &ops_state);
+			set_bit(STRIPE_OP_CHECK_Dma, &ops_state);
+		}
+	}
+
+#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
+raid5_dma_retry:
+#endif
+	spin_lock(&sh->lock);
+	/* Update the state of operations:
+	 * -clear incoming requests
+	 * -preserve output status (i.e. done status / check result / dma)
+	 * -preserve requests added since 'ops_state_orig' was set
+	 */
+	sh->ops.state ^= (ops_state_orig & ~STRIPE_OP_COMPLETION_MASK);
+	sh->ops.state |= ops_state;
+
+	/* if we cleared an overlap condition wake up threads in make_request */
+	if (overlap)
+		for (i= disks; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_and_clear_bit(R5_Overlap, &dev->flags))
+				wake_up(&sh->raid_conf->wait_for_overlap);
+		}
+
+	if (dma_async_operation_complete(sh->ops.dma_chan, sh->ops.dma_cookie,
+					NULL, NULL) == DMA_IN_PROGRESS)
+		dma_async_issue_pending(sh->ops.dma_chan);
+	else { /* now that dma operations have quiesced update the stripe state */
+		int written, work;
+		written = 0;
+		work = 0;
+
+		if (test_and_clear_bit(STRIPE_OP_BIOFILL_Dma, &sh->ops.state)) {
+			work++;
+			set_bit(STRIPE_OP_BIOFILL_Done, &sh->ops.state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_COMPUTE_Dma, &sh->ops.state)) {
+			for (i=disks ; i-- ;)
+				if (test_and_clear_bit(R5_ComputeReq,
+						       &sh->dev[i].flags)) {
+					set_bit(R5_UPTODATE,
+						&sh->dev[i].flags);
+					break;
+				}
+			work++;
+			set_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_RCW_Dma, &sh->ops.state)) {
+			work++;
+			written++;
+			set_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+			set_bit(STRIPE_OP_RCW_Done, &sh->ops.state);
+		}
+		if (test_and_clear_bit(STRIPE_OP_RMW_Dma, &sh->ops.state)) {
+			work++;
+			written++;
+			set_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+			set_bit(STRIPE_OP_RMW_Done, &sh->ops.state);
+		}
+		if (written)
+			for (i=disks ; i-- ;) {
+				struct r5dev *dev = &sh->dev[i];
+				if (dev->written)
+					set_bit(R5_UPTODATE, &dev->flags);
+			}
+		if (test_and_clear_bit(STRIPE_OP_CHECK_Dma, &sh->ops.state)) {
+			if (test_and_clear_bit(STRIPE_OP_CHECK_Verify,
+				&sh->ops.state)) {
+				work++;
+				if (sh->ops.dma_result == 0) {
+					set_bit(STRIPE_OP_CHECK_IsZero,
+						&sh->ops.state);
+
+					/* if the parity is correct and we
+					 * performed the check withtout dirtying
+					 * the parity block, mark it up to date.
+					 */
+					if (disks <= MAX_HW_XOR_SRCS)
+						set_bit(R5_UPTODATE, 
+						&sh->dev[sh->pd_idx].flags);
+
+				} else
+					clear_bit(STRIPE_OP_CHECK_IsZero,
+						&sh->ops.state);
+
+				set_bit(STRIPE_OP_CHECK_Done, &sh->ops.state);
+
+			} else
+				BUG();
+		}
+
+		sh->ops.pending -= work;
+		BUG_ON(sh->ops.pending < 0);
+
+		#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
+		/* return to the bookmark to continue the operation */
+		if (sh->ops.ops_bookmark) {
+			overlap = 0;
+			state = sh->state;
+			ops_state_orig = ops_state = sh->ops.state;
+			spin_unlock(&sh->lock);
+			goto *sh->ops.ops_bookmark;
+		}
+		#endif
+
+		/* the stripe is done with the channel */
+		dma_chan_put(sh->ops.dma_chan);
+		sh->ops.dma_chan = NULL;
+		sh->ops.dma_cookie = 0;
+	}
+
+	BUG_ON(sh->ops.pending == 0 && sh->ops.dma_chan);
+	clear_bit(STRIPE_OP_QUEUED, &sh->state);
+	set_bit(STRIPE_HANDLE, &sh->state);
+	queue_raid_work(sh);
+	spin_unlock(&sh->lock);
+
+	release_stripe(sh);
+}
+
+static void raid5_dma_event_callback(struct dma_client *client,
+			struct dma_chan *chan, enum dma_event event)
+{
+	switch (event) {
+	case DMA_RESOURCE_SUSPEND:
+		PRINTK("%s: DMA_RESOURCE_SUSPEND\n", __FUNCTION__);
+		break;
+	case DMA_RESOURCE_RESUME:
+		PRINTK("%s: DMA_RESOURCE_RESUME\n", __FUNCTION__);
+		break;
+	case DMA_RESOURCE_ADDED:
+		PRINTK("%s: DMA_RESOURCE_ADDED\n", __FUNCTION__);
+		break;
+	case DMA_RESOURCE_REMOVED:
+		PRINTK("%s: DMA_RESOURCE_REMOVED\n", __FUNCTION__);
+		break;
+	default:
+		PRINTK("%s: unknown\n", __FUNCTION__);
+		break;
+	}
+
+}
+
+static int __init raid5_dma_init (void)
+{
+	raid5_dma_client = dma_async_client_register(
+				&raid5_dma_event_callback);
+
+	if (raid5_dma_client == NULL)
+		return -ENOMEM;
+
+	__arch_raid5_dma_chan_request(raid5_dma_client);
+
+	printk("raid5-dma: driver initialized\n");
+	return 0;
+
+}
+
+static void __init raid5_dma_exit (void)
+{
+	if (raid5_dma_client)
+		dma_async_client_unregister(raid5_dma_client);
+
+	raid5_dma_client = NULL;
+}
+
+static struct dma_chan *raid5_dma_next_channel(void)
+{
+	return __arch_raid5_dma_next_channel(raid5_dma_client);
+}
+
+void raid5_dma_get_dma(struct raid5_dma *dma)
+{
+	dma->owner = THIS_MODULE;
+	dma->channel_iterate = raid5_dma_next_channel;
+	dma->do_block_ops = dma_do_raid5_block_ops;
+	atomic_inc(&raid5_count);
+}
+
+EXPORT_SYMBOL_GPL(raid5_dma_get_dma);
+
+module_init(raid5_dma_init);
+module_exit(raid5_dma_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("RAID5-DMA Offload Driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2a16b3b..dbd3ddc 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -183,6 +183,17 @@ config MD_RAID456_WORKQUEUE_MULTITHREAD
 
 	  If unsure say, Y.
 
+config MD_RAID5_HW_OFFLOAD
+	depends on MD_RAID456 && RAID5_DMA
+	bool "Execute raid5 xor/copy operations with hardware engines"
+	default y
+	---help---
+	  On platforms with the requisite hardware capabilities MD
+	  can offload  RAID5 stripe cache operations (i.e. parity
+	  maintenance and bio buffer copies)
+
+	  If unsure say, Y.
+
 config MD_MULTIPATH
 	tristate "Multipath I/O support"
 	depends on BLK_DEV_MD
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad6883b..4daa335 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -53,6 +53,16 @@ #include "raid6.h"
 
 #include <linux/raid/bitmap.h>
 
+#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+#include <linux/dma-mapping.h>
+extern void raid5_dma_get_dma(struct raid5_dma *dma);
+#endif /* CONFIG_MD_RAID5_HW_OFFLOAD */
+
+#ifdef CONFIG_MD_RAID6_HW_OFFLOAD
+#include <linux/dma-mapping.h>
+extern void raid6_dma_get_dma(struct raid6_dma *dma);
+#endif /* CONFIG_MD_RAID6_HW_OFFLOAD */
+
 /*
  * Stripe cache
  */
@@ -138,7 +148,7 @@ static void __release_stripe(raid5_conf_
 		}
 	}
 }
-static void release_stripe(struct stripe_head *sh)
+void release_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	unsigned long flags;
@@ -193,6 +203,17 @@ static void shrink_buffers(struct stripe
 		p = sh->dev[i].page;
 		if (!p)
 			continue;
+		#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+		do {
+			raid5_conf_t *conf = sh->raid_conf;
+			struct dma_chan *chan = conf->dma.channel_iterate();
+			/* assumes that all channels share the same mapping
+			 * characteristics
+			 */
+			dma_async_unmap_page(chan, sh->dev[i].dma,
+					PAGE_SIZE, DMA_FROM_DEVICE);
+		} while (0);
+		#endif
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
@@ -209,6 +230,20 @@ static int grow_buffers(struct stripe_he
 			return 1;
 		}
 		sh->dev[i].page = page;
+		#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+		do {
+			raid5_conf_t *conf = sh->raid_conf;
+			struct dma_chan *chan = conf->dma.channel_iterate();
+			/* assumes that all channels share the same mapping
+			 * characteristics
+			 */
+			sh->dev[i].dma = dma_async_map_page(chan,
+							sh->dev[i].page,
+							0,
+							PAGE_SIZE,
+							DMA_FROM_DEVICE);
+		} while (0);
+		#endif
 	}
 	return 0;
 }
@@ -576,6 +611,13 @@ #if 0
 #else
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 #endif
+#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+		/* If the backing block device driver performed a pio
+		 * read then the buffer needs to be cleaned
+		 */
+		consistent_sync(page_address(sh->dev[i].page), PAGE_SIZE,
+				DMA_TO_DEVICE);
+#endif
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
 			rdev = conf->disks[i].rdev;
 			printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
@@ -666,6 +708,15 @@ static int raid5_end_write_request (stru
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 	
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
+
+	#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+	/* If the backing block device driver performed a pio
+	 * write then the buffer needs to be invalidated
+	 */
+	consistent_sync(page_address(sh->dev[i].page), PAGE_SIZE,
+			DMA_FROM_DEVICE);
+	#endif
+
 	set_bit(STRIPE_HANDLE, &sh->state);
 	__release_stripe(conf, sh);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -1311,6 +1362,7 @@ static int stripe_to_pdidx(sector_t stri
 	return pd_idx;
 }
 
+#ifndef CONFIG_MD_RAID5_HW_OFFLOAD
 /*
  * raid5_do_soft_block_ops - perform block memory operations on stripe data
  * outside the spin lock.
@@ -1600,6 +1652,7 @@ static void raid5_do_soft_block_ops(void
 
 	release_stripe(sh);
 }
+#endif /* #ifndef CONFIG_MD_RAID5_HW_OFFLOAD*/
 
 /*
  * handle_stripe - do things to a stripe.
@@ -3553,12 +3606,12 @@ static int run(mddev_t *mddev)
 	#endif
 	#endif
 
-	/* To Do:
-	 * 1/ Offload to asynchronous copy / xor engines
-	 * 2/ Automated selection of optimal do_block_ops
-	 *	routine similar to the xor template selection
-	 */
+	#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+	raid5_dma_get_dma(&conf->dma);
+	conf->do_block_ops = conf->dma.do_block_ops;
+	#else
 	conf->do_block_ops = raid5_do_soft_block_ops;
+	#endif
 
 
 	spin_lock_init(&conf->device_lock);
@@ -4184,6 +4237,7 @@ static void raid5_exit(void)
 	unregister_md_personality(&raid4_personality);
 }
 
+EXPORT_SYMBOL(release_stripe);
 module_init(raid5_init);
 module_exit(raid5_exit);
 MODULE_LICENSE("GPL");
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 0a70c9e..7fd5aaf 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -762,6 +762,11 @@ static inline enum dma_status dma_sync_w
 	return status;
 }
 
+static inline dma_cookie_t dma_async_get_last_cookie(struct dma_chan *chan)
+{
+	return chan->cookie;
+}
+
 /* --- DMA device --- */
 
 int dma_async_device_register(struct dma_device *device);
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 31ae55c..f5b021d 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -4,6 +4,9 @@ #define _RAID5_H
 #include <linux/raid/md.h>
 #include <linux/raid/xor.h>
 #include <linux/workqueue.h>
+#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+#include <linux/dmaengine.h>
+#endif
 
 /*
  *
@@ -169,16 +172,28 @@ struct stripe_head {
 		#ifdef CONFIG_MD_RAID456_WORKQUEUE
 		struct work_struct	work;		/* work queue descriptor */
 		#endif
+		#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+		u32 			dma_result;	/* storage for dma engine zero sum results */
+		dma_cookie_t		dma_cookie;	/* last issued dma operation */
+		struct dma_chan		*dma_chan;	/* dma channel for ops offload */
+		#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
+		void			*ops_bookmark;	/* place holder for requeued stripes */
+		#endif /* CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE */
+		#endif /* CONFIG_MD_RAID5_HW_OFFLOAD */
 	} ops;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
 		struct page	*page;
+		#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+		dma_addr_t	dma;
+		#endif
 		struct bio	*toread, *read, *towrite, *written;
 		sector_t	sector;			/* sector of this page */
 		unsigned long	flags;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
+
 /* Flags */
 #define	R5_UPTODATE	0	/* page contains current data */
 #define	R5_LOCKED	1	/* IO has been submitted on "req" */
@@ -190,7 +205,6 @@ #define	R5_Wantwrite	5
 #define	R5_Overlap	7	/* There is a pending overlapping request on this block */
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
-
 #define	R5_Expanded	10	/* This block now has post-expand data */
 #define	R5_Consistent	11	/* Block is HW DMA-able without a cache flush */
 #define	R5_ComputeReq	12	/* compute_block in progress treat as uptodate */
@@ -373,6 +387,14 @@ struct raid5_private_data {
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
+#ifdef CONFIG_MD_RAID5_HW_OFFLOAD
+	struct raid5_dma {
+		struct module *owner;
+		void (*do_block_ops)(void *stripe_ref);
+		struct dma_chan * (*channel_iterate)(void);
+	} dma;
+#endif
+
 };
 
 typedef struct raid5_private_data raid5_conf_t;

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (14 preceding siblings ...)
  2006-09-11 23:18 ` [PATCH 15/19] dmaengine: raid5 dma client Dan Williams
@ 2006-09-11 23:19 ` Dan Williams
  2006-09-15 14:57   ` Olof Johansson
  2006-09-11 23:19 ` [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs Dan Williams
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:19 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor,
pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy
operations.

Changelog:
* fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
slots to be requested eventually leading to data corruption
* enabled the slot allocation routine to attempt to free slots before
returning -ENOMEM
* switched the cleanup routine to solely use the software chain and the
status register to determine if a descriptor is complete.  This is
necessary to support other IOP engines that do not have status writeback
capability
* make the driver iop generic
* modified the allocation routines to understand allocating a group of
slots for a single operation
* added a null xor initialization operation for the xor only channel on
iop3xx
* add software emulation of zero sum on iop32x
* support xor operations on buffers larger than the hardware maximum
* add architecture specific raid5-dma support functions

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/dma/Kconfig                 |   27 +
 drivers/dma/Makefile                |    1 
 drivers/dma/iop-adma.c              | 1501 +++++++++++++++++++++++++++++++++++
 include/asm-arm/hardware/iop_adma.h |   98 ++
 4 files changed, 1624 insertions(+), 3 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index fced8c3..3556143 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -7,8 +7,8 @@ menu "DMA Engine support"
 config DMA_ENGINE
 	bool "Support for DMA engines"
 	---help---
-	  DMA engines offload copy operations from the CPU to dedicated
-	  hardware, allowing the copies to happen asynchronously.
+          DMA engines offload block memory operations from the CPU to dedicated
+          hardware, allowing the operations to happen asynchronously.
 
 comment "DMA Clients"
 
@@ -28,9 +28,19 @@ config RAID5_DMA
 	default y
 	---help---
 	  This enables the use of DMA engines in the MD-RAID5 driver to
-	  offload stripe cache operations, freeing CPU cycles.
+	  offload stripe cache operations (i.e. xor, memcpy), freeing CPU cycles.
 	  say Y here
 
+config RAID5_DMA_WAIT_VIA_REQUEUE
+	bool "raid5-dma: Non-blocking channel switching"
+	depends on RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH && RAID5_DMA && BROKEN
+	default n
+	---help---
+	  This enables the raid5-dma driver to continue to operate on incoming
+	  stripes when it determines that the current stripe must wait for a
+	  a hardware channel to finish operations.  This code is a work in
+	  progress, only say Y to debug the implementation, otherwise say N.
+
 comment "DMA Devices"
 
 config INTEL_IOATDMA
@@ -40,4 +50,15 @@ config INTEL_IOATDMA
 	---help---
 	  Enable support for the Intel(R) I/OAT DMA engine.
 
+config INTEL_IOP_ADMA
+        tristate "Intel IOP ADMA support"
+        depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX)
+	select RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH if (ARCH_IOP32X || ARCH_IOP33X)
+        default m
+        ---help---
+          Enable support for the Intel(R) IOP Series RAID engines.
+
+config RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH
+	bool
+
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 4e36d6e..233eae7 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_RAID5_DMA) += raid5-dma.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
new file mode 100644
index 0000000..51f1c54
--- /dev/null
+++ b/drivers/dma/iop-adma.c
@@ -0,0 +1,1501 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This driver supports the asynchrounous DMA copy and RAID engines available
+ * on the Intel Xscale(R) family of I/O Processors (IOP 32x, 33x, 134x)
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/dmaengine.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/platform_device.h>
+#include <asm/arch/adma.h>
+#include <asm/memory.h>
+
+#define to_iop_adma_chan(chan) container_of(chan, struct iop_adma_chan, common)
+#define to_iop_adma_device(dev) container_of(dev, struct iop_adma_device, common)
+#define to_iop_adma_slot(lh) container_of(lh, struct iop_adma_desc_slot, slot_node)
+
+#define IOP_ADMA_DEBUG 0
+#define PRINTK(x...) ((void)(IOP_ADMA_DEBUG && printk(x)))
+
+/* software zero sum implemenation bits for iop32x */
+#ifdef CONFIG_ARCH_IOP32X
+char iop32x_zero_result_buffer[PAGE_SIZE] __attribute__((aligned(256)));
+u32 *iop32x_zero_sum_output;
+#endif
+
+/**
+ * iop_adma_free_slots - flags descriptor slots for reuse
+ * @slot: Slot to free
+ * Caller must hold &iop_chan->lock while calling this function
+ */
+static inline void iop_adma_free_slots(struct iop_adma_desc_slot *slot)
+{
+	int stride = slot->stride;
+	while (stride--) {
+		slot->stride = 0;
+		slot = list_entry(slot->slot_node.next,
+				struct iop_adma_desc_slot,
+				slot_node);
+	}
+}
+
+static void __iop_adma_slot_cleanup(struct iop_adma_chan *iop_chan)
+{
+	struct iop_adma_desc_slot *iter, *_iter;
+	dma_cookie_t cookie = 0;
+	struct device *dev = &iop_chan->device->pdev->dev;
+	u32 current_desc = iop_chan_get_current_descriptor(iop_chan);
+	int busy = iop_chan_is_busy(iop_chan);
+	int seen_current = 0;
+
+	/* free completed slots from the chain starting with
+	 * the oldest descriptor
+	 */
+	list_for_each_entry_safe(iter, _iter, &iop_chan->chain,
+					chain_node) {
+		PRINTK("%s: [%d] cookie: %d busy: %x next: %x\n",
+			__FUNCTION__, iter->idx, iter->cookie, busy,
+			iop_desc_get_next_desc(iter, iop_chan));
+
+		/* do not advance past the current descriptor loaded into the 
+		 * hardware channel, subsequent descriptors are either in process
+		 * or have not been submitted
+		 */
+		if (seen_current)
+			break;
+
+		/* stop the search if we reach the current descriptor and the
+		 * channel is busy, or if it appears that the current descriptor
+		 * needs to be re-read (i.e. has been appended to)
+		 */
+		if (iter->phys == current_desc) {
+			BUG_ON(seen_current++);
+			if (busy || iop_desc_get_next_desc(iter, iop_chan))
+				break;
+		}
+
+		/* if we are tracking a group of zero-result descriptors add
+		 * the current result to the accumulator
+		 */
+		if (iop_chan->zero_sum_group) {
+			iop_chan->result_accumulator |=
+				iop_desc_get_zero_result(iter);
+			PRINTK("%s: add to zero sum group acc: %d this: %d\n", __FUNCTION__,
+				iop_chan->result_accumulator, iop_desc_get_zero_result(iter));
+		}
+
+		if (iter->cookie) {
+			u32 src_cnt = iter->src_cnt;
+			u32 len = iop_desc_get_byte_count(iter, iop_chan);
+			dma_addr_t addr;
+
+			cookie = iter->cookie;
+			iter->cookie = 0;
+
+			/* the first and last descriptor in a zero sum group
+			 * will have 'xor_check_result' set
+			 */
+			if (iter->xor_check_result) {
+				if (iter->slot_cnt > iter->slots_per_op) {
+					if (!iop_chan->zero_sum_group) {
+						iop_chan->zero_sum_group = 1;
+						iop_chan->result_accumulator |=
+							iop_desc_get_zero_result(iter);
+					}
+					PRINTK("%s: start zero sum group acc: %d this: %d\n", __FUNCTION__,
+						iop_chan->result_accumulator, iop_desc_get_zero_result(iter));
+				} else {
+					if (!iop_chan->zero_sum_group)
+						iop_chan->result_accumulator |=
+							iop_desc_get_zero_result(iter);
+					else
+						iop_chan->zero_sum_group = 0;
+	
+					*iter->xor_check_result = iop_chan->result_accumulator;
+					iop_chan->result_accumulator = 0;
+
+					PRINTK("%s: end zero sum group acc: %d this: %d\n", __FUNCTION__,
+						*iter->xor_check_result, iop_desc_get_zero_result(iter));
+				}
+			}
+
+			/* unmap dma ranges */
+			switch (iter->flags & (DMA_DEST_BUF | DMA_DEST_PAGE |
+				DMA_DEST_DMA)) {
+			case DMA_DEST_BUF:
+				addr = iop_desc_get_dest_addr(iter, iop_chan);
+				dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE);
+				break;
+			case DMA_DEST_PAGE:
+				addr = iop_desc_get_dest_addr(iter, iop_chan);
+				dma_unmap_page(dev, addr, len, DMA_FROM_DEVICE);
+				break;
+			case DMA_DEST_DMA:
+				break;
+			}
+
+			switch (iter->flags & (DMA_SRC_BUF |
+					DMA_SRC_PAGE | DMA_SRC_DMA |
+					DMA_SRC_PAGES | DMA_SRC_DMA_LIST)) {
+			case DMA_SRC_BUF:
+				addr = iop_desc_get_src_addr(iter, iop_chan, 0);
+				dma_unmap_single(dev, addr, len, DMA_TO_DEVICE);
+				break;
+			case DMA_SRC_PAGE:
+				addr = iop_desc_get_src_addr(iter, iop_chan, 0);
+				dma_unmap_page(dev, addr, len, DMA_TO_DEVICE);
+				break;
+			case DMA_SRC_PAGES:
+				while(src_cnt--) {
+					addr = iop_desc_get_src_addr(iter,
+								iop_chan,
+								src_cnt);
+					dma_unmap_page(dev, addr, len,
+						DMA_TO_DEVICE);
+				}
+				break;
+			case DMA_SRC_DMA:
+			case DMA_SRC_DMA_LIST:
+				break;
+			}
+		}
+
+		/* leave the last descriptor in the chain 
+		 * so we can append to it
+		 */
+		if (iter->chain_node.next == &iop_chan->chain)
+			break;
+
+		PRINTK("iop adma%d: cleanup %d stride %d\n",
+		iop_chan->device->id, iter->idx, iter->stride);
+
+		list_del(&iter->chain_node);
+		iop_adma_free_slots(iter);
+	}
+
+	BUG_ON(!seen_current);
+
+	if (cookie) {
+		iop_chan->completed_cookie = cookie;
+
+		PRINTK("iop adma%d: completed cookie %d\n",
+		iop_chan->device->id, cookie);
+	}
+}
+
+static inline void iop_adma_slot_cleanup(struct iop_adma_chan *iop_chan)
+{
+	spin_lock_bh(&iop_chan->lock);
+	__iop_adma_slot_cleanup(iop_chan);
+	spin_unlock_bh(&iop_chan->lock);
+}
+
+static struct iop_adma_desc_slot *
+__iop_adma_alloc_slots(struct iop_adma_chan *iop_chan, int num_slots,
+			int slots_per_op, int recurse)
+{
+	struct iop_adma_desc_slot *iter = NULL, *alloc_start = NULL;
+	int i;
+
+	/* start search from the last allocated descrtiptor
+	 * if a contiguous allocation can not be found start searching
+	 * from the beginning of the list
+	 */
+	for (i = 0; i < 2; i++) {
+		int slots_found = 0;
+		if (i == 0)
+			iter = iop_chan->last_used;
+		else {
+			iter = list_entry(&iop_chan->all_slots,
+				struct iop_adma_desc_slot,
+				slot_node);
+		}
+
+		list_for_each_entry_continue(iter, &iop_chan->all_slots, slot_node) {
+			if (iter->stride) {
+				/* give up after finding the first busy slot
+				 * on the second pass through the list
+				 */
+				if (i == 1)
+					break;
+
+				slots_found = 0;
+				continue;
+			}
+
+			/* start the allocation if the slot is correctly aligned */
+			if (!slots_found++) {
+				if (iop_desc_is_aligned(iter, slots_per_op))
+					alloc_start = iter;
+				else {
+					slots_found = 0;
+					continue;
+				}
+			}
+
+			if (slots_found == num_slots) {
+				iter = alloc_start;
+				while (num_slots) {
+					PRINTK("iop adma%d: allocated [%d] "
+						"(desc %p phys: %#x) stride %d\n",
+						iop_chan->device->id,
+						iter->idx, iter->hw_desc, iter->phys,
+						slots_per_op);
+					iop_chan->last_used = iter;
+					list_add_tail(&iter->chain_node,
+							&iop_chan->chain);
+					iter->slot_cnt = num_slots;
+					iter->slots_per_op = slots_per_op;
+					iter->xor_check_result = NULL;
+					iter->cookie = 0;
+					for (i = 0; i < slots_per_op; i++) {
+						iter->stride = slots_per_op - i;
+						iter = list_entry(iter->slot_node.next,
+								struct iop_adma_desc_slot,
+								slot_node);
+					}
+					num_slots -= slots_per_op;
+				}
+				return alloc_start;
+			}
+		}
+	}
+
+	/* try once to free some slots if the allocation fails */
+	if (recurse) {
+		__iop_adma_slot_cleanup(iop_chan);
+		return __iop_adma_alloc_slots(iop_chan, num_slots, slots_per_op, 0);
+	} else
+		return NULL;
+}
+
+static struct iop_adma_desc_slot *
+iop_adma_alloc_slots(struct iop_adma_chan *iop_chan,
+			int num_slots,
+			int slots_per_op)
+{
+	return __iop_adma_alloc_slots(iop_chan, num_slots, slots_per_op, 1);
+}
+
+static void iop_chan_start_null_memcpy(struct iop_adma_chan *iop_chan);
+static void iop_chan_start_null_xor(struct iop_adma_chan *iop_chan);
+
+/* returns the actual number of allocated descriptors */
+static int iop_adma_alloc_chan_resources(struct dma_chan *chan)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	struct iop_adma_desc_slot *slot = NULL;
+	char *hw_desc;
+	int i;
+	int init = iop_chan->slots_allocated ? 0 : 1;
+	struct iop_adma_platform_data *plat_data;
+	
+	plat_data = iop_chan->device->pdev->dev.platform_data;
+
+	spin_lock_bh(&iop_chan->lock);
+	/* Allocate descriptor slots */
+	i = iop_chan->slots_allocated;
+	for (; i < (plat_data->pool_size/IOP_ADMA_SLOT_SIZE); i++) {
+		slot = kmalloc(sizeof(*slot), GFP_KERNEL);
+		if (!slot) {
+			printk(KERN_INFO "IOP ADMA Channel only initialized"
+				" %d descriptor slots", i--);
+			break;
+		}
+		hw_desc = (char *) iop_chan->device->dma_desc_pool_virt;
+		slot->hw_desc = (void *) &hw_desc[i * IOP_ADMA_SLOT_SIZE];
+
+		INIT_LIST_HEAD(&slot->chain_node);
+		INIT_LIST_HEAD(&slot->slot_node);
+		hw_desc = (char *) iop_chan->device->dma_desc_pool;
+		slot->phys = (dma_addr_t) &hw_desc[i * IOP_ADMA_SLOT_SIZE];
+		slot->stride = 0;
+		slot->cookie = 0;
+		slot->xor_check_result = NULL;
+		slot->idx = i;
+		list_add_tail(&slot->slot_node, &iop_chan->all_slots);
+	}
+	if (i && !iop_chan->last_used)
+		iop_chan->last_used = list_entry(iop_chan->all_slots.next,
+					struct iop_adma_desc_slot,
+					slot_node);
+
+	iop_chan->slots_allocated = i;
+	PRINTK("iop adma%d: allocated %d descriptor slots last_used: %p\n",
+		iop_chan->device->id, i, iop_chan->last_used);
+	spin_unlock_bh(&iop_chan->lock);
+
+	/* initialize the channel and the chain with a null operation */
+	if (init) {
+		if (iop_chan->device->common.capabilities & DMA_MEMCPY)
+			iop_chan_start_null_memcpy(iop_chan);
+		else if (iop_chan->device->common.capabilities & DMA_XOR)
+			iop_chan_start_null_xor(iop_chan);
+		else
+			BUG();
+	}
+
+	return (i > 0) ? i : -ENOMEM;
+}
+
+/* chain the descriptors */
+static inline void iop_chan_chain_desc(struct iop_adma_chan *iop_chan,
+					struct iop_adma_desc_slot *desc)
+{
+	struct iop_adma_desc_slot *prev = list_entry(desc->chain_node.prev,
+						struct iop_adma_desc_slot,
+						chain_node);
+	iop_desc_set_next_desc(prev, iop_chan, desc->phys);
+}
+
+static inline void iop_desc_assign_cookie(struct iop_adma_chan *iop_chan,
+					struct iop_adma_desc_slot *desc)
+{
+	dma_cookie_t cookie = iop_chan->common.cookie;
+	cookie++;
+	if (cookie < 0)
+		cookie = 1;
+	iop_chan->common.cookie = desc->cookie = cookie;
+	PRINTK("iop adma%d: %s cookie %d slot %d\n",
+	iop_chan->device->id, __FUNCTION__, cookie, desc->idx);
+}
+
+static inline void iop_adma_check_threshold(struct iop_adma_chan *iop_chan)
+{
+	if (iop_chan->pending >= IOP_ADMA_THRESHOLD) {
+		iop_chan->pending = 0;
+		iop_chan_append(iop_chan);
+	}
+}
+
+static dma_cookie_t do_iop_adma_memcpy(struct dma_chan *chan,
+                                       union dmaengine_addr dest,
+					unsigned int dest_off,
+                                       union dmaengine_addr src,
+					unsigned int src_off,
+                                       size_t len,
+                                       unsigned long flags)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_cookie_t ret = -ENOMEM;
+	struct iop_adma_desc_slot *sw_desc;
+	int slot_cnt, slots_per_op;
+
+	if (!chan || !dest.dma || !src.dma)
+		return -EFAULT;
+	if (!len)
+		return iop_chan->common.cookie;
+
+	PRINTK("iop adma%d: %s len: %u flags: %#lx\n",
+	iop_chan->device->id, __FUNCTION__, len, flags);
+
+	switch (flags & (DMA_SRC_BUF | DMA_SRC_PAGE | DMA_SRC_DMA)) {
+	case DMA_SRC_BUF:
+		src.dma = dma_map_single(&iop_chan->device->pdev->dev,
+			src.buf, len, DMA_TO_DEVICE);
+		break;
+	case DMA_SRC_PAGE:
+		src.dma = dma_map_page(&iop_chan->device->pdev->dev,
+			src.pg, src_off, len, DMA_TO_DEVICE);
+		break;
+	case DMA_SRC_DMA:
+		break;
+	default:
+		return -EFAULT;
+	}
+
+	switch (flags & (DMA_DEST_BUF | DMA_DEST_PAGE | DMA_DEST_DMA)) {
+	case DMA_DEST_BUF:
+		dest.dma = dma_map_single(&iop_chan->device->pdev->dev,
+			dest.buf, len, DMA_FROM_DEVICE);
+		break;
+	case DMA_DEST_PAGE:
+		dest.dma = dma_map_page(&iop_chan->device->pdev->dev,
+			dest.pg, dest_off, len, DMA_FROM_DEVICE);
+		break;
+	case DMA_DEST_DMA:
+		break;
+	default:
+		return -EFAULT;
+	}
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_memcpy_slot_count(len, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		iop_desc_init_memcpy(sw_desc);
+		iop_desc_set_byte_count(sw_desc, iop_chan, len);
+		iop_desc_set_dest_addr(sw_desc, iop_chan, dest.dma);
+		iop_desc_set_memcpy_src_addr(sw_desc, src.dma, slot_cnt, slots_per_op);
+
+		iop_chan_chain_desc(iop_chan, sw_desc);
+		iop_desc_assign_cookie(iop_chan, sw_desc);
+
+		sw_desc->flags = flags;
+		iop_chan->pending++;
+		ret = sw_desc->cookie;
+	}
+	spin_unlock_bh(&iop_chan->lock);
+
+	iop_adma_check_threshold(iop_chan);
+
+	return ret;
+}
+
+static dma_cookie_t do_iop_adma_memset(struct dma_chan *chan,
+                                       union dmaengine_addr dest,
+					unsigned int dest_off,
+                                       int val,
+                                       size_t len,
+                                       unsigned long flags)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_cookie_t ret = -ENOMEM;
+	struct iop_adma_desc_slot *sw_desc;
+	int slot_cnt, slots_per_op;
+
+	if (!chan || !dest.dma)
+		return -EFAULT;
+	if (!len)
+		return iop_chan->common.cookie;
+
+	PRINTK("iop adma%d: %s len: %u flags: %#lx\n",
+	iop_chan->device->id, __FUNCTION__, len, flags);
+
+	switch (flags & (DMA_DEST_BUF | DMA_DEST_PAGE | DMA_DEST_DMA)) {
+	case DMA_DEST_BUF:
+		dest.dma = dma_map_single(&iop_chan->device->pdev->dev,
+			dest.buf, len, DMA_FROM_DEVICE);
+		break;
+	case DMA_DEST_PAGE:
+		dest.dma = dma_map_page(&iop_chan->device->pdev->dev,
+			dest.pg, dest_off, len, DMA_FROM_DEVICE);
+		break;
+	case DMA_DEST_DMA:
+		break;
+	default:
+		return -EFAULT;
+	}
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_memset_slot_count(len, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		iop_desc_init_memset(sw_desc);
+		iop_desc_set_byte_count(sw_desc, iop_chan, len);
+		iop_desc_set_block_fill_val(sw_desc, val);
+		iop_desc_set_dest_addr(sw_desc, iop_chan, dest.dma);
+
+		iop_chan_chain_desc(iop_chan, sw_desc);
+		iop_desc_assign_cookie(iop_chan, sw_desc);
+
+		sw_desc->flags = flags;
+		iop_chan->pending++;
+		ret = sw_desc->cookie;
+	}
+	spin_unlock_bh(&iop_chan->lock);
+
+	iop_adma_check_threshold(iop_chan);
+
+	return ret;
+}
+
+/**
+ * do_iop_adma_xor - xor from source pages to a dest page
+ * @chan: common channel handle
+ * @dest: DMAENGINE destination address
+ * @dest_off: offset into the destination page
+ * @src: DMAENGINE source addresses
+ * @src_cnt: number of source pages
+ * @src_off: offset into the source pages
+ * @len: transaction length in bytes
+ * @flags: DMAENGINE address type flags
+ */
+static dma_cookie_t do_iop_adma_xor(struct dma_chan *chan,
+					union dmaengine_addr dest,
+					unsigned int dest_off,
+					union dmaengine_addr src,
+					unsigned int src_cnt,
+					unsigned int src_off,
+					size_t len,
+					unsigned long flags)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	struct device *dev = &iop_chan->device->pdev->dev;
+	struct iop_adma_desc_slot *sw_desc;
+	dma_cookie_t ret = -ENOMEM;
+	int slot_cnt, slots_per_op;
+
+	if (!chan || !dest.dma || !src.dma_list)
+		return -EFAULT;
+
+	if (!len)
+		return iop_chan->common.cookie;
+
+	PRINTK("iop adma%d: %s src_cnt: %d len: %u flags: %lx\n",
+	iop_chan->device->id, __FUNCTION__, src_cnt, len, flags);
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_xor_slot_count(len, src_cnt, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		#ifdef CONFIG_ARCH_IOP32X
+		if ((flags & DMA_DEST_BUF) &&
+			dest.buf == (void *) iop32x_zero_result_buffer) {
+			PRINTK("%s: iop32x zero sum emulation requested\n",
+				__FUNCTION__);
+			sw_desc->xor_check_result = iop32x_zero_sum_output;
+		}
+		#endif
+
+		iop_desc_init_xor(sw_desc, src_cnt);
+		iop_desc_set_byte_count(sw_desc, iop_chan, len);
+
+		switch (flags & (DMA_DEST_BUF | DMA_DEST_PAGE |
+				DMA_DEST_PAGES | DMA_DEST_DMA |
+				DMA_DEST_DMA_LIST)) {
+		case DMA_DEST_PAGE:
+			dest.dma = dma_map_page(dev, dest.pg, dest_off, len,
+					DMA_FROM_DEVICE);
+			break;
+		case DMA_DEST_BUF:
+			dest.dma = dma_map_single(dev, dest.buf, len,
+					DMA_FROM_DEVICE);
+			break;
+		}
+
+		iop_desc_set_dest_addr(sw_desc, iop_chan, dest.dma);
+
+		switch (flags & (DMA_SRC_BUF | DMA_SRC_PAGE |
+				DMA_SRC_PAGES | DMA_SRC_DMA |
+				DMA_SRC_DMA_LIST)) {
+		case DMA_SRC_PAGES:
+			while (src_cnt--) {
+				dma_addr_t addr = dma_map_page(dev,
+							src.pgs[src_cnt],
+							src_off, len,
+							DMA_TO_DEVICE);
+				iop_desc_set_xor_src_addr(sw_desc,
+							src_cnt,
+							addr,
+							slot_cnt,
+							slots_per_op);
+			}
+			break;
+		case DMA_SRC_DMA_LIST:
+			while (src_cnt--) {
+				iop_desc_set_xor_src_addr(sw_desc,
+							src_cnt,
+							src.dma_list[src_cnt],
+							slot_cnt,
+							slots_per_op);
+			}
+			break;
+		}
+
+		iop_chan_chain_desc(iop_chan, sw_desc);
+		iop_desc_assign_cookie(iop_chan, sw_desc);
+
+		sw_desc->flags = flags;
+		iop_chan->pending++;
+		ret = sw_desc->cookie;
+	}
+	spin_unlock_bh(&iop_chan->lock);
+
+	iop_adma_check_threshold(iop_chan);
+
+	return ret;
+}
+
+/**
+ * do_iop_adma_zero_sum - xor the sources together and report whether
+ *				the sum is zero
+ * @chan: common channel handle
+ * @src: DMAENGINE source addresses
+ * @src_cnt: number of sources
+ * @src_off: offset into the sources
+ * @len: transaction length in bytes
+ * @flags: DMAENGINE address type flags
+ * @result: set to 1 if sum is zero else 0
+ */
+#ifndef CONFIG_ARCH_IOP32X
+static dma_cookie_t do_iop_adma_zero_sum(struct dma_chan *chan,
+					union dmaengine_addr src,
+					unsigned int src_cnt,
+					unsigned int src_off,
+					size_t len,
+					u32 *result,
+					unsigned long flags)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	struct iop_adma_desc_slot *sw_desc;
+	dma_cookie_t ret = -ENOMEM;
+	int slot_cnt, slots_per_op;
+
+	if (!chan || !src.dma_list || !result)
+		return -EFAULT;
+
+	if (!len)
+		return iop_chan->common.cookie;
+
+	PRINTK("iop adma%d: %s src_cnt: %d len: %u flags: %lx\n",
+	iop_chan->device->id, __FUNCTION__, src_cnt, len, flags);
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_zero_sum_slot_count(len, src_cnt, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		struct device *dev = &iop_chan->device->pdev->dev;
+		iop_chan->pending += iop_desc_init_zero_sum(sw_desc, src_cnt,
+					slot_cnt, slots_per_op);
+
+		switch (flags & (DMA_SRC_BUF | DMA_SRC_PAGE |
+				DMA_SRC_PAGES | DMA_SRC_DMA |
+				DMA_SRC_DMA_LIST)) {
+		case DMA_SRC_PAGES:
+			while (src_cnt--) {
+				dma_addr_t addr = dma_map_page(dev,
+						src.pgs[src_cnt],
+						src_off, len,
+						DMA_TO_DEVICE);
+				iop_desc_set_zero_sum_src_addr(sw_desc,
+						src_cnt,
+						addr,
+						slot_cnt,
+						slots_per_op);
+			}
+			break;
+		case DMA_SRC_DMA_LIST:
+			while (src_cnt--) {
+				iop_desc_set_zero_sum_src_addr(sw_desc,
+						src_cnt,
+						src.dma_list[src_cnt],
+						slot_cnt,
+						slots_per_op);
+			}
+			break;
+		}
+
+		iop_desc_set_zero_sum_byte_count(sw_desc, len, slots_per_op);
+
+		/* assign a cookie to the first descriptor so
+		 * the buffers are unmapped
+		 */
+		iop_desc_assign_cookie(iop_chan, sw_desc);
+		sw_desc->flags = flags;
+
+		/* assign cookie to the last descriptor in the group
+		 * so the xor_check_result is updated. Also, set the
+		 * xor_check_result ptr of the first and last descriptor
+		 * so the cleanup routine can sum the group of results
+		 */
+		if (slot_cnt > slots_per_op) {
+			struct iop_adma_desc_slot *desc;
+			desc = list_entry(iop_chan->chain.prev,
+				struct iop_adma_desc_slot,
+				chain_node);
+			iop_desc_assign_cookie(iop_chan, desc);
+			sw_desc->xor_check_result = result;
+			desc->xor_check_result = result;
+			ret = desc->cookie;
+		} else {
+			sw_desc->xor_check_result = result;
+			ret = sw_desc->cookie;
+		}
+
+		/* add the group to the chain */
+		iop_chan_chain_desc(iop_chan, sw_desc);
+	}
+	spin_unlock_bh(&iop_chan->lock);
+
+	iop_adma_check_threshold(iop_chan);
+
+	return ret;
+}
+#else
+/* iop32x does not support zero sum in hardware, so we simulate 
+ * it in software.  It only supports a PAGE_SIZE length which is
+ * enough to support md raid.
+ */
+static dma_cookie_t do_iop_adma_zero_sum(struct dma_chan *chan,
+					union dmaengine_addr src,
+					unsigned int src_cnt,
+					unsigned int src_off,
+					size_t len,
+					u32 *result,
+					unsigned long flags)
+{
+	static union dmaengine_addr dest_addr = { .buf = iop32x_zero_result_buffer };
+	static dma_cookie_t last_zero_result_cookie = 0;
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_cookie_t ret;
+
+	if (!chan || !src.dma_list || !result)
+		return -EFAULT;
+
+	if (!len)
+		return iop_chan->common.cookie;
+
+	if (len > sizeof(iop32x_zero_result_buffer)) {
+		printk(KERN_ERR "iop32x performs zero sum with a %d byte buffer, %d"
+		  	" bytes is too large\n", sizeof(iop32x_zero_result_buffer),
+		  	len);
+		BUG();
+		return -EFAULT;
+	}
+
+	/* we only have 1 result buffer, it can not be shared */
+	if (last_zero_result_cookie) {
+		PRINTK("%s: waiting for last_zero_result_cookie: %d\n",
+			__FUNCTION__, last_zero_result_cookie);
+		dma_sync_wait(chan, last_zero_result_cookie);
+		last_zero_result_cookie = 0;
+	}
+
+	PRINTK("iop adma%d: %s src_cnt: %d len: %u flags: %lx\n",
+	iop_chan->device->id, __FUNCTION__, src_cnt, len, flags);
+
+	flags |= DMA_DEST_BUF;
+	iop32x_zero_sum_output = result;
+
+	ret = do_iop_adma_xor(chan, dest_addr, 0, src, src_cnt, src_off,
+				len, flags);
+
+	if (ret > 0)
+		last_zero_result_cookie = ret;
+
+	return ret;
+}
+#endif
+
+static void iop_adma_free_chan_resources(struct dma_chan *chan)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	struct iop_adma_desc_slot *iter, *_iter;
+	int in_use_descs = 0;
+
+	iop_adma_slot_cleanup(iop_chan);
+
+	spin_lock_bh(&iop_chan->lock);
+	list_for_each_entry_safe(iter, _iter, &iop_chan->chain,
+					chain_node) {
+		in_use_descs++;
+		list_del(&iter->chain_node);
+	}
+	list_for_each_entry_safe_reverse(iter, _iter, &iop_chan->all_slots, slot_node) {
+		list_del(&iter->slot_node);
+		kfree(iter);
+		iop_chan->slots_allocated--;
+	}
+	iop_chan->last_used = NULL;
+
+	PRINTK("iop adma%d %s slots_allocated %d\n", iop_chan->device->id,
+		__FUNCTION__, iop_chan->slots_allocated);
+	spin_unlock_bh(&iop_chan->lock);
+
+	/* one is ok since we left it on there on purpose */
+	if (in_use_descs > 1)
+		printk(KERN_ERR "IOP: Freeing %d in use descriptors!\n",
+			in_use_descs - 1);
+}
+
+/**
+ * iop_adma_is_complete - poll the status of an ADMA transaction
+ * @chan: ADMA channel handle
+ * @cookie: ADMA transaction identifier
+ */
+static enum dma_status iop_adma_is_complete(struct dma_chan *chan,
+                                            dma_cookie_t cookie,
+                                            dma_cookie_t *done,
+                                            dma_cookie_t *used)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_cookie_t last_used;
+	dma_cookie_t last_complete;
+	enum dma_status ret;
+
+	last_used = chan->cookie;
+	last_complete = iop_chan->completed_cookie;
+
+	if (done)
+		*done= last_complete;
+	if (used)
+		*used = last_used;
+
+	ret = dma_async_is_complete(cookie, last_complete, last_used);
+	if (ret == DMA_SUCCESS)
+		return ret;
+
+	iop_adma_slot_cleanup(iop_chan);
+
+	last_used = chan->cookie;
+	last_complete = iop_chan->completed_cookie;
+
+	if (done)
+		*done= last_complete;
+	if (used)
+		*used = last_used;
+
+	return dma_async_is_complete(cookie, last_complete, last_used);
+}
+
+/* to do: can we use these interrupts to implement 'sleep until completed' */
+static irqreturn_t iop_adma_eot_handler(int irq, void *data, struct pt_regs *regs)
+{
+	return IRQ_NONE;
+}
+
+static irqreturn_t iop_adma_eoc_handler(int irq, void *data, struct pt_regs *regs)
+{
+	return IRQ_NONE;
+}
+
+static irqreturn_t iop_adma_err_handler(int irq, void *data, struct pt_regs *regs)
+{
+	return IRQ_NONE;
+}
+
+static void iop_adma_issue_pending(struct dma_chan *chan)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	spin_lock(&iop_chan->lock);
+	if (iop_chan->pending) {
+		iop_chan->pending = 0;
+		iop_chan_append(iop_chan);
+	}
+	spin_unlock(&iop_chan->lock);
+}
+
+/*
+ * Perform a transaction to verify the HW works.
+ */
+#define IOP_ADMA_TEST_SIZE 2000
+
+static int __devinit iop_adma_memcpy_self_test(struct iop_adma_device *device)
+{
+	int i;
+	union dmaengine_addr src;
+	union dmaengine_addr dest;
+	struct dma_chan *dma_chan;
+	dma_cookie_t cookie;
+	int err = 0;
+
+	src.buf = kzalloc(sizeof(u8) * IOP_ADMA_TEST_SIZE, SLAB_KERNEL);
+	if (!src.buf)
+		return -ENOMEM;
+	dest.buf = kzalloc(sizeof(u8) * IOP_ADMA_TEST_SIZE, SLAB_KERNEL);
+	if (!dest.buf) {
+		kfree(src.buf);
+		return -ENOMEM;
+	}
+
+	/* Fill in src buffer */
+	for (i = 0; i < IOP_ADMA_TEST_SIZE; i++)
+		((u8 *) src.buf)[i] = (u8)i;
+
+	memset(dest.buf, 0, IOP_ADMA_TEST_SIZE);
+
+	/* Start copy, using first DMA channel */
+	dma_chan = container_of(device->common.channels.next,
+	                        struct dma_chan,
+	                        device_node);
+	if (iop_adma_alloc_chan_resources(dma_chan) < 1) {
+		err = -ENODEV;
+		goto out;
+	}
+
+	cookie = do_iop_adma_memcpy(dma_chan, dest, 0, src, 0,
+		IOP_ADMA_TEST_SIZE, DMA_SRC_BUF | DMA_DEST_BUF);
+	iop_adma_issue_pending(dma_chan);
+	msleep(1);
+
+	if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) != DMA_SUCCESS) {
+		printk(KERN_ERR "iop adma%d: Self-test copy timed out, disabling\n",
+			device->id);
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	consistent_sync(dest.buf, IOP_ADMA_TEST_SIZE, DMA_FROM_DEVICE);
+	if (memcmp(src.buf, dest.buf, IOP_ADMA_TEST_SIZE)) {
+		printk(KERN_ERR "iop adma%d: Self-test copy failed compare, disabling\n",
+			device->id);
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+free_resources:
+	iop_adma_free_chan_resources(dma_chan);
+out:
+	kfree(src.buf);
+	kfree(dest.buf);
+	return err;
+}
+
+#define IOP_ADMA_NUM_SRC_TST 4 /* must be <= 15 */
+static int __devinit iop_adma_xor_zero_sum_self_test(struct iop_adma_device *device)
+{
+	int i, src_idx;
+	struct page *xor_srcs[IOP_ADMA_NUM_SRC_TST];
+	struct page *zero_sum_srcs[IOP_ADMA_NUM_SRC_TST + 1];
+	union dmaengine_addr dest;
+	union dmaengine_addr src;
+	struct dma_chan *dma_chan;
+	dma_cookie_t cookie;
+	u8 cmp_byte = 0;
+	u32 cmp_word;
+	u32 zero_sum_result;
+	int err = 0;
+
+	for (src_idx = 0; src_idx < IOP_ADMA_NUM_SRC_TST; src_idx++) {
+		xor_srcs[src_idx] = alloc_page(GFP_KERNEL);
+		if (!xor_srcs[src_idx])
+			while (src_idx--) {
+				__free_page(xor_srcs[src_idx]);
+				return -ENOMEM;
+			}
+	}
+
+	dest.pg = alloc_page(GFP_KERNEL);
+	if (!dest.pg)
+		while (src_idx--) {
+			__free_page(xor_srcs[src_idx]);
+			return -ENOMEM;
+		}
+
+	/* Fill in src buffers */
+	for (src_idx = 0; src_idx < IOP_ADMA_NUM_SRC_TST; src_idx++) {
+		u8 *ptr = page_address(xor_srcs[src_idx]);
+		for (i = 0; i < PAGE_SIZE; i++)
+			ptr[i] = (1 << src_idx);
+	}
+
+	for (src_idx = 0; src_idx < IOP_ADMA_NUM_SRC_TST; src_idx++)
+		cmp_byte ^= (u8) (1 << src_idx);
+
+	cmp_word = (cmp_byte << 24) | (cmp_byte << 16) | (cmp_byte << 8) | cmp_byte;
+
+	memset(page_address(dest.pg), 0, PAGE_SIZE);
+
+	dma_chan = container_of(device->common.channels.next,
+	                        struct dma_chan,
+	                        device_node);
+	if (iop_adma_alloc_chan_resources(dma_chan) < 1) {
+		err = -ENODEV;
+		goto out;
+	}
+
+	/* test xor */
+	src.pgs = xor_srcs;
+	cookie = do_iop_adma_xor(dma_chan, dest, 0, src,
+		IOP_ADMA_NUM_SRC_TST, 0, PAGE_SIZE, DMA_DEST_PAGE | DMA_SRC_PAGES);
+	iop_adma_issue_pending(dma_chan);
+	msleep(8);
+
+	if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) != DMA_SUCCESS) {
+		printk(KERN_ERR "iop_adma: Self-test xor timed out, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	consistent_sync(page_address(dest.pg), PAGE_SIZE, DMA_FROM_DEVICE);
+	for (i = 0; i < (PAGE_SIZE / sizeof(u32)); i++) {
+		u32 *ptr = page_address(dest.pg);
+		if (ptr[i] != cmp_word) {
+			printk(KERN_ERR "iop_adma: Self-test xor failed compare, disabling\n");
+			err = -ENODEV;
+			goto free_resources;
+		}
+	}
+
+	/* zero sum the sources with the destintation page */
+	for (i = 0; i < IOP_ADMA_NUM_SRC_TST; i++)
+		zero_sum_srcs[i] = xor_srcs[i];
+	zero_sum_srcs[i] = dest.pg;
+	src.pgs = zero_sum_srcs;
+
+	zero_sum_result = 1;
+	cookie = do_iop_adma_zero_sum(dma_chan, src, IOP_ADMA_NUM_SRC_TST + 1,
+				0, PAGE_SIZE, &zero_sum_result, DMA_SRC_PAGES);
+	iop_adma_issue_pending(dma_chan);
+	msleep(8);
+
+	if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) != DMA_SUCCESS) {
+		printk(KERN_ERR "iop_adma: Self-test zero sum timed out, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	if (zero_sum_result != 0) {
+		printk(KERN_ERR "iop_adma: Self-test zero sum failed compare, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	/* test memset */
+	cookie = do_iop_adma_memset(dma_chan, dest, 0, 0, PAGE_SIZE, DMA_DEST_PAGE);
+	iop_adma_issue_pending(dma_chan);
+	msleep(8);
+
+	if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) != DMA_SUCCESS) {
+		printk(KERN_ERR "iop_adma: Self-test memset timed out, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	consistent_sync(page_address(dest.pg), PAGE_SIZE, DMA_FROM_DEVICE);
+	for (i = 0; i < PAGE_SIZE/sizeof(u32); i++) {
+		u32 *ptr = page_address(dest.pg);
+		if (ptr[i]) {
+			printk(KERN_ERR "iop_adma: Self-test memset failed compare, disabling\n");
+			err = -ENODEV;
+			goto free_resources;
+		}
+	}
+
+	/* test for non-zero parity sum */
+	zero_sum_result = 0;
+	cookie = do_iop_adma_zero_sum(dma_chan, src, IOP_ADMA_NUM_SRC_TST + 1,
+				0, PAGE_SIZE, &zero_sum_result, DMA_SRC_PAGES);
+	iop_adma_issue_pending(dma_chan);
+	msleep(8);
+
+	if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) != DMA_SUCCESS) {
+		printk(KERN_ERR "iop_adma: Self-test non-zero sum timed out, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+	if (zero_sum_result != 1) {
+		printk(KERN_ERR "iop_adma: Self-test non-zero sum failed compare, disabling\n");
+		err = -ENODEV;
+		goto free_resources;
+	}
+
+free_resources:
+	iop_adma_free_chan_resources(dma_chan);
+out:
+	src_idx = IOP_ADMA_NUM_SRC_TST;
+	while (src_idx--)
+		__free_page(xor_srcs[src_idx]);
+	__free_page(dest.pg);
+	return err;
+}
+
+static int __devexit iop_adma_remove(struct platform_device *dev)
+{
+	struct iop_adma_device *device = platform_get_drvdata(dev);
+	struct dma_chan *chan, *_chan;
+	struct iop_adma_chan *iop_chan;
+	int i;
+	struct iop_adma_platform_data *plat_data = dev->dev.platform_data;
+
+
+	dma_async_device_unregister(&device->common);
+
+	for (i = 0; i < 3; i++) {
+		unsigned int irq;
+		irq = platform_get_irq(dev, i);
+		free_irq(irq, device);
+	}
+
+	dma_free_coherent(&dev->dev, plat_data->pool_size,
+			device->dma_desc_pool_virt, device->dma_desc_pool);
+
+	do {
+		struct resource *res;
+		res = platform_get_resource(dev, IORESOURCE_MEM, 0);
+		release_mem_region(res->start, res->end - res->start);
+	} while (0);
+
+	list_for_each_entry_safe(chan, _chan, &device->common.channels,
+				device_node) {
+		iop_chan = to_iop_adma_chan(chan);
+		list_del(&chan->device_node);
+		kfree(iop_chan);
+	}
+	kfree(device);
+
+	return 0;
+}
+
+static dma_addr_t iop_adma_map_page(struct dma_chan *chan, struct page *page,
+					unsigned long offset, size_t size,
+					int direction)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	return dma_map_page(&iop_chan->device->pdev->dev, page, offset, size,
+			direction);
+}
+
+static dma_addr_t iop_adma_map_single(struct dma_chan *chan, void *cpu_addr,
+					size_t size, int direction)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	return dma_map_single(&iop_chan->device->pdev->dev, cpu_addr, size,
+			direction);
+}
+
+static void iop_adma_unmap_page(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_unmap_page(&iop_chan->device->pdev->dev, handle, size, direction);
+}
+
+static void iop_adma_unmap_single(struct dma_chan *chan, dma_addr_t handle,
+				size_t size, int direction)
+{
+	struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+	dma_unmap_single(&iop_chan->device->pdev->dev, handle, size, direction);
+}
+
+extern dma_cookie_t dma_async_do_memcpy_err(struct dma_chan *chan,
+		union dmaengine_addr dest, unsigned int dest_off,
+		union dmaengine_addr src, unsigned int src_off,
+                size_t len, unsigned long flags);
+
+extern dma_cookie_t dma_async_do_xor_err(struct dma_chan *chan,
+		union dmaengine_addr dest, unsigned int dest_off,
+		union dmaengine_addr src, unsigned int src_cnt,
+		unsigned int src_off, size_t len, unsigned long flags);
+
+extern dma_cookie_t dma_async_do_zero_sum_err(struct dma_chan *chan,
+		union dmaengine_addr src, unsigned int src_cnt,
+		unsigned int src_off, size_t len, u32 *result,
+		unsigned long flags);
+
+extern dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
+                union dmaengine_addr dest, unsigned int dest_off,
+                int val, size_t len, unsigned long flags);
+
+static int __devinit iop_adma_probe(struct platform_device *pdev)
+{
+	struct resource *res;
+	int ret=0, irq_eot=0, irq_eoc=0, irq_err=0, irq, i;
+	struct iop_adma_device *adev;
+	struct iop_adma_chan *iop_chan;
+	struct iop_adma_platform_data *plat_data = pdev->dev.platform_data;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENODEV;
+
+	if (!request_mem_region(res->start, res->end - res->start, pdev->name))
+		return -EBUSY;
+
+	if ((adev = kzalloc(sizeof(*adev), GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto err_adev_alloc;
+	}
+
+	if ((adev->dma_desc_pool_virt = dma_alloc_writecombine(&pdev->dev,
+					plat_data->pool_size,
+					&adev->dma_desc_pool,
+					GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto err_dma_alloc;
+	}
+
+	PRINTK("%s: allocted descriptor pool virt %p phys %p\n",
+	__FUNCTION__, adev->dma_desc_pool_virt, (void *) adev->dma_desc_pool);
+
+	adev->id = plat_data->hw_id;
+	adev->common.capabilities = plat_data->capabilities;
+
+	for (i = 0; i < 3; i++) {
+		irq = platform_get_irq(pdev, i);
+		if (irq < 0)
+			ret = -ENXIO;
+		else {
+			switch (i) {
+			case 0:
+				irq_eot = irq;
+				ret = request_irq(irq, iop_adma_eot_handler,
+					 0, pdev->name, adev);
+				if (ret) {
+					ret = -EIO;
+					goto err_irq0;
+				}
+				break;
+			case 1:
+				irq_eoc = irq;
+				ret = request_irq(irq, iop_adma_eoc_handler,
+					0, pdev->name, adev);
+				if (ret) {
+					ret = -EIO;
+					goto err_irq1;
+				}
+				break;
+			case 2:
+				irq_err = irq;
+				ret = request_irq(irq, iop_adma_err_handler,
+					0, pdev->name, adev);
+				if (ret) {
+					ret = -EIO;
+					goto err_irq2;
+				}
+				break;
+			}
+		}
+	}
+
+	adev->pdev = pdev;
+	platform_set_drvdata(pdev, adev);
+
+	INIT_LIST_HEAD(&adev->common.channels);
+	adev->common.device_alloc_chan_resources = iop_adma_alloc_chan_resources;
+	adev->common.device_free_chan_resources = iop_adma_free_chan_resources;
+	adev->common.device_operation_complete = iop_adma_is_complete;
+	adev->common.device_issue_pending = iop_adma_issue_pending;
+	adev->common.map_page = iop_adma_map_page;
+	adev->common.map_single = iop_adma_map_single;
+	adev->common.unmap_page = iop_adma_unmap_page;
+	adev->common.unmap_single = iop_adma_unmap_single;
+
+	if (adev->common.capabilities & DMA_MEMCPY)
+		adev->common.device_do_dma_memcpy = do_iop_adma_memcpy;
+	else
+		adev->common.device_do_dma_memcpy = dma_async_do_memcpy_err;
+
+	if (adev->common.capabilities & DMA_MEMSET)
+		adev->common.device_do_dma_memset = do_iop_adma_memset;
+	else
+		adev->common.device_do_dma_memset = dma_async_do_memset_err;
+
+	if (adev->common.capabilities & DMA_XOR)
+		adev->common.device_do_dma_xor = do_iop_adma_xor;
+	else
+		adev->common.device_do_dma_xor = dma_async_do_xor_err;
+
+	if (adev->common.capabilities & DMA_ZERO_SUM)
+		adev->common.device_do_dma_zero_sum = do_iop_adma_zero_sum;
+	else
+		adev->common.device_do_dma_zero_sum = dma_async_do_zero_sum_err;
+
+	if ((iop_chan = kzalloc(sizeof(*iop_chan), GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto err_chan_alloc;
+	}
+
+	spin_lock_init(&iop_chan->lock);
+	iop_chan->device = adev;
+	INIT_LIST_HEAD(&iop_chan->chain);
+	INIT_LIST_HEAD(&iop_chan->all_slots);
+	iop_chan->last_used = NULL;
+	dma_async_chan_init(&iop_chan->common, &adev->common);
+
+	if (adev->common.capabilities & DMA_MEMCPY) {
+		ret = iop_adma_memcpy_self_test(adev);
+		PRINTK("iop adma%d: memcpy self test returned %d\n", adev->id, ret);
+		if (ret)
+			goto err_self_test;
+	}
+
+	if (adev->common.capabilities & (DMA_XOR + DMA_ZERO_SUM + DMA_MEMSET)) {
+		ret = iop_adma_xor_zero_sum_self_test(adev);
+		PRINTK("iop adma%d: xor self test returned %d\n", adev->id, ret);
+		if (ret)
+			goto err_self_test;
+	}
+
+	printk(KERN_INFO "Intel(R) IOP ADMA Engine found [%d]: "
+		"( %s%s%s%s%s%s%s%s%s)\n",
+		adev->id,
+		adev->common.capabilities & DMA_PQ_XOR	      ? "pq_xor " : "",
+		adev->common.capabilities & DMA_PQ_UPDATE     ? "pq_update " : "",
+		adev->common.capabilities & DMA_PQ_ZERO_SUM   ? "pq_zero_sum " : "",
+		adev->common.capabilities & DMA_XOR	      ? "xor " : "",
+		adev->common.capabilities & DMA_DUAL_XOR      ? "dual_xor " : "",
+		adev->common.capabilities & DMA_ZERO_SUM      ? "xor_zero_sum " : "",
+		adev->common.capabilities & DMA_MEMSET	      ? "memset " : "",
+		adev->common.capabilities & DMA_MEMCPY_CRC32C ? "memcpy+crc " : "",
+		adev->common.capabilities & DMA_MEMCPY	      ? "memcpy " : "");
+
+	dma_async_device_register(&adev->common);
+	goto out;
+
+err_self_test:
+	kfree(iop_chan);
+err_chan_alloc:
+err_irq2:
+	free_irq(irq_eoc, adev);
+err_irq1:
+	free_irq(irq_eot, adev);
+err_irq0:
+	dma_free_coherent(&adev->pdev->dev, plat_data->pool_size,
+			adev->dma_desc_pool_virt, adev->dma_desc_pool);
+err_dma_alloc:
+	kfree(adev);
+err_adev_alloc:
+	release_mem_region(res->start, res->end - res->start);
+out:
+	return ret;
+}
+
+static void iop_chan_start_null_memcpy(struct iop_adma_chan *iop_chan)
+{
+	struct iop_adma_desc_slot *sw_desc;
+	dma_cookie_t cookie;
+	int slot_cnt, slots_per_op;
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_memcpy_slot_count(0, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		iop_desc_init_memcpy(sw_desc);
+		iop_desc_set_byte_count(sw_desc, iop_chan, 0);
+		iop_desc_set_dest_addr(sw_desc, iop_chan, 0);
+		iop_desc_set_memcpy_src_addr(sw_desc, 0, slot_cnt, slots_per_op);
+
+		cookie = iop_chan->common.cookie;
+		cookie++;
+		if (cookie <= 1)
+			cookie = 2;
+
+		/* initialize the completed cookie to be less than
+		 * the most recently used cookie
+		 */
+		iop_chan->completed_cookie = cookie - 1;
+		iop_chan->common.cookie = sw_desc->cookie = cookie;
+
+		/* channel should not be busy */
+		BUG_ON(iop_chan_is_busy(iop_chan));
+
+		/* clear any prior error-status bits */
+		iop_chan_clear_status(iop_chan);
+
+		/* disable operation */
+		iop_chan_disable(iop_chan);
+
+		/* set the descriptor address */
+		iop_chan_set_next_descriptor(iop_chan, sw_desc->phys);
+
+		/* run the descriptor */
+		iop_chan_enable(iop_chan);
+	} else
+		printk(KERN_ERR "iop adma%d failed to allocate null descriptor\n",
+			iop_chan->device->id);
+	spin_unlock_bh(&iop_chan->lock);
+}
+
+static void iop_chan_start_null_xor(struct iop_adma_chan *iop_chan)
+{
+	struct iop_adma_desc_slot *sw_desc;
+	dma_cookie_t cookie;
+	int slot_cnt, slots_per_op;
+
+	spin_lock_bh(&iop_chan->lock);
+	slot_cnt = iop_chan_xor_slot_count(0, 2, &slots_per_op);
+	sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+	if (sw_desc) {
+		iop_desc_init_null_xor(sw_desc, 2);
+		iop_desc_set_byte_count(sw_desc, iop_chan, 0);
+		iop_desc_set_dest_addr(sw_desc, iop_chan, 0);
+		iop_desc_set_xor_src_addr(sw_desc, 0, 0, slot_cnt, slots_per_op);
+		iop_desc_set_xor_src_addr(sw_desc, 1, 0, slot_cnt, slots_per_op);
+
+		cookie = iop_chan->common.cookie;
+		cookie++;
+		if (cookie <= 1)
+			cookie = 2;
+
+		/* initialize the completed cookie to be less than
+		 * the most recently used cookie
+		 */
+		iop_chan->completed_cookie = cookie - 1;
+		iop_chan->common.cookie = sw_desc->cookie = cookie;
+
+		/* channel should not be busy */
+		BUG_ON(iop_chan_is_busy(iop_chan));
+
+		/* clear any prior error-status bits */
+		iop_chan_clear_status(iop_chan);
+
+		/* disable operation */
+		iop_chan_disable(iop_chan);
+
+		/* set the descriptor address */
+		iop_chan_set_next_descriptor(iop_chan, sw_desc->phys);
+
+		/* run the descriptor */
+		iop_chan_enable(iop_chan);
+	} else
+		printk(KERN_ERR "iop adma%d failed to allocate null descriptor\n",
+			iop_chan->device->id);
+	spin_unlock_bh(&iop_chan->lock);
+}
+
+static struct platform_driver iop_adma_driver = {
+	.probe		= iop_adma_probe,
+	.remove		= iop_adma_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "IOP-ADMA",
+	},
+};
+
+static int __init iop_adma_init (void)
+{
+	return platform_driver_register(&iop_adma_driver);
+}
+
+static void __exit iop_adma_exit (void)
+{
+	platform_driver_unregister(&iop_adma_driver);
+	return;
+}
+
+void __arch_raid5_dma_chan_request(struct dma_client *client)
+{
+	iop_raid5_dma_chan_request(client);
+}
+
+struct dma_chan *__arch_raid5_dma_next_channel(struct dma_client *client)
+{
+	return iop_raid5_dma_next_channel(client);
+}
+
+struct dma_chan *__arch_raid5_dma_check_channel(struct dma_chan *chan,
+						dma_cookie_t cookie,
+						struct dma_client *client,
+						unsigned long capabilities)
+{
+	return iop_raid5_dma_check_channel(chan, cookie, client, capabilities);
+}
+
+EXPORT_SYMBOL_GPL(__arch_raid5_dma_chan_request);
+EXPORT_SYMBOL_GPL(__arch_raid5_dma_next_channel);
+EXPORT_SYMBOL_GPL(__arch_raid5_dma_check_channel);
+
+module_init(iop_adma_init);
+module_exit(iop_adma_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("IOP ADMA Engine Driver");
+MODULE_LICENSE("GPL");
diff --git a/include/asm-arm/hardware/iop_adma.h b/include/asm-arm/hardware/iop_adma.h
new file mode 100644
index 0000000..62bbbdf
--- /dev/null
+++ b/include/asm-arm/hardware/iop_adma.h
@@ -0,0 +1,98 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#ifndef IOP_ADMA_H
+#define IOP_ADMA_H
+#include <linux/types.h>
+#include <linux/dmaengine.h>
+
+#define IOP_ADMA_SLOT_SIZE 32
+#define IOP_ADMA_THRESHOLD 20
+
+/**
+ * struct iop_adma_device - internal representation of an ADMA device
+ * @pdev: Platform device
+ * @id: HW ADMA Device selector
+ * @dma_desc_pool: base of DMA descriptor region (DMA address)
+ * @dma_desc_pool_virt: base of DMA descriptor region (CPU address)
+ * @common: embedded struct dma_device
+ */
+struct iop_adma_device {
+	struct platform_device *pdev;
+	int id;
+	dma_addr_t dma_desc_pool;
+	void *dma_desc_pool_virt;
+	struct dma_device common;
+};
+
+/**
+ * struct iop_adma_device - internal representation of an ADMA device
+ * @lock: serializes enqueue/dequeue operations to the slot pool
+ * @device: parent device
+ * @chain: device chain view of the descriptors
+ * @common: common dmaengine channel object members
+ * @all_slots: complete domain of slots usable by the channel
+ * @pending: allows batching of hardware operations
+ * @result_accumulator: allows zero result sums of buffers > the hw maximum
+ * @zero_sum_group: flag to the clean up routine to collect zero sum results
+ * @completed_cookie: identifier for the most recently completed operation
+ * @slots_allocated: records the actual size of the descriptor slot pool
+ */
+struct iop_adma_chan {
+	spinlock_t lock;
+	struct iop_adma_device *device;
+	struct list_head chain;
+	struct dma_chan common;
+	struct list_head all_slots;
+	struct iop_adma_desc_slot *last_used;
+	int pending;
+	u8 result_accumulator;
+	u8 zero_sum_group;
+	dma_cookie_t completed_cookie;
+	int slots_allocated;
+};
+
+struct iop_adma_desc_slot {
+	void *hw_desc;
+	struct list_head slot_node;
+	struct list_head chain_node;
+	dma_cookie_t cookie;
+	dma_addr_t phys;
+	u16 stride;
+	u16 idx;
+	u16 slot_cnt;
+	u8 src_cnt;
+	u8 slots_per_op;
+	unsigned long flags;
+	union {
+		u32 *xor_check_result;
+		u32 *crc32_result;
+	};
+};
+
+struct iop_adma_platform_data {
+        int hw_id;
+        unsigned long capabilities;
+        size_t pool_size;
+};
+
+#define to_iop_sw_desc(addr_hw_desc) container_of(addr_hw_desc, struct iop_adma_desc_slot, hw_desc)
+#define iop_hw_desc_slot_idx(hw_desc, idx) ( (void *) (((unsigned long) hw_desc) + ((idx) << 5)) )
+#endif

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (15 preceding siblings ...)
  2006-09-11 23:19 ` [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines Dan Williams
@ 2006-09-11 23:19 ` Dan Williams
  2006-09-11 23:55   ` Jeff Garzik
  2006-09-11 23:19 ` [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization Dan Williams
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:19 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Also brings the iop3xx registers in line with the format of the iop13xx
register definitions.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 include/asm-arm/arch-iop32x/entry-macro.S |    2 
 include/asm-arm/arch-iop32x/iop32x.h      |   14 +
 include/asm-arm/arch-iop33x/entry-macro.S |    2 
 include/asm-arm/arch-iop33x/iop33x.h      |   38 ++-
 include/asm-arm/hardware/iop3xx.h         |  347 +++++++++++++----------------
 5 files changed, 188 insertions(+), 215 deletions(-)

diff --git a/include/asm-arm/arch-iop32x/entry-macro.S b/include/asm-arm/arch-iop32x/entry-macro.S
index 1500cbb..f357be4 100644
--- a/include/asm-arm/arch-iop32x/entry-macro.S
+++ b/include/asm-arm/arch-iop32x/entry-macro.S
@@ -13,7 +13,7 @@ #include <asm/arch/iop32x.h>
 		.endm
 
 		.macro	get_irqnr_and_base, irqnr, irqstat, base, tmp
-		ldr	\base, =IOP3XX_REG_ADDR(0x07D8)
+		ldr	\base, =0xfeffe7d8
 		ldr	\irqstat, [\base]		@ Read IINTSRC
 		cmp	\irqstat, #0
 		clzne	\irqnr, \irqstat
diff --git a/include/asm-arm/arch-iop32x/iop32x.h b/include/asm-arm/arch-iop32x/iop32x.h
index 15b4d6a..904a14d 100644
--- a/include/asm-arm/arch-iop32x/iop32x.h
+++ b/include/asm-arm/arch-iop32x/iop32x.h
@@ -19,16 +19,18 @@ #define __IOP32X_H
  * Peripherals that are shared between the iop32x and iop33x but
  * located at different addresses.
  */
-#define IOP3XX_GPIO_REG(reg)	(IOP3XX_PERIPHERAL_VIRT_BASE + 0x07c0 + (reg))
-#define IOP3XX_TIMER_REG(reg)	(IOP3XX_PERIPHERAL_VIRT_BASE + 0x07e0 + (reg))
+#define IOP3XX_GPIO_REG32(reg)	 (volatile u32 *)(IOP3XX_PERIPHERAL_VIRT_BASE +\
+						  0x07c0 + (reg))
+#define IOP3XX_TIMER_REG32(reg) (volatile u32 *)(IOP3XX_PERIPHERAL_VIRT_BASE +\
+						  0x07e0 + (reg))
 
 #include <asm/hardware/iop3xx.h>
 
 /* Interrupt Controller  */
-#define IOP32X_INTCTL		(volatile u32 *)IOP3XX_REG_ADDR(0x07d0)
-#define IOP32X_INTSTR		(volatile u32 *)IOP3XX_REG_ADDR(0x07d4)
-#define IOP32X_IINTSRC		(volatile u32 *)IOP3XX_REG_ADDR(0x07d8)
-#define IOP32X_FINTSRC		(volatile u32 *)IOP3XX_REG_ADDR(0x07dc)
+#define IOP32X_INTCTL		IOP3XX_REG_ADDR32(0x07d0)
+#define IOP32X_INTSTR		IOP3XX_REG_ADDR32(0x07d4)
+#define IOP32X_IINTSRC		IOP3XX_REG_ADDR32(0x07d8)
+#define IOP32X_FINTSRC		IOP3XX_REG_ADDR32(0x07dc)
 
 
 #endif
diff --git a/include/asm-arm/arch-iop33x/entry-macro.S b/include/asm-arm/arch-iop33x/entry-macro.S
index 92b7917..eb207d2 100644
--- a/include/asm-arm/arch-iop33x/entry-macro.S
+++ b/include/asm-arm/arch-iop33x/entry-macro.S
@@ -13,7 +13,7 @@ #include <asm/arch/iop33x.h>
 		.endm
 
 		.macro	get_irqnr_and_base, irqnr, irqstat, base, tmp
-		ldr	\base, =IOP3XX_REG_ADDR(0x07C8)
+		ldr	\base, =0xfeffe7c8
 		ldr	\irqstat, [\base]		@ Read IINTVEC
 		cmp	\irqstat, #0
 		ldreq	\irqstat, [\base]		@ erratum 63 workaround
diff --git a/include/asm-arm/arch-iop33x/iop33x.h b/include/asm-arm/arch-iop33x/iop33x.h
index 9b38fde..c171383 100644
--- a/include/asm-arm/arch-iop33x/iop33x.h
+++ b/include/asm-arm/arch-iop33x/iop33x.h
@@ -18,28 +18,30 @@ #define __IOP33X_H
  * Peripherals that are shared between the iop32x and iop33x but
  * located at different addresses.
  */
-#define IOP3XX_GPIO_REG(reg)	(IOP3XX_PERIPHERAL_VIRT_BASE + 0x1780 + (reg))
-#define IOP3XX_TIMER_REG(reg)	(IOP3XX_PERIPHERAL_VIRT_BASE + 0x07d0 + (reg))
+#define IOP3XX_GPIO_REG32(reg)	 (volatile u32 *)(IOP3XX_PERIPHERAL_VIRT_BASE +\
+						  0x1780 + (reg))
+#define IOP3XX_TIMER_REG32(reg) (volatile u32 *)(IOP3XX_PERIPHERAL_VIRT_BASE +\
+						  0x07d0 + (reg))
 
 #include <asm/hardware/iop3xx.h>
 
 /* Interrupt Controller  */
-#define IOP33X_INTCTL0		(volatile u32 *)IOP3XX_REG_ADDR(0x0790)
-#define IOP33X_INTCTL1		(volatile u32 *)IOP3XX_REG_ADDR(0x0794)
-#define IOP33X_INTSTR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0798)
-#define IOP33X_INTSTR1		(volatile u32 *)IOP3XX_REG_ADDR(0x079c)
-#define IOP33X_IINTSRC0		(volatile u32 *)IOP3XX_REG_ADDR(0x07a0)
-#define IOP33X_IINTSRC1		(volatile u32 *)IOP3XX_REG_ADDR(0x07a4)
-#define IOP33X_FINTSRC0		(volatile u32 *)IOP3XX_REG_ADDR(0x07a8)
-#define IOP33X_FINTSRC1		(volatile u32 *)IOP3XX_REG_ADDR(0x07ac)
-#define IOP33X_IPR0		(volatile u32 *)IOP3XX_REG_ADDR(0x07b0)
-#define IOP33X_IPR1		(volatile u32 *)IOP3XX_REG_ADDR(0x07b4)
-#define IOP33X_IPR2		(volatile u32 *)IOP3XX_REG_ADDR(0x07b8)
-#define IOP33X_IPR3		(volatile u32 *)IOP3XX_REG_ADDR(0x07bc)
-#define IOP33X_INTBASE		(volatile u32 *)IOP3XX_REG_ADDR(0x07c0)
-#define IOP33X_INTSIZE		(volatile u32 *)IOP3XX_REG_ADDR(0x07c4)
-#define IOP33X_IINTVEC		(volatile u32 *)IOP3XX_REG_ADDR(0x07c8)
-#define IOP33X_FINTVEC		(volatile u32 *)IOP3XX_REG_ADDR(0x07cc)
+#define IOP33X_INTCTL0		IOP3XX_REG_ADDR32(0x0790)
+#define IOP33X_INTCTL1		IOP3XX_REG_ADDR32(0x0794)
+#define IOP33X_INTSTR0		IOP3XX_REG_ADDR32(0x0798)
+#define IOP33X_INTSTR1		IOP3XX_REG_ADDR32(0x079c)
+#define IOP33X_IINTSRC0	IOP3XX_REG_ADDR32(0x07a0)
+#define IOP33X_IINTSRC1	IOP3XX_REG_ADDR32(0x07a4)
+#define IOP33X_FINTSRC0	IOP3XX_REG_ADDR32(0x07a8)
+#define IOP33X_FINTSRC1	IOP3XX_REG_ADDR32(0x07ac)
+#define IOP33X_IPR0		IOP3XX_REG_ADDR32(0x07b0)
+#define IOP33X_IPR1		IOP3XX_REG_ADDR32(0x07b4)
+#define IOP33X_IPR2		IOP3XX_REG_ADDR32(0x07b8)
+#define IOP33X_IPR3		IOP3XX_REG_ADDR32(0x07bc)
+#define IOP33X_INTBASE		IOP3XX_REG_ADDR32(0x07c0)
+#define IOP33X_INTSIZE		IOP3XX_REG_ADDR32(0x07c4)
+#define IOP33X_IINTVEC		IOP3XX_REG_ADDR32(0x07c8)
+#define IOP33X_FINTVEC		IOP3XX_REG_ADDR32(0x07cc)
 
 /* UARTs  */
 #define IOP33X_UART0_PHYS	(IOP3XX_PERIPHERAL_PHYS_BASE + 0x1700)
diff --git a/include/asm-arm/hardware/iop3xx.h b/include/asm-arm/hardware/iop3xx.h
index b5c12ef..295789a 100644
--- a/include/asm-arm/hardware/iop3xx.h
+++ b/include/asm-arm/hardware/iop3xx.h
@@ -34,153 +34,166 @@ #endif
 /*
  * IOP3XX processor registers
  */
-#define IOP3XX_PERIPHERAL_PHYS_BASE	0xffffe000
-#define IOP3XX_PERIPHERAL_VIRT_BASE	0xfeffe000
-#define IOP3XX_PERIPHERAL_SIZE		0x00002000
-#define IOP3XX_REG_ADDR(reg)		(IOP3XX_PERIPHERAL_VIRT_BASE + (reg))
+#define IOP3XX_PERIPHERAL_PHYS_BASE 0xffffe000
+#define IOP3XX_PERIPHERAL_VIRT_BASE 0xfeffe000
+#define IOP3XX_PERIPHERAL_SIZE	     0x00002000
+#define IOP3XX_REG_ADDR32(reg)	     (volatile u32 *)(IOP3XX_PERIPHERAL_VIRT_BASE + (reg))
+#define IOP3XX_REG_ADDR16(reg)	     (volatile u16 *)(IOP3XX_PERIPHERAL_VIRT_BASE + (reg))
+#define IOP3XX_REG_ADDR8(reg)	     (volatile u8 *)(IOP3XX_PERIPHERAL_VIRT_BASE + (reg))
 
 /* Address Translation Unit  */
-#define IOP3XX_ATUVID		(volatile u16 *)IOP3XX_REG_ADDR(0x0100)
-#define IOP3XX_ATUDID		(volatile u16 *)IOP3XX_REG_ADDR(0x0102)
-#define IOP3XX_ATUCMD		(volatile u16 *)IOP3XX_REG_ADDR(0x0104)
-#define IOP3XX_ATUSR		(volatile u16 *)IOP3XX_REG_ADDR(0x0106)
-#define IOP3XX_ATURID		(volatile u8  *)IOP3XX_REG_ADDR(0x0108)
-#define IOP3XX_ATUCCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0109)
-#define IOP3XX_ATUCLSR		(volatile u8  *)IOP3XX_REG_ADDR(0x010c)
-#define IOP3XX_ATULT		(volatile u8  *)IOP3XX_REG_ADDR(0x010d)
-#define IOP3XX_ATUHTR		(volatile u8  *)IOP3XX_REG_ADDR(0x010e)
-#define IOP3XX_ATUBIST		(volatile u8  *)IOP3XX_REG_ADDR(0x010f)
-#define IOP3XX_IABAR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0110)
-#define IOP3XX_IAUBAR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0114)
-#define IOP3XX_IABAR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0118)
-#define IOP3XX_IAUBAR1		(volatile u32 *)IOP3XX_REG_ADDR(0x011c)
-#define IOP3XX_IABAR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0120)
-#define IOP3XX_IAUBAR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0124)
-#define IOP3XX_ASVIR		(volatile u16 *)IOP3XX_REG_ADDR(0x012c)
-#define IOP3XX_ASIR		(volatile u16 *)IOP3XX_REG_ADDR(0x012e)
-#define IOP3XX_ERBAR		(volatile u32 *)IOP3XX_REG_ADDR(0x0130)
-#define IOP3XX_ATUILR		(volatile u8  *)IOP3XX_REG_ADDR(0x013c)
-#define IOP3XX_ATUIPR		(volatile u8  *)IOP3XX_REG_ADDR(0x013d)
-#define IOP3XX_ATUMGNT		(volatile u8  *)IOP3XX_REG_ADDR(0x013e)
-#define IOP3XX_ATUMLAT		(volatile u8  *)IOP3XX_REG_ADDR(0x013f)
-#define IOP3XX_IALR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0140)
-#define IOP3XX_IATVR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0144)
-#define IOP3XX_ERLR		(volatile u32 *)IOP3XX_REG_ADDR(0x0148)
-#define IOP3XX_ERTVR		(volatile u32 *)IOP3XX_REG_ADDR(0x014c)
-#define IOP3XX_IALR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0150)
-#define IOP3XX_IALR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0154)
-#define IOP3XX_IATVR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0158)
-#define IOP3XX_OIOWTVR		(volatile u32 *)IOP3XX_REG_ADDR(0x015c)
-#define IOP3XX_OMWTVR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0160)
-#define IOP3XX_OUMWTVR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0164)
-#define IOP3XX_OMWTVR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0168)
-#define IOP3XX_OUMWTVR1		(volatile u32 *)IOP3XX_REG_ADDR(0x016c)
-#define IOP3XX_OUDWTVR		(volatile u32 *)IOP3XX_REG_ADDR(0x0178)
-#define IOP3XX_ATUCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0180)
-#define IOP3XX_PCSR		(volatile u32 *)IOP3XX_REG_ADDR(0x0184)
-#define IOP3XX_ATUISR		(volatile u32 *)IOP3XX_REG_ADDR(0x0188)
-#define IOP3XX_ATUIMR		(volatile u32 *)IOP3XX_REG_ADDR(0x018c)
-#define IOP3XX_IABAR3		(volatile u32 *)IOP3XX_REG_ADDR(0x0190)
-#define IOP3XX_IAUBAR3		(volatile u32 *)IOP3XX_REG_ADDR(0x0194)
-#define IOP3XX_IALR3		(volatile u32 *)IOP3XX_REG_ADDR(0x0198)
-#define IOP3XX_IATVR3		(volatile u32 *)IOP3XX_REG_ADDR(0x019c)
-#define IOP3XX_OCCAR		(volatile u32 *)IOP3XX_REG_ADDR(0x01a4)
-#define IOP3XX_OCCDR		(volatile u32 *)IOP3XX_REG_ADDR(0x01ac)
-#define IOP3XX_PDSCR		(volatile u32 *)IOP3XX_REG_ADDR(0x01bc)
-#define IOP3XX_PMCAPID		(volatile u8  *)IOP3XX_REG_ADDR(0x01c0)
-#define IOP3XX_PMNEXT		(volatile u8  *)IOP3XX_REG_ADDR(0x01c1)
-#define IOP3XX_APMCR		(volatile u16 *)IOP3XX_REG_ADDR(0x01c2)
-#define IOP3XX_APMCSR		(volatile u16 *)IOP3XX_REG_ADDR(0x01c4)
-#define IOP3XX_PCIXCAPID	(volatile u8  *)IOP3XX_REG_ADDR(0x01e0)
-#define IOP3XX_PCIXNEXT		(volatile u8  *)IOP3XX_REG_ADDR(0x01e1)
-#define IOP3XX_PCIXCMD		(volatile u16 *)IOP3XX_REG_ADDR(0x01e2)
-#define IOP3XX_PCIXSR		(volatile u32 *)IOP3XX_REG_ADDR(0x01e4)
-#define IOP3XX_PCIIRSR		(volatile u32 *)IOP3XX_REG_ADDR(0x01ec)
+#define IOP3XX_ATUVID		IOP3XX_REG_ADDR16(0x0100)
+#define IOP3XX_ATUDID		IOP3XX_REG_ADDR16(0x0102)
+#define IOP3XX_ATUCMD		IOP3XX_REG_ADDR16(0x0104)
+#define IOP3XX_ATUSR		IOP3XX_REG_ADDR16(0x0106)
+#define IOP3XX_ATURID		IOP3XX_REG_ADDR8(0x0108)
+#define IOP3XX_ATUCCR		IOP3XX_REG_ADDR32(0x0109)
+#define IOP3XX_ATUCLSR		IOP3XX_REG_ADDR8(0x010c)
+#define IOP3XX_ATULT		IOP3XX_REG_ADDR8(0x010d)
+#define IOP3XX_ATUHTR		IOP3XX_REG_ADDR8(0x010e)
+#define IOP3XX_ATUBIST		IOP3XX_REG_ADDR8(0x010f)
+#define IOP3XX_IABAR0		IOP3XX_REG_ADDR32(0x0110)
+#define IOP3XX_IAUBAR0		IOP3XX_REG_ADDR32(0x0114)
+#define IOP3XX_IABAR1		IOP3XX_REG_ADDR32(0x0118)
+#define IOP3XX_IAUBAR1		IOP3XX_REG_ADDR32(0x011c)
+#define IOP3XX_IABAR2		IOP3XX_REG_ADDR32(0x0120)
+#define IOP3XX_IAUBAR2		IOP3XX_REG_ADDR32(0x0124)
+#define IOP3XX_ASVIR		IOP3XX_REG_ADDR16(0x012c)
+#define IOP3XX_ASIR		IOP3XX_REG_ADDR16(0x012e)
+#define IOP3XX_ERBAR		IOP3XX_REG_ADDR32(0x0130)
+#define IOP3XX_ATUILR		IOP3XX_REG_ADDR8(0x013c)
+#define IOP3XX_ATUIPR		IOP3XX_REG_ADDR8(0x013d)
+#define IOP3XX_ATUMGNT		IOP3XX_REG_ADDR8(0x013e)
+#define IOP3XX_ATUMLAT		IOP3XX_REG_ADDR8(0x013f)
+#define IOP3XX_IALR0		IOP3XX_REG_ADDR32(0x0140)
+#define IOP3XX_IATVR0		IOP3XX_REG_ADDR32(0x0144)
+#define IOP3XX_ERLR		IOP3XX_REG_ADDR32(0x0148)
+#define IOP3XX_ERTVR		IOP3XX_REG_ADDR32(0x014c)
+#define IOP3XX_IALR1		IOP3XX_REG_ADDR32(0x0150)
+#define IOP3XX_IALR2		IOP3XX_REG_ADDR32(0x0154)
+#define IOP3XX_IATVR2		IOP3XX_REG_ADDR32(0x0158)
+#define IOP3XX_OIOWTVR		IOP3XX_REG_ADDR32(0x015c)
+#define IOP3XX_OMWTVR0		IOP3XX_REG_ADDR32(0x0160)
+#define IOP3XX_OUMWTVR0	IOP3XX_REG_ADDR32(0x0164)
+#define IOP3XX_OMWTVR1		IOP3XX_REG_ADDR32(0x0168)
+#define IOP3XX_OUMWTVR1	IOP3XX_REG_ADDR32(0x016c)
+#define IOP3XX_OUDWTVR		IOP3XX_REG_ADDR32(0x0178)
+#define IOP3XX_ATUCR		IOP3XX_REG_ADDR32(0x0180)
+#define IOP3XX_PCSR		IOP3XX_REG_ADDR32(0x0184)
+#define IOP3XX_ATUISR		IOP3XX_REG_ADDR32(0x0188)
+#define IOP3XX_ATUIMR		IOP3XX_REG_ADDR32(0x018c)
+#define IOP3XX_IABAR3		IOP3XX_REG_ADDR32(0x0190)
+#define IOP3XX_IAUBAR3		IOP3XX_REG_ADDR32(0x0194)
+#define IOP3XX_IALR3		IOP3XX_REG_ADDR32(0x0198)
+#define IOP3XX_IATVR3		IOP3XX_REG_ADDR32(0x019c)
+#define IOP3XX_OCCAR		IOP3XX_REG_ADDR32(0x01a4)
+#define IOP3XX_OCCDR		IOP3XX_REG_ADDR32(0x01ac)
+#define IOP3XX_PDSCR		IOP3XX_REG_ADDR32(0x01bc)
+#define IOP3XX_PMCAPID		IOP3XX_REG_ADDR8(0x01c0)
+#define IOP3XX_PMNEXT		IOP3XX_REG_ADDR8(0x01c1)
+#define IOP3XX_APMCR		IOP3XX_REG_ADDR16(0x01c2)
+#define IOP3XX_APMCSR		IOP3XX_REG_ADDR16(0x01c4)
+#define IOP3XX_PCIXCAPID	IOP3XX_REG_ADDR8(0x01e0)
+#define IOP3XX_PCIXNEXT	IOP3XX_REG_ADDR8(0x01e1)
+#define IOP3XX_PCIXCMD		IOP3XX_REG_ADDR16(0x01e2)
+#define IOP3XX_PCIXSR		IOP3XX_REG_ADDR32(0x01e4)
+#define IOP3XX_PCIIRSR		IOP3XX_REG_ADDR32(0x01ec)
 
 /* Messaging Unit  */
-#define IOP3XX_IMR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0310)
-#define IOP3XX_IMR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0314)
-#define IOP3XX_OMR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0318)
-#define IOP3XX_OMR1		(volatile u32 *)IOP3XX_REG_ADDR(0x031c)
-#define IOP3XX_IDR		(volatile u32 *)IOP3XX_REG_ADDR(0x0320)
-#define IOP3XX_IISR		(volatile u32 *)IOP3XX_REG_ADDR(0x0324)
-#define IOP3XX_IIMR		(volatile u32 *)IOP3XX_REG_ADDR(0x0328)
-#define IOP3XX_ODR		(volatile u32 *)IOP3XX_REG_ADDR(0x032c)
-#define IOP3XX_OISR		(volatile u32 *)IOP3XX_REG_ADDR(0x0330)
-#define IOP3XX_OIMR		(volatile u32 *)IOP3XX_REG_ADDR(0x0334)
-#define IOP3XX_MUCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0350)
-#define IOP3XX_QBAR		(volatile u32 *)IOP3XX_REG_ADDR(0x0354)
-#define IOP3XX_IFHPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0360)
-#define IOP3XX_IFTPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0364)
-#define IOP3XX_IPHPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0368)
-#define IOP3XX_IPTPR		(volatile u32 *)IOP3XX_REG_ADDR(0x036c)
-#define IOP3XX_OFHPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0370)
-#define IOP3XX_OFTPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0374)
-#define IOP3XX_OPHPR		(volatile u32 *)IOP3XX_REG_ADDR(0x0378)
-#define IOP3XX_OPTPR		(volatile u32 *)IOP3XX_REG_ADDR(0x037c)
-#define IOP3XX_IAR		(volatile u32 *)IOP3XX_REG_ADDR(0x0380)
+#define IOP3XX_IMR0		IOP3XX_REG_ADDR32(0x0310)
+#define IOP3XX_IMR1		IOP3XX_REG_ADDR32(0x0314)
+#define IOP3XX_OMR0		IOP3XX_REG_ADDR32(0x0318)
+#define IOP3XX_OMR1		IOP3XX_REG_ADDR32(0x031c)
+#define IOP3XX_IDR		IOP3XX_REG_ADDR32(0x0320)
+#define IOP3XX_IISR		IOP3XX_REG_ADDR32(0x0324)
+#define IOP3XX_IIMR		IOP3XX_REG_ADDR32(0x0328)
+#define IOP3XX_ODR		IOP3XX_REG_ADDR32(0x032c)
+#define IOP3XX_OISR		IOP3XX_REG_ADDR32(0x0330)
+#define IOP3XX_OIMR		IOP3XX_REG_ADDR32(0x0334)
+#define IOP3XX_MUCR		IOP3XX_REG_ADDR32(0x0350)
+#define IOP3XX_QBAR		IOP3XX_REG_ADDR32(0x0354)
+#define IOP3XX_IFHPR		IOP3XX_REG_ADDR32(0x0360)
+#define IOP3XX_IFTPR		IOP3XX_REG_ADDR32(0x0364)
+#define IOP3XX_IPHPR		IOP3XX_REG_ADDR32(0x0368)
+#define IOP3XX_IPTPR		IOP3XX_REG_ADDR32(0x036c)
+#define IOP3XX_OFHPR		IOP3XX_REG_ADDR32(0x0370)
+#define IOP3XX_OFTPR		IOP3XX_REG_ADDR32(0x0374)
+#define IOP3XX_OPHPR		IOP3XX_REG_ADDR32(0x0378)
+#define IOP3XX_OPTPR		IOP3XX_REG_ADDR32(0x037c)
+#define IOP3XX_IAR		IOP3XX_REG_ADDR32(0x0380)
 
-/* DMA Controller  */
-#define IOP3XX_DMA0_CCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0400)
-#define IOP3XX_DMA0_CSR		(volatile u32 *)IOP3XX_REG_ADDR(0x0404)
-#define IOP3XX_DMA0_DAR		(volatile u32 *)IOP3XX_REG_ADDR(0x040c)
-#define IOP3XX_DMA0_NDAR	(volatile u32 *)IOP3XX_REG_ADDR(0x0410)
-#define IOP3XX_DMA0_PADR	(volatile u32 *)IOP3XX_REG_ADDR(0x0414)
-#define IOP3XX_DMA0_PUADR	(volatile u32 *)IOP3XX_REG_ADDR(0x0418)
-#define IOP3XX_DMA0_LADR	(volatile u32 *)IOP3XX_REG_ADDR(0x041c)
-#define IOP3XX_DMA0_BCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0420)
-#define IOP3XX_DMA0_DCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0424)
-#define IOP3XX_DMA1_CCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0440)
-#define IOP3XX_DMA1_CSR		(volatile u32 *)IOP3XX_REG_ADDR(0x0444)
-#define IOP3XX_DMA1_DAR		(volatile u32 *)IOP3XX_REG_ADDR(0x044c)
-#define IOP3XX_DMA1_NDAR	(volatile u32 *)IOP3XX_REG_ADDR(0x0450)
-#define IOP3XX_DMA1_PADR	(volatile u32 *)IOP3XX_REG_ADDR(0x0454)
-#define IOP3XX_DMA1_PUADR	(volatile u32 *)IOP3XX_REG_ADDR(0x0458)
-#define IOP3XX_DMA1_LADR	(volatile u32 *)IOP3XX_REG_ADDR(0x045c)
-#define IOP3XX_DMA1_BCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0460)
-#define IOP3XX_DMA1_DCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0464)
+/* DMA Controllers  */
+#define IOP3XX_DMA_OFFSET(chan, ofs) 	IOP3XX_REG_ADDR32((chan << 6) + (ofs))
+
+#define IOP3XX_DMA_CCR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0400)
+#define IOP3XX_DMA_CSR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0404)
+#define IOP3XX_DMA_DAR(chan)		IOP3XX_DMA_OFFSET(chan, 0x040c)
+#define IOP3XX_DMA_NDAR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0410)
+#define IOP3XX_DMA_PADR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0414)
+#define IOP3XX_DMA_PUADR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0418)
+#define IOP3XX_DMA_LADR(chan)		IOP3XX_DMA_OFFSET(chan, 0x041c)
+#define IOP3XX_DMA_BCR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0420)
+#define IOP3XX_DMA_DCR(chan)		IOP3XX_DMA_OFFSET(chan, 0x0424)
+
+/* Application accelerator unit  */
+#define IOP3XX_AAU_ACR		IOP3XX_REG_ADDR32(0x0800)
+#define IOP3XX_AAU_ASR		IOP3XX_REG_ADDR32(0x0804)
+#define IOP3XX_AAU_ADAR	IOP3XX_REG_ADDR32(0x0808)
+#define IOP3XX_AAU_ANDAR	IOP3XX_REG_ADDR32(0x080c)
+#define IOP3XX_AAU_SAR(src)	IOP3XX_REG_ADDR32(0x0810 + ((src) << 2))
+#define IOP3XX_AAU_DAR		IOP3XX_REG_ADDR32(0x0820)
+#define IOP3XX_AAU_ABCR	IOP3XX_REG_ADDR32(0x0824)
+#define IOP3XX_AAU_ADCR	IOP3XX_REG_ADDR32(0x0828)
+#define IOP3XX_AAU_SAR_EDCR(src_edc) IOP3XX_REG_ADDR32(0x082c + ((src_edc - 4) << 2))
+#define IOP3XX_AAU_EDCR0_IDX	8
+#define IOP3XX_AAU_EDCR1_IDX	17
+#define IOP3XX_AAU_EDCR2_IDX	26
+
+#define IOP3XX_DMA0_ID 0
+#define IOP3XX_DMA1_ID 1
+#define IOP3XX_AAU_ID 2
 
 /* Peripheral bus interface  */
-#define IOP3XX_PBCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0680)
-#define IOP3XX_PBISR		(volatile u32 *)IOP3XX_REG_ADDR(0x0684)
-#define IOP3XX_PBBAR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0688)
-#define IOP3XX_PBLR0		(volatile u32 *)IOP3XX_REG_ADDR(0x068c)
-#define IOP3XX_PBBAR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0690)
-#define IOP3XX_PBLR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0694)
-#define IOP3XX_PBBAR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0698)
-#define IOP3XX_PBLR2		(volatile u32 *)IOP3XX_REG_ADDR(0x069c)
-#define IOP3XX_PBBAR3		(volatile u32 *)IOP3XX_REG_ADDR(0x06a0)
-#define IOP3XX_PBLR3		(volatile u32 *)IOP3XX_REG_ADDR(0x06a4)
-#define IOP3XX_PBBAR4		(volatile u32 *)IOP3XX_REG_ADDR(0x06a8)
-#define IOP3XX_PBLR4		(volatile u32 *)IOP3XX_REG_ADDR(0x06ac)
-#define IOP3XX_PBBAR5		(volatile u32 *)IOP3XX_REG_ADDR(0x06b0)
-#define IOP3XX_PBLR5		(volatile u32 *)IOP3XX_REG_ADDR(0x06b4)
-#define IOP3XX_PMBR0		(volatile u32 *)IOP3XX_REG_ADDR(0x06c0)
-#define IOP3XX_PMBR1		(volatile u32 *)IOP3XX_REG_ADDR(0x06e0)
-#define IOP3XX_PMBR2		(volatile u32 *)IOP3XX_REG_ADDR(0x06e4)
+#define IOP3XX_PBCR		IOP3XX_REG_ADDR32(0x0680)
+#define IOP3XX_PBISR		IOP3XX_REG_ADDR32(0x0684)
+#define IOP3XX_PBBAR0		IOP3XX_REG_ADDR32(0x0688)
+#define IOP3XX_PBLR0		IOP3XX_REG_ADDR32(0x068c)
+#define IOP3XX_PBBAR1		IOP3XX_REG_ADDR32(0x0690)
+#define IOP3XX_PBLR1		IOP3XX_REG_ADDR32(0x0694)
+#define IOP3XX_PBBAR2		IOP3XX_REG_ADDR32(0x0698)
+#define IOP3XX_PBLR2		IOP3XX_REG_ADDR32(0x069c)
+#define IOP3XX_PBBAR3		IOP3XX_REG_ADDR32(0x06a0)
+#define IOP3XX_PBLR3		IOP3XX_REG_ADDR32(0x06a4)
+#define IOP3XX_PBBAR4		IOP3XX_REG_ADDR32(0x06a8)
+#define IOP3XX_PBLR4		IOP3XX_REG_ADDR32(0x06ac)
+#define IOP3XX_PBBAR5		IOP3XX_REG_ADDR32(0x06b0)
+#define IOP3XX_PBLR5		IOP3XX_REG_ADDR32(0x06b4)
+#define IOP3XX_PMBR0		IOP3XX_REG_ADDR32(0x06c0)
+#define IOP3XX_PMBR1		IOP3XX_REG_ADDR32(0x06e0)
+#define IOP3XX_PMBR2		IOP3XX_REG_ADDR32(0x06e4)
 
 /* Peripheral performance monitoring unit  */
-#define IOP3XX_GTMR		(volatile u32 *)IOP3XX_REG_ADDR(0x0700)
-#define IOP3XX_ESR		(volatile u32 *)IOP3XX_REG_ADDR(0x0704)
-#define IOP3XX_EMISR		(volatile u32 *)IOP3XX_REG_ADDR(0x0708)
-#define IOP3XX_GTSR		(volatile u32 *)IOP3XX_REG_ADDR(0x0710)
+#define IOP3XX_GTMR		IOP3XX_REG_ADDR32(0x0700)
+#define IOP3XX_ESR		IOP3XX_REG_ADDR32(0x0704)
+#define IOP3XX_EMISR		IOP3XX_REG_ADDR32(0x0708)
+#define IOP3XX_GTSR		IOP3XX_REG_ADDR32(0x0710)
 /* PERCR0 DOESN'T EXIST - index from 1! */
-#define IOP3XX_PERCR0		(volatile u32 *)IOP3XX_REG_ADDR(0x0710)
+#define IOP3XX_PERCR0		IOP3XX_REG_ADDR32(0x0710)
 
 /* General Purpose I/O  */
-#define IOP3XX_GPOE		(volatile u32 *)IOP3XX_GPIO_REG(0x0004)
-#define IOP3XX_GPID		(volatile u32 *)IOP3XX_GPIO_REG(0x0008)
-#define IOP3XX_GPOD		(volatile u32 *)IOP3XX_GPIO_REG(0x000c)
+#define IOP3XX_GPOE		IOP3XX_GPIO_REG32(0x0004)
+#define IOP3XX_GPID		IOP3XX_GPIO_REG32(0x0008)
+#define IOP3XX_GPOD		IOP3XX_GPIO_REG32(0x000c)
 
 /* Timers  */
-#define IOP3XX_TU_TMR0		(volatile u32 *)IOP3XX_TIMER_REG(0x0000)
-#define IOP3XX_TU_TMR1		(volatile u32 *)IOP3XX_TIMER_REG(0x0004)
-#define IOP3XX_TU_TCR0		(volatile u32 *)IOP3XX_TIMER_REG(0x0008)
-#define IOP3XX_TU_TCR1		(volatile u32 *)IOP3XX_TIMER_REG(0x000c)
-#define IOP3XX_TU_TRR0		(volatile u32 *)IOP3XX_TIMER_REG(0x0010)
-#define IOP3XX_TU_TRR1		(volatile u32 *)IOP3XX_TIMER_REG(0x0014)
-#define IOP3XX_TU_TISR		(volatile u32 *)IOP3XX_TIMER_REG(0x0018)
-#define IOP3XX_TU_WDTCR		(volatile u32 *)IOP3XX_TIMER_REG(0x001c)
+#define IOP3XX_TU_TMR0		IOP3XX_TIMER_REG32(0x0000)
+#define IOP3XX_TU_TMR1		IOP3XX_TIMER_REG32(0x0004)
+#define IOP3XX_TU_TCR0		IOP3XX_TIMER_REG32(0x0008)
+#define IOP3XX_TU_TCR1		IOP3XX_TIMER_REG32(0x000c)
+#define IOP3XX_TU_TRR0		IOP3XX_TIMER_REG32(0x0010)
+#define IOP3XX_TU_TRR1		IOP3XX_TIMER_REG32(0x0014)
+#define IOP3XX_TU_TISR		IOP3XX_TIMER_REG32(0x0018)
+#define IOP3XX_TU_WDTCR		IOP3XX_TIMER_REG32(0x001c)
 #define IOP3XX_TMR_TC		0x01
 #define IOP3XX_TMR_EN		0x02
 #define IOP3XX_TMR_RELOAD	0x04
@@ -190,69 +203,25 @@ #define IOP3XX_TMR_RATIO_4_1	0x10
 #define IOP3XX_TMR_RATIO_8_1	0x20
 #define IOP3XX_TMR_RATIO_16_1	0x30
 
-/* Application accelerator unit  */
-#define IOP3XX_AAU_ACR		(volatile u32 *)IOP3XX_REG_ADDR(0x0800)
-#define IOP3XX_AAU_ASR		(volatile u32 *)IOP3XX_REG_ADDR(0x0804)
-#define IOP3XX_AAU_ADAR		(volatile u32 *)IOP3XX_REG_ADDR(0x0808)
-#define IOP3XX_AAU_ANDAR	(volatile u32 *)IOP3XX_REG_ADDR(0x080c)
-#define IOP3XX_AAU_SAR1		(volatile u32 *)IOP3XX_REG_ADDR(0x0810)
-#define IOP3XX_AAU_SAR2		(volatile u32 *)IOP3XX_REG_ADDR(0x0814)
-#define IOP3XX_AAU_SAR3		(volatile u32 *)IOP3XX_REG_ADDR(0x0818)
-#define IOP3XX_AAU_SAR4		(volatile u32 *)IOP3XX_REG_ADDR(0x081c)
-#define IOP3XX_AAU_DAR		(volatile u32 *)IOP3XX_REG_ADDR(0x0820)
-#define IOP3XX_AAU_ABCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0824)
-#define IOP3XX_AAU_ADCR		(volatile u32 *)IOP3XX_REG_ADDR(0x0828)
-#define IOP3XX_AAU_SAR5		(volatile u32 *)IOP3XX_REG_ADDR(0x082c)
-#define IOP3XX_AAU_SAR6		(volatile u32 *)IOP3XX_REG_ADDR(0x0830)
-#define IOP3XX_AAU_SAR7		(volatile u32 *)IOP3XX_REG_ADDR(0x0834)
-#define IOP3XX_AAU_SAR8		(volatile u32 *)IOP3XX_REG_ADDR(0x0838)
-#define IOP3XX_AAU_EDCR0	(volatile u32 *)IOP3XX_REG_ADDR(0x083c)
-#define IOP3XX_AAU_SAR9		(volatile u32 *)IOP3XX_REG_ADDR(0x0840)
-#define IOP3XX_AAU_SAR10	(volatile u32 *)IOP3XX_REG_ADDR(0x0844)
-#define IOP3XX_AAU_SAR11	(volatile u32 *)IOP3XX_REG_ADDR(0x0848)
-#define IOP3XX_AAU_SAR12	(volatile u32 *)IOP3XX_REG_ADDR(0x084c)
-#define IOP3XX_AAU_SAR13	(volatile u32 *)IOP3XX_REG_ADDR(0x0850)
-#define IOP3XX_AAU_SAR14	(volatile u32 *)IOP3XX_REG_ADDR(0x0854)
-#define IOP3XX_AAU_SAR15	(volatile u32 *)IOP3XX_REG_ADDR(0x0858)
-#define IOP3XX_AAU_SAR16	(volatile u32 *)IOP3XX_REG_ADDR(0x085c)
-#define IOP3XX_AAU_EDCR1	(volatile u32 *)IOP3XX_REG_ADDR(0x0860)
-#define IOP3XX_AAU_SAR17	(volatile u32 *)IOP3XX_REG_ADDR(0x0864)
-#define IOP3XX_AAU_SAR18	(volatile u32 *)IOP3XX_REG_ADDR(0x0868)
-#define IOP3XX_AAU_SAR19	(volatile u32 *)IOP3XX_REG_ADDR(0x086c)
-#define IOP3XX_AAU_SAR20	(volatile u32 *)IOP3XX_REG_ADDR(0x0870)
-#define IOP3XX_AAU_SAR21	(volatile u32 *)IOP3XX_REG_ADDR(0x0874)
-#define IOP3XX_AAU_SAR22	(volatile u32 *)IOP3XX_REG_ADDR(0x0878)
-#define IOP3XX_AAU_SAR23	(volatile u32 *)IOP3XX_REG_ADDR(0x087c)
-#define IOP3XX_AAU_SAR24	(volatile u32 *)IOP3XX_REG_ADDR(0x0880)
-#define IOP3XX_AAU_EDCR2	(volatile u32 *)IOP3XX_REG_ADDR(0x0884)
-#define IOP3XX_AAU_SAR25	(volatile u32 *)IOP3XX_REG_ADDR(0x0888)
-#define IOP3XX_AAU_SAR26	(volatile u32 *)IOP3XX_REG_ADDR(0x088c)
-#define IOP3XX_AAU_SAR27	(volatile u32 *)IOP3XX_REG_ADDR(0x0890)
-#define IOP3XX_AAU_SAR28	(volatile u32 *)IOP3XX_REG_ADDR(0x0894)
-#define IOP3XX_AAU_SAR29	(volatile u32 *)IOP3XX_REG_ADDR(0x0898)
-#define IOP3XX_AAU_SAR30	(volatile u32 *)IOP3XX_REG_ADDR(0x089c)
-#define IOP3XX_AAU_SAR31	(volatile u32 *)IOP3XX_REG_ADDR(0x08a0)
-#define IOP3XX_AAU_SAR32	(volatile u32 *)IOP3XX_REG_ADDR(0x08a4)
-
 /* I2C bus interface unit  */
-#define IOP3XX_ICR0		(volatile u32 *)IOP3XX_REG_ADDR(0x1680)
-#define IOP3XX_ISR0		(volatile u32 *)IOP3XX_REG_ADDR(0x1684)
-#define IOP3XX_ISAR0		(volatile u32 *)IOP3XX_REG_ADDR(0x1688)
-#define IOP3XX_IDBR0		(volatile u32 *)IOP3XX_REG_ADDR(0x168c)
-#define IOP3XX_IBMR0		(volatile u32 *)IOP3XX_REG_ADDR(0x1694)
-#define IOP3XX_ICR1		(volatile u32 *)IOP3XX_REG_ADDR(0x16a0)
-#define IOP3XX_ISR1		(volatile u32 *)IOP3XX_REG_ADDR(0x16a4)
-#define IOP3XX_ISAR1		(volatile u32 *)IOP3XX_REG_ADDR(0x16a8)
-#define IOP3XX_IDBR1		(volatile u32 *)IOP3XX_REG_ADDR(0x16ac)
-#define IOP3XX_IBMR1		(volatile u32 *)IOP3XX_REG_ADDR(0x16b4)
+#define IOP3XX_ICR0		IOP3XX_REG_ADDR32(0x1680)
+#define IOP3XX_ISR0		IOP3XX_REG_ADDR32(0x1684)
+#define IOP3XX_ISAR0		IOP3XX_REG_ADDR32(0x1688)
+#define IOP3XX_IDBR0		IOP3XX_REG_ADDR32(0x168c)
+#define IOP3XX_IBMR0		IOP3XX_REG_ADDR32(0x1694)
+#define IOP3XX_ICR1		IOP3XX_REG_ADDR32(0x16a0)
+#define IOP3XX_ISR1		IOP3XX_REG_ADDR32(0x16a4)
+#define IOP3XX_ISAR1		IOP3XX_REG_ADDR32(0x16a8)
+#define IOP3XX_IDBR1		IOP3XX_REG_ADDR32(0x16ac)
+#define IOP3XX_IBMR1		IOP3XX_REG_ADDR32(0x16b4)
 
 
 /*
  * IOP3XX I/O and Mem space regions for PCI autoconfiguration
  */
 #define IOP3XX_PCI_MEM_WINDOW_SIZE	0x04000000
-#define IOP3XX_PCI_LOWER_MEM_PA		0x80000000
-#define IOP3XX_PCI_LOWER_MEM_BA		(*IOP3XX_OMWTVR0)
+#define IOP3XX_PCI_LOWER_MEM_PA	0x80000000
+#define IOP3XX_PCI_LOWER_MEM_BA	(*IOP3XX_OMWTVR0)
 
 #define IOP3XX_PCI_IO_WINDOW_SIZE	0x00010000
 #define IOP3XX_PCI_LOWER_IO_PA		0x90000000

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (16 preceding siblings ...)
  2006-09-11 23:19 ` [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs Dan Williams
@ 2006-09-11 23:19 ` Dan Williams
  2006-09-11 23:56   ` Jeff Garzik
  2006-09-11 23:19 ` [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver Dan Williams
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:19 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Currently the iop3xx platform support code assumes that RedBoot is the
bootloader and has already initialized the ATU.  Linux should handle this
initialization for three reasons:

1/ The memory map that RedBoot sets up is not optimal (page_to_dma and
virt_to_phys return different addresses).  The effect of this is that using
the dma mapping API for the internal bus dma units generates pci bus
addresses that are incorrect for the internal bus.

2/ Not all iop platforms use RedBoot

3/ If the ATU is already initialized it indicates that the iop is an add-in
card in another host, it does not own the PCI bus, and should not be
re-initialized.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 arch/arm/mach-iop32x/Kconfig         |    8 ++
 arch/arm/mach-iop32x/ep80219.c       |    4 +
 arch/arm/mach-iop32x/iq31244.c       |    5 +
 arch/arm/mach-iop32x/iq80321.c       |    5 +
 arch/arm/mach-iop33x/Kconfig         |    8 ++
 arch/arm/mach-iop33x/iq80331.c       |    5 +
 arch/arm/mach-iop33x/iq80332.c       |    4 +
 arch/arm/plat-iop/pci.c              |  140 ++++++++++++++++++++++++++++++++++
 include/asm-arm/arch-iop32x/iop32x.h |    9 ++
 include/asm-arm/arch-iop32x/memory.h |    4 -
 include/asm-arm/arch-iop33x/iop33x.h |   10 ++
 include/asm-arm/arch-iop33x/memory.h |    4 -
 include/asm-arm/hardware/iop3xx.h    |   20 ++++-
 13 files changed, 214 insertions(+), 12 deletions(-)

diff --git a/arch/arm/mach-iop32x/Kconfig b/arch/arm/mach-iop32x/Kconfig
index 05549a5..b2788e3 100644
--- a/arch/arm/mach-iop32x/Kconfig
+++ b/arch/arm/mach-iop32x/Kconfig
@@ -22,6 +22,14 @@ config ARCH_IQ80321
 	  Say Y here if you want to run your kernel on the Intel IQ80321
 	  evaluation kit for the IOP321 processor.
 
+config IOP3XX_ATU
+        bool "Enable the PCI Controller"
+        default y
+        help
+          Say Y here if you want the IOP to initialize its PCI Controller.
+          Say N if the IOP is an add in card, the host system owns the PCI
+          bus in this case.
+
 endmenu
 
 endif
diff --git a/arch/arm/mach-iop32x/ep80219.c b/arch/arm/mach-iop32x/ep80219.c
index f616d3e..1a5c586 100644
--- a/arch/arm/mach-iop32x/ep80219.c
+++ b/arch/arm/mach-iop32x/ep80219.c
@@ -100,7 +100,7 @@ ep80219_pci_map_irq(struct pci_dev *dev,
 
 static struct hw_pci ep80219_pci __initdata = {
 	.swizzle	= pci_std_swizzle,
-	.nr_controllers = 1,
+	.nr_controllers = 0,
 	.setup		= iop3xx_pci_setup,
 	.preinit	= iop3xx_pci_preinit,
 	.scan		= iop3xx_pci_scan_bus,
@@ -109,6 +109,8 @@ static struct hw_pci ep80219_pci __initd
 
 static int __init ep80219_pci_init(void)
 {
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
+		ep80219_pci.nr_controllers = 1;
 #if 0
 	if (machine_is_ep80219())
 		pci_common_init(&ep80219_pci);
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index 967a696..25d5d62 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -97,7 +97,7 @@ iq31244_pci_map_irq(struct pci_dev *dev,
 
 static struct hw_pci iq31244_pci __initdata = {
 	.swizzle	= pci_std_swizzle,
-	.nr_controllers = 1,
+	.nr_controllers = 0,
 	.setup		= iop3xx_pci_setup,
 	.preinit	= iop3xx_pci_preinit,
 	.scan		= iop3xx_pci_scan_bus,
@@ -106,6 +106,9 @@ static struct hw_pci iq31244_pci __initd
 
 static int __init iq31244_pci_init(void)
 {
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
+		iq31244_pci.nr_controllers = 1;
+
 	if (machine_is_iq31244())
 		pci_common_init(&iq31244_pci);
 
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index ef4388c..cdd2265 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -97,7 +97,7 @@ iq80321_pci_map_irq(struct pci_dev *dev,
 
 static struct hw_pci iq80321_pci __initdata = {
 	.swizzle	= pci_std_swizzle,
-	.nr_controllers = 1,
+	.nr_controllers = 0,
 	.setup		= iop3xx_pci_setup,
 	.preinit	= iop3xx_pci_preinit,
 	.scan		= iop3xx_pci_scan_bus,
@@ -106,6 +106,9 @@ static struct hw_pci iq80321_pci __initd
 
 static int __init iq80321_pci_init(void)
 {
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
+		iq80321_pci.nr_controllers = 1;
+
 	if (machine_is_iq80321())
 		pci_common_init(&iq80321_pci);
 
diff --git a/arch/arm/mach-iop33x/Kconfig b/arch/arm/mach-iop33x/Kconfig
index 9aa016b..45598e0 100644
--- a/arch/arm/mach-iop33x/Kconfig
+++ b/arch/arm/mach-iop33x/Kconfig
@@ -16,6 +16,14 @@ config MACH_IQ80332
 	  Say Y here if you want to run your kernel on the Intel IQ80332
 	  evaluation kit for the IOP332 chipset.
 
+config IOP3XX_ATU
+	bool "Enable the PCI Controller"
+	default y
+	help
+	  Say Y here if you want the IOP to initialize its PCI Controller.
+	  Say N if the IOP is an add in card, the host system owns the PCI
+	  bus in this case.
+
 endmenu
 
 endif
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 7714c94..3807000 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -78,7 +78,7 @@ iq80331_pci_map_irq(struct pci_dev *dev,
 
 static struct hw_pci iq80331_pci __initdata = {
 	.swizzle	= pci_std_swizzle,
-	.nr_controllers = 1,
+	.nr_controllers = 0,
 	.setup		= iop3xx_pci_setup,
 	.preinit	= iop3xx_pci_preinit,
 	.scan		= iop3xx_pci_scan_bus,
@@ -87,6 +87,9 @@ static struct hw_pci iq80331_pci __initd
 
 static int __init iq80331_pci_init(void)
 {
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
+		iq80331_pci.nr_controllers = 1;
+
 	if (machine_is_iq80331())
 		pci_common_init(&iq80331_pci);
 
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index a3fa7f8..8780d55 100644
--- a/arch/arm/mach-iop33x/iq80332.c
+++ b/arch/arm/mach-iop33x/iq80332.c
@@ -93,6 +93,10 @@ static struct hw_pci iq80332_pci __initd
 
 static int __init iq80332_pci_init(void)
 {
+
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
+		iq80332_pci.nr_controllers = 1;
+
 	if (machine_is_iq80332())
 		pci_common_init(&iq80332_pci);
 
diff --git a/arch/arm/plat-iop/pci.c b/arch/arm/plat-iop/pci.c
index e647812..19aace9 100644
--- a/arch/arm/plat-iop/pci.c
+++ b/arch/arm/plat-iop/pci.c
@@ -55,7 +55,7 @@ static u32 iop3xx_cfg_address(struct pci
  * This routine checks the status of the last configuration cycle.  If an error
  * was detected it returns a 1, else it returns a 0.  The errors being checked
  * are parity, master abort, target abort (master and target).  These types of
- * errors occure during a config cycle where there is no device, like during
+ * errors occur during a config cycle where there is no device, like during
  * the discovery stage.
  */
 static int iop3xx_pci_status(void)
@@ -223,8 +223,111 @@ struct pci_bus *iop3xx_pci_scan_bus(int 
 	return pci_scan_bus(sys->busnr, &iop3xx_ops, sys);
 }
 
+void __init iop3xx_atu_setup(void)
+{
+	/* BAR 0 ( Disabled ) */
+	*IOP3XX_IAUBAR0 = 0x0;
+	*IOP3XX_IABAR0  = 0x0;
+	*IOP3XX_IATVR0  = 0x0;
+	*IOP3XX_IALR0   = 0x0;
+
+	/* BAR 1 ( Disabled ) */
+	*IOP3XX_IAUBAR1 = 0x0;
+	*IOP3XX_IABAR1  = 0x0;
+	*IOP3XX_IALR1   = 0x0;
+
+	/* BAR 2 (1:1 mapping with Physical RAM) */
+	/* Set limit and enable */
+	*IOP3XX_IALR2 = ~((u32)IOP3XX_MAX_RAM_SIZE - 1) & ~0x1;
+	*IOP3XX_IAUBAR2 = 0x0;
+
+	/* Align the inbound bar with the base of memory */
+	*IOP3XX_IABAR2 = PHYS_OFFSET |
+			       PCI_BASE_ADDRESS_MEM_TYPE_64 |
+			       PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+	*IOP3XX_IATVR2 = PHYS_OFFSET;
+
+	/* Outbound window 0 */
+	*IOP3XX_OMWTVR0 = IOP3XX_PCI_LOWER_MEM_PA;
+	*IOP3XX_OUMWTVR0 = 0;
+
+	/* Outbound window 1 */
+	*IOP3XX_OMWTVR1 = IOP3XX_PCI_LOWER_MEM_PA + IOP3XX_PCI_MEM_WINDOW_SIZE;
+	*IOP3XX_OUMWTVR1 = 0;
+
+	/* BAR 3 ( Disabled ) */
+	*IOP3XX_IAUBAR3 = 0x0;
+	*IOP3XX_IABAR3  = 0x0;
+	*IOP3XX_IATVR3  = 0x0;
+	*IOP3XX_IALR3   = 0x0;
+
+	/* Setup the I/O Bar
+	 */
+	*IOP3XX_OIOWTVR = IOP3XX_PCI_LOWER_IO_PA;;
+
+	/* Enable inbound and outbound cycles
+	 */
+	*IOP3XX_ATUCMD |= PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
+			       PCI_COMMAND_PARITY | PCI_COMMAND_SERR;
+	*IOP3XX_ATUCR |= IOP3XX_ATUCR_OUT_EN;
+}
+
+void __init iop3xx_atu_disable(void)
+{
+	*IOP3XX_ATUCMD = 0;
+	*IOP3XX_ATUCR = 0;
+
+	/* wait for cycles to quiesce */
+	while (*IOP3XX_PCSR & (IOP3XX_PCSR_OUT_Q_BUSY |
+				     IOP3XX_PCSR_IN_Q_BUSY))
+		cpu_relax();
+
+	/* BAR 0 ( Disabled ) */
+	*IOP3XX_IAUBAR0 = 0x0;
+	*IOP3XX_IABAR0  = 0x0;
+	*IOP3XX_IATVR0  = 0x0;
+	*IOP3XX_IALR0   = 0x0;
+
+	/* BAR 1 ( Disabled ) */
+	*IOP3XX_IAUBAR1 = 0x0;
+	*IOP3XX_IABAR1  = 0x0;
+	*IOP3XX_IALR1   = 0x0;
+
+	/* BAR 2 ( Disabled ) */
+	*IOP3XX_IAUBAR2 = 0x0;
+	*IOP3XX_IABAR2  = 0x0;
+	*IOP3XX_IATVR2  = 0x0;
+	*IOP3XX_IALR2   = 0x0;
+
+	/* BAR 3 ( Disabled ) */
+	*IOP3XX_IAUBAR3 = 0x0;
+	*IOP3XX_IABAR3  = 0x0;
+	*IOP3XX_IATVR3  = 0x0;
+	*IOP3XX_IALR3   = 0x0;
+
+	/* Clear the outbound windows */
+	*IOP3XX_OIOWTVR  = 0;
+
+	/* Outbound window 0 */
+	*IOP3XX_OMWTVR0 = 0;
+	*IOP3XX_OUMWTVR0 = 0;
+
+	/* Outbound window 1 */
+	*IOP3XX_OMWTVR1 = 0;
+	*IOP3XX_OUMWTVR1 = 0;
+}
+
+/* Flag to determine whether the ATU is initialized and the PCI bus scanned */
+int init_atu;
+
 void iop3xx_pci_preinit(void)
 {
+	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE) {
+		iop3xx_atu_disable();
+		iop3xx_atu_setup();
+	}
+
 	DBG("PCI:  Intel 803xx PCI init code.\n");
 	DBG("ATU: IOP3XX_ATUCMD=0x%04x\n", *IOP3XX_ATUCMD);
 	DBG("ATU: IOP3XX_OMWTVR0=0x%04x, IOP3XX_OIOWTVR=0x%04x\n",
@@ -245,3 +348,38 @@ void iop3xx_pci_preinit(void)
 
 	hook_fault_code(16+6, iop3xx_pci_abort, SIGBUS, "imprecise external abort");
 }
+
+/* allow init_atu to be user overridden */
+static int __init iop3xx_init_atu_setup(char *str)
+{
+	init_atu = IOP3XX_INIT_ATU_DEFAULT;
+	if (str) {
+		while (*str != '\0') {
+			switch (*str) {
+			case 'y':
+			case 'Y':
+				init_atu = IOP3XX_INIT_ATU_ENABLE;
+				break;
+			case 'n':
+			case 'N':
+				init_atu = IOP3XX_INIT_ATU_DISABLE;
+				break;
+			case ',':
+			case '=':
+				break;
+			default:
+				printk(KERN_DEBUG "\"%s\" malformed at "
+					    "character: \'%c\'",
+					    __FUNCTION__,
+					    *str);
+				*(str + 1) = '\0';
+			}
+			str++;
+		}
+	}
+
+	return 1;
+}
+
+__setup("iop3xx_init_atu", iop3xx_init_atu_setup);
+
diff --git a/include/asm-arm/arch-iop32x/iop32x.h b/include/asm-arm/arch-iop32x/iop32x.h
index 904a14d..93209c7 100644
--- a/include/asm-arm/arch-iop32x/iop32x.h
+++ b/include/asm-arm/arch-iop32x/iop32x.h
@@ -32,5 +32,14 @@ #define IOP32X_INTSTR		IOP3XX_REG_ADDR32
 #define IOP32X_IINTSRC		IOP3XX_REG_ADDR32(0x07d8)
 #define IOP32X_FINTSRC		IOP3XX_REG_ADDR32(0x07dc)
 
+/* ATU Parameters
+ * set up a 1:1 bus to physical ram relationship
+ * w/ physical ram on top of pci in the memory map
+ */
+#define IOP32X_MAX_RAM_SIZE            0x40000000UL
+#define IOP3XX_MAX_RAM_SIZE            IOP32X_MAX_RAM_SIZE
+#define IOP3XX_PCI_LOWER_MEM_BA        0x80000000
+#define IOP32X_PCI_MEM_WINDOW_SIZE     0x04000000
+#define IOP3XX_PCI_MEM_WINDOW_SIZE     IOP32X_PCI_MEM_WINDOW_SIZE
 
 #endif
diff --git a/include/asm-arm/arch-iop32x/memory.h b/include/asm-arm/arch-iop32x/memory.h
index 764cd3f..c51072a 100644
--- a/include/asm-arm/arch-iop32x/memory.h
+++ b/include/asm-arm/arch-iop32x/memory.h
@@ -19,8 +19,8 @@ #define PHYS_OFFSET	UL(0xa0000000)
  * bus_to_virt: Used to convert an address for DMA operations
  *		to an address that the kernel can use.
  */
-#define __virt_to_bus(x)	(((__virt_to_phys(x)) & ~(*IOP3XX_IATVR2)) | ((*IOP3XX_IABAR2) & 0xfffffff0))
-#define __bus_to_virt(x)	(__phys_to_virt(((x) & ~(*IOP3XX_IALR2)) | ( *IOP3XX_IATVR2)))
+#define __virt_to_bus(x)	(__virt_to_phys(x))
+#define __bus_to_virt(x)	(__phys_to_virt(x))
 
 
 #endif
diff --git a/include/asm-arm/arch-iop33x/iop33x.h b/include/asm-arm/arch-iop33x/iop33x.h
index c171383..e106b80 100644
--- a/include/asm-arm/arch-iop33x/iop33x.h
+++ b/include/asm-arm/arch-iop33x/iop33x.h
@@ -49,5 +49,15 @@ #define IOP33X_UART0_VIRT	(IOP3XX_PERIPH
 #define IOP33X_UART1_PHYS	(IOP3XX_PERIPHERAL_PHYS_BASE + 0x1740)
 #define IOP33X_UART1_VIRT	(IOP3XX_PERIPHERAL_VIRT_BASE + 0x1740)
 
+/* ATU Parameters
+ * set up a 1:1 bus to physical ram relationship
+ * w/ pci on top of physical ram in memory map
+ */
+#define IOP33X_MAX_RAM_SIZE		0x80000000UL
+#define IOP3XX_MAX_RAM_SIZE		IOP33X_MAX_RAM_SIZE
+#define IOP3XX_PCI_LOWER_MEM_BA	(PHYS_OFFSET + IOP33X_MAX_RAM_SIZE)
+#define IOP33X_PCI_MEM_WINDOW_SIZE	0x08000000
+#define IOP3XX_PCI_MEM_WINDOW_SIZE	IOP33X_PCI_MEM_WINDOW_SIZE
+
 
 #endif
diff --git a/include/asm-arm/arch-iop33x/memory.h b/include/asm-arm/arch-iop33x/memory.h
index 0d39139..c874912 100644
--- a/include/asm-arm/arch-iop33x/memory.h
+++ b/include/asm-arm/arch-iop33x/memory.h
@@ -19,8 +19,8 @@ #define PHYS_OFFSET	UL(0x00000000)
  * bus_to_virt: Used to convert an address for DMA operations
  *		to an address that the kernel can use.
  */
-#define __virt_to_bus(x)	(((__virt_to_phys(x)) & ~(*IOP3XX_IATVR2)) | ((*IOP3XX_IABAR2) & 0xfffffff0))
-#define __bus_to_virt(x)	(__phys_to_virt(((x) & ~(*IOP3XX_IALR2)) | ( *IOP3XX_IATVR2)))
+#define __virt_to_bus(x)	(__virt_to_phys(x))
+#define __bus_to_virt(x)	(__phys_to_virt(x))
 
 
 #endif
diff --git a/include/asm-arm/hardware/iop3xx.h b/include/asm-arm/hardware/iop3xx.h
index 295789a..5a084c8 100644
--- a/include/asm-arm/hardware/iop3xx.h
+++ b/include/asm-arm/hardware/iop3xx.h
@@ -28,6 +28,7 @@ #ifndef __ASSEMBLY__
 extern void gpio_line_config(int line, int direction);
 extern int  gpio_line_get(int line);
 extern void gpio_line_set(int line, int value);
+extern int init_atu;
 #endif
 
 
@@ -98,6 +99,21 @@ #define IOP3XX_PCIXNEXT	IOP3XX_REG_ADDR8
 #define IOP3XX_PCIXCMD		IOP3XX_REG_ADDR16(0x01e2)
 #define IOP3XX_PCIXSR		IOP3XX_REG_ADDR32(0x01e4)
 #define IOP3XX_PCIIRSR		IOP3XX_REG_ADDR32(0x01ec)
+#define IOP3XX_PCSR_OUT_Q_BUSY (1 << 15)
+#define IOP3XX_PCSR_IN_Q_BUSY	(1 << 14)
+#define IOP3XX_ATUCR_OUT_EN	(1 << 1)
+
+#define IOP3XX_INIT_ATU_DEFAULT 0
+#define IOP3XX_INIT_ATU_DISABLE -1
+#define IOP3XX_INIT_ATU_ENABLE	 1
+
+#ifdef CONFIG_IOP3XX_ATU
+#define iop3xx_get_init_atu(x) (init_atu == IOP3XX_INIT_ATU_DEFAULT ?\
+				IOP3XX_INIT_ATU_ENABLE : init_atu)
+#else
+#define iop3xx_get_init_atu(x) (init_atu == IOP3XX_INIT_ATU_DEFAULT ?\
+				IOP3XX_INIT_ATU_DISABLE : init_atu)
+#endif
 
 /* Messaging Unit  */
 #define IOP3XX_IMR0		IOP3XX_REG_ADDR32(0x0310)
@@ -219,14 +235,12 @@ #define IOP3XX_IBMR1		IOP3XX_REG_ADDR32(
 /*
  * IOP3XX I/O and Mem space regions for PCI autoconfiguration
  */
-#define IOP3XX_PCI_MEM_WINDOW_SIZE	0x04000000
 #define IOP3XX_PCI_LOWER_MEM_PA	0x80000000
-#define IOP3XX_PCI_LOWER_MEM_BA	(*IOP3XX_OMWTVR0)
 
 #define IOP3XX_PCI_IO_WINDOW_SIZE	0x00010000
 #define IOP3XX_PCI_LOWER_IO_PA		0x90000000
 #define IOP3XX_PCI_LOWER_IO_VA		0xfe000000
-#define IOP3XX_PCI_LOWER_IO_BA		(*IOP3XX_OIOWTVR)
+#define IOP3XX_PCI_LOWER_IO_BA		0x90000000
 
 
 #ifndef __ASSEMBLY__

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (17 preceding siblings ...)
  2006-09-11 23:19 ` [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization Dan Williams
@ 2006-09-11 23:19 ` Dan Williams
  2006-09-11 23:38 ` [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Jeff Garzik
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:19 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: akpm, linux-kernel, christopher.leech

From: Dan Williams <dan.j.williams@intel.com>

Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for > 1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
hardware descriptors are contiguous
* iop32x does not support hardware zero sum, add software emulation support
for up to a PAGE_SIZE buffer size
* added raid5 dma driver support functions

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 arch/arm/mach-iop32x/iq80321.c         |  141 +++++
 arch/arm/mach-iop33x/iq80331.c         |    9 
 arch/arm/mach-iop33x/iq80332.c         |    8 
 arch/arm/mach-iop33x/setup.c           |  132 +++++
 include/asm-arm/arch-iop32x/adma.h     |    5 
 include/asm-arm/arch-iop33x/adma.h     |    5 
 include/asm-arm/hardware/iop3xx-adma.h |  901 ++++++++++++++++++++++++++++++++
 7 files changed, 1201 insertions(+), 0 deletions(-)

diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index cdd2265..79d6514 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -33,6 +33,9 @@ #include <asm/mach/time.h>
 #include <asm/mach-types.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
+#ifdef CONFIG_DMA_ENGINE
+#include <asm/hardware/iop_adma.h>
+#endif
 
 /*
  * IQ80321 timer tick configuration.
@@ -170,12 +173,150 @@ static struct platform_device iq80321_se
 	.resource	= &iq80321_uart_resource,
 };
 
+#ifdef CONFIG_DMA_ENGINE
+/* AAU and DMA Channels */
+static struct resource iop3xx_dma_0_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_DMA_CCR(0),
+		.end = ((unsigned long) IOP3XX_DMA_DCR(0)) + 4,
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP32X_DMA0_EOT,
+		.end = IRQ_IOP32X_DMA0_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP32X_DMA0_EOC,
+		.end = IRQ_IOP32X_DMA0_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP32X_DMA0_ERR,
+		.end = IRQ_IOP32X_DMA0_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+static struct resource iop3xx_dma_1_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_DMA_CCR(1),
+		.end = ((unsigned long) IOP3XX_DMA_DCR(1)) + 4,
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP32X_DMA1_EOT,
+		.end = IRQ_IOP32X_DMA1_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP32X_DMA1_EOC,
+		.end = IRQ_IOP32X_DMA1_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP32X_DMA1_ERR,
+		.end = IRQ_IOP32X_DMA1_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+
+static struct resource iop3xx_aau_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_AAU_ACR,
+		.end = (unsigned long) IOP3XX_AAU_SAR_EDCR(32),
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP32X_AA_EOT,
+		.end = IRQ_IOP32X_AA_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP32X_AA_EOC,
+		.end = IRQ_IOP32X_AA_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP32X_AA_ERR,
+		.end = IRQ_IOP32X_AA_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+static u64 iop3xx_adma_dmamask = DMA_32BIT_MASK;
+
+static struct iop_adma_platform_data iop3xx_dma_0_data = {
+	.hw_id = IOP3XX_DMA0_ID,
+	.capabilities =	DMA_MEMCPY | DMA_MEMCPY_CRC32C,
+	.pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop3xx_dma_1_data = {
+	.hw_id = IOP3XX_DMA1_ID,
+	.capabilities =	DMA_MEMCPY | DMA_MEMCPY_CRC32C,
+	.pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop3xx_aau_data = {
+	.hw_id = IOP3XX_AAU_ID,
+	.capabilities =	DMA_XOR | DMA_ZERO_SUM | DMA_MEMSET,
+	.pool_size = 3 * PAGE_SIZE,
+};
+
+struct platform_device iop3xx_dma_0_channel = {
+	.name = "IOP-ADMA",
+	.id = 0,
+	.num_resources = 4,
+	.resource = iop3xx_dma_0_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_dma_0_data,
+	},
+};
+
+struct platform_device iop3xx_dma_1_channel = {
+	.name = "IOP-ADMA",
+	.id = 1,
+	.num_resources = 4,
+	.resource = iop3xx_dma_1_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_dma_1_data,
+	},
+};
+
+struct platform_device iop3xx_aau_channel = {
+	.name = "IOP-ADMA",
+	.id = 2,
+	.num_resources = 4,
+	.resource = iop3xx_aau_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_aau_data,
+	},
+};
+#endif /* CONFIG_DMA_ENGINE */
+
+extern struct platform_device iop3xx_dma_0_channel;
+extern struct platform_device iop3xx_dma_1_channel;
+extern struct platform_device iop3xx_aau_channel;
 static void __init iq80321_init_machine(void)
 {
 	platform_device_register(&iop3xx_i2c0_device);
 	platform_device_register(&iop3xx_i2c1_device);
 	platform_device_register(&iq80321_flash_device);
 	platform_device_register(&iq80321_serial_device);
+	#ifdef CONFIG_DMA_ENGINE
+	platform_device_register(&iop3xx_dma_0_channel);
+	platform_device_register(&iop3xx_dma_1_channel);
+	platform_device_register(&iop3xx_aau_channel);
+	#endif
+
 }
 
 MACHINE_START(IQ80321, "Intel IQ80321")
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 3807000..34bedc6 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -122,6 +122,10 @@ static struct platform_device iq80331_fl
 	.resource	= &iq80331_flash_resource,
 };
 
+
+extern struct platform_device iop3xx_dma_0_channel;
+extern struct platform_device iop3xx_dma_1_channel;
+extern struct platform_device iop3xx_aau_channel;
 static void __init iq80331_init_machine(void)
 {
 	platform_device_register(&iop3xx_i2c0_device);
@@ -129,6 +133,11 @@ static void __init iq80331_init_machine(
 	platform_device_register(&iop33x_uart0_device);
 	platform_device_register(&iop33x_uart1_device);
 	platform_device_register(&iq80331_flash_device);
+	#ifdef CONFIG_DMA_ENGINE
+	platform_device_register(&iop3xx_dma_0_channel);
+	platform_device_register(&iop3xx_dma_1_channel);
+	platform_device_register(&iop3xx_aau_channel);
+	#endif
 }
 
 MACHINE_START(IQ80331, "Intel IQ80331")
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index 8780d55..ed36016 100644
--- a/arch/arm/mach-iop33x/iq80332.c
+++ b/arch/arm/mach-iop33x/iq80332.c
@@ -129,6 +129,9 @@ static struct platform_device iq80332_fl
 	.resource	= &iq80332_flash_resource,
 };
 
+extern struct platform_device iop3xx_dma_0_channel;
+extern struct platform_device iop3xx_dma_1_channel;
+extern struct platform_device iop3xx_aau_channel;
 static void __init iq80332_init_machine(void)
 {
 	platform_device_register(&iop3xx_i2c0_device);
@@ -136,6 +139,11 @@ static void __init iq80332_init_machine(
 	platform_device_register(&iop33x_uart0_device);
 	platform_device_register(&iop33x_uart1_device);
 	platform_device_register(&iq80332_flash_device);
+	#ifdef CONFIG_DMA_ENGINE
+	platform_device_register(&iop3xx_dma_0_channel);
+	platform_device_register(&iop3xx_dma_1_channel);
+	platform_device_register(&iop3xx_aau_channel);
+	#endif
 }
 
 MACHINE_START(IQ80332, "Intel IQ80332")
diff --git a/arch/arm/mach-iop33x/setup.c b/arch/arm/mach-iop33x/setup.c
index e72face..fbdb998 100644
--- a/arch/arm/mach-iop33x/setup.c
+++ b/arch/arm/mach-iop33x/setup.c
@@ -28,6 +28,9 @@ #include <asm/hardware.h>
 #include <asm/hardware/iop3xx.h>
 #include <asm/mach-types.h>
 #include <asm/mach/arch.h>
+#include <linux/dmaengine.h>
+#include <linux/dma-mapping.h>
+#include <asm/hardware/iop_adma.h>
 
 #define IOP33X_UART_XTAL 33334000
 
@@ -103,3 +106,132 @@ struct platform_device iop33x_uart1_devi
 	.num_resources	= 2,
 	.resource	= iop33x_uart1_resources,
 };
+
+#ifdef CONFIG_DMA_ENGINE
+/* AAU and DMA Channels */
+static struct resource iop3xx_dma_0_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_DMA_CCR(0),
+		.end = ((unsigned long) IOP3XX_DMA_DCR(0)) + 4,
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP33X_DMA0_EOT,
+		.end = IRQ_IOP33X_DMA0_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP33X_DMA0_EOC,
+		.end = IRQ_IOP33X_DMA0_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP33X_DMA0_ERR,
+		.end = IRQ_IOP33X_DMA0_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+static struct resource iop3xx_dma_1_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_DMA_CCR(1),
+		.end = ((unsigned long) IOP3XX_DMA_DCR(1)) + 4,
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP33X_DMA1_EOT,
+		.end = IRQ_IOP33X_DMA1_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP33X_DMA1_EOC,
+		.end = IRQ_IOP33X_DMA1_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP33X_DMA1_ERR,
+		.end = IRQ_IOP33X_DMA1_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+
+static struct resource iop3xx_aau_resources[] = {
+	[0] = {
+		.start = (unsigned long) IOP3XX_AAU_ACR,
+		.end = (unsigned long) IOP3XX_AAU_SAR_EDCR(32),
+		.flags = IORESOURCE_MEM,
+	},
+	[1] = {
+		.start = IRQ_IOP33X_AA_EOT,
+		.end = IRQ_IOP33X_AA_EOT,
+		.flags = IORESOURCE_IRQ
+	},
+	[2] = {
+		.start = IRQ_IOP33X_AA_EOC,
+		.end = IRQ_IOP33X_AA_EOC,
+		.flags = IORESOURCE_IRQ
+	},
+	[3] = {
+		.start = IRQ_IOP33X_AA_ERR,
+		.end = IRQ_IOP33X_AA_ERR,
+		.flags = IORESOURCE_IRQ
+	}
+};
+
+static u64 iop3xx_adma_dmamask = DMA_32BIT_MASK;
+
+static struct iop_adma_platform_data iop3xx_dma_0_data = {
+	.hw_id = IOP3XX_DMA0_ID,
+	.capabilities =	DMA_MEMCPY | DMA_MEMCPY_CRC32C,
+	.pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop3xx_dma_1_data = {
+	.hw_id = IOP3XX_DMA1_ID,
+	.capabilities =	DMA_MEMCPY | DMA_MEMCPY_CRC32C,
+	.pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop3xx_aau_data = {
+	.hw_id = IOP3XX_AAU_ID,
+	.capabilities =	DMA_XOR | DMA_ZERO_SUM | DMA_MEMSET,
+	.pool_size = 3 * PAGE_SIZE,
+};
+
+struct platform_device iop3xx_dma_0_channel = {
+	.name = "IOP-ADMA",
+	.id = 0,
+	.num_resources = 4,
+	.resource = iop3xx_dma_0_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_dma_0_data,
+	},
+};
+
+struct platform_device iop3xx_dma_1_channel = {
+	.name = "IOP-ADMA",
+	.id = 1,
+	.num_resources = 4,
+	.resource = iop3xx_dma_1_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_dma_1_data,
+	},
+};
+
+struct platform_device iop3xx_aau_channel = {
+	.name = "IOP-ADMA",
+	.id = 2,
+	.num_resources = 4,
+	.resource = iop3xx_aau_resources,
+	.dev = {
+		.dma_mask = &iop3xx_adma_dmamask,
+		.coherent_dma_mask = DMA_64BIT_MASK,
+		.platform_data = (void *) &iop3xx_aau_data,
+	},
+};
+#endif /* CONFIG_DMA_ENGINE */
diff --git a/include/asm-arm/arch-iop32x/adma.h b/include/asm-arm/arch-iop32x/adma.h
new file mode 100644
index 0000000..5ed9203
--- /dev/null
+++ b/include/asm-arm/arch-iop32x/adma.h
@@ -0,0 +1,5 @@
+#ifndef IOP32X_ADMA_H
+#define IOP32X_ADMA_H
+#include <asm/hardware/iop3xx-adma.h>
+#endif
+
diff --git a/include/asm-arm/arch-iop33x/adma.h b/include/asm-arm/arch-iop33x/adma.h
new file mode 100644
index 0000000..4b92f79
--- /dev/null
+++ b/include/asm-arm/arch-iop33x/adma.h
@@ -0,0 +1,5 @@
+#ifndef IOP33X_ADMA_H
+#define IOP33X_ADMA_H
+#include <asm/hardware/iop3xx-adma.h>
+#endif
+
diff --git a/include/asm-arm/hardware/iop3xx-adma.h b/include/asm-arm/hardware/iop3xx-adma.h
new file mode 100644
index 0000000..34624b6
--- /dev/null
+++ b/include/asm-arm/hardware/iop3xx-adma.h
@@ -0,0 +1,901 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#ifndef _IOP3XX_ADMA_H
+#define _IOP3XX_ADMA_H
+#include <linux/types.h>
+#include <asm/hardware.h>
+#include <asm/hardware/iop_adma.h>
+
+struct iop3xx_aau_desc_ctrl {
+	unsigned int int_en:1;
+	unsigned int blk1_cmd_ctrl:3;
+	unsigned int blk2_cmd_ctrl:3;
+	unsigned int blk3_cmd_ctrl:3;
+	unsigned int blk4_cmd_ctrl:3;
+	unsigned int blk5_cmd_ctrl:3;
+	unsigned int blk6_cmd_ctrl:3;
+	unsigned int blk7_cmd_ctrl:3;
+	unsigned int blk8_cmd_ctrl:3;
+	unsigned int blk_ctrl:2;
+	unsigned int dual_xor_en:1;
+	unsigned int tx_complete:1;
+	unsigned int zero_result_err:1;
+	unsigned int zero_result_en:1;
+	unsigned int dest_write_en:1;
+};
+
+struct iop3xx_aau_e_desc_ctrl {
+	unsigned int reserved:1;
+	unsigned int blk1_cmd_ctrl:3;
+	unsigned int blk2_cmd_ctrl:3;
+	unsigned int blk3_cmd_ctrl:3;
+	unsigned int blk4_cmd_ctrl:3;
+	unsigned int blk5_cmd_ctrl:3;
+	unsigned int blk6_cmd_ctrl:3;
+	unsigned int blk7_cmd_ctrl:3;
+	unsigned int blk8_cmd_ctrl:3;
+	unsigned int reserved2:7;
+};
+
+struct iop3xx_dma_desc_ctrl {
+	unsigned int pci_transaction:4;
+	unsigned int int_en:1;
+	unsigned int dac_cycle_en:1;
+	unsigned int mem_to_mem_en:1;
+	unsigned int crc_data_tx_en:1;
+	unsigned int crc_gen_en:1;
+	unsigned int crc_seed_dis:1;
+	unsigned int reserved:21;
+	unsigned int crc_tx_complete:1;
+};
+
+struct iop3xx_desc_dma {
+	u32 next_desc;
+	union {
+		u32 pci_src_addr;
+		u32 pci_dest_addr;
+		u32 src_addr;
+	};
+	union {
+		u32 upper_pci_src_addr;
+		u32 upper_pci_dest_addr;
+	};
+	union {
+		u32 local_pci_src_addr;
+		u32 local_pci_dest_addr;
+		u32 dest_addr;
+	};
+	u32 byte_count;
+	union {
+		u32 desc_ctrl;
+		struct iop3xx_dma_desc_ctrl desc_ctrl_field;
+	};
+	u32 crc_addr;		
+};
+
+struct iop3xx_desc_aau {
+	u32 next_desc;
+	u32 src[4];
+	u32 dest_addr;
+	u32 byte_count;
+	union {
+		u32 desc_ctrl;
+		struct iop3xx_aau_desc_ctrl desc_ctrl_field;
+	};
+	union {
+		u32 src_addr;
+		u32 e_desc_ctrl;
+		struct iop3xx_aau_e_desc_ctrl e_desc_ctrl_field;
+	} src_edc[31];
+};
+
+
+struct iop3xx_aau_gfmr {
+	unsigned int gfmr1:8;
+	unsigned int gfmr2:8;
+	unsigned int gfmr3:8;
+	unsigned int gfmr4:8;
+};
+
+struct iop3xx_desc_pq_xor {
+	u32 next_desc;
+	u32 src[3];
+	union {
+		u32 data_mult1;
+		struct iop3xx_aau_gfmr data_mult1_field;
+	};
+	u32 dest_addr;
+	u32 byte_count;
+	union {
+		u32 desc_ctrl;
+		struct iop3xx_aau_desc_ctrl desc_ctrl_field;
+	};
+	union {
+		u32 src_addr;
+		u32 e_desc_ctrl;
+		struct iop3xx_aau_e_desc_ctrl e_desc_ctrl_field;
+		u32 data_multiplier;
+		struct iop3xx_aau_gfmr data_mult_field;
+		u32 reserved;
+	} src_edc_gfmr[19];
+};
+
+struct iop3xx_desc_dual_xor {
+	u32 next_desc;
+	u32 src0_addr;
+	u32 src1_addr;
+	u32 h_src_addr;
+	u32 d_src_addr;
+	u32 h_dest_addr;
+	u32 byte_count;
+	union {
+		u32 desc_ctrl;
+		struct iop3xx_aau_desc_ctrl desc_ctrl_field;
+	};
+	u32 d_dest_addr;
+};
+
+union iop3xx_desc {
+	struct iop3xx_desc_aau *aau;
+	struct iop3xx_desc_dma *dma;
+	struct iop3xx_desc_pq_xor *pq_xor;
+	struct iop3xx_desc_dual_xor *dual_xor;
+	void *ptr;
+};
+
+static inline u32 iop_chan_get_current_descriptor(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return *IOP3XX_DMA_DAR(id);
+	case IOP3XX_AAU_ID:
+		return *IOP3XX_AAU_ADAR;
+	default:
+		BUG();
+	}
+	return 0;
+}
+
+static inline void iop_chan_set_next_descriptor(struct iop_adma_chan *chan,
+						u32 next_desc_addr)
+{
+	int id = chan->device->id;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		*IOP3XX_DMA_NDAR(id) = next_desc_addr;
+		break;
+	case IOP3XX_AAU_ID:
+		*IOP3XX_AAU_ANDAR = next_desc_addr;
+		break;
+	}
+
+}
+
+#define IOP3XX_ADMA_STATUS_BUSY (1 << 10)
+#define IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT (1024)
+#define IOP_ADMA_XOR_MAX_BYTE_COUNT (16 * 1024 * 1024)
+
+static int iop_chan_is_busy(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+	int busy;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		busy = (*IOP3XX_DMA_CSR(id) & IOP3XX_ADMA_STATUS_BUSY) ? 1 : 0;
+		break;
+	case IOP3XX_AAU_ID:
+		busy = (*IOP3XX_AAU_ASR & IOP3XX_ADMA_STATUS_BUSY) ? 1 : 0;
+		break;
+	default:
+		busy = 0;
+		BUG();
+	}
+
+	return busy;
+}
+
+static inline int iop_desc_is_aligned(struct iop_adma_desc_slot *desc,
+					int num_slots)
+{
+	/* num_slots will only ever be 1, 2, 4, or 8 */
+	return (desc->idx & (num_slots - 1)) ? 0 : 1;
+}
+
+/* to do: support large (i.e. > hw max) buffer sizes */
+static inline int iop_chan_memcpy_slot_count(size_t len, int *slots_per_op)
+{
+	*slots_per_op = 1;
+	return 1;
+}
+
+/* to do: support large (i.e. > hw max) buffer sizes */
+static inline int iop_chan_memset_slot_count(size_t len, int *slots_per_op)
+{
+	*slots_per_op = 1;
+	return 1;
+}
+
+static inline int iop3xx_aau_xor_slot_count(size_t len, int src_cnt,
+					int *slots_per_op)
+{
+	const static int slot_count_table[] = { 0,
+					        1, 1, 1, 1, /* 01 - 04 */
+					        2, 2, 2, 2, /* 05 - 08 */
+					        4, 4, 4, 4, /* 09 - 12 */
+					        4, 4, 4, 4, /* 13 - 16 */
+					        8, 8, 8, 8, /* 17 - 20 */
+					        8, 8, 8, 8, /* 21 - 24 */
+					        8, 8, 8, 8, /* 25 - 28 */
+					        8, 8, 8, 8, /* 29 - 32 */
+					      };
+	*slots_per_op = slot_count_table[src_cnt];
+	return *slots_per_op;
+}
+
+static inline int iop_chan_xor_slot_count(size_t len, int src_cnt,
+						int *slots_per_op)
+{
+	int slot_cnt = iop3xx_aau_xor_slot_count(len, src_cnt, slots_per_op);
+
+	if (len <= IOP_ADMA_XOR_MAX_BYTE_COUNT)
+		return slot_cnt;
+
+	len -= IOP_ADMA_XOR_MAX_BYTE_COUNT;
+	while (len > IOP_ADMA_XOR_MAX_BYTE_COUNT) {
+		len -= IOP_ADMA_XOR_MAX_BYTE_COUNT;
+		slot_cnt += *slots_per_op;
+	}
+
+	if (len)
+		slot_cnt += *slots_per_op;
+
+	return slot_cnt;
+}
+
+/* zero sum on iop3xx is limited to 1k at a time so it requires multiple
+ * descriptors
+ */
+static inline int iop_chan_zero_sum_slot_count(size_t len, int src_cnt,
+						int *slots_per_op)
+{
+	int slot_cnt = iop3xx_aau_xor_slot_count(len, src_cnt, slots_per_op);
+
+	if (len <= IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT)
+		return slot_cnt;
+
+	len -= IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT;
+	while (len > IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT) {
+		len -= IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT;
+		slot_cnt += *slots_per_op;
+	}
+
+	if (len)
+		slot_cnt += *slots_per_op;
+
+	return slot_cnt;
+}
+
+static inline u32 iop_desc_get_dest_addr(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return hw_desc.dma->dest_addr;
+	case IOP3XX_AAU_ID:
+		return hw_desc.aau->dest_addr;
+	default:
+		BUG();
+	}
+	return 0;
+}
+
+static inline u32 iop_desc_get_byte_count(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return hw_desc.dma->byte_count;
+	case IOP3XX_AAU_ID:
+		return hw_desc.aau->byte_count;
+	default:
+		BUG();
+	}
+	return 0;
+}
+
+static inline int iop3xx_src_edc_idx(int src_idx)
+{
+	const static int src_edc_idx_table[] = { 0, 0, 0, 0,
+						 0, 1, 2, 3,
+						 5, 6, 7, 8,
+						 9, 10, 11, 12,
+						 14, 15, 16, 17,
+						 18, 19, 20, 21,
+						 23, 24, 25, 26,
+						 27, 28, 29, 30,
+					       };
+
+	return src_edc_idx_table[src_idx];
+}
+
+static inline u32 iop_desc_get_src_addr(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan,
+					int src_idx)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return hw_desc.dma->src_addr;
+	case IOP3XX_AAU_ID:
+		break;
+	default:
+		BUG();
+	}
+
+	if (src_idx < 4)
+		return hw_desc.aau->src[src_idx];
+	else
+		return hw_desc.aau->src_edc[iop3xx_src_edc_idx(src_idx)].src_addr;
+}
+
+static inline void iop3xx_aau_desc_set_src_addr(struct iop3xx_desc_aau *hw_desc,
+					int src_idx, dma_addr_t addr)
+{
+	if (src_idx < 4)
+		hw_desc->src[src_idx] = addr;
+	else
+		hw_desc->src_edc[iop3xx_src_edc_idx(src_idx)].src_addr = addr;
+}
+
+static inline void iop_desc_init_memcpy(struct iop_adma_desc_slot *desc)
+{
+	struct iop3xx_desc_dma *hw_desc = desc->hw_desc;
+	union {
+		u32 value;
+		struct iop3xx_dma_desc_ctrl field;
+	} u_desc_ctrl;
+
+	desc->src_cnt = 1;
+	u_desc_ctrl.value = 0;
+	u_desc_ctrl.field.mem_to_mem_en = 1;
+	u_desc_ctrl.field.pci_transaction = 0xe; /* memory read block */
+	hw_desc->desc_ctrl = u_desc_ctrl.value;
+	hw_desc->upper_pci_src_addr = 0;
+	hw_desc->crc_addr = 0;
+	hw_desc->next_desc = 0;
+}
+
+static inline void iop_desc_init_memset(struct iop_adma_desc_slot *desc)
+{
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc;
+	union {
+		u32 value;
+		struct iop3xx_aau_desc_ctrl field;
+	} u_desc_ctrl;
+
+	desc->src_cnt = 1;
+	u_desc_ctrl.value = 0;
+	u_desc_ctrl.field.blk1_cmd_ctrl = 0x2; /* memory block fill */
+	u_desc_ctrl.field.dest_write_en = 1;
+	hw_desc->desc_ctrl = u_desc_ctrl.value;
+	hw_desc->next_desc = 0;
+}
+
+static inline u32 iop3xx_desc_init_xor(struct iop3xx_desc_aau *hw_desc,
+				int src_cnt)
+{
+	int i, shift;
+	u32 edcr;
+	union {
+		u32 value;
+		struct iop3xx_aau_desc_ctrl field;
+	} u_desc_ctrl;
+
+	u_desc_ctrl.value = 0;
+	switch (src_cnt) {
+	case 25 ... 32:
+		u_desc_ctrl.field.blk_ctrl = 0x3; /* use EDCR[2:0] */
+		edcr = 0;
+		shift = 1;
+		for (i = 24; i < src_cnt; i++) {
+			edcr |= (1 << shift);
+			shift += 3;
+		}
+		hw_desc->src_edc[IOP3XX_AAU_EDCR2_IDX].e_desc_ctrl = edcr;
+		src_cnt = 24;
+		/* fall through */
+	case 17 ... 24:
+		if (!u_desc_ctrl.field.blk_ctrl) {
+			hw_desc->src_edc[IOP3XX_AAU_EDCR2_IDX].e_desc_ctrl = 0;
+			u_desc_ctrl.field.blk_ctrl = 0x3; /* use EDCR[2:0] */
+		}
+		edcr = 0;
+		shift = 1;
+		for (i = 16; i < src_cnt; i++) {
+			edcr |= (1 << shift);
+			shift += 3;
+		}
+		hw_desc->src_edc[IOP3XX_AAU_EDCR1_IDX].e_desc_ctrl = edcr;
+		src_cnt = 16;
+		/* fall through */
+	case 9 ... 16:
+		if (!u_desc_ctrl.field.blk_ctrl)
+			u_desc_ctrl.field.blk_ctrl = 0x2; /* use EDCR0 */
+		edcr = 0;
+		shift = 1;
+		for (i = 8; i < src_cnt; i++) {
+			edcr |= (1 << shift);
+			shift += 3;
+		}
+		hw_desc->src_edc[IOP3XX_AAU_EDCR0_IDX].e_desc_ctrl = edcr;
+		src_cnt = 8;
+		/* fall through */
+	case 2 ... 8:
+		shift = 1;
+		for (i = 0; i < src_cnt; i++) {
+			u_desc_ctrl.value |= (1 << shift);
+			shift += 3;
+		}
+
+		if (!u_desc_ctrl.field.blk_ctrl && src_cnt > 4)
+			u_desc_ctrl.field.blk_ctrl = 0x1; /* use mini-desc */
+	}
+
+	u_desc_ctrl.field.dest_write_en = 1;
+	u_desc_ctrl.field.blk1_cmd_ctrl = 0x7; /* direct fill */
+	hw_desc->desc_ctrl = u_desc_ctrl.value;
+	hw_desc->next_desc = 0;
+
+	return u_desc_ctrl.value;
+}
+
+static inline void iop_desc_init_xor(struct iop_adma_desc_slot *desc,
+				int src_cnt)
+{
+	desc->src_cnt = src_cnt;
+	iop3xx_desc_init_xor(desc->hw_desc, src_cnt);
+}
+
+/* return the number of operations */
+static inline int iop_desc_init_zero_sum(struct iop_adma_desc_slot *desc,
+					int src_cnt,
+					int slot_cnt,
+					int slots_per_op)
+{
+	struct iop3xx_desc_aau *hw_desc, *prev_hw_desc, *iter;
+	union {
+		u32 value;
+		struct iop3xx_aau_desc_ctrl field;
+	} u_desc_ctrl;
+	int i = 0, j = 0;
+	hw_desc = desc->hw_desc;
+	desc->src_cnt = src_cnt;
+
+	do {
+		iter = iop_hw_desc_slot_idx(hw_desc, i);
+		u_desc_ctrl.value = iop3xx_desc_init_xor(iter, src_cnt);
+		u_desc_ctrl.field.dest_write_en = 0;
+		u_desc_ctrl.field.zero_result_en = 1;
+		/* for the subsequent descriptors preserve the store queue
+		 * and chain them together
+		 */
+		if (i) {
+			prev_hw_desc = iop_hw_desc_slot_idx(hw_desc, i - slots_per_op);
+			prev_hw_desc->next_desc = (u32) (desc->phys + (i << 5));
+		}
+		iter->desc_ctrl = u_desc_ctrl.value;
+		slot_cnt -= slots_per_op;
+		i += slots_per_op;
+		j++;
+	} while (slot_cnt);
+
+	return j;
+}
+
+static inline void iop_desc_init_null_xor(struct iop_adma_desc_slot *desc,
+				int src_cnt)
+{
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc;
+	union {
+		u32 value;
+		struct iop3xx_aau_desc_ctrl field;
+	} u_desc_ctrl;
+
+	u_desc_ctrl.value = 0;
+	switch (src_cnt) {
+	case 25 ... 32:
+		u_desc_ctrl.field.blk_ctrl = 0x3; /* use EDCR[2:0] */
+		hw_desc->src_edc[IOP3XX_AAU_EDCR2_IDX].e_desc_ctrl = 0;
+		/* fall through */
+	case 17 ... 24:
+		if (!u_desc_ctrl.field.blk_ctrl) {
+			hw_desc->src_edc[IOP3XX_AAU_EDCR2_IDX].e_desc_ctrl = 0;
+			u_desc_ctrl.field.blk_ctrl = 0x3; /* use EDCR[2:0] */
+		}
+		hw_desc->src_edc[IOP3XX_AAU_EDCR1_IDX].e_desc_ctrl = 0;
+		/* fall through */
+	case 9 ... 16:
+		if (!u_desc_ctrl.field.blk_ctrl)
+			u_desc_ctrl.field.blk_ctrl = 0x2; /* use EDCR0 */
+		hw_desc->src_edc[IOP3XX_AAU_EDCR0_IDX].e_desc_ctrl = 0;
+		/* fall through */
+	case 1 ... 8:
+		if (!u_desc_ctrl.field.blk_ctrl && src_cnt > 4)
+			u_desc_ctrl.field.blk_ctrl = 0x1; /* use mini-desc */
+	}
+
+	desc->src_cnt = src_cnt;
+	u_desc_ctrl.field.dest_write_en = 0;
+	hw_desc->desc_ctrl = u_desc_ctrl.value;
+	hw_desc->next_desc = 0;
+}
+
+static inline void iop_desc_set_byte_count(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan,
+					u32 byte_count)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		hw_desc.dma->byte_count = byte_count;
+		break;
+	case IOP3XX_AAU_ID:
+		hw_desc.aau->byte_count = byte_count;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_desc_set_zero_sum_byte_count(struct iop_adma_desc_slot *desc,
+					u32 len,
+					int slots_per_op)
+{
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc, *iter;
+	int i = 0;
+
+	if (len <= IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT) {
+		hw_desc->byte_count = len;
+	} else {
+		do {
+			iter = iop_hw_desc_slot_idx(hw_desc, i);
+			iter->byte_count = IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT;
+			len -= IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT;
+			i += slots_per_op;
+		} while (len > IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT);
+
+		if (len) {
+			iter = iop_hw_desc_slot_idx(hw_desc, i);
+			iter->byte_count = len;
+		}
+	}
+}
+
+static inline void iop_desc_set_dest_addr(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan,
+					dma_addr_t addr)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		hw_desc.dma->dest_addr = addr;
+		break;
+	case IOP3XX_AAU_ID:
+		hw_desc.aau->dest_addr = addr;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_desc_set_memcpy_src_addr(struct iop_adma_desc_slot *desc,
+					dma_addr_t addr, int slot_cnt,
+					int slots_per_op)
+{
+	struct iop3xx_desc_dma *hw_desc = desc->hw_desc;
+	hw_desc->src_addr = addr;
+}
+
+static inline void iop_desc_set_zero_sum_src_addr(struct iop_adma_desc_slot *desc,
+					int src_idx, dma_addr_t addr, int slot_cnt,
+					int slots_per_op)
+{
+
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc, *iter;
+	int i = 0;
+
+	do {
+		iter = iop_hw_desc_slot_idx(hw_desc, i);
+		iop3xx_aau_desc_set_src_addr(iter, src_idx, addr);
+		slot_cnt -= slots_per_op;
+		i += slots_per_op;
+		addr += IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT;
+	} while (slot_cnt);
+}
+
+static inline void iop_desc_set_xor_src_addr(struct iop_adma_desc_slot *desc,
+					int src_idx, dma_addr_t addr, int slot_cnt,
+					int slots_per_op)
+{
+
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc, *iter;
+	int i = 0;
+
+	do {
+		iter = iop_hw_desc_slot_idx(hw_desc, i);
+		iop3xx_aau_desc_set_src_addr(iter, src_idx, addr);
+		slot_cnt -= slots_per_op;
+		i += slots_per_op;
+		addr += IOP_ADMA_XOR_MAX_BYTE_COUNT;
+	} while (slot_cnt);
+}
+
+static inline void iop_desc_set_next_desc(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan,
+					u32 next_desc_addr)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		BUG_ON(hw_desc.dma->next_desc);
+		hw_desc.dma->next_desc = next_desc_addr;
+		break;
+	case IOP3XX_AAU_ID:
+		BUG_ON(hw_desc.aau->next_desc);
+		hw_desc.aau->next_desc = next_desc_addr;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline u32 iop_desc_get_next_desc(struct iop_adma_desc_slot *desc,
+					struct iop_adma_chan *chan)
+{
+	union iop3xx_desc hw_desc = { .ptr = desc->hw_desc, };
+
+	switch (chan->device->id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return hw_desc.dma->next_desc;
+	case IOP3XX_AAU_ID:
+		return hw_desc.aau->next_desc;
+	default:
+		BUG();
+	}
+
+	return 0;
+}
+
+static inline void iop_desc_set_block_fill_val(struct iop_adma_desc_slot *desc,
+						u32 val)
+{
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc;
+	hw_desc->src[0] = val;
+}
+
+#ifndef CONFIG_ARCH_IOP32X
+static inline int iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
+{
+	struct iop3xx_desc_aau *hw_desc = desc->hw_desc;
+	struct iop3xx_aau_desc_ctrl desc_ctrl = hw_desc->desc_ctrl_field;
+
+	BUG_ON(!(desc_ctrl.tx_complete && desc_ctrl.zero_result_en));
+	return desc_ctrl.zero_result_err;
+}
+#else
+extern char iop32x_zero_result_buffer[PAGE_SIZE];
+static inline int iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
+{
+	int i;
+
+	consistent_sync(iop32x_zero_result_buffer,
+			sizeof(iop32x_zero_result_buffer),
+			DMA_FROM_DEVICE);
+
+	for (i = 0; i < sizeof(iop32x_zero_result_buffer)/sizeof(u32); i++)
+		if (((u32 *) iop32x_zero_result_buffer)[i])
+			return 1;
+		else if ((i & 0x7) == 0) /* prefetch the next cache line */
+			prefetch(((u32 *) iop32x_zero_result_buffer) + 8);
+
+	return 0;
+}
+#endif
+
+static inline void iop_chan_append(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+	/* drain write buffer so ADMA can see updated descriptor */
+	asm volatile ("mcr p15, 0, r1, c7, c10, 4" : : : "%r1");
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		*IOP3XX_DMA_CCR(id) |= 0x2;
+		break;
+	case IOP3XX_AAU_ID:
+		*IOP3XX_AAU_ACR |= 0x2;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_chan_clear_status(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+	u32 status;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		status = *IOP3XX_DMA_CSR(id);
+		*IOP3XX_DMA_CSR(id) = status;
+		break;
+	case IOP3XX_AAU_ID:
+		status = *IOP3XX_AAU_ASR;
+		*IOP3XX_AAU_ASR = status;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline u32 iop_chan_get_status(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		return *IOP3XX_DMA_CSR(id);
+	case IOP3XX_AAU_ID:
+		return *IOP3XX_AAU_ASR;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_chan_disable(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		*IOP3XX_DMA_CCR(id) &= ~0x1;
+		break;
+	case IOP3XX_AAU_ID:
+		*IOP3XX_AAU_ACR &= ~0x1;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_chan_enable(struct iop_adma_chan *chan)
+{
+	int id = chan->device->id;
+
+	/* drain write buffer */
+	asm volatile ("mcr p15, 0, r1, c7, c10, 4" : : : "%r1");
+
+	switch (id) {
+	case IOP3XX_DMA0_ID:
+	case IOP3XX_DMA1_ID:
+		*IOP3XX_DMA_CCR(id) |= 0x1;
+		break;
+	case IOP3XX_AAU_ID:
+		*IOP3XX_AAU_ACR |= 0x1;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline void iop_raid5_dma_chan_request(struct dma_client *client)
+{
+	dma_async_client_chan_request(client, 2, DMA_MEMCPY);
+	dma_async_client_chan_request(client, 1, DMA_XOR | DMA_ZERO_SUM);
+}
+
+static inline struct dma_chan *iop_raid5_dma_next_channel(struct dma_client *client)
+{
+	static struct dma_chan_client_ref *chan_ref = NULL;
+	static int req_idx = -1;
+	static struct dma_req *req[2];
+	
+	if (unlikely(req_idx < 0)) {
+		req[0] = &client->req[0];
+		req[1] = &client->req[1];
+	} 
+	
+	if (++req_idx > 1)
+		req_idx = 0;
+
+	spin_lock(&client->lock);
+	if (unlikely(list_empty(&req[req_idx]->channels)))
+		chan_ref = NULL;
+	else if (!chan_ref || chan_ref->req_node.next == &req[req_idx]->channels)
+		chan_ref = list_entry(req[req_idx]->channels.next, typeof(*chan_ref),
+					req_node);
+	else
+		chan_ref = list_entry(chan_ref->req_node.next,
+					typeof(*chan_ref), req_node);
+	spin_unlock(&client->lock);
+
+	return chan_ref ? chan_ref->chan : NULL;
+}
+
+static inline struct dma_chan *iop_raid5_dma_check_channel(struct dma_chan *chan,
+						dma_cookie_t cookie,
+						struct dma_client *client,
+						unsigned long capabilities)
+{
+	struct dma_chan_client_ref *chan_ref;
+
+	if ((chan->device->capabilities & capabilities) == capabilities)
+		return chan;
+	else if (dma_async_operation_complete(chan,
+					      cookie,
+					      NULL,
+					      NULL) == DMA_SUCCESS) {
+		/* dma channels on req[0] */
+		if (capabilities & (DMA_MEMCPY | DMA_MEMCPY_CRC32C))
+			chan_ref = list_entry(client->req[0].channels.next,
+						typeof(*chan_ref),
+						req_node);
+		/* aau channel on req[1] */
+		else
+			chan_ref = list_entry(client->req[1].channels.next,
+						typeof(*chan_ref),
+						req_node);
+		/* switch to the new channel */
+		dma_chan_put(chan);
+		dma_chan_get(chan_ref->chan);
+
+		return chan_ref->chan;
+	} else
+		return NULL;
+}
+#endif /* _IOP3XX_ADMA_H */

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/19] raid5: raid5_do_soft_block_ops
  2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
@ 2006-09-11 23:34   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:34 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> raid5_do_soft_block_ops consolidates all the stripe cache maintenance
> operations into a single routine.  The stripe operations are:
> * copying data between the stripe cache and user application buffers
> * computing blocks to save a disk access, or to recover a missing block
> * updating the parity on a write operation (reconstruct write and
> read-modify-write)
> * checking parity correctness
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  drivers/md/raid5.c         |  289 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/raid/raid5.h |  129 +++++++++++++++++++-
>  2 files changed, 415 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 4500660..8fde62b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1362,6 +1362,295 @@ static int stripe_to_pdidx(sector_t stri
>  	return pd_idx;
>  }
>  
> +/*
> + * raid5_do_soft_block_ops - perform block memory operations on stripe data
> + * outside the spin lock.
> + */
> +static void raid5_do_soft_block_ops(void *stripe_head_ref)

This function absolutely must be broken up into multiple functions, 
presumably one per operation.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/19] raid5: move write operations to a workqueue
  2006-09-11 23:17 ` [PATCH 02/19] raid5: move write operations to a workqueue Dan Williams
@ 2006-09-11 23:36   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:36 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Enable handle_stripe5 to pass off write operations to
> raid5_do_soft_blocks_ops (which can be run as a workqueue).  The operations
> moved are reconstruct-writes and read-modify-writes formerly handled by
> compute_parity5.
> 
> Changelog:
> * moved raid5_do_soft_block_ops changes into a separate patch
> * changed handle_write_operations5 to only initiate write operations, which
> prevents new writes from being requested while the current one is in flight
> * all blocks undergoing a write are now marked locked and !uptodate at the
> beginning of the write operation
> * blocks undergoing a read-modify-write need a request flag to distinguish
> them from blocks that are locked for reading. Reconstruct-writes still use
> the R5_LOCKED bit to select blocks for the operation
> * integrated the work queue Kconfig option
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  drivers/md/Kconfig         |   21 +++++
>  drivers/md/raid5.c         |  192 ++++++++++++++++++++++++++++++++++++++------
>  include/linux/raid/raid5.h |    3 +
>  3 files changed, 190 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index bf869ed..2a16b3b 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -162,6 +162,27 @@ config MD_RAID5_RESHAPE
>  	  There should be enough spares already present to make the new
>  	  array workable.
>  
> +config MD_RAID456_WORKQUEUE
> +	depends on MD_RAID456
> +	bool "Offload raid work to a workqueue from raid5d"
> +	---help---
> +	  This option enables raid work (block copy and xor operations)
> +	  to run in a workqueue.  If your platform has a high context
> +	  switch penalty say N.  If you are using hardware offload or
> +	  are running on an SMP platform say Y.
> +
> +	  If unsure say, Y.
> +
> +config MD_RAID456_WORKQUEUE_MULTITHREAD
> +	depends on MD_RAID456_WORKQUEUE && SMP
> +	bool "Enable multi-threaded raid processing"
> +	default y
> +	---help---
> +	  This option controls whether the raid workqueue will be multi-
> +	  threaded or single threaded.
> +
> +	  If unsure say, Y.

In the final patch that gets merged, these configuration options should 
go away.  We are very anti-#ifdef in Linux, for a variety of reasons. 
In this particular instance, code complexity increases and 
maintainability decreases as the #ifdef forest grows.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (18 preceding siblings ...)
  2006-09-11 23:19 ` [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver Dan Williams
@ 2006-09-11 23:38 ` Jeff Garzik
  2006-09-11 23:53   ` Dan Williams
  2006-09-13  7:15 ` Jakob Oestergaard
  2006-10-08 22:18 ` Neil Brown
  21 siblings, 1 reply; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:38 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> Neil,
> 
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale® series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.
> 
> Since the last release of the raid5 changes many bug fixes and other
> improvements have been made as a result of stress testing.  See the per
> patch change logs for more information about what was fixed.  This
> release is the first release of the full dma implementation.
> 
> The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
> interface, and a platform device driver for IOPs.  The raid5 changes
> follow your comments concerning making the acceleration implementation
> similar to how the stripe cache handles I/O requests.  The dmaengine
> changes are the second release of this code.  They expand the interface
> to handle more than memcpy operations, and add a generic raid5-dma
> client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
> memset across all IOP architectures (32x, 33x, and 13xx).
> 
> Concerning the context switching performance concerns raised at the
> previous release, I have observed the following.  For the hardware
> accelerated case it appears that performance is always better with the
> work queue than without since it allows multiple stripes to be operated
> on simultaneously.  I expect the same for an SMP platform, but so far my
> testing has been limited to IOPs.  For a single-processor
> non-accelerated configuration I have not observed performance
> degradation with work queue support enabled, but in the Kconfig option
> help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).
> 
> Please consider the patches for -mm.
> 
> -Dan
> 
> [PATCH 01/19] raid5: raid5_do_soft_block_ops
> [PATCH 02/19] raid5: move write operations to a workqueue
> [PATCH 03/19] raid5: move check parity operations to a workqueue
> [PATCH 04/19] raid5: move compute block operations to a workqueue
> [PATCH 05/19] raid5: move read completion copies to a workqueue
> [PATCH 06/19] raid5: move the reconstruct write expansion operation to a workqueue
> [PATCH 07/19] raid5: remove compute_block and compute_parity5
> [PATCH 08/19] dmaengine: enable multiple clients and operations
> [PATCH 09/19] dmaengine: reduce backend address permutations
> [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients
> [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
> [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy
> [PATCH 13/19] dmaengine: add support for dma xor zero sum operations
> [PATCH 14/19] dmaengine: add dma_sync_wait
> [PATCH 15/19] dmaengine: raid5 dma client
> [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines
> [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
> [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
> [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver

Can devices like drivers/scsi/sata_sx4.c or drivers/scsi/sata_promise.c 
take advantage of this?  Promise silicon supports RAID5 XOR offload.

If so, how?  If not, why not?  :)

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-11 23:18 ` [PATCH 08/19] dmaengine: enable multiple clients and operations Dan Williams
@ 2006-09-11 23:44   ` Jeff Garzik
  2006-09-12  0:14     ` Dan Williams
  2006-09-15 16:38     ` Olof Johansson
  0 siblings, 2 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:44 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> @@ -759,8 +755,10 @@ #endif
>  	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
>  	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
>  	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
> -	device->common.device_memcpy_complete = ioat_dma_is_complete;
> -	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
> +	device->common.device_operation_complete = ioat_dma_is_complete;
> +	device->common.device_xor_pgs_to_pg = dma_async_xor_pgs_to_pg_err;
> +	device->common.device_issue_pending = ioat_dma_memcpy_issue_pending;
> +	device->common.capabilities = DMA_MEMCPY;


Are we really going to add a set of hooks for each DMA engine whizbang 
feature?

That will get ugly when DMA engines support memcpy, xor, crc32, sha1, 
aes, and a dozen other transforms.


> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index c94d8f1..3599472 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -20,7 +20,7 @@
>   */
>  #ifndef DMAENGINE_H
>  #define DMAENGINE_H
> -
> +#include <linux/config.h>
>  #ifdef CONFIG_DMA_ENGINE
>  
>  #include <linux/device.h>
> @@ -65,6 +65,27 @@ enum dma_status {
>  };
>  
>  /**
> + * enum dma_capabilities - DMA operational capabilities
> + * @DMA_MEMCPY: src to dest copy
> + * @DMA_XOR: src*n to dest xor
> + * @DMA_DUAL_XOR: src*n to dest_diag and dest_horiz xor
> + * @DMA_PQ_XOR: src*n to dest_q and dest_p gf/xor
> + * @DMA_MEMCPY_CRC32C: src to dest copy and crc-32c sum
> + * @DMA_SHARE: multiple clients can use this channel
> + */
> +enum dma_capabilities {
> +	DMA_MEMCPY		= 0x1,
> +	DMA_XOR			= 0x2,
> +	DMA_PQ_XOR		= 0x4,
> +	DMA_DUAL_XOR		= 0x8,
> +	DMA_PQ_UPDATE		= 0x10,
> +	DMA_ZERO_SUM		= 0x20,
> +	DMA_PQ_ZERO_SUM		= 0x40,
> +	DMA_MEMSET		= 0x80,
> +	DMA_MEMCPY_CRC32C	= 0x100,

Please use the more readable style that explicitly lists bits:

	DMA_MEMCPY		= (1 << 0),
	DMA_XOR			= (1 << 1),
	...


> +/**
>   * struct dma_chan_percpu - the per-CPU part of struct dma_chan
>   * @refcount: local_t used for open-coded "bigref" counting
>   * @memcpy_count: transaction counter
> @@ -75,27 +96,32 @@ struct dma_chan_percpu {
>  	local_t refcount;
>  	/* stats */
>  	unsigned long memcpy_count;
> +	unsigned long xor_count;
>  	unsigned long bytes_transferred;
> +	unsigned long bytes_xor;

Clearly, each operation needs to be more compartmentalized.

This just isn't scalable, when you consider all the possible transforms.

	Jeff



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
  2006-09-11 23:18 ` [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation Dan Williams
@ 2006-09-11 23:50   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Changelog:
> * make the dmaengine api EXPORT_SYMBOL_GPL
> * zero sum support should be standalone, not integrated into xor
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  drivers/dma/dmaengine.c   |   15 ++++++++++
>  drivers/dma/ioatdma.c     |    5 +++
>  include/linux/dmaengine.h |   68 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> index e78ce89..fe62237 100644
> --- a/drivers/dma/dmaengine.c
> +++ b/drivers/dma/dmaengine.c
> @@ -604,6 +604,17 @@ dma_cookie_t dma_async_do_xor_err(struct
>  	return -ENXIO;
>  }
>  
> +/**
> + * dma_async_do_memset_err - default function for dma devices that
> + *      do not support memset
> + */
> +dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
> +                union dmaengine_addr dest, unsigned int dest_off,
> +                int val, size_t len, unsigned long flags)
> +{
> +        return -ENXIO;
> +}
> +
>  static int __init dma_bus_init(void)
>  {
>  	mutex_init(&dma_list_mutex);
> @@ -621,6 +632,9 @@ EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to
>  EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_dma);
>  EXPORT_SYMBOL_GPL(dma_async_memcpy_pg_to_dma);
>  EXPORT_SYMBOL_GPL(dma_async_memcpy_dma_to_pg);
> +EXPORT_SYMBOL_GPL(dma_async_memset_buf);
> +EXPORT_SYMBOL_GPL(dma_async_memset_page);
> +EXPORT_SYMBOL_GPL(dma_async_memset_dma);
>  EXPORT_SYMBOL_GPL(dma_async_xor_pgs_to_pg);
>  EXPORT_SYMBOL_GPL(dma_async_xor_dma_list_to_dma);
>  EXPORT_SYMBOL_GPL(dma_async_operation_complete);
> @@ -629,6 +643,7 @@ EXPORT_SYMBOL_GPL(dma_async_device_regis
>  EXPORT_SYMBOL_GPL(dma_async_device_unregister);
>  EXPORT_SYMBOL_GPL(dma_chan_cleanup);
>  EXPORT_SYMBOL_GPL(dma_async_do_xor_err);
> +EXPORT_SYMBOL_GPL(dma_async_do_memset_err);
>  EXPORT_SYMBOL_GPL(dma_async_chan_init);
>  EXPORT_SYMBOL_GPL(dma_async_map_page);
>  EXPORT_SYMBOL_GPL(dma_async_map_single);
> diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
> index 0159d14..231247c 100644
> --- a/drivers/dma/ioatdma.c
> +++ b/drivers/dma/ioatdma.c
> @@ -637,6 +637,10 @@ extern dma_cookie_t dma_async_do_xor_err
>  	union dmaengine_addr src, unsigned int src_cnt,
>  	unsigned int src_off, size_t len, unsigned long flags);
>  
> +extern dma_cookie_t dma_async_do_memset_err(struct dma_chan *chan,
> +	union dmaengine_addr dest, unsigned int dest_off,
> +	int val, size_t size, unsigned long flags);
> +
>  static dma_addr_t ioat_map_page(struct dma_chan *chan, struct page *page,
>  					unsigned long offset, size_t size,
>  					int direction)
> @@ -748,6 +752,7 @@ #endif
>  	device->common.capabilities = DMA_MEMCPY;
>  	device->common.device_do_dma_memcpy = do_ioat_dma_memcpy;
>  	device->common.device_do_dma_xor = dma_async_do_xor_err;
> +	device->common.device_do_dma_memset = dma_async_do_memset_err;
>  	device->common.map_page = ioat_map_page;
>  	device->common.map_single = ioat_map_single;
>  	device->common.unmap_page = ioat_unmap_page;
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index cb4cfcf..8d53b08 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -260,6 +260,7 @@ struct dma_chan_client_ref {
>   * @device_issue_pending: push appended descriptors to hardware
>   * @device_do_dma_memcpy: perform memcpy with a dma engine
>   * @device_do_dma_xor: perform block xor with a dma engine
> + * @device_do_dma_memset: perform block fill with a dma engine
>   */
>  struct dma_device {
>  
> @@ -284,6 +285,9 @@ struct dma_device {
>  			union dmaengine_addr src, unsigned int src_cnt,
>  			unsigned int src_off, size_t len,
>  			unsigned long flags);
> +	dma_cookie_t (*device_do_dma_memset)(struct dma_chan *chan,
> +			union dmaengine_addr dest, unsigned int dest_off,
> +			int value, size_t len, unsigned long flags);

Same comment as for XOR:  adding operations in this way just isn't scalable.

Operations need to be more compartmentalized.

Maybe a client could do:

	struct adma_transaction adma_xact;

	/* fill in hooks with XOR-specific info */
	init_XScale_xor(adma_device, &adma_xact, my_completion_func);

	/* initiate transaction */	
	adma_go(&adma_xact);

	/* callback signals completion asynchronously */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy
  2006-09-11 23:18 ` [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy Dan Williams
@ 2006-09-11 23:51   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:51 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Default virtual function that returns an error if the user attempts a
> memcpy operation.  An XOR engine is an example of a DMA engine that does
> not support memcpy.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  drivers/dma/dmaengine.c |   13 +++++++++++++
>  1 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> index fe62237..33ad690 100644
> --- a/drivers/dma/dmaengine.c
> +++ b/drivers/dma/dmaengine.c
> @@ -593,6 +593,18 @@ void dma_async_device_unregister(struct 
>  }
>  
>  /**
> + * dma_async_do_memcpy_err - default function for dma devices that
> + *	do not support memcpy
> + */
> +dma_cookie_t dma_async_do_memcpy_err(struct dma_chan *chan,
> +		union dmaengine_addr dest, unsigned int dest_off,
> +		union dmaengine_addr src, unsigned int src_off,
> +                size_t len, unsigned long flags)
> +{
> +	return -ENXIO;
> +}

Further illustration of how this API growth is going wrong.  You should 
create an API such that it is impossible for an XOR transform to ever 
call non-XOR-transform hooks.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/19] dmaengine: add dma_sync_wait
  2006-09-11 23:18 ` [PATCH 14/19] dmaengine: add dma_sync_wait Dan Williams
@ 2006-09-11 23:52   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:52 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> dma_sync_wait is a common routine to live wait for a dma operation to
> complete.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  include/linux/dmaengine.h |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index 9fd6cbd..0a70c9e 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -750,6 +750,18 @@ static inline void dma_async_unmap_singl
>  	chan->device->unmap_single(chan, handle, size, direction);
>  }
>  
> +static inline enum dma_status dma_sync_wait(struct dma_chan *chan,
> +						dma_cookie_t cookie)
> +{
> +	enum dma_status status;
> +	dma_async_issue_pending(chan);
> +	do {
> +		status = dma_async_operation_complete(chan, cookie, NULL, NULL);
> +	} while (status == DMA_IN_PROGRESS);
> +
> +	return status;

Where are the timeouts, etc.?  Looks like an infinite loop to me, in the 
worst case.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-11 23:38 ` [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Jeff Garzik
@ 2006-09-11 23:53   ` Dan Williams
  2006-09-12  2:41     ` Jeff Garzik
  0 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-11 23:53 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

On 9/11/06, Jeff Garzik <jeff@garzik.org> wrote:
> Dan Williams wrote:
> > Neil,
> >
> > The following patches implement hardware accelerated raid5 for the Intel
> > Xscale(r) series of I/O Processors.  The MD changes allow stripe
> > operations to run outside the spin lock in a work queue.  Hardware
> > acceleration is achieved by using a dma-engine-aware work queue routine
> > instead of the default software only routine.
> >
> > Since the last release of the raid5 changes many bug fixes and other
> > improvements have been made as a result of stress testing.  See the per
> > patch change logs for more information about what was fixed.  This
> > release is the first release of the full dma implementation.
> >
> > The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
> > interface, and a platform device driver for IOPs.  The raid5 changes
> > follow your comments concerning making the acceleration implementation
> > similar to how the stripe cache handles I/O requests.  The dmaengine
> > changes are the second release of this code.  They expand the interface
> > to handle more than memcpy operations, and add a generic raid5-dma
> > client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
> > memset across all IOP architectures (32x, 33x, and 13xx).
> >
> > Concerning the context switching performance concerns raised at the
> > previous release, I have observed the following.  For the hardware
> > accelerated case it appears that performance is always better with the
> > work queue than without since it allows multiple stripes to be operated
> > on simultaneously.  I expect the same for an SMP platform, but so far my
> > testing has been limited to IOPs.  For a single-processor
> > non-accelerated configuration I have not observed performance
> > degradation with work queue support enabled, but in the Kconfig option
> > help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).
> >
> > Please consider the patches for -mm.
> >
> > -Dan
> >
> > [PATCH 01/19] raid5: raid5_do_soft_block_ops
> > [PATCH 02/19] raid5: move write operations to a workqueue
> > [PATCH 03/19] raid5: move check parity operations to a workqueue
> > [PATCH 04/19] raid5: move compute block operations to a workqueue
> > [PATCH 05/19] raid5: move read completion copies to a workqueue
> > [PATCH 06/19] raid5: move the reconstruct write expansion operation to a workqueue
> > [PATCH 07/19] raid5: remove compute_block and compute_parity5
> > [PATCH 08/19] dmaengine: enable multiple clients and operations
> > [PATCH 09/19] dmaengine: reduce backend address permutations
> > [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients
> > [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
> > [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy
> > [PATCH 13/19] dmaengine: add support for dma xor zero sum operations
> > [PATCH 14/19] dmaengine: add dma_sync_wait
> > [PATCH 15/19] dmaengine: raid5 dma client
> > [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines
> > [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
> > [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
> > [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver
>
> Can devices like drivers/scsi/sata_sx4.c or drivers/scsi/sata_promise.c
> take advantage of this?  Promise silicon supports RAID5 XOR offload.
>
> If so, how?  If not, why not?  :)
This is a frequently asked question, Alan Cox had the same one at OLS.
 The answer is "probably."  The only complication I currently see is
where/how the stripe cache is maintained.  With the IOPs its easy
because the DMA engines operate directly on kernel memory.  With the
Promise card I believe they have memory on the card and it's not clear
to me if the XOR engines on the card can deal with host memory.  Also,
MD would need to be modified to handle a stripe cache located on a
device, or somehow synchronize its local cache with card in a manner
that is still able to beat software only MD.

>         Jeff

Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 15/19] dmaengine: raid5 dma client
  2006-09-11 23:18 ` [PATCH 15/19] dmaengine: raid5 dma client Dan Williams
@ 2006-09-11 23:54   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:54 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Adds a dmaengine client that is the hardware accelerated version of
> raid5_do_soft_block_ops.  It utilizes the raid5 workqueue implementation to
> operate on multiple stripes simultaneously.  See the iop-adma.c driver for
> an example of a driver that enables hardware accelerated raid5.
> 
> Changelog:
> * mark operations as _Dma rather than _Done until all outstanding
> operations have completed.  Once all operations have completed update the
> state and return it to the handle list
> * add a helper routine to retrieve the last used cookie
> * use dma_async_zero_sum_dma_list for checking parity which optionally
> allows parity check operations to not dirty the parity block in the cache
> (if 'disks' is less than 'MAX_ADMA_XOR_SOURCES')
> * remove dependencies on iop13xx
> * take into account the fact that dma engines have a staging buffer so we
> can perform 1 less block operation compared to software xor
> * added __arch_raid5_dma_chan_request __arch_raid5_dma_next_channel and
> __arch_raid5_dma_check_channel to make the driver architecture independent
> * added channel switching capability for architectures that implement
> different operations (i.e. copy & xor) on individual channels
> * added initial support for "non-blocking" channel switching
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  drivers/dma/Kconfig        |    9 +
>  drivers/dma/Makefile       |    1 
>  drivers/dma/raid5-dma.c    |  730 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/Kconfig         |   11 +
>  drivers/md/raid5.c         |   66 ++++
>  include/linux/dmaengine.h  |    5 
>  include/linux/raid/raid5.h |   24 +
>  7 files changed, 839 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index 30d021d..fced8c3 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -22,6 +22,15 @@ config NET_DMA
>  	  Since this is the main user of the DMA engine, it should be enabled;
>  	  say Y here.
>  
> +config RAID5_DMA
> +        tristate "MD raid5: block operations offload"
> +	depends on INTEL_IOP_ADMA && MD_RAID456
> +	default y
> +	---help---
> +	  This enables the use of DMA engines in the MD-RAID5 driver to
> +	  offload stripe cache operations, freeing CPU cycles.
> +	  say Y here
> +
>  comment "DMA Devices"
>  
>  config INTEL_IOATDMA
> diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
> index bdcfdbd..4e36d6e 100644
> --- a/drivers/dma/Makefile
> +++ b/drivers/dma/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
>  obj-$(CONFIG_NET_DMA) += iovlock.o
> +obj-$(CONFIG_RAID5_DMA) += raid5-dma.o
>  obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
> diff --git a/drivers/dma/raid5-dma.c b/drivers/dma/raid5-dma.c
> new file mode 100644
> index 0000000..04a1790
> --- /dev/null
> +++ b/drivers/dma/raid5-dma.c
> @@ -0,0 +1,730 @@
> +/*
> + * Offload raid5 operations to hardware RAID engines
> + * Copyright(c) 2006 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + * This program is distributed in the hope that it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59
> + * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
> + *
> + * The full GNU General Public License is included in this distribution in the
> + * file called COPYING.
> + */
> +
> +#include <linux/raid/raid5.h>
> +#include <linux/dmaengine.h>
> +
> +static struct dma_client *raid5_dma_client;
> +static atomic_t raid5_count;
> +extern void release_stripe(struct stripe_head *sh);
> +extern void __arch_raid5_dma_chan_request(struct dma_client *client);
> +extern struct dma_chan *__arch_raid5_dma_next_channel(struct dma_client *client);
> +
> +#define MAX_HW_XOR_SRCS 16
> +
> +#ifndef STRIPE_SIZE
> +#define STRIPE_SIZE PAGE_SIZE
> +#endif
> +
> +#ifndef STRIPE_SECTORS
> +#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
> +#endif
> +
> +#ifndef r5_next_bio
> +#define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL)
> +#endif
> +
> +#define DMA_RAID5_DEBUG 0
> +#define PRINTK(x...) ((void)(DMA_RAID5_DEBUG && printk(x)))
> +
> +/*
> + * Copy data between a page in the stripe cache, and one or more bion
> + * The page could align with the middle of the bio, or there could be
> + * several bion, each with several bio_vecs, which cover part of the page
> + * Multiple bion are linked together on bi_next.  There may be extras
> + * at the end of this list.  We ignore them.
> + */
> +static dma_cookie_t dma_raid_copy_data(int frombio, struct bio *bio,
> +		     dma_addr_t dma, sector_t sector, struct dma_chan *chan,
> +		     dma_cookie_t cookie)
> +{
> +	struct bio_vec *bvl;
> +	struct page *bio_page;
> +	int i;
> +	int dma_offset;
> +	dma_cookie_t last_cookie = cookie;
> +
> +	if (bio->bi_sector >= sector)
> +		dma_offset = (signed)(bio->bi_sector - sector) * 512;
> +	else
> +		dma_offset = (signed)(sector - bio->bi_sector) * -512;
> +	bio_for_each_segment(bvl, bio, i) {
> +		int len = bio_iovec_idx(bio,i)->bv_len;
> +		int clen;
> +		int b_offset = 0;
> +
> +		if (dma_offset < 0) {
> +			b_offset = -dma_offset;
> +			dma_offset += b_offset;
> +			len -= b_offset;
> +		}
> +
> +		if (len > 0 && dma_offset + len > STRIPE_SIZE)
> +			clen = STRIPE_SIZE - dma_offset;
> +		else clen = len;
> +
> +		if (clen > 0) {
> +			b_offset += bio_iovec_idx(bio,i)->bv_offset;
> +			bio_page = bio_iovec_idx(bio,i)->bv_page;
> +			if (frombio)
> +				do {
> +					cookie = dma_async_memcpy_pg_to_dma(chan,
> +								dma + dma_offset,
> +								bio_page,
> +								b_offset,
> +								clen);
> +					if (cookie == -ENOMEM)
> +						dma_sync_wait(chan, last_cookie);
> +					else
> +						WARN_ON(cookie <= 0);
> +				} while (cookie == -ENOMEM);
> +			else
> +				do {
> +					cookie = dma_async_memcpy_dma_to_pg(chan,
> +								bio_page,
> +								b_offset,
> +								dma + dma_offset,
> +								clen);
> +					if (cookie == -ENOMEM)
> +						dma_sync_wait(chan, last_cookie);
> +					else
> +						WARN_ON(cookie <= 0);
> +				} while (cookie == -ENOMEM);
> +		}
> +		last_cookie = cookie;
> +		if (clen < len) /* hit end of page */
> +			break;
> +		dma_offset +=  len;
> +	}
> +
> +	return last_cookie;
> +}
> +
> +#define issue_xor() do {					          \
> +			 do {					          \
> +			 	cookie = dma_async_xor_dma_list_to_dma(   \
> +			 		sh->ops.dma_chan,	          \
> +			 		xor_destination_addr,	          \
> +			 		dma,			          \
> +			 		count,			          \
> +			 		STRIPE_SIZE);		          \
> +			 	if (cookie == -ENOMEM)		          \
> +			 		dma_sync_wait(sh->ops.dma_chan,	  \
> +			 			sh->ops.dma_cookie);      \
> +			 	else				          \
> +			 		WARN_ON(cookie <= 0);	          \
> +			 } while (cookie == -ENOMEM);		          \
> +			 sh->ops.dma_cookie = cookie;		          \
> +			 dma[0] = xor_destination_addr;			  \
> +			 count = 1;					  \
> +			} while(0)
> +#define check_xor() do {						\
> +			if (count == MAX_HW_XOR_SRCS)			\
> +				issue_xor();				\
> +		     } while (0)
> +
> +#ifdef CONFIG_RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH
> +extern struct dma_chan *__arch_raid5_dma_check_channel(struct dma_chan *chan,
> +						dma_cookie_t cookie,
> +						struct dma_client *client,
> +						unsigned long capabilities);
> +
> +#ifdef CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE
> +#define check_channel(cap, bookmark) do {			     \
> +bookmark:							     \
> +	next_chan = __arch_raid5_dma_check_channel(sh->ops.dma_chan, \
> +						sh->ops.dma_cookie,  \
> +						raid5_dma_client,    \
> +						(cap));		     \
> +	if (!next_chan) {					     \
> +		BUG_ON(sh->ops.ops_bookmark);			     \
> +		sh->ops.ops_bookmark = &&bookmark;		     \
> +		goto raid5_dma_retry;				     \
> +	} else {						     \
> +		sh->ops.dma_chan = next_chan;			     \
> +		sh->ops.dma_cookie = dma_async_get_last_cookie(	     \
> +							next_chan);  \
> +		sh->ops.ops_bookmark = NULL;			     \
> +	}							     \
> +} while (0)
> +#else
> +#define check_channel(cap, bookmark) do {			     \
> +bookmark:							     \
> +	next_chan = __arch_raid5_dma_check_channel(sh->ops.dma_chan, \
> +						sh->ops.dma_cookie,  \
> +						raid5_dma_client,    \
> +						(cap));		     \
> +	if (!next_chan) {					     \
> +		dma_sync_wait(sh->ops.dma_chan, sh->ops.dma_cookie); \
> +		goto bookmark;					     \
> +	} else {						     \
> +		sh->ops.dma_chan = next_chan;			     \
> +		sh->ops.dma_cookie = dma_async_get_last_cookie(	     \
> +							next_chan);  \
> +	}							     \
> +} while (0)
> +#endif /* CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE */
> +#else
> +#define check_channel(cap, bookmark) do { } while (0)
> +#endif /* CONFIG_RAID5_DMA_ARCH_NEEDS_CHAN_SWITCH */

The above seems a bit questionable and overengineered.

Linux mantra:  Do What You Must, And No More.

In this case, just code and note that it's IOP-specific.  Don't bother 
to support cases that doesn't exist yet.


> + * dma_do_raid5_block_ops - perform block memory operations on stripe data
> + * outside the spin lock with dma engines
> + *
> + * A note about the need for __arch_raid5_dma_check_channel:
> + * This function is only needed to support architectures where a single raid
> + * operation spans multiple hardware channels.  For example on a reconstruct
> + * write, memory copy operations are submitted to a memcpy channel and then
> + * the routine must switch to the xor channel to complete the raid operation.
> + * __arch_raid5_dma_check_channel makes sure the previous operation has
> + * completed before returning the new channel.
> + * Some efficiency can be gained by putting the stripe back on the work
> + * queue rather than spin waiting.  This code is a work in progress and is
> + * available via the 'broken' option CONFIG_RAID5_DMA_WAIT_VIA_REQUEUE.
> + * If 'wait via requeue' is not defined the check_channel macro live waits
> + * for the next channel.
> + */
> +static void dma_do_raid5_block_ops(void *stripe_head_ref)
> +{

Another way-too-big function that should be split up.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
  2006-09-11 23:19 ` [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs Dan Williams
@ 2006-09-11 23:55   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:55 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Also brings the iop3xx registers in line with the format of the iop13xx
> register definitions.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  include/asm-arm/arch-iop32x/entry-macro.S |    2 
>  include/asm-arm/arch-iop32x/iop32x.h      |   14 +
>  include/asm-arm/arch-iop33x/entry-macro.S |    2 
>  include/asm-arm/arch-iop33x/iop33x.h      |   38 ++-
>  include/asm-arm/hardware/iop3xx.h         |  347 +++++++++++++----------------
>  5 files changed, 188 insertions(+), 215 deletions(-)

Another Linux mantra:  "volatile" == hiding a bug.  Avoid, please.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
  2006-09-11 23:19 ` [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization Dan Williams
@ 2006-09-11 23:56   ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-11 23:56 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Currently the iop3xx platform support code assumes that RedBoot is the
> bootloader and has already initialized the ATU.  Linux should handle this
> initialization for three reasons:
> 
> 1/ The memory map that RedBoot sets up is not optimal (page_to_dma and
> virt_to_phys return different addresses).  The effect of this is that using
> the dma mapping API for the internal bus dma units generates pci bus
> addresses that are incorrect for the internal bus.
> 
> 2/ Not all iop platforms use RedBoot
> 
> 3/ If the ATU is already initialized it indicates that the iop is an add-in
> card in another host, it does not own the PCI bus, and should not be
> re-initialized.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  arch/arm/mach-iop32x/Kconfig         |    8 ++
>  arch/arm/mach-iop32x/ep80219.c       |    4 +
>  arch/arm/mach-iop32x/iq31244.c       |    5 +
>  arch/arm/mach-iop32x/iq80321.c       |    5 +
>  arch/arm/mach-iop33x/Kconfig         |    8 ++
>  arch/arm/mach-iop33x/iq80331.c       |    5 +
>  arch/arm/mach-iop33x/iq80332.c       |    4 +
>  arch/arm/plat-iop/pci.c              |  140 ++++++++++++++++++++++++++++++++++
>  include/asm-arm/arch-iop32x/iop32x.h |    9 ++
>  include/asm-arm/arch-iop32x/memory.h |    4 -
>  include/asm-arm/arch-iop33x/iop33x.h |   10 ++
>  include/asm-arm/arch-iop33x/memory.h |    4 -
>  include/asm-arm/hardware/iop3xx.h    |   20 ++++-
>  13 files changed, 214 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm/mach-iop32x/Kconfig b/arch/arm/mach-iop32x/Kconfig
> index 05549a5..b2788e3 100644
> --- a/arch/arm/mach-iop32x/Kconfig
> +++ b/arch/arm/mach-iop32x/Kconfig
> @@ -22,6 +22,14 @@ config ARCH_IQ80321
>  	  Say Y here if you want to run your kernel on the Intel IQ80321
>  	  evaluation kit for the IOP321 processor.
>  
> +config IOP3XX_ATU
> +        bool "Enable the PCI Controller"
> +        default y
> +        help
> +          Say Y here if you want the IOP to initialize its PCI Controller.
> +          Say N if the IOP is an add in card, the host system owns the PCI
> +          bus in this case.
> +
>  endmenu
>  
>  endif
> diff --git a/arch/arm/mach-iop32x/ep80219.c b/arch/arm/mach-iop32x/ep80219.c
> index f616d3e..1a5c586 100644
> --- a/arch/arm/mach-iop32x/ep80219.c
> +++ b/arch/arm/mach-iop32x/ep80219.c
> @@ -100,7 +100,7 @@ ep80219_pci_map_irq(struct pci_dev *dev,
>  
>  static struct hw_pci ep80219_pci __initdata = {
>  	.swizzle	= pci_std_swizzle,
> -	.nr_controllers = 1,
> +	.nr_controllers = 0,
>  	.setup		= iop3xx_pci_setup,
>  	.preinit	= iop3xx_pci_preinit,
>  	.scan		= iop3xx_pci_scan_bus,
> @@ -109,6 +109,8 @@ static struct hw_pci ep80219_pci __initd
>  
>  static int __init ep80219_pci_init(void)
>  {
> +	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
> +		ep80219_pci.nr_controllers = 1;
>  #if 0
>  	if (machine_is_ep80219())
>  		pci_common_init(&ep80219_pci);
> diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
> index 967a696..25d5d62 100644
> --- a/arch/arm/mach-iop32x/iq31244.c
> +++ b/arch/arm/mach-iop32x/iq31244.c
> @@ -97,7 +97,7 @@ iq31244_pci_map_irq(struct pci_dev *dev,
>  
>  static struct hw_pci iq31244_pci __initdata = {
>  	.swizzle	= pci_std_swizzle,
> -	.nr_controllers = 1,
> +	.nr_controllers = 0,
>  	.setup		= iop3xx_pci_setup,
>  	.preinit	= iop3xx_pci_preinit,
>  	.scan		= iop3xx_pci_scan_bus,
> @@ -106,6 +106,9 @@ static struct hw_pci iq31244_pci __initd
>  
>  static int __init iq31244_pci_init(void)
>  {
> +	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
> +		iq31244_pci.nr_controllers = 1;
> +
>  	if (machine_is_iq31244())
>  		pci_common_init(&iq31244_pci);
>  
> diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
> index ef4388c..cdd2265 100644
> --- a/arch/arm/mach-iop32x/iq80321.c
> +++ b/arch/arm/mach-iop32x/iq80321.c
> @@ -97,7 +97,7 @@ iq80321_pci_map_irq(struct pci_dev *dev,
>  
>  static struct hw_pci iq80321_pci __initdata = {
>  	.swizzle	= pci_std_swizzle,
> -	.nr_controllers = 1,
> +	.nr_controllers = 0,
>  	.setup		= iop3xx_pci_setup,
>  	.preinit	= iop3xx_pci_preinit,
>  	.scan		= iop3xx_pci_scan_bus,
> @@ -106,6 +106,9 @@ static struct hw_pci iq80321_pci __initd
>  
>  static int __init iq80321_pci_init(void)
>  {
> +	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
> +		iq80321_pci.nr_controllers = 1;
> +
>  	if (machine_is_iq80321())
>  		pci_common_init(&iq80321_pci);
>  
> diff --git a/arch/arm/mach-iop33x/Kconfig b/arch/arm/mach-iop33x/Kconfig
> index 9aa016b..45598e0 100644
> --- a/arch/arm/mach-iop33x/Kconfig
> +++ b/arch/arm/mach-iop33x/Kconfig
> @@ -16,6 +16,14 @@ config MACH_IQ80332
>  	  Say Y here if you want to run your kernel on the Intel IQ80332
>  	  evaluation kit for the IOP332 chipset.
>  
> +config IOP3XX_ATU
> +	bool "Enable the PCI Controller"
> +	default y
> +	help
> +	  Say Y here if you want the IOP to initialize its PCI Controller.
> +	  Say N if the IOP is an add in card, the host system owns the PCI
> +	  bus in this case.
> +
>  endmenu
>  
>  endif
> diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
> index 7714c94..3807000 100644
> --- a/arch/arm/mach-iop33x/iq80331.c
> +++ b/arch/arm/mach-iop33x/iq80331.c
> @@ -78,7 +78,7 @@ iq80331_pci_map_irq(struct pci_dev *dev,
>  
>  static struct hw_pci iq80331_pci __initdata = {
>  	.swizzle	= pci_std_swizzle,
> -	.nr_controllers = 1,
> +	.nr_controllers = 0,
>  	.setup		= iop3xx_pci_setup,
>  	.preinit	= iop3xx_pci_preinit,
>  	.scan		= iop3xx_pci_scan_bus,
> @@ -87,6 +87,9 @@ static struct hw_pci iq80331_pci __initd
>  
>  static int __init iq80331_pci_init(void)
>  {
> +	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
> +		iq80331_pci.nr_controllers = 1;
> +
>  	if (machine_is_iq80331())
>  		pci_common_init(&iq80331_pci);
>  
> diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
> index a3fa7f8..8780d55 100644
> --- a/arch/arm/mach-iop33x/iq80332.c
> +++ b/arch/arm/mach-iop33x/iq80332.c
> @@ -93,6 +93,10 @@ static struct hw_pci iq80332_pci __initd
>  
>  static int __init iq80332_pci_init(void)
>  {
> +
> +	if (iop3xx_get_init_atu() == IOP3XX_INIT_ATU_ENABLE)
> +		iq80332_pci.nr_controllers = 1;
> +
>  	if (machine_is_iq80332())
>  		pci_common_init(&iq80332_pci);
>  
> diff --git a/arch/arm/plat-iop/pci.c b/arch/arm/plat-iop/pci.c
> index e647812..19aace9 100644
> --- a/arch/arm/plat-iop/pci.c
> +++ b/arch/arm/plat-iop/pci.c
> @@ -55,7 +55,7 @@ static u32 iop3xx_cfg_address(struct pci
>   * This routine checks the status of the last configuration cycle.  If an error
>   * was detected it returns a 1, else it returns a 0.  The errors being checked
>   * are parity, master abort, target abort (master and target).  These types of
> - * errors occure during a config cycle where there is no device, like during
> + * errors occur during a config cycle where there is no device, like during
>   * the discovery stage.
>   */
>  static int iop3xx_pci_status(void)
> @@ -223,8 +223,111 @@ struct pci_bus *iop3xx_pci_scan_bus(int 
>  	return pci_scan_bus(sys->busnr, &iop3xx_ops, sys);
>  }
>  
> +void __init iop3xx_atu_setup(void)
> +{
> +	/* BAR 0 ( Disabled ) */
> +	*IOP3XX_IAUBAR0 = 0x0;
> +	*IOP3XX_IABAR0  = 0x0;
> +	*IOP3XX_IATVR0  = 0x0;
> +	*IOP3XX_IALR0   = 0x0;
> +
> +	/* BAR 1 ( Disabled ) */
> +	*IOP3XX_IAUBAR1 = 0x0;
> +	*IOP3XX_IABAR1  = 0x0;
> +	*IOP3XX_IALR1   = 0x0;
> +
> +	/* BAR 2 (1:1 mapping with Physical RAM) */
> +	/* Set limit and enable */
> +	*IOP3XX_IALR2 = ~((u32)IOP3XX_MAX_RAM_SIZE - 1) & ~0x1;
> +	*IOP3XX_IAUBAR2 = 0x0;
> +
> +	/* Align the inbound bar with the base of memory */
> +	*IOP3XX_IABAR2 = PHYS_OFFSET |
> +			       PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +			       PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +	*IOP3XX_IATVR2 = PHYS_OFFSET;
> +
> +	/* Outbound window 0 */
> +	*IOP3XX_OMWTVR0 = IOP3XX_PCI_LOWER_MEM_PA;
> +	*IOP3XX_OUMWTVR0 = 0;
> +
> +	/* Outbound window 1 */
> +	*IOP3XX_OMWTVR1 = IOP3XX_PCI_LOWER_MEM_PA + IOP3XX_PCI_MEM_WINDOW_SIZE;
> +	*IOP3XX_OUMWTVR1 = 0;
> +
> +	/* BAR 3 ( Disabled ) */
> +	*IOP3XX_IAUBAR3 = 0x0;
> +	*IOP3XX_IABAR3  = 0x0;
> +	*IOP3XX_IATVR3  = 0x0;
> +	*IOP3XX_IALR3   = 0x0;
> +
> +	/* Setup the I/O Bar
> +	 */
> +	*IOP3XX_OIOWTVR = IOP3XX_PCI_LOWER_IO_PA;;
> +
> +	/* Enable inbound and outbound cycles
> +	 */
> +	*IOP3XX_ATUCMD |= PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
> +			       PCI_COMMAND_PARITY | PCI_COMMAND_SERR;
> +	*IOP3XX_ATUCR |= IOP3XX_ATUCR_OUT_EN;
> +}
> +
> +void __init iop3xx_atu_disable(void)
> +{
> +	*IOP3XX_ATUCMD = 0;
> +	*IOP3XX_ATUCR = 0;
> +
> +	/* wait for cycles to quiesce */
> +	while (*IOP3XX_PCSR & (IOP3XX_PCSR_OUT_Q_BUSY |
> +				     IOP3XX_PCSR_IN_Q_BUSY))
> +		cpu_relax();
> +
> +	/* BAR 0 ( Disabled ) */
> +	*IOP3XX_IAUBAR0 = 0x0;
> +	*IOP3XX_IABAR0  = 0x0;
> +	*IOP3XX_IATVR0  = 0x0;
> +	*IOP3XX_IALR0   = 0x0;
> +
> +	/* BAR 1 ( Disabled ) */
> +	*IOP3XX_IAUBAR1 = 0x0;
> +	*IOP3XX_IABAR1  = 0x0;
> +	*IOP3XX_IALR1   = 0x0;
> +
> +	/* BAR 2 ( Disabled ) */
> +	*IOP3XX_IAUBAR2 = 0x0;
> +	*IOP3XX_IABAR2  = 0x0;
> +	*IOP3XX_IATVR2  = 0x0;
> +	*IOP3XX_IALR2   = 0x0;
> +
> +	/* BAR 3 ( Disabled ) */
> +	*IOP3XX_IAUBAR3 = 0x0;
> +	*IOP3XX_IABAR3  = 0x0;
> +	*IOP3XX_IATVR3  = 0x0;
> +	*IOP3XX_IALR3   = 0x0;
> +
> +	/* Clear the outbound windows */
> +	*IOP3XX_OIOWTVR  = 0;
> +
> +	/* Outbound window 0 */
> +	*IOP3XX_OMWTVR0 = 0;
> +	*IOP3XX_OUMWTVR0 = 0;
> +
> +	/* Outbound window 1 */
> +	*IOP3XX_OMWTVR1 = 0;
> +	*IOP3XX_OUMWTVR1 = 0;

You should be using readl(), writel() variants rather than writing C 
code that appears to be normal, but in reality has hardware side-effects.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-11 23:44   ` Jeff Garzik
@ 2006-09-12  0:14     ` Dan Williams
  2006-09-12  0:52       ` Roland Dreier
  2006-09-15 16:38     ` Olof Johansson
  1 sibling, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-12  0:14 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

On 9/11/06, Jeff Garzik <jeff@garzik.org> wrote:
> Dan Williams wrote:
> > @@ -759,8 +755,10 @@ #endif
> >       device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
> >       device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
> >       device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
> > -     device->common.device_memcpy_complete = ioat_dma_is_complete;
> > -     device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
> > +     device->common.device_operation_complete = ioat_dma_is_complete;
> > +     device->common.device_xor_pgs_to_pg = dma_async_xor_pgs_to_pg_err;
> > +     device->common.device_issue_pending = ioat_dma_memcpy_issue_pending;
> > +     device->common.capabilities = DMA_MEMCPY;
>
>
> Are we really going to add a set of hooks for each DMA engine whizbang
> feature?

What's the alternative?  But, also see patch 9 "dmaengine: reduce
backend address permutations" it relieves some of this pain.

>
> That will get ugly when DMA engines support memcpy, xor, crc32, sha1,
> aes, and a dozen other transforms.
>
>
> > diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> > index c94d8f1..3599472 100644
> > --- a/include/linux/dmaengine.h
> > +++ b/include/linux/dmaengine.h
> > @@ -20,7 +20,7 @@
> >   */
> >  #ifndef DMAENGINE_H
> >  #define DMAENGINE_H
> > -
> > +#include <linux/config.h>
> >  #ifdef CONFIG_DMA_ENGINE
> >
> >  #include <linux/device.h>
> > @@ -65,6 +65,27 @@ enum dma_status {
> >  };
> >
> >  /**
> > + * enum dma_capabilities - DMA operational capabilities
> > + * @DMA_MEMCPY: src to dest copy
> > + * @DMA_XOR: src*n to dest xor
> > + * @DMA_DUAL_XOR: src*n to dest_diag and dest_horiz xor
> > + * @DMA_PQ_XOR: src*n to dest_q and dest_p gf/xor
> > + * @DMA_MEMCPY_CRC32C: src to dest copy and crc-32c sum
> > + * @DMA_SHARE: multiple clients can use this channel
> > + */
> > +enum dma_capabilities {
> > +     DMA_MEMCPY              = 0x1,
> > +     DMA_XOR                 = 0x2,
> > +     DMA_PQ_XOR              = 0x4,
> > +     DMA_DUAL_XOR            = 0x8,
> > +     DMA_PQ_UPDATE           = 0x10,
> > +     DMA_ZERO_SUM            = 0x20,
> > +     DMA_PQ_ZERO_SUM         = 0x40,
> > +     DMA_MEMSET              = 0x80,
> > +     DMA_MEMCPY_CRC32C       = 0x100,
>
> Please use the more readable style that explicitly lists bits:
>
>         DMA_MEMCPY              = (1 << 0),
>         DMA_XOR                 = (1 << 1),
>         ...
I prefer this as well, although at one point I was told (not by you)
the absolute number was preferred when I was making changes to
drivers/scsi/sata_vsc.c.  In any event I'll change it...

>
> > +/**
> >   * struct dma_chan_percpu - the per-CPU part of struct dma_chan
> >   * @refcount: local_t used for open-coded "bigref" counting
> >   * @memcpy_count: transaction counter
> > @@ -75,27 +96,32 @@ struct dma_chan_percpu {
> >       local_t refcount;
> >       /* stats */
> >       unsigned long memcpy_count;
> > +     unsigned long xor_count;
> >       unsigned long bytes_transferred;
> > +     unsigned long bytes_xor;
>
> Clearly, each operation needs to be more compartmentalized.
>
> This just isn't scalable, when you consider all the possible transforms.
Ok, one set of counters per op is probably overkill what about lumping
operations into groups and just tracking at the group level? i.e.

memcpy, memset -> string_count, string_bytes_transferred
crc, sha1, aes -> hash_count, hash_transferred
xor, pq_xor -> sum_count, sum_transferred

>
>         Jeff

Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-12  0:14     ` Dan Williams
@ 2006-09-12  0:52       ` Roland Dreier
  2006-09-12  6:18         ` Dan Williams
  0 siblings, 1 reply; 55+ messages in thread
From: Roland Dreier @ 2006-09-12  0:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Garzik, neilb, linux-raid, akpm, linux-kernel, christopher.leech

    Jeff> Are we really going to add a set of hooks for each DMA
    Jeff> engine whizbang feature?

    Dan> What's the alternative?  But, also see patch 9 "dmaengine:
    Dan> reduce backend address permutations" it relieves some of this
    Dan> pain.

I guess you can pass an opcode into a common "start operation" function.

With all the memcpy / xor / crypto / etc. hardware out there already,
we definitely have to get this interface right.

 - R.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-11 23:53   ` Dan Williams
@ 2006-09-12  2:41     ` Jeff Garzik
  2006-09-12  5:47       ` Dan Williams
  0 siblings, 1 reply; 55+ messages in thread
From: Jeff Garzik @ 2006-09-12  2:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> This is a frequently asked question, Alan Cox had the same one at OLS.
> The answer is "probably."  The only complication I currently see is
> where/how the stripe cache is maintained.  With the IOPs its easy
> because the DMA engines operate directly on kernel memory.  With the
> Promise card I believe they have memory on the card and it's not clear
> to me if the XOR engines on the card can deal with host memory.  Also,
> MD would need to be modified to handle a stripe cache located on a
> device, or somehow synchronize its local cache with card in a manner
> that is still able to beat software only MD.

sata_sx4 operates through [standard PC] memory on the card, and you use 
a DMA engine to copy memory to/from the card.

[select chipsets supported by] sata_promise operates directly on host 
memory.

So, while sata_sx4 is farther away from your direct-host-memory model, 
it also has much more potential for RAID acceleration:  ideally, RAID1 
just copies data to the card once, then copies the data to multiple 
drives from there.  Similarly with RAID5, you can eliminate copies and 
offload XOR, presuming the drives are all connected to the same card.

	Jeff



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-12  2:41     ` Jeff Garzik
@ 2006-09-12  5:47       ` Dan Williams
  2006-09-13  4:05         ` Jeff Garzik
  0 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-12  5:47 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

On 9/11/06, Jeff Garzik <jeff@garzik.org> wrote:
> Dan Williams wrote:
> > This is a frequently asked question, Alan Cox had the same one at OLS.
> > The answer is "probably."  The only complication I currently see is
> > where/how the stripe cache is maintained.  With the IOPs its easy
> > because the DMA engines operate directly on kernel memory.  With the
> > Promise card I believe they have memory on the card and it's not clear
> > to me if the XOR engines on the card can deal with host memory.  Also,
> > MD would need to be modified to handle a stripe cache located on a
> > device, or somehow synchronize its local cache with card in a manner
> > that is still able to beat software only MD.
>
> sata_sx4 operates through [standard PC] memory on the card, and you use
> a DMA engine to copy memory to/from the card.
>
> [select chipsets supported by] sata_promise operates directly on host
> memory.
>
> So, while sata_sx4 is farther away from your direct-host-memory model,
> it also has much more potential for RAID acceleration:  ideally, RAID1
> just copies data to the card once, then copies the data to multiple
> drives from there.  Similarly with RAID5, you can eliminate copies and
> offload XOR, presuming the drives are all connected to the same card.
In the sata_promise case its straight forward, all that is needed is
dmaengine drivers for the xor and memcpy engines.  This would be
similar to the current I/OAT model where dma resources are provided by
a PCI function.  The sata_sx4 case would need a different flavor of
the dma_do_raid5_block_ops routine, one that understands where the
cache is located.  MD would also need the capability to bypass the
block layer since the data will have already been transferred to the
card by a stripe cache operation

The RAID1 case give me pause because it seems any work along these
lines requires that the implementation work for both MD and DM, which
then eventually leads to being tasked with merging the two.

>         Jeff

Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-12  0:52       ` Roland Dreier
@ 2006-09-12  6:18         ` Dan Williams
  2006-09-12  9:15           ` Evgeniy Polyakov
  2006-09-13  4:04           ` Jeff Garzik
  0 siblings, 2 replies; 55+ messages in thread
From: Dan Williams @ 2006-09-12  6:18 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jeff Garzik, neilb, linux-raid, akpm, linux-kernel,
	christopher.leech, Evgeniy Polyakov

On 9/11/06, Roland Dreier <rdreier@cisco.com> wrote:
>     Jeff> Are we really going to add a set of hooks for each DMA
>     Jeff> engine whizbang feature?
...ok, but at some level we are going to need a file that has:
EXPORT_SYMBOL_GPL(dma_whizbang_op1)
. . .
EXPORT_SYMBOL_GPL(dma_whizbang_opX)
correct?


>     Dan> What's the alternative?  But, also see patch 9 "dmaengine:
>     Dan> reduce backend address permutations" it relieves some of this
>     Dan> pain.
>
> I guess you can pass an opcode into a common "start operation" function.
But then we still have the problem of being able to request a memory
copy operation of a channel that only understands xor, a la Jeff's
comment to patch 12:

"Further illustration of how this API growth is going wrong.  You should
create an API such that it is impossible for an XOR transform to ever
call non-XOR-transform hooks."

> With all the memcpy / xor / crypto / etc. hardware out there already,
> we definitely have to get this interface right.
>
>  - R.

I understand what you are saying Jeff, the implementation can be made
better, but something I think is valuable is the ability to write
clients once like NET_DMA and RAID5_DMA and have them run without
modification on any platform that can provide the engine interface
rather than needing a client per architecture
IOP_RAID5_DMA...FOO_X_RAID5_DMA.

Or is this an example of the where "Do What You Must, And No More"
comes in, i.e. don't worry about making a generic RAID5_DMA while
there is only one implementation existence?

I also want to pose the question of whether the dmaengine interface
should handle cryptographic transforms?  We already have Acrypto:
http://tservice.net.ru/~s0mbre/blog/devel/acrypto/index.html.  At the
same time since IOPs can do Galois Field multiplication and XOR it
would be nice to take advantage of that for crypto acceleration, but
this does not fit the model of a device that Acrypto supports.

Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-12  6:18         ` Dan Williams
@ 2006-09-12  9:15           ` Evgeniy Polyakov
  2006-09-13  4:04           ` Jeff Garzik
  1 sibling, 0 replies; 55+ messages in thread
From: Evgeniy Polyakov @ 2006-09-12  9:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Roland Dreier, Jeff Garzik, neilb, linux-raid, akpm,
	linux-kernel, christopher.leech

On Mon, Sep 11, 2006 at 11:18:59PM -0700, Dan Williams (dan.j.williams@gmail.com) wrote:
> Or is this an example of the where "Do What You Must, And No More"
> comes in, i.e. don't worry about making a generic RAID5_DMA while
> there is only one implementation existence?
> 
> I also want to pose the question of whether the dmaengine interface
> should handle cryptographic transforms?  We already have Acrypto:
> http://tservice.net.ru/~s0mbre/blog/devel/acrypto/index.html.  At the
> same time since IOPs can do Galois Field multiplication and XOR it
> would be nice to take advantage of that for crypto acceleration, but
> this does not fit the model of a device that Acrypto supports.

Each acrypto crypto device provides set of capabilities it supports, and
when user requests some operation, acrypto core selects device with the
maximum speed for given capabilities, so one can easily add there GF
multiplication devices. Acrypto supports "sync" mode too in case your
hardware is synchronous (i.e. it does not provide interrupt or other
async event when operation is completed).

P.S. acrypto homepage with some design notes and supported features
can be found here:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=acrypto

> Dan

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-12  6:18         ` Dan Williams
  2006-09-12  9:15           ` Evgeniy Polyakov
@ 2006-09-13  4:04           ` Jeff Garzik
  1 sibling, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-13  4:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Roland Dreier, neilb, linux-raid, akpm, linux-kernel,
	christopher.leech, Evgeniy Polyakov

Dan Williams wrote:
> On 9/11/06, Roland Dreier <rdreier@cisco.com> wrote:
>>     Jeff> Are we really going to add a set of hooks for each DMA
>>     Jeff> engine whizbang feature?
> ...ok, but at some level we are going to need a file that has:
> EXPORT_SYMBOL_GPL(dma_whizbang_op1)
> . . .
> EXPORT_SYMBOL_GPL(dma_whizbang_opX)
> correct?

If properly modularized, you'll have multiple files with such exports.

Or perhaps you won't have such exports at all, if it is hidden inside a 
module-specific struct-of-hooks.


> I understand what you are saying Jeff, the implementation can be made
> better, but something I think is valuable is the ability to write
> clients once like NET_DMA and RAID5_DMA and have them run without
> modification on any platform that can provide the engine interface
> rather than needing a client per architecture
> IOP_RAID5_DMA...FOO_X_RAID5_DMA.

It depends on the situation.

The hardware capabilities exported by each platform[or device] vary 
greatly, not only in the raw capabilities provided, but also in the 
level of offload.

In general, we don't want to see hardware-specific stuff in generic 
code, though...


> Or is this an example of the where "Do What You Must, And No More"
> comes in, i.e. don't worry about making a generic RAID5_DMA while
> there is only one implementation existence?

> I also want to pose the question of whether the dmaengine interface
> should handle cryptographic transforms?  We already have Acrypto:
> http://tservice.net.ru/~s0mbre/blog/devel/acrypto/index.html.  At the
> same time since IOPs can do Galois Field multiplication and XOR it
> would be nice to take advantage of that for crypto acceleration, but
> this does not fit the model of a device that Acrypto supports.

It would be quite interesting to see where the synergies are between the 
two, at the very least.  "async [transform|sum]" is a superset of "async 
crypto" after all.

	Jeff



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-12  5:47       ` Dan Williams
@ 2006-09-13  4:05         ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2006-09-13  4:05 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

Dan Williams wrote:
> On 9/11/06, Jeff Garzik <jeff@garzik.org> wrote:
>> Dan Williams wrote:
>> > This is a frequently asked question, Alan Cox had the same one at OLS.
>> > The answer is "probably."  The only complication I currently see is
>> > where/how the stripe cache is maintained.  With the IOPs its easy
>> > because the DMA engines operate directly on kernel memory.  With the
>> > Promise card I believe they have memory on the card and it's not clear
>> > to me if the XOR engines on the card can deal with host memory.  Also,
>> > MD would need to be modified to handle a stripe cache located on a
>> > device, or somehow synchronize its local cache with card in a manner
>> > that is still able to beat software only MD.
>>
>> sata_sx4 operates through [standard PC] memory on the card, and you use
>> a DMA engine to copy memory to/from the card.
>>
>> [select chipsets supported by] sata_promise operates directly on host
>> memory.
>>
>> So, while sata_sx4 is farther away from your direct-host-memory model,
>> it also has much more potential for RAID acceleration:  ideally, RAID1
>> just copies data to the card once, then copies the data to multiple
>> drives from there.  Similarly with RAID5, you can eliminate copies and
>> offload XOR, presuming the drives are all connected to the same card.
> In the sata_promise case its straight forward, all that is needed is
> dmaengine drivers for the xor and memcpy engines.  This would be
> similar to the current I/OAT model where dma resources are provided by
> a PCI function.  The sata_sx4 case would need a different flavor of
> the dma_do_raid5_block_ops routine, one that understands where the
> cache is located.  MD would also need the capability to bypass the
> block layer since the data will have already been transferred to the
> card by a stripe cache operation
> 
> The RAID1 case give me pause because it seems any work along these
> lines requires that the implementation work for both MD and DM, which
> then eventually leads to being tasked with merging the two.

RAID5 has similar properties.  If all devices in a RAID5 array are 
attached to a single SX4 card, then a high level write to the RAID5 
array is passed directly to the card, which then performs XOR, striping, 
etc.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (19 preceding siblings ...)
  2006-09-11 23:38 ` [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Jeff Garzik
@ 2006-09-13  7:15 ` Jakob Oestergaard
  2006-09-13 19:17   ` Dan Williams
  2006-10-08 22:18 ` Neil Brown
  21 siblings, 1 reply; 55+ messages in thread
From: Jakob Oestergaard @ 2006-09-13  7:15 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote:
> Neil,
> 
...
> 
> Concerning the context switching performance concerns raised at the
> previous release, I have observed the following.  For the hardware
> accelerated case it appears that performance is always better with the
> work queue than without since it allows multiple stripes to be operated
> on simultaneously.  I expect the same for an SMP platform, but so far my
> testing has been limited to IOPs.  For a single-processor
> non-accelerated configuration I have not observed performance
> degradation with work queue support enabled, but in the Kconfig option
> help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Out of curiosity; how does accelerated compare to non-accelerated?

-- 

 / jakob


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-13  7:15 ` Jakob Oestergaard
@ 2006-09-13 19:17   ` Dan Williams
  2006-09-14  7:42     ` Jakob Oestergaard
  0 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-13 19:17 UTC (permalink / raw)
  To: Jakob Oestergaard, Dan Williams, NeilBrown, linux-raid, akpm,
	linux-kernel, christopher.leech

On 9/13/06, Jakob Oestergaard <jakob@unthought.net> wrote:
> On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote:
> > Neil,
> >
> ...
> >
> > Concerning the context switching performance concerns raised at the
> > previous release, I have observed the following.  For the hardware
> > accelerated case it appears that performance is always better with the
> > work queue than without since it allows multiple stripes to be operated
> > on simultaneously.  I expect the same for an SMP platform, but so far my
> > testing has been limited to IOPs.  For a single-processor
> > non-accelerated configuration I have not observed performance
> > degradation with work queue support enabled, but in the Kconfig option
> > help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).
>
> Out of curiosity; how does accelerated compare to non-accelerated?

One quick example:
4-disk SATA array rebuild on iop321 without acceleration - 'top'
reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
utilization.

With acceleration - 'top' reports md0_resync cpu utilization at ~90%
with the rest split between md0_raid5 and md0_raid5_ops.

The sync speed reported by /proc/mdstat is ~40% higher in the accelerated case.

That being said, array resync is a special case, so your mileage may
vary with other applications.

I will put together some data from bonnie++, iozone, maybe contest,
and post it on SourceForge.

>  / jakob

Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-13 19:17   ` Dan Williams
@ 2006-09-14  7:42     ` Jakob Oestergaard
  2006-10-11  1:46       ` Dan Williams
  0 siblings, 1 reply; 55+ messages in thread
From: Jakob Oestergaard @ 2006-09-14  7:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dan Williams, NeilBrown, linux-raid, akpm, linux-kernel,
	christopher.leech

On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote:
...
> >Out of curiosity; how does accelerated compare to non-accelerated?
> 
> One quick example:
> 4-disk SATA array rebuild on iop321 without acceleration - 'top'
> reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
> utilization.
> 
> With acceleration - 'top' reports md0_resync cpu utilization at ~90%
> with the rest split between md0_raid5 and md0_raid5_ops.
> 
> The sync speed reported by /proc/mdstat is ~40% higher in the accelerated 
> case.

Ok, nice :)

> 
> That being said, array resync is a special case, so your mileage may
> vary with other applications.

Every-day usage I/O performance data would be nice indeed :)

> I will put together some data from bonnie++, iozone, maybe contest,
> and post it on SourceForge.

Great!

-- 

 / jakob


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/19] dmaengine: reduce backend address permutations
  2006-09-11 23:18 ` [PATCH 09/19] dmaengine: reduce backend address permutations Dan Williams
@ 2006-09-15 14:46   ` Olof Johansson
  0 siblings, 0 replies; 55+ messages in thread
From: Olof Johansson @ 2006-09-15 14:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Hi,

On Mon, 11 Sep 2006 16:18:23 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> Change the backend dma driver API to accept a 'union dmaengine_addr'.  The
> intent is to be able to support a wide range of frontend address type
> permutations without needing an equal number of function type permutations
> on the backend.

Please do the cleanup of existing code before you apply new function.
Earlier patches in this series added code that you're modifying here.
If you modify the existing code first it's less churn for everyone to
review.


Thanks,

Olof

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines
  2006-09-11 23:19 ` [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines Dan Williams
@ 2006-09-15 14:57   ` Olof Johansson
  0 siblings, 0 replies; 55+ messages in thread
From: Olof Johansson @ 2006-09-15 14:57 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, linux-raid, akpm, linux-kernel, christopher.leech

Hi,

On Mon, 11 Sep 2006 16:19:00 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor,
> pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy
> operations.

You implement a bunch of different functions here. I agree with Jeff's
feedback related to the lack of scalability the way the API is going
right now.

Another example of this is that the driver is doing it's own self-test
of the functions. This means that every backend driver will need to
duplicate this code. Wouldn't it be easier for everyone if the common
infrastructure did a test call at the time of registration of a
function instead, and return failure if it doesn't pass?

>  drivers/dma/Kconfig                 |   27 +
>  drivers/dma/Makefile                |    1 
>  drivers/dma/iop-adma.c              | 1501 +++++++++++++++++++++++++++++++++++
>  include/asm-arm/hardware/iop_adma.h |   98 ++

ioatdma.h is currently under drivers/dma/. If the contents is strictly
device-related please add them under drivers/dma.


-Olof

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/19] dmaengine: enable multiple clients and operations
  2006-09-11 23:44   ` Jeff Garzik
  2006-09-12  0:14     ` Dan Williams
@ 2006-09-15 16:38     ` Olof Johansson
  2006-09-15 19:44       ` [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations) Olof Johansson
  1 sibling, 1 reply; 55+ messages in thread
From: Olof Johansson @ 2006-09-15 16:38 UTC (permalink / raw)
  To: Jeff Garzik, Dan Williams, christopher.leech
  Cc: neilb, linux-raid, akpm, linux-kernel

On Mon, 11 Sep 2006 19:44:16 -0400 Jeff Garzik <jeff@garzik.org> wrote:

> Dan Williams wrote:
> > @@ -759,8 +755,10 @@ #endif
> >  	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
> >  	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
> >  	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
> > -	device->common.device_memcpy_complete = ioat_dma_is_complete;
> > -	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
> > +	device->common.device_operation_complete = ioat_dma_is_complete;
> > +	device->common.device_xor_pgs_to_pg = dma_async_xor_pgs_to_pg_err;
> > +	device->common.device_issue_pending = ioat_dma_memcpy_issue_pending;
> > +	device->common.capabilities = DMA_MEMCPY;
> 
> 
> Are we really going to add a set of hooks for each DMA engine whizbang 
> feature?
> 
> That will get ugly when DMA engines support memcpy, xor, crc32, sha1, 
> aes, and a dozen other transforms.


Yes, it will be unmaintainable. We need some sort of multiplexing with
per-function registrations.

Here's a first cut at it, just very quick. It could be improved further
but it shows that we could exorcise most of the hardcoded things pretty
easily.

Dan, would this fit with your added XOR stuff as well? If so, would you
mind rebasing on top of something like this (with your further cleanups
going in before added function, please. :-)

(Build tested only, since I lack Intel hardware).


It would be nice if we could move the type specification to only be
needed in the channel allocation. I don't know how well that fits the
model for some of the hardware platforms though, since a single channel
might be shared for different types of functions. Maybe we need a
different level of abstraction there instead, i.e. divorce the hardware
channel and software channel model and have several software channels
map onto a hardware one.





Clean up the DMA API a bit, allowing each engine to register an array
of supported functions instead of allocating static names for each possible
function.


Signed-off-by: Olof Johansson <olof@lixom.net>


diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 1527804..282ce85 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -80,7 +80,7 @@ static ssize_t show_memcpy_count(struct 
 	int i;
 
 	for_each_possible_cpu(i)
-		count += per_cpu_ptr(chan->local, i)->memcpy_count;
+		count += per_cpu_ptr(chan->local, i)->count;
 
 	return sprintf(buf, "%lu\n", count);
 }
@@ -105,7 +105,7 @@ static ssize_t show_in_use(struct class_
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
-	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(count, S_IRUGO, show_memcpy_count, NULL),
 	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
 	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
 	__ATTR_NULL
@@ -402,11 +402,11 @@ subsys_initcall(dma_bus_init);
 EXPORT_SYMBOL(dma_async_client_register);
 EXPORT_SYMBOL(dma_async_client_unregister);
 EXPORT_SYMBOL(dma_async_client_chan_request);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_complete);
-EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_buf_to_buf);
+EXPORT_SYMBOL(dma_async_buf_to_pg);
+EXPORT_SYMBOL(dma_async_pg_to_pg);
+EXPORT_SYMBOL(dma_async_complete);
+EXPORT_SYMBOL(dma_async_issue_pending);
 EXPORT_SYMBOL(dma_async_device_register);
 EXPORT_SYMBOL(dma_async_device_unregister);
 EXPORT_SYMBOL(dma_chan_cleanup);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index dbd4d6c..6cbed42 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -40,6 +40,7 @@
 #define to_ioat_device(dev) container_of(dev, struct ioat_device, common)
 #define to_ioat_desc(lh) container_of(lh, struct ioat_desc_sw, node)
 
+
 /* internal functions */
 static int __devinit ioat_probe(struct pci_dev *pdev, const struct pci_device_id *ent);
 static void __devexit ioat_remove(struct pci_dev *pdev);
@@ -681,6 +682,14 @@ out:
 	return err;
 }
 
+struct dma_function ioat_memcpy_functions = {
+	.buf_to_buf = ioat_dma_memcpy_buf_to_buf,
+	.buf_to_pg = ioat_dma_memcpy_buf_to_pg,
+	.pg_to_pg = ioat_dma_memcpy_pg_to_pg,
+	.complete = ioat_dma_is_complete,
+	.issue_pending = ioat_dma_memcpy_issue_pending,
+};
+
 static int __devinit ioat_probe(struct pci_dev *pdev,
                                 const struct pci_device_id *ent)
 {
@@ -756,11 +765,8 @@ static int __devinit ioat_probe(struct p
 
 	device->common.device_alloc_chan_resources = ioat_dma_alloc_chan_resources;
 	device->common.device_free_chan_resources = ioat_dma_free_chan_resources;
-	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
-	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
-	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
-	device->common.device_memcpy_complete = ioat_dma_is_complete;
-	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
+	device->common.funcs[DMAFUNC_MEMCPY] = &ioat_memcpy_functions;
+
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index d637555..8a2f642 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -151,11 +151,8 @@ static dma_cookie_t dma_memcpy_to_kernel
 	while (len > 0) {
 		if (iov->iov_len) {
 			int copy = min_t(unsigned int, iov->iov_len, len);
-			dma_cookie = dma_async_memcpy_buf_to_buf(
-					chan,
-					iov->iov_base,
-					kdata,
-					copy);
+			dma_cookie = dma_async_buf_to_buf(DMAFUNC_MEMCPY, chan,
+					iov->iov_base, kdata, copy);
 			kdata += copy;
 			len -= copy;
 			iov->iov_len -= copy;
@@ -210,7 +207,7 @@ dma_cookie_t dma_memcpy_to_iovec(struct 
 			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
 			copy = min_t(int, copy, iov[iovec_idx].iov_len);
 
-			dma_cookie = dma_async_memcpy_buf_to_pg(chan,
+			dma_cookie = dma_async_buf_to_pg(DMAFUNC_MEMCPY, chan,
 					page_list->pages[page_idx],
 					iov_byte_offset,
 					kdata,
@@ -274,7 +271,7 @@ dma_cookie_t dma_memcpy_pg_to_iovec(stru
 			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
 			copy = min_t(int, copy, iov[iovec_idx].iov_len);
 
-			dma_cookie = dma_async_memcpy_pg_to_pg(chan,
+			dma_cookie = dma_async_pg_to_pg(DMAFUNC_MEMCPY, chan,
 					page_list->pages[page_idx],
 					iov_byte_offset,
 					page,
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c94d8f1..317a7f2 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -67,14 +67,14 @@ enum dma_status {
 /**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
  * @refcount: local_t used for open-coded "bigref" counting
- * @memcpy_count: transaction counter
+ * @count: transaction counter
  * @bytes_transferred: byte counter
  */
 
 struct dma_chan_percpu {
 	local_t refcount;
 	/* stats */
-	unsigned long memcpy_count;
+	unsigned long count;
 	unsigned long bytes_transferred;
 };
 
@@ -157,6 +157,34 @@ struct dma_client {
 	struct list_head	global_node;
 };
 
+enum dma_function_type {
+	DMAFUNC_MEMCPY = 0,
+	DMAFUNC_XOR,
+	DMAFUNC_MAX
+};
+
+/* struct dma_function
+ * @buf_to_pg: buf pointer to struct page
+ * @pg_to_pg: struct page/offset to struct page/offset
+ * @complete: poll the status of a DMA transaction
+ * @issue_pending: push appended descriptors to hardware
+ */
+struct dma_function {
+	dma_cookie_t (*buf_to_buf)(struct dma_chan *chan,
+				void *dest, void *src, size_t len);
+	dma_cookie_t (*buf_to_pg)(struct dma_chan *chan,
+				struct page *page, unsigned int offset,
+				void *kdata, size_t len);
+	dma_cookie_t (*pg_to_pg)(struct dma_chan *chan,
+				struct page *dest_pg, unsigned int dest_off,
+				struct page *src_pg, unsigned int src_off,
+				size_t len);
+	enum dma_status (*complete)(struct dma_chan *chan,
+				dma_cookie_t cookie, dma_cookie_t *last,
+				dma_cookie_t *used);
+	void (*issue_pending)(struct dma_chan *chan);
+};
+
 /**
  * struct dma_device - info on the entity supplying DMA services
  * @chancnt: how many DMA channels are supported
@@ -168,14 +196,8 @@ struct dma_client {
  * @device_alloc_chan_resources: allocate resources and return the
  *	number of allocated descriptors
  * @device_free_chan_resources: release DMA channel's resources
- * @device_memcpy_buf_to_buf: memcpy buf pointer to buf pointer
- * @device_memcpy_buf_to_pg: memcpy buf pointer to struct page
- * @device_memcpy_pg_to_pg: memcpy struct page/offset to struct page/offset
- * @device_memcpy_complete: poll the status of an IOAT DMA transaction
- * @device_memcpy_issue_pending: push appended descriptors to hardware
  */
 struct dma_device {
-
 	unsigned int chancnt;
 	struct list_head channels;
 	struct list_head global_node;
@@ -185,20 +207,10 @@ struct dma_device {
 
 	int dev_id;
 
+	struct dma_function *funcs[DMAFUNC_MAX];
+
 	int (*device_alloc_chan_resources)(struct dma_chan *chan);
 	void (*device_free_chan_resources)(struct dma_chan *chan);
-	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan,
-			void *dest, void *src, size_t len);
-	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan,
-			struct page *page, unsigned int offset, void *kdata,
-			size_t len);
-	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
-			struct page *dest_pg, unsigned int dest_off,
-			struct page *src_pg, unsigned int src_off, size_t len);
-	enum dma_status (*device_memcpy_complete)(struct dma_chan *chan,
-			dma_cookie_t cookie, dma_cookie_t *last,
-			dma_cookie_t *used);
-	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
 };
 
 /* --- public DMA engine API --- */
@@ -209,7 +221,7 @@ void dma_async_client_chan_request(struc
 		unsigned int number);
 
 /**
- * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
+ * dma_async_buf_to_buf - offloaded copy between virtual addresses
  * @chan: DMA channel to offload copy to
  * @dest: destination address (virtual)
  * @src: source address (virtual)
@@ -220,19 +232,24 @@ void dma_async_client_chan_request(struc
  * Both @dest and @src must stay memory resident (kernel memory or locked
  * user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
-	void *dest, void *src, size_t len)
+static inline dma_cookie_t dma_async_buf_to_buf(enum dma_function_type type,
+		struct dma_chan *chan, void *dest, void *src, size_t len)
 {
-	int cpu = get_cpu();
+	int cpu;
+
+	if (!chan->device->funcs[type])
+		return -ENXIO;
+
+	cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+	return chan->device->funcs[type]->buf_to_buf(chan, dest, src, len);
 }
 
 /**
- * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
+ * dma_async_buf_to_pg - offloaded copy from address to page
  * @chan: DMA channel to offload copy to
  * @page: destination page
  * @offset: offset in page to copy to
@@ -244,20 +261,26 @@ static inline dma_cookie_t dma_async_mem
  * Both @page/@offset and @kdata must stay memory resident (kernel memory or
  * locked user space pages)
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
-	struct page *page, unsigned int offset, void *kdata, size_t len)
+static inline dma_cookie_t dma_async_buf_to_pg(enum dma_function_type type,
+		struct dma_chan *chan, struct page *page, unsigned int offset,
+		void *kdata, size_t len)
 {
-	int cpu = get_cpu();
+	int cpu;
+
+	if (!chan->device->funcs[type])
+		return -ENXIO;
+
+	cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_pg(chan, page, offset,
-	                                             kdata, len);
+	return chan->device->funcs[type]->buf_to_pg(chan, page, offset,
+							kdata, len);
 }
 
 /**
- * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
+ * dma_async_pg_to_pg - offloaded copy from page to page
  * @chan: DMA channel to offload copy to
  * @dest_pg: destination page
  * @dest_off: offset in page to copy to
@@ -270,33 +293,40 @@ static inline dma_cookie_t dma_async_mem
  * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
  * (kernel memory or locked user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
-	struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
-	unsigned int src_off, size_t len)
+static inline dma_cookie_t dma_async_pg_to_pg(enum dma_function_type type,
+		struct dma_chan *chan, struct page *dest_pg, unsigned int dest_off,
+		struct page *src_pg, unsigned int src_off, size_t len)
 {
-	int cpu = get_cpu();
+	int cpu;
+
+	if (!chan->device->funcs[type])
+		return -ENXIO;
+
+	cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
-	                                            src_pg, src_off, len);
+	return chan->device->funcs[type]->pg_to_pg(chan, dest_pg, dest_off,
+							src_pg, src_off, len);
 }
 
 /**
- * dma_async_memcpy_issue_pending - flush pending copies to HW
+ * dma_async_issue_pending - flush pending copies to HW
  * @chan: target DMA channel
  *
  * This allows drivers to push copies to HW in batches,
  * reducing MMIO writes where possible.
  */
-static inline void dma_async_memcpy_issue_pending(struct dma_chan *chan)
+static inline void dma_async_issue_pending(enum dma_function_type type,
+		struct dma_chan *chan)
 {
-	return chan->device->device_memcpy_issue_pending(chan);
+	if (chan->device->funcs[type])
+		return chan->device->funcs[type]->issue_pending(chan);
 }
 
 /**
- * dma_async_memcpy_complete - poll for transaction completion
+ * dma_async_complete - poll for transaction completion
  * @chan: DMA channel
  * @cookie: transaction identifier to check status of
  * @last: returns last completed cookie, can be NULL
@@ -306,10 +336,14 @@ static inline void dma_async_memcpy_issu
  * internal state and can be used with dma_async_is_complete() to check
  * the status of multiple cookies without re-checking hardware state.
  */
-static inline enum dma_status dma_async_memcpy_complete(struct dma_chan *chan,
-	dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
+static inline enum dma_status dma_async_complete(enum dma_function_type type,
+		struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last,
+		dma_cookie_t *used)
 {
-	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+	if (!chan->device->funcs[type])
+		return -ENXIO;
+	else
+		return chan->device->funcs[type]->complete(chan, cookie, last, used);
 }
 
 /**
@@ -318,7 +352,7 @@ static inline enum dma_status dma_async_
  * @last_complete: last know completed transaction
  * @last_used: last cookie value handed out
  *
- * dma_async_is_complete() is used in dma_async_memcpy_complete()
+ * dma_async_is_complete() is used in dma_async_complete()
  * the test logic is seperated for lightweight testing of multiple cookies
  */
 static inline enum dma_status dma_async_is_complete(dma_cookie_t cookie,
diff --git a/net/core/dev.c b/net/core/dev.c
index d4a1ec3..e8a8ee9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1945,7 +1945,7 @@ out:
 		struct dma_chan *chan;
 		rcu_read_lock();
 		list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node)
-			dma_async_memcpy_issue_pending(chan);
+			dma_async_issue_pending(DMAFUNC_MEMCPY, chan);
 		rcu_read_unlock();
 	}
 #endif
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..c270837 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1431,9 +1431,9 @@ skip_copy:
 		struct sk_buff *skb;
 		dma_cookie_t done, used;
 
-		dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+		dma_async_issue_pending(DMAFUNC_MEMCPY, tp->ucopy.dma_chan);
 
-		while (dma_async_memcpy_complete(tp->ucopy.dma_chan,
+		while (dma_async_complete(DMAFUNC_MEMCPY, tp->ucopy.dma_chan,
 		                                 tp->ucopy.dma_cookie, &done,
 		                                 &used) == DMA_IN_PROGRESS) {
 			/* do partial cleanup of sk_async_wait_queue */



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-15 16:38     ` Olof Johansson
@ 2006-09-15 19:44       ` Olof Johansson
  2006-09-15 20:02         ` [PATCH] [v2] " Olof Johansson
  2006-09-18 22:56         ` [PATCH] " Dan Williams
  0 siblings, 2 replies; 55+ messages in thread
From: Olof Johansson @ 2006-09-15 19:44 UTC (permalink / raw)
  To: Dan Williams, christopher.leech
  Cc: Jeff Garzik, neilb, linux-raid, akpm, linux-kernel

On Fri, 15 Sep 2006 11:38:17 -0500 Olof Johansson <olof@lixom.net> wrote:

> On Mon, 11 Sep 2006 19:44:16 -0400 Jeff Garzik <jeff@garzik.org> wrote:

> > Are we really going to add a set of hooks for each DMA engine whizbang 
> > feature?
> > 
> > That will get ugly when DMA engines support memcpy, xor, crc32, sha1, 
> > aes, and a dozen other transforms.
> 
> 
> Yes, it will be unmaintainable. We need some sort of multiplexing with
> per-function registrations.
> 
> Here's a first cut at it, just very quick. It could be improved further
> but it shows that we could exorcise most of the hardcoded things pretty
> easily.

Ok, that was obviously a naive and not so nice first attempt, but I
figured it was worth it to show how it can be done.

This is a little more proper: Specify at client registration time what
the function the client will use is, and make the channel use it. This
way most of the error checking per call can be removed too.

Chris/Dan: Please consider picking this up as a base for the added
functionality and cleanups.





Clean up dmaengine a bit. Make the client registration specify which
channel functions ("type") the client will use. Also, make devices
register which functions they will provide.

Also exorcise most of the memcpy-specific references from the generic
dma engine code. There's still some left in the iov stuff.


Signed-off-by: Olof Johansson <olof@lixom.net>

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c
+++ linux-2.6/drivers/dma/dmaengine.c
@@ -73,14 +73,14 @@ static LIST_HEAD(dma_client_list);
 
 /* --- sysfs implementation --- */
 
-static ssize_t show_memcpy_count(struct class_device *cd, char *buf)
+static ssize_t show_count(struct class_device *cd, char *buf)
 {
 	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
 	unsigned long count = 0;
 	int i;
 
 	for_each_possible_cpu(i)
-		count += per_cpu_ptr(chan->local, i)->memcpy_count;
+		count += per_cpu_ptr(chan->local, i)->count;
 
 	return sprintf(buf, "%lu\n", count);
 }
@@ -105,7 +105,7 @@ static ssize_t show_in_use(struct class_
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
-	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(count, S_IRUGO, show_count, NULL),
 	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
 	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
 	__ATTR_NULL
@@ -142,6 +142,10 @@ static struct dma_chan *dma_client_chan_
 
 	/* Find a channel, any DMA engine will do */
 	list_for_each_entry(device, &dma_device_list, global_node) {
+		/* Skip devices that don't provide the right function */
+		if (!device->funcs[client->type])
+			continue;
+
 		list_for_each_entry(chan, &device->channels, device_node) {
 			if (chan->client)
 				continue;
@@ -241,7 +245,8 @@ static void dma_chans_rebalance(void)
  * dma_async_client_register - allocate and register a &dma_client
  * @event_callback: callback for notification of channel addition/removal
  */
-struct dma_client *dma_async_client_register(dma_event_callback event_callback)
+struct dma_client *dma_async_client_register(enum dma_function_type type,
+		dma_event_callback event_callback)
 {
 	struct dma_client *client;
 
@@ -254,6 +259,7 @@ struct dma_client *dma_async_client_regi
 	client->chans_desired = 0;
 	client->chan_count = 0;
 	client->event_callback = event_callback;
+	client->type = type;
 
 	mutex_lock(&dma_list_mutex);
 	list_add_tail(&client->global_node, &dma_client_list);
@@ -402,11 +408,11 @@ subsys_initcall(dma_bus_init);
 EXPORT_SYMBOL(dma_async_client_register);
 EXPORT_SYMBOL(dma_async_client_unregister);
 EXPORT_SYMBOL(dma_async_client_chan_request);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_complete);
-EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_buf_to_buf);
+EXPORT_SYMBOL(dma_async_buf_to_pg);
+EXPORT_SYMBOL(dma_async_pg_to_pg);
+EXPORT_SYMBOL(dma_async_complete);
+EXPORT_SYMBOL(dma_async_issue_pending);
 EXPORT_SYMBOL(dma_async_device_register);
 EXPORT_SYMBOL(dma_async_device_unregister);
 EXPORT_SYMBOL(dma_chan_cleanup);
Index: linux-2.6/drivers/dma/ioatdma.c
===================================================================
--- linux-2.6.orig/drivers/dma/ioatdma.c
+++ linux-2.6/drivers/dma/ioatdma.c
@@ -681,6 +682,14 @@ out:
 	return err;
 }
 
+struct dma_function ioat_memcpy_functions = {
+	.buf_to_buf = ioat_dma_memcpy_buf_to_buf,
+	.buf_to_pg = ioat_dma_memcpy_buf_to_pg,
+	.pg_to_pg = ioat_dma_memcpy_pg_to_pg,
+	.complete = ioat_dma_is_complete,
+	.issue_pending = ioat_dma_memcpy_issue_pending,
+};
+
 static int __devinit ioat_probe(struct pci_dev *pdev,
                                 const struct pci_device_id *ent)
 {
@@ -756,11 +765,8 @@ static int __devinit ioat_probe(struct p
 
 	device->common.device_alloc_chan_resources = ioat_dma_alloc_chan_resources;
 	device->common.device_free_chan_resources = ioat_dma_free_chan_resources;
-	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
-	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
-	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
-	device->common.device_memcpy_complete = ioat_dma_is_complete;
-	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
+	device->common.funcs[DMAFUNC_MEMCPY] = &ioat_memcpy_functions;
+
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
Index: linux-2.6/include/linux/dmaengine.h
===================================================================
--- linux-2.6.orig/include/linux/dmaengine.h
+++ linux-2.6/include/linux/dmaengine.h
@@ -67,14 +67,14 @@ enum dma_status {
 /**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
  * @refcount: local_t used for open-coded "bigref" counting
- * @memcpy_count: transaction counter
+ * @count: transaction counter
  * @bytes_transferred: byte counter
  */
 
 struct dma_chan_percpu {
 	local_t refcount;
 	/* stats */
-	unsigned long memcpy_count;
+	unsigned long count;
 	unsigned long bytes_transferred;
 };
 
@@ -138,6 +138,15 @@ static inline void dma_chan_put(struct d
 typedef void (*dma_event_callback) (struct dma_client *client,
 		struct dma_chan *chan, enum dma_event event);
 
+/*
+ * dma_function_type - one entry for every possible function type provided
+ */
+enum dma_function_type {
+	DMAFUNC_MEMCPY = 0,
+	DMAFUNC_XOR,
+	DMAFUNC_MAX
+};
+
 /**
  * struct dma_client - info on the entity making use of DMA services
  * @event_callback: func ptr to call when something happens
@@ -152,11 +161,35 @@ struct dma_client {
 	unsigned int		chan_count;
 	unsigned int		chans_desired;
 
+	enum dma_function_type	type;
+
 	spinlock_t		lock;
 	struct list_head	channels;
 	struct list_head	global_node;
 };
 
+/* struct dma_function
+ * @buf_to_pg: buf pointer to struct page
+ * @pg_to_pg: struct page/offset to struct page/offset
+ * @complete: poll the status of a DMA transaction
+ * @issue_pending: push appended descriptors to hardware
+ */
+struct dma_function {
+	dma_cookie_t (*buf_to_buf)(struct dma_chan *chan,
+				void *dest, void *src, size_t len);
+	dma_cookie_t (*buf_to_pg)(struct dma_chan *chan,
+				struct page *page, unsigned int offset,
+				void *kdata, size_t len);
+	dma_cookie_t (*pg_to_pg)(struct dma_chan *chan,
+				struct page *dest_pg, unsigned int dest_off,
+				struct page *src_pg, unsigned int src_off,
+				size_t len);
+	enum dma_status (*complete)(struct dma_chan *chan,
+				dma_cookie_t cookie, dma_cookie_t *last,
+				dma_cookie_t *used);
+	void (*issue_pending)(struct dma_chan *chan);
+};
+
 /**
  * struct dma_device - info on the entity supplying DMA services
  * @chancnt: how many DMA channels are supported
@@ -168,14 +201,8 @@ struct dma_client {
  * @device_alloc_chan_resources: allocate resources and return the
  *	number of allocated descriptors
  * @device_free_chan_resources: release DMA channel's resources
- * @device_memcpy_buf_to_buf: memcpy buf pointer to buf pointer
- * @device_memcpy_buf_to_pg: memcpy buf pointer to struct page
- * @device_memcpy_pg_to_pg: memcpy struct page/offset to struct page/offset
- * @device_memcpy_complete: poll the status of an IOAT DMA transaction
- * @device_memcpy_issue_pending: push appended descriptors to hardware
  */
 struct dma_device {
-
 	unsigned int chancnt;
 	struct list_head channels;
 	struct list_head global_node;
@@ -185,31 +212,24 @@ struct dma_device {
 
 	int dev_id;
 
+	struct dma_function *funcs[DMAFUNC_MAX];
+
 	int (*device_alloc_chan_resources)(struct dma_chan *chan);
 	void (*device_free_chan_resources)(struct dma_chan *chan);
-	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan,
-			void *dest, void *src, size_t len);
-	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan,
-			struct page *page, unsigned int offset, void *kdata,
-			size_t len);
-	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
-			struct page *dest_pg, unsigned int dest_off,
-			struct page *src_pg, unsigned int src_off, size_t len);
-	enum dma_status (*device_memcpy_complete)(struct dma_chan *chan,
-			dma_cookie_t cookie, dma_cookie_t *last,
-			dma_cookie_t *used);
-	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
 };
 
+#define CHAN2FUNCS(chan) (chan->device->funcs[chan->client->type])
+
 /* --- public DMA engine API --- */
 
-struct dma_client *dma_async_client_register(dma_event_callback event_callback);
+struct dma_client *dma_async_client_register(enum dma_function_type type,
+		dma_event_callback event_callback);
 void dma_async_client_unregister(struct dma_client *client);
 void dma_async_client_chan_request(struct dma_client *client,
 		unsigned int number);
 
 /**
- * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
+ * dma_async_buf_to_buf - offloaded copy between virtual addresses
  * @chan: DMA channel to offload copy to
  * @dest: destination address (virtual)
  * @src: source address (virtual)
@@ -220,19 +240,19 @@ void dma_async_client_chan_request(struc
  * Both @dest and @src must stay memory resident (kernel memory or locked
  * user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
-	void *dest, void *src, size_t len)
+static inline dma_cookie_t dma_async_buf_to_buf(struct dma_chan *chan,
+		void *dest, void *src, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+	return CHAN2FUNCS(chan)->buf_to_buf(chan, dest, src, len);
 }
 
 /**
- * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
+ * dma_async_buf_to_pg - offloaded copy from address to page
  * @chan: DMA channel to offload copy to
  * @page: destination page
  * @offset: offset in page to copy to
@@ -244,20 +264,21 @@ static inline dma_cookie_t dma_async_mem
  * Both @page/@offset and @kdata must stay memory resident (kernel memory or
  * locked user space pages)
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
-	struct page *page, unsigned int offset, void *kdata, size_t len)
+static inline dma_cookie_t dma_async_buf_to_pg(struct dma_chan *chan,
+		struct page *page, unsigned int offset,
+		void *kdata, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_pg(chan, page, offset,
-	                                             kdata, len);
+	return CHAN2FUNCS(chan)->buf_to_pg(chan, page, offset,
+						kdata, len);
 }
 
 /**
- * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
+ * dma_async_pg_to_pg - offloaded copy from page to page
  * @chan: DMA channel to offload copy to
  * @dest_pg: destination page
  * @dest_off: offset in page to copy to
@@ -270,33 +291,33 @@ static inline dma_cookie_t dma_async_mem
  * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
  * (kernel memory or locked user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
-	struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
-	unsigned int src_off, size_t len)
+static inline dma_cookie_t dma_async_pg_to_pg( struct dma_chan *chan,
+		struct page *dest_pg, unsigned int dest_off,
+		struct page *src_pg, unsigned int src_off, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
-	                                            src_pg, src_off, len);
+	return CHAN2FUNCS(chan)->pg_to_pg(chan, dest_pg, dest_off,
+						src_pg, src_off, len);
 }
 
 /**
- * dma_async_memcpy_issue_pending - flush pending copies to HW
+ * dma_async_issue_pending - flush pending copies to HW
  * @chan: target DMA channel
  *
  * This allows drivers to push copies to HW in batches,
  * reducing MMIO writes where possible.
  */
-static inline void dma_async_memcpy_issue_pending(struct dma_chan *chan)
+static inline void dma_async_issue_pending(struct dma_chan *chan)
 {
-	return chan->device->device_memcpy_issue_pending(chan);
+	return CHAN2FUNCS(chan)->issue_pending(chan);
 }
 
 /**
- * dma_async_memcpy_complete - poll for transaction completion
+ * dma_async_complete - poll for transaction completion
  * @chan: DMA channel
  * @cookie: transaction identifier to check status of
  * @last: returns last completed cookie, can be NULL
@@ -306,10 +327,11 @@ static inline void dma_async_memcpy_issu
  * internal state and can be used with dma_async_is_complete() to check
  * the status of multiple cookies without re-checking hardware state.
  */
-static inline enum dma_status dma_async_memcpy_complete(struct dma_chan *chan,
-	dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
+static inline enum dma_status dma_async_complete(struct dma_chan *chan,
+		dma_cookie_t cookie, dma_cookie_t *last,
+		dma_cookie_t *used)
 {
-	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+	return CHAN2FUNCS(chan)->complete(chan, cookie, last, used);
 }
 
 /**
@@ -318,7 +340,7 @@ static inline enum dma_status dma_async_
  * @last_complete: last know completed transaction
  * @last_used: last cookie value handed out
  *
- * dma_async_is_complete() is used in dma_async_memcpy_complete()
+ * dma_async_is_complete() is used in dma_async_complete()
  * the test logic is seperated for lightweight testing of multiple cookies
  */
 static inline enum dma_status dma_async_is_complete(dma_cookie_t cookie,
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -1945,7 +1945,7 @@ out:
 		struct dma_chan *chan;
 		rcu_read_lock();
 		list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node)
-			dma_async_memcpy_issue_pending(chan);
+			dma_async_issue_pending(chan);
 		rcu_read_unlock();
 	}
 #endif
@@ -3467,7 +3467,7 @@ static void netdev_dma_event(struct dma_
 static int __init netdev_dma_register(void)
 {
 	spin_lock_init(&net_dma_event_lock);
-	net_dma_client = dma_async_client_register(netdev_dma_event);
+	net_dma_client = dma_async_client_register(DMAFUNC_MEMCPY, netdev_dma_event);
 	if (net_dma_client == NULL)
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [v2] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-15 19:44       ` [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations) Olof Johansson
@ 2006-09-15 20:02         ` Olof Johansson
  2006-09-18 22:56         ` [PATCH] " Dan Williams
  1 sibling, 0 replies; 55+ messages in thread
From: Olof Johansson @ 2006-09-15 20:02 UTC (permalink / raw)
  To: Dan Williams, christopher.leech
  Cc: Jeff Garzik, neilb, linux-raid, akpm, linux-kernel

[Bad day, forgot a quilt refresh.]




Clean up dmaengine a bit. Make the client registration specify which
channel functions ("type") the client will use. Also, make devices
register which functions they will provide.

Also exorcise most of the memcpy-specific references from the generic
dma engine code. There's still some left in the iov stuff.


Signed-off-by: Olof Johansson <olof@lixom.net>

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c
+++ linux-2.6/drivers/dma/dmaengine.c
@@ -73,14 +73,14 @@ static LIST_HEAD(dma_client_list);
 
 /* --- sysfs implementation --- */
 
-static ssize_t show_memcpy_count(struct class_device *cd, char *buf)
+static ssize_t show_count(struct class_device *cd, char *buf)
 {
 	struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
 	unsigned long count = 0;
 	int i;
 
 	for_each_possible_cpu(i)
-		count += per_cpu_ptr(chan->local, i)->memcpy_count;
+		count += per_cpu_ptr(chan->local, i)->count;
 
 	return sprintf(buf, "%lu\n", count);
 }
@@ -105,7 +105,7 @@ static ssize_t show_in_use(struct class_
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
-	__ATTR(memcpy_count, S_IRUGO, show_memcpy_count, NULL),
+	__ATTR(count, S_IRUGO, show_count, NULL),
 	__ATTR(bytes_transferred, S_IRUGO, show_bytes_transferred, NULL),
 	__ATTR(in_use, S_IRUGO, show_in_use, NULL),
 	__ATTR_NULL
@@ -142,6 +142,10 @@ static struct dma_chan *dma_client_chan_
 
 	/* Find a channel, any DMA engine will do */
 	list_for_each_entry(device, &dma_device_list, global_node) {
+		/* Skip devices that don't provide the right function */
+		if (!device->funcs[client->type])
+			continue;
+
 		list_for_each_entry(chan, &device->channels, device_node) {
 			if (chan->client)
 				continue;
@@ -241,7 +245,8 @@ static void dma_chans_rebalance(void)
  * dma_async_client_register - allocate and register a &dma_client
  * @event_callback: callback for notification of channel addition/removal
  */
-struct dma_client *dma_async_client_register(dma_event_callback event_callback)
+struct dma_client *dma_async_client_register(enum dma_function_type type,
+		dma_event_callback event_callback)
 {
 	struct dma_client *client;
 
@@ -254,6 +259,7 @@ struct dma_client *dma_async_client_regi
 	client->chans_desired = 0;
 	client->chan_count = 0;
 	client->event_callback = event_callback;
+	client->type = type;
 
 	mutex_lock(&dma_list_mutex);
 	list_add_tail(&client->global_node, &dma_client_list);
@@ -402,11 +408,11 @@ subsys_initcall(dma_bus_init);
 EXPORT_SYMBOL(dma_async_client_register);
 EXPORT_SYMBOL(dma_async_client_unregister);
 EXPORT_SYMBOL(dma_async_client_chan_request);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_complete);
-EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_buf_to_buf);
+EXPORT_SYMBOL(dma_async_buf_to_pg);
+EXPORT_SYMBOL(dma_async_pg_to_pg);
+EXPORT_SYMBOL(dma_async_complete);
+EXPORT_SYMBOL(dma_async_issue_pending);
 EXPORT_SYMBOL(dma_async_device_register);
 EXPORT_SYMBOL(dma_async_device_unregister);
 EXPORT_SYMBOL(dma_chan_cleanup);
Index: linux-2.6/drivers/dma/ioatdma.c
===================================================================
--- linux-2.6.orig/drivers/dma/ioatdma.c
+++ linux-2.6/drivers/dma/ioatdma.c
@@ -40,6 +40,7 @@
 #define to_ioat_device(dev) container_of(dev, struct ioat_device, common)
 #define to_ioat_desc(lh) container_of(lh, struct ioat_desc_sw, node)
 
+
 /* internal functions */
 static int __devinit ioat_probe(struct pci_dev *pdev, const struct pci_device_id *ent);
 static void __devexit ioat_remove(struct pci_dev *pdev);
@@ -681,6 +682,14 @@ out:
 	return err;
 }
 
+struct dma_function ioat_memcpy_functions = {
+	.buf_to_buf = ioat_dma_memcpy_buf_to_buf,
+	.buf_to_pg = ioat_dma_memcpy_buf_to_pg,
+	.pg_to_pg = ioat_dma_memcpy_pg_to_pg,
+	.complete = ioat_dma_is_complete,
+	.issue_pending = ioat_dma_memcpy_issue_pending,
+};
+
 static int __devinit ioat_probe(struct pci_dev *pdev,
                                 const struct pci_device_id *ent)
 {
@@ -756,11 +765,8 @@ static int __devinit ioat_probe(struct p
 
 	device->common.device_alloc_chan_resources = ioat_dma_alloc_chan_resources;
 	device->common.device_free_chan_resources = ioat_dma_free_chan_resources;
-	device->common.device_memcpy_buf_to_buf = ioat_dma_memcpy_buf_to_buf;
-	device->common.device_memcpy_buf_to_pg = ioat_dma_memcpy_buf_to_pg;
-	device->common.device_memcpy_pg_to_pg = ioat_dma_memcpy_pg_to_pg;
-	device->common.device_memcpy_complete = ioat_dma_is_complete;
-	device->common.device_memcpy_issue_pending = ioat_dma_memcpy_issue_pending;
+	device->common.funcs[DMAFUNC_MEMCPY] = &ioat_memcpy_functions;
+
 	printk(KERN_INFO "Intel(R) I/OAT DMA Engine found, %d channels\n",
 		device->common.chancnt);
 
Index: linux-2.6/include/linux/dmaengine.h
===================================================================
--- linux-2.6.orig/include/linux/dmaengine.h
+++ linux-2.6/include/linux/dmaengine.h
@@ -67,14 +67,14 @@ enum dma_status {
 /**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
  * @refcount: local_t used for open-coded "bigref" counting
- * @memcpy_count: transaction counter
+ * @count: transaction counter
  * @bytes_transferred: byte counter
  */
 
 struct dma_chan_percpu {
 	local_t refcount;
 	/* stats */
-	unsigned long memcpy_count;
+	unsigned long count;
 	unsigned long bytes_transferred;
 };
 
@@ -138,6 +138,15 @@ static inline void dma_chan_put(struct d
 typedef void (*dma_event_callback) (struct dma_client *client,
 		struct dma_chan *chan, enum dma_event event);
 
+/*
+ * dma_function_type - one entry for every possible function type provided
+ */
+enum dma_function_type {
+	DMAFUNC_MEMCPY = 0,
+	DMAFUNC_XOR,
+	DMAFUNC_MAX
+};
+
 /**
  * struct dma_client - info on the entity making use of DMA services
  * @event_callback: func ptr to call when something happens
@@ -152,11 +161,35 @@ struct dma_client {
 	unsigned int		chan_count;
 	unsigned int		chans_desired;
 
+	enum dma_function_type	type;
+
 	spinlock_t		lock;
 	struct list_head	channels;
 	struct list_head	global_node;
 };
 
+/* struct dma_function
+ * @buf_to_pg: buf pointer to struct page
+ * @pg_to_pg: struct page/offset to struct page/offset
+ * @complete: poll the status of a DMA transaction
+ * @issue_pending: push appended descriptors to hardware
+ */
+struct dma_function {
+	dma_cookie_t (*buf_to_buf)(struct dma_chan *chan,
+				void *dest, void *src, size_t len);
+	dma_cookie_t (*buf_to_pg)(struct dma_chan *chan,
+				struct page *page, unsigned int offset,
+				void *kdata, size_t len);
+	dma_cookie_t (*pg_to_pg)(struct dma_chan *chan,
+				struct page *dest_pg, unsigned int dest_off,
+				struct page *src_pg, unsigned int src_off,
+				size_t len);
+	enum dma_status (*complete)(struct dma_chan *chan,
+				dma_cookie_t cookie, dma_cookie_t *last,
+				dma_cookie_t *used);
+	void (*issue_pending)(struct dma_chan *chan);
+};
+
 /**
  * struct dma_device - info on the entity supplying DMA services
  * @chancnt: how many DMA channels are supported
@@ -168,14 +201,8 @@ struct dma_client {
  * @device_alloc_chan_resources: allocate resources and return the
  *	number of allocated descriptors
  * @device_free_chan_resources: release DMA channel's resources
- * @device_memcpy_buf_to_buf: memcpy buf pointer to buf pointer
- * @device_memcpy_buf_to_pg: memcpy buf pointer to struct page
- * @device_memcpy_pg_to_pg: memcpy struct page/offset to struct page/offset
- * @device_memcpy_complete: poll the status of an IOAT DMA transaction
- * @device_memcpy_issue_pending: push appended descriptors to hardware
  */
 struct dma_device {
-
 	unsigned int chancnt;
 	struct list_head channels;
 	struct list_head global_node;
@@ -185,31 +212,24 @@ struct dma_device {
 
 	int dev_id;
 
+	struct dma_function *funcs[DMAFUNC_MAX];
+
 	int (*device_alloc_chan_resources)(struct dma_chan *chan);
 	void (*device_free_chan_resources)(struct dma_chan *chan);
-	dma_cookie_t (*device_memcpy_buf_to_buf)(struct dma_chan *chan,
-			void *dest, void *src, size_t len);
-	dma_cookie_t (*device_memcpy_buf_to_pg)(struct dma_chan *chan,
-			struct page *page, unsigned int offset, void *kdata,
-			size_t len);
-	dma_cookie_t (*device_memcpy_pg_to_pg)(struct dma_chan *chan,
-			struct page *dest_pg, unsigned int dest_off,
-			struct page *src_pg, unsigned int src_off, size_t len);
-	enum dma_status (*device_memcpy_complete)(struct dma_chan *chan,
-			dma_cookie_t cookie, dma_cookie_t *last,
-			dma_cookie_t *used);
-	void (*device_memcpy_issue_pending)(struct dma_chan *chan);
 };
 
+#define CHAN2FUNCS(chan) (chan->device->funcs[chan->client->type])
+
 /* --- public DMA engine API --- */
 
-struct dma_client *dma_async_client_register(dma_event_callback event_callback);
+struct dma_client *dma_async_client_register(enum dma_function_type type,
+		dma_event_callback event_callback);
 void dma_async_client_unregister(struct dma_client *client);
 void dma_async_client_chan_request(struct dma_client *client,
 		unsigned int number);
 
 /**
- * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
+ * dma_async_buf_to_buf - offloaded copy between virtual addresses
  * @chan: DMA channel to offload copy to
  * @dest: destination address (virtual)
  * @src: source address (virtual)
@@ -220,19 +240,19 @@ void dma_async_client_chan_request(struc
  * Both @dest and @src must stay memory resident (kernel memory or locked
  * user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
-	void *dest, void *src, size_t len)
+static inline dma_cookie_t dma_async_buf_to_buf(struct dma_chan *chan,
+		void *dest, void *src, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_buf(chan, dest, src, len);
+	return CHAN2FUNCS(chan)->buf_to_buf(chan, dest, src, len);
 }
 
 /**
- * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
+ * dma_async_buf_to_pg - offloaded copy from address to page
  * @chan: DMA channel to offload copy to
  * @page: destination page
  * @offset: offset in page to copy to
@@ -244,20 +264,21 @@ static inline dma_cookie_t dma_async_mem
  * Both @page/@offset and @kdata must stay memory resident (kernel memory or
  * locked user space pages)
  */
-static inline dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
-	struct page *page, unsigned int offset, void *kdata, size_t len)
+static inline dma_cookie_t dma_async_buf_to_pg(struct dma_chan *chan,
+		struct page *page, unsigned int offset,
+		void *kdata, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_buf_to_pg(chan, page, offset,
-	                                             kdata, len);
+	return CHAN2FUNCS(chan)->buf_to_pg(chan, page, offset,
+						kdata, len);
 }
 
 /**
- * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
+ * dma_async_pg_to_pg - offloaded copy from page to page
  * @chan: DMA channel to offload copy to
  * @dest_pg: destination page
  * @dest_off: offset in page to copy to
@@ -270,33 +291,33 @@ static inline dma_cookie_t dma_async_mem
  * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
  * (kernel memory or locked user space pages).
  */
-static inline dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
-	struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
-	unsigned int src_off, size_t len)
+static inline dma_cookie_t dma_async_pg_to_pg( struct dma_chan *chan,
+		struct page *dest_pg, unsigned int dest_off,
+		struct page *src_pg, unsigned int src_off, size_t len)
 {
 	int cpu = get_cpu();
 	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	per_cpu_ptr(chan->local, cpu)->count++;
 	put_cpu();
 
-	return chan->device->device_memcpy_pg_to_pg(chan, dest_pg, dest_off,
-	                                            src_pg, src_off, len);
+	return CHAN2FUNCS(chan)->pg_to_pg(chan, dest_pg, dest_off,
+						src_pg, src_off, len);
 }
 
 /**
- * dma_async_memcpy_issue_pending - flush pending copies to HW
+ * dma_async_issue_pending - flush pending copies to HW
  * @chan: target DMA channel
  *
  * This allows drivers to push copies to HW in batches,
  * reducing MMIO writes where possible.
  */
-static inline void dma_async_memcpy_issue_pending(struct dma_chan *chan)
+static inline void dma_async_issue_pending(struct dma_chan *chan)
 {
-	return chan->device->device_memcpy_issue_pending(chan);
+	return CHAN2FUNCS(chan)->issue_pending(chan);
 }
 
 /**
- * dma_async_memcpy_complete - poll for transaction completion
+ * dma_async_complete - poll for transaction completion
  * @chan: DMA channel
  * @cookie: transaction identifier to check status of
  * @last: returns last completed cookie, can be NULL
@@ -306,10 +327,11 @@ static inline void dma_async_memcpy_issu
  * internal state and can be used with dma_async_is_complete() to check
  * the status of multiple cookies without re-checking hardware state.
  */
-static inline enum dma_status dma_async_memcpy_complete(struct dma_chan *chan,
-	dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used)
+static inline enum dma_status dma_async_complete(struct dma_chan *chan,
+		dma_cookie_t cookie, dma_cookie_t *last,
+		dma_cookie_t *used)
 {
-	return chan->device->device_memcpy_complete(chan, cookie, last, used);
+	return CHAN2FUNCS(chan)->complete(chan, cookie, last, used);
 }
 
 /**
@@ -318,7 +340,7 @@ static inline enum dma_status dma_async_
  * @last_complete: last know completed transaction
  * @last_used: last cookie value handed out
  *
- * dma_async_is_complete() is used in dma_async_memcpy_complete()
+ * dma_async_is_complete() is used in dma_async_complete()
  * the test logic is seperated for lightweight testing of multiple cookies
  */
 static inline enum dma_status dma_async_is_complete(dma_cookie_t cookie,
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -1945,7 +1945,7 @@ out:
 		struct dma_chan *chan;
 		rcu_read_lock();
 		list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node)
-			dma_async_memcpy_issue_pending(chan);
+			dma_async_issue_pending(chan);
 		rcu_read_unlock();
 	}
 #endif
@@ -3467,7 +3467,7 @@ static void netdev_dma_event(struct dma_
 static int __init netdev_dma_register(void)
 {
 	spin_lock_init(&net_dma_event_lock);
-	net_dma_client = dma_async_client_register(netdev_dma_event);
+	net_dma_client = dma_async_client_register(DMAFUNC_MEMCPY, netdev_dma_event);
 	if (net_dma_client == NULL)
 		return -ENOMEM;
 
Index: linux-2.6/drivers/dma/iovlock.c
===================================================================
--- linux-2.6.orig/drivers/dma/iovlock.c
+++ linux-2.6/drivers/dma/iovlock.c
@@ -151,7 +151,7 @@ static dma_cookie_t dma_memcpy_to_kernel
 	while (len > 0) {
 		if (iov->iov_len) {
 			int copy = min_t(unsigned int, iov->iov_len, len);
-			dma_cookie = dma_async_memcpy_buf_to_buf(
+			dma_cookie = dma_async_buf_to_buf(
 					chan,
 					iov->iov_base,
 					kdata,
@@ -210,7 +210,7 @@ dma_cookie_t dma_memcpy_to_iovec(struct 
 			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
 			copy = min_t(int, copy, iov[iovec_idx].iov_len);
 
-			dma_cookie = dma_async_memcpy_buf_to_pg(chan,
+			dma_cookie = dma_async_buf_to_pg(chan,
 					page_list->pages[page_idx],
 					iov_byte_offset,
 					kdata,
@@ -274,7 +274,7 @@ dma_cookie_t dma_memcpy_pg_to_iovec(stru
 			copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
 			copy = min_t(int, copy, iov[iovec_idx].iov_len);
 
-			dma_cookie = dma_async_memcpy_pg_to_pg(chan,
+			dma_cookie = dma_async_pg_to_pg(chan,
 					page_list->pages[page_idx],
 					iov_byte_offset,
 					page,
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1431,11 +1431,11 @@ skip_copy:
 		struct sk_buff *skb;
 		dma_cookie_t done, used;
 
-		dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+		dma_async_issue_pending(tp->ucopy.dma_chan);
 
-		while (dma_async_memcpy_complete(tp->ucopy.dma_chan,
-		                                 tp->ucopy.dma_cookie, &done,
-		                                 &used) == DMA_IN_PROGRESS) {
+		while (dma_async_complete(tp->ucopy.dma_chan,
+		                          tp->ucopy.dma_cookie, &done,
+		                          &used) == DMA_IN_PROGRESS) {
 			/* do partial cleanup of sk_async_wait_queue */
 			while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
 			       (dma_async_is_complete(skb->dma_cookie, done,

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-15 19:44       ` [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations) Olof Johansson
  2006-09-15 20:02         ` [PATCH] [v2] " Olof Johansson
@ 2006-09-18 22:56         ` Dan Williams
  2006-09-19  1:05           ` Olof Johansson
  1 sibling, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-09-18 22:56 UTC (permalink / raw)
  To: Olof Johansson
  Cc: christopher.leech, Jeff Garzik, neilb, linux-raid, akpm, linux-kernel

On 9/15/06, Olof Johansson <olof@lixom.net> wrote:
> On Fri, 15 Sep 2006 11:38:17 -0500 Olof Johansson <olof@lixom.net> wrote:
>
> > On Mon, 11 Sep 2006 19:44:16 -0400 Jeff Garzik <jeff@garzik.org> wrote:
>
> > > Are we really going to add a set of hooks for each DMA engine whizbang
> > > feature?
> > >
> > > That will get ugly when DMA engines support memcpy, xor, crc32, sha1,
> > > aes, and a dozen other transforms.
> >
> >
> > Yes, it will be unmaintainable. We need some sort of multiplexing with
> > per-function registrations.
> >
> > Here's a first cut at it, just very quick. It could be improved further
> > but it shows that we could exorcise most of the hardcoded things pretty
> > easily.
>
> Ok, that was obviously a naive and not so nice first attempt, but I
> figured it was worth it to show how it can be done.
>
> This is a little more proper: Specify at client registration time what
> the function the client will use is, and make the channel use it. This
> way most of the error checking per call can be removed too.
>
> Chris/Dan: Please consider picking this up as a base for the added
> functionality and cleanups.
>
Thanks for this Olof it has sparked some ideas about how to redo
support for multiple operations.

>
>
>
>
> Clean up dmaengine a bit. Make the client registration specify which
> channel functions ("type") the client will use. Also, make devices
> register which functions they will provide.
>
> Also exorcise most of the memcpy-specific references from the generic
> dma engine code. There's still some left in the iov stuff.
I think we should keep the operation type in the function name but
drop all the [buf|pg|dma]_to_[buf|pg|dma] permutations.  The buffer
type can be handled generically across all operation types.  Something
like the following for a pg_to_buf memcpy.

struct dma_async_op_memcpy *op;
struct page *pg;
void *buf;
size_t len;

dma_async_op_init_src_pg(op, pg);
dma_async_op_init_dest_buf(op, buf);
dma_async_memcpy(chan, op, len);

-Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-18 22:56         ` [PATCH] " Dan Williams
@ 2006-09-19  1:05           ` Olof Johansson
  2006-09-19 11:20             ` Alan Cox
  0 siblings, 1 reply; 55+ messages in thread
From: Olof Johansson @ 2006-09-19  1:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: christopher.leech, Jeff Garzik, neilb, linux-raid, akpm, linux-kernel

On Mon, 18 Sep 2006 15:56:37 -0700 "Dan Williams" <dan.j.williams@gmail.com> wrote:

> On 9/15/06, Olof Johansson <olof@lixom.net> wrote:
> > On Fri, 15 Sep 2006 11:38:17 -0500 Olof Johansson <olof@lixom.net> wrote:

> > Chris/Dan: Please consider picking this up as a base for the added
> > functionality and cleanups.
> >
> Thanks for this Olof it has sparked some ideas about how to redo
> support for multiple operations.

Good. :)

> I think we should keep the operation type in the function name but
> drop all the [buf|pg|dma]_to_[buf|pg|dma] permutations.  The buffer
> type can be handled generically across all operation types.  Something
> like the following for a pg_to_buf memcpy.
> 
> struct dma_async_op_memcpy *op;
> struct page *pg;
> void *buf;
> size_t len;
> 
> dma_async_op_init_src_pg(op, pg);
> dma_async_op_init_dest_buf(op, buf);
> dma_async_memcpy(chan, op, len);

I'm generally for a more generic interface, especially in the address
permutation cases like above. However, I think it'll be a mistake to
keep the association between the API and the function names and types
so close.

What's the benefit of keeping a memcpy-specific dma_async_memcpy()
instead of a more generic dma_async_commit() (or similar)? We'll know
based on how the client/channel was allocated what kind of function is
requested, won't we?

Same goes for the dma_async_op_memcpy. Make it an union that has a type
field if you need per-operation settings. But as before, we'll know
what kind of op structure gets passed in since we'll know what kind of
operation is to be performed on it.

Finally, yet again the same goes for the op_init settings. I would even
prefer it to not be function-based, instead just direct union/struct
assignments.

struct dma_async_op op;
...

op.src_type = PG; op.src = pg;
op.dest_type = BUF; op.dest = buf;
op.len = len;
dma_async_commit(chan, op);

op might have to be dynamically allocated, since it'll outlive the
scope of this function. But the idea would be the same.


-Olof

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-19  1:05           ` Olof Johansson
@ 2006-09-19 11:20             ` Alan Cox
  2006-09-19 16:32               ` Olof Johansson
  0 siblings, 1 reply; 55+ messages in thread
From: Alan Cox @ 2006-09-19 11:20 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dan Williams, christopher.leech, Jeff Garzik, neilb, linux-raid,
	akpm, linux-kernel

Ar Llu, 2006-09-18 am 20:05 -0500, ysgrifennodd Olof Johansson:
> On Mon, 18 Sep 2006 15:56:37 -0700 "Dan Williams" <dan.j.williams@gmail.com> wrote:

> op.src_type = PG; op.src = pg;
> op.dest_type = BUF; op.dest = buf;
> op.len = len;
> dma_async_commit(chan, op);

At OLS Linus suggested it should distinguish between sync and async
events for locking reasons.

	if(dma_async_commit(foo) == SYNC_COMPLETE) {
		finalise_stuff();
	}

	else		/* will call foo->callback(foo->dev_id) */

because otherwise you have locking complexities - the callback wants to
take locks to guard the object it works on but if it is called
synchronously - eg if hardware is busy and we fall back - it might
deadlock with the caller of dmaa_async_foo() who also needs to hold the
lock.

Alan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations)
  2006-09-19 11:20             ` Alan Cox
@ 2006-09-19 16:32               ` Olof Johansson
  0 siblings, 0 replies; 55+ messages in thread
From: Olof Johansson @ 2006-09-19 16:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Dan Williams, christopher.leech, Jeff Garzik, neilb, linux-raid,
	akpm, linux-kernel

On Tue, 19 Sep 2006 12:20:09 +0100 Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Ar Llu, 2006-09-18 am 20:05 -0500, ysgrifennodd Olof Johansson:
> > On Mon, 18 Sep 2006 15:56:37 -0700 "Dan Williams" <dan.j.williams@gmail.com> wrote:
> 
> > op.src_type = PG; op.src = pg;
> > op.dest_type = BUF; op.dest = buf;
> > op.len = len;
> > dma_async_commit(chan, op);
> 
> At OLS Linus suggested it should distinguish between sync and async
> events for locking reasons.
> 
> 	if(dma_async_commit(foo) == SYNC_COMPLETE) {
> 		finalise_stuff();
> 	}
> 
> 	else		/* will call foo->callback(foo->dev_id) */
> 
> because otherwise you have locking complexities - the callback wants to
> take locks to guard the object it works on but if it is called
> synchronously - eg if hardware is busy and we fall back - it might
> deadlock with the caller of dmaa_async_foo() who also needs to hold the
> lock.

Good point, sounds very reasonable to me.


-Olof

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
                   ` (20 preceding siblings ...)
  2006-09-13  7:15 ` Jakob Oestergaard
@ 2006-10-08 22:18 ` Neil Brown
  2006-10-10 18:23   ` Dan Williams
  21 siblings, 1 reply; 55+ messages in thread
From: Neil Brown @ 2006-10-08 22:18 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, akpm, linux-kernel, christopher.leech



On Monday September 11, dan.j.williams@intel.com wrote:
> Neil,
> 
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale® series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.

Hi Dan,
 Sorry for the delay in replying.
 I've looked through these patches at last (mostly the raid-specific
 bits) and while there is clearly a lot of good stuff here, it does
 'feel' right - it just seems too complex.

 The particular issues that stand out to me are:
   - 33 new STRIPE_OP_* flags.  I'm sure there doesn't need to be that
      many new flags.
   - the "raid5 dma client" patch moves far too much internal
     knowledge about raid5 into drivers/dma.

 Clearly there are some complex issues being dealt with and some
 complexity is to be expected, but I feel there must be room for some
 serious simplification.

 Let me try to describe how I envisage it might work.

 As you know, the theory-of-operation of handle_stripe is that it
 assesses the state of a stripe deciding what actions to perform and
 then performs them.  Synchronous actions (e.g. current parity calcs)
 are performed 'in-line'.  Async actions (reads, writes) and actions
 that cannot be performed under a spinlock (->b_end_io) are recorded
 as being needed and then are initiated at the end of handle_stripe
 outside of the sh->lock.

 The proposal is to bring the parity and other bulk-memory operations
 out of the spinlock and make them optionally asynchronous.

 The set of tasks that might be needed to be performed on a stripe
 are:
	Clear a target cache block
	pre-xor various cache blocks into a target
	copy data out of bios into cache blocks. (drain)
	post-xor various cache blocks into a target
	copy data into bios out of cache blocks (fill)
	test if a cache block is all zeros
	start a read on a cache block
	start a write on a cache block

 (There is also a memcpy when expanding raid5.  I think I would try to
  simply avoid that copy and move pointers around instead).

 Some of these steps require sequencing. e.g.
   clear, pre-xor, copy, post-xor, write
 for a rwm cycle.
 We could require handle_stripe to be called again for each step.
 i.e. first call just clears the target and flags it as clear.  Next
 call initiates the pre-xor and flags that as done.  Etc.  However I
 think that would make the non-offloaded case too slow, or at least
 too clumsy.

 So instead we set flags to say what needs to be done and have a
 workqueue system that does it.

 (so far this is all quite similar to what you have done.)

 So handle_stripe would set various flag and other things (like
 identify which block was the 'target' block) and run the following
 in a workqueue:

raid5_do_stuff(struct stripe_head *sh)
{
	raid5_cont_t *conf = sh->raid_conf;

	if (test_bit(CLEAR_TARGET, &sh->ops.pending)) {
		struct page = *p->sh->dev[sh->ops.target].page;
		rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh);
		if (rv != BUSY)
			clear_bit(CLEAR_TARGET, &sh->ops.pending);
		if (rv != COMPLETE)
			goto out;
	}

	while (test_bit(PRE_XOR, &sh->ops.pending)) {
		struct page *plist[XOR_MAX];
		int offset[XOR_MAX];
		int pos = 0;
		int d;

		for (d = sh->ops.nextdev;
		     d < conf->raid_disks && pos < XOR_MAX ;
		     d++) {
			if (sh->ops.nextdev == sh->ops.target)
				continue;
			if (!test_bit(R5_WantPreXor, &sh->dev[d].flags))
				continue;
			plist[pos] = sh->dev[d].page;
			offset[pos++] = 0;
		}
		if (pos) {
			struct page *p = sh->dev[sh->ops.target].page;
			rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE,
				       ops_done, sh);
			if (rv != BUSY)
				sh->ops.nextdev = d;
			if (rv != COMPLETE)
				goto out;
		} else {
			clear_bit(PRE_XOR, &sh->ops.pending);
			sh->ops.nextdev = 0;
		}
	}
		
	while (test_bit(COPY_IN, &sh0>ops.pending)) {
		...
	}
	....

	if (test_bit(START_IO, &sh->ops.pending)) {
		int d;
		for (d = 0 ; d < conf->raid_disk ; d++) {
			/* all that code from the end of handle_stripe */
		}

	release_stripe(conf, sh);
	return;

 out:
	if (rv == BUSY) {
		/* wait on something and try again ???*/
	}
	return;
}

ops_done(struct stripe_head *sh)
{
	queue_work(....whatever..);
}


Things to note:
 - we keep track of where we are up to in sh->ops.
      .pending is flags saying what is left to be done
      .next_dev is the next device to process for operations that
        work on several devices
      .next_bio, .next_iov will be needed for copy operations that
        cross multiple bios and iovecs.

 - Each sh->dev has R5_Want flags reflecting which multi-device
   operations are wanted on each device.

 - async bulk-memory operations take pages, offsets, and lengths,
   and can return COMPLETE (if the operation was performed
   synchronously) IN_PROGRESS (if it has been started, or at least
   queued) or BUSY if it couldn't even be queued.  Exactly what to do
   in that case I'm not sure.  Probably we need a waitqueue to wait
   on.

 - The interface between the client and the ADMA hardware is a
   collection of async_ functions.  async_memcpy, async_xor,
   async_memset etc.

   I gather there needs to be some understanding
   about whether the pages are already appropriately mapped for DMA or
   whether a mapping is needed.  Maybe an extra flag argument should
   be passed.

   I imagine that any piece of ADMA hardware would register with the
   'async_*' subsystem, and a call to async_X would be routed as
   appropriate, or be run in-line.

This approach introduces 8 flags for sh->ops.pending and maybe two or
three new R5_Want* flags.  It also keeps the raid5 knowledge firmly in
the raid5 code base.  So it seems to keep the complexity under control

Would this approach make sense to you?  Is there something really
important I have missed?

(I'll try and be more responsive next time).

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-10-08 22:18 ` Neil Brown
@ 2006-10-10 18:23   ` Dan Williams
  2006-10-11  2:44     ` Neil Brown
  0 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2006-10-10 18:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, akpm, linux-kernel, christopher.leech

On 10/8/06, Neil Brown <neilb@suse.de> wrote:
>
>
> On Monday September 11, dan.j.williams@intel.com wrote:
> > Neil,
> >
> > The following patches implement hardware accelerated raid5 for the Intel
> > Xscale(r) series of I/O Processors.  The MD changes allow stripe
> > operations to run outside the spin lock in a work queue.  Hardware
> > acceleration is achieved by using a dma-engine-aware work queue routine
> > instead of the default software only routine.
>
> Hi Dan,
>  Sorry for the delay in replying.
>  I've looked through these patches at last (mostly the raid-specific
>  bits) and while there is clearly a lot of good stuff here, it does
>  'feel' right - it just seems too complex.
>
>  The particular issues that stand out to me are:
>    - 33 new STRIPE_OP_* flags.  I'm sure there doesn't need to be that
>       many new flags.
>    - the "raid5 dma client" patch moves far too much internal
>      knowledge about raid5 into drivers/dma.
>
>  Clearly there are some complex issues being dealt with and some
>  complexity is to be expected, but I feel there must be room for some
>  serious simplification.
A valid criticism.  There was definitely a push to just get it
functional, so I can now see how the complexity crept into the
implementation.  The primary cause was the choice to explicitly handle
channel switching in raid5-dma.  However, relieving "client" code from
this responsibility is something I am taking care of in the async api
changes.

>
>  Let me try to describe how I envisage it might work.
>
>  As you know, the theory-of-operation of handle_stripe is that it
>  assesses the state of a stripe deciding what actions to perform and
>  then performs them.  Synchronous actions (e.g. current parity calcs)
>  are performed 'in-line'.  Async actions (reads, writes) and actions
>  that cannot be performed under a spinlock (->b_end_io) are recorded
>  as being needed and then are initiated at the end of handle_stripe
>  outside of the sh->lock.
>
>  The proposal is to bring the parity and other bulk-memory operations
>  out of the spinlock and make them optionally asynchronous.
>
>  The set of tasks that might be needed to be performed on a stripe
>  are:
>         Clear a target cache block
>         pre-xor various cache blocks into a target
>         copy data out of bios into cache blocks. (drain)
>         post-xor various cache blocks into a target
>         copy data into bios out of cache blocks (fill)
>         test if a cache block is all zeros
>         start a read on a cache block
>         start a write on a cache block
>
>  (There is also a memcpy when expanding raid5.  I think I would try to
>   simply avoid that copy and move pointers around instead).
>
>  Some of these steps require sequencing. e.g.
>    clear, pre-xor, copy, post-xor, write
>  for a rwm cycle.
>  We could require handle_stripe to be called again for each step.
>  i.e. first call just clears the target and flags it as clear.  Next
>  call initiates the pre-xor and flags that as done.  Etc.  However I
>  think that would make the non-offloaded case too slow, or at least
>  too clumsy.
>
>  So instead we set flags to say what needs to be done and have a
>  workqueue system that does it.
>
>  (so far this is all quite similar to what you have done.)
>
>  So handle_stripe would set various flag and other things (like
>  identify which block was the 'target' block) and run the following
>  in a workqueue:
>
> raid5_do_stuff(struct stripe_head *sh)
> {
>         raid5_cont_t *conf = sh->raid_conf;
>
>         if (test_bit(CLEAR_TARGET, &sh->ops.pending)) {
>                 struct page = *p->sh->dev[sh->ops.target].page;
>                 rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh);
>                 if (rv != BUSY)
>                         clear_bit(CLEAR_TARGET, &sh->ops.pending);
>                 if (rv != COMPLETE)
>                         goto out;
>         }
>
>         while (test_bit(PRE_XOR, &sh->ops.pending)) {
>                 struct page *plist[XOR_MAX];
>                 int offset[XOR_MAX];
>                 int pos = 0;
>                 int d;
>
>                 for (d = sh->ops.nextdev;
>                      d < conf->raid_disks && pos < XOR_MAX ;
>                      d++) {
>                         if (sh->ops.nextdev == sh->ops.target)
>                                 continue;
>                         if (!test_bit(R5_WantPreXor, &sh->dev[d].flags))
>                                 continue;
>                         plist[pos] = sh->dev[d].page;
>                         offset[pos++] = 0;
>                 }
>                 if (pos) {
>                         struct page *p = sh->dev[sh->ops.target].page;
>                         rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE,
>                                        ops_done, sh);
>                         if (rv != BUSY)
>                                 sh->ops.nextdev = d;
>                         if (rv != COMPLETE)
>                                 goto out;
>                 } else {
>                         clear_bit(PRE_XOR, &sh->ops.pending);
>                         sh->ops.nextdev = 0;
>                 }
>         }
>
>         while (test_bit(COPY_IN, &sh0>ops.pending)) {
>                 ...
>         }
>         ....
>
>         if (test_bit(START_IO, &sh->ops.pending)) {
>                 int d;
>                 for (d = 0 ; d < conf->raid_disk ; d++) {
>                         /* all that code from the end of handle_stripe */
>                 }
>
>         release_stripe(conf, sh);
>         return;
>
>  out:
>         if (rv == BUSY) {
>                 /* wait on something and try again ???*/
>         }
>         return;
> }
>
> ops_done(struct stripe_head *sh)
> {
>         queue_work(....whatever..);
> }
>
>
> Things to note:
>  - we keep track of where we are up to in sh->ops.
>       .pending is flags saying what is left to be done
>       .next_dev is the next device to process for operations that
>         work on several devices
>       .next_bio, .next_iov will be needed for copy operations that
>         cross multiple bios and iovecs.
>
>  - Each sh->dev has R5_Want flags reflecting which multi-device
>    operations are wanted on each device.
>
>  - async bulk-memory operations take pages, offsets, and lengths,
>    and can return COMPLETE (if the operation was performed
>    synchronously) IN_PROGRESS (if it has been started, or at least
>    queued) or BUSY if it couldn't even be queued.  Exactly what to do
>    in that case I'm not sure.  Probably we need a waitqueue to wait
>    on.
>
>  - The interface between the client and the ADMA hardware is a
>    collection of async_ functions.  async_memcpy, async_xor,
>    async_memset etc.
>
>    I gather there needs to be some understanding
>    about whether the pages are already appropriately mapped for DMA or
>    whether a mapping is needed.  Maybe an extra flag argument should
>    be passed.
>
>    I imagine that any piece of ADMA hardware would register with the
>    'async_*' subsystem, and a call to async_X would be routed as
>    appropriate, or be run in-line.
>
> This approach introduces 8 flags for sh->ops.pending and maybe two or
> three new R5_Want* flags.  It also keeps the raid5 knowledge firmly in
> the raid5 code base.  So it seems to keep the complexity under control
>
> Would this approach make sense to you?
Definitely.

> Is there something really important I have missed?
No, nothing important jumps out.  Just a follow up question/note about
the details.

You imply that the async path and the sync path are unified in this
implementation.  I think it is doable but it will add some complexity
since the sync case is not a distinct subset of the async case.  For
example "Clear a target cache block" is required for the sync case,
but it can go away when using hardware engines.  Engines typically
have their own accumulator buffer to store the temporary result,
whereas software only operates on memory.

What do you think of adding async tests for these situations?
test_bit(XOR, &conf->async)

Where a flag is set if calls to async_<operation> may be routed to
hardware engine?  Otherwise skip any async specific details.

>
> (I'll try and be more responsive next time).
Thanks for shepherding this along.

>
> Thanks,
> NeilBrown

Regards,
Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-09-14  7:42     ` Jakob Oestergaard
@ 2006-10-11  1:46       ` Dan Williams
  0 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2006-10-11  1:46 UTC (permalink / raw)
  To: Jakob Oestergaard
  Cc: NeilBrown, linux-raid, akpm, linux-kernel, christopher.leech

On 9/14/06, Jakob Oestergaard <jakob@unthought.net> wrote:
> On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote:
> ...
> > >Out of curiosity; how does accelerated compare to non-accelerated?
> >
> > One quick example:
> > 4-disk SATA array rebuild on iop321 without acceleration - 'top'
> > reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
> > utilization.
> >
> > With acceleration - 'top' reports md0_resync cpu utilization at ~90%
> > with the rest split between md0_raid5 and md0_raid5_ops.
> >
> > The sync speed reported by /proc/mdstat is ~40% higher in the accelerated
> > case.
>
> Ok, nice :)
>
> >
> > That being said, array resync is a special case, so your mileage may
> > vary with other applications.
>
> Every-day usage I/O performance data would be nice indeed :)
>
> > I will put together some data from bonnie++, iozone, maybe contest,
> > and post it on SourceForge.
>
> Great!
>
I have posted some Iozone data and graphs showing the performance
impact of the patches across the three iop processors iop321, iop331,
and iop341.  The general take away from the data is that using dma
engines extends the region that Iozone calls the "buffer cache
effect".  Write performance benefited the most as expected, but read
performance showed some modest gains as well.  There are some regions
(smaller file size and record length) that show a performance
disadvantage but it is typically less than 5%.

The graphs map the relative performance multiplier that the raid
patches generate ('2.6.18-rc6 performance' x 'performance multiplier'
= '2.6.18-rc6-raid performance') .  A value of '1' designates equal
performance.  The large cliff that drops to zero is a "not measured"
region, i.e. the record length is larger than the file size.  Iozone
outputs to Excel, but I have also made pdf's of the graphs available.
Note: Openoffice-calc can view the data but it does not support the 3D
surface graphs that Iozone uses.

Excel:
http://prdownloads.sourceforge.net/xscaleiop/iozone_raid_accel.xls?download

PDF Graphs:
http://prdownloads.sourceforge.net/xscaleiop/iop-iozone-graphs-20061010.tar.bz2?download

Regards,
Dan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
  2006-10-10 18:23   ` Dan Williams
@ 2006-10-11  2:44     ` Neil Brown
  0 siblings, 0 replies; 55+ messages in thread
From: Neil Brown @ 2006-10-11  2:44 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, linux-kernel, christopher.leech

[dropped akpm from the Cc: as current discussion isn't directly
relevant to him]
On Tuesday October 10, dan.j.williams@intel.com wrote:
> On 10/8/06, Neil Brown <neilb@suse.de> wrote:
> 
> > Is there something really important I have missed?
> No, nothing important jumps out.  Just a follow up question/note about
> the details.
> 
> You imply that the async path and the sync path are unified in this
> implementation.  I think it is doable but it will add some complexity
> since the sync case is not a distinct subset of the async case.  For
> example "Clear a target cache block" is required for the sync case,
> but it can go away when using hardware engines.  Engines typically
> have their own accumulator buffer to store the temporary result,
> whereas software only operates on memory.
> 
> What do you think of adding async tests for these situations?
> test_bit(XOR, &conf->async)
> 
> Where a flag is set if calls to async_<operation> may be routed to
> hardware engine?  Otherwise skip any async specific details.

I'd rather try to come up with an interface that was equally
appropriate to both offload and inline.  I appreciate that it might
not be possible to get an interface that gets best performance out of
both, but I'd like to explore that direction first.

I'd guess from what you say that the dma engine is given a bunch of
sources and a destination and it xor's all the sources together into
an accumulation buffer, and then writes the accum buffer to the
destination.  Would that be right?  Can you use the destination as one
of the sources?

That can obviously be done inline too with some changes to the xor
code, and avoiding the initial memset might be good for performance
too. 

So I would suggest we drop the memset idea, and define the async_xor
interface to xor a number of sources into a destination, where the
destination is allowed to be the same as the first source, but
doesn't need to be.
Then the inline version could use a memset followed by the current xor
operations, or could use newly written xor operations, and the offload
version could equally do whatever is appropriate.

Another place where combining operations might make sense is copy-in
and post-xor.  In some cases it might be more efficient to only read
the source once, and both write it to the destination and xor it into
the target.  Would your DMA engine be able to optimise this
combination?  I think current processors could certainly do better if
the two were combined.

So there is definitely room to move, but would rather avoid flags if I
could.

NeilBrown

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2006-10-11  2:45 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-11 23:00 [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Dan Williams
2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
2006-09-11 23:34   ` Jeff Garzik
2006-09-11 23:17 ` [PATCH 02/19] raid5: move write operations to a workqueue Dan Williams
2006-09-11 23:36   ` Jeff Garzik
2006-09-11 23:17 ` [PATCH 03/19] raid5: move check parity " Dan Williams
2006-09-11 23:17 ` [PATCH 04/19] raid5: move compute block " Dan Williams
2006-09-11 23:18 ` [PATCH 05/19] raid5: move read completion copies " Dan Williams
2006-09-11 23:18 ` [PATCH 06/19] raid5: move the reconstruct write expansion operation " Dan Williams
2006-09-11 23:18 ` [PATCH 07/19] raid5: remove compute_block and compute_parity5 Dan Williams
2006-09-11 23:18 ` [PATCH 08/19] dmaengine: enable multiple clients and operations Dan Williams
2006-09-11 23:44   ` Jeff Garzik
2006-09-12  0:14     ` Dan Williams
2006-09-12  0:52       ` Roland Dreier
2006-09-12  6:18         ` Dan Williams
2006-09-12  9:15           ` Evgeniy Polyakov
2006-09-13  4:04           ` Jeff Garzik
2006-09-15 16:38     ` Olof Johansson
2006-09-15 19:44       ` [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations) Olof Johansson
2006-09-15 20:02         ` [PATCH] [v2] " Olof Johansson
2006-09-18 22:56         ` [PATCH] " Dan Williams
2006-09-19  1:05           ` Olof Johansson
2006-09-19 11:20             ` Alan Cox
2006-09-19 16:32               ` Olof Johansson
2006-09-11 23:18 ` [PATCH 09/19] dmaengine: reduce backend address permutations Dan Williams
2006-09-15 14:46   ` Olof Johansson
2006-09-11 23:18 ` [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients Dan Williams
2006-09-11 23:18 ` [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation Dan Williams
2006-09-11 23:50   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy Dan Williams
2006-09-11 23:51   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 13/19] dmaengine: add support for dma xor zero sum operations Dan Williams
2006-09-11 23:18 ` [PATCH 14/19] dmaengine: add dma_sync_wait Dan Williams
2006-09-11 23:52   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 15/19] dmaengine: raid5 dma client Dan Williams
2006-09-11 23:54   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines Dan Williams
2006-09-15 14:57   ` Olof Johansson
2006-09-11 23:19 ` [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs Dan Williams
2006-09-11 23:55   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization Dan Williams
2006-09-11 23:56   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver Dan Williams
2006-09-11 23:38 ` [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Jeff Garzik
2006-09-11 23:53   ` Dan Williams
2006-09-12  2:41     ` Jeff Garzik
2006-09-12  5:47       ` Dan Williams
2006-09-13  4:05         ` Jeff Garzik
2006-09-13  7:15 ` Jakob Oestergaard
2006-09-13 19:17   ` Dan Williams
2006-09-14  7:42     ` Jakob Oestergaard
2006-10-11  1:46       ` Dan Williams
2006-10-08 22:18 ` Neil Brown
2006-10-10 18:23   ` Dan Williams
2006-10-11  2:44     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).