Re: [PATCH 1/6] dm raid45 target: export region hash functions and add a needed one

From: Heinz Mauelshagen <heinzm@redhat.com>
To: device-mapper development <dm-devel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Ed Ciechanowski <ed.ciechanowski@intel.com>
Subject: Re: [PATCH 1/6] dm raid45 target: export region hash functions and add a needed one
Date: Thu, 02 Jul 2009 14:52:10 +0200	[thread overview]
Message-ID: <1246539130.20207.12.camel@o> (raw)
In-Reply-To: <1245697837.7390.456.camel@o>

[-- Attachment #1: Type: text/plain, Size: 6495 bytes --]

On Mon, 2009-06-22 at 21:10 +0200, Heinz Mauelshagen wrote:
> On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote:
> > On Friday June 19, heinzm@redhat.com wrote:
> > > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> > > > On Wednesday June 17, neilb@suse.de wrote:
> > > > > 
> > > > > I will try to find time to review your dm-raid5 code with a view to
> > > > > understanding how it plugs in to dm, and then how the md/raid5 engine
> > > > > can be used by dm-raid5.
> > > 
> > > Hi Neil.
> > > 
> > > > 
> > > > I've had a bit of a look through the dm-raid5 patches.
> > > 
> > > Thanks.
> > > 
> > > > 
> > > > Some observations:
> > > > 
> > > > - You have your own 'xor' code against which you do a run-time test of
> > > >   the 'xor_block' code which md/raid5 uses - then choose the fasted.
> > > >   This really should not be necessary.  If you have xor code that runs
> > > >   faster than anything in xor_block, it really would be best to submit
> > > >   it for inclusion in the common xor code base.
> > > 
> > > This is in because it actually shows better performance regularly by
> > > utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> > > Sparc).
> > > 
> > > If xor_block would always have been performed best, I'd dropped that
> > > optimization already.
<SNIP>

Dan, Neil,

like mentioned before I left to LinuxTag last week, here comes an initial
take on dm-raid45 warm/cold CPU cache xor speed optimization metrics.

This shall give us the base to decide to keep or drop the dm-raid45
internal xor optimization magic or move (part of) it into the crypto
subsystem.

Heinz


Howto:
------
I added a loop to walk the list of recovery stripes to dm-raid45.c
in xor_optimize() to allow for over committing the cache and some variables
to be able to display absolute minimum and maximum xor runs performed plus
the number of xor runs achieved per cycle for xor_blocks() and for the dm-raid45
build in xor optimization.

In order to make results more deterministic, I run xor_speed() for <= 5 ticks.

See diff vs. dm-devel dm-raid45 patch (submitted Jun 15th) attached.

Tests being performed on the following 2 systems:

   hostname: a4
   2.6.31-rc1@250HZ timer frequency
   Core i7 920@3.4GHz, 8 MB 3rd Level Cache
   6GB RAM

   hostname: t4
   2.6.31-rc1@250HZ timer frequency
   2 CPU Opeteron 280@2.4GHz, 2*1 MB 2nd Level Cache
   2GB RAM

with the xor optimization being the only load on the systems.


I've performed test runs on each of those with the following mapping tables
and 128 iterations for each of them, which represents a small array case
with 3 drives per set, running the xor optimization on a single core:

Intel:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0

Opteron:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0


Because no actual IO is being performed, I just mapped to error targets
(table used: "0 2199023255552 error"; I know it's large but it ain't matter).

The number following the 2nd nosync parameter is the amount of recovery
stripes with io size of 512 sectors = 256 kilobytes per chunk or
256 sectors = 128 kilobytes per chunk respectively.
I.e. a work set of 768/384 kilobytes per recovery stripe.
These values shall make sure, that results differ in the per mille range
(i.e. more than 100 cycles per test run) where appropriate.


Systems are running out of cache at
~ >= 8 stripes on the Intel (8192 - 2048 code / (512 / 2) / 3)
and
~ >= 0 stripes on the Opteron system (1024 - 768 code) / (256 / 2) / 3).
assuming some cache utilization for code and other data.

See raw kernel log extracts being created by these test runs attached
in a tarball and the script to extract the metrics as well.


Intel results with 128 iterations each:
---------------------------------------

1 stripe  : NB:10 111/80 HM:118 111/82
2 stripes : NB:25 113/87 HM:103 112/91
3 stripes : NB:24 115/93 HM:104 114/93
4 stripes : NB:48 114/93 HM:80 114/93
5 stripes : NB:38 113/94 HM:90 114/94
6 stripes : NB:25 116/94 HM:103 114/94
7 stripes : NB:25 115/95 HM:103 115/95
8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here
9 stripes : NB:66 117/96 HM:62 116/95
10 stripes: NB:73 117/96 HM:55 114/95
11 stripes: NB:63 114/96 HM:65 112/95
12 stripes: NB:51 111/96 HM:77 110/95
13 stripes: NB:65 109/96 HM:63 112/95

NB: number of xor_blocks() parity calculations winning per 128 iterations
HM: number of dm-raid45 xor() parity calculations equal to/winning
    xor_blocks per 128 iterations
NN/MM: count of maximm/minimum calculations achived per iteration in <= 5 ticks.

Opteron results with 128 iterations each:
-----------------------------------------
1 stripe  : NB:0 30/20 HM:128 64/53
2 stripes : NB:0 31/21 HM:128 68/55
3 stripes : NB:0 31/22 HM:128 68/57
4 stripes : NB:0 32/22 HM:128 70/61
5 stripes : NB:0 32/22 HM:128 70/63
6 stripes : NB:0 35/22 HM:128 70/64
7 stripes : NB:0 32/23 HM:128 69/63
8 stripes : NB:0 44/23 HM:128 76/65
9 stripes : NB:0 43/23 HM:128 73/65
10 stripes: NB:0 35/23 HM:128 72/64
11 stripes: NB:0 35/24 HM:128 72/64
12 stripes: NB:0 33/24 HM:128 72/65
13 stripes: NB:0 33/23 HM:128 71/64


Test analysis:
--------------
I must have done something wrong ;-)

On the Opteron, dm-raid45 xor() outperforms xor_blocks() by far.
No warm cache significance visible.

On the Intel, dm-raid45 xor() performs slightly better on warm cache
vs. xor_blocks() performing slightly better on cold cache, which may be
the result of the lag of prefetching in dm-raid45 xor().
xor_blocks() achieves a slightly better maximum in 8 / 13 runs vs.
xor() in 2 test runs. in 3 runs they achieve the same maximum.

This is not deterministic:
min/max varying by up to > 200% on the Opteron
and up to 46% on the Intel.


Questions/Recommendations:
--------------------------
Review the code changes and the data analysis please.

Review the test cases and argue if those are valid
or recommend different ones please.

Can we get this more deterministic (e.g. use prefetching for dm-raid45 xor()) ?

Regards,
Heinz


[-- Attachment #2: xor_performance_metrics --]
[-- Type: application/x-shellscript, Size: 808 bytes --]

[-- Attachment #3: dm-raid45-2.6.31-rc1.patch --]
[-- Type: text/x-patch, Size: 6876 bytes --]

 {drivers => /usr/src/linux/drivers}/md/dm-raid45.c |  139 ++++++++++++++------
 1 files changed, 98 insertions(+), 41 deletions(-)

diff --git a/drivers/md/dm-raid45.c b/drivers/md/dm-raid45.c
index 0c33fea..6ace975 100644
--- a/drivers/md/dm-raid45.c
+++ b/linux/drivers/md/dm-raid45.c
@@ -8,7 +8,28 @@
  *
  * Linux 2.6 Device Mapper RAID4 and RAID5 target.
  *
- * Supports:
+ * Tested-by: Intel; Marcin.Labun@intel.com, krzysztof.wojcik@intel.com
+ *
+ *
+ * Supports the following ATARAID vendor solutions (and SNIA DDF):
+ *
+ * 	Adaptec HostRAID ASR
+ * 	SNIA DDF1
+ * 	Hiphpoint 37x
+ * 	Hiphpoint 45x
+ *	Intel IMSM
+ *	Jmicron ATARAID
+ *	LSI Logic MegaRAID
+ *	NVidia RAID
+ *	Promise FastTrack
+ *	Silicon Image Medley
+ *	VIA Software RAID
+ *
+ * via the dmraid application.
+ *
+ *
+ * Features:
+ *
  *	o RAID4 with dedicated and selectable parity device
  *	o RAID5 with rotating parity (left+right, symmetric+asymmetric)
  *	o recovery of out of sync device for initial
@@ -37,7 +58,7 @@
  * ANALYZEME: recovery bandwidth
  */ 
 
-static const char *version = "v0.2594p";
+static const char *version = "v0.2596l";
 
 #include "dm.h"
 #include "dm-memcache.h"
@@ -101,9 +122,6 @@ static const char *version = "v0.2594p";
 /* Check value in range. */
 #define	range_ok(i, min, max)	(i >= min && i <= max)
 
-/* Check argument is power of 2. */
-#define POWER_OF_2(a) (!(a & (a - 1)))
-
 /* Structure access macros. */
 /* Derive raid_set from stripe_cache pointer. */
 #define	RS(x)	container_of(x, struct raid_set, sc)
@@ -1848,10 +1866,10 @@ struct xor_func {
 	xor_function_t f;
 	const char *name;
 } static xor_funcs[] = {
-	{ xor_8,   "xor_8"  },
-	{ xor_16,  "xor_16" },
-	{ xor_32,  "xor_32" },
 	{ xor_64,  "xor_64" },
+	{ xor_32,  "xor_32" },
+	{ xor_16,  "xor_16" },
+	{ xor_8,   "xor_8"  },
 	{ xor_blocks_wrapper, "xor_blocks" },
 };
 
@@ -3114,10 +3132,10 @@ static void _do_endios(struct raid_set *rs, struct stripe *stripe,
 		SetStripeReconstructed(stripe);
 
 		/* FIXME: reschedule to be written in case of read. */
-		// if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) {
-		// 	chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY);
-		// 	stripe_chunks_rw(stripe);
-		// }
+		/* if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) {
+			chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY);
+			stripe_chunks_rw(stripe);
+		} */
 
 		stripe->idx.recover = -1;
 	}
@@ -3257,7 +3275,7 @@ static void do_ios(struct raid_set *rs, struct bio_list *ios)
 		/* Check for recovering regions. */
 		sector = _sector(rs, bio);
 		r = region_state(rs, sector, DM_RH_RECOVERING);
-		if (unlikely(r && bio_data_dir(bio) == WRITE)) {
+		if (unlikely(r)) {
 			delay++;
 			/* Wait writing to recovering regions. */
 			dm_rh_delay_by_region(rh, bio,
@@ -3409,64 +3427,104 @@ static unsigned mbpers(struct raid_set *rs, unsigned speed)
 /*
  * Discover fastest xor algorithm and # of chunks combination.
  */
-/* Calculate speed for algorithm and # of chunks. */
+/* Calculate speed of particular algorithm and # of chunks. */
 static unsigned xor_speed(struct stripe *stripe)
 {
+	int ticks = 5;
 	unsigned r = 0;
-	unsigned long j;
 
-	/* Wait for next tick. */
-	for (j = jiffies; j == jiffies; )
-		;
+	/* Do xors for a few ticks. */
+	while (ticks--) {
+		unsigned xors = 0;
+		unsigned long j = jiffies;
+
+		while (j == jiffies) {
+			mb();
+			common_xor(stripe, stripe->io.size, 0, 0);
+			mb();
+			xors++;
+			mb();
+		}
 
-	/* Do xors for a full tick. */
-	for (j = jiffies; j == jiffies; ) {
-		mb();
-		common_xor(stripe, stripe->io.size, 0, 0);
-		mb();
-		r++;
+		if (xors > r)
+			r = xors;
 	}
 
 	return r;
 }
 
+/* Define for xor multi recovery stripe optimization runs. */
+#define DMRAID45_XOR_TEST
+
 /* Optimize xor algorithm for this RAID set. */
 static unsigned xor_optimize(struct raid_set *rs)
 {
-	unsigned chunks_max = 2, p = rs->set.raid_devs, speed_max = 0;
+	unsigned chunks_max = 2, p, speed_max = 0;
 	struct xor_func *f = ARRAY_END(xor_funcs), *f_max = NULL;
 	struct stripe *stripe;
+unsigned speed_hm = 0, speed_min = ~0, speed_xor_blocks = 0;
 
 	BUG_ON(list_empty(&rs->recover.stripes));
+#ifndef DMRAID45_XOR_TEST
 	stripe = list_first_entry(&rs->recover.stripes, struct stripe,
 				  lists[LIST_RECOVER]);
 
 	/* Must set uptodate so that xor() will belabour chunks. */
-	while (p--)
+	for (p = rs->set.raid_devs; p-- ;)
 		SetChunkUptodate(CHUNK(stripe, p));
+#endif
 
 	/* Try all xor functions. */
 	while (f-- > xor_funcs) {
 		unsigned speed;
 
-		/* Set actual xor function for common_xor(). */
-		rs->xor.f = f;
-		rs->xor.chunks = (f->f == xor_blocks_wrapper ?
-				  (MAX_XOR_BLOCKS + 1) : XOR_CHUNKS_MAX) + 1;
-
-		while (rs->xor.chunks-- > 2) {
-			speed = xor_speed(stripe);
-			if (speed > speed_max) {
-				speed_max = speed;
-				chunks_max = rs->xor.chunks;
-				f_max = f;
+#ifdef DMRAID45_XOR_TEST
+		list_for_each_entry(stripe, &rs->recover.stripes,
+				    lists[LIST_RECOVER]) {
+				for (p = rs->set.raid_devs; p-- ;)
+					SetChunkUptodate(CHUNK(stripe, p));
+#endif
+	
+			/* Set actual xor function for common_xor(). */
+			rs->xor.f = f;
+			rs->xor.chunks = (f->f == xor_blocks_wrapper ?
+					  (MAX_XOR_BLOCKS + 1) :
+					  XOR_CHUNKS_MAX) + 1;
+	
+			while (rs->xor.chunks-- > 2) {
+				speed = xor_speed(stripe);
+
+#ifdef DMRAID45_XOR_TEST
+				if (f->f == xor_blocks_wrapper) {
+					if (speed > speed_xor_blocks)
+						speed_xor_blocks = speed;
+				} else if (speed > speed_hm)
+					speed_hm = speed;
+	
+				if (speed < speed_min)
+					speed_min = speed;
+#endif
+
+				if (speed > speed_max) {
+					speed_max = speed;
+					chunks_max = rs->xor.chunks;
+					f_max = f;
+				}
 			}
+#ifdef DMRAID45_XOR_TEST
 		}
+#endif
 	}
 
-	/* Memorize optimum parameters. */
+	/* Memorize optimal parameters. */
 	rs->xor.f = f_max;
 	rs->xor.chunks = chunks_max;
+#ifdef DMRAID45_XOR_TEST
+	DMINFO("%s stripes=%u min=%u xor_blocks=%u hm=%u max=%u",
+	       speed_max == speed_hm ? "HM" : "NB",
+	       rs->recover.recovery_stripes, speed_min,
+	       speed_xor_blocks, speed_hm, speed_max);
+#endif
 	return speed_max;
 }
 
@@ -3786,7 +3844,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv,
 		  "Invalid recovery switch; must be \"sync\" or \"nosync\"" },
 		{ 0,
 		  "Invalid number of recovery stripes;"
-		  "must be -1, > 0 and <= 16384",
+		  "must be -1, > 0 and <= 64",
 		  RECOVERY_STRIPES_MIN, RECOVERY_STRIPES_MAX,
 		  &vp->recovery_stripes_parm, &vp->recovery_stripes, NULL },
 	}, *varp;
@@ -3831,7 +3889,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv,
 
 		if (sscanf(*(argv++), "%d", &value) != 1 ||
 		    (value != -1 &&
-		     ((varp->action && !POWER_OF_2(value)) ||
+		     ((varp->action && !is_power_of_2(value)) ||
 		      !range_ok(value, varp->min, varp->max))))
 			TI_ERR(varp->errmsg);
 

[-- Attachment #4: xor_optomize_test_data.tar.bz2 --]
[-- Type: application/x-bzip-compressed-tar, Size: 17107 bytes --]

[-- Attachment #5: Type: text/plain, Size: 0 bytes --]