All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] reworked dm-switch target
@ 2012-08-15 22:36 Mikulas Patocka
  2012-08-16  7:09 ` Alasdair G Kergon
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-15 22:36 UTC (permalink / raw)
  To: Alasdair G. Kergon, Kevin_OKelley, Jim_Ramsay, Narendran_Ganapathy
  Cc: dm-devel

This is simplified dm-switch target, originally written by Jim Ramsay.

Changes from the original:

Removed netlink interface and added dm message interface to change
mapping table because the message interface is noticeably simpler.
The table is changed by sending dm message:
"dmsetup message <device-name> 0 set-table <commands...>"
The mesage can have multiple commands, each command has format
"<page>:<device index>" or "<start page>-<end page>:<device index>"
The page or pages in the specified range are remapped to the device with
the given intex.
For example "dmsetup message switch 0 set-table 0-15:0 16-31:1 32-33:2"
sets pages 0-15 to device 0, 16-31 to device 1, 32-33 to device 2.

The dm-switch.h file was removed (if the netlink was removed, there is
no need for this file).

Page table is allocated using vmalloc instead of kmalloc. kmalloc
allocates physically contiguous memory and it can fail if memory is
fragmented. vmalloc allocates discontiguous memory and maps it to a
contiguous virtual address range using MMU.

RCU and page table reallocation was removed. The page table is allocated
in the constructor and stays the same for the lifetime of the device.
The page table can be read and modified at the same time, so there is no
need to use RCU.

The page table is initialized with a repetitive pattern that uses all
the devices.

One page table entry has 64-bit size on 64-bit processors and 32-bit
size on 32-bit processors (in the original it was always 32-bit). Making
it 64-bit makes it consume slightly less space in some cases.

Removed dm status:
- ios_remapped/ios_unmapped counting was removed because all the IOs are
  mapped when statically allocated page table is used.
- Userspace-supplied numbers that are reported in the status were
  removed because it is not clear what were they used for.
- The device list with 'A' statuses was removed (it could be added back
  if we implement device error tracking); there was just mock code that
  returned 'A' for all devices.

Device limit check was simplified to use i_size_read and fixed to take
account of 'start' value as well.

do_div was replaced with sector_div - if we have 32-bit sectors, we
don't need to do slow 64-bit math.

The divisions were optimized if the divisor is a power of two.

Set dm_set_target_max_io_len. The original code didn't set it, so it
could issue IOs that span page boundaries.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/Kconfig     |   11 +
 drivers/md/Makefile    |    1 
 drivers/md/dm-switch.c |  419 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 431 insertions(+)

Index: linux-3.5.1-fast/drivers/md/Kconfig
===================================================================
--- linux-3.5.1-fast.orig/drivers/md/Kconfig	2012-08-16 00:29:55.000000000 +0200
+++ linux-3.5.1-fast/drivers/md/Kconfig	2012-08-16 00:30:14.000000000 +0200
@@ -417,4 +417,15 @@ config DM_VERITY2
 
 source "drivers/md/enhanceio/Kconfig"
 
+config DM_SWITCH
+	tristate "Switch target support (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	  Help text needs writing
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-switch.
+
+	  If unsure, say N.
+
 endif # MD
Index: linux-3.5.1-fast/drivers/md/Makefile
===================================================================
--- linux-3.5.1-fast.orig/drivers/md/Makefile	2012-08-16 00:29:55.000000000 +0200
+++ linux-3.5.1-fast/drivers/md/Makefile	2012-08-16 00:30:14.000000000 +0200
@@ -48,6 +48,7 @@ obj-$(CONFIG_DM_THIN_PROVISIONING)	+= dm
 obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
 obj-$(CONFIG_DM_ZEROED)		+= dm-zeroed.o
 obj-$(CONFIG_DM_ENHANCEIO)	+= enhanceio/
+obj-$(CONFIG_DM_SWITCH)		+= dm-switch.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
Index: linux-3.5.1-fast/drivers/md/dm-switch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.5.1-fast/drivers/md/dm-switch.c	2012-08-16 00:35:03.000000000 +0200
@@ -0,0 +1,419 @@
+/*
+ * Copyright (c) 2010-2011 by Dell, Inc.  All rights reserved.
+ *
+ * This file is released under the GPL.
+ *
+ * Description:
+ *
+ *     file:    dm-switch.c
+ *     authors: Kevin_OKelley@dell.com
+ *              Jim_Ramsay@dell.com
+ *              Narendran_Ganapathy@dell.com
+ *		mpatocka@redhat.com
+ *
+ * This file implements a "switch" target which efficiently implements a
+ * mapping of IOs to underlying block devices in scenarios where there are:
+ *   (1) a large number of address regions
+ *   (2) a fixed size equal across all address regions
+ *   (3) no pattern than allows for a compact description with something like
+ *       the dm-stripe target.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "switch"
+
+/*
+ * Switch device context block: A new one is created for each dm device.
+ * Contains an array of devices from which we have taken references.
+ */
+struct switch_dev {
+	struct dm_dev *dmdev;
+	sector_t start;
+};
+
+typedef unsigned long pt_entry;
+
+/* Switch context header */
+struct switch_ctx {
+	unsigned dev_count;		/* Number of devices */
+	unsigned page_size;		/* Page size in 512B sectors */
+	unsigned long n_pages;		/* Number of pages */
+	signed char page_size_bits;	/* log2 of page_size or -1 */
+
+	unsigned char pte_size;		/* Page table entry size in bits */
+	unsigned char pte_fields;	/* Number of entries per pt_entry */
+	signed char pte_fields_bits;	/* log2 of pte_fields or -1 */
+	pt_entry *page_table;		/* Page table */
+
+	/* Array of dm devices to switch between */
+	struct switch_dev dev_list[0];
+};
+
+static inline void switch_get_position(struct switch_ctx *pctx,
+				       unsigned long page,
+				       unsigned long *index,
+				       unsigned *bit)
+
+{
+	if (pctx->pte_fields_bits >= 0) {
+		*index = page >> pctx->pte_fields_bits;
+		*bit = page & (pctx->pte_fields - 1);
+	} else {
+		*index = page / pctx->pte_fields;
+		*bit = page % pctx->pte_fields;
+	}
+	*bit *= pctx->pte_size;
+
+}
+
+static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
+				    unsigned value)
+{
+	unsigned long index;
+	unsigned bit;
+	pt_entry pte;
+
+	switch_get_position(pctx, page, &index, &bit);
+
+	pte = pctx->page_table[index];
+	pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
+	pte |= (pt_entry)value << bit;
+	pctx->page_table[index] = pte;
+}
+
+/*
+ * Constructor: Called each time a dmsetup command creates a dm device.  The
+ * target parameter will already have the table, type, begin and len fields
+ * filled in.  Arguments are in pairs: <dev_path> <offset>.  Therefore, we get
+ * multiple constructor calls, but we will need to build a list of switch_ctx
+ * blocks so that the page table information gets matched to the correct
+ * device.
+ */
+static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	unsigned a;
+	int n;
+	int r;
+	unsigned dev_count;
+	struct switch_ctx *pctx;
+	sector_t dev_size;
+	unsigned long e;
+
+	if (argc < 4) {
+		ti->error = "Insufficient arguments";
+		r = -EINVAL;
+		goto error;
+	}
+	if (kstrtouint(argv[0], 10, &dev_count) ||
+	    !dev_count ||
+	    dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
+		ti->error = "Invalid device count";
+		r = -EINVAL;
+		goto error;
+	}
+	if (dev_count != (argc - 2) / 2) {
+		ti->error = "Invalid argument count";
+		r = -EINVAL;
+		goto error;
+	}
+	pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
+		       GFP_KERNEL);
+	if (!pctx) {
+		ti->error = "Cannot allocate redirect context";
+		r = -ENOMEM;
+		goto error;
+	}
+	pctx->dev_count = dev_count;
+	if (kstrtouint(argv[1], 10, &pctx->page_size) ||
+	    !pctx->page_size) {
+		ti->error = "Invalid page size";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	if (!(pctx->page_size & (pctx->page_size - 1)))
+		pctx->page_size_bits = __ffs(pctx->page_size);
+	else
+		pctx->page_size_bits = -1;
+
+	pctx->pte_size = 1;
+	while (pctx->pte_size < sizeof(pt_entry) * 8 &&
+	       (pt_entry)1 << pctx->pte_size < pctx->dev_count)
+		pctx->pte_size++;
+
+	pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
+	if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
+		pctx->pte_fields_bits = __ffs(pctx->pte_fields);
+	else
+		pctx->pte_fields_bits = -1;
+
+	dev_size = ti->len;
+	if (sector_div(dev_size, pctx->page_size))
+		dev_size++;
+
+	pctx->n_pages = dev_size;
+	if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
+		ti->error = "Too long page table";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	if (sector_div(dev_size, pctx->pte_fields))
+		dev_size++;
+
+	if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
+		ti->error = "Too long page table";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	r = dm_set_target_max_io_len(ti, pctx->page_size);
+	if (r)
+		goto error_kfree;
+
+	pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
+	if (!pctx->page_table) {
+		ti->error = "Cannot allocate page table";
+		r = -ENOMEM;
+		goto error_kfree;
+	}
+
+	a = 0;
+	for (e = 0; e < pctx->n_pages; e++) {
+		switch_page_table_write(pctx, e, a);
+		a++;
+		if (a >= pctx->dev_count)
+			a = 0;
+	}
+
+	/*
+	 * Check each device beneath the target to ensure that the limits are
+	 * consistent.
+	 */
+	for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
+		struct dm_dev *dm;
+		sector_t dev_size;
+		unsigned long long start;
+
+		if (kstrtoull(argv[a + 1], 10, &start) ||
+		    start != (sector_t)start) {
+			ti->error = "Invalid device starting offset";
+			r = -EINVAL;
+			n--;
+			goto error_release_n;
+		}
+		r = dm_get_device
+		    (ti, argv[a], dm_table_get_mode(ti->table), &dm);
+		if (r) {
+			ti->error = "Device lookup failed";
+			n--;
+			goto error_release_n;
+		}
+		pctx->dev_list[n].dmdev = dm;
+		pctx->dev_list[n].start = start;
+
+		dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;
+
+		if (ti->len > start + dev_size) {
+			ti->error = "Device is too small";
+			r = -EINVAL;
+			goto error_release_n;
+		}
+	}
+
+	ti->private = pctx;
+
+	return 0;
+
+error_release_n:		/* De-reference all devices  */
+	for (; n >= 0; n--)
+		dm_put_device(ti, pctx->dev_list[n].dmdev);
+
+	vfree(pctx->page_table);
+error_kfree:
+	kfree(pctx);
+
+error:
+	return r;
+}
+
+/*
+ * Destructor: Don't free the dm_target, just the ti->private data (if any).
+ */
+static void switch_dtr(struct dm_target *ti)
+{
+	int n;
+	struct switch_ctx *pctx = ti->private;
+
+	for (n = 0; n < pctx->dev_count; n++)
+		dm_put_device(ti, pctx->dev_list[n].dmdev);
+
+	vfree(pctx->page_table);
+	kfree(pctx);
+}
+
+static int switch_map(struct dm_target *ti, struct bio *bio,
+		      union map_info *map_context)
+{
+	struct switch_ctx *pctx = ti->private;
+
+	sector_t offset = bio->bi_sector - ti->begin;
+	sector_t p;
+	unsigned long index;
+	unsigned bit, idev;
+
+	p = offset;
+	if (pctx->page_size_bits >= 0)
+		p >>= pctx->page_size_bits;
+	else
+		sector_div(p, pctx->page_size);
+
+	switch_get_position(pctx, p, &index, &bit);
+
+	idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) & ((1 << pctx->pte_size) - 1);
+	/* This can only happen if the processor uses non-atomic stores. */
+	if (unlikely(idev >= pctx->dev_count))
+		idev = 0;
+
+	bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
+	bio->bi_sector = pctx->dev_list[idev].start + offset;
+
+	return DM_MAPIO_REMAPPED;
+}
+
+static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	static DEFINE_MUTEX(message_mutex);
+
+	struct switch_ctx *pctx = ti->private;
+	int r;
+
+	mutex_lock(&message_mutex);
+
+	if (!argc) {
+		goto invalid_message;
+	} else if (!strcasecmp(argv[0], "set-table")) {
+		unsigned i;
+		for (i = 1; i < argc; i++) {
+			unsigned long long from, to;
+			unsigned device;
+			char dummy;
+			if (sscanf(argv[i], "%llu-%llu:%u%c", &from, &to, &device, &dummy) == 3)
+				goto do_set_table;
+			if (sscanf(argv[i], "%llu:%u%c", &from, &device, &dummy) == 2) {
+				to = from;
+				goto do_set_table;
+			}
+			DMWARN("invalid set-table argument");
+			r = -EINVAL;
+			goto ret;
+do_set_table:
+			if (from > to || to >= pctx->n_pages) {
+				DMWARN("invalid set-table page");
+				r = -EINVAL;
+				goto ret;
+			}
+			if (device >= pctx->dev_count) {
+				DMWARN("invalid set-table device");
+				r = -EINVAL;
+				goto ret;
+			}
+			for (; from <= to; from++)
+				switch_page_table_write(pctx, from, device);
+		}
+		r = 0;
+	} else {
+invalid_message:
+		DMWARN("unrecognised message received.");
+		r = -EINVAL;
+	}
+ret:
+	mutex_unlock(&message_mutex);
+	return r;
+}
+
+static int switch_status(struct dm_target *ti, status_type_t type,
+			 unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct switch_ctx *pctx = ti->private;
+	unsigned sz = 0;
+	int n;
+
+	result[0] = '\0';
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = 0;
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
+		for (n = 0; n < pctx->dev_count; n++) {
+			DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
+			       (unsigned long long)pctx->dev_list[n].start);
+		}
+		break;
+
+	default:
+		return 0;
+	}
+	return 0;
+}
+
+/*
+ * Switch ioctl:
+ *
+ * Passthrough all ioctls to the first path.
+ */
+static int switch_ioctl(struct dm_target *ti, unsigned cmd,
+			unsigned long arg)
+{
+	struct switch_ctx *pctx = ti->private;
+	struct block_device *bdev;
+	fmode_t mode;
+
+	bdev = pctx->dev_list[0].dmdev->bdev;
+	mode = pctx->dev_list[0].dmdev->mode;
+
+	return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
+}
+
+static struct target_type switch_target = {
+	.name = "switch",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr = switch_ctr,
+	.dtr = switch_dtr,
+	.map = switch_map,
+	.message = switch_message,
+	.status = switch_status,
+	.ioctl = switch_ioctl,
+};
+
+int __init dm_switch_init(void)
+{
+	int r;
+
+	r = dm_register_target(&switch_target);
+	if (r) {
+		DMERR("dm_register_target() failed %d", r);
+		return r;
+	}
+
+	return 0;
+}
+
+void dm_switch_exit(void)
+{
+	dm_unregister_target(&switch_target);
+}
+
+module_init(dm_switch_init);
+module_exit(dm_switch_exit);
+
+MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
+MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
+MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
+MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-15 22:36 [PATCH] reworked dm-switch target Mikulas Patocka
@ 2012-08-16  7:09 ` Alasdair G Kergon
  2012-08-17 14:18   ` Jim_Ramsay
  0 siblings, 1 reply; 18+ messages in thread
From: Alasdair G Kergon @ 2012-08-16  7:09 UTC (permalink / raw)
  To: Mikulas Patocka, Kevin_OKelley, Jim_Ramsay, Narendran_Ganapathy,
	dm-devel

On Wed, Aug 15, 2012 at 06:36:55PM -0400, Mikulas Patocka wrote:
> For example "dmsetup message switch 0 set-table 0-15:0 16-31:1 32-33:2"

What I think we need to know now is how large this makes a typical message to
set up the table.  Set up can be split between several messages.
Or is typical data sufficiently random and large that a more-compact
might be a better idea?
 
Alasdair

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-16  7:09 ` Alasdair G Kergon
@ 2012-08-17 14:18   ` Jim_Ramsay
  2012-08-20 19:20     ` Mikulas Patocka
  2012-08-20 20:48     ` [PATCH] reworked dm-switch target Alasdair G Kergon
  0 siblings, 2 replies; 18+ messages in thread
From: Jim_Ramsay @ 2012-08-17 14:18 UTC (permalink / raw)
  To: agk, mpatocka, Kevin_OKelley, Jason_Shamberger, dm-devel

Hi Alasdair and Mikulas.  It's great to hear this feedback, thanks!

We had a few comments about the latest changes:

1) Uploading large page tables

As Alasdair mentioned, a more compact method of sending the page table will be necessary.  Consider a volume with a page table that consists of 1572864 entries in total.  On our storage solution, the pages are spread out among different group members (i.e. underlying DM devices) and it is uncommon to see long stretches mapping to the same device.  The reason for this is similar to the principle of striping, attempting to maximize the chance of simultaneous accesses on multiple group members.  This means that there is little gain in having a message format that allows a contiguous range of pages to be set to the same value.

Assuming a fairly well-distributed layout of 1572864 pages where 50% of the pages are different every other page, 20% are different every 2 pages, 10% every 5 pages, 10% every 10 pages, and 10% every 20 pages, this would leave us with a dmsetup message with argc=998768

  dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)

We agree that using 'dmestup message' is a lot cleaner than using a netlink socket in a number of ways, but the fact that it's argc/argv space-delimited shell-parsed data makes it more difficult to send large amounts of binary data like the bit-compressed page table.  We would be fine leaving in the proposed syntax for setting specific pages, as it may be useful to others and in small device testing scenarios, but an additional mechanism to upload larger chunks of binary data all at once would be important for our use of the device.

Perhaps we can work with you on designing alternate non-netlink mechanism to achieve the same goal... A sysfs file per DM device for userland processes to do direct I/O with?  Base64-encoding larger chunks of the binary page tables and passing those values through 'dmsetup message'?

2) vmalloc and TLB performance

Having a (virtually) contiguous memory range certainly simplifies the allocation and lookup algorithms, but what about the concerns about vmalloc() that are summarized nicely in Documentation/flexible-arrays.txt:

"Large contiguous memory allocations can be unreliable in the Linux kernel.
Kernel programmers will sometimes respond to this problem by allocating
pages with vmalloc().  This solution not ideal, though.  On 32-bit systems,
memory from vmalloc() must be mapped into a relatively small address space;
it's easy to run out.  On SMP systems, the page table changes required by
vmalloc() allocations can require expensive cross-processor interrupts on
all CPUs.  And, on all systems, use of space in the vmalloc() range
increases pressure on the translation lookaside buffer (TLB), reducing the
performance of the system."

The page table lookup is in the I/O path, so performance is an important consideration.  Do you have any performance comparisons between our existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd memory lookup?  Multiple devices of similarly large table size may be in use simultaneously, so this needs consideration as well.

Also, in the example above with 1572864 page table entries, assuming 2 bits per entry requires a table of 384KB.  Would this be a problem for the vmalloc system, especially on 32-bit systems, if there are multiple devices of similarly large size in use at the same time?

It can also be desirable to allow sparsely-populated page tables, when it is known that large chunks are not needed or deemed (by external logic) not important enough to consume kernel memory.  A 2-level kmalloc'd memory scheme can save memory in sparsely-allocated situations.

3) Userland values and counters

The "user saved" values are useful for debugging purposes.  For example, we had been using the 0th field for a timestamp so it's easy to manually validate when the last page table upload succeeded, and the other field for a count of the number of page table entries uploaded so far, but these could be used for other checkpointing or checksumming by userland processes.

Also, while we had not yet implemented a mechanism to retrieve the per-chunk hit counters, this would be valuable to have for a userland process to decide which chunks of the page table are "hot" for a sparsely-populated situation.

4) num_discard_requests, num_flush_requests, and iterate_devices

I have a slightly updated version of  driver that implements these DM target features as well.  I was actually preparing to submit the changes to this list when this conversation began, and will be doing so shortly.

--
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-17 14:18   ` Jim_Ramsay
@ 2012-08-20 19:20     ` Mikulas Patocka
  2012-08-21 16:33       ` Jim Ramsay
  2012-08-20 20:48     ` [PATCH] reworked dm-switch target Alasdair G Kergon
  1 sibling, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-20 19:20 UTC (permalink / raw)
  To: Jim_Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

Hi

On Fri, 17 Aug 2012, Jim_Ramsay@DELL.com wrote:

> Hi Alasdair and Mikulas.  It's great to hear this feedback, thanks!
> 
> We had a few comments about the latest changes:
> 
> 1) Uploading large page tables
> 
> As Alasdair mentioned, a more compact method of sending the page table 
> will be necessary.  Consider a volume with a page table that consists of 
> 1572864 entries in total.  On our storage solution, the pages are spread 
> out among different group members (i.e. underlying DM devices) and it is 
> uncommon to see long stretches mapping to the same device.  The reason for 
> this is similar to the principle of striping, attempting to maximize the 
> chance of simultaneous accesses on multiple group members.  This means 
> that there is little gain in having a message format that allows a 
> contiguous range of pages to be set to the same value.
> 
> Assuming a fairly well-distributed layout of 1572864 pages where 50% of 
> the pages are different every other page, 20% are different every 2 pages, 
> 10% every 5 pages, 10% every 10 pages, and 10% every 20 pages, this would 
> leave us with a dmsetup message with argc=998768
> 
>   dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)

You don't have to use the dash, you can send:
dmsetup message switch 0 set-table 0:1 1:0 2:2 3:1 4:0 ... etc.

You don't have to send the whole table at once in one message. Using 
message with 998768 arguments is bad (it can trigger allocation failures 
in the kernel).

But you can split the initial table load into several messages, each 
having up to 4096 bytes, so that it fits into a single page.

> We agree that using 'dmestup message' is a lot cleaner than using a 
> netlink socket in a number of ways, but the fact that it's argc/argv 
> space-delimited shell-parsed data makes it more difficult to send large 
> amounts of binary data like the bit-compressed page table.  We would be 
> fine leaving in the proposed syntax for setting specific pages, as it may 
> be useful to others and in small device testing scenarios, but an 
> additional mechanism to upload larger chunks of binary data all at once 
> would be important for our use of the device.
> 
> Perhaps we can work with you on designing alternate non-netlink mechanism 
> to achieve the same goal... A sysfs file per DM device for userland 
> processes to do direct I/O with?  Base64-encoding larger chunks of the 
> binary page tables and passing those values through 'dmsetup message'?

As I said, you don't have to upload the whole table with one message ... 
or if you really need to update the whole table at once, explain why.

> 2) vmalloc and TLB performance
> 
> Having a (virtually) contiguous memory range certainly simplifies the 
> allocation and lookup algorithms, but what about the concerns about 
> vmalloc() that are summarized nicely in Documentation/flexible-arrays.txt:
> 
> "Large contiguous memory allocations can be unreliable in the Linux kernel.
> Kernel programmers will sometimes respond to this problem by allocating
> pages with vmalloc().  This solution not ideal, though.  On 32-bit systems,
> memory from vmalloc() must be mapped into a relatively small address space;
> it's easy to run out.

The original code uses a simple kmalloc to allocate the whole table.

The maximum size allocatable with kmalloc is 4MB.

The minimum vmalloc arena is 128MB (on x86) - so the switch from kmalloc 
to vmalloc makes it no worse.

> On SMP systems, the page table changes required by
> vmalloc() allocations can require expensive cross-processor interrupts on
> all CPUs.

vmalloc is used only once when the target is loaded, so performance is not 
an issue here.

> And, on all systems, use of space in the vmalloc() range
> increases pressure on the translation lookaside buffer (TLB), reducing the
> performance of the system."
> 
> The page table lookup is in the I/O path, so performance is an important 
> consideration.  Do you have any performance comparisons between our 
> existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd 

There was just 1-level lookup in the original dm-switch patch. Did you add 
2-level lookup recently?

> memory lookup?  Multiple devices of similarly large table size may be in 
> use simultaneously, so this needs consideration as well.
> 
> Also, in the example above with 1572864 page table entries, assuming 2 
> bits per entry requires a table of 384KB.  Would this be a problem for the 
> vmalloc system, especially on 32-bit systems, if there are multiple 
> devices of similarly large size in use at the same time?

384KB is not a problem, the whole vmalloc space has 128MB.

> It can also be desirable to allow sparsely-populated page tables, when it 
> is known that large chunks are not needed or deemed (by external logic) 
> not important enough to consume kernel memory.  A 2-level kmalloc'd memory 
> scheme can save memory in sparsely-allocated situations.

> 3) Userland values and counters
> 
> The "user saved" values are useful for debugging purposes.  For example, 
> we had been using the 0th field for a timestamp so it's easy to manually 
> validate when the last page table upload succeeded, and the other field 
> for a count of the number of page table entries uploaded so far, but these 
> could be used for other checkpointing or checksumming by userland 
> processes.
> 
> Also, while we had not yet implemented a mechanism to retrieve the 
> per-chunk hit counters, this would be valuable to have for a userland 
> process to decide which chunks of the page table are "hot" for a 
> sparsely-populated situation.
> 
> 4) num_discard_requests, num_flush_requests, and iterate_devices
> 
> I have a slightly updated version of driver that implements these DM 
> target features as well.  I was actually preparing to submit the changes 
> to this list when this conversation began, and will be doing so shortly.
> 
> --
> Jim Ramsay

Mikulas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-17 14:18   ` Jim_Ramsay
  2012-08-20 19:20     ` Mikulas Patocka
@ 2012-08-20 20:48     ` Alasdair G Kergon
  1 sibling, 0 replies; 18+ messages in thread
From: Alasdair G Kergon @ 2012-08-20 20:48 UTC (permalink / raw)
  To: Jim_Ramsay; +Cc: dm-devel, mpatocka, Jason_Shamberger, Kevin_OKelley

On Fri, Aug 17, 2012 at 02:18:15PM +0000, Jim_Ramsay@dell.com wrote:
>   dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)

Or 0,3,16:1 1,4:0 2,5:2 6:0 -8:1 -15:2
with three short-hands:
0-0 -> 0
6:0 7-8:1  -> 6:0 -8:1  (missing start of range assumes continues from last one)
0:0 3:0 -> 0,3:0 (list)

> an additional mechanism to upload larger chunks
> of binary data all at once would be important for our use of the device. 

The message mechanism could probably be extended to accept blobs of binary data
if parsing so many numbers turns out to be too inefficient.  Now we have an
example to work with, we can check the speed.

Alasdair

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-20 19:20     ` Mikulas Patocka
@ 2012-08-21 16:33       ` Jim Ramsay
  2012-08-21 18:14         ` Alasdair G Kergon
                           ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Jim Ramsay @ 2012-08-21 16:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

On Mon, Aug 20, 2012 at 03:20:42PM -0400, Mikulas Patocka wrote:
> On Fri, 17 Aug 2012, Jim_Ramsay@DELL.com wrote:
> > 1) Uploading large page tables
<snip>
> > Assuming a fairly well-distributed layout of 1572864 pages where 50% of 
> > the pages are different every other page, 20% are different every 2 pages, 
> > 10% every 5 pages, 10% every 10 pages, and 10% every 20 pages, this would 
> > leave us with a dmsetup message with argc=998768
> > 
> >   dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)
> 
> You don't have to use the dash, you can send:
> dmsetup message switch 0 set-table 0:1 1:0 2:2 3:1 4:0 ... etc.
> 
> You don't have to send the whole table at once in one message. Using 
> message with 998768 arguments is bad (it can trigger allocation failures 
> in the kernel).
> 
> But you can split the initial table load into several messages, each 
> having up to 4096 bytes, so that it fits into a single page.

Even removing the '-' for single-page sets, you're looking at having to
send 4 bytes minimum per page (and as the index of the page you're
indexing increases significantly, it takes many more bytes to represent
a page), which means that each 4096-byte run would have maybe 1000 page
table entries in it at most.

This would mean that to upload an entire page table for my example
volume, we would have to run 'dmsetup message ...' almost 1000 times.

I'm sure we can come up with other syntactical shortcuts like those
Alasdair came up with, but encoding into any ascii format will always be
less space-efficient than a pure binary transfer.

> > Perhaps we can work with you on designing alternate non-netlink mechanism 
> > to achieve the same goal... A sysfs file per DM device for userland 
> > processes to do direct I/O with?  Base64-encoding larger chunks of the 
> > binary page tables and passing those values through 'dmsetup message'?
> 
> As I said, you don't have to upload the whole table with one message ... 
> or if you really need to update the whole table at once, explain why.

At the very least, we would need to update the whole page table in the
following scenarios:

  1) When we first learn the geometry of the volume

  2) When the volume layout changes significantly (for example, if it was
     previously represented by 2 devices and is then later moved onto 3
     devices, or the underlying LUN is resized)

  3) When the protocol used to fetch the data can fetch segments of the
     page table in a dense binary formate, it is considerably more work
     for a userland processes to keep its own persistent copy of the
     page table, compare a new version with the old version, calculate
     the differences, and send only those differences.  It is much
     simpler to have a binary conduit to upload the entire table at
     once, provided it does not occur too frequently.

Furthermore, if a userland process already has an internal binary
representation of a page map, what is the value in converting this into
a complicated human-readable ascii representation then having the kernel
do the opposite de-conversion when it receives the data?

> > 2) vmalloc and TLB performance
<snip>

> The original code uses a simple kmalloc to allocate the whole table.
> 
> The maximum size allocatable with kmalloc is 4MB.
> 
> The minimum vmalloc arena is 128MB (on x86) - so the switch from kmalloc 
> to vmalloc makes it no worse.
> 
> > On SMP systems, the page table changes required by
> > vmalloc() allocations can require expensive cross-processor interrupts on
> > all CPUs.
> 
> vmalloc is used only once when the target is loaded, so performance is not 
> an issue here.

The table would also have to be reallocated on LUN resize or if the data
is moved to be across a different number of devices (provided the change
is such that it causes the number of bits-per-page to be changed), such
as if you had a 2-device setup represented by 1-bit-per-page change to a
3-device setup represented by 2-bit-per-page.

Granted these are not frequent operations, but we need to continue to
properly handle these cases.

We also need to keep the multiple device scenario in mind (perhaps 100s of
targets in use or being created simultaneously).

> > And, on all systems, use of space in the vmalloc() range
> > increases pressure on the translation lookaside buffer (TLB), reducing the
> > performance of the system."
> > 
> > The page table lookup is in the I/O path, so performance is an important 
> > consideration.  Do you have any performance comparisons between our 
> > existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd 
> 
> There was just 1-level lookup in the original dm-switch patch. Did you add 
> 2-level lookup recently?

In October 2011 I posted a 'v3' version of our driver to the dm-devel
list that did this 2-stage lookup to the dm-devel list:

http://www.redhat.com/archives/dm-devel/2011-October/msg00109.html

The main consideration was to avoid single large kmalloc allocations,
but to also support sparse allocations in the future.

> > memory lookup?  Multiple devices of similarly large table size may be in 
> > use simultaneously, so this needs consideration as well.
> > 
> > Also, in the example above with 1572864 page table entries, assuming 2 
> > bits per entry requires a table of 384KB.  Would this be a problem for the 
> > vmalloc system, especially on 32-bit systems, if there are multiple 
> > devices of similarly large size in use at the same time?
> 
> 384KB is not a problem, the whole vmalloc space has 128MB.

This means we could allow ~375 similarly-sized devices in the system,
assuming no other kernel objects are consuming any vmalloc space.  This
could be okay, provided our performance considerations are also
addressed, but allowing sparse allocation may be a good enough reason
to use a 2-level allocation scheme.

> > It can also be desirable to allow sparsely-populated page tables, when it 
> > is known that large chunks are not needed or deemed (by external logic) 
> > not important enough to consume kernel memory.  A 2-level kmalloc'd memory 
> > scheme can save memory in sparsely-allocated situations.

This ability to do sparse allocations may be important depending on what
else is going on in the kernel and using vmalloc space.

Thanks for your comments, and I do hope to send our 'v4' driver code as
well as a demonstration application with the netlink socket interface to
this list in the very near future.

-- 
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-21 16:33       ` Jim Ramsay
@ 2012-08-21 18:14         ` Alasdair G Kergon
  2012-08-22  1:02         ` Mikulas Patocka
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Alasdair G Kergon @ 2012-08-21 18:14 UTC (permalink / raw)
  To: device-mapper development
  Cc: Mikulas Patocka, Jason_Shamberger, Kevin_OKelley

Well my order of preference for the interface at the moment would be:
  ioctl: ASCII dm messages
  ioctl: binary dm messages   (i.e. the message content is treated as a binary blob)
  mmap: lookups in shared memory (lockless preferably)
  netlink

Alasdair

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-21 16:33       ` Jim Ramsay
  2012-08-21 18:14         ` Alasdair G Kergon
@ 2012-08-22  1:02         ` Mikulas Patocka
  2012-08-24 18:14           ` Jim Ramsay
  2012-08-22  1:03         ` Mikulas Patocka
  2012-08-22  1:04         ` Mikulas Patocka
  3 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-22  1:02 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley



On Tue, 21 Aug 2012, Jim Ramsay wrote:

> On Mon, Aug 20, 2012 at 03:20:42PM -0400, Mikulas Patocka wrote:
> > On Fri, 17 Aug 2012, Jim_Ramsay@DELL.com wrote:
> > > 1) Uploading large page tables
> <snip>
> > > Assuming a fairly well-distributed layout of 1572864 pages where 50% of 
> > > the pages are different every other page, 20% are different every 2 pages, 
> > > 10% every 5 pages, 10% every 10 pages, and 10% every 20 pages, this would 
> > > leave us with a dmsetup message with argc=998768
> > > 
> > >   dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)
> > 
> > You don't have to use the dash, you can send:
> > dmsetup message switch 0 set-table 0:1 1:0 2:2 3:1 4:0 ... etc.
> > 
> > You don't have to send the whole table at once in one message. Using 
> > message with 998768 arguments is bad (it can trigger allocation failures 
> > in the kernel).
> > 
> > But you can split the initial table load into several messages, each 
> > having up to 4096 bytes, so that it fits into a single page.
> 
> Even removing the '-' for single-page sets, you're looking at having to
> send 4 bytes minimum per page (and as the index of the page you're
> indexing increases significantly, it takes many more bytes to represent
> a page), which means that each 4096-byte run would have maybe 1000 page
> table entries in it at most.
> 
> This would mean that to upload an entire page table for my example
> volume, we would have to run 'dmsetup message ...' almost 1000 times.
> 
> I'm sure we can come up with other syntactical shortcuts like those
> Alasdair came up with, but encoding into any ascii format will always be
> less space-efficient than a pure binary transfer.

I converted the format to use hexadecimal numbers (they are faster to 
produce and faster to parse) and made an option to omit the page number 
(in this case, the previous page plus one is used) - and it takes 0.05s 
to load a table with one million entries on 2.3GHz Opteron.

The table is loaded with 67 dm message calls, each having 45000 bytes 
(the number 45000 was experimentally found to be near the optimum).

So I don't think there are performance problems with this.

I'll send you the program that updates the table with messages.

> > > Perhaps we can work with you on designing alternate non-netlink mechanism 
> > > to achieve the same goal... A sysfs file per DM device for userland 
> > > processes to do direct I/O with?  Base64-encoding larger chunks of the 
> > > binary page tables and passing those values through 'dmsetup message'?
> > 
> > As I said, you don't have to upload the whole table with one message ... 
> > or if you really need to update the whole table at once, explain why.
> 
> At the very least, we would need to update the whole page table in the
> following scenarios:
> 
>   1) When we first learn the geometry of the volume
> 
>   2) When the volume layout changes significantly (for example, if it was
>      previously represented by 2 devices and is then later moved onto 3
>      devices, or the underlying LUN is resized)
> 
>   3) When the protocol used to fetch the data can fetch segments of the
>      page table in a dense binary formate, it is considerably more work
>      for a userland processes to keep its own persistent copy of the
>      page table, compare a new version with the old version, calculate
>      the differences, and send only those differences.  It is much
>      simpler to have a binary conduit to upload the entire table at
>      once, provided it does not occur too frequently.

But you don't have to upload the table at once - you can upload the table 
incrementally with several dm messages.

> Furthermore, if a userland process already has an internal binary
> representation of a page map, what is the value in converting this into
> a complicated human-readable ascii representation then having the kernel
> do the opposite de-conversion when it receives the data?

The reason is simplicity - the dm message code is noticeably smaller than 
the netlink code. It is also less bug-prone because no structures are 
allocated or freed there.

> > > 2) vmalloc and TLB performance
> <snip>
> 
> > The original code uses a simple kmalloc to allocate the whole table.
> > 
> > The maximum size allocatable with kmalloc is 4MB.
> > 
> > The minimum vmalloc arena is 128MB (on x86) - so the switch from kmalloc 
> > to vmalloc makes it no worse.
> > 
> > > On SMP systems, the page table changes required by
> > > vmalloc() allocations can require expensive cross-processor interrupts on
> > > all CPUs.
> > 
> > vmalloc is used only once when the target is loaded, so performance is not 
> > an issue here.
> 
> The table would also have to be reallocated on LUN resize or if the data
> is moved to be across a different number of devices (provided the change
> is such that it causes the number of bits-per-page to be changed), such
> as if you had a 2-device setup represented by 1-bit-per-page change to a
> 3-device setup represented by 2-bit-per-page.
> 
> Granted these are not frequent operations, but we need to continue to
> properly handle these cases.
>
> We also need to keep the multiple device scenario in mind (perhaps 100s of
> targets in use or being created simultaneously).

For these operations (resizing the device or changing the number of 
underlying devices), you can load a new table, suspend the device and 
resume the device. It will switch to the new table and destroy the old 
one.

You have to reload the table anyway if you change device size, so there is 
no need to include code to change table size in the target driver.

> > > And, on all systems, use of space in the vmalloc() range
> > > increases pressure on the translation lookaside buffer (TLB), reducing the
> > > performance of the system."
> > > 
> > > The page table lookup is in the I/O path, so performance is an important 
> > > consideration.  Do you have any performance comparisons between our 
> > > existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd 
> > 
> > There was just 1-level lookup in the original dm-switch patch. Did you add 
> > 2-level lookup recently?
> 
> In October 2011 I posted a 'v3' version of our driver to the dm-devel
> list that did this 2-stage lookup to the dm-devel list:
> 
> http://www.redhat.com/archives/dm-devel/2011-October/msg00109.html
> 
> The main consideration was to avoid single large kmalloc allocations,
> but to also support sparse allocations in the future.
> 
> > > memory lookup?  Multiple devices of similarly large table size may be in 
> > > use simultaneously, so this needs consideration as well.
> > > 
> > > Also, in the example above with 1572864 page table entries, assuming 2 
> > > bits per entry requires a table of 384KB.  Would this be a problem for the 
> > > vmalloc system, especially on 32-bit systems, if there are multiple 
> > > devices of similarly large size in use at the same time?
> > 
> > 384KB is not a problem, the whole vmalloc space has 128MB.
> 
> This means we could allow ~375 similarly-sized devices in the system,
> assuming no other kernel objects are consuming any vmalloc space.  This
> could be okay, provided our performance considerations are also
> addressed, but allowing sparse allocation may be a good enough reason
> to use a 2-level allocation scheme.
> 
> > > It can also be desirable to allow sparsely-populated page tables, when it 
> > > is known that large chunks are not needed or deemed (by external logic) 
> > > not important enough to consume kernel memory.  A 2-level kmalloc'd memory 
> > > scheme can save memory in sparsely-allocated situations.
> 
> This ability to do sparse allocations may be important depending on what
> else is going on in the kernel and using vmalloc space.

It may be possible to use radix tree and do sparse allocations, but given 
the current usage (tables with million entries, each entry having a few 
bits), it doesn't seem as a problem now.

> Thanks for your comments, and I do hope to send our 'v4' driver code as
> well as a demonstration application with the netlink socket interface to
> this list in the very near future.
> 
> -- 
> Jim Ramsay

Mikulas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-21 16:33       ` Jim Ramsay
  2012-08-21 18:14         ` Alasdair G Kergon
  2012-08-22  1:02         ` Mikulas Patocka
@ 2012-08-22  1:03         ` Mikulas Patocka
  2012-08-22  1:04         ` Mikulas Patocka
  3 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-22  1:03 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

This is a new version that uses hex numbers.

---

This is simplified dm-switch target, originally written by Jim Ramsay.

Changes from the original:

Removed netlink interface and added dm message interface to change
mapping table because the message interface is noticeably simpler.
The table is changed by sending dm message:
"dmsetup message <device-name> 0 set-table <commands...>"
The mesage can have multiple commands, each command has format
"<page>:<device index>" (sets specified page) or ":<device index>" (sets
previous page plus 1 to the specified index). <page> and <device index>
are in hexadecimal format.
For example "dmsetup message switch 0 set-table 3:0 :2 :7 F:4"
sets page 3 to device 0, page 4 to device 2, page 5 to device 7, page 15
to device 4.

The dm-switch.h file was removed (if the netlink was removed, there is
no need for this file).

Page table is allocated using vmalloc instead of kmalloc. kmalloc
allocates physically contiguous memory and it can fail if memory is
fragmented. vmalloc allocates discontiguous memory and maps it to a
contiguous virtual address range using MMU.

RCU and page table reallocation was removed. The page table is allocated
in the constructor and stays the same for the lifetime of the device.
The page table can be read and modified at the same time, so there is no
need to use RCU.

The page table is initialized with a repetitive pattern that uses all
the devices.

One page table entry has 64-bit size on 64-bit processors and 32-bit
size on 32-bit processors (in the original it was always 32-bit). Making
it 64-bit makes it consume slightly less space in some cases.

Removed dm status:
- ios_remapped/ios_unmapped counting was removed because all the IOs are
  mapped when statically allocated page table is used.
- Userspace-supplied numbers that are reported in the status were
  removed because it is not clear what were they used for.
- The device list with 'A' statuses was removed (it could be added back
  if we implement device error tracking); there was just mock code that
  returned 'A' for all devices.

Device limit check was simplified to use i_size_read and fixed to take
account of 'start' value as well.

do_div was replaced with sector_div - if we have 32-bit sectors, we
don't need to do slow 64-bit math.

The divisions were optimized if the divisor is a power of two.

Set dm_set_target_max_io_len. The original code didn't set it, so it
could issue IOs that span page boundaries.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/Kconfig     |   11 +
 drivers/md/Makefile    |    1 
 drivers/md/dm-switch.c |  485 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 497 insertions(+)

Index: linux-3.5.2-fast/drivers/md/Kconfig
===================================================================
--- linux-3.5.2-fast.orig/drivers/md/Kconfig	2012-08-22 02:03:19.000000000 +0200
+++ linux-3.5.2-fast/drivers/md/Kconfig	2012-08-22 02:04:01.000000000 +0200
@@ -417,4 +417,15 @@ config DM_VERITY2
 
 source "drivers/md/enhanceio/Kconfig"
 
+config DM_SWITCH
+	tristate "Switch target support (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	  Help text needs writing
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-switch.
+
+	  If unsure, say N.
+
 endif # MD
Index: linux-3.5.2-fast/drivers/md/Makefile
===================================================================
--- linux-3.5.2-fast.orig/drivers/md/Makefile	2012-08-22 02:03:19.000000000 +0200
+++ linux-3.5.2-fast/drivers/md/Makefile	2012-08-22 02:04:01.000000000 +0200
@@ -48,6 +48,7 @@ obj-$(CONFIG_DM_THIN_PROVISIONING)	+= dm
 obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
 obj-$(CONFIG_DM_ZEROED)		+= dm-zeroed.o
 obj-$(CONFIG_DM_ENHANCEIO)	+= enhanceio/
+obj-$(CONFIG_DM_SWITCH)		+= dm-switch.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
Index: linux-3.5.2-fast/drivers/md/dm-switch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.5.2-fast/drivers/md/dm-switch.c	2012-08-22 03:00:50.000000000 +0200
@@ -0,0 +1,485 @@
+/*
+ * Copyright (c) 2010-2011 by Dell, Inc.  All rights reserved.
+ *
+ * This file is released under the GPL.
+ *
+ * Description:
+ *
+ *     file:    dm-switch.c
+ *     authors: Kevin_OKelley@dell.com
+ *              Jim_Ramsay@dell.com
+ *              Narendran_Ganapathy@dell.com
+ *		mpatocka@redhat.com
+ *
+ * This file implements a "switch" target which efficiently implements a
+ * mapping of IOs to underlying block devices in scenarios where there are:
+ *   (1) a large number of address regions
+ *   (2) a fixed size equal across all address regions
+ *   (3) no pattern than allows for a compact description with something like
+ *       the dm-stripe target.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device-mapper.h>
+#include <linux/vmalloc.h>
+
+#define DM_MSG_PREFIX "switch"
+
+/*
+ * Switch device context block: A new one is created for each dm device.
+ * Contains an array of devices from which we have taken references.
+ */
+struct switch_dev {
+	struct dm_dev *dmdev;
+	sector_t start;
+};
+
+typedef unsigned long pt_entry;
+
+/* Switch context header */
+struct switch_ctx {
+	unsigned dev_count;		/* Number of devices */
+	unsigned page_size;		/* Page size in 512B sectors */
+	unsigned long n_pages;		/* Number of pages */
+	signed char page_size_bits;	/* log2 of page_size or -1 */
+
+	unsigned char pte_size;		/* Page table entry size in bits */
+	unsigned char pte_fields;	/* Number of entries per pt_entry */
+	signed char pte_fields_bits;	/* log2 of pte_fields or -1 */
+	pt_entry *page_table;		/* Page table */
+
+	/* Array of dm devices to switch between */
+	struct switch_dev dev_list[0];
+};
+
+static inline void switch_get_position(struct switch_ctx *pctx,
+				       unsigned long page,
+				       unsigned long *index,
+				       unsigned *bit)
+
+{
+	if (pctx->pte_fields_bits >= 0) {
+		*index = page >> pctx->pte_fields_bits;
+		*bit = page & (pctx->pte_fields - 1);
+	} else {
+		*index = page / pctx->pte_fields;
+		*bit = page % pctx->pte_fields;
+	}
+	*bit *= pctx->pte_size;
+
+}
+
+static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
+				    unsigned value)
+{
+	unsigned long index;
+	unsigned bit;
+	pt_entry pte;
+
+	switch_get_position(pctx, page, &index, &bit);
+
+	pte = pctx->page_table[index];
+	pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
+	pte |= (pt_entry)value << bit;
+	pctx->page_table[index] = pte;
+}
+
+/*
+ * Constructor: Called each time a dmsetup command creates a dm device.  The
+ * target parameter will already have the table, type, begin and len fields
+ * filled in.  Arguments are in pairs: <dev_path> <offset>.  Therefore, we get
+ * multiple constructor calls, but we will need to build a list of switch_ctx
+ * blocks so that the page table information gets matched to the correct
+ * device.
+ */
+static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	unsigned a;
+	int n;
+	int r;
+	unsigned dev_count;
+	struct switch_ctx *pctx;
+	sector_t dev_size;
+	unsigned long e;
+
+	if (argc < 4) {
+		ti->error = "Insufficient arguments";
+		r = -EINVAL;
+		goto error;
+	}
+	if (kstrtouint(argv[0], 10, &dev_count) ||
+	    !dev_count ||
+	    dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
+		ti->error = "Invalid device count";
+		r = -EINVAL;
+		goto error;
+	}
+	if (dev_count != (argc - 2) / 2) {
+		ti->error = "Invalid argument count";
+		r = -EINVAL;
+		goto error;
+	}
+	pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
+		       GFP_KERNEL);
+	if (!pctx) {
+		ti->error = "Cannot allocate redirect context";
+		r = -ENOMEM;
+		goto error;
+	}
+	pctx->dev_count = dev_count;
+	if (kstrtouint(argv[1], 10, &pctx->page_size) ||
+	    !pctx->page_size) {
+		ti->error = "Invalid page size";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	if (!(pctx->page_size & (pctx->page_size - 1)))
+		pctx->page_size_bits = __ffs(pctx->page_size);
+	else
+		pctx->page_size_bits = -1;
+
+	pctx->pte_size = 1;
+	while (pctx->pte_size < sizeof(pt_entry) * 8 &&
+	       (pt_entry)1 << pctx->pte_size < pctx->dev_count)
+		pctx->pte_size++;
+
+	pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
+	if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
+		pctx->pte_fields_bits = __ffs(pctx->pte_fields);
+	else
+		pctx->pte_fields_bits = -1;
+
+	dev_size = ti->len;
+	if (sector_div(dev_size, pctx->page_size))
+		dev_size++;
+
+	pctx->n_pages = dev_size;
+	if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
+		ti->error = "Too long page table";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	if (sector_div(dev_size, pctx->pte_fields))
+		dev_size++;
+
+	if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
+		ti->error = "Too long page table";
+		r = -EINVAL;
+		goto error_kfree;
+	}
+
+	r = dm_set_target_max_io_len(ti, pctx->page_size);
+	if (r)
+		goto error_kfree;
+
+	pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
+	if (!pctx->page_table) {
+		ti->error = "Cannot allocate page table";
+		r = -ENOMEM;
+		goto error_kfree;
+	}
+
+	a = 0;
+	for (e = 0; e < pctx->n_pages; e++) {
+		switch_page_table_write(pctx, e, a);
+		a++;
+		if (a >= pctx->dev_count)
+			a = 0;
+	}
+
+	/*
+	 * Check each device beneath the target to ensure that the limits are
+	 * consistent.
+	 */
+	for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
+		struct dm_dev *dm;
+		sector_t dev_size;
+		unsigned long long start;
+
+		if (kstrtoull(argv[a + 1], 10, &start) ||
+		    start != (sector_t)start) {
+			ti->error = "Invalid device starting offset";
+			r = -EINVAL;
+			n--;
+			goto error_release_n;
+		}
+		r = dm_get_device
+		    (ti, argv[a], dm_table_get_mode(ti->table), &dm);
+		if (r) {
+			ti->error = "Device lookup failed";
+			n--;
+			goto error_release_n;
+		}
+		pctx->dev_list[n].dmdev = dm;
+		pctx->dev_list[n].start = start;
+
+		dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;
+
+		if (ti->len > start + dev_size) {
+			ti->error = "Device is too small";
+			r = -EINVAL;
+			goto error_release_n;
+		}
+	}
+
+	ti->private = pctx;
+
+	return 0;
+
+error_release_n:		/* De-reference all devices  */
+	for (; n >= 0; n--)
+		dm_put_device(ti, pctx->dev_list[n].dmdev);
+
+	vfree(pctx->page_table);
+error_kfree:
+	kfree(pctx);
+
+error:
+	return r;
+}
+
+/*
+ * Destructor: Don't free the dm_target, just the ti->private data (if any).
+ */
+static void switch_dtr(struct dm_target *ti)
+{
+	int n;
+	struct switch_ctx *pctx = ti->private;
+
+	for (n = 0; n < pctx->dev_count; n++)
+		dm_put_device(ti, pctx->dev_list[n].dmdev);
+
+	vfree(pctx->page_table);
+	kfree(pctx);
+}
+
+static int switch_map(struct dm_target *ti, struct bio *bio,
+		      union map_info *map_context)
+{
+	struct switch_ctx *pctx = ti->private;
+
+	sector_t offset = bio->bi_sector - ti->begin;
+	sector_t p;
+	unsigned long index;
+	unsigned bit, idev;
+
+	p = offset;
+	if (pctx->page_size_bits >= 0)
+		p >>= pctx->page_size_bits;
+	else
+		sector_div(p, pctx->page_size);
+
+	switch_get_position(pctx, p, &index, &bit);
+
+	idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) & ((1 << pctx->pte_size) - 1);
+	/* This can only happen if the processor uses non-atomic stores. */
+	if (unlikely(idev >= pctx->dev_count))
+		idev = 0;
+
+	bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
+	bio->bi_sector = pctx->dev_list[idev].start + offset;
+
+	return DM_MAPIO_REMAPPED;
+}
+
+/*
+ * We need to parse hex numbers as fast as possible.
+ * Message is used to load the whole table.
+ *
+ * This table-based hex parser improves performance.
+ * It improves a time to load 1000000 entries compared to the condition-based
+ * parser.
+ *		table-based parser	condition-based parser
+ * PA-RISC	0.29s			0.31s
+ * Opteron	0.0495s			0.0498s
+ */
+
+static const unsigned char hex_table[256] = {
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+0,1,2,3,4,5,6,7,8,9,255,255,255,255,255,255,
+255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255
+};
+
+static inline void parse_hex(const char *string, sector_t *result, const char **end)
+{
+	unsigned char d;
+	sector_t r = 0;
+#if 1
+	while ((d = hex_table[(unsigned char)*string]) < 16) {
+		r = (r << 4) | d;
+		string++;
+	}
+#else
+	while (1) {
+		d = *string;
+		if (d >= '0' && d <= '9')
+			d -= '0';
+		else if (d >= 'A' && d <= 'F')
+			d -= 'A' - 10;
+		else if (d >= 'a' && d <= 'f')
+			d -= 'a' - 10;
+		else
+			break;
+		r = (r << 4) | d;
+		string++;
+	}
+#endif
+	*end = string;
+	*result = r;
+}
+
+static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	static DEFINE_MUTEX(message_mutex);
+
+	struct switch_ctx *pctx = ti->private;
+	int r;
+
+	mutex_lock(&message_mutex);
+
+	if (!argc) {
+		goto invalid_message;
+	} else if (!strcasecmp(argv[0], "set-table")) {
+		unsigned i;
+		sector_t table_index = 0;
+		for (i = 1; i < argc; i++) {
+			sector_t device;
+			const char *string = argv[i];
+			if (*string == ':')
+				table_index++;
+			else {
+				parse_hex(string, &table_index, &string);
+				if (unlikely(*string != ':')) {
+invalid_table:
+					DMWARN("invalid set-table argument");
+					r = -EINVAL;
+					goto ret;
+				}
+			}
+			string++;
+			if (unlikely(!*string))
+				goto invalid_table;
+			parse_hex(string, &device, &string);
+			if (unlikely(*string))
+				goto invalid_table;
+			if (unlikely(table_index >= pctx->n_pages)) {
+				DMWARN("invalid set-table page");
+				r = -EINVAL;
+				goto ret;
+			}
+			if (unlikely(device >= pctx->dev_count)) {
+				DMWARN("invalid set-table device");
+				r = -EINVAL;
+				goto ret;
+			}
+			switch_page_table_write(pctx, table_index, device);
+		}
+		r = 0;
+	} else {
+invalid_message:
+		DMWARN("unrecognised message received.");
+		r = -EINVAL;
+	}
+ret:
+	mutex_unlock(&message_mutex);
+	return r;
+}
+
+static int switch_status(struct dm_target *ti, status_type_t type,
+			 unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct switch_ctx *pctx = ti->private;
+	unsigned sz = 0;
+	int n;
+
+	result[0] = '\0';
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = 0;
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
+		for (n = 0; n < pctx->dev_count; n++) {
+			DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
+			       (unsigned long long)pctx->dev_list[n].start);
+		}
+		break;
+
+	default:
+		return 0;
+	}
+	return 0;
+}
+
+/*
+ * Switch ioctl:
+ *
+ * Passthrough all ioctls to the first path.
+ */
+static int switch_ioctl(struct dm_target *ti, unsigned cmd,
+			unsigned long arg)
+{
+	struct switch_ctx *pctx = ti->private;
+	struct block_device *bdev;
+	fmode_t mode;
+
+	bdev = pctx->dev_list[0].dmdev->bdev;
+	mode = pctx->dev_list[0].dmdev->mode;
+
+	return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
+}
+
+static struct target_type switch_target = {
+	.name = "switch",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr = switch_ctr,
+	.dtr = switch_dtr,
+	.map = switch_map,
+	.message = switch_message,
+	.status = switch_status,
+	.ioctl = switch_ioctl,
+};
+
+int __init dm_switch_init(void)
+{
+	int r;
+
+	r = dm_register_target(&switch_target);
+	if (r) {
+		DMERR("dm_register_target() failed %d", r);
+		return r;
+	}
+
+	return 0;
+}
+
+void dm_switch_exit(void)
+{
+	dm_unregister_target(&switch_target);
+}
+
+module_init(dm_switch_init);
+module_exit(dm_switch_exit);
+
+MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
+MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
+MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
+MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-21 16:33       ` Jim Ramsay
                           ` (2 preceding siblings ...)
  2012-08-22  1:03         ` Mikulas Patocka
@ 2012-08-22  1:04         ` Mikulas Patocka
       [not found]           ` <20120822192547.GJ32571@agk-dp.fab.redhat.com>
  3 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-22  1:04 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

/* a sample program that makes a table of million entries and uploads it 
via dm message */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <libdevmapper.h>

#define N_DEVICES		3
#define TABLE_SIZE		1000000

#define MSG_SIZE_LIMIT		45000
#define MSG_SIZE_RESERVED	20

static int table[TABLE_SIZE];
static char buffer[MSG_SIZE_LIMIT];

static void print_hex(char **s, unsigned hex)
{
	char *p = *s;
	int b;
	for (b = 0; hex >> 4 >> b; b += 4) ;
	do {
		char c = (hex >> b) & 0xf;
		if (c <= 9) c += '0';
		else c += 'A' - 10;
		*p++ = c;
	} while ((b -= 4) >= 0);
	*s = p;
}

int main(void)
{
	int i, index;
	int val;

	val = time(NULL) % N_DEVICES;
	for (i = 0; i < TABLE_SIZE; i++) {
		table[i] = val;
		if (++val >= N_DEVICES) val = 0;
	}

	index = 0;
	while (index < TABLE_SIZE) {
		struct dm_task *dmt;
		char *ptr;
		strcpy(buffer, "set-table ");
		ptr = strchr(buffer, 0);
		print_hex(&ptr, index);
		while (index < TABLE_SIZE && ptr - buffer < MSG_SIZE_LIMIT - MSG_SIZE_RESERVED) {
			*ptr++ = ':';
			print_hex(&ptr, table[index]);
			*ptr++ = ' ';
			index++;
		}
		if (ptr - buffer > MSG_SIZE_LIMIT) {
			fprintf(stderr, "buffer overflow\n");
			exit(1);
		}
		ptr[-1] = 0;

		dmt = dm_task_create(DM_DEVICE_TARGET_MSG);
		if (!dmt) fprintf(stderr, "dm_task_create failed\n"), exit(1);
		if (!dm_task_set_name(dmt, "switch")) fprintf(stderr, "dm_task_set_name failed\n"), exit(1);
		if (!dm_task_set_sector(dmt, 0)) fprintf(stderr, "dm_task_set_sector failed\n"), exit(1);

		if (!dm_task_set_message(dmt, buffer)) fprintf(stderr, "dm_task_set_message failed\n"), exit(1);

		if (!dm_task_run(dmt)) fprintf(stderr, "dm_task_run failed\n"), exit(1);

		dm_task_destroy(dmt);
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] check sector number in dmsetup message
       [not found]                 ` <Pine.LNX.4.64.1208222122180.7091@file.rdu.redhat.com>
@ 2012-08-23  1:35                   ` Mikulas Patocka
  0 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-23  1:35 UTC (permalink / raw)
  To: lvm-devel

atoll doesn't check for errors, so invalid sector numbers were silently
accepted in the "dmsetup message" command.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 tools/dmsetup.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: LVM2.2.02.96/tools/dmsetup.c
===================================================================
--- LVM2.2.02.96.orig/tools/dmsetup.c	2012-08-23 03:27:51.000000000 +0200
+++ LVM2.2.02.96/tools/dmsetup.c	2012-08-23 03:29:48.000000000 +0200
@@ -769,6 +769,8 @@ static int _message(CMD_ARGS)
 	size_t sz = 1;
 	struct dm_task *dmt;
 	char *str;
+	uint64_t sector;
+	char *endptr;
 
 	if (!(dmt = dm_task_create(DM_DEVICE_TARGET_MSG)))
 		return 0;
@@ -783,7 +785,12 @@ static int _message(CMD_ARGS)
 		argv++;
 	}
 
-	if (!dm_task_set_sector(dmt, (uint64_t) atoll(argv[1])))
+	sector = strtoull(argv[1], &endptr, 10);
+	if (*endptr || endptr == argv[1]) {
+		err("invalid sector");
+		goto out;
+	}
+	if (!dm_task_set_sector(dmt, sector))
 		goto out;
 
 	argc -= 2;



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-22  1:02         ` Mikulas Patocka
@ 2012-08-24 18:14           ` Jim Ramsay
  2012-08-30  0:51             ` Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Jim Ramsay @ 2012-08-24 18:14 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

On Tue, Aug 21, 2012 at 09:02:35PM -0400, Mikulas Patocka wrote:
> On Tue, 21 Aug 2012, Jim Ramsay wrote:
> > On Mon, Aug 20, 2012 at 03:20:42PM -0400, Mikulas Patocka wrote:
> > > On Fri, 17 Aug 2012, Jim_Ramsay@DELL.com wrote:
> > > > 1) Uploading large page tables
<snip>
> I converted the format to use hexadecimal numbers (they are faster to 
> produce and faster to parse) and made an option to omit the page number 
> (in this case, the previous page plus one is used) - and it takes 0.05s 
> to load a table with one million entries on 2.3GHz Opteron.
> 
> The table is loaded with 67 dm message calls, each having 45000 bytes 
> (the number 45000 was experimentally found to be near the optimum).
> 
> So I don't think there are performance problems with this.
> 
> I'll send you the program that updates the table with messages.

Thanks for this change, and for doing that performance test.

We are interested in the relative performance between using the dm
message interface and the netlink interface in uploading a million page
table entries.  If this dm message-based format is close enough, it
would certainly be an acceptable replacement of the netlink mechanism.

> > > > Perhaps we can work with you on designing alternate non-netlink mechanism 
> > > > to achieve the same goal... A sysfs file per DM device for userland 
> > > > processes to do direct I/O with?  Base64-encoding larger chunks of the 
> > > > binary page tables and passing those values through 'dmsetup message'?
> > > 
> > > As I said, you don't have to upload the whole table with one message ... 
> > > or if you really need to update the whole table at once, explain why.
> > 
> > At the very least, we would need to update the whole page table in the
> > following scenarios:
> > 
> >   1) When we first learn the geometry of the volume
> > 
> >   2) When the volume layout changes significantly (for example, if it was
> >      previously represented by 2 devices and is then later moved onto 3
> >      devices, or the underlying LUN is resized)
> > 
> >   3) When the protocol used to fetch the data can fetch segments of the
> >      page table in a dense binary formate, it is considerably more work
> >      for a userland processes to keep its own persistent copy of the
> >      page table, compare a new version with the old version, calculate
> >      the differences, and send only those differences.  It is much
> >      simpler to have a binary conduit to upload the entire table at
> >      once, provided it does not occur too frequently.
> 
> But you don't have to upload the table at once - you can upload the table 
> incrementally with several dm messages.

By "all at once" I was talking about the scenarios when you need to push
all 1000000 entries to the kernel driver.  The netlink implementation
also sends the data in chunks.

The question as to whether we should do this in one message or multiple
messages (and how many and how large they are) is better answered by
checking relative performance between this dm message code and our
existing netlink code.

> > Furthermore, if a userland process already has an internal binary
> > representation of a page map, what is the value in converting this into
> > a complicated human-readable ascii representation then having the kernel
> > do the opposite de-conversion when it receives the data?
> 
> The reason is simplicity - the dm message code is noticeably smaller than 
> the netlink code. It is also less bug-prone because no structures are 
> allocated or freed there.

I do like the simplicity of the dm message interface, but the cost of
that simplicity seems to be that it just doesn't seem to be well suited
for sending large amounts of packed binary data.  It's also great for
crafting test data by hand, but it's more complicated for userland
programs who now need to convert binary data into ascii before sending
it.

I think though that as long as the cost of uploading the whole page
table from start to finish isn't too great, a dm message based mechanism
would be acceptable.

> > > > 2) vmalloc and TLB performance
<snip>
> > The table would also have to be reallocated on LUN resize or if the data
> > is moved to be across a different number of devices (provided the change
> > is such that it causes the number of bits-per-page to be changed), such
> > as if you had a 2-device setup represented by 1-bit-per-page change to a
> > 3-device setup represented by 2-bit-per-page.
> > 
> > Granted these are not frequent operations, but we need to continue to
> > properly handle these cases.
> >
> > We also need to keep the multiple device scenario in mind (perhaps 100s of
> > targets in use or being created simultaneously).
> 
> For these operations (resizing the device or changing the number of 
> underlying devices), you can load a new table, suspend the device and 
> resume the device. It will switch to the new table and destroy the old 
> one.
> 
> You have to reload the table anyway if you change device size, so there is 
> no need to include code to change table size in the target driver.

Good points.

> > > > And, on all systems, use of space in the vmalloc() range
> > > > increases pressure on the translation lookaside buffer (TLB), reducing the
> > > > performance of the system."
> > > > 
> > > > The page table lookup is in the I/O path, so performance is an important 
> > > > consideration.  Do you have any performance comparisons between our 
> > > > existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd 

Besides the performance consideration of uploading a large page table to
the device, the actual I/O performance is another important
consideration we have not yet addressed.

I think a side-by-side comparison of I/O performance would be useful to
see, comparing the vmalloc single table versus the kmalloc 2-step
lookup.  We are curious to see if there is any impact to doing lookups
all over a large vmalloc'd area in multiple disks simultaneously.

> > > > Also, in the example above with 1572864 page table entries, assuming 2 
> > > > bits per entry requires a table of 384KB.  Would this be a problem for the 
> > > > vmalloc system, especially on 32-bit systems, if there are multiple 
> > > > devices of similarly large size in use at the same time?
> > > 
> > > 384KB is not a problem, the whole vmalloc space has 128MB.
> > 
> > This means we could allow ~375 similarly-sized devices in the system,
> > assuming no other kernel objects are consuming any vmalloc space.  This
> > could be okay, provided our performance considerations are also
> > addressed, but allowing sparse allocation may be a good enough reason
> > to use a 2-level allocation scheme.
> > 
> > > > It can also be desirable to allow sparsely-populated page tables, when it 
> > > > is known that large chunks are not needed or deemed (by external logic) 
> > > > not important enough to consume kernel memory.  A 2-level kmalloc'd memory 
> > > > scheme can save memory in sparsely-allocated situations.
> > 
> > This ability to do sparse allocations may be important depending on what
> > else is going on in the kernel and using vmalloc space.
> 
> It may be possible to use radix tree and do sparse allocations, but given 
> the current usage (tables with million entries, each entry having a few 
> bits), it doesn't seem as a problem now.

I suppose it depends on how general we want this driver to be.  The
number of pages could be considerably larger if the underlying volumes
are larger or if the page sizes were considerably smaller.  For our
particular use of this device, and a reasonable look into the
not-too-distant future, I believe we would be happy if it works well
with the tens-millions-of-pages scope.  We are less concerned about the
case of hundreds-of-millions-of-pages or larger, at least for now.

I'm also not that familiar with what other devices use vmalloc space in
the kernel - With a limited resource like this we must make sure we can
properly contend with other consumers of the space.

So in conclusion, I think we're converging on something that we're all
going to be happy with - We just need to ensure that the performance of
the proposed changes are acceptable compared to our existing code.

To that end I will be devoting some time next week to getting your
driver working with our userland to test page table uploading and actual
IO performance.  Would you mind taking some time to give a try to the
latest 'v4' version of this driver, and doing an independent comparison?

v4 driver code is here:
  http://www.redhat.com/archives/dm-devel/2012-August/msg00299.html
v4 userland example code is here:
  http://www.redhat.com/archives/dm-devel/2012-August/msg00300.html

-- 
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-24 18:14           ` Jim Ramsay
@ 2012-08-30  0:51             ` Mikulas Patocka
  2012-09-07 19:58               ` Jim Ramsay
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2012-08-30  0:51 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley



On Fri, 24 Aug 2012, Jim Ramsay wrote:

> I'm also not that familiar with what other devices use vmalloc space in
> the kernel - With a limited resource like this we must make sure we can
> properly contend with other consumers of the space.
> 
> So in conclusion, I think we're converging on something that we're all
> going to be happy with - We just need to ensure that the performance of
> the proposed changes are acceptable compared to our existing code.
> 
> To that end I will be devoting some time next week to getting your
> driver working with our userland to test page table uploading and actual
> IO performance.  Would you mind taking some time to give a try to the
> latest 'v4' version of this driver, and doing an independent comparison?
> 
> v4 driver code is here:
>   http://www.redhat.com/archives/dm-devel/2012-August/msg00299.html
> v4 userland example code is here:
>   http://www.redhat.com/archives/dm-devel/2012-August/msg00300.html
> 
> -- 
> Jim Ramsay

Hi

I tested your and my dm-switch implementation. I created 3 1GB ramdisk 
devices, filled them with data (so that alloc-on-demand in the ramdisk 
driver doesn't influence the benchmark). Then I placed dm-switch device on 
these ramdisks, with page size 3 sectors and page table of repetitive 
pattern 012012012...

I ran fio command performing 8 parallel I/O threads (on two quad-core 
Opterons), each reading or writing 512-bytes with direct I/O: "fio 
--rw=randrw --size=1G --bs=512 --filename=/dev/mapper/switch2 --direct=1 
--name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 
--name=job7 --name=job8"

Basically, the worst performance brake is the spinlock in your code, 
otherwise there is not much difference. vmalloc doesn't slow it down 
(actually it was slightly faster with vmalloc than with kmalloc).

Your original code
15.15s, stdev 0.05

Your code (with spinlock removed):
14.33s, stdev 0.17

Your code (with spinlock and RCU removed):
14.38s, stdev 0.16

My code (with kmalloc):
14.45s, stdev 0.17

My code (with vmalloc):
14.31s, stdev 0.11

Mikulas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-08-30  0:51             ` Mikulas Patocka
@ 2012-09-07 19:58               ` Jim Ramsay
  2012-09-08 16:35                 ` Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Jim Ramsay @ 2012-09-07 19:58 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

On Wed, Aug 29, 2012 at 08:51:06PM -0400, Mikulas Patocka wrote:
> Basically, the worst performance brake is the spinlock in your code, 
> otherwise there is not much difference. vmalloc doesn't slow it down 
> (actually it was slightly faster with vmalloc than with kmalloc).

Thanks very much for your performance testing.  I've completed my
testing too, and my results agree with yours.  When running against our
actual iSCSI storage solution, the IO performance for this vmalloc-based
code is equivalent to the 2-step kmalloc code.

I also did a side-by-side comparison of uploading 1048576 page table
entries.  The netlink code averaged out to 0.0026s, and the DM message
code averaged out to 0.042s.  As expected, the message code is slower,
but well within reasonable performance expectations.

I am attaching below an updated version of your 'dm-switch.c' file,
based on your latest post in
http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
makes the following changes:

 1. Support for FLUSH and DISCARD operations by implementing
    target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
    switch_map.  Sends DISCARD to one path, FLUSH to each path.

 2. Send IOCTLs to the device who owns sector 0, instead of
    pctx->dev_list[0]

 3. Copyright notice update in header, plus adding myself to
    MODULE_AUTHOR

(Question: Would you prefer these as a patch series against your dm-switch.c
instead?)

------ dm-switch.c ------
/*
 * Copyright (c) 2010-2012 by Dell Inc.  All rights reserved.
 *
 * This file is released under the GPL.
 *
 * Description:
 *
 *     file:    dm-switch.c
 *     authors: Kevin_OKelley@dell.com
 *              Jim_Ramsay@dell.com
 *              Narendran_Ganapathy@dell.com
 *		mpatocka@redhat.com
 *
 * This file implements a "switch" target which efficiently implements a
 * mapping of IOs to underlying block devices in scenarios where there are:
 *   (1) a large number of address regions
 *   (2) a fixed size equal across all address regions
 *   (3) no pattern than allows for a compact description with something like
 *       the dm-stripe target.
 */

#include <linux/module.h>
#include <linux/init.h>
#include <linux/device-mapper.h>
#include <linux/vmalloc.h>

#define DM_MSG_PREFIX "switch"

/*
 * Switch device context block: A new one is created for each dm device.
 * Contains an array of devices from which we have taken references.
 */
struct switch_dev {
	struct dm_dev *dmdev;
	sector_t start;
};

typedef unsigned long pt_entry;

/* Switch context header */
struct switch_ctx {
	unsigned dev_count;		/* Number of devices */
	unsigned page_size;		/* Page size in 512B sectors */
	unsigned long n_pages;		/* Number of pages */
	signed char page_size_bits;	/* log2 of page_size or -1 */

	unsigned char pte_size;		/* Page table entry size in bits */
	unsigned char pte_fields;	/* Number of entries per pt_entry */
	signed char pte_fields_bits;	/* log2 of pte_fields or -1 */
	pt_entry *page_table;		/* Page table */

	/* Array of dm devices to switch between */
	struct switch_dev dev_list[0];
};

static inline void switch_get_position(struct switch_ctx *pctx,
				       unsigned long page,
				       unsigned long *index,
				       unsigned *bit)

{
	if (pctx->pte_fields_bits >= 0) {
		*index = page >> pctx->pte_fields_bits;
		*bit = page & (pctx->pte_fields - 1);
	} else {
		*index = page / pctx->pte_fields;
		*bit = page % pctx->pte_fields;
	}
	*bit *= pctx->pte_size;

}

static inline unsigned switch_get_deviceidx(struct switch_ctx *pctx,
					    sector_t sector)
{
	unsigned long index;
	unsigned bit, idev;
	sector_t p;

	p = sector;
	if (pctx->page_size_bits >= 0)
		p >>= pctx->page_size_bits;
	else
		sector_div(p, pctx->page_size);

	switch_get_position(pctx, p, &index, &bit);
	idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) &
	       ((1 << pctx->pte_size) - 1);

	/* This can only happen if the processor uses non-atomic stores. */
	if (unlikely(idev >= pctx->dev_count))
		idev = 0;

	return idev;
}

static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
				    unsigned value)
{
	unsigned long index;
	unsigned bit;
	pt_entry pte;

	switch_get_position(pctx, page, &index, &bit);

	pte = pctx->page_table[index];
	pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
	pte |= (pt_entry)value << bit;
	pctx->page_table[index] = pte;
}

/*
 * Constructor: Called each time a dmsetup command creates a dm device.  The
 * target parameter will already have the table, type, begin and len fields
 * filled in.  Arguments are in pairs: <dev_path> <offset>.  Therefore, we get
 * multiple constructor calls, but we will need to build a list of switch_ctx
 * blocks so that the page table information gets matched to the correct
 * device.
 */
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
	unsigned a;
	int n;
	int r;
	unsigned dev_count;
	struct switch_ctx *pctx;
	sector_t dev_size;
	unsigned long e;

	if (argc < 4) {
		ti->error = "Insufficient arguments";
		r = -EINVAL;
		goto error;
	}
	if (kstrtouint(argv[0], 10, &dev_count) ||
	    !dev_count ||
	    dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
		ti->error = "Invalid device count";
		r = -EINVAL;
		goto error;
	}
	if (dev_count != (argc - 2) / 2) {
		ti->error = "Invalid argument count";
		r = -EINVAL;
		goto error;
	}
	pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
		       GFP_KERNEL);
	if (!pctx) {
		ti->error = "Cannot allocate redirect context";
		r = -ENOMEM;
		goto error;
	}
	pctx->dev_count = dev_count;
	if (kstrtouint(argv[1], 10, &pctx->page_size) ||
	    !pctx->page_size) {
		ti->error = "Invalid page size";
		r = -EINVAL;
		goto error_kfree;
	}

	if (!(pctx->page_size & (pctx->page_size - 1)))
		pctx->page_size_bits = __ffs(pctx->page_size);
	else
		pctx->page_size_bits = -1;

	pctx->pte_size = 1;
	while (pctx->pte_size < sizeof(pt_entry) * 8 &&
	       (pt_entry)1 << pctx->pte_size < pctx->dev_count)
		pctx->pte_size++;

	pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
	if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
		pctx->pte_fields_bits = __ffs(pctx->pte_fields);
	else
		pctx->pte_fields_bits = -1;

	dev_size = ti->len;
	if (sector_div(dev_size, pctx->page_size))
		dev_size++;

	pctx->n_pages = dev_size;
	if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
		ti->error = "Too long page table";
		r = -EINVAL;
		goto error_kfree;
	}

	if (sector_div(dev_size, pctx->pte_fields))
		dev_size++;

	if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
		ti->error = "Too long page table";
		r = -EINVAL;
		goto error_kfree;
	}

	r = dm_set_target_max_io_len(ti, pctx->page_size);
	if (r)
		goto error_kfree;

	pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
	if (!pctx->page_table) {
		ti->error = "Cannot allocate page table";
		r = -ENOMEM;
		goto error_kfree;
	}

	a = 0;
	for (e = 0; e < pctx->n_pages; e++) {
		switch_page_table_write(pctx, e, a);
		a++;
		if (a >= pctx->dev_count)
			a = 0;
	}

	/*
	 * Check each device beneath the target to ensure that the limits are
	 * consistent.
	 */
	for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
		struct dm_dev *dm;
		sector_t dev_size;
		unsigned long long start;

		if (kstrtoull(argv[a + 1], 10, &start) ||
		    start != (sector_t)start) {
			ti->error = "Invalid device starting offset";
			r = -EINVAL;
			n--;
			goto error_release_n;
		}
		r = dm_get_device
		    (ti, argv[a], dm_table_get_mode(ti->table), &dm);
		if (r) {
			ti->error = "Device lookup failed";
			n--;
			goto error_release_n;
		}
		pctx->dev_list[n].dmdev = dm;
		pctx->dev_list[n].start = start;

		dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;

		if (ti->len > start + dev_size) {
			ti->error = "Device is too small";
			r = -EINVAL;
			goto error_release_n;
		}
	}

	/* For UNMAP, sending the request down any path is sufficient */
	ti->num_discard_requests = 1;
	/* For FLUSH, we should flush each path */
	ti->num_flush_requests = pctx->dev_count;

	ti->private = pctx;

	return 0;

error_release_n:		/* De-reference all devices  */
	for (; n >= 0; n--)
		dm_put_device(ti, pctx->dev_list[n].dmdev);

	vfree(pctx->page_table);
error_kfree:
	kfree(pctx);

error:
	return r;
}

/*
 * Destructor: Don't free the dm_target, just the ti->private data (if any).
 */
static void switch_dtr(struct dm_target *ti)
{
	int n;
	struct switch_ctx *pctx = ti->private;

	for (n = 0; n < pctx->dev_count; n++)
		dm_put_device(ti, pctx->dev_list[n].dmdev);

	vfree(pctx->page_table);
	kfree(pctx);
}

static int switch_map(struct dm_target *ti, struct bio *bio,
		      union map_info *map_context)
{
	struct switch_ctx *pctx = ti->private;

	sector_t offset = bio->bi_sector - ti->begin;
	unsigned idev;

	if (bio->bi_rw & REQ_FLUSH) {
		int request_nr = map_context->target_request_nr;
		BUG_ON(request_nr >= pctx->dev_count);
		bio->bi_bdev = pctx->dev_list[request_nr].dmdev->bdev;
		return DM_MAPIO_REMAPPED;
	}

	idev = switch_get_deviceidx(pctx, offset);

	bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
	bio->bi_sector = pctx->dev_list[idev].start + offset;

	return DM_MAPIO_REMAPPED;
}

/*
 * We need to parse hex numbers as fast as possible.
 * Message is used to load the whole table.
 *
 * This table-based hex parser improves performance.
 * It improves a time to load 1000000 entries compared to the condition-based
 * parser.
 *		table-based parser	condition-based parser
 * PA-RISC	0.29s			0.31s
 * Opteron	0.0495s			0.0498s
 */

static const unsigned char hex_table[256] = {
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
0,1,2,3,4,5,6,7,8,9,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255
};

static inline void parse_hex(const char *string, sector_t *result, const char **end)
{
	unsigned char d;
	sector_t r = 0;
#if 1
	while ((d = hex_table[(unsigned char)*string]) < 16) {
		r = (r << 4) | d;
		string++;
	}
#else
	while (1) {
		d = *string;
		if (d >= '0' && d <= '9')
			d -= '0';
		else if (d >= 'A' && d <= 'F')
			d -= 'A' - 10;
		else if (d >= 'a' && d <= 'f')
			d -= 'a' - 10;
		else
			break;
		r = (r << 4) | d;
		string++;
	}
#endif
	*end = string;
	*result = r;
}

static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
	static DEFINE_MUTEX(message_mutex);

	struct switch_ctx *pctx = ti->private;
	int r;

	mutex_lock(&message_mutex);

	if (!argc) {
		goto invalid_message;
	} else if (!strcasecmp(argv[0], "set-table")) {
		unsigned i;
		sector_t table_index = 0;
		for (i = 1; i < argc; i++) {
			sector_t device;
			const char *string = argv[i];
			if (*string == ':')
				table_index++;
			else {
				parse_hex(string, &table_index, &string);
				if (unlikely(*string != ':')) {
invalid_table:
					DMWARN("invalid set-table argument");
					r = -EINVAL;
					goto ret;
				}
			}
			string++;
			if (unlikely(!*string))
				goto invalid_table;
			parse_hex(string, &device, &string);
			if (unlikely(*string))
				goto invalid_table;
			if (unlikely(table_index >= pctx->n_pages)) {
				DMWARN("invalid set-table page");
				r = -EINVAL;
				goto ret;
			}
			if (unlikely(device >= pctx->dev_count)) {
				DMWARN("invalid set-table device");
				r = -EINVAL;
				goto ret;
			}
			switch_page_table_write(pctx, table_index, device);
		}
		r = 0;
	} else {
invalid_message:
		DMWARN("unrecognised message received.");
		r = -EINVAL;
	}
ret:
	mutex_unlock(&message_mutex);
	return r;
}

static int switch_status(struct dm_target *ti, status_type_t type,
			 unsigned status_flags, char *result, unsigned maxlen)
{
	struct switch_ctx *pctx = ti->private;
	unsigned sz = 0;
	int n;

	result[0] = '\0';
	switch (type) {
	case STATUSTYPE_INFO:
		result[0] = 0;
		break;

	case STATUSTYPE_TABLE:
		DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
		for (n = 0; n < pctx->dev_count; n++) {
			DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
			       (unsigned long long)pctx->dev_list[n].start);
		}
		break;

	default:
		return 0;
	}
	return 0;
}

/*
 * Switch ioctl:
 *
 * Passthrough all ioctls to the path for sector 0
 */
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
			unsigned long arg)
{
	struct switch_ctx *pctx = ti->private;
	struct block_device *bdev;
	fmode_t mode;
	unsigned idev;

	idev = switch_get_deviceidx(pctx, 0);

	bdev = pctx->dev_list[idev].dmdev->bdev;
	mode = pctx->dev_list[idev].dmdev->mode;

	return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}

static int switch_iterate_devices(struct dm_target *ti,
				  iterate_devices_callout_fn fn, void *data)
{
	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
	int n, ret = 0;

	for (n = 0; n < pctx->dev_count; n++) {
		ret = fn(ti, pctx->dev_list[n].dmdev, ti->begin, ti->len, data);
		if (ret)
			goto out;
	}

out:
	return ret;
}

static struct target_type switch_target = {
	.name = "switch",
	.version = {1, 0, 0},
	.module = THIS_MODULE,
	.ctr = switch_ctr,
	.dtr = switch_dtr,
	.map = switch_map,
	.message = switch_message,
	.status = switch_status,
	.ioctl = switch_ioctl,
	.iterate_devices = switch_iterate_devices,
};

int __init dm_switch_init(void)
{
	int r;

	r = dm_register_target(&switch_target);
	if (r) {
		DMERR("dm_register_target() failed %d", r);
		return r;
	}

	return 0;
}

void dm_switch_exit(void)
{
	dm_unregister_target(&switch_target);
}

module_init(dm_switch_init);
module_exit(dm_switch_exit);

MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");
-------------------------

-- 
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-09-07 19:58               ` Jim Ramsay
@ 2012-09-08 16:35                 ` Mikulas Patocka
  2012-09-11 13:07                   ` Jim Ramsay
  2012-09-11 13:14                   ` Jim Ramsay
  0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2012-09-08 16:35 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

Hi

On Fri, 7 Sep 2012, Jim Ramsay wrote:

> On Wed, Aug 29, 2012 at 08:51:06PM -0400, Mikulas Patocka wrote:
> > Basically, the worst performance brake is the spinlock in your code, 
> > otherwise there is not much difference. vmalloc doesn't slow it down 
> > (actually it was slightly faster with vmalloc than with kmalloc).
> 
> Thanks very much for your performance testing.  I've completed my
> testing too, and my results agree with yours.  When running against our
> actual iSCSI storage solution, the IO performance for this vmalloc-based
> code is equivalent to the 2-step kmalloc code.
> 
> I also did a side-by-side comparison of uploading 1048576 page table
> entries.  The netlink code averaged out to 0.0026s, and the DM message
> code averaged out to 0.042s.  As expected, the message code is slower,
> but well within reasonable performance expectations.
> 
> I am attaching below an updated version of your 'dm-switch.c' file,
> based on your latest post in
> http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
> makes the following changes:
> 
>  1. Support for FLUSH and DISCARD operations by implementing
>     target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
>     switch_map.  Sends DISCARD to one path, FLUSH to each path.
> 
>  2. Send IOCTLs to the device who owns sector 0, instead of
>     pctx->dev_list[0]
> 
>  3. Copyright notice update in header, plus adding myself to
>     MODULE_AUTHOR
> 
> (Question: Would you prefer these as a patch series against your dm-switch.c
> instead?)

You don't have to send it as a patch, I can trivially create the patch on 
my own.

Your changes are ok, it could be included in the kernel.

Regarding flushes ... could you please explain how write-back caching 
works in your design?

Is there some write-back cache in the host adapter?

Is there some write-back cache in the individual storage nodes? (i.e. does 
the node signal end of I/O before it finishes writing data to the disk) 

How does the node deal with write-back cache in the disk? Does the node 
turn on write back cache in the disk? Does the node send SYNCHRONIZE CACHE 
commands to the disk automatically? Or does it send SYNCHRONIZE CACHE 
commands only when flush is received over the network?

I'd like to know these things to better understand the behavior of the 
flush request.


Another thing - please resend your code with "Signed-off-by". Read the 
meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to 
the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to 
the code. It is a legal requirement, so that you certify that the code is 
under the open source license and that you have the right to distribute 
the code.

Mikulas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-09-08 16:35                 ` Mikulas Patocka
@ 2012-09-11 13:07                   ` Jim Ramsay
  2012-09-11 17:13                     ` Mikulas Patocka
  2012-09-11 13:14                   ` Jim Ramsay
  1 sibling, 1 reply; 18+ messages in thread
From: Jim Ramsay @ 2012-09-11 13:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley


On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> Your changes are ok, it could be included in the kernel.
> 
> Regarding flushes ... could you please explain how write-back caching 
> works in your design?

We are not actually using write-back caching in our design at all, but
assumed that for a general-purpose DM device like dm-switch, some
underlying storage may rely on REQ_FLUSH to ensure proper data
synchronization, so the switch should do the right thing when required.
The code is modeled on dm-stripe.

> Is there some write-back cache in the host adapter?

No, there is not at this time.

> Is there some write-back cache in the individual storage nodes? (i.e. does 
> the node signal end of I/O before it finishes writing data to the disk) 

No, there is not at this time.  The iSCSI-backed SD devices we are
ultimately using do not advertise any write-back cache ('sdparm
--get=WCE' returns 0).

> Another thing - please resend your code with "Signed-off-by". Read the 
> meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to 
> the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to 
> the code. It is a legal requirement, so that you certify that the code is 
> under the open source license and that you have the right to distribute 
> the code.

Will do shortly.  Thanks!

-- 
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-09-08 16:35                 ` Mikulas Patocka
  2012-09-11 13:07                   ` Jim Ramsay
@ 2012-09-11 13:14                   ` Jim Ramsay
  1 sibling, 0 replies; 18+ messages in thread
From: Jim Ramsay @ 2012-09-11 13:14 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley

On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> Another thing - please resend your code with "Signed-off-by". Read the 
> meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to 
> the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to 
> the code. It is a legal requirement, so that you certify that the code is 
> under the open source license and that you have the right to distribute 
> the code.

I am attaching below an updated version of your 'dm-switch.c' file,
based on your latest post in
http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
makes the following changes:

 1. Support for FLUSH and DISCARD operations by implementing
    target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
    switch_map.  Sends DISCARD to one path, FLUSH to each path.

 2. Send IOCTLs to the device who owns sector 0, instead of
    pctx->dev_list[0]

 3. Copyright notice update in header, plus adding myself to
    MODULE_AUTHOR

Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>

------ dm-switch.c ------
/*
 * Copyright (c) 2010-2012 by Dell Inc.  All rights reserved.
 *
 * This file is released under the GPL.
 *
 * Description:
 *
 *     file:    dm-switch.c
 *     authors: Kevin_OKelley@dell.com
 *              Jim_Ramsay@dell.com
 *              Narendran_Ganapathy@dell.com
 *		mpatocka@redhat.com
 *
 * This file implements a "switch" target which efficiently implements a
 * mapping of IOs to underlying block devices in scenarios where there are:
 *   (1) a large number of address regions
 *   (2) a fixed size equal across all address regions
 *   (3) no pattern than allows for a compact description with something like
 *       the dm-stripe target.
 */

#include <linux/module.h>
#include <linux/init.h>
#include <linux/device-mapper.h>
#include <linux/vmalloc.h>

#define DM_MSG_PREFIX "switch"

/*
 * Switch device context block: A new one is created for each dm device.
 * Contains an array of devices from which we have taken references.
 */
struct switch_dev {
	struct dm_dev *dmdev;
	sector_t start;
};

typedef unsigned long pt_entry;

/* Switch context header */
struct switch_ctx {
	unsigned dev_count;		/* Number of devices */
	unsigned page_size;		/* Page size in 512B sectors */
	unsigned long n_pages;		/* Number of pages */
	signed char page_size_bits;	/* log2 of page_size or -1 */

	unsigned char pte_size;		/* Page table entry size in bits */
	unsigned char pte_fields;	/* Number of entries per pt_entry */
	signed char pte_fields_bits;	/* log2 of pte_fields or -1 */
	pt_entry *page_table;		/* Page table */

	/* Array of dm devices to switch between */
	struct switch_dev dev_list[0];
};

static inline void switch_get_position(struct switch_ctx *pctx,
				       unsigned long page,
				       unsigned long *index,
				       unsigned *bit)

{
	if (pctx->pte_fields_bits >= 0) {
		*index = page >> pctx->pte_fields_bits;
		*bit = page & (pctx->pte_fields - 1);
	} else {
		*index = page / pctx->pte_fields;
		*bit = page % pctx->pte_fields;
	}
	*bit *= pctx->pte_size;

}

static inline unsigned switch_get_deviceidx(struct switch_ctx *pctx,
					    sector_t sector)
{
	unsigned long index;
	unsigned bit, idev;
	sector_t p;

	p = sector;
	if (pctx->page_size_bits >= 0)
		p >>= pctx->page_size_bits;
	else
		sector_div(p, pctx->page_size);

	switch_get_position(pctx, p, &index, &bit);
	idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) &
	       ((1 << pctx->pte_size) - 1);

	/* This can only happen if the processor uses non-atomic stores. */
	if (unlikely(idev >= pctx->dev_count))
		idev = 0;

	return idev;
}

static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
				    unsigned value)
{
	unsigned long index;
	unsigned bit;
	pt_entry pte;

	switch_get_position(pctx, page, &index, &bit);

	pte = pctx->page_table[index];
	pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
	pte |= (pt_entry)value << bit;
	pctx->page_table[index] = pte;
}

/*
 * Constructor: Called each time a dmsetup command creates a dm device.  The
 * target parameter will already have the table, type, begin and len fields
 * filled in.  Arguments are in pairs: <dev_path> <offset>.  Therefore, we get
 * multiple constructor calls, but we will need to build a list of switch_ctx
 * blocks so that the page table information gets matched to the correct
 * device.
 */
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
	unsigned a;
	int n;
	int r;
	unsigned dev_count;
	struct switch_ctx *pctx;
	sector_t dev_size;
	unsigned long e;

	if (argc < 4) {
		ti->error = "Insufficient arguments";
		r = -EINVAL;
		goto error;
	}
	if (kstrtouint(argv[0], 10, &dev_count) ||
	    !dev_count ||
	    dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
		ti->error = "Invalid device count";
		r = -EINVAL;
		goto error;
	}
	if (dev_count != (argc - 2) / 2) {
		ti->error = "Invalid argument count";
		r = -EINVAL;
		goto error;
	}
	pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
		       GFP_KERNEL);
	if (!pctx) {
		ti->error = "Cannot allocate redirect context";
		r = -ENOMEM;
		goto error;
	}
	pctx->dev_count = dev_count;
	if (kstrtouint(argv[1], 10, &pctx->page_size) ||
	    !pctx->page_size) {
		ti->error = "Invalid page size";
		r = -EINVAL;
		goto error_kfree;
	}

	if (!(pctx->page_size & (pctx->page_size - 1)))
		pctx->page_size_bits = __ffs(pctx->page_size);
	else
		pctx->page_size_bits = -1;

	pctx->pte_size = 1;
	while (pctx->pte_size < sizeof(pt_entry) * 8 &&
	       (pt_entry)1 << pctx->pte_size < pctx->dev_count)
		pctx->pte_size++;

	pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
	if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
		pctx->pte_fields_bits = __ffs(pctx->pte_fields);
	else
		pctx->pte_fields_bits = -1;

	dev_size = ti->len;
	if (sector_div(dev_size, pctx->page_size))
		dev_size++;

	pctx->n_pages = dev_size;
	if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
		ti->error = "Too long page table";
		r = -EINVAL;
		goto error_kfree;
	}

	if (sector_div(dev_size, pctx->pte_fields))
		dev_size++;

	if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
		ti->error = "Too long page table";
		r = -EINVAL;
		goto error_kfree;
	}

	r = dm_set_target_max_io_len(ti, pctx->page_size);
	if (r)
		goto error_kfree;

	pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
	if (!pctx->page_table) {
		ti->error = "Cannot allocate page table";
		r = -ENOMEM;
		goto error_kfree;
	}

	a = 0;
	for (e = 0; e < pctx->n_pages; e++) {
		switch_page_table_write(pctx, e, a);
		a++;
		if (a >= pctx->dev_count)
			a = 0;
	}

	/*
	 * Check each device beneath the target to ensure that the limits are
	 * consistent.
	 */
	for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
		struct dm_dev *dm;
		sector_t dev_size;
		unsigned long long start;

		if (kstrtoull(argv[a + 1], 10, &start) ||
		    start != (sector_t)start) {
			ti->error = "Invalid device starting offset";
			r = -EINVAL;
			n--;
			goto error_release_n;
		}
		r = dm_get_device
		    (ti, argv[a], dm_table_get_mode(ti->table), &dm);
		if (r) {
			ti->error = "Device lookup failed";
			n--;
			goto error_release_n;
		}
		pctx->dev_list[n].dmdev = dm;
		pctx->dev_list[n].start = start;

		dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;

		if (ti->len > start + dev_size) {
			ti->error = "Device is too small";
			r = -EINVAL;
			goto error_release_n;
		}
	}

	/* For UNMAP, sending the request down any path is sufficient */
	ti->num_discard_requests = 1;
	/* For FLUSH, we should flush each path */
	ti->num_flush_requests = pctx->dev_count;

	ti->private = pctx;

	return 0;

error_release_n:		/* De-reference all devices  */
	for (; n >= 0; n--)
		dm_put_device(ti, pctx->dev_list[n].dmdev);

	vfree(pctx->page_table);
error_kfree:
	kfree(pctx);

error:
	return r;
}

/*
 * Destructor: Don't free the dm_target, just the ti->private data (if any).
 */
static void switch_dtr(struct dm_target *ti)
{
	int n;
	struct switch_ctx *pctx = ti->private;

	for (n = 0; n < pctx->dev_count; n++)
		dm_put_device(ti, pctx->dev_list[n].dmdev);

	vfree(pctx->page_table);
	kfree(pctx);
}

static int switch_map(struct dm_target *ti, struct bio *bio,
		      union map_info *map_context)
{
	struct switch_ctx *pctx = ti->private;

	sector_t offset = bio->bi_sector - ti->begin;
	unsigned idev;

	if (bio->bi_rw & REQ_FLUSH) {
		int request_nr = map_context->target_request_nr;
		BUG_ON(request_nr >= pctx->dev_count);
		bio->bi_bdev = pctx->dev_list[request_nr].dmdev->bdev;
		return DM_MAPIO_REMAPPED;
	}

	idev = switch_get_deviceidx(pctx, offset);

	bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
	bio->bi_sector = pctx->dev_list[idev].start + offset;

	return DM_MAPIO_REMAPPED;
}

/*
 * We need to parse hex numbers as fast as possible.
 * Message is used to load the whole table.
 *
 * This table-based hex parser improves performance.
 * It improves a time to load 1000000 entries compared to the condition-based
 * parser.
 *		table-based parser	condition-based parser
 * PA-RISC	0.29s			0.31s
 * Opteron	0.0495s			0.0498s
 */

static const unsigned char hex_table[256] = {
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
0,1,2,3,4,5,6,7,8,9,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255
};

static inline void parse_hex(const char *string, sector_t *result, const char **end)
{
	unsigned char d;
	sector_t r = 0;
#if 1
	while ((d = hex_table[(unsigned char)*string]) < 16) {
		r = (r << 4) | d;
		string++;
	}
#else
	while (1) {
		d = *string;
		if (d >= '0' && d <= '9')
			d -= '0';
		else if (d >= 'A' && d <= 'F')
			d -= 'A' - 10;
		else if (d >= 'a' && d <= 'f')
			d -= 'a' - 10;
		else
			break;
		r = (r << 4) | d;
		string++;
	}
#endif
	*end = string;
	*result = r;
}

static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
	static DEFINE_MUTEX(message_mutex);

	struct switch_ctx *pctx = ti->private;
	int r;

	mutex_lock(&message_mutex);

	if (!argc) {
		goto invalid_message;
	} else if (!strcasecmp(argv[0], "set-table")) {
		unsigned i;
		sector_t table_index = 0;
		for (i = 1; i < argc; i++) {
			sector_t device;
			const char *string = argv[i];
			if (*string == ':')
				table_index++;
			else {
				parse_hex(string, &table_index, &string);
				if (unlikely(*string != ':')) {
invalid_table:
					DMWARN("invalid set-table argument");
					r = -EINVAL;
					goto ret;
				}
			}
			string++;
			if (unlikely(!*string))
				goto invalid_table;
			parse_hex(string, &device, &string);
			if (unlikely(*string))
				goto invalid_table;
			if (unlikely(table_index >= pctx->n_pages)) {
				DMWARN("invalid set-table page");
				r = -EINVAL;
				goto ret;
			}
			if (unlikely(device >= pctx->dev_count)) {
				DMWARN("invalid set-table device");
				r = -EINVAL;
				goto ret;
			}
			switch_page_table_write(pctx, table_index, device);
		}
		r = 0;
	} else {
invalid_message:
		DMWARN("unrecognised message received.");
		r = -EINVAL;
	}
ret:
	mutex_unlock(&message_mutex);
	return r;
}

static int switch_status(struct dm_target *ti, status_type_t type,
			 unsigned status_flags, char *result, unsigned maxlen)
{
	struct switch_ctx *pctx = ti->private;
	unsigned sz = 0;
	int n;

	result[0] = '\0';
	switch (type) {
	case STATUSTYPE_INFO:
		result[0] = 0;
		break;

	case STATUSTYPE_TABLE:
		DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
		for (n = 0; n < pctx->dev_count; n++) {
			DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
			       (unsigned long long)pctx->dev_list[n].start);
		}
		break;

	default:
		return 0;
	}
	return 0;
}

/*
 * Switch ioctl:
 *
 * Passthrough all ioctls to the path for sector 0
 */
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
			unsigned long arg)
{
	struct switch_ctx *pctx = ti->private;
	struct block_device *bdev;
	fmode_t mode;
	unsigned idev;

	idev = switch_get_deviceidx(pctx, 0);

	bdev = pctx->dev_list[idev].dmdev->bdev;
	mode = pctx->dev_list[idev].dmdev->mode;

	return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}

static int switch_iterate_devices(struct dm_target *ti,
				  iterate_devices_callout_fn fn, void *data)
{
	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
	int n, ret = 0;

	for (n = 0; n < pctx->dev_count; n++) {
		ret = fn(ti, pctx->dev_list[n].dmdev, ti->begin, ti->len, data);
		if (ret)
			goto out;
	}

out:
	return ret;
}

static struct target_type switch_target = {
	.name = "switch",
	.version = {1, 0, 0},
	.module = THIS_MODULE,
	.ctr = switch_ctr,
	.dtr = switch_dtr,
	.map = switch_map,
	.message = switch_message,
	.status = switch_status,
	.ioctl = switch_ioctl,
	.iterate_devices = switch_iterate_devices,
};

int __init dm_switch_init(void)
{
	int r;

	r = dm_register_target(&switch_target);
	if (r) {
		DMERR("dm_register_target() failed %d", r);
		return r;
	}

	return 0;
}

void dm_switch_exit(void)
{
	dm_unregister_target(&switch_target);
}

module_init(dm_switch_init);
module_exit(dm_switch_exit);

MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");
-------------------------

-- 
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
-- 
Jim Ramsay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] reworked dm-switch target
  2012-09-11 13:07                   ` Jim Ramsay
@ 2012-09-11 17:13                     ` Mikulas Patocka
  0 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2012-09-11 17:13 UTC (permalink / raw)
  To: Jim Ramsay; +Cc: dm-devel, Jason_Shamberger, agk, Kevin_OKelley



On Tue, 11 Sep 2012, Jim Ramsay wrote:

> 
> On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> > Your changes are ok, it could be included in the kernel.
> > 
> > Regarding flushes ... could you please explain how write-back caching 
> > works in your design?
> 
> We are not actually using write-back caching in our design at all, but
> assumed that for a general-purpose DM device like dm-switch, some
> underlying storage may rely on REQ_FLUSH to ensure proper data
> synchronization, so the switch should do the right thing when required.
> The code is modeled on dm-stripe.

So you don't need to process REQ_FLUSH at all. REQ_FLUSH doesn't impose 
any request synchronization, it only flushes the hardware cache - and 
there is no cache. I would remove REQ_FLUSH support if it has no use.

dm-switch is used directly on host adapters (or maybe dm-mpath that is 
connected on host adapters), there is no intermediate layer that would 
need REQ_FLUSH.

Mikulas

> > Is there some write-back cache in the host adapter?
> 
> No, there is not at this time.
> 
> > Is there some write-back cache in the individual storage nodes? (i.e. does 
> > the node signal end of I/O before it finishes writing data to the disk) 
> 
> No, there is not at this time.  The iSCSI-backed SD devices we are
> ultimately using do not advertise any write-back cache ('sdparm
> --get=WCE' returns 0).
> 
> > Another thing - please resend your code with "Signed-off-by". Read the 
> > meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to 
> > the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to 
> > the code. It is a legal requirement, so that you certify that the code is 
> > under the open source license and that you have the right to distribute 
> > the code.
> 
> Will do shortly.  Thanks!
> 
> -- 
> Jim Ramsay
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-09-11 17:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-15 22:36 [PATCH] reworked dm-switch target Mikulas Patocka
2012-08-16  7:09 ` Alasdair G Kergon
2012-08-17 14:18   ` Jim_Ramsay
2012-08-20 19:20     ` Mikulas Patocka
2012-08-21 16:33       ` Jim Ramsay
2012-08-21 18:14         ` Alasdair G Kergon
2012-08-22  1:02         ` Mikulas Patocka
2012-08-24 18:14           ` Jim Ramsay
2012-08-30  0:51             ` Mikulas Patocka
2012-09-07 19:58               ` Jim Ramsay
2012-09-08 16:35                 ` Mikulas Patocka
2012-09-11 13:07                   ` Jim Ramsay
2012-09-11 17:13                     ` Mikulas Patocka
2012-09-11 13:14                   ` Jim Ramsay
2012-08-22  1:03         ` Mikulas Patocka
2012-08-22  1:04         ` Mikulas Patocka
     [not found]           ` <20120822192547.GJ32571@agk-dp.fab.redhat.com>
     [not found]             ` <Pine.LNX.4.64.1208221813380.12623@file.rdu.redhat.com>
     [not found]               ` <20120822222708.GN32571@agk-dp.fab.redhat.com>
     [not found]                 ` <Pine.LNX.4.64.1208222122180.7091@file.rdu.redhat.com>
2012-08-23  1:35                   ` [PATCH] check sector number in dmsetup message Mikulas Patocka
2012-08-20 20:48     ` [PATCH] reworked dm-switch target Alasdair G Kergon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.