All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH} dm-throttle: new device mapper target to throttle reads and writes
@ 2010-08-10 13:42 Heinz Mauelshagen
  2010-08-10 14:44 ` Vivek Goyal
  2010-08-10 18:41 ` Vivek Goyal
  0 siblings, 2 replies; 10+ messages in thread
From: Heinz Mauelshagen @ 2010-08-10 13:42 UTC (permalink / raw)
  To: dm-devel


This is a new device mapper "throttle" target which allows for
throttling reads and writes (ie. enforcing throughput limits) in units
of kilobytes per second.

I've been using it for a while in testing configurations and think it's
valuable for many people requiring simulation of low bandwidth
interconnects or simulating different throughput characteristics on
distinct address segments of a device (eg. fast outer disk spindles vs.
slower inner ones).

Please read Documentation/device-mapper/throttle.txt for how to use it.

Note: this target can be combined with the "delay" target, which is
already upstream in order to set io delays in addition to throttling,
again valuable for long distance transport simulations.


This target should stay separate rather than merged IMO, because it
basically serves testing purposes and hence should not complicate any
production mapping target. A potential merge with the "delay" target is
subject to discussion.


Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>

 Documentation/device-mapper/throttle.txt |   68 ++++++
 drivers/md/Kconfig                       |    8 +
 drivers/md/Makefile                      |    1 +
 drivers/md/dm-throttle.c                 |  389 ++++++++++++++++++++++++++++++
 4 files changed, 466 insertions(+), 0 deletions(-)

diff --git a/Documentation/device-mapper/throttle.txt b/Documentation/device-mapper/throttle.txt
new file mode 100644
index 0000000..9deea6e
--- /dev/null
+++ b/Documentation/device-mapper/throttle.txt
@@ -0,0 +1,68 @@
+dm-throttle
+===========
+
+Device-Mapper's "throttle" target maps a linear range of the Device-Mapper
+device onto a linear range of another device providing the option to throttle
+read and write ios seperately.
+
+This target provides the ability to simulate low bandwidth transports to
+devices or different throughput to seperate address segements of a device.
+
+Parameters: <#variable params> <read kbs> <write kbs> <dev path> <offset>
+    <#variable params> number of variable paramaters to set read and
+		       write throttling kilobytes per second limits.
+		       Range: 0 - 2 with
+		       0 = no throttling,
+		       1 and <read kbs> = read throttling only and
+		       2 and <read kbs> <write kbs> = read and write throttling.
+    <read kbs> read kilobatyes per second limit
+    <write kbs> write kilobatyes per second limit
+    <dev path>: Full pathname to the underlying block-device, or a
+                "major:minor" device-number.
+    <offset>: Starting sector within the device.
+
+Throttling read and write values can be adjusted through the constructor
+by reloading a mapping table with the respective parameters or without
+reloading through the message interface:
+
+dmsetup message <mapped device name> <offset> read_kbs <read kbs>
+dmsetup message <mapped device name> <offset> write_kbs <read kbs>
+
+The target provides status information via its status interface:
+
+dmsetup status <mapped device name>
+
+Output includes the target version, the actual read and write kilobytes
+per second limits used, how many read and write ios have been processed,
+deferred and accounted for.
+
+Status can be reset without reloading the mapping table via the message
+interface as well:
+
+dmsetup message <mapped device name> <offset> stats reset
+
+
+Example scripts
+===============
+[[
+#!/bin/sh
+# Create an identity mapping for a device
+# setting 1MB/s read and write throttling
+echo "0 `blockdev --getsize $1` throttle 2 1024 1024 $1 0" | \
+dmsetup create throttle_identity
+]]
+
+[[
+#!/bin/sh
+# Set different throughput to first and second half of a device
+let size=`blockdev --getsize $1`/2
+echo "0 $size throttle 2 10480 8192 $1 0
+$size $size throttle 2 2048 1024 $1 $size" | \
+dmsetup create throttle_segmented
+]]
+
+[[
+#!/bin/sh
+# Change read throughput on 2nd segment of previous segemented mapping
+dmsetup message throttle_segmented $size 1 4096"
+]]
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 4a6feac..9c3cbe0 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -313,6 +313,14 @@ config DM_DELAY
 
 	If unsure, say N.
 
+config DM_THROTTLE
+	tristate "Throttling target (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+
+	A target that supports device throughput throttling
+	with bandwidth selection for reads and writes.
+
 config DM_UEVENT
 	bool "DM uevents (EXPERIMENTAL)"
 	depends on BLK_DEV_DM && EXPERIMENTAL
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index e355e7f..6ea2598 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
+obj-$(CONFIG_DM_THROTTLE)	+= dm-throttle.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
diff --git a/drivers/md/dm-throttle.c b/drivers/md/dm-throttle.c
new file mode 100644
index 0000000..bc000d0
--- /dev/null
+++ b/drivers/md/dm-throttle.c
@@ -0,0 +1,389 @@
+/*
+ * Copyright (C) 2010 Red Hat GmbH
+ *
+ * Module Author: Heinz Mauelshagen <heinzm@redhat.com>
+ *
+ * This file is released under the GPL.
+ *
+ * Test target to stack on top of arbitrary other block
+ * device to throttle io in units of kilobyes per second.
+ *
+ * Throttling is configurable separately for reads and write
+ * via the constructor and the message interfaces.
+ */
+
+#include "dm.h"
+#include <linux/slab.h>
+
+static const char *version = "1.0";
+
+#define	DM_MSG_PREFIX	"dm-throttle"
+#define	TI_ERR_RET(str, ret) \
+	do { ti->error = DM_MSG_PREFIX ": " str; return ret; } while (0);
+#define	TI_ERR(str)	TI_ERR_RET(str, -EINVAL)
+
+/* Statistics for target status output (see throttle_status()). */
+struct stats {
+	atomic_t accounted[2];
+	atomic_t deferred_io[2];
+	atomic_t io[2];
+};
+
+/* Reset statistics variables. */
+static void stats_reset(struct stats *stats)
+{
+	int i = 2;
+
+	while (i--) {
+		atomic_set(&stats->accounted[i], 0);
+		atomic_set(&stats->deferred_io[i], 0);
+		atomic_set(&stats->io[i], 0);
+	}
+}
+
+/* Throttle context. */
+struct throttle_c {
+	/* Device to throttle. */
+	struct {
+		struct dm_dev *dev;
+		sector_t start;
+	} dev;
+
+	/* ctr parameters. */
+	struct params {
+		unsigned bs[2];		/* Bytes per second. */
+		unsigned kbs_ctr[2];	/* To save kb/s constructor args. */
+		unsigned params;	/* # of variable parameters. */
+	} params;
+
+	struct account {
+		/* Accounting for reads and writes. */
+		struct ac_rw {
+			struct mutex mutex;
+
+			unsigned long end_jiffies;
+			unsigned size;
+		} rw[2];
+
+		unsigned long flags;
+	} account;
+
+	struct stats stats;
+};
+
+/* Return bytes/s value for kilobytes/s. */
+static inline unsigned to_bs(unsigned kbs)
+{
+	return kbs << 10;
+}
+
+static inline unsigned to_kbs(unsigned bs)
+{
+	return bs >> 10;
+}
+
+/* Reset account. */
+static void account_reset(int rw, struct throttle_c *tc)
+{
+	struct account *ac = &tc->account;
+	struct ac_rw *ac_rw = ac->rw + rw;
+
+	ac_rw->size = 0;
+	ac_rw->end_jiffies = jiffies + HZ;
+	clear_bit(rw, &ac->flags);
+	smp_wmb();
+}
+
+/* Decide about throttling (ie. deferring bios). */
+static int throttle(struct throttle_c *tc, struct bio *bio)
+{
+	int rw = (bio_data_dir(bio) == WRITE);
+	unsigned bps; /* Bytes per second. */
+
+	smp_rmb();
+	bps = tc->params.bs[rw];
+	if (bps) {
+		unsigned size;
+		struct account *ac = &tc->account;
+		struct ac_rw *ac_rw = ac->rw + rw;
+
+		if (time_after(jiffies, ac_rw->end_jiffies))
+			/* Measure time exceeded. */
+			account_reset(rw, tc);
+		else if (test_bit(rw, &ac->flags))
+			/* In case we're throttled already. */
+			return 1;
+
+		/* Account I/O size. */
+		size = ac_rw->size + bio->bi_size;
+		if (size > bps) {
+			/* Hit kilobytes per second threshold. */
+			set_bit(rw, &ac->flags);
+			return 1;
+		} else {
+			ac_rw->size = size;
+			smp_wmb();
+		}
+
+		atomic_inc(tc->stats.accounted + rw); /* Statistics. */
+	}
+
+	return 0;
+}
+
+/*
+ * Destruct a throttle mapping.
+ */
+static void throttle_dtr(struct dm_target *ti)
+{
+	struct throttle_c *tc = ti->private;
+
+	if (tc->dev.dev)
+		dm_put_device(ti, tc->dev.dev);
+
+	kfree(tc);
+}
+
+/* Check @arg to be >= @min && <= @max. */
+static inline int range_ok(int arg, int min, int max)
+{
+	return !(arg < min || arg > max);
+}
+
+/* Return "write" or "read" string for @write */
+static const char *rw_str(int write)
+{
+	return write ? "write" : "read";
+}
+
+/*
+ * Construct a throttle mapping:
+ *
+ * <start> <len> throttle \
+ * #throttle_params <throttle_params> \
+ * orig_dev_name orig_dev_start
+ *
+ * #throttle_params = 0 - 2
+ * throttle_parms = [read_kbs [write_kbs]]
+ *
+ */
+static int throttle_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int i, kbs[] = { 0, 0 }, r, throttle_params;
+	unsigned long long tmp;
+	sector_t start;
+	struct throttle_c *tc;
+	struct params *params;
+
+	if (!range_ok(argc, 3, 5))
+		TI_ERR("Invalid argument count");
+
+	/* Get #throttle_params. */
+	if (sscanf(argv[0], "%d", &throttle_params) != 1 ||
+	    !range_ok(throttle_params, 0, 2))
+		TI_ERR("Invalid throttle parameter number argument");
+
+	/* Handle any variable throttle parameters. */
+	for (i = 0; i < throttle_params; i++) {
+		/* Get throttle read/write kilobytes per second. */
+		if (sscanf(argv[i + 1], "%d", kbs + i) != 1 || kbs[i] < 0) {
+			static char msg[60];
+
+			snprintf(msg, sizeof(msg),
+				 "Invalid throttle %s kilobytes per second",
+				 rw_str(i));
+			ti->error = msg;
+			return -EINVAL;
+		}
+	}
+
+	if (sscanf(argv[2 + throttle_params], "%llu", &tmp) != 1)
+		TI_ERR("Invalid throttle device offset");
+
+	start = tmp;
+
+	/* Allocate throttle context. */
+	tc = ti->private = kzalloc(sizeof(*tc), GFP_KERNEL);
+	if (!tc)
+		TI_ERR_RET("Cannot allocate throttle context", -ENOMEM);
+
+	/* Aquire throttle device. */
+	r = dm_get_device(ti, argv[1 + throttle_params],
+			  dm_table_get_mode(ti->table), &tc->dev.dev);
+	if (r) {
+		DMERR("Throttle device lookup failed");
+		goto err;
+	}
+
+	/* Check throttled device length. */
+	if (ti->len >
+	    i_size_read(tc->dev.dev->bdev->bd_inode) >> SECTOR_SHIFT) {
+		DMERR("Throttled device too small for mapping");
+		goto err;
+	}
+
+	tc->dev.start = start;
+	params = &tc->params;
+	params->params = throttle_params;
+
+	i = ARRAY_SIZE(kbs);
+	while (i--) {
+		params->kbs_ctr[i] = kbs[i];
+		params->bs[i] = to_bs(kbs[i]);
+		mutex_init(&tc->account.rw[i].mutex);
+	}
+
+	stats_reset(&tc->stats);
+	return 0;
+err:
+	throttle_dtr(ti);
+	return -EINVAL;
+}
+
+/* Map a throttle io. */
+static int throttle_map(struct dm_target *ti, struct bio *bio,
+			union map_info *map_context)
+{
+	int r, rw = (bio_data_dir(bio) == WRITE);
+	struct throttle_c *tc = ti->private;
+	struct ac_rw *ac_rw = tc->account.rw + rw;
+
+	mutex_lock(&ac_rw->mutex);
+	do {
+		r = throttle(tc, bio);
+		if (r) {
+			long end = ac_rw->end_jiffies, j = jiffies;
+
+			/* Wait till next second when KB/s reached. */
+			if (j < end)
+				schedule_timeout_uninterruptible(end - j);
+		}
+	} while (r);
+
+	mutex_unlock(&ac_rw->mutex);
+
+	/* Remap. */
+	bio->bi_bdev = tc->dev.dev->bdev;
+	bio->bi_sector = bio->bi_sector - ti->begin + tc->dev.start;
+
+	atomic_inc(&tc->stats.io[rw]); /* Statistics */
+	return 1; /* Done with the bio; let dm core submit it. */
+}
+
+/* Message method. */
+static int throttle_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int kbs, rw;
+	struct throttle_c *tc = ti->private;
+
+	if (argc == 2) {
+		if (!strcmp(argv[0], "stats") &&
+		    !strcmp(argv[1], "reset")) {
+			/* Reset statistics. */
+			stats_reset(&tc->stats);
+			goto out;
+		} else if (!strcmp(argv[0], "read_kbs"))
+			/* Adjust read kilobytes per second. */
+			rw = 0;
+		else if (!strcmp(argv[0], "write_kbs"))
+			/* Adjust write kilobytes per second. */
+			rw = 1;
+		else
+			goto err;
+
+		/* Read r/w kbs paramater. */
+		if (sscanf(argv[1], "%d", &kbs) != 1 || kbs < 0) {
+			DMWARN("Unrecognised throttle %s_kbs parameter.",
+			       rw_str(rw));
+			return -EINVAL;
+		}
+
+		/* Update settings. */
+		mutex_lock(&tc->account.rw[rw].mutex);
+		tc->params.bs[rw] = to_bs(kbs);
+		account_reset(rw, tc);
+		mutex_unlock(&tc->account.rw[rw].mutex);
+out:
+		return 0;
+	}
+err:
+	DMWARN("Unrecognised throttle message received.");
+	return -EINVAL;
+}
+
+/* Status output method. */
+static int throttle_status(struct dm_target *ti, status_type_t type,
+			   char *result, unsigned maxlen)
+{
+	ssize_t sz = 0;
+	struct throttle_c *tc = ti->private;
+	struct stats *s = &tc->stats;
+	struct params *p = &tc->params;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("v=%s rkb=%u wkb=%u r=%u w=%u rd=%u wd=%u "
+		       "acr=%u acw=%u",
+		       version,
+		       to_kbs(p->bs[0]), to_kbs(p->bs[1]),
+		       atomic_read(s->io), atomic_read(s->io + 1),
+		       atomic_read(s->deferred_io),
+		       atomic_read(s->deferred_io + 1),
+		       atomic_read(s->accounted),
+		       atomic_read(s->accounted + 1));
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%u", p->params);
+
+		if (p->params) {
+			DMEMIT(" %u", p->kbs_ctr[0]);
+
+			if (p->params > 1)
+				DMEMIT(" %u", p->kbs_ctr[1]);
+		}
+
+		DMEMIT(" %s %llu",
+		       tc->dev.dev->name,
+		       (unsigned long long) tc->dev.start);
+	}
+
+	return 0;
+}
+
+static struct target_type throttle_target = {
+	.name		= "throttle",
+	.version	= {1, 0, 0},
+	.module		= THIS_MODULE,
+	.ctr		= throttle_ctr,
+	.dtr		= throttle_dtr,
+	.map		= throttle_map,
+	.message	= throttle_message,
+	.status		= throttle_status,
+};
+
+int __init dm_throttle_init(void)
+{
+	int r = dm_register_target(&throttle_target);
+
+	if (r)
+		DMERR("Failed to register %s [%d]", DM_MSG_PREFIX, r);
+	else
+		DMINFO("registered %s %s", DM_MSG_PREFIX, version);
+
+	return r;
+}
+
+void dm_throttle_exit(void)
+{
+	dm_unregister_target(&throttle_target);
+	DMINFO("unregistered %s %s", DM_MSG_PREFIX, version);
+}
+
+/* Module hooks */
+module_init(dm_throttle_init);
+module_exit(dm_throttle_exit);
+
+MODULE_DESCRIPTION(DM_NAME "device-mapper throttle target");
+MODULE_AUTHOR("Heinz Mauelshagen <heinzm@redhat.com>");
+MODULE_LICENSE("GPL");

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-10 13:42 [PATCH} dm-throttle: new device mapper target to throttle reads and writes Heinz Mauelshagen
@ 2010-08-10 14:44 ` Vivek Goyal
  2010-08-12  9:08   ` Heinz Mauelshagen
  2010-08-10 18:41 ` Vivek Goyal
  1 sibling, 1 reply; 10+ messages in thread
From: Vivek Goyal @ 2010-08-10 14:44 UTC (permalink / raw)
  To: heinzm, device-mapper development, Christoph Hellwig, Alasdair G Kergon

On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> 
> This is a new device mapper "throttle" target which allows for
> throttling reads and writes (ie. enforcing throughput limits) in units
> of kilobytes per second.
> 

Hi Heinz,

How about extending this stuff to handle cgroups also. So instead of
having deivice wide throttling policy, we throttle cgroups. That will
be a much more useful thing and will serve well the use case of throttling
virtual machines in cgroup.

Yesterday I had raised the issue of cgroup IO bandwidth throttling at
Linux Storage and Filesystem session. I thought that a device mapper
target will be the easiest thing to because I can make use of lots
of existing infrastructure.

Christoph did not like it because of configuration concerns. He preferred
something in block layer/request queue. It was also hinted that there
were some ideas floating of better integation of device mapper
infrastructure with request queue and this thing should go behind that.
But the problem is I am not sure how long it is going to take before
this new infrastructure becomes a reality and it will not be practical
to wait for that.

There is a possibility that we can put a hook in __make_request function
and first take out all the bios and subject them to bandwidth limitation
and then pass it to lower layers. But that will mean redoing lots of
common infrastructure which has already been done. For example,

- What happens to queue congestion semantics.

	- Request queue already has it based on requests and device mapper
	  seems to have its own congestion functions.

	- If I go for taking the bio out on request queue and hold them
   	  back then I am not sure how to define congestion semantics.
	  To keep congestion semantcs simple, it would make sense to
 	  create a new request queue (with the help of dm target), and
	  use that.

- I have yet to think through it but I think I wil be doing other common
  operations like holding back requests in internal queues, dispatching
  these later with the help of a kernel thread, allowing some to dispatch
  immediately as these come in, Putting processes to sleep and waking
  them later if we are already holding too many bios etc.

To me it sounds that doing it is lot simpler with the help of device
mapper target. Though the not so nice part is the need of configuring
another device mapper target on every block device we want to control.

Christoph, would it make sense to currently go ahead with device mapper
target and later convert that to whenever request queue and device mapper
fusion thing happens. Or, do you have other ideas which I have not been
able to grasp....

Thanks
Vivek  




> I've been using it for a while in testing configurations and think it's
> valuable for many people requiring simulation of low bandwidth
> interconnects or simulating different throughput characteristics on
> distinct address segments of a device (eg. fast outer disk spindles vs.
> slower inner ones).
> 
> Please read Documentation/device-mapper/throttle.txt for how to use it.
> 
> Note: this target can be combined with the "delay" target, which is
> already upstream in order to set io delays in addition to throttling,
> again valuable for long distance transport simulations.
> 
> 
> This target should stay separate rather than merged IMO, because it
> basically serves testing purposes and hence should not complicate any
> production mapping target. A potential merge with the "delay" target is
> subject to discussion.
> 
> 
> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
> 
>  Documentation/device-mapper/throttle.txt |   68 ++++++
>  drivers/md/Kconfig                       |    8 +
>  drivers/md/Makefile                      |    1 +
>  drivers/md/dm-throttle.c                 |  389 ++++++++++++++++++++++++++++++
>  4 files changed, 466 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/device-mapper/throttle.txt b/Documentation/device-mapper/throttle.txt
> new file mode 100644
> index 0000000..9deea6e
> --- /dev/null
> +++ b/Documentation/device-mapper/throttle.txt
> @@ -0,0 +1,68 @@
> +dm-throttle
> +===========
> +
> +Device-Mapper's "throttle" target maps a linear range of the Device-Mapper
> +device onto a linear range of another device providing the option to throttle
> +read and write ios seperately.
> +
> +This target provides the ability to simulate low bandwidth transports to
> +devices or different throughput to seperate address segements of a device.
> +
> +Parameters: <#variable params> <read kbs> <write kbs> <dev path> <offset>
> +    <#variable params> number of variable paramaters to set read and
> +		       write throttling kilobytes per second limits.
> +		       Range: 0 - 2 with
> +		       0 = no throttling,
> +		       1 and <read kbs> = read throttling only and
> +		       2 and <read kbs> <write kbs> = read and write throttling.
> +    <read kbs> read kilobatyes per second limit
> +    <write kbs> write kilobatyes per second limit
> +    <dev path>: Full pathname to the underlying block-device, or a
> +                "major:minor" device-number.
> +    <offset>: Starting sector within the device.
> +
> +Throttling read and write values can be adjusted through the constructor
> +by reloading a mapping table with the respective parameters or without
> +reloading through the message interface:
> +
> +dmsetup message <mapped device name> <offset> read_kbs <read kbs>
> +dmsetup message <mapped device name> <offset> write_kbs <read kbs>
> +
> +The target provides status information via its status interface:
> +
> +dmsetup status <mapped device name>
> +
> +Output includes the target version, the actual read and write kilobytes
> +per second limits used, how many read and write ios have been processed,
> +deferred and accounted for.
> +
> +Status can be reset without reloading the mapping table via the message
> +interface as well:
> +
> +dmsetup message <mapped device name> <offset> stats reset
> +
> +
> +Example scripts
> +===============
> +[[
> +#!/bin/sh
> +# Create an identity mapping for a device
> +# setting 1MB/s read and write throttling
> +echo "0 `blockdev --getsize $1` throttle 2 1024 1024 $1 0" | \
> +dmsetup create throttle_identity
> +]]
> +
> +[[
> +#!/bin/sh
> +# Set different throughput to first and second half of a device
> +let size=`blockdev --getsize $1`/2
> +echo "0 $size throttle 2 10480 8192 $1 0
> +$size $size throttle 2 2048 1024 $1 $size" | \
> +dmsetup create throttle_segmented
> +]]
> +
> +[[
> +#!/bin/sh
> +# Change read throughput on 2nd segment of previous segemented mapping
> +dmsetup message throttle_segmented $size 1 4096"
> +]]
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 4a6feac..9c3cbe0 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -313,6 +313,14 @@ config DM_DELAY
>  
>  	If unsure, say N.
>  
> +config DM_THROTTLE
> +	tristate "Throttling target (EXPERIMENTAL)"
> +	depends on BLK_DEV_DM && EXPERIMENTAL
> +	---help---
> +
> +	A target that supports device throughput throttling
> +	with bandwidth selection for reads and writes.
> +
>  config DM_UEVENT
>  	bool "DM uevents (EXPERIMENTAL)"
>  	depends on BLK_DEV_DM && EXPERIMENTAL
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index e355e7f..6ea2598 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -37,6 +37,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
>  obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
>  obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
> +obj-$(CONFIG_DM_THROTTLE)	+= dm-throttle.o
>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> diff --git a/drivers/md/dm-throttle.c b/drivers/md/dm-throttle.c
> new file mode 100644
> index 0000000..bc000d0
> --- /dev/null
> +++ b/drivers/md/dm-throttle.c
> @@ -0,0 +1,389 @@
> +/*
> + * Copyright (C) 2010 Red Hat GmbH
> + *
> + * Module Author: Heinz Mauelshagen <heinzm@redhat.com>
> + *
> + * This file is released under the GPL.
> + *
> + * Test target to stack on top of arbitrary other block
> + * device to throttle io in units of kilobyes per second.
> + *
> + * Throttling is configurable separately for reads and write
> + * via the constructor and the message interfaces.
> + */
> +
> +#include "dm.h"
> +#include <linux/slab.h>
> +
> +static const char *version = "1.0";
> +
> +#define	DM_MSG_PREFIX	"dm-throttle"
> +#define	TI_ERR_RET(str, ret) \
> +	do { ti->error = DM_MSG_PREFIX ": " str; return ret; } while (0);
> +#define	TI_ERR(str)	TI_ERR_RET(str, -EINVAL)
> +
> +/* Statistics for target status output (see throttle_status()). */
> +struct stats {
> +	atomic_t accounted[2];
> +	atomic_t deferred_io[2];
> +	atomic_t io[2];
> +};
> +
> +/* Reset statistics variables. */
> +static void stats_reset(struct stats *stats)
> +{
> +	int i = 2;
> +
> +	while (i--) {
> +		atomic_set(&stats->accounted[i], 0);
> +		atomic_set(&stats->deferred_io[i], 0);
> +		atomic_set(&stats->io[i], 0);
> +	}
> +}
> +
> +/* Throttle context. */
> +struct throttle_c {
> +	/* Device to throttle. */
> +	struct {
> +		struct dm_dev *dev;
> +		sector_t start;
> +	} dev;
> +
> +	/* ctr parameters. */
> +	struct params {
> +		unsigned bs[2];		/* Bytes per second. */
> +		unsigned kbs_ctr[2];	/* To save kb/s constructor args. */
> +		unsigned params;	/* # of variable parameters. */
> +	} params;
> +
> +	struct account {
> +		/* Accounting for reads and writes. */
> +		struct ac_rw {
> +			struct mutex mutex;
> +
> +			unsigned long end_jiffies;
> +			unsigned size;
> +		} rw[2];
> +
> +		unsigned long flags;
> +	} account;
> +
> +	struct stats stats;
> +};
> +
> +/* Return bytes/s value for kilobytes/s. */
> +static inline unsigned to_bs(unsigned kbs)
> +{
> +	return kbs << 10;
> +}
> +
> +static inline unsigned to_kbs(unsigned bs)
> +{
> +	return bs >> 10;
> +}
> +
> +/* Reset account. */
> +static void account_reset(int rw, struct throttle_c *tc)
> +{
> +	struct account *ac = &tc->account;
> +	struct ac_rw *ac_rw = ac->rw + rw;
> +
> +	ac_rw->size = 0;
> +	ac_rw->end_jiffies = jiffies + HZ;
> +	clear_bit(rw, &ac->flags);
> +	smp_wmb();
> +}
> +
> +/* Decide about throttling (ie. deferring bios). */
> +static int throttle(struct throttle_c *tc, struct bio *bio)
> +{
> +	int rw = (bio_data_dir(bio) == WRITE);
> +	unsigned bps; /* Bytes per second. */
> +
> +	smp_rmb();
> +	bps = tc->params.bs[rw];
> +	if (bps) {
> +		unsigned size;
> +		struct account *ac = &tc->account;
> +		struct ac_rw *ac_rw = ac->rw + rw;
> +
> +		if (time_after(jiffies, ac_rw->end_jiffies))
> +			/* Measure time exceeded. */
> +			account_reset(rw, tc);
> +		else if (test_bit(rw, &ac->flags))
> +			/* In case we're throttled already. */
> +			return 1;
> +
> +		/* Account I/O size. */
> +		size = ac_rw->size + bio->bi_size;
> +		if (size > bps) {
> +			/* Hit kilobytes per second threshold. */
> +			set_bit(rw, &ac->flags);
> +			return 1;
> +		} else {
> +			ac_rw->size = size;
> +			smp_wmb();
> +		}
> +
> +		atomic_inc(tc->stats.accounted + rw); /* Statistics. */
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Destruct a throttle mapping.
> + */
> +static void throttle_dtr(struct dm_target *ti)
> +{
> +	struct throttle_c *tc = ti->private;
> +
> +	if (tc->dev.dev)
> +		dm_put_device(ti, tc->dev.dev);
> +
> +	kfree(tc);
> +}
> +
> +/* Check @arg to be >= @min && <= @max. */
> +static inline int range_ok(int arg, int min, int max)
> +{
> +	return !(arg < min || arg > max);
> +}
> +
> +/* Return "write" or "read" string for @write */
> +static const char *rw_str(int write)
> +{
> +	return write ? "write" : "read";
> +}
> +
> +/*
> + * Construct a throttle mapping:
> + *
> + * <start> <len> throttle \
> + * #throttle_params <throttle_params> \
> + * orig_dev_name orig_dev_start
> + *
> + * #throttle_params = 0 - 2
> + * throttle_parms = [read_kbs [write_kbs]]
> + *
> + */
> +static int throttle_ctr(struct dm_target *ti, unsigned argc, char **argv)
> +{
> +	int i, kbs[] = { 0, 0 }, r, throttle_params;
> +	unsigned long long tmp;
> +	sector_t start;
> +	struct throttle_c *tc;
> +	struct params *params;
> +
> +	if (!range_ok(argc, 3, 5))
> +		TI_ERR("Invalid argument count");
> +
> +	/* Get #throttle_params. */
> +	if (sscanf(argv[0], "%d", &throttle_params) != 1 ||
> +	    !range_ok(throttle_params, 0, 2))
> +		TI_ERR("Invalid throttle parameter number argument");
> +
> +	/* Handle any variable throttle parameters. */
> +	for (i = 0; i < throttle_params; i++) {
> +		/* Get throttle read/write kilobytes per second. */
> +		if (sscanf(argv[i + 1], "%d", kbs + i) != 1 || kbs[i] < 0) {
> +			static char msg[60];
> +
> +			snprintf(msg, sizeof(msg),
> +				 "Invalid throttle %s kilobytes per second",
> +				 rw_str(i));
> +			ti->error = msg;
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (sscanf(argv[2 + throttle_params], "%llu", &tmp) != 1)
> +		TI_ERR("Invalid throttle device offset");
> +
> +	start = tmp;
> +
> +	/* Allocate throttle context. */
> +	tc = ti->private = kzalloc(sizeof(*tc), GFP_KERNEL);
> +	if (!tc)
> +		TI_ERR_RET("Cannot allocate throttle context", -ENOMEM);
> +
> +	/* Aquire throttle device. */
> +	r = dm_get_device(ti, argv[1 + throttle_params],
> +			  dm_table_get_mode(ti->table), &tc->dev.dev);
> +	if (r) {
> +		DMERR("Throttle device lookup failed");
> +		goto err;
> +	}
> +
> +	/* Check throttled device length. */
> +	if (ti->len >
> +	    i_size_read(tc->dev.dev->bdev->bd_inode) >> SECTOR_SHIFT) {
> +		DMERR("Throttled device too small for mapping");
> +		goto err;
> +	}
> +
> +	tc->dev.start = start;
> +	params = &tc->params;
> +	params->params = throttle_params;
> +
> +	i = ARRAY_SIZE(kbs);
> +	while (i--) {
> +		params->kbs_ctr[i] = kbs[i];
> +		params->bs[i] = to_bs(kbs[i]);
> +		mutex_init(&tc->account.rw[i].mutex);
> +	}
> +
> +	stats_reset(&tc->stats);
> +	return 0;
> +err:
> +	throttle_dtr(ti);
> +	return -EINVAL;
> +}
> +
> +/* Map a throttle io. */
> +static int throttle_map(struct dm_target *ti, struct bio *bio,
> +			union map_info *map_context)
> +{
> +	int r, rw = (bio_data_dir(bio) == WRITE);
> +	struct throttle_c *tc = ti->private;
> +	struct ac_rw *ac_rw = tc->account.rw + rw;
> +
> +	mutex_lock(&ac_rw->mutex);
> +	do {
> +		r = throttle(tc, bio);
> +		if (r) {
> +			long end = ac_rw->end_jiffies, j = jiffies;
> +
> +			/* Wait till next second when KB/s reached. */
> +			if (j < end)
> +				schedule_timeout_uninterruptible(end - j);
> +		}
> +	} while (r);
> +
> +	mutex_unlock(&ac_rw->mutex);
> +
> +	/* Remap. */
> +	bio->bi_bdev = tc->dev.dev->bdev;
> +	bio->bi_sector = bio->bi_sector - ti->begin + tc->dev.start;
> +
> +	atomic_inc(&tc->stats.io[rw]); /* Statistics */
> +	return 1; /* Done with the bio; let dm core submit it. */
> +}
> +
> +/* Message method. */
> +static int throttle_message(struct dm_target *ti, unsigned argc, char **argv)
> +{
> +	int kbs, rw;
> +	struct throttle_c *tc = ti->private;
> +
> +	if (argc == 2) {
> +		if (!strcmp(argv[0], "stats") &&
> +		    !strcmp(argv[1], "reset")) {
> +			/* Reset statistics. */
> +			stats_reset(&tc->stats);
> +			goto out;
> +		} else if (!strcmp(argv[0], "read_kbs"))
> +			/* Adjust read kilobytes per second. */
> +			rw = 0;
> +		else if (!strcmp(argv[0], "write_kbs"))
> +			/* Adjust write kilobytes per second. */
> +			rw = 1;
> +		else
> +			goto err;
> +
> +		/* Read r/w kbs paramater. */
> +		if (sscanf(argv[1], "%d", &kbs) != 1 || kbs < 0) {
> +			DMWARN("Unrecognised throttle %s_kbs parameter.",
> +			       rw_str(rw));
> +			return -EINVAL;
> +		}
> +
> +		/* Update settings. */
> +		mutex_lock(&tc->account.rw[rw].mutex);
> +		tc->params.bs[rw] = to_bs(kbs);
> +		account_reset(rw, tc);
> +		mutex_unlock(&tc->account.rw[rw].mutex);
> +out:
> +		return 0;
> +	}
> +err:
> +	DMWARN("Unrecognised throttle message received.");
> +	return -EINVAL;
> +}
> +
> +/* Status output method. */
> +static int throttle_status(struct dm_target *ti, status_type_t type,
> +			   char *result, unsigned maxlen)
> +{
> +	ssize_t sz = 0;
> +	struct throttle_c *tc = ti->private;
> +	struct stats *s = &tc->stats;
> +	struct params *p = &tc->params;
> +
> +	switch (type) {
> +	case STATUSTYPE_INFO:
> +		DMEMIT("v=%s rkb=%u wkb=%u r=%u w=%u rd=%u wd=%u "
> +		       "acr=%u acw=%u",
> +		       version,
> +		       to_kbs(p->bs[0]), to_kbs(p->bs[1]),
> +		       atomic_read(s->io), atomic_read(s->io + 1),
> +		       atomic_read(s->deferred_io),
> +		       atomic_read(s->deferred_io + 1),
> +		       atomic_read(s->accounted),
> +		       atomic_read(s->accounted + 1));
> +		break;
> +
> +	case STATUSTYPE_TABLE:
> +		DMEMIT("%u", p->params);
> +
> +		if (p->params) {
> +			DMEMIT(" %u", p->kbs_ctr[0]);
> +
> +			if (p->params > 1)
> +				DMEMIT(" %u", p->kbs_ctr[1]);
> +		}
> +
> +		DMEMIT(" %s %llu",
> +		       tc->dev.dev->name,
> +		       (unsigned long long) tc->dev.start);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct target_type throttle_target = {
> +	.name		= "throttle",
> +	.version	= {1, 0, 0},
> +	.module		= THIS_MODULE,
> +	.ctr		= throttle_ctr,
> +	.dtr		= throttle_dtr,
> +	.map		= throttle_map,
> +	.message	= throttle_message,
> +	.status		= throttle_status,
> +};
> +
> +int __init dm_throttle_init(void)
> +{
> +	int r = dm_register_target(&throttle_target);
> +
> +	if (r)
> +		DMERR("Failed to register %s [%d]", DM_MSG_PREFIX, r);
> +	else
> +		DMINFO("registered %s %s", DM_MSG_PREFIX, version);
> +
> +	return r;
> +}
> +
> +void dm_throttle_exit(void)
> +{
> +	dm_unregister_target(&throttle_target);
> +	DMINFO("unregistered %s %s", DM_MSG_PREFIX, version);
> +}
> +
> +/* Module hooks */
> +module_init(dm_throttle_init);
> +module_exit(dm_throttle_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME "device-mapper throttle target");
> +MODULE_AUTHOR("Heinz Mauelshagen <heinzm@redhat.com>");
> +MODULE_LICENSE("GPL");
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-10 13:42 [PATCH} dm-throttle: new device mapper target to throttle reads and writes Heinz Mauelshagen
  2010-08-10 14:44 ` Vivek Goyal
@ 2010-08-10 18:41 ` Vivek Goyal
  1 sibling, 0 replies; 10+ messages in thread
From: Vivek Goyal @ 2010-08-10 18:41 UTC (permalink / raw)
  To: Heinz Mauelshagen; +Cc: dm-devel

On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:

[..]
> +/* Decide about throttling (ie. deferring bios). */
> +static int throttle(struct throttle_c *tc, struct bio *bio)
> +{
> +	int rw = (bio_data_dir(bio) == WRITE);
> +	unsigned bps; /* Bytes per second. */
> +
> +	smp_rmb();
> +	bps = tc->params.bs[rw];
> +	if (bps) {
> +		unsigned size;
> +		struct account *ac = &tc->account;
> +		struct ac_rw *ac_rw = ac->rw + rw;
> +
> +		if (time_after(jiffies, ac_rw->end_jiffies))
> +			/* Measure time exceeded. */
> +			account_reset(rw, tc);
> +		else if (test_bit(rw, &ac->flags))
> +			/* In case we're throttled already. */
> +			return 1;
> +
> +		/* Account I/O size. */
> +		size = ac_rw->size + bio->bi_size;
> +		if (size > bps) {
> +			/* Hit kilobytes per second threshold. */
> +			set_bit(rw, &ac->flags);
> +			return 1;

If bio->bi_size is greate than bps, will I always keep on throttling
and hang?

[..]
> +/* Map a throttle io. */
> +static int throttle_map(struct dm_target *ti, struct bio *bio,
> +			union map_info *map_context)
> +{
> +	int r, rw = (bio_data_dir(bio) == WRITE);
> +	struct throttle_c *tc = ti->private;
> +	struct ac_rw *ac_rw = tc->account.rw + rw;
> +
> +	mutex_lock(&ac_rw->mutex);
> +	do {
> +		r = throttle(tc, bio);
> +		if (r) {
> +			long end = ac_rw->end_jiffies, j = jiffies;
> +
> +			/* Wait till next second when KB/s reached. */
> +			if (j < end)
> +				schedule_timeout_uninterruptible(end - j);
> +		}

So a thread is blocked if it crossed the IO rate. There is no such mechanism
to take the bio, statsh away somewhere and dispatch it to disk later. The
way request queue descriptors work. Processes are blocked only if queue
is congested otherwise one allows processes to submit requests and go
back and do other work.

I am assuming that this will be bad for AIO.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-10 14:44 ` Vivek Goyal
@ 2010-08-12  9:08   ` Heinz Mauelshagen
  2010-08-12 16:46     ` Vivek Goyal
  0 siblings, 1 reply; 10+ messages in thread
From: Heinz Mauelshagen @ 2010-08-12  9:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, device-mapper development, Alasdair G Kergon

On Tue, 2010-08-10 at 10:44 -0400, Vivek Goyal wrote:
> On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> > 
> > This is a new device mapper "throttle" target which allows for
> > throttling reads and writes (ie. enforcing throughput limits) in units
> > of kilobytes per second.
> > 
> 
> Hi Heinz,
> 
> How about extending this stuff to handle cgroups also. So instead of
> having deivice wide throttling policy, we throttle cgroups. That will
> be a much more useful thing and will serve well the use case of throttling
> virtual machines in cgroup.


Hi Vivek,

needs a serious design discussion but I think we could leverage it to
allow for throttling of cgroups.

> 
> Yesterday I had raised the issue of cgroup IO bandwidth throttling at
> Linux Storage and Filesystem session. I thought that a device mapper
> target will be the easiest thing to because I can make use of lots
> of existing infrastructure.
> 
> Christoph did not like it because of configuration concerns. He preferred
> something in block layer/request queue. It was also hinted that there
> were some ideas floating of better integation of device mapper
> infrastructure with request queue and this thing should go behind that.

Right, if a block layer change of that kind will be pending, we should
wait for it to settle.

> But the problem is I am not sure how long it is going to take before
> this new infrastructure becomes a reality and it will not be practical
> to wait for that.

Did any reliable plans come out of the discussion or will there be any
in the near future?

> 
> There is a possibility that we can put a hook in __make_request function
> and first take out all the bios and subject them to bandwidth limitation
> and then pass it to lower layers. But that will mean redoing lots of
> common infrastructure which has already been done. For example,
> 
> - What happens to queue congestion semantics.
> 
> 	- Request queue already has it based on requests and device mapper
> 	  seems to have its own congestion functions.

Yes, dm does.

> 
> 	- If I go for taking the bio out on request queue and hold them
>    	  back then I am not sure how to define congestion semantics.
> 	  To keep congestion semantcs simple, it would make sense to
>  	  create a new request queue (with the help of dm target), and
> 	  use that.

Yes, that's an obvious approach to stay with the same congestion
semantics.

> 
> - I have yet to think through it but I think I wil be doing other common
>   operations like holding back requests in internal queues, dispatching
>   these later with the help of a kernel thread, allowing some to dispatch
>   immediately as these come in, Putting processes to sleep and waking
>   them later if we are already holding too many bios etc.
> 
> To me it sounds that doing it is lot simpler with the help of device
> mapper target. Though the not so nice part is the need of configuring
> another device mapper target on every block device we want to control.

Yes, we'd need identity mappings in the stack to be prepared.

Or we need some __generic_make_request() hack ala bcache to hijack the
request function on the fly.

> 
> Christoph, would it make sense to currently go ahead with device mapper
> target and later convert that to whenever request queue and device mapper
> fusion thing happens. Or, do you have other ideas which I have not been
> able to grasp....

Let's see what he wants to fill in.

Cheers,
Heinz

> 
> Thanks
> Vivek  
> 
> 
> 
> 
> > I've been using it for a while in testing configurations and think it's
> > valuable for many people requiring simulation of low bandwidth
> > interconnects or simulating different throughput characteristics on
> > distinct address segments of a device (eg. fast outer disk spindles vs.
> > slower inner ones).
> > 
> > Please read Documentation/device-mapper/throttle.txt for how to use it.
> > 
> > Note: this target can be combined with the "delay" target, which is
> > already upstream in order to set io delays in addition to throttling,
> > again valuable for long distance transport simulations.
> > 
> > 
> > This target should stay separate rather than merged IMO, because it
> > basically serves testing purposes and hence should not complicate any
> > production mapping target. A potential merge with the "delay" target is
> > subject to discussion.
> > 
> > 
> > Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
> > 
> >  Documentation/device-mapper/throttle.txt |   68 ++++++
> >  drivers/md/Kconfig                       |    8 +
> >  drivers/md/Makefile                      |    1 +
> >  drivers/md/dm-throttle.c                 |  389 ++++++++++++++++++++++++++++++
> >  4 files changed, 466 insertions(+), 0 deletions(-)
> > 
> > diff --git a/Documentation/device-mapper/throttle.txt b/Documentation/device-mapper/throttle.txt
> > new file mode 100644
> > index 0000000..9deea6e
> > --- /dev/null
> > +++ b/Documentation/device-mapper/throttle.txt
> > @@ -0,0 +1,68 @@
> > +dm-throttle
> > +===========
> > +
> > +Device-Mapper's "throttle" target maps a linear range of the Device-Mapper
> > +device onto a linear range of another device providing the option to throttle
> > +read and write ios seperately.
> > +
> > +This target provides the ability to simulate low bandwidth transports to
> > +devices or different throughput to seperate address segements of a device.
> > +
> > +Parameters: <#variable params> <read kbs> <write kbs> <dev path> <offset>
> > +    <#variable params> number of variable paramaters to set read and
> > +		       write throttling kilobytes per second limits.
> > +		       Range: 0 - 2 with
> > +		       0 = no throttling,
> > +		       1 and <read kbs> = read throttling only and
> > +		       2 and <read kbs> <write kbs> = read and write throttling.
> > +    <read kbs> read kilobatyes per second limit
> > +    <write kbs> write kilobatyes per second limit
> > +    <dev path>: Full pathname to the underlying block-device, or a
> > +                "major:minor" device-number.
> > +    <offset>: Starting sector within the device.
> > +
> > +Throttling read and write values can be adjusted through the constructor
> > +by reloading a mapping table with the respective parameters or without
> > +reloading through the message interface:
> > +
> > +dmsetup message <mapped device name> <offset> read_kbs <read kbs>
> > +dmsetup message <mapped device name> <offset> write_kbs <read kbs>
> > +
> > +The target provides status information via its status interface:
> > +
> > +dmsetup status <mapped device name>
> > +
> > +Output includes the target version, the actual read and write kilobytes
> > +per second limits used, how many read and write ios have been processed,
> > +deferred and accounted for.
> > +
> > +Status can be reset without reloading the mapping table via the message
> > +interface as well:
> > +
> > +dmsetup message <mapped device name> <offset> stats reset
> > +
> > +
> > +Example scripts
> > +===============
> > +[[
> > +#!/bin/sh
> > +# Create an identity mapping for a device
> > +# setting 1MB/s read and write throttling
> > +echo "0 `blockdev --getsize $1` throttle 2 1024 1024 $1 0" | \
> > +dmsetup create throttle_identity
> > +]]
> > +
> > +[[
> > +#!/bin/sh
> > +# Set different throughput to first and second half of a device
> > +let size=`blockdev --getsize $1`/2
> > +echo "0 $size throttle 2 10480 8192 $1 0
> > +$size $size throttle 2 2048 1024 $1 $size" | \
> > +dmsetup create throttle_segmented
> > +]]
> > +
> > +[[
> > +#!/bin/sh
> > +# Change read throughput on 2nd segment of previous segemented mapping
> > +dmsetup message throttle_segmented $size 1 4096"
> > +]]
> > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> > index 4a6feac..9c3cbe0 100644
> > --- a/drivers/md/Kconfig
> > +++ b/drivers/md/Kconfig
> > @@ -313,6 +313,14 @@ config DM_DELAY
> >  
> >  	If unsure, say N.
> >  
> > +config DM_THROTTLE
> > +	tristate "Throttling target (EXPERIMENTAL)"
> > +	depends on BLK_DEV_DM && EXPERIMENTAL
> > +	---help---
> > +
> > +	A target that supports device throughput throttling
> > +	with bandwidth selection for reads and writes.
> > +
> >  config DM_UEVENT
> >  	bool "DM uevents (EXPERIMENTAL)"
> >  	depends on BLK_DEV_DM && EXPERIMENTAL
> > diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> > index e355e7f..6ea2598 100644
> > --- a/drivers/md/Makefile
> > +++ b/drivers/md/Makefile
> > @@ -37,6 +37,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
> >  obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
> >  obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
> >  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
> > +obj-$(CONFIG_DM_THROTTLE)	+= dm-throttle.o
> >  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
> >  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
> >  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> > diff --git a/drivers/md/dm-throttle.c b/drivers/md/dm-throttle.c
> > new file mode 100644
> > index 0000000..bc000d0
> > --- /dev/null
> > +++ b/drivers/md/dm-throttle.c
> > @@ -0,0 +1,389 @@
> > +/*
> > + * Copyright (C) 2010 Red Hat GmbH
> > + *
> > + * Module Author: Heinz Mauelshagen <heinzm@redhat.com>
> > + *
> > + * This file is released under the GPL.
> > + *
> > + * Test target to stack on top of arbitrary other block
> > + * device to throttle io in units of kilobyes per second.
> > + *
> > + * Throttling is configurable separately for reads and write
> > + * via the constructor and the message interfaces.
> > + */
> > +
> > +#include "dm.h"
> > +#include <linux/slab.h>
> > +
> > +static const char *version = "1.0";
> > +
> > +#define	DM_MSG_PREFIX	"dm-throttle"
> > +#define	TI_ERR_RET(str, ret) \
> > +	do { ti->error = DM_MSG_PREFIX ": " str; return ret; } while (0);
> > +#define	TI_ERR(str)	TI_ERR_RET(str, -EINVAL)
> > +
> > +/* Statistics for target status output (see throttle_status()). */
> > +struct stats {
> > +	atomic_t accounted[2];
> > +	atomic_t deferred_io[2];
> > +	atomic_t io[2];
> > +};
> > +
> > +/* Reset statistics variables. */
> > +static void stats_reset(struct stats *stats)
> > +{
> > +	int i = 2;
> > +
> > +	while (i--) {
> > +		atomic_set(&stats->accounted[i], 0);
> > +		atomic_set(&stats->deferred_io[i], 0);
> > +		atomic_set(&stats->io[i], 0);
> > +	}
> > +}
> > +
> > +/* Throttle context. */
> > +struct throttle_c {
> > +	/* Device to throttle. */
> > +	struct {
> > +		struct dm_dev *dev;
> > +		sector_t start;
> > +	} dev;
> > +
> > +	/* ctr parameters. */
> > +	struct params {
> > +		unsigned bs[2];		/* Bytes per second. */
> > +		unsigned kbs_ctr[2];	/* To save kb/s constructor args. */
> > +		unsigned params;	/* # of variable parameters. */
> > +	} params;
> > +
> > +	struct account {
> > +		/* Accounting for reads and writes. */
> > +		struct ac_rw {
> > +			struct mutex mutex;
> > +
> > +			unsigned long end_jiffies;
> > +			unsigned size;
> > +		} rw[2];
> > +
> > +		unsigned long flags;
> > +	} account;
> > +
> > +	struct stats stats;
> > +};
> > +
> > +/* Return bytes/s value for kilobytes/s. */
> > +static inline unsigned to_bs(unsigned kbs)
> > +{
> > +	return kbs << 10;
> > +}
> > +
> > +static inline unsigned to_kbs(unsigned bs)
> > +{
> > +	return bs >> 10;
> > +}
> > +
> > +/* Reset account. */
> > +static void account_reset(int rw, struct throttle_c *tc)
> > +{
> > +	struct account *ac = &tc->account;
> > +	struct ac_rw *ac_rw = ac->rw + rw;
> > +
> > +	ac_rw->size = 0;
> > +	ac_rw->end_jiffies = jiffies + HZ;
> > +	clear_bit(rw, &ac->flags);
> > +	smp_wmb();
> > +}
> > +
> > +/* Decide about throttling (ie. deferring bios). */
> > +static int throttle(struct throttle_c *tc, struct bio *bio)
> > +{
> > +	int rw = (bio_data_dir(bio) == WRITE);
> > +	unsigned bps; /* Bytes per second. */
> > +
> > +	smp_rmb();
> > +	bps = tc->params.bs[rw];
> > +	if (bps) {
> > +		unsigned size;
> > +		struct account *ac = &tc->account;
> > +		struct ac_rw *ac_rw = ac->rw + rw;
> > +
> > +		if (time_after(jiffies, ac_rw->end_jiffies))
> > +			/* Measure time exceeded. */
> > +			account_reset(rw, tc);
> > +		else if (test_bit(rw, &ac->flags))
> > +			/* In case we're throttled already. */
> > +			return 1;
> > +
> > +		/* Account I/O size. */
> > +		size = ac_rw->size + bio->bi_size;
> > +		if (size > bps) {
> > +			/* Hit kilobytes per second threshold. */
> > +			set_bit(rw, &ac->flags);
> > +			return 1;
> > +		} else {
> > +			ac_rw->size = size;
> > +			smp_wmb();
> > +		}
> > +
> > +		atomic_inc(tc->stats.accounted + rw); /* Statistics. */
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Destruct a throttle mapping.
> > + */
> > +static void throttle_dtr(struct dm_target *ti)
> > +{
> > +	struct throttle_c *tc = ti->private;
> > +
> > +	if (tc->dev.dev)
> > +		dm_put_device(ti, tc->dev.dev);
> > +
> > +	kfree(tc);
> > +}
> > +
> > +/* Check @arg to be >= @min && <= @max. */
> > +static inline int range_ok(int arg, int min, int max)
> > +{
> > +	return !(arg < min || arg > max);
> > +}
> > +
> > +/* Return "write" or "read" string for @write */
> > +static const char *rw_str(int write)
> > +{
> > +	return write ? "write" : "read";
> > +}
> > +
> > +/*
> > + * Construct a throttle mapping:
> > + *
> > + * <start> <len> throttle \
> > + * #throttle_params <throttle_params> \
> > + * orig_dev_name orig_dev_start
> > + *
> > + * #throttle_params = 0 - 2
> > + * throttle_parms = [read_kbs [write_kbs]]
> > + *
> > + */
> > +static int throttle_ctr(struct dm_target *ti, unsigned argc, char **argv)
> > +{
> > +	int i, kbs[] = { 0, 0 }, r, throttle_params;
> > +	unsigned long long tmp;
> > +	sector_t start;
> > +	struct throttle_c *tc;
> > +	struct params *params;
> > +
> > +	if (!range_ok(argc, 3, 5))
> > +		TI_ERR("Invalid argument count");
> > +
> > +	/* Get #throttle_params. */
> > +	if (sscanf(argv[0], "%d", &throttle_params) != 1 ||
> > +	    !range_ok(throttle_params, 0, 2))
> > +		TI_ERR("Invalid throttle parameter number argument");
> > +
> > +	/* Handle any variable throttle parameters. */
> > +	for (i = 0; i < throttle_params; i++) {
> > +		/* Get throttle read/write kilobytes per second. */
> > +		if (sscanf(argv[i + 1], "%d", kbs + i) != 1 || kbs[i] < 0) {
> > +			static char msg[60];
> > +
> > +			snprintf(msg, sizeof(msg),
> > +				 "Invalid throttle %s kilobytes per second",
> > +				 rw_str(i));
> > +			ti->error = msg;
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	if (sscanf(argv[2 + throttle_params], "%llu", &tmp) != 1)
> > +		TI_ERR("Invalid throttle device offset");
> > +
> > +	start = tmp;
> > +
> > +	/* Allocate throttle context. */
> > +	tc = ti->private = kzalloc(sizeof(*tc), GFP_KERNEL);
> > +	if (!tc)
> > +		TI_ERR_RET("Cannot allocate throttle context", -ENOMEM);
> > +
> > +	/* Aquire throttle device. */
> > +	r = dm_get_device(ti, argv[1 + throttle_params],
> > +			  dm_table_get_mode(ti->table), &tc->dev.dev);
> > +	if (r) {
> > +		DMERR("Throttle device lookup failed");
> > +		goto err;
> > +	}
> > +
> > +	/* Check throttled device length. */
> > +	if (ti->len >
> > +	    i_size_read(tc->dev.dev->bdev->bd_inode) >> SECTOR_SHIFT) {
> > +		DMERR("Throttled device too small for mapping");
> > +		goto err;
> > +	}
> > +
> > +	tc->dev.start = start;
> > +	params = &tc->params;
> > +	params->params = throttle_params;
> > +
> > +	i = ARRAY_SIZE(kbs);
> > +	while (i--) {
> > +		params->kbs_ctr[i] = kbs[i];
> > +		params->bs[i] = to_bs(kbs[i]);
> > +		mutex_init(&tc->account.rw[i].mutex);
> > +	}
> > +
> > +	stats_reset(&tc->stats);
> > +	return 0;
> > +err:
> > +	throttle_dtr(ti);
> > +	return -EINVAL;
> > +}
> > +
> > +/* Map a throttle io. */
> > +static int throttle_map(struct dm_target *ti, struct bio *bio,
> > +			union map_info *map_context)
> > +{
> > +	int r, rw = (bio_data_dir(bio) == WRITE);
> > +	struct throttle_c *tc = ti->private;
> > +	struct ac_rw *ac_rw = tc->account.rw + rw;
> > +
> > +	mutex_lock(&ac_rw->mutex);
> > +	do {
> > +		r = throttle(tc, bio);
> > +		if (r) {
> > +			long end = ac_rw->end_jiffies, j = jiffies;
> > +
> > +			/* Wait till next second when KB/s reached. */
> > +			if (j < end)
> > +				schedule_timeout_uninterruptible(end - j);
> > +		}
> > +	} while (r);
> > +
> > +	mutex_unlock(&ac_rw->mutex);
> > +
> > +	/* Remap. */
> > +	bio->bi_bdev = tc->dev.dev->bdev;
> > +	bio->bi_sector = bio->bi_sector - ti->begin + tc->dev.start;
> > +
> > +	atomic_inc(&tc->stats.io[rw]); /* Statistics */
> > +	return 1; /* Done with the bio; let dm core submit it. */
> > +}
> > +
> > +/* Message method. */
> > +static int throttle_message(struct dm_target *ti, unsigned argc, char **argv)
> > +{
> > +	int kbs, rw;
> > +	struct throttle_c *tc = ti->private;
> > +
> > +	if (argc == 2) {
> > +		if (!strcmp(argv[0], "stats") &&
> > +		    !strcmp(argv[1], "reset")) {
> > +			/* Reset statistics. */
> > +			stats_reset(&tc->stats);
> > +			goto out;
> > +		} else if (!strcmp(argv[0], "read_kbs"))
> > +			/* Adjust read kilobytes per second. */
> > +			rw = 0;
> > +		else if (!strcmp(argv[0], "write_kbs"))
> > +			/* Adjust write kilobytes per second. */
> > +			rw = 1;
> > +		else
> > +			goto err;
> > +
> > +		/* Read r/w kbs paramater. */
> > +		if (sscanf(argv[1], "%d", &kbs) != 1 || kbs < 0) {
> > +			DMWARN("Unrecognised throttle %s_kbs parameter.",
> > +			       rw_str(rw));
> > +			return -EINVAL;
> > +		}
> > +
> > +		/* Update settings. */
> > +		mutex_lock(&tc->account.rw[rw].mutex);
> > +		tc->params.bs[rw] = to_bs(kbs);
> > +		account_reset(rw, tc);
> > +		mutex_unlock(&tc->account.rw[rw].mutex);
> > +out:
> > +		return 0;
> > +	}
> > +err:
> > +	DMWARN("Unrecognised throttle message received.");
> > +	return -EINVAL;
> > +}
> > +
> > +/* Status output method. */
> > +static int throttle_status(struct dm_target *ti, status_type_t type,
> > +			   char *result, unsigned maxlen)
> > +{
> > +	ssize_t sz = 0;
> > +	struct throttle_c *tc = ti->private;
> > +	struct stats *s = &tc->stats;
> > +	struct params *p = &tc->params;
> > +
> > +	switch (type) {
> > +	case STATUSTYPE_INFO:
> > +		DMEMIT("v=%s rkb=%u wkb=%u r=%u w=%u rd=%u wd=%u "
> > +		       "acr=%u acw=%u",
> > +		       version,
> > +		       to_kbs(p->bs[0]), to_kbs(p->bs[1]),
> > +		       atomic_read(s->io), atomic_read(s->io + 1),
> > +		       atomic_read(s->deferred_io),
> > +		       atomic_read(s->deferred_io + 1),
> > +		       atomic_read(s->accounted),
> > +		       atomic_read(s->accounted + 1));
> > +		break;
> > +
> > +	case STATUSTYPE_TABLE:
> > +		DMEMIT("%u", p->params);
> > +
> > +		if (p->params) {
> > +			DMEMIT(" %u", p->kbs_ctr[0]);
> > +
> > +			if (p->params > 1)
> > +				DMEMIT(" %u", p->kbs_ctr[1]);
> > +		}
> > +
> > +		DMEMIT(" %s %llu",
> > +		       tc->dev.dev->name,
> > +		       (unsigned long long) tc->dev.start);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct target_type throttle_target = {
> > +	.name		= "throttle",
> > +	.version	= {1, 0, 0},
> > +	.module		= THIS_MODULE,
> > +	.ctr		= throttle_ctr,
> > +	.dtr		= throttle_dtr,
> > +	.map		= throttle_map,
> > +	.message	= throttle_message,
> > +	.status		= throttle_status,
> > +};
> > +
> > +int __init dm_throttle_init(void)
> > +{
> > +	int r = dm_register_target(&throttle_target);
> > +
> > +	if (r)
> > +		DMERR("Failed to register %s [%d]", DM_MSG_PREFIX, r);
> > +	else
> > +		DMINFO("registered %s %s", DM_MSG_PREFIX, version);
> > +
> > +	return r;
> > +}
> > +
> > +void dm_throttle_exit(void)
> > +{
> > +	dm_unregister_target(&throttle_target);
> > +	DMINFO("unregistered %s %s", DM_MSG_PREFIX, version);
> > +}
> > +
> > +/* Module hooks */
> > +module_init(dm_throttle_init);
> > +module_exit(dm_throttle_exit);
> > +
> > +MODULE_DESCRIPTION(DM_NAME "device-mapper throttle target");
> > +MODULE_AUTHOR("Heinz Mauelshagen <heinzm@redhat.com>");
> > +MODULE_LICENSE("GPL");
> > 
> > 
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-12  9:08   ` Heinz Mauelshagen
@ 2010-08-12 16:46     ` Vivek Goyal
  2010-08-12 20:32       ` Vivek Goyal
  2010-08-13 11:19       ` Heinz Mauelshagen
  0 siblings, 2 replies; 10+ messages in thread
From: Vivek Goyal @ 2010-08-12 16:46 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Christoph Hellwig, device-mapper development, Alasdair G Kergon

On Thu, Aug 12, 2010 at 11:08:09AM +0200, Heinz Mauelshagen wrote:
> On Tue, 2010-08-10 at 10:44 -0400, Vivek Goyal wrote:
> > On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> > > 
> > > This is a new device mapper "throttle" target which allows for
> > > throttling reads and writes (ie. enforcing throughput limits) in units
> > > of kilobytes per second.
> > > 
> > 
> > Hi Heinz,
> > 
> > How about extending this stuff to handle cgroups also. So instead of
> > having deivice wide throttling policy, we throttle cgroups. That will
> > be a much more useful thing and will serve well the use case of throttling
> > virtual machines in cgroup.
> 
> 
> Hi Vivek,
> 
> needs a serious design discussion but I think we could leverage it to
> allow for throttling of cgroups.
> 

We need to parse cgroup information inside the dm taget (as CFQ does) and
just prepare one queue per group and queue the IO there (If we exceeded
the IO rate of group) and dispatch it later.

We also need to get the per cgroup rules from cgroup interface and not
with the help of static device mapper tables at the device creation
time.

I can write some code for dm-throttle for cgroup functioinality once
basic dm-throttle is in.

I am not sure that how to keep both the modes in dm-throttle taget

- cgroup mode
- the whole device limitation mode. (you just created).

Mike Snitzer suggested that we can have both the modes and specify the
mode at the time of device creation.

> > 
> > Yesterday I had raised the issue of cgroup IO bandwidth throttling at
> > Linux Storage and Filesystem session. I thought that a device mapper
> > target will be the easiest thing to because I can make use of lots
> > of existing infrastructure.
> > 
> > Christoph did not like it because of configuration concerns. He preferred
> > something in block layer/request queue. It was also hinted that there
> > were some ideas floating of better integation of device mapper
> > infrastructure with request queue and this thing should go behind that.
> 
> Right, if a block layer change of that kind will be pending, we should
> wait for it to settle.

I don't have details but to me it sounds as if it is just a concept at
this point of time. So waiting for that to happen might take too long
and we want max bw control feature as soon as possible. IMHO, converting
the dm-throttle later to use that new infrastructure will be a much
better option.

> 
> > But the problem is I am not sure how long it is going to take before
> > this new infrastructure becomes a reality and it will not be practical
> > to wait for that.
> 
> Did any reliable plans come out of the discussion or will there be any
> in the near future?

I am not aware of any. Alasdair will know more about it.

> 
> > 
> > There is a possibility that we can put a hook in __make_request function
> > and first take out all the bios and subject them to bandwidth limitation
> > and then pass it to lower layers. But that will mean redoing lots of
> > common infrastructure which has already been done. For example,
> > 
> > - What happens to queue congestion semantics.
> > 
> > 	- Request queue already has it based on requests and device mapper
> > 	  seems to have its own congestion functions.
> 
> Yes, dm does.

I was looking into the dm code and found dm_any_congested(). So it looks
like that dm just calls underlying devices to find out if any of the device
is congested or not.

Thinking more about it, congestion semantics seem to have been defined
for a thread which does not want to sleep because of request descriptor
allocation. In case of bandwidth control, we will not be allocating
any request descriptors. Bios will be handed to us. No cloning operation
required so no bio allocations required. I might have to do some
allocation of internal structures though like group, queue etc when a new 
request comes in.

So because I will not be putting any artificial restrictions on number
if bios queued for throttling (unlike request descriptors), I probably
don't require any congestion semantics. The only time a thread might
be put to sleep if mempool_alloc() puts it to sleep because of some
memory reclaim taking place. That's how dm seems to be handling it and
if that is acceptable then it should be acceptable for bandwidth
controller on request queue?

> 
> > 
> > 	- If I go for taking the bio out on request queue and hold them
> >    	  back then I am not sure how to define congestion semantics.
> > 	  To keep congestion semantcs simple, it would make sense to
> >  	  create a new request queue (with the help of dm target), and
> > 	  use that.
> 
> Yes, that's an obvious approach to stay with the same congestion
> semantics.

See above? If I am not putting an artificial limit on number of bios that
can be submitted on request queue, then I don't require any additional
congestion semantics. The only time a thread will put to sleep if we
are not able to allocate some objects like group and per group queue etc.
Otherwise, a thread will submit the bio and go back and do something else
or wait for io to finish.

> 
> > 
> > - I have yet to think through it but I think I wil be doing other common
> >   operations like holding back requests in internal queues, dispatching
> >   these later with the help of a kernel thread, allowing some to dispatch
> >   immediately as these come in, Putting processes to sleep and waking
> >   them later if we are already holding too many bios etc.
> > 
> > To me it sounds that doing it is lot simpler with the help of device
> > mapper target. Though the not so nice part is the need of configuring
> > another device mapper target on every block device we want to control.
> 
> Yes, we'd need identity mappings in the stack to be prepared.
> 
> Or we need some __generic_make_request() hack ala bcache to hijack the
> request function on the fly.

I will look at bcache but yes it would be a hook in __generic_make_request()
if bandwidth control has to be done in request queue/block layer and not
as device mapper target.

Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-12 16:46     ` Vivek Goyal
@ 2010-08-12 20:32       ` Vivek Goyal
  2010-08-13 11:19       ` Heinz Mauelshagen
  1 sibling, 0 replies; 10+ messages in thread
From: Vivek Goyal @ 2010-08-12 20:32 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Christoph Hellwig, device-mapper development, Alasdair G Kergon

On Thu, Aug 12, 2010 at 12:46:36PM -0400, Vivek Goyal wrote:

[..]
> > > There is a possibility that we can put a hook in __make_request function
> > > and first take out all the bios and subject them to bandwidth limitation
> > > and then pass it to lower layers. But that will mean redoing lots of
> > > common infrastructure which has already been done. For example,
> > > 
> > > - What happens to queue congestion semantics.
> > > 
> > > 	- Request queue already has it based on requests and device mapper
> > > 	  seems to have its own congestion functions.
> > 
> > Yes, dm does.
> 
> I was looking into the dm code and found dm_any_congested(). So it looks
> like that dm just calls underlying devices to find out if any of the device
> is congested or not.
> 
> Thinking more about it, congestion semantics seem to have been defined
> for a thread which does not want to sleep because of request descriptor
> allocation. In case of bandwidth control, we will not be allocating
> any request descriptors. Bios will be handed to us. No cloning operation
> required so no bio allocations required. I might have to do some
> allocation of internal structures though like group, queue etc when a new 
> request comes in.
> 
> So because I will not be putting any artificial restrictions on number
> if bios queued for throttling (unlike request descriptors), I probably
> don't require any congestion semantics. The only time a thread might
> be put to sleep if mempool_alloc() puts it to sleep because of some
> memory reclaim taking place. That's how dm seems to be handling it and
> if that is acceptable then it should be acceptable for bandwidth
> controller on request queue?
> 
> > 
> > > 
> > > 	- If I go for taking the bio out on request queue and hold them
> > >    	  back then I am not sure how to define congestion semantics.
> > > 	  To keep congestion semantcs simple, it would make sense to
> > >  	  create a new request queue (with the help of dm target), and
> > > 	  use that.
> > 
> > Yes, that's an obvious approach to stay with the same congestion
> > semantics.
> 
> See above? If I am not putting an artificial limit on number of bios that
> can be submitted on request queue, then I don't require any additional
> congestion semantics. The only time a thread will put to sleep if we
> are not able to allocate some objects like group and per group queue etc.
> Otherwise, a thread will submit the bio and go back and do something else
> or wait for io to finish.

I guess I shall have to define some kind of congestion notion for bw
controlller also and reason being that with thorottling it might happen
that underlying devices are not congested at all and in that case some
process can keep on submitting the bios and consume ton of memory. So
to prevent that at some point of time I shall have to stop accepting
more bios and put the submitting process to sleep.

Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH} dm-throttle: new device mapper target to throttle reads and writes
  2010-08-12 16:46     ` Vivek Goyal
  2010-08-12 20:32       ` Vivek Goyal
@ 2010-08-13 11:19       ` Heinz Mauelshagen
  1 sibling, 0 replies; 10+ messages in thread
From: Heinz Mauelshagen @ 2010-08-13 11:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, device-mapper development, Alasdair G Kergon

On Thu, 2010-08-12 at 12:46 -0400, Vivek Goyal wrote:
> On Thu, Aug 12, 2010 at 11:08:09AM +0200, Heinz Mauelshagen wrote:
> > On Tue, 2010-08-10 at 10:44 -0400, Vivek Goyal wrote:
> > > On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> > > > 
> > > > This is a new device mapper "throttle" target which allows for
> > > > throttling reads and writes (ie. enforcing throughput limits) in units
> > > > of kilobytes per second.
> > > > 
> > > 
> > > Hi Heinz,
> > > 
> > > How about extending this stuff to handle cgroups also. So instead of
> > > having deivice wide throttling policy, we throttle cgroups. That will
> > > be a much more useful thing and will serve well the use case of throttling
> > > virtual machines in cgroup.
> > 
> > 
> > Hi Vivek,
> > 
> > needs a serious design discussion but I think we could leverage it to
> > allow for throttling of cgroups.
> > 
> 
> We need to parse cgroup information inside the dm taget (as CFQ does) and
> just prepare one queue per group and queue the IO there (If we exceeded
> the IO rate of group) and dispatch it later.
> 
> We also need to get the per cgroup rules from cgroup interface and not
> with the help of static device mapper tables at the device creation
> time.
> 
> I can write some code for dm-throttle for cgroup functioinality once
> basic dm-throttle is in.

Ok.

> 
> I am not sure that how to keep both the modes in dm-throttle taget
> 
> - cgroup mode
> - the whole device limitation mode. (you just created).

Need to learn more about cgroup internals to tell...

> 
> Mike Snitzer suggested that we can have both the modes and specify the
> mode at the time of device creation.

Yes, can be optional selected by a target constructor argument.

> 
> > > 
> > > Yesterday I had raised the issue of cgroup IO bandwidth throttling at
> > > Linux Storage and Filesystem session. I thought that a device mapper
> > > target will be the easiest thing to because I can make use of lots
> > > of existing infrastructure.
> > > 
> > > Christoph did not like it because of configuration concerns. He preferred
> > > something in block layer/request queue. It was also hinted that there
> > > were some ideas floating of better integation of device mapper
> > > infrastructure with request queue and this thing should go behind that.
> > 
> > Right, if a block layer change of that kind will be pending, we should
> > wait for it to settle.
> 
> I don't have details but to me it sounds as if it is just a concept at
> this point of time. So waiting for that to happen might take too long
> and we want max bw control feature as soon as possible. IMHO, converting
> the dm-throttle later to use that new infrastructure will be a much
> better option.

Ok.

BTW:
I added the stashing/dispatching of bios to my repository.
Testing right now.

Tackling the bi_size > bps support next.

> 
> > 
> > > But the problem is I am not sure how long it is going to take before
> > > this new infrastructure becomes a reality and it will not be practical
> > > to wait for that.
> > 
> > Did any reliable plans come out of the discussion or will there be any
> > in the near future?
> 
> I am not aware of any. Alasdair will know more about it.

Ok, waiting for his summary once he got back.

> 
> > 
> > > 
> > > There is a possibility that we can put a hook in __make_request function
> > > and first take out all the bios and subject them to bandwidth limitation
> > > and then pass it to lower layers. But that will mean redoing lots of
> > > common infrastructure which has already been done. For example,
> > > 
> > > - What happens to queue congestion semantics.
> > > 
> > > 	- Request queue already has it based on requests and device mapper
> > > 	  seems to have its own congestion functions.
> > 
> > Yes, dm does.
> 
> I was looking into the dm code and found dm_any_congested(). So it looks
> like that dm just calls underlying devices to find out if any of the device
> is congested or not.

Right.

> 
> Thinking more about it, congestion semantics seem to have been defined
> for a thread which does not want to sleep because of request descriptor
> allocation. In case of bandwidth control, we will not be allocating
> any request descriptors. Bios will be handed to us. No cloning operation
> required so no bio allocations required. I might have to do some
> allocation of internal structures though like group, queue etc when a new 
> request comes in.
> 
> So because I will not be putting any artificial restrictions on number
> if bios queued for throttling (unlike request descriptors), I probably
> don't require any congestion semantics. The only time a thread might
> be put to sleep if mempool_alloc() puts it to sleep because of some
> memory reclaim taking place. That's how dm seems to be handling it and
> if that is acceptable then it should be acceptable for bandwidth
> controller on request queue?

I think so. We'll have to figure it out when we proceed.

> 
> > 
> > > 
> > > 	- If I go for taking the bio out on request queue and hold them
> > >    	  back then I am not sure how to define congestion semantics.
> > > 	  To keep congestion semantcs simple, it would make sense to
> > >  	  create a new request queue (with the help of dm target), and
> > > 	  use that.
> > 
> > Yes, that's an obvious approach to stay with the same congestion
> > semantics.
> 
> See above? If I am not putting an artificial limit on number of bios that
> can be submitted on request queue, then I don't require any additional
> congestion semantics. The only time a thread will put to sleep if we
> are not able to allocate some objects like group and per group queue etc.
> Otherwise, a thread will submit the bio and go back and do something else
> or wait for io to finish.

Ok.

> 
> > 
> > > 
> > > - I have yet to think through it but I think I wil be doing other common
> > >   operations like holding back requests in internal queues, dispatching
> > >   these later with the help of a kernel thread, allowing some to dispatch
> > >   immediately as these come in, Putting processes to sleep and waking
> > >   them later if we are already holding too many bios etc.
> > > 
> > > To me it sounds that doing it is lot simpler with the help of device
> > > mapper target. Though the not so nice part is the need of configuring
> > > another device mapper target on every block device we want to control.
> > 
> > Yes, we'd need identity mappings in the stack to be prepared.
> > 
> > Or we need some __generic_make_request() hack ala bcache to hijack the
> > request function on the fly.
> 
> I will look at bcache but yes it would be a hook in __generic_make_request()
> if bandwidth control has to be done in request queue/block layer and not
> as device mapper target.

This is back to the general discussion where bandwidth control should be
layered. The more generically things like bandwidth control are possible
to do, the better.

dm-throttle OTOH is only meant to be for testing purposes so far.

If the decision goes to do it in the block layer for production, we have
to see if we can drop such testing targets,

Heinz

> 
> Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] dm-throttle: new device mapper target to throttle reads and writes
  2010-08-12 14:23 ` Vivek Goyal
@ 2010-08-12 21:23   ` Heinz Mauelshagen
  0 siblings, 0 replies; 10+ messages in thread
From: Heinz Mauelshagen @ 2010-08-12 21:23 UTC (permalink / raw)
  To: device-mapper development

On Thu, 2010-08-12 at 10:23 -0400, Vivek Goyal wrote:
> On Wed, Aug 11, 2010 at 05:45:13PM +0200, Heinz Mauelshagen wrote:
> 
> [..]
> > > [..]
> > > > +/* Decide about throttling (ie. deferring bios). */
> > > > +static int throttle(struct throttle_c *tc, struct bio *bio)
> > > > +{
> > > > +	int rw = (bio_data_dir(bio) == WRITE);
> > > > +	unsigned bps; /* Bytes per second. */
> > > > +
> > > > +	smp_rmb();
> > > > +	bps = tc->params.bs[rw];
> > > > +	if (bps) {
> > > > +		unsigned size;
> > > > +		struct account *ac = &tc->account;
> > > > +		struct ac_rw *ac_rw = ac->rw + rw;
> > > > +
> > > > +		if (time_after(jiffies, ac_rw->end_jiffies))
> > > > +			/* Measure time exceeded. */
> > > > +			account_reset(rw, tc);
> > > > +		else if (test_bit(rw, &ac->flags))
> > > > +			/* In case we're throttled already. */
> > > > +			return 1;
> > > > +
> > > > +		/* Account I/O size. */
> > > > +		size = ac_rw->size + bio->bi_size;
> > > > +		if (size > bps) {
> > > > +			/* Hit kilobytes per second threshold. */
> > > > +			set_bit(rw, &ac->flags);
> > > > +			return 1;
> > > 
> > > If bio->bi_size is greate than bps, will I always keep on throttling
> > > and hang?
> > 
> > bps needs to be set larger than the bio maximum size expected with the
> > current implementatio, right. The algorithm needs changing to cope with
> > bi_size larger than bps (see below).
> > 
> > > 
> > > [..]
> > > > +/* Map a throttle io. */
> > > > +static int throttle_map(struct dm_target *ti, struct bio *bio,
> > > > +			union map_info *map_context)
> > > > +{
> > > > +	int r, rw = (bio_data_dir(bio) == WRITE);
> > > > +	struct throttle_c *tc = ti->private;
> > > > +	struct ac_rw *ac_rw = tc->account.rw + rw;
> > > > +
> > > > +	mutex_lock(&ac_rw->mutex);
> > > > +	do {
> > > > +		r = throttle(tc, bio);
> > > > +		if (r) {
> > > > +			long end = ac_rw->end_jiffies, j = jiffies;
> > > > +
> > > > +			/* Wait till next second when KB/s reached. */
> > > > +			if (j < end)
> > > > +				schedule_timeout_uninterruptible(end - j);
> > > > +		}
> > > 
> > > So a thread is blocked if it crossed the IO rate. There is no such
> > mechanism
> > > to take the bio, statsh away somewhere and dispatch it to disk later.
> > The
> > > way request queue descriptors work.
> > 
> > Right, the aim for this testing target was to keep it as simple as
> > possible to solve the purpose of simulating low bandwidth transports or
> > varying device throughput properties.
> > 
> > Cheap approaches to tackle this issue include to set ti->split_io based
> > on bps < BIO_MAX_SIZE (units ignored) in the ctr/message interface,
> > to prohibit bps smaller than BIO_MAX_SIZE altogether or to change the
> > throttle() algorithm to allow for > 1s measurement periods based on
> > bi_size maximum vs. bps ratios.
> > 
> > The 1st one obviously causing more bio splits, the 2nd one prohibiting
> > small bandwidth simulation and the last one causing io peeks, which is
> > actually not what I wanted.
> 
> Can't we just wait for enough number of seconds to allow bigger bio to
> pass. So if bio size is 4MB and rate limit is 1MB/s then wait for 4 
> seconds. That way there are no splits, and no io peeks?

No, that'd cause what I meant with io peeks.
4s w/o io and then a 4MB burst.

Heinz

> 
> Thanks
> Vivek
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] dm-throttle: new device mapper target to throttle reads and writes
  2010-08-11 15:45 [PATCH] " Heinz Mauelshagen
@ 2010-08-12 14:23 ` Vivek Goyal
  2010-08-12 21:23   ` Heinz Mauelshagen
  0 siblings, 1 reply; 10+ messages in thread
From: Vivek Goyal @ 2010-08-12 14:23 UTC (permalink / raw)
  To: Heinz Mauelshagen; +Cc: dm-devel

On Wed, Aug 11, 2010 at 05:45:13PM +0200, Heinz Mauelshagen wrote:

[..]
> > [..]
> > > +/* Decide about throttling (ie. deferring bios). */
> > > +static int throttle(struct throttle_c *tc, struct bio *bio)
> > > +{
> > > +	int rw = (bio_data_dir(bio) == WRITE);
> > > +	unsigned bps; /* Bytes per second. */
> > > +
> > > +	smp_rmb();
> > > +	bps = tc->params.bs[rw];
> > > +	if (bps) {
> > > +		unsigned size;
> > > +		struct account *ac = &tc->account;
> > > +		struct ac_rw *ac_rw = ac->rw + rw;
> > > +
> > > +		if (time_after(jiffies, ac_rw->end_jiffies))
> > > +			/* Measure time exceeded. */
> > > +			account_reset(rw, tc);
> > > +		else if (test_bit(rw, &ac->flags))
> > > +			/* In case we're throttled already. */
> > > +			return 1;
> > > +
> > > +		/* Account I/O size. */
> > > +		size = ac_rw->size + bio->bi_size;
> > > +		if (size > bps) {
> > > +			/* Hit kilobytes per second threshold. */
> > > +			set_bit(rw, &ac->flags);
> > > +			return 1;
> > 
> > If bio->bi_size is greate than bps, will I always keep on throttling
> > and hang?
> 
> bps needs to be set larger than the bio maximum size expected with the
> current implementatio, right. The algorithm needs changing to cope with
> bi_size larger than bps (see below).
> 
> > 
> > [..]
> > > +/* Map a throttle io. */
> > > +static int throttle_map(struct dm_target *ti, struct bio *bio,
> > > +			union map_info *map_context)
> > > +{
> > > +	int r, rw = (bio_data_dir(bio) == WRITE);
> > > +	struct throttle_c *tc = ti->private;
> > > +	struct ac_rw *ac_rw = tc->account.rw + rw;
> > > +
> > > +	mutex_lock(&ac_rw->mutex);
> > > +	do {
> > > +		r = throttle(tc, bio);
> > > +		if (r) {
> > > +			long end = ac_rw->end_jiffies, j = jiffies;
> > > +
> > > +			/* Wait till next second when KB/s reached. */
> > > +			if (j < end)
> > > +				schedule_timeout_uninterruptible(end - j);
> > > +		}
> > 
> > So a thread is blocked if it crossed the IO rate. There is no such
> mechanism
> > to take the bio, statsh away somewhere and dispatch it to disk later.
> The
> > way request queue descriptors work.
> 
> Right, the aim for this testing target was to keep it as simple as
> possible to solve the purpose of simulating low bandwidth transports or
> varying device throughput properties.
> 
> Cheap approaches to tackle this issue include to set ti->split_io based
> on bps < BIO_MAX_SIZE (units ignored) in the ctr/message interface,
> to prohibit bps smaller than BIO_MAX_SIZE altogether or to change the
> throttle() algorithm to allow for > 1s measurement periods based on
> bi_size maximum vs. bps ratios.
> 
> The 1st one obviously causing more bio splits, the 2nd one prohibiting
> small bandwidth simulation and the last one causing io peeks, which is
> actually not what I wanted.

Can't we just wait for enough number of seconds to allow bigger bio to
pass. So if bio size is 4MB and rate limit is 1MB/s then wait for 4 
seconds. That way there are no splits, and no io peeks?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] dm-throttle: new device mapper target to throttle reads and writes
@ 2010-08-11 15:45 Heinz Mauelshagen
  2010-08-12 14:23 ` Vivek Goyal
  0 siblings, 1 reply; 10+ messages in thread
From: Heinz Mauelshagen @ 2010-08-11 15:45 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: dm-devel

From heinzm@redhat.com Wed Aug 11 17:19:52 2010
Subject: Re: [dm-devel] [PATCH} dm-throttle: new device mapper target to
 throttle reads and writes
From: Heinz Mauelshagen <heinzm@redhat.com>
Reply-To: heinzm@redhat.com
To: Vivek Goyal <vgoyal@redhat.comm>
Cc: dm-devel@redhat.comm
In-Reply-To: <20100810184138.GC9028@redhat.com>
References: <1281447742.4964.46.camel@o>
<20100810184138.GC9028@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Organization: Red Hat Inc.
Message-ID: <1281539990.4964.328.camel@o>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.3 (2.28.3-1.fc12) 
Date: Wed, 11 Aug 2010 17:19:52 +0200
X-Evolution-Format: text/plain
X-Evolution-Account: 1270744278.6871.1@o
X-Evolution-Transport:
 smtp://heinzm;auth=PLAIN@smtp.corp.redhat.com/;use_ssl=never
X-Evolution-Fcc: mbox:/home/mauelsha/.evolution/mail/local#Sent
Content-Transfer-Encoding: 8bit

On Tue, 2010-08-10 at 14:41 -0400, Vivek Goyal wrote:
> On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> 
> [..]
> > +/* Decide about throttling (ie. deferring bios). */
> > +static int throttle(struct throttle_c *tc, struct bio *bio)
> > +{
> > +	int rw = (bio_data_dir(bio) == WRITE);
> > +	unsigned bps; /* Bytes per second. */
> > +
> > +	smp_rmb();
> > +	bps = tc->params.bs[rw];
> > +	if (bps) {
> > +		unsigned size;
> > +		struct account *ac = &tc->account;
> > +		struct ac_rw *ac_rw = ac->rw + rw;
> > +
> > +		if (time_after(jiffies, ac_rw->end_jiffies))
> > +			/* Measure time exceeded. */
> > +			account_reset(rw, tc);
> > +		else if (test_bit(rw, &ac->flags))
> > +			/* In case we're throttled already. */
> > +			return 1;
> > +
> > +		/* Account I/O size. */
> > +		size = ac_rw->size + bio->bi_size;
> > +		if (size > bps) {
> > +			/* Hit kilobytes per second threshold. */
> > +			set_bit(rw, &ac->flags);
> > +			return 1;
> 
> If bio->bi_size is greate than bps, will I always keep on throttling
> and hang?

bps needs to be set larger than the bio maximum size expected with the
current implementatio, right. The algorithm needs changing to cope with
bi_size larger than bps (see below).

> 
> [..]
> > +/* Map a throttle io. */
> > +static int throttle_map(struct dm_target *ti, struct bio *bio,
> > +			union map_info *map_context)
> > +{
> > +	int r, rw = (bio_data_dir(bio) == WRITE);
> > +	struct throttle_c *tc = ti->private;
> > +	struct ac_rw *ac_rw = tc->account.rw + rw;
> > +
> > +	mutex_lock(&ac_rw->mutex);
> > +	do {
> > +		r = throttle(tc, bio);
> > +		if (r) {
> > +			long end = ac_rw->end_jiffies, j = jiffies;
> > +
> > +			/* Wait till next second when KB/s reached. */
> > +			if (j < end)
> > +				schedule_timeout_uninterruptible(end - j);
> > +		}
> 
> So a thread is blocked if it crossed the IO rate. There is no such
mechanism
> to take the bio, statsh away somewhere and dispatch it to disk later.
The
> way request queue descriptors work.

Right, the aim for this testing target was to keep it as simple as
possible to solve the purpose of simulating low bandwidth transports or
varying device throughput properties.

Cheap approaches to tackle this issue include to set ti->split_io based
on bps < BIO_MAX_SIZE (units ignored) in the ctr/message interface,
to prohibit bps smaller than BIO_MAX_SIZE altogether or to change the
throttle() algorithm to allow for > 1s measurement periods based on
bi_size maximum vs. bps ratios.

The 1st one obviously causing more bio splits, the 2nd one prohibiting
small bandwidth simulation and the last one causing io peeks, which is
actually not what I wanted.


> Processes are blocked only if queue
> is congested otherwise one allows processes to submit requests and go
> back and do other work.
> 
> I am assuming that this will be bad for AIO.

Yes, bio stashing/dispatching mandatory to make AIO work

Regards,
Heinz

> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-08-13 11:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-10 13:42 [PATCH} dm-throttle: new device mapper target to throttle reads and writes Heinz Mauelshagen
2010-08-10 14:44 ` Vivek Goyal
2010-08-12  9:08   ` Heinz Mauelshagen
2010-08-12 16:46     ` Vivek Goyal
2010-08-12 20:32       ` Vivek Goyal
2010-08-13 11:19       ` Heinz Mauelshagen
2010-08-10 18:41 ` Vivek Goyal
2010-08-11 15:45 [PATCH] " Heinz Mauelshagen
2010-08-12 14:23 ` Vivek Goyal
2010-08-12 21:23   ` Heinz Mauelshagen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.