linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] cgroup: fsio throttle controller
@ 2019-01-18 10:31 Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 1/3] fsio-throttle: documentation Andrea Righi
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 10:31 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: Jens Axboe, Vivek Goyal, Josef Bacik, Dennis Zhou, cgroups,
	linux-block, linux-kernel

This is a redesign of my old cgroup-io-throttle controller:
https://lwn.net/Articles/330531/

I'm resuming this old patch to point out a problem that I think is still
not solved completely.

= Problem =

The io.max controller works really well at limiting synchronous I/O
(READs), but a lot of I/O requests are initiated outside the context of
the process that is ultimately responsible for its creation (e.g.,
WRITEs).

Throttling at the block layer in some cases is too late and we may end
up slowing down processes that are not responsible for the I/O that
is being processed at that level.

= Proposed solution =

The main idea of this controller is to split I/O measurement and I/O
throttling: I/O is measured at the block layer for READS, at page cache
(dirty pages) for WRITEs, and processes are limited while they're
generating I/O at the VFS level, based on the measured I/O.

= Example =

Here's a trivial example: create 2 cgroups, set an io.max limit of
10MB/s, run a write-intensive workload on both and after a while, from a
root cgroup, run "sync".

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

 # cat /proc/self/cgroup
 0::/cg2
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

 - io.max controller:

 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max

 # cat /proc/self/cgroup
 0::/
 # time sync

 real	0m51,241s
 user	0m0,000s
 sys	0m0,113s

Ideally "sync" should complete almost immediately, because the root
cgroup is unlimited and it's not doing any I/O at all, but instead it's
blocked for more than 50 sec with io.max, because the writeback is
throttled to satisfy the io.max limits.

- fsio controller:

 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs

 [you can find details about the syntax in the documentation patch]

 # cat /proc/self/cgroup
 0::/
 # time sync

 real	0m0,146s
 user	0m0,003s
 sys	0m0,001s

= Questions =

Q: Do we need another controller?
A: Probably no, I think it would be better to integrate this policy (or
   something similar) in the current blkio controller, this is just to
   highlight the problem and get some ideas on how to address it.

Q: What about proportional limits / latency?
A: It should be trivial to add latency-based limits if we integrate this in the
   current I/O controller. About proportional limits (weights), they're
   strictly related to I/O scheduling and since this controller doesn't touch
   I/O dispatching policies it's not trivial to implement proportional limits
   (bandwidth limiting is definitely more straightforward).

Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
   writeback, right?
A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
   avoid priority inversion problems in the system.

Andrea Righi (3):
  fsio-throttle: documentation
  fsio-throttle: controller infrastructure
  fsio-throttle: instrumentation

 Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
 block/blk-core.c                          |  10 +
 include/linux/cgroup_subsys.h             |   4 +
 include/linux/fsio-throttle.h             |  43 +++
 include/linux/writeback.h                 |   7 +-
 init/Kconfig                              |  11 +
 kernel/cgroup/Makefile                    |   1 +
 kernel/cgroup/fsio-throttle.c             | 501 ++++++++++++++++++++++++++++++
 mm/filemap.c                              |  20 +-
 mm/page-writeback.c                       |  14 +-
 10 files changed, 749 insertions(+), 4 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/3] fsio-throttle: documentation
  2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
@ 2019-01-18 10:31 ` Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 2/3] fsio-throttle: controller infrastructure Andrea Righi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 10:31 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: Jens Axboe, Vivek Goyal, Josef Bacik, Dennis Zhou, cgroups,
	linux-block, linux-kernel, Andrea Righi

Document the filesystem I/O controller: description, usage, design,
etc.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 Documentation/cgroup-v1/fsio-throttle.txt | 142 ++++++++++++++++++++++
 1 file changed, 142 insertions(+)
 create mode 100644 Documentation/cgroup-v1/fsio-throttle.txt

diff --git a/Documentation/cgroup-v1/fsio-throttle.txt b/Documentation/cgroup-v1/fsio-throttle.txt
new file mode 100644
index 000000000000..4f33cae2adea
--- /dev/null
+++ b/Documentation/cgroup-v1/fsio-throttle.txt
@@ -0,0 +1,142 @@
+
+                 Filesystem I/O throttling controller
+
+----------------------------------------------------------------------
+1. OVERVIEW
+
+This controller allows to limit filesystem I/O of mounted devices of specific
+process containers (cgroups [1]) enforcing delays to the processes that exceed
+the limits defined for their cgroup.
+
+The goal of the filesystem I/O controller is to improve performance
+predictability from applications' point of view and provide performance
+isolation of different control groups sharing the same filesystems.
+
+----------------------------------------------------------------------
+2. DESIGN
+
+I/O activity generated by READs is evaluated at the block layer, WRITEs are
+evaluated when a page changes from clear to dirty (rewriting a page that was
+already dirty doesn't generate extra I/O activity).
+
+Throttling is always performed at the VFS layer.
+
+This solution has the advantage of always being able to determine the
+task/cgroup that originally generated the I/O request and it prevents
+filesystem locking contention and potential priority inversion problems
+(example: journal I/O being throttled that may slow down the entire system).
+
+The downside of this solution is that the controller is more fuzzy (compared to
+the blkio controller) and it allows I/O bursts that may happen at the I/O
+scheduler layer.
+
+----------------------------------------------------------------------
+2.1. TOKEN BUCKET THROTTLING
+
+            Tokens (I/O rate) - <mb_per_second>
+                o
+                o
+                o
+              ....... <--.
+              \     /    | Bucket size (burst limit)
+               \ooo/     | <bucket_size_in_mb>
+                ---   <--'
+                 |ooo
+    Incoming --->|---> Conforming
+    I/O          |oo   I/O
+    requests  -->|-->  requests
+                 |
+            ---->|
+
+Token bucket [2] throttling: <mb_per_second> tokens are added to the bucket
+every seconds; the bucket can hold at the most <bucket_size_in_mb> tokens; I/O
+requests are accepted if there are available tokens in the bucket; when a
+request of N bytes arrives, N tokens are removed from the bucket; if less than
+N tokens are available in the bucket, the request is delayed until a sufficient
+amount of token is available again in the bucket.
+
+----------------------------------------------------------------------
+3. USER INTERFACE
+
+A new I/O limit (in MB/s) can be defined using the file:
+- fsio.max_mbs
+
+The syntax of a throttling policy is the following:
+
+"<major>:<minor> <mb_per_second> <bucket_size_in_mb>"
+
+Examples:
+
+- set a maximum I/O rate of 10MB/s on /dev/sda (8:0), bucket size = 10MB:
+
+  # echo "8:0 10 10" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+- remove the I/O limit defined for /dev/sda (8:0):
+
+  # echo "8:0 0 0" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+----------------------------------------------------------------------
+4. Additional parameters
+
+----------------------------------------------------------------------
+4.1. Sleep timeslice
+
+Sleep timeslice is a configurable parameter that allows to decide the minimum
+time of sleep to enforce to throttled tasks. Tasks will never be put to sleep
+for less than the sleep timeslice. Moreover wakeup timers will be always
+aligned to a multiple of the sleep timeslice.
+
+Increasing the sleep timeslice has the advantage of reducing the overhead of
+the controller: with a more coarse-grained control, less timers are created to
+wake-up tasks, that means less softirq pressure in the system and less overhead
+introduced. However, a bigger sleep timeslice makes the controller more fuzzy
+since throttled tasks are going to receive less throttling events with larger
+sleeps.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 250 ms.
+
+Example:
+  - set the sleep timeslice to 1s:
+
+    # echo 1000 > /sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+----------------------------------------------------------------------
+4.2. Sleep timeframe
+
+This parameter defines maximum time to sleep for a throttled task.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 2 sec.
+
+Example:
+  - set the sleep timeframe to 5s:
+
+    # echo 5000 > /sys/module/fsio_throttle/parameters/throttle_timeframe_ms
+
+4.3. Throttle kernel threads
+
+By default kernel threads are never throttled or accounted for any I/O
+activity. It is possible to change this behavior by setting 1 to:
+
+/sys/module/fsio_throttle/parameters/throttle_kernel_threads
+
+It is strongly recommended to not change this setting unless you know what you
+are doing.
+
+----------------------------------------------------------------------
+5. TODO
+
+- Integration with the blkio controller
+- Provide distinct read/write limits, as well as MBs vs iops
+- Provide additional statistics in cgroupfs
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/admin-guide/cgroup-v2.rst
+[2] http://en.wikipedia.org/wiki/Token_bucket
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/3] fsio-throttle: controller infrastructure
  2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 1/3] fsio-throttle: documentation Andrea Righi
@ 2019-01-18 10:31 ` Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 3/3] fsio-throttle: instrumentation Andrea Righi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 10:31 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: Jens Axboe, Vivek Goyal, Josef Bacik, Dennis Zhou, cgroups,
	linux-block, linux-kernel, Andrea Righi

This is the core of the fsio-throttle controller: it defines the
interface to the cgroup subsystem and implements the I/O measurement and
throttling logic.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 include/linux/cgroup_subsys.h |   4 +
 include/linux/fsio-throttle.h |  43 +++
 init/Kconfig                  |  11 +
 kernel/cgroup/Makefile        |   1 +
 kernel/cgroup/fsio-throttle.c | 501 ++++++++++++++++++++++++++++++++++
 5 files changed, 560 insertions(+)
 create mode 100644 include/linux/fsio-throttle.h
 create mode 100644 kernel/cgroup/fsio-throttle.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..33beb70c0eca 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,10 @@ SUBSYS(pids)
 SUBSYS(rdma)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_FSIO_THROTTLE)
+SUBSYS(fsio)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/include/linux/fsio-throttle.h b/include/linux/fsio-throttle.h
new file mode 100644
index 000000000000..3a46df712475
--- /dev/null
+++ b/include/linux/fsio-throttle.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __FSIO_THROTTLE_H__
+#define __FSIO_THROTTLE_H__
+
+#include <linux/fs.h>
+#include <linux/genhd.h>
+
+#ifdef CONFIG_BLOCK
+static inline dev_t bdev_to_dev(struct block_device *bdev)
+{
+	return bdev ? MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+			    bdev->bd_disk->first_minor) : 0;
+}
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+	return (mapping->host && mapping->host->i_sb->s_bdev) ?
+		mapping->host->i_sb->s_bdev : NULL;
+}
+#else /* CONFIG_BLOCK */
+static dev_t bdev_to_dev(struct block_device *bdev)
+{
+	return 0;
+}
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+	return NULL;
+}
+#endif /* CONFIG_BLOCK */
+
+#ifdef CONFIG_CGROUP_FSIO_THROTTLE
+int fsio_throttle(dev_t dev, ssize_t bytes, int state);
+#else /* CONFIG_CGROUP_FSIO_THROTTLE */
+static inline int
+fsio_throttle(dev_t dev, ssize_t bytes, int state)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_FSIO_THROTTLE */
+
+#endif /* __FSIO_THROTTLE_H__ */
diff --git a/init/Kconfig b/init/Kconfig
index d47cb77a220e..95d7342801eb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -775,6 +775,17 @@ config CGROUP_WRITEBACK
 	depends on MEMCG && BLK_CGROUP
 	default y
 
+config CGROUP_FSIO_THROTTLE
+	bool "Filesystem I/O throttling controller"
+	default n
+	depends on BLOCK
+	help
+	  This option enables filesystem I/O throttling infrastructure.
+
+	  This allows to properly throttle reads and writes at the filesystem
+	  level, without introducing I/O locking contentions or priority
+	  inversion problems.
+
 menuconfig CGROUP_SCHED
 	bool "CPU controller"
 	default n
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index bfcdae896122..12de828b36cd 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -2,6 +2,7 @@
 obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
+obj-$(CONFIG_CGROUP_FSIO_THROTTLE) += fsio-throttle.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
diff --git a/kernel/cgroup/fsio-throttle.c b/kernel/cgroup/fsio-throttle.c
new file mode 100644
index 000000000000..46f3ffd4015b
--- /dev/null
+++ b/kernel/cgroup/fsio-throttle.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * fsio-throttle.c - I/O cgroup controller
+ *
+ * Copyright (C) 2019 Andrea Righi <righi.andrea@gmail.com>
+ */
+
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/moduleparam.h>
+#include <linux/genhd.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched/signal.h>
+#include <linux/cgroup.h>
+#include <linux/fsio-throttle.h>
+
+#define KB(x)   ((x) * 1024)
+#define MB(x)   (KB(KB(x)))
+#define GB(x)   (MB(KB(x)))
+
+static int throttle_kernel_threads __read_mostly;
+module_param(throttle_kernel_threads, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+		  "enable/disable I/O throttling for kernel threads");
+
+static int throttle_timeslice_ms __read_mostly = 250;
+module_param(throttle_timeslice_ms, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+		  "throttling time slice (default 250ms)");
+
+static int throttle_timeframe_ms __read_mostly = 2000;
+module_param(throttle_timeframe_ms, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+		  "maximum sleep time enforced (default 2000ms)");
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	struct list_head list;
+	/* protect the list of iothrottle_node elements (list) */
+	struct mutex lock;
+	wait_queue_head_t wait;
+	struct timer_list timer;
+	bool timer_cancel;
+	/* protect the wait queue elements */
+	spinlock_t wait_lock;
+};
+
+struct iothrottle_limit {
+	unsigned long long usage;
+	unsigned long long bucket_size;
+	unsigned long long limit;
+	unsigned long long timestamp;
+	/* protect all of the above */
+	spinlock_t lock;
+};
+
+struct iothrottle_node {
+	struct list_head node;
+	struct rcu_head rcu;
+	struct iothrottle_limit bw;
+	dev_t dev;
+};
+
+static inline bool iothrottle_disabled(void)
+{
+	return !cgroup_subsys_enabled(fsio_cgrp_subsys);
+}
+
+static struct iothrottle *css_to_iothrottle(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct iothrottle, css) : NULL;
+}
+
+struct iothrottle *task_to_iothrottle(struct task_struct *p)
+{
+	if (unlikely(!p))
+		return NULL;
+	return css_to_iothrottle(task_css(p, fsio_cgrp_id));
+}
+
+static inline unsigned long long
+iothrottle_limit_delta_t(struct iothrottle_limit *res)
+{
+	return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+static void iothrottle_limit_init(struct iothrottle_limit *res,
+				 unsigned long long limit,
+				 unsigned long long bucket_size)
+{
+	spin_lock_init(&res->lock);
+	res->limit = limit;
+	res->usage = 0;
+	res->bucket_size = bucket_size;
+	res->timestamp = get_jiffies_64();
+}
+
+static unsigned long long
+iothrottle_limit_sleep(struct iothrottle_limit *res, unsigned long long size)
+{
+	unsigned long long delta;
+	long long tok;
+	unsigned long flags;
+
+	spin_lock_irqsave(&res->lock, flags);
+	res->usage -= size;
+	delta = jiffies_to_msecs(iothrottle_limit_delta_t(res));
+	res->timestamp = get_jiffies_64();
+	tok = (long long)res->usage * MSEC_PER_SEC;
+	if (delta) {
+		long long max = (long long)res->bucket_size * MSEC_PER_SEC;
+
+		tok += delta * res->limit;
+		tok = min_t(long long, tok, max);
+		res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+	}
+	spin_unlock_irqrestore(&res->lock, flags);
+
+	return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+static void iothrottle_limit_reset(struct iothrottle_limit *res)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&res->lock, flags);
+	res->usage = 0;
+	spin_unlock_irqrestore(&res->lock, flags);
+}
+
+static inline int iothrottle_node_size(void)
+{
+	return sizeof(struct iothrottle_node);
+}
+
+static struct iothrottle_node *iothrottle_node_alloc(gfp_t flags)
+{
+	struct iothrottle_node *n;
+	int size = iothrottle_node_size();
+
+	if (size < PAGE_SIZE)
+		n = kmalloc(size, flags);
+	else
+		n = vmalloc(size);
+	if (n)
+		memset(n, 0, size);
+	return n;
+}
+
+static void iothrottle_node_free(struct iothrottle_node *n)
+{
+	if (iothrottle_node_size() < PAGE_SIZE)
+		kfree(n);
+	else
+		vfree(n);
+}
+
+static struct iothrottle_node *
+iothrottle_node_search(const struct iothrottle *iot, dev_t dev)
+{
+	struct iothrottle_node *n;
+
+	list_for_each_entry_rcu(n, &iot->list, node)
+		if (n->dev == dev)
+			return n;
+	return NULL;
+}
+
+static void iothrottle_node_reclaim(struct rcu_head *rp)
+{
+	struct iothrottle_node *n;
+
+	n = container_of(rp, struct iothrottle_node, rcu);
+	iothrottle_node_free(n);
+}
+
+static int iothrottle_parse_args(char *buf, size_t nbytes,
+				 dev_t *dev,
+				 unsigned long long *io_limit,
+				 unsigned long long *bucket_size)
+{
+	struct gendisk *disk;
+	unsigned int major, minor;
+	unsigned long long limit, size;
+	int part, ret = 0;
+
+	if (sscanf(buf, "%u:%u %llu %llu", &major, &minor, &limit, &size) != 4)
+		return -EINVAL;
+	disk = get_gendisk(MKDEV(major, minor), &part);
+	if (!disk)
+		return -ENODEV;
+	if (part) {
+		ret = -ENODEV;
+		goto out;
+	}
+	*dev = MKDEV(major, minor);
+	*io_limit = MB(limit);
+	*bucket_size = MB(size);
+out:
+	put_disk_and_module(disk);
+
+	return ret;
+}
+
+static ssize_t iothrottle_write(struct kernfs_open_file *of,
+				char *buffer, size_t nbytes, loff_t off)
+{
+	struct iothrottle *iot;
+	struct iothrottle_node *n, *newn = NULL;
+	unsigned long long io_limit, bucket_size;
+	dev_t dev;
+	char *buf;
+	int ret;
+
+	/*
+	 * We need to allocate a new buffer here, because
+	 * iothrottle_parse_args() can modify it and the buffer provided by
+	 * write_string is supposed to be const.
+	 */
+	buf = kmalloc(nbytes + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	memcpy(buf, buffer, nbytes + 1);
+
+	ret = iothrottle_parse_args(buf, nbytes, &dev, &io_limit, &bucket_size);
+	if (ret)
+		goto out_free;
+
+	newn = iothrottle_node_alloc(GFP_KERNEL);
+	if (!newn) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+	newn->dev = dev;
+	iothrottle_limit_init(&newn->bw, io_limit, bucket_size);
+
+	iot = css_to_iothrottle(of_css(of));
+	if (unlikely(!iot)) {
+		WARN_ON_ONCE(1);
+		goto out_free;
+	}
+	mutex_lock(&iot->lock);
+	n = iothrottle_node_search(iot, dev);
+	if (!n) {
+		/* Insert new node */
+		if (io_limit) {
+			list_add_rcu(&newn->node, &iot->list);
+			newn = NULL;
+		}
+	} else if (!io_limit) {
+		/* Delete existing node */
+		list_del_rcu(&n->node);
+	} else {
+		/* Update existing node */
+		list_replace_rcu(&n->node, &newn->node);
+		newn = NULL;
+	}
+	mutex_unlock(&iot->lock);
+	if (n)
+		call_rcu(&n->rcu, iothrottle_node_reclaim);
+	ret = nbytes;
+out_free:
+	if (newn)
+		iothrottle_node_free(newn);
+	kfree(buf);
+	return ret;
+}
+
+static void iothrottle_show_limit(struct seq_file *m,
+				  dev_t dev, struct iothrottle_limit *res)
+{
+	seq_put_decimal_ull(m, "", MAJOR(dev));
+	seq_put_decimal_ull(m, ":", MINOR(dev));
+	seq_put_decimal_ull(m, " ", res->limit);
+	seq_put_decimal_ull(m, " ", res->usage);
+	seq_put_decimal_ull(m, " ", res->bucket_size);
+	seq_put_decimal_ull(m, " ",
+			    jiffies_to_clock_t(iothrottle_limit_delta_t(res)));
+	seq_putc(m, '\n');
+}
+
+static int iothrottle_read(struct seq_file *m, void *v)
+{
+	struct iothrottle *iot = css_to_iothrottle(seq_css(m));
+	struct iothrottle_node *n;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(n, &iot->list, node)
+		iothrottle_show_limit(m, n->dev, &n->bw);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static struct cftype iothrottle_files[] = {
+	{
+		.name = "max_mbs",
+		.seq_show = iothrottle_read,
+		.write = iothrottle_write,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+};
+
+static void iothrottle_wakeup(struct iothrottle *iot, bool timer_cancel)
+{
+	spin_lock_bh(&iot->wait_lock);
+	if (timer_cancel)
+		iot->timer_cancel = true;
+	wake_up_all(&iot->wait);
+	spin_unlock_bh(&iot->wait_lock);
+}
+
+static void iothrottle_timer_wakeup(struct timer_list *t)
+{
+	struct iothrottle *iot = from_timer(iot, t, timer);
+
+	iothrottle_wakeup(iot, false);
+}
+
+static struct cgroup_subsys_state *
+iothrottle_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct iothrottle *iot;
+
+	iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+	if (!iot)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&iot->list);
+	mutex_init(&iot->lock);
+	init_waitqueue_head(&iot->wait);
+	spin_lock_init(&iot->wait_lock);
+	iot->timer_cancel = false;
+	timer_setup(&iot->timer, iothrottle_timer_wakeup, 0);
+
+	return &iot->css;
+}
+
+static void iothrottle_css_offline(struct cgroup_subsys_state *css)
+{
+	struct iothrottle *iot = css_to_iothrottle(css);
+
+	spin_lock_bh(&iot->wait_lock);
+	iot->timer_cancel = true;
+	spin_unlock_bh(&iot->wait_lock);
+
+	iothrottle_wakeup(iot, true);
+}
+
+static void iothrottle_css_free(struct cgroup_subsys_state *css)
+{
+	struct iothrottle_node *n, *p;
+	struct iothrottle *iot = css_to_iothrottle(css);
+
+	del_timer_sync(&iot->timer);
+	/*
+	 * don't worry about locking here, at this point there's no reference
+	 * to the list.
+	 */
+	list_for_each_entry_safe(n, p, &iot->list, node)
+		iothrottle_node_free(n);
+	kfree(iot);
+}
+
+static inline bool is_kernel_thread(void)
+{
+	return !!(current->flags & (PF_KTHREAD | PF_KSWAPD));
+}
+
+static inline bool is_urgent_task(void)
+{
+	/* Never throttle tasks that are going to exit */
+	if (current->flags & PF_EXITING)
+		return true;
+	/* Throttle kernel threads only if throttle_kernel_threads is set */
+	return is_kernel_thread() && !throttle_kernel_threads;
+}
+
+static struct iothrottle *try_get_iothrottle_from_task(struct task_struct *p)
+{
+	struct iothrottle *iot = NULL;
+
+	rcu_read_lock();
+	if (!task_css_is_root(p, fsio_cgrp_id)) {
+		do {
+			iot = task_to_iothrottle(p);
+			if (unlikely(!iot))
+				break;
+		} while (!css_tryget_online(&iot->css));
+	}
+	rcu_read_unlock();
+
+	return iot;
+}
+
+static int iothrottle_evaluate_sleep(struct iothrottle *iot, dev_t dev,
+				     ssize_t bytes, int state)
+{
+	struct iothrottle_node *n;
+	unsigned long long sleep = 0;
+
+	rcu_read_lock();
+	n = iothrottle_node_search(iot, dev);
+	if (n) {
+		sleep = iothrottle_limit_sleep(&n->bw, bytes);
+		/*
+		 * state == 0 is used to do only I/O accounting without
+		 * enforcing sleeps.
+		 */
+		if (!state || sleep < msecs_to_jiffies(throttle_timeslice_ms))
+			sleep = 0;
+		if (sleep)
+			iothrottle_limit_reset(&n->bw);
+	}
+	rcu_read_unlock();
+
+	return sleep;
+}
+
+static noinline void iothrottle_force_sleep(struct iothrottle *iot,
+					    unsigned long long sleep,
+					    int state)
+{
+	unsigned long expire, now;
+
+	/*
+	 * Allow small IO bursts, by waking up the throttled task after a
+	 * maximum sleep of throttle_timeframe millisec.
+	 */
+	if (sleep > msecs_to_jiffies(throttle_timeframe_ms))
+		sleep = msecs_to_jiffies(throttle_timeframe_ms);
+
+	now = READ_ONCE(jiffies);
+	expire = now + sleep;
+
+	/*
+	 * Round up the time to sleep to a multiple of the sleep timeslice.
+	 *
+	 * In this way we can strongly reduce timer softirqs and
+	 * context switches in the system even when there are a lot of
+	 * different cgroups.
+	 */
+	expire = roundup(expire, msecs_to_jiffies(throttle_timeslice_ms));
+
+	/* Force sleep */
+	do {
+		DEFINE_WAIT(wait);
+
+		spin_lock_bh(&iot->wait_lock);
+		if (unlikely(iot->timer_cancel)) {
+			spin_unlock_bh(&iot->wait_lock);
+			break;
+		}
+		mod_timer(&iot->timer, expire);
+		spin_unlock_bh(&iot->wait_lock);
+
+		/*
+		 * Do not enforce interruptible sleep if there are pending
+		 * signals, otherwise we'll end up into a busy loop.
+		 */
+		if (signal_pending(current))
+			state = TASK_KILLABLE;
+
+		/* Send to sleep */
+		prepare_to_wait(&iot->wait, &wait, state);
+		schedule();
+		finish_wait(&iot->wait, &wait);
+	} while (!fatal_signal_pending(current) &&
+		 time_is_after_jiffies(expire));
+}
+
+int fsio_throttle(dev_t dev, ssize_t bytes, int state)
+{
+	struct iothrottle *iot;
+	unsigned long long sleep = 0;
+
+	if (iothrottle_disabled() || is_urgent_task())
+		return 0;
+	if (!dev)
+		return 0;
+	iot = try_get_iothrottle_from_task(current);
+	if (!iot)
+		return 0;
+	sleep = iothrottle_evaluate_sleep(iot, dev, bytes, state);
+	if (unlikely(sleep))
+		iothrottle_force_sleep(iot, sleep, state);
+	css_put(&iot->css);
+
+	return sleep;
+}
+
+struct cgroup_subsys fsio_cgrp_subsys = {
+	.css_alloc = iothrottle_css_alloc,
+	.css_free = iothrottle_css_free,
+	.css_offline = iothrottle_css_offline,
+	.dfl_cftypes = iothrottle_files,
+};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/3] fsio-throttle: instrumentation
  2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 1/3] fsio-throttle: documentation Andrea Righi
  2019-01-18 10:31 ` [RFC PATCH 2/3] fsio-throttle: controller infrastructure Andrea Righi
@ 2019-01-18 10:31 ` Andrea Righi
  2019-01-18 11:04 ` [RFC PATCH 0/3] cgroup: fsio throttle controller Paolo Valente
  2019-01-18 16:35 ` Josef Bacik
  4 siblings, 0 replies; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 10:31 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: Jens Axboe, Vivek Goyal, Josef Bacik, Dennis Zhou, cgroups,
	linux-block, linux-kernel, Andrea Righi

Apply the fsio controller to the opportune kernel functions to evaluate
and throttle filesystem I/O.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 block/blk-core.c          | 10 ++++++++++
 include/linux/writeback.h |  7 ++++++-
 mm/filemap.c              | 20 +++++++++++++++++++-
 mm/page-writeback.c       | 14 ++++++++++++--
 4 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c5f61ceeb67..4b4717f64ac1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -16,6 +16,7 @@
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
+#include <linux/fsio-throttle.h>
 #include <linux/blk-mq.h>
 #include <linux/highmem.h>
 #include <linux/mm.h>
@@ -956,6 +957,15 @@ generic_make_request_checks(struct bio *bio)
 	 */
 	create_io_context(GFP_ATOMIC, q->node);
 
+	/*
+	 * Account only READs at this layer (WRITEs are accounted and throttled
+	 * in balance_dirty_pages()) and don't enfore sleeps (state=0): in this
+	 * way we can prevent potential lock contentions and priority inversion
+	 * problems at the filesystem layer.
+	 */
+	if (bio_op(bio) == REQ_OP_READ)
+		fsio_throttle(bio_dev(bio), bio->bi_iter.bi_size, 0);
+
 	if (!blkcg_bio_issue_check(q, bio))
 		return false;
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..1e161c7969e5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -356,7 +356,12 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+
+#define balance_dirty_pages_ratelimited(__mapping) \
+	__balance_dirty_pages_ratelimited(__mapping, false)
+void __balance_dirty_pages_ratelimited(struct address_space *mapping,
+				       bool redirty);
+
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..5cc0959274d6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -29,6 +29,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/blkdev.h>
+#include <linux/fsio-throttle.h>
 #include <linux/security.h>
 #include <linux/cpuset.h>
 #include <linux/hugetlb.h>
@@ -2040,6 +2041,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
+	struct block_device *bdev = as_to_bdev(mapping);
 	struct inode *inode = mapping->host;
 	struct file_ra_state *ra = &filp->f_ra;
 	loff_t *ppos = &iocb->ki_pos;
@@ -2068,6 +2070,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		cond_resched();
 find_page:
+		fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
 		if (fatal_signal_pending(current)) {
 			error = -EINTR;
 			goto out;
@@ -2308,11 +2311,17 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		struct file *file = iocb->ki_filp;
 		struct address_space *mapping = file->f_mapping;
+		struct block_device *bdev = as_to_bdev(mapping);
 		struct inode *inode = mapping->host;
 		loff_t size;
 
 		size = i_size_read(inode);
 		if (iocb->ki_flags & IOCB_NOWAIT) {
+			unsigned long long sleep;
+
+			sleep = fsio_throttle(bdev_to_dev(bdev), 0, 0);
+			if (sleep)
+				return -EAGAIN;
 			if (filemap_range_has_page(mapping, iocb->ki_pos,
 						   iocb->ki_pos + count - 1))
 				return -EAGAIN;
@@ -2322,6 +2331,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 					        iocb->ki_pos + count - 1);
 			if (retval < 0)
 				goto out;
+			fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
 		}
 
 		file_accessed(file);
@@ -2366,9 +2376,11 @@ EXPORT_SYMBOL(generic_file_read_iter);
 static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct block_device *bdev = as_to_bdev(mapping);
 	struct page *page;
 	int ret;
 
+	fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
 	do {
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
@@ -2498,11 +2510,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	page = find_get_page(mapping, offset);
 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+		struct block_device *bdev = as_to_bdev(mapping);
 		/*
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
 		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+		if (unlikely(!PageUptodate(page)))
+			fsio_throttle(bdev_to_dev(bdev), 0,
+				      TASK_INTERRUPTIBLE);
 	} else if (!page) {
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
@@ -3172,6 +3188,7 @@ ssize_t generic_perform_write(struct file *file,
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	do {
 		struct page *page;
@@ -3216,6 +3233,7 @@ ssize_t generic_perform_write(struct file *file,
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
@@ -3241,7 +3259,7 @@ ssize_t generic_perform_write(struct file *file,
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		__balance_dirty_pages_ratelimited(mapping, dirty);
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7d1010453fb9..694ede8783f3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>
+#include <linux/fsio-throttle.h>
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
@@ -1858,10 +1859,12 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
  * limit we decrease the ratelimiting by a lot, to prevent individual processes
  * from overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void __balance_dirty_pages_ratelimited(struct address_space *mapping,
+				       bool redirty)
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct block_device *bdev = as_to_bdev(mapping);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int *p;
@@ -1878,6 +1881,13 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (wb->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
+	/*
+	 * Throttle filesystem I/O only if page was initially clean: re-writing
+	 * a dirty page doesn't generate additional I/O.
+	 */
+	if (!redirty)
+		fsio_throttle(bdev_to_dev(bdev), PAGE_SIZE, TASK_KILLABLE);
+
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1911,7 +1921,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 
 	wb_put(wb);
 }
-EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
+EXPORT_SYMBOL(__balance_dirty_pages_ratelimited);
 
 /**
  * wb_over_bg_thresh - does @wb need to be written back?
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
                   ` (2 preceding siblings ...)
  2019-01-18 10:31 ` [RFC PATCH 3/3] fsio-throttle: instrumentation Andrea Righi
@ 2019-01-18 11:04 ` Paolo Valente
  2019-01-18 11:10   ` Andrea Righi
  2019-01-18 16:35 ` Josef Bacik
  4 siblings, 1 reply; 19+ messages in thread
From: Paolo Valente @ 2019-01-18 11:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Josef Bacik, Dennis Zhou, cgroups, linux-block, linux-kernel



> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> 
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
> 
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
> 
> = Problem =
> 
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
> 
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.
> 
> = Proposed solution =
> 
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
> 

Hi Andrea,
what the about the case where two processes are dirtying the same
pages?  Which will be charged?

Thanks,
Paolo

> = Example =
> 
> Here's a trivial example: create 2 cgroups, set an io.max limit of
> 10MB/s, run a write-intensive workload on both and after a while, from a
> root cgroup, run "sync".
> 
> # cat /proc/self/cgroup
> 0::/cg1
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
> 
> # cat /proc/self/cgroup
> 0::/cg2
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
> 
> - io.max controller:
> 
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max
> 
> # cat /proc/self/cgroup
> 0::/
> # time sync
> 
> real	0m51,241s
> user	0m0,000s
> sys	0m0,113s
> 
> Ideally "sync" should complete almost immediately, because the root
> cgroup is unlimited and it's not doing any I/O at all, but instead it's
> blocked for more than 50 sec with io.max, because the writeback is
> throttled to satisfy the io.max limits.
> 
> - fsio controller:
> 
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs
> 
> [you can find details about the syntax in the documentation patch]
> 
> # cat /proc/self/cgroup
> 0::/
> # time sync
> 
> real	0m0,146s
> user	0m0,003s
> sys	0m0,001s
> 
> = Questions =
> 
> Q: Do we need another controller?
> A: Probably no, I think it would be better to integrate this policy (or
>   something similar) in the current blkio controller, this is just to
>   highlight the problem and get some ideas on how to address it.
> 
> Q: What about proportional limits / latency?
> A: It should be trivial to add latency-based limits if we integrate this in the
>   current I/O controller. About proportional limits (weights), they're
>   strictly related to I/O scheduling and since this controller doesn't touch
>   I/O dispatching policies it's not trivial to implement proportional limits
>   (bandwidth limiting is definitely more straightforward).
> 
> Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
>   writeback, right?
> A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
>   avoid priority inversion problems in the system.
> 
> Andrea Righi (3):
>  fsio-throttle: documentation
>  fsio-throttle: controller infrastructure
>  fsio-throttle: instrumentation
> 
> Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
> block/blk-core.c                          |  10 +
> include/linux/cgroup_subsys.h             |   4 +
> include/linux/fsio-throttle.h             |  43 +++
> include/linux/writeback.h                 |   7 +-
> init/Kconfig                              |  11 +
> kernel/cgroup/Makefile                    |   1 +
> kernel/cgroup/fsio-throttle.c             | 501 ++++++++++++++++++++++++++++++
> mm/filemap.c                              |  20 +-
> mm/page-writeback.c                       |  14 +-
> 10 files changed, 749 insertions(+), 4 deletions(-)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 11:04 ` [RFC PATCH 0/3] cgroup: fsio throttle controller Paolo Valente
@ 2019-01-18 11:10   ` Andrea Righi
  2019-01-18 11:11     ` Paolo Valente
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 11:10 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Josef Bacik, Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> > 
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> Hi Andrea,
> what the about the case where two processes are dirtying the same
> pages?  Which will be charged?
> 
> Thanks,
> Paolo

Hi Paolo,

in this case only the first one will be charged for the I/O activity
(the one that changes a page from clean to dirty). This is probably not
totally fair in some cases, but I think it's a good compromise, at the
end rewriting the same page over and over while it's already dirty
doesn't actually generate I/O activity, until the page is flushed back
to disk.

Obviously I'm open to other better ideas and suggestions.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 11:10   ` Andrea Righi
@ 2019-01-18 11:11     ` Paolo Valente
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Valente @ 2019-01-18 11:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Josef Bacik, Dennis Zhou, cgroups, linux-block, linux-kernel



> Il giorno 18 gen 2019, alle ore 12:10, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> 
> On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
>>> 
>>> This is a redesign of my old cgroup-io-throttle controller:
>>> https://lwn.net/Articles/330531/
>>> 
>>> I'm resuming this old patch to point out a problem that I think is still
>>> not solved completely.
>>> 
>>> = Problem =
>>> 
>>> The io.max controller works really well at limiting synchronous I/O
>>> (READs), but a lot of I/O requests are initiated outside the context of
>>> the process that is ultimately responsible for its creation (e.g.,
>>> WRITEs).
>>> 
>>> Throttling at the block layer in some cases is too late and we may end
>>> up slowing down processes that are not responsible for the I/O that
>>> is being processed at that level.
>>> 
>>> = Proposed solution =
>>> 
>>> The main idea of this controller is to split I/O measurement and I/O
>>> throttling: I/O is measured at the block layer for READS, at page cache
>>> (dirty pages) for WRITEs, and processes are limited while they're
>>> generating I/O at the VFS level, based on the measured I/O.
>>> 
>> 
>> Hi Andrea,
>> what the about the case where two processes are dirtying the same
>> pages?  Which will be charged?
>> 
>> Thanks,
>> Paolo
> 
> Hi Paolo,
> 
> in this case only the first one will be charged for the I/O activity
> (the one that changes a page from clean to dirty). This is probably not
> totally fair in some cases, but I think it's a good compromise,

Absolutely, I just wanted to better understand this point.

> at the
> end rewriting the same page over and over while it's already dirty
> doesn't actually generate I/O activity, until the page is flushed back
> to disk.
> 

Right.

Thanks,
Paolo

> Obviously I'm open to other better ideas and suggestions.
> 
> Thanks!
> -Andrea


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
                   ` (3 preceding siblings ...)
  2019-01-18 11:04 ` [RFC PATCH 0/3] cgroup: fsio throttle controller Paolo Valente
@ 2019-01-18 16:35 ` Josef Bacik
  2019-01-18 17:07   ` Paolo Valente
  2019-01-18 18:44   ` Andrea Righi
  4 siblings, 2 replies; 19+ messages in thread
From: Josef Bacik @ 2019-01-18 16:35 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Josef Bacik, Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
> 
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
> 
> = Problem =
> 
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
> 
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.

How so?  The writeback threads are per-cgroup and have the cgroup stuff set
properly.  So if you dirty a bunch of pages, they are associated with your
cgroup, and then writeback happens and it's done in the writeback thread
associated with your cgroup and then that is throttled.  Then you are throttled
at balance_dirty_pages() because the writeout is taking longer.

I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
clearly tie IO to the thing generating the IO, such as readahead and such.  If
you are running into this case that may be something worth using.  Course it
only works for io.latency now but there's no reason you can't add support to it
for io.max or whatever.

> 
> = Proposed solution =
> 
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
> 

This is what blk_cgroup_congested() is meant to accomplish, I would suggest
looking into that route and simply changing the existing io controller you are
using to take advantage of that so it will actually throttle things.  Then just
sprinkle it around the areas where we indirectly generate IO.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 16:35 ` Josef Bacik
@ 2019-01-18 17:07   ` Paolo Valente
  2019-01-18 17:12     ` Josef Bacik
  2019-01-18 19:02     ` Andrea Righi
  2019-01-18 18:44   ` Andrea Righi
  1 sibling, 2 replies; 19+ messages in thread
From: Paolo Valente @ 2019-01-18 17:07 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andrea Righi, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Vivek Goyal, Dennis Zhou, cgroups, linux-block, linux-kernel



> Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> 
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
>> This is a redesign of my old cgroup-io-throttle controller:
>> https://lwn.net/Articles/330531/
>> 
>> I'm resuming this old patch to point out a problem that I think is still
>> not solved completely.
>> 
>> = Problem =
>> 
>> The io.max controller works really well at limiting synchronous I/O
>> (READs), but a lot of I/O requests are initiated outside the context of
>> the process that is ultimately responsible for its creation (e.g.,
>> WRITEs).
>> 
>> Throttling at the block layer in some cases is too late and we may end
>> up slowing down processes that are not responsible for the I/O that
>> is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.
> 

IIUC, Andrea described this problem: certain processes in a certain group dirty a
lot of pages, causing write back to start.  Then some other blameless
process in the same group experiences very high latency, in spite of
the fact that it has to do little I/O.

Does your blk_cgroup_congested() stuff solves this issue?

Or simply I didn't get what Andrea meant at all :)

Thanks,
Paolo

> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.
> 
>> 
>> = Proposed solution =
>> 
>> The main idea of this controller is to split I/O measurement and I/O
>> throttling: I/O is measured at the block layer for READS, at page cache
>> (dirty pages) for WRITEs, and processes are limited while they're
>> generating I/O at the VFS level, based on the measured I/O.
>> 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Josef


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 17:07   ` Paolo Valente
@ 2019-01-18 17:12     ` Josef Bacik
  2019-01-18 19:02     ` Andrea Righi
  1 sibling, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2019-01-18 17:12 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Josef Bacik, Andrea Righi, Tejun Heo, Li Zefan, Johannes Weiner,
	Jens Axboe, Vivek Goyal, Dennis Zhou, cgroups, linux-block,
	linux-kernel

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 

In that case the io controller isn't doing it's job and needs to be fixed (or
reconfigured).  io.latency guards against this, I assume io.max would keep this
from happening if it were configured properly.

> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 

I _think_ Andrea is talking about the fact that we can generate IO indirectly
and never get throttled for it, which is what blk_cgroup_congested() is meant to
address.  I added it specifically because some low prio task was just allocating
all of the memory on the system and causing a lot of pressure because of
swapping, but there was no direct feedback loop there.  blk_cgroup_congested()
provides that feedback loop.

Course I could be wrong too and we're all just talking past each other ;).
Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 16:35 ` Josef Bacik
  2019-01-18 17:07   ` Paolo Valente
@ 2019-01-18 18:44   ` Andrea Righi
  2019-01-18 19:46     ` Josef Bacik
  1 sibling, 1 reply; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 18:44 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.

Right, writeback is per-cgroup and slowing down writeback affects only
that specific cgroup, but, there are cases where other processes from
other cgroups may require to wait on that writeback to complete before
doing I/O (for example an fsync() to a file shared among different
cgroups). In this case we may end up blocking cgroups that shouldn't be
blocked, that looks like a priority-inversion problem. This is the
problem that I'm trying to address.

> 
> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.

IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
memcg), something like this: if the cgroup is already congested don't
generate extra I/O due to readahead. Am I right?

> 
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,

Absolutely, I can probably use blk_cgroup_congested() as a method to
determine when a cgroup should be throttled (instead of doing my own
I/O measuring), but to prevent the "slow writeback slowing down other
cgroups" issue I still need to apply throttling when pages are dirtied
in page cache.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 17:07   ` Paolo Valente
  2019-01-18 17:12     ` Josef Bacik
@ 2019-01-18 19:02     ` Andrea Righi
  1 sibling, 0 replies; 19+ messages in thread
From: Andrea Righi @ 2019-01-18 19:02 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Vivek Goyal, Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 
> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 
> Thanks,
> Paolo

Yes, there is also this problem: provide fairness among processes
running inside the same cgroup.

This is a tough one, because once the I/O limit is reached whoever
process comes next gets punished, even if it hasn't done any I/O before.

Well, my proposal doesn't solve this problem. To solve this one in the
"throttling" scenario, we should probably save some information about
the previously generated I/O activity and apply a delay proportional to
that (like a dynamic weight for each process inside each cgroup).

AFAICS the io.max controller doesn't solve this problem either.

-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 18:44   ` Andrea Righi
@ 2019-01-18 19:46     ` Josef Bacik
  2019-01-19 10:08       ` Andrea Righi
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2019-01-18 19:46 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Vivek Goyal, Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > This is a redesign of my old cgroup-io-throttle controller:
> > > https://lwn.net/Articles/330531/
> > > 
> > > I'm resuming this old patch to point out a problem that I think is still
> > > not solved completely.
> > > 
> > > = Problem =
> > > 
> > > The io.max controller works really well at limiting synchronous I/O
> > > (READs), but a lot of I/O requests are initiated outside the context of
> > > the process that is ultimately responsible for its creation (e.g.,
> > > WRITEs).
> > > 
> > > Throttling at the block layer in some cases is too late and we may end
> > > up slowing down processes that are not responsible for the I/O that
> > > is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> 
> Right, writeback is per-cgroup and slowing down writeback affects only
> that specific cgroup, but, there are cases where other processes from
> other cgroups may require to wait on that writeback to complete before
> doing I/O (for example an fsync() to a file shared among different
> cgroups). In this case we may end up blocking cgroups that shouldn't be
> blocked, that looks like a priority-inversion problem. This is the
> problem that I'm trying to address.

Well this case is a misconfiguration, you shouldn't be sharing files between
cgroups.  But even if you are, fsync() is synchronous, we should be getting the
context from the process itself and thus should have its own rules applied.
There's nothing we can do for outstanding IO, but that shouldn't be that much.
That would need to be dealt with on a per-contoller basis.

> 
> > 
> > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > clearly tie IO to the thing generating the IO, such as readahead and such.  If
> > you are running into this case that may be something worth using.  Course it
> > only works for io.latency now but there's no reason you can't add support to it
> > for io.max or whatever.
> 
> IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> memcg), something like this: if the cgroup is already congested don't
> generate extra I/O due to readahead. Am I right?

Yeah, but that's just how it's currently used, it can be used any which way we
feel like.

> 
> > 
> > > 
> > > = Proposed solution =
> > > 
> > > The main idea of this controller is to split I/O measurement and I/O
> > > throttling: I/O is measured at the block layer for READS, at page cache
> > > (dirty pages) for WRITEs, and processes are limited while they're
> > > generating I/O at the VFS level, based on the measured I/O.
> > > 
> > 
> > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > looking into that route and simply changing the existing io controller you are
> > using to take advantage of that so it will actually throttle things.  Then just
> > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Absolutely, I can probably use blk_cgroup_congested() as a method to
> determine when a cgroup should be throttled (instead of doing my own
> I/O measuring), but to prevent the "slow writeback slowing down other
> cgroups" issue I still need to apply throttling when pages are dirtied
> in page cache.

Again this is just a fuckup from a configuration stand point.  The argument
could be made that sync() is probably broken here, but I think the right
solution here is to just pass the cgroup context along with the writeback
information and use that if it's set instead.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-18 19:46     ` Josef Bacik
@ 2019-01-19 10:08       ` Andrea Righi
  2019-01-21 21:47         ` Vivek Goyal
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Righi @ 2019-01-19 10:08 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe, Vivek Goyal,
	Dennis Zhou, cgroups, linux-block, linux-kernel

On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > > 
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > > 
> > > > = Problem =
> > > > 
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > > 
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > > 
> > > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > > properly.  So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled.  Then you are throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> > 
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
> 
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups.  But even if you are, fsync() is synchronous, we should be getting the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.

OK, fair point. We shouldn't be sharing files between cgroups.

I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.

> 
> > 
> > > 
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and such.  If
> > > you are running into this case that may be something worth using.  Course it
> > > only works for io.latency now but there's no reason you can't add support to it
> > > for io.max or whatever.
> > 
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
> 
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.

I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?

> 
> > 
> > > 
> > > > 
> > > > = Proposed solution =
> > > > 
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > > 
> > > 
> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > > looking into that route and simply changing the existing io controller you are
> > > using to take advantage of that so it will actually throttle things.  Then just
> > > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> > 
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> > determine when a cgroup should be throttled (instead of doing my own
> > I/O measuring), but to prevent the "slow writeback slowing down other
> > cgroups" issue I still need to apply throttling when pages are dirtied
> > in page cache.
> 
> Again this is just a fuckup from a configuration stand point.  The argument
> could be made that sync() is probably broken here, but I think the right
> solution here is to just pass the cgroup context along with the writeback
> information and use that if it's set instead.  Thanks,

Alright, let's skip the root cgroup for now. I think the point here is
if we want to provide sync() isolation among cgroups or not.

According to the manpage:

       sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
       written to the underlying filesystems.

And:
       According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
       may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
       thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
       tem or filesystem respectively.

Excluding the root cgroup, do you think a sync() issued inside a
specific cgroup should wait for I/O completions only for the writes that
have been generated by that cgroup?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-19 10:08       ` Andrea Righi
@ 2019-01-21 21:47         ` Vivek Goyal
  2019-01-28 17:41           ` Andrea Righi
  0 siblings, 1 reply; 19+ messages in thread
From: Vivek Goyal @ 2019-01-21 21:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Dennis Zhou, cgroups, linux-block, linux-kernel

On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:

[..]
> Alright, let's skip the root cgroup for now. I think the point here is
> if we want to provide sync() isolation among cgroups or not.
> 
> According to the manpage:
> 
>        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
>        written to the underlying filesystems.
> 
> And:
>        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
>        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
>        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
>        tem or filesystem respectively.
> 
> Excluding the root cgroup, do you think a sync() issued inside a
> specific cgroup should wait for I/O completions only for the writes that
> have been generated by that cgroup?

Can we account I/O towards the cgroup which issued "sync" only if write
rate of sync cgroup is higher than cgroup to which page belongs to. Will
that solve problem, assuming its doable?

Vivek

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-21 21:47         ` Vivek Goyal
@ 2019-01-28 17:41           ` Andrea Righi
  2019-01-28 19:26             ` Vivek Goyal
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Righi @ 2019-01-28 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Dennis Zhou, cgroups, linux-block, linux-kernel

Hi Vivek,

sorry for the late reply.

On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> 
> [..]
> > Alright, let's skip the root cgroup for now. I think the point here is
> > if we want to provide sync() isolation among cgroups or not.
> > 
> > According to the manpage:
> > 
> >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> >        written to the underlying filesystems.
> > 
> > And:
> >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> >        tem or filesystem respectively.
> > 
> > Excluding the root cgroup, do you think a sync() issued inside a
> > specific cgroup should wait for I/O completions only for the writes that
> > have been generated by that cgroup?
> 
> Can we account I/O towards the cgroup which issued "sync" only if write
> rate of sync cgroup is higher than cgroup to which page belongs to. Will
> that solve problem, assuming its doable?

Maybe this would mitigate the problem, in part, but it doesn't solve it.

The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
issues "sync", the fast cgroup needs to wait a lot, because writeback is
happening at the speed of the slow cgroup.

Ideally in this case we should bump up the writeback speed, maybe even
temporarily inherit the write rate of the sync cgroup, similarly to a
priority-inversion locking scenario, but I think it's not doable at the
moment without applying big changes.

Or we could isolate the sync domain, meaning that a cgroup issuing a
sync will only wait for the syncing of the pages that belong to that
sync cgroup. But probably also this method requires big changes...

-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-28 17:41           ` Andrea Righi
@ 2019-01-28 19:26             ` Vivek Goyal
  2019-01-29 18:39               ` Andrea Righi
  0 siblings, 1 reply; 19+ messages in thread
From: Vivek Goyal @ 2019-01-28 19:26 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Dennis Zhou, cgroups, linux-block, linux-kernel

On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> Hi Vivek,
> 
> sorry for the late reply.
> 
> On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > 
> > [..]
> > > Alright, let's skip the root cgroup for now. I think the point here is
> > > if we want to provide sync() isolation among cgroups or not.
> > > 
> > > According to the manpage:
> > > 
> > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > >        written to the underlying filesystems.
> > > 
> > > And:
> > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > >        tem or filesystem respectively.
> > > 
> > > Excluding the root cgroup, do you think a sync() issued inside a
> > > specific cgroup should wait for I/O completions only for the writes that
> > > have been generated by that cgroup?
> > 
> > Can we account I/O towards the cgroup which issued "sync" only if write
> > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > that solve problem, assuming its doable?
> 
> Maybe this would mitigate the problem, in part, but it doesn't solve it.
> 
> The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> issues "sync", the fast cgroup needs to wait a lot, because writeback is
> happening at the speed of the slow cgroup.

Hi Andrea,

But that's true only for I/O which has already been submitted to block
layer, right? Any new I/O yet to be submitted could still be attributed
to faster cgroup requesting sync.

Until and unless cgroups limits are absurdly low, it should not take very
long for already submitted I/O to finish. If yes, then in practice, it
might not be a big problem?

Vivek

> 
> Ideally in this case we should bump up the writeback speed, maybe even
> temporarily inherit the write rate of the sync cgroup, similarly to a
> priority-inversion locking scenario, but I think it's not doable at the
> moment without applying big changes.
> 
> Or we could isolate the sync domain, meaning that a cgroup issuing a
> sync will only wait for the syncing of the pages that belong to that
> sync cgroup. But probably also this method requires big changes...
> 
> -Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-28 19:26             ` Vivek Goyal
@ 2019-01-29 18:39               ` Andrea Righi
  2019-01-29 18:50                 ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Righi @ 2019-01-29 18:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner, Jens Axboe,
	Dennis Zhou, cgroups, linux-block, linux-kernel

On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > Hi Vivek,
> > 
> > sorry for the late reply.
> > 
> > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > 
> > > [..]
> > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > if we want to provide sync() isolation among cgroups or not.
> > > > 
> > > > According to the manpage:
> > > > 
> > > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > > >        written to the underlying filesystems.
> > > > 
> > > > And:
> > > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > > >        tem or filesystem respectively.
> > > > 
> > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > specific cgroup should wait for I/O completions only for the writes that
> > > > have been generated by that cgroup?
> > > 
> > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > that solve problem, assuming its doable?
> > 
> > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > 
> > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > happening at the speed of the slow cgroup.
> 
> Hi Andrea,
> 
> But that's true only for I/O which has already been submitted to block
> layer, right? Any new I/O yet to be submitted could still be attributed
> to faster cgroup requesting sync.

Right. If we could bump up the new I/O yet to be submitted I think we
could effectively prevent the priority inversion problem (the ongoing
writeback I/O should be negligible).

> 
> Until and unless cgroups limits are absurdly low, it should not take very
> long for already submitted I/O to finish. If yes, then in practice, it
> might not be a big problem?

I was actually doing my tests with a very low limit (1MB/s both for rbps
and wbps), but this shows the problem very well I think.

Here's what I'm doing:

 [ slow cgroup (1Mbps read/write) ]

   $ cat /sys/fs/cgroup/unified/cg1/io.max
   259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
   $ cat /proc/self/cgroup
   0::/cg1

   $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30

 [ fast cgroup (root cgroup, no limitation) ]

   # cat /proc/self/cgroup
   0::/

   # time sync
   real	9m32,618s
   user	0m0,000s
   sys	0m0,018s

With this simple test I can easily trigger hung task timeout warnings
and make the whole system totally sluggish (even the processes running
in the root cgroup).

When fio ends, writeback is still taking forever to complete, as you can
see by the insane amount that sync takes to complete.

-Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
  2019-01-29 18:39               ` Andrea Righi
@ 2019-01-29 18:50                 ` Josef Bacik
  0 siblings, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2019-01-29 18:50 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Vivek Goyal, Josef Bacik, Tejun Heo, Li Zefan, Johannes Weiner,
	Jens Axboe, Dennis Zhou, cgroups, linux-block, linux-kernel

On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote:
> On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > > Hi Vivek,
> > > 
> > > sorry for the late reply.
> > > 
> > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > > 
> > > > [..]
> > > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > > if we want to provide sync() isolation among cgroups or not.
> > > > > 
> > > > > According to the manpage:
> > > > > 
> > > > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > > > >        written to the underlying filesystems.
> > > > > 
> > > > > And:
> > > > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > > > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > > > >        tem or filesystem respectively.
> > > > > 
> > > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > > specific cgroup should wait for I/O completions only for the writes that
> > > > > have been generated by that cgroup?
> > > > 
> > > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > > that solve problem, assuming its doable?
> > > 
> > > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > > 
> > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > > happening at the speed of the slow cgroup.
> > 
> > Hi Andrea,
> > 
> > But that's true only for I/O which has already been submitted to block
> > layer, right? Any new I/O yet to be submitted could still be attributed
> > to faster cgroup requesting sync.
> 
> Right. If we could bump up the new I/O yet to be submitted I think we
> could effectively prevent the priority inversion problem (the ongoing
> writeback I/O should be negligible).
> 
> > 
> > Until and unless cgroups limits are absurdly low, it should not take very
> > long for already submitted I/O to finish. If yes, then in practice, it
> > might not be a big problem?
> 
> I was actually doing my tests with a very low limit (1MB/s both for rbps
> and wbps), but this shows the problem very well I think.
> 
> Here's what I'm doing:
> 
>  [ slow cgroup (1Mbps read/write) ]
> 
>    $ cat /sys/fs/cgroup/unified/cg1/io.max
>    259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
>    $ cat /proc/self/cgroup
>    0::/cg1
> 
>    $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30
> 
>  [ fast cgroup (root cgroup, no limitation) ]
> 
>    # cat /proc/self/cgroup
>    0::/
> 
>    # time sync
>    real	9m32,618s
>    user	0m0,000s
>    sys	0m0,018s
> 
> With this simple test I can easily trigger hung task timeout warnings
> and make the whole system totally sluggish (even the processes running
> in the root cgroup).
> 
> When fio ends, writeback is still taking forever to complete, as you can
> see by the insane amount that sync takes to complete.
> 

Yeah sync() needs to be treated differently, but its kind of special too.  We
don't want slow to run sync() and backup fast doing sync() because we make all
of the io go based on the submitting cgroup.  The problem here is we don't know
who's more important until we get to the blk cgroup layer, and even then
sometimes we can't tell (different hierarchies would make this tricky with
io.weight or io.latency).

We could treat it like REQ_META and just let everything go through and back
charge.  This feels like a way for the slow group to cheat though, unless we
just throttle the shit out of him before returning to user space.  I'll have to
think about this some more.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-01-29 18:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 1/3] fsio-throttle: documentation Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 2/3] fsio-throttle: controller infrastructure Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 3/3] fsio-throttle: instrumentation Andrea Righi
2019-01-18 11:04 ` [RFC PATCH 0/3] cgroup: fsio throttle controller Paolo Valente
2019-01-18 11:10   ` Andrea Righi
2019-01-18 11:11     ` Paolo Valente
2019-01-18 16:35 ` Josef Bacik
2019-01-18 17:07   ` Paolo Valente
2019-01-18 17:12     ` Josef Bacik
2019-01-18 19:02     ` Andrea Righi
2019-01-18 18:44   ` Andrea Righi
2019-01-18 19:46     ` Josef Bacik
2019-01-19 10:08       ` Andrea Righi
2019-01-21 21:47         ` Vivek Goyal
2019-01-28 17:41           ` Andrea Righi
2019-01-28 19:26             ` Vivek Goyal
2019-01-29 18:39               ` Andrea Righi
2019-01-29 18:50                 ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).