[PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
@ 2017-12-04 16:12 Kirill Tkhai
  2017-12-04 16:12 ` [PATCH 1/5] aio: Move aio_nr increment to separate function Kirill Tkhai
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:12 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

Hi,

this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
It may be used to limit number of aio requests, which are available for
a cgroup, and could be useful for containers.

The accounting is hierarchical, and aio contexts, allocated in child cgroup,
are accounted in parent cgroups too.

Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
aio requests limit, to show current limit and number of currenly occupied
requests.

Patches 1-3 are refactoring.
Patch 4 is the place where the accounting actually introduced.
Patch 5 adds "io.aio_nr" file.

---

Kirill Tkhai (5):
      aio: Move aio_nr increment to separate function
      aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
      blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
      blkcg: Charge aio requests in blkio cgroup hierarchy
      blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr

 block/blk-cgroup.c         |   88 +++++++++++++++++++++++++-
 fs/aio.c                   |  151 ++++++++++++++++++++++++++++++++++++++++----
 include/linux/aio.h        |   21 ++++++
 include/linux/blk-cgroup.h |    4 +
 4 files changed, 247 insertions(+), 17 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] aio: Move aio_nr increment to separate function
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
@ 2017-12-04 16:12 ` Kirill Tkhai
  2017-12-04 16:13 ` [PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h Kirill Tkhai
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:12 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

There is no functional changes, only a preparation
for next patches.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/aio.c |   44 ++++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e6de7715228c..04209c0561b2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -694,13 +694,39 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 	}
 }
 
-static void aio_nr_sub(unsigned nr)
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+	if (aio_nr + nr > aio_max_nr ||
+	    aio_nr + nr < aio_nr)
+		return false;
+
+	aio_nr += nr;
+	return true;
+}
+
+static void __uncharge_aio_nr(unsigned nr)
 {
-	spin_lock(&aio_nr_lock);
 	if (WARN_ON(aio_nr - nr > aio_nr))
 		aio_nr = 0;
 	else
 		aio_nr -= nr;
+}
+
+static bool try_to_charge_aio_nr(unsigned nr)
+{
+	bool ret;
+
+	spin_lock(&aio_nr_lock);
+	ret = __try_to_charge_aio_nr(nr);
+	spin_unlock(&aio_nr_lock);
+
+	return ret;
+}
+
+static void uncharge_aio_nr(unsigned nr)
+{
+	spin_lock(&aio_nr_lock);
+	__uncharge_aio_nr(nr);
 	spin_unlock(&aio_nr_lock);
 }
 
@@ -776,15 +802,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 		ctx->req_batch = 1;
 
 	/* limit the number of system wide aios */
-	spin_lock(&aio_nr_lock);
-	if (aio_nr + ctx->max_reqs > aio_max_nr ||
-	    aio_nr + ctx->max_reqs < aio_nr) {
-		spin_unlock(&aio_nr_lock);
-		err = -EAGAIN;
+	err = -EAGAIN;
+	if (!try_to_charge_aio_nr(ctx->max_reqs))
 		goto err_ctx;
-	}
-	aio_nr += ctx->max_reqs;
-	spin_unlock(&aio_nr_lock);
 
 	percpu_ref_get(&ctx->users);	/* io_setup() will drop this ref */
 	percpu_ref_get(&ctx->reqs);	/* free_ioctx_users() will drop this */
@@ -801,7 +821,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	return ctx;
 
 err_cleanup:
-	aio_nr_sub(ctx->max_reqs);
+	uncharge_aio_nr(ctx->max_reqs);
 err_ctx:
 	atomic_set(&ctx->dead, 1);
 	if (ctx->mmap_size)
@@ -848,7 +868,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	 * -EAGAIN with no ioctxs actually in use (as far as userspace
 	 *  could tell).
 	 */
-	aio_nr_sub(ctx->max_reqs);
+	uncharge_aio_nr(ctx->max_reqs);
 
 	if (ctx->mmap_size)
 		vm_munmap(ctx->mmap_base, ctx->mmap_size);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
  2017-12-04 16:12 ` [PATCH 1/5] aio: Move aio_nr increment to separate function Kirill Tkhai
@ 2017-12-04 16:13 ` Kirill Tkhai
  2017-12-04 16:13 ` [PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr Kirill Tkhai
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:13 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

Next patch will use the values in more files, so let's make them visible external.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/aio.c            |    4 ++--
 include/linux/aio.h |    4 ++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 04209c0561b2..9dc98a29077c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -188,10 +188,10 @@ struct aio_kiocb {
 	struct eventfd_ctx	*ki_eventfd;
 };
 
+DEFINE_SPINLOCK(aio_nr_lock);
 /*------ sysctl variables----*/
-static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */
 /*----end sysctl variables---*/
 
 static struct kmem_cache	*kiocb_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9d8aabecfe2d..5dda2663802f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -10,6 +10,10 @@ struct mm_struct;
 
 #define KIOCB_KEY		0
 
+#define AIO_NR_INF		(-1UL)
+#define AIO_NR_DEF		0x10000
+
+extern spinlock_t aio_nr_lock;
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 /* prototypes */

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
  2017-12-04 16:12 ` [PATCH 1/5] aio: Move aio_nr increment to separate function Kirill Tkhai
  2017-12-04 16:13 ` [PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h Kirill Tkhai
@ 2017-12-04 16:13 ` Kirill Tkhai
  2017-12-04 16:13 ` [PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy Kirill Tkhai
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:13 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

This adds new members of struct blkcg, which will
be used to account numbers of cgroup's aio requests.

Also, blkcg_root is used to store sysctl variables
aio_nr and aio_max_nr.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-cgroup.c         |    4 ++++
 fs/aio.c                   |    2 ++
 include/linux/aio.h        |    6 ++++++
 include/linux/blk-cgroup.h |    4 ++++
 4 files changed, 16 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56baee936..774560469b01 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -27,6 +27,7 @@
 #include <linux/atomic.h>
 #include <linux/ctype.h>
 #include <linux/blk-cgroup.h>
+#include <linux/aio.h>
 #include "blk.h"
 
 #define MAX_KEY_LEN 100
@@ -1101,6 +1102,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&blkcg->cgwb_list);
+#endif
+#ifdef CONFIG_AIO
+	blkcg->blkg_aio_max_nr = parent_css ? AIO_NR_INF : AIO_NR_DEF;
 #endif
 	list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);
 
diff --git a/fs/aio.c b/fs/aio.c
index 9dc98a29077c..755f97a42ebe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -189,10 +189,12 @@ struct aio_kiocb {
 };
 
 DEFINE_SPINLOCK(aio_nr_lock);
+#ifndef CONFIG_BLK_CGROUP
 /*------ sysctl variables----*/
 unsigned long aio_nr;		/* current system wide number of aio requests */
 unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */
 /*----end sysctl variables---*/
+#endif
 
 static struct kmem_cache	*kiocb_cachep;
 static struct kmem_cache	*kioctx_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 5dda2663802f..de929a8c9c59 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX__AIO_H
 #define __LINUX__AIO_H
 
+#include <linux/blk-cgroup.h>
 #include <linux/aio_abi.h>
 
 struct kioctx;
@@ -26,8 +27,13 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
 #endif /* CONFIG_AIO */
 
+#if !defined(CONFIG_BLK_CGROUP) || !defined(CONFIG_AIO)
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+#else
+#define aio_nr		blkcg_root.blkg_aio_nr
+#define aio_max_nr	blkcg_root.blkg_aio_max_nr
+#endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */
 
 #endif /* __LINUX__AIO_H */
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 8bbc3716507a..3d9c176fc173 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -55,6 +55,10 @@ struct blkcg {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct list_head		cgwb_list;
 #endif
+#ifdef CONFIG_AIO
+	unsigned long			blkg_aio_nr;
+	unsigned long			blkg_aio_max_nr;
+#endif
 };
 
 /*

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
                   ` (2 preceding siblings ...)
  2017-12-04 16:13 ` [PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr Kirill Tkhai
@ 2017-12-04 16:13 ` Kirill Tkhai
  2017-12-04 16:13 ` [PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr Kirill Tkhai
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:13 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

This patch adds accounting of number of requests of allocated aio
contexts per blkio cgroup, and aggregates child cgroups requests
up the hierarhy. This may be used to limit aio requests available
for containers.

By default, newly allocated blkcg::blkg_aio_max_nr is set
to "unlimited" value (see blkcg_css_alloc() in previous patch).
This guarantees that applications, which do not know about
blkcg::blkg_aio_max_nr, will be able to run like they used to do
before, without configuring child cgroup's blkg_aio_max_nr.

For protection "task attach" vs "io_context create/destroy"
read locked cgroup_threadgroup_rwsem is used. We take it
via cgroup_threadgroup_change_*() interfaces, which are used
around the places we charge kioctx::max_reqs and link a ctx
to mm_struct::ioctx_table.

Single allocation are protected aio_nr_lock like it used before.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-cgroup.c  |   44 +++++++++++++++++++++-
 fs/aio.c            |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   11 ++++++
 3 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 774560469b01..9cc6e9574946 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1217,8 +1217,8 @@ void blkcg_exit_queue(struct request_queue *q)
  */
 static int blkcg_can_attach(struct cgroup_taskset *tset)
 {
-	struct task_struct *task;
-	struct cgroup_subsys_state *dst_css;
+	struct cgroup_subsys_state *dst_css, *old_css;
+	struct task_struct *task, *p;
 	struct io_context *ioc;
 	int ret = 0;
 
@@ -1230,11 +1230,46 @@ static int blkcg_can_attach(struct cgroup_taskset *tset)
 			ret = -EINVAL;
 		task_unlock(task);
 		if (ret)
-			break;
+			goto err;
+		if (!thread_group_leader(task))
+			continue;
+		ret = charge_task_aio_nr(task, css_to_blkcg(dst_css));
+		if (ret)
+			goto err;
+		old_css = task_css(task, io_cgrp_id);
+		uncharge_task_aio_nr(task, css_to_blkcg(old_css));
+	}
+err:
+	if (ret) {
+		cgroup_taskset_for_each(p, dst_css, tset) {
+			if (p == task)
+				break;
+			if (!thread_group_leader(p))
+				continue;
+			uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+			old_css = task_css(p, io_cgrp_id);
+			WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css)));
+		}
 	}
 	return ret;
 }
 
+#ifdef CONFIG_AIO
+static void blkcg_cancel_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css, *old_css;
+	struct task_struct *p;
+
+	cgroup_taskset_for_each(p, dst_css, tset) {
+		if (!thread_group_leader(p))
+			continue;
+		uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+		old_css = task_css(p, io_cgrp_id);
+		WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css)));
+	}
+}
+#endif
+
 static void blkcg_bind(struct cgroup_subsys_state *root_css)
 {
 	int i;
@@ -1260,6 +1295,9 @@ struct cgroup_subsys io_cgrp_subsys = {
 	.css_offline = blkcg_css_offline,
 	.css_free = blkcg_css_free,
 	.can_attach = blkcg_can_attach,
+#ifdef CONFIG_AIO
+	.cancel_attach = blkcg_cancel_attach,
+#endif
 	.bind = blkcg_bind,
 	.dfl_cftypes = blkcg_files,
 	.legacy_cftypes = blkcg_legacy_files,
diff --git a/fs/aio.c b/fs/aio.c
index 755f97a42ebe..2e63f5c582c0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/cgroup-defs.h>
 
 #include <asm/kmap_types.h>
 #include <linux/uaccess.h>
@@ -696,6 +697,97 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 	}
 }
 
+#ifdef CONFIG_BLK_CGROUP
+static bool try_to_charge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+	struct blkcg *tmp = blkg;
+
+	while (blkg) {
+		if (nr + blkg->blkg_aio_nr > blkg->blkg_aio_max_nr ||
+		    nr + blkg->blkg_aio_nr < nr)
+			goto fail;
+
+		blkg->blkg_aio_nr += nr;
+		blkg = blkcg_parent(blkg);
+	}
+
+	return true;
+fail:
+	while (tmp != blkg) {
+		tmp->blkg_aio_nr -= nr;
+		tmp = blkcg_parent(tmp);
+	}
+	return false;
+}
+
+
+static void uncharge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+	while (blkg) {
+		blkg->blkg_aio_nr -= nr;
+		blkg = blkcg_parent(blkg);
+	}
+}
+
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+	struct blkcg *blkg;
+
+	percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem);
+	blkg = container_of(task_css_check(current, io_cgrp_id, true),
+			     struct blkcg, css);
+	return try_to_charge_blkcg(nr, blkg);
+}
+
+static void __uncharge_aio_nr(unsigned nr)
+{
+	struct blkcg *blkg;
+
+	percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem);
+	blkg = container_of(task_css_check(current, io_cgrp_id, true),
+			     struct blkcg, css);
+	uncharge_blkcg(nr, blkg);
+}
+
+static unsigned long get_task_max_reqs(struct task_struct *p)
+{
+	struct kioctx_table *tbl;
+	unsigned long nr = 0;
+	struct kioctx *ctx;
+	int i;
+
+	if (p->flags & PF_KTHREAD)
+		return 0;
+	/* rwsem must be write locked */
+	tbl = rcu_dereference_protected(p->mm->ioctx_table,
+			percpu_rwsem_is_held(&cgroup_threadgroup_rwsem));
+	if (!tbl)
+		return 0;
+	for (i = 0; i < tbl->nr; i++) {
+		ctx = tbl->table[i];
+		if (!ctx)
+			continue;
+		nr += ctx->max_reqs;
+	}
+	return nr;
+}
+
+int charge_task_aio_nr(struct task_struct *p, struct blkcg *blkg)
+{
+	unsigned long nr = get_task_max_reqs(p);
+
+	if (!nr || try_to_charge_blkcg(nr, blkg))
+		return 0;
+	return -ENOMEM;
+}
+
+void uncharge_task_aio_nr(struct task_struct *p, struct blkcg *blkg)
+{
+	unsigned long nr = get_task_max_reqs(p);
+	if (nr)
+		uncharge_blkcg(nr, blkg);
+}
+#else
 static bool __try_to_charge_aio_nr(unsigned nr)
 {
 	if (aio_nr + nr > aio_max_nr ||
@@ -713,6 +805,7 @@ static void __uncharge_aio_nr(unsigned nr)
 	else
 		aio_nr -= nr;
 }
+#endif /* CONFIG_BLK_CGROUP */
 
 static bool try_to_charge_aio_nr(unsigned nr)
 {
@@ -803,6 +896,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (ctx->req_batch < 1)
 		ctx->req_batch = 1;
 
+	cgroup_threadgroup_change_begin(current);
+
 	/* limit the number of system wide aios */
 	err = -EAGAIN;
 	if (!try_to_charge_aio_nr(ctx->max_reqs))
@@ -815,6 +910,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (err)
 		goto err_cleanup;
 
+	cgroup_threadgroup_change_end(current);
+
 	/* Release the ring_lock mutex now that all setup is complete. */
 	mutex_unlock(&ctx->ring_lock);
 
@@ -825,6 +922,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 err_cleanup:
 	uncharge_aio_nr(ctx->max_reqs);
 err_ctx:
+	cgroup_threadgroup_change_end(current);
 	atomic_set(&ctx->dead, 1);
 	if (ctx->mmap_size)
 		vm_munmap(ctx->mmap_base, ctx->mmap_size);
@@ -849,9 +947,11 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 {
 	struct kioctx_table *table;
 
+	cgroup_threadgroup_change_begin(current);
 	spin_lock(&mm->ioctx_lock);
 	if (atomic_xchg(&ctx->dead, 1)) {
 		spin_unlock(&mm->ioctx_lock);
+		cgroup_threadgroup_change_end(current);
 		return -EINVAL;
 	}
 
@@ -871,6 +971,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	 *  could tell).
 	 */
 	uncharge_aio_nr(ctx->max_reqs);
+	cgroup_threadgroup_change_end(current);
 
 	if (ctx->mmap_size)
 		vm_munmap(ctx->mmap_base, ctx->mmap_size);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index de929a8c9c59..bf442e562a8f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -31,9 +31,20 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+
+static inline int charge_task_aio_nr(struct task_struct *p, struct blkcg *g)
+{
+	return 0;
+}
+static inline void uncharge_task_aio_nr(struct task_struct *p, struct blkcg *g)
+{
+}
 #else
 #define aio_nr		blkcg_root.blkg_aio_nr
 #define aio_max_nr	blkcg_root.blkg_aio_max_nr
+
+extern int charge_task_aio_nr(struct task_struct *, struct blkcg *);
+extern void uncharge_task_aio_nr(struct task_struct *, struct blkcg *);
 #endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */
 
 #endif /* __LINUX__AIO_H */

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
                   ` (3 preceding siblings ...)
  2017-12-04 16:13 ` [PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy Kirill Tkhai
@ 2017-12-04 16:13 ` Kirill Tkhai
  2017-12-04 16:52 ` [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Benjamin LaHaise
  2017-12-04 20:07 ` Tejun Heo
  6 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 16:13 UTC (permalink / raw)
  To: axboe, bcrl, viro, tj, linux-block, linux-kernel, linux-aio,
	oleg, ktkhai

Add a file to configure per-cgroup maximum number of available
aio requests.

Allow write the values, which are less then currently allocated
numbers of requests.

Show numbers of currently allocated and maximum available requests.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-cgroup.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9cc6e9574946..dc5600ef4523 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -981,12 +981,52 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_AIO
+static int blkcg_write_aio_max_nr(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 val)
+{
+	struct blkcg *blkg = css_to_blkcg(css);
+	int ret = 0;
+
+	percpu_down_read(&cgroup_threadgroup_rwsem);
+	spin_lock(&aio_nr_lock);
+	if (val >= blkg->blkg_aio_nr)
+		blkg->blkg_aio_max_nr = val;
+	else
+		ret = -EBUSY;
+	spin_unlock(&aio_nr_lock);
+	percpu_up_read(&cgroup_threadgroup_rwsem);
+	return ret;
+}
+
+static int blkcg_show_aio_nrs(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkg = css_to_blkcg(seq_css(sf));
+	unsigned long max_nr, nr;
+
+	spin_lock(&aio_nr_lock);
+	max_nr = blkg->blkg_aio_max_nr;
+	nr = blkg->blkg_aio_nr;
+	spin_unlock(&aio_nr_lock);
+
+	seq_printf(sf, "used=%lu, max=%lu\n", nr, max_nr);
+	return 0;
+}
+#endif /* CONFIG_AIO */
+
 static struct cftype blkcg_files[] = {
 	{
 		.name = "stat",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = blkcg_print_stat,
 	},
+#ifdef CONFIG_AIO
+	{
+		.name = "aio_nr",
+		.write_u64 = blkcg_write_aio_max_nr,
+		.seq_show = blkcg_show_aio_nrs,
+	},
+#endif
 	{ }	/* terminate */
 };
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
                   ` (4 preceding siblings ...)
  2017-12-04 16:13 ` [PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr Kirill Tkhai
@ 2017-12-04 16:52 ` Benjamin LaHaise
  2017-12-04 21:27   ` Kirill Tkhai
  2017-12-04 20:07 ` Tejun Heo
  6 siblings, 1 reply; 28+ messages in thread
From: Benjamin LaHaise @ 2017-12-04 16:52 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: axboe, viro, tj, linux-block, linux-kernel, linux-aio, oleg

Hi Kirill,

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> Hi,
> 
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.
> 
> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
> are accounted in parent cgroups too.
> 
> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
> aio requests limit, to show current limit and number of currenly occupied
> requests.

Where are your test cases to check this functionality in the libaio test
suite?

		-ben

> Patches 1-3 are refactoring.
> Patch 4 is the place where the accounting actually introduced.
> Patch 5 adds "io.aio_nr" file.
> 
> ---
> 
> Kirill Tkhai (5):
>       aio: Move aio_nr increment to separate function
>       aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>       blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>       blkcg: Charge aio requests in blkio cgroup hierarchy
>       blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
> 
> 
>  block/blk-cgroup.c         |   88 +++++++++++++++++++++++++-
>  fs/aio.c                   |  151 ++++++++++++++++++++++++++++++++++++++++----
>  include/linux/aio.h        |   21 ++++++
>  include/linux/blk-cgroup.h |    4 +
>  4 files changed, 247 insertions(+), 17 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
                   ` (5 preceding siblings ...)
  2017-12-04 16:52 ` [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Benjamin LaHaise
@ 2017-12-04 20:07 ` Tejun Heo
  2017-12-04 21:44   ` Kirill Tkhai
  6 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2017-12-04 20:07 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Hello, Kirill.

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.

Can you please explain how this is a fundamental resource which can't
be controlled otherwise?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 16:52 ` [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Benjamin LaHaise
@ 2017-12-04 21:27   ` Kirill Tkhai
  2017-12-04 21:35     ` Jeff Moyer
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 21:27 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: axboe, viro, tj, linux-block, linux-kernel, linux-aio, oleg

Hi, Benjamin,

On 04.12.2017 19:52, Benjamin LaHaise wrote:
> Hi Kirill,
> 
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> Hi,
>>
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
>>
>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>> are accounted in parent cgroups too.
>>
>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>> aio requests limit, to show current limit and number of currenly occupied
>> requests.
> 
> Where are your test cases to check this functionality in the libaio test
> suite?

I tried to find actual libaio test suite repository url in google,
but there is no certain answer. Also, there is no information in kernel
anywhere (I hope I grepped right).

Could you please provide url where actual upstream libaio could be obtained?

Kirill

> 
> 		-ben
> 
>> Patches 1-3 are refactoring.
>> Patch 4 is the place where the accounting actually introduced.
>> Patch 5 adds "io.aio_nr" file.
>>
>> ---
>>
>> Kirill Tkhai (5):
>>       aio: Move aio_nr increment to separate function
>>       aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>>       blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>       blkcg: Charge aio requests in blkio cgroup hierarchy
>>       blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>
>>
>>  block/blk-cgroup.c         |   88 +++++++++++++++++++++++++-
>>  fs/aio.c                   |  151 ++++++++++++++++++++++++++++++++++++++++----
>>  include/linux/aio.h        |   21 ++++++
>>  include/linux/blk-cgroup.h |    4 +
>>  4 files changed, 247 insertions(+), 17 deletions(-)
>>
>> --
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 21:27   ` Kirill Tkhai
@ 2017-12-04 21:35     ` Jeff Moyer
  2017-12-04 21:48       ` Kirill Tkhai
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Moyer @ 2017-12-04 21:35 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Benjamin LaHaise, axboe, viro, tj, linux-block, linux-kernel,
	linux-aio, oleg

Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> Hi, Benjamin,
>
> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>> Hi Kirill,
>> 
>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>>> Hi,
>>>
>>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>>> It may be used to limit number of aio requests, which are available for
>>> a cgroup, and could be useful for containers.
>>>
>>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>>> are accounted in parent cgroups too.
>>>
>>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>>> aio requests limit, to show current limit and number of currenly occupied
>>> requests.
>> 
>> Where are your test cases to check this functionality in the libaio test
>> suite?
>
> I tried to find actual libaio test suite repository url in google,
> but there is no certain answer. Also, there is no information in kernel
> anywhere (I hope I grepped right).
>
> Could you please provide url where actual upstream libaio could be obtained?

https://pagure.io/libaio

Patches can be sent to this list (linux-aio).

Cheers,
Jeff

>>> Patches 1-3 are refactoring.
>>> Patch 4 is the place where the accounting actually introduced.
>>> Patch 5 adds "io.aio_nr" file.
>>>
>>> ---
>>>
>>> Kirill Tkhai (5):
>>>       aio: Move aio_nr increment to separate function
>>>       aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>>>       blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>>       blkcg: Charge aio requests in blkio cgroup hierarchy
>>>       blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>>
>>>
>>>  block/blk-cgroup.c         |   88 +++++++++++++++++++++++++-
>>>  fs/aio.c                   |  151 ++++++++++++++++++++++++++++++++++++++++----
>>>  include/linux/aio.h        |   21 ++++++
>>>  include/linux/blk-cgroup.h |    4 +
>>>  4 files changed, 247 insertions(+), 17 deletions(-)
>>>
>>> --
>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>
>> 
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 20:07 ` Tejun Heo
@ 2017-12-04 21:44   ` Kirill Tkhai
  2017-12-04 21:52     ` Tejun Heo
  2017-12-05 15:19     ` Oleg Nesterov
  0 siblings, 2 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 21:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Hello, Tejun,

On 04.12.2017 23:07, Tejun Heo wrote:
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
> 
> Can you please explain how this is a fundamental resource which can't
> be controlled otherwise?

Currently, aio_nr and aio_max_nr are global. In case of containers this
means that a single container may occupy all aio requests, which are
available in the system, and to deprive others possibility to use aio
at all. This may happen because of evil intentions of the container's
user or because of the program error, when the user makes this occasionally.

My patch set allows to guarantee that every container or cgroup has
its own number of allowed aios, and nobody can steal it, and therefore
can't slow down another containers, and to force programs to use direct io.

AIO gives certain advantages to its user, so this patchset just doesn't
allow to rob the advantages without any possibility to protect against that.

This could be used by LXC or for starting some critical micro-services,
for example. 

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 21:35     ` Jeff Moyer
@ 2017-12-04 21:48       ` Kirill Tkhai
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 21:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Benjamin LaHaise, axboe, viro, tj, linux-block, linux-kernel,
	linux-aio, oleg

On 05.12.2017 00:35, Jeff Moyer wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> Hi, Benjamin,
>>
>> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>>> Hi Kirill,
>>>
>>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>>>> Hi,
>>>>
>>>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>>>> It may be used to limit number of aio requests, which are available for
>>>> a cgroup, and could be useful for containers.
>>>>
>>>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>>>> are accounted in parent cgroups too.
>>>>
>>>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>>>> aio requests limit, to show current limit and number of currenly occupied
>>>> requests.
>>>
>>> Where are your test cases to check this functionality in the libaio test
>>> suite?
>>
>> I tried to find actual libaio test suite repository url in google,
>> but there is no certain answer. Also, there is no information in kernel
>> anywhere (I hope I grepped right).
>>
>> Could you please provide url where actual upstream libaio could be obtained?
> 
> https://pagure.io/libaio
> 
> Patches can be sent to this list (linux-aio).
> 

Thank you, Jeff.

Kirill

>>>> Patches 1-3 are refactoring.
>>>> Patch 4 is the place where the accounting actually introduced.
>>>> Patch 5 adds "io.aio_nr" file.
>>>>
>>>> ---
>>>>
>>>> Kirill Tkhai (5):
>>>>       aio: Move aio_nr increment to separate function
>>>>       aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>>>>       blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>>>       blkcg: Charge aio requests in blkio cgroup hierarchy
>>>>       blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>>>
>>>>
>>>>  block/blk-cgroup.c         |   88 +++++++++++++++++++++++++-
>>>>  fs/aio.c                   |  151 ++++++++++++++++++++++++++++++++++++++++----
>>>>  include/linux/aio.h        |   21 ++++++
>>>>  include/linux/blk-cgroup.h |    4 +
>>>>  4 files changed, 247 insertions(+), 17 deletions(-)
>>>>
>>>> --
>>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majordomo@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 21:44   ` Kirill Tkhai
@ 2017-12-04 21:52     ` Tejun Heo
  2017-12-04 22:49       ` Kirill Tkhai
  2017-12-05 15:19     ` Oleg Nesterov
  1 sibling, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2017-12-04 21:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Hello, Kirill.

On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
> > Can you please explain how this is a fundamental resource which can't
> > be controlled otherwise?
> 
> Currently, aio_nr and aio_max_nr are global. In case of containers this
> means that a single container may occupy all aio requests, which are
> available in the system, and to deprive others possibility to use aio
> at all. This may happen because of evil intentions of the container's
> user or because of the program error, when the user makes this occasionally.

Hmm... I see.  It feels really wrong to me to make this a first class
resource because there is a system wide limit.  The only reason I can
think of for the system wide limit is to prevent too much kernel
memory consumed by creating a lot of aios but that squarely falls
inside cgroup memory controller protection.  If there are other
reasons why the number of aios should be limited system-wide, please
bring them up.

If the only reason is kernel memory consumption protection, the only
thing we need to do is making sure that memory used for aio commands
are accounted against cgroup kernel memory consumption and
relaxing/removing system wide limit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 21:52     ` Tejun Heo
@ 2017-12-04 22:49       ` Kirill Tkhai
  2017-12-04 22:59         ` Jeff Moyer
  2017-12-04 23:02         ` Tejun Heo
  0 siblings, 2 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 22:49 UTC (permalink / raw)
  To: Tejun Heo, axboe, bcrl; +Cc: viro, linux-block, linux-kernel, linux-aio, oleg

On 05.12.2017 00:52, Tejun Heo wrote:
> Hello, Kirill.
> 
> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>> Can you please explain how this is a fundamental resource which can't
>>> be controlled otherwise?
>>
>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>> means that a single container may occupy all aio requests, which are
>> available in the system, and to deprive others possibility to use aio
>> at all. This may happen because of evil intentions of the container's
>> user or because of the program error, when the user makes this occasionally.
> 
> Hmm... I see.  It feels really wrong to me to make this a first class
> resource because there is a system wide limit.  The only reason I can
> think of for the system wide limit is to prevent too much kernel
> memory consumed by creating a lot of aios but that squarely falls
> inside cgroup memory controller protection.  If there are other
> reasons why the number of aios should be limited system-wide, please
> bring them up.
>
> If the only reason is kernel memory consumption protection, the only
> thing we need to do is making sure that memory used for aio commands
> are accounted against cgroup kernel memory consumption and
> relaxing/removing system wide limit.

So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
structures and pages, and all the memory will be accounted in kmem and
limited by memcg. Looks very good.

One detail about memory consumption. io_submit() calls primitives
file_operations::write_iter and read_iter. It's not clear for me whether
they consume the same memory as if writev() or readv() system calls
would be used instead. writev() may delay the actual write till dirty
pages limit will be reached, so it seems logic of the accounting should
be the same. So aio mustn't use more not accounted system memory in file
system internals, then simple writev().

Could you please to say if you have thoughts about this?

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 22:49       ` Kirill Tkhai
@ 2017-12-04 22:59         ` Jeff Moyer
  2017-12-04 23:14           ` Kirill Tkhai
  2017-12-04 23:02         ` Tejun Heo
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Moyer @ 2017-12-04 22:59 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Tejun Heo, axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 05.12.2017 00:52, Tejun Heo wrote:
>> Hello, Kirill.
>> 
>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>>> Can you please explain how this is a fundamental resource which can't
>>>> be controlled otherwise?
>>>
>>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>>> means that a single container may occupy all aio requests, which are
>>> available in the system, and to deprive others possibility to use aio
>>> at all. This may happen because of evil intentions of the container's
>>> user or because of the program error, when the user makes this occasionally.
>> 
>> Hmm... I see.  It feels really wrong to me to make this a first class
>> resource because there is a system wide limit.  The only reason I can
>> think of for the system wide limit is to prevent too much kernel
>> memory consumed by creating a lot of aios but that squarely falls
>> inside cgroup memory controller protection.  If there are other
>> reasons why the number of aios should be limited system-wide, please
>> bring them up.
>>
>> If the only reason is kernel memory consumption protection, the only
>> thing we need to do is making sure that memory used for aio commands
>> are accounted against cgroup kernel memory consumption and
>> relaxing/removing system wide limit.
>
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.
>
> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
>
> Could you please to say if you have thoughts about this?

I think you just need to account the completion ring.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 22:49       ` Kirill Tkhai
  2017-12-04 22:59         ` Jeff Moyer
@ 2017-12-04 23:02         ` Tejun Heo
  2017-12-04 23:05           ` Kirill Tkhai
  1 sibling, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2017-12-04 23:02 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Hello, Kirill.

On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
> > If the only reason is kernel memory consumption protection, the only
> > thing we need to do is making sure that memory used for aio commands
> > are accounted against cgroup kernel memory consumption and
> > relaxing/removing system wide limit.
> 
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.

Yeah.

> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
> 
> Could you please to say if you have thoughts about this?

I'm not too familiar with vfs / filesystems but I don't think there's
gonna be significant unaccounted memory consumption.  It shouldn't be
too difficult to find out with experiments too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 23:02         ` Tejun Heo
@ 2017-12-04 23:05           ` Kirill Tkhai
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 23:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

On 05.12.2017 02:02, Tejun Heo wrote:
> Hello, Kirill.
> 
> On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
> 
> Yeah.
> 
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
> 
> I'm not too familiar with vfs / filesystems but I don't think there's
> gonna be significant unaccounted memory consumption.  It shouldn't be
> too difficult to find out with experiments too.
> 
> Thanks.

Thanks, Tejun!

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 22:59         ` Jeff Moyer
@ 2017-12-04 23:14           ` Kirill Tkhai
  2017-12-05 15:41             ` Jeff Moyer
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-04 23:14 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Tejun Heo, axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

On 05.12.2017 01:59, Jeff Moyer wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 05.12.2017 00:52, Tejun Heo wrote:
>>> Hello, Kirill.
>>>
>>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>>>> Can you please explain how this is a fundamental resource which can't
>>>>> be controlled otherwise?
>>>>
>>>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>>>> means that a single container may occupy all aio requests, which are
>>>> available in the system, and to deprive others possibility to use aio
>>>> at all. This may happen because of evil intentions of the container's
>>>> user or because of the program error, when the user makes this occasionally.
>>>
>>> Hmm... I see.  It feels really wrong to me to make this a first class
>>> resource because there is a system wide limit.  The only reason I can
>>> think of for the system wide limit is to prevent too much kernel
>>> memory consumed by creating a lot of aios but that squarely falls
>>> inside cgroup memory controller protection.  If there are other
>>> reasons why the number of aios should be limited system-wide, please
>>> bring them up.
>>>
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
>>
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
> 
> I think you just need to account the completion ring.

A request of struct aio_kiocb type consumes much more memory, than
struct io_event does. Shouldn't we account it too?

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 21:44   ` Kirill Tkhai
  2017-12-04 21:52     ` Tejun Heo
@ 2017-12-05 15:19     ` Oleg Nesterov
  2017-12-05 15:35       ` Benjamin LaHaise
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2017-12-05 15:19 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Tejun Heo, axboe, bcrl, viro, linux-block, linux-kernel, linux-aio

On 12/05, Kirill Tkhai wrote:
>
> Currently, aio_nr and aio_max_nr are global.

Yeah, I too tried to complain 2 years ago...

> In case of containers this
> means that a single container may occupy all aio requests, which are
> available in the system,

and memory. let me quote my old emails...

This is off-topic, but the whole "vm" logic in aio_setup_ring()
looks sub-optimal. I do not mean the code, just it seems to me it
is pointless to pollute the page cache, and expose the pages we
can not swap/free to lru. Afaics we _only_ need this for migration.

This memory lives in page-cache/lru, it is visible for shrinker which
will unmap these pages for no reason on memory shortage. IOW, aio fools
the kernel, this memory looks reclaimable but it is not. And we only do
this for migration.

Even if this is not a problem, this does not look right. So perhaps at
least mapping_set_unevictable() makes sense. But I simply do not know
if migration will work with this change.

Perhaps I missed something, doesn't matter. But this means that
this memory is not accounted, so if I increase aio-max-nr then
this test-case

	#define __NR_io_setup	206

	int main(void)
	{
		int nr;

		for (nr = 0; ;++nr) {
			void *ctx = NULL;
			int ret = syscall(__NR_io_setup, 1, &ctx);
			if (ret) {
				printf("failed %d %m: ", nr);
				getchar();
			}
		}

		return 0;
	}

triggers OOM-killer which kills sshd and other daemons on my machine.
These pages were not even faulted in (or the shrinker can unmap them),
the kernel can not know who should be blamed.

Shouldn't we account aio events/pages somehow, say per-user, or in
mm->pinned_vm ?

I do not think this is unkown, and probably this all is fine. IOW,
this is just a question, not a bug-report or something like this.

And of course, this is not exploitable because aio-max-nr limits
the number of pages you can steal.

But otoh, aio_max_nr is system-wide, so the unpriviliged user can
ddos (say) mysqld. And this leads to the same question: shouldn't
we account nr_events at least?

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-05 15:19     ` Oleg Nesterov
@ 2017-12-05 15:35       ` Benjamin LaHaise
  2017-12-06 17:32         ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin LaHaise @ 2017-12-05 15:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On Tue, Dec 05, 2017 at 04:19:56PM +0100, Oleg Nesterov wrote:
> On 12/05, Kirill Tkhai wrote:
> >
> > Currently, aio_nr and aio_max_nr are global.
> 
> Yeah, I too tried to complain 2 years ago...
> 
> > In case of containers this
> > means that a single container may occupy all aio requests, which are
> > available in the system,
> 
> and memory. let me quote my old emails...
> 
> 
> This is off-topic, but the whole "vm" logic in aio_setup_ring()
> looks sub-optimal. I do not mean the code, just it seems to me it
> is pointless to pollute the page cache, and expose the pages we
> can not swap/free to lru. Afaics we _only_ need this for migration.

It is needed for migration, which is needed for hot unplug of memory.
There is no way around this.

> This memory lives in page-cache/lru, it is visible for shrinker which
> will unmap these pages for no reason on memory shortage. IOW, aio fools
> the kernel, this memory looks reclaimable but it is not. And we only do
> this for migration.

It's the same as any other memory that's mlock()ed into RAM.

> Even if this is not a problem, this does not look right. So perhaps at
> least mapping_set_unevictable() makes sense. But I simply do not know
> if migration will work with this change.
> 
> 
> 
> Perhaps I missed something, doesn't matter. But this means that
> this memory is not accounted, so if I increase aio-max-nr then
> this test-case
> 
> 	#define __NR_io_setup	206
> 
> 	int main(void)
> 	{
> 		int nr;
> 
> 		for (nr = 0; ;++nr) {
> 			void *ctx = NULL;
> 			int ret = syscall(__NR_io_setup, 1, &ctx);
> 			if (ret) {
> 				printf("failed %d %m: ", nr);
> 				getchar();
> 			}
> 		}
> 
> 		return 0;
> 	}
> 
> triggers OOM-killer which kills sshd and other daemons on my machine.
> These pages were not even faulted in (or the shrinker can unmap them),
> the kernel can not know who should be blamed.

The OOM-killer killed the wrong process: News at 11.  This is not new
behaviour, and has always been an issue.  If it bothered to kill the
process that was doing the allocations, the ioctxes would be freed and all
would be well.  Doesn't the OOM killer take into account which process is
allocating memory?  Seems like a pretty major deficiency if it isn't.

> Shouldn't we account aio events/pages somehow, say per-user, or in
> mm->pinned_vm ?

Sure, I have no objection to that.  Please send a patch!

> I do not think this is unkown, and probably this all is fine. IOW,
> this is just a question, not a bug-report or something like this.
> 
> And of course, this is not exploitable because aio-max-nr limits
> the number of pages you can steal.

Which is why the aio-max-nr limit exists!  That it is imprecise and
imperfect is not being denied.

> But otoh, aio_max_nr is system-wide, so the unpriviliged user can
> ddos (say) mysqld. And this leads to the same question: shouldn't
> we account nr_events at least?

Anyone can DDoS the local system - this nothing new and has always been
the case.  I'm not opposed to improvements in this area.  The only issue I
have with Kirill's changes is that we need to have test cases for this new
functionality if the code is to be merged.

TBH, using memory accounting to limit things may be a better approach, as
that at least doesn't require changes to containers implementations to
obtain a benefit from bounding aio's memory usage.

		-ben

> Oleg.
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-04 23:14           ` Kirill Tkhai
@ 2017-12-05 15:41             ` Jeff Moyer
  2017-12-05 15:51               ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Moyer @ 2017-12-05 15:41 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Tejun Heo, axboe, bcrl, viro, linux-block, linux-kernel, linux-aio, oleg

Kirill Tkhai <ktkhai@virtuozzo.com> writes:

>> I think you just need to account the completion ring.
>
> A request of struct aio_kiocb type consumes much more memory, than
> struct io_event does. Shouldn't we account it too?

Not in my opinion.  The completion ring is the part that gets pinned for
long periods of time.

Just be sure to document this where appropriate.  Users/admins should
know that the aio completion ring now contributes to their memory
budget.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-05 15:41             ` Jeff Moyer
@ 2017-12-05 15:51               ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2017-12-05 15:51 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Kirill Tkhai, axboe, bcrl, viro, linux-block, linux-kernel,
	linux-aio, oleg

Hello, Jeff.

On Tue, Dec 05, 2017 at 10:41:11AM -0500, Jeff Moyer wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
> >> I think you just need to account the completion ring.
> >
> > A request of struct aio_kiocb type consumes much more memory, than
> > struct io_event does. Shouldn't we account it too?
> 
> Not in my opinion.  The completion ring is the part that gets pinned for
> long periods of time.

For memcg, it should account all possibly significant memory
consumptions whether long term or transitional.  Otherwise, isolation
doesn't really work that well.

> Just be sure to document this where appropriate.  Users/admins should
> know that the aio completion ring now contributes to their memory
> budget.

Yeah, the memory section in cgroup-v2.txt does have a section for
this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-05 15:35       ` Benjamin LaHaise
@ 2017-12-06 17:32         ` Oleg Nesterov
  2017-12-06 17:44           ` Benjamin LaHaise
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2017-12-06 17:32 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kirill Tkhai, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On 12/05, Benjamin LaHaise wrote:
>
> On Tue, Dec 05, 2017 at 04:19:56PM +0100, Oleg Nesterov wrote:
> >
> > and memory. let me quote my old emails...
> >
> >
> > This is off-topic, but the whole "vm" logic in aio_setup_ring()
> > looks sub-optimal. I do not mean the code, just it seems to me it
> > is pointless to pollute the page cache, and expose the pages we
> > can not swap/free to lru. Afaics we _only_ need this for migration.
>
> It is needed for migration, which is needed for hot unplug of memory.
> There is no way around this.

I know, and I even mentioned this above.


> > This memory lives in page-cache/lru, it is visible for shrinker which
> > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > the kernel, this memory looks reclaimable but it is not. And we only do
> > this for migration.
>
> It's the same as any other memory that's mlock()ed into RAM.

No. Again, this memory is not properly accounted, and unlike mlock()ed
memory it is visible to shrinker which will do the unnecessary work on
memory shortage which in turn will lead to unnecessary page faults.

So let me repeat, shouldn't we at least do mapping_set_unevictable() in
aio_private_file() ?


> > triggers OOM-killer which kills sshd and other daemons on my machine.
> > These pages were not even faulted in (or the shrinker can unmap them),
> > the kernel can not know who should be blamed.
>
> The OOM-killer killed the wrong process: News at 11.

Well. I do not think we should blame OOM-killer in this case. But as I
said this is not a bug-report or something like this, I agree this is
a minor issue.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-06 17:32         ` Oleg Nesterov
@ 2017-12-06 17:44           ` Benjamin LaHaise
  2017-12-06 18:19             ` Kirill Tkhai
  2017-12-07 13:44             ` Oleg Nesterov
  0 siblings, 2 replies; 28+ messages in thread
From: Benjamin LaHaise @ 2017-12-06 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> > > This memory lives in page-cache/lru, it is visible for shrinker which
> > > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > > the kernel, this memory looks reclaimable but it is not. And we only do
> > > this for migration.
> >
> > It's the same as any other memory that's mlock()ed into RAM.
> 
> No. Again, this memory is not properly accounted, and unlike mlock()ed
> memory it is visible to shrinker which will do the unnecessary work on
> memory shortage which in turn will lead to unnecessary page faults.
> 
> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> aio_private_file() ?

Send a patch then!  I don't know why you're asking rather than sending a
patch to do this if you think it is needed.

> > > triggers OOM-killer which kills sshd and other daemons on my machine.
> > > These pages were not even faulted in (or the shrinker can unmap them),
> > > the kernel can not know who should be blamed.
> >
> > The OOM-killer killed the wrong process: News at 11.
> 
> Well. I do not think we should blame OOM-killer in this case. But as I
> said this is not a bug-report or something like this, I agree this is
> a minor issue.

I do think the OOM-killer is doing the wrong thing here.  If process X is
the only one that is allocating gobs of memory, why kill process Y that
hasn't allocated memory in minutes or hours just because it is bigger?
We're not perfect at tracking who owns memory allocations, so why not
factor in memory allocation rate when decided which process to kill?  We
keep throwing bandaids on the OOM-killer by annotating allocations, and we
keep missing the annotation of allocations.  Doesn't sound like a real fix
for the underlying problem to me when a better heuristic would solve the
current problem and probably several other future instances of the same
problem.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-06 17:44           ` Benjamin LaHaise
@ 2017-12-06 18:19             ` Kirill Tkhai
  2017-12-06 18:30               ` Benjamin LaHaise
  2017-12-07 13:44             ` Oleg Nesterov
  1 sibling, 1 reply; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-06 18:19 UTC (permalink / raw)
  To: Benjamin LaHaise, Oleg Nesterov
  Cc: Tejun Heo, axboe, viro, linux-block, linux-kernel, linux-aio

On 06.12.2017 20:44, Benjamin LaHaise wrote:
> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
>>>> This memory lives in page-cache/lru, it is visible for shrinker which
>>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
>>>> the kernel, this memory looks reclaimable but it is not. And we only do
>>>> this for migration.
>>>
>>> It's the same as any other memory that's mlock()ed into RAM.
>>
>> No. Again, this memory is not properly accounted, and unlike mlock()ed
>> memory it is visible to shrinker which will do the unnecessary work on
>> memory shortage which in turn will lead to unnecessary page faults.
>>
>> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
>> aio_private_file() ?
> 
> Send a patch then!  I don't know why you're asking rather than sending a
> patch to do this if you think it is needed.
> 
>>>> triggers OOM-killer which kills sshd and other daemons on my machine.
>>>> These pages were not even faulted in (or the shrinker can unmap them),
>>>> the kernel can not know who should be blamed.
>>>
>>> The OOM-killer killed the wrong process: News at 11.
>>
>> Well. I do not think we should blame OOM-killer in this case. But as I
>> said this is not a bug-report or something like this, I agree this is
>> a minor issue.
> 
> I do think the OOM-killer is doing the wrong thing here.  If process X is
> the only one that is allocating gobs of memory, why kill process Y that
> hasn't allocated memory in minutes or hours just because it is bigger?

I assume, if a process hasn't allocated memory in minutes or hours,
then the most probably all of its evictable memory has already been
moved to swap as its pages were last in lru list.

> We're not perfect at tracking who owns memory allocations, so why not
> factor in memory allocation rate when decided which process to kill?  We
> keep throwing bandaids on the OOM-killer by annotating allocations, and we
> keep missing the annotation of allocations.  Doesn't sound like a real fix
> for the underlying problem to me when a better heuristic would solve the
> current problem and probably several other future instances of the same
> problem.

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-06 18:19             ` Kirill Tkhai
@ 2017-12-06 18:30               ` Benjamin LaHaise
  2017-12-06 19:37                 ` Kirill Tkhai
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin LaHaise @ 2017-12-06 18:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On Wed, Dec 06, 2017 at 09:19:09PM +0300, Kirill Tkhai wrote:
> On 06.12.2017 20:44, Benjamin LaHaise wrote:
> > On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> >>>> This memory lives in page-cache/lru, it is visible for shrinker which
> >>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
> >>>> the kernel, this memory looks reclaimable but it is not. And we only do
> >>>> this for migration.
> >>>
> >>> It's the same as any other memory that's mlock()ed into RAM.
> >>
> >> No. Again, this memory is not properly accounted, and unlike mlock()ed
> >> memory it is visible to shrinker which will do the unnecessary work on
> >> memory shortage which in turn will lead to unnecessary page faults.
> >>
> >> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> >> aio_private_file() ?
> > 
> > Send a patch then!  I don't know why you're asking rather than sending a
> > patch to do this if you think it is needed.
> > 
> >>>> triggers OOM-killer which kills sshd and other daemons on my machine.
> >>>> These pages were not even faulted in (or the shrinker can unmap them),
> >>>> the kernel can not know who should be blamed.
> >>>
> >>> The OOM-killer killed the wrong process: News at 11.
> >>
> >> Well. I do not think we should blame OOM-killer in this case. But as I
> >> said this is not a bug-report or something like this, I agree this is
> >> a minor issue.
> > 
> > I do think the OOM-killer is doing the wrong thing here.  If process X is
> > the only one that is allocating gobs of memory, why kill process Y that
> > hasn't allocated memory in minutes or hours just because it is bigger?
> 
> I assume, if a process hasn't allocated memory in minutes or hours,
> then the most probably all of its evictable memory has already been
> moved to swap as its pages were last in lru list.

That assumption would be incorrect.  Just because memory isn't being
allocated doesn't mean that it isn't being used - the two are very
different things.  Most of the embedded systems I work on allocate memory
at startup and then use it until the system gets restarted or rebooted.
By only doing memory allocations at startup, code is simplified,
performance is more predictable and failure modes are constrained.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-06 18:30               ` Benjamin LaHaise
@ 2017-12-06 19:37                 ` Kirill Tkhai
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill Tkhai @ 2017-12-06 19:37 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Oleg Nesterov, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On 06.12.2017 21:30, Benjamin LaHaise wrote:
> On Wed, Dec 06, 2017 at 09:19:09PM +0300, Kirill Tkhai wrote:
>> On 06.12.2017 20:44, Benjamin LaHaise wrote:
>>> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
>>>>>> This memory lives in page-cache/lru, it is visible for shrinker which
>>>>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
>>>>>> the kernel, this memory looks reclaimable but it is not. And we only do
>>>>>> this for migration.
>>>>>
>>>>> It's the same as any other memory that's mlock()ed into RAM.
>>>>
>>>> No. Again, this memory is not properly accounted, and unlike mlock()ed
>>>> memory it is visible to shrinker which will do the unnecessary work on
>>>> memory shortage which in turn will lead to unnecessary page faults.
>>>>
>>>> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
>>>> aio_private_file() ?
>>>
>>> Send a patch then!  I don't know why you're asking rather than sending a
>>> patch to do this if you think it is needed.
>>>
>>>>>> triggers OOM-killer which kills sshd and other daemons on my machine.
>>>>>> These pages were not even faulted in (or the shrinker can unmap them),
>>>>>> the kernel can not know who should be blamed.
>>>>>
>>>>> The OOM-killer killed the wrong process: News at 11.
>>>>
>>>> Well. I do not think we should blame OOM-killer in this case. But as I
>>>> said this is not a bug-report or something like this, I agree this is
>>>> a minor issue.
>>>
>>> I do think the OOM-killer is doing the wrong thing here.  If process X is
>>> the only one that is allocating gobs of memory, why kill process Y that
>>> hasn't allocated memory in minutes or hours just because it is bigger?
>>
>> I assume, if a process hasn't allocated memory in minutes or hours,
>> then the most probably all of its evictable memory has already been
>> moved to swap as its pages were last in lru list.
> 
> That assumption would be incorrect.  Just because memory isn't being
> allocated doesn't mean that it isn't being used - the two are very
> different things.

You are sure, allocation and using are different things. But if memory
is being used does not mean that it can't be swapped on disk. And as
the memory is still being used, the swapped pages are read to the RAM
again, when pagefaults happen. And as pagefaults happen, the task allocates
new pages to place the swapped pages in RAM. So, allocations of memory
happen, and honestly this seems to be not the case we started to speak.

Of course, maybe, there were implemented a tracker of used memory, which
prevents used pages to be swapped on the disk. But when I dived there
last time, there was only PROT_NONE based tracker used for NUMA balancing.
Maybe you or someone else have new information about this.

> Most of the embedded systems I work on allocate memory
> at startup and then use it until the system gets restarted or rebooted.
> By only doing memory allocations at startup, code is simplified,
> performance is more predictable and failure modes are constrained.

Yeah, and many realtime systems have the same model. But it may look
strange, aio is also popular on many other configurations too :)

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
  2017-12-06 17:44           ` Benjamin LaHaise
  2017-12-06 18:19             ` Kirill Tkhai
@ 2017-12-07 13:44             ` Oleg Nesterov
  1 sibling, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2017-12-07 13:44 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kirill Tkhai, Tejun Heo, axboe, viro, linux-block, linux-kernel,
	linux-aio

On 12/06, Benjamin LaHaise wrote:
>
> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> >
> > No. Again, this memory is not properly accounted, and unlike mlock()ed
> > memory it is visible to shrinker which will do the unnecessary work on
> > memory shortage which in turn will lead to unnecessary page faults.
> >
> > So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> > aio_private_file() ?

... and probably account this memory in ->pinned_vm

> Send a patch then!

I have no idea how to test this change, and personally I don't reallly care
about aio,

> I don't know why you're asking rather than sending a
> patch to do this if you think it is needed.

Because you are maintainer, and I naively thought it is always fine to
ask the maintainer if you think the code is not correct or sub-optimal.
Sorry for bothering you.

> > > > triggers OOM-killer which kills sshd and other daemons on my machine.
> > > > These pages were not even faulted in (or the shrinker can unmap them),
> > > > the kernel can not know who should be blamed.
> > >
> > > The OOM-killer killed the wrong process: News at 11.
> >
> > Well. I do not think we should blame OOM-killer in this case. But as I
> > said this is not a bug-report or something like this, I agree this is
> > a minor issue.
>
> I do think the OOM-killer is doing the wrong thing here.  If process X is
> the only one that is allocating gobs of memory,

aio_setup_ring() does find_or_create_page(file->f_mapping), this adds
the page to page cache. Again, this memory looks _reclaimable_ but it
is not because ctx->ring_pages has a reference.

I do not understand how we can blame OOM-killer, it should not kill the
task which blows the page cache, and this is how io_setup() looks to vm.

Quite possibly I missed something, please correct me.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-12-07 13:44 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-04 16:12 [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Kirill Tkhai
2017-12-04 16:12 ` [PATCH 1/5] aio: Move aio_nr increment to separate function Kirill Tkhai
2017-12-04 16:13 ` [PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h Kirill Tkhai
2017-12-04 16:13 ` [PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr Kirill Tkhai
2017-12-04 16:13 ` [PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy Kirill Tkhai
2017-12-04 16:13 ` [PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr Kirill Tkhai
2017-12-04 16:52 ` [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup Benjamin LaHaise
2017-12-04 21:27   ` Kirill Tkhai
2017-12-04 21:35     ` Jeff Moyer
2017-12-04 21:48       ` Kirill Tkhai
2017-12-04 20:07 ` Tejun Heo
2017-12-04 21:44   ` Kirill Tkhai
2017-12-04 21:52     ` Tejun Heo
2017-12-04 22:49       ` Kirill Tkhai
2017-12-04 22:59         ` Jeff Moyer
2017-12-04 23:14           ` Kirill Tkhai
2017-12-05 15:41             ` Jeff Moyer
2017-12-05 15:51               ` Tejun Heo
2017-12-04 23:02         ` Tejun Heo
2017-12-04 23:05           ` Kirill Tkhai
2017-12-05 15:19     ` Oleg Nesterov
2017-12-05 15:35       ` Benjamin LaHaise
2017-12-06 17:32         ` Oleg Nesterov
2017-12-06 17:44           ` Benjamin LaHaise
2017-12-06 18:19             ` Kirill Tkhai
2017-12-06 18:30               ` Benjamin LaHaise
2017-12-06 19:37                 ` Kirill Tkhai
2017-12-07 13:44             ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).