linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/5] cgroup-aware unbound workqueues
@ 2019-06-05 13:36 Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 1/5] cgroup: add cgroup v2 interfaces to migrate kernel threads Daniel Jordan
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

Hi,

This series adds cgroup awareness to unbound workqueues.

This is the second version since Bandan Das's post from a few years ago[1].
The design is completely new, but the code is still in development and I'm
posting early to get feedback on the design.  Is this a good direction?

Thanks,
Daniel


Summary
-------

Cgroup controllers don't throttle workqueue workers for the most part, so
resource-intensive works run unchecked.  Fix it by adding a new type of work
that causes the assigned worker to attach to the given cgroup.

Motivation
----------

Workqueue workers are currently always attached to root cgroups.  If a task in
a child cgroup queues a resource-intensive work, the resource limits of the
child cgroup generally don't throttle the worker, with some exceptions such as
writeback.

My use case for this work is kernel multithreading, the series formerly known
as ktask[2] that I'm now trying to combine with padata according to feedback
from the last post.  Helper threads in a multithreaded job may consume lots of
resources that aren't properly accounted to the cgroup of the task that started
the job.

Basic Idea
----------

I know of two basic ways to fix this, with other ideas welcome.  They both use
the existing cgroup migration path to move workers to different cgroups.

  #1  Maintain per-cgroup worker pools and queue works on these pools.  A
      worker in the pool is migrated once to the pool's assigned cgroup when
      the worker is first created.

These days, users can have hundreds or thousands of cgroups on their systems,
which means that #1 could cause as many workers to be created across the pools,
bringing back the problems of MT workqueues.[3]  The concurrency level could be
managed across the pools, but I don't see how to avoid thrashing on worker
creation and destruction with even demand for workers across cgroups.  So #1
doesn't seem like the right way forward.

  #2  Migrate a worker to the desired cgroup before it runs the work.
      Worker pools are shared across cgroups, and workers migrate to
      different cgroups as needed.

#2 has some issues of its own, namely cgroup_mutex and
cgroup_threadgroup_rwsem.  These prevent concurrent worker migrations, so for
this to work scalably, these locks should be fixed.  css_set_lock and
controller-specific locks may then also be a problem.  Nevertheless, #2 keeps
the total number of workers low to accommodate systems with many cgroups.

This RFC implements #2.  If the design looks good, I can start working on
fixing the locks, and I'd be thrilled if others wanted to help with this.


A third alternative arose late in the development of this series that takes
inspiration from proxy execution, in which a task's scheduling context and
execution context are treated separately[4].  The idea is to allow a proxy task
to temporarily assume the cgroup characteristics of another task so that it can
use the other task's cgroup-related task_struct fields.  The worker avoids the
performance and scalability cost of the migration path, but it also doesn't run
the attach callbacks, so controllers wouldn't work as designed without adding
special logic in various places to account for this situation.  That doesn't
sound immediately appealing, but I haven't thought about it for very long.

Data Structures
---------------

Cgroup awareness is implemented per work with a new type of work item:

        struct cgroup_work {
                struct work_struct work;
        #ifdef CONFIG_CGROUPS
                struct cgroup *cgroup;
        #endif
        };

The cgroup field contains the cgroup to which the assigned worker should
attach.  A new type is used so only those users who want cgroup awareness incur
the space overhead of the cgroup pointer.  This feature is supported only for
cgroups on the default hierarchy, so one cgroup pointer is sufficient.  

Workqueues may be created with the WQ_CGROUP flag.  The flag means that cgroup
works, and only cgroup works, may be queued on this workqueue.  Cgroup works
aren't allowed to be queued on !WQ_CGROUP workqueues.

This separation avoids the issues that come from cgroup_works and regular works
being queued together, such as distinguishing between the two on a worklist,
which probably means adding a new work data bit causing increased memory usage
from higher pool_workqueue alignment, or creating multiple worklists and
dealing fairly with, "which worklist do I pick from next?"

Migrating Kernel Threads
------------------------

Migrated worker pids appear in cgroup.procs and cgroup.threads, and they block
cgroup destruction (cgroup_rmdir) just as user tasks do.  To alleviate this
somewhat, workers that have finished their work migrate themselves back to the
root cgroup before sleeping.

In addition, it's probably best to allow userland to destroy a cgroup when only
kernel threads remain (no user tasks left), with destruction finishing in the
background once all kernel threads have been migrated out.  The reason is, it's
consistent with current cgroup behavior in which management apps, libraries,
etc may expect destruction to succeed when all known tasks have been moved out.
So that's tentatively on my TODO, but I'm curious what people think.

It's possible for task migration to fail for several reasons.  On failure, the
worker tries migrating itself to the root cgroup.  In case _this_ fails, the
code currently throws a warning, but it seems best to design this so that
migrating a kernel thread to the root can't fail.  Otherwise, with both
failures, we account work to an unrelated, random cgroup.

Priority Inversion
------------------

One concern with cgroup-aware workers may be priority inversion[5].  I couldn't
find where this was discussed in detail, but it seems the issue is that a
worker could be throttled by some resource limit from its attached cgroup,
causing other work items' execution to be delayed a long time.

However, this doesn't seem to be a problem because of how worker pools are
managed.  There's an invariant that at least one idle worker should exist in a
pool before a worker begins processing works, so that there will be at least
one worker per work item, avoiding the inversion.

It's possible that works from a large number of different resource-constrained
cgroups could cause as many workers to be created, with creation eventually
failing due for example to pid exhaustion, but in that extreme case workqueue
will retry repeatedly with a CREATE_COOLDOWN timeout.  This seems good enough,
but I'm open to other ideas.

Testing
-------

A little, not a lot.  I've sanity-checked that some controllers throttle
workers as expected (memory, cpu, pids), "believe" rdma should work, haven't
looked at io yet, and know cpuset is broken.  For cpuset, I need to fix
->can_attach() for bound kthreads and reconcile the controller's cpumasks with
worker cpumasks.

In one experiment on a large Xeon server, a kernel thread was migrated 2M times
back and forth between two cgroups.  The mean time per migration was 1 usec, so
cgroup-aware work items should take much longer than that for the migration to
be worth it.

TODO
----

 - scale cgroup_mutex and cgroup_threadcgroup_rwsem
 - support the cpuset controller, and reconcile that with workqueue NUMA
   awareness and worker_pool cpumasks
 - support the io controller
 - make kernel thread migration to the root cgroup always succeed
 - maybe allow userland to destroy a cgroup with only kernel threads

Dependencies
------------

This series is against 5.1 plus some kernel multithreading patches (formerly
ktask).  A branch with everything is available at

    git://oss.oracle.com/git/linux-dmjordan.git cauwq-rfc-v2

The multithreading patches don't incorporate some of the feedback from the last
post[2] (yet) because I'm in the process of addressing the larger design
comments.

[1] http://lkml.kernel.org/r/1458339291-4093-1-git-send-email-bsd@redhat.com
[2] https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/
[3] https://lore.kernel.org/lkml/4C17C598.7070303@kernel.org/
[4] https://lore.kernel.org/lkml/20181009092434.26221-1-juri.lelli@redhat.com/
[5] https://lore.kernel.org/netdev/4BFE9ABA.6030907@kernel.org/

Daniel Jordan (5):
  cgroup: add cgroup v2 interfaces to migrate kernel threads
  workqueue, cgroup: add cgroup-aware workqueues
  workqueue, memcontrol: make memcg throttle workqueue workers
  workqueue, cgroup: add test module
  ktask, cgroup: attach helper threads to the master thread's cgroup

 include/linux/cgroup.h                        |  43 +++
 include/linux/workqueue.h                     |  85 +++++
 kernel/cgroup/cgroup-internal.h               |   1 -
 kernel/cgroup/cgroup.c                        |  48 ++-
 kernel/ktask.c                                |  32 +-
 kernel/workqueue.c                            | 263 +++++++++++++-
 kernel/workqueue_internal.h                   |  50 +++
 lib/Kconfig.debug                             |  12 +
 lib/Makefile                                  |   1 +
 lib/test_cgroup_workqueue.c                   | 325 ++++++++++++++++++
 mm/memcontrol.c                               |  26 +-
 .../selftests/cgroup_workqueue/Makefile       |   9 +
 .../testing/selftests/cgroup_workqueue/config |   1 +
 .../cgroup_workqueue/test_cgroup_workqueue.sh | 104 ++++++
 14 files changed, 963 insertions(+), 37 deletions(-)
 create mode 100644 lib/test_cgroup_workqueue.c
 create mode 100644 tools/testing/selftests/cgroup_workqueue/Makefile
 create mode 100644 tools/testing/selftests/cgroup_workqueue/config
 create mode 100755 tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh


base-commit: e93c9c99a629c61837d5a7fc2120cd2b6c70dbdd
prerequisite-patch-id: 253830d9ec7ed8f9d10127c1bc61f2489c40f042
prerequisite-patch-id: 0fa4fe0d879ae76f8e16d15982d799d84f67bee3
prerequisite-patch-id: e2e8229b9d1a1efa75262910a597902e136a6214
prerequisite-patch-id: f67900739fe811de1d4d06e19a5aa180b46396d8
prerequisite-patch-id: 7349e563091d065d4ace565f3daca139d7d470ad
prerequisite-patch-id: 6cd37c09bb0902519d574080b5c56d61755935fe
prerequisite-patch-id: a07d6676fbb5ed07486a3160e6eced91ecef1842
prerequisite-patch-id: 17baa0481806fc48dcb32caffb973120ac599c8d
prerequisite-patch-id: 6e629bbeb6efdd69aa733731bc91690346f26f21
prerequisite-patch-id: 59630f99305aa371e024b652e754d67614481c29
prerequisite-patch-id: 46c713fed894a530e9a0a83ca2a7c1ae2c787a5f
prerequisite-patch-id: 4b233711a8bdd1ef9fa82e364f07595526036ff4
prerequisite-patch-id: 9caefc8d5d48d6ec42bad4336fa38a28118296da
prerequisite-patch-id: 04d008e4ffbe499ebc5b466b7777dabff9a6c77e
prerequisite-patch-id: 396a88dad48473d4b54f539fadeb7d601020098d
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC v2 1/5] cgroup: add cgroup v2 interfaces to migrate kernel threads
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
@ 2019-06-05 13:36 ` Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 2/5] workqueue, cgroup: add cgroup-aware workqueues Daniel Jordan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

Prepare for cgroup aware workqueues by introducing cgroup_attach_kthread
and a helper cgroup_attach_kthread_to_dfl_root.

A workqueue worker will always migrate itself, so for now use @current
in the interfaces to avoid handling task references.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/cgroup.h |  6 ++++++
 kernel/cgroup/cgroup.c | 48 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81f58b4a5418..ad78784e3692 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -103,6 +103,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 struct cgroup *cgroup_get_from_path(const char *path);
 struct cgroup *cgroup_get_from_fd(int fd);
 
+int cgroup_attach_kthread(struct cgroup *dst_cgrp);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
@@ -530,6 +531,11 @@ static inline struct cgroup *task_dfl_cgroup(struct task_struct *task)
 	return task_css_set(task)->dfl_cgrp;
 }
 
+static inline int cgroup_attach_kthread_to_dfl_root(void)
+{
+	return cgroup_attach_kthread(&cgrp_dfl_root.cgrp);
+}
+
 static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
 {
 	struct cgroup_subsys_state *parent_css = cgrp->self.parent;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f2b4bde0f9c..bc8d6a2e529f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2771,21 +2771,59 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
 	return tsk;
 }
 
-void cgroup_procs_write_finish(struct task_struct *task)
-	__releases(&cgroup_threadgroup_rwsem)
+static void __cgroup_procs_write_finish(struct task_struct *task)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
 
-	/* release reference from cgroup_procs_write_start() */
-	put_task_struct(task);
+	lockdep_assert_held(&cgroup_mutex);
 
-	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
 }
 
+void cgroup_procs_write_finish(struct task_struct *task)
+	__releases(&cgroup_threadgroup_rwsem)
+{
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* release reference from cgroup_procs_write_start() */
+	put_task_struct(task);
+
+	percpu_up_write(&cgroup_threadgroup_rwsem);
+	__cgroup_procs_write_finish(task);
+}
+
+/**
+ * cgroup_attach_kthread - attach the current kernel thread to a cgroup
+ * @dst_cgrp: the cgroup to attach to
+ *
+ * The caller is responsible for ensuring @dst_cgrp is valid until this
+ * function returns.
+ *
+ * Return: 0 on success or negative error code.
+ */
+int cgroup_attach_kthread(struct cgroup *dst_cgrp)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
+		return -EINVAL;
+
+	mutex_lock(&cgroup_mutex);
+
+	percpu_down_write(&cgroup_threadgroup_rwsem);
+	ret = cgroup_attach_task(dst_cgrp, current, false);
+	percpu_up_write(&cgroup_threadgroup_rwsem);
+
+	__cgroup_procs_write_finish(current);
+
+	mutex_unlock(&cgroup_mutex);
+
+	return ret;
+}
+
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
 {
 	struct cgroup_subsys *ss;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 2/5] workqueue, cgroup: add cgroup-aware workqueues
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 1/5] cgroup: add cgroup v2 interfaces to migrate kernel threads Daniel Jordan
@ 2019-06-05 13:36 ` Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 3/5] workqueue, memcontrol: make memcg throttle workqueue workers Daniel Jordan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

Workqueue workers ignore the cgroup of the queueing task, so workers'
resource usage normally goes unaccounted, with exceptions such as cgroup
writeback, and so can arbitrarily exceed controller limits.

Add cgroup awareness to workqueue workers.  Do it only for unbound
workqueue since these tend to be the most resource intensive.  There's a
design overview in the cover letter.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/cgroup.h          |  11 ++
 include/linux/workqueue.h       |  85 ++++++++++++
 kernel/cgroup/cgroup-internal.h |   1 -
 kernel/workqueue.c              | 236 +++++++++++++++++++++++++++++---
 kernel/workqueue_internal.h     |  45 ++++++
 5 files changed, 360 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ad78784e3692..de578e29077b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -91,6 +91,7 @@ extern struct css_set init_css_set;
 #define cgroup_subsys_on_dfl(ss)						\
 	static_branch_likely(&ss ## _on_dfl_key)
 
+bool cgroup_on_dfl(const struct cgroup *cgrp);
 bool css_has_online_children(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup,
@@ -531,6 +532,11 @@ static inline struct cgroup *task_dfl_cgroup(struct task_struct *task)
 	return task_css_set(task)->dfl_cgrp;
 }
 
+static inline struct cgroup *cgroup_dfl_root(void)
+{
+	return &cgrp_dfl_root.cgrp;
+}
+
 static inline int cgroup_attach_kthread_to_dfl_root(void)
 {
 	return cgroup_attach_kthread(&cgrp_dfl_root.cgrp);
@@ -694,6 +700,11 @@ struct cgroup_subsys_state;
 struct cgroup;
 
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_put(struct cgroup *cgrp) {}
+static inline struct cgroup *task_dfl_cgroup(struct task_struct *task)
+{
+	return NULL;
+}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b5bc12cc1dde..c200ab5268df 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -14,7 +14,9 @@
 #include <linux/atomic.h>
 #include <linux/cpumask.h>
 #include <linux/rcupdate.h>
+#include <linux/numa.h>
 
+struct cgroup;
 struct workqueue_struct;
 
 struct work_struct;
@@ -133,6 +135,13 @@ struct rcu_work {
 	struct workqueue_struct *wq;
 };
 
+struct cgroup_work {
+	struct work_struct work;
+#ifdef CONFIG_CGROUPS
+	struct cgroup *cgroup;
+#endif
+};
+
 /**
  * struct workqueue_attrs - A struct for workqueue attributes.
  *
@@ -157,6 +166,12 @@ struct workqueue_attrs {
 	 * doesn't participate in pool hash calculations or equality comparisons.
 	 */
 	bool no_numa;
+
+	/**
+	 * Workers run work items while attached to the work's corresponding
+	 * cgroup.  This is a property of both workqueues and worker pools.
+	 */
+	bool cgroup_aware;
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
@@ -169,6 +184,11 @@ static inline struct rcu_work *to_rcu_work(struct work_struct *work)
 	return container_of(work, struct rcu_work, work);
 }
 
+static inline struct cgroup_work *to_cgroup_work(struct work_struct *work)
+{
+	return container_of(work, struct cgroup_work, work);
+}
+
 struct execute_work {
 	struct work_struct work;
 };
@@ -290,6 +310,12 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
 #define INIT_RCU_WORK_ONSTACK(_work, _func)				\
 	INIT_WORK_ONSTACK(&(_work)->work, (_func))
 
+#define INIT_CGROUP_WORK(_work, _func)					\
+	INIT_WORK(&(_work)->work, (_func))
+
+#define INIT_CGROUP_WORK_ONSTACK(_work, _func)				\
+	INIT_WORK_ONSTACK(&(_work)->work, (_func))
+
 /**
  * work_pending - Find out whether a work item is currently pending
  * @work: The work item in question
@@ -344,6 +370,14 @@ enum {
 	 */
 	WQ_POWER_EFFICIENT	= 1 << 7,
 
+	/*
+	 * Workqueue is cgroup-aware.  Valid only for WQ_UNBOUND workqueues
+	 * since these work items tend to be the most resource-intensive and
+	 * thus worth the accounting overhead.  Only cgroup_work's may be
+	 * queued.
+	 */
+	WQ_CGROUP		= 1 << 8,
+
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
@@ -514,6 +548,57 @@ static inline bool queue_delayed_work(struct workqueue_struct *wq,
 	return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
 }
 
+#ifdef CONFIG_CGROUPS
+
+extern bool queue_cgroup_work_node(int node, struct workqueue_struct *wq,
+				   struct cgroup_work *cwork,
+				   struct cgroup *cgroup);
+
+/**
+ * queue_cgroup_work - queue work to be run in a cgroup
+ * @wq: workqueue to use
+ * @cwork: cgroup_work to queue
+ * @cgroup: cgroup that the worker assigned to @cwork will attach to
+ *
+ * A worker serving @wq will run @cwork while attached to @cgroup.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+static inline bool queue_cgroup_work(struct workqueue_struct *wq,
+				     struct cgroup_work *cwork,
+				     struct cgroup *cgroup)
+{
+	return queue_cgroup_work_node(NUMA_NO_NODE, wq, cwork, cgroup);
+}
+
+static inline struct cgroup *work_to_cgroup(struct work_struct *work)
+{
+	return to_cgroup_work(work)->cgroup;
+}
+
+#else  /* CONFIG_CGROUPS */
+
+static inline bool queue_cgroup_work_node(int node, struct workqueue_struct *wq,
+					  struct cgroup_work *cwork,
+					  struct cgroup *cgroup)
+{
+	return queue_work_node(node, wq, &cwork->work);
+}
+
+static inline bool queue_cgroup_work(struct workqueue_struct *wq,
+				     struct cgroup_work *cwork,
+				     struct cgroup *cgroup)
+{
+	return queue_work_node(NUMA_NO_NODE, wq, &cwork->work);
+}
+
+static inline struct cgroup *work_to_cgroup(struct work_struct *work)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUPS */
+
 /**
  * mod_delayed_work - modify delay of or queue a delayed work
  * @wq: workqueue to use
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 30e39f3932ad..575ca2d0a7bc 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -200,7 +200,6 @@ static inline void get_css_set(struct css_set *cset)
 }
 
 bool cgroup_ssid_enabled(int ssid);
-bool cgroup_on_dfl(const struct cgroup *cgrp);
 bool cgroup_is_thread_root(struct cgroup *cgrp);
 bool cgroup_is_threaded(struct cgroup *cgrp);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 51aa010d728e..89b90899bc09 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
+#include <linux/cgroup.h>
 
 #include "workqueue_internal.h"
 
@@ -80,6 +81,11 @@ enum {
 	WORKER_UNBOUND		= 1 << 7,	/* worker is unbound */
 	WORKER_REBOUND		= 1 << 8,	/* worker was rebound */
 	WORKER_NICED		= 1 << 9,	/* worker's nice was adjusted */
+#ifdef CONFIG_CGROUPS
+	WORKER_CGROUP		= 1 << 10,	/* worker is cgroup-aware */
+#else
+	WORKER_CGROUP		= 0,		/* eliminate branches */
+#endif
 
 	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_CPU_INTENSIVE |
 				  WORKER_UNBOUND | WORKER_REBOUND,
@@ -106,6 +112,9 @@ enum {
 	HIGHPRI_NICE_LEVEL	= MIN_NICE,
 
 	WQ_NAME_LEN		= 24,
+
+	/* flags for __queue_work */
+	QUEUE_WORK_CGROUP	= 1,
 };
 
 /*
@@ -1214,6 +1223,8 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
  * @work: work item to steal
  * @is_dwork: @work is a delayed_work
  * @flags: place to store irq state
+ * @is_cwork: set to %true if @work is a cgroup_work and PENDING is stolen
+ *            (ret == 1)
  *
  * Try to grab PENDING bit of @work.  This function can handle @work in any
  * stable state - idle, on timer or on worklist.
@@ -1237,7 +1248,7 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
  * This function is safe to call from any context including IRQ handler.
  */
 static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
-			       unsigned long *flags)
+			       unsigned long *flags, bool *is_cwork)
 {
 	struct worker_pool *pool;
 	struct pool_workqueue *pwq;
@@ -1297,6 +1308,8 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
 
 		/* work->data points to pwq iff queued, point to pool */
 		set_work_pool_and_keep_pending(work, pool->id);
+		if (unlikely(is_cwork && (pwq->wq->flags & WQ_CGROUP)))
+			*is_cwork = true;
 
 		spin_unlock(&pool->lock);
 		return 1;
@@ -1394,7 +1407,7 @@ static int wq_select_unbound_cpu(int cpu)
 }
 
 static void __queue_work(int cpu, struct workqueue_struct *wq,
-			 struct work_struct *work)
+			 struct work_struct *work, int flags)
 {
 	struct pool_workqueue *pwq;
 	struct worker_pool *last_pool;
@@ -1416,6 +1429,12 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	if (unlikely(wq->flags & __WQ_DRAINING) &&
 	    WARN_ON_ONCE(!is_chained_work(wq)))
 		return;
+
+	/* not allowed to queue regular works on a cgroup-aware workqueue */
+	if (unlikely(wq->flags & WQ_CGROUP) &&
+	    WARN_ON_ONCE(!(flags & QUEUE_WORK_CGROUP)))
+		return;
+
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
 		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
@@ -1516,7 +1535,7 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 	local_irq_save(flags);
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
-		__queue_work(cpu, wq, work);
+		__queue_work(cpu, wq, work, 0);
 		ret = true;
 	}
 
@@ -1600,7 +1619,7 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		int cpu = workqueue_select_cpu_near(node);
 
-		__queue_work(cpu, wq, work);
+		__queue_work(cpu, wq, work, 0);
 		ret = true;
 	}
 
@@ -1614,7 +1633,7 @@ void delayed_work_timer_fn(struct timer_list *t)
 	struct delayed_work *dwork = from_timer(dwork, t, timer);
 
 	/* should have been called from irqsafe timer with irq already off */
-	__queue_work(dwork->cpu, dwork->wq, &dwork->work);
+	__queue_work(dwork->cpu, dwork->wq, &dwork->work, 0);
 }
 EXPORT_SYMBOL(delayed_work_timer_fn);
 
@@ -1636,7 +1655,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	 * on that there's no such delay when @delay is 0.
 	 */
 	if (!delay) {
-		__queue_work(cpu, wq, &dwork->work);
+		__queue_work(cpu, wq, &dwork->work, 0);
 		return;
 	}
 
@@ -1706,7 +1725,7 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	int ret;
 
 	do {
-		ret = try_to_grab_pending(&dwork->work, true, &flags);
+		ret = try_to_grab_pending(&dwork->work, true, &flags, NULL);
 	} while (unlikely(ret == -EAGAIN));
 
 	if (likely(ret >= 0)) {
@@ -1725,7 +1744,7 @@ static void rcu_work_rcufn(struct rcu_head *rcu)
 
 	/* read the comment in __queue_work() */
 	local_irq_disable();
-	__queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work);
+	__queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work, 0);
 	local_irq_enable();
 }
 
@@ -1753,6 +1772,129 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 }
 EXPORT_SYMBOL(queue_rcu_work);
 
+#ifdef CONFIG_CGROUPS
+
+/**
+ * queue_cgroup_work_node - queue work to be run in a cgroup on a specific node
+ * @node: node to execute work on
+ * @wq: workqueue to use
+ * @cwork: work to queue
+ * @cgroup: cgroup that the assigned worker should attach to
+ *
+ * Queue @cwork to be run by a worker attached to @cgroup.
+ *
+ * It is the caller's responsibility to ensure @cgroup is valid until this
+ * function returns.
+ *
+ * Supports cgroup v2 only.  If @cgroup is on a v1 hierarchy, the assigned
+ * worker runs in the root of the default hierarchy.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_cgroup_work_node(int node, struct workqueue_struct *wq,
+			    struct cgroup_work *cwork, struct cgroup *cgroup)
+{
+	bool ret = false;
+	unsigned long flags;
+
+	if (WARN_ON_ONCE(!(wq->flags & WQ_CGROUP)))
+		return ret;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+			      work_data_bits(&cwork->work))) {
+		int cpu = workqueue_select_cpu_near(node);
+
+		if (cgroup_on_dfl(cgroup))
+			cwork->cgroup = cgroup;
+		else
+			cwork->cgroup = cgroup_dfl_root();
+
+		/*
+		 * cgroup_put happens after a worker is assigned to @work and
+		 * migrated into @cgroup, or @work is cancelled.
+		 */
+		cgroup_get(cwork->cgroup);
+		__queue_work(cpu, wq, &cwork->work, QUEUE_WORK_CGROUP);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+
+static inline bool worker_in_child_cgroup(struct worker *worker)
+{
+	return (worker->flags & WORKER_CGROUP) && cgroup_parent(worker->cgroup);
+}
+
+static void attach_worker_to_dfl_root(struct worker *worker)
+{
+	int ret;
+
+	if (!worker_in_child_cgroup(worker))
+		return;
+
+	ret = cgroup_attach_kthread_to_dfl_root();
+	if (ret == 0) {
+		rcu_read_lock();
+		worker->cgroup = task_dfl_cgroup(worker->task);
+		rcu_read_unlock();
+	} else {
+		/*
+		 * TODO Modify the cgroup migration path to guarantee that a
+		 * kernel thread can successfully migrate to the default root
+		 * cgroup.
+		 */
+		WARN_ONCE(1, "can't migrate %s to dfl root (%d)\n",
+			  current->comm, ret);
+	}
+}
+
+/**
+ * attach_worker_to_cgroup - attach worker to work's corresponding cgroup
+ * @worker: worker thread to attach
+ * @work: work used to decide which cgroup to attach to
+ *
+ * Attach a cgroup-aware worker to work's corresponding cgroup.
+ */
+static void attach_worker_to_cgroup(struct worker *worker,
+				    struct work_struct *work)
+{
+	struct cgroup_work *cwork;
+	struct cgroup *cgroup;
+
+	if (!(worker->flags & WORKER_CGROUP))
+		return;
+
+	cwork = to_cgroup_work(work);
+
+	if (unlikely(is_wq_barrier_cgroup(cwork)))
+		return;
+
+	cgroup = cwork->cgroup;
+
+	if (cgroup == worker->cgroup)
+		goto out;
+
+	if (cgroup_attach_kthread(cgroup) == 0) {
+		worker->cgroup = cgroup;
+	} else {
+		/*
+		 * Attach failed, so attach to the default root so the
+		 * work isn't accounted to an unrelated cgroup.
+		 */
+		attach_worker_to_dfl_root(worker);
+	}
+
+out:
+	/* Pairs with cgroup_get in queue_cgroup_work_node. */
+	cgroup_put(cgroup);
+}
+
+#endif /* CONFIG_CGROUPS */
+
 /**
  * worker_enter_idle - enter idle state
  * @worker: worker which is entering idle state
@@ -1934,6 +2076,12 @@ static struct worker *create_worker(struct worker_pool *pool)
 
 	set_user_nice(worker->task, pool->attrs->nice);
 	kthread_bind_mask(worker->task, pool->attrs->cpumask);
+	if (pool->attrs->cgroup_aware) {
+		rcu_read_lock();
+		worker->cgroup = task_dfl_cgroup(worker->task);
+		rcu_read_unlock();
+		worker->flags |= WORKER_CGROUP;
+	}
 
 	/* successful, attach the worker to the pool */
 	worker_attach_to_pool(worker, pool);
@@ -2242,6 +2390,8 @@ __acquires(&pool->lock)
 
 	spin_unlock_irq(&pool->lock);
 
+	attach_worker_to_cgroup(worker, work);
+
 	lock_map_acquire(&pwq->wq->lockdep_map);
 	lock_map_acquire(&lockdep_map);
 	/*
@@ -2434,6 +2584,21 @@ static int worker_thread(void *__worker)
 		}
 	} while (keep_working(pool));
 
+	/*
+	 * Migrate a worker attached to a non-root cgroup to the root so a
+	 * sleeping worker won't cause cgroup_rmdir to fail indefinitely.
+	 *
+	 * XXX Should probably also modify cgroup core so that cgroup_rmdir
+	 * fails only if there are user (i.e. non-kthread) tasks in a cgroup;
+	 * otherwise, long-running workers can still cause cgroup_rmdir to fail
+	 * and userspace can't do anything other than wait.
+	 */
+	if (worker_in_child_cgroup(worker)) {
+		spin_unlock_irq(&pool->lock);
+		attach_worker_to_dfl_root(worker);
+		spin_lock_irq(&pool->lock);
+	}
+
 	worker_set_flags(worker, WORKER_PREP);
 sleep:
 	/*
@@ -2619,7 +2784,10 @@ static void check_flush_dependency(struct workqueue_struct *target_wq,
 }
 
 struct wq_barrier {
-	struct work_struct	work;
+	union {
+		struct work_struct	work;
+		struct cgroup_work	cwork;
+	};
 	struct completion	done;
 	struct task_struct	*task;	/* purely informational */
 };
@@ -2660,6 +2828,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
 {
 	struct list_head *head;
 	unsigned int linked = 0;
+	struct work_struct *barr_work;
 
 	/*
 	 * debugobject calls are safe here even with pool->lock locked
@@ -2667,8 +2836,17 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
 	 * checks and call back into the fixup functions where we
 	 * might deadlock.
 	 */
-	INIT_WORK_ONSTACK(&barr->work, wq_barrier_func);
-	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
+
+	if (unlikely(pwq->wq->flags & WQ_CGROUP)) {
+		barr_work = &barr->cwork.work;
+		INIT_CGROUP_WORK_ONSTACK(&barr->cwork, wq_barrier_func);
+		set_wq_barrier_cgroup(&barr->cwork);
+	} else {
+		barr_work = &barr->work;
+		INIT_WORK_ONSTACK(barr_work, wq_barrier_func);
+	}
+
+	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(barr_work));
 
 	init_completion_map(&barr->done, &target->lockdep_map);
 
@@ -2689,8 +2867,8 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
 		__set_bit(WORK_STRUCT_LINKED_BIT, bits);
 	}
 
-	debug_work_activate(&barr->work);
-	insert_work(pwq, &barr->work, head,
+	debug_work_activate(barr_work);
+	insert_work(pwq, barr_work, head,
 		    work_color_to_flags(WORK_NO_COLOR) | linked);
 }
 
@@ -3171,10 +3349,11 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
 {
 	static DECLARE_WAIT_QUEUE_HEAD(cancel_waitq);
 	unsigned long flags;
+	bool is_cwork = false;
 	int ret;
 
 	do {
-		ret = try_to_grab_pending(work, is_dwork, &flags);
+		ret = try_to_grab_pending(work, is_dwork, &flags, &is_cwork);
 		/*
 		 * If someone else is already canceling, wait for it to
 		 * finish.  flush_work() doesn't work for PREEMPT_NONE
@@ -3210,6 +3389,10 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
 	mark_work_canceling(work);
 	local_irq_restore(flags);
 
+	/* PENDING stolen, so drop the cgroup ref from queueing @work. */
+	if (ret == 1 && is_cwork)
+		cgroup_put(work_to_cgroup(work));
+
 	/*
 	 * This allows canceling during early boot.  We know that @work
 	 * isn't executing.
@@ -3271,7 +3454,7 @@ bool flush_delayed_work(struct delayed_work *dwork)
 {
 	local_irq_disable();
 	if (del_timer_sync(&dwork->timer))
-		__queue_work(dwork->cpu, dwork->wq, &dwork->work);
+		__queue_work(dwork->cpu, dwork->wq, &dwork->work, 0);
 	local_irq_enable();
 	return flush_work(&dwork->work);
 }
@@ -3300,15 +3483,20 @@ EXPORT_SYMBOL(flush_rcu_work);
 static bool __cancel_work(struct work_struct *work, bool is_dwork)
 {
 	unsigned long flags;
+	bool is_cwork = false;
 	int ret;
 
 	do {
-		ret = try_to_grab_pending(work, is_dwork, &flags);
+		ret = try_to_grab_pending(work, is_dwork, &flags, &is_cwork);
 	} while (unlikely(ret == -EAGAIN));
 
 	if (unlikely(ret < 0))
 		return false;
 
+	/* PENDING stolen, so drop the cgroup ref from queueing @work. */
+	if (ret == 1 && is_cwork)
+		cgroup_put(work_to_cgroup(work));
+
 	set_work_pool_and_clear_pending(work, get_work_pool_id(work));
 	local_irq_restore(flags);
 	return ret;
@@ -3465,12 +3653,13 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 	 * get_unbound_pool() explicitly clears ->no_numa after copying.
 	 */
 	to->no_numa = from->no_numa;
+	to->cgroup_aware = from->cgroup_aware;
 }
 
 /* hash value of the content of @attr */
 static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 {
-	u32 hash = 0;
+	u32 hash = attrs->cgroup_aware;
 
 	hash = jhash_1word(attrs->nice, hash);
 	hash = jhash(cpumask_bits(attrs->cpumask),
@@ -3486,6 +3675,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 		return false;
 	if (!cpumask_equal(a->cpumask, b->cpumask))
 		return false;
+	if (a->cgroup_aware != b->cgroup_aware)
+		return false;
 	return true;
 }
 
@@ -4002,6 +4193,8 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	if (unlikely(cpumask_empty(new_attrs->cpumask)))
 		cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
 
+	new_attrs->cgroup_aware = !!(wq->flags & WQ_CGROUP);
+
 	/*
 	 * We may create multiple pwqs with differing cpumasks.  Make a
 	 * copy of @new_attrs which will be modified and used to obtain
@@ -4323,6 +4516,13 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 	if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
 		flags |= WQ_UNBOUND;
 
+	/*
+	 * cgroup awareness supported only in unbound workqueues since those
+	 * tend to be the most resource-intensive.
+	 */
+	if (WARN_ON_ONCE((flags & WQ_CGROUP) && !(flags & WQ_UNBOUND)))
+		flags &= ~WQ_CGROUP;
+
 	/* allocate wq and format name */
 	if (flags & WQ_UNBOUND)
 		tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);
@@ -5980,6 +6180,7 @@ int __init workqueue_init_early(void)
 
 		BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
 		attrs->nice = std_nice[i];
+		attrs->cgroup_aware = true;
 		unbound_std_wq_attrs[i] = attrs;
 
 		/*
@@ -5990,6 +6191,7 @@ int __init workqueue_init_early(void)
 		BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
 		attrs->nice = std_nice[i];
 		attrs->no_numa = true;
+		attrs->cgroup_aware = true;
 		ordered_wq_attrs[i] = attrs;
 	}
 
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index cb68b03ca89a..3ad5861258ca 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -32,6 +32,7 @@ struct worker {
 	work_func_t		current_func;	/* L: current_work's fn */
 	struct pool_workqueue	*current_pwq; /* L: current_work's pwq */
 	struct list_head	scheduled;	/* L: scheduled works */
+	struct cgroup		*cgroup;	/* private to worker->task */
 
 	/* 64 bytes boundary on 64bit, 32 on 32bit */
 
@@ -76,4 +77,48 @@ void wq_worker_waking_up(struct task_struct *task, int cpu);
 struct task_struct *wq_worker_sleeping(struct task_struct *task);
 work_func_t wq_worker_last_func(struct task_struct *task);
 
+#ifdef CONFIG_CGROUPS
+
+/*
+ * A barrier work running in a cgroup-aware worker pool needs to specify a
+ * cgroup.  For simplicity, WQ_BARRIER_CGROUP makes the worker stay in its
+ * current cgroup, which correctly accounts the barrier work to the cgroup of
+ * the work being flushed in most cases.  The only exception is when the
+ * flushed work is in progress and a worker collision has caused a work from a
+ * different cgroup to be scheduled before the barrier work, but that seems
+ * acceptable since the barrier work isn't resource-intensive anyway.
+ */
+#define WQ_BARRIER_CGROUP	((struct cgroup *)1)
+
+static inline void set_wq_barrier_cgroup(struct cgroup_work *cwork)
+{
+	cwork->cgroup = WQ_BARRIER_CGROUP;
+}
+
+static inline bool is_wq_barrier_cgroup(struct cgroup_work *cwork)
+{
+	return cwork->cgroup == WQ_BARRIER_CGROUP;
+}
+
+#else
+
+static inline void set_wq_barrier_cgroup(struct cgroup_work *cwork) {}
+
+static inline bool is_wq_barrier_cgroup(struct cgroup_work *cwork)
+{
+	return false;
+}
+
+static inline bool worker_in_child_cgroup(struct worker *worker)
+{
+	return false;
+}
+
+static inline void attach_worker_to_cgroup(struct worker *worker,
+					   struct work_struct *work) {}
+
+static inline void attach_worker_to_dfl_root(struct worker *worker) {}
+
+#endif /* CONFIG_CGROUPS */
+
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 3/5] workqueue, memcontrol: make memcg throttle workqueue workers
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 1/5] cgroup: add cgroup v2 interfaces to migrate kernel threads Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 2/5] workqueue, cgroup: add cgroup-aware workqueues Daniel Jordan
@ 2019-06-05 13:36 ` Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 4/5] workqueue, cgroup: add test module Daniel Jordan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

Attaching a worker to a css_set isn't enough for all controllers to
throttle it.  In particular, the memory controller currently bypasses
accounting for kernel threads.

Support memcg accounting for cgroup-aware workqueue workers so that
they're appropriately throttled.

Another, probably better way to do this is to have kernel threads, or
even specifically cgroup-aware workqueue workers, call
memalloc_use_memcg and memalloc_unuse_memcg during cgroup migration
(memcg attach callback maybe).

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/workqueue.c          | 26 ++++++++++++++++++++++++++
 kernel/workqueue_internal.h |  5 +++++
 mm/memcontrol.c             | 26 ++++++++++++++++++++++++--
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 89b90899bc09..c8cc69e296c0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -50,6 +50,8 @@
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
 #include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
 
 #include "workqueue_internal.h"
 
@@ -1829,6 +1831,28 @@ static inline bool worker_in_child_cgroup(struct worker *worker)
 	return (worker->flags & WORKER_CGROUP) && cgroup_parent(worker->cgroup);
 }
 
+/* XXX Put this in the memory controller's attach callback. */
+#ifdef CONFIG_MEMCG
+static void worker_unuse_memcg(struct worker *worker)
+{
+	if (worker->task->active_memcg) {
+		struct mem_cgroup *memcg = worker->task->active_memcg;
+
+		memalloc_unuse_memcg();
+		css_put(&memcg->css);
+	}
+}
+
+static void worker_use_memcg(struct worker *worker)
+{
+	struct mem_cgroup *memcg;
+
+	worker_unuse_memcg(worker);
+	memcg = mem_cgroup_from_css(task_get_css(worker->task, memory_cgrp_id));
+	memalloc_use_memcg(memcg);
+}
+#endif /* CONFIG_MEMCG */
+
 static void attach_worker_to_dfl_root(struct worker *worker)
 {
 	int ret;
@@ -1841,6 +1865,7 @@ static void attach_worker_to_dfl_root(struct worker *worker)
 		rcu_read_lock();
 		worker->cgroup = task_dfl_cgroup(worker->task);
 		rcu_read_unlock();
+		worker_unuse_memcg(worker);
 	} else {
 		/*
 		 * TODO Modify the cgroup migration path to guarantee that a
@@ -1880,6 +1905,7 @@ static void attach_worker_to_cgroup(struct worker *worker,
 
 	if (cgroup_attach_kthread(cgroup) == 0) {
 		worker->cgroup = cgroup;
+		worker_use_memcg(worker);
 	} else {
 		/*
 		 * Attach failed, so attach to the default root so the
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 3ad5861258ca..f254b93edc2c 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -79,6 +79,11 @@ work_func_t wq_worker_last_func(struct task_struct *task);
 
 #ifdef CONFIG_CGROUPS
 
+#ifndef CONFIG_MEMCG
+static inline void worker_use_memcg(struct worker *worker) {}
+static inline void worker_unuse_memcg(struct worker *worker) {}
+#endif /* CONFIG_MEMCG */
+
 /*
  * A barrier work running in a cgroup-aware worker pool needs to specify a
  * cgroup.  For simplicity, WQ_BARRIER_CGROUP makes the worker stay in its
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 81a0d3914ec9..1a80931b124a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2513,9 +2513,31 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 
 static inline bool memcg_kmem_bypass(void)
 {
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (in_interrupt())
 		return true;
-	return false;
+
+	if (unlikely(current->flags & PF_WQ_WORKER)) {
+		struct cgroup *parent;
+
+		/*
+		 * memcg should throttle cgroup-aware workers.  Infer the
+		 * worker is cgroup-aware by its presence in a non-root cgroup.
+		 *
+		 * This test won't detect a cgroup-aware worker attached to the
+		 * default root, but in that case memcg doesn't need to
+		 * throttle it anyway.
+		 *
+		 * XXX One alternative to this awkward block is adding a
+		 * cgroup-aware-worker bit to task_struct.
+		 */
+		rcu_read_lock();
+		parent = cgroup_parent(task_dfl_cgroup(current));
+		rcu_read_unlock();
+
+		return !parent;
+	}
+
+	return !current->mm || (current->flags & PF_KTHREAD);
 }
 
 /**
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 4/5] workqueue, cgroup: add test module
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
                   ` (2 preceding siblings ...)
  2019-06-05 13:36 ` [RFC v2 3/5] workqueue, memcontrol: make memcg throttle workqueue workers Daniel Jordan
@ 2019-06-05 13:36 ` Daniel Jordan
  2019-06-05 13:36 ` [RFC v2 5/5] ktask, cgroup: attach helper threads to the master thread's cgroup Daniel Jordan
  2019-06-05 13:53 ` [RFC v2 0/5] cgroup-aware unbound workqueues Tejun Heo
  5 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

Test the basic functionality of cgroup aware workqueues.  Inspired by
Matthew Wilcox's test_xarray.c

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/workqueue.c                            |   1 +
 lib/Kconfig.debug                             |  12 +
 lib/Makefile                                  |   1 +
 lib/test_cgroup_workqueue.c                   | 325 ++++++++++++++++++
 .../selftests/cgroup_workqueue/Makefile       |   9 +
 .../testing/selftests/cgroup_workqueue/config |   1 +
 .../cgroup_workqueue/test_cgroup_workqueue.sh | 104 ++++++
 7 files changed, 453 insertions(+)
 create mode 100644 lib/test_cgroup_workqueue.c
 create mode 100644 tools/testing/selftests/cgroup_workqueue/Makefile
 create mode 100644 tools/testing/selftests/cgroup_workqueue/config
 create mode 100755 tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c8cc69e296c0..15459b5bb0bf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1825,6 +1825,7 @@ bool queue_cgroup_work_node(int node, struct workqueue_struct *wq,
 	local_irq_restore(flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(queue_cgroup_work_node);
 
 static inline bool worker_in_child_cgroup(struct worker *worker)
 {
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d5a4a4036d2f..9909a306c142 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2010,6 +2010,18 @@ config TEST_STACKINIT
 
 	  If unsure, say N.
 
+config TEST_CGROUP_WQ
+	tristate "Test cgroup-aware workqueues at runtime"
+	depends on CGROUPS
+	help
+	  Test cgroup-aware workqueues, in which workers attach to the
+          cgroup specified by the queueing task.  Basic test coverage
+          for whether workers attach to the expected cgroup, both for
+          cgroup-aware and unaware works, and whether workers are
+          throttled by the memory controller.
+
+	  If unsure, say N.
+
 endif # RUNTIME_TESTING_MENU
 
 config MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 18c2be516ab4..d08b4a50bfd1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o
 obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o
 
 obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
+obj-$(CONFIG_TEST_CGROUP_WQ) += test_cgroup_workqueue.o
 
 ifeq ($(CONFIG_DEBUG_KOBJECT),y)
 CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/test_cgroup_workqueue.c b/lib/test_cgroup_workqueue.c
new file mode 100644
index 000000000000..466ec4e6e55b
--- /dev/null
+++ b/lib/test_cgroup_workqueue.c
@@ -0,0 +1,325 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_cgroup_workqueue.c: Test cgroup-aware workqueues
+ * Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved.
+ * Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+ *
+ * Inspired by Matthew Wilcox's test_xarray.c.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/kconfig.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/vmalloc.h>
+#include <linux/workqueue.h>
+
+static atomic_long_t cwq_tests_run = ATOMIC_LONG_INIT(0);
+static atomic_long_t cwq_tests_passed = ATOMIC_LONG_INIT(0);
+
+static struct workqueue_struct *cwq_cgroup_aware_wq;
+static struct workqueue_struct *cwq_cgroup_unaware_wq;
+
+/* cgroup v2 hierarchy mountpoint */
+static char *cwq_cgrp_root_path = "/test_cgroup_workqueue";
+module_param(cwq_cgrp_root_path, charp, 0444);
+static struct cgroup *cwq_cgrp_root;
+
+static char *cwq_cgrp_1_path = "/cwq_1";
+module_param(cwq_cgrp_1_path, charp, 0444);
+static struct cgroup *cwq_cgrp_1;
+
+static char *cwq_cgrp_2_path = "/cwq_2";
+module_param(cwq_cgrp_2_path, charp, 0444);
+static struct cgroup *cwq_cgrp_2;
+
+static char *cwq_cgrp_3_path = "/cwq_3";
+module_param(cwq_cgrp_3_path, charp, 0444);
+static struct cgroup *cwq_cgrp_3;
+
+static size_t cwq_memcg_max = 1ul << 20;
+module_param(cwq_memcg_max, ulong, 0444);
+
+#define CWQ_BUG(x, test_name) ({					       \
+	int __ret_cwq_bug = !!(x);					       \
+	atomic_long_inc(&cwq_tests_run);				       \
+	if (__ret_cwq_bug) {						       \
+		pr_warn("BUG at %s:%d\n", __func__, __LINE__);		       \
+		pr_warn("%s\n", (test_name));				       \
+		dump_stack();						       \
+	} else {							       \
+		atomic_long_inc(&cwq_tests_passed);			       \
+	}								       \
+	__ret_cwq_bug;							       \
+})
+
+#define CWQ_BUG_ON(x) CWQ_BUG((x), "<no test name>")
+
+struct cwq_data {
+	struct cgroup_work cwork;
+	struct cgroup	   *expected_cgroup;
+	const char	   *test_name;
+};
+
+#define CWQ_INIT_DATA(data, func, expected_cgrp) do {			\
+	INIT_WORK_ONSTACK(&(data)->work, (func));			\
+	(data)->expected_cgroup = (expected_cgrp);			\
+} while (0)
+
+static void cwq_verify_worker_cgroup(struct work_struct *work)
+{
+	struct cgroup_work *cwork = container_of(work, struct cgroup_work,
+						 work);
+	struct cwq_data *data = container_of(cwork, struct cwq_data, cwork);
+	struct cgroup *worker_cgroup;
+
+	CWQ_BUG(!(current->flags & PF_WQ_WORKER), data->test_name);
+
+	rcu_read_lock();
+	worker_cgroup = task_dfl_cgroup(current);
+	rcu_read_unlock();
+
+	CWQ_BUG(worker_cgroup != data->expected_cgroup, data->test_name);
+}
+
+static noinline void cwq_test_reg_work_on_cgrp_unaware_wq(void)
+{
+	struct cwq_data data;
+
+	data.expected_cgroup = cwq_cgrp_root;
+	data.test_name = __func__;
+	INIT_WORK_ONSTACK(&data.cwork.work, cwq_verify_worker_cgroup);
+
+	CWQ_BUG_ON(!queue_work(cwq_cgroup_unaware_wq, &data.cwork.work));
+	flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_on_cgrp_aware_wq(void)
+{
+	struct cwq_data data;
+
+	data.expected_cgroup = cwq_cgrp_1;
+	data.test_name = __func__;
+	INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_cgroup);
+
+	CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+				      cwq_cgrp_1));
+	flush_work(&data.cwork.work);
+}
+
+static struct cgroup *cwq_get_random_cgroup(void)
+{
+	switch (prandom_u32_max(4)) {
+	case 1:  return cwq_cgrp_1;
+	case 2:  return cwq_cgrp_2;
+	case 3:  return cwq_cgrp_3;
+	default: return cwq_cgrp_root;
+	}
+}
+
+#define CWQ_NWORK 256
+static noinline void cwq_test_many_cgrp_works_on_cgrp_aware_wq(void)
+{
+	int i;
+	struct cwq_data *data_array = kmalloc_array(CWQ_NWORK,
+						    sizeof(struct cwq_data),
+						    GFP_KERNEL);
+	if (CWQ_BUG_ON(!data_array))
+		return;
+
+	for (i = 0; i < CWQ_NWORK; ++i) {
+		struct cgroup *cgrp = cwq_get_random_cgroup();
+
+		data_array[i].expected_cgroup = cgrp;
+		data_array[i].test_name = __func__;
+		INIT_CGROUP_WORK(&data_array[i].cwork,
+				 cwq_verify_worker_cgroup);
+		CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq,
+					      &data_array[i].cwork,
+					      cgrp));
+	}
+
+	for (i = 0; i < CWQ_NWORK; ++i)
+		flush_work(&data_array[i].cwork.work);
+
+	kfree(data_array);
+}
+
+static void cwq_verify_worker_obeys_memcg(struct work_struct *work)
+{
+	struct cgroup_work *cwork = container_of(work, struct cgroup_work,
+						 work);
+	struct cwq_data *data = container_of(cwork, struct cwq_data, cwork);
+	struct cgroup *worker_cgroup;
+	void *mem;
+
+	CWQ_BUG(!(current->flags & PF_WQ_WORKER), data->test_name);
+
+	rcu_read_lock();
+	worker_cgroup = task_dfl_cgroup(current);
+	rcu_read_unlock();
+
+	CWQ_BUG(worker_cgroup != data->expected_cgroup, data->test_name);
+
+	mem = __vmalloc(cwq_memcg_max * 2, __GFP_ACCOUNT | __GFP_NOWARN,
+			PAGE_KERNEL);
+	if (data->expected_cgroup == cwq_cgrp_2) {
+		/*
+		 * cwq_cgrp_2 has its memory.max set to cwq_memcg_max, so the
+		 * allocation should fail.
+		 */
+		CWQ_BUG(mem, data->test_name);
+	} else {
+		/*
+		 * Other cgroups don't have a memory.max limit, so the
+		 * allocation should succeed.
+		 */
+		CWQ_BUG(!mem, data->test_name);
+	}
+	vfree(mem);
+}
+
+static noinline void cwq_test_reg_work_is_not_throttled_by_memcg(void)
+{
+	struct cwq_data data;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+		return;
+
+	data.expected_cgroup = cwq_cgrp_root;
+	data.test_name = __func__;
+	INIT_WORK_ONSTACK(&data.cwork.work, cwq_verify_worker_obeys_memcg);
+	CWQ_BUG_ON(!queue_work(cwq_cgroup_unaware_wq, &data.cwork.work));
+	flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_is_throttled_by_memcg(void)
+{
+	struct cwq_data data;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+		return;
+
+	/*
+	 * The kselftest shell script enables the memory controller in
+	 * cwq_cgrp_2 and sets memory.max to cwq_memcg_max.
+	 */
+	data.expected_cgroup = cwq_cgrp_2;
+	data.test_name = __func__;
+	INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+	CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+				      cwq_cgrp_2));
+	flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_is_not_throttled_by_memcg(void)
+{
+	struct cwq_data data;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+		return;
+
+	/*
+	 * The kselftest shell script doesn't set a memory limit for cwq_cgrp_1
+	 * or _3, so a cgroup work should still be able to allocate memory
+	 * above that limit.
+	 */
+	data.expected_cgroup = cwq_cgrp_1;
+	data.test_name = __func__;
+	INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+	CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+				      cwq_cgrp_1));
+	flush_work(&data.cwork.work);
+
+	/*
+	 * And cgroup workqueues shouldn't choke on a cgroup that's disabled
+	 * the memory controller, such as cwq_cgrp_3.
+	 */
+	data.expected_cgroup = cwq_cgrp_3;
+	data.test_name = __func__;
+	INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+	CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+				      cwq_cgrp_3));
+	flush_work(&data.cwork.work);
+}
+
+static int cwq_init(void)
+{
+	s64 passed, run;
+
+	pr_warn("cgroup workqueue test module\n");
+
+	cwq_cgroup_aware_wq = alloc_workqueue("cwq_cgroup_aware_wq",
+					      WQ_UNBOUND | WQ_CGROUP, 0);
+	if (!cwq_cgroup_aware_wq) {
+		pr_warn("cwq_cgroup_aware_wq allocation failed\n");
+		return -EAGAIN;
+	}
+
+	cwq_cgroup_unaware_wq = alloc_workqueue("cwq_cgroup_unaware_wq",
+						WQ_UNBOUND, 0);
+	if (!cwq_cgroup_unaware_wq) {
+		pr_warn("cwq_cgroup_unaware_wq allocation failed\n");
+		goto alloc_wq_fail;
+	}
+
+	cwq_cgrp_root = cgroup_get_from_path("/");
+	if (IS_ERR(cwq_cgrp_root)) {
+		pr_warn("can't get root cgroup\n");
+		goto cgroup_get_fail;
+	}
+
+	cwq_cgrp_1 = cgroup_get_from_path(cwq_cgrp_1_path);
+	if (IS_ERR(cwq_cgrp_1)) {
+		pr_warn("can't get child cgroup 1\n");
+		goto cgroup_get_fail;
+	}
+
+	cwq_cgrp_2 = cgroup_get_from_path(cwq_cgrp_2_path);
+	if (IS_ERR(cwq_cgrp_2)) {
+		pr_warn("can't get child cgroup 2\n");
+		goto cgroup_get_fail;
+	}
+
+	cwq_cgrp_3 = cgroup_get_from_path(cwq_cgrp_3_path);
+	if (IS_ERR(cwq_cgrp_3)) {
+		pr_warn("can't get child cgroup 3\n");
+		goto cgroup_get_fail;
+	}
+
+	cwq_test_reg_work_on_cgrp_unaware_wq();
+	cwq_test_cgrp_work_on_cgrp_aware_wq();
+	cwq_test_many_cgrp_works_on_cgrp_aware_wq();
+	cwq_test_reg_work_is_not_throttled_by_memcg();
+	cwq_test_cgrp_work_is_throttled_by_memcg();
+	cwq_test_cgrp_work_is_not_throttled_by_memcg();
+
+	passed = atomic_long_read(&cwq_tests_passed);
+	run    = atomic_long_read(&cwq_tests_run);
+	pr_warn("cgroup workqueues: %lld of %lld tests passed\n", passed, run);
+	return (run == passed) ? 0 : -EINVAL;
+
+cgroup_get_fail:
+	destroy_workqueue(cwq_cgroup_unaware_wq);
+alloc_wq_fail:
+	destroy_workqueue(cwq_cgroup_aware_wq);
+	return -EAGAIN;		/* better ideas? */
+}
+
+static void cwq_exit(void)
+{
+	destroy_workqueue(cwq_cgroup_aware_wq);
+	destroy_workqueue(cwq_cgroup_unaware_wq);
+}
+
+module_init(cwq_init);
+module_exit(cwq_exit);
+MODULE_AUTHOR("Daniel Jordan <daniel.m.jordan@oracle.com>");
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/cgroup_workqueue/Makefile b/tools/testing/selftests/cgroup_workqueue/Makefile
new file mode 100644
index 000000000000..2ba1cd922670
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+
+all:
+
+TEST_PROGS := test_cgroup_workqueue.sh
+
+include ../lib.mk
+
+clean:
diff --git a/tools/testing/selftests/cgroup_workqueue/config b/tools/testing/selftests/cgroup_workqueue/config
new file mode 100644
index 000000000000..ae38b8f3c3db
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/config
@@ -0,0 +1 @@
+CONFIG_TEST_CGROUP_WQ=m
diff --git a/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh b/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
new file mode 100755
index 000000000000..33251276d2cf
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
@@ -0,0 +1,104 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# Runs cgroup workqueue kernel module tests
+
+# hierarchy:   root
+#             /    \
+#            1      2
+#           /
+#          3
+CG_ROOT='/test_cgroup_workqueue'
+CG_1='/cwq_1'
+CG_2='/cwq_2'
+CG_3='/cwq_3'
+MEMCG_MAX="$((2**16))"	# small but arbitrary amount
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+cg_mnt=''
+cg_mnt_mounted=''
+cg_1_created=''
+cg_2_created=''
+cg_3_created=''
+
+cleanup()
+{
+	[ -n "$cg_3_created" ]   && rmdir "$CG_ROOT/$CG_1/$CG_3" || exit 1
+	[ -n "$cg_2_created" ]   && rmdir "$CG_ROOT/$CG_2"       || exit 1
+	[ -n "$cg_1_created" ]   && rmdir "$CG_ROOT/$CG_1"       || exit 1
+	[ -n "$cg_mnt_mounted" ] && umount "$CG_ROOT"	         || exit 1
+	[ -n "$cg_mnt_created" ] && rmdir "$CG_ROOT"	         || exit 1
+	exit "$1"
+}
+
+if ! /sbin/modprobe -q -n test_cgroup_workqueue; then
+	echo "cgroup_workqueue: module test_cgroup_workqueue not found [SKIP]"
+	exit $ksft_skip
+fi
+
+# Setup cgroup v2 hierarchy
+if mkdir "$CG_ROOT"; then
+	cg_mnt_created=1
+else
+	echo "cgroup_workqueue: can't create cgroup mountpoint at $CG_ROOT"
+	cleanup 1
+fi
+
+if mount -t cgroup2 none "$CG_ROOT"; then
+	cg_mnt_mounted=1
+else
+	echo "cgroup_workqueue: couldn't mount cgroup hierarchy at $CG_ROOT"
+	cleanup 1
+fi
+
+if grep -q memory "$CG_ROOT/cgroup.controllers"; then
+	/bin/echo +memory > "$CG_ROOT/cgroup.subtree_control"
+	cwq_memcg_max="$MEMCG_MAX"
+else
+	# Tell test module not to run memory.max tests.
+	cwq_memcg_max=0
+fi
+
+if mkdir "$CG_ROOT/$CG_1"; then
+	cg_1_created=1
+else
+	echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_1"
+	cleanup 1
+fi
+
+if mkdir "$CG_ROOT/$CG_2"; then
+	cg_2_created=1
+else
+	echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_2"
+	cleanup 1
+fi
+
+if mkdir "$CG_ROOT/$CG_1/$CG_3"; then
+	cg_3_created=1
+	# Ensure the memory controller is disabled as expected in $CG_3's
+	# parent, $CG_1, for testing.
+	if grep -q memory "$CG_ROOT/$CG_1/cgroup.subtree_control"; then
+		/bin/echo -memory > "$CG_ROOT/$CG_1/cgroup.subtree_control"
+	fi
+else
+	echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_1/$CG_3"
+	cleanup 1
+fi
+
+if (( cwq_memcg_max != 0 )); then
+	/bin/echo "$MEMCG_MAX" > "$CG_ROOT/$CG_2/memory.max"
+fi
+
+if /sbin/modprobe -q test_cgroup_workqueue cwq_cgrp_root_path="$CG_ROOT"       \
+					   cwq_cgrp_1_path="$CG_1"	       \
+					   cwq_cgrp_2_path="$CG_2"	       \
+					   cwq_cgrp_3_path="$CG_1/$CG_3"       \
+					   cwq_memcg_max="$cwq_memcg_max"; then
+	echo "cgroup_workqueue: ok"
+	/sbin/modprobe -q -r test_cgroup_workqueue
+	cleanup 0
+else
+	echo "cgroup_workqueue: [FAIL]"
+	cleanup 1
+fi
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 5/5] ktask, cgroup: attach helper threads to the master thread's cgroup
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
                   ` (3 preceding siblings ...)
  2019-06-05 13:36 ` [RFC v2 4/5] workqueue, cgroup: add test module Daniel Jordan
@ 2019-06-05 13:36 ` Daniel Jordan
  2019-06-05 13:53 ` [RFC v2 0/5] cgroup-aware unbound workqueues Tejun Heo
  5 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 13:36 UTC (permalink / raw)
  To: hannes, jiangshanlai, lizefan, tj
  Cc: bsd, dan.j.williams, daniel.m.jordan, dave.hansen, juri.lelli,
	mhocko, peterz, steven.sistare, tglx, tom.hromatka, vdavydov.dev,
	cgroups, linux-kernel, linux-mm

ktask tasks are expensive, and helper threads are not currently
throttled by the master's cgroup, so helpers' resource usage is
unbounded.

Attach helper threads to the master thread's cgroup to ensure helpers
get this throttling.

It's possible for the master to be migrated to a new cgroup before the
task is finished.  In that case, to keep it simple, the helpers continue
executing in the original cgroup.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/cgroup.h | 26 ++++++++++++++++++++++++++
 kernel/ktask.c         | 32 ++++++++++++++++++++------------
 2 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index de578e29077b..67b2c469f17f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -532,6 +532,28 @@ static inline struct cgroup *task_dfl_cgroup(struct task_struct *task)
 	return task_css_set(task)->dfl_cgrp;
 }
 
+/**
+ * task_get_dfl_cgroup - find and get the default hierarchy cgroup for task
+ * @task: the target task
+ *
+ * Find the default hierarchy cgroup for @task, take a reference on it, and
+ * return it.  Guaranteed to return a valid cgroup.
+ */
+static inline struct cgroup *task_get_dfl_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgroup;
+
+	rcu_read_lock();
+	while (true) {
+		cgroup = task_dfl_cgroup(task);
+		if (likely(css_tryget_online(&cgroup->self)))
+			break;
+		cpu_relax();
+	}
+	rcu_read_unlock();
+	return cgroup;
+}
+
 static inline struct cgroup *cgroup_dfl_root(void)
 {
 	return &cgrp_dfl_root.cgrp;
@@ -705,6 +727,10 @@ static inline struct cgroup *task_dfl_cgroup(struct task_struct *task)
 {
 	return NULL;
 }
+static inline struct cgroup *task_get_dfl_cgroup(struct task_struct *task)
+{
+	return NULL;
+}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/kernel/ktask.c b/kernel/ktask.c
index 15d62ed7c67e..b047f30f77fa 100644
--- a/kernel/ktask.c
+++ b/kernel/ktask.c
@@ -14,6 +14,7 @@
 
 #ifdef CONFIG_KTASK
 
+#include <linux/cgroup.h>
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
 #include <linux/init.h>
@@ -49,7 +50,7 @@ enum ktask_work_flags {
 
 /* Used to pass ktask data to the workqueue API. */
 struct ktask_work {
-	struct work_struct	kw_work;
+	struct cgroup_work	kw_work;
 	struct ktask_task	*kw_task;
 	int			kw_ktask_node_i;
 	int			kw_queue_nid;
@@ -76,6 +77,7 @@ struct ktask_task {
 	size_t			kt_nr_nodes;
 	size_t			kt_nr_nodes_left;
 	int			kt_error; /* first error from thread_func */
+	struct cgroup		*kt_cgroup;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	kt_lockdep_map;
 #endif
@@ -103,16 +105,16 @@ static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt,
 {
 	/* The master's work is always on the stack--in __ktask_run_numa. */
 	if (flags & KTASK_WORK_MASTER)
-		INIT_WORK_ONSTACK(&kw->kw_work, ktask_thread);
+		INIT_CGROUP_WORK_ONSTACK(&kw->kw_work, ktask_thread);
 	else
-		INIT_WORK(&kw->kw_work, ktask_thread);
+		INIT_CGROUP_WORK(&kw->kw_work, ktask_thread);
 	kw->kw_task = kt;
 	kw->kw_ktask_node_i = ktask_node_i;
 	kw->kw_queue_nid = queue_nid;
 	kw->kw_flags = flags;
 }
 
-static void ktask_queue_work(struct ktask_work *kw)
+static void ktask_queue_work(struct ktask_work *kw, struct cgroup *cgroup)
 {
 	struct workqueue_struct *wq;
 
@@ -128,7 +130,8 @@ static void ktask_queue_work(struct ktask_work *kw)
 	}
 
 	WARN_ON(!wq);
-	WARN_ON(!queue_work_node(kw->kw_queue_nid, wq, &kw->kw_work));
+	WARN_ON(!queue_cgroup_work_node(kw->kw_queue_nid, wq, &kw->kw_work,
+		cgroup));
 }
 
 /* Returns true if we're migrating this part of the task to another node. */
@@ -163,14 +166,15 @@ static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn,
 
 	WARN_ON(kw->kw_flags & (KTASK_WORK_FINISHED | KTASK_WORK_UNDO));
 	ktask_init_work(kw, kt, ktask_node_i, new_queue_nid, kw->kw_flags);
-	ktask_queue_work(kw);
+	ktask_queue_work(kw, kt->kt_cgroup);
 
 	return true;
 }
 
 static void ktask_thread(struct work_struct *work)
 {
-	struct ktask_work  *kw = container_of(work, struct ktask_work, kw_work);
+	struct cgroup_work *cw = container_of(work, struct cgroup_work, work);
+	struct ktask_work  *kw = container_of(cw, struct ktask_work, kw_work);
 	struct ktask_task  *kt = kw->kw_task;
 	struct ktask_ctl   *kc = &kt->kt_ctl;
 	struct ktask_node  *kn = &kt->kt_nodes[kw->kw_ktask_node_i];
@@ -455,7 +459,7 @@ static void __ktask_wait_for_completion(struct ktask_task *kt,
 		while (!(READ_ONCE(work->kw_flags) & KTASK_WORK_FINISHED))
 			cpu_relax();
 	} else {
-		flush_work_at_nice(&work->kw_work, task_nice(current));
+		flush_work_at_nice(&work->kw_work.work, task_nice(current));
 	}
 }
 
@@ -530,15 +534,18 @@ int __ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
 	kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
 					    ctl->kc_min_chunk_size, nr_works);
 
+	/* Ensure the master's cgroup throttles helper threads. */
+	kt.kt_cgroup = task_get_dfl_cgroup(current);
 	list_for_each_entry(work, &unfinished_works, kw_list)
-		ktask_queue_work(work);
+		ktask_queue_work(work, kt.kt_cgroup);
 
 	/* Use the current thread, which saves starting a workqueue worker. */
 	ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid, KTASK_WORK_MASTER);
 	INIT_LIST_HEAD(&kw.kw_list);
-	ktask_thread(&kw.kw_work);
+	ktask_thread(&kw.kw_work.work);
 
 	ktask_wait_for_completion(&kt, &unfinished_works, &finished_works);
+	cgroup_put(kt.kt_cgroup);
 
 	if (kt.kt_error != KTASK_RETURN_SUCCESS && ctl->kc_undo_func)
 		ktask_undo(nodes, nr_nodes, ctl, &finished_works, &kw);
@@ -611,13 +618,14 @@ void __init ktask_init(void)
 	if (!ktask_rlim_init())
 		goto out;
 
-	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND | WQ_CGROUP, 0);
 	if (!ktask_wq) {
 		pr_warn("disabled (failed to alloc ktask_wq)");
 		goto out;
 	}
 
-	ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq", WQ_UNBOUND, 0);
+	ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq",
+					  WQ_UNBOUND | WQ_CGROUP, 0);
 	if (!ktask_nonuma_wq) {
 		pr_warn("disabled (failed to alloc ktask_nonuma_wq)");
 		goto alloc_fail;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
                   ` (4 preceding siblings ...)
  2019-06-05 13:36 ` [RFC v2 5/5] ktask, cgroup: attach helper threads to the master thread's cgroup Daniel Jordan
@ 2019-06-05 13:53 ` Tejun Heo
  2019-06-05 15:32   ` Daniel Jordan
  2019-06-06  6:15   ` Mike Rapoport
  5 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2019-06-05 13:53 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: hannes, jiangshanlai, lizefan, bsd, dan.j.williams, dave.hansen,
	juri.lelli, mhocko, peterz, steven.sistare, tglx, tom.hromatka,
	vdavydov.dev, cgroups, linux-kernel, linux-mm

Hello, Daniel.

On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> My use case for this work is kernel multithreading, the series formerly known
> as ktask[2] that I'm now trying to combine with padata according to feedback
> from the last post.  Helper threads in a multithreaded job may consume lots of
> resources that aren't properly accounted to the cgroup of the task that started
> the job.

Can you please go into more details on the use cases?

For memory and io, we're generally going for remote charging, where a
kthread explicitly says who the specific io or allocation is for,
combined with selective back-charging, where the resource is charged
and consumed unconditionally even if that would put the usage above
the current limits temporarily.  From what I've been seeing recently,
combination of the two give us really good control quality without
being too invasive across the stack.

CPU doesn't have a backcharging mechanism yet and depending on the use
case, we *might* need to put kthreads in different cgroups.  However,
such use cases might not be that abundant and there may be gotaches
which require them to be force-executed and back-charged (e.g. fs
compression from global reclaim).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-05 13:53 ` [RFC v2 0/5] cgroup-aware unbound workqueues Tejun Heo
@ 2019-06-05 15:32   ` Daniel Jordan
  2019-06-11 19:55     ` Tejun Heo
  2019-06-06  6:15   ` Mike Rapoport
  1 sibling, 1 reply; 12+ messages in thread
From: Daniel Jordan @ 2019-06-05 15:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Jordan, hannes, jiangshanlai, lizefan, bsd,
	dan.j.williams, dave.hansen, juri.lelli, mhocko, peterz,
	steven.sistare, tglx, tom.hromatka, vdavydov.dev, cgroups,
	linux-kernel, linux-mm, shakeelb

Hi Tejun,

On Wed, Jun 05, 2019 at 06:53:19AM -0700, Tejun Heo wrote:
> On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> > My use case for this work is kernel multithreading, the series formerly known
> > as ktask[2] that I'm now trying to combine with padata according to feedback
> > from the last post.  Helper threads in a multithreaded job may consume lots of
> > resources that aren't properly accounted to the cgroup of the task that started
> > the job.
> 
> Can you please go into more details on the use cases?

Sure, quoting from the last ktask post:

  A single CPU can spend an excessive amount of time in the kernel operating
  on large amounts of data.  Often these situations arise during initialization-
  and destruction-related tasks, where the data involved scales with system size.
  These long-running jobs can slow startup and shutdown of applications and the
  system itself while extra CPUs sit idle.
      
  To ensure that applications and the kernel continue to perform well as core
  counts and memory sizes increase, harness these idle CPUs to complete such jobs
  more quickly.
      
  ktask is a generic framework for parallelizing CPU-intensive work in the
  kernel.  The API is generic enough to add concurrency to many different kinds
  of tasks--for example, zeroing a range of pages or evicting a list of
  inodes--and aims to save its clients the trouble of splitting up the work,
  choosing the number of threads to use, maintaining an efficient concurrency
  level, starting these threads, and load balancing the work between them.

So far the users of the framework primarily consume CPU and memory.

> For memory and io, we're generally going for remote charging, where a
> kthread explicitly says who the specific io or allocation is for,
> combined with selective back-charging, where the resource is charged
> and consumed unconditionally even if that would put the usage above
> the current limits temporarily.  From what I've been seeing recently,
> combination of the two give us really good control quality without
> being too invasive across the stack.

Yes, for memory I actually use remote charging.  In patch 3 the worker's
current->active_memcg field is changed to match that of the cgroup associated
with the work.

Cc Shakeel, since we're talking about it.

> CPU doesn't have a backcharging mechanism yet and depending on the use
> case, we *might* need to put kthreads in different cgroups.  However,
> such use cases might not be that abundant and there may be gotaches
> which require them to be force-executed and back-charged (e.g. fs
> compression from global reclaim).

The CPU-intensiveness of these works is one of the reasons for actually putting
the workers through the migration path.  I don't know of a way to get the
workers to respect the cpu controller (and even cpuset for that matter) without
doing that.

Thanks for the quick feedback.

Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-05 13:53 ` [RFC v2 0/5] cgroup-aware unbound workqueues Tejun Heo
  2019-06-05 15:32   ` Daniel Jordan
@ 2019-06-06  6:15   ` Mike Rapoport
  2019-06-11 19:52     ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Mike Rapoport @ 2019-06-06  6:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Jordan, hannes, jiangshanlai, lizefan, bsd,
	dan.j.williams, dave.hansen, juri.lelli, mhocko, peterz,
	steven.sistare, tglx, tom.hromatka, vdavydov.dev, cgroups,
	linux-kernel, linux-mm

Hi Tejun,

On Wed, Jun 05, 2019 at 06:53:19AM -0700, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> > My use case for this work is kernel multithreading, the series formerly known
> > as ktask[2] that I'm now trying to combine with padata according to feedback
> > from the last post.  Helper threads in a multithreaded job may consume lots of
> > resources that aren't properly accounted to the cgroup of the task that started
> > the job.
> 
> Can you please go into more details on the use cases?

If I remember correctly, the original Bandan's work was about using
workqueues instead of kthreads in vhost. 
 
> For memory and io, we're generally going for remote charging, where a
> kthread explicitly says who the specific io or allocation is for,
> combined with selective back-charging, where the resource is charged
> and consumed unconditionally even if that would put the usage above
> the current limits temporarily.  From what I've been seeing recently,
> combination of the two give us really good control quality without
> being too invasive across the stack.
> 
> CPU doesn't have a backcharging mechanism yet and depending on the use
> case, we *might* need to put kthreads in different cgroups.  However,
> such use cases might not be that abundant and there may be gotaches
> which require them to be force-executed and back-charged (e.g. fs
> compression from global reclaim).
> 
> Thanks.
> 
> -- 
> tejun
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-06  6:15   ` Mike Rapoport
@ 2019-06-11 19:52     ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2019-06-11 19:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Daniel Jordan, hannes, jiangshanlai, lizefan, bsd,
	dan.j.williams, dave.hansen, juri.lelli, mhocko, peterz,
	steven.sistare, tglx, tom.hromatka, vdavydov.dev, cgroups,
	linux-kernel, linux-mm

Hello,

On Thu, Jun 06, 2019 at 09:15:26AM +0300, Mike Rapoport wrote:
> > Can you please go into more details on the use cases?
> 
> If I remember correctly, the original Bandan's work was about using
> workqueues instead of kthreads in vhost. 

For vhosts, I think it might be better to stick with kthread or
kthread_worker given that they can consume lots of cpu cycles over a
long period of time and we want to keep persistent track of scheduling
states.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-05 15:32   ` Daniel Jordan
@ 2019-06-11 19:55     ` Tejun Heo
  2019-06-12 22:29       ` Daniel Jordan
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2019-06-11 19:55 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: hannes, jiangshanlai, lizefan, bsd, dan.j.williams, dave.hansen,
	juri.lelli, mhocko, peterz, steven.sistare, tglx, tom.hromatka,
	vdavydov.dev, cgroups, linux-kernel, linux-mm, shakeelb

Hello, Daniel.

On Wed, Jun 05, 2019 at 11:32:29AM -0400, Daniel Jordan wrote:
> Sure, quoting from the last ktask post:
> 
>   A single CPU can spend an excessive amount of time in the kernel operating
>   on large amounts of data.  Often these situations arise during initialization-
>   and destruction-related tasks, where the data involved scales with system size.
>   These long-running jobs can slow startup and shutdown of applications and the
>   system itself while extra CPUs sit idle.
>       
>   To ensure that applications and the kernel continue to perform well as core
>   counts and memory sizes increase, harness these idle CPUs to complete such jobs
>   more quickly.
>       
>   ktask is a generic framework for parallelizing CPU-intensive work in the
>   kernel.  The API is generic enough to add concurrency to many different kinds
>   of tasks--for example, zeroing a range of pages or evicting a list of
>   inodes--and aims to save its clients the trouble of splitting up the work,
>   choosing the number of threads to use, maintaining an efficient concurrency
>   level, starting these threads, and load balancing the work between them.

Yeah, that rings a bell.

> > For memory and io, we're generally going for remote charging, where a
> > kthread explicitly says who the specific io or allocation is for,
> > combined with selective back-charging, where the resource is charged
> > and consumed unconditionally even if that would put the usage above
> > the current limits temporarily.  From what I've been seeing recently,
> > combination of the two give us really good control quality without
> > being too invasive across the stack.
> 
> Yes, for memory I actually use remote charging.  In patch 3 the worker's
> current->active_memcg field is changed to match that of the cgroup associated
> with the work.

I see.

> > CPU doesn't have a backcharging mechanism yet and depending on the use
> > case, we *might* need to put kthreads in different cgroups.  However,
> > such use cases might not be that abundant and there may be gotaches
> > which require them to be force-executed and back-charged (e.g. fs
> > compression from global reclaim).
> 
> The CPU-intensiveness of these works is one of the reasons for actually putting
> the workers through the migration path.  I don't know of a way to get the
> workers to respect the cpu controller (and even cpuset for that matter) without
> doing that.

So, I still think it'd likely be better to go back-charging route than
actually putting kworkers in non-root cgroups.  That's gonna be way
cheaper, simpler and makes avoiding inadvertent priority inversions
trivial.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 0/5] cgroup-aware unbound workqueues
  2019-06-11 19:55     ` Tejun Heo
@ 2019-06-12 22:29       ` Daniel Jordan
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Jordan @ 2019-06-12 22:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Jordan, hannes, jiangshanlai, lizefan, bsd,
	dan.j.williams, dave.hansen, juri.lelli, mhocko, peterz,
	steven.sistare, tglx, tom.hromatka, vdavydov.dev, cgroups,
	linux-kernel, linux-mm, shakeelb

On Tue, Jun 11, 2019 at 12:55:49PM -0700, Tejun Heo wrote:
> > > CPU doesn't have a backcharging mechanism yet and depending on the use
> > > case, we *might* need to put kthreads in different cgroups.  However,
> > > such use cases might not be that abundant and there may be gotaches
> > > which require them to be force-executed and back-charged (e.g. fs
> > > compression from global reclaim).
> > 
> > The CPU-intensiveness of these works is one of the reasons for actually putting
> > the workers through the migration path.  I don't know of a way to get the
> > workers to respect the cpu controller (and even cpuset for that matter) without
> > doing that.
> 
> So, I still think it'd likely be better to go back-charging route than
> actually putting kworkers in non-root cgroups.  That's gonna be way
> cheaper, simpler and makes avoiding inadvertent priority inversions
> trivial.

Ok, I'll experiment with backcharging in the cpu controller.  Initial plan is
to smooth out resource usage by backcharging after each chunk of work that each
helper thread does rather than do one giant backcharge after the multithreaded
job is over.  May turn out better performance-wise to do it less often than
this.

I'll also experiment with getting workqueue workers to respect cpuset without
migrating.  Seems to make sense to use the intersection of an unbound worker's
cpumask and the cpuset's cpumask, and make some compromises if the result is
empty.

Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-13 17:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05 13:36 [RFC v2 0/5] cgroup-aware unbound workqueues Daniel Jordan
2019-06-05 13:36 ` [RFC v2 1/5] cgroup: add cgroup v2 interfaces to migrate kernel threads Daniel Jordan
2019-06-05 13:36 ` [RFC v2 2/5] workqueue, cgroup: add cgroup-aware workqueues Daniel Jordan
2019-06-05 13:36 ` [RFC v2 3/5] workqueue, memcontrol: make memcg throttle workqueue workers Daniel Jordan
2019-06-05 13:36 ` [RFC v2 4/5] workqueue, cgroup: add test module Daniel Jordan
2019-06-05 13:36 ` [RFC v2 5/5] ktask, cgroup: attach helper threads to the master thread's cgroup Daniel Jordan
2019-06-05 13:53 ` [RFC v2 0/5] cgroup-aware unbound workqueues Tejun Heo
2019-06-05 15:32   ` Daniel Jordan
2019-06-11 19:55     ` Tejun Heo
2019-06-12 22:29       ` Daniel Jordan
2019-06-06  6:15   ` Mike Rapoport
2019-06-11 19:52     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).