All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V5 1/5] workqueue: rename system workqueues
@ 2018-01-27  5:15 Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 2/5] workqueue: expose attrs for " Wen Yang
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Wen Yang @ 2018-01-27  5:15 UTC (permalink / raw)
  To: tj
  Cc: zhong.weidong, wen.yang99, Jiang Biao, Tan Hu, Lai Jiangshan,
	linux-kernel

Rename system_wq's wq->name from "events" to "system_percpu",
and similarly for the similarly named workqueues.

Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f699122..67b68bb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5601,16 +5601,18 @@ int __init workqueue_init_early(void)
 		ordered_wq_attrs[i] = attrs;
 	}
 
-	system_wq = alloc_workqueue("events", 0, 0);
-	system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
-	system_long_wq = alloc_workqueue("events_long", 0, 0);
-	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
+	system_wq = alloc_workqueue("system_percpu", 0, 0);
+	system_highpri_wq = alloc_workqueue("system_percpu_highpri",
+					    WQ_HIGHPRI, 0);
+	system_long_wq = alloc_workqueue("system_percpu_long", 0, 0);
+	system_unbound_wq = alloc_workqueue("system_unbound", WQ_UNBOUND,
 					    WQ_UNBOUND_MAX_ACTIVE);
-	system_freezable_wq = alloc_workqueue("events_freezable",
+	system_freezable_wq = alloc_workqueue("system_percpu_freezable",
 					      WQ_FREEZABLE, 0);
-	system_power_efficient_wq = alloc_workqueue("events_power_efficient",
+	system_power_efficient_wq = alloc_workqueue("system_percpu_power_efficient",
 					      WQ_POWER_EFFICIENT, 0);
-	system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient",
+	system_freezable_power_efficient_wq = alloc_workqueue(
+					      "system_percpu_freezable_power_efficient",
 					      WQ_FREEZABLE | WQ_POWER_EFFICIENT,
 					      0);
 	BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH V5 2/5] workqueue: expose attrs for system workqueues
  2018-01-27  5:15 [RFC PATCH V5 1/5] workqueue: rename system workqueues Wen Yang
@ 2018-01-27  5:15 ` Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 3/5] workqueue: rename unbound_attrs to attrs Wen Yang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Wen Yang @ 2018-01-27  5:15 UTC (permalink / raw)
  To: tj
  Cc: zhong.weidong, wen.yang99, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 67b68bb..d5a5c76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5254,7 +5254,11 @@ static int __init wq_sysfs_init(void)
 	if (err)
 		return err;
 
-	return device_create_file(wq_subsys.dev_root, &wq_sysfs_cpumask_attr);
+	err = device_create_file(wq_subsys.dev_root, &wq_sysfs_cpumask_attr);
+	if (err)
+		return err;
+	return workqueue_sysfs_register(system_wq) ||
+		workqueue_sysfs_register(system_highpri_wq);
 }
 core_initcall(wq_sysfs_init);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH V5 3/5] workqueue: rename unbound_attrs to attrs
  2018-01-27  5:15 [RFC PATCH V5 1/5] workqueue: rename system workqueues Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 2/5] workqueue: expose attrs for " Wen Yang
@ 2018-01-27  5:15 ` Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 4/5] workqueue: convert ->nice to ->sched_attr Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler Wen Yang
  3 siblings, 0 replies; 11+ messages in thread
From: Wen Yang @ 2018-01-27  5:15 UTC (permalink / raw)
  To: tj
  Cc: zhong.weidong, wen.yang99, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

Replace workqueue's unbound_attrs by attrs, so that both unbound
or bound wq can use it.

Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d5a5c76..df22ecb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -255,7 +255,7 @@ struct workqueue_struct {
 	int			nr_drainers;	/* WQ: drain in progress */
 	int			saved_max_active; /* WQ: saved pwq max_active */
 
-	struct workqueue_attrs	*unbound_attrs;	/* PW: only for unbound wqs */
+	struct workqueue_attrs	*attrs;
 	struct pool_workqueue	*dfl_pwq;	/* PW: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
@@ -3248,9 +3248,8 @@ static void rcu_free_wq(struct rcu_head *rcu)
 
 	if (!(wq->flags & WQ_UNBOUND))
 		free_percpu(wq->cpu_pwqs);
-	else
-		free_workqueue_attrs(wq->unbound_attrs);
 
+	free_workqueue_attrs(wq->attrs);
 	kfree(wq->rescuer);
 	kfree(wq);
 }
@@ -3725,7 +3724,7 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
 	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&ctx->wq->mutex);
 
-	copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
+	copy_workqueue_attrs(ctx->wq->attrs, ctx->attrs);
 
 	/* save the previous pwq and install the new one */
 	for_each_node(node)
@@ -3842,7 +3841,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	lockdep_assert_held(&wq_pool_mutex);
 
 	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
-	    wq->unbound_attrs->no_numa)
+	    wq->attrs->no_numa)
 		return;
 
 	/*
@@ -3853,7 +3852,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	target_attrs = wq_update_unbound_numa_attrs_buf;
 	cpumask = target_attrs->cpumask;
 
-	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+	copy_workqueue_attrs(target_attrs, wq->attrs);
 	pwq = unbound_pwq_by_node(wq, node);
 
 	/*
@@ -3973,11 +3972,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	if (!wq)
 		return NULL;
 
-	if (flags & WQ_UNBOUND) {
-		wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
-		if (!wq->unbound_attrs)
-			goto err_free_wq;
-	}
+	wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!wq->attrs)
+		goto err_free_wq;
 
 	va_start(args, lock_name);
 	vsnprintf(wq->name, sizeof(wq->name), fmt, args);
@@ -4048,7 +4045,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	return wq;
 
 err_free_wq:
-	free_workqueue_attrs(wq->unbound_attrs);
+	free_workqueue_attrs(wq->attrs);
 	kfree(wq);
 	return NULL;
 err_destroy:
@@ -4919,7 +4916,7 @@ static int workqueue_apply_unbound_cpumask(void)
 		if (wq->flags & __WQ_ORDERED)
 			continue;
 
-		ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+		ctx = apply_wqattrs_prepare(wq, wq->attrs);
 		if (!ctx) {
 			ret = -ENOMEM;
 			break;
@@ -5077,7 +5074,7 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
 	int written;
 
 	mutex_lock(&wq->mutex);
-	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->attrs->nice);
 	mutex_unlock(&wq->mutex);
 
 	return written;
@@ -5094,7 +5091,7 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 	if (!attrs)
 		return NULL;
 
-	copy_workqueue_attrs(attrs, wq->unbound_attrs);
+	copy_workqueue_attrs(attrs, wq->attrs);
 	return attrs;
 }
 
@@ -5131,7 +5128,7 @@ static ssize_t wq_cpumask_show(struct device *dev,
 
 	mutex_lock(&wq->mutex);
 	written = scnprintf(buf, PAGE_SIZE, "%*pb\n",
-			    cpumask_pr_args(wq->unbound_attrs->cpumask));
+			    cpumask_pr_args(wq->attrs->cpumask));
 	mutex_unlock(&wq->mutex);
 	return written;
 }
@@ -5168,7 +5165,7 @@ static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
 
 	mutex_lock(&wq->mutex);
 	written = scnprintf(buf, PAGE_SIZE, "%d\n",
-			    !wq->unbound_attrs->no_numa);
+			    !wq->attrs->no_numa);
 	mutex_unlock(&wq->mutex);
 
 	return written;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH V5 4/5] workqueue: convert ->nice to ->sched_attr
  2018-01-27  5:15 [RFC PATCH V5 1/5] workqueue: rename system workqueues Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 2/5] workqueue: expose attrs for " Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 3/5] workqueue: rename unbound_attrs to attrs Wen Yang
@ 2018-01-27  5:15 ` Wen Yang
  2018-01-27  5:15 ` [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler Wen Yang
  3 siblings, 0 replies; 11+ messages in thread
From: Wen Yang @ 2018-01-27  5:15 UTC (permalink / raw)
  To: tj
  Cc: zhong.weidong, wen.yang99, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

The possibility of specifying more than just a nice
for the wq may be useful for a wide variety of applications.

Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |  5 +++--
 kernel/workqueue.c        | 27 +++++++++++++++------------
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a54ef9..d9d0f36 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include <linux/threads.h>
 #include <linux/atomic.h>
 #include <linux/cpumask.h>
+#include <uapi/linux/sched/types.h>
 
 struct workqueue_struct;
 
@@ -127,9 +128,9 @@ struct delayed_work {
  */
 struct workqueue_attrs {
 	/**
-	 * @nice: nice level
+	 * @sched_attr: kworker's scheduling parameters
 	 */
-	int nice;
+	struct sched_attr sched_attr;
 
 	/**
 	 * @cpumask: allowed CPUs
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index df22ecb..fca0e30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1772,7 +1772,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 
 	if (pool->cpu >= 0)
 		snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
-			 pool->attrs->nice < 0  ? "H" : "");
+			 pool->attrs->sched_attr.sched_nice < 0  ? "H" : "");
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
@@ -1781,7 +1781,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	if (IS_ERR(worker->task))
 		goto fail;
 
-	set_user_nice(worker->task, pool->attrs->nice);
+	set_user_nice(worker->task, pool->attrs->sched_attr.sched_nice);
 	kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
 	/* successful, attach the worker to the pool */
@@ -3169,7 +3169,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 				 const struct workqueue_attrs *from)
 {
-	to->nice = from->nice;
+	to->sched_attr.sched_nice = from->sched_attr.sched_nice;
 	cpumask_copy(to->cpumask, from->cpumask);
 	/*
 	 * Unlike hash and equality test, this function doesn't ignore
@@ -3184,7 +3184,7 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 {
 	u32 hash = 0;
 
-	hash = jhash_1word(attrs->nice, hash);
+	hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
 	hash = jhash(cpumask_bits(attrs->cpumask),
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
 	return hash;
@@ -3194,7 +3194,7 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 static bool wqattrs_equal(const struct workqueue_attrs *a,
 			  const struct workqueue_attrs *b)
 {
-	if (a->nice != b->nice)
+	if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
 		return false;
 	if (!cpumask_equal(a->cpumask, b->cpumask))
 		return false;
@@ -4336,7 +4336,8 @@ static void pr_cont_pool_info(struct worker_pool *pool)
 	pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
 	if (pool->node != NUMA_NO_NODE)
 		pr_cont(" node=%d", pool->node);
-	pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+	pr_cont(" flags=0x%x nice=%d", pool->flags,
+			pool->attrs->sched_attr.sched_nice);
 }
 
 static void pr_cont_work(bool comma, struct work_struct *work)
@@ -5074,7 +5075,8 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
 	int written;
 
 	mutex_lock(&wq->mutex);
-	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->attrs->nice);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n",
+			wq->attrs->sched_attr.sched_nice);
 	mutex_unlock(&wq->mutex);
 
 	return written;
@@ -5108,8 +5110,9 @@ static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
 	if (!attrs)
 		goto out_unlock;
 
-	if (sscanf(buf, "%d", &attrs->nice) == 1 &&
-	    attrs->nice >= MIN_NICE && attrs->nice <= MAX_NICE)
+	if (sscanf(buf, "%d", &attrs->sched_attr.sched_nice) == 1 &&
+	    attrs->sched_attr.sched_nice >= MIN_NICE &&
+	    attrs->sched_attr.sched_nice <= MAX_NICE)
 		ret = apply_workqueue_attrs_locked(wq, attrs);
 	else
 		ret = -EINVAL;
@@ -5573,7 +5576,7 @@ int __init workqueue_init_early(void)
 			BUG_ON(init_worker_pool(pool));
 			pool->cpu = cpu;
 			cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
-			pool->attrs->nice = std_nice[i++];
+			pool->attrs->sched_attr.sched_nice = std_nice[i++];
 			pool->node = cpu_to_node(cpu);
 
 			/* alloc pool ID */
@@ -5588,7 +5591,7 @@ int __init workqueue_init_early(void)
 		struct workqueue_attrs *attrs;
 
 		BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
-		attrs->nice = std_nice[i];
+		attrs->sched_attr.sched_nice = std_nice[i];
 		unbound_std_wq_attrs[i] = attrs;
 
 		/*
@@ -5597,7 +5600,7 @@ int __init workqueue_init_early(void)
 		 * Turn off NUMA so that dfl_pwq is used for all nodes.
 		 */
 		BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
-		attrs->nice = std_nice[i];
+		attrs->sched_attr.sched_nice = std_nice[i];
 		attrs->no_numa = true;
 		ordered_wq_attrs[i] = attrs;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-27  5:15 [RFC PATCH V5 1/5] workqueue: rename system workqueues Wen Yang
                   ` (2 preceding siblings ...)
  2018-01-27  5:15 ` [RFC PATCH V5 4/5] workqueue: convert ->nice to ->sched_attr Wen Yang
@ 2018-01-27  5:15 ` Wen Yang
  2018-01-27  9:31   ` Mike Galbraith
  2018-01-29  4:15   ` Lai Jiangshan
  3 siblings, 2 replies; 11+ messages in thread
From: Wen Yang @ 2018-01-27  5:15 UTC (permalink / raw)
  To: tj
  Cc: zhong.weidong, wen.yang99, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
    PID: 44145  TASK: ffff8807bec7b980  CPU: 0   COMMAND: "libvirtd"
    #0 [ffff8807f2cbb9d0] __schedule at ffffffff816410ed
    #1 [ffff8807f2cbba38] schedule at ffffffff81641789
    #2 [ffff8807f2cbba48] schedule_timeout at ffffffff8163f479
    #3 [ffff8807f2cbbaf8] wait_for_completion at ffffffff81641b56
    #4 [ffff8807f2cbbb58] flush_work at ffffffff8109efdc
    #5 [ffff8807f2cbbbd0] lru_add_drain_all at ffffffff81179002
    #6 [ffff8807f2cbbc08] migrate_prep at ffffffff811c77be
    #7 [ffff8807f2cbbc18] do_migrate_pages at ffffffff811b8010
    #8 [ffff8807f2cbbcf8] cpuset_migrate_mm at ffffffff810fea6c
    #9 [ffff8807f2cbbd10] cpuset_attach at ffffffff810ff91e
    #10 [ffff8807f2cbbd50] cgroup_attach_task at ffffffff810f9972
    #11 [ffff8807f2cbbe08] attach_task_by_pid at ffffffff810fa520
    #12 [ffff8807f2cbbe58] cgroup_tasks_write at ffffffff810fa593
    #13 [ffff8807f2cbbe68] cgroup_file_write at ffffffff810f8773
    #14 [ffff8807f2cbbef8] vfs_write at ffffffff811dfdfd
    #15 [ffff8807f2cbbf38] sys_write at ffffffff811e089f
    #16 [ffff8807f2cbbf80] system_call_fastpath at ffffffff8164c809

CPU 43: kworker/43 starved because of the RT threads:
    CURRENT: PID: 21294  TASK: ffff883fd2d45080  COMMAND: "lwip"
    RT PRIO_ARRAY: ffff883fff3f4950
    [ 79] PID: 21294  TASK: ffff883fd2d45080  COMMAND: "lwip"
    [ 79] PID: 21295  TASK: ffff88276d481700  COMMAND: "ovdk-ovsvswitch"
    [ 79] PID: 21351  TASK: ffff8807be822280  COMMAND: "dispatcher"
    [ 79] PID: 21129  TASK: ffff8807bef0f300  COMMAND: "ovdk-ovsvswitch"
    [ 79] PID: 21337  TASK: ffff88276d482e00  COMMAND: "handler_3"
    [ 79] PID: 21352  TASK: ffff8807be824500  COMMAND: "flow_dumper"
    [ 79] PID: 21336  TASK: ffff88276d480b80  COMMAND: "handler_2"
    [ 79] PID: 21342  TASK: ffff88276d484500  COMMAND: "handler_8"
    [ 79] PID: 21341  TASK: ffff88276d482280  COMMAND: "handler_7"
    [ 79] PID: 21338  TASK: ffff88276d483980  COMMAND: "handler_4"
    [ 79] PID: 21339  TASK: ffff88276d480000  COMMAND: "handler_5"
    [ 79] PID: 21340  TASK: ffff88276d486780  COMMAND: "handler_6"
    CFS RB_ROOT: ffff883fff3f4868
    [120] PID: 37959  TASK: ffff88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
    PID: 1      TASK: ffff883fd2d40000  CPU: 28  COMMAND: "systemd"
    #0 [ffff881fd317bd60] __schedule at ffffffff816410ed
    #1 [ffff881fd317bdc8] schedule_preempt_disabled at ffffffff81642869
    #2 [ffff881fd317bdd8] __mutex_lock_slowpath at ffffffff81640565
    #3 [ffff881fd317be38] mutex_lock at ffffffff8163f9cf
    #4 [ffff881fd317be50] proc_cgroup_show at ffffffff810fd256
    #5 [ffff881fd317be98] seq_read at ffffffff81203cda
    #6 [ffff881fd317bf08] vfs_read at ffffffff811dfc6c
    #7 [ffff881fd317bf38] sys_read at ffffffff811e07bf
    #8 [ffff881fd317bf80] system_call_fastpath at ffffffff81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 <kworker_pid>
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
    # cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
    policy=0 prio=0 nice=0
    # echo "policy=1 prio=1 nice=0" > /sys/devices/virtual/workqueue/system_percpu/sched_attr
    # cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
    policy=1 prio=1 nice=0
    # cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
    policy=0 prio=0 nice=-20
    # echo "policy=1 prio=2 nice=0" > /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
    # cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
    policy=1 prio=2 nice=0

Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 196 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 151 insertions(+), 45 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fca0e30..e58f9bd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1699,6 +1699,7 @@ static void worker_attach_to_pool(struct worker *worker,
 	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 	 */
 	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+	sched_setattr(worker->task, &pool->attrs->sched_attr);
 
 	/*
 	 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -3166,10 +3167,19 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
 	return NULL;
 }
 
+static void copy_sched_attr(struct sched_attr *to,
+		const struct sched_attr *from)
+{
+	to->sched_policy = from->sched_policy;
+	to->sched_priority = from->sched_priority;
+	to->sched_nice = from->sched_nice;
+	to->sched_flags = from->sched_flags;
+}
+
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 				 const struct workqueue_attrs *from)
 {
-	to->sched_attr.sched_nice = from->sched_attr.sched_nice;
+	copy_sched_attr(&to->sched_attr, &from->sched_attr);
 	cpumask_copy(to->cpumask, from->cpumask);
 	/*
 	 * Unlike hash and equality test, this function doesn't ignore
@@ -3184,17 +3194,32 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 {
 	u32 hash = 0;
 
-	hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
+	hash = jhash_3words(attrs->sched_attr.sched_policy,
+			attrs->sched_attr.sched_priority,
+			attrs->sched_attr.sched_nice,
+			hash);
 	hash = jhash(cpumask_bits(attrs->cpumask),
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
 	return hash;
 }
 
+static bool sched_attr_equal(const struct sched_attr *a,
+		const struct sched_attr *b)
+{
+	if (a->sched_policy != b->sched_policy)
+		return false;
+	if (a->sched_priority != b->sched_priority)
+		return false;
+	if (a->sched_nice != b->sched_nice)
+		return false;
+	return true;
+}
+
 /* content equality test */
 static bool wqattrs_equal(const struct workqueue_attrs *a,
 			  const struct workqueue_attrs *b)
 {
-	if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
+	if (!sched_attr_equal(&a->sched_attr, &b->sched_attr))
 		return false;
 	if (!cpumask_equal(a->cpumask, b->cpumask))
 		return false;
@@ -3911,6 +3936,11 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			init_pwq(pwq, wq, &cpu_pools[highpri]);
 
 			mutex_lock(&wq->mutex);
+			wq->attrs->sched_attr.sched_policy = SCHED_NORMAL;
+			wq->attrs->sched_attr.sched_priority = 0;
+			wq->attrs->sched_attr.sched_nice =
+				wq->flags & WQ_HIGHPRI ?
+				HIGHPRI_NICE_LEVEL : 0;
 			link_pwq(pwq);
 			mutex_unlock(&wq->mutex);
 		}
@@ -4336,7 +4366,9 @@ static void pr_cont_pool_info(struct worker_pool *pool)
 	pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
 	if (pool->node != NUMA_NO_NODE)
 		pr_cont(" node=%d", pool->node);
-	pr_cont(" flags=0x%x nice=%d", pool->flags,
+	pr_cont(" flags=0x%x policy=%u prio=%u nice=%d", pool->flags,
+			pool->attrs->sched_attr.sched_policy,
+			pool->attrs->sched_attr.sched_priority,
 			pool->attrs->sched_attr.sched_nice);
 }
 
@@ -5041,9 +5073,124 @@ static ssize_t max_active_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(max_active);
 
+static ssize_t sched_attr_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	size_t written;
+	struct workqueue_struct *wq = dev_to_wq(dev);
+
+	mutex_lock(&wq->mutex);
+	written = scnprintf(buf, PAGE_SIZE,
+			"policy=%u prio=%u nice=%d\n",
+			wq->attrs->sched_attr.sched_policy,
+			wq->attrs->sched_attr.sched_priority,
+			wq->attrs->sched_attr.sched_nice);
+	mutex_unlock(&wq->mutex);
+
+	return written;
+}
+
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq);
+
+static int wq_set_unbound_sched_attr(struct workqueue_struct *wq,
+		const struct sched_attr *new)
+{
+	struct workqueue_attrs *attrs;
+	int ret = -ENOMEM;
+
+	apply_wqattrs_lock();
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		goto out_unlock;
+	copy_sched_attr(&attrs->sched_attr, new);
+	ret = apply_workqueue_attrs_locked(wq, attrs);
+
+out_unlock:
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret;
+}
+
+static int wq_set_bound_sched_attr(struct workqueue_struct *wq,
+		const struct sched_attr *new)
+{
+	struct pool_workqueue *pwq;
+	struct worker_pool *pool;
+	struct worker *worker;
+	int ret = 0;
+
+	apply_wqattrs_lock();
+	for_each_pwq(pwq, wq) {
+		pool = pwq->pool;
+		mutex_lock(&pool->attach_mutex);
+		for_each_pool_worker(worker, pool) {
+			ret = sched_setattr(worker->task, new);
+			if (ret) {
+				pr_err("%s:%d err[%d]",
+						__func__, __LINE__, ret);
+				pr_err(" worker[%s] policy[%d] prio[%d] nice[%d]\n",
+						worker->task->comm,
+						new->sched_policy,
+						new->sched_priority,
+						new->sched_nice);
+			}
+		}
+		copy_sched_attr(&pool->attrs->sched_attr, new);
+		mutex_unlock(&pool->attach_mutex);
+	}
+	apply_wqattrs_unlock();
+
+	mutex_lock(&wq->mutex);
+	copy_sched_attr(&wq->attrs->sched_attr, new);
+	mutex_unlock(&wq->mutex);
+
+	return ret;
+}
+
+static ssize_t sched_attr_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct sched_attr new = {
+		.size = sizeof(struct sched_attr),
+		.sched_policy = SCHED_NORMAL,
+		.sched_flags = 0,
+		.sched_priority = 0,
+	};
+	int ret = 0;
+
+	if (!capable(CAP_SYS_NICE))
+		return -EPERM;
+
+	if (sscanf(buf, "policy=%u prio=%u nice=%d",
+				&new.sched_policy,
+				&new.sched_priority,
+				&new.sched_nice) != 3)
+		return -EINVAL;
+
+	pr_debug("set wq's sched_attr: policy=%u prio=%u nice=%d\n",
+			new.sched_policy,
+			new.sched_priority,
+			new.sched_nice);
+	mutex_lock(&wq->mutex);
+	if (sched_attr_equal(&wq->attrs->sched_attr, &new)) {
+		mutex_unlock(&wq->mutex);
+		return count;
+	}
+	mutex_unlock(&wq->mutex);
+
+	if (wq->flags & WQ_UNBOUND)
+		ret = wq_set_unbound_sched_attr(wq, &new);
+	else
+		ret = wq_set_bound_sched_attr(wq, &new);
+	return ret ?: count;
+}
+static DEVICE_ATTR_RW(sched_attr);
+
 static struct attribute *wq_sysfs_attrs[] = {
 	&dev_attr_per_cpu.attr,
 	&dev_attr_max_active.attr,
+	&dev_attr_sched_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(wq_sysfs);
@@ -5068,20 +5215,6 @@ static ssize_t wq_pool_ids_show(struct device *dev,
 	return written;
 }
 
-static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
-			    char *buf)
-{
-	struct workqueue_struct *wq = dev_to_wq(dev);
-	int written;
-
-	mutex_lock(&wq->mutex);
-	written = scnprintf(buf, PAGE_SIZE, "%d\n",
-			wq->attrs->sched_attr.sched_nice);
-	mutex_unlock(&wq->mutex);
-
-	return written;
-}
-
 /* prepare workqueue_attrs for sysfs store operations */
 static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 {
@@ -5097,32 +5230,6 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 	return attrs;
 }
 
-static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
-			     const char *buf, size_t count)
-{
-	struct workqueue_struct *wq = dev_to_wq(dev);
-	struct workqueue_attrs *attrs;
-	int ret = -ENOMEM;
-
-	apply_wqattrs_lock();
-
-	attrs = wq_sysfs_prep_attrs(wq);
-	if (!attrs)
-		goto out_unlock;
-
-	if (sscanf(buf, "%d", &attrs->sched_attr.sched_nice) == 1 &&
-	    attrs->sched_attr.sched_nice >= MIN_NICE &&
-	    attrs->sched_attr.sched_nice <= MAX_NICE)
-		ret = apply_workqueue_attrs_locked(wq, attrs);
-	else
-		ret = -EINVAL;
-
-out_unlock:
-	apply_wqattrs_unlock();
-	free_workqueue_attrs(attrs);
-	return ret ?: count;
-}
-
 static ssize_t wq_cpumask_show(struct device *dev,
 			       struct device_attribute *attr, char *buf)
 {
@@ -5201,7 +5308,6 @@ static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
 
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
-	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(numa, 0644, wq_numa_show, wq_numa_store),
 	__ATTR_NULL,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-27  5:15 ` [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler Wen Yang
@ 2018-01-27  9:31   ` Mike Galbraith
  2018-01-27 12:37     ` Mike Galbraith
  2018-01-29  4:15   ` Lai Jiangshan
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2018-01-27  9:31 UTC (permalink / raw)
  To: Wen Yang, tj
  Cc: zhong.weidong, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote:
> When pinning RT threads to specific cores using CPU affinity, the
> kworkers on the same CPU would starve, which may lead to some kind
> of priority inversion. In that case, the RT threads would also
> suffer high performance impact.

...

> This patch introduces a way to set the scheduler(policy and priority)
> of percpu worker_pool, in that way, user could set proper scheduler
> policy and priority of the worker_pool as needed, which could apply
> to all the WORK_CPU_BOUND workers on the same CPU.

What happens when a new kworker needs to be spawned?  What guarantees
that kthreadd can run?  Not to mention other kthreads that can be
starved, resulting in severe self inflicted injury.  An interface to
configure workqueues is very nice, but it's only part of the problem.

	-Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-27  9:31   ` Mike Galbraith
@ 2018-01-27 12:37     ` Mike Galbraith
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2018-01-27 12:37 UTC (permalink / raw)
  To: Wen Yang, tj
  Cc: zhong.weidong, Jiang Biao, Tan Hu, Lai Jiangshan,
	kernel test robot, linux-kernel

On Sat, 2018-01-27 at 10:31 +0100, Mike Galbraith wrote:
> On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote:
> > When pinning RT threads to specific cores using CPU affinity, the
> > kworkers on the same CPU would starve, which may lead to some kind
> > of priority inversion. In that case, the RT threads would also
> > suffer high performance impact.
> 
> ...
> 
> > This patch introduces a way to set the scheduler(policy and priority)
> > of percpu worker_pool, in that way, user could set proper scheduler
> > policy and priority of the worker_pool as needed, which could apply
> > to all the WORK_CPU_BOUND workers on the same CPU.
> 
> What happens when a new kworker needs to be spawned?  What guarantees
> that kthreadd can run?  Not to mention other kthreads that can be
> starved, resulting in severe self inflicted injury.  An interface to
> configure workqueues is very nice, but it's only part of the problem.

P.S. You can also meet inversion expressly due to having excluded
unbound kworkers.  Just yesterday, I was tracing dbench, and both
varieties of kworker were involved in the chain.  An RT task doing
anything at all involving unbound kworkers meets an inversion the
instant an unbound kworker doing work on its behalf has to wait for a
SCHED_OTHER task, if that wait can in any way affect RT progress.

	-Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-27  5:15 ` [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler Wen Yang
  2018-01-27  9:31   ` Mike Galbraith
@ 2018-01-29  4:15   ` Lai Jiangshan
  2018-01-29  4:41     ` Mike Galbraith
  1 sibling, 1 reply; 11+ messages in thread
From: Lai Jiangshan @ 2018-01-29  4:15 UTC (permalink / raw)
  To: Wen Yang
  Cc: Tejun Heo, zhong.weidong, Jiang Biao, Tan Hu, kernel test robot, LKML

I think adding priority boost to workqueue(flush_work()) is the best
way to fix the problem.

On Sat, Jan 27, 2018 at 1:15 PM, Wen Yang <wen.yang99@zte.com.cn> wrote:
> When pinning RT threads to specific cores using CPU affinity, the
> kworkers on the same CPU would starve, which may lead to some kind
> of priority inversion. In that case, the RT threads would also
> suffer high performance impact.
>
> The priority inversion looks like,
> CPU 0:  libvirtd acquired cgroup_mutex, and triggered
> lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
>     PID: 44145  TASK: ffff8807bec7b980  CPU: 0   COMMAND: "libvirtd"
>     #0 [ffff8807f2cbb9d0] __schedule at ffffffff816410ed
>     #1 [ffff8807f2cbba38] schedule at ffffffff81641789
>     #2 [ffff8807f2cbba48] schedule_timeout at ffffffff8163f479
>     #3 [ffff8807f2cbbaf8] wait_for_completion at ffffffff81641b56
>     #4 [ffff8807f2cbbb58] flush_work at ffffffff8109efdc
>     #5 [ffff8807f2cbbbd0] lru_add_drain_all at ffffffff81179002
>     #6 [ffff8807f2cbbc08] migrate_prep at ffffffff811c77be
>     #7 [ffff8807f2cbbc18] do_migrate_pages at ffffffff811b8010
>     #8 [ffff8807f2cbbcf8] cpuset_migrate_mm at ffffffff810fea6c
>     #9 [ffff8807f2cbbd10] cpuset_attach at ffffffff810ff91e
>     #10 [ffff8807f2cbbd50] cgroup_attach_task at ffffffff810f9972
>     #11 [ffff8807f2cbbe08] attach_task_by_pid at ffffffff810fa520
>     #12 [ffff8807f2cbbe58] cgroup_tasks_write at ffffffff810fa593
>     #13 [ffff8807f2cbbe68] cgroup_file_write at ffffffff810f8773
>     #14 [ffff8807f2cbbef8] vfs_write at ffffffff811dfdfd
>     #15 [ffff8807f2cbbf38] sys_write at ffffffff811e089f
>     #16 [ffff8807f2cbbf80] system_call_fastpath at ffffffff8164c809
>
> CPU 43: kworker/43 starved because of the RT threads:
>     CURRENT: PID: 21294  TASK: ffff883fd2d45080  COMMAND: "lwip"
>     RT PRIO_ARRAY: ffff883fff3f4950
>     [ 79] PID: 21294  TASK: ffff883fd2d45080  COMMAND: "lwip"
>     [ 79] PID: 21295  TASK: ffff88276d481700  COMMAND: "ovdk-ovsvswitch"
>     [ 79] PID: 21351  TASK: ffff8807be822280  COMMAND: "dispatcher"
>     [ 79] PID: 21129  TASK: ffff8807bef0f300  COMMAND: "ovdk-ovsvswitch"
>     [ 79] PID: 21337  TASK: ffff88276d482e00  COMMAND: "handler_3"
>     [ 79] PID: 21352  TASK: ffff8807be824500  COMMAND: "flow_dumper"
>     [ 79] PID: 21336  TASK: ffff88276d480b80  COMMAND: "handler_2"
>     [ 79] PID: 21342  TASK: ffff88276d484500  COMMAND: "handler_8"
>     [ 79] PID: 21341  TASK: ffff88276d482280  COMMAND: "handler_7"
>     [ 79] PID: 21338  TASK: ffff88276d483980  COMMAND: "handler_4"
>     [ 79] PID: 21339  TASK: ffff88276d480000  COMMAND: "handler_5"
>     [ 79] PID: 21340  TASK: ffff88276d486780  COMMAND: "handler_6"
>     CFS RB_ROOT: ffff883fff3f4868
>     [120] PID: 37959  TASK: ffff88276e148000  COMMAND: "kworker/43:1"
>
> CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
>     PID: 1      TASK: ffff883fd2d40000  CPU: 28  COMMAND: "systemd"
>     #0 [ffff881fd317bd60] __schedule at ffffffff816410ed
>     #1 [ffff881fd317bdc8] schedule_preempt_disabled at ffffffff81642869
>     #2 [ffff881fd317bdd8] __mutex_lock_slowpath at ffffffff81640565
>     #3 [ffff881fd317be38] mutex_lock at ffffffff8163f9cf
>     #4 [ffff881fd317be50] proc_cgroup_show at ffffffff810fd256
>     #5 [ffff881fd317be98] seq_read at ffffffff81203cda
>     #6 [ffff881fd317bf08] vfs_read at ffffffff811dfc6c
>     #7 [ffff881fd317bf38] sys_read at ffffffff811e07bf
>     #8 [ffff881fd317bf80] system_call_fastpath at ffffffff81
>
> The simplest way to fix that is to set the scheduler of kworkers to
> higher RT priority, just like,
> chrt --fifo -p 61 <kworker_pid>
> However, it can not avoid other WORK_CPU_BOUND worker threads running
> and starving.
>
> This patch introduces a way to set the scheduler(policy and priority)
> of percpu worker_pool, in that way, user could set proper scheduler
> policy and priority of the worker_pool as needed, which could apply
> to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
> we could using /sys/devices/virtual/workqueue/cpumask for
> WORK_CPU_UNBOUND workers to prevent them starving.
>
> Tejun Heo suggested:
> "* Add scheduler type to wq_attrs so that unbound workqueues can be
>  configured.
>
> * Rename system_wq's wq->name from "events" to "system_percpu", and
>  similarly for the similarly named workqueues.
>
> * Enable wq_attrs (only the applicable part should show up in the
>  interface) for system_percpu and system_percpu_highpri, and use that
>  to change the attributes of the percpu pools."
>
> This patch implements the basic infrastructure and /sys interface,
> such as:
>     # cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
>     policy=0 prio=0 nice=0
>     # echo "policy=1 prio=1 nice=0" > /sys/devices/virtual/workqueue/system_percpu/sched_attr
>     # cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
>     policy=1 prio=1 nice=0
>     # cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
>     policy=0 prio=0 nice=-20
>     # echo "policy=1 prio=2 nice=0" > /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
>     # cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
>     policy=1 prio=2 nice=0
>
> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
> Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
> Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: kernel test robot <xiaolong.ye@intel.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/workqueue.c | 196 +++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 151 insertions(+), 45 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index fca0e30..e58f9bd 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1699,6 +1699,7 @@ static void worker_attach_to_pool(struct worker *worker,
>          * online CPUs.  It'll be re-applied when any of the CPUs come up.
>          */
>         set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +       sched_setattr(worker->task, &pool->attrs->sched_attr);
>
>         /*
>          * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
> @@ -3166,10 +3167,19 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
>         return NULL;
>  }
>
> +static void copy_sched_attr(struct sched_attr *to,
> +               const struct sched_attr *from)
> +{
> +       to->sched_policy = from->sched_policy;
> +       to->sched_priority = from->sched_priority;
> +       to->sched_nice = from->sched_nice;
> +       to->sched_flags = from->sched_flags;
> +}
> +
>  static void copy_workqueue_attrs(struct workqueue_attrs *to,
>                                  const struct workqueue_attrs *from)
>  {
> -       to->sched_attr.sched_nice = from->sched_attr.sched_nice;
> +       copy_sched_attr(&to->sched_attr, &from->sched_attr);
>         cpumask_copy(to->cpumask, from->cpumask);
>         /*
>          * Unlike hash and equality test, this function doesn't ignore
> @@ -3184,17 +3194,32 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
>  {
>         u32 hash = 0;
>
> -       hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
> +       hash = jhash_3words(attrs->sched_attr.sched_policy,
> +                       attrs->sched_attr.sched_priority,
> +                       attrs->sched_attr.sched_nice,
> +                       hash);
>         hash = jhash(cpumask_bits(attrs->cpumask),
>                      BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
>         return hash;
>  }
>
> +static bool sched_attr_equal(const struct sched_attr *a,
> +               const struct sched_attr *b)
> +{
> +       if (a->sched_policy != b->sched_policy)
> +               return false;
> +       if (a->sched_priority != b->sched_priority)
> +               return false;
> +       if (a->sched_nice != b->sched_nice)
> +               return false;
> +       return true;
> +}
> +
>  /* content equality test */
>  static bool wqattrs_equal(const struct workqueue_attrs *a,
>                           const struct workqueue_attrs *b)
>  {
> -       if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
> +       if (!sched_attr_equal(&a->sched_attr, &b->sched_attr))
>                 return false;
>         if (!cpumask_equal(a->cpumask, b->cpumask))
>                 return false;
> @@ -3911,6 +3936,11 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>                         init_pwq(pwq, wq, &cpu_pools[highpri]);
>
>                         mutex_lock(&wq->mutex);
> +                       wq->attrs->sched_attr.sched_policy = SCHED_NORMAL;
> +                       wq->attrs->sched_attr.sched_priority = 0;
> +                       wq->attrs->sched_attr.sched_nice =
> +                               wq->flags & WQ_HIGHPRI ?
> +                               HIGHPRI_NICE_LEVEL : 0;
>                         link_pwq(pwq);
>                         mutex_unlock(&wq->mutex);
>                 }
> @@ -4336,7 +4366,9 @@ static void pr_cont_pool_info(struct worker_pool *pool)
>         pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
>         if (pool->node != NUMA_NO_NODE)
>                 pr_cont(" node=%d", pool->node);
> -       pr_cont(" flags=0x%x nice=%d", pool->flags,
> +       pr_cont(" flags=0x%x policy=%u prio=%u nice=%d", pool->flags,
> +                       pool->attrs->sched_attr.sched_policy,
> +                       pool->attrs->sched_attr.sched_priority,
>                         pool->attrs->sched_attr.sched_nice);
>  }
>
> @@ -5041,9 +5073,124 @@ static ssize_t max_active_store(struct device *dev,
>  }
>  static DEVICE_ATTR_RW(max_active);
>
> +static ssize_t sched_attr_show(struct device *dev,
> +               struct device_attribute *attr, char *buf)
> +{
> +       size_t written;
> +       struct workqueue_struct *wq = dev_to_wq(dev);
> +
> +       mutex_lock(&wq->mutex);
> +       written = scnprintf(buf, PAGE_SIZE,
> +                       "policy=%u prio=%u nice=%d\n",
> +                       wq->attrs->sched_attr.sched_policy,
> +                       wq->attrs->sched_attr.sched_priority,
> +                       wq->attrs->sched_attr.sched_nice);
> +       mutex_unlock(&wq->mutex);
> +
> +       return written;
> +}
> +
> +static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq);
> +
> +static int wq_set_unbound_sched_attr(struct workqueue_struct *wq,
> +               const struct sched_attr *new)
> +{
> +       struct workqueue_attrs *attrs;
> +       int ret = -ENOMEM;
> +
> +       apply_wqattrs_lock();
> +       attrs = wq_sysfs_prep_attrs(wq);
> +       if (!attrs)
> +               goto out_unlock;
> +       copy_sched_attr(&attrs->sched_attr, new);
> +       ret = apply_workqueue_attrs_locked(wq, attrs);
> +
> +out_unlock:
> +       apply_wqattrs_unlock();
> +       free_workqueue_attrs(attrs);
> +       return ret;
> +}
> +
> +static int wq_set_bound_sched_attr(struct workqueue_struct *wq,
> +               const struct sched_attr *new)
> +{
> +       struct pool_workqueue *pwq;
> +       struct worker_pool *pool;
> +       struct worker *worker;
> +       int ret = 0;
> +
> +       apply_wqattrs_lock();
> +       for_each_pwq(pwq, wq) {
> +               pool = pwq->pool;
> +               mutex_lock(&pool->attach_mutex);
> +               for_each_pool_worker(worker, pool) {
> +                       ret = sched_setattr(worker->task, new);
> +                       if (ret) {
> +                               pr_err("%s:%d err[%d]",
> +                                               __func__, __LINE__, ret);
> +                               pr_err(" worker[%s] policy[%d] prio[%d] nice[%d]\n",
> +                                               worker->task->comm,
> +                                               new->sched_policy,
> +                                               new->sched_priority,
> +                                               new->sched_nice);
> +                       }
> +               }
> +               copy_sched_attr(&pool->attrs->sched_attr, new);
> +               mutex_unlock(&pool->attach_mutex);
> +       }
> +       apply_wqattrs_unlock();
> +
> +       mutex_lock(&wq->mutex);
> +       copy_sched_attr(&wq->attrs->sched_attr, new);
> +       mutex_unlock(&wq->mutex);
> +
> +       return ret;
> +}
> +
> +static ssize_t sched_attr_store(struct device *dev,
> +               struct device_attribute *attr, const char *buf, size_t count)
> +{
> +       struct workqueue_struct *wq = dev_to_wq(dev);
> +       struct sched_attr new = {
> +               .size = sizeof(struct sched_attr),
> +               .sched_policy = SCHED_NORMAL,
> +               .sched_flags = 0,
> +               .sched_priority = 0,
> +       };
> +       int ret = 0;
> +
> +       if (!capable(CAP_SYS_NICE))
> +               return -EPERM;
> +
> +       if (sscanf(buf, "policy=%u prio=%u nice=%d",
> +                               &new.sched_policy,
> +                               &new.sched_priority,
> +                               &new.sched_nice) != 3)
> +               return -EINVAL;
> +
> +       pr_debug("set wq's sched_attr: policy=%u prio=%u nice=%d\n",
> +                       new.sched_policy,
> +                       new.sched_priority,
> +                       new.sched_nice);
> +       mutex_lock(&wq->mutex);
> +       if (sched_attr_equal(&wq->attrs->sched_attr, &new)) {
> +               mutex_unlock(&wq->mutex);
> +               return count;
> +       }
> +       mutex_unlock(&wq->mutex);
> +
> +       if (wq->flags & WQ_UNBOUND)
> +               ret = wq_set_unbound_sched_attr(wq, &new);
> +       else
> +               ret = wq_set_bound_sched_attr(wq, &new);
> +       return ret ?: count;
> +}
> +static DEVICE_ATTR_RW(sched_attr);
> +
>  static struct attribute *wq_sysfs_attrs[] = {
>         &dev_attr_per_cpu.attr,
>         &dev_attr_max_active.attr,
> +       &dev_attr_sched_attr.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(wq_sysfs);
> @@ -5068,20 +5215,6 @@ static ssize_t wq_pool_ids_show(struct device *dev,
>         return written;
>  }
>
> -static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
> -                           char *buf)
> -{
> -       struct workqueue_struct *wq = dev_to_wq(dev);
> -       int written;
> -
> -       mutex_lock(&wq->mutex);
> -       written = scnprintf(buf, PAGE_SIZE, "%d\n",
> -                       wq->attrs->sched_attr.sched_nice);
> -       mutex_unlock(&wq->mutex);
> -
> -       return written;
> -}
> -
>  /* prepare workqueue_attrs for sysfs store operations */
>  static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
>  {
> @@ -5097,32 +5230,6 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
>         return attrs;
>  }
>
> -static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
> -                            const char *buf, size_t count)
> -{
> -       struct workqueue_struct *wq = dev_to_wq(dev);
> -       struct workqueue_attrs *attrs;
> -       int ret = -ENOMEM;
> -
> -       apply_wqattrs_lock();
> -
> -       attrs = wq_sysfs_prep_attrs(wq);
> -       if (!attrs)
> -               goto out_unlock;
> -
> -       if (sscanf(buf, "%d", &attrs->sched_attr.sched_nice) == 1 &&
> -           attrs->sched_attr.sched_nice >= MIN_NICE &&
> -           attrs->sched_attr.sched_nice <= MAX_NICE)
> -               ret = apply_workqueue_attrs_locked(wq, attrs);
> -       else
> -               ret = -EINVAL;
> -
> -out_unlock:
> -       apply_wqattrs_unlock();
> -       free_workqueue_attrs(attrs);
> -       return ret ?: count;
> -}
> -
>  static ssize_t wq_cpumask_show(struct device *dev,
>                                struct device_attribute *attr, char *buf)
>  {
> @@ -5201,7 +5308,6 @@ static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
>
>  static struct device_attribute wq_sysfs_unbound_attrs[] = {
>         __ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
> -       __ATTR(nice, 0644, wq_nice_show, wq_nice_store),
>         __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
>         __ATTR(numa, 0644, wq_numa_show, wq_numa_store),
>         __ATTR_NULL,
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-29  4:15   ` Lai Jiangshan
@ 2018-01-29  4:41     ` Mike Galbraith
  2018-01-29  6:33       ` Lai Jiangshan
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2018-01-29  4:41 UTC (permalink / raw)
  To: Lai Jiangshan, Wen Yang
  Cc: Tejun Heo, zhong.weidong, Jiang Biao, Tan Hu, kernel test robot, LKML

On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote:
> I think adding priority boost to workqueue(flush_work()) is the best
> way to fix the problem.

I disagree, priority boosting is needlessly invasive, takes control out
of user hands.  The kernel wanting to run a workqueue does not justify
perturbing the user's critical task.

I think "give userspace rope" is always the best option, how rope is
used is none of our business.  Giving the user a means to draw a simple
line in the sand, above which they run only critical stuff, below
which, they can do whatever they want, sane in our opinions or not,
lets users do whatever craziness they want/need to do, and puts the
responsibility for consequences squarely on the right set of shoulders.

	-Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-29  4:41     ` Mike Galbraith
@ 2018-01-29  6:33       ` Lai Jiangshan
  2018-01-29  7:27         ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Lai Jiangshan @ 2018-01-29  6:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Wen Yang, Tejun Heo, zhong.weidong, Jiang Biao, Tan Hu,
	kernel test robot, LKML

On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote:
>> I think adding priority boost to workqueue(flush_work()) is the best
>> way to fix the problem.
>
> I disagree, priority boosting is needlessly invasive, takes control out
> of user hands.  The kernel wanting to run a workqueue does not justify
> perturbing the user's critical task.

The kworkers doesn't belong to any user, it is really needlessly invasive
if we give the ability to any user to control the priority of the kworkers.

If the user's critical task calls flush_work(). the critical task should
boost one responsible kworker. (the kwoker scheduled for
the work item, or the first idle kworker or the manager kworker,
the kwoker for the later two cases is changing, need to migrate
the boosting to a new kworker when needed)

The boosted work items need to be moved to a prio list in the pool
too for the boosted kworker to pick it up.

>
> I think "give userspace rope" is always the best option, how rope is
> used is none of our business.  Giving the user a means to draw a simple
> line in the sand, above which they run only critical stuff, below
> which, they can do whatever they want, sane in our opinions or not,
> lets users do whatever craziness they want/need to do, and puts the
> responsibility for consequences squarely on the right set of shoulders.
>
>         -Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
  2018-01-29  6:33       ` Lai Jiangshan
@ 2018-01-29  7:27         ` Mike Galbraith
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2018-01-29  7:27 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Wen Yang, Tejun Heo, zhong.weidong, Jiang Biao, Tan Hu,
	kernel test robot, LKML

On Mon, 2018-01-29 at 14:33 +0800, Lai Jiangshan wrote:
> On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraith <efault@gmx.de> wrote:
> > On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote:
> >> I think adding priority boost to workqueue(flush_work()) is the best
> >> way to fix the problem.
> >
> > I disagree, priority boosting is needlessly invasive, takes control out
> > of user hands.  The kernel wanting to run a workqueue does not justify
> > perturbing the user's critical task.
> 
> The kworkers doesn't belong to any user, it is really needlessly invasive
> if we give the ability to any user to control the priority of the kworkers.

In a scenario where box is being saturated by RT, every last bit of the
box is likely in the (hopefully capable) hands of a solo box pilot.
 With a prio-boosting scheme, which user gets to choose the boost
priority for the global resource?
 
> If the user's critical task calls flush_work(). the critical task should
> boost one responsible kworker. (the kwoker scheduled for
> the work item, or the first idle kworker or the manager kworker,
> the kwoker for the later two cases is changing, need to migrate
> the boosting to a new kworker when needed)
> 
> The boosted work items need to be moved to a prio list in the pool
> too for the boosted kworker to pick it up.

Userspace knows which of its actions are wired up to what kernel
mechanism?  New workers are never spawned, stepping on any
prioritization userspace does?

I don't want to argue about it really, I'm just expressing my opinion
on the matter.  I have a mechanism in place to let users safely do
whatever they like, have for years, and it's not going anywhere.  That
mechanism was born from the needs of users, not mine.  First came a
user with a long stable product that suddenly ceased to function due to
workqueues learning to spawn new threads, then came a few cases where
users were absolutely convinced that they really really did need to be
able to safely saturate.  I could have said tough titty, adapt your
product to use a dedicated kthread to the one, and no you just think
you need to do that to the others, but I'm not (quite) that arrogant,
gave them the control they wanted instead.

	-Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-29  7:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-27  5:15 [RFC PATCH V5 1/5] workqueue: rename system workqueues Wen Yang
2018-01-27  5:15 ` [RFC PATCH V5 2/5] workqueue: expose attrs for " Wen Yang
2018-01-27  5:15 ` [RFC PATCH V5 3/5] workqueue: rename unbound_attrs to attrs Wen Yang
2018-01-27  5:15 ` [RFC PATCH V5 4/5] workqueue: convert ->nice to ->sched_attr Wen Yang
2018-01-27  5:15 ` [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler Wen Yang
2018-01-27  9:31   ` Mike Galbraith
2018-01-27 12:37     ` Mike Galbraith
2018-01-29  4:15   ` Lai Jiangshan
2018-01-29  4:41     ` Mike Galbraith
2018-01-29  6:33       ` Lai Jiangshan
2018-01-29  7:27         ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.