All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 4.20 68/72] psi: fix aggregation idle shut-off
Date: Sat, 23 Feb 2019 16:04:18 -0500	[thread overview]
Message-ID: <20190223210422.199966-68-sashal@kernel.org> (raw)
In-Reply-To: <20190223210422.199966-1-sashal@kernel.org>

From: Johannes Weiner <hannes@cmpxchg.org>

[ Upstream commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990 ]

psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/sched/psi.c          | 21 +++++++++++++++++----
 kernel/workqueue.c          | 23 +++++++++++++++++++++++
 kernel/workqueue_internal.h |  6 +++++-
 3 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc938..c3484785b1795 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -124,6 +124,7 @@
  * sampling of the aggregate task states would be.
  */
 
+#include "../workqueue_internal.h"
 #include <linux/sched/loadavg.h>
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
@@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 			groupc->tasks[t]++;
 
 	write_seqcount_end(&groupc->seq);
-
-	if (!delayed_work_pending(&group->clock_work))
-		schedule_delayed_work(&group->clock_work, PSI_FREQ);
 }
 
 static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
@@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
+	bool wake_clock = true;
 	void *iter = NULL;
 
 	if (!task->pid)
@@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 	task->psi_flags &= ~clear;
 	task->psi_flags |= set;
 
-	while ((group = iterate_groups(task, &iter)))
+	/*
+	 * Periodic aggregation shuts off if there is a period of no
+	 * task changes, so we wake it back up if necessary. However,
+	 * don't do this if the task change is the aggregation worker
+	 * itself going to sleep, or we'll ping-pong forever.
+	 */
+	if (unlikely((clear & TSK_RUNNING) &&
+		     (task->flags & PF_WQ_WORKER) &&
+		     wq_worker_last_func(task) == psi_update_work))
+		wake_clock = false;
+
+	while ((group = iterate_groups(task, &iter))) {
 		psi_group_change(group, cpu, clear, set);
+		if (wake_clock && !delayed_work_pending(&group->clock_work))
+			schedule_delayed_work(&group->clock_work, PSI_FREQ);
+	}
 }
 
 void psi_memstall_tick(struct task_struct *task, int cpu)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e2..288b2105bbb1a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
 	return to_wakeup ? to_wakeup->task : NULL;
 }
 
+/**
+ * wq_worker_last_func - retrieve worker's last work function
+ *
+ * Determine the last function a worker executed. This is called from
+ * the scheduler to get a worker's last known identity.
+ *
+ * CONTEXT:
+ * spin_lock_irq(rq->lock)
+ *
+ * Return:
+ * The last work function %current executed as a worker, NULL if it
+ * hasn't executed any work yet.
+ */
+work_func_t wq_worker_last_func(struct task_struct *task)
+{
+	struct worker *worker = kthread_data(task);
+
+	return worker->last_func;
+}
+
 /**
  * worker_set_flags - set worker flags and adjust nr_running accordingly
  * @worker: self
@@ -2184,6 +2204,9 @@ __acquires(&pool->lock)
 	if (unlikely(cpu_intensive))
 		worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
+	/* tag the worker for identification in schedule() */
+	worker->last_func = worker->current_func;
+
 	/* we're done with it, release */
 	hash_del(&worker->hentry);
 	worker->current_work = NULL;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 66fbb5a9e633b..cb68b03ca89aa 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -53,6 +53,9 @@ struct worker {
 
 	/* used only by rescuers to point to the target workqueue */
 	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
+
+	/* used by the scheduler to determine a worker's last known identity */
+	work_func_t		last_func;
 };
 
 /**
@@ -67,9 +70,10 @@ static inline struct worker *current_wq_worker(void)
 
 /*
  * Scheduler hooks for concurrency managed workqueue.  Only to be used from
- * sched/core.c and workqueue.c.
+ * sched/ and workqueue.c.
  */
 void wq_worker_waking_up(struct task_struct *task, int cpu);
 struct task_struct *wq_worker_sleeping(struct task_struct *task);
+work_func_t wq_worker_last_func(struct task_struct *task);
 
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
-- 
2.19.1


  parent reply	other threads:[~2019-02-23 21:29 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-23 21:03 [PATCH AUTOSEL 4.20 01/72] vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 02/72] xfrm: refine validation of template and selector families Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 03/72] xfrm: Make set-mark default behavior backward compatible Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 04/72] perf ordered_events: Fix crash in ordered_events__free Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 05/72] netfilter: nft_compat: use refcnt_t type for nft_xt reference count Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 06/72] netfilter: nft_compat: make lists per netns Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 07/72] netfilter: nft_compat: destroy function must not have side effects Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 08/72] perf script: Fix crash with printing mixed trace point and other events Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 09/72] perf core: Fix perf_proc_update_handler() bug Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 10/72] perf python: Remove -fstack-clash-protection when building with some clang versions Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 11/72] perf tools: Handle TOPOLOGY headers with no CPU Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 12/72] perf script: Fix crash when processing recorded stat data Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 13/72] IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 14/72] iommu/amd: Call free_iova_fast with pfn in map_sg Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 15/72] iommu/amd: Unmap all mapped pages in error path of map_sg Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 16/72] riscv: fixup max_low_pfn with PFN_DOWN Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 17/72] ipvs: Fix signed integer overflow when setsockopt timeout Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 18/72] iommu/amd: Fix IOMMU page flush when detach device from a domain Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 19/72] clk: ti: Fix error handling in ti_clk_parse_divider_data() Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 20/72] clk: qcom: gcc: Use active only source for CPUSS clocks Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 21/72] xtensa: SMP: fix ccount_timer_shutdown Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 22/72] RDMA/umem: Add missing initialization of owning_mm Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 23/72] riscv: Adjust mmap base address at a third of task size Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 24/72] IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 25/72] selftests: cpu-hotplug: fix case where CPUs offline > CPUs present Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03   ` sashal
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 26/72] xtensa: SMP: fix secondary CPU initialization Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 27/72] xtensa: smp_lx200_defconfig: fix vectors clash Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 28/72] xtensa: SMP: mark each possible CPU as present Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 29/72] iomap: get/put the page in iomap_page_create/release() Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 30/72] iomap: fix a use after free in iomap_dio_rw Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 31/72] xtensa: SMP: limit number of possible CPUs by NR_CPUS Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 32/72] net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 33/72] net: hns: Fix for missing of_node_put() after of_parse_phandle() Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 34/72] net: hns: Restart autoneg need return failed when autoneg off Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 35/72] net: hns: Fix wrong read accesses via Clause 45 MDIO protocol Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 36/72] net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup() Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 37/72] netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 38/72] netfilter: nfnetlink_osf: add missing fmatch check Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 39/72] gpio: vf610: Mask all GPIO interrupts Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 40/72] selftests: net: use LDLIBS instead of LDFLAGS Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03   ` sashal
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 41/72] selftests: timers: " Sasha Levin
2019-02-23 21:03   ` Sasha Levin
2019-02-23 21:03   ` sashal
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 42/72] nfs: Fix NULL pointer dereference of dev_name Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 43/72] qed: Fix bug in tx promiscuous mode settings Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 44/72] qed: Fix LACP pdu drops for VFs Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 45/72] qed: Fix VF probe failure while FLR Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 46/72] qed: Fix system crash in ll2 xmit Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 47/72] qed: Fix stack out of bounds bug Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 48/72] scsi: libfc: free skb when receiving invalid flogi resp Sasha Levin
2019-02-23 21:03 ` [PATCH AUTOSEL 4.20 49/72] scsi: scsi_debug: fix write_same with virtual_gb problem Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 50/72] scsi: bnx2fc: Fix error handling in probe() Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 51/72] scsi: 53c700: pass correct "dev" to dma_alloc_attrs() Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 52/72] platform/x86: Fix unmet dependency warning for ACPI_CMPC Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 53/72] platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 54/72] x86/cpu: Add Atom Tremont (Jacobsville) Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 55/72] net: macb: Apply RXUBR workaround only to versions with errata Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 56/72] x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 57/72] cifs: fix computation for MAX_SMB2_HDR_SIZE Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 58/72] blk-mq: fix a hung issue when fsync Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 59/72] x86/microcode/amd: Don't falsely trick the late loading mechanism Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 60/72] apparmor: Fix warning about unused function apparmor_ipv6_postroute Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 61/72] arm64: kprobe: Always blacklist the KVM world-switch code Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 62/72] apparmor: Fix aa_label_build() error handling for failed merges Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 63/72] x86/kexec: Don't setup EFI info if EFI runtime is not enabled Sasha Levin
2019-02-23 21:04   ` Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 64/72] proc: fix /proc/net/* after setns(2) Sasha Levin
2019-02-23 21:04   ` Sasha Levin
2019-02-23 21:04   ` sashal
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 65/72] x86_64: increase stack size for KASAN_EXTRA Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 66/72] mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone Sasha Levin
2019-02-26 12:46   ` Mike Rapoport
2019-03-11 15:21     ` Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 67/72] mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone Sasha Levin
2019-02-23 21:04 ` Sasha Levin [this message]
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 69/72] lib/test_kmod.c: potential double free in error handling Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 70/72] fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 71/72] autofs: drop dentry reference only when it is never used Sasha Levin
2019-02-23 21:04 ` [PATCH AUTOSEL 4.20 72/72] autofs: fix error return in autofs_fill_super() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190223210422.199966-68-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.