From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Phil Auld <pauld@redhat.com>,
Ben Segall <bsegall@google.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@kernel.org>
Subject: [PATCH 4.18 28/34] sched/fair: Fix throttle_list starvation with low CFS quota
Date: Thu, 8 Nov 2018 13:52:57 -0800 [thread overview]
Message-ID: <20181108215140.629248683@linuxfoundation.org> (raw)
In-Reply-To: <20181108215138.892971755@linuxfoundation.org>
4.18-stable review patch. If anyone has any objections, please let me know.
------------------
From: Phil Auld <pauld@redhat.com>
commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve. Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.
Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list. The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.
This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -976050
2 ffff90b56cb2cc00 -484925
3 ffff90b56cb2bc00 -658814
4 ffff90b56cb2ba00 -275365
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -994147
2 ffff90b56cb2cc00 -306051
3 ffff90b56cb2bc00 -961321
4 ffff90b56cb2ba00 -24490
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/sched/fair.c | 22 +++++++++++++++++++---
kernel/sched/sched.h | 2 ++
2 files changed, 21 insertions(+), 3 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4797,9 +4797,13 @@ static void throttle_cfs_rq(struct cfs_r
/*
* Add to the _head_ of the list, so that an already-started
- * distribute_cfs_runtime will not see us
+ * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is
+ * not running add to the tail so that later runqueues don't get starved.
*/
- list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+ if (cfs_b->distribute_running)
+ list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+ else
+ list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
/*
* If we're the first throttled task, make sure the bandwidth
@@ -4943,14 +4947,16 @@ static int do_sched_cfs_period_timer(str
* in us over-using our runtime if it is all used during this loop, but
* only by limited amounts in that extreme case.
*/
- while (throttled && cfs_b->runtime > 0) {
+ while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
runtime = cfs_b->runtime;
+ cfs_b->distribute_running = 1;
raw_spin_unlock(&cfs_b->lock);
/* we can't nest cfs_b->lock while distributing bandwidth */
runtime = distribute_cfs_runtime(cfs_b, runtime,
runtime_expires);
raw_spin_lock(&cfs_b->lock);
+ cfs_b->distribute_running = 0;
throttled = !list_empty(&cfs_b->throttled_cfs_rq);
cfs_b->runtime -= min(runtime, cfs_b->runtime);
@@ -5061,6 +5067,11 @@ static void do_sched_cfs_slack_timer(str
/* confirm we're still not at a refresh boundary */
raw_spin_lock(&cfs_b->lock);
+ if (cfs_b->distribute_running) {
+ raw_spin_unlock(&cfs_b->lock);
+ return;
+ }
+
if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
raw_spin_unlock(&cfs_b->lock);
return;
@@ -5070,6 +5081,9 @@ static void do_sched_cfs_slack_timer(str
runtime = cfs_b->runtime;
expires = cfs_b->runtime_expires;
+ if (runtime)
+ cfs_b->distribute_running = 1;
+
raw_spin_unlock(&cfs_b->lock);
if (!runtime)
@@ -5080,6 +5094,7 @@ static void do_sched_cfs_slack_timer(str
raw_spin_lock(&cfs_b->lock);
if (expires == cfs_b->runtime_expires)
cfs_b->runtime -= min(runtime, cfs_b->runtime);
+ cfs_b->distribute_running = 0;
raw_spin_unlock(&cfs_b->lock);
}
@@ -5188,6 +5203,7 @@ void init_cfs_bandwidth(struct cfs_bandw
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
+ cfs_b->distribute_running = 0;
}
static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -346,6 +346,8 @@ struct cfs_bandwidth {
int nr_periods;
int nr_throttled;
u64 throttled_time;
+
+ bool distribute_running;
#endif
};
next prev parent reply other threads:[~2018-11-08 22:10 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-08 21:52 [PATCH 4.18 00/34] 4.18.18-stable review Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 01/34] eeprom: at24: Add support for address-width property Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 02/34] vfs: swap names of {do,vfs}_clone_file_range() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 03/34] bpf: fix partial copy of map_ptr when dst is scalar Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 04/34] gpio: mxs: Get rid of external API call Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 05/34] clk: sunxi-ng: sun4i: Set VCO and PLL bias current to lowest setting Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 06/34] fscache: Fix incomplete initialisation of inline key space Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 07/34] cachefiles: fix the race between cachefiles_bury_object() and rmdir(2) Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 08/34] fscache: Fix out of bound read in long cookie keys Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 09/34] ptp: fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 10/34] drm/edid: Add 6 bpc quirk for BOE panel in HP Pavilion 15-n233sl Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 11/34] drm/edid: VSDB yCBCr420 Deep Color mode bit definitions Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 12/34] drm: fb-helper: Reject all pixel format changing requests Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 13/34] RDMA/ucma: Fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 14/34] IB/ucm: " Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 15/34] cdc-acm: do not reset notification buffer index upon urb unlinking Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 16/34] cdc-acm: correct counting of UART states in serial state notification Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 17/34] cdc-acm: fix race between reset and control messaging Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 18/34] usb: usbip: Fix BUG: KASAN: slab-out-of-bounds in vhci_hub_control() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 19/34] usb: gadget: storage: Fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 20/34] usb: roles: intel_xhci: Fix Unbalanced pm_runtime_enable Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 21/34] usb: xhci: pci: Enable Intel USB role mux on Apollo Lake platforms Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 22/34] USB: fix the usbfs flag sanitization for control transfers Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 23/34] block: dont deal with discard limit in blkdev_issue_discard() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 24/34] tracing: Fix synthetic event to accept unsigned modifier Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 25/34] tracing: Fix synthetic event to allow semicolon at end Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 26/34] Input: elan_i2c - add ACPI ID for Lenovo IdeaPad 330-15IGM Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 27/34] drm/sun4i: Fix an ulong overflow in the dotclock driver Greg Kroah-Hartman
2018-11-08 21:52 ` Greg Kroah-Hartman [this message]
2018-11-08 21:52 ` [PATCH 4.18 29/34] x86/tsc: Force inlining of cyc2ns bits Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 30/34] x86, hibernate: Fix nosave_regions setup for hibernation Greg Kroah-Hartman
2018-11-09 8:13 ` Pavel Machek
2018-11-08 21:53 ` [PATCH 4.18 31/34] x86/percpu: Fix this_cpu_read() Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 32/34] x86/time: Correct the attribute on jiffies definition Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 33/34] x86/swiotlb: Enable swiotlb for > 4GiG RAM on 32-bit kernels Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 34/34] x86/fpu: Fix i486 + no387 boot crash by only saving FPU registers on context switch if there is an FPU Greg Kroah-Hartman
2018-11-09 3:50 ` [PATCH 4.18 00/34] 4.18.18-stable review kernelci.org bot
2018-11-09 14:39 ` Naresh Kamboju
2018-11-10 15:24 ` Greg Kroah-Hartman
2018-11-09 19:41 ` Guenter Roeck
2018-11-09 19:55 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181108215140.629248683@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=bsegall@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).