linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Phil Auld <pauld@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>
Subject: [PATCH 4.18 28/34] sched/fair: Fix throttle_list starvation with low CFS quota
Date: Thu,  8 Nov 2018 13:52:57 -0800	[thread overview]
Message-ID: <20181108215140.629248683@linuxfoundation.org> (raw)
In-Reply-To: <20181108215138.892971755@linuxfoundation.org>

4.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Phil Auld <pauld@redhat.com>

commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -976050
    2     ffff90b56cb2cc00  -484925
    3     ffff90b56cb2bc00  -658814
    4     ffff90b56cb2ba00  -275365
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -994147
    2     ffff90b56cb2cc00  -306051
    3     ffff90b56cb2bc00  -961321
    4     ffff90b56cb2ba00  -24490
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },
  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },

Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/sched/fair.c  |   22 +++++++++++++++++++---
 kernel/sched/sched.h |    2 ++
 2 files changed, 21 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4797,9 +4797,13 @@ static void throttle_cfs_rq(struct cfs_r
 
 	/*
 	 * Add to the _head_ of the list, so that an already-started
-	 * distribute_cfs_runtime will not see us
+	 * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is
+	 * not running add to the tail so that later runqueues don't get starved.
 	 */
-	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	if (cfs_b->distribute_running)
+		list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	else
+		list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 
 	/*
 	 * If we're the first throttled task, make sure the bandwidth
@@ -4943,14 +4947,16 @@ static int do_sched_cfs_period_timer(str
 	 * in us over-using our runtime if it is all used during this loop, but
 	 * only by limited amounts in that extreme case.
 	 */
-	while (throttled && cfs_b->runtime > 0) {
+	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
+		cfs_b->distribute_running = 1;
 		raw_spin_unlock(&cfs_b->lock);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
 		raw_spin_lock(&cfs_b->lock);
 
+		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 
 		cfs_b->runtime -= min(runtime, cfs_b->runtime);
@@ -5061,6 +5067,11 @@ static void do_sched_cfs_slack_timer(str
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->distribute_running) {
+		raw_spin_unlock(&cfs_b->lock);
+		return;
+	}
+
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
 		raw_spin_unlock(&cfs_b->lock);
 		return;
@@ -5070,6 +5081,9 @@ static void do_sched_cfs_slack_timer(str
 		runtime = cfs_b->runtime;
 
 	expires = cfs_b->runtime_expires;
+	if (runtime)
+		cfs_b->distribute_running = 1;
+
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!runtime)
@@ -5080,6 +5094,7 @@ static void do_sched_cfs_slack_timer(str
 	raw_spin_lock(&cfs_b->lock);
 	if (expires == cfs_b->runtime_expires)
 		cfs_b->runtime -= min(runtime, cfs_b->runtime);
+	cfs_b->distribute_running = 0;
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -5188,6 +5203,7 @@ void init_cfs_bandwidth(struct cfs_bandw
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
+	cfs_b->distribute_running = 0;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -346,6 +346,8 @@ struct cfs_bandwidth {
 	int			nr_periods;
 	int			nr_throttled;
 	u64			throttled_time;
+
+	bool                    distribute_running;
 #endif
 };
 



  parent reply	other threads:[~2018-11-08 22:10 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-08 21:52 [PATCH 4.18 00/34] 4.18.18-stable review Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 01/34] eeprom: at24: Add support for address-width property Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 02/34] vfs: swap names of {do,vfs}_clone_file_range() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 03/34] bpf: fix partial copy of map_ptr when dst is scalar Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 04/34] gpio: mxs: Get rid of external API call Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 05/34] clk: sunxi-ng: sun4i: Set VCO and PLL bias current to lowest setting Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 06/34] fscache: Fix incomplete initialisation of inline key space Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 07/34] cachefiles: fix the race between cachefiles_bury_object() and rmdir(2) Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 08/34] fscache: Fix out of bound read in long cookie keys Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 09/34] ptp: fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 10/34] drm/edid: Add 6 bpc quirk for BOE panel in HP Pavilion 15-n233sl Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 11/34] drm/edid: VSDB yCBCr420 Deep Color mode bit definitions Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 12/34] drm: fb-helper: Reject all pixel format changing requests Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 13/34] RDMA/ucma: Fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 14/34] IB/ucm: " Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 15/34] cdc-acm: do not reset notification buffer index upon urb unlinking Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 16/34] cdc-acm: correct counting of UART states in serial state notification Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 17/34] cdc-acm: fix race between reset and control messaging Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 18/34] usb: usbip: Fix BUG: KASAN: slab-out-of-bounds in vhci_hub_control() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 19/34] usb: gadget: storage: Fix Spectre v1 vulnerability Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 20/34] usb: roles: intel_xhci: Fix Unbalanced pm_runtime_enable Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 21/34] usb: xhci: pci: Enable Intel USB role mux on Apollo Lake platforms Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 22/34] USB: fix the usbfs flag sanitization for control transfers Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 23/34] block: dont deal with discard limit in blkdev_issue_discard() Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 24/34] tracing: Fix synthetic event to accept unsigned modifier Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 25/34] tracing: Fix synthetic event to allow semicolon at end Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 26/34] Input: elan_i2c - add ACPI ID for Lenovo IdeaPad 330-15IGM Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 27/34] drm/sun4i: Fix an ulong overflow in the dotclock driver Greg Kroah-Hartman
2018-11-08 21:52 ` Greg Kroah-Hartman [this message]
2018-11-08 21:52 ` [PATCH 4.18 29/34] x86/tsc: Force inlining of cyc2ns bits Greg Kroah-Hartman
2018-11-08 21:52 ` [PATCH 4.18 30/34] x86, hibernate: Fix nosave_regions setup for hibernation Greg Kroah-Hartman
2018-11-09  8:13   ` Pavel Machek
2018-11-08 21:53 ` [PATCH 4.18 31/34] x86/percpu: Fix this_cpu_read() Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 32/34] x86/time: Correct the attribute on jiffies definition Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 33/34] x86/swiotlb: Enable swiotlb for > 4GiG RAM on 32-bit kernels Greg Kroah-Hartman
2018-11-08 21:53 ` [PATCH 4.18 34/34] x86/fpu: Fix i486 + no387 boot crash by only saving FPU registers on context switch if there is an FPU Greg Kroah-Hartman
2018-11-09  3:50 ` [PATCH 4.18 00/34] 4.18.18-stable review kernelci.org bot
2018-11-09 14:39 ` Naresh Kamboju
2018-11-10 15:24   ` Greg Kroah-Hartman
2018-11-09 19:41 ` Guenter Roeck
2018-11-09 19:55 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181108215140.629248683@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).