[PATCH RFC] sched/fair: unthrottle cfs_rq in FIFO order and only once per period

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-kernel@vger.kernel.org
Cc: Ben Segall <bsegall@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>
Subject: [PATCH RFC] sched/fair: unthrottle cfs_rq in FIFO order and only once per period
Date: Thu, 12 Jan 2017 21:16:40 +0300	[thread overview]
Message-ID: <148424500006.25010.17562609703137954663.stgit@buzz> (raw)

I'm seeing live-locks on 32-core machine with 32 busy-loops running inside
cpu cgroup with limit set equal to 0.33 core (cpu.cfs_quota_us = 33000,
cpu.cfs_period_us = 100000) : some threads almost never runs.

This is very nasty behavior: sometimes throttled tasks also hold kernel
locks, for example mmap_sem and block proc files for system-wide "ps ax".

This happens because throttle_cfs_rq() inserts cfs-rq into head and
unthrottle in distribute_cfs_runtime() starts from head too.
Cfs_rqs at the end might never get runtime if limits is set low enough.

Originally untrottle worked in FIFO order but commit c06f04c70489
("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
switched to LIFO because distribute_cfs_runtime() could unthrottle
the same cfq-rq endlessly.

This was happened because loop around distribute_cfs_runtime() in function
do_sched_cfs_period_timer() and because cfs_b->runtime here set to zero to
prevent acquiring runtime from global pool while this loop works.
That zeroing was removed by commit mentioned above. This one removes loop.

This patch reverts unthrottling back to FIFO order and checks expiration
time during distribution: if cfs_rq already was unthrottled during this
period then its expiration is equal to currently distributed. Then we'll
skip it for now and will try to unthottle at next period.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Cc: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6559d197e08a..b30412e3d7c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4162,10 +4162,9 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	empty = list_empty(&cfs_b->throttled_cfs_rq);
 
 	/*
-	 * Add to the _head_ of the list, so that an already-started
-	 * distribute_cfs_runtime will not see us
+	 * Add to the _tail_ of the list, untrottle happens in FIFO order
 	 */
-	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 
 	/*
 	 * If we're the first throttled task, make sure the bandwidth
@@ -4240,6 +4239,15 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
+		/*
+		 * Give throttled cfs_rq 1 ns of runtime. It will take more
+		 * from global pool when needed. But skip cfs_rq if expiration
+		 * time equals to ours, this means we already untrottled it in
+		 * this loop and global pool is already depleted.
+		 */
+		if (cfs_rq->runtime_expires == expires)
+			goto next;
+
 		runtime = -cfs_rq->runtime_remaining + 1;
 		if (runtime > remaining)
 			runtime = remaining;
@@ -4302,24 +4310,20 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	runtime_expires = cfs_b->runtime_expires;
 
 	/*
-	 * This check is repeated as we are holding onto the new bandwidth while
-	 * we unthrottle. This can potentially race with an unthrottled group
-	 * trying to acquire new bandwidth from the global pool. This can result
-	 * in us over-using our runtime if it is all used during this loop, but
-	 * only by limited amounts in that extreme case.
+	 * This can potentially race with an unthrottled group trying to
+	 * acquire new bandwidth from the global pool. This can result
+	 * in us over-using our runtime if it is all used during this loop,
+	 * but only by limited amounts in that extreme case.
 	 */
-	while (throttled && cfs_b->runtime > 0) {
-		runtime = cfs_b->runtime;
-		raw_spin_unlock(&cfs_b->lock);
-		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+	runtime = cfs_b->runtime;
 
-		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	/* we can't nest cfs_b->lock while distributing bandwidth */
+	raw_spin_unlock(&cfs_b->lock);
+	runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
+	raw_spin_lock(&cfs_b->lock);
 
+	if (runtime_expires == cfs_b->runtime_expires)
 		cfs_b->runtime -= min(runtime, cfs_b->runtime);
-	}
 
 	/*
 	 * While we are ensured activity in the period following an