LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Zhipeng Xie <xiezhipeng1@huawei.com>,
	Sargun Dhillon <sargun@sargun.me>,
	Xie XiuQi <xiexiuqi@huawei.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Bin Li <huawei.libin@huawei.com>, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>
Subject: [PATCH 4.20 51/65] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
Date: Fri, 11 Jan 2019 15:15:37 +0100
Message-ID: <20190111131103.218018795@linuxfoundation.org> (raw)
In-Reply-To: <20190111131055.331350141@linuxfoundation.org>

4.20-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Linus Torvalds <torvalds@linux-foundation.org>

commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.

Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.

Do a (manual) revert of:

  a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b9c and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b9c - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/sched/fair.c |   43 +++++++++----------------------------------
 1 file changed, 9 insertions(+), 34 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq(
 	}
 }
 
-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
-	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
-				 leaf_cfs_rq_list)
+/* Iterate through all leaf cfs_rq's on a runqueue: */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(
 {
 }
 
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
-		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7387,27 +7386,10 @@ static inline bool others_have_blocked(s
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load.weight)
-		return false;
-
-	if (cfs_rq->avg.load_sum)
-		return false;
-
-	if (cfs_rq->avg.util_sum)
-		return false;
-
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
-	return true;
-}
-
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7419,7 +7401,7 @@ static void update_blocked_averages(int
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7434,13 +7416,6 @@ static void update_blocked_averages(int
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
-
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10289,10 +10264,10 @@ const struct sched_class fair_sched_clas
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }



  parent reply index

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-11 14:14 [PATCH 4.20 00/65] 4.20.2-stable review Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 01/65] scsi: zfcp: fix posting too many status read buffers leading to adapter shutdown Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 02/65] scsi: lpfc: do not set queue->page_count to 0 if pc_sli4_params.wqpcnt is invalid Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 03/65] fork: record start_time late Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 04/65] zram: fix double free backing device Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 05/65] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 06/65] mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 07/65] mm, devm_memremap_pages: kill mapping "System RAM" support Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 08/65] mm, devm_memremap_pages: fix shutdown handling Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 09/65] memcg, oom: notify on oom killer invocation from the charge path Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 10/65] sunrpc: fix cache_head leak due to queued request Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 11/65] sunrpc: use SVC_NET() in svcauth_gss_* functions Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 12/65] mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.20 13/65] mm, hmm: use devm semantics for hmm_devmem_{add, remove} Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 14/65] mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 15/65] mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 16/65] mm, swap: fix swapoff with KSM pages Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 17/65] media: cx23885: only reset DMA on problematic CPUs Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 18/65] ALSA: cs46xx: Potential NULL dereference in probe Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 19/65] ALSA: usb-audio: Avoid access before bLength check in build_audio_procunit() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 20/65] ALSA: usb-audio: Check mixer unit descriptors more strictly Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 21/65] ALSA: usb-audio: Fix an out-of-bound read in create_composite_quirks Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 22/65] ALSA: usb-audio: Always check descriptor sizes in parser code Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 23/65] srcu: Lock srcu_data structure in srcu_gp_start() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 24/65] driver core: Add missing dev->bus->need_parent_lock checks Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 25/65] Fix failure path in alloc_pid() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 26/65] block: deactivate blk_stat timer in wbt_disable_default() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 27/65] block: mq-deadline: Fix write completion handling Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 28/65] dm: do not allow readahead to limit IO size Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 29/65] dlm: fixed memory leaks after failed ls_remove_names allocation Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 30/65] dlm: possible memory leak on error path in create_lkb() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 31/65] dlm: lost put_lkb on error path in receive_convert() and receive_unlock() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 32/65] dlm: memory leaks on error path in dlm_user_request() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 33/65] gfs2: Get rid of potential double-freeing in gfs2_create_inode Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 34/65] gfs2: Fix loop in gfs2_rbm_find Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 35/65] b43: Fix error in cordic routine Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 36/65] selinux: policydb - fix byte order and alignment issues Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 37/65] PCI / PM: Allow runtime PM without callback functions Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 38/65] lockd: Show pid of lockd for remote locks Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 39/65] xprtrdma: Yet another double DMA-unmap Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 40/65] nfsd4: zero-length WRITE should succeed Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 41/65] Revert "powerpc/tm: Unset MSR[TS] if not recheckpointing" Greg Kroah-Hartman
2019-01-12 21:35   ` Christoph Biedl
2019-01-13  7:11     ` Greg Kroah-Hartman
2019-01-14  0:00       ` Michael Ellerman
2019-01-14  8:10         ` Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 42/65] powerpc/tm: Set MSR[TS] just prior to recheckpoint Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 43/65] iio: adc: qcom-spmi-adc5: Initialize prescale properly Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 44/65] iio: dac: ad5686: fix bit shift read register Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 45/65] 9p/net: put a lower bound on msize Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 46/65] rxe: fix error completion wr_id and qp_num Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 47/65] stm class: Fix a module refcount leak in policy creation error path Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 48/65] RDMA/srpt: Fix a use-after-free in the channel release code Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 49/65] RDMA/iwcm: Dont copy past the end of dev_name() string Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 50/65] iommu/vt-d: Handle domain agaw being less than iommu agaw Greg Kroah-Hartman
2019-01-11 14:15 ` Greg Kroah-Hartman [this message]
2019-01-11 14:15 ` [PATCH 4.20 52/65] ceph: dont update importing caps mseq when handing cap export Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 53/65] video: fbdev: pxafb: Fix "WARNING: invalid free of devm_ allocated data" Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 54/65] drivers/perf: hisi: Fixup one DDRC PMU register offset Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 55/65] powerpc/4xx/ocm: Fix compilation error due to PAGE_KERNEL usage Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 56/65] selftests: Fix test errors related to lib.mk khdr target Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 57/65] genwqe: Fix size check Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 58/65] intel_th: msu: Fix an off-by-one in attribute store Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 59/65] power: supply: olpc_battery: correct the temperature units Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 60/65] of: of_node_get()/of_node_put() nodes held in phandle cache Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 61/65] of: __of_detach_node() - remove node from " Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 62/65] lib: fix build failure in CONFIG_DEBUG_VIRTUAL test Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 63/65] drm/nouveau/drm/nouveau: Check rc from drm_dp_mst_topology_mgr_resume() Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 64/65] drm/vc4: Set ->is_yuv to false when num_planes == 1 Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.20 65/65] drm/rockchip: psr: do not dereference encoder before it is null checked Greg Kroah-Hartman
2019-01-11 21:35 ` [PATCH 4.20 00/65] 4.20.2-stable review shuah
2019-01-12  8:03   ` Greg Kroah-Hartman
2019-01-12  8:28 ` Naresh Kamboju
2019-01-12 17:35   ` Greg Kroah-Hartman
2019-01-12 17:45 ` Guenter Roeck

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190111131103.218018795@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=efault@gmx.de \
    --cc=huawei.libin@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sargun@sargun.me \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=xiexiuqi@huawei.com \
    --cc=xiezhipeng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox