linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Borislav Petkov <bp@alien8.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mike Galbraith <efault@gmx.de>, Mel Gorman <mgorman@suse.de>,
	Nikolay Ulyanitsky <lystor@gmail.com>,
	linux-kernel@vger.kernel.org,
	Andreas Herrmann <andreas.herrmann3@amd.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Suresh Siddha <suresh.b.siddha@intel.com>
Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
Date: Wed, 26 Sep 2012 11:19:42 -0700	[thread overview]
Message-ID: <CA+55aFzC1GZG8+Gv9KmAcV=RGU+hw39hwC9AjfYVqiJ84qAYkw@mail.gmail.com> (raw)
In-Reply-To: <20120926163233.GA5339@x1.osrc.amd.com>

[-- Attachment #1: Type: text/plain, Size: 3242 bytes --]

On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case. Sad, because it really would be lovely to just remove that
>> thing ;)
>
> Yep, correct. It hurts.

I'm *so* not surprised.

That said, I think your "kill select_idle_sibling()" one was
interesting, but the wrong kind of "get rid of that logic".

It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying "does this group actually cover the cpu I am
interested in?"

Please tell me I am mis-reading this?

But starting from the biggest ("llc" group) is wrong *anyway*, since
it means that it starts looking at the L3 level, and then if it finds
an acceptable cpu inside that level, it's all done. But that's
*crazy*. Once again, it's much better to try to find an idle sibling
*closeby* rather than at the L3 level. No? So once again, we should
start at the inner level and if we can't find something really close,
we work our way out, rather than starting from the outer level and
working our way in.

If I read the code correctly, we can have both "prev" and "cpu" in the
same L2 domain, but because we start looking at the L3 domain, we may
end up picking another "affine" CPU that isn't even sharing L2's
*before* we pick one that actually *is* sharing L2's with the target
CPU. But that code is confusing enough with the scheduler groups inner
loop that maybe I am mis-reading it entirely.

There are other oddities in select_idle_sibling() too, if I read
things correctly.

For example, it uses "cpu_idle(target)", but if we're actively trying
to move to the current CPU (ie wake_affine() returned true), then
target is the current cpu, which is certainly *not* going to be idle
for a sync wakeup. So it should actually check whether it's a sync
wakeup and the only thing pending is that synchronous waker, no?

Maybe I'm missing something really fundamental, but it all really does
look very odd to me.

Attached is a totally untested and probably very buggy patch, so
please consider it a "shouldn't we do something like this instead" RFC
rather than anything serious. So this RFC patch is more a "ok, the
patch tries to fix the above oddnesses, please tell me where I went
wrong" than anything else.

Comments?

                    Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 3060 bytes --]

 kernel/sched/fair.c | 80 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 43 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96e2b18b6283..25817cff72c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2632,53 +2632,53 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, struct sched_domain *affine)
 {
-	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
-	struct sched_group *sg;
+	struct sched_domain *sd, *llc_sd;
 	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
-		return cpu;
-
-	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
-	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
-		return prev_cpu;
+	if (idle_cpu(target))
+		return target;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
-		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
+	llc_sd = rcu_dereference(per_cpu(sd_llc, target));
+	for_each_domain(target, sd) {
+		for_each_cpu(i, sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
+				continue;
+			if (!idle_cpu(i))
+				continue;
+			return  i;
+		}
+		/* Don't iterate past the last level cache domain */
+		if (sd == llc_sd)
+			break;
+		/* Don't iterate past the affinity level */
+		if (sd == affine)
+			break;
+	}
+	return -1;
+}
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
+/*
+ * For synchronous wake-ups: is the currently running
+ * process the only pending process of this CPU runqueue?
+ */
+static inline int single_running(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
-	}
-done:
-	return target;
+#ifdef CONFIG_SMP
+	if (!llist_empty(&rq->wake_list))
+		return 0;
+#endif
+	return cpu_rq(cpu)->nr_running <= 1;
 }
 
 /*
@@ -2759,11 +2759,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	if (affine_sd) {
-		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync)) {
 			prev_cpu = cpu;
+			if (sync && single_running(cpu)) {
+				new_cpu = cpu;
+				goto unlock;
+			}
+		}
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		new_cpu = select_idle_sibling(p, prev_cpu, affine_sd);
+		if (new_cpu >= 0)
+			goto unlock;
 	}
 
 	while (sd) {

  reply	other threads:[~2012-09-26 18:20 UTC|newest]

Thread overview: 115+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-14  7:47 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets Nikolay Ulyanitsky
2012-09-14 18:40 ` Borislav Petkov
2012-09-14 18:51   ` Borislav Petkov
2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
2012-09-14 21:40   ` Peter Zijlstra
2012-09-14 21:44     ` Linus Torvalds
2012-09-14 21:56       ` Peter Zijlstra
2012-09-14 21:59         ` Peter Zijlstra
2012-09-15  3:57           ` Mike Galbraith
2012-09-14 22:01         ` Linus Torvalds
2012-09-14 22:10           ` Peter Zijlstra
2012-09-14 22:20             ` Linus Torvalds
2012-09-14 22:14           ` Borislav Petkov
2012-09-14 21:45     ` Borislav Petkov
2012-09-14 21:42   ` Linus Torvalds
2012-09-15  3:33     ` Mike Galbraith
2012-09-15 16:16       ` Andi Kleen
2012-09-15 16:36         ` Mike Galbraith
2012-09-15 17:08           ` richard -rw- weinberger
2012-09-16  4:48             ` Mike Galbraith
2012-09-15 21:32           ` Alan Cox
2012-09-16  4:35             ` Mike Galbraith
2012-09-16 19:57               ` Linus Torvalds
2012-09-17  8:08                 ` Mike Galbraith
2012-09-17 10:07                   ` Ingo Molnar
2012-09-17 10:47                     ` Mike Galbraith
2012-09-17 14:39                     ` Andi Kleen
2012-09-19 12:35               ` Mike Galbraith
2012-09-19 14:54                 ` Ingo Molnar
2012-09-19 15:23                   ` Mike Galbraith
2012-09-24 15:00     ` Mel Gorman
2012-09-24 15:23       ` Nikolay Ulyanitsky
2012-09-24 15:53         ` Borislav Petkov
2012-09-24 15:30       ` Peter Zijlstra
2012-09-24 15:51         ` Mike Galbraith
2012-09-24 15:52         ` Linus Torvalds
2012-09-24 16:07           ` Peter Zijlstra
2012-09-24 16:33             ` Linus Torvalds
2012-09-24 16:54               ` Peter Zijlstra
2012-09-25 12:10                 ` Hillf Danton
2012-09-24 16:12           ` Peter Zijlstra
2012-09-24 16:30             ` Linus Torvalds
2012-09-24 16:52               ` Borislav Petkov
2012-09-24 16:54               ` Peter Zijlstra
2012-09-24 17:44                 ` Peter Zijlstra
2012-09-25 13:23                   ` Mel Gorman
2012-09-25 14:36                     ` Peter Zijlstra
2012-09-24 18:26                 ` Mike Galbraith
2012-09-24 19:12                   ` Linus Torvalds
2012-09-24 19:20                     ` Borislav Petkov
2012-09-25  1:57                       ` Mike Galbraith
2012-09-25  2:11                         ` Linus Torvalds
2012-09-25  2:49                           ` Mike Galbraith
2012-09-25  3:10                             ` Linus Torvalds
2012-09-25  3:20                               ` Mike Galbraith
2012-09-25  3:32                                 ` Linus Torvalds
2012-09-25  3:43                                   ` Mike Galbraith
2012-09-25 11:58                           ` Peter Zijlstra
2012-09-25 13:17                             ` Borislav Petkov
2012-09-25 17:00                               ` Borislav Petkov
2012-09-25 17:21                                 ` Linus Torvalds
2012-09-25 18:42                                   ` Borislav Petkov
2012-09-25 19:08                                     ` Linus Torvalds
2012-09-26  2:23                                     ` Mike Galbraith
2012-09-26 17:17                                       ` Borislav Petkov
2012-09-26  2:00                                   ` Mike Galbraith
2012-09-26  2:22                                     ` Linus Torvalds
2012-09-26  2:42                                       ` Mike Galbraith
2012-09-26 17:15                                       ` Borislav Petkov
2012-09-26 16:32                                   ` Borislav Petkov
2012-09-26 18:19                                     ` Linus Torvalds [this message]
2012-09-26 21:37                                       ` Borislav Petkov
2012-09-27  5:09                                         ` Mike Galbraith
2012-09-27  5:18                                           ` Borislav Petkov
2012-09-27  5:44                                             ` Mike Galbraith
2012-09-27  5:47                                           ` Ingo Molnar
2012-09-27  5:59                                             ` Ingo Molnar
2012-09-27  6:34                                             ` Mike Galbraith
2012-09-27  6:41                                               ` Ingo Molnar
2012-09-27  6:54                                                 ` Mike Galbraith
2012-09-27  7:10                                                   ` Ingo Molnar
2012-09-27 16:25                                                     ` Borislav Petkov
2012-09-27 17:44                                                     ` Linus Torvalds
2012-09-27 18:05                                                       ` Borislav Petkov
2012-09-27 18:19                                                         ` Linus Torvalds
2012-09-27 18:29                                                           ` Peter Zijlstra
2012-09-27 19:24                                                             ` Borislav Petkov
2012-09-28  3:50                                                               ` Mike Galbraith
2012-09-28 12:30                                                                 ` Borislav Petkov
2012-09-27 19:40                                                             ` Linus Torvalds
2012-09-28  4:13                                                               ` Mike Galbraith
2012-09-28  8:37                                                               ` Peter Zijlstra
2012-09-27  7:17                                         ` david
2012-09-27  7:55                                           ` Mike Galbraith
2012-09-27 10:20                                           ` Borislav Petkov
2012-09-27 13:38                                             ` Mike Galbraith
2012-09-27 16:55                                             ` david
2012-09-27  4:32                                       ` Mike Galbraith
2012-09-27  8:21                                       ` Peter Zijlstra
2012-09-27 16:48                                         ` david
2012-09-27 17:38                                           ` Peter Zijlstra
2012-09-27 17:45                                             ` david
2012-09-27 18:09                                               ` Peter Zijlstra
2012-09-27 18:15                                               ` Linus Torvalds
2012-09-27 18:24                                               ` Borislav Petkov
2012-09-25  1:39                     ` Mike Galbraith
2012-09-25 21:11                     ` Suresh Siddha
2012-09-25  4:16       ` Mike Galbraith
2012-09-15  4:11   ` Mike Galbraith
     [not found]     ` <CA+55aFz1A7HbMYS9o-GTS5Zm=Xx8MUD7cR05GMVo--2E34jcgQ@mail.gmail.com>
2012-09-15  4:42       ` Mike Galbraith
2012-09-15 10:44     ` Borislav Petkov
2012-09-15 14:47       ` Mike Galbraith
2012-09-15 15:18         ` Borislav Petkov
2012-09-15 16:13           ` Mike Galbraith
2012-09-15 19:44             ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFzC1GZG8+Gv9KmAcV=RGU+hw39hwC9AjfYVqiJ84qAYkw@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andreas.herrmann3@amd.com \
    --cc=bp@alien8.de \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lystor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).