All of lore.kernel.org
 help / color / mirror / Atom feed
From: Charles Wang <muming.wq@gmail.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Charles Wang <muming.wq@taobao.com>
Subject: [PATCH] sched: Folding nohz load accounting more accurate
Date: Sat,  9 Jun 2012 18:54:55 +0800	[thread overview]
Message-ID: <1339239295-18591-1-git-send-email-muming.wq@taobao.com> (raw)

After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
the idle into calc_load_tasks_idle between the last cpu load calculating and
calc_global_load calling. However problem still exits between the first cpu 
load calculating and the last cpu load calculating. Every time when we do load 
calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
the idle load is caused by calculated cpus. This problem is also described in
the following link:

https://lkml.org/lkml/2012/5/24/419

This bug can be found in our work load. The average running processes number 
is about 15, but the load only shows about 4.

The patch provides a solution, by taking calculated load cpus' idle away from
 real effective idle. First adds a cpumask to record those cpus that alread 
calculated their load, and then adds a calc_unmask_cpu_load_idle to record 
thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place 
of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ when cpu 
calculate its load. Go-idle load on those cpus which load alread has been calculated 
will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
 
Reported-by: Sha Zhengju <handai.szj@gmail.com>
Signed-off-by: Charles Wang <muming.wq@taobao.com>

---
 include/linux/sched.h     |    1 +
 kernel/sched/core.c       |   83 ++++++++++++++++++++++++++++++++++++++++++++-
 kernel/time/timekeeping.c |    1 +
 3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6029d8c..a2b8df2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
 
 
 extern void calc_global_load(unsigned long ticks);
+extern void prepare_idle_mask(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
 extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46958e..bdfe3c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
 static unsigned long calc_load_update;
+static unsigned long idle_mask_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
 
@@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
  */
 static atomic_long_t calc_load_tasks_idle;
 
+/* 
+ * Those cpus whose load alread has been calculated in this LOAD_FREQ 
+ * period will be masked.
+ */
+struct cpumask  cpu_load_update_mask;
+
+/* 
+ * Fold unmask cpus' idle load
+ */
+static atomic_long_t calc_unmask_cpu_load_idle;
+
+
 void calc_load_account_idle(struct rq *this_rq)
 {
 	long delta;
+	int cpu = smp_processor_id();
+
 
 	delta = calc_load_fold_active(this_rq);
 	if (delta)
+	{
 		atomic_long_add(delta, &calc_load_tasks_idle);
+		/*
+		 * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
+		 * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
+		 * those unmask cpus.
+		 */
+		if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
+		{
+			atomic_long_add(delta, &calc_unmask_cpu_load_idle);
+		}
+       }
 }
 
 static long calc_load_fold_idle(void)
@@ -2221,6 +2247,20 @@ static long calc_load_fold_idle(void)
 	return delta;
 }
 
+static long calc_load_fold_unmask_idle(void)
+{
+	long delta = 0;
+	
+	if (atomic_long_read(&calc_unmask_cpu_load_idle))
+	{
+		delta = atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+		atomic_long_sub(delta, &calc_load_tasks_idle);
+	}
+	
+	return delta;
+}
+
+
 /**
  * fixed_power_int - compute: x^n, in O(log n) time
  *
@@ -2312,6 +2352,9 @@ static void calc_global_nohz(void)
 	if (delta)
 		atomic_long_add(delta, &calc_load_tasks);
 
+	cpumask_clear(&cpu_load_update_mask);
+	atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+
 	/*
 	 * It could be the one fold was all it took, we done!
 	 */
@@ -2395,18 +2438,54 @@ void calc_global_load(unsigned long ticks)
 }
 
 /*
+ * Prepare cpu_load_update_mask for the comming per-cpu load calculating
+ */
+void prepare_idle_mask(unsigned long ticks)
+{
+	if (time_before(jiffies, idle_mask_update - 10))
+		return;
+	
+	cpumask_clear(&cpu_load_update_mask);
+	/* 
+        * calc_unmask_cpu_load_idle is part of calc_load_tasks_idle,
+	 * and calc_load_tasks_ide will be folded into calc_load_tasks immediately.
+        * So no need to keep this now.
+        */
+	atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+	
+	idle_mask_update += LOAD_FREQ;
+}
+
+/*
  * Called from update_cpu_load() to periodically update this CPU's
  * active count.
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
 	long delta;
+	int cpu = smp_processor_id();
 
 	if (time_before(jiffies, this_rq->calc_load_update))
 		return;
 
+	/* 
+	 * cpu_load_update_mask empty means the first cpu
+	 * doing load calculating. Global idle should be
+	 * folded into calc_load_tasks, so we just push it
+	 * to calc_unmask_cpu_load_idle.
+	 */
+	if (cpumask_empty(&cpu_load_update_mask))
+		atomic_long_set(&calc_unmask_cpu_load_idle, atomic_long_read(&calc_load_tasks_idle));
+	/* 
+	 * Mask this cpu as load calculated,
+	 * then go-idle in this cpu won't take effect
+	 * to calc_load_tasks.
+	 */
+	cpumask_set_cpu(cpu, &cpu_load_update_mask);
+
 	delta  = calc_load_fold_active(this_rq);
-	delta += calc_load_fold_idle();
+	/* Fold unmask cpus' load into calc_load_tasks */
+	delta += calc_load_fold_unmask_idle();
 	if (delta)
 		atomic_long_add(delta, &calc_load_tasks);
 
@@ -7100,6 +7179,8 @@ void __init sched_init(void)
 
 	calc_load_update = jiffies + LOAD_FREQ;
 
+	idle_mask_update = jiffies + LOAD_FREQ;
+
 	/*
 	 * During early bootup we pretend to be a normal task:
 	 */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6e46cac..afbc06a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
 	jiffies_64 += ticks;
 	update_wall_time();
 	calc_global_load(ticks);
+	prepare_idle_mask(ticks);
 }
 
 /**
-- 
1.7.9.5


             reply	other threads:[~2012-06-09 10:55 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-09 10:54 Charles Wang [this message]
2012-06-11 15:42 ` [PATCH] sched: Folding nohz load accounting more accurate Peter Zijlstra
     [not found]   ` <4FD6BFC4.1060302@gmail.com>
2012-06-12  8:54     ` Peter Zijlstra
2012-06-12  9:34   ` Charles Wang
2012-06-12  9:56     ` Peter Zijlstra
2012-06-13  5:55       ` Doug Smythies
2012-06-13  7:56         ` Charles Wang
2012-06-14  4:41           ` Doug Smythies
2012-06-14 15:42             ` Charles Wang
2012-06-16  6:42               ` Doug Smythies
2012-06-13  8:16         ` Peter Zijlstra
2012-06-13 15:33           ` Doug Smythies
2012-06-13 21:57             ` Peter Zijlstra
2012-06-14  3:13               ` Doug Smythies
2012-06-18 10:13                 ` Peter Zijlstra
2012-07-20 19:24         ` sched: care and feeding of load-avg code (Re: [PATCH] sched: Folding nohz load accounting more accurate) Jonathan Nieder
2012-06-15 14:27       ` [PATCH] sched: Folding nohz load accounting more accurate Charles Wang
2012-06-15 17:39         ` Peter Zijlstra
2012-06-16 14:53           ` Doug Smythies
2012-06-18  6:41             ` Doug Smythies
2012-06-18 14:41               ` Charles Wang
2012-06-18 10:06           ` Charles Wang
2012-06-18 16:03         ` Peter Zijlstra
2012-06-19  6:08           ` Yong Zhang
2012-06-19  9:18             ` Peter Zijlstra
2012-06-19 15:50               ` Doug Smythies
2012-06-20  9:45                 ` Peter Zijlstra
2012-06-21  4:12                   ` Doug Smythies
2012-06-21  6:35                     ` Charles Wang
2012-06-21  8:48                     ` Peter Zijlstra
2012-06-22 14:03                     ` Peter Zijlstra
2012-06-24 21:45                       ` Doug Smythies
2012-07-03 16:01                         ` Doug Smythies
2012-06-25  2:15                       ` Charles Wang
2012-07-06  6:19                       ` [tip:sched/core] sched/nohz: Rewrite and fix load-avg computation -- again tip-bot for Peter Zijlstra
2012-06-19  6:19           ` [PATCH] sched: Folding nohz load accounting more accurate Doug Smythies
2012-06-19  6:24           ` Charles Wang
2012-06-19  9:57             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1339239295-18591-1-git-send-email-muming.wq@taobao.com \
    --to=muming.wq@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=muming.wq@taobao.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.