linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler.
@ 2012-02-12 18:52 Rakib Mullick
  2012-02-13 14:05 ` Hillf Danton
  0 siblings, 1 reply; 4+ messages in thread
From: Rakib Mullick @ 2012-02-12 18:52 UTC (permalink / raw)
  To: LKML

Hello all,

Here, I'm going to introduce an alternative load distribution
algorithm for Linux kernel scheduler. This technique is named as
"The Barbershop Load Distribution Algorigthm" or BLD for short and
will be refered as BLD from here on. As it's name implies,
 it only tries to distribute the load properly by tracking lowest and
highest loaded rq of the system. This technique never tries to
 balance the system load at idle context, which is done by the current
scheduler. The motivation behind this technique is to
distribute load properly amonst the CPUs in a way as if load balancer
isn't needed and to make load distribution easier.

Naming Reason and it's origin:
------------------------------------------
 To keep it small, this section is skipped. Interested reader should
look at rk-devel.blogspot.com and bld release with some performance
results could be found at http://code.google.com/p/bld/.

Description
----------------
 BLD is best described as a O(1) CPU picking technique. Which is done
by reordering CPU runqueues based on runqueue loads.
 In other words, it keeps the scheduler aware of the load changes,
which helps scheduler to keep runqueues in an order. This
technique doesn't depend on scheduler ticks. The two most simple
things in this technique are: load tracking and runqueue ordering;
these are relatively simpler operations. Load tracking will be done
whenever a load change happens on the system and based on
this load change runqueue will be ordered. So, if we have an ordered
runqueue from lowest to highest, then picking the less (or even
busiest) runqueue is easy. Scheduler can pick the lowest runqueue
without calculation and comparison at the time of placing a task
in a runqueue. And while trying to distribute load at sched_exec and
sched_fork our best choice is to pick the lowest busiest runqueue
of the system. And in this way, system remains balanced without doing
any load balancing. At the time of try_to_wake_up picking the idlest
runqueue is topmost priority but it has been done as per domain basis
to utilize CPU cache properly and it's an area where more
concentration is requires.


Implementation
----------------------
It's been implemented as config option and could be found under
"General setup"->
"An alternative CPU load distributor". If this option is enabled, then
current load balancing
code will be disabled.

 As described above, BLD needs to do two things:
      a) load tracking i.e load change;
      b) ordering runqueues

 a) track load: To track load we need to hook at activate_task() and
deactivate_task(). At this particular point load tracking is done,
which simply calculates the current rq load and updates it accordingly.
Runqueue loads are found by simply calling weighted_cpuload(). While
trying to keep balance at task wakeup, simply calculating weighted_cpuload()
isn't enough and needs to accumulate sleeping tasks into account.

 b) ordering runqueues: Ordering runqueues are simple and done by
linking runqueues based on load from lowest to highest. Ordering
runqueues are done after a runqueue load has been updated and makes
necessary comparision with loaded runqueues to make it
ordered. This ordering operations doesn't depend on number of
runqueues i.e CPUs and it's a constant time operation.

Runqueues are kept on a doubly linked list based on load we get from
weighted_cpuload(), which makes easier to access less busiest
queue and highest busiest queue. Maintaining runqueues with linked
list are feasible for keeping it in order and maintaining large no of
runqueues. This linked list's operations are protected with a single
read-write spinlock.

O(1) CPU picking at sched_fork() and sched_exec()
--------------------------------------------------------------------------
 At this point we've an ordered runqueues, where we can access with
help of runqueue head pointer. Current implementation doesn't
seperates offline CPU from online CPU (when HOTPLUG is enabled), so it
checks whether the first CPU of rq head is online or not. So,
CPU picking is also a O(1) operation (with side effect of CPU getting
offline, it's avoidable), when we are balancing from sched_fork and
sched_exec. At this particular moment we're allowed to move a task on
any CPU of the system, cause we've zero cache footprint.

CPU picking at task wakeup:
----------------------------------------
While picking a CPU at task waking, the lowest loaded CPU of a
particular domain will be choosen. In this way, it tries to utilize
CPU resources - this is a bit over cautious technique and an area which
needs more attention. Simply picking the lowest loaded runqueue of the
system isn't appropriate in this case. When CPU affinity is set for a
particular task, the lowest loaded CPU will be returned from task's
affinity mask by finding out the lowest loaded CPU.

Remarks
--------------
 Current implementation requires a *lot* of cleanup and it's been
messed up with #ifdef's. So, implementation is ugly and very basic.
It needs to be tested in other architechtures. So, I'm fully unware
how it'll *really* run on other architectures. So, if it hangs/crashes
your system while testing don't blaim me too hard ;-). It also wasn't
tested to make sure how it'll react upon different system settings
available on Linux kernel, like - cgroups settings, various schedule domain
flags etc . Current locking schemes might be guilty of heavy lock
contention for large systems, perhaps it could be reduced by introducing more
fine grained locking scheme. And more work is required for task waking
up balancing. It shows good performance for kernel building sort of
loads. But has some downfall for that sort of loads which is a 'single
process creates various numbers of threads' (1:N). It shows better
fairness - test was made by running "for i in `seq 1 5` ; do
tools/perf/perf bench sched pipe & done;wait" (found from previous
lkml post by Ingo Molnar).

 Personally, I was worried about CFS gets hurt by this load
distribution technique, but looking at the fairness test (perf bench
sched pipe) shows that it improves fairness. Current scheduler is a result of
years of dedicated effort of numbers of developers and a lot of
scheduler modification has been done by introducing huristics and
showing good results as per performance measuring tools. But the case
for BLD isn't same - it introduces an approach which shows how cpu load can be
distributed evenly without worrying too much about how many
number of CPUs exists and putting negative impact on system
performance isn't abnormal. So, rather simply looking at performance
results, I'd suggest readers to think about "current load balancing
approach" vs. "BLD load distribution approach" (although BLD performance
is not *bad* as per my testings).

Bld kernel size will reduced a lot, but as said before - a lot of code
cleanup is required
that's why I'm not showing it here.

And finally, my writing skills are not that good, so there might be
repetitive sentences, spelling mistakes etc. Please be patient if you
started to feel disgusting. I try to explain all the details about
this technique
and if any confusion, I think looking at the patch will give you a clear idea.

 So, please test it. Any comments, suggestions are highly expected. Anything,
that I should've mentioned? Can't remember ...


Thanks,
Rakib.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
---

diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..98c9622 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -68,6 +68,14 @@ config BROKEN_ON_SMP
 	depends on BROKEN || !SMP
 	default y

+config BLD
+	bool "An alternate CPU load distributor"
+	depends on EXPERIMENTAL && SMP
+	default y
+	help
+	  This is an alternate CPU load distribution technique based
+	  on The Barbershop Load Distribution algorithm.
+
 config INIT_ENV_ARG_LIMIT
 	int
 	default 32 if !UML
diff --git a/kernel/sched/bld.h b/kernel/sched/bld.h
new file mode 100644
index 0000000..31ac23f
--- /dev/null
+++ b/kernel/sched/bld.h
@@ -0,0 +1,112 @@
+#ifdef CONFIG_BLD
+
+static DEFINE_RWLOCK(disp_list_lock);
+static LIST_HEAD(rq_head);
+
+static inline int list_is_first(const struct list_head *list,
+				const struct list_head *head)
+{
+	return list == head->next;
+}
+
+static inline int select_cpu_for_wakeup(struct task_struct *p, int
sd_flags, int wake_flags)
+{
+	int cpu = smp_processor_id(), prev_cpu = task_cpu(p), i;
+	/*bool sync = wake_flags & WF_SYNC; */
+	unsigned long load, min_load = ULONG_MAX;
+	struct cpumask *mask;
+
+	if (wake_flags & WF_SYNC) {
+		if (cpu == prev_cpu)
+			return cpu;
+		mask = sched_group_cpus(cpu_rq(prev_cpu)->sd->groups);
+	} else
+		mask = sched_domain_span(cpu_rq(prev_cpu)->sd);
+	
+	for_each_cpu(i, mask) {
+		load = cpu_rq(i)->load.weight;
+		if (load < min_load) {
+			min_load = load;
+			cpu = i;
+		}
+	}
+	return cpu;
+}
+
+static int bld_select_task_rq(struct task_struct *p, int sd_flags,
int wake_flags)
+{
+	struct rq *tmp;
+	unsigned long flag;
+	unsigned int cpu = smp_processor_id();
+
+	if (&p->cpus_allowed) {
+		struct cpumask *taskmask;
+		unsigned long min_load = ULONG_MAX, load, i;
+		taskmask = tsk_cpus_allowed(p);
+		for_each_cpu(i, taskmask) {
+			load = cpu_rq(i)->load.weight;
+			if (load < min_load) {
+				min_load = load;
+				cpu = i;
+			}
+		}
+	} else	if (sd_flags & SD_BALANCE_WAKE) {
+		cpu = select_cpu_for_wakeup(p, sd_flags, wake_flags);
+		return cpu;
+	} else {
+		read_lock_irqsave(&disp_list_lock, flag);
+		list_for_each_entry(tmp, &rq_head, disp_load_balance) {
+			cpu = cpu_of(tmp);
+			if (cpu_online(cpu))
+				break;
+		}
+		read_unlock_irqrestore(&disp_list_lock, flag);
+	}
+	return cpu;
+}
+
+static void bld_track_load_activate(struct rq *rq)
+{
+	unsigned long  flag;
+	rq->this_cpu_load = rq->load.weight;
+
+	if (rq->pos != 2) {	/* if rq isn't the last one */
+		struct rq *last;
+		write_lock_irqsave(&disp_list_lock, flag);
+		last = list_entry(rq_head.prev, struct rq, disp_load_balance);
+		if (rq->this_cpu_load > last->this_cpu_load) {
+			list_del(&rq->disp_load_balance);
+			list_add_tail(&rq->disp_load_balance, &rq_head);
+			rq->pos = 2; last->pos = 1;
+		}
+		write_unlock_irqrestore(&disp_list_lock, flag);
+	}
+}
+
+static void bld_track_load_deactivate(struct rq *rq)
+{
+	unsigned long flag;
+	
+	rq->this_cpu_load = rq->load.weight;
+
+	if (rq->pos != 0) { /* If rq isn't first one */
+		struct rq *first;
+		first = list_entry(rq_head.prev, struct rq, disp_load_balance);
+		write_lock_irqsave(&disp_list_lock, flag);
+		if (rq->this_cpu_load <= first->this_cpu_load) {
+			list_del(&rq->disp_load_balance);
+			list_add_tail(&rq->disp_load_balance, &rq_head);
+			rq->pos = 0; first->pos = 1;
+		}
+		write_unlock_irqrestore(&disp_list_lock, flag);
+	}
+}
+#else
+static inline void bld_track_load_activate(struct rq *rq)
+{
+}
+
+static inline void bld_track_load_deactivate(struct rq *rq)
+{
+}
+#endif /* CONFIG_BLD */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..cff20e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,8 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-11-29  RT balancing improvements by Steven Rostedt, Gregory Haskins,
  *              Thomas Gleixner, Mike Kravetz
+ *  2012-Feb	The Barbershop Load Distribution (BLD) algorithm, an alternate
+ *  		load distribution algorithm by Rakib Mullick.
  */

 #include <linux/mm.h>
@@ -81,6 +83,7 @@

 #include "sched.h"
 #include "../workqueue_sched.h"
+#include "bld.h"

 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
@@ -578,6 +581,7 @@ unlock:
  */
 void wake_up_idle_cpu(int cpu)
 {
+#ifndef CONFIG_BLD
 	struct rq *rq = cpu_rq(cpu);

 	if (cpu == smp_processor_id())
@@ -604,6 +608,7 @@ void wake_up_idle_cpu(int cpu)
 	smp_mb();
 	if (!tsk_is_polling(rq->idle))
 		smp_send_reschedule(cpu);
+#endif
 }

 static inline bool got_nohz_idle_kick(void)
@@ -730,6 +735,7 @@ void activate_task(struct rq *rq, struct
task_struct *p, int flags)
 		rq->nr_uninterruptible--;

 	enqueue_task(rq, p, flags);
+	bld_track_load_activate(rq);
 }

 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
@@ -738,6 +744,7 @@ void deactivate_task(struct rq *rq, struct
task_struct *p, int flags)
 		rq->nr_uninterruptible++;

 	dequeue_task(rq, p, flags);
+	bld_track_load_deactivate(rq);
 }

 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -1297,7 +1304,12 @@ static int select_fallback_rq(int cpu, struct
task_struct *p)
 static inline
 int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	int cpu;
+#ifdef CONFIG_BLD
+	cpu = bld_select_task_rq(p, sd_flags, wake_flags);
+#else
+	cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+#endif

 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1453,7 +1465,11 @@ static void sched_ttwu_pending(void)

 void scheduler_ipi(void)
 {
+#ifndef CONFIG_BLD
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
+#else
+	if (llist_empty(&this_rq()->wake_list))
+#endif
 		return;

 	/*
@@ -1475,10 +1491,12 @@ void scheduler_ipi(void)
 	/*
 	 * Check if someone kicked us for doing the nohz idle load balance.
 	 */
+#ifndef CONFIG_BLD
 	if (unlikely(got_nohz_idle_kick() && !need_resched())) {
 		this_rq()->idle_balance = 1;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
+#endif
 	irq_exit();
 }

@@ -1518,12 +1536,14 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 	struct rq *rq = cpu_rq(cpu);

 #if defined(CONFIG_SMP)
+#ifndef CONFIG_BLD
 	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;
 	}
 #endif
+#endif

 	raw_spin_lock(&rq->lock);
 	ttwu_do_activate(rq, p, 0);
@@ -2269,6 +2289,7 @@ calc_load_n(unsigned long load, unsigned long exp,
  */
 static void calc_global_nohz(unsigned long ticks)
 {
+#ifndef CONFIG_BLD
 	long delta, active, n;

 	if (time_before(jiffies, calc_load_update))
@@ -2310,6 +2331,7 @@ static void calc_global_nohz(unsigned long ticks)
 	 * age us 4 cycles, and the test in calc_global_load() will
 	 * pick up the final one.
 	 */
+#endif
 }
 #else
 void calc_load_account_idle(struct rq *this_rq)
@@ -3003,8 +3025,10 @@ void scheduler_tick(void)

 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
+#ifndef CONFIG_BLD
 	trigger_load_balance(rq, cpu);
 #endif
+#endif
 }

 notrace unsigned long get_parent_ip(unsigned long addr)
@@ -3194,8 +3218,10 @@ need_resched:

 	pre_schedule(rq, prev);

+#ifndef CONFIG_BLD
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
+#endif

 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
@@ -6938,6 +6964,11 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
+#ifdef CONFIG_BLD
+		INIT_LIST_HEAD(&rq->disp_load_balance);
+		list_add_tail(&rq->disp_load_balance, &rq_head);
+		rq->pos = 0;
+#endif
 	}

 	set_load_weight(&init_task);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c6414f..f2624ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5609,7 +5609,9 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+#ifndef CONFIG_BLD
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
+#endif /* BLD */

 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98c0c26..bd7e4c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -474,6 +474,17 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_BLD
+	unsigned long this_cpu_load;
+	struct list_head disp_load_balance;
+	/* It indicates whether, rq is first or last
+	 * or in the middle based on load from rq_head.
+	 * 0 - First rq
+	 * 1 - rq stays middle
+	 * 2 - last rq
+	 */
+	char pos;
+#endif
 };

 static inline int cpu_of(struct rq *rq)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler.
  2012-02-12 18:52 [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler Rakib Mullick
@ 2012-02-13 14:05 ` Hillf Danton
  2012-02-13 17:22   ` Rakib Mullick
  0 siblings, 1 reply; 4+ messages in thread
From: Hillf Danton @ 2012-02-13 14:05 UTC (permalink / raw)
  To: Rakib Mullick; +Cc: LKML

Hello Rakib

Just nitpicks

On Mon, Feb 13, 2012 at 2:52 AM, Rakib Mullick <rakib.mullick@gmail.com> wrote:
[...]
> --- /dev/null
> +++ b/kernel/sched/bld.h
> @@ -0,0 +1,112 @@
> +#ifdef CONFIG_BLD
> +
> +static DEFINE_RWLOCK(disp_list_lock);

What is the advantage of rwlock, compared with spin lock?

> +static LIST_HEAD(rq_head);
> +
> +static inline int list_is_first(const struct list_head *list,

Where is this helper used?

> +                               const struct list_head *head)
> +{
> +       return list == head->next;
> +}
> +
> +static inline int select_cpu_for_wakeup(struct task_struct *p, int
> sd_flags, int wake_flags)

Looks @sd_flags not used. Why is the arch specifics negligible?
Also looks message corrupted due to mail agent?

> +{
> +       int cpu = smp_processor_id(), prev_cpu = task_cpu(p), i;

            int this_cpu = smp_processor_id();
            int prev_cpu = task_cpu(p);
            int cpu;

> +       /*bool sync = wake_flags & WF_SYNC; */
> +       unsigned long load, min_load = ULONG_MAX;
> +       struct cpumask *mask;
> +
> +       if (wake_flags & WF_SYNC) {
> +               if (cpu == prev_cpu)
> +                       return cpu;
> +               mask = sched_group_cpus(cpu_rq(prev_cpu)->sd->groups);
> +       } else
> +               mask = sched_domain_span(cpu_rq(prev_cpu)->sd);
> +
> +       for_each_cpu(i, mask) {
> +               load = cpu_rq(i)->load.weight;
> +               if (load < min_load) {
> +                       min_load = load;
> +                       cpu = i;
> +               }
> +       }
> +       return cpu;
> +}
> +
> +static int bld_select_task_rq(struct task_struct *p, int sd_flags,
> int wake_flags)

Message corrupted?

> +{
> +       struct rq *tmp;
> +       unsigned long flag;
> +       unsigned int cpu = smp_processor_id();
> +
> +       if (&p->cpus_allowed) {
> +               struct cpumask *taskmask;
> +               unsigned long min_load = ULONG_MAX, load, i;
> +               taskmask = tsk_cpus_allowed(p);
> +               for_each_cpu(i, taskmask) {
> +                       load = cpu_rq(i)->load.weight;
> +                       if (load < min_load) {
> +                               min_load = load;
> +                               cpu = i;
> +                       }
> +               }
> +       } else  if (sd_flags & SD_BALANCE_WAKE) {
> +               cpu = select_cpu_for_wakeup(p, sd_flags, wake_flags);
> +               return cpu;
> +       } else {
> +               read_lock_irqsave(&disp_list_lock, flag);
> +               list_for_each_entry(tmp, &rq_head, disp_load_balance) {
> +                       cpu = cpu_of(tmp);
> +                       if (cpu_online(cpu))
> +                               break;
> +               }
> +               read_unlock_irqrestore(&disp_list_lock, flag);
> +       }
> +       return cpu;
> +}
> +
> +static void bld_track_load_activate(struct rq *rq)
> +{
> +       unsigned long  flag;
> +       rq->this_cpu_load = rq->load.weight;

Well ->this_cpu_load looks unnecessary?

> +
> +       if (rq->pos != 2) {     /* if rq isn't the last one */
> +               struct rq *last;
> +               write_lock_irqsave(&disp_list_lock, flag);

                    if (rq->pos != 2)
                             goto out;

> +               last = list_entry(rq_head.prev, struct rq, disp_load_balance);

Could disp_list_lock serialize updating this_cpu_load?

> +               if (rq->this_cpu_load > last->this_cpu_load) {
> +                       list_del(&rq->disp_load_balance);
> +                       list_add_tail(&rq->disp_load_balance, &rq_head);
> +                       rq->pos = 2; last->pos = 1;
> +               }

out:

> +               write_unlock_irqrestore(&disp_list_lock, flag);
> +       }
> +}
> +
> +static void bld_track_load_deactivate(struct rq *rq)
> +{
> +       unsigned long flag;
> +
> +       rq->this_cpu_load = rq->load.weight;
> +
> +       if (rq->pos != 0) { /* If rq isn't first one */
> +               struct rq *first;
> +               first = list_entry(rq_head.prev, struct rq, disp_load_balance);
> +               write_lock_irqsave(&disp_list_lock, flag);
> +               if (rq->this_cpu_load <= first->this_cpu_load) {
> +                       list_del(&rq->disp_load_balance);
> +                       list_add_tail(&rq->disp_load_balance, &rq_head);
> +                       rq->pos = 0; first->pos = 1;
> +               }
> +               write_unlock_irqrestore(&disp_list_lock, flag);
> +       }
> +}
> +#else
> +static inline void bld_track_load_activate(struct rq *rq)
> +{
> +}
> +
> +static inline void bld_track_load_deactivate(struct rq *rq)
> +{
> +}
> +#endif /* CONFIG_BLD */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5255c9d..cff20e1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -24,6 +24,8 @@
>  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
>  *  2007-11-29  RT balancing improvements by Steven Rostedt, Gregory Haskins,
>  *              Thomas Gleixner, Mike Kravetz
> + *  2012-Feb   The Barbershop Load Distribution (BLD) algorithm, an alternate
> + *             load distribution algorithm by Rakib Mullick.
>  */
>
>  #include <linux/mm.h>
> @@ -81,6 +83,7 @@
>
>  #include "sched.h"
>  #include "../workqueue_sched.h"
> +#include "bld.h"
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/sched.h>
> @@ -578,6 +581,7 @@ unlock:
>  */
>  void wake_up_idle_cpu(int cpu)
>  {
> +#ifndef CONFIG_BLD
>        struct rq *rq = cpu_rq(cpu);
>
>        if (cpu == smp_processor_id())
> @@ -604,6 +608,7 @@ void wake_up_idle_cpu(int cpu)
>        smp_mb();
>        if (!tsk_is_polling(rq->idle))
>                smp_send_reschedule(cpu);
> +#endif
>  }
>
>  static inline bool got_nohz_idle_kick(void)
> @@ -730,6 +735,7 @@ void activate_task(struct rq *rq, struct
> task_struct *p, int flags)
>                rq->nr_uninterruptible--;
>
>        enqueue_task(rq, p, flags);
> +       bld_track_load_activate(rq);

Looks better if sorting rq folded in enqueue_task()?

>  }
>
>  void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> @@ -738,6 +744,7 @@ void deactivate_task(struct rq *rq, struct
> task_struct *p, int flags)
>                rq->nr_uninterruptible++;
>
>        dequeue_task(rq, p, flags);
> +       bld_track_load_deactivate(rq);
>  }
>
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> @@ -1297,7 +1304,12 @@ static int select_fallback_rq(int cpu, struct
> task_struct *p)
>  static inline
>  int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
>  {
> -       int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
> +       int cpu;
> +#ifdef CONFIG_BLD
> +       cpu = bld_select_task_rq(p, sd_flags, wake_flags);

What if @p is RT?

> +#else
> +       cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
> +#endif
>
>        /*
>         * In order not to call set_task_cpu() on a blocking task we need
> @@ -1453,7 +1465,11 @@ static void sched_ttwu_pending(void)
>
>  void scheduler_ipi(void)
>  {
> +#ifndef CONFIG_BLD
>        if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> +#else
> +       if (llist_empty(&this_rq()->wake_list))
> +#endif
>                return;
>
>        /*
> @@ -1475,10 +1491,12 @@ void scheduler_ipi(void)
>        /*
>         * Check if someone kicked us for doing the nohz idle load balance.
>         */
> +#ifndef CONFIG_BLD
>        if (unlikely(got_nohz_idle_kick() && !need_resched())) {
>                this_rq()->idle_balance = 1;
>                raise_softirq_irqoff(SCHED_SOFTIRQ);
>        }
> +#endif
>        irq_exit();
>  }
>
> @@ -1518,12 +1536,14 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>        struct rq *rq = cpu_rq(cpu);
>
>  #if defined(CONFIG_SMP)
> +#ifndef CONFIG_BLD
>        if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
>                sched_clock_cpu(cpu); /* sync clocks x-cpu */
>                ttwu_queue_remote(p, cpu);
>                return;
>        }
>  #endif
> +#endif
>
>        raw_spin_lock(&rq->lock);
>        ttwu_do_activate(rq, p, 0);
> @@ -2269,6 +2289,7 @@ calc_load_n(unsigned long load, unsigned long exp,
>  */
>  static void calc_global_nohz(unsigned long ticks)
>  {
> +#ifndef CONFIG_BLD
>        long delta, active, n;
>
>        if (time_before(jiffies, calc_load_update))
> @@ -2310,6 +2331,7 @@ static void calc_global_nohz(unsigned long ticks)
>         * age us 4 cycles, and the test in calc_global_load() will
>         * pick up the final one.
>         */
> +#endif
>  }
>  #else
>  void calc_load_account_idle(struct rq *this_rq)
> @@ -3003,8 +3025,10 @@ void scheduler_tick(void)
>
>  #ifdef CONFIG_SMP
>        rq->idle_balance = idle_cpu(cpu);
> +#ifndef CONFIG_BLD
>        trigger_load_balance(rq, cpu);
>  #endif
> +#endif
>  }
>
>  notrace unsigned long get_parent_ip(unsigned long addr)
> @@ -3194,8 +3218,10 @@ need_resched:
>
>        pre_schedule(rq, prev);
>
> +#ifndef CONFIG_BLD
>        if (unlikely(!rq->nr_running))
>                idle_balance(cpu, rq);
> +#endif
>
>        put_prev_task(rq, prev);
>        next = pick_next_task(rq);
> @@ -6938,6 +6964,11 @@ void __init sched_init(void)
>  #endif
>                init_rq_hrtick(rq);
>                atomic_set(&rq->nr_iowait, 0);
> +#ifdef CONFIG_BLD
> +               INIT_LIST_HEAD(&rq->disp_load_balance);
> +               list_add_tail(&rq->disp_load_balance, &rq_head);
> +               rq->pos = 0;
> +#endif
>        }
>
>        set_load_weight(&init_task);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7c6414f..f2624ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5609,7 +5609,9 @@ void print_cfs_stats(struct seq_file *m, int cpu)
>  __init void init_sched_fair_class(void)
>  {
>  #ifdef CONFIG_SMP
> +#ifndef CONFIG_BLD
>        open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
> +#endif /* BLD */
>
>  #ifdef CONFIG_NO_HZ
>        zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 98c0c26..bd7e4c6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -474,6 +474,17 @@ struct rq {
>  #ifdef CONFIG_SMP
>        struct llist_head wake_list;
>  #endif
> +#ifdef CONFIG_BLD
> +       unsigned long this_cpu_load;
> +       struct list_head disp_load_balance;
> +       /* It indicates whether, rq is first or last
> +        * or in the middle based on load from rq_head.
> +        * 0 - First rq
> +        * 1 - rq stays middle
> +        * 2 - last rq
> +        */
> +       char pos;
> +#endif
>  };
>
>  static inline int cpu_of(struct rq *rq)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler.
  2012-02-13 14:05 ` Hillf Danton
@ 2012-02-13 17:22   ` Rakib Mullick
  2012-02-14 12:59     ` Hillf Danton
  0 siblings, 1 reply; 4+ messages in thread
From: Rakib Mullick @ 2012-02-13 17:22 UTC (permalink / raw)
  To: Hillf Danton; +Cc: LKML

Hi Hillf,

On Mon, Feb 13, 2012 at 8:05 PM, Hillf Danton <dhillf@gmail.com> wrote:
> Hello Rakib
>
> Just nitpicks
>
> On Mon, Feb 13, 2012 at 2:52 AM, Rakib Mullick <rakib.mullick@gmail.com> wrote:
> [...]
>> --- /dev/null
>> +++ b/kernel/sched/bld.h
>> @@ -0,0 +1,112 @@
>> +#ifdef CONFIG_BLD
>> +
>> +static DEFINE_RWLOCK(disp_list_lock);
>
> What is the advantage of rwlock, compared with spin lock?
>
It separates reader writers and allows multiple readers can be at a
same critical reason.

>> +static LIST_HEAD(rq_head);
>> +
>> +static inline int list_is_first(const struct list_head *list,
>
> Where is this helper used?
>
I forget to remove this function. Actually, this whole bld is under
development, I'm constantly trying to improve it. Above helper was
used to find out - whether a particular rq is the first (lowest
loaded) list in this doubly linked list or not. But, later on it
wasn't used due to introduction of "rq->pos" field. The purpose of
->pos field is to indicate whether a rq is a last or first or in
between last and first. In this way, we can
 check whether a rq is the last or first or in between last and first
without holding rwlock.

>> +                               const struct list_head *head)
>> +{
>> +       return list == head->next;
>> +}
>> +
>> +static inline int select_cpu_for_wakeup(struct task_struct *p, int
>> sd_flags, int wake_flags)
>
> Looks @sd_flags not used.

Yes, sd_flag isn't needed here. Will remove it.

> Why is the arch specifics negligible?

I'm not clear what you're trying to say.

> Also looks message corrupted due to mail agent?
>
Perhaps, will be careful later on.

>> +{
>> +       int cpu = smp_processor_id(), prev_cpu = task_cpu(p), i;
>
>            int this_cpu = smp_processor_id();
>            int prev_cpu = task_cpu(p);
>            int cpu;
>
>> +       /*bool sync = wake_flags & WF_SYNC; */
>> +       unsigned long load, min_load = ULONG_MAX;
>> +       struct cpumask *mask;
>> +
>> +       if (wake_flags & WF_SYNC) {
>> +               if (cpu == prev_cpu)
>> +                       return cpu;
>> +               mask = sched_group_cpus(cpu_rq(prev_cpu)->sd->groups);
>> +       } else
>> +               mask = sched_domain_span(cpu_rq(prev_cpu)->sd);
>> +
>> +       for_each_cpu(i, mask) {
>> +               load = cpu_rq(i)->load.weight;
>> +               if (load < min_load) {
>> +                       min_load = load;
>> +                       cpu = i;
>> +               }
>> +       }
>> +       return cpu;
>> +}
>> +
>> +static int bld_select_task_rq(struct task_struct *p, int sd_flags,
>> int wake_flags)
>
> Message corrupted?
>
>> +{
>> +       struct rq *tmp;
>> +       unsigned long flag;
>> +       unsigned int cpu = smp_processor_id();
>> +
>> +       if (&p->cpus_allowed) {
>> +               struct cpumask *taskmask;
>> +               unsigned long min_load = ULONG_MAX, load, i;
>> +               taskmask = tsk_cpus_allowed(p);
>> +               for_each_cpu(i, taskmask) {
>> +                       load = cpu_rq(i)->load.weight;
>> +                       if (load < min_load) {
>> +                               min_load = load;
>> +                               cpu = i;
>> +                       }
>> +               }
>> +       } else  if (sd_flags & SD_BALANCE_WAKE) {
>> +               cpu = select_cpu_for_wakeup(p, sd_flags, wake_flags);
>> +               return cpu;
>> +       } else {
>> +               read_lock_irqsave(&disp_list_lock, flag);
>> +               list_for_each_entry(tmp, &rq_head, disp_load_balance) {
>> +                       cpu = cpu_of(tmp);
>> +                       if (cpu_online(cpu))
>> +                               break;
>> +               }
>> +               read_unlock_irqrestore(&disp_list_lock, flag);
>> +       }
>> +       return cpu;
>> +}
>> +
>> +static void bld_track_load_activate(struct rq *rq)
>> +{
>> +       unsigned long  flag;
>> +       rq->this_cpu_load = rq->load.weight;
>
> Well ->this_cpu_load looks unnecessary?
>
->this_cpu_load was used intentionally to maintain a separate field
cause a cross rq check is required later
and I'm not sure whether doing over rq->load.weight is safe or not.

>> +
>> +       if (rq->pos != 2) {     /* if rq isn't the last one */
>> +               struct rq *last;
>> +               write_lock_irqsave(&disp_list_lock, flag);
>
>                    if (rq->pos != 2)
>                             goto out;
>
At this point, we're checking whether this task is activating on a rq
which is the last (hightest loaded) rq or not. If rq->pos != 2, it
stands we're not activating a task at the highest loaded rq, so a
check will be made with the highest loaded rq to make sure - this rq's
loaded didn't exceed the highest loaded rq. If rq's load
exceed - list will be removed from it's place and will be placed as a
last entry of rq_head and thus it becomes the highest loaded rq. So,
what you proposed here isn't what was intended.

>> +               last = list_entry(rq_head.prev, struct rq, disp_load_balance);
>
> Could disp_list_lock serialize updating this_cpu_load?
>
>> +               if (rq->this_cpu_load > last->this_cpu_load) {
>> +                       list_del(&rq->disp_load_balance);
>> +                       list_add_tail(&rq->disp_load_balance, &rq_head);
>> +                       rq->pos = 2; last->pos = 1;
>> +               }
>
> out:
>
>> +               write_unlock_irqrestore(&disp_list_lock, flag);
>> +       }
>> +}
>> +
>> +static void bld_track_load_deactivate(struct rq *rq)
>> +{
>> +       unsigned long flag;
>> +
>> +       rq->this_cpu_load = rq->load.weight;
>> +
>> +       if (rq->pos != 0) { /* If rq isn't first one */
>> +               struct rq *first;
>> +               first = list_entry(rq_head.prev, struct rq, disp_load_balance);
>> +               write_lock_irqsave(&disp_list_lock, flag);
>> +               if (rq->this_cpu_load <= first->this_cpu_load) {
>> +                       list_del(&rq->disp_load_balance);
>> +                       list_add_tail(&rq->disp_load_balance, &rq_head);
>> +                       rq->pos = 0; first->pos = 1;
>> +               }
>> +               write_unlock_irqrestore(&disp_list_lock, flag);
>> +       }
>> +}
>> +#else
>> +static inline void bld_track_load_activate(struct rq *rq)
>> +{
>> +}
>> +
>> +static inline void bld_track_load_deactivate(struct rq *rq)
>> +{
>> +}
>> +#endif /* CONFIG_BLD */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5255c9d..cff20e1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -24,6 +24,8 @@
>>  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
>>  *  2007-11-29  RT balancing improvements by Steven Rostedt, Gregory Haskins,
>>  *              Thomas Gleixner, Mike Kravetz
>> + *  2012-Feb   The Barbershop Load Distribution (BLD) algorithm, an alternate
>> + *             load distribution algorithm by Rakib Mullick.
>>  */
>>
>>  #include <linux/mm.h>
>> @@ -81,6 +83,7 @@
>>
>>  #include "sched.h"
>>  #include "../workqueue_sched.h"
>> +#include "bld.h"
>>
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/sched.h>
>> @@ -578,6 +581,7 @@ unlock:
>>  */
>>  void wake_up_idle_cpu(int cpu)
>>  {
>> +#ifndef CONFIG_BLD
>>        struct rq *rq = cpu_rq(cpu);
>>
>>        if (cpu == smp_processor_id())
>> @@ -604,6 +608,7 @@ void wake_up_idle_cpu(int cpu)
>>        smp_mb();
>>        if (!tsk_is_polling(rq->idle))
>>                smp_send_reschedule(cpu);
>> +#endif
>>  }
>>
>>  static inline bool got_nohz_idle_kick(void)
>> @@ -730,6 +735,7 @@ void activate_task(struct rq *rq, struct
>> task_struct *p, int flags)
>>                rq->nr_uninterruptible--;
>>
>>        enqueue_task(rq, p, flags);
>> +       bld_track_load_activate(rq);
>
> Looks better if sorting rq folded in enqueue_task()?
>
Any particular reason for that?

>>  }
>>
>>  void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>> @@ -738,6 +744,7 @@ void deactivate_task(struct rq *rq, struct
>> task_struct *p, int flags)
>>                rq->nr_uninterruptible++;
>>
>>        dequeue_task(rq, p, flags);
>> +       bld_track_load_deactivate(rq);
>>  }
>>
>>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> @@ -1297,7 +1304,12 @@ static int select_fallback_rq(int cpu, struct
>> task_struct *p)
>>  static inline
>>  int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
>>  {
>> -       int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
>> +       int cpu;
>> +#ifdef CONFIG_BLD
>> +       cpu = bld_select_task_rq(p, sd_flags, wake_flags);
>
> What if @p is RT?
>
bld_select_task_rq() will be called. :)

Hiff, did you ran the patch? Would like to know.

Thanks,
Rakib

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler.
  2012-02-13 17:22   ` Rakib Mullick
@ 2012-02-14 12:59     ` Hillf Danton
  0 siblings, 0 replies; 4+ messages in thread
From: Hillf Danton @ 2012-02-14 12:59 UTC (permalink / raw)
  To: Rakib Mullick; +Cc: LKML

On Tue, Feb 14, 2012 at 1:22 AM, Rakib Mullick <rakib.mullick@gmail.com> wrote:
>>> +
>>> +       if (rq->pos != 2) {     /* if rq isn't the last one */
>>> +               struct rq *last;
>>> +               write_lock_irqsave(&disp_list_lock, flag);
>>
>>                    if (rq->pos != 2)
>>                             goto out;
>>
> At this point, we're checking whether this task is activating on a rq
> which is the last (hightest loaded) rq or not. If rq->pos != 2, it
> stands we're not activating a task at the highest loaded rq, so a
> check will be made with the highest loaded rq to make sure - this rq's
> loaded didn't exceed the highest loaded rq. If rq's load
> exceed - list will be removed from it's place and will be placed as a
> last entry of rq_head and thus it becomes the highest loaded rq. So,
> what you proposed here isn't what was intended.
>

I want to say
                    if (rq->pos == 2)
                             goto out;
sorry for the bad:(

> Hiff, did you ran the patch? Would like to know.

Try to run soon.

Best regards
Hillf

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-02-14 12:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-12 18:52 [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler Rakib Mullick
2012-02-13 14:05 ` Hillf Danton
2012-02-13 17:22   ` Rakib Mullick
2012-02-14 12:59     ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).