From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756891Ab3GDSDL (ORCPT ); Thu, 4 Jul 2013 14:03:11 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:49383 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756728Ab3GDSDI (ORCPT ); Thu, 4 Jul 2013 14:03:08 -0400 Date: Thu, 4 Jul 2013 23:32:27 +0530 From: Srikar Dronamraju To: Mel Gorman Cc: Peter Zijlstra , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML Subject: Re: [PATCH RFC WIP] Process weights based scheduling for better consolidation Message-ID: <20130704180227.GA31348@linux.vnet.ibm.com> Reply-To: Srikar Dronamraju References: <1372861300-9973-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1372861300-9973-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No x-cbid: 13070418-5806-0000-0000-000021F8611A Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here is an approach to look at numa balanced scheduling from a non numa fault angle. This approach uses process weights instead of faults as a basis to move or bring tasks together. Here are the advantages of this approach. 1. Provides excellent consolidation of tasks. - I have verified with sched_autonuma_dump_mm() which was part of Andrea's autonuma patches refer commit id: commit aba373d04251691b5e0987a0fff2fa7007311810 Author: Andrea Arcangeli Date: Fri Mar 23 20:35:07 2012 +0100 autonuma: CPU follows memory algorithm From limited experiments, I have found that the better the task consolidation, we achieve better the memory layout, which results in better the performance. 2. Provides good benefit in whatever limited testing that I have done so far. For example it provides _20+%_ improvement for numa01 (autonuma-benchmark). 3. Since it doesnt depend on numa faulting, it doesnt have the overhead of having to get the scanning rate correctly. 4. Looks to extend the load balancer esp when the cpus are idling. 5. Code looks much simpler and naive to me. (But yes this is relative!!!) Results on a 2 node 12 core system: KernelVersion: 3.9.0 (with hyper threading) Testcase: Min Max Avg numa01: 220.12 246.96 239.18 numa02: 41.85 43.02 42.43 KernelVersion: 3.9.0 + code (with hyper threading) Testcase: Min Max Avg %Change numa01: 174.97 219.38 198.99 20.20% numa02: 38.24 38.50 38.38 10.55% KernelVersion: 3.9.0 (noht) Testcase: Min Max Avg numa01: 118.72 121.04 120.23 numa02: 36.64 37.56 36.99 KernelVersion: 3.9.0 + code (noht) Testcase: Min Max Avg %Change numa01: 92.95 113.36 108.21 11.11% numa02: 36.76 38.91 37.34 -0.94% /usr/bin/time -f %e %S %U %c %w i.e elapsed,user,sys, voluntary and involuntary context switches Best case performance for v3.9 numa01 220.12 17.14 5041.27 522147 1273 numa02 41.91 2.47 887.46 92079 8 Best case performance for v3.9 + code. numa01 174.97 17.46 4102.64 433804 1846 numa01_THREAD_ALLOC 288.04 15.76 6783.86 718220 174 numa02 38.41 0.75 905.65 95364 5 numa02_SMT 46.43 0.55 487.30 66416 7 Best case memory layout for v3.9 9 416.44 5728.73 19 356.42 5788.75 30 722.49 5422.68 40 1936.50 4208.67 50 1372.40 4772.77 60 1354.39 4790.78 71 1512.39 4632.78 81 1598.40 4546.77 91 2242.40 3902.77 101 2242.40 3902.78 111 2654.41 3490.77 122 2654.40 3490.77 132 2976.30 3168.87 142 2956.30 3188.87 152 2956.30 3188.87 162 2956.30 3188.87 173 3044.30 3100.87 183 3058.30 3086.87 193 3204.20 2942.87 203 3262.20 2884.89 213 3262.18 2884.91 Best case memory layout for v3.9 + code 10 6140.55 4.64 20 3728.99 2416.18 30 3066.45 3078.73 40 3072.46 3072.73 51 3072.46 3072.73 61 3072.46 3072.73 71 3072.46 3072.73 81 3072.46 3072.73 91 3072.46 3072.73 102 3072.46 3072.73 112 3072.46 3072.73 122 3072.46 3072.73 132 3072.46 3072.73 142 3072.46 3072.73 152 3072.46 3072.73 163 3072.46 3072.73 173 3072.44 3072.74 Having said that I am sure the experts would have already thought of this approach and might have reasons to discard it. Hence the code is not yet in a patchset format, nor do I have extensive analysis that Mel has for his patchset. I thought of posting the code out in some form so that I know if there are any obvious pitfalls for which this approach Here is the outline of the approach. - Every process has a per node array where we store the weight of all its tasks running on that node. This arrays gets updated on task enqueue/dequeue. - Added a 2 pass mechanism (somewhat taken from numacore but not exactly) while choosing tasks to move across nodes. In the first pass, choose only tasks that are ideal to be moved. While choosing a task, look at the per node process arrays to see if moving task helps. If the first pass fails to move a task, any task can be chosen on the second pass. - If the regular load balancer (rebalance_domain()) fails to balance the load (or finds no imbalance) and there is a cpu, use the cpu to consolidate tasks to the nodes by using the information in the per node process arrays. Every idle cpu if its doesnt have tasks queued after load balance, - will walk thro the cpus in its node and checks if there are buddy tasks that are not part of the node but should have been ideally part of this node. - To make sure that we dont pull all buddy tasks and create an imbalance, we look at load on the load, pinned tasks and the processes contribution to the load for this node. - Each cpu looks at the node which has the least number of buddy tasks running and tries to pull the tasks from such nodes. - Once it finds the cpu from which to pull the tasks, it triggers active_balancing. This type of active balancing triggers just one pass. i.e it only fetches tasks that increase numa locality. Thanks for taking a look and providing your valuable inputs. ---8<--- sched: Using process weights to consolidate tasks If we consolidate related tasks to one node, memory tends to follow to that node. If the memory and tasks end up in one node, it results in better performance. To achieve this, the code below tries to extend the current load balancing while idling to move tasks in such a way that the related tasks end up being based on the same node. Care is taken not to overload the tasks while moving the tasks. This code also adds iterations logic to the regular move task logic to further consolidate tasks while performing the regular load balancing. Not-yet-signed-off-by: Srikar Dronamraju --- fs/exec.c | 4 + include/linux/migrate.h | 1 - include/linux/mm_types.h | 1 + include/linux/sched.h | 2 + kernel/fork.c | 10 +- kernel/sched/core.c | 2 + kernel/sched/fair.c | 338 ++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 4 + 8 files changed, 344 insertions(+), 18 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index a96a488..54589d0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -833,6 +833,10 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); task_unlock(tsk); arch_pick_mmap_layout(mm); +#ifdef CONFIG_NUMA_BALANCING + mm->numa_weights = kzalloc(sizeof(unsigned long) * (nr_node_ids + 1), GFP_KERNEL); + tsk->task_load = 0; +#endif if (old_mm) { up_read(&old_mm->mmap_sem); BUG_ON(active_mm != old_mm); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3d..086bd33 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -93,7 +93,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern int migrate_misplaced_page(struct page *page, int node); -extern int migrate_misplaced_page(struct page *page, int node); extern bool migrate_ratelimited(int node); #else static inline int migrate_misplaced_page(struct page *page, int node) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index ace9a5f..bb402d3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -435,6 +435,7 @@ struct mm_struct { * a different node than Make PTE Scan Go Now. */ int first_nid; + unsigned long *numa_weights; #endif struct uprobes_state uprobes_state; }; diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..2736ec6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -815,6 +815,7 @@ enum cpu_idle_type { #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ +#define SD_NUMA 0x4000 /* cross-node balancing */ extern int __weak arch_sd_sibiling_asym_packing(void); @@ -1505,6 +1506,7 @@ struct task_struct { unsigned int numa_scan_period; u64 node_stamp; /* migration stamp */ struct callback_head numa_work; + unsigned long task_load; #endif /* CONFIG_NUMA_BALANCING */ struct rcu_head rcu; diff --git a/kernel/fork.c b/kernel/fork.c index 1766d32..14c7aea 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -617,6 +617,9 @@ void mmput(struct mm_struct *mm) khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); set_mm_exe_file(mm, NULL); +#ifdef CONFIG_NUMA_BALANCING + kfree(mm->numa_weights); +#endif if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); list_del(&mm->mmlist); @@ -823,9 +826,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; #endif -#ifdef CONFIG_NUMA_BALANCING - mm->first_nid = NUMA_PTE_SCAN_INIT; -#endif if (!mm_init(mm, tsk)) goto fail_nomem; @@ -844,6 +844,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk) if (mm->binfmt && !try_module_get(mm->binfmt->module)) goto free_pt; +#ifdef CONFIG_NUMA_BALANCING + mm->first_nid = NUMA_PTE_SCAN_INIT; + mm->numa_weights = kzalloc(sizeof(unsigned long) * (nr_node_ids + 1), GFP_KERNEL); +#endif return mm; free_pt: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 67d0465..82f8f79 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p) p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0; p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0; p->numa_scan_period = sysctl_numa_balancing_scan_delay; + p->task_load = 0; p->numa_work.next = &p->numa_work; #endif /* CONFIG_NUMA_BALANCING */ } @@ -6136,6 +6137,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_SHARE_PKG_RESOURCES | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING + | 1*SD_NUMA | sd_local_flags(level) , .last_balance = jiffies, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a33e59..15d71a1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -777,6 +777,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) * Scheduling class queueing methods: */ +static unsigned long task_h_load(struct task_struct *p); + #ifdef CONFIG_NUMA_BALANCING /* * numa task sample period in ms @@ -791,6 +793,60 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) +{ + struct rq *rq = rq_of(cfs_rq); + unsigned long task_load = 0; + int curnode = cpu_to_node(cpu_of(rq)); +#ifdef CONFIG_SCHED_AUTOGROUP + struct sched_entity *se; + + se = cfs_rq->tg->se[cpu_of(rq)]; + if (!se) + return; + + if (cfs_rq->load.weight) { + task_load = p->se.load.weight * se->load.weight; + task_load /= cfs_rq->load.weight; + } else { + task_load = 0; + } +#else + task_load = p->se.load.weight; +#endif + p->task_load = 0; + if (!task_load) + return; + + if (p->mm && p->mm->numa_weights) { + p->mm->numa_weights[curnode] += task_load; + p->mm->numa_weights[nr_node_ids] += task_load; + } + + if (p->nr_cpus_allowed != num_online_cpus()) + rq->pinned_load += task_load; + p->task_load = task_load; +} + +static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) +{ + struct rq *rq = rq_of(cfs_rq); + unsigned long task_load = p->task_load; + int curnode = cpu_to_node(cpu_of(rq)); + + p->task_load = 0; + if (!task_load) + return; + + if (p->mm && p->mm->numa_weights) { + p->mm->numa_weights[curnode] -= task_load; + p->mm->numa_weights[nr_node_ids] -= task_load; + } + + if (p->nr_cpus_allowed != num_online_cpus()) + rq->pinned_load -= task_load; +} + static void task_numa_placement(struct task_struct *p) { int seq; @@ -999,6 +1055,12 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr) static void task_tick_numa(struct rq *rq, struct task_struct *curr) { } +static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) +{ +} +static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) +{ +} #endif /* CONFIG_NUMA_BALANCING */ static void @@ -1008,8 +1070,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) if (!parent_entity(se)) update_load_add(&rq_of(cfs_rq)->load, se->load.weight); #ifdef CONFIG_SMP - if (entity_is_task(se)) - list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks); + if (entity_is_task(se)) { + struct rq *rq = rq_of(cfs_rq); + + list_add(&se->group_node, &rq->cfs_tasks); + } #endif cfs_rq->nr_running++; } @@ -1713,6 +1778,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (se != cfs_rq->curr) __enqueue_entity(cfs_rq, se); se->on_rq = 1; + if (entity_is_task(se)) + account_numa_enqueue(cfs_rq, task_of(se)); if (cfs_rq->nr_running == 1) { list_add_leaf_cfs_rq(cfs_rq); @@ -1810,6 +1877,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_min_vruntime(cfs_rq); update_cfs_shares(cfs_rq); + if (entity_is_task(se)) + account_numa_dequeue(cfs_rq, task_of(se)); } /* @@ -3292,6 +3361,33 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +#ifdef CONFIG_NUMA_BALANCING +static int +check_numa_affinity(struct task_struct *p, int cpu, int prev_cpu) +{ + struct mm_struct *mm = p->mm; + struct rq *rq = cpu_rq(prev_cpu); + int source_node = cpu_to_node(prev_cpu); + int target_node = cpu_to_node(cpu); + + if (mm && mm->numa_weights) { + unsigned long *weights = mm->numa_weights; + + if (weights[target_node] > weights[source_node]) { + if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load) + return 1; + } + } + return 0; +} +#else +static int +check_numa_affinity(struct task_struct *p, int cpu, int prev_cpu) +{ + return 0; +} +#endif + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3317,7 +3413,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) return prev_cpu; if (sd_flag & SD_BALANCE_WAKE) { - if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) + if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) && check_numa_affinity(p, cpu, prev_cpu)) want_affine = 1; new_cpu = prev_cpu; } @@ -3819,6 +3915,7 @@ struct lb_env { unsigned int loop; unsigned int loop_break; unsigned int loop_max; + unsigned int iterations; }; /* @@ -3865,6 +3962,37 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) return delta < (s64)sysctl_sched_migration_cost; } +#ifdef CONFIG_NUMA_BALANCING +static bool force_migrate(struct lb_env *env, struct task_struct *p) +{ + struct mm_struct *mm = p->mm; + struct rq *rq = env->src_rq; + int source_node = cpu_to_node(env->src_cpu); + int target_node = cpu_to_node(env->dst_cpu); + + if (env->sd->nr_balance_failed > env->sd->cache_nice_tries) + return true; + + if (!(env->sd->flags & SD_NUMA)) + return false; + + if (mm && mm->numa_weights) { + unsigned long *weights = mm->numa_weights; + + if (weights[target_node] > weights[source_node]) { + if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load) + return true; + } + } + return false; +} +#else +static bool force_migrate(struct lb_env *env, struct task_struct *p) +{ + return false; +} +#endif + /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? */ @@ -3916,26 +4044,51 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) * 1) task is cache cold, or * 2) too many balance attempts have failed. */ - tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd); - if (!tsk_cache_hot || - env->sd->nr_balance_failed > env->sd->cache_nice_tries) { + if (tsk_cache_hot) { + if (force_migrate(env, p)) { #ifdef CONFIG_SCHEDSTATS - if (tsk_cache_hot) { schedstat_inc(env->sd, lb_hot_gained[env->idle]); schedstat_inc(p, se.statistics.nr_forced_migrations); - } #endif - return 1; - } - - if (tsk_cache_hot) { + return 1; + } schedstat_inc(p, se.statistics.nr_failed_migrations_hot); return 0; } return 1; } +#ifdef CONFIG_NUMA_BALANCING +static int preferred_node(struct task_struct *p, struct lb_env *env) +{ + struct mm_struct *mm = p->mm; + + if (!(env->sd->flags & SD_NUMA)) + return false; + + if (mm && mm->numa_weights) { + struct rq *rq = env->src_rq; + unsigned long *weights = mm->numa_weights; + int target_node = cpu_to_node(env->dst_cpu); + int source_node = cpu_to_node(env->src_cpu); + + if (weights[target_node] > weights[source_node]) { + if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load) + return 1; + } + } + if (env->iterations) + return 1; + return 0; +} +#else +static int preferred_node(struct task_struct *p, struct lb_env *env) +{ + return 0; +} +#endif + /* * move_one_task tries to move exactly one task from busiest to this_rq, as * part of active balancing operations within "domain". @@ -3947,7 +4100,11 @@ static int move_one_task(struct lb_env *env) { struct task_struct *p, *n; +again: list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) { + if (!preferred_node(p, env)) + continue; + if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu)) continue; @@ -3955,6 +4112,7 @@ static int move_one_task(struct lb_env *env) continue; move_task(p, env); + /* * Right now, this is only the second place move_task() * is called, so we can safely collect move_task() @@ -3963,11 +4121,12 @@ static int move_one_task(struct lb_env *env) schedstat_inc(env->sd, lb_gained[env->idle]); return 1; } + if (!env->iterations++ && env->src_rq->active_balance != 2) + goto again; + return 0; } -static unsigned long task_h_load(struct task_struct *p); - static const unsigned int sched_nr_migrate_break = 32; /* @@ -4002,6 +4161,9 @@ static int move_tasks(struct lb_env *env) break; } + if (!preferred_node(p, env)) + goto next; + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) goto next; @@ -5005,6 +5167,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, .idle = idle, .loop_break = sched_nr_migrate_break, .cpus = cpus, + .iterations = 1, }; cpumask_copy(cpus, cpu_active_mask); @@ -5047,6 +5210,11 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); + if (sd->flags & SD_NUMA) { + if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu)) + env.iterations = 0; + } + update_h_load(env.src_cpu); more_balance: local_irq_save(flags); @@ -5066,6 +5234,13 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto more_balance; } + if (!ld_moved && !env.iterations) { + env.iterations++; + env.loop = 0; + env.loop_break = sched_nr_migrate_break; + goto more_balance; + } + /* * some other cpu did the load balance for us. */ @@ -5152,6 +5327,9 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (!busiest->active_balance) { busiest->active_balance = 1; busiest->push_cpu = this_cpu; +#ifdef CONFIG_NUMA_BALANCING + busiest->ab_node_load = 0; +#endif active_balance = 1; } raw_spin_unlock_irqrestore(&busiest->lock, flags); @@ -5313,8 +5491,14 @@ static int active_load_balance_cpu_stop(void *data) .src_cpu = busiest_rq->cpu, .src_rq = busiest_rq, .idle = CPU_IDLE, + .iterations = 1, }; + if ((sd->flags & SD_NUMA)) { + if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu)) + env.iterations = 0; + } + schedstat_inc(sd, alb_count); if (move_one_task(&env)) @@ -5326,6 +5510,9 @@ static int active_load_balance_cpu_stop(void *data) double_unlock_balance(busiest_rq, target_rq); out_unlock: busiest_rq->active_balance = 0; +#ifdef CONFIG_NUMA_BALANCING + busiest_rq->ab_node_load = 0; +#endif raw_spin_unlock_irq(&busiest_rq->lock); return 0; } @@ -5464,6 +5651,59 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; } +#ifdef CONFIG_NUMA_BALANCING +static int migrate_from_cpu(struct mm_struct *this_mm, int this_cpu, int nid) +{ + struct mm_struct *mm; + struct rq *rq; + int cpu; + + for_each_cpu(cpu, cpumask_of_node(nid)) { + rq = cpu_rq(cpu); + mm = rq->curr->mm; + + if (mm == this_mm) { + if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(rq->curr))) + return cpu; + } + } + return -1; +} + +static int migrate_from_node(unsigned long *weights, unsigned long load, int nid) +{ + unsigned long least_weight = weights[nid]; + unsigned long node_load; + int least_node = -1; + int node, cpu; + + for_each_online_node(node) { + if (node == nid) + continue; + if (weights[node] == 0) + continue; + + node_load = 0; + for_each_cpu(cpu, cpumask_of_node(node)) { + node_load += weighted_cpuload(cpu); + } + + if (load > node_load) { + if (load * nr_node_ids >= node_load * (nr_node_ids + 1)) + continue; + if (weights[node] == least_weight) + continue; + } + + if (weights[node] <= least_weight) { + least_weight = weights[node]; + least_node = node; + } + } + return least_node; +} +#endif + /* * It checks each scheduling domain to see if it is due to be balanced, * and initiates a balancing operation if so. @@ -5529,6 +5769,76 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) if (!balance) break; } +#ifdef CONFIG_NUMA_BALANCING + if (!rq->nr_running) { + struct mm_struct *prev_mm = NULL; + unsigned long load = 0, pinned_load = 0; + unsigned long *weights = NULL; + int node, nid, dcpu; + int this_cpu = -1; + + nid = cpu_to_node(cpu); + + /* Traverse only the allowed CPUs */ + for_each_cpu(dcpu, cpumask_of_node(nid)) { + load += weighted_cpuload(dcpu); + pinned_load += cpu_rq(dcpu)->pinned_load; + } + for_each_cpu(dcpu, cpumask_of_node(nid)) { + struct rq *rq = cpu_rq(dcpu); + struct mm_struct *mm = rq->curr->mm; + + if (!mm || !mm->numa_weights) + continue; + + weights = mm->numa_weights; + if (!weights[nr_node_ids] || !weights[nid]) + continue; + + if (weights[nid] + pinned_load >= load) + break; + if (weights[nr_node_ids]/weights[nid] > nr_node_ids) + continue; + + if (mm == prev_mm) + continue; + + prev_mm = mm; + node = migrate_from_node(weights, load, nid); + if (node == -1) + continue; + this_cpu = migrate_from_cpu(mm, cpu, node); + if (this_cpu != -1) + break; + } + if (this_cpu != -1) { + struct rq *this_rq; + unsigned long flags; + int active_balance; + + this_rq = cpu_rq(this_cpu); + active_balance = 0; + + /* + * ->active_balance synchronizes accesses to + * ->active_balance_work. Once set, it's cleared + * only after active load balance is finished. + */ + raw_spin_lock_irqsave(&this_rq->lock, flags); + if (!this_rq->active_balance) { + this_rq->active_balance = 2; + this_rq->push_cpu = cpu; + this_rq->ab_node_load = load - pinned_load; + active_balance = 1; + } + raw_spin_unlock_irqrestore(&this_rq->lock, flags); + + if (active_balance) { + stop_one_cpu_nowait(this_cpu, active_load_balance_cpu_stop, this_rq, &this_rq->active_balance_work); + } + } + } +#endif rcu_read_unlock(); /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cc03cfd..0011bba 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -484,6 +484,10 @@ struct rq { #endif struct sched_avg avg; +#ifdef CONFIG_NUMA_BALANCING + unsigned long pinned_load; + unsigned long ab_node_load; +#endif }; static inline int cpu_of(struct rq *rq)