* [PATCH 1/4] sched: make nr_running() return 32-bit @ 2021-04-22 20:02 Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 2/4] sched: make nr_iowait() return 32-bit value Alexey Dobriyan ` (4 more replies) 0 siblings, 5 replies; 15+ messages in thread From: Alexey Dobriyan @ 2021-04-22 20:02 UTC (permalink / raw) To: mingo, peterz; +Cc: linux-kernel, Alexey Dobriyan Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- fs/proc/loadavg.c | 2 +- fs/proc/stat.c | 2 +- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index 8468baee951d..f32878d9a39f 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -16,7 +16,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v) get_avenrun(avnrun, FIXED_1/200, 0); - seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n", + seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), diff --git a/fs/proc/stat.c b/fs/proc/stat.c index f25e8531fd27..941605de7f9a 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -200,7 +200,7 @@ static int show_stat(struct seq_file *p, void *v) "\nctxt %llu\n" "btime %llu\n" "processes %lu\n" - "procs_running %lu\n" + "procs_running %u\n" "procs_blocked %lu\n", nr_context_switches(), (unsigned long long)boottime.tv_sec, diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 568286411b43..f742229091ec 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -16,7 +16,7 @@ extern unsigned long total_forks; extern int nr_threads; DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); -extern unsigned long nr_running(void); +extern unsigned int nr_running(void); extern bool single_task_running(void); extern unsigned long nr_iowait(void); extern unsigned long nr_iowait_cpu(int cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 98191218d891..713ea35cb995 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4331,9 +4331,9 @@ context_switch(struct rq *rq, struct task_struct *prev, * externally visible scheduler statistics: current number of runnable * threads, total number of context switches performed since bootup. */ -unsigned long nr_running(void) +unsigned int nr_running(void) { - unsigned long i, sum = 0; + unsigned int i, sum = 0; for_each_online_cpu(i) sum += cpu_rq(i)->nr_running; -- 2.30.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/4] sched: make nr_iowait() return 32-bit value 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan @ 2021-04-22 20:02 ` Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make " tip-bot2 for Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 3/4] sched: make nr_iowait_cpu() return 32-bit Alexey Dobriyan ` (3 subsequent siblings) 4 siblings, 1 reply; 15+ messages in thread From: Alexey Dobriyan @ 2021-04-22 20:02 UTC (permalink / raw) To: mingo, peterz; +Cc: linux-kernel, Alexey Dobriyan Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- fs/proc/stat.c | 2 +- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 941605de7f9a..6561a06ef905 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -201,7 +201,7 @@ static int show_stat(struct seq_file *p, void *v) "btime %llu\n" "processes %lu\n" "procs_running %u\n" - "procs_blocked %lu\n", + "procs_blocked %u\n", nr_context_switches(), (unsigned long long)boottime.tv_sec, total_forks, diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index f742229091ec..8a5b27ae7937 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -18,7 +18,7 @@ DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); -extern unsigned long nr_iowait(void); +extern unsigned int nr_iowait(void); extern unsigned long nr_iowait_cpu(int cpu); static inline int sched_info_on(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 713ea35cb995..2cc8d81cdc75 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4413,9 +4413,9 @@ unsigned long nr_iowait_cpu(int cpu) * Task CPU affinities can make all that even more 'interesting'. */ -unsigned long nr_iowait(void) +unsigned int nr_iowait(void) { - unsigned long i, sum = 0; + unsigned int i, sum = 0; for_each_possible_cpu(i) sum += nr_iowait_cpu(i); -- 2.30.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip: sched/core] sched: Make nr_iowait() return 32-bit value 2021-04-22 20:02 ` [PATCH 2/4] sched: make nr_iowait() return 32-bit value Alexey Dobriyan @ 2021-05-12 20:01 ` tip-bot2 for Alexey Dobriyan 0 siblings, 0 replies; 15+ messages in thread From: tip-bot2 for Alexey Dobriyan @ 2021-05-12 20:01 UTC (permalink / raw) To: linux-tip-commits; +Cc: Alexey Dobriyan, Ingo Molnar, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 9745516841a55c77163a5d549bce1374d776df54 Gitweb: https://git.kernel.org/tip/9745516841a55c77163a5d549bce1374d776df54 Author: Alexey Dobriyan <adobriyan@gmail.com> AuthorDate: Thu, 22 Apr 2021 23:02:26 +03:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 12 May 2021 21:34:15 +02:00 sched: Make nr_iowait() return 32-bit value Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com --- fs/proc/stat.c | 2 +- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 941605d..6561a06 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -201,7 +201,7 @@ static int show_stat(struct seq_file *p, void *v) "btime %llu\n" "processes %lu\n" "procs_running %u\n" - "procs_blocked %lu\n", + "procs_blocked %u\n", nr_context_switches(), (unsigned long long)boottime.tv_sec, total_forks, diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 73606b3..81d9b53 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -19,7 +19,7 @@ DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); -extern unsigned long nr_iowait(void); +extern unsigned int nr_iowait(void); extern unsigned long nr_iowait_cpu(int cpu); static inline int sched_info_on(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2c6cdb0..fadf2bf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4774,9 +4774,9 @@ unsigned long nr_iowait_cpu(int cpu) * Task CPU affinities can make all that even more 'interesting'. */ -unsigned long nr_iowait(void) +unsigned int nr_iowait(void) { - unsigned long i, sum = 0; + unsigned int i, sum = 0; for_each_possible_cpu(i) sum += nr_iowait_cpu(i); ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/4] sched: make nr_iowait_cpu() return 32-bit 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 2/4] sched: make nr_iowait() return 32-bit value Alexey Dobriyan @ 2021-04-22 20:02 ` Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make nr_iowait_cpu() return 32-bit value tip-bot2 for Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 4/4] sched: make multiple runqueue task counters 32-bit Alexey Dobriyan ` (2 subsequent siblings) 4 siblings, 1 reply; 15+ messages in thread From: Alexey Dobriyan @ 2021-04-22 20:02 UTC (permalink / raw) To: mingo, peterz; +Cc: linux-kernel, Alexey Dobriyan Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- drivers/cpuidle/governors/menu.c | 6 +++--- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index b0a7ad566081..ddaaa36af290 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -117,7 +117,7 @@ struct menu_device { int interval_ptr; }; -static inline int which_bucket(u64 duration_ns, unsigned long nr_iowaiters) +static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) { int bucket = 0; @@ -150,7 +150,7 @@ static inline int which_bucket(u64 duration_ns, unsigned long nr_iowaiters) * to be, the higher this multiplier, and thus the higher * the barrier to go to an expensive C state. */ -static inline int performance_multiplier(unsigned long nr_iowaiters) +static inline int performance_multiplier(unsigned int nr_iowaiters) { /* for IO wait tasks (per cpu!) we add 10x each */ return 1 + 10 * nr_iowaiters; @@ -270,7 +270,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, unsigned int predicted_us; u64 predicted_ns; u64 interactivity_req; - unsigned long nr_iowaiters; + unsigned int nr_iowaiters; ktime_t delta_next; int i, idx; diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 8a5b27ae7937..0319bc227c4d 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -19,7 +19,7 @@ extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); extern unsigned int nr_iowait(void); -extern unsigned long nr_iowait_cpu(int cpu); +extern unsigned int nr_iowait_cpu(int cpu); static inline int sched_info_on(void) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2cc8d81cdc75..c2012f0ea7a0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4378,7 +4378,7 @@ unsigned long long nr_context_switches(void) * it does become runnable. */ -unsigned long nr_iowait_cpu(int cpu) +unsigned int nr_iowait_cpu(int cpu) { return atomic_read(&cpu_rq(cpu)->nr_iowait); } -- 2.30.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip: sched/core] sched: Make nr_iowait_cpu() return 32-bit value 2021-04-22 20:02 ` [PATCH 3/4] sched: make nr_iowait_cpu() return 32-bit Alexey Dobriyan @ 2021-05-12 20:01 ` tip-bot2 for Alexey Dobriyan 0 siblings, 0 replies; 15+ messages in thread From: tip-bot2 for Alexey Dobriyan @ 2021-05-12 20:01 UTC (permalink / raw) To: linux-tip-commits; +Cc: Alexey Dobriyan, Ingo Molnar, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 8fc2858e572ce761bffcade81a42ac72005e76f9 Gitweb: https://git.kernel.org/tip/8fc2858e572ce761bffcade81a42ac72005e76f9 Author: Alexey Dobriyan <adobriyan@gmail.com> AuthorDate: Thu, 22 Apr 2021 23:02:27 +03:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 12 May 2021 21:34:16 +02:00 sched: Make nr_iowait_cpu() return 32-bit value Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com --- drivers/cpuidle/governors/menu.c | 6 +++--- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index c3aa8d6..2e56704 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -117,7 +117,7 @@ struct menu_device { int interval_ptr; }; -static inline int which_bucket(u64 duration_ns, unsigned long nr_iowaiters) +static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) { int bucket = 0; @@ -150,7 +150,7 @@ static inline int which_bucket(u64 duration_ns, unsigned long nr_iowaiters) * to be, the higher this multiplier, and thus the higher * the barrier to go to an expensive C state. */ -static inline int performance_multiplier(unsigned long nr_iowaiters) +static inline int performance_multiplier(unsigned int nr_iowaiters) { /* for IO wait tasks (per cpu!) we add 10x each */ return 1 + 10 * nr_iowaiters; @@ -270,7 +270,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, unsigned int predicted_us; u64 predicted_ns; u64 interactivity_req; - unsigned long nr_iowaiters; + unsigned int nr_iowaiters; ktime_t delta, delta_tick; int i, idx; diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 81d9b53..0108a38 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -20,7 +20,7 @@ extern int nr_processes(void); extern unsigned int nr_running(void); extern bool single_task_running(void); extern unsigned int nr_iowait(void); -extern unsigned long nr_iowait_cpu(int cpu); +extern unsigned int nr_iowait_cpu(int cpu); static inline int sched_info_on(void) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fadf2bf..24fd795 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4739,7 +4739,7 @@ unsigned long long nr_context_switches(void) * it does become runnable. */ -unsigned long nr_iowait_cpu(int cpu) +unsigned int nr_iowait_cpu(int cpu) { return atomic_read(&cpu_rq(cpu)->nr_iowait); } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/4] sched: make multiple runqueue task counters 32-bit 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 2/4] sched: make nr_iowait() return 32-bit value Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 3/4] sched: make nr_iowait_cpu() return 32-bit Alexey Dobriyan @ 2021-04-22 20:02 ` Alexey Dobriyan 2021-05-12 19:36 ` Ingo Molnar 2021-05-12 20:01 ` [tip: sched/core] sched: Make " tip-bot2 for Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make nr_running() return 32-bit value tip-bot2 for Alexey Dobriyan 2021-05-12 23:58 ` [PATCH 1/4] sched: make nr_running() return 32-bit Thomas Gleixner 4 siblings, 2 replies; 15+ messages in thread From: Alexey Dobriyan @ 2021-04-22 20:02 UTC (permalink / raw) To: mingo, peterz; +Cc: linux-kernel, Alexey Dobriyan Make struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- kernel/sched/loadavg.c | 2 +- kernel/sched/sched.h | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index d2a655643a02..aef8072cfebe 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -81,7 +81,7 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) long nr_active, delta = 0; nr_active = this_rq->nr_running - adjust; - nr_active += (long)this_rq->nr_uninterruptible; + nr_active += (int)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10a1522b1e30..730c81a54ed1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -622,8 +622,8 @@ struct rt_rq { } highest_prio; #endif #ifdef CONFIG_SMP - unsigned long rt_nr_migratory; - unsigned long rt_nr_total; + unsigned int rt_nr_migratory; + unsigned int rt_nr_total; int overloaded; struct plist_head pushable_tasks; @@ -637,7 +637,7 @@ struct rt_rq { raw_spinlock_t rt_runtime_lock; #ifdef CONFIG_RT_GROUP_SCHED - unsigned long rt_nr_boosted; + unsigned int rt_nr_boosted; struct rq *rq; struct task_group *tg; @@ -654,7 +654,7 @@ struct dl_rq { /* runqueue is an rbtree, ordered by deadline */ struct rb_root_cached root; - unsigned long dl_nr_running; + unsigned int dl_nr_running; #ifdef CONFIG_SMP /* @@ -668,7 +668,7 @@ struct dl_rq { u64 next; } earliest_dl; - unsigned long dl_nr_migratory; + unsigned int dl_nr_migratory; int overloaded; /* @@ -946,7 +946,7 @@ struct rq { * one CPU and if it got migrated afterwards it may decrease * it on another CPU. Always updated under the runqueue lock: */ - unsigned long nr_uninterruptible; + unsigned int nr_uninterruptible; struct task_struct __rcu *curr; struct task_struct *idle; -- 2.30.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched: make multiple runqueue task counters 32-bit 2021-04-22 20:02 ` [PATCH 4/4] sched: make multiple runqueue task counters 32-bit Alexey Dobriyan @ 2021-05-12 19:36 ` Ingo Molnar 2021-05-12 20:01 ` [tip: sched/core] sched: Make " tip-bot2 for Alexey Dobriyan 1 sibling, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2021-05-12 19:36 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: mingo, peterz, linux-kernel * Alexey Dobriyan <adobriyan@gmail.com> wrote: > Make > > struct dl_rq::dl_nr_migratory > struct dl_rq::dl_nr_running > > struct rt_rq::rt_nr_boosted > struct rt_rq::rt_nr_migratory > struct rt_rq::rt_nr_total > > struct rq::nr_uninterruptible > > 32-bit. > > If total number of tasks can't exceed 2**32 (and less due to futex pid > limits), then per-runqueue counters can't as well. Applied to tip:sched/core, thanks! There was a bit of a conflict with recent changes in drivers/cpuidle/governors/menu.c, but I fixed it up, hopefully correctly. > This patchset has been sponsored by REX Prefix Eradication Society. A patchset with such an impeccable recommendation letter is impossible to resist. Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* [tip: sched/core] sched: Make multiple runqueue task counters 32-bit 2021-04-22 20:02 ` [PATCH 4/4] sched: make multiple runqueue task counters 32-bit Alexey Dobriyan 2021-05-12 19:36 ` Ingo Molnar @ 2021-05-12 20:01 ` tip-bot2 for Alexey Dobriyan 1 sibling, 0 replies; 15+ messages in thread From: tip-bot2 for Alexey Dobriyan @ 2021-05-12 20:01 UTC (permalink / raw) To: linux-tip-commits; +Cc: Alexey Dobriyan, Ingo Molnar, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: e6fe3f422be128b7d65de607f6ae67bedc55f0ca Gitweb: https://git.kernel.org/tip/e6fe3f422be128b7d65de607f6ae67bedc55f0ca Author: Alexey Dobriyan <adobriyan@gmail.com> AuthorDate: Thu, 22 Apr 2021 23:02:28 +03:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 12 May 2021 21:34:17 +02:00 sched: Make multiple runqueue task counters 32-bit Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-4-adobriyan@gmail.com --- kernel/sched/loadavg.c | 2 +- kernel/sched/sched.h | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 1c79896..954b229 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -81,7 +81,7 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) long nr_active, delta = 0; nr_active = this_rq->nr_running - adjust; - nr_active += (long)this_rq->nr_uninterruptible; + nr_active += (int)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 904c52b..8f0194c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -636,8 +636,8 @@ struct rt_rq { } highest_prio; #endif #ifdef CONFIG_SMP - unsigned long rt_nr_migratory; - unsigned long rt_nr_total; + unsigned int rt_nr_migratory; + unsigned int rt_nr_total; int overloaded; struct plist_head pushable_tasks; @@ -651,7 +651,7 @@ struct rt_rq { raw_spinlock_t rt_runtime_lock; #ifdef CONFIG_RT_GROUP_SCHED - unsigned long rt_nr_boosted; + unsigned int rt_nr_boosted; struct rq *rq; struct task_group *tg; @@ -668,7 +668,7 @@ struct dl_rq { /* runqueue is an rbtree, ordered by deadline */ struct rb_root_cached root; - unsigned long dl_nr_running; + unsigned int dl_nr_running; #ifdef CONFIG_SMP /* @@ -682,7 +682,7 @@ struct dl_rq { u64 next; } earliest_dl; - unsigned long dl_nr_migratory; + unsigned int dl_nr_migratory; int overloaded; /* @@ -960,7 +960,7 @@ struct rq { * one CPU and if it got migrated afterwards it may decrease * it on another CPU. Always updated under the runqueue lock: */ - unsigned long nr_uninterruptible; + unsigned int nr_uninterruptible; struct task_struct __rcu *curr; struct task_struct *idle; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip: sched/core] sched: Make nr_running() return 32-bit value 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan ` (2 preceding siblings ...) 2021-04-22 20:02 ` [PATCH 4/4] sched: make multiple runqueue task counters 32-bit Alexey Dobriyan @ 2021-05-12 20:01 ` tip-bot2 for Alexey Dobriyan 2021-05-12 23:58 ` [PATCH 1/4] sched: make nr_running() return 32-bit Thomas Gleixner 4 siblings, 0 replies; 15+ messages in thread From: tip-bot2 for Alexey Dobriyan @ 2021-05-12 20:01 UTC (permalink / raw) To: linux-tip-commits; +Cc: Alexey Dobriyan, Ingo Molnar, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 01aee8fd7fb23049e2b52abadbe1f7b5e94a52d2 Gitweb: https://git.kernel.org/tip/01aee8fd7fb23049e2b52abadbe1f7b5e94a52d2 Author: Alexey Dobriyan <adobriyan@gmail.com> AuthorDate: Thu, 22 Apr 2021 23:02:25 +03:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 12 May 2021 21:34:14 +02:00 sched: Make nr_running() return 32-bit value Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com --- fs/proc/loadavg.c | 2 +- fs/proc/stat.c | 2 +- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index 8468bae..f32878d 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -16,7 +16,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v) get_avenrun(avnrun, FIXED_1/200, 0); - seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n", + seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), diff --git a/fs/proc/stat.c b/fs/proc/stat.c index f25e853..941605d 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -200,7 +200,7 @@ static int show_stat(struct seq_file *p, void *v) "\nctxt %llu\n" "btime %llu\n" "processes %lu\n" - "procs_running %lu\n" + "procs_running %u\n" "procs_blocked %lu\n", nr_context_switches(), (unsigned long long)boottime.tv_sec, diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index 939c3ec..73606b3 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -17,7 +17,7 @@ extern unsigned long total_forks; extern int nr_threads; DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); -extern unsigned long nr_running(void); +extern unsigned int nr_running(void); extern bool single_task_running(void); extern unsigned long nr_iowait(void); extern unsigned long nr_iowait_cpu(int cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ac8882d..2c6cdb0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4692,9 +4692,9 @@ context_switch(struct rq *rq, struct task_struct *prev, * externally visible scheduler statistics: current number of runnable * threads, total number of context switches performed since bootup. */ -unsigned long nr_running(void) +unsigned int nr_running(void) { - unsigned long i, sum = 0; + unsigned int i, sum = 0; for_each_online_cpu(i) sum += cpu_rq(i)->nr_running; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan ` (3 preceding siblings ...) 2021-05-12 20:01 ` [tip: sched/core] sched: Make nr_running() return 32-bit value tip-bot2 for Alexey Dobriyan @ 2021-05-12 23:58 ` Thomas Gleixner 2021-05-13 7:23 ` Alexey Dobriyan 2021-05-13 9:58 ` Ingo Molnar 4 siblings, 2 replies; 15+ messages in thread From: Thomas Gleixner @ 2021-05-12 23:58 UTC (permalink / raw) To: Alexey Dobriyan, mingo, peterz; +Cc: linux-kernel, Alexey Dobriyan, x86 Alexey, On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote: > Creating 2**32 tasks is impossible due to futex pid limits and wasteful > anyway. Nobody has done it. > this whole pile lacks useful numbers. What's the actual benefit of that churn? Just with the default config for one of my reference machines: text data bss dec hex filename 16679864 6627950 1671296 24979110 17d26a6 ../build/vmlinux-before 16679894 6627950 1671296 24979140 17d26c4 ../build/vmlinux-after ------------------------------------------------------------------------ +30 I'm truly impressed by the massive savings of this change and I'm even more impressed by the justification: > Bring nr_running() into 32-bit world to save on REX prefixes. Aside of the obvious useless churn, REX prefixes are universaly true for all architectures, right? There is a world outside x86 ... Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-05-12 23:58 ` [PATCH 1/4] sched: make nr_running() return 32-bit Thomas Gleixner @ 2021-05-13 7:23 ` Alexey Dobriyan 2021-05-13 9:58 ` Ingo Molnar 1 sibling, 0 replies; 15+ messages in thread From: Alexey Dobriyan @ 2021-05-13 7:23 UTC (permalink / raw) To: Thomas Gleixner; +Cc: mingo, peterz, linux-kernel, x86 On Thu, May 13, 2021 at 01:58:16AM +0200, Thomas Gleixner wrote: > Alexey, > > On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote: > > Creating 2**32 tasks is impossible due to futex pid limits and wasteful > > anyway. Nobody has done it. > > > > this whole pile lacks useful numbers. What's the actual benefit of that > churn? The long term goal is to use 32-bit data more. People will see it in core kernel and copy everywhere elase. > Just with the default config for one of my reference machines: > > text data bss dec hex filename > 16679864 6627950 1671296 24979110 17d26a6 ../build/vmlinux-before > 16679894 6627950 1671296 24979140 17d26c4 ../build/vmlinux-after > ------------------------------------------------------------------------ > +30 > > I'm truly impressed by the massive savings of this change and I'm even > more impressed by the justification: > > > Bring nr_running() into 32-bit world to save on REX prefixes. I collected numbers initially but then stopped because noone cared and they can be config and arch dependent. > Aside of the obvious useless churn, oh... Sometimes I think churn is the whole point. > REX prefixes are universaly true for > all architectures, right? There is a world outside x86 ... In general, 32-bitness is preferred for code generation. 32-bit RISCs naturally prefers 32-bit. 64-bit RISCs don't care because they remember 32-bit roots and have necessary 32-bit fixed width(!) instructions. x86_64 is the only arch where going 64-bit generally adds more bytes to the instruction stream. Effects can be smudged by compilers of course, in this case, percpu stuff. That "unsigned int i" is a mistake. Proper diff looks like this: -ffffffff811115fa: 8b 44 18 04 mov eax,DWORD PTR [rax+rbx*1+0x4] -ffffffff811115fe: 49 01 c4 add r12,rax +ffffffff811115fa: 44 03 64 18 04 add r12d,DWORD PTR [rax+rbx*1+0x4] --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4348,9 +4348,10 @@ context_switch(struct rq *rq, struct task_struct *prev, * externally visible scheduler statistics: current number of runnable * threads, total number of context switches performed since bootup. */ -unsigned long nr_running(void) +unsigned int nr_running(void) { - unsigned long i, sum = 0; + unsigned int sum = 0; + unsigned long i; for_each_online_cpu(i) sum += cpu_rq(i)->nr_running; ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-05-12 23:58 ` [PATCH 1/4] sched: make nr_running() return 32-bit Thomas Gleixner 2021-05-13 7:23 ` Alexey Dobriyan @ 2021-05-13 9:58 ` Ingo Molnar 2021-05-13 21:22 ` Alexey Dobriyan ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Ingo Molnar @ 2021-05-13 9:58 UTC (permalink / raw) To: Thomas Gleixner Cc: Alexey Dobriyan, mingo, Borislav Petkov, linux-kernel, x86, Linus Torvalds, Greg Kroah-Hartman, Peter Zijlstra * Thomas Gleixner <tglx@linutronix.de> wrote: > Alexey, > > On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote: > > Creating 2**32 tasks is impossible due to futex pid limits and wasteful > > anyway. Nobody has done it. > > > > this whole pile lacks useful numbers. What's the actual benefit of that > churn? I applied the 4 patch series from Alexey Dobriyan because they offer four distinct technical advantages to the scheduler code: - Shorter instructions generated in nr_running(), nr_iowait(), nr_iowait_cpu() due to losing the REX prefix. - Shorter instructions generated by users of these 3 functions as well. - Tighter data packing and structure size reduction in 'struct rt_rq', 'struct dl_rq' and 'struct rq', due to 8-byte 'long' fields shrinking to 4-byte 'int' based fields. - Together these 4 patches clean up all derivative uses of the rq::nr_running base type, which is already 'unsigned int' and that's pretty fundamental given our PID ABI limits. Having type mismatch where we use 64-bit data types for certain APIs while 32-bit data types for others is inconsistent crap I wouldn't accept if it was submitted as new code. As to the numbers: The data structure size improvements are IMO obvious, and they are also measurable, here's the before/after Pahole comparison of 'struct rt_rq': --- pahole.struct.rt_rq.before 2021-05-13 11:40:53.207077908 +0200 +++ pahole.struct.rt_rq.after 2021-05-13 11:41:42.257385897 +0200 @@ -7,22 +7,22 @@ struct rt_rq { int curr; /* 1624 4 */ int next; /* 1628 4 */ } highest_prio; /* 1624 8 */ - long unsigned int rt_nr_migratory; /* 1632 8 */ - long unsigned int rt_nr_total; /* 1640 8 */ - int overloaded; /* 1648 4 */ + unsigned int rt_nr_migratory; /* 1632 4 */ + unsigned int rt_nr_total; /* 1636 4 */ + int overloaded; /* 1640 4 */ /* XXX 4 bytes hole, try to pack */ - struct plist_head pushable_tasks; /* 1656 16 */ - /* --- cacheline 26 boundary (1664 bytes) was 8 bytes ago --- */ - int rt_queued; /* 1672 4 */ - int rt_throttled; /* 1676 4 */ - u64 rt_time; /* 1680 8 */ - u64 rt_runtime; /* 1688 8 */ - raw_spinlock_t rt_runtime_lock; /* 1696 4 */ + struct plist_head pushable_tasks; /* 1648 16 */ + /* --- cacheline 26 boundary (1664 bytes) --- */ + int rt_queued; /* 1664 4 */ + int rt_throttled; /* 1668 4 */ + u64 rt_time; /* 1672 8 */ + u64 rt_runtime; /* 1680 8 */ + raw_spinlock_t rt_runtime_lock; /* 1688 4 */ - /* size: 1704, cachelines: 27, members: 13 */ - /* sum members: 1696, holes: 1, sum holes: 4 */ + /* size: 1696, cachelines: 27, members: 13 */ + /* sum members: 1688, holes: 1, sum holes: 4 */ /* padding: 4 */ - /* last cacheline: 40 bytes */ + /* last cacheline: 32 bytes */ 'struct rt_rq' got shrunk from 1704 bytes to 1696 bytes, an 8 byte reduction. 'struct dl_rq' shrunk by 8 bytes: - /* size: 104, cachelines: 2, members: 10 */ - /* sum members: 100, holes: 1, sum holes: 4 */ - /* last cacheline: 40 bytes */ + /* size: 96, cachelines: 2, members: 10 */ + /* sum members: 92, holes: 1, sum holes: 4 */ + /* last cacheline: 32 bytes */ 'struct rq', which embedds both rt_rq and dl_rq, got 20 bytes smaller: - /* sum members: 2967, holes: 10, sum holes: 137 */ + /* sum members: 2947, holes: 11, sum holes: 157 */ Side note: there's now 20 bytes more new padding holes which could possibly be coalesced a bit more by reordering members sensibly - resulting in even more data footprint savings. (But those should be separate changes and only for fields we truly care about from a performance POV.) As to code generation improvements: > Just with the default config for one of my reference machines: > > text data bss dec hex filename > 16679864 6627950 1671296 24979110 17d26a6 ../build/vmlinux-before > 16679894 6627950 1671296 24979140 17d26c4 ../build/vmlinux-after > ------------------------------------------------------------------------ > +30 Using '/usr/bin/size' to compare before/after generated code is the wrong way to measure code generation improvements for smaller changes due to default alignment creating a reserve of 'padding' bytes at the end of most functions. You have to look at the low level generated assembly directly. For example, the nr_iowait_cpu() commit, with before/after generated code of the callers of the function: 9745516841a5: ("sched: Make nr_iowait() return 32-bit value") ffffffffxxxxxxxx: e8 49 8e fb ff call ffffffffxxxxxxxx <nr_iowait_cpu> - ffffffffxxxxxxxx: 48 85 c0 test %rax,%rax ffffffffxxxxxxxx: e8 64 8e fb ff call ffffffffxxxxxxxx <nr_iowait_cpu> + ffffffffxxxxxxxx: 85 c0 test %eax,%eax Note how the 'test %rax,%rax' lost the 0x48 64-bit REX prefix and the generated code got one byte shorter from "48 85 c0" to "85 c0". ( Note, my asm generation scripts filter out some of the noise to make it easier to diff generated asm, hence the ffffffffxxxxxxxx placeholder. ) The nr_iowait() function itself got shorter by two bytes as well, due to: Before: ffffffffxxxxxxxx <nr_iowait>: ffffffffxxxxxxxx: 41 54 push %r12 ffffffffxxxxxxxx: 48 c7 c7 ff ff ff ff mov $0xffffffffxxxxxxxx,%rdi ffffffffxxxxxxxx: 45 31 e4 xor %r12d,%r12d ffffffffxxxxxxxx: 55 push %rbp ffffffffxxxxxxxx: 8b 2d 01 ea 5d 01 mov 0x15dea01(%rip),%ebp # ffffffffxxxxxxxx <nr_cpu_ids> ffffffffxxxxxxxx: 53 push %rbx ffffffffxxxxxxxx: 48 c7 c3 c0 95 02 00 mov $0x295c0,%rbx ffffffffxxxxxxxx: eb 17 jmp ffffffffxxxxxxxx <nr_iowait+0x34> ffffffffxxxxxxxx: 48 98 cltq ffffffffxxxxxxxx: 48 8b 14 c5 a0 c6 2e mov -0x7dd13960(,%rax,8),%rdx ffffffffxxxxxxxx: 82 ffffffffxxxxxxxx: 48 01 da add %rbx,%rdx ffffffffxxxxxxxx: 48 63 82 98 09 00 00 movslq 0x998(%rdx),%rax ffffffffxxxxxxxx: 49 01 c4 add %rax,%r12 ffffffffxxxxxxxx: 48 c7 c6 18 72 67 82 mov $0xffffffffxxxxxxxx,%rsi ffffffffxxxxxxxx: e8 80 46 39 00 call ffffffffxxxxxxxx <cpumask_next> ffffffffxxxxxxxx: 89 c7 mov %eax,%edi ffffffffxxxxxxxx: 39 e8 cmp %ebp,%eax ffffffffxxxxxxxx: 72 d7 jb ffffffffxxxxxxxx <nr_iowait+0x1d> ffffffffxxxxxxxx: 4c 89 e0 mov %r12,%rax ffffffffxxxxxxxx: 5b pop %rbx ffffffffxxxxxxxx: 5d pop %rbp ffffffffxxxxxxxx: 41 5c pop %r12 ffffffffxxxxxxxx: c3 ret After: ffffffffxxxxxxxx <nr_iowait>: ffffffffxxxxxxxx: 41 54 push %r12 ffffffffxxxxxxxx: bf ff ff ff ff mov $0xffffffff,%edi ffffffffxxxxxxxx: 45 31 e4 xor %r12d,%r12d ffffffffxxxxxxxx: 55 push %rbp ffffffffxxxxxxxx: 8b 2d 03 ea 5d 01 mov 0x15dea03(%rip),%ebp # ffffffffxxxxxxxx <nr_cpu_ids> ffffffffxxxxxxxx: 53 push %rbx ffffffffxxxxxxxx: 48 c7 c3 c0 95 02 00 mov $0x295c0,%rbx ffffffffxxxxxxxx: eb 17 jmp ffffffffxxxxxxxx <nr_iowait+0x32> ffffffffxxxxxxxx: 48 63 c7 movslq %edi,%rax ffffffffxxxxxxxx: 48 8b 14 c5 a0 c6 2e mov -0x7dd13960(,%rax,8),%rdx ffffffffxxxxxxxx: 82 ffffffffxxxxxxxx: 48 01 da add %rbx,%rdx ffffffffxxxxxxxx: 8b 82 98 09 00 00 mov 0x998(%rdx),%eax ffffffffxxxxxxxx: 41 01 c4 add %eax,%r12d ffffffffxxxxxxxx: 48 c7 c6 18 72 67 82 mov $0xffffffffxxxxxxxx,%rsi ffffffffxxxxxxxx: e8 a2 46 39 00 call ffffffffxxxxxxxx <cpumask_next> ffffffffxxxxxxxx: 89 c7 mov %eax,%edi ffffffffxxxxxxxx: 39 c5 cmp %eax,%ebp ffffffffxxxxxxxx: 77 d7 ja ffffffffxxxxxxxx <nr_iowait+0x1b> ffffffffxxxxxxxx: 44 89 e0 mov %r12d,%eax ffffffffxxxxxxxx: 5b pop %rbx ffffffffxxxxxxxx: 5d pop %rbp ffffffffxxxxxxxx: 41 5c pop %r12 ffffffffxxxxxxxx: c3 ret The size of nr_iowait() shrunk from 78 bytes to 76 bytes. Or the other commit: 01aee8fd7fb2: ("sched: Make nr_running() return 32-bit value") ffffffffxxxxxxxx: e8 d9 24 e8 ff call ffffffffxxxxxxxx <nr_running> ffffffffxxxxxxxx: 4c 8b 05 cc 92 70 01 mov 0x17092cc(%rip),%r8 # ffffffffxxxxxxxx <total_forks> ffffffffxxxxxxxx: 4c 8b 64 24 50 mov 0x50(%rsp),%r12 - ffffffffxxxxxxxx: 48 89 44 24 08 mov %rax,0x8(%rsp) ffffffffxxxxxxxx: 4c 89 04 24 mov %r8,(%rsp) ffffffffxxxxxxxx: e8 f9 24 e8 ff call ffffffffxxxxxxxx <nr_running> ffffffffxxxxxxxx: 4c 8b 05 ed 92 70 01 mov 0x17092ed(%rip),%r8 # ffffffffxxxxxxxx <total_forks> ffffffffxxxxxxxx: 4c 8b 64 24 50 mov 0x50(%rsp),%r12 + ffffffffxxxxxxxx: 89 44 24 08 mov %eax,0x8(%rsp) ffffffffxxxxxxxx: 4c 89 04 24 mov %r8,(%rsp) Note how "mov %rax,0x8(%rsp)" got shortened by one byte to "mov %eax,0x8(%rsp)": - ffffffffxxxxxxxx: 48 89 44 24 08 mov %rax,0x8(%rsp) + ffffffffxxxxxxxx: 89 44 24 08 mov %eax,0x8(%rsp) The nr_running() function itself got shorter by 2 bytes, due to shorter instruction sequences. The third commit improved code generation too: 8fc2858e572c: ("sched: Make nr_iowait_cpu() return 32-bit value") ffffffffxxxxxxxx <nr_iowait_cpu>: ffffffffxxxxxxxx: 48 63 ff movslq %edi,%rdi ffffffffxxxxxxxx: 48 c7 c0 c0 95 02 00 mov $0x295c0,%rax ffffffffxxxxxxxx: 48 03 04 fd a0 46 0f add -0x7df0b960(,%rdi,8),%rax ffffffffxxxxxxxx: 82 - ffffffffxxxxxxxx: 48 63 80 98 09 00 00 movslq 0x998(%rax),%rax ffffffffxxxxxxxx: c3 ret ffffffffxxxxxxxx <nr_iowait_cpu>: ffffffffxxxxxxxx: 48 63 ff movslq %edi,%rdi ffffffffxxxxxxxx: 48 c7 c0 c0 95 02 00 mov $0x295c0,%rax ffffffffxxxxxxxx: 48 03 04 fd a0 46 0f add -0x7df0b960(,%rdi,8),%rax ffffffffxxxxxxxx: 82 - ffffffffxxxxxxxx: 8b 80 98 09 00 00 mov 0x998(%rax),%eax ffffffffxxxxxxxx: c3 ret Note how the 'movslq 0x998(%rax),%rax' lost the REX prefix and got one byte shorter. Call sites got shorter too: ffffffffxxxxxxxx: e8 e8 73 fa ff call ffffffffxxxxxxxx <nr_iowait_cpu> - ffffffffxxxxxxxx: 48 85 c0 test %rax,%rax ffffffffxxxxxxxx: e8 c8 73 fa ff call ffffffffxxxxxxxx <nr_iowait_cpu> - ffffffffxxxxxxxx: 85 c0 test %eax,%eax You often won't see these effects in the 'size vmlinux' output, because function alignment padding reserves usually hide 1-2 byte size improvements in generated code. > I'm truly impressed by the massive savings of this change and I'm even > more impressed by the justification: > > > Bring nr_running() into 32-bit world to save on REX prefixes. > > Aside of the obvious useless churn, REX prefixes are universaly true for > all architectures, right? There is a world outside x86 ... Even architectures that have the same instruction length for 32-bit and 64-bit data types benefit: - smaller data structures benefit all 64-bit architectures - the cleaner, more coherent nr_running type definitions are now more consistently 'int' based, not the previous nonsensical 'int/long' mix that also made the generated code larger on x86. More importantly, the maintenance benchmark in these cases is not whether a change actively helps every architecture we care about - but whether the change is a *disadvantage* for them - and it isn't here. Changes that primarily benefit one common architecture, while not others, are still eligible for upstream merging if they otherwise meet the quality threshold and don't hurt the other architectures. TL;DR: This benefits from this series are small, but are far from 'useless churn', unless we want to arbitrarily cut off technically valid contributions that improve generated code, data structure size and code readability, submitted by a long-time contributor who has contributed over 1,300 patches to the kernel already, just because we don't think these add up a significant enough benefit? No doubt the quality barrier must be and is higher for smaller changes - but this series IMO passed that barrier. Anyway, I've Cc:-ed Linus and Greg, if you are advocating for some sort of cut-off threshold for small but measurable improvements from long-time contributors, it should probably be clearly specified & documented in Documentation/SubmittingPatches ... Thanks, Ingo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-05-13 9:58 ` Ingo Molnar @ 2021-05-13 21:22 ` Alexey Dobriyan 2021-05-14 12:52 ` Thomas Gleixner 2021-05-14 18:18 ` Thomas Gleixner 2 siblings, 0 replies; 15+ messages in thread From: Alexey Dobriyan @ 2021-05-13 21:22 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, mingo, Borislav Petkov, linux-kernel, x86, Linus Torvalds, Greg Kroah-Hartman, Peter Zijlstra On Thu, May 13, 2021 at 11:58:38AM +0200, Ingo Molnar wrote: > > * Thomas Gleixner <tglx@linutronix.de> wrote: > > > Just with the default config for one of my reference machines: > > > > text data bss dec hex filename > > 16679864 6627950 1671296 24979110 17d26a6 ../build/vmlinux-before > > 16679894 6627950 1671296 24979140 17d26c4 ../build/vmlinux-after > > ------------------------------------------------------------------------ > > +30 > > Using '/usr/bin/size' to compare before/after generated code is the wrong > way to measure code generation improvements for smaller changes due to > default alignment creating a reserve of 'padding' bytes at the end of most > functions. You have to look at the low level generated assembly directly. This is bloat-o-meter output with Fedora 33 .config: This is how they look like, something gets bigger but total is smaller (otherwise why would I send it). Apparently something got one 1 byte too many and pushed padding. add/remove: 0/0 grow/shrink: 6/21 up/down: 18/-42 (-24) Function old new delta calc_load_fold_active 50 56 +6 calc_load_nohz_start 100 103 +3 calc_load_nohz_remote 85 88 +3 calc_global_load_tick 86 89 +3 pull_dl_task 901 903 +2 switched_from_dl 613 614 +1 update_rt_migration 165 164 -1 update_dl_migration 141 140 -1 ttwu_do_activate 181 180 -1 tick_nohz_idle_exit 225 224 -1 tick_irq_enter 227 226 -1 print_rt_rq.cold 238 237 -1 print_rt_rq 413 412 -1 nr_iowait_cpu 31 30 -1 init_rt_rq 143 142 -1 show_stat 1745 1743 -2 nr_running 75 73 -2 nr_iowait 83 81 -2 get_cpu_iowait_time_us 260 258 -2 get_cpu_idle_time_us 260 258 -2 find_lock_later_rq 507 505 -2 enqueue_task_rt 777 775 -2 enqueue_task_dl 2461 2459 -2 dequeue_rt_stack 576 574 -2 menu_select 1492 1489 -3 __dequeue_dl_entity 419 414 -5 init_dl_rq 88 81 -7 Total: Before=25729849, After=25729825, chg -0.00% ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-05-13 9:58 ` Ingo Molnar 2021-05-13 21:22 ` Alexey Dobriyan @ 2021-05-14 12:52 ` Thomas Gleixner 2021-05-14 18:18 ` Thomas Gleixner 2 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2021-05-14 12:52 UTC (permalink / raw) To: Ingo Molnar Cc: Alexey Dobriyan, mingo, Borislav Petkov, linux-kernel, x86, Linus Torvalds, Greg Kroah-Hartman, Peter Zijlstra Ingo, On Thu, May 13 2021 at 11:58, Ingo Molnar wrote: > * Thomas Gleixner <tglx@linutronix.de> wrote: > > You often won't see these effects in the 'size vmlinux' output, because > function alignment padding reserves usually hide 1-2 byte size improvements > in generated code. I'm not that stupid. And I certainly looked where this comes from. > More importantly, the maintenance benchmark in these cases is not whether a > change actively helps every architecture we care about - but whether the > change is a *disadvantage* for them - and it isn't here. That's clearly documented in the changelogs of these patches, right? > Changes that primarily benefit one common architecture, while not others, > are still eligible for upstream merging if they otherwise meet the quality > threshold and don't hurt the other architectures. That has been proven by compile testing the relevant architectures as documented in the changelog, right? > TL;DR: > > This benefits from this series are small, but are far from 'useless churn', > unless we want to arbitrarily cut off technically valid contributions that > improve generated code, data structure size and code readability, submitted > by a long-time contributor who has contributed over 1,300 patches to the > kernel already, just because we don't think these add up a significant > enough benefit? > > No doubt the quality barrier must be and is higher for smaller changes - > but this series IMO passed that barrier. > > Anyway, I've Cc:-ed Linus and Greg, if you are advocating for some sort of > cut-off threshold for small but measurable improvements from long-time > contributors, it should probably be clearly specified & documented in > Documentation/SubmittingPatches ... What I'm arguing about is already documented: Quantify optimizations and trade-offs. If you claim improvements in performance, memory consumption, stack footprint, or binary size, include numbers that back them up. That series fails to provide any of this and it does not matter whether this comes from a long time contributor or from a newbie. Long term contributors are not excempt from documented process. In fact they should lead by example. If you as a maintainer put different measures on newbies and long-time contrinbutors then you pretty much have proven the point the UMN people tried to make (in the wrong way). Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] sched: make nr_running() return 32-bit 2021-05-13 9:58 ` Ingo Molnar 2021-05-13 21:22 ` Alexey Dobriyan 2021-05-14 12:52 ` Thomas Gleixner @ 2021-05-14 18:18 ` Thomas Gleixner 2 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2021-05-14 18:18 UTC (permalink / raw) To: Ingo Molnar Cc: Alexey Dobriyan, mingo, Borislav Petkov, linux-kernel, x86, Linus Torvalds, Greg Kroah-Hartman, Peter Zijlstra On Thu, May 13 2021 at 11:58, Ingo Molnar wrote: > * Thomas Gleixner <tglx@linutronix.de> wrote: > As to the numbers: > - /* size: 1704, cachelines: 27, members: 13 */ > - /* sum members: 1696, holes: 1, sum holes: 4 */ > + /* size: 1696, cachelines: 27, members: 13 */ > + /* sum members: 1688, holes: 1, sum holes: 4 */ > /* padding: 4 */ > - /* last cacheline: 40 bytes */ > + /* last cacheline: 32 bytes */ > > 'struct rt_rq' got shrunk from 1704 bytes to 1696 bytes, an 8 byte > reduction. Amazing and it still occupies 27 cache lines > ffffffffxxxxxxxx: e8 49 8e fb ff call ffffffffxxxxxxxx <nr_iowait_cpu> > - ffffffffxxxxxxxx: 48 85 c0 test %rax,%rax > > ffffffffxxxxxxxx: e8 64 8e fb ff call ffffffffxxxxxxxx <nr_iowait_cpu> > + ffffffffxxxxxxxx: 85 c0 test %eax,%eax > > Note how the 'test %rax,%rax' lost the 0x48 64-bit REX prefix and the > generated code got one byte shorter from "48 85 c0" to "85 c0". Which will surely be noticable big time. None of this truly matters because once the data is in L1 the REX prefix is just noise. > ( Note, my asm generation scripts filter out some of the noise to make it > easier to diff generated asm, hence the ffffffffxxxxxxxx placeholder. ) > > The nr_iowait() function itself got shorter by two bytes as well, due to: > > The size of nr_iowait() shrunk from 78 bytes to 76 bytes. That's important because nr_iowait() is truly a hotpath function... > The nr_running() function itself got shorter by 2 bytes, due to shorter > instruction sequences. along with nr_running() which both feed /proc/stat. The latter feeds /proc/loadavg as well. Point well taken. But looking at the /proc/stat usage there is obviously way bigger fish to fry. seq_printf(...., nr_running(), nr_iowait()); which is outright stupid because both functions iterate over CPUs instead of doing it once, which definitely would be well measurable on large machines. But looking at nr_running() and nr_iowait(): nr_running() walks all CPUs in cpu_online_mask nr_iowait() walks all CPUs in cpu_possible_mask The latter is because rq::nr_iowait is not transferred to an online CPU when a CPU goes offline. Oh well. That aside: I'm not against the change per se, but I'm disagreeing with patches which come with zero information, are clearly focussed on one architecture and obviously nobody bothered to check whether there is an impact on others. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-05-14 18:18 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-22 20:02 [PATCH 1/4] sched: make nr_running() return 32-bit Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 2/4] sched: make nr_iowait() return 32-bit value Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make " tip-bot2 for Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 3/4] sched: make nr_iowait_cpu() return 32-bit Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make nr_iowait_cpu() return 32-bit value tip-bot2 for Alexey Dobriyan 2021-04-22 20:02 ` [PATCH 4/4] sched: make multiple runqueue task counters 32-bit Alexey Dobriyan 2021-05-12 19:36 ` Ingo Molnar 2021-05-12 20:01 ` [tip: sched/core] sched: Make " tip-bot2 for Alexey Dobriyan 2021-05-12 20:01 ` [tip: sched/core] sched: Make nr_running() return 32-bit value tip-bot2 for Alexey Dobriyan 2021-05-12 23:58 ` [PATCH 1/4] sched: make nr_running() return 32-bit Thomas Gleixner 2021-05-13 7:23 ` Alexey Dobriyan 2021-05-13 9:58 ` Ingo Molnar 2021-05-13 21:22 ` Alexey Dobriyan 2021-05-14 12:52 ` Thomas Gleixner 2021-05-14 18:18 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).