[1/4] sched: make nr_running() return 32-bit
diff mbox series

Message ID 20210422200228.1423391-1-adobriyan@gmail.com
State New, archived
Headers show
Series
  • [1/4] sched: make nr_running() return 32-bit
Related show

Commit Message

Alexey Dobriyan April 22, 2021, 8:02 p.m. UTC
Creating 2**32 tasks is impossible due to futex pid limits and wasteful
anyway. Nobody has done it.

Bring nr_running() into 32-bit world to save on REX prefixes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/proc/loadavg.c          | 2 +-
 fs/proc/stat.c             | 2 +-
 include/linux/sched/stat.h | 2 +-
 kernel/sched/core.c        | 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

Comments

Thomas Gleixner May 12, 2021, 11:58 p.m. UTC | #1
Alexey,

On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote:
> Creating 2**32 tasks is impossible due to futex pid limits and wasteful
> anyway. Nobody has done it.
>

this whole pile lacks useful numbers. What's the actual benefit of that
churn?

Just with the default config for one of my reference machines:

   text		data	bss	dec	 hex	 filename
16679864	6627950	1671296	24979110 17d26a6 ../build/vmlinux-before
16679894	6627950	1671296	24979140 17d26c4 ../build/vmlinux-after
------------------------------------------------------------------------
     +30

I'm truly impressed by the massive savings of this change and I'm even
more impressed by the justification:

> Bring nr_running() into 32-bit world to save on REX prefixes.

Aside of the obvious useless churn, REX prefixes are universaly true for
all architectures, right? There is a world outside x86 ...

Thanks,

        tglx
Alexey Dobriyan May 13, 2021, 7:23 a.m. UTC | #2
On Thu, May 13, 2021 at 01:58:16AM +0200, Thomas Gleixner wrote:
> Alexey,
> 
> On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote:
> > Creating 2**32 tasks is impossible due to futex pid limits and wasteful
> > anyway. Nobody has done it.
> >
> 
> this whole pile lacks useful numbers. What's the actual benefit of that
> churn?

The long term goal is to use 32-bit data more. People will see it in
core kernel and copy everywhere elase.

> Just with the default config for one of my reference machines:
> 
>    text		data	bss	dec	 hex	 filename
> 16679864	6627950	1671296	24979110 17d26a6 ../build/vmlinux-before
> 16679894	6627950	1671296	24979140 17d26c4 ../build/vmlinux-after
> ------------------------------------------------------------------------
>      +30
> 
> I'm truly impressed by the massive savings of this change and I'm even
> more impressed by the justification:
> 
> > Bring nr_running() into 32-bit world to save on REX prefixes.

I collected numbers initially but then stopped because noone cared and
they can be config and arch dependent.

> Aside of the obvious useless churn,

oh... Sometimes I think churn is the whole point.

> REX prefixes are universaly true for
> all architectures, right? There is a world outside x86 ...

In general, 32-bitness is preferred for code generation.

32-bit RISCs naturally prefers 32-bit.

64-bit RISCs don't care because they remember 32-bit roots and
have necessary 32-bit fixed width(!) instructions.

x86_64 is the only arch where going 64-bit generally adds more bytes
to the instruction stream.

Effects can be smudged by compilers of course, in this case, percpu
stuff. That "unsigned int i" is a mistake. Proper diff looks like this:

	-ffffffff811115fa: 8b 44 18 04      mov    eax,DWORD PTR [rax+rbx*1+0x4]
	-ffffffff811115fe: 49 01 c4         add    r12,rax
	+ffffffff811115fa: 44 03 64 18 04   add    r12d,DWORD PTR [rax+rbx*1+0x4]

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,9 +4348,10 @@ context_switch(struct rq *rq, struct task_struct *prev,
  * externally visible scheduler statistics: current number of runnable
  * threads, total number of context switches performed since bootup.
  */
-unsigned long nr_running(void)
+unsigned int nr_running(void)
 {
-	unsigned long i, sum = 0;
+	unsigned int sum = 0;
+	unsigned long i;
 
 	for_each_online_cpu(i)
 		sum += cpu_rq(i)->nr_running;
Ingo Molnar May 13, 2021, 9:58 a.m. UTC | #3
* Thomas Gleixner <tglx@linutronix.de> wrote:

> Alexey,
> 
> On Thu, Apr 22 2021 at 23:02, Alexey Dobriyan wrote:
> > Creating 2**32 tasks is impossible due to futex pid limits and wasteful
> > anyway. Nobody has done it.
> >
> 
> this whole pile lacks useful numbers. What's the actual benefit of that
> churn?

I applied the 4 patch series from Alexey Dobriyan because they offer four 
distinct technical advantages to the scheduler code:

 - Shorter instructions generated in nr_running(), nr_iowait(), 
   nr_iowait_cpu() due to losing the REX prefix.

 - Shorter instructions generated by users of these 3 functions as well.

 - Tighter data packing and structure size reduction in 'struct rt_rq', 
   'struct dl_rq' and 'struct rq', due to 8-byte 'long' fields shrinking to 
   4-byte 'int' based fields.

 - Together these 4 patches clean up all derivative uses of the 
   rq::nr_running base type, which is already 'unsigned int' and that's 
   pretty fundamental given our PID ABI limits. Having type mismatch where 
   we use 64-bit data types for certain APIs while 32-bit data types for 
   others is inconsistent crap I wouldn't accept if it was submitted as new 
   code.

As to the numbers:

The data structure size improvements are IMO obvious, and they are also 
measurable, here's the before/after Pahole comparison of 'struct rt_rq':

--- pahole.struct.rt_rq.before	2021-05-13 11:40:53.207077908 +0200
+++ pahole.struct.rt_rq.after	2021-05-13 11:41:42.257385897 +0200
@@ -7,22 +7,22 @@ struct rt_rq {
 		int                curr;                 /*  1624     4 */
 		int                next;                 /*  1628     4 */
 	} highest_prio;                                  /*  1624     8 */
-	long unsigned int          rt_nr_migratory;      /*  1632     8 */
-	long unsigned int          rt_nr_total;          /*  1640     8 */
-	int                        overloaded;           /*  1648     4 */
+	unsigned int               rt_nr_migratory;      /*  1632     4 */
+	unsigned int               rt_nr_total;          /*  1636     4 */
+	int                        overloaded;           /*  1640     4 */
 
 	/* XXX 4 bytes hole, try to pack */
 
-	struct plist_head          pushable_tasks;       /*  1656    16 */
-	/* --- cacheline 26 boundary (1664 bytes) was 8 bytes ago --- */
-	int                        rt_queued;            /*  1672     4 */
-	int                        rt_throttled;         /*  1676     4 */
-	u64                        rt_time;              /*  1680     8 */
-	u64                        rt_runtime;           /*  1688     8 */
-	raw_spinlock_t             rt_runtime_lock;      /*  1696     4 */
+	struct plist_head          pushable_tasks;       /*  1648    16 */
+	/* --- cacheline 26 boundary (1664 bytes) --- */
+	int                        rt_queued;            /*  1664     4 */
+	int                        rt_throttled;         /*  1668     4 */
+	u64                        rt_time;              /*  1672     8 */
+	u64                        rt_runtime;           /*  1680     8 */
+	raw_spinlock_t             rt_runtime_lock;      /*  1688     4 */
 
-	/* size: 1704, cachelines: 27, members: 13 */
-	/* sum members: 1696, holes: 1, sum holes: 4 */
+	/* size: 1696, cachelines: 27, members: 13 */
+	/* sum members: 1688, holes: 1, sum holes: 4 */
 	/* padding: 4 */
-	/* last cacheline: 40 bytes */
+	/* last cacheline: 32 bytes */

'struct rt_rq' got shrunk from 1704 bytes to 1696 bytes, an 8 byte 
reduction.

'struct dl_rq' shrunk by 8 bytes:

-	/* size: 104, cachelines: 2, members: 10 */
-	/* sum members: 100, holes: 1, sum holes: 4 */
-	/* last cacheline: 40 bytes */
+	/* size: 96, cachelines: 2, members: 10 */
+	/* sum members: 92, holes: 1, sum holes: 4 */
+	/* last cacheline: 32 bytes */

'struct rq', which embedds both rt_rq and dl_rq, got 20 bytes smaller:

-	/* sum members: 2967, holes: 10, sum holes: 137 */
+	/* sum members: 2947, holes: 11, sum holes: 157 */

Side note: there's now 20 bytes more new padding holes which could possibly 
be coalesced a bit more by reordering members sensibly - resulting in even 
more data footprint savings. (But those should be separate changes and only 
for fields we truly care about from a performance POV.)

As to code generation improvements:

> Just with the default config for one of my reference machines:
> 
>    text		data	bss	dec	 hex	 filename
> 16679864	6627950	1671296	24979110 17d26a6 ../build/vmlinux-before
> 16679894	6627950	1671296	24979140 17d26c4 ../build/vmlinux-after
> ------------------------------------------------------------------------
>      +30

Using '/usr/bin/size' to compare before/after generated code is the wrong 
way to measure code generation improvements for smaller changes due to 
default alignment creating a reserve of 'padding' bytes at the end of most 
functions. You have to look at the low level generated assembly directly.

For example, the nr_iowait_cpu() commit, with before/after generated code 
of the callers of the function:

  9745516841a5: ("sched: Make nr_iowait() return 32-bit value")

  ffffffffxxxxxxxx:      e8 49 8e fb ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
- ffffffffxxxxxxxx:      48 85 c0                test   %rax,%rax

  ffffffffxxxxxxxx:      e8 64 8e fb ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
+ ffffffffxxxxxxxx:      85 c0                   test   %eax,%eax

Note how the 'test %rax,%rax' lost the 0x48 64-bit REX prefix and the 
generated code got one byte shorter from "48 85 c0" to "85 c0".

( Note, my asm generation scripts filter out some of the noise to make it 
  easier to diff generated asm, hence the ffffffffxxxxxxxx placeholder. )

The nr_iowait() function itself got shorter by two bytes as well, due to:

Before:

ffffffffxxxxxxxx <nr_iowait>:
ffffffffxxxxxxxx:       41 54                   push   %r12
ffffffffxxxxxxxx:       48 c7 c7 ff ff ff ff    mov    $0xffffffffxxxxxxxx,%rdi
ffffffffxxxxxxxx:       45 31 e4                xor    %r12d,%r12d
ffffffffxxxxxxxx:       55                      push   %rbp
ffffffffxxxxxxxx:       8b 2d 01 ea 5d 01       mov    0x15dea01(%rip),%ebp        # ffffffffxxxxxxxx <nr_cpu_ids>
ffffffffxxxxxxxx:       53                      push   %rbx
ffffffffxxxxxxxx:       48 c7 c3 c0 95 02 00    mov    $0x295c0,%rbx
ffffffffxxxxxxxx:       eb 17                   jmp    ffffffffxxxxxxxx <nr_iowait+0x34>
ffffffffxxxxxxxx:       48 98                   cltq   
ffffffffxxxxxxxx:       48 8b 14 c5 a0 c6 2e    mov    -0x7dd13960(,%rax,8),%rdx
ffffffffxxxxxxxx:       82 
ffffffffxxxxxxxx:       48 01 da                add    %rbx,%rdx
ffffffffxxxxxxxx:       48 63 82 98 09 00 00    movslq 0x998(%rdx),%rax
ffffffffxxxxxxxx:       49 01 c4                add    %rax,%r12
ffffffffxxxxxxxx:       48 c7 c6 18 72 67 82    mov    $0xffffffffxxxxxxxx,%rsi
ffffffffxxxxxxxx:       e8 80 46 39 00          call   ffffffffxxxxxxxx <cpumask_next>
ffffffffxxxxxxxx:       89 c7                   mov    %eax,%edi
ffffffffxxxxxxxx:       39 e8                   cmp    %ebp,%eax
ffffffffxxxxxxxx:       72 d7                   jb     ffffffffxxxxxxxx <nr_iowait+0x1d>
ffffffffxxxxxxxx:       4c 89 e0                mov    %r12,%rax
ffffffffxxxxxxxx:       5b                      pop    %rbx
ffffffffxxxxxxxx:       5d                      pop    %rbp
ffffffffxxxxxxxx:       41 5c                   pop    %r12
ffffffffxxxxxxxx:       c3                      ret    

After:

ffffffffxxxxxxxx <nr_iowait>:
ffffffffxxxxxxxx:       41 54                   push   %r12
ffffffffxxxxxxxx:       bf ff ff ff ff          mov    $0xffffffff,%edi
ffffffffxxxxxxxx:       45 31 e4                xor    %r12d,%r12d
ffffffffxxxxxxxx:       55                      push   %rbp
ffffffffxxxxxxxx:       8b 2d 03 ea 5d 01       mov    0x15dea03(%rip),%ebp        # ffffffffxxxxxxxx <nr_cpu_ids>
ffffffffxxxxxxxx:       53                      push   %rbx
ffffffffxxxxxxxx:       48 c7 c3 c0 95 02 00    mov    $0x295c0,%rbx
ffffffffxxxxxxxx:       eb 17                   jmp    ffffffffxxxxxxxx <nr_iowait+0x32>
ffffffffxxxxxxxx:       48 63 c7                movslq %edi,%rax
ffffffffxxxxxxxx:       48 8b 14 c5 a0 c6 2e    mov    -0x7dd13960(,%rax,8),%rdx
ffffffffxxxxxxxx:       82 
ffffffffxxxxxxxx:       48 01 da                add    %rbx,%rdx
ffffffffxxxxxxxx:       8b 82 98 09 00 00       mov    0x998(%rdx),%eax
ffffffffxxxxxxxx:       41 01 c4                add    %eax,%r12d
ffffffffxxxxxxxx:       48 c7 c6 18 72 67 82    mov    $0xffffffffxxxxxxxx,%rsi
ffffffffxxxxxxxx:       e8 a2 46 39 00          call   ffffffffxxxxxxxx <cpumask_next>
ffffffffxxxxxxxx:       89 c7                   mov    %eax,%edi
ffffffffxxxxxxxx:       39 c5                   cmp    %eax,%ebp
ffffffffxxxxxxxx:       77 d7                   ja     ffffffffxxxxxxxx <nr_iowait+0x1b>
ffffffffxxxxxxxx:       44 89 e0                mov    %r12d,%eax
ffffffffxxxxxxxx:       5b                      pop    %rbx
ffffffffxxxxxxxx:       5d                      pop    %rbp
ffffffffxxxxxxxx:       41 5c                   pop    %r12
ffffffffxxxxxxxx:       c3                      ret    

The size of nr_iowait() shrunk from 78 bytes to 76 bytes.

Or the other commit:

  01aee8fd7fb2: ("sched: Make nr_running() return 32-bit value")

  ffffffffxxxxxxxx:       e8 d9 24 e8 ff          call   ffffffffxxxxxxxx <nr_running>
  ffffffffxxxxxxxx:       4c 8b 05 cc 92 70 01    mov    0x17092cc(%rip),%r8        # ffffffffxxxxxxxx <total_forks>
  ffffffffxxxxxxxx:       4c 8b 64 24 50          mov    0x50(%rsp),%r12
- ffffffffxxxxxxxx:       48 89 44 24 08          mov    %rax,0x8(%rsp)
  ffffffffxxxxxxxx:       4c 89 04 24             mov    %r8,(%rsp)

  ffffffffxxxxxxxx:       e8 f9 24 e8 ff          call   ffffffffxxxxxxxx <nr_running>
  ffffffffxxxxxxxx:       4c 8b 05 ed 92 70 01    mov    0x17092ed(%rip),%r8        # ffffffffxxxxxxxx <total_forks>
  ffffffffxxxxxxxx:       4c 8b 64 24 50          mov    0x50(%rsp),%r12
+ ffffffffxxxxxxxx:       89 44 24 08             mov    %eax,0x8(%rsp)
  ffffffffxxxxxxxx:       4c 89 04 24             mov    %r8,(%rsp)

Note how "mov %rax,0x8(%rsp)" got shortened by one byte to "mov %eax,0x8(%rsp)":

- ffffffffxxxxxxxx:       48 89 44 24 08          mov    %rax,0x8(%rsp)
+ ffffffffxxxxxxxx:       89 44 24 08             mov    %eax,0x8(%rsp)

The nr_running() function itself got shorter by 2 bytes, due to shorter 
instruction sequences.

The third commit improved code generation too:

  8fc2858e572c: ("sched: Make nr_iowait_cpu() return 32-bit value")

  ffffffffxxxxxxxx <nr_iowait_cpu>:
  ffffffffxxxxxxxx:       48 63 ff                movslq %edi,%rdi
  ffffffffxxxxxxxx:       48 c7 c0 c0 95 02 00    mov    $0x295c0,%rax
  ffffffffxxxxxxxx:       48 03 04 fd a0 46 0f    add    -0x7df0b960(,%rdi,8),%rax
  ffffffffxxxxxxxx:       82 
- ffffffffxxxxxxxx:       48 63 80 98 09 00 00    movslq 0x998(%rax),%rax
  ffffffffxxxxxxxx:       c3                      ret    

  ffffffffxxxxxxxx <nr_iowait_cpu>:
  ffffffffxxxxxxxx:       48 63 ff                movslq %edi,%rdi
  ffffffffxxxxxxxx:       48 c7 c0 c0 95 02 00    mov    $0x295c0,%rax
  ffffffffxxxxxxxx:       48 03 04 fd a0 46 0f    add    -0x7df0b960(,%rdi,8),%rax
  ffffffffxxxxxxxx:       82 
- ffffffffxxxxxxxx:       8b 80 98 09 00 00       mov    0x998(%rax),%eax
  ffffffffxxxxxxxx:       c3                      ret    

Note how the 'movslq 0x998(%rax),%rax' lost the REX prefix and got one byte 
shorter. Call sites got shorter too:

  ffffffffxxxxxxxx:       e8 e8 73 fa ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
- ffffffffxxxxxxxx:       48 85 c0                test   %rax,%rax

  ffffffffxxxxxxxx:       e8 c8 73 fa ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
- ffffffffxxxxxxxx:       85 c0                   test   %eax,%eax

You often won't see these effects in the 'size vmlinux' output, because 
function alignment padding reserves usually hide 1-2 byte size improvements 
in generated code.

> I'm truly impressed by the massive savings of this change and I'm even 
> more impressed by the justification:
> 
> > Bring nr_running() into 32-bit world to save on REX prefixes.
> 
> Aside of the obvious useless churn, REX prefixes are universaly true for 
> all architectures, right? There is a world outside x86 ...

Even architectures that have the same instruction length for 32-bit and 
64-bit data types benefit:

 - smaller data structures benefit all 64-bit architectures

 - the cleaner, more coherent nr_running type definitions are now more 
   consistently 'int' based, not the previous nonsensical 'int/long' mix 
   that also made the generated code larger on x86.

More importantly, the maintenance benchmark in these cases is not whether a 
change actively helps every architecture we care about - but whether the 
change is a *disadvantage* for them - and it isn't here.

Changes that primarily benefit one common architecture, while not others, 
are still eligible for upstream merging if they otherwise meet the quality 
threshold and don't hurt the other architectures.

TL;DR:

This benefits from this series are small, but are far from 'useless churn', 
unless we want to arbitrarily cut off technically valid contributions that 
improve generated code, data structure size and code readability, submitted 
by a long-time contributor who has contributed over 1,300 patches to the 
kernel already, just because we don't think these add up a significant 
enough benefit?

No doubt the quality barrier must be and is higher for smaller changes - 
but this series IMO passed that barrier.

Anyway, I've Cc:-ed Linus and Greg, if you are advocating for some sort of 
cut-off threshold for small but measurable improvements from long-time 
contributors, it should probably be clearly specified & documented in 
Documentation/SubmittingPatches ...

Thanks,

	Ingo
Alexey Dobriyan May 13, 2021, 9:22 p.m. UTC | #4
On Thu, May 13, 2021 at 11:58:38AM +0200, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 

> > Just with the default config for one of my reference machines:
> > 
> >    text		data	bss	dec	 hex	 filename
> > 16679864	6627950	1671296	24979110 17d26a6 ../build/vmlinux-before
> > 16679894	6627950	1671296	24979140 17d26c4 ../build/vmlinux-after
> > ------------------------------------------------------------------------
> >      +30
> 
> Using '/usr/bin/size' to compare before/after generated code is the wrong 
> way to measure code generation improvements for smaller changes due to 
> default alignment creating a reserve of 'padding' bytes at the end of most 
> functions. You have to look at the low level generated assembly directly.

This is bloat-o-meter output with Fedora 33 .config:

This is how they look like, something gets bigger but total is smaller
(otherwise why would I send it). Apparently something got one 1 byte too
many and pushed padding.

add/remove: 0/0 grow/shrink: 6/21 up/down: 18/-42 (-24)
Function                                     old     new   delta
calc_load_fold_active                         50      56      +6
calc_load_nohz_start                         100     103      +3
calc_load_nohz_remote                         85      88      +3
calc_global_load_tick                         86      89      +3
pull_dl_task                                 901     903      +2
switched_from_dl                             613     614      +1
update_rt_migration                          165     164      -1
update_dl_migration                          141     140      -1
ttwu_do_activate                             181     180      -1
tick_nohz_idle_exit                          225     224      -1
tick_irq_enter                               227     226      -1
print_rt_rq.cold                             238     237      -1
print_rt_rq                                  413     412      -1
nr_iowait_cpu                                 31      30      -1
init_rt_rq                                   143     142      -1
show_stat                                   1745    1743      -2
nr_running                                    75      73      -2
nr_iowait                                     83      81      -2
get_cpu_iowait_time_us                       260     258      -2
get_cpu_idle_time_us                         260     258      -2
find_lock_later_rq                           507     505      -2
enqueue_task_rt                              777     775      -2
enqueue_task_dl                             2461    2459      -2
dequeue_rt_stack                             576     574      -2
menu_select                                 1492    1489      -3
__dequeue_dl_entity                          419     414      -5
init_dl_rq                                    88      81      -7
Total: Before=25729849, After=25729825, chg -0.00%
Thomas Gleixner May 14, 2021, 12:52 p.m. UTC | #5
Ingo,

On Thu, May 13 2021 at 11:58, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>
> You often won't see these effects in the 'size vmlinux' output, because 
> function alignment padding reserves usually hide 1-2 byte size improvements 
> in generated code.

I'm not that stupid. And I certainly looked where this comes from.

> More importantly, the maintenance benchmark in these cases is not whether a 
> change actively helps every architecture we care about - but whether the 
> change is a *disadvantage* for them - and it isn't here.

That's clearly documented in the changelogs of these patches, right?

> Changes that primarily benefit one common architecture, while not others, 
> are still eligible for upstream merging if they otherwise meet the quality 
> threshold and don't hurt the other architectures.

That has been proven by compile testing the relevant architectures as
documented in the changelog, right?

> TL;DR:
>
> This benefits from this series are small, but are far from 'useless churn', 
> unless we want to arbitrarily cut off technically valid contributions that 
> improve generated code, data structure size and code readability, submitted 
> by a long-time contributor who has contributed over 1,300 patches to the 
> kernel already, just because we don't think these add up a significant 
> enough benefit?
>
> No doubt the quality barrier must be and is higher for smaller changes - 
> but this series IMO passed that barrier.
>
> Anyway, I've Cc:-ed Linus and Greg, if you are advocating for some sort of 
> cut-off threshold for small but measurable improvements from long-time 
> contributors, it should probably be clearly specified & documented in 
> Documentation/SubmittingPatches ...

What I'm arguing about is already documented:

  Quantify optimizations and trade-offs.  If you claim improvements in
  performance, memory consumption, stack footprint, or binary size,
  include numbers that back them up.

That series fails to provide any of this and it does not matter whether
this comes from a long time contributor or from a newbie.

Long term contributors are not excempt from documented process. In fact
they should lead by example.

If you as a maintainer put different measures on newbies and long-time
contrinbutors then you pretty much have proven the point the UMN people
tried to make (in the wrong way).

Thanks,

        tglx
Thomas Gleixner May 14, 2021, 6:18 p.m. UTC | #6
On Thu, May 13 2021 at 11:58, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> As to the numbers:
> -	/* size: 1704, cachelines: 27, members: 13 */
> -	/* sum members: 1696, holes: 1, sum holes: 4 */
> +	/* size: 1696, cachelines: 27, members: 13 */
> +	/* sum members: 1688, holes: 1, sum holes: 4 */
>  	/* padding: 4 */
> -	/* last cacheline: 40 bytes */
> +	/* last cacheline: 32 bytes */
>
> 'struct rt_rq' got shrunk from 1704 bytes to 1696 bytes, an 8 byte 
> reduction.

Amazing and it still occupies 27 cache lines

>   ffffffffxxxxxxxx:      e8 49 8e fb ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
> - ffffffffxxxxxxxx:      48 85 c0                test   %rax,%rax
>
>   ffffffffxxxxxxxx:      e8 64 8e fb ff          call   ffffffffxxxxxxxx <nr_iowait_cpu>
> + ffffffffxxxxxxxx:      85 c0                   test   %eax,%eax
>
> Note how the 'test %rax,%rax' lost the 0x48 64-bit REX prefix and the 
> generated code got one byte shorter from "48 85 c0" to "85 c0".

Which will surely be noticable big time. None of this truly matters
because once the data is in L1 the REX prefix is just noise.

> ( Note, my asm generation scripts filter out some of the noise to make it 
>   easier to diff generated asm, hence the ffffffffxxxxxxxx placeholder. )
>
> The nr_iowait() function itself got shorter by two bytes as well, due to:
>
> The size of nr_iowait() shrunk from 78 bytes to 76 bytes.

That's important because nr_iowait() is truly a hotpath function...

> The nr_running() function itself got shorter by 2 bytes, due to shorter 
> instruction sequences.

along with nr_running() which both feed /proc/stat. The latter feeds
/proc/loadavg as well.

Point well taken.

But looking at the /proc/stat usage there is obviously way bigger fish
to fry.

   seq_printf(...., nr_running(), nr_iowait());

which is outright stupid because both functions iterate over CPUs
instead of doing it once, which definitely would be well measurable on
large machines.

But looking at nr_running() and nr_iowait():

    nr_running() walks all CPUs in cpu_online_mask

    nr_iowait() walks all CPUs in cpu_possible_mask

The latter is because rq::nr_iowait is not transferred to an online CPU
when a CPU goes offline. Oh well.

That aside:

I'm not against the change per se, but I'm disagreeing with patches
which come with zero information, are clearly focussed on one
architecture and obviously nobody bothered to check whether there is an
impact on others.

Thanks,

        tglx

Patch
diff mbox series

diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 8468baee951d..f32878d9a39f 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -16,7 +16,7 @@  static int loadavg_proc_show(struct seq_file *m, void *v)
 
 	get_avenrun(avnrun, FIXED_1/200, 0);
 
-	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
+	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
 		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index f25e8531fd27..941605de7f9a 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -200,7 +200,7 @@  static int show_stat(struct seq_file *p, void *v)
 		"\nctxt %llu\n"
 		"btime %llu\n"
 		"processes %lu\n"
-		"procs_running %lu\n"
+		"procs_running %u\n"
 		"procs_blocked %lu\n",
 		nr_context_switches(),
 		(unsigned long long)boottime.tv_sec,
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 568286411b43..f742229091ec 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -16,7 +16,7 @@  extern unsigned long total_forks;
 extern int nr_threads;
 DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
-extern unsigned long nr_running(void);
+extern unsigned int nr_running(void);
 extern bool single_task_running(void);
 extern unsigned long nr_iowait(void);
 extern unsigned long nr_iowait_cpu(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..713ea35cb995 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4331,9 +4331,9 @@  context_switch(struct rq *rq, struct task_struct *prev,
  * externally visible scheduler statistics: current number of runnable
  * threads, total number of context switches performed since bootup.
  */
-unsigned long nr_running(void)
+unsigned int nr_running(void)
 {
-	unsigned long i, sum = 0;
+	unsigned int i, sum = 0;
 
 	for_each_online_cpu(i)
 		sum += cpu_rq(i)->nr_running;