linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
@ 2018-10-11  0:33 Eric Dumazet
  2018-10-11  7:31 ` Peter Zijlstra
  2018-10-11 23:25 ` kbuild test robot
  0 siblings, 2 replies; 6+ messages in thread
From: Eric Dumazet @ 2018-10-11  0:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Dumazet, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

While looking at native_sched_clock() disassembly I had
the surprise to see the compiler (gcc 7.3 here) had
optimized out the loop, meaning the code is broken.

Using the documented and approved API not only fixes the bug,
it also makes the code more readable.

Replacing five this_cpu_read() by one this_cpu_ptr() makes
the generated code smaller.

Fixes: 59eaef78bfea ("x86/tsc: Remodel cyc2ns to use seqcount_latch()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/kernel/tsc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..b039ae0f358a4f78cc830c069ad90e4fb108489e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -60,19 +60,16 @@ static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
 
 void cyc2ns_read_begin(struct cyc2ns_data *data)
 {
-	int seq, idx;
+	const struct cyc2ns *c2n;
+	int seq;
 
 	preempt_disable_notrace();
 
+	c2n = this_cpu_ptr(&cyc2ns);
 	do {
-		seq = this_cpu_read(cyc2ns.seq.sequence);
-		idx = seq & 1;
-
-		data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
-		data->cyc2ns_mul    = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
-		data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
-	} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
+		seq = raw_read_seqcount_latch(&c2n->seq);
+		*data = c2n->data[seq & 1];
+	} while (read_seqcount_retry(&c2n->seq, seq));
 }
 
 void cyc2ns_read_end(void)
-- 
2.19.0.605.g01d371f741-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
  2018-10-11  0:33 [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin() Eric Dumazet
@ 2018-10-11  7:31 ` Peter Zijlstra
  2018-10-11  8:40   ` Peter Zijlstra
  2018-10-11 15:17   ` Eric Dumazet
  2018-10-11 23:25 ` kbuild test robot
  1 sibling, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2018-10-11  7:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Eric Dumazet, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> While looking at native_sched_clock() disassembly I had
> the surprise to see the compiler (gcc 7.3 here) had
> optimized out the loop, meaning the code is broken.
> 
> Using the documented and approved API not only fixes the bug,
> it also makes the code more readable.
> 
> Replacing five this_cpu_read() by one this_cpu_ptr() makes
> the generated code smaller.

Does not for me, that is, the resulting asm is actually larger

You're quite right the loop went missing; no idea wth that compiler is
smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
will return the same value. But this_cpu_read() is an asm() statement,
it _should_ not assume such.

We assume that this_cpu_read() implies READ_ONCE() in a number of
locations, this really should not happen.

The reason it was written using this_cpu_read() is so that it can use
%gs: prefixed instructions and avoid ever loading that percpu offset and
doing manual address computation.

Let me prod at this with a sharp stick.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
  2018-10-11  7:31 ` Peter Zijlstra
@ 2018-10-11  8:40   ` Peter Zijlstra
       [not found]     ` <CANn89iJdztFEMzeCoXu9J4X6roKKx+fuPkwzi9N9G7aiO=_=FQ@mail.gmail.com>
  2018-10-11 15:17   ` Eric Dumazet
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2018-10-11  8:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Eric Dumazet, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov

On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> > 
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> > 
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
> 
> Does not for me, that is, the resulting asm is actually larger
> 
> You're quite right the loop went missing; no idea wth that compiler is
> smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
> think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
> will return the same value. But this_cpu_read() is an asm() statement,
> it _should_ not assume such.
> 
> We assume that this_cpu_read() implies READ_ONCE() in a number of
> locations, this really should not happen.
> 
> The reason it was written using this_cpu_read() is so that it can use
> %gs: prefixed instructions and avoid ever loading that percpu offset and
> doing manual address computation.
> 
> Let me prod at this with a sharp stick.

OK, so apart from the inlining being crap, which is fixed by:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6490f618e096..638491062fea 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -57,7 +57,7 @@ struct cyc2ns {
 
 static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
 
-void cyc2ns_read_begin(struct cyc2ns_data *data)
+void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data)
 {
 	int seq, idx;
 
@@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
 	} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }
 
-void cyc2ns_read_end(void)
+void __always_inline cyc2ns_read_end(void)
 {
 	preempt_enable_notrace();
 }
@@ -103,7 +103,7 @@ void cyc2ns_read_end(void)
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
 	struct cyc2ns_data data;
 	unsigned long long ns;


That gets us:

native_sched_clock:
	pushq   %rbp    #
	movq    %rsp, %rbp      #,

 ... jump label ...

	rdtsc
	salq    $32, %rdx       #, tmp110
	orq     %rax, %rdx      # low, tmp110
	movl %gs:cyc2ns+32(%rip),%ecx   # cyc2ns.seq.sequence, pfo_ret__
	andl    $1, %ecx        #, idx
	salq    $4, %rcx        #, tmp116
	movl %gs:cyc2ns(%rcx),%esi      # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__
	movl    %esi, %esi      # pfo_ret__, pfo_ret__
	movq    %rsi, %rax      # pfo_ret__, tmp133
	mulq    %rdx    # _23
	movq %gs:cyc2ns+8(%rcx),%rdi    # cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__
	addq    $cyc2ns, %rcx   #, tmp117
	movl %gs:4(%rcx),%ecx   # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
	shrdq   %rdx, %rax      # pfo_ret__,, tmp134
	shrq    %cl, %rdx       # pfo_ret__,
	testb   $64, %cl        #, pfo_ret__
	cmovne  %rdx, %rax      #,, tmp134
	addq    %rdi, %rax      # pfo_ret__, <retval>
	popq    %rbp    #
	ret


If we then fix the percpu mess, with:

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index e9202a0de8f0..1a19d11cfbbd 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,22 +185,22 @@ do {									\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(1)",%0"		\
+		asm volatile(op "b "__percpu_arg(1)",%0"\
 		    : "=q" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(1)",%0"		\
+		asm volatile(op "w "__percpu_arg(1)",%0"\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(1)",%0"		\
+		asm volatile(op "l "__percpu_arg(1)",%0"\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(1)",%0"		\
+		asm volatile(op "q "__percpu_arg(1)",%0"\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\


That turns into:

native_sched_clock:
	pushq   %rbp    #
	movq    %rsp, %rbp      #,

 ... jump label ...

	rdtsc
	salq    $32, %rdx       #, tmp110
	orq     %rax, %rdx      # low, tmp110
	movq    %rdx, %r10      # tmp110, _23
	movq    $cyc2ns, %r9    #, tmp136
.L235:
	movl %gs:cyc2ns+32(%rip),%eax   # cyc2ns.seq.sequence, pfo_ret__
	movl    %eax, %ecx      # pfo_ret__, idx
	andl    $1, %ecx        #, idx
	salq    $4, %rcx        #, tmp116
	addq    %r9, %rcx       # tmp136, tmp117
	movq %gs:8(%rcx),%rdi   # cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__
	movl %gs:(%rcx),%esi    # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__
	movl %gs:4(%rcx),%ecx   # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
	movl %gs:cyc2ns+32(%rip),%r8d   # cyc2ns.seq.sequence, pfo_ret__
	cmpl    %r8d, %eax      # pfo_ret__, pfo_ret__
	jne     .L235   #,
	movl    %esi, %esi      # pfo_ret__, pfo_ret__
	movq    %rsi, %rax      # pfo_ret__, tmp133
	mulq    %r10    # _23
	shrdq   %rdx, %rax      # pfo_ret__,, tmp134
	shrq    %cl, %rdx       # pfo_ret__,
	testb   $64, %cl        #, pfo_ret__
	cmovne  %rdx, %rax      #,, tmp134
	addq    %rdi, %rax      # pfo_ret__, <retval>
	popq    %rbp    #
	ret

which is exactly right. Except perhaps for the mess that
mul_u64_u32_shr() turns into.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
  2018-10-11  7:31 ` Peter Zijlstra
  2018-10-11  8:40   ` Peter Zijlstra
@ 2018-10-11 15:17   ` Eric Dumazet
  1 sibling, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2018-10-11 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Eric Dumazet, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> >
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> >
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
>
> Does not for me, that is, the resulting asm is actually larger

[resent in non html]

I counted the number of bytes of text, and really the after my patch
code size is smaller.

%gs prefixes are one-byte, but the 32bit offsets are adding up.

Prior version (with resinstaed loop, thanks to one  smp_rmb()

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf
100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
  data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
  data->cyc2ns_mul    = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
  data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
+ smp_rmb();
  } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }


Total length : 0xA7 bytes

00000000000002a0 <native_sched_clock>:
     2a0: 4c 8d 54 24 08        lea    0x8(%rsp),%r10
     2a5: 48 83 e4 f0          and    $0xfffffffffffffff0,%rsp
     2a9: 41 ff 72 f8          pushq  -0x8(%r10)
     2ad: 55                    push   %rbp
     2ae: 48 89 e5              mov    %rsp,%rbp
     2b1: 41 52                push   %r10
     2b3: e9 60 00 00 00        jmpq   318 <native_sched_clock+0x78>
     2b8: 0f 31                rdtsc
     2ba: 48 c1 e2 20          shl    $0x20,%rdx
     2be: 48 09 c2              or     %rax,%rdx
     2c1: 49 89 d2              mov    %rdx,%r10
     2c4: 49 c7 c1 00 00 00 00 mov    $0x0,%r9
2c7: R_X86_64_32S .data..percpu..shared_aligned
     2cb: 65 8b 05 00 00 00 00 mov    %gs:0x0(%rip),%eax        # 2d2
<native_sched_clock+0x32>
2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
     2d2: 89 c1                mov    %eax,%ecx
     2d4: 83 e1 01              and    $0x1,%ecx
     2d7: 48 c1 e1 04          shl    $0x4,%rcx
     2db: 4c 01 c9              add    %r9,%rcx
     2de: 65 48 8b 79 08        mov    %gs:0x8(%rcx),%rdi
     2e3: 65 8b 31              mov    %gs:(%rcx),%esi
     2e6: 65 8b 49 04          mov    %gs:0x4(%rcx),%ecx
     2ea: 65 44 8b 05 00 00 00 mov    %gs:0x0(%rip),%r8d        # 2f2
<native_sched_clock+0x52>
     2f1: 00
2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
     2f2: 44 39 c0              cmp    %r8d,%eax
     2f5: 75 d4                jne    2cb <native_sched_clock+0x2b>
     2f7: 89 f6                mov    %esi,%esi
     2f9: 48 89 f0              mov    %rsi,%rax
     2fc: 49 f7 e2              mul    %r10
     2ff: 48 0f ad d0          shrd   %cl,%rdx,%rax
     303: 48 d3 ea              shr    %cl,%rdx
     306: f6 c1 40              test   $0x40,%cl
     309: 48 0f 45 c2          cmovne %rdx,%rax
     30d: 48 01 f8              add    %rdi,%rax
     310: 41 5a                pop    %r10
     312: 5d                    pop    %rbp
     313: 49 8d 62 f8          lea    -0x8(%r10),%rsp
     317: c3                    retq
     318: 48 69 05 00 00 00 00 imul   $0x3d0900,0x0(%rip),%rax
# 323 <native_sched_clock+0x83>
     31f: 00 09 3d 00
31b: R_X86_64_PC32 jiffies_64-0x8
     323: 41 5a                pop    %r10
     325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
     32c: f7 c2 ff
     32f: 5d                    pop    %rbp
     330: 49 8d 62 f8          lea    -0x8(%r10),%rsp
     334: 48 01 d0              add    %rdx,%rax
     337: c3                    retq

New version (my cleanup patch)

Total length = 0x91 bytes

00000000000002a0 <native_sched_clock>:
     2a0: 4c 8d 54 24 08        lea    0x8(%rsp),%r10
     2a5: 48 83 e4 f0          and    $0xfffffffffffffff0,%rsp
     2a9: 41 ff 72 f8          pushq  -0x8(%r10)
     2ad: 55                    push   %rbp
     2ae: 48 89 e5              mov    %rsp,%rbp
     2b1: 41 52                push   %r10
     2b3: e9 59 00 00 00        jmpq   311 <native_sched_clock+0x71>
     2b8: 0f 31                rdtsc
     2ba: 48 c1 e2 20          shl    $0x20,%rdx
     2be: 48 09 c2              or     %rax,%rdx
     2c1: 49 89 d1              mov    %rdx,%r9
     2c4: 49 c7 c0 00 00 00 00 mov    $0x0,%r8
2c7: R_X86_64_32S .data..percpu..shared_aligned
     2cb: 65 4c 03 05 00 00 00 add    %gs:0x0(%rip),%r8        # 2d3
<native_sched_clock+0x33>
     2d2: 00
2cf: R_X86_64_PC32 this_cpu_off-0x4
     2d3: 41 8b 40 20          mov    0x20(%r8),%eax
     2d7: 89 c6                mov    %eax,%esi
     2d9: 83 e6 01              and    $0x1,%esi
     2dc: 48 c1 e6 04          shl    $0x4,%rsi
     2e0: 4c 01 c6              add    %r8,%rsi
     2e3: 8b 3e                mov    (%rsi),%edi
     2e5: 8b 4e 04              mov    0x4(%rsi),%ecx
     2e8: 48 8b 76 08          mov    0x8(%rsi),%rsi
     2ec: 41 3b 40 20          cmp    0x20(%r8),%eax
     2f0: 75 e1                jne    2d3 <native_sched_clock+0x33>
     2f2: 48 89 f8              mov    %rdi,%rax
     2f5: 49 f7 e1              mul    %r9
     2f8: 48 0f ad d0          shrd   %cl,%rdx,%rax
     2fc: 48 d3 ea              shr    %cl,%rdx
     2ff: f6 c1 40              test   $0x40,%cl
     302: 48 0f 45 c2          cmovne %rdx,%rax
     306: 48 01 f0              add    %rsi,%rax
     309: 41 5a                pop    %r10
     30b: 5d                    pop    %rbp
     30c: 49 8d 62 f8          lea    -0x8(%r10),%rsp
     310: c3                    retq
     311: 48 69 05 00 00 00 00 imul   $0x3d0900,0x0(%rip),%rax
# 31c <native_sched_clock+0x7c>
     318: 00 09 3d 00
314: R_X86_64_PC32 jiffies_64-0x8
     31c: 41 5a                pop    %r10
     31e: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
     325: f7 c2 ff
     328: 5d                    pop    %rbp
     329: 49 8d 62 f8          lea    -0x8(%r10),%rsp
     32d: 48 01 d0              add    %rdx,%rax
     330: c3                    retq
     331:

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
       [not found]     ` <CANn89iJdztFEMzeCoXu9J4X6roKKx+fuPkwzi9N9G7aiO=_=FQ@mail.gmail.com>
@ 2018-10-11 15:24       ` Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2018-10-11 15:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Eric Dumazet, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov

On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote:
> Yes, but the code size is bigger (I have looked at the disassembly)
> 
> All these %gs plus offset add up

> Total length : 0xA7 bytes

effective length: 0x78 bytes

> 00000000000002a0 <native_sched_clock>:
>      2a0: 4c 8d 54 24 08        lea    0x8(%rsp),%r10
>      2a5: 48 83 e4 f0          and    $0xfffffffffffffff0,%rsp
>      2a9: 41 ff 72 f8          pushq  -0x8(%r10)
>      2ad: 55                    push   %rbp
>      2ae: 48 89 e5              mov    %rsp,%rbp
>      2b1: 41 52                push   %r10

       2b3: 66 66 66 66 90	k8_nop5_atomic

>      2b8: 0f 31                rdtsc
>      2ba: 48 c1 e2 20          shl    $0x20,%rdx
>      2be: 48 09 c2              or     %rax,%rdx
>      2c1: 49 89 d2              mov    %rdx,%r10
>      2c4: 49 c7 c1 00 00 00 00 mov    $0x0,%r9
> 		2c7: R_X86_64_32S .data..percpu..shared_aligned
>      2cb: 65 8b 05 00 00 00 00 mov    %gs:0x0(%rip),%eax        # 2d2
> <native_sched_clock+0x32>
> 		2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>      2d2: 89 c1                mov    %eax,%ecx
>      2d4: 83 e1 01              and    $0x1,%ecx
>      2d7: 48 c1 e1 04          shl    $0x4,%rcx
>      2db: 4c 01 c9              add    %r9,%rcx
>      2de: 65 48 8b 79 08        mov    %gs:0x8(%rcx),%rdi
>      2e3: 65 8b 31              mov    %gs:(%rcx),%esi
>      2e6: 65 8b 49 04          mov    %gs:0x4(%rcx),%ecx
>      2ea: 65 44 8b 05 00 00 00 mov    %gs:0x0(%rip),%r8d        # 2f2
> <native_sched_clock+0x52>
>      2f1: 00
> 		2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>      2f2: 44 39 c0              cmp    %r8d,%eax
>      2f5: 75 d4                jne    2cb <native_sched_clock+0x2b>
>      2f7: 89 f6                mov    %esi,%esi
>      2f9: 48 89 f0              mov    %rsi,%rax
>      2fc: 49 f7 e2              mul    %r10
>      2ff: 48 0f ad d0          shrd   %cl,%rdx,%rax
>      303: 48 d3 ea              shr    %cl,%rdx
>      306: f6 c1 40              test   $0x40,%cl
>      309: 48 0f 45 c2          cmovne %rdx,%rax
>      30d: 48 01 f8              add    %rdi,%rax
>      310: 41 5a                pop    %r10
>      312: 5d                    pop    %rbp
>      313: 49 8d 62 f8          lea    -0x8(%r10),%rsp
>      317: c3                    retq
> 
> New version :
> 
> Total length = 0x91 bytes

effective: 0x71

> 
> 00000000000002a0 <native_sched_clock>:
>      2a0: 4c 8d 54 24 08        lea    0x8(%rsp),%r10
>      2a5: 48 83 e4 f0          and    $0xfffffffffffffff0,%rsp
>      2a9: 41 ff 72 f8          pushq  -0x8(%r10)
>      2ad: 55                    push   %rbp
>      2ae: 48 89 e5              mov    %rsp,%rbp
>      2b1: 41 52                push   %r10

       2b3: 66 66 66 66 90	k8_nop5_atomic

>      2b8: 0f 31                rdtsc
>      2ba: 48 c1 e2 20          shl    $0x20,%rdx
>      2be: 48 09 c2              or     %rax,%rdx
>      2c1: 49 89 d1              mov    %rdx,%r9
>      2c4: 49 c7 c0 00 00 00 00 mov    $0x0,%r8
>		2c7: R_X86_64_32S .data..percpu..shared_aligned
>      2cb: 65 4c 03 05 00 00 00 add    %gs:0x0(%rip),%r8        # 2d3
> <native_sched_clock+0x33>
>      2d2: 00
>		2cf: R_X86_64_PC32 this_cpu_off-0x4
>      2d3: 41 8b 40 20          mov    0x20(%r8),%eax
>      2d7: 89 c6                mov    %eax,%esi
>      2d9: 83 e6 01              and    $0x1,%esi
>      2dc: 48 c1 e6 04          shl    $0x4,%rsi
>      2e0: 4c 01 c6              add    %r8,%rsi
>      2e3: 8b 3e                mov    (%rsi),%edi
>      2e5: 8b 4e 04              mov    0x4(%rsi),%ecx
>      2e8: 48 8b 76 08          mov    0x8(%rsi),%rsi
>      2ec: 41 3b 40 20          cmp    0x20(%r8),%eax
>      2f0: 75 e1                jne    2d3 <native_sched_clock+0x33>
>      2f2: 48 89 f8              mov    %rdi,%rax
>      2f5: 49 f7 e1              mul    %r9
>      2f8: 48 0f ad d0          shrd   %cl,%rdx,%rax
>      2fc: 48 d3 ea              shr    %cl,%rdx
>      2ff: f6 c1 40              test   $0x40,%cl
>      302: 48 0f 45 c2          cmovne %rdx,%rax
>      306: 48 01 f0              add    %rsi,%rax
>      309: 41 5a                pop    %r10
>      30b: 5d                    pop    %rbp
>      30c: 49 8d 62 f8          lea    -0x8(%r10),%rsp
>      310: c3                    retq

Ah, right you are. But my version only touches the one cacheline,
whereas yours will do that extra cpu offset load, which might or might
not be hot.

Difficult..

You have some weird stack setup though.. mine doesn't have that:

$ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk '/<native_sched_clock>:$/ {p=1} /^$/ {p=0} {if (p) print $0}'
0000000000000b00 <native_sched_clock>:
b00:   55                      push   %rbp
b01:   48 89 e5                mov    %rsp,%rbp
b04:   66 66 66 66 90          k8_nop5_atomic
b09:   0f 31                   rdtsc
b0b:   48 c1 e2 20             shl    $0x20,%rdx
b0f:   48 09 c2                or     %rax,%rdx
b12:   49 89 d2                mov    %rdx,%r10
b15:   49 c7 c1 00 00 00 00    mov    $0x0,%r9
		b18: R_X86_64_32S       .data..percpu..shared_aligned
b1c:   65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # b23 <native_sched_clock+0x23>
		b1f: R_X86_64_PC32      .data..percpu..shared_aligned+0x1c
b23:   89 c1                   mov    %eax,%ecx
b25:   83 e1 01                and    $0x1,%ecx
b28:   48 c1 e1 04             shl    $0x4,%rcx
b2c:   4c 01 c9                add    %r9,%rcx
b2f:   65 48 8b 79 08          mov    %gs:0x8(%rcx),%rdi
b34:   65 8b 31                mov    %gs:(%rcx),%esi
b37:   65 8b 49 04             mov    %gs:0x4(%rcx),%ecx
b3b:   65 44 8b 05 00 00 00    mov    %gs:0x0(%rip),%r8d        # b43 <native_sched_clock+0x43>
b42:   00
		b3f: R_X86_64_PC32      .data..percpu..shared_aligned+0x1c
b43:   44 39 c0                cmp    %r8d,%eax
b46:   75 d4                   jne    b1c <native_sched_clock+0x1c>
b48:   89 f6                   mov    %esi,%esi
b4a:   48 89 f0                mov    %rsi,%rax
b4d:   49 f7 e2                mul    %r10
b50:   48 0f ad d0             shrd   %cl,%rdx,%rax
b54:   48 d3 ea                shr    %cl,%rdx
b57:   f6 c1 40                test   $0x40,%cl
b5a:   48 0f 45 c2             cmovne %rdx,%rax
b5e:   48 01 f8                add    %rdi,%rax
b61:   5d                      pop    %rbp
b62:   c3                      retq

Which gets me to 0x62 effective bytes.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
  2018-10-11  0:33 [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin() Eric Dumazet
  2018-10-11  7:31 ` Peter Zijlstra
@ 2018-10-11 23:25 ` kbuild test robot
  1 sibling, 0 replies; 6+ messages in thread
From: kbuild test robot @ 2018-10-11 23:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kbuild-all, linux-kernel, Eric Dumazet, Eric Dumazet,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]

Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on v4.19-rc7 next-20181011]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625
config: x86_64-randconfig-x000-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin':
>> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
      seq = raw_read_seqcount_latch(&c2n->seq);
                                    ^
   In file included from include/linux/time.h:6:0,
                    from include/linux/ktime.h:24,
                    from include/linux/timer.h:6,
                    from include/linux/workqueue.h:9,
                    from include/linux/rhashtable-types.h:15,
                    from include/linux/ipc.h:7,
                    from include/uapi/linux/sem.h:5,
                    from include/linux/sem.h:5,
                    from include/linux/sched.h:15,
                    from arch/x86/kernel/tsc.c:4:
   include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct seqcount *}' but argument is of type 'const seqcount_t * {aka const struct seqcount *}'
    static inline int raw_read_seqcount_latch(seqcount_t *s)
                      ^~~~~~~~~~~~~~~~~~~~~~~

vim +69 arch/x86/kernel/tsc.c

    59	
    60	void cyc2ns_read_begin(struct cyc2ns_data *data)
    61	{
    62		const struct cyc2ns *c2n;
    63		int seq;
    64	
    65		preempt_disable_notrace();
    66	
    67		c2n = this_cpu_ptr(&cyc2ns);
    68		do {
  > 69			seq = raw_read_seqcount_latch(&c2n->seq);
    70			*data = c2n->data[seq & 1];
    71		} while (read_seqcount_retry(&c2n->seq, seq));
    72	}
    73	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29762 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-11 23:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-11  0:33 [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin() Eric Dumazet
2018-10-11  7:31 ` Peter Zijlstra
2018-10-11  8:40   ` Peter Zijlstra
     [not found]     ` <CANn89iJdztFEMzeCoXu9J4X6roKKx+fuPkwzi9N9G7aiO=_=FQ@mail.gmail.com>
2018-10-11 15:24       ` Peter Zijlstra
2018-10-11 15:17   ` Eric Dumazet
2018-10-11 23:25 ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).