All of lore.kernel.org
 help / color / mirror / Atom feed
* re: locking/atomic: Introduce atomic_try_cmpxchg()
@ 2017-03-24 12:44 Dmitry Vyukov
  2017-03-24 14:21 ` Peter Zijlstra
  2017-03-27 12:16 ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2017-03-24 12:44 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

Hi,

I've come across:

commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
Author: Peter Zijlstra
Date:   Wed Feb 1 16:39:38 2017 +0100
    locking/atomic: Introduce atomic_try_cmpxchg()

The primitive has subtle difference with all other implementation that
I know of, and can lead to very subtle bugs. Some time ago I've spent
several days debugging a memory corruption caused by similar
implementation. Consider a classical lock-free stack push:

node->next = atomic_read(&head);
do {
} while (!atomic_try_cmpxchg(&head, &node->next, node));

This code is broken with the current implementation, the problem is
with unconditional update of *__po here:

#define __atomic64_try_cmpxchg(type, _p, _po, _n)    \
({
                \
        typeof(_po) __po = (_po);                                       \
        typeof(*(_po)) __o = *__po;                                     \
        *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
        (*__po == __o);
         \
})

In case of success it writes the same value back into *__po, but in
case of cmpxchg success we might have lose ownership of some memory
locations and potentially over what __po has pointed to. The same
holds for the re-read of *__po.
The problem is very easy to overlook in user code (not saying about
the case when developer is already familiar with different semantics
of similar functions (e.g. C/C++ atomic operations, or gcc/clang
atomic builtins)). For this reason cmpxchg implementations usually
update comparand only in case of failure. So I would suggest to change
it to a safer and less surprising alternative:

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db51a2a..81fb985f51f4 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
        default:                                                        \
                __cmpxchg_wrong_size();                                 \
        }                                                               \
-       *_old = __old;                                                  \
+       if (!success)                                                   \
+               *_old = __old;                                          \
        success;                                                        \
 })

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index aae5953817d6..f8098157f7c8 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 ({                                                                     \
        typeof(_po) __po = (_po);                                       \
        typeof(*(_po)) __o = *__po;                                     \
-       *__po = atomic64_cmpxchg##type((_p), __o, (_n));                \
-       (*__po == __o);                                                 \
+       typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n));   \
+       if (__v == __o)                                                 \
+               return true;                                            \
+       *__po = __v;                                                    \
+       return false;                                                   \
 })

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 12:44 locking/atomic: Introduce atomic_try_cmpxchg() Dmitry Vyukov
@ 2017-03-24 14:21 ` Peter Zijlstra
  2017-03-24 14:23   ` Dmitry Vyukov
  2017-03-24 16:41   ` Peter Zijlstra
  2017-03-27 12:16 ` Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 14:21 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
> 
> The primitive has subtle difference with all other implementation that
> I know of, and can lead to very subtle bugs. Some time ago I've spent
> several days debugging a memory corruption caused by similar
> implementation. Consider a classical lock-free stack push:
> 
> node->next = atomic_read(&head);
> do {
> } while (!atomic_try_cmpxchg(&head, &node->next, node));
> 
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po here:

Indeed. I had only considered stack local variables when I wrote that.

> So I would suggest to change it to a safer and less surprising
> alternative:
> 
> diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> index fb961db51a2a..81fb985f51f4 100644
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
>         default:                                                        \
>                 __cmpxchg_wrong_size();                                 \
>         }                                                               \
> -       *_old = __old;                                                  \
> +       if (!success)                                                   \
> +               *_old = __old;                                          \
>         success;                                                        \
>  })

I've no immediate objection, I'll double check what, if anything, it
does for code gen.

> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> index aae5953817d6..f8098157f7c8 100644
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
>  ({                                                                     \
>         typeof(_po) __po = (_po);                                       \
>         typeof(*(_po)) __o = *__po;                                     \
> -       *__po = atomic64_cmpxchg##type((_p), __o, (_n));                \
> -       (*__po == __o);                                                 \
> +       typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n));   \
> +       if (__v == __o)                                                 \
> +               return true;                                            \
> +       *__po = __v;                                                    \
> +       return false;                                                   \
>  })

Can you actually use return in statement-expressions?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 14:21 ` Peter Zijlstra
@ 2017-03-24 14:23   ` Dmitry Vyukov
  2017-03-24 16:41   ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2017-03-24 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 3:21 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>>
>> The primitive has subtle difference with all other implementation that
>> I know of, and can lead to very subtle bugs. Some time ago I've spent
>> several days debugging a memory corruption caused by similar
>> implementation. Consider a classical lock-free stack push:
>>
>> node->next = atomic_read(&head);
>> do {
>> } while (!atomic_try_cmpxchg(&head, &node->next, node));
>>
>> This code is broken with the current implementation, the problem is
>> with unconditional update of *__po here:
>
> Indeed. I had only considered stack local variables when I wrote that.
>
>> So I would suggest to change it to a safer and less surprising
>> alternative:
>>
>> diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
>> index fb961db51a2a..81fb985f51f4 100644
>> --- a/arch/x86/include/asm/cmpxchg.h
>> +++ b/arch/x86/include/asm/cmpxchg.h
>> @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
>>         default:                                                        \
>>                 __cmpxchg_wrong_size();                                 \
>>         }                                                               \
>> -       *_old = __old;                                                  \
>> +       if (!success)                                                   \
>> +               *_old = __old;                                          \
>>         success;                                                        \
>>  })
>
> I've no immediate objection, I'll double check what, if anything, it
> does for code gen.
>
>> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
>> index aae5953817d6..f8098157f7c8 100644
>> --- a/include/linux/atomic.h
>> +++ b/include/linux/atomic.h
>> @@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
>>  ({                                                                     \
>>         typeof(_po) __po = (_po);                                       \
>>         typeof(*(_po)) __o = *__po;                                     \
>> -       *__po = atomic64_cmpxchg##type((_p), __o, (_n));                \
>> -       (*__po == __o);                                                 \
>> +       typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n));   \
>> +       if (__v == __o)                                                 \
>> +               return true;                                            \
>> +       *__po = __v;                                                    \
>> +       return false;                                                   \
>>  })
>
> Can you actually use return in statement-expressions?

Dunno. Just wanted to show the idea.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 14:21 ` Peter Zijlstra
  2017-03-24 14:23   ` Dmitry Vyukov
@ 2017-03-24 16:41   ` Peter Zijlstra
  2017-03-24 16:54     ` Andy Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 16:41 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 03:21:40PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:

> > So I would suggest to change it to a safer and less surprising
> > alternative:
> > 
> > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> > index fb961db51a2a..81fb985f51f4 100644
> > --- a/arch/x86/include/asm/cmpxchg.h
> > +++ b/arch/x86/include/asm/cmpxchg.h
> > @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
> >         default:                                                        \
> >                 __cmpxchg_wrong_size();                                 \
> >         }                                                               \
> > -       *_old = __old;                                                  \
> > +       if (!success)                                                   \
> > +               *_old = __old;                                          \
> >         success;                                                        \
> >  })
> 
> I've no immediate objection, I'll double check what, if anything, it
> does for code gen.

So the first snipped I tested regressed like so:


0000000000000000 <T_refcount_inc>:				0000000000000000 <T_refcount_inc>:
   0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
   2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
   5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
   7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
   9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
   b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
   e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
  12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
  14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
  16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
  18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
  1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
                                                                  1c:   75 03                   jne    21 <T_refcount_inc+0x21>
                                                                  1e:   0f 0b                   ud2
                                                                  20:   c3                      retq
                                                                  21:   c3                      retq

Which is rather unfortunate...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 16:41   ` Peter Zijlstra
@ 2017-03-24 16:54     ` Andy Lutomirski
  2017-03-24 17:23       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 9:41 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 03:21:40PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>
>> > So I would suggest to change it to a safer and less surprising
>> > alternative:
>> >
>> > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
>> > index fb961db51a2a..81fb985f51f4 100644
>> > --- a/arch/x86/include/asm/cmpxchg.h
>> > +++ b/arch/x86/include/asm/cmpxchg.h
>> > @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
>> >         default:                                                        \
>> >                 __cmpxchg_wrong_size();                                 \
>> >         }                                                               \
>> > -       *_old = __old;                                                  \
>> > +       if (!success)                                                   \
>> > +               *_old = __old;                                          \
>> >         success;                                                        \
>> >  })
>>
>> I've no immediate objection, I'll double check what, if anything, it
>> does for code gen.
>
> So the first snipped I tested regressed like so:
>
>
> 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
>    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
>    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
>    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
>    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
>    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
>    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
>    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
>   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
>   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
>   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
>   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
>   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
>                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
>                                                                   1e:   0f 0b                   ud2
>                                                                   20:   c3                      retq
>                                                                   21:   c3                      retq

Can you re-send the better asm you got earlier?

If I pretend to be a dumb compiler, I wonder if you'd get better results with:

if (!success) {
  *_old = __old;
  return false;
} else {
  return true;
}

or however you jam that into a statement expression.  That way you
aren't relying on the compiler to merge the branches.

>
> Which is rather unfortunate...



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 16:54     ` Andy Lutomirski
@ 2017-03-24 17:23       ` Peter Zijlstra
  2017-03-24 17:51         ` Dmitry Vyukov
                           ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> > So the first snipped I tested regressed like so:
> >
> >
> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
> >                                                                   1e:   0f 0b                   ud2
> >                                                                   20:   c3                      retq
> >                                                                   21:   c3                      retq
> 
> Can you re-send the better asm you got earlier?

On the left?

> If I pretend to be a dumb compiler, I wonder if you'd get better results with:
> 
> if (!success) {
>   *_old = __old;
>   return false;
> } else {
>   return true;
> }
> 
> or however you jam that into a statement expression.  That way you
> aren't relying on the compiler to merge the branches.

I tried a few variants, but nothing really made it better.

Find the tiny.c file below; I'm using:

  gcc (Debian 6.3.0-5) 6.3.0 20170124

it has both an inline and an stmt-expr try_cmpxchg variant to play with;
the 'expected' output is at the bottom (same as above left).

Note that clang doesn't compile this stuff due to missing features.

---

/* gcc -Os -std=gnu99 -fno-strict-overflow -falign-jumps=1 -falign-loops=1 -c tiny.c; objdump -dr tiny.o */

typedef _Bool bool;

static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val, unsigned int new)
{
	unsigned int old = *val;
	bool success;

	asm volatile("lock cmpxchgl %[new], %[ptr]"
		     : "=@ccz" (success),
		       [ptr] "+m" (*ptr),
		       [old] "+a" (old)
		     : [new] "r" (new)
		     : "memory");

//	if (!success)
		*val = old;

	return success;
}

#define __try_cmpxchg(_ptr, _pold, _new)				\
({									\
 	bool success;							\
	__typeof__(_ptr) _old = (_pold);				\
	__typeof__(*(_ptr)) __old = *_old;				\
	__typeof__(*(_ptr)) __new = (_new);				\
	volatile unsigned int * __ptr = (volatile unsigned int *)(_ptr);\
									\
	asm volatile("lock cmpxchgl %[new], %[ptr]"			\
		     : "=@ccz" (success),				\
		       [ptr] "+m" (*__ptr),				\
		       [old] "+a" (__old)				\
		     : [new] "r" (__new)				\
		     : "memory");					\
	/* if (!success) */							\
		*_old = __old;						\
	success;							\
})

#define EXCEPTION_VALUE(val, handler)	asm volatile ("ud2 # %0" : : "r" (val))

#define UINT_MAX	(~0U)

#define likely(x)    __builtin_expect(!!(x), 1)
#define unlikely(x)    __builtin_expect(!!(x), 0)

static inline void refcount_inc(unsigned int *r)
{
	unsigned int new, val = *(unsigned int volatile *)r;

	do {
		if (unlikely(val == UINT_MAX)) /* saturated */
			return;

		if (unlikely(!val)) /* use-after-free */
			goto exception;

		/* cannot overflow because we already checked UINT_MAX */
		new = val + 1;

	} while (unlikely(!try_cmpxchg(r, &val, new)));

	if (unlikely(new == UINT_MAX))
exception:	EXCEPTION_VALUE(val, __refcount_warn);
}

void T_refcount_inc(unsigned int *r)
{
	refcount_inc(r);
}

/*
   0:   8b 07                   mov    (%rdi),%eax
   2:   83 f8 ff                cmp    $0xffffffff,%eax
   5:   74 13                   je     1a <T_refcount_inc+0x1a>
   7:   85 c0                   test   %eax,%eax
   9:   74 0d                   je     18 <T_refcount_inc+0x18>
   b:   8d 50 01                lea    0x1(%rax),%edx
   e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
  12:   75 ee                   jne    2 <T_refcount_inc+0x2>
  14:   ff c2                   inc    %edx
  16:   75 02                   jne    1a <T_refcount_inc+0x1a>
  18:   0f 0b                   ud2
  1a:   c3                      retq
*/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 17:23       ` Peter Zijlstra
@ 2017-03-24 17:51         ` Dmitry Vyukov
  2017-03-24 18:08           ` Peter Zijlstra
  2017-03-24 18:00         ` Peter Zijlstra
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Dmitry Vyukov @ 2017-03-24 17:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > So the first snipped I tested regressed like so:
>> >
>> >
>> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
>> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
>> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
>> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
>> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
>> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
>> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
>> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
>> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
>> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
>> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
>> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
>> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
>> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
>> >                                                                   1e:   0f 0b                   ud2
>> >                                                                   20:   c3                      retq
>> >                                                                   21:   c3                      retq
>>
>> Can you re-send the better asm you got earlier?
>
> On the left?
>
>> If I pretend to be a dumb compiler, I wonder if you'd get better results with:
>>
>> if (!success) {
>>   *_old = __old;
>>   return false;
>> } else {
>>   return true;
>> }
>>
>> or however you jam that into a statement expression.  That way you
>> aren't relying on the compiler to merge the branches.
>
> I tried a few variants, but nothing really made it better.
>
> Find the tiny.c file below; I'm using:
>
>   gcc (Debian 6.3.0-5) 6.3.0 20170124
>
> it has both an inline and an stmt-expr try_cmpxchg variant to play with;
> the 'expected' output is at the bottom (same as above left).
>
> Note that clang doesn't compile this stuff due to missing features.
>
> ---
>
> /* gcc -Os -std=gnu99 -fno-strict-overflow -falign-jumps=1 -falign-loops=1 -c tiny.c; objdump -dr tiny.o */
>
> typedef _Bool bool;
>
> static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val, unsigned int new)
> {
>         unsigned int old = *val;
>         bool success;
>
>         asm volatile("lock cmpxchgl %[new], %[ptr]"
>                      : "=@ccz" (success),
>                        [ptr] "+m" (*ptr),
>                        [old] "+a" (old)
>                      : [new] "r" (new)
>                      : "memory");
>
> //      if (!success)
>                 *val = old;
>
>         return success;
> }
>
> #define __try_cmpxchg(_ptr, _pold, _new)                                \
> ({                                                                      \
>         bool success;                                                   \
>         __typeof__(_ptr) _old = (_pold);                                \
>         __typeof__(*(_ptr)) __old = *_old;                              \
>         __typeof__(*(_ptr)) __new = (_new);                             \
>         volatile unsigned int * __ptr = (volatile unsigned int *)(_ptr);\
>                                                                         \
>         asm volatile("lock cmpxchgl %[new], %[ptr]"                     \
>                      : "=@ccz" (success),                               \
>                        [ptr] "+m" (*__ptr),                             \
>                        [old] "+a" (__old)                               \
>                      : [new] "r" (__new)                                \
>                      : "memory");                                       \
>         /* if (!success) */                                                     \
>                 *_old = __old;                                          \
>         success;                                                        \
> })
>
> #define EXCEPTION_VALUE(val, handler)   asm volatile ("ud2 # %0" : : "r" (val))
>
> #define UINT_MAX        (~0U)
>
> #define likely(x)    __builtin_expect(!!(x), 1)
> #define unlikely(x)    __builtin_expect(!!(x), 0)
>
> static inline void refcount_inc(unsigned int *r)
> {
>         unsigned int new, val = *(unsigned int volatile *)r;
>
>         do {
>                 if (unlikely(val == UINT_MAX)) /* saturated */
>                         return;
>
>                 if (unlikely(!val)) /* use-after-free */
>                         goto exception;
>
>                 /* cannot overflow because we already checked UINT_MAX */
>                 new = val + 1;
>
>         } while (unlikely(!try_cmpxchg(r, &val, new)));
>
>         if (unlikely(new == UINT_MAX))
> exception:      EXCEPTION_VALUE(val, __refcount_warn);
> }
>
> void T_refcount_inc(unsigned int *r)
> {
>         refcount_inc(r);
> }
>
> /*
>    0:   8b 07                   mov    (%rdi),%eax
>    2:   83 f8 ff                cmp    $0xffffffff,%eax
>    5:   74 13                   je     1a <T_refcount_inc+0x1a>
>    7:   85 c0                   test   %eax,%eax
>    9:   74 0d                   je     18 <T_refcount_inc+0x18>
>    b:   8d 50 01                lea    0x1(%rax),%edx
>    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
>   12:   75 ee                   jne    2 <T_refcount_inc+0x2>
>   14:   ff c2                   inc    %edx
>   16:   75 02                   jne    1a <T_refcount_inc+0x1a>
>   18:   0f 0b                   ud2
>   1a:   c3                      retq
> */



This seems to help ;)

#define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr,
pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 17:23       ` Peter Zijlstra
  2017-03-24 17:51         ` Dmitry Vyukov
@ 2017-03-24 18:00         ` Peter Zijlstra
  2017-03-24 18:04           ` Peter Zijlstra
  2017-03-24 18:45         ` Andy Lutomirski
  2017-03-24 19:08         ` Linus Torvalds
  3 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 18:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 06:23:42PM +0100, Peter Zijlstra wrote:
> I tried a few variants, but nothing really made it better.
> 
> Find the tiny.c file below; I'm using:
> 
>   gcc (Debian 6.3.0-5) 6.3.0 20170124
> 
> it has both an inline and an stmt-expr try_cmpxchg variant to play with;
> the 'expected' output is at the bottom (same as above left).
> 
> Note that clang doesn't compile this stuff due to missing features.
> 

static inline bool try_cmpxchg2(unsigned int *ptr, unsigned int *val, unsigned int new)
{
	unsigned int old = *val;
	bool success;

        asm volatile goto("lock cmpxchgl %[new], %[ptr]; "
                          "jnz %[label]"
                     : /* no output */
                     : [ptr] "+m" (*ptr),
                       [old] "+a" (old)
                       [new] "r" (new)
                     : "memory"
                     : label);
        return 1;

label:
	*val = old;
	return 0;
}


generates better code than the @cc output with extra if (!success), but
not as good as the original.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:00         ` Peter Zijlstra
@ 2017-03-24 18:04           ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 18:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 07:00:25PM +0100, Peter Zijlstra wrote:
> static inline bool try_cmpxchg2(unsigned int *ptr, unsigned int *val, unsigned int new)
> {
> 	unsigned int old = *val;
> 	bool success;
> 
>         asm volatile goto("lock cmpxchgl %[new], %[ptr]; "
>                           "jnz %[label]"
>                      : /* no output */
>                      : [ptr] "+m" (*ptr),
>                        [old] "+a" (old)
>                        [new] "r" (new)
>                      : "memory"
>                      : label);
>         return 1;
> 
> label:
> 	*val = old;
> 	return 0;
> }

N/m, I'm apparently so tired I didn't see the compiler errors :/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 17:51         ` Dmitry Vyukov
@ 2017-03-24 18:08           ` Peter Zijlstra
  2017-03-24 18:13             ` Peter Zijlstra
  2017-03-24 18:16             ` Dmitry Vyukov
  0 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 18:08 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andy Lutomirski, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
> On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> >> > So the first snipped I tested regressed like so:
> >> >
> >> >
> >> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
> >> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
> >> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
> >> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
> >> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
> >> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
> >> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
> >> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
> >> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
> >> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
> >> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
> >> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
> >> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
> >> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
> >> >                                                                   1e:   0f 0b                   ud2
> >> >                                                                   20:   c3                      retq
> >> >                                                                   21:   c3                      retq
> >>

> This seems to help ;)
> 
> #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)

That gets me:

0000000000000000 <T_refcount_inc>:
   0:   8b 07                   mov    (%rdi),%eax
   2:   89 44 24 fc             mov    %eax,-0x4(%rsp)
   6:   8b 44 24 fc             mov    -0x4(%rsp),%eax
   a:   83 f8 ff                cmp    $0xffffffff,%eax
   d:   74 1c                   je     2b <T_refcount_inc+0x2b>
   f:   85 c0                   test   %eax,%eax
  11:   75 07                   jne    1a <T_refcount_inc+0x1a>
  13:   8b 44 24 fc             mov    -0x4(%rsp),%eax
  17:   0f 0b                   ud2    
  19:   c3                      retq   
  1a:   8d 50 01                lea    0x1(%rax),%edx
  1d:   8b 44 24 fc             mov    -0x4(%rsp),%eax
  21:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
  25:   75 db                   jne    2 <T_refcount_inc+0x2>
  27:   ff c2                   inc    %edx
  29:   74 e8                   je     13 <T_refcount_inc+0x13>
  2b:   c3                      retq  


Which is even worse... (I did double check it actually compiled)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:08           ` Peter Zijlstra
@ 2017-03-24 18:13             ` Peter Zijlstra
  2017-03-24 19:16               ` Andy Lutomirski
  2017-03-24 18:16             ` Dmitry Vyukov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 18:13 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andy Lutomirski, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 07:08:38PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
> > On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> > >> > So the first snipped I tested regressed like so:
> > >> >
> > >> >
> > >> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
> > >> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
> > >> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
> > >> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
> > >> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
> > >> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
> > >> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
> > >> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
> > >> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
> > >> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
> > >> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
> > >> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
> > >> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
> > >> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
> > >> >                                                                   1e:   0f 0b                   ud2
> > >> >                                                                   20:   c3                      retq
> > >> >                                                                   21:   c3                      retq
> > >>
> 
> > This seems to help ;)
> > 
> > #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
> 
> That gets me:
> 
> 0000000000000000 <T_refcount_inc>:
>    0:   8b 07                   mov    (%rdi),%eax
>    2:   89 44 24 fc             mov    %eax,-0x4(%rsp)
>    6:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>    a:   83 f8 ff                cmp    $0xffffffff,%eax
>    d:   74 1c                   je     2b <T_refcount_inc+0x2b>
>    f:   85 c0                   test   %eax,%eax
>   11:   75 07                   jne    1a <T_refcount_inc+0x1a>
>   13:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>   17:   0f 0b                   ud2    
>   19:   c3                      retq   
>   1a:   8d 50 01                lea    0x1(%rax),%edx
>   1d:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>   21:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
>   25:   75 db                   jne    2 <T_refcount_inc+0x2>
>   27:   ff c2                   inc    %edx
>   29:   74 e8                   je     13 <T_refcount_inc+0x13>
>   2b:   c3                      retq  
> 
> 
> Which is even worse... (I did double check it actually compiled)

Not to mention we cannot use the C11 atomics in kernel because we want
to be able to runtime patch LOCK prefixes when only 1 CPU is available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:08           ` Peter Zijlstra
  2017-03-24 18:13             ` Peter Zijlstra
@ 2017-03-24 18:16             ` Dmitry Vyukov
  1 sibling, 0 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2017-03-24 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 7:08 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
>> On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> >> > So the first snipped I tested regressed like so:
>> >> >
>> >> >
>> >> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
>> >> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
>> >> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
>> >> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
>> >> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
>> >> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
>> >> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
>> >> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
>> >> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
>> >> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
>> >> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
>> >> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
>> >> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
>> >> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
>> >> >                                                                   1e:   0f 0b                   ud2
>> >> >                                                                   20:   c3                      retq
>> >> >                                                                   21:   c3                      retq
>> >>
>
>> This seems to help ;)
>>
>> #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>
> That gets me:
>
> 0000000000000000 <T_refcount_inc>:
>    0:   8b 07                   mov    (%rdi),%eax
>    2:   89 44 24 fc             mov    %eax,-0x4(%rsp)
>    6:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>    a:   83 f8 ff                cmp    $0xffffffff,%eax
>    d:   74 1c                   je     2b <T_refcount_inc+0x2b>
>    f:   85 c0                   test   %eax,%eax
>   11:   75 07                   jne    1a <T_refcount_inc+0x1a>
>   13:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>   17:   0f 0b                   ud2
>   19:   c3                      retq
>   1a:   8d 50 01                lea    0x1(%rax),%edx
>   1d:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>   21:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
>   25:   75 db                   jne    2 <T_refcount_inc+0x2>
>   27:   ff c2                   inc    %edx
>   29:   74 e8                   je     13 <T_refcount_inc+0x13>
>   2b:   c3                      retq
>
>
> Which is even worse... (I did double check it actually compiled)


For me gcc 7.0.1 generates:

0000000000000330 <T_refcount_inc1>:
 330:   55                      push   %rbp
 331:   8b 07                   mov    (%rdi),%eax
 333:   48 89 e5                mov    %rsp,%rbp
 336:   83 f8 ff                cmp    $0xffffffff,%eax
 339:   74 12                   je     34d <T_refcount_inc1+0x1d>
 33b:   85 c0                   test   %eax,%eax
 33d:   74 10                   je     34f <T_refcount_inc1+0x1f>
 33f:   8d 50 01                lea    0x1(%rax),%edx
 342:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
 346:   75 ee                   jne    336 <T_refcount_inc1+0x6>
 348:   83 fa ff                cmp    $0xffffffff,%edx
 34b:   74 04                   je     351 <T_refcount_inc1+0x21>
 34d:   5d                      pop    %rbp
 34e:   c3                      retq
 34f:   31 c0                   xor    %eax,%eax
 351:   0f 0b                   ud2
 353:   5d                      pop    %rbp
 354:   c3                      retq

with:

if (!success) \
  *_old = __old; \

0000000000000320 <T_refcount_inc1>:
 320:   8b 0f                   mov    (%rdi),%ecx
 322:   55                      push   %rbp
 323:   48 89 e5                mov    %rsp,%rbp
 326:   83 f9 ff                cmp    $0xffffffff,%ecx
 329:   74 2d                   je     358 <T_refcount_inc1+0x38>
 32b:   85 c9                   test   %ecx,%ecx
 32d:   74 25                   je     354 <T_refcount_inc1+0x34>
 32f:   8d 71 01                lea    0x1(%rcx),%esi
 332:   89 c8                   mov    %ecx,%eax
 334:   f0 0f b1 37             lock cmpxchg %esi,(%rdi)
 338:   89 c2                   mov    %eax,%edx
 33a:   74 20                   je     35c <T_refcount_inc1+0x3c>
 33c:   83 fa ff                cmp    $0xffffffff,%edx
 33f:   74 17                   je     358 <T_refcount_inc1+0x38>
 341:   85 d2                   test   %edx,%edx
 343:   74 0f                   je     354 <T_refcount_inc1+0x34>
 345:   8d 72 01                lea    0x1(%rdx),%esi
 348:   89 d0                   mov    %edx,%eax
 34a:   f0 0f b1 37             lock cmpxchg %esi,(%rdi)
 34e:   74 0a                   je     35a <T_refcount_inc1+0x3a>
 350:   89 c2                   mov    %eax,%edx
 352:   eb e8                   jmp    33c <T_refcount_inc1+0x1c>
 354:   31 c9                   xor    %ecx,%ecx
 356:   0f 0b                   ud2
 358:   5d                      pop    %rbp
 359:   c3                      retq
 35a:   89 d1                   mov    %edx,%ecx
 35c:   83 fe ff                cmp    $0xffffffff,%esi
 35f:   74 f5                   je     356 <T_refcount_inc1+0x36>
 361:   5d                      pop    %rbp
 362:   c3                      retq


with __atomic_compare_exchange_n:

exactly the same as the original code.


But I don't have an answer for runtime patching of LOCK.
Looks like something to fix in gcc.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 17:23       ` Peter Zijlstra
  2017-03-24 17:51         ` Dmitry Vyukov
  2017-03-24 18:00         ` Peter Zijlstra
@ 2017-03-24 18:45         ` Andy Lutomirski
  2017-03-24 19:17           ` Linus Torvalds
  2017-03-24 20:22           ` Peter Zijlstra
  2017-03-24 19:08         ` Linus Torvalds
  3 siblings, 2 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > So the first snipped I tested regressed like so:
>> >
>> >
>> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
>> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
>> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
>> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
>> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
>> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
>> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
>> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
>> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
>> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
>> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
>> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
>> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
>> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
>> >                                                                   1e:   0f 0b                   ud2
>> >                                                                   20:   c3                      retq
>> >                                                                   21:   c3                      retq
>>
>> Can you re-send the better asm you got earlier?
>
> On the left?

Apparently I'm just blind this morning.
*/

After playing with it a bit, I found some of the problem: you're
passing val into EXCEPTION_VALUE, which keeps it live.  If I get rid
of that, the generated code is great.

I haven't found a way to convince GCC that, in the success case, eax
isn't clobbered.  I wrote this:

static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val,
unsigned int new)
{
    unsigned int old = *val;
    bool success;

    asm volatile("lock cmpxchgl %[new], %[ptr]"
             : "=@ccz" (success),
               [ptr] "+m" (*ptr),
               [old] "+a" (old)
             : [new] "r" (new)
             : "memory");

    if (!success) {
        *val = old;
    } else {
        if (*val != old) {
            *val = old;
            __builtin_unreachable();
        } else {
            /*
             * Damnit, GCC, I want you to realize that this
             * is happening but to avoid emitting the store.
             */
            *val = old; /* <-- here */
        }
    }

    return success;
}

The "here" line is the problematic code that breaks certain use cases,
and it obviously needn't have any effect in the generated code, but
I'm having trouble getting GCC to generate good code without it.

Is there some hack like if __builtin_is_unescaped(*val) *val = old;
that would work?

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 17:23       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2017-03-24 18:45         ` Andy Lutomirski
@ 2017-03-24 19:08         ` Linus Torvalds
  2017-03-24 20:46           ` Peter Zijlstra
  3 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-24 19:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I tried a few variants, but nothing really made it better.

So I really hate how your thing has two return values, and fakes the
second one using the pointer value. I dislike it for two different
reasons:

 - it's bad for type checking: it makes the "real" pointer value and
the "old value" pointer value both be pointers of the same type

 - I think it's part of the reason why gcc easily generates crap code.

And I think the problem is fundamental to your interface.

So how about we change the interface entirely, with the goal being
both type safety and good code generation?

In particular, let's make the interface something that might be
optimal given possible future gcc improvements, like the ability to
use "asm goto" with outputs. We can't do that today, but maybe some
day...

So I would suggest that the format of the "try_cmpxhg()" thing be
something like this:

#define try_cmpxchg(ptr, value, new, success_label) ({  \
        bool __txchg_success;                           \
        __typeof__(*(ptr)) __old;                       \
        asm volatile("lock cmpxchgl %3, %1"             \
                : "=@ccz" (__txchg_success),            \
                  "+m" (*ptr),                          \
                  "=a" (__old)                          \
                : "r" (new),                            \
                  "2" (value)                           \
                : "memory");                            \
        if (__txchg_success) goto success_label;        \
        __old; })

which seems to be fairly natural.

Then you do your refcount loop (or pretty much *any* cmpxchg loop) with

        for (;;) {
                if (unlikely(val == UINT_MAX))
                        goto saturated;

                if (unlikely(!val))
                        goto use_after_free;

                new = val + 1;
                val = try_cmpxchg(r, val, new, success);
        }

    success:
        return;

    saturated: use_after_free: .. whatever error handling ..

and I think gcc should generate reasonable code.

Hmm?

NOTE! I would suggest that this odd "macro with a success label" model
of try_cmpxchg() never be used for something that people are expected
to use directly. So I don't think "normal" code should use this label
form.

But it's useful as a helper function that is then used to implement
the "real" ABI, ie to implement that "refcount_inc()" and other things
like that..

                               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:13             ` Peter Zijlstra
@ 2017-03-24 19:16               ` Andy Lutomirski
  2017-03-24 19:20                 ` Linus Torvalds
  2017-03-24 20:14                 ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 07:08:38PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
>> > On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > >> > So the first snipped I tested regressed like so:
>> > >> >
>> > >> >
>> > >> > 0000000000000000 <T_refcount_inc>:                              0000000000000000 <T_refcount_inc>:
>> > >> >    0:   8b 07                   mov    (%rdi),%eax                 0:   8b 17                   mov    (%rdi),%edx
>> > >> >    2:   83 f8 ff                cmp    $0xffffffff,%eax            2:   83 fa ff                cmp    $0xffffffff,%edx
>> > >> >    5:   74 13                   je     1a <T_refcount_inc+0x1a>    5:   74 1a                   je     21 <T_refcount_inc+0x21>
>> > >> >    7:   85 c0                   test   %eax,%eax                   7:   85 d2                   test   %edx,%edx
>> > >> >    9:   74 0d                   je     18 <T_refcount_inc+0x18>    9:   74 13                   je     1e <T_refcount_inc+0x1e>
>> > >> >    b:   8d 50 01                lea    0x1(%rax),%edx              b:   8d 4a 01                lea    0x1(%rdx),%ecx
>> > >> >    e:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)           e:   89 d0                   mov    %edx,%eax
>> > >> >   12:   75 ee                   jne    2 <T_refcount_inc+0x2>     10:   f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
>> > >> >   14:   ff c2                   inc    %edx                       14:   74 04                   je     1a <T_refcount_inc+0x1a>
>> > >> >   16:   75 02                   jne    1a <T_refcount_inc+0x1a>   16:   89 c2                   mov    %eax,%edx
>> > >> >   18:   0f 0b                   ud2                               18:   eb e8                   jmp    2 <T_refcount_inc+0x2>
>> > >> >   1a:   c3                      retq                              1a:   ff c1                   inc    %ecx
>> > >> >                                                                   1c:   75 03                   jne    21 <T_refcount_inc+0x21>
>> > >> >                                                                   1e:   0f 0b                   ud2
>> > >> >                                                                   20:   c3                      retq
>> > >> >                                                                   21:   c3                      retq
>> > >>
>>
>> > This seems to help ;)
>> >
>> > #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>>
>> That gets me:
>>
>> 0000000000000000 <T_refcount_inc>:
>>    0:   8b 07                   mov    (%rdi),%eax
>>    2:   89 44 24 fc             mov    %eax,-0x4(%rsp)
>>    6:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>>    a:   83 f8 ff                cmp    $0xffffffff,%eax
>>    d:   74 1c                   je     2b <T_refcount_inc+0x2b>
>>    f:   85 c0                   test   %eax,%eax
>>   11:   75 07                   jne    1a <T_refcount_inc+0x1a>
>>   13:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>>   17:   0f 0b                   ud2
>>   19:   c3                      retq
>>   1a:   8d 50 01                lea    0x1(%rax),%edx
>>   1d:   8b 44 24 fc             mov    -0x4(%rsp),%eax
>>   21:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
>>   25:   75 db                   jne    2 <T_refcount_inc+0x2>
>>   27:   ff c2                   inc    %edx
>>   29:   74 e8                   je     13 <T_refcount_inc+0x13>
>>   2b:   c3                      retq
>>
>>
>> Which is even worse... (I did double check it actually compiled)
>
> Not to mention we cannot use the C11 atomics in kernel because we want
> to be able to runtime patch LOCK prefixes when only 1 CPU is available.

Is this really a show-stopper?  I bet that objtool could be persuaded
to emit a list of the locations of all those LOCK prefixes.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:45         ` Andy Lutomirski
@ 2017-03-24 19:17           ` Linus Torvalds
  2017-03-24 21:23             ` Peter Zijlstra
  2017-03-24 20:22           ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-24 19:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 11:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Is there some hack like if __builtin_is_unescaped(*val) *val = old;
> that would work?

See my recent email suggesting a completely different interface, which
avoids this problem.

My interface generates:

0000000000000000 <T_refcount_inc>:
   0: 8b 07                 mov    (%rdi),%eax
   2: 83 f8 ff             cmp    $0xffffffff,%eax
   5: 74 12                 je     19 <T_refcount_inc+0x19>
   7: 85 c0                 test   %eax,%eax
   9: 74 0a                 je     15 <T_refcount_inc+0x15>
   b: 8d 50 01             lea    0x1(%rax),%edx
   e: f0 0f b1 17           lock cmpxchg %edx,(%rdi)
  12: 75 ee                 jne    2 <T_refcount_inc+0x2>
  14: c3                   retq
  15: 31 c0                 xor    %eax,%eax
  17: 0f 0b                 ud2
  19: c3                   retq

for PeterZ's test-case, which seems optimal.

Of course, PeterZ used -Os, which isn't actually very natural for the
kernel. Using -O2 I get something else. It turns out that my macro
should use

        if (likely(__txchg_success)) goto success_label;

(that "likely()" is criticial) to make gcc not try to optimize for the
looping case.

So with that "likely()" fixed, with -O2 I get:

0000000000000000 <T_refcount_inc>:
   0: 8b 07                 mov    (%rdi),%eax
   2: 83 f8 ff             cmp    $0xffffffff,%eax
   5: 74 0d                 je     14 <T_refcount_inc+0x14>
   7: 85 c0                 test   %eax,%eax
   9: 74 12                 je     1d <T_refcount_inc+0x1d>
   b: 8d 50 01             lea    0x1(%rax),%edx
   e: f0 0f b1 17           lock cmpxchg %edx,(%rdi)
  12: 75 02                 jne    16 <T_refcount_inc+0x16>
  14: f3 c3                 repz retq
  16: 83 f8 ff             cmp    $0xffffffff,%eax
  19: 75 ec                 jne    7 <T_refcount_inc+0x7>
  1b: f3 c3                 repz retq
  1d: 31 c0                 xor    %eax,%eax
  1f: 0f 0b                 ud2
  21: c3                   retq

which again looks pretty optimal (it did indeed actually generate
bigger but potentially higher-performance code by making the good case
be a fallthrough, and the unlikely case be a _forward_ jump that will
be predicted not-taken in the absense of other rpediction information.

(Of course, this also depends on the exact behavior that PeterZ's code
had, namely an exception for use-after-free, but a silent saturation)

            Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:16               ` Andy Lutomirski
@ 2017-03-24 19:20                 ` Linus Torvalds
  2017-03-24 19:27                   ` Andy Lutomirski
  2017-03-24 20:15                   ` Peter Zijlstra
  2017-03-24 20:14                 ` Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2017-03-24 19:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Dmitry Vyukov, Andrew Morton, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 12:16 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Not to mention we cannot use the C11 atomics in kernel because we want
>> to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>
> Is this really a show-stopper?  I bet that objtool could be persuaded
> to emit a list of the locations of all those LOCK prefixes.

I doubt it's a show-stopper, if only because nobody cares about UP any
more. Not even the embedded world does.

That said, I'm not convinced that there will ever really be a reason
for the kernel to use the C11 atomics. They just aren't any better
than what we can do ourselves.

The reason for C11 atomics is "portably good atomics". We use
"architecture-specific good atomics" instead, and are willing to
maintain that. We will *have* to maintain that in the forseeable
future anyway, for legacy compiler issues.

So any C11 atomics use by the kernel is realistically at least a
decade away if it *ever* happens.

Don't even worry about it. Planning decades ahead for some detail like
that is stupid.

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:20                 ` Linus Torvalds
@ 2017-03-24 19:27                   ` Andy Lutomirski
  2017-03-24 20:15                   ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, Dmitry Vyukov, Andrew Morton,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 12:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 24, 2017 at 12:16 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> Not to mention we cannot use the C11 atomics in kernel because we want
>>> to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>>
>> Is this really a show-stopper?  I bet that objtool could be persuaded
>> to emit a list of the locations of all those LOCK prefixes.
>
> I doubt it's a show-stopper, if only because nobody cares about UP any
> more. Not even the embedded world does.
>
> That said, I'm not convinced that there will ever really be a reason
> for the kernel to use the C11 atomics. They just aren't any better
> than what we can do ourselves.
>
> The reason for C11 atomics is "portably good atomics". We use
> "architecture-specific good atomics" instead, and are willing to
> maintain that. We will *have* to maintain that in the forseeable
> future anyway, for legacy compiler issues.

In theory, though, the compiler could optimize based on its knowledge
of what the C11 atomics do.  ISTR reading about a few optimizations
that were already starting to be developed.

Using asm goto seems okay, too, but it's a lot more tedious is less
friendly to the optimizers.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:16               ` Andy Lutomirski
  2017-03-24 19:20                 ` Linus Torvalds
@ 2017-03-24 20:14                 ` Peter Zijlstra
  2017-03-24 20:21                   ` Andy Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 20:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 12:16:11PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Not to mention we cannot use the C11 atomics in kernel because we want
> > to be able to runtime patch LOCK prefixes when only 1 CPU is available.
> 
> Is this really a show-stopper?  I bet that objtool could be persuaded
> to emit a list of the locations of all those LOCK prefixes.

Ah, but its not _all_ LOCK prefixes. Some are needed even on UP, because
against hardware instead of other CPUs. Or again hypervisor instead of
other vCPU.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:20                 ` Linus Torvalds
  2017-03-24 19:27                   ` Andy Lutomirski
@ 2017-03-24 20:15                   ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 12:20:37PM -0700, Linus Torvalds wrote:
> I doubt it's a show-stopper, if only because nobody cares about UP any
> more. Not even the embedded world does.

For some obscure reason we recently introduced a variant for virt
people. Where it would need memory barriers against the hypervisor, but
having only a single vCPU not against itself.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 20:14                 ` Peter Zijlstra
@ 2017-03-24 20:21                   ` Andy Lutomirski
  0 siblings, 0 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 20:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 1:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 12:16:11PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > Not to mention we cannot use the C11 atomics in kernel because we want
>> > to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>>
>> Is this really a show-stopper?  I bet that objtool could be persuaded
>> to emit a list of the locations of all those LOCK prefixes.
>
> Ah, but its not _all_ LOCK prefixes. Some are needed even on UP, because
> against hardware instead of other CPUs. Or again hypervisor instead of
> other vCPU.
>

Make a table of mandatory lock prefixes and assume all the others (due
to C11 atomics, etc) can be omitted on UP?

I'm curious, though, whether anyone actually compiles an x86 SMP
kernel, runs it on UP, and cares about performance these days.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 18:45         ` Andy Lutomirski
  2017-03-24 19:17           ` Linus Torvalds
@ 2017-03-24 20:22           ` Peter Zijlstra
  2017-03-24 20:27             ` Andy Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 20:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
> After playing with it a bit, I found some of the problem: you're
> passing val into EXCEPTION_VALUE, which keeps it live.  If I get rid
> of that, the generated code is great.

Right, so I needed that because I land on ud2 through 2 different paths:

 - newly saturated
 - use-after-free

And the exception handler can figure out which of the two by looking at
the variable, but then of course, it needs to be life.

For the full horror of how to do this, look here:

  http://paste.debian.net/924190/

But I didn't just show you that, so you can't blame me for any damage
that might've done you.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 20:22           ` Peter Zijlstra
@ 2017-03-24 20:27             ` Andy Lutomirski
  2017-03-24 21:07               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-24 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Vyukov, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 1:22 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
>> After playing with it a bit, I found some of the problem: you're
>> passing val into EXCEPTION_VALUE, which keeps it live.  If I get rid
>> of that, the generated code is great.
>
> Right, so I needed that because I land on ud2 through 2 different paths:
>
>  - newly saturated
>  - use-after-free
>
> And the exception handler can figure out which of the two by looking at
> the variable, but then of course, it needs to be life.
>
> For the full horror of how to do this, look here:
>
>   http://paste.debian.net/924190/
>
> But I didn't just show you that, so you can't blame me for any damage
> that might've done you.

Wow, that's horrible.  Could this not be done by looking at flags
instead of regs?

For that matter, you're effectively comparing to -1 and 0.  I'm not
really sure it would be faster, but you could plausibly add one then
subtract one again and get the full picture just from flags and a
single comparison?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:08         ` Linus Torvalds
@ 2017-03-24 20:46           ` Peter Zijlstra
  2017-03-24 20:58             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 12:08:32PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > I tried a few variants, but nothing really made it better.
> 
> So I really hate how your thing has two return values, and fakes the
> second one using the pointer value. 

Inspired by C11 I'm afraid..

> So how about we change the interface entirely, with the goal being
> both type safety and good code generation?

I certainly like it better, but so far I'm having trouble reproducing
your results. What compiler version are you on?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 20:46           ` Peter Zijlstra
@ 2017-03-24 20:58             ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2017-03-24 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

[-- Attachment #1: Type: text/plain, Size: 423 bytes --]

On Fri, Mar 24, 2017 at 1:46 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I certainly like it better, but so far I'm having trouble reproducing
> your results. What compiler version are you on?

I have:

    gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC)

from

    gcc-6.3.1-1.fc24.x86_64

and I'm attaching the edited form of your test-case, just so that
we're on the exact same page.

                 Linus

[-- Attachment #2: tiny.c --]
[-- Type: text/x-csrc, Size: 1204 bytes --]

/* gcc -Os -std=gnu99 -fno-strict-overflow -falign-jumps=1 -falign-loops=1 -c tiny.c; objdump -dr tiny.o */

typedef _Bool bool;

#define try_cmpxchg(ptr, value, new, success_label) ({	\
	bool __txchg_success;				\
	__typeof__(*(ptr)) __old;			\
	asm volatile("lock cmpxchgl %3, %1"		\
		: "=@ccz" (__txchg_success),		\
		  "+m" (*ptr),				\
		  "=a" (__old)				\
		: "r" (new),				\
		  "2" (value)				\
		: "memory");				\
	if (likely(__txchg_success)) goto success_label;\
	__old; })


#define EXCEPTION_VALUE(val, handler)   asm volatile ("ud2 # %0" : : "r" (val))

#define UINT_MAX        (~0U)

#define likely(x)    __builtin_expect(!!(x), 1)
#define unlikely(x)    __builtin_expect(!!(x), 0)

static inline void refcount_inc(unsigned int *r)
{
	unsigned int new, val = *(unsigned int volatile *)r;

	for (;;) {
		if (unlikely(val == UINT_MAX)) /* saturated */
			return;

		if (unlikely(!val)) /* use-after-free */
			goto exception;

		/* cannot overflow because we already checked UINT_MAX */
		new = val + 1;
		val = try_cmpxchg(r, val, new, success);
	}

success:
	return;

exception:
      EXCEPTION_VALUE(val, __refcount_warn);
}

void T_refcount_inc(unsigned int *r)
{
	refcount_inc(r);
}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 20:27             ` Andy Lutomirski
@ 2017-03-24 21:07               ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Vyukov, Andrew Morton, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 01:27:49PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 24, 2017 at 1:22 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
> >> After playing with it a bit, I found some of the problem: you're
> >> passing val into EXCEPTION_VALUE, which keeps it live.  If I get rid
> >> of that, the generated code is great.
> >
> > Right, so I needed that because I land on ud2 through 2 different paths:
> >
> >  - newly saturated
> >  - use-after-free
> >
> > And the exception handler can figure out which of the two by looking at
> > the variable, but then of course, it needs to be life.
> >
> > For the full horror of how to do this, look here:
> >
> >   http://paste.debian.net/924190/
> >
> > But I didn't just show you that, so you can't blame me for any damage
> > that might've done you.
> 
> Wow, that's horrible.  Could this not be done by looking at flags
> instead of regs?

Well, the EXCEPTION_HANDLER() thing is something ARM/ARM64 could also
implement.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 19:17           ` Linus Torvalds
@ 2017-03-24 21:23             ` Peter Zijlstra
  2017-03-25  7:51               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-24 21:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 12:17:28PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2017 at 11:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Is there some hack like if __builtin_is_unescaped(*val) *val = old;
> > that would work?
> 
> See my recent email suggesting a completely different interface, which
> avoids this problem.
> 
> My interface generates:
> 
> 0000000000000000 <T_refcount_inc>:
>    0: 8b 07                 mov    (%rdi),%eax
>    2: 83 f8 ff             cmp    $0xffffffff,%eax
>    5: 74 12                 je     19 <T_refcount_inc+0x19>
>    7: 85 c0                 test   %eax,%eax
>    9: 74 0a                 je     15 <T_refcount_inc+0x15>
>    b: 8d 50 01             lea    0x1(%rax),%edx
>    e: f0 0f b1 17           lock cmpxchg %edx,(%rdi)
>   12: 75 ee                 jne    2 <T_refcount_inc+0x2>
>   14: c3                   retq
>   15: 31 c0                 xor    %eax,%eax
>   17: 0f 0b                 ud2
>   19: c3                   retq
> 
> for PeterZ's test-case, which seems optimal.

Right; now my GCC emits more or less the same code (its a slightly
different compiler and instead of 12: jne, it does: 12 je ; 14: jmp 2.

But maybe that's the likely() you added later.

Also, see how at 7 we test if eax is 0 and then at 9 jump to 15 where we
make eax 0. Pretty daft code-gen.

In any case, you lost one branch into ud2; your success: return, should
be success: if (new == UINT_MAX), such that when we newly saturate the
count we also raise an exception.

With that, the code is still larger than it used to be. I'll have a play
around. I do like this interface better, but getting GCC to generate
sensible code seems 'interesting'.

I'll try and redo the patches that landed in tip and see what it does
for total vmlinux size somewhere tomorrow.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 21:23             ` Peter Zijlstra
@ 2017-03-25  7:51               ` Peter Zijlstra
  2017-03-25 18:00                 ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-25  7:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:

> I'll try and redo the patches that landed in tip and see what it does
> for total vmlinux size somewhere tomorrow.

   text    data     bss     dec     hex filename
10726413        4540256  843776 16110445         f5d36d defconfig-build/vmlinux.pre
10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux.post

:-(

---
 arch/x86/include/asm/atomic.h      | 18 ++++++-------
 arch/x86/include/asm/atomic64_64.h | 24 ++++++++++--------
 arch/x86/include/asm/cmpxchg.h     | 50 ++++++++++++++++++++----------------
 include/linux/atomic.h             | 52 +++++++++++++++++++++++++-------------
 lib/refcount.c                     | 35 ++++++++++++++-----------
 5 files changed, 104 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index caa5798..c80d4914 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -186,11 +186,8 @@ static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 	return cmpxchg(&v->counter, old, new);
 }
 
-#define atomic_try_cmpxchg atomic_try_cmpxchg
-static __always_inline bool atomic_try_cmpxchg(atomic_t *v, int *old, int new)
-{
-	return try_cmpxchg(&v->counter, old, new);
-}
+#define atomic_try_cmpxchg(_ptr, _old, _new, _label)			\
+	try_cmpxchg(&(_ptr)->counter, _old, _new, _label)
 
 static inline int atomic_xchg(atomic_t *v, int new)
 {
@@ -210,8 +207,9 @@ static inline void atomic_##op(int i, atomic_t *v)			\
 static inline int atomic_fetch_##op(int i, atomic_t *v)			\
 {									\
 	int val = atomic_read(v);					\
-	do {								\
-	} while (!atomic_try_cmpxchg(v, &val, val c_op i));		\
+	for (;;)							\
+		val = atomic_try_cmpxchg(v, val, val c_op i, success);	\
+success:								\
 	return val;							\
 }
 
@@ -239,10 +237,12 @@ ATOMIC_OPS(xor, ^)
 static __always_inline int __atomic_add_unless(atomic_t *v, int a, int u)
 {
 	int c = atomic_read(v);
-	do {
+	for (;;) {
 		if (unlikely(c == u))
 			break;
-	} while (!atomic_try_cmpxchg(v, &c, c + a));
+		c = atomic_try_cmpxchg(v, c, c + a, success);
+	}
+success:
 	return c;
 }
 
diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index 6189a43..489c3e2 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -176,11 +176,8 @@ static inline long atomic64_cmpxchg(atomic64_t *v, long old, long new)
 	return cmpxchg(&v->counter, old, new);
 }
 
-#define atomic64_try_cmpxchg atomic64_try_cmpxchg
-static __always_inline bool atomic64_try_cmpxchg(atomic64_t *v, long *old, long new)
-{
-	return try_cmpxchg(&v->counter, old, new);
-}
+#define atomic64_try_cmpxchg(_ptr, _old, _new, _label)			\
+	try_cmpxchg(&(_ptr)->counter, _old, _new, _label)
 
 static inline long atomic64_xchg(atomic64_t *v, long new)
 {
@@ -199,10 +196,12 @@ static inline long atomic64_xchg(atomic64_t *v, long new)
 static inline bool atomic64_add_unless(atomic64_t *v, long a, long u)
 {
 	long c = atomic64_read(v);
-	do {
+	for (;;) {
 		if (unlikely(c == u))
 			return false;
-	} while (!atomic64_try_cmpxchg(v, &c, c + a));
+		c = atomic64_try_cmpxchg(v, c, c + a, success);
+	}
+success:
 	return true;
 }
 
@@ -218,11 +217,13 @@ static inline bool atomic64_add_unless(atomic64_t *v, long a, long u)
 static inline long atomic64_dec_if_positive(atomic64_t *v)
 {
 	long dec, c = atomic64_read(v);
-	do {
+	for (;;) {
 		dec = c - 1;
 		if (unlikely(dec < 0))
 			break;
-	} while (!atomic64_try_cmpxchg(v, &c, dec));
+		c = atomic64_try_cmpxchg(v, c, dec, success);
+	}
+success:
 	return dec;
 }
 
@@ -239,8 +240,9 @@ static inline void atomic64_##op(long i, atomic64_t *v)			\
 static inline long atomic64_fetch_##op(long i, atomic64_t *v)		\
 {									\
 	long val = atomic64_read(v);					\
-	do {								\
-	} while (!atomic64_try_cmpxchg(v, &val, val c_op i));		\
+	for (;;)							\
+		val = atomic64_try_cmpxchg(v, val, val c_op i, success);\
+success:								\
 	return val;							\
 }
 
diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..e6b8a8f 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -154,22 +154,24 @@ extern void __add_wrong_size(void)
 	__cmpxchg_local(ptr, old, new, sizeof(*(ptr)))
 
 
-#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
+#define __raw_try_cmpxchg(_ptr, _old, _new, success_label, size, lock)	\
 ({									\
-	bool success;							\
-	__typeof__(_ptr) _old = (_pold);				\
-	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __old = (_old);				\
 	__typeof__(*(_ptr)) __new = (_new);				\
+	__typeof__(*(_ptr)) __ret;					\
+	bool __success;							\
+									\
 	switch (size) {							\
 	case __X86_CASE_B:						\
 	{								\
 		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
 		asm volatile(lock "cmpxchgb %[new], %[ptr]"		\
 			     CC_SET(z)					\
-			     : CC_OUT(z) (success),			\
+			     : CC_OUT(z) (__success),			\
 			       [ptr] "+m" (*__ptr),			\
-			       [old] "+a" (__old)			\
-			     : [new] "q" (__new)			\
+			       [old] "=a" (__ret)			\
+			     : [new] "q" (__new),			\
+			       "2" (__old)				\
 			     : "memory");				\
 		break;							\
 	}								\
@@ -178,10 +180,11 @@ extern void __add_wrong_size(void)
 		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
 		asm volatile(lock "cmpxchgw %[new], %[ptr]"		\
 			     CC_SET(z)					\
-			     : CC_OUT(z) (success),			\
+			     : CC_OUT(z) (__success),			\
 			       [ptr] "+m" (*__ptr),			\
-			       [old] "+a" (__old)			\
-			     : [new] "r" (__new)			\
+			       [old] "=a" (__ret)			\
+			     : [new] "r" (__new),			\
+			       "2" (__old)				\
 			     : "memory");				\
 		break;							\
 	}								\
@@ -190,10 +193,11 @@ extern void __add_wrong_size(void)
 		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
 		asm volatile(lock "cmpxchgl %[new], %[ptr]"		\
 			     CC_SET(z)					\
-			     : CC_OUT(z) (success),			\
+			     : CC_OUT(z) (__success),			\
 			       [ptr] "+m" (*__ptr),			\
-			       [old] "+a" (__old)			\
-			     : [new] "r" (__new)			\
+			       [old] "=a" (__ret)			\
+			     : [new] "r" (__new),			\
+			       "2" (__old)				\
 			     : "memory");				\
 		break;							\
 	}								\
@@ -202,25 +206,27 @@ extern void __add_wrong_size(void)
 		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
 		asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
 			     CC_SET(z)					\
-			     : CC_OUT(z) (success),			\
+			     : CC_OUT(z) (__success),			\
 			       [ptr] "+m" (*__ptr),			\
-			       [old] "+a" (__old)			\
-			     : [new] "r" (__new)			\
+			       [old] "=a" (__ret)			\
+			     : [new] "r" (__new),			\
+			       "2" (__old)				\
 			     : "memory");				\
 		break;							\
 	}								\
 	default:							\
 		__cmpxchg_wrong_size();					\
 	}								\
-	*_old = __old;							\
-	success;							\
+									\
+	if (likely(__success)) goto success_label;			\
+	__ret;								\
 })
 
-#define __try_cmpxchg(ptr, pold, new, size)				\
-	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
+#define __try_cmpxchg(ptr, pold, new, success_label, size)		\
+	__raw_try_cmpxchg((ptr), (pold), (new), success_label, (size), LOCK_PREFIX)
 
-#define try_cmpxchg(ptr, pold, new)					\
-	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
+#define try_cmpxchg(ptr, pold, new, success_label)			\
+	__try_cmpxchg((ptr), (pold), (new), success_label, sizeof(*(ptr)))
 
 /*
  * xadd() adds "inc" to "*ptr" and atomically returns the previous
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index aae5953..13a6eac 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -425,18 +425,26 @@
 
 #ifndef atomic_try_cmpxchg
 
-#define __atomic_try_cmpxchg(type, _p, _po, _n)				\
+#define __atomic_try_cmpxchg(type, _p, _o, _n, _label)			\
 ({									\
-	typeof(_po) __po = (_po);					\
-	typeof(*(_po)) __o = *__po;					\
-	*__po = atomic_cmpxchg##type((_p), __o, (_n));			\
-	(*__po == __o);							\
+	typeof(*(_p)) __r; 						\
+ 	typeof(*(_p)) __o = (_o);					\
+	__r = atomic_cmpxchg##type((_p), __o, (_n));			\
+ 	if (__r == __o) goto _label;					\
+ 	__r;								\
 })
 
-#define atomic_try_cmpxchg(_p, _po, _n)		__atomic_try_cmpxchg(, _p, _po, _n)
-#define atomic_try_cmpxchg_relaxed(_p, _po, _n)	__atomic_try_cmpxchg(_relaxed, _p, _po, _n)
-#define atomic_try_cmpxchg_acquire(_p, _po, _n)	__atomic_try_cmpxchg(_acquire, _p, _po, _n)
-#define atomic_try_cmpxchg_release(_p, _po, _n)	__atomic_try_cmpxchg(_release, _p, _po, _n)
+#define atomic_try_cmpxchg(_p, _o, _n, _l)				\
+	__atomic_try_cmpxchg(, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_relaxed(_p, _o, _n, _l)			\
+	__atomic_try_cmpxchg(_relaxed, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_acquire(_p, _o, _n, _l)			\
+	__atomic_try_cmpxchg(_acquire, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_release(_p, _o, _n, _l)			\
+	__atomic_try_cmpxchg(_release, _p, _o, _n, _l)
 
 #else /* atomic_try_cmpxchg */
 #define atomic_try_cmpxchg_relaxed	atomic_try_cmpxchg
@@ -1019,18 +1027,26 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 
 #ifndef atomic64_try_cmpxchg
 
-#define __atomic64_try_cmpxchg(type, _p, _po, _n)			\
+#define __atomic64_try_cmpxchg(type, _p, _o, _n, _label)		\
 ({									\
-	typeof(_po) __po = (_po);					\
-	typeof(*(_po)) __o = *__po;					\
-	*__po = atomic64_cmpxchg##type((_p), __o, (_n));		\
-	(*__po == __o);							\
+	typeof(*(_p)) __r;						\
+	typeof(*(_p)) __o = (_o);					\
+	__r = atomic64_cmpxchg##type((_p), __o, (_n));			\
+	if (__r == __o) goto _label;					\
+	__r;								\
 })
 
-#define atomic64_try_cmpxchg(_p, _po, _n)		__atomic64_try_cmpxchg(, _p, _po, _n)
-#define atomic64_try_cmpxchg_relaxed(_p, _po, _n)	__atomic64_try_cmpxchg(_relaxed, _p, _po, _n)
-#define atomic64_try_cmpxchg_acquire(_p, _po, _n)	__atomic64_try_cmpxchg(_acquire, _p, _po, _n)
-#define atomic64_try_cmpxchg_release(_p, _po, _n)	__atomic64_try_cmpxchg(_release, _p, _po, _n)
+#define atomic64_try_cmpxchg(_p, _o, _n, _l)				\
+	__atomic64_try_cmpxchg(, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_relaxed(_p, _o, _n, _l)			\
+	__atomic64_try_cmpxchg(_relaxed, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_acquire(_p, _o, _n, _l)			\
+	__atomic64_try_cmpxchg(_acquire, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_release(_p, _o, _n, _l)			\
+	__atomic64_try_cmpxchg(_release, _p, _o, _n, _l)
 
 #else /* atomic64_try_cmpxchg */
 #define atomic64_try_cmpxchg_relaxed	atomic64_try_cmpxchg
diff --git a/lib/refcount.c b/lib/refcount.c
index f42124c..18b8926 100644
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -59,7 +59,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 {
 	unsigned int new, val = atomic_read(&r->refs);
 
-	do {
+	for (;;) {
 		if (!val)
 			return false;
 
@@ -70,8 +70,9 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 		if (new < val)
 			new = UINT_MAX;
 
-	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
-
+		val = atomic_try_cmpxchg_relaxed(&r->refs, val, new, success);
+	}
+success:
 	WARN_ONCE(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");
 
 	return true;
@@ -116,7 +117,7 @@ bool refcount_inc_not_zero(refcount_t *r)
 {
 	unsigned int new, val = atomic_read(&r->refs);
 
-	do {
+	for (;;) {
 		new = val + 1;
 
 		if (!val)
@@ -125,8 +126,9 @@ bool refcount_inc_not_zero(refcount_t *r)
 		if (unlikely(!new))
 			return true;
 
-	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
-
+		val = atomic_try_cmpxchg_relaxed(&r->refs, val, new, success);
+	}
+success:
 	WARN_ONCE(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");
 
 	return true;
@@ -175,7 +177,7 @@ bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
 	unsigned int new, val = atomic_read(&r->refs);
 
-	do {
+	for (;;) {
 		if (unlikely(val == UINT_MAX))
 			return false;
 
@@ -185,8 +187,9 @@ bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 			return false;
 		}
 
-	} while (!atomic_try_cmpxchg_release(&r->refs, &val, new));
-
+		val = atomic_try_cmpxchg_release(&r->refs, val, new, success);
+	}
+success:
 	return !new;
 }
 EXPORT_SYMBOL_GPL(refcount_sub_and_test);
@@ -244,9 +247,10 @@ EXPORT_SYMBOL_GPL(refcount_dec);
  */
 bool refcount_dec_if_one(refcount_t *r)
 {
-	int val = 1;
-
-	return atomic_try_cmpxchg_release(&r->refs, &val, 0);
+	atomic_try_cmpxchg_release(&r->refs, 1, 0, success);
+	return false;
+success:
+	return true;
 }
 EXPORT_SYMBOL_GPL(refcount_dec_if_one);
 
@@ -265,7 +269,7 @@ bool refcount_dec_not_one(refcount_t *r)
 {
 	unsigned int new, val = atomic_read(&r->refs);
 
-	do {
+	for (;;) {
 		if (unlikely(val == UINT_MAX))
 			return true;
 
@@ -278,8 +282,9 @@ bool refcount_dec_not_one(refcount_t *r)
 			return true;
 		}
 
-	} while (!atomic_try_cmpxchg_release(&r->refs, &val, new));
-
+		val = atomic_try_cmpxchg_release(&r->refs, val, new, success);
+	}
+success:
 	return true;
 }
 EXPORT_SYMBOL_GPL(refcount_dec_not_one);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25  7:51               ` Peter Zijlstra
@ 2017-03-25 18:00                 ` Linus Torvalds
  2017-03-25 18:20                   ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-25 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 12:51 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:
>
>> I'll try and redo the patches that landed in tip and see what it does
>> for total vmlinux size somewhere tomorrow.
>
>    text    data     bss     dec     hex filename
> 10726413        4540256  843776 16110445         f5d36d defconfig-build/vmlinux.pre
> 10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux.post
>
> :-(

Hmm. But you are comparing against the *broken* version that did the
unconditional store of the result.

You should at least compare against the fixed version with the
conditional store. That's the one that was hard to get good code
generation from, wasn't it?

                        Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 18:00                 ` Linus Torvalds
@ 2017-03-25 18:20                   ` Peter Zijlstra
  2017-03-25 18:28                     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-25 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 11:00:44AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 12:51 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:
> >
> >> I'll try and redo the patches that landed in tip and see what it does
> >> for total vmlinux size somewhere tomorrow.
> >
> >    text    data     bss     dec     hex filename
> > 10726413        4540256  843776 16110445         f5d36d defconfig-build/vmlinux.pre
> > 10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux.post
    10730445        4540256  843776 16114477         f5e32d defconfig-build/vmlinux

> >
> > :-(
> 
> Hmm. But you are comparing against the *broken* version that did the
> unconditional store of the result.

Well, only broken if not used on stack local variables, but yes.

> You should at least compare against the fixed version with the
> conditional store. That's the one that was hard to get good code
> generation from, wasn't it?

Added above, a few bytes smaller than the shiny new one actually.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 18:20                   ` Peter Zijlstra
@ 2017-03-25 18:28                     ` Linus Torvalds
  2017-03-25 18:34                       ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-25 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Added above, a few bytes smaller than the shiny new one actually.

Hmm. Sad. The label approach looked like it would match the semantics
of cmpxchg perfectly, but it's not as optimal as it superficially
would have seemed.

And I assume that register allocation etc is different enough that
there's no sane way to diff the asm to see what changed.

           Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 18:28                     ` Linus Torvalds
@ 2017-03-25 18:34                       ` Linus Torvalds
  2017-03-25 21:13                         ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-25 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 11:28 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. Sad. The label approach looked like it would match the semantics
> of cmpxchg perfectly, but it's not as optimal as it superficially
> would have seemed.

Oh, I just noticed that at least your other one didn't mark "success"
as being likely.

That changed code generation a lot for me for the loops, where gcc
would assume that the loop was likely to be taken, which in turn means
that gcc lays out the loop with a backwards branch. Which is denser,
but also likely slower, since most x86 chips predict backwards
branches taken.

So that might be one difference. Although the size differences in the
last case are so small that it might also just be random noise.

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 18:34                       ` Linus Torvalds
@ 2017-03-25 21:13                         ` Peter Zijlstra
  2017-03-25 22:08                           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-25 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 11:28 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. Sad. The label approach looked like it would match the semantics
> > of cmpxchg perfectly, but it's not as optimal as it superficially
> > would have seemed.
> 
> Oh, I just noticed that at least your other one didn't mark "success"
> as being likely.

10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux


---
 arch/x86/include/asm/cmpxchg.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..d347abc 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
 	default:							\
 		__cmpxchg_wrong_size();					\
 	}								\
+	if (unlikely(!success)) \
 	*_old = __old;							\
-	success;							\
+	likely(success);						\
 })
 
 #define __try_cmpxchg(ptr, pold, new, size)				\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 21:13                         ` Peter Zijlstra
@ 2017-03-25 22:08                           ` Linus Torvalds
  2017-03-27  9:48                             ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2017-03-25 22:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 2:13 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
>>
>> Oh, I just noticed that at least your other one didn't mark "success"
>> as being likely.
>
> 10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux

Ok, that seems to be the exact same size as with the patch using the
"goto label" approach. So maybe the code generation is the same now.

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-25 22:08                           ` Linus Torvalds
@ 2017-03-27  9:48                             ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-27  9:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dmitry Vyukov, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Josh Poimboeuf, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	LKML

On Sat, Mar 25, 2017 at 03:08:10PM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 2:13 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
> >>
> >> Oh, I just noticed that at least your other one didn't mark "success"
> >> as being likely.
> >
> > 10730509        4540256  843776 16114541         f5e36d defconfig-build/vmlinux
> 
> Ok, that seems to be the exact same size as with the patch using the
> "goto label" approach. So maybe the code generation is the same now.


OK, so I went and build myself a GCC-7 compiler and constructed the
below table. From this I would propose we do the "try_cmpxchg + if"
thing, also below. Because, while the interface is icky, it is what C11
does for this construct.


GCC-6.3.0:

10735757        (cmpxchg)
10726413        (try_cmpxchg)
10730701        (try_cmpxchg + likely)
10730509        (try_cmpxchg + if)
10730445        (try_cmpxchg-linus)

GCC-7 (20170327):

10709514        (cmpxchg)
10704266        (try_cmpxchg)
10704458        (try_cmpxchg + likely)
10704266        (try_cmpxchg + if)
10704394        (try_cmpxchg-linus)



---
 arch/x86/include/asm/cmpxchg.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..d90296d 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
 	default:							\
 		__cmpxchg_wrong_size();					\
 	}								\
-	*_old = __old;							\
-	success;							\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);						\
 })
 
 #define __try_cmpxchg(ptr, pold, new, size)				\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-24 12:44 locking/atomic: Introduce atomic_try_cmpxchg() Dmitry Vyukov
  2017-03-24 14:21 ` Peter Zijlstra
@ 2017-03-27 12:16 ` Peter Zijlstra
  2017-03-27 13:45   ` Dmitry Vyukov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-27 12:16 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
> Hi,
> 
> I've come across:
> 
> commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
> Author: Peter Zijlstra
> Date:   Wed Feb 1 16:39:38 2017 +0100
>     locking/atomic: Introduce atomic_try_cmpxchg()
> 
> The primitive has subtle difference with all other implementation that
> I know of, and can lead to very subtle bugs. Some time ago I've spent
> several days debugging a memory corruption caused by similar
> implementation. 

OK, so how about this?

---
Subject: atomic: Fix try_cmpxchg semantics
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Mar 27 13:54:38 CEST 2017

Dmitry noted that the new try_cmpxchg() primitive is broken when the
old pointer doesn't point to local stack.

He writes: "Consider a classical lock-free stack push:

  node->next = atomic_read(&head);
  do {
  } while (!atomic_try_cmpxchg(&head, &node->next, node));

This code is broken with the current implementation, the problem is
with unconditional update of *__po.

In case of success it writes the same value back into *__po, but in
case of cmpxchg success we might have lose ownership of some memory
locations and potentially over what __po has pointed to. The same
holds for the re-read of *__po. "

He also points out that this makes it surprisingly different from the
similar C/C++ atomic operation.

After investigating the code-gen differences caused by this patch; and
a number of alternatives (Linus dislikes this interface lots), we
arrived at these results (size x86_64-defconfig/vmlinux):

  GCC-6.3.0:

  10735757        cmpxchg
  10726413        try_cmpxchg
  10730509        try_cmpxchg + patch
  10730445        try_cmpxchg-linus

  GCC-7 (20170327):

  10709514        cmpxchg
  10704266        try_cmpxchg
  10704266        try_cmpxchg + patch
  10704394        try_cmpxchg-linus

>From this we see that the patch has the advantage of better code-gen on
GCC-7 and keeps the interface roughly consistent with the C language
variant.

Fixes: a9ebf306f52c ("locking/atomic: Introduce atomic_try_cmpxchg()")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/cmpxchg.h |    5 +++--
 include/linux/atomic.h         |   16 ++++++++++------
 2 files changed, 13 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
 	default:							\
 		__cmpxchg_wrong_size();					\
 	}								\
-	*_old = __old;							\
-	success;							\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);						\
 })
 
 #define __try_cmpxchg(ptr, pold, new, size)				\
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -428,9 +428,11 @@
 #define __atomic_try_cmpxchg(type, _p, _po, _n)				\
 ({									\
 	typeof(_po) __po = (_po);					\
-	typeof(*(_po)) __o = *__po;					\
-	*__po = atomic_cmpxchg##type((_p), __o, (_n));			\
-	(*__po == __o);							\
+	typeof(*(_po)) __r, __o = *__po;				\
+	__r = atomic_cmpxchg##type((_p), __o, (_n));			\
+	if (unlikely(__r != __o))					\
+		*__po = __r;						\
+	likely(__r == __o);						\
 })
 
 #define atomic_try_cmpxchg(_p, _po, _n)		__atomic_try_cmpxchg(, _p, _po, _n)
@@ -1022,9 +1024,11 @@ static inline int atomic_dec_if_positive
 #define __atomic64_try_cmpxchg(type, _p, _po, _n)			\
 ({									\
 	typeof(_po) __po = (_po);					\
-	typeof(*(_po)) __o = *__po;					\
-	*__po = atomic64_cmpxchg##type((_p), __o, (_n));		\
-	(*__po == __o);							\
+	typeof(*(_po)) __r, __o = *__po;				\
+	__r = atomic64_cmpxchg##type((_p), __o, (_n));			\
+	if (unlikely(__r != __o))					\
+		*__po = __r;						\
+	likely(__r == __o);						\
 })
 
 #define atomic64_try_cmpxchg(_p, _po, _n)		__atomic64_try_cmpxchg(, _p, _po, _n)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: locking/atomic: Introduce atomic_try_cmpxchg()
  2017-03-27 12:16 ` Peter Zijlstra
@ 2017-03-27 13:45   ` Dmitry Vyukov
  0 siblings, 0 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2017-03-27 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Josh Poimboeuf, Linus Torvalds,
	Paul McKenney, Thomas Gleixner, Ingo Molnar, LKML

On Mon, Mar 27, 2017 at 2:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>> Hi,
>>
>> I've come across:
>>
>> commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
>> Author: Peter Zijlstra
>> Date:   Wed Feb 1 16:39:38 2017 +0100
>>     locking/atomic: Introduce atomic_try_cmpxchg()
>>
>> The primitive has subtle difference with all other implementation that
>> I know of, and can lead to very subtle bugs. Some time ago I've spent
>> several days debugging a memory corruption caused by similar
>> implementation.
>
> OK, so how about this?
>
> ---
> Subject: atomic: Fix try_cmpxchg semantics
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Mar 27 13:54:38 CEST 2017
>
> Dmitry noted that the new try_cmpxchg() primitive is broken when the
> old pointer doesn't point to local stack.
>
> He writes: "Consider a classical lock-free stack push:
>
>   node->next = atomic_read(&head);
>   do {
>   } while (!atomic_try_cmpxchg(&head, &node->next, node));
>
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po.
>
> In case of success it writes the same value back into *__po, but in
> case of cmpxchg success we might have lose ownership of some memory
> locations and potentially over what __po has pointed to. The same
> holds for the re-read of *__po. "
>
> He also points out that this makes it surprisingly different from the
> similar C/C++ atomic operation.
>
> After investigating the code-gen differences caused by this patch; and
> a number of alternatives (Linus dislikes this interface lots), we
> arrived at these results (size x86_64-defconfig/vmlinux):
>
>   GCC-6.3.0:
>
>   10735757        cmpxchg
>   10726413        try_cmpxchg
>   10730509        try_cmpxchg + patch
>   10730445        try_cmpxchg-linus
>
>   GCC-7 (20170327):
>
>   10709514        cmpxchg
>   10704266        try_cmpxchg
>   10704266        try_cmpxchg + patch
>   10704394        try_cmpxchg-linus
>
> From this we see that the patch has the advantage of better code-gen on
> GCC-7 and keeps the interface roughly consistent with the C language
> variant.
>
> Fixes: a9ebf306f52c ("locking/atomic: Introduce atomic_try_cmpxchg()")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/cmpxchg.h |    5 +++--
>  include/linux/atomic.h         |   16 ++++++++++------
>  2 files changed, 13 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> @@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
>         default:                                                        \
>                 __cmpxchg_wrong_size();                                 \
>         }                                                               \
> -       *_old = __old;                                                  \
> -       success;                                                        \
> +       if (unlikely(!success))                                         \
> +               *_old = __old;                                          \
> +       likely(success);                                                \
>  })
>
>  #define __try_cmpxchg(ptr, pold, new, size)                            \
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -428,9 +428,11 @@
>  #define __atomic_try_cmpxchg(type, _p, _po, _n)                                \
>  ({                                                                     \
>         typeof(_po) __po = (_po);                                       \
> -       typeof(*(_po)) __o = *__po;                                     \
> -       *__po = atomic_cmpxchg##type((_p), __o, (_n));                  \
> -       (*__po == __o);                                                 \
> +       typeof(*(_po)) __r, __o = *__po;                                \
> +       __r = atomic_cmpxchg##type((_p), __o, (_n));                    \
> +       if (unlikely(__r != __o))                                       \
> +               *__po = __r;                                            \
> +       likely(__r == __o);                                             \
>  })
>
>  #define atomic_try_cmpxchg(_p, _po, _n)                __atomic_try_cmpxchg(, _p, _po, _n)
> @@ -1022,9 +1024,11 @@ static inline int atomic_dec_if_positive
>  #define __atomic64_try_cmpxchg(type, _p, _po, _n)                      \
>  ({                                                                     \
>         typeof(_po) __po = (_po);                                       \
> -       typeof(*(_po)) __o = *__po;                                     \
> -       *__po = atomic64_cmpxchg##type((_p), __o, (_n));                \
> -       (*__po == __o);                                                 \
> +       typeof(*(_po)) __r, __o = *__po;                                \
> +       __r = atomic64_cmpxchg##type((_p), __o, (_n));                  \
> +       if (unlikely(__r != __o))                                       \
> +               *__po = __r;                                            \
> +       likely(__r == __o);                                             \
>  })
>
>  #define atomic64_try_cmpxchg(_p, _po, _n)              __atomic64_try_cmpxchg(, _p, _po, _n)


Acked-by: Dmitry Vyukov <dvyukov@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-03-27 13:45 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-24 12:44 locking/atomic: Introduce atomic_try_cmpxchg() Dmitry Vyukov
2017-03-24 14:21 ` Peter Zijlstra
2017-03-24 14:23   ` Dmitry Vyukov
2017-03-24 16:41   ` Peter Zijlstra
2017-03-24 16:54     ` Andy Lutomirski
2017-03-24 17:23       ` Peter Zijlstra
2017-03-24 17:51         ` Dmitry Vyukov
2017-03-24 18:08           ` Peter Zijlstra
2017-03-24 18:13             ` Peter Zijlstra
2017-03-24 19:16               ` Andy Lutomirski
2017-03-24 19:20                 ` Linus Torvalds
2017-03-24 19:27                   ` Andy Lutomirski
2017-03-24 20:15                   ` Peter Zijlstra
2017-03-24 20:14                 ` Peter Zijlstra
2017-03-24 20:21                   ` Andy Lutomirski
2017-03-24 18:16             ` Dmitry Vyukov
2017-03-24 18:00         ` Peter Zijlstra
2017-03-24 18:04           ` Peter Zijlstra
2017-03-24 18:45         ` Andy Lutomirski
2017-03-24 19:17           ` Linus Torvalds
2017-03-24 21:23             ` Peter Zijlstra
2017-03-25  7:51               ` Peter Zijlstra
2017-03-25 18:00                 ` Linus Torvalds
2017-03-25 18:20                   ` Peter Zijlstra
2017-03-25 18:28                     ` Linus Torvalds
2017-03-25 18:34                       ` Linus Torvalds
2017-03-25 21:13                         ` Peter Zijlstra
2017-03-25 22:08                           ` Linus Torvalds
2017-03-27  9:48                             ` Peter Zijlstra
2017-03-24 20:22           ` Peter Zijlstra
2017-03-24 20:27             ` Andy Lutomirski
2017-03-24 21:07               ` Peter Zijlstra
2017-03-24 19:08         ` Linus Torvalds
2017-03-24 20:46           ` Peter Zijlstra
2017-03-24 20:58             ` Linus Torvalds
2017-03-27 12:16 ` Peter Zijlstra
2017-03-27 13:45   ` Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.