SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
@ 2018-07-31 17:01 Andrey Ryabinin
  2018-07-31 17:09 ` Florian Westphal
  2018-07-31 17:36 ` Christopher Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Andrey Ryabinin @ 2018-07-31 17:01 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, netfilter-devel, coreteam, netdev,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	David Airlie, intel-gfx, dri-devel, Eric Dumazet,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	linux-kernel, Dmitry Vyukov, Christoph Lameter, Andrew Morton,
	linux-mm, Andrey Konovalov, Linus Torvalds


On 07/31/2018 07:04 PM, Andrey Ryabinin wrote:
>> Somewhat offtopic, but I can't understand how SLAB_TYPESAFE_BY_RCU
>> slabs can be useful without ctors or at least memset(0). Objects in
>> such slabs need to be type-stable, but I can't understand how it's
>> possible to establish type stability without a ctor... Are these bugs?
> 
> Yeah, I puzzled by this too. However, I think it's hard but possible to make it work, at least in theory.
> There must be an initializer, which consists of two parts:
> a) initilize objects fields
> b) expose object to the world (add it to list or something like that)
> 
> (a) part must somehow to be ok to race with another cpu which might already use the object.
> (b) part must must use e.g. barriers to make sure that racy users will see previously inilized fields.
> Racy users must have parring barrier of course.
> 
> But it sound fishy, and very easy to fuck up. I won't be surprised if every single one SLAB_TYPESAFE_BY_RCU user
> without ->ctor is bogus. It certainly would be better to convert those to use ->ctor.
> 
> Such caches seems used by networking subsystem in proto_register():
> 
> 		prot->slab = kmem_cache_create_usercopy(prot->name,
> 					prot->obj_size, 0,
> 					SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT |
> 					prot->slab_flags,
> 					prot->useroffset, prot->usersize,
> 					NULL);
> 
> And certain protocols specify SLAB_TYPESAFE_BY_RCU in ->slab_flags, such as:
> llc_proto, smc_proto, smc_proto6, tcp_prot, tcpv6_prot, dccp_v6_prot, dccp_v4_prot.
> 
> 
> Also nf_conntrack_cachep, kernfs_node_cache, jbd2_journal_head_cache and i915_request cache.
> 


[+CC maintainer of the relevant code.]

Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor.
I think it's nearly impossible to use that combination without having bugs.
It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.

Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?

E.g. the netlink code look extremely suspicious:

	/*
	 * Do not use kmem_cache_zalloc(), as this cache uses
	 * SLAB_TYPESAFE_BY_RCU.
	 */
	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
	if (ct == NULL)
		goto out;

	spin_lock_init(&ct->lock);

If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be
in use by another cpu. So we just reinitialize spin_lock used by someone else?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin
@ 2018-07-31 17:09 ` Florian Westphal
  2018-07-31 17:32   ` Eric Dumazet
  2018-07-31 17:36 ` Christopher Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Florian Westphal @ 2018-07-31 17:09 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, netfilter-devel, coreteam, netdev,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	David Airlie, intel-gfx, dri-devel, Eric Dumazet,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	linux-kernel, Dmitry Vyukov, Christoph Lameter, Andrew Morton,
	linux-mm, Andrey Konovalov, Linus Torvalds

Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor.
> I think it's nearly impossible to use that combination without having bugs.
> It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.
> 
> Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?
> 
> E.g. the netlink code look extremely suspicious:
> 
> 	/*
> 	 * Do not use kmem_cache_zalloc(), as this cache uses
> 	 * SLAB_TYPESAFE_BY_RCU.
> 	 */
> 	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
> 	if (ct == NULL)
> 		goto out;
> 
> 	spin_lock_init(&ct->lock);
> 
> If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be
> in use by another cpu. So we just reinitialize spin_lock used by someone else?

That would be a bug, nf_conn objects are reference counted.

spinlock can only be used after object had its refcount incremented.

lookup operation on nf_conn object:
1. compare keys
2. attempt to obtain refcount (using _not_zero version)
3. compare keys again after refcount was obtained

if any of that fails, nf_conn candidate is skipped.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:09 ` Florian Westphal
@ 2018-07-31 17:32   ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2018-07-31 17:32 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Ryabinin, Theodore Ts'o, jack, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker,
	dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Dmitry Vyukov, Christoph Lameter,
	Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds

On Tue, Jul 31, 2018 at 10:10 AM Florian Westphal <fw@strlen.de> wrote:
>
> Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> > Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor.
> > I think it's nearly impossible to use that combination without having bugs.
> > It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.
> >
> > Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?
> >
> > E.g. the netlink code look extremely suspicious:
> >
> >       /*
> >        * Do not use kmem_cache_zalloc(), as this cache uses
> >        * SLAB_TYPESAFE_BY_RCU.
> >        */
> >       ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
> >       if (ct == NULL)
> >               goto out;
> >
> >       spin_lock_init(&ct->lock);
> >
> > If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be
> > in use by another cpu. So we just reinitialize spin_lock used by someone else?
>
> That would be a bug, nf_conn objects are reference counted.
>
> spinlock can only be used after object had its refcount incremented.
>
> lookup operation on nf_conn object:
> 1. compare keys
> 2. attempt to obtain refcount (using _not_zero version)
> 3. compare keys again after refcount was obtained
>
> if any of that fails, nf_conn candidate is skipped.


Yes, the key here is the refcount, this is only what we need to clear
after kmem_cache_alloc()

By definition, if an object is being freed/reallocated, the refcount
should be already 0, and clearing it again is a NOP.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin
  2018-07-31 17:09 ` Florian Westphal
@ 2018-07-31 17:36 ` Christopher Lameter
  2018-07-31 17:41   ` Eric Dumazet
  2018-07-31 17:49   ` Linus Torvalds
  1 sibling, 2 replies; 31+ messages in thread
From: Christopher Lameter @ 2018-07-31 17:36 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, netfilter-devel, coreteam, netdev,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	David Airlie, intel-gfx, dri-devel, Eric Dumazet,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	linux-kernel, Dmitry Vyukov, Andrew Morton, linux-mm,
	Andrey Konovalov, Linus Torvalds

On Tue, 31 Jul 2018, Andrey Ryabinin wrote:

> Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor.
> I think it's nearly impossible to use that combination without having bugs.
> It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.
>
> Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?
>
> E.g. the netlink code look extremely suspicious:
>
> 	/*
> 	 * Do not use kmem_cache_zalloc(), as this cache uses
> 	 * SLAB_TYPESAFE_BY_RCU.
> 	 */
> 	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
> 	if (ct == NULL)
> 		goto out;
>
> 	spin_lock_init(&ct->lock);
>
> If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be
> in use by another cpu. So we just reinitialize spin_lock used by someone else?

ct may still be read by another cpu in a RCU section but the object was
freed elsewhere so no other processor may modify the object.

The lock must have been released before freeing the slab object and thus
the initialization of the spinlock is unnecessary if it was
initialized in ctor.

If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:36 ` Christopher Lameter
@ 2018-07-31 17:41   ` Eric Dumazet
  2018-07-31 17:51     ` Dmitry Vyukov
  2018-07-31 17:49   ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2018-07-31 17:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrey Ryabinin, Theodore Ts'o, jack, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, netfilter-devel, coreteam,
	netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen,
	rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Dmitry Vyukov,
	Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds

On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote:

>
> If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

To allow fast reuse of objects, without going through call_rcu() and
reducing cache efficiency.

I believe this is mentioned in Documentation/RCU/rculist_nulls.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:41   ` Eric Dumazet
@ 2018-07-31 17:51     ` Dmitry Vyukov
  2018-07-31 18:16       ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-07-31 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara,
	linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David Miller,
	netfilter-devel, coreteam, netdev, Gerrit Renker, dccp,
	jani.nikula, joonas.lahtinen, rodrigo.vivi, David Airlie,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov, Linus Torvalds

On Tue, Jul 31, 2018 at 7:41 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote:
>
>>
>> If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?
>
> To allow fast reuse of objects, without going through call_rcu() and
> reducing cache efficiency.
>
> I believe this is mentioned in Documentation/RCU/rculist_nulls.txt


Is it OK to overwrite ct->status? It seems that are some read and
writes to it right after atomic_inc_not_zero.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:51     ` Dmitry Vyukov
@ 2018-07-31 18:16       ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2018-07-31 18:16 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, jack,
	linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David Miller,
	netfilter-devel, coreteam, netdev, Gerrit Renker, dccp,
	jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx,
	dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov,
	Linus Torvalds

On Tue, Jul 31, 2018 at 10:51 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
>
> Is it OK to overwrite ct->status? It seems that are some read and
> writes to it right after atomic_inc_not_zero.

If it is after a (successful) atomic_inc_not_zero(),
the object is guaranteed to be alive (not freed or about to be freed).

About readind/writing a specific field, all traditional locking rules apply.

For TCP socket, we would generally grab the socket lock before
reading/writing various fields.

ct->status seems to be manipulated with set_bit() and clear_bit()
which are SMP safe.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:36 ` Christopher Lameter
  2018-07-31 17:41   ` Eric Dumazet
@ 2018-07-31 17:49   ` Linus Torvalds
  2018-07-31 18:51     ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2018-07-31 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, NetFilter, coreteam,
	Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton,
	linux-mm, Andrey Konovalov

On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote:
>
> If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

.. because the object can be accessed (by RCU) after the refcount has
gone down to zero, and the thing has been released.

That's the whole and only point of SLAB_TYPESAFE_BY_RCU.

That flag basically says:

  "I may end up accessing this object *after* it has been free'd,
because there may be RCU lookups in flight"

This has nothing to do with constructors. It's ok if the object gets
reused as an object of the same type and does *not* get
re-initialized, because we're perfectly fine seeing old stale data.

What it guarantees is that the slab isn't shared with any other kind
of object, _and_ that the underlying pages are free'd after an RCU
quiescent period (so the pages aren't shared with another kind of
object either during an RCU walk).

And it doesn't necessarily have to have a constructor, because the
thing that a RCU walk will care about is

 (a) guaranteed to be an object that *has* been on some RCU list (so
it's not a "new" object)

 (b) the RCU walk needs to have logic to verify that it's still the
*same* object and hasn't been re-used as something else.

So the re-use might initialize the fields lazily, not necessarily using a ctor.

And the point of using SLAB_TYPESAFE_BY_RCU is that using the more
traditional RCU freeing - where you free each object one by one with
an RCU delay - can be prohibitively slow and have a huge memory
overhead (because of big chunks of memory that are queued for
freeing).

In contrast, a SLAB_TYPESAFE_BY_RCU memory gets free'd and re-used
immediately, but because it gets reused as the same kind of object,
the RCU walker can "know" what parts have meaning for re-use, in a way
it couidn't if the re-use was random.

That said, it *is* subtle, and people should be careful.

                 Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 17:49   ` Linus Torvalds
@ 2018-07-31 18:51     ` Linus Torvalds
  2018-08-01  8:46       ` Dmitry Vyukov
  2018-08-01  9:03       ` Andrey Ryabinin
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2018-07-31 18:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, NetFilter, coreteam,
	Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton,
	linux-mm, Andrey Konovalov

On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the re-use might initialize the fields lazily, not necessarily using a ctor.

In particular, the pattern that nf_conntrack uses looks like it is safe.

If you have a well-defined refcount, and use "atomic_inc_not_zero()"
to guard the speculative RCU access section, and use
"atomic_dec_and_test()" in the freeing section, then you should be
safe wrt new allocations.

If you have a completely new allocation that has "random stale
content", you know that it cannot be on the RCU list, so there is no
speculative access that can ever see that random content.

So the only case you need to worry about is a re-use allocation, and
you know that the refcount will start out as zero even if you don't
have a constructor.

So you can think of the refcount itself as always having a zero
constructor, *BUT* you need to be careful with ordering.

In particular, whoever does the allocation needs to then set the
refcount to a non-zero value *after* it has initialized all the other
fields. And in particular, it needs to make sure that it uses the
proper memory ordering to do so.

And in this case, we have

  static struct nf_conn *
  __nf_conntrack_alloc(struct net *net,
  {
        ...
        atomic_set(&ct->ct_general.use, 0);

which is a no-op for the re-use case (whether racing or not, since any
"inc_not_zero" users won't touch it), but initializes it to zero for
the "completely new object" case.

And then, the thing that actually exposes it to the speculative walkers does:

  int
  nf_conntrack_hash_check_insert(struct nf_conn *ct)
  {
        ...
        smp_wmb();
        /* The caller holds a reference to this object */
        atomic_set(&ct->ct_general.use, 2);

which means that it stays as zero until everything is actually set up,
and then the optimistic walker can use the other fields (including
spinlocks etc) to verify that it's actually the right thing. The
smp_wmb() means that the previous initialization really will be
visible before the object is visible.

Side note: on some architectures it might help to make that "smp_wmb
-> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't
matter on x86, but might matter on arm64.

NOTE! One thing to be very worried about is that re-initializing
whatever RCU lists means that now the RCU walker may be walking on the
wrong list so the walker may do the right thing for this particular
entry, but it may miss walking *other* entries. So then you can get
spurious lookup failures, because the RCU walker never walked all the
way to the end of the right list. That ends up being a much more
subtle bug.

But the nf_conntrack case seems to get that right too, see the restart
in ____nf_conntrack_find().

So I don't see anything wrong in nf_conntrack.

But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of
the subtleties have nothing to do with having a constructor, they are
about those "make sure memory ordering wrt refcount is right" and
"restart speculative RCU walk" issues that actually happen regardless
of having a constructor or not.

                  Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 18:51     ` Linus Torvalds
@ 2018-08-01  8:46       ` Dmitry Vyukov
  2018-08-01  9:10         ` Dmitry Vyukov
  2018-08-06 20:20         ` Jan Kara
  2018-08-01  9:03       ` Andrey Ryabinin
  1 sibling, 2 replies; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01  8:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara,
	linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So the re-use might initialize the fields lazily, not necessarily using a ctor.
>
> In particular, the pattern that nf_conntrack uses looks like it is safe.
>
> If you have a well-defined refcount, and use "atomic_inc_not_zero()"
> to guard the speculative RCU access section, and use
> "atomic_dec_and_test()" in the freeing section, then you should be
> safe wrt new allocations.
>
> If you have a completely new allocation that has "random stale
> content", you know that it cannot be on the RCU list, so there is no
> speculative access that can ever see that random content.
>
> So the only case you need to worry about is a re-use allocation, and
> you know that the refcount will start out as zero even if you don't
> have a constructor.
>
> So you can think of the refcount itself as always having a zero
> constructor, *BUT* you need to be careful with ordering.
>
> In particular, whoever does the allocation needs to then set the
> refcount to a non-zero value *after* it has initialized all the other
> fields. And in particular, it needs to make sure that it uses the
> proper memory ordering to do so.
>
> And in this case, we have
>
>   static struct nf_conn *
>   __nf_conntrack_alloc(struct net *net,
>   {
>         ...
>         atomic_set(&ct->ct_general.use, 0);
>
> which is a no-op for the re-use case (whether racing or not, since any
> "inc_not_zero" users won't touch it), but initializes it to zero for
> the "completely new object" case.
>
> And then, the thing that actually exposes it to the speculative walkers does:
>
>   int
>   nf_conntrack_hash_check_insert(struct nf_conn *ct)
>   {
>         ...
>         smp_wmb();
>         /* The caller holds a reference to this object */
>         atomic_set(&ct->ct_general.use, 2);
>
> which means that it stays as zero until everything is actually set up,
> and then the optimistic walker can use the other fields (including
> spinlocks etc) to verify that it's actually the right thing. The
> smp_wmb() means that the previous initialization really will be
> visible before the object is visible.
>
> Side note: on some architectures it might help to make that "smp_wmb
> -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't
> matter on x86, but might matter on arm64.
>
> NOTE! One thing to be very worried about is that re-initializing
> whatever RCU lists means that now the RCU walker may be walking on the
> wrong list so the walker may do the right thing for this particular
> entry, but it may miss walking *other* entries. So then you can get
> spurious lookup failures, because the RCU walker never walked all the
> way to the end of the right list. That ends up being a much more
> subtle bug.
>
> But the nf_conntrack case seems to get that right too, see the restart
> in ____nf_conntrack_find().
>
> So I don't see anything wrong in nf_conntrack.
>
> But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of
> the subtleties have nothing to do with having a constructor, they are
> about those "make sure memory ordering wrt refcount is right" and
> "restart speculative RCU walk" issues that actually happen regardless
> of having a constructor or not.


Thank you, this is very enlightening.

So the type-stable invariant is established by initialization of the
object after the first kmem_cache_alloc, and then we rely on the fact
that repeated initialization does not break the invariant, which works
because the state is very simple (including debug builds, i.e. no
ODEBUG nor magic values).

There is a bunch of other SLAB_TYPESAFE_BY_RCU uses without ctor:

https://elixir.bootlin.com/linux/latest/source/fs/jbd2/journal.c#L2395
https://elixir.bootlin.com/linux/latest/source/fs/kernfs/mount.c#L415
https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_core.c#L2065
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/i915_gem.c#L5501
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/selftests/mock_gem_device.c#L212
https://elixir.bootlin.com/linux/latest/source/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c#L1131

Also these in proto structs:
https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv4.c#L959
https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv6.c#L1048
https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L2461
https://elixir.bootlin.com/linux/latest/source/net/ipv6/tcp_ipv6.c#L1980
https://elixir.bootlin.com/linux/latest/source/net/llc/af_llc.c#L145
https://elixir.bootlin.com/linux/latest/source/net/smc/af_smc.c#L105
They later created in net/core/sock.c without ctor.

I guess it would be useful to have such extensive comment for each
SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the
tricky aspects are handled.

For example, the one in jbd2 is interesting because it memsets the
whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab:

memset(jh, JBD2_POISON_FREE, sizeof(*jh));
kmem_cache_free(jbd2_journal_head_cache, jh);

I guess there are also tricky ways how it can all work in the end
(type-stable state is only a byte, or we check for all possible
combinations of being overwritten with JBD2_POISON_FREE). But at first
sight it does look fishy.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01  8:46       ` Dmitry Vyukov
@ 2018-08-01  9:10         ` Dmitry Vyukov
  2018-08-01 10:35           ` Florian Westphal
  2018-08-06 20:20         ` Jan Kara
  1 sibling, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01  9:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara,
	linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 10:46 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> So the re-use might initialize the fields lazily, not necessarily using a ctor.
>>
>> In particular, the pattern that nf_conntrack uses looks like it is safe.
>>
>> If you have a well-defined refcount, and use "atomic_inc_not_zero()"
>> to guard the speculative RCU access section, and use
>> "atomic_dec_and_test()" in the freeing section, then you should be
>> safe wrt new allocations.
>>
>> If you have a completely new allocation that has "random stale
>> content", you know that it cannot be on the RCU list, so there is no
>> speculative access that can ever see that random content.
>>
>> So the only case you need to worry about is a re-use allocation, and
>> you know that the refcount will start out as zero even if you don't
>> have a constructor.
>>
>> So you can think of the refcount itself as always having a zero
>> constructor, *BUT* you need to be careful with ordering.
>>
>> In particular, whoever does the allocation needs to then set the
>> refcount to a non-zero value *after* it has initialized all the other
>> fields. And in particular, it needs to make sure that it uses the
>> proper memory ordering to do so.
>>
>> And in this case, we have
>>
>>   static struct nf_conn *
>>   __nf_conntrack_alloc(struct net *net,
>>   {
>>         ...
>>         atomic_set(&ct->ct_general.use, 0);
>>
>> which is a no-op for the re-use case (whether racing or not, since any
>> "inc_not_zero" users won't touch it), but initializes it to zero for
>> the "completely new object" case.
>>
>> And then, the thing that actually exposes it to the speculative walkers does:
>>
>>   int
>>   nf_conntrack_hash_check_insert(struct nf_conn *ct)
>>   {
>>         ...
>>         smp_wmb();
>>         /* The caller holds a reference to this object */
>>         atomic_set(&ct->ct_general.use, 2);
>>
>> which means that it stays as zero until everything is actually set up,
>> and then the optimistic walker can use the other fields (including
>> spinlocks etc) to verify that it's actually the right thing. The
>> smp_wmb() means that the previous initialization really will be
>> visible before the object is visible.
>>
>> Side note: on some architectures it might help to make that "smp_wmb
>> -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't
>> matter on x86, but might matter on arm64.
>>
>> NOTE! One thing to be very worried about is that re-initializing
>> whatever RCU lists means that now the RCU walker may be walking on the
>> wrong list so the walker may do the right thing for this particular
>> entry, but it may miss walking *other* entries. So then you can get
>> spurious lookup failures, because the RCU walker never walked all the
>> way to the end of the right list. That ends up being a much more
>> subtle bug.
>>
>> But the nf_conntrack case seems to get that right too, see the restart
>> in ____nf_conntrack_find().
>>
>> So I don't see anything wrong in nf_conntrack.
>>
>> But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of
>> the subtleties have nothing to do with having a constructor, they are
>> about those "make sure memory ordering wrt refcount is right" and
>> "restart speculative RCU walk" issues that actually happen regardless
>> of having a constructor or not.
>
>
> Thank you, this is very enlightening.
>
> So the type-stable invariant is established by initialization of the
> object after the first kmem_cache_alloc, and then we rely on the fact
> that repeated initialization does not break the invariant, which works
> because the state is very simple (including debug builds, i.e. no
> ODEBUG nor magic values).


Still can't grasp all details.
There is state that we read without taking ct->ct_general.use ref
first, namely ct->state and what's used by nf_ct_key_equal.
So let's say the entry we want to find is in the list, but
____nf_conntrack_find finds a wrong entry earlier because all state it
looks at is random garbage, so it returns the wrong entry to
__nf_conntrack_find_get.
Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use))
check in __nf_conntrack_find_get passes, and it returns NULL to the
caller (which means entry is not present). While in reality the entry
is present, but we were just looking at the wrong one.

Also I am not sure about order of checks in (nf_ct_is_dying(ct) ||
!atomic_inc_not_zero(&ct->ct_general.use)), because checking state
before taking the ref is only a best-effort hint, so it can actually
be a dying entry when we take a ref.

So shouldn't it read something like the following?

        rcu_read_lock();
begin:
        h = ____nf_conntrack_find(net, zone, tuple, hash);
        if (h) {
                ct = nf_ct_tuplehash_to_ctrack(h);
                if (!atomic_inc_not_zero(&ct->ct_general.use))
                        goto begin;
                if (unlikely(nf_ct_is_dying(ct)) ||
                    unlikely(!nf_ct_key_equal(h, tuple, zone, net))) {
                        nf_ct_put(ct);
                        goto begin;
                }
        }
        rcu_read_unlock();
        return h;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01  9:10         ` Dmitry Vyukov
@ 2018-08-01 10:35           ` Florian Westphal
  2018-08-01 10:41             ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Florian Westphal @ 2018-08-01 10:35 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

Dmitry Vyukov <dvyukov@google.com> wrote:
> Still can't grasp all details.
> There is state that we read without taking ct->ct_general.use ref
> first, namely ct->state and what's used by nf_ct_key_equal.
> So let's say the entry we want to find is in the list, but
> ____nf_conntrack_find finds a wrong entry earlier because all state it
> looks at is random garbage, so it returns the wrong entry to
> __nf_conntrack_find_get.

If an entry can be found, it can't be random garbage.
We never link entries into global table until state has been set up.

> Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use))
> check in __nf_conntrack_find_get passes, and it returns NULL to the
> caller (which means entry is not present).

So entry is going away or marked as dead which for us is same as
'not present', we need to allocate a new entry.

> While in reality the entry
> is present, but we were just looking at the wrong one.

We never add tuples that are identical to the global table.

If N cores receive identical packets at same time with no prior state, all
will allocate a new conntrack, but we notice this when we try to insert the
nf_conn entries into the table.

Only one will succeed, other cpus have to cope with this.
(worst case: all raced packets are dropped along with their conntrack
 object).

For lookup, we have following scenarios:

1. It doesn't exist -> new allocation needed
2. It exists, not dead, has nonzero refount -> use it
3. It exists, but marked as dying -> new allocation needed
4. It exists but has 0 reference count -> new allocation needed
5. It exists, we get reference, but 2nd nf_ct_key_equal check
   fails.  We saw a matching 'old incarnation' that just got
   re-used on other core.  -> retry lookup

> Also I am not sure about order of checks in (nf_ct_is_dying(ct) ||
> !atomic_inc_not_zero(&ct->ct_general.use)), because checking state
> before taking the ref is only a best-effort hint, so it can actually
> be a dying entry when we take a ref.

Yes, it can also become a dying entry after we took the reference.

> So shouldn't it read something like the following?
> 
>         rcu_read_lock();
> begin:
>         h = ____nf_conntrack_find(net, zone, tuple, hash);
>         if (h) {
>                 ct = nf_ct_tuplehash_to_ctrack(h);
>                 if (!atomic_inc_not_zero(&ct->ct_general.use))
>                         goto begin;
>                 if (unlikely(nf_ct_is_dying(ct)) ||
>                     unlikely(!nf_ct_key_equal(h, tuple, zone, net))) {
>                         nf_ct_put(ct);

It would be ok to make this change, but dying bit can be set
at any time e.g. because userspace tells kernel to flush the conntrack table.
So refcount is always > 0 when the DYING bit is set.

I don't see why it would be a problem.

nf_conn struct will stay valid until all cpus have dropped references.
The check in lookup function only serves to hide the known-to-go-away entry.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 10:35           ` Florian Westphal
@ 2018-08-01 10:41             ` Dmitry Vyukov
  2018-08-01 11:40               ` Florian Westphal
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 10:41 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote:
> Dmitry Vyukov <dvyukov@google.com> wrote:
>> Still can't grasp all details.
>> There is state that we read without taking ct->ct_general.use ref
>> first, namely ct->state and what's used by nf_ct_key_equal.
>> So let's say the entry we want to find is in the list, but
>> ____nf_conntrack_find finds a wrong entry earlier because all state it
>> looks at is random garbage, so it returns the wrong entry to
>> __nf_conntrack_find_get.
>
> If an entry can be found, it can't be random garbage.
> We never link entries into global table until state has been set up.


But... we don't hold a reference to the entry. So say it's in the
table with valid state, now ____nf_conntrack_find discovers it, now
the entry is removed and reused a dozen of times will all associated
state reinitialization. And nf_ct_key_equal looks at it concurrently
and decides if it's the entry we are looking for or now. I think
unless we hold a ref to the entry, it's state needs to be considered
random garbage for correctness reasoning.


>> Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use))
>> check in __nf_conntrack_find_get passes, and it returns NULL to the
>> caller (which means entry is not present).
>
> So entry is going away or marked as dead which for us is same as
> 'not present', we need to allocate a new entry.
>
>> While in reality the entry
>> is present, but we were just looking at the wrong one.
>
> We never add tuples that are identical to the global table.
>
> If N cores receive identical packets at same time with no prior state, all
> will allocate a new conntrack, but we notice this when we try to insert the
> nf_conn entries into the table.
>
> Only one will succeed, other cpus have to cope with this.
> (worst case: all raced packets are dropped along with their conntrack
>  object).
>
> For lookup, we have following scenarios:
>
> 1. It doesn't exist -> new allocation needed
> 2. It exists, not dead, has nonzero refount -> use it
> 3. It exists, but marked as dying -> new allocation needed
> 4. It exists but has 0 reference count -> new allocation needed
> 5. It exists, we get reference, but 2nd nf_ct_key_equal check
>    fails.  We saw a matching 'old incarnation' that just got
>    re-used on other core.  -> retry lookup
>
>> Also I am not sure about order of checks in (nf_ct_is_dying(ct) ||
>> !atomic_inc_not_zero(&ct->ct_general.use)), because checking state
>> before taking the ref is only a best-effort hint, so it can actually
>> be a dying entry when we take a ref.
>
> Yes, it can also become a dying entry after we took the reference.
>
>> So shouldn't it read something like the following?
>>
>>         rcu_read_lock();
>> begin:
>>         h = ____nf_conntrack_find(net, zone, tuple, hash);
>>         if (h) {
>>                 ct = nf_ct_tuplehash_to_ctrack(h);
>>                 if (!atomic_inc_not_zero(&ct->ct_general.use))
>>                         goto begin;
>>                 if (unlikely(nf_ct_is_dying(ct)) ||
>>                     unlikely(!nf_ct_key_equal(h, tuple, zone, net))) {
>>                         nf_ct_put(ct);
>
> It would be ok to make this change, but dying bit can be set
> at any time e.g. because userspace tells kernel to flush the conntrack table.
> So refcount is always > 0 when the DYING bit is set.
>
> I don't see why it would be a problem.
>
> nf_conn struct will stay valid until all cpus have dropped references.
> The check in lookup function only serves to hide the known-to-go-away entry.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 10:41             ` Dmitry Vyukov
@ 2018-08-01 11:40               ` Florian Westphal
  2018-08-01 12:38                 ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Florian Westphal @ 2018-08-01 11:40 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Florian Westphal, Linus Torvalds, Christoph Lameter,
	Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote:
> > Dmitry Vyukov <dvyukov@google.com> wrote:
> >> Still can't grasp all details.
> >> There is state that we read without taking ct->ct_general.use ref
> >> first, namely ct->state and what's used by nf_ct_key_equal.
> >> So let's say the entry we want to find is in the list, but
> >> ____nf_conntrack_find finds a wrong entry earlier because all state it
> >> looks at is random garbage, so it returns the wrong entry to
> >> __nf_conntrack_find_get.
> >
> > If an entry can be found, it can't be random garbage.
> > We never link entries into global table until state has been set up.
> 
> But... we don't hold a reference to the entry. So say it's in the
> table with valid state, now ____nf_conntrack_find discovers it, now
> the entry is removed and reused a dozen of times will all associated
> state reinitialization. And nf_ct_key_equal looks at it concurrently
> and decides if it's the entry we are looking for or now. I think
> unless we hold a ref to the entry, it's state needs to be considered
> random garbage for correctness reasoning.

It checks if it might be the entry we're looking for.

If this was complete random garbage, scheme would not work, as then
we could have entry that isn't in table, has nonzero refcount, but
has its confirmed bit set.

I don't see how that would be possible, any reallocation
makes sure ct->status has CONFIRMED bit clear before setting refcount
to nonzero value.

I think this is the scenario you hint at is:
1. nf_ct_key_equal is true
2. the entry is free'd (or was already free'd)
3. we return NULL to caller due to atomic_inc_not_zero() failure

but i fail to see how thats wrong?

Sure, we could restart lookup but how would that help?

We'd not find the 'candidate' entry again.

We might find entry that has been inserted at this very instant but
newly allocated entries are only inserted into global table until the skb that
created the nf_conn object has made it through the network stack
(postrouting for fowarded, or input for local delivery).

So, the likelyhood of such restart finding another candidate is close to 0,
and won't prevent 'insert race' from happening.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 11:40               ` Florian Westphal
@ 2018-08-01 12:38                 ` Dmitry Vyukov
  2018-08-01 13:46                   ` Florian Westphal
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 12:38 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 1:40 PM, Florian Westphal <fw@strlen.de> wrote:
> Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote:
>> > Dmitry Vyukov <dvyukov@google.com> wrote:
>> >> Still can't grasp all details.
>> >> There is state that we read without taking ct->ct_general.use ref
>> >> first, namely ct->state and what's used by nf_ct_key_equal.
>> >> So let's say the entry we want to find is in the list, but
>> >> ____nf_conntrack_find finds a wrong entry earlier because all state it
>> >> looks at is random garbage, so it returns the wrong entry to
>> >> __nf_conntrack_find_get.
>> >
>> > If an entry can be found, it can't be random garbage.
>> > We never link entries into global table until state has been set up.
>>
>> But... we don't hold a reference to the entry. So say it's in the
>> table with valid state, now ____nf_conntrack_find discovers it, now
>> the entry is removed and reused a dozen of times will all associated
>> state reinitialization. And nf_ct_key_equal looks at it concurrently
>> and decides if it's the entry we are looking for or now. I think
>> unless we hold a ref to the entry, it's state needs to be considered
>> random garbage for correctness reasoning.
>
> It checks if it might be the entry we're looking for.
>
> If this was complete random garbage, scheme would not work, as then
> we could have entry that isn't in table, has nonzero refcount, but
> has its confirmed bit set.
>
> I don't see how that would be possible, any reallocation
> makes sure ct->status has CONFIRMED bit clear before setting refcount
> to nonzero value.
>
> I think this is the scenario you hint at is:
> 1. nf_ct_key_equal is true
> 2. the entry is free'd (or was already free'd)
> 3. we return NULL to caller due to atomic_inc_not_zero() failure
>
> but i fail to see how thats wrong?
>
> Sure, we could restart lookup but how would that help?
>
> We'd not find the 'candidate' entry again.
>
> We might find entry that has been inserted at this very instant but
> newly allocated entries are only inserted into global table until the skb that
> created the nf_conn object has made it through the network stack
> (postrouting for fowarded, or input for local delivery).
>
> So, the likelyhood of such restart finding another candidate is close to 0,
> and won't prevent 'insert race' from happening.


The scenario I have in mind is different and it relates to the fact
that ____nf_conntrack_find will return the right entry if it's
present, but it can also return an unrelated entry because when it
looks at entries they change underneath.

Let's take any 2 fields compared by nf_ct_key_equal for simplicity,
for example, src.u3 and dst.u3.
Let's say we are looking for an entry with src.u3=A and dst.u3=B,
let's call it (A,B).
Let's say there is an existing entry 1 (A,B) in the global list. But
there is also entry 2 (A,C) earlier in the list.
Now, ____nf_conntrack_find starts checking entry 2 (A,C), it checks
that src.u3==A, so so far it looks good.
Now another thread deletes, reuses and reinitilizes entry 2 for (C,B).
Now, ____nf_conntrack_find checks that dst.u3==B, so it concludes that
it's the right entry, because it observed (A,B). Entry 2 is returned
to __nf_conntrack_find_get.
Now another thread marks entry 2 as dying.
Now __nf_conntrack_find_get sees that it's dying and returns NULL to
caller, _but_ the matching entry 1 (A,B) was in the list all that time
and we should have been discovered it, but we didn't because we were
deraield by the wrong entry 2.

If that scenario is possible that a fix would be to make
__nf_conntrack_find_get ever return NULL iff it got NULL from
____nf_conntrack_find (not if any of the checks has failed).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 12:38                 ` Dmitry Vyukov
@ 2018-08-01 13:46                   ` Florian Westphal
  2018-08-01 13:52                     ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Florian Westphal @ 2018-08-01 13:46 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Florian Westphal, Linus Torvalds, Christoph Lameter,
	Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

Dmitry Vyukov <dvyukov@google.com> wrote:
> If that scenario is possible that a fix would be to make

Looks possible.

> __nf_conntrack_find_get ever return NULL iff it got NULL from
> ____nf_conntrack_find (not if any of the checks has failed).

I don't see why we need to restart on nf_ct_is_dying(), but other
than that this seems ok.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 13:46                   ` Florian Westphal
@ 2018-08-01 13:52                     ` Dmitry Vyukov
  0 siblings, 0 replies; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 13:52 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 3:46 PM, Florian Westphal <fw@strlen.de> wrote:
> Dmitry Vyukov <dvyukov@google.com> wrote:
>> If that scenario is possible that a fix would be to make
>
> Looks possible.
>
>> __nf_conntrack_find_get ever return NULL iff it got NULL from
>> ____nf_conntrack_find (not if any of the checks has failed).
>
> I don't see why we need to restart on nf_ct_is_dying(), but other
> than that this seems ok.

Because it can be a wrong entry dying. When we check dying, we don't
yet know if we are looking at the right entry or not. If we
successfully acquire a reference, then recheck nf_ct_key_equal and
_then_ check dying, then we don't need to restart on dying. But with
the current check order, we need to restart on dying too.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01  8:46       ` Dmitry Vyukov
  2018-08-01  9:10         ` Dmitry Vyukov
@ 2018-08-06 20:20         ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2018-08-06 20:20 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed 01-08-18 10:46:35, Dmitry Vyukov wrote:
> I guess it would be useful to have such extensive comment for each
> SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the
> tricky aspects are handled.
> 
> For example, the one in jbd2 is interesting because it memsets the
> whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab:
> 
> memset(jh, JBD2_POISON_FREE, sizeof(*jh));
> kmem_cache_free(jbd2_journal_head_cache, jh);
> 
> I guess there are also tricky ways how it can all work in the end
> (type-stable state is only a byte, or we check for all possible
> combinations of being overwritten with JBD2_POISON_FREE). But at first
> sight it does look fishy.

The RCU access is used from a single place:

fs/jbd2/transaction.c: jbd2_write_access_granted()

There are also quite some comments explaining why what it does is safe. The
overwrite by JBD2_POISON_FREE is much older than this RCU stuff (honestly I
didn't know about it until this moment) and has nothing to do with the
safety of RCU access.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-07-31 18:51     ` Linus Torvalds
  2018-08-01  8:46       ` Dmitry Vyukov
@ 2018-08-01  9:03       ` Andrey Ryabinin
  2018-08-01 10:23         ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Andrey Ryabinin @ 2018-08-01  9:03 UTC (permalink / raw)
  To: Linus Torvalds, Christoph Lameter
  Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development, gerrit,
	dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie,
	intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton,
	linux-mm, Andrey Konovalov



On 07/31/2018 09:51 PM, Linus Torvalds wrote:
> On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So the re-use might initialize the fields lazily, not necessarily using a ctor.
> 
> In particular, the pattern that nf_conntrack uses looks like it is safe.
> 
> If you have a well-defined refcount, and use "atomic_inc_not_zero()"
> to guard the speculative RCU access section, and use
> "atomic_dec_and_test()" in the freeing section, then you should be
> safe wrt new allocations.
> 
> If you have a completely new allocation that has "random stale
> content", you know that it cannot be on the RCU list, so there is no
> speculative access that can ever see that random content.
> 
> So the only case you need to worry about is a re-use allocation, and
> you know that the refcount will start out as zero even if you don't
> have a constructor.
> 
> So you can think of the refcount itself as always having a zero
> constructor, *BUT* you need to be careful with ordering.
> 
> In particular, whoever does the allocation needs to then set the
> refcount to a non-zero value *after* it has initialized all the other
> fields. And in particular, it needs to make sure that it uses the
> proper memory ordering to do so.
> 
> And in this case, we have
> 
>   static struct nf_conn *
>   __nf_conntrack_alloc(struct net *net,
>   {
>         ...
>         atomic_set(&ct->ct_general.use, 0);
> 
> which is a no-op for the re-use case (whether racing or not, since any
> "inc_not_zero" users won't touch it), but initializes it to zero for
> the "completely new object" case.
> 
> And then, the thing that actually exposes it to the speculative walkers does:
> 
>   int
>   nf_conntrack_hash_check_insert(struct nf_conn *ct)
>   {
>         ...
>         smp_wmb();
>         /* The caller holds a reference to this object */
>         atomic_set(&ct->ct_general.use, 2);
> 
> which means that it stays as zero until everything is actually set up,
> and then the optimistic walker can use the other fields (including
> spinlocks etc) to verify that it's actually the right thing. The
> smp_wmb() means that the previous initialization really will be
> visible before the object is visible.
> 
> Side note: on some architectures it might help to make that "smp_wmb
> -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't
> matter on x86, but might matter on arm64.
> 
> NOTE! One thing to be very worried about is that re-initializing
> whatever RCU lists means that now the RCU walker may be walking on the
> wrong list so the walker may do the right thing for this particular
> entry, but it may miss walking *other* entries. So then you can get
> spurious lookup failures, because the RCU walker never walked all the
> way to the end of the right list. That ends up being a much more
> subtle bug.
> 
> But the nf_conntrack case seems to get that right too, see the restart
> in ____nf_conntrack_find().
> 
> So I don't see anything wrong in nf_conntrack.
> 
> But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of
> the subtleties have nothing to do with having a constructor, they are
> about those "make sure memory ordering wrt refcount is right" and
> "restart speculative RCU walk" issues that actually happen regardless
> of having a constructor or not.
> 

I see, thanks. I just don't see any point or advantage of *not* using the constructor
with SLAB_TYPESAFE_BY_RCU caches.
There is always must be a small part of initialization code that could be placed
in constructor. And it's better to put that part into constructor to avoid initializing
what's already has been initialized. If people use SLAB_TYPESAFE_BY_RCU instead of traditional
kfree_rcu() for cache-efficiency and because they want to reuse the object faster, than avoiding
few writes into cache-hot data doesn't look like a bad idea.
E.g. nf_conntrack case could at least move the spin_lock_init() and atomic_set(&ct->ct_general.use, 0)
in the constructor.

I can't think of any advantage in not having the constructor. 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01  9:03       ` Andrey Ryabinin
@ 2018-08-01 10:23         ` Eric Dumazet
  2018-08-01 10:34           ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2018-08-01 10:23 UTC (permalink / raw)
  To: Andrey Ryabinin, Linus Torvalds, Christoph Lameter
  Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development, gerrit,
	dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie,
	intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton,
	linux-mm, Andrey Konovalov



On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:

> I can't think of any advantage in not having the constructor. 
> 


I can't see any advantage adding another indirect call,
in RETPOLINE world.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 10:23         ` Eric Dumazet
@ 2018-08-01 10:34           ` Dmitry Vyukov
  2018-08-01 11:28             ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 10:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:
>
>> I can't think of any advantage in not having the constructor.
>
> I can't see any advantage adding another indirect call,
> in RETPOLINE world.

Can you please elaborate what's the problem here?
If slab ctor call have RETPOLINE, then using ctors more does not
introduce any security problems and they are not _that_ slow.
If slab ctors are not protected, then we have problem that needs to be
fixed already.
What am I missing?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 10:34           ` Dmitry Vyukov
@ 2018-08-01 11:28             ` Eric Dumazet
  2018-08-01 11:35               ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2018-08-01 11:28 UTC (permalink / raw)
  To: Dmitry Vyukov, Eric Dumazet
  Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov



On 08/01/2018 03:34 AM, Dmitry Vyukov wrote:
> On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:
>>
>>> I can't think of any advantage in not having the constructor.
>>
>> I can't see any advantage adding another indirect call,
>> in RETPOLINE world.
> 
> Can you please elaborate what's the problem here?
> If slab ctor call have RETPOLINE, then using ctors more does not
> introduce any security problems and they are not _that_ slow.

They _are_ slow, when we have dozens of them in a code path.

I object "having to add" yet another indirect call, if this can be avoided [*]

If some people want to use ctor, fine, but do not request this.

[*] This can be tricky, but worth the pain.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 11:28             ` Eric Dumazet
@ 2018-08-01 11:35               ` Dmitry Vyukov
  2018-08-01 15:15                 ` Christopher Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 11:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, Network Development,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 1:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 08/01/2018 03:34 AM, Dmitry Vyukov wrote:
>> On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:
>>>
>>>> I can't think of any advantage in not having the constructor.
>>>
>>> I can't see any advantage adding another indirect call,
>>> in RETPOLINE world.
>>
>> Can you please elaborate what's the problem here?
>> If slab ctor call have RETPOLINE, then using ctors more does not
>> introduce any security problems and they are not _that_ slow.
>
> They _are_ slow, when we have dozens of them in a code path.
>
> I object "having to add" yet another indirect call, if this can be avoided [*]
>
> If some people want to use ctor, fine, but do not request this.
>
> [*] This can be tricky, but worth the pain.

But we are trading 1 indirect call for comparable overhead removed
from much more common path. The path that does ctors is also calling
into page alloc, which is much more expensive.
So ctor should be a net win on performance front, no?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 11:35               ` Dmitry Vyukov
@ 2018-08-01 15:15                 ` Christopher Lameter
  2018-08-01 15:37                   ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Christopher Lameter @ 2018-08-01 15:15 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o,
	Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter,
	coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI,
	Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun,
	linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, 1 Aug 2018, Dmitry Vyukov wrote:

> But we are trading 1 indirect call for comparable overhead removed
> from much more common path. The path that does ctors is also calling
> into page alloc, which is much more expensive.
> So ctor should be a net win on performance front, no?

ctor would make it esier to review the flow and guarantee that the object
always has certain fields set as required before any use by the subsystem.

ctors are run once on allocation of the slab page for all objects in it.

ctors are not called duiring allocation and freeing of objects from the
slab page. So we could avoid the intialization of the spinlock on each
object allocation which actually should be faster.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 15:15                 ` Christopher Lameter
@ 2018-08-01 15:37                   ` Eric Dumazet
  2018-08-01 15:51                     ` Misuse of constructors Matthew Wilcox
                                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Eric Dumazet @ 2018-08-01 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds,
	Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker,
	dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter <cl@linux.com> wrote:
>
> On Wed, 1 Aug 2018, Dmitry Vyukov wrote:
>
> > But we are trading 1 indirect call for comparable overhead removed
> > from much more common path. The path that does ctors is also calling
> > into page alloc, which is much more expensive.
> > So ctor should be a net win on performance front, no?
>
> ctor would make it esier to review the flow and guarantee that the object
> always has certain fields set as required before any use by the subsystem.
>
> ctors are run once on allocation of the slab page for all objects in it.
>
> ctors are not called duiring allocation and freeing of objects from the
> slab page. So we could avoid the intialization of the spinlock on each
> object allocation which actually should be faster.


This strategy might have been a win 30 years ago when cpu had no
caches (or too small anyway)

What probability is that the 60 bytes around the spinlock are not
touched after the object is freshly allocated ?

-> None

Writing 60 bytes  in one cache line instead of 64 has really the same
cost. The cache line miss is the real killer.

Feel free to write the patches, test them,  but I doubt you will have any gain.

Remember btw that TCP sockets can be either completely fresh
(socket() call, using memset() to clear the whole object),
or clones (accept() thus copying the parent socket)

The idea of having a ctor() would only be a win if all the fields that
can be initialized in the ctor are contiguous and fill an integral
number of cache lines.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Misuse of constructors
  2018-08-01 15:37                   ` Eric Dumazet
@ 2018-08-01 15:51                     ` Matthew Wilcox
  2018-08-01 15:53                     ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov
  2018-08-01 16:22                     ` Christopher Lameter
  2 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2018-08-01 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin,
	Linus Torvalds, Theodore Ts'o, jack, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, netfilter-devel, coreteam,
	netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen,
	rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton,
	linux-mm, Andrey Konovalov, linux-fsdevel

On Wed, Aug 01, 2018 at 08:37:07AM -0700, Eric Dumazet wrote:
> The idea of having a ctor() would only be a win if all the fields that
> can be initialized in the ctor are contiguous and fill an integral
> number of cache lines.

Let's state it more generally: Having a ctor is only a win if it allows
the user to avoid reading or writing a significant number of cachelines.

For example, the radix tree node occupies nine cachelines (576 bytes).
The typical user will touch two of those cacheliens (the header and the
cacheline which contains the slot of interest).  That's seven cachelines
which don't even need to be read, let alone written.

I think filesystems are particularly prone to this antipattern of
initialising some of their inode with a ctor and the remainder at
alloc_inode() time (compare the locations of the struct members touched
by inode_init_always() and inode_init_once()).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 15:37                   ` Eric Dumazet
  2018-08-01 15:51                     ` Misuse of constructors Matthew Wilcox
@ 2018-08-01 15:53                     ` Dmitry Vyukov
  2018-08-01 16:22                     ` Christopher Lameter
  2 siblings, 0 replies; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 15:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Eric Dumazet, Andrey Ryabinin, Linus Torvalds,
	Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, NetFilter, coreteam, netdev, Gerrit Renker, dccp,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie,
	intel-gfx, DRI, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 5:37 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter <cl@linux.com> wrote:
>>
>> On Wed, 1 Aug 2018, Dmitry Vyukov wrote:
>>
>> > But we are trading 1 indirect call for comparable overhead removed
>> > from much more common path. The path that does ctors is also calling
>> > into page alloc, which is much more expensive.
>> > So ctor should be a net win on performance front, no?
>>
>> ctor would make it esier to review the flow and guarantee that the object
>> always has certain fields set as required before any use by the subsystem.
>>
>> ctors are run once on allocation of the slab page for all objects in it.
>>
>> ctors are not called duiring allocation and freeing of objects from the
>> slab page. So we could avoid the intialization of the spinlock on each
>> object allocation which actually should be faster.
>
>
> This strategy might have been a win 30 years ago when cpu had no
> caches (or too small anyway)
>
> What probability is that the 60 bytes around the spinlock are not
> touched after the object is freshly allocated ?
>
> -> None
>
> Writing 60 bytes  in one cache line instead of 64 has really the same
> cost. The cache line miss is the real killer.
>
> Feel free to write the patches, test them,  but I doubt you will have any gain.
>
> Remember btw that TCP sockets can be either completely fresh
> (socket() call, using memset() to clear the whole object),
> or clones (accept() thus copying the parent socket)
>
> The idea of having a ctor() would only be a win if all the fields that
> can be initialized in the ctor are contiguous and fill an integral
> number of cache lines.

Code size can have some visible performance impact too.

But either way, what you say only means that ctors are not necessary
significantly faster. But your original point was that they are
slower.
If they are not slower, then what Andrey said seems to make sense:
some gain on code comprehension front re type-stability invariant +
some gain on performance front (even if not too big) and no downsides.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 15:37                   ` Eric Dumazet
  2018-08-01 15:51                     ` Misuse of constructors Matthew Wilcox
  2018-08-01 15:53                     ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov
@ 2018-08-01 16:22                     ` Christopher Lameter
  2018-08-01 16:25                       ` Eric Dumazet
  2 siblings, 1 reply; 31+ messages in thread
From: Christopher Lameter @ 2018-08-01 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds,
	Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker,
	dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, 1 Aug 2018, Eric Dumazet wrote:

> The idea of having a ctor() would only be a win if all the fields that
> can be initialized in the ctor are contiguous and fill an integral
> number of cache lines.

Ok. Its reducing code size and makes the object status more consistent.
Isn't that enough?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 16:22                     ` Christopher Lameter
@ 2018-08-01 16:25                       ` Eric Dumazet
  2018-08-01 16:47                         ` Dmitry Vyukov
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2018-08-01 16:25 UTC (permalink / raw)
  To: Christopher Lameter, Eric Dumazet
  Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds,
	Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker,
	dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov



On 08/01/2018 09:22 AM, Christopher Lameter wrote:
> On Wed, 1 Aug 2018, Eric Dumazet wrote:
> 
>> The idea of having a ctor() would only be a win if all the fields that
>> can be initialized in the ctor are contiguous and fill an integral
>> number of cache lines.
> 
> Ok. Its reducing code size and makes the object status more consistent.
> Isn't that enough?
> 

Prove it ;)

I yet have to seen actual numbers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 16:25                       ` Eric Dumazet
@ 2018-08-01 16:47                         ` Dmitry Vyukov
  2018-08-01 17:18                           ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Vyukov @ 2018-08-01 16:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christopher Lameter, Eric Dumazet, Andrey Ryabinin,
	Linus Torvalds, Theodore Ts'o, Jan Kara, linux-ext4,
	Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, NetFilter, coreteam, netdev,
	Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	David Airlie, intel-gfx, DRI, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton,
	linux-mm, Andrey Konovalov

On Wed, Aug 1, 2018 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 08/01/2018 09:22 AM, Christopher Lameter wrote:
>> On Wed, 1 Aug 2018, Eric Dumazet wrote:
>>
>>> The idea of having a ctor() would only be a win if all the fields that
>>> can be initialized in the ctor are contiguous and fill an integral
>>> number of cache lines.
>>
>> Ok. Its reducing code size and makes the object status more consistent.
>> Isn't that enough?
>>
>
> Prove it ;)
>
> I yet have to seen actual numbers.

Proving with numbers is required for a claimed performance improvement
at the cost of code degradation/increase. For a win-win change there
is really nothing to prove.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
  2018-08-01 16:47                         ` Dmitry Vyukov
@ 2018-08-01 17:18                           ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2018-08-01 17:18 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Eric Dumazet, Christoph Lameter, Andrey Ryabinin, Linus Torvalds,
	Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker,
	dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm,
	Andrey Konovalov

On Wed, Aug 1, 2018 at 9:47 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> Proving with numbers is required for a claimed performance improvement
> at the cost of code degradation/increase. For a win-win change there
> is really nothing to prove.

You have to _prove_ it is a win-win.

It is not sufficient to claim it is a win-win.

Sorry, but I do have bugs to take care of.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-08-06 20:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin
2018-07-31 17:09 ` Florian Westphal
2018-07-31 17:32   ` Eric Dumazet
2018-07-31 17:36 ` Christopher Lameter
2018-07-31 17:41   ` Eric Dumazet
2018-07-31 17:51     ` Dmitry Vyukov
2018-07-31 18:16       ` Eric Dumazet
2018-07-31 17:49   ` Linus Torvalds
2018-07-31 18:51     ` Linus Torvalds
2018-08-01  8:46       ` Dmitry Vyukov
2018-08-01  9:10         ` Dmitry Vyukov
2018-08-01 10:35           ` Florian Westphal
2018-08-01 10:41             ` Dmitry Vyukov
2018-08-01 11:40               ` Florian Westphal
2018-08-01 12:38                 ` Dmitry Vyukov
2018-08-01 13:46                   ` Florian Westphal
2018-08-01 13:52                     ` Dmitry Vyukov
2018-08-06 20:20         ` Jan Kara
2018-08-01  9:03       ` Andrey Ryabinin
2018-08-01 10:23         ` Eric Dumazet
2018-08-01 10:34           ` Dmitry Vyukov
2018-08-01 11:28             ` Eric Dumazet
2018-08-01 11:35               ` Dmitry Vyukov
2018-08-01 15:15                 ` Christopher Lameter
2018-08-01 15:37                   ` Eric Dumazet
2018-08-01 15:51                     ` Misuse of constructors Matthew Wilcox
2018-08-01 15:53                     ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov
2018-08-01 16:22                     ` Christopher Lameter
2018-08-01 16:25                       ` Eric Dumazet
2018-08-01 16:47                         ` Dmitry Vyukov
2018-08-01 17:18                           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).