* SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)
@ 2018-07-31 17:01 Andrey Ryabinin
2018-07-31 17:09 ` Florian Westphal
2018-07-31 17:36 ` Christopher Lameter
0 siblings, 2 replies; 31+ messages in thread
From: Andrey Ryabinin @ 2018-07-31 17:01 UTC (permalink / raw)
To: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman,
Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
David S. Miller, netfilter-devel, coreteam, netdev,
Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
David Airlie, intel-gfx, dri-devel, Eric Dumazet,
Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390,
linux-kernel, Dmitry Vyukov, Christoph Lameter, Andrew Morton,
linux-mm, Andrey Konovalov, Linus Torvalds
On 07/31/2018 07:04 PM, Andrey Ryabinin wrote:
>> Somewhat offtopic, but I can't understand how SLAB_TYPESAFE_BY_RCU
>> slabs can be useful without ctors or at least memset(0). Objects in
>> such slabs need to be type-stable, but I can't understand how it's
>> possible to establish type stability without a ctor... Are these bugs?
>
> Yeah, I puzzled by this too. However, I think it's hard but possible to make it work, at least in theory.
> There must be an initializer, which consists of two parts:
> a) initilize objects fields
> b) expose object to the world (add it to list or something like that)
>
> (a) part must somehow to be ok to race with another cpu which might already use the object.
> (b) part must must use e.g. barriers to make sure that racy users will see previously inilized fields.
> Racy users must have parring barrier of course.
>
> But it sound fishy, and very easy to fuck up. I won't be surprised if every single one SLAB_TYPESAFE_BY_RCU user
> without ->ctor is bogus. It certainly would be better to convert those to use ->ctor.
>
> Such caches seems used by networking subsystem in proto_register():
>
> prot->slab = kmem_cache_create_usercopy(prot->name,
> prot->obj_size, 0,
> SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT |
> prot->slab_flags,
> prot->useroffset, prot->usersize,
> NULL);
>
> And certain protocols specify SLAB_TYPESAFE_BY_RCU in ->slab_flags, such as:
> llc_proto, smc_proto, smc_proto6, tcp_prot, tcpv6_prot, dccp_v6_prot, dccp_v4_prot.
>
>
> Also nf_conntrack_cachep, kernfs_node_cache, jbd2_journal_head_cache and i915_request cache.
>
[+CC maintainer of the relevant code.]
Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor.
I think it's nearly impossible to use that combination without having bugs.
It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.
Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?
E.g. the netlink code look extremely suspicious:
/*
* Do not use kmem_cache_zalloc(), as this cache uses
* SLAB_TYPESAFE_BY_RCU.
*/
ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
if (ct == NULL)
goto out;
spin_lock_init(&ct->lock);
If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be
in use by another cpu. So we just reinitialize spin_lock used by someone else?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin @ 2018-07-31 17:09 ` Florian Westphal 2018-07-31 17:32 ` Eric Dumazet 2018-07-31 17:36 ` Christopher Lameter 1 sibling, 1 reply; 31+ messages in thread From: Florian Westphal @ 2018-07-31 17:09 UTC (permalink / raw) To: Andrey Ryabinin Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David S. Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie, intel-gfx, dri-devel, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, linux-kernel, Dmitry Vyukov, Christoph Lameter, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. > I think it's nearly impossible to use that combination without having bugs. > It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache. > > Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor? > > E.g. the netlink code look extremely suspicious: > > /* > * Do not use kmem_cache_zalloc(), as this cache uses > * SLAB_TYPESAFE_BY_RCU. > */ > ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); > if (ct == NULL) > goto out; > > spin_lock_init(&ct->lock); > > If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be > in use by another cpu. So we just reinitialize spin_lock used by someone else? That would be a bug, nf_conn objects are reference counted. spinlock can only be used after object had its refcount incremented. lookup operation on nf_conn object: 1. compare keys 2. attempt to obtain refcount (using _not_zero version) 3. compare keys again after refcount was obtained if any of that fails, nf_conn candidate is skipped. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:09 ` Florian Westphal @ 2018-07-31 17:32 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2018-07-31 17:32 UTC (permalink / raw) To: Florian Westphal Cc: Andrey Ryabinin, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Dmitry Vyukov, Christoph Lameter, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds On Tue, Jul 31, 2018 at 10:10 AM Florian Westphal <fw@strlen.de> wrote: > > Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > > Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. > > I think it's nearly impossible to use that combination without having bugs. > > It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache. > > > > Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor? > > > > E.g. the netlink code look extremely suspicious: > > > > /* > > * Do not use kmem_cache_zalloc(), as this cache uses > > * SLAB_TYPESAFE_BY_RCU. > > */ > > ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); > > if (ct == NULL) > > goto out; > > > > spin_lock_init(&ct->lock); > > > > If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be > > in use by another cpu. So we just reinitialize spin_lock used by someone else? > > That would be a bug, nf_conn objects are reference counted. > > spinlock can only be used after object had its refcount incremented. > > lookup operation on nf_conn object: > 1. compare keys > 2. attempt to obtain refcount (using _not_zero version) > 3. compare keys again after refcount was obtained > > if any of that fails, nf_conn candidate is skipped. Yes, the key here is the refcount, this is only what we need to clear after kmem_cache_alloc() By definition, if an object is being freed/reallocated, the refcount should be already 0, and clearing it again is a NOP. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin 2018-07-31 17:09 ` Florian Westphal @ 2018-07-31 17:36 ` Christopher Lameter 2018-07-31 17:41 ` Eric Dumazet 2018-07-31 17:49 ` Linus Torvalds 1 sibling, 2 replies; 31+ messages in thread From: Christopher Lameter @ 2018-07-31 17:36 UTC (permalink / raw) To: Andrey Ryabinin Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David S. Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie, intel-gfx, dri-devel, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, linux-kernel, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds On Tue, 31 Jul 2018, Andrey Ryabinin wrote: > Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. > I think it's nearly impossible to use that combination without having bugs. > It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache. > > Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor? > > E.g. the netlink code look extremely suspicious: > > /* > * Do not use kmem_cache_zalloc(), as this cache uses > * SLAB_TYPESAFE_BY_RCU. > */ > ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); > if (ct == NULL) > goto out; > > spin_lock_init(&ct->lock); > > If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be > in use by another cpu. So we just reinitialize spin_lock used by someone else? ct may still be read by another cpu in a RCU section but the object was freed elsewhere so no other processor may modify the object. The lock must have been released before freeing the slab object and thus the initialization of the spinlock is unnecessary if it was initialized in ctor. If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:36 ` Christopher Lameter @ 2018-07-31 17:41 ` Eric Dumazet 2018-07-31 17:51 ` Dmitry Vyukov 2018-07-31 17:49 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2018-07-31 17:41 UTC (permalink / raw) To: Christoph Lameter Cc: Andrey Ryabinin, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote: > > If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU? To allow fast reuse of objects, without going through call_rcu() and reducing cache efficiency. I believe this is mentioned in Documentation/RCU/rculist_nulls.txt ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:41 ` Eric Dumazet @ 2018-07-31 17:51 ` Dmitry Vyukov 2018-07-31 18:16 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-07-31 17:51 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, David Airlie, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds On Tue, Jul 31, 2018 at 7:41 PM, Eric Dumazet <edumazet@google.com> wrote: > On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote: > >> >> If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU? > > To allow fast reuse of objects, without going through call_rcu() and > reducing cache efficiency. > > I believe this is mentioned in Documentation/RCU/rculist_nulls.txt Is it OK to overwrite ct->status? It seems that are some read and writes to it right after atomic_inc_not_zero. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:51 ` Dmitry Vyukov @ 2018-07-31 18:16 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2018-07-31 18:16 UTC (permalink / raw) To: Dmitry Vyukov Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov, Linus Torvalds On Tue, Jul 31, 2018 at 10:51 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > > Is it OK to overwrite ct->status? It seems that are some read and > writes to it right after atomic_inc_not_zero. If it is after a (successful) atomic_inc_not_zero(), the object is guaranteed to be alive (not freed or about to be freed). About readind/writing a specific field, all traditional locking rules apply. For TCP socket, we would generally grab the socket lock before reading/writing various fields. ct->status seems to be manipulated with set_bit() and clear_bit() which are SMP safe. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:36 ` Christopher Lameter 2018-07-31 17:41 ` Eric Dumazet @ 2018-07-31 17:49 ` Linus Torvalds 2018-07-31 18:51 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2018-07-31 17:49 UTC (permalink / raw) To: Christoph Lameter Cc: Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter <cl@linux.com> wrote: > > If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU? .. because the object can be accessed (by RCU) after the refcount has gone down to zero, and the thing has been released. That's the whole and only point of SLAB_TYPESAFE_BY_RCU. That flag basically says: "I may end up accessing this object *after* it has been free'd, because there may be RCU lookups in flight" This has nothing to do with constructors. It's ok if the object gets reused as an object of the same type and does *not* get re-initialized, because we're perfectly fine seeing old stale data. What it guarantees is that the slab isn't shared with any other kind of object, _and_ that the underlying pages are free'd after an RCU quiescent period (so the pages aren't shared with another kind of object either during an RCU walk). And it doesn't necessarily have to have a constructor, because the thing that a RCU walk will care about is (a) guaranteed to be an object that *has* been on some RCU list (so it's not a "new" object) (b) the RCU walk needs to have logic to verify that it's still the *same* object and hasn't been re-used as something else. So the re-use might initialize the fields lazily, not necessarily using a ctor. And the point of using SLAB_TYPESAFE_BY_RCU is that using the more traditional RCU freeing - where you free each object one by one with an RCU delay - can be prohibitively slow and have a huge memory overhead (because of big chunks of memory that are queued for freeing). In contrast, a SLAB_TYPESAFE_BY_RCU memory gets free'd and re-used immediately, but because it gets reused as the same kind of object, the RCU walker can "know" what parts have meaning for re-use, in a way it couidn't if the re-use was random. That said, it *is* subtle, and people should be careful. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 17:49 ` Linus Torvalds @ 2018-07-31 18:51 ` Linus Torvalds 2018-08-01 8:46 ` Dmitry Vyukov 2018-08-01 9:03 ` Andrey Ryabinin 0 siblings, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2018-07-31 18:51 UTC (permalink / raw) To: Christoph Lameter Cc: Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So the re-use might initialize the fields lazily, not necessarily using a ctor. In particular, the pattern that nf_conntrack uses looks like it is safe. If you have a well-defined refcount, and use "atomic_inc_not_zero()" to guard the speculative RCU access section, and use "atomic_dec_and_test()" in the freeing section, then you should be safe wrt new allocations. If you have a completely new allocation that has "random stale content", you know that it cannot be on the RCU list, so there is no speculative access that can ever see that random content. So the only case you need to worry about is a re-use allocation, and you know that the refcount will start out as zero even if you don't have a constructor. So you can think of the refcount itself as always having a zero constructor, *BUT* you need to be careful with ordering. In particular, whoever does the allocation needs to then set the refcount to a non-zero value *after* it has initialized all the other fields. And in particular, it needs to make sure that it uses the proper memory ordering to do so. And in this case, we have static struct nf_conn * __nf_conntrack_alloc(struct net *net, { ... atomic_set(&ct->ct_general.use, 0); which is a no-op for the re-use case (whether racing or not, since any "inc_not_zero" users won't touch it), but initializes it to zero for the "completely new object" case. And then, the thing that actually exposes it to the speculative walkers does: int nf_conntrack_hash_check_insert(struct nf_conn *ct) { ... smp_wmb(); /* The caller holds a reference to this object */ atomic_set(&ct->ct_general.use, 2); which means that it stays as zero until everything is actually set up, and then the optimistic walker can use the other fields (including spinlocks etc) to verify that it's actually the right thing. The smp_wmb() means that the previous initialization really will be visible before the object is visible. Side note: on some architectures it might help to make that "smp_wmb -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't matter on x86, but might matter on arm64. NOTE! One thing to be very worried about is that re-initializing whatever RCU lists means that now the RCU walker may be walking on the wrong list so the walker may do the right thing for this particular entry, but it may miss walking *other* entries. So then you can get spurious lookup failures, because the RCU walker never walked all the way to the end of the right list. That ends up being a much more subtle bug. But the nf_conntrack case seems to get that right too, see the restart in ____nf_conntrack_find(). So I don't see anything wrong in nf_conntrack. But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of the subtleties have nothing to do with having a constructor, they are about those "make sure memory ordering wrt refcount is right" and "restart speculative RCU walk" issues that actually happen regardless of having a constructor or not. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 18:51 ` Linus Torvalds @ 2018-08-01 8:46 ` Dmitry Vyukov 2018-08-01 9:10 ` Dmitry Vyukov 2018-08-06 20:20 ` Jan Kara 2018-08-01 9:03 ` Andrey Ryabinin 1 sibling, 2 replies; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 8:46 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> So the re-use might initialize the fields lazily, not necessarily using a ctor. > > In particular, the pattern that nf_conntrack uses looks like it is safe. > > If you have a well-defined refcount, and use "atomic_inc_not_zero()" > to guard the speculative RCU access section, and use > "atomic_dec_and_test()" in the freeing section, then you should be > safe wrt new allocations. > > If you have a completely new allocation that has "random stale > content", you know that it cannot be on the RCU list, so there is no > speculative access that can ever see that random content. > > So the only case you need to worry about is a re-use allocation, and > you know that the refcount will start out as zero even if you don't > have a constructor. > > So you can think of the refcount itself as always having a zero > constructor, *BUT* you need to be careful with ordering. > > In particular, whoever does the allocation needs to then set the > refcount to a non-zero value *after* it has initialized all the other > fields. And in particular, it needs to make sure that it uses the > proper memory ordering to do so. > > And in this case, we have > > static struct nf_conn * > __nf_conntrack_alloc(struct net *net, > { > ... > atomic_set(&ct->ct_general.use, 0); > > which is a no-op for the re-use case (whether racing or not, since any > "inc_not_zero" users won't touch it), but initializes it to zero for > the "completely new object" case. > > And then, the thing that actually exposes it to the speculative walkers does: > > int > nf_conntrack_hash_check_insert(struct nf_conn *ct) > { > ... > smp_wmb(); > /* The caller holds a reference to this object */ > atomic_set(&ct->ct_general.use, 2); > > which means that it stays as zero until everything is actually set up, > and then the optimistic walker can use the other fields (including > spinlocks etc) to verify that it's actually the right thing. The > smp_wmb() means that the previous initialization really will be > visible before the object is visible. > > Side note: on some architectures it might help to make that "smp_wmb > -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't > matter on x86, but might matter on arm64. > > NOTE! One thing to be very worried about is that re-initializing > whatever RCU lists means that now the RCU walker may be walking on the > wrong list so the walker may do the right thing for this particular > entry, but it may miss walking *other* entries. So then you can get > spurious lookup failures, because the RCU walker never walked all the > way to the end of the right list. That ends up being a much more > subtle bug. > > But the nf_conntrack case seems to get that right too, see the restart > in ____nf_conntrack_find(). > > So I don't see anything wrong in nf_conntrack. > > But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of > the subtleties have nothing to do with having a constructor, they are > about those "make sure memory ordering wrt refcount is right" and > "restart speculative RCU walk" issues that actually happen regardless > of having a constructor or not. Thank you, this is very enlightening. So the type-stable invariant is established by initialization of the object after the first kmem_cache_alloc, and then we rely on the fact that repeated initialization does not break the invariant, which works because the state is very simple (including debug builds, i.e. no ODEBUG nor magic values). There is a bunch of other SLAB_TYPESAFE_BY_RCU uses without ctor: https://elixir.bootlin.com/linux/latest/source/fs/jbd2/journal.c#L2395 https://elixir.bootlin.com/linux/latest/source/fs/kernfs/mount.c#L415 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_core.c#L2065 https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/i915_gem.c#L5501 https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/selftests/mock_gem_device.c#L212 https://elixir.bootlin.com/linux/latest/source/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c#L1131 Also these in proto structs: https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv4.c#L959 https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv6.c#L1048 https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L2461 https://elixir.bootlin.com/linux/latest/source/net/ipv6/tcp_ipv6.c#L1980 https://elixir.bootlin.com/linux/latest/source/net/llc/af_llc.c#L145 https://elixir.bootlin.com/linux/latest/source/net/smc/af_smc.c#L105 They later created in net/core/sock.c without ctor. I guess it would be useful to have such extensive comment for each SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the tricky aspects are handled. For example, the one in jbd2 is interesting because it memsets the whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab: memset(jh, JBD2_POISON_FREE, sizeof(*jh)); kmem_cache_free(jbd2_journal_head_cache, jh); I guess there are also tricky ways how it can all work in the end (type-stable state is only a byte, or we check for all possible combinations of being overwritten with JBD2_POISON_FREE). But at first sight it does look fishy. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 8:46 ` Dmitry Vyukov @ 2018-08-01 9:10 ` Dmitry Vyukov 2018-08-01 10:35 ` Florian Westphal 2018-08-06 20:20 ` Jan Kara 1 sibling, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 9:10 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 10:46 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> So the re-use might initialize the fields lazily, not necessarily using a ctor. >> >> In particular, the pattern that nf_conntrack uses looks like it is safe. >> >> If you have a well-defined refcount, and use "atomic_inc_not_zero()" >> to guard the speculative RCU access section, and use >> "atomic_dec_and_test()" in the freeing section, then you should be >> safe wrt new allocations. >> >> If you have a completely new allocation that has "random stale >> content", you know that it cannot be on the RCU list, so there is no >> speculative access that can ever see that random content. >> >> So the only case you need to worry about is a re-use allocation, and >> you know that the refcount will start out as zero even if you don't >> have a constructor. >> >> So you can think of the refcount itself as always having a zero >> constructor, *BUT* you need to be careful with ordering. >> >> In particular, whoever does the allocation needs to then set the >> refcount to a non-zero value *after* it has initialized all the other >> fields. And in particular, it needs to make sure that it uses the >> proper memory ordering to do so. >> >> And in this case, we have >> >> static struct nf_conn * >> __nf_conntrack_alloc(struct net *net, >> { >> ... >> atomic_set(&ct->ct_general.use, 0); >> >> which is a no-op for the re-use case (whether racing or not, since any >> "inc_not_zero" users won't touch it), but initializes it to zero for >> the "completely new object" case. >> >> And then, the thing that actually exposes it to the speculative walkers does: >> >> int >> nf_conntrack_hash_check_insert(struct nf_conn *ct) >> { >> ... >> smp_wmb(); >> /* The caller holds a reference to this object */ >> atomic_set(&ct->ct_general.use, 2); >> >> which means that it stays as zero until everything is actually set up, >> and then the optimistic walker can use the other fields (including >> spinlocks etc) to verify that it's actually the right thing. The >> smp_wmb() means that the previous initialization really will be >> visible before the object is visible. >> >> Side note: on some architectures it might help to make that "smp_wmb >> -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't >> matter on x86, but might matter on arm64. >> >> NOTE! One thing to be very worried about is that re-initializing >> whatever RCU lists means that now the RCU walker may be walking on the >> wrong list so the walker may do the right thing for this particular >> entry, but it may miss walking *other* entries. So then you can get >> spurious lookup failures, because the RCU walker never walked all the >> way to the end of the right list. That ends up being a much more >> subtle bug. >> >> But the nf_conntrack case seems to get that right too, see the restart >> in ____nf_conntrack_find(). >> >> So I don't see anything wrong in nf_conntrack. >> >> But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of >> the subtleties have nothing to do with having a constructor, they are >> about those "make sure memory ordering wrt refcount is right" and >> "restart speculative RCU walk" issues that actually happen regardless >> of having a constructor or not. > > > Thank you, this is very enlightening. > > So the type-stable invariant is established by initialization of the > object after the first kmem_cache_alloc, and then we rely on the fact > that repeated initialization does not break the invariant, which works > because the state is very simple (including debug builds, i.e. no > ODEBUG nor magic values). Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get. Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) check in __nf_conntrack_find_get passes, and it returns NULL to the caller (which means entry is not present). While in reality the entry is present, but we were just looking at the wrong one. Also I am not sure about order of checks in (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)), because checking state before taking the ref is only a best-effort hint, so it can actually be a dying entry when we take a ref. So shouldn't it read something like the following? rcu_read_lock(); begin: h = ____nf_conntrack_find(net, zone, tuple, hash); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (!atomic_inc_not_zero(&ct->ct_general.use)) goto begin; if (unlikely(nf_ct_is_dying(ct)) || unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { nf_ct_put(ct); goto begin; } } rcu_read_unlock(); return h; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 9:10 ` Dmitry Vyukov @ 2018-08-01 10:35 ` Florian Westphal 2018-08-01 10:41 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Florian Westphal @ 2018-08-01 10:35 UTC (permalink / raw) To: Dmitry Vyukov Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov Dmitry Vyukov <dvyukov@google.com> wrote: > Still can't grasp all details. > There is state that we read without taking ct->ct_general.use ref > first, namely ct->state and what's used by nf_ct_key_equal. > So let's say the entry we want to find is in the list, but > ____nf_conntrack_find finds a wrong entry earlier because all state it > looks at is random garbage, so it returns the wrong entry to > __nf_conntrack_find_get. If an entry can be found, it can't be random garbage. We never link entries into global table until state has been set up. > Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) > check in __nf_conntrack_find_get passes, and it returns NULL to the > caller (which means entry is not present). So entry is going away or marked as dead which for us is same as 'not present', we need to allocate a new entry. > While in reality the entry > is present, but we were just looking at the wrong one. We never add tuples that are identical to the global table. If N cores receive identical packets at same time with no prior state, all will allocate a new conntrack, but we notice this when we try to insert the nf_conn entries into the table. Only one will succeed, other cpus have to cope with this. (worst case: all raced packets are dropped along with their conntrack object). For lookup, we have following scenarios: 1. It doesn't exist -> new allocation needed 2. It exists, not dead, has nonzero refount -> use it 3. It exists, but marked as dying -> new allocation needed 4. It exists but has 0 reference count -> new allocation needed 5. It exists, we get reference, but 2nd nf_ct_key_equal check fails. We saw a matching 'old incarnation' that just got re-used on other core. -> retry lookup > Also I am not sure about order of checks in (nf_ct_is_dying(ct) || > !atomic_inc_not_zero(&ct->ct_general.use)), because checking state > before taking the ref is only a best-effort hint, so it can actually > be a dying entry when we take a ref. Yes, it can also become a dying entry after we took the reference. > So shouldn't it read something like the following? > > rcu_read_lock(); > begin: > h = ____nf_conntrack_find(net, zone, tuple, hash); > if (h) { > ct = nf_ct_tuplehash_to_ctrack(h); > if (!atomic_inc_not_zero(&ct->ct_general.use)) > goto begin; > if (unlikely(nf_ct_is_dying(ct)) || > unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { > nf_ct_put(ct); It would be ok to make this change, but dying bit can be set at any time e.g. because userspace tells kernel to flush the conntrack table. So refcount is always > 0 when the DYING bit is set. I don't see why it would be a problem. nf_conn struct will stay valid until all cpus have dropped references. The check in lookup function only serves to hide the known-to-go-away entry. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 10:35 ` Florian Westphal @ 2018-08-01 10:41 ` Dmitry Vyukov 2018-08-01 11:40 ` Florian Westphal 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 10:41 UTC (permalink / raw) To: Florian Westphal Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote: > Dmitry Vyukov <dvyukov@google.com> wrote: >> Still can't grasp all details. >> There is state that we read without taking ct->ct_general.use ref >> first, namely ct->state and what's used by nf_ct_key_equal. >> So let's say the entry we want to find is in the list, but >> ____nf_conntrack_find finds a wrong entry earlier because all state it >> looks at is random garbage, so it returns the wrong entry to >> __nf_conntrack_find_get. > > If an entry can be found, it can't be random garbage. > We never link entries into global table until state has been set up. But... we don't hold a reference to the entry. So say it's in the table with valid state, now ____nf_conntrack_find discovers it, now the entry is removed and reused a dozen of times will all associated state reinitialization. And nf_ct_key_equal looks at it concurrently and decides if it's the entry we are looking for or now. I think unless we hold a ref to the entry, it's state needs to be considered random garbage for correctness reasoning. >> Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) >> check in __nf_conntrack_find_get passes, and it returns NULL to the >> caller (which means entry is not present). > > So entry is going away or marked as dead which for us is same as > 'not present', we need to allocate a new entry. > >> While in reality the entry >> is present, but we were just looking at the wrong one. > > We never add tuples that are identical to the global table. > > If N cores receive identical packets at same time with no prior state, all > will allocate a new conntrack, but we notice this when we try to insert the > nf_conn entries into the table. > > Only one will succeed, other cpus have to cope with this. > (worst case: all raced packets are dropped along with their conntrack > object). > > For lookup, we have following scenarios: > > 1. It doesn't exist -> new allocation needed > 2. It exists, not dead, has nonzero refount -> use it > 3. It exists, but marked as dying -> new allocation needed > 4. It exists but has 0 reference count -> new allocation needed > 5. It exists, we get reference, but 2nd nf_ct_key_equal check > fails. We saw a matching 'old incarnation' that just got > re-used on other core. -> retry lookup > >> Also I am not sure about order of checks in (nf_ct_is_dying(ct) || >> !atomic_inc_not_zero(&ct->ct_general.use)), because checking state >> before taking the ref is only a best-effort hint, so it can actually >> be a dying entry when we take a ref. > > Yes, it can also become a dying entry after we took the reference. > >> So shouldn't it read something like the following? >> >> rcu_read_lock(); >> begin: >> h = ____nf_conntrack_find(net, zone, tuple, hash); >> if (h) { >> ct = nf_ct_tuplehash_to_ctrack(h); >> if (!atomic_inc_not_zero(&ct->ct_general.use)) >> goto begin; >> if (unlikely(nf_ct_is_dying(ct)) || >> unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { >> nf_ct_put(ct); > > It would be ok to make this change, but dying bit can be set > at any time e.g. because userspace tells kernel to flush the conntrack table. > So refcount is always > 0 when the DYING bit is set. > > I don't see why it would be a problem. > > nf_conn struct will stay valid until all cpus have dropped references. > The check in lookup function only serves to hide the known-to-go-away entry. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 10:41 ` Dmitry Vyukov @ 2018-08-01 11:40 ` Florian Westphal 2018-08-01 12:38 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Florian Westphal @ 2018-08-01 11:40 UTC (permalink / raw) To: Dmitry Vyukov Cc: Florian Westphal, Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov Dmitry Vyukov <dvyukov@google.com> wrote: > On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote: > > Dmitry Vyukov <dvyukov@google.com> wrote: > >> Still can't grasp all details. > >> There is state that we read without taking ct->ct_general.use ref > >> first, namely ct->state and what's used by nf_ct_key_equal. > >> So let's say the entry we want to find is in the list, but > >> ____nf_conntrack_find finds a wrong entry earlier because all state it > >> looks at is random garbage, so it returns the wrong entry to > >> __nf_conntrack_find_get. > > > > If an entry can be found, it can't be random garbage. > > We never link entries into global table until state has been set up. > > But... we don't hold a reference to the entry. So say it's in the > table with valid state, now ____nf_conntrack_find discovers it, now > the entry is removed and reused a dozen of times will all associated > state reinitialization. And nf_ct_key_equal looks at it concurrently > and decides if it's the entry we are looking for or now. I think > unless we hold a ref to the entry, it's state needs to be considered > random garbage for correctness reasoning. It checks if it might be the entry we're looking for. If this was complete random garbage, scheme would not work, as then we could have entry that isn't in table, has nonzero refcount, but has its confirmed bit set. I don't see how that would be possible, any reallocation makes sure ct->status has CONFIRMED bit clear before setting refcount to nonzero value. I think this is the scenario you hint at is: 1. nf_ct_key_equal is true 2. the entry is free'd (or was already free'd) 3. we return NULL to caller due to atomic_inc_not_zero() failure but i fail to see how thats wrong? Sure, we could restart lookup but how would that help? We'd not find the 'candidate' entry again. We might find entry that has been inserted at this very instant but newly allocated entries are only inserted into global table until the skb that created the nf_conn object has made it through the network stack (postrouting for fowarded, or input for local delivery). So, the likelyhood of such restart finding another candidate is close to 0, and won't prevent 'insert race' from happening. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 11:40 ` Florian Westphal @ 2018-08-01 12:38 ` Dmitry Vyukov 2018-08-01 13:46 ` Florian Westphal 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 12:38 UTC (permalink / raw) To: Florian Westphal Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 1:40 PM, Florian Westphal <fw@strlen.de> wrote: > Dmitry Vyukov <dvyukov@google.com> wrote: >> On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal <fw@strlen.de> wrote: >> > Dmitry Vyukov <dvyukov@google.com> wrote: >> >> Still can't grasp all details. >> >> There is state that we read without taking ct->ct_general.use ref >> >> first, namely ct->state and what's used by nf_ct_key_equal. >> >> So let's say the entry we want to find is in the list, but >> >> ____nf_conntrack_find finds a wrong entry earlier because all state it >> >> looks at is random garbage, so it returns the wrong entry to >> >> __nf_conntrack_find_get. >> > >> > If an entry can be found, it can't be random garbage. >> > We never link entries into global table until state has been set up. >> >> But... we don't hold a reference to the entry. So say it's in the >> table with valid state, now ____nf_conntrack_find discovers it, now >> the entry is removed and reused a dozen of times will all associated >> state reinitialization. And nf_ct_key_equal looks at it concurrently >> and decides if it's the entry we are looking for or now. I think >> unless we hold a ref to the entry, it's state needs to be considered >> random garbage for correctness reasoning. > > It checks if it might be the entry we're looking for. > > If this was complete random garbage, scheme would not work, as then > we could have entry that isn't in table, has nonzero refcount, but > has its confirmed bit set. > > I don't see how that would be possible, any reallocation > makes sure ct->status has CONFIRMED bit clear before setting refcount > to nonzero value. > > I think this is the scenario you hint at is: > 1. nf_ct_key_equal is true > 2. the entry is free'd (or was already free'd) > 3. we return NULL to caller due to atomic_inc_not_zero() failure > > but i fail to see how thats wrong? > > Sure, we could restart lookup but how would that help? > > We'd not find the 'candidate' entry again. > > We might find entry that has been inserted at this very instant but > newly allocated entries are only inserted into global table until the skb that > created the nf_conn object has made it through the network stack > (postrouting for fowarded, or input for local delivery). > > So, the likelyhood of such restart finding another candidate is close to 0, > and won't prevent 'insert race' from happening. The scenario I have in mind is different and it relates to the fact that ____nf_conntrack_find will return the right entry if it's present, but it can also return an unrelated entry because when it looks at entries they change underneath. Let's take any 2 fields compared by nf_ct_key_equal for simplicity, for example, src.u3 and dst.u3. Let's say we are looking for an entry with src.u3=A and dst.u3=B, let's call it (A,B). Let's say there is an existing entry 1 (A,B) in the global list. But there is also entry 2 (A,C) earlier in the list. Now, ____nf_conntrack_find starts checking entry 2 (A,C), it checks that src.u3==A, so so far it looks good. Now another thread deletes, reuses and reinitilizes entry 2 for (C,B). Now, ____nf_conntrack_find checks that dst.u3==B, so it concludes that it's the right entry, because it observed (A,B). Entry 2 is returned to __nf_conntrack_find_get. Now another thread marks entry 2 as dying. Now __nf_conntrack_find_get sees that it's dying and returns NULL to caller, _but_ the matching entry 1 (A,B) was in the list all that time and we should have been discovered it, but we didn't because we were deraield by the wrong entry 2. If that scenario is possible that a fix would be to make __nf_conntrack_find_get ever return NULL iff it got NULL from ____nf_conntrack_find (not if any of the checks has failed). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 12:38 ` Dmitry Vyukov @ 2018-08-01 13:46 ` Florian Westphal 2018-08-01 13:52 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Florian Westphal @ 2018-08-01 13:46 UTC (permalink / raw) To: Dmitry Vyukov Cc: Florian Westphal, Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov Dmitry Vyukov <dvyukov@google.com> wrote: > If that scenario is possible that a fix would be to make Looks possible. > __nf_conntrack_find_get ever return NULL iff it got NULL from > ____nf_conntrack_find (not if any of the checks has failed). I don't see why we need to restart on nf_ct_is_dying(), but other than that this seems ok. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 13:46 ` Florian Westphal @ 2018-08-01 13:52 ` Dmitry Vyukov 0 siblings, 0 replies; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 13:52 UTC (permalink / raw) To: Florian Westphal Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 3:46 PM, Florian Westphal <fw@strlen.de> wrote: > Dmitry Vyukov <dvyukov@google.com> wrote: >> If that scenario is possible that a fix would be to make > > Looks possible. > >> __nf_conntrack_find_get ever return NULL iff it got NULL from >> ____nf_conntrack_find (not if any of the checks has failed). > > I don't see why we need to restart on nf_ct_is_dying(), but other > than that this seems ok. Because it can be a wrong entry dying. When we check dying, we don't yet know if we are looking at the right entry or not. If we successfully acquire a reference, then recheck nf_ct_key_equal and _then_ check dying, then we don't need to restart on dying. But with the current check order, we need to restart on dying too. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 8:46 ` Dmitry Vyukov 2018-08-01 9:10 ` Dmitry Vyukov @ 2018-08-06 20:20 ` Jan Kara 1 sibling, 0 replies; 31+ messages in thread From: Jan Kara @ 2018-08-06 20:20 UTC (permalink / raw) To: Dmitry Vyukov Cc: Linus Torvalds, Christoph Lameter, Andrey Ryabinin, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed 01-08-18 10:46:35, Dmitry Vyukov wrote: > I guess it would be useful to have such extensive comment for each > SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the > tricky aspects are handled. > > For example, the one in jbd2 is interesting because it memsets the > whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab: > > memset(jh, JBD2_POISON_FREE, sizeof(*jh)); > kmem_cache_free(jbd2_journal_head_cache, jh); > > I guess there are also tricky ways how it can all work in the end > (type-stable state is only a byte, or we check for all possible > combinations of being overwritten with JBD2_POISON_FREE). But at first > sight it does look fishy. The RCU access is used from a single place: fs/jbd2/transaction.c: jbd2_write_access_granted() There are also quite some comments explaining why what it does is safe. The overwrite by JBD2_POISON_FREE is much older than this RCU stuff (honestly I didn't know about it until this moment) and has nothing to do with the safety of RCU access. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-07-31 18:51 ` Linus Torvalds 2018-08-01 8:46 ` Dmitry Vyukov @ 2018-08-01 9:03 ` Andrey Ryabinin 2018-08-01 10:23 ` Eric Dumazet 1 sibling, 1 reply; 31+ messages in thread From: Andrey Ryabinin @ 2018-08-01 9:03 UTC (permalink / raw) To: Linus Torvalds, Christoph Lameter Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov On 07/31/2018 09:51 PM, Linus Torvalds wrote: > On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> So the re-use might initialize the fields lazily, not necessarily using a ctor. > > In particular, the pattern that nf_conntrack uses looks like it is safe. > > If you have a well-defined refcount, and use "atomic_inc_not_zero()" > to guard the speculative RCU access section, and use > "atomic_dec_and_test()" in the freeing section, then you should be > safe wrt new allocations. > > If you have a completely new allocation that has "random stale > content", you know that it cannot be on the RCU list, so there is no > speculative access that can ever see that random content. > > So the only case you need to worry about is a re-use allocation, and > you know that the refcount will start out as zero even if you don't > have a constructor. > > So you can think of the refcount itself as always having a zero > constructor, *BUT* you need to be careful with ordering. > > In particular, whoever does the allocation needs to then set the > refcount to a non-zero value *after* it has initialized all the other > fields. And in particular, it needs to make sure that it uses the > proper memory ordering to do so. > > And in this case, we have > > static struct nf_conn * > __nf_conntrack_alloc(struct net *net, > { > ... > atomic_set(&ct->ct_general.use, 0); > > which is a no-op for the re-use case (whether racing or not, since any > "inc_not_zero" users won't touch it), but initializes it to zero for > the "completely new object" case. > > And then, the thing that actually exposes it to the speculative walkers does: > > int > nf_conntrack_hash_check_insert(struct nf_conn *ct) > { > ... > smp_wmb(); > /* The caller holds a reference to this object */ > atomic_set(&ct->ct_general.use, 2); > > which means that it stays as zero until everything is actually set up, > and then the optimistic walker can use the other fields (including > spinlocks etc) to verify that it's actually the right thing. The > smp_wmb() means that the previous initialization really will be > visible before the object is visible. > > Side note: on some architectures it might help to make that "smp_wmb > -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't > matter on x86, but might matter on arm64. > > NOTE! One thing to be very worried about is that re-initializing > whatever RCU lists means that now the RCU walker may be walking on the > wrong list so the walker may do the right thing for this particular > entry, but it may miss walking *other* entries. So then you can get > spurious lookup failures, because the RCU walker never walked all the > way to the end of the right list. That ends up being a much more > subtle bug. > > But the nf_conntrack case seems to get that right too, see the restart > in ____nf_conntrack_find(). > > So I don't see anything wrong in nf_conntrack. > > But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of > the subtleties have nothing to do with having a constructor, they are > about those "make sure memory ordering wrt refcount is right" and > "restart speculative RCU walk" issues that actually happen regardless > of having a constructor or not. > I see, thanks. I just don't see any point or advantage of *not* using the constructor with SLAB_TYPESAFE_BY_RCU caches. There is always must be a small part of initialization code that could be placed in constructor. And it's better to put that part into constructor to avoid initializing what's already has been initialized. If people use SLAB_TYPESAFE_BY_RCU instead of traditional kfree_rcu() for cache-efficiency and because they want to reuse the object faster, than avoiding few writes into cache-hot data doesn't look like a bad idea. E.g. nf_conntrack case could at least move the spin_lock_init() and atomic_set(&ct->ct_general.use, 0) in the constructor. I can't think of any advantage in not having the constructor. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 9:03 ` Andrey Ryabinin @ 2018-08-01 10:23 ` Eric Dumazet 2018-08-01 10:34 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2018-08-01 10:23 UTC (permalink / raw) To: Andrey Ryabinin, Linus Torvalds, Christoph Lameter Cc: Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, gerrit, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Dmitry Vyukov, Andrew Morton, linux-mm, Andrey Konovalov On 08/01/2018 02:03 AM, Andrey Ryabinin wrote: > I can't think of any advantage in not having the constructor. > I can't see any advantage adding another indirect call, in RETPOLINE world. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 10:23 ` Eric Dumazet @ 2018-08-01 10:34 ` Dmitry Vyukov 2018-08-01 11:28 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 10:34 UTC (permalink / raw) To: Eric Dumazet Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 08/01/2018 02:03 AM, Andrey Ryabinin wrote: > >> I can't think of any advantage in not having the constructor. > > I can't see any advantage adding another indirect call, > in RETPOLINE world. Can you please elaborate what's the problem here? If slab ctor call have RETPOLINE, then using ctors more does not introduce any security problems and they are not _that_ slow. If slab ctors are not protected, then we have problem that needs to be fixed already. What am I missing? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 10:34 ` Dmitry Vyukov @ 2018-08-01 11:28 ` Eric Dumazet 2018-08-01 11:35 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2018-08-01 11:28 UTC (permalink / raw) To: Dmitry Vyukov, Eric Dumazet Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On 08/01/2018 03:34 AM, Dmitry Vyukov wrote: > On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On 08/01/2018 02:03 AM, Andrey Ryabinin wrote: >> >>> I can't think of any advantage in not having the constructor. >> >> I can't see any advantage adding another indirect call, >> in RETPOLINE world. > > Can you please elaborate what's the problem here? > If slab ctor call have RETPOLINE, then using ctors more does not > introduce any security problems and they are not _that_ slow. They _are_ slow, when we have dozens of them in a code path. I object "having to add" yet another indirect call, if this can be avoided [*] If some people want to use ctor, fine, but do not request this. [*] This can be tricky, but worth the pain. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 11:28 ` Eric Dumazet @ 2018-08-01 11:35 ` Dmitry Vyukov 2018-08-01 15:15 ` Christopher Lameter 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 11:35 UTC (permalink / raw) To: Eric Dumazet Cc: Andrey Ryabinin, Linus Torvalds, Christoph Lameter, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 1:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 08/01/2018 03:34 AM, Dmitry Vyukov wrote: >> On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>> On 08/01/2018 02:03 AM, Andrey Ryabinin wrote: >>> >>>> I can't think of any advantage in not having the constructor. >>> >>> I can't see any advantage adding another indirect call, >>> in RETPOLINE world. >> >> Can you please elaborate what's the problem here? >> If slab ctor call have RETPOLINE, then using ctors more does not >> introduce any security problems and they are not _that_ slow. > > They _are_ slow, when we have dozens of them in a code path. > > I object "having to add" yet another indirect call, if this can be avoided [*] > > If some people want to use ctor, fine, but do not request this. > > [*] This can be tricky, but worth the pain. But we are trading 1 indirect call for comparable overhead removed from much more common path. The path that does ctors is also calling into page alloc, which is much more expensive. So ctor should be a net win on performance front, no? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 11:35 ` Dmitry Vyukov @ 2018-08-01 15:15 ` Christopher Lameter 2018-08-01 15:37 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Christopher Lameter @ 2018-08-01 15:15 UTC (permalink / raw) To: Dmitry Vyukov Cc: Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, Network Development, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Dave Airlie, intel-gfx, DRI, Eric Dumazet, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, Linux Kernel Mailing List, Andrew Morton, linux-mm, Andrey Konovalov On Wed, 1 Aug 2018, Dmitry Vyukov wrote: > But we are trading 1 indirect call for comparable overhead removed > from much more common path. The path that does ctors is also calling > into page alloc, which is much more expensive. > So ctor should be a net win on performance front, no? ctor would make it esier to review the flow and guarantee that the object always has certain fields set as required before any use by the subsystem. ctors are run once on allocation of the slab page for all objects in it. ctors are not called duiring allocation and freeing of objects from the slab page. So we could avoid the intialization of the spinlock on each object allocation which actually should be faster. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 15:15 ` Christopher Lameter @ 2018-08-01 15:37 ` Eric Dumazet 2018-08-01 15:51 ` Misuse of constructors Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Eric Dumazet @ 2018-08-01 15:37 UTC (permalink / raw) To: Christoph Lameter Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter <cl@linux.com> wrote: > > On Wed, 1 Aug 2018, Dmitry Vyukov wrote: > > > But we are trading 1 indirect call for comparable overhead removed > > from much more common path. The path that does ctors is also calling > > into page alloc, which is much more expensive. > > So ctor should be a net win on performance front, no? > > ctor would make it esier to review the flow and guarantee that the object > always has certain fields set as required before any use by the subsystem. > > ctors are run once on allocation of the slab page for all objects in it. > > ctors are not called duiring allocation and freeing of objects from the > slab page. So we could avoid the intialization of the spinlock on each > object allocation which actually should be faster. This strategy might have been a win 30 years ago when cpu had no caches (or too small anyway) What probability is that the 60 bytes around the spinlock are not touched after the object is freshly allocated ? -> None Writing 60 bytes in one cache line instead of 64 has really the same cost. The cache line miss is the real killer. Feel free to write the patches, test them, but I doubt you will have any gain. Remember btw that TCP sockets can be either completely fresh (socket() call, using memset() to clear the whole object), or clones (accept() thus copying the parent socket) The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Misuse of constructors 2018-08-01 15:37 ` Eric Dumazet @ 2018-08-01 15:51 ` Matthew Wilcox 2018-08-01 15:53 ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov 2018-08-01 16:22 ` Christopher Lameter 2 siblings, 0 replies; 31+ messages in thread From: Matthew Wilcox @ 2018-08-01 15:51 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov, linux-fsdevel On Wed, Aug 01, 2018 at 08:37:07AM -0700, Eric Dumazet wrote: > The idea of having a ctor() would only be a win if all the fields that > can be initialized in the ctor are contiguous and fill an integral > number of cache lines. Let's state it more generally: Having a ctor is only a win if it allows the user to avoid reading or writing a significant number of cachelines. For example, the radix tree node occupies nine cachelines (576 bytes). The typical user will touch two of those cacheliens (the header and the cacheline which contains the slot of interest). That's seven cachelines which don't even need to be read, let alone written. I think filesystems are particularly prone to this antipattern of initialising some of their inode with a ctor and the remainder at alloc_inode() time (compare the locations of the struct members touched by inode_init_always() and inode_init_once()). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 15:37 ` Eric Dumazet 2018-08-01 15:51 ` Misuse of constructors Matthew Wilcox @ 2018-08-01 15:53 ` Dmitry Vyukov 2018-08-01 16:22 ` Christopher Lameter 2 siblings, 0 replies; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 15:53 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, netdev, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie, intel-gfx, DRI, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 5:37 PM, Eric Dumazet <edumazet@google.com> wrote: > On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter <cl@linux.com> wrote: >> >> On Wed, 1 Aug 2018, Dmitry Vyukov wrote: >> >> > But we are trading 1 indirect call for comparable overhead removed >> > from much more common path. The path that does ctors is also calling >> > into page alloc, which is much more expensive. >> > So ctor should be a net win on performance front, no? >> >> ctor would make it esier to review the flow and guarantee that the object >> always has certain fields set as required before any use by the subsystem. >> >> ctors are run once on allocation of the slab page for all objects in it. >> >> ctors are not called duiring allocation and freeing of objects from the >> slab page. So we could avoid the intialization of the spinlock on each >> object allocation which actually should be faster. > > > This strategy might have been a win 30 years ago when cpu had no > caches (or too small anyway) > > What probability is that the 60 bytes around the spinlock are not > touched after the object is freshly allocated ? > > -> None > > Writing 60 bytes in one cache line instead of 64 has really the same > cost. The cache line miss is the real killer. > > Feel free to write the patches, test them, but I doubt you will have any gain. > > Remember btw that TCP sockets can be either completely fresh > (socket() call, using memset() to clear the whole object), > or clones (accept() thus copying the parent socket) > > The idea of having a ctor() would only be a win if all the fields that > can be initialized in the ctor are contiguous and fill an integral > number of cache lines. Code size can have some visible performance impact too. But either way, what you say only means that ctors are not necessary significantly faster. But your original point was that they are slower. If they are not slower, then what Andrey said seems to make sense: some gain on code comprehension front re type-stability invariant + some gain on performance front (even if not too big) and no downsides. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 15:37 ` Eric Dumazet 2018-08-01 15:51 ` Misuse of constructors Matthew Wilcox 2018-08-01 15:53 ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov @ 2018-08-01 16:22 ` Christopher Lameter 2018-08-01 16:25 ` Eric Dumazet 2 siblings, 1 reply; 31+ messages in thread From: Christopher Lameter @ 2018-08-01 16:22 UTC (permalink / raw) To: Eric Dumazet Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On Wed, 1 Aug 2018, Eric Dumazet wrote: > The idea of having a ctor() would only be a win if all the fields that > can be initialized in the ctor are contiguous and fill an integral > number of cache lines. Ok. Its reducing code size and makes the object status more consistent. Isn't that enough? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 16:22 ` Christopher Lameter @ 2018-08-01 16:25 ` Eric Dumazet 2018-08-01 16:47 ` Dmitry Vyukov 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2018-08-01 16:25 UTC (permalink / raw) To: Christopher Lameter, Eric Dumazet Cc: Dmitry Vyukov, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On 08/01/2018 09:22 AM, Christopher Lameter wrote: > On Wed, 1 Aug 2018, Eric Dumazet wrote: > >> The idea of having a ctor() would only be a win if all the fields that >> can be initialized in the ctor are contiguous and fill an integral >> number of cache lines. > > Ok. Its reducing code size and makes the object status more consistent. > Isn't that enough? > Prove it ;) I yet have to seen actual numbers. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 16:25 ` Eric Dumazet @ 2018-08-01 16:47 ` Dmitry Vyukov 2018-08-01 17:18 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2018-08-01 16:47 UTC (permalink / raw) To: Eric Dumazet Cc: Christopher Lameter, Eric Dumazet, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, Jan Kara, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, NetFilter, coreteam, netdev, Gerrit Renker, dccp, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie, intel-gfx, DRI, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 08/01/2018 09:22 AM, Christopher Lameter wrote: >> On Wed, 1 Aug 2018, Eric Dumazet wrote: >> >>> The idea of having a ctor() would only be a win if all the fields that >>> can be initialized in the ctor are contiguous and fill an integral >>> number of cache lines. >> >> Ok. Its reducing code size and makes the object status more consistent. >> Isn't that enough? >> > > Prove it ;) > > I yet have to seen actual numbers. Proving with numbers is required for a claimed performance improvement at the cost of code degradation/increase. For a win-win change there is really nothing to prove. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) 2018-08-01 16:47 ` Dmitry Vyukov @ 2018-08-01 17:18 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2018-08-01 17:18 UTC (permalink / raw) To: Dmitry Vyukov Cc: Eric Dumazet, Christoph Lameter, Andrey Ryabinin, Linus Torvalds, Theodore Ts'o, jack, linux-ext4, Greg Kroah-Hartman, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, David Miller, netfilter-devel, coreteam, netdev, Gerrit Renker, dccp, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, intel-gfx, dri-devel, Alexey Kuznetsov, Hideaki YOSHIFUJI, Ursula Braun, linux-s390, LKML, Andrew Morton, linux-mm, Andrey Konovalov On Wed, Aug 1, 2018 at 9:47 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > Proving with numbers is required for a claimed performance improvement > at the cost of code degradation/increase. For a win-win change there > is really nothing to prove. You have to _prove_ it is a win-win. It is not sufficient to claim it is a win-win. Sorry, but I do have bugs to take care of. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2018-08-06 20:20 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-31 17:01 SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Andrey Ryabinin 2018-07-31 17:09 ` Florian Westphal 2018-07-31 17:32 ` Eric Dumazet 2018-07-31 17:36 ` Christopher Lameter 2018-07-31 17:41 ` Eric Dumazet 2018-07-31 17:51 ` Dmitry Vyukov 2018-07-31 18:16 ` Eric Dumazet 2018-07-31 17:49 ` Linus Torvalds 2018-07-31 18:51 ` Linus Torvalds 2018-08-01 8:46 ` Dmitry Vyukov 2018-08-01 9:10 ` Dmitry Vyukov 2018-08-01 10:35 ` Florian Westphal 2018-08-01 10:41 ` Dmitry Vyukov 2018-08-01 11:40 ` Florian Westphal 2018-08-01 12:38 ` Dmitry Vyukov 2018-08-01 13:46 ` Florian Westphal 2018-08-01 13:52 ` Dmitry Vyukov 2018-08-06 20:20 ` Jan Kara 2018-08-01 9:03 ` Andrey Ryabinin 2018-08-01 10:23 ` Eric Dumazet 2018-08-01 10:34 ` Dmitry Vyukov 2018-08-01 11:28 ` Eric Dumazet 2018-08-01 11:35 ` Dmitry Vyukov 2018-08-01 15:15 ` Christopher Lameter 2018-08-01 15:37 ` Eric Dumazet 2018-08-01 15:51 ` Misuse of constructors Matthew Wilcox 2018-08-01 15:53 ` SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation) Dmitry Vyukov 2018-08-01 16:22 ` Christopher Lameter 2018-08-01 16:25 ` Eric Dumazet 2018-08-01 16:47 ` Dmitry Vyukov 2018-08-01 17:18 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).