All of lore.kernel.org
 help / color / mirror / Atom feed
* semantics of rhashtable and sysvipc
@ 2018-05-23 17:25 Davidlohr Bueso
  2018-05-23 17:35 ` Davidlohr Bueso
  2018-05-23 18:35 ` Linus Torvalds
  0 siblings, 2 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-23 17:25 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: manfred, guillaume.knispel, linux-api, linux-kernel

Hi,

In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in 0cfb6aee70b (ipc: optimize semget/shmget/msgget for lots of keys).
It's there, specifically, to prevent nil pointer dereferences, from using an
uninitialized api. Considering how rhashtable_init() can fail (probably due to
ENOMEM, if anything), this made the overall ipc initialization capable of
failure as well. That alone is ugly, but fine, however I've spotted a few
issues regarding the semantics of tables_initialized (however unlikely they
may be):

- There is inconsistency in what we return to userspace: ipc_addid() returns
ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr() returns
EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

I see two possible fixes. The first is to return the proper error code
if !tables_initialized, however, I'm not sure how we want to deal with the
EINVAL cases when rhashtable_init() fails. Userspace has no reason to know
about this. The ENOMEM case I guess makes sense ok.

The second alternative would be to add a BUG_ON() if the initialization fails
and we get rid of all the tables_initialized hack. I know Linus isn't fond
of this, and in the past ipc has gotten rid of BUG_ON usage in ipc because
of this. However I mention it because there are other core areas that do this
(or call panic() straightaway). Ie in networking: sock_diag_init() and
netlink_proto_init().

Thoughts?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso
@ 2018-05-23 17:35 ` Davidlohr Bueso
  2018-05-23 18:35 ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-23 17:35 UTC (permalink / raw)
  To: akpm, torvalds; +Cc: manfred, guillaume.knispel, linux-api, linux-kernel

On Wed, 23 May 2018, Davidlohr Bueso wrote:
>I see two possible fixes.

I guess a third option would be to make the hashtable static, but I'm not
against using rhashtables so I'm not really considering this.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 18:35 ` Linus Torvalds
@ 2018-05-23 18:31   ` Davidlohr Bueso
  2018-05-23 18:52     ` Linus Torvalds
  2018-05-24 17:07   ` Davidlohr Bueso
  1 sibling, 1 reply; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-23 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Wed, 23 May 2018, Linus Torvalds wrote:

>So I'm perfectly fine with getting rid of 'tables_initialized'. But no, not
>with a BUG_ON().
>
>If you cannot guarantee that the allocation works (using __GFP_NOFAIL is
>ok, for example - but it only works with small allocations), then you need
>to handle the allocation failure.

Note that even if the allocation was guaranteed, there are still param validations
and rhashtable_init() can return -EINVAL.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso
  2018-05-23 17:35 ` Davidlohr Bueso
@ 2018-05-23 18:35 ` Linus Torvalds
  2018-05-23 18:31   ` Davidlohr Bueso
  2018-05-24 17:07   ` Davidlohr Bueso
  1 sibling, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2018-05-23 18:35 UTC (permalink / raw)
  To: Davidlohr Bueso, Thomas Graf, Herbert Xu
  Cc: Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API,
	Linux Kernel Mailing List

On Wed, May 23, 2018 at 10:41 AM Davidlohr Bueso <dave@stgolabs.net> wrote:

> The second alternative would be to add a BUG_ON() if the initialization
fails
> and we get rid of all the tables_initialized hack.

I see absolutely no value in an early boot BUG_ON().

Either we know the allocation cannot fail - which is perfectly fine at
bootup, and is a common pattern - or it can fail and we need to handle it.

In neither case is the BUG_ON() appropriate.

So I'm perfectly fine with getting rid of 'tables_initialized'. But no, not
with a BUG_ON().

If you cannot guarantee that the allocation works (using __GFP_NOFAIL is
ok, for example - but it only works with small allocations), then you need
to handle the allocation failure.

I refuse to see more of the shit-for-brains kind of "I can't be bothered to
handle error cases" BUG_ON() stuff.

And I also am not in the least interested in "this cannot possibly happen"
BUG_ON() code.

One option is to make rhashtable_alloc() shrink the allocation and try
again if it fails, and then you *can* do __GFP_NOFAIL eventually.

In fact, it can validly be argued that rhashtable_init() is just buggy
as-is. The whole *point* olf that function is to size things appropriately,
and returning -ENOMEM obviously means that it didn't do its job.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 18:52     ` Linus Torvalds
@ 2018-05-23 18:52       ` Davidlohr Bueso
  0 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-23 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Wed, 23 May 2018, Linus Torvalds wrote:

>On Wed, May 23, 2018 at 11:47 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>> Note that even if the allocation was guaranteed, there are still param
>validations
>> and rhashtable_init() can return -EINVAL.
>
>So?
>
>It's not going to happen, because you're not going to give garbage
>parameters.

Maybe EINVAL could be replaced with WARN_ON(). That would grab the programmer's
attention.

>
>Why would you add a BUG_ON() for something that cannot happen? You might as
>well sprinkle them randomly in every damn place.

Not suggesting this. Before I started the thread, I was actually thinking of
ipc using ENOMEM only for rhashtable_init() filure considering the EINVAL case
will never happen.

>
>And even if somebody screws up the parameters because they are being
>stupid, then SO WHAT? rhashtable_init() won't initialize the pointers, and
>we'll get a NULL pointer dereference.
>
>And hey, we'll probably get it later during boot, once the system is
>actually up and running, and that NULL pointer dereference might even get
>logged in the system logs now because the machine booted successfully, and
>mnaybe it will even get sent to a distro and debugged.
>
>So at what point was there _any_ advantage in doing a BUG_ON() for a crazy
>case?

For the record, I'm not arguing in favor of BUG_ON().

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 18:31   ` Davidlohr Bueso
@ 2018-05-23 18:52     ` Linus Torvalds
  2018-05-23 18:52       ` Davidlohr Bueso
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2018-05-23 18:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Wed, May 23, 2018 at 11:47 AM Davidlohr Bueso <dave@stgolabs.net> wrote:

> Note that even if the allocation was guaranteed, there are still param
validations
> and rhashtable_init() can return -EINVAL.

So?

It's not going to happen, because you're not going to give garbage
parameters.

Why would you add a BUG_ON() for something that cannot happen? You might as
well sprinkle them randomly in every damn place.

And even if somebody screws up the parameters because they are being
stupid, then SO WHAT? rhashtable_init() won't initialize the pointers, and
we'll get a NULL pointer dereference.

And hey, we'll probably get it later during boot, once the system is
actually up and running, and that NULL pointer dereference might even get
logged in the system logs now because the machine booted successfully, and
mnaybe it will even get sent to a distro and debugged.

So at what point was there _any_ advantage in doing a BUG_ON() for a crazy
case?

Really.

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-23 18:35 ` Linus Torvalds
  2018-05-23 18:31   ` Davidlohr Bueso
@ 2018-05-24 17:07   ` Davidlohr Bueso
  2018-05-24 17:53     ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-24 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Wed, 23 May 2018, Linus Torvalds wrote:

>One option is to make rhashtable_alloc() shrink the allocation and try
>again if it fails, and then you *can* do __GFP_NOFAIL eventually.

The below attempts to implements this, along with converting the EINVAL cases
to WARN_ON().

I've refactored bucket_table_alloc() to add a 'retry' param which always
ends up doing GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as
alloc_bucket_spinlocks(). I've arbitrarily shrunk the size in half upon
initial allocation failure. So we default from 64 to 32 buckets, and if
the user is hinting at larger nelems, we simply disregard that and retry
with 32. Similarly, smaller values are also cut in half.

In addition, I think we can also get rid of explicitly passing the gfp flags
to bucket_table_alloc() (we only do GFP_KERNEL or GFP_ATOMIC), and leave it
up to __bucket_table_alloc() by adding a new bucket_table_alloc_noblock() or
something for GFP_ATOMIC.

>
>In fact, it can validly be argued that rhashtable_init() is just buggy
>as-is. The whole *point* olf that function is to size things appropriately,
>and returning -ENOMEM obviously means that it didn't do its job.

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..e86a396aebcf 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -166,16 +166,21 @@ static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,
 	return tbl;
 }
 
-static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
-					       size_t nbuckets,
-					       gfp_t gfp)
+static struct bucket_table *__bucket_table_alloc(struct rhashtable *ht,
+						 size_t nbuckets,
+						 gfp_t gfp, bool retry)
 {
 	struct bucket_table *tbl = NULL;
 	size_t size, max_locks;
 	int i;
 
 	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-	if (gfp != GFP_KERNEL)
+	if (retry) {
+		gfp |= __GFP_NOFAIL;
+		tbl = kzalloc(size, gfp);
+	}
+
+	else if (gfp != GFP_KERNEL)
 		tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
 	else
 		tbl = kvzalloc(size, gfp);
@@ -211,6 +216,20 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 	return tbl;
 }
 
+static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
+					       size_t nbuckets,
+					       gfp_t gfp)
+{
+	return __bucket_table_alloc(ht, nbuckets, gfp, false);
+}
+
+static struct bucket_table *bucket_table_alloc_retry(struct rhashtable *ht,
+						     size_t nbuckets,
+						     gfp_t gfp)
+{
+	return __bucket_table_alloc(ht, nbuckets, gfp, true);
+}
+
 static struct bucket_table *rhashtable_last_table(struct rhashtable *ht,
 						  struct bucket_table *tbl)
 {
@@ -1024,12 +1043,11 @@ int rhashtable_init(struct rhashtable *ht,
 
 	size = HASH_DEFAULT_SIZE;
 
-	if ((!params->key_len && !params->obj_hashfn) ||
-	    (params->obj_hashfn && !params->obj_cmpfn))
-		return -EINVAL;
+	WARN_ON((!params->key_len && !params->obj_hashfn) ||
+		(params->obj_hashfn && !params->obj_cmpfn));
 
-	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
-		return -EINVAL;
+	WARN_ON(params->nulls_base &&
+		params->nulls_base < (1U << RHT_BASE_SHIFT));
 
 	memset(ht, 0, sizeof(*ht));
 	mutex_init(&ht->mutex);
@@ -1068,9 +1086,23 @@ int rhashtable_init(struct rhashtable *ht,
 		}
 	}
 
+	/*
+	 * This is api initialization. We need to guarantee the initial
+	 * rhashtable allocation. Upon failure, retry with a smaller size,
+	 * otherwise we exhaust our options with __GFP_NOFAIL.
+	 *
+	 * The size of the table is shrunk to at least half the original
+	 * value. Users that use large nelem_hint values are lowered to 32
+	 * buckets.
+	 */
 	tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-	if (tbl == NULL)
-		return -ENOMEM;
+	if (unlikely(tbl == NULL)) {
+		size = min(size, HASH_DEFAULT_SIZE) / 2;
+
+		tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
+		if (tbl == NULL)
+			tbl = bucket_table_alloc_retry(ht, size, GFP_KERNEL);
+	}
 
 	atomic_set(&ht->nelems, 0);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-24 17:07   ` Davidlohr Bueso
@ 2018-05-24 17:53     ` Linus Torvalds
  2018-05-24 18:51       ` Davidlohr Bueso
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2018-05-24 17:53 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Thu, May 24, 2018 at 10:23 AM Davidlohr Bueso <dave@stgolabs.net> wrote:

>          tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> -       if (tbl == NULL)
> -               return -ENOMEM;
> +       if (unlikely(tbl == NULL)) {
> +               size = min(size, HASH_DEFAULT_SIZE) / 2;
> +
> +               tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> +               if (tbl == NULL)
> +                       tbl = bucket_table_alloc_retry(ht, size,
GFP_KERNEL);
> +       }

This doesn't seem to be taking 'param->min_size' into account.

I'm not sure that matters, but right now, if you have nelem_hint set and a
min_size, the min_size is honored (if you have just min_size it's already
ignored because the rhashtable always starts with HASH_DEFAULT_SIZE). So I
could imagine that somebody uses it to guarantee something. The docs say
that "min_size" is the minimum size for *shrinking* not for initializing,
so I guess it's debatable.

Also, wouldn't it make sense to make this all be a while loop? Or are you
just depending on the knowledge that HASH_DEFAULT_SIZE / 2 is already
guaranteed to be so small that there's no point? A comment to that effect
would be good, perhaps.

                    Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-24 17:53     ` Linus Torvalds
@ 2018-05-24 18:51       ` Davidlohr Bueso
  2018-05-24 19:17         ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Davidlohr Bueso @ 2018-05-24 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Thu, 24 May 2018, Linus Torvalds wrote:

>This doesn't seem to be taking 'param->min_size' into account.

It was in that rounded_hashtable_size() does, however, after more
thought I think we can do better by taking it much more into account.

>
>I'm not sure that matters, but right now, if you have nelem_hint set and a
>min_size, the min_size is honored (if you have just min_size it's already
>ignored because the rhashtable always starts with HASH_DEFAULT_SIZE). So I
>could imagine that somebody uses it to guarantee something. The docs say
>that "min_size" is the minimum size for *shrinking* not for initializing,
>so I guess it's debatable.
>
>Also, wouldn't it make sense to make this all be a while loop? Or are you
>just depending on the knowledge that HASH_DEFAULT_SIZE / 2 is already
>guaranteed to be so small that there's no point? A comment to that effect
>would be good, perhaps.

Yes, this is why I didn't loop. With the default size of 64 buckets, we
allocate 640 + 128 = 768 bytes for the tbl and the lock array, respectively.
By halving this, upon retrying, I was relying on it being to "small to fail".

However, after how about the resize being based on HASH_MIN_SIZE instead of
HASH_DEFAULT_SIZE? That way the initial table would be a _lot_ smaller and
aid the allocator that much more; which is why we're here in the first place.
Any performance costs of collisions would be completely unimportant in this
scenario.

Considering that some users set p.min_size to be rather large-ish (up to 1024
buckets afaict), we'd need the following:

	size = min(ht->p.min_size, HASH_MIN_SIZE);

Which takes into account the min_size = max(ht->p.min_size, HASH_MIN_SIZE)
which came before, thus p.min_size == 0 is already taken into account.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: semantics of rhashtable and sysvipc
  2018-05-24 18:51       ` Davidlohr Bueso
@ 2018-05-24 19:17         ` Linus Torvalds
  0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2018-05-24 19:17 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul,
	guillaume.knispel, Linux API, Linux Kernel Mailing List

On Thu, May 24, 2018 at 12:08 PM Davidlohr Bueso <dave@stgolabs.net> wrote:


> However, after how about the resize being based on HASH_MIN_SIZE instead
of
> HASH_DEFAULT_SIZE?

I think that sounds reasonable. We wouldn't expect this to ever happen in
practice, and as you say, if it *does* happen, the size of the hash array
is the last of our problems.

> Considering that some users set p.min_size to be rather large-ish (up to
1024
> buckets afaict), we'd need the following:

>          size = min(ht->p.min_size, HASH_MIN_SIZE);

Bah, let's just go for simplicity, and just make it HASH_MIN_SIZE
unconditionally, and just have a single fallback: if the first "normal"
allocation fails, do one single unconditional allocation with HASH_MIN_SIZE
and GFP_NOFAIL.

I think that should work fine.

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-05-24 19:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso
2018-05-23 17:35 ` Davidlohr Bueso
2018-05-23 18:35 ` Linus Torvalds
2018-05-23 18:31   ` Davidlohr Bueso
2018-05-23 18:52     ` Linus Torvalds
2018-05-23 18:52       ` Davidlohr Bueso
2018-05-24 17:07   ` Davidlohr Bueso
2018-05-24 17:53     ` Linus Torvalds
2018-05-24 18:51       ` Davidlohr Bueso
2018-05-24 19:17         ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.