linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: memory leak in udp_table_init
@ 2012-02-25  0:55 Tim Bird
  2012-02-25  1:27 ` Paul Gortmaker
  0 siblings, 1 reply; 14+ messages in thread
From: Tim Bird @ 2012-02-25  0:55 UTC (permalink / raw)
  To: David Miller, kuznet, linux kernel, netdev, eric.dumazet

We've uncovered an obscure memory leak in the routine udp_table_init(),
in the file: net/ipv4/udp.c

The allocation sequence is a bit weird, and I've got some questions
about the best way to fix it.

Here's the code:

void __init udp_table_init(struct udp_table *table, const char *name)
{
	unsigned int i;

	if (!CONFIG_BASE_SMALL)
		table->hash = alloc_large_system_hash(name,
			2 * sizeof(struct udp_hslot),
			uhash_entries,
			21, /* one slot per 2 MB */
			0,
			&table->log,
			&table->mask,
			64 * 1024);
	/*
	 * Make sure hash table has the minimum size
	 */
	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
		if (!table->hash)
			panic(name);
		table->log = ilog2(UDP_HTABLE_SIZE_MIN);
		table->mask = UDP_HTABLE_SIZE_MIN - 1;
	}
	...

We've seen instances where the second allocation of table->hash
is performed, wiping out the first hash table allocation, without a
free.  This ends up leaking the previously allocated hash table.

That is, if we are !CONFIG_BASE_SMALL and for some reason
the alloc_large_system_hash() operation returns less than UDP_HTABLE_SIZE_MIN
hash slots, then it will trigger this.

There's no complementary "free_large_system_hash()" which can be used to
back out of the first allocation, that I can find.

We are currently doing the following to avoid the memory leak, but this
seems like it defeats the purpose of checking for the minimum size
(that is, if the first allocation was too small, we don't re-allocate).

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..2524af4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2194,7 +2194,8 @@ void __init udp_table_init(struct udp_table *table, const char *name)
 	/*
 	 * Make sure hash table has the minimum size
 	 */
-	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
+	if ((CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1)
+		&& !table->hash) {
 		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
 				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
 		if (!table->hash)

Any suggestions for a way to correct for a too-small first allocation, without
a memory leak?

Alternatively - how critical is this UDP_HTABLE_SIZE_MIN for correct operation
of the stack?

Thanks for any information you can provide.

 -- Tim

=============================
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=============================


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-25  0:55 RFC: memory leak in udp_table_init Tim Bird
@ 2012-02-25  1:27 ` Paul Gortmaker
  2012-02-25  5:19   ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Gortmaker @ 2012-02-25  1:27 UTC (permalink / raw)
  To: Tim Bird; +Cc: David Miller, kuznet, linux kernel, netdev, eric.dumazet

On Fri, Feb 24, 2012 at 7:55 PM, Tim Bird <tim.bird@am.sony.com> wrote:
> We've uncovered an obscure memory leak in the routine udp_table_init(),
> in the file: net/ipv4/udp.c

At a glance, I think what you are seeing is the same as this?

http://lists.openwall.net/netdev/2011/06/22/12

Paul.

>
> The allocation sequence is a bit weird, and I've got some questions
> about the best way to fix it.
>
> Here's the code:
>
> void __init udp_table_init(struct udp_table *table, const char *name)
> {
>        unsigned int i;
>
>        if (!CONFIG_BASE_SMALL)
>                table->hash = alloc_large_system_hash(name,
>                        2 * sizeof(struct udp_hslot),
>                        uhash_entries,
>                        21, /* one slot per 2 MB */
>                        0,
>                        &table->log,
>                        &table->mask,
>                        64 * 1024);
>        /*
>         * Make sure hash table has the minimum size
>         */
>        if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
>                table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
>                                      2 * sizeof(struct udp_hslot), GFP_KERNEL);
>                if (!table->hash)
>                        panic(name);
>                table->log = ilog2(UDP_HTABLE_SIZE_MIN);
>                table->mask = UDP_HTABLE_SIZE_MIN - 1;
>        }
>        ...
>
> We've seen instances where the second allocation of table->hash
> is performed, wiping out the first hash table allocation, without a
> free.  This ends up leaking the previously allocated hash table.
>
> That is, if we are !CONFIG_BASE_SMALL and for some reason
> the alloc_large_system_hash() operation returns less than UDP_HTABLE_SIZE_MIN
> hash slots, then it will trigger this.
>
> There's no complementary "free_large_system_hash()" which can be used to
> back out of the first allocation, that I can find.
>
> We are currently doing the following to avoid the memory leak, but this
> seems like it defeats the purpose of checking for the minimum size
> (that is, if the first allocation was too small, we don't re-allocate).
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 5d075b5..2524af4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2194,7 +2194,8 @@ void __init udp_table_init(struct udp_table *table, const char *name)
>        /*
>         * Make sure hash table has the minimum size
>         */
> -       if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
> +       if ((CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1)
> +               && !table->hash) {
>                table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
>                                      2 * sizeof(struct udp_hslot), GFP_KERNEL);
>                if (!table->hash)
>
> Any suggestions for a way to correct for a too-small first allocation, without
> a memory leak?
>
> Alternatively - how critical is this UDP_HTABLE_SIZE_MIN for correct operation
> of the stack?
>
> Thanks for any information you can provide.
>
>  -- Tim
>
> =============================
> Tim Bird
> Architecture Group Chair, CE Workgroup of the Linux Foundation
> Senior Staff Engineer, Sony Network Entertainment
> =============================
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-25  1:27 ` Paul Gortmaker
@ 2012-02-25  5:19   ` Eric Dumazet
  2012-02-26 19:20     ` David Miller
  2012-02-27  6:03     ` [PATCH v2] mm: add a low limit to alloc_large_system_hash Eric Dumazet
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-02-25  5:19 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Tim Bird, David Miller, kuznet, linux kernel, netdev

Le vendredi 24 février 2012 à 20:27 -0500, Paul Gortmaker a écrit :
> On Fri, Feb 24, 2012 at 7:55 PM, Tim Bird <tim.bird@am.sony.com> wrote:
> > We've uncovered an obscure memory leak in the routine udp_table_init(),
> > in the file: net/ipv4/udp.c
> 
> At a glance, I think what you are seeing is the same as this?
> 
> http://lists.openwall.net/netdev/2011/06/22/12
> 

Yes, this issue is somewhat recurrent...

> > Any suggestions for a way to correct for a too-small first allocation, without
> > a memory leak?
> >
> > Alternatively - how critical is this UDP_HTABLE_SIZE_MIN for correct operation
> > of the stack?

Absolutely mandatory, if you read net/ipv4/udp.c


Lets add a new parameter to alloc_large_system_hash() to specify minimum
number of slots in hash table.


David, this patch is based on Linus tree, not on net tree.
(Doesnt apply properly on net tree currently)

Thanks

[PATCH] mm: add a low limit to alloc_large_system_hash

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation, and 16 for pid_hash.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 fs/dcache.c             |    2 ++
 fs/inode.c              |    2 ++
 include/linux/bootmem.h |    3 ++-
 kernel/pid.c            |    3 ++-
 mm/page_alloc.c         |    7 +++++--
 net/ipv4/route.c        |    1 +
 net/ipv4/tcp.c          |    2 ++
 net/ipv4/udp.c          |   30 ++++++++++--------------------
 8 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac1..ef5e72e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2984,6 +2984,7 @@ static void __init dcache_init_early(void)
 					HASH_EARLY,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
@@ -3014,6 +3015,7 @@ static void __init dcache_init(void)
 					0,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..7acee4c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1667,6 +1667,7 @@ void __init inode_init_early(void)
 					HASH_EARLY,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
@@ -1697,6 +1698,7 @@ void __init inode_init(void)
 					0,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 66d3e95..1a0cd27 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -154,7 +154,8 @@ extern void *alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit);
+				     unsigned long low_limit,
+				     unsigned long high_limit);
 
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
diff --git a/kernel/pid.c b/kernel/pid.c
index 9f08dfa..79884b2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -547,7 +547,8 @@ void __init pidhash_init(void)
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
 					   HASH_EARLY | HASH_SMALL,
-					   &pidhash_shift, NULL, 4096);
+					   &pidhash_shift, NULL,
+					   16, 4096);
 	pidhash_size = 1U << pidhash_shift;
 
 	for (i = 0; i < pidhash_size; i++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..f037398 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5198,9 +5198,10 @@ void *__init alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit)
+				     unsigned long low_limit,
+				     unsigned long high_limit)
 {
-	unsigned long long max = limit;
+	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
 
@@ -5238,6 +5239,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 	}
 	max = min(max, 0x80000000ULL);
 
+	if (numentries < low_limit)
+		numentries = low_limit;
 	if (numentries > max)
 		numentries = max;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bcacf54..0a41e38 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3475,6 +3475,7 @@ int __init ip_rt_init(void)
 					0,
 					&rt_hash_log,
 					&rt_hash_mask,
+					0,
 					rhash_entries ? 0 : 512 * 1024);
 	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
 	rt_hash_lock_init();
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 22ef5f9..e61a498 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3267,6 +3267,7 @@ void __init tcp_init(void)
 					0,
 					NULL,
 					&tcp_hashinfo.ehash_mask,
+					0,
 					thash_entries ? 0 : 512 * 1024);
 	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
@@ -3283,6 +3284,7 @@ void __init tcp_init(void)
 					0,
 					&tcp_hashinfo.bhash_size,
 					NULL,
+					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..dc68ed2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2182,26 +2182,16 @@ void __init udp_table_init(struct udp_table *table, const char *name)
 {
 	unsigned int i;
 
-	if (!CONFIG_BASE_SMALL)
-		table->hash = alloc_large_system_hash(name,
-			2 * sizeof(struct udp_hslot),
-			uhash_entries,
-			21, /* one slot per 2 MB */
-			0,
-			&table->log,
-			&table->mask,
-			64 * 1024);
-	/*
-	 * Make sure hash table has the minimum size
-	 */
-	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
-		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
-				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
-		if (!table->hash)
-			panic(name);
-		table->log = ilog2(UDP_HTABLE_SIZE_MIN);
-		table->mask = UDP_HTABLE_SIZE_MIN - 1;
-	}
+	table->hash = alloc_large_system_hash(name,
+					      2 * sizeof(struct udp_hslot),
+					      uhash_entries,
+					      21, /* one slot per 2 MB */
+					      0,
+					      &table->log,
+					      &table->mask,
+					      UDP_HTABLE_SIZE_MIN,
+					      64 * 1024);
+
 	table->hash2 = table->hash + (table->mask + 1);
 	for (i = 0; i <= table->mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-25  5:19   ` Eric Dumazet
@ 2012-02-26 19:20     ` David Miller
  2012-02-27  5:40       ` Eric Dumazet
  2012-02-27  6:03     ` [PATCH v2] mm: add a low limit to alloc_large_system_hash Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2012-02-26 19:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paul.gortmaker, tim.bird, kuznet, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 25 Feb 2012 06:19:42 +0100

> [PATCH] mm: add a low limit to alloc_large_system_hash

I think you should just use zero as the default minimum for all
call sites except this UDP case we are trying to fix.

For example I see you used 16 for kernel/pid.c

Let's not try to do unrelated changes like that now, we can do such
tweaks later.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-26 19:20     ` David Miller
@ 2012-02-27  5:40       ` Eric Dumazet
  2012-02-27  5:44         ` David Miller
  2012-02-27 11:33         ` David Laight
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-02-27  5:40 UTC (permalink / raw)
  To: David Miller; +Cc: paul.gortmaker, tim.bird, kuznet, linux-kernel, netdev

Le dimanche 26 février 2012 à 14:20 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 25 Feb 2012 06:19:42 +0100
> 
> > [PATCH] mm: add a low limit to alloc_large_system_hash
> 
> I think you should just use zero as the default minimum for all
> call sites except this UDP case we are trying to fix.
> 
> For example I see you used 16 for kernel/pid.c
> 
> Let's not try to do unrelated changes like that now, we can do such
> tweaks later.

It was to match the comment we have few lines above :

/*
 * The pid hash table is scaled according to the amount of memory in the
 * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
 * more.
 */



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-27  5:40       ` Eric Dumazet
@ 2012-02-27  5:44         ` David Miller
  2012-02-27 11:33         ` David Laight
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2012-02-27  5:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paul.gortmaker, tim.bird, kuznet, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2012 06:40:23 +0100

> It was to match the comment we have few lines above :
> 
> /*
>  * The pid hash table is scaled according to the amount of memory in the
>  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
>  * more.
>  */

Granted, but that's not what the actual code did beforehand.

Please, make it a seperate change, thank you.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm: add a low limit to alloc_large_system_hash
  2012-02-25  5:19   ` Eric Dumazet
  2012-02-26 19:20     ` David Miller
@ 2012-02-27  6:03     ` Eric Dumazet
  2012-02-27  6:45       ` David Miller
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2012-02-27  6:03 UTC (permalink / raw)
  To: Paul Gortmaker, David Miller; +Cc: Tim Bird, kuznet, linux kernel, netdev

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
---
V2: no 16 minimum value for pid hash

 fs/dcache.c             |    2 ++
 fs/inode.c              |    2 ++
 include/linux/bootmem.h |    3 ++-
 kernel/pid.c            |    3 ++-
 mm/page_alloc.c         |    7 +++++--
 net/ipv4/route.c        |    1 +
 net/ipv4/tcp.c          |    2 ++
 net/ipv4/udp.c          |   30 ++++++++++--------------------
 8 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac1..ef5e72e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2984,6 +2984,7 @@ static void __init dcache_init_early(void)
 					HASH_EARLY,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
@@ -3014,6 +3015,7 @@ static void __init dcache_init(void)
 					0,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..7acee4c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1667,6 +1667,7 @@ void __init inode_init_early(void)
 					HASH_EARLY,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
@@ -1697,6 +1698,7 @@ void __init inode_init(void)
 					0,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);
 
 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 66d3e95..1a0cd27 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -154,7 +154,8 @@ extern void *alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit);
+				     unsigned long low_limit,
+				     unsigned long high_limit);
 
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
diff --git a/kernel/pid.c b/kernel/pid.c
index 9f08dfa..e86b291 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -547,7 +547,8 @@ void __init pidhash_init(void)
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
 					   HASH_EARLY | HASH_SMALL,
-					   &pidhash_shift, NULL, 4096);
+					   &pidhash_shift, NULL,
+					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;
 
 	for (i = 0; i < pidhash_size; i++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..b9afccb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5198,9 +5198,10 @@ void *__init alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit)
+				     unsigned long low_limit,
+				     unsigned long high_limit)
 {
-	unsigned long long max = limit;
+	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
 
@@ -5238,6 +5239,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 	}
 	max = min(max, 0x80000000ULL);
 
+	if (numentries < low_limit)
+		numentries = low_limit;
 	if (numentries > max)
 		numentries = max;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bcacf54..0a41e38 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3475,6 +3475,7 @@ int __init ip_rt_init(void)
 					0,
 					&rt_hash_log,
 					&rt_hash_mask,
+					0,
 					rhash_entries ? 0 : 512 * 1024);
 	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
 	rt_hash_lock_init();
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 22ef5f9..e61a498 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3267,6 +3267,7 @@ void __init tcp_init(void)
 					0,
 					NULL,
 					&tcp_hashinfo.ehash_mask,
+					0,
 					thash_entries ? 0 : 512 * 1024);
 	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
@@ -3283,6 +3284,7 @@ void __init tcp_init(void)
 					0,
 					&tcp_hashinfo.bhash_size,
 					NULL,
+					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..dc68ed2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2182,26 +2182,16 @@ void __init udp_table_init(struct udp_table *table, const char *name)
 {
 	unsigned int i;
 
-	if (!CONFIG_BASE_SMALL)
-		table->hash = alloc_large_system_hash(name,
-			2 * sizeof(struct udp_hslot),
-			uhash_entries,
-			21, /* one slot per 2 MB */
-			0,
-			&table->log,
-			&table->mask,
-			64 * 1024);
-	/*
-	 * Make sure hash table has the minimum size
-	 */
-	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
-		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
-				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
-		if (!table->hash)
-			panic(name);
-		table->log = ilog2(UDP_HTABLE_SIZE_MIN);
-		table->mask = UDP_HTABLE_SIZE_MIN - 1;
-	}
+	table->hash = alloc_large_system_hash(name,
+					      2 * sizeof(struct udp_hslot),
+					      uhash_entries,
+					      21, /* one slot per 2 MB */
+					      0,
+					      &table->log,
+					      &table->mask,
+					      UDP_HTABLE_SIZE_MIN,
+					      64 * 1024);
+
 	table->hash2 = table->hash + (table->mask + 1);
 	for (i = 0; i <= table->mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm: add a low limit to alloc_large_system_hash
  2012-02-27  6:03     ` [PATCH v2] mm: add a low limit to alloc_large_system_hash Eric Dumazet
@ 2012-02-27  6:45       ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2012-02-27  6:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paul.gortmaker, tim.bird, kuznet, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2012 07:03:17 +0100

> UDP stack needs a minimum hash size value for proper operation and also
> uses alloc_large_system_hash() for proper NUMA distribution of its hash
> tables and automatic sizing depending on available system memory.
> 
> On some low memory situations, udp_table_init() must ignore the
> alloc_large_system_hash() result and reallocs a bigger memory area.
> 
> As we cannot easily free old hash table, we leak it and kmemleak can
> issue a warning.
> 
> This patch adds a low limit parameter to alloc_large_system_hash() to
> solve this problem.
> 
> We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
> allocation.
> 
> Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
> Reported-by: Tim Bird <tim.bird@am.sony.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: David S. Miller <davem@davemloft.net>

Who wants to take this?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC: memory leak in udp_table_init
  2012-02-27  5:40       ` Eric Dumazet
  2012-02-27  5:44         ` David Miller
@ 2012-02-27 11:33         ` David Laight
  2012-02-29 18:28           ` Eric W. Biederman
  1 sibling, 1 reply; 14+ messages in thread
From: David Laight @ 2012-02-27 11:33 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: paul.gortmaker, tim.bird, kuznet, linux-kernel, netdev

> > > [PATCH] mm: add a low limit to alloc_large_system_hash
> > 
> > I think you should just use zero as the default minimum for all
> > call sites except this UDP case we are trying to fix.
> > 
> > For example I see you used 16 for kernel/pid.c
> > 
> > Let's not try to do unrelated changes like that now, we can do such
> > tweaks later.
> 
> It was to match the comment we have few lines above :
> 
> /*
>  * The pid hash table is scaled according to the amount of memory in
the
>  * machine.  From a minimum of 16 slots up to 4096 slots at one
gigabyte or
>  * more.
>  */

These large hash tables are, IMHO, an indication that the
algorithm used is, perhaps, suboptimal.

Not least of the problems is actually finding a suitable
(and fast) hash function that will work with the actual
real-life data.

The pid table is a good example of something where a hash
table is unnecessary.
Linux should steal the code I put into NetBSD :-)

	David



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-02-27 11:33         ` David Laight
@ 2012-02-29 18:28           ` Eric W. Biederman
  2012-03-01  8:55             ` David Laight
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2012-02-29 18:28 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, David Miller, paul.gortmaker, tim.bird, kuznet,
	linux-kernel, netdev

"David Laight" <David.Laight@ACULAB.COM> writes:

>> It was to match the comment we have few lines above :
>> 
>> /*
>>  * The pid hash table is scaled according to the amount of memory in
> the
>>  * machine.  From a minimum of 16 slots up to 4096 slots at one
> gigabyte or
>>  * more.
>>  */

The comment was actually correct until someone converted the code to use
alloc_large_system_hash.

> These large hash tables are, IMHO, an indication that the
> algorithm used is, perhaps, suboptimal.
>
> Not least of the problems is actually finding a suitable
> (and fast) hash function that will work with the actual
> real-life data.
>
> The pid table is a good example of something where a hash
> table is unnecessary.
> Linux should steal the code I put into NetBSD :-)

On this unrelated topic.  What algorithm did you use on NetBSD for
dealing with pids?

A hash chain length of 1 for common sizes is a pretty attractive
algorithm, as it minimizes the number of cache line misses.  Ages ago
when I modeled the linux situation that is what I was seeing.

The normal case for the linux pid hash table is that it has 4096
entries taking 16k or 32k of memory and the typical process load
has about 200 or so processes.    Making it very easy in the common
case to have a single entry hash chain.

All of the other algorithms I know have a tree structure and thus
more cache misses (as you traverse the tree) and ultimately worse
real world performance.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC: memory leak in udp_table_init
  2012-02-29 18:28           ` Eric W. Biederman
@ 2012-03-01  8:55             ` David Laight
  2012-03-01 12:33               ` Eric Dumazet
  2012-03-02  0:12               ` Eric W. Biederman
  0 siblings, 2 replies; 14+ messages in thread
From: David Laight @ 2012-03-01  8:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Eric Dumazet, David Miller, paul.gortmaker, tim.bird, kuznet,
	linux-kernel, netdev

 
> > The pid table is a good example of something where a hash
> > table is unnecessary.
> > Linux should steal the code I put into NetBSD :-)
> 
> On this unrelated topic.  What algorithm did you use on NetBSD for
> dealing with pids?

Basically I forced the hash chain length to one by allocating
a pid that hit an empty entry in the table.

So you start off with (say) 64 entries and use the low 6
bits to index the table. The higher bits are incremented
each time a 'slot' is reused.
Free entries are kept in a FIFO list.
So each entry either contains a pointer to the process,
or the high bits and the index of the next free slot.
(and the PGID pointer).
When there are only (say) 2 free entries, then the table
size is doubled, the pointers moved to the correct places,
the free ist fixed up, and the minimum number of free entries
doubled.

The overall effect:
- lookup is only ever a mask and index + compare.
- Allocate is always fast and fixed cost (except when
  the table size has to be doubled).
- A pid value will never be reused within (about) 2000
  allocates (for 16bit pids, much larger for 32bit ones).
- Allocated pid numbers tend to be random, certainly
  very difficult to predict.
- Small memory footprint for small systems.
For pids we normally avoid issuing large values, but
will do so to avoid immediate re-use on systems that
have 1000s of active processes.

See lines 580-820 of
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_proc.c?annotate=1.
182&only_with_tag=MAIN

	David



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC: memory leak in udp_table_init
  2012-03-01  8:55             ` David Laight
@ 2012-03-01 12:33               ` Eric Dumazet
  2012-03-15 17:44                 ` Paul E. McKenney
  2012-03-02  0:12               ` Eric W. Biederman
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2012-03-01 12:33 UTC (permalink / raw)
  To: David Laight
  Cc: Eric W. Biederman, David Miller, paul.gortmaker, tim.bird,
	kuznet, linux-kernel, netdev

Le jeudi 01 mars 2012 à 08:55 +0000, David Laight a écrit :
> > > The pid table is a good example of something where a hash
> > > table is unnecessary.
> > > Linux should steal the code I put into NetBSD :-)
> > 
> > On this unrelated topic.  What algorithm did you use on NetBSD for
> > dealing with pids?
> 
> Basically I forced the hash chain length to one by allocating
> a pid that hit an empty entry in the table.
> 
> So you start off with (say) 64 entries and use the low 6
> bits to index the table. The higher bits are incremented
> each time a 'slot' is reused.
> Free entries are kept in a FIFO list.
> So each entry either contains a pointer to the process,
> or the high bits and the index of the next free slot.
> (and the PGID pointer).
> When there are only (say) 2 free entries, then the table
> size is doubled, the pointers moved to the correct places,
> the free ist fixed up, and the minimum number of free entries
> doubled.
> 
> The overall effect:
> - lookup is only ever a mask and index + compare.
> - Allocate is always fast and fixed cost (except when
>   the table size has to be doubled).
> - A pid value will never be reused within (about) 2000
>   allocates (for 16bit pids, much larger for 32bit ones).
> - Allocated pid numbers tend to be random, certainly
>   very difficult to predict.
> - Small memory footprint for small systems.
> For pids we normally avoid issuing large values, but
> will do so to avoid immediate re-use on systems that
> have 1000s of active processes.
> 


You describe a hash table mechanism still, and you made chain lengthes
be 0 or 1.

This GEN_ID/SLOT schem is the one used in IBM AIX for pid allocations
(with a 31 (or was it 32) bit range). They did not use FIFO, because
they tried to use lower slots of the proc table.

Hashes values in network land are unpredictable, so we cannot make sure
a slot contains at most one entry.

Note: If you manage 4 million tcp sockets in your server, hash table
must be at _least_ 4 million slots. I would not call that suboptimal as
you said in your previous mail.

We could argue that default sizes of these hash tables (ip route,
tcp, ...) are now more suited for high end uses instead of
desktop/embedded uses, because ram sizes increased so much last 10
years.

[ We added in commits 0ccfe61803ad & c9503e0fe05 a 512K limits for
tcp/ip route hash tables to stop insanity, but thats it ]

RCU lookups make dynamic resizes of these hash table a bit complex.
IIRC David Miller have submitted a patch but this work was not
completed. Not sure its an issue these days, as we prefer to work on ip
route cache (and its associated hash table) removal anyway.

For UDP/UDPLite it certainly is not an issue, since max size is 65536
slots. On a 32bit kernel, with at more 1GB of LOWMEM, max is 512 slots.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-03-01  8:55             ` David Laight
  2012-03-01 12:33               ` Eric Dumazet
@ 2012-03-02  0:12               ` Eric W. Biederman
  1 sibling, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2012-03-02  0:12 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, David Miller, paul.gortmaker, tim.bird, kuznet,
	linux-kernel, netdev

"David Laight" <David.Laight@ACULAB.COM> writes:

>  
>> > The pid table is a good example of something where a hash
>> > table is unnecessary.
>> > Linux should steal the code I put into NetBSD :-)
>> 
>> On this unrelated topic.  What algorithm did you use on NetBSD for
>> dealing with pids?
>
> Basically I forced the hash chain length to one by allocating
> a pid that hit an empty entry in the table.
>
> So you start off with (say) 64 entries and use the low 6
> bits to index the table. The higher bits are incremented
> each time a 'slot' is reused.
> Free entries are kept in a FIFO list.
> So each entry either contains a pointer to the process,
> or the high bits and the index of the next free slot.
> (and the PGID pointer).
> When there are only (say) 2 free entries, then the table
> size is doubled, the pointers moved to the correct places,
> the free ist fixed up, and the minimum number of free entries
> doubled.
>
> The overall effect:
> - lookup is only ever a mask and index + compare.
> - Allocate is always fast and fixed cost (except when
>   the table size has to be doubled).
> - A pid value will never be reused within (about) 2000
>   allocates (for 16bit pids, much larger for 32bit ones).
> - Allocated pid numbers tend to be random, certainly
>   very difficult to predict.
> - Small memory footprint for small systems.
> For pids we normally avoid issuing large values, but
> will do so to avoid immediate re-use on systems that
> have 1000s of active processes.

That is a very nice technique.  Unfortunately doubling a hash table size
when you need large amounts of contiguous memory is difficult, and worse
does not support user space selectable pid numbers which is needed to
support process migration.  So I don't think we can adapt this case for
the linux pid hash table.

I will keep the technique in mind in case I run into a situation where
it is applicable.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: memory leak in udp_table_init
  2012-03-01 12:33               ` Eric Dumazet
@ 2012-03-15 17:44                 ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2012-03-15 17:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Eric W. Biederman, David Miller, paul.gortmaker,
	tim.bird, kuznet, linux-kernel, netdev

On Thu, Mar 01, 2012 at 04:33:38AM -0800, Eric Dumazet wrote:
> Le jeudi 01 mars 2012 à 08:55 +0000, David Laight a écrit :
> > > > The pid table is a good example of something where a hash
> > > > table is unnecessary.
> > > > Linux should steal the code I put into NetBSD :-)
> > > 
> > > On this unrelated topic.  What algorithm did you use on NetBSD for
> > > dealing with pids?
> > 
> > Basically I forced the hash chain length to one by allocating
> > a pid that hit an empty entry in the table.
> > 
> > So you start off with (say) 64 entries and use the low 6
> > bits to index the table. The higher bits are incremented
> > each time a 'slot' is reused.
> > Free entries are kept in a FIFO list.
> > So each entry either contains a pointer to the process,
> > or the high bits and the index of the next free slot.
> > (and the PGID pointer).
> > When there are only (say) 2 free entries, then the table
> > size is doubled, the pointers moved to the correct places,
> > the free ist fixed up, and the minimum number of free entries
> > doubled.
> > 
> > The overall effect:
> > - lookup is only ever a mask and index + compare.
> > - Allocate is always fast and fixed cost (except when
> >   the table size has to be doubled).
> > - A pid value will never be reused within (about) 2000
> >   allocates (for 16bit pids, much larger for 32bit ones).
> > - Allocated pid numbers tend to be random, certainly
> >   very difficult to predict.
> > - Small memory footprint for small systems.
> > For pids we normally avoid issuing large values, but
> > will do so to avoid immediate re-use on systems that
> > have 1000s of active processes.
> > 
> 
> 
> You describe a hash table mechanism still, and you made chain lengthes
> be 0 or 1.
> 
> This GEN_ID/SLOT schem is the one used in IBM AIX for pid allocations
> (with a 31 (or was it 32) bit range). They did not use FIFO, because
> they tried to use lower slots of the proc table.
> 
> Hashes values in network land are unpredictable, so we cannot make sure
> a slot contains at most one entry.
> 
> Note: If you manage 4 million tcp sockets in your server, hash table
> must be at _least_ 4 million slots. I would not call that suboptimal as
> you said in your previous mail.
> 
> We could argue that default sizes of these hash tables (ip route,
> tcp, ...) are now more suited for high end uses instead of
> desktop/embedded uses, because ram sizes increased so much last 10
> years.
> 
> [ We added in commits 0ccfe61803ad & c9503e0fe05 a 512K limits for
> tcp/ip route hash tables to stop insanity, but thats it ]
> 
> RCU lookups make dynamic resizes of these hash table a bit complex.
> IIRC David Miller have submitted a patch but this work was not
> completed. Not sure its an issue these days, as we prefer to work on ip
> route cache (and its associated hash table) removal anyway.

One approach is to put a pair of list_head structures in each object, then
proceed as Herbert Xu did: http://lists.openwall.net/netdev/2010/02/28/20.

There is another RCU-protected resizable hash table with the userspace
RCU library: http://lttng.org/urcu.  And another approach is described
in a USENIX paper:

http://www.usenix.org/event/atc11/tech/final_files/Triplett.pdf

						Thanx, Paul

> For UDP/UDPLite it certainly is not an issue, since max size is 65536
> slots. On a 32bit kernel, with at more 1GB of LOWMEM, max is 512 slots.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-03-15 17:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-25  0:55 RFC: memory leak in udp_table_init Tim Bird
2012-02-25  1:27 ` Paul Gortmaker
2012-02-25  5:19   ` Eric Dumazet
2012-02-26 19:20     ` David Miller
2012-02-27  5:40       ` Eric Dumazet
2012-02-27  5:44         ` David Miller
2012-02-27 11:33         ` David Laight
2012-02-29 18:28           ` Eric W. Biederman
2012-03-01  8:55             ` David Laight
2012-03-01 12:33               ` Eric Dumazet
2012-03-15 17:44                 ` Paul E. McKenney
2012-03-02  0:12               ` Eric W. Biederman
2012-02-27  6:03     ` [PATCH v2] mm: add a low limit to alloc_large_system_hash Eric Dumazet
2012-02-27  6:45       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).