linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
@ 2016-08-17 18:20 Aruna Ramakrishna
  2016-08-17 19:03 ` Eric Dumazet
  2016-08-18 11:52 ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Aruna Ramakrishna @ 2016-08-17 18:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton

On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 26 +++++++++++++++++---------
 mm/slab.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..3da34fe 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent)
 	spin_lock_init(&parent->list_lock);
 	parent->free_objects = 0;
 	parent->free_touched = 0;
+	parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid)				\
@@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
 		page = list_entry(p, struct page, lru);
 		list_del(&page->lru);
+		n->num_slabs--;
 		/*
 		 * Safe to drop the lock. The slab is no longer linked
 		 * to the cache.
@@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page)
 		list_add_tail(&page->lru, &(n->slabs_free));
 	else
 		fixup_slab_list(cachep, n, page, &list);
+
+	n->num_slabs++;
 	STATS_INC_GROWN(cachep);
 	n->free_objects += cachep->num - page->active;
 	spin_unlock(&n->list_lock);
@@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
 
 		page = list_last_entry(&n->slabs_free, struct page, lru);
 		list_move(&page->lru, list);
+		n->num_slabs--;
 	}
 }
 
@@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
 	unsigned long num_objs;
 	unsigned long active_slabs = 0;
 	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+	unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+	unsigned long num_slabs_full = 0;
 	const char *name;
 	char *error = NULL;
 	int node;
@@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
 		check_irq_on();
 		spin_lock_irq(&n->list_lock);
 
-		list_for_each_entry(page, &n->slabs_full, lru) {
-			if (page->active != cachep->num && !error)
-				error = "slabs_full accounting error";
-			active_objs += cachep->num;
-			active_slabs++;
-		}
+		num_slabs += n->num_slabs;
+
 		list_for_each_entry(page, &n->slabs_partial, lru) {
 			if (page->active == cachep->num && !error)
 				error = "slabs_partial accounting error";
 			if (!page->active && !error)
 				error = "slabs_partial accounting error";
 			active_objs += page->active;
-			active_slabs++;
+			num_slabs_partial++;
 		}
+
 		list_for_each_entry(page, &n->slabs_free, lru) {
 			if (page->active && !error)
 				error = "slabs_free accounting error";
-			num_slabs++;
+			num_slabs_free++;
 		}
+
 		free_objects += n->free_objects;
 		if (n->shared)
 			shared_avail += n->shared->avail;
 
 		spin_unlock_irq(&n->list_lock);
 	}
-	num_slabs += active_slabs;
 	num_objs = num_slabs * cachep->num;
+	active_slabs = num_slabs - num_slabs_free;
+	num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free);
+	active_objs += (num_slabs_full * cachep->num);
+
 	if (num_objs - active_objs != free_objects && !error)
 		error = "free_objects accounting error";
 
diff --git a/mm/slab.h b/mm/slab.h
index 9653f2e..bc05fdc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -432,6 +432,7 @@ struct kmem_cache_node {
 	struct list_head slabs_partial;	/* partial list first, better asm code */
 	struct list_head slabs_full;
 	struct list_head slabs_free;
+	unsigned long num_slabs;
 	unsigned long free_objects;
 	unsigned int free_limit;
 	unsigned int colour_next;	/* Per-node cache coloring */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
  2016-08-17 18:20 [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats Aruna Ramakrishna
@ 2016-08-17 19:03 ` Eric Dumazet
  2016-08-17 19:25   ` Aruna Ramakrishna
  2016-08-18 11:52 ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-08-17 19:03 UTC (permalink / raw)
  To: Aruna Ramakrishna
  Cc: linux-mm, linux-kernel, Mike Kravetz, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton

On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote:
]
> -		list_for_each_entry(page, &n->slabs_full, lru) {
> -			if (page->active != cachep->num && !error)
> -				error = "slabs_full accounting error";
> -			active_objs += cachep->num;
> -			active_slabs++;
> -		}

Since you only removed this loop, you could track only number of
full_slabs.

This would avoid messing with n->num_slabs all over the places in fast
path.

Please also update slab_out_of_memory()

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
  2016-08-17 19:03 ` Eric Dumazet
@ 2016-08-17 19:25   ` Aruna Ramakrishna
  0 siblings, 0 replies; 22+ messages in thread
From: Aruna Ramakrishna @ 2016-08-17 19:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-mm, linux-kernel, Mike Kravetz, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton


On 08/17/2016 12:03 PM, Eric Dumazet wrote:
> On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote:
> ]
>> -		list_for_each_entry(page, &n->slabs_full, lru) {
>> -			if (page->active != cachep->num && !error)
>> -				error = "slabs_full accounting error";
>> -			active_objs += cachep->num;
>> -			active_slabs++;
>> -		}
>
> Since you only removed this loop, you could track only number of
> full_slabs.
>
> This would avoid messing with n->num_slabs all over the places in fast
> path.
>
> Please also update slab_out_of_memory()
>

Eric,

Right now, n->num_slabs is modified only when a slab is detached from 
slabs_free (i.e. in drain_freelist and free_block) or when a new one is 
attached in cache_grow_end. None of those 3 calls are in the fast path, 
right? Tracking just full_slabs would also involve similar changes: 
decrement when a slab moves from full to partial during free_block, and 
increment when it moves from partial/free to full after allocation in 
fixup_slab_list. So I don't see what the real difference/advantage is.

I will update slab_out_of_memory and remove the slabs_full list 
traversal there too.

Thanks,
Aruna

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
  2016-08-17 18:20 [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats Aruna Ramakrishna
  2016-08-17 19:03 ` Eric Dumazet
@ 2016-08-18 11:52 ` Michal Hocko
  2016-08-19  5:47   ` aruna.ramakrishna
  2016-08-23  2:13   ` Joonsoo Kim
  1 sibling, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2016-08-18 11:52 UTC (permalink / raw)
  To: Aruna Ramakrishna
  Cc: linux-mm, linux-kernel, Mike Kravetz, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton

On Wed 17-08-16 11:20:50, Aruna Ramakrishna wrote:
> On large systems, when some slab caches grow to millions of objects (and
> many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
> During this time, interrupts are disabled while walking the slab lists
> (slabs_full, slabs_partial, and slabs_free) for each node, and this
> sometimes causes timeouts in other drivers (for instance, Infiniband).
> 
> This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
> total number of allocated slabs per node, per cache. This counter is
> updated when a slab is created or destroyed. This enables us to skip
> traversing the slabs_full list while gathering slabinfo statistics, and
> since slabs_full tends to be the biggest list when the cache is large, it
> results in a dramatic performance improvement. Getting slabinfo statistics
> now only requires walking the slabs_free and slabs_partial lists, and
> those lists are usually much smaller than slabs_full. We tested this after
> growing the dentry cache to 70GB, and the performance improved from 2s to
> 5ms.

I am not opposing the patch (to be honest it is quite neat) but this
is buggering me for quite some time. Sorry for hijacking this email
thread but I couldn't resist. Why are we trying to optimize SLAB and
slowly converge it to SLUB feature-wise. I always thought that SLAB
should remain stable and time challenged solution which works reasonably
well for many/most workloads, while SLUB is an optimized implementation
which experiment with slightly different concepts that might boost the
performance considerably but might also surprise from time to time. If
this is not the case then why do we have both of them in the kernel. It
is a lot of code and some features need tweaking both while only one
gets testing coverage. So this is mainly a question for maintainers. Why
do we maintain both and what is the purpose of them.

> Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> Note: this has been tested only on x86_64.
> 
>  mm/slab.c | 26 +++++++++++++++++---------
>  mm/slab.h |  1 +
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index b672710..3da34fe 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent)
>  	spin_lock_init(&parent->list_lock);
>  	parent->free_objects = 0;
>  	parent->free_touched = 0;
> +	parent->num_slabs = 0;
>  }
>  
>  #define MAKE_LIST(cachep, listp, slab, nodeid)				\
> @@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache,
>  
>  		page = list_entry(p, struct page, lru);
>  		list_del(&page->lru);
> +		n->num_slabs--;
>  		/*
>  		 * Safe to drop the lock. The slab is no longer linked
>  		 * to the cache.
> @@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page)
>  		list_add_tail(&page->lru, &(n->slabs_free));
>  	else
>  		fixup_slab_list(cachep, n, page, &list);
> +
> +	n->num_slabs++;
>  	STATS_INC_GROWN(cachep);
>  	n->free_objects += cachep->num - page->active;
>  	spin_unlock(&n->list_lock);
> @@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
>  
>  		page = list_last_entry(&n->slabs_free, struct page, lru);
>  		list_move(&page->lru, list);
> +		n->num_slabs--;
>  	}
>  }
>  
> @@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
>  	unsigned long num_objs;
>  	unsigned long active_slabs = 0;
>  	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
> +	unsigned long num_slabs_partial = 0, num_slabs_free = 0;
> +	unsigned long num_slabs_full = 0;
>  	const char *name;
>  	char *error = NULL;
>  	int node;
> @@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
>  		check_irq_on();
>  		spin_lock_irq(&n->list_lock);
>  
> -		list_for_each_entry(page, &n->slabs_full, lru) {
> -			if (page->active != cachep->num && !error)
> -				error = "slabs_full accounting error";
> -			active_objs += cachep->num;
> -			active_slabs++;
> -		}
> +		num_slabs += n->num_slabs;
> +
>  		list_for_each_entry(page, &n->slabs_partial, lru) {
>  			if (page->active == cachep->num && !error)
>  				error = "slabs_partial accounting error";
>  			if (!page->active && !error)
>  				error = "slabs_partial accounting error";
>  			active_objs += page->active;
> -			active_slabs++;
> +			num_slabs_partial++;
>  		}
> +
>  		list_for_each_entry(page, &n->slabs_free, lru) {
>  			if (page->active && !error)
>  				error = "slabs_free accounting error";
> -			num_slabs++;
> +			num_slabs_free++;
>  		}
> +
>  		free_objects += n->free_objects;
>  		if (n->shared)
>  			shared_avail += n->shared->avail;
>  
>  		spin_unlock_irq(&n->list_lock);
>  	}
> -	num_slabs += active_slabs;
>  	num_objs = num_slabs * cachep->num;
> +	active_slabs = num_slabs - num_slabs_free;
> +	num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free);
> +	active_objs += (num_slabs_full * cachep->num);
> +
>  	if (num_objs - active_objs != free_objects && !error)
>  		error = "free_objects accounting error";
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 9653f2e..bc05fdc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -432,6 +432,7 @@ struct kmem_cache_node {
>  	struct list_head slabs_partial;	/* partial list first, better asm code */
>  	struct list_head slabs_full;
>  	struct list_head slabs_free;
> +	unsigned long num_slabs;
>  	unsigned long free_objects;
>  	unsigned int free_limit;
>  	unsigned int colour_next;	/* Per-node cache coloring */
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
  2016-08-18 11:52 ` Michal Hocko
@ 2016-08-19  5:47   ` aruna.ramakrishna
  2016-08-23  2:13   ` Joonsoo Kim
  1 sibling, 0 replies; 22+ messages in thread
From: aruna.ramakrishna @ 2016-08-19  5:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Mike Kravetz, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton

On 08/18/2016 04:52 AM, Michal Hocko wrote:
> I am not opposing the patch (to be honest it is quite neat) but this
> is buggering me for quite some time. Sorry for hijacking this email
> thread but I couldn't resist. Why are we trying to optimize SLAB and
> slowly converge it to SLUB feature-wise. I always thought that SLAB
> should remain stable and time challenged solution which works reasonably
> well for many/most workloads, while SLUB is an optimized implementation
> which experiment with slightly different concepts that might boost the
> performance considerably but might also surprise from time to time. If
> this is not the case then why do we have both of them in the kernel. It
> is a lot of code and some features need tweaking both while only one
> gets testing coverage. So this is mainly a question for maintainers. Why
> do we maintain both and what is the purpose of them.

Michal,

Speaking about this patch specifically - I'm not trying to optimize SLAB 
or make it more similar to SLUB. This patch is a bug fix for an issue 
where the slowness of 'cat /proc/slabinfo' caused timeouts in other 
drivers. While optimizing that flow, it became apparent (as Christoph 
pointed out) that one could converge this patch to SLUB's current 
implementation. Though I have not done that in this patch (because that 
warrants a separate patch), I think it makes sense to converge where 
appropriate, since they both do share some common data structures and 
code already.

Thanks,
Aruna

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
  2016-08-18 11:52 ` Michal Hocko
  2016-08-19  5:47   ` aruna.ramakrishna
@ 2016-08-23  2:13   ` Joonsoo Kim
  2016-08-23 15:38     ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Joonsoo Kim @ 2016-08-23  2:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aruna Ramakrishna, linux-mm, linux-kernel, Mike Kravetz,
	Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton

On Thu, Aug 18, 2016 at 01:52:19PM +0200, Michal Hocko wrote:
> On Wed 17-08-16 11:20:50, Aruna Ramakrishna wrote:
> > On large systems, when some slab caches grow to millions of objects (and
> > many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
> > During this time, interrupts are disabled while walking the slab lists
> > (slabs_full, slabs_partial, and slabs_free) for each node, and this
> > sometimes causes timeouts in other drivers (for instance, Infiniband).
> > 
> > This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
> > total number of allocated slabs per node, per cache. This counter is
> > updated when a slab is created or destroyed. This enables us to skip
> > traversing the slabs_full list while gathering slabinfo statistics, and
> > since slabs_full tends to be the biggest list when the cache is large, it
> > results in a dramatic performance improvement. Getting slabinfo statistics
> > now only requires walking the slabs_free and slabs_partial lists, and
> > those lists are usually much smaller than slabs_full. We tested this after
> > growing the dentry cache to 70GB, and the performance improved from 2s to
> > 5ms.
> 
> I am not opposing the patch (to be honest it is quite neat) but this
> is buggering me for quite some time. Sorry for hijacking this email
> thread but I couldn't resist. Why are we trying to optimize SLAB and
> slowly converge it to SLUB feature-wise. I always thought that SLAB
> should remain stable and time challenged solution which works reasonably
> well for many/most workloads, while SLUB is an optimized implementation
> which experiment with slightly different concepts that might boost the
> performance considerably but might also surprise from time to time. If
> this is not the case then why do we have both of them in the kernel. It
> is a lot of code and some features need tweaking both while only one
> gets testing coverage. So this is mainly a question for maintainers. Why
> do we maintain both and what is the purpose of them.

I don't know full history about it since I joined kernel communitiy
recently(?). Christoph would be a better candidate for this topic.
Anyway,

AFAIK, first plan at the time when SLUB is introduced was to remove
SLAB if SLUB beats SLAB completely. But, there are fundamental
differences in implementation detail so they cannot beat each other
for all the workloads. It is similar with filesystem case that various
filesystems exist for it's own workload.

Then, second plan was started. It is commonizing the code as much
as possible to develope new feature and maintain the code easily. The
code goes this direction, although it is slow. If it is achieved, we
don't need to worry about maintanance overhead.

Anyway, we cannot remove one without regression so we don't remove one
until now. In this case, there is no point to stop improving one.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-23  2:13   ` Joonsoo Kim
@ 2016-08-23 15:38     ` Michal Hocko
  2016-08-23 15:54       ` what is the purpose of SLAB and SLUB Andi Kleen
                         ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Michal Hocko @ 2016-08-23 15:38 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Aruna Ramakrishna, linux-mm, linux-kernel, Mike Kravetz,
	Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton,
	Mel Gorman, Jiri Slaby

On Tue 23-08-16 11:13:03, Joonsoo Kim wrote:
> On Thu, Aug 18, 2016 at 01:52:19PM +0200, Michal Hocko wrote:
[...]
> > I am not opposing the patch (to be honest it is quite neat) but this
> > is buggering me for quite some time. Sorry for hijacking this email
> > thread but I couldn't resist. Why are we trying to optimize SLAB and
> > slowly converge it to SLUB feature-wise. I always thought that SLAB
> > should remain stable and time challenged solution which works reasonably
> > well for many/most workloads, while SLUB is an optimized implementation
> > which experiment with slightly different concepts that might boost the
> > performance considerably but might also surprise from time to time. If
> > this is not the case then why do we have both of them in the kernel. It
> > is a lot of code and some features need tweaking both while only one
> > gets testing coverage. So this is mainly a question for maintainers. Why
> > do we maintain both and what is the purpose of them.
> 
> I don't know full history about it since I joined kernel communitiy
> recently(?). Christoph would be a better candidate for this topic.
> Anyway,
> 
> SLAB if SLUB beats SLAB completely. But, there are fundamental
> differences in implementation detail so they cannot beat each other
> for all the workloads. It is similar with filesystem case that various
> filesystems exist for it's own workload.

Do we have any documentation/study about which particular workloads
benefit from which allocator? It seems that most users will use whatever
the default or what their distribution uses. E.g. SLES kernel use SLAB
because this is what we used to have for ages and there was no strong
reason to change that default. From such a perspective having a stable
allocator with minimum changes - just bug fixes - makes a lot of sense.
I remember Mel doing some benchmarks when "why opensuse kernels do not
use the default SLUB allocator" came the last time and he didn't see any
large winner there
https://lists.opensuse.org/opensuse-kernel/2015-08/msg00098.html
This set of workloads is of course not comprehensive to rule one or
other but I am wondering whether there are still any pathological
workloads where we really want to keep SLAB or add new features to it.

> Then, second plan was started. It is commonizing the code as much
> as possible to develope new feature and maintain the code easily. The
> code goes this direction, although it is slow. If it is achieved, we
> don't need to worry about maintanance overhead.

I fully agree, commonizing the code base makes perfect sense. If a
feature can be made independent on the underlying implementation then I
am all for adding it but AFAIR kmemcg or kmemleak both need to touch
quite deep internals and that brings risk for introducing new bugs which
would be SL[AU]B specific. I remember Jiri Slaby was fighting a kmemlead
false positives recently with SLAB which were not present in SLUB for
example.

> Anyway, we cannot remove one without regression so we don't remove one
> until now. In this case, there is no point to stop improving one.

I can completely see the reason to not drop SLAB (and I am not suggesting
that) but I would expect that SLAB would be more in a feature freeze
state. Or if both of them need to evolve then at least describe which
workloads pathologically benefit/suffer from one or the other.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-23 15:38     ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Michal Hocko
@ 2016-08-23 15:54       ` Andi Kleen
  2016-08-25  4:10         ` Christoph Lameter
  2016-08-24  1:15       ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Joonsoo Kim
  2016-08-24  8:20       ` Mel Gorman
  2 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-08-23 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Aruna Ramakrishna, linux-mm, linux-kernel,
	Mike Kravetz, Christoph Lameter, Pekka Enberg, David Rientjes,
	Andrew Morton, Mel Gorman, Jiri Slaby

Michal Hocko <mhocko@kernel.org> writes:
>
>> Anyway, we cannot remove one without regression so we don't remove one
>> until now. In this case, there is no point to stop improving one.
>
> I can completely see the reason to not drop SLAB (and I am not suggesting
> that) but I would expect that SLAB would be more in a feature freeze
> state. Or if both of them need to evolve then at least describe which
> workloads pathologically benefit/suffer from one or the other.

Why would you stop someone from working on SLAB if they want to?

Forcibly enforcing a freeze on something can make sense if you're
in charge of a team to conserve resources, but in Linux the situation is
very different.

Everyone works on what they (or their employer wants), not what
someone else wants. So if they want slab that is what they do.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-23 15:38     ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Michal Hocko
  2016-08-23 15:54       ` what is the purpose of SLAB and SLUB Andi Kleen
@ 2016-08-24  1:15       ` Joonsoo Kim
  2016-08-24  8:05         ` Michal Hocko
  2016-08-24  8:20       ` Mel Gorman
  2 siblings, 1 reply; 22+ messages in thread
From: Joonsoo Kim @ 2016-08-24  1:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aruna Ramakrishna, linux-mm, linux-kernel, Mike Kravetz,
	Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton,
	Mel Gorman, Jiri Slaby

On Tue, Aug 23, 2016 at 05:38:08PM +0200, Michal Hocko wrote:
> On Tue 23-08-16 11:13:03, Joonsoo Kim wrote:
> > On Thu, Aug 18, 2016 at 01:52:19PM +0200, Michal Hocko wrote:
> [...]
> > > I am not opposing the patch (to be honest it is quite neat) but this
> > > is buggering me for quite some time. Sorry for hijacking this email
> > > thread but I couldn't resist. Why are we trying to optimize SLAB and
> > > slowly converge it to SLUB feature-wise. I always thought that SLAB
> > > should remain stable and time challenged solution which works reasonably
> > > well for many/most workloads, while SLUB is an optimized implementation
> > > which experiment with slightly different concepts that might boost the
> > > performance considerably but might also surprise from time to time. If
> > > this is not the case then why do we have both of them in the kernel. It
> > > is a lot of code and some features need tweaking both while only one
> > > gets testing coverage. So this is mainly a question for maintainers. Why
> > > do we maintain both and what is the purpose of them.
> > 
> > I don't know full history about it since I joined kernel communitiy
> > recently(?). Christoph would be a better candidate for this topic.
> > Anyway,
> > 
> > SLAB if SLUB beats SLAB completely. But, there are fundamental
> > differences in implementation detail so they cannot beat each other
> > for all the workloads. It is similar with filesystem case that various
> > filesystems exist for it's own workload.
> 
> Do we have any documentation/study about which particular workloads
> benefit from which allocator? It seems that most users will use whatever
> the default or what their distribution uses. E.g. SLES kernel use SLAB
> because this is what we used to have for ages and there was no strong
> reason to change that default. From such a perspective having a stable
> allocator with minimum changes - just bug fixes - makes a lot of sense.

It doesn't make sense to me. Even if someone uses SLAB due to
conventional reason, they would want to use shiny new feature and get
performance improvement.

And, it is not only reason to use SLAB. There would be many different
reasons to use SLAB.

> I remember Mel doing some benchmarks when "why opensuse kernels do not
> use the default SLUB allocator" came the last time and he didn't see any
> large winner there
> https://lists.opensuse.org/opensuse-kernel/2015-08/msg00098.html
> This set of workloads is of course not comprehensive to rule one or
> other but I am wondering whether there are still any pathological
> workloads where we really want to keep SLAB or add new features to it.

AFAIK, some network benchmark still shows regression in SLUB.

http://lkml.kernel.org/r/20150907113026.5bb28ca3@redhat.com

> > Then, second plan was started. It is commonizing the code as much
> > as possible to develope new feature and maintain the code easily. The
> > code goes this direction, although it is slow. If it is achieved, we
> > don't need to worry about maintanance overhead.
> 
> I fully agree, commonizing the code base makes perfect sense. If a
> feature can be made independent on the underlying implementation then I
> am all for adding it but AFAIR kmemcg or kmemleak both need to touch
> quite deep internals and that brings risk for introducing new bugs which
> would be SL[AU]B specific. I remember Jiri Slaby was fighting a kmemlead
> false positives recently with SLAB which were not present in SLUB for
> example.

I guess that if commonizing work is done, there is a little to do for
allocator specific.

Thanks.

> > Anyway, we cannot remove one without regression so we don't remove one
> > until now. In this case, there is no point to stop improving one.
> 
> I can completely see the reason to not drop SLAB (and I am not suggesting
> that) but I would expect that SLAB would be more in a feature freeze
> state. Or if both of them need to evolve then at least describe which
> workloads pathologically benefit/suffer from one or the other.
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-24  1:15       ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Joonsoo Kim
@ 2016-08-24  8:05         ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-08-24  8:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Aruna Ramakrishna, linux-mm, linux-kernel, Mike Kravetz,
	Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton,
	Mel Gorman, Jiri Slaby

On Wed 24-08-16 10:15:02, Joonsoo Kim wrote:
> On Tue, Aug 23, 2016 at 05:38:08PM +0200, Michal Hocko wrote:
> > On Tue 23-08-16 11:13:03, Joonsoo Kim wrote:
> > > On Thu, Aug 18, 2016 at 01:52:19PM +0200, Michal Hocko wrote:
> > [...]
> > > > I am not opposing the patch (to be honest it is quite neat) but this
> > > > is buggering me for quite some time. Sorry for hijacking this email
> > > > thread but I couldn't resist. Why are we trying to optimize SLAB and
> > > > slowly converge it to SLUB feature-wise. I always thought that SLAB
> > > > should remain stable and time challenged solution which works reasonably
> > > > well for many/most workloads, while SLUB is an optimized implementation
> > > > which experiment with slightly different concepts that might boost the
> > > > performance considerably but might also surprise from time to time. If
> > > > this is not the case then why do we have both of them in the kernel. It
> > > > is a lot of code and some features need tweaking both while only one
> > > > gets testing coverage. So this is mainly a question for maintainers. Why
> > > > do we maintain both and what is the purpose of them.
> > > 
> > > I don't know full history about it since I joined kernel communitiy
> > > recently(?). Christoph would be a better candidate for this topic.
> > > Anyway,
> > > 
> > > SLAB if SLUB beats SLAB completely. But, there are fundamental
> > > differences in implementation detail so they cannot beat each other
> > > for all the workloads. It is similar with filesystem case that various
> > > filesystems exist for it's own workload.
> > 
> > Do we have any documentation/study about which particular workloads
> > benefit from which allocator? It seems that most users will use whatever
> > the default or what their distribution uses. E.g. SLES kernel use SLAB
> > because this is what we used to have for ages and there was no strong
> > reason to change that default. From such a perspective having a stable
> > allocator with minimum changes - just bug fixes - makes a lot of sense.
> 
> It doesn't make sense to me. Even if someone uses SLAB due to
> conventional reason, they would want to use shiny new feature and get
> performance improvement.
> 
> And, it is not only reason to use SLAB. There would be many different
> reasons to use SLAB.

Could you be more specific please? Are there any inherent problems that
would make one allocator unsuitable for specific workloads?

> > I remember Mel doing some benchmarks when "why opensuse kernels do not
> > use the default SLUB allocator" came the last time and he didn't see any
> > large winner there
> > https://lists.opensuse.org/opensuse-kernel/2015-08/msg00098.html
> > This set of workloads is of course not comprehensive to rule one or
> > other but I am wondering whether there are still any pathological
> > workloads where we really want to keep SLAB or add new features to it.
> 
> AFAIK, some network benchmark still shows regression in SLUB.
> 
> http://lkml.kernel.org/r/20150907113026.5bb28ca3@redhat.com

That suggests that this is not an inherent problem of SLUB though.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-23 15:38     ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Michal Hocko
  2016-08-23 15:54       ` what is the purpose of SLAB and SLUB Andi Kleen
  2016-08-24  1:15       ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Joonsoo Kim
@ 2016-08-24  8:20       ` Mel Gorman
  2016-08-25  4:01         ` Christoph Lameter
  2 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-08-24  8:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Aruna Ramakrishna, linux-mm, linux-kernel,
	Mike Kravetz, Christoph Lameter, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Tue, Aug 23, 2016 at 05:38:08PM +0200, Michal Hocko wrote:
> Do we have any documentation/study about which particular workloads
> benefit from which allocator? It seems that most users will use whatever
> the default or what their distribution uses. E.g. SLES kernel use SLAB
> because this is what we used to have for ages and there was no strong
> reason to change that default.

Yes, with the downside that a reliance on high-orders contended on the
zone lock which would not scale and could degrade over time. If there
were multiple compelling reasons then it would have been an easier
switch.

I did prototype high-order pcp caching up to PAGE_ALLOC_COSTLY_ORDER
but it pushed the size of per_cpu_pages over a cache line which could
be problematic in itself. I never finished off the work as fixing the
allocator for SLUB was not a priority. The prototype no longer applies as
it conflicts with the removal of the fair zone allocation policy.

If/when I get back to the page allocator, the priority would be a bulk
API for faster allocs of batches of order-0 pages instead of allocating
a large page and splitting.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-24  8:20       ` Mel Gorman
@ 2016-08-25  4:01         ` Christoph Lameter
  2016-08-25 10:07           ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2016-08-25  4:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Wed, 24 Aug 2016, Mel Gorman wrote:
> If/when I get back to the page allocator, the priority would be a bulk
> API for faster allocs of batches of order-0 pages instead of allocating
> a large page and splitting.
>

OMG. Do we really want to continue this? There are billions of Linux
devices out there that require a reboot at least once a week. This is now
standard with certain Android phones. In our company we reboot all
machines every week because fragmentation degrades performance
significantly. We need to finally face up to it and deal with the issue
instead of continuing to produce more half ass-ed solutions.

Managing memory in 4K chunks is not reasonable if you have
machines with terabytes of memory and thus billions of individual page
structs to manage. I/O devices are throttling because they cannot manage
so much meta data and we get grotesque devices.

The kernel needs an effective way to handle large contiguous memory. It
needs the ability to do effective defragmentation for that. And the way
forward has been clear also for awhile. All objects must be either
movable or be reclaimable so that things can be moved to allow contiguity
to be restored.


We have support for that for the page cache and interestingly enough for
CMA now. So this is gradually developing because it is necessary. We need
to go with that and provide a full fledged implementation in the kernel
that allows effective handling of large objects in the page allocator and
we need general logic in the kernel for effective handling of large
sized chunks of memory.

Lets stop churning tiny 4k segments in the world where even our cell
phones have capacities measured in Gigabytes which certainly then already
means millions of 4k objects whose management one by one is a drag on
performance and makes operating system coding extremely complex. The core
of Linux must support that for the future in which we will see even larger
memory capacities.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-23 15:54       ` what is the purpose of SLAB and SLUB Andi Kleen
@ 2016-08-25  4:10         ` Christoph Lameter
  2016-08-25  7:32           ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2016-08-25  4:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Mel Gorman, Jiri Slaby

On Tue, 23 Aug 2016, Andi Kleen wrote:

> Why would you stop someone from working on SLAB if they want to?
>
> Forcibly enforcing a freeze on something can make sense if you're
> in charge of a team to conserve resources, but in Linux the situation is
> very different.

I agree and frankly having multiple allocators is something good.
Features that are good in one are copied to the other and enhanced in the
process. I think this has driven code development quite a bit.

Every allocator has a different basic approach to storage layout and
synchronization which determines performance in various usage scenarios.
The competition of seeing if the developer that is a fan of one can come
up with a way to make performance better or storage use more effective in
a situation where another shows better numbers is good.

There may be more creative ways of coming up with new ways of laying out
storage in the future and I would like to have the flexibility in the
kernel to explore those if necessary with additional variations.

The more common code we can isolate the easier it will become to just try
out a new layout and a new form of serialization to see if it provides
advantages.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-25  4:10         ` Christoph Lameter
@ 2016-08-25  7:32           ` Michal Hocko
  2016-08-25 19:49             ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-08-25  7:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Mel Gorman, Jiri Slaby

On Wed 24-08-16 23:10:03, Christoph Lameter wrote:
> On Tue, 23 Aug 2016, Andi Kleen wrote:
> 
> > Why would you stop someone from working on SLAB if they want to?
> >
> > Forcibly enforcing a freeze on something can make sense if you're
> > in charge of a team to conserve resources, but in Linux the situation is
> > very different.
> 
> I agree and frankly having multiple allocators is something good.
> Features that are good in one are copied to the other and enhanced in the
> process. I think this has driven code development quite a bit.
> 
> Every allocator has a different basic approach to storage layout and
> synchronization which determines performance in various usage scenarios.
> The competition of seeing if the developer that is a fan of one can come
> up with a way to make performance better or storage use more effective in
> a situation where another shows better numbers is good.

I can completely see how having multiple allocators (schedulers etc...)
can be good as a playground. But how are users supposed to chose when
we do not help them with any documentation. Most benchmarks which are
referred to (e.g. SLUB doesn't work so well with the networking
workloads) might be really outdated and that just feeds the cargo cult.
Look, I am not suggesting removing SLAB (or SLUB) I am just really
looking to understand for their objectives and which users they target. 
Because as of now, most users are using whatever is the default (SLUB
for some and never documented reason) or what their distributions come
up with. This means that we have quite a lot of code which only few
people understand deeply. Some features which are added on top need much
more testing to cover both allocators or we are risking subtle
regressions.

> There may be more creative ways of coming up with new ways of laying out
> storage in the future and I would like to have the flexibility in the
> kernel to explore those if necessary with additional variations.

Flexibility is always good but there comes a maintenance burden. Both
should be weighed properly.

> The more common code we can isolate the easier it will become to just try
> out a new layout and a new form of serialization to see if it provides
> advantages.

Sure, but even after attempts to make some code common we are still at
$ wc -l mm/slab.c mm/slub.c 
	4479 mm/slab.c
	5727 mm/slub.c
	10206 total

quite a lot, don't you think?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-25  4:01         ` Christoph Lameter
@ 2016-08-25 10:07           ` Mel Gorman
  2016-08-25 19:55             ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-08-25 10:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Wed, Aug 24, 2016 at 11:01:43PM -0500, Christoph Lameter wrote:
> On Wed, 24 Aug 2016, Mel Gorman wrote:
> > If/when I get back to the page allocator, the priority would be a bulk
> > API for faster allocs of batches of order-0 pages instead of allocating
> > a large page and splitting.
> >
> 
> OMG. Do we really want to continue this? There are billions of Linux
> devices out there that require a reboot at least once a week. This is now
> standard with certain Android phones. In our company we reboot all
> machines every week because fragmentation degrades performance
> significantly. We need to finally face up to it and deal with the issue
> instead of continuing to produce more half ass-ed solutions.
> 

Flipping the lid aside, there will always be a need for fast management
of 4K pages. The primary use case is networking that sometimes uses
high-order pages to avoid allocator overhead and amortise DMA setup.
Userspace-mapped pages will always be 4K although fault-around may benefit
from bulk allocating the pages. That is relatively low hanging fruit that
would take a few weeks given a free schedule.

Dirty tracking of pages on a 4K boundary will always be required to avoid IO
multiplier effects that cannot be side-stepped by increasing the fundamental
unit of allocation.

Batching of tree_lock during reclaim for large files and swapping is also
relatively low hanging fruit that also is doable in a week or two.

A high-order per-cpu cache for SLUB to reduce zone->lock contention is
also relatively low hanging fruit with the caveat it makes per_cpu_pages
larger than a cache line.

If you want to rework the VM to use a larger fundamental unit, track
sub-units where required and deal with the internal fragmentation issues
then by all means go ahead and deal with it.
 
-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-25  7:32           ` Michal Hocko
@ 2016-08-25 19:49             ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2016-08-25 19:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andi Kleen, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Mel Gorman, Jiri Slaby

On Thu, 25 Aug 2016, Michal Hocko wrote:
> I can completely see how having multiple allocators (schedulers etc...)
> Because as of now, most users are using whatever is the default (SLUB
> for some and never documented reason) or what their distributions come
> up with. This means that we have quite a lot of code which only few
> people understand deeply. Some features which are added on top need much
> more testing to cover both allocators or we are risking subtle
> regressions.

I think the default is clear and advisable to use. The debugging features
in SLAB f.e. are problematic and I have had to ask at times to retry with
SLUB in order to find a subtle issue.

I think the main activity nowadays is to make SLAB competitive by adopting
methods from SLUB. Maybe that will work. But then concepts from SLAB can
also be used in SLUB and enhance speed there.

> Flexibility is always good but there comes a maintenance burden. Both
> should be weighed properly.

Well I thought we had that under control. SLAB is a legacy issue in many
ways and people are used to the problems with debuggability if they still
use that. There is always the simple way to just switch to SLUB
temporarily in order to find issues.


> Sure, but even after attempts to make some code common we are still at
> $ wc -l mm/slab.c mm/slub.c
> 	4479 mm/slab.c
> 	5727 mm/slub.c
> 	10206 total
>
> quite a lot, don't you think?

Well the code is always growing since features are being added like
cgroups support and the batch allocation/freeing that is used to improve
the network performance. I think this is actually quite reasonable
compared with other parts of our kernel.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-25 10:07           ` Mel Gorman
@ 2016-08-25 19:55             ` Christoph Lameter
  2016-08-26 20:47               ` what is the purpose of SLAB and SLUB Andi Kleen
  2016-08-30  9:39               ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Mel Gorman
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Lameter @ 2016-08-25 19:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Thu, 25 Aug 2016, Mel Gorman wrote:

> Flipping the lid aside, there will always be a need for fast management
> of 4K pages. The primary use case is networking that sometimes uses
> high-order pages to avoid allocator overhead and amortise DMA setup.
> Userspace-mapped pages will always be 4K although fault-around may benefit
> from bulk allocating the pages. That is relatively low hanging fruit that
> would take a few weeks given a free schedule.

Userspace mapped pages can be hugepages as well as giant pages and that
has been there for a long time. Intermediate sizes would be useful too in
order to avoid having to keep lists of 4k pages around and continually
scan them.

> Dirty tracking of pages on a 4K boundary will always be required to avoid IO
> multiplier effects that cannot be side-stepped by increasing the fundamental
> unit of allocation.

Huge pages cannot be dirtied? This is an issue of hardware support. On
x867 you only have one size. I am pretty such that even intel would
support other sizes if needed. The case has been repeatedly made that 64k
pages f.e. would be useful to have on x86.


> Batching of tree_lock during reclaim for large files and swapping is also
> relatively low hanging fruit that also is doable in a week or two.

Ok these are good incremental improvement but they do not address the main
issue going forward.

> A high-order per-cpu cache for SLUB to reduce zone->lock contention is
> also relatively low hanging fruit with the caveat it makes per_cpu_pages
> larger than a cache line.

Would be great to have.

> If you want to rework the VM to use a larger fundamental unit, track
> sub-units where required and deal with the internal fragmentation issues
> then by all means go ahead and deal with it.

Hmmm... The time problem is always there. Tried various approaches over
the last decade. Could be a massive project. We really would need a
larger group of developers to effectively do this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-25 19:55             ` Christoph Lameter
@ 2016-08-26 20:47               ` Andi Kleen
  2016-08-29 13:44                 ` Michal Hocko
  2016-08-30  9:39               ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-08-26 20:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Michal Hocko, Joonsoo Kim, Aruna Ramakrishna,
	linux-mm, linux-kernel, Mike Kravetz, Pekka Enberg,
	David Rientjes, Andrew Morton, Jiri Slaby

Christoph Lameter <cl@linux.com> writes:
>
>> If you want to rework the VM to use a larger fundamental unit, track
>> sub-units where required and deal with the internal fragmentation issues
>> then by all means go ahead and deal with it.
>
> Hmmm... The time problem is always there. Tried various approaches over
> the last decade. Could be a massive project. We really would need a
> larger group of developers to effectively do this.

I'm surprised that compactions is not able to fix the fragmentation.
Is the problem that there are too many non movable objects around?

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-26 20:47               ` what is the purpose of SLAB and SLUB Andi Kleen
@ 2016-08-29 13:44                 ` Michal Hocko
  2016-08-29 14:49                   ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-08-29 13:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Mel Gorman, Joonsoo Kim, Aruna Ramakrishna,
	linux-mm, linux-kernel, Mike Kravetz, Pekka Enberg,
	David Rientjes, Andrew Morton, Jiri Slaby

On Fri 26-08-16 13:47:47, Andi Kleen wrote:
> Christoph Lameter <cl@linux.com> writes:
> >
> >> If you want to rework the VM to use a larger fundamental unit, track
> >> sub-units where required and deal with the internal fragmentation issues
> >> then by all means go ahead and deal with it.
> >
> > Hmmm... The time problem is always there. Tried various approaches over
> > the last decade. Could be a massive project. We really would need a
> > larger group of developers to effectively do this.
> 
> I'm surprised that compactions is not able to fix the fragmentation.
> Is the problem that there are too many non movable objects around?

Compaction can certainly help and the more we are proactive in that
direction the better. Vlastimil has already done a first step in that
direction and we a have a dedicated kcompactd kernel thread for that
purpose. But I guess what Mel had in mind is the latency of higher
order pages which is inherently higher with the current page allocator
no matter how well the compaction works. There are other changes, mostly
for the fast path, needed to make higher order pages less of a second
citizen.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB
  2016-08-29 13:44                 ` Michal Hocko
@ 2016-08-29 14:49                   ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2016-08-29 14:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andi Kleen, Mel Gorman, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Mon, 29 Aug 2016, Michal Hocko wrote:

> Compaction can certainly help and the more we are proactive in that
> direction the better. Vlastimil has already done a first step in that
> direction and we a have a dedicated kcompactd kernel thread for that
> purpose. But I guess what Mel had in mind is the latency of higher
> order pages which is inherently higher with the current page allocator
> no matter how well the compaction works. There are other changes, mostly
> for the fast path, needed to make higher order pages less of a second
> citizen.

Compaction needs to be able to move many more types of kernel objects out
of the way. I think if the callbacks that were merged for the migration of
CMA pages are made usable for slab allocations then we may make some
progress there. This would require the creator of a slab cache to specify
functions that allow the migration of an object. Would require additional
subsystem specific code. But doing that for inodes and dentries could be
very benficial for compaction.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-25 19:55             ` Christoph Lameter
  2016-08-26 20:47               ` what is the purpose of SLAB and SLUB Andi Kleen
@ 2016-08-30  9:39               ` Mel Gorman
  2016-08-30 19:32                 ` Christoph Lameter
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2016-08-30  9:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Thu, Aug 25, 2016 at 02:55:43PM -0500, Christoph Lameter wrote:
> On Thu, 25 Aug 2016, Mel Gorman wrote:
> 
> > Flipping the lid aside, there will always be a need for fast management
> > of 4K pages. The primary use case is networking that sometimes uses
> > high-order pages to avoid allocator overhead and amortise DMA setup.
> > Userspace-mapped pages will always be 4K although fault-around may benefit
> > from bulk allocating the pages. That is relatively low hanging fruit that
> > would take a few weeks given a free schedule.
> 
> Userspace mapped pages can be hugepages as well as giant pages and that
> has been there for a long time. Intermediate sizes would be useful too in
> order to avoid having to keep lists of 4k pages around and continually
> scan them.
> 

Userspace pages cannot always be mapped as huge or giant. mprotect on a
4K boundary is an obvious example.

> > Dirty tracking of pages on a 4K boundary will always be required to avoid IO
> > multiplier effects that cannot be side-stepped by increasing the fundamental
> > unit of allocation.
> 
> Huge pages cannot be dirtied?

I didn't say that, I said they are required to avoid IO multiplier
effects. If a file is mapped as 2M or 1G then even a 1 byte write requires
2M or 1G of IO to writeback.

> This is an issue of hardware support. On
> x867 you only have one size. I am pretty such that even intel would
> support other sizes if needed. The case has been repeatedly made that 64k
> pages f.e. would be useful to have on x86.
> 

64K pages are not a universal win even on the arches that do support them.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats
  2016-08-30  9:39               ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Mel Gorman
@ 2016-08-30 19:32                 ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2016-08-30 19:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Joonsoo Kim, Aruna Ramakrishna, linux-mm,
	linux-kernel, Mike Kravetz, Pekka Enberg, David Rientjes,
	Andrew Morton, Jiri Slaby

On Tue, 30 Aug 2016, Mel Gorman wrote:

> > Userspace mapped pages can be hugepages as well as giant pages and that
> > has been there for a long time. Intermediate sizes would be useful too in
> > order to avoid having to keep lists of 4k pages around and continually
> > scan them.
> >
>
> Userspace pages cannot always be mapped as huge or giant. mprotect on a
> 4K boundary is an obvious example.

Well if the pages are bigger then the boundaries will also be different.
The problem is that we are trying to keep the 4k illustion alive. This
causes churn in various subsystems. Implementation of a file cache
with arbitrary page order is rather straightforward. See
https://lkml.org/lkml/2007/4/19/261

There we run again against the problem of defragmentation. Avoiding decent
garbage collection in the kernel causes no end of additional trouble. I
think we need to face the issue and solve it. Then a lot of other
workaround and complex things are no longer necesary.

> > > Dirty tracking of pages on a 4K boundary will always be required to avoid IO
> > > multiplier effects that cannot be side-stepped by increasing the fundamental
> > > unit of allocation.
> >
> > Huge pages cannot be dirtied?
>
> I didn't say that, I said they are required to avoid IO multiplier
> effects. If a file is mapped as 2M or 1G then even a 1 byte write requires
> 2M or 1G of IO to writeback.

There are numerous use cases that I know of where this would be
acceptable. Some tuning would be required of course like a mininum period
until writeback occurs.

> > This is an issue of hardware support. On
> > x867 you only have one size. I am pretty such that even intel would
> > support other sizes if needed. The case has been repeatedly made that 64k
> > pages f.e. would be useful to have on x86.
> >
>
> 64K pages are not a universal win even on the arches that do support them.

There are always corner cases that regress with any kernel "enhancement".
64k page size was a signicant improvement for many of the loads when I
worked at SGI on Altix.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-08-30 19:32 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 18:20 [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats Aruna Ramakrishna
2016-08-17 19:03 ` Eric Dumazet
2016-08-17 19:25   ` Aruna Ramakrishna
2016-08-18 11:52 ` Michal Hocko
2016-08-19  5:47   ` aruna.ramakrishna
2016-08-23  2:13   ` Joonsoo Kim
2016-08-23 15:38     ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Michal Hocko
2016-08-23 15:54       ` what is the purpose of SLAB and SLUB Andi Kleen
2016-08-25  4:10         ` Christoph Lameter
2016-08-25  7:32           ` Michal Hocko
2016-08-25 19:49             ` Christoph Lameter
2016-08-24  1:15       ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Joonsoo Kim
2016-08-24  8:05         ` Michal Hocko
2016-08-24  8:20       ` Mel Gorman
2016-08-25  4:01         ` Christoph Lameter
2016-08-25 10:07           ` Mel Gorman
2016-08-25 19:55             ` Christoph Lameter
2016-08-26 20:47               ` what is the purpose of SLAB and SLUB Andi Kleen
2016-08-29 13:44                 ` Michal Hocko
2016-08-29 14:49                   ` Christoph Lameter
2016-08-30  9:39               ` what is the purpose of SLAB and SLUB (was: Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo) stats Mel Gorman
2016-08-30 19:32                 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).