linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
@ 2014-12-10 16:30 Christoph Lameter
  2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
                   ` (9 more replies)
  0 siblings, 10 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

We had to insert a preempt enable/disable in the fastpath a while ago. This
was mainly due to a lot of state that is kept to be allocating from the per
cpu freelist. In particular the page field is not covered by
this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
change for fast path allocation and freeing.

This patch removes the need for the page field to describe the state of the
per cpu list. The freelist pointer can be used to determine the page struct
address if necessary.

However, currently this does not work for the termination value of a list
which is NULL and the same for all slab pages. If we use a valid pointer
into the page as well as set the last bit then all freelist pointers can
always be used to determine the address of the page struct and we will not
need the page field anymore in the per cpu are for a slab. Testing for the
end of the list is a test if the first bit is set.

So the first patch changes the termination pointer for freelists to do just
that. The second removes the page field and then third can then remove the
preempt enable/disable.

Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
allocator effectiveness will increase further. Also RT uses full preemption which means
that currently pretty expensive code has to be inserted into the fastpath. This approach
allows the removal of that code and a corresponding performance increase.

For V1 a number of changes were made to avoid the overhead of virt_to_page
and page_address from the RFC.

Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
20%-50% of fastpath latency:

Before:

Single thread testing
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles

2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 114 cycles
10000 times kmalloc(16)/kfree -> 115 cycles
10000 times kmalloc(32)/kfree -> 117 cycles
10000 times kmalloc(64)/kfree -> 115 cycles
10000 times kmalloc(128)/kfree -> 111 cycles
10000 times kmalloc(256)/kfree -> 116 cycles
10000 times kmalloc(512)/kfree -> 110 cycles
10000 times kmalloc(1024)/kfree -> 114 cycles
10000 times kmalloc(2048)/kfree -> 110 cycles
10000 times kmalloc(4096)/kfree -> 107 cycles
10000 times kmalloc(8192)/kfree -> 108 cycles
10000 times kmalloc(16384)/kfree -> 706 cycles


After:


Single thread testing
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles

2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 100 cycles
10000 times kmalloc(16)/kfree -> 108 cycles
10000 times kmalloc(32)/kfree -> 101 cycles
10000 times kmalloc(64)/kfree -> 109 cycles
10000 times kmalloc(128)/kfree -> 125 cycles
10000 times kmalloc(256)/kfree -> 60 cycles
10000 times kmalloc(512)/kfree -> 60 cycles
10000 times kmalloc(1024)/kfree -> 67 cycles
10000 times kmalloc(2048)/kfree -> 60 cycles
10000 times kmalloc(4096)/kfree -> 65 cycles
10000 times kmalloc(8192)/kfree -> 60 cycles
10000 times kmalloc(16384)/kfree -> 686 cycles


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/7] slub: Remove __slab_alloc code duplication
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 16:39   ` Pekka Enberg
  2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: simplify_code --]
[-- Type: text/plain, Size: 1276 bytes --]

Somehow the two branches in __slab_alloc do the same.
Unify them.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-08 13:24:05.193185492 -0600
+++ linux/mm/slub.c	2014-12-09 12:23:11.927032128 -0600
@@ -2282,10 +2282,7 @@ redo:
 
 		if (unlikely(!node_match(page, searchnode))) {
 			stat(s, ALLOC_NODE_MISMATCH);
-			deactivate_slab(s, page, c->freelist);
-			c->page = NULL;
-			c->freelist = NULL;
-			goto new_slab;
+			goto deactivate;
 		}
 	}
 
@@ -2294,12 +2291,8 @@ redo:
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
 	 * information when the page leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
-	}
+	if (unlikely(!pfmemalloc_match(page, gfpflags)))
+		goto deactivate;
 
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	freelist = c->freelist;
@@ -2328,6 +2321,11 @@ load_freelist:
 	local_irq_restore(flags);
 	return freelist;
 
+deactivate:
+	deactivate_slab(s, page, c->freelist);
+	c->page = NULL;
+	c->freelist = NULL;
+
 new_slab:
 
 	if (c->partial) {


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
  2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 16:45   ` Pekka Enberg
  2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: page_address --]
[-- Type: text/plain, Size: 4318 bytes --]

SLAB uses the mapping field of the page struct to store a pointer to the
begining of the objects in the page frame. Use the same field to store
the address of the objects in SLUB as well. This allows us to avoid a
number of invocations of page_address(). Those are mostly only used for
debugging though so this should have no performance benefit.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h	2014-12-09 12:23:37.374266835 -0600
+++ linux/include/linux/mm_types.h	2014-12-09 12:23:37.370266955 -0600
@@ -54,6 +54,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 		void *s_mem;			/* slab first object */
+		void *address;			/* slub address of page */
 	};
 
 	/* Second double word */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:23:37.374266835 -0600
+++ linux/mm/slub.c	2014-12-09 12:23:37.370266955 -0600
@@ -232,7 +232,7 @@ static inline int check_valid_pointer(st
 	if (!object)
 		return 1;
 
-	base = page_address(page);
+	base = page->address;
 	if (object < base || object >= base + page->objects * s->size ||
 		(object - base) % s->size) {
 		return 0;
@@ -449,7 +449,7 @@ static inline bool cmpxchg_double_slab(s
 static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map)
 {
 	void *p;
-	void *addr = page_address(page);
+	void *addr = page->address;
 
 	for (p = page->freelist; p; p = get_freepointer(s, p))
 		set_bit(slab_index(p, s, addr), map);
@@ -596,7 +596,7 @@ static void slab_fix(struct kmem_cache *
 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 {
 	unsigned int off;	/* Offset of last byte */
-	u8 *addr = page_address(page);
+	u8 *addr = page->address;
 
 	print_tracking(s, p);
 
@@ -763,7 +763,7 @@ static int slab_pad_check(struct kmem_ca
 	if (!(s->flags & SLAB_POISON))
 		return 1;
 
-	start = page_address(page);
+	start = page->address;
 	length = (PAGE_SIZE << compound_order(page)) - s->reserved;
 	end = start + length;
 	remainder = length % s->size;
@@ -1387,11 +1387,12 @@ static struct page *new_slab(struct kmem
 	order = compound_order(page);
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab_cache = s;
+	page->address = page_address(page);
 	__SetPageSlab(page);
 	if (page->pfmemalloc)
 		SetPageSlabPfmemalloc(page);
 
-	start = page_address(page);
+	start = page->address;
 
 	if (unlikely(s->flags & SLAB_POISON))
 		memset(start, POISON_INUSE, PAGE_SIZE << order);
@@ -1420,7 +1421,7 @@ static void __free_slab(struct kmem_cach
 		void *p;
 
 		slab_pad_check(s, page);
-		for_each_object(p, s, page_address(page),
+		for_each_object(p, s, page->address,
 						page->objects)
 			check_object(s, page, p, SLUB_RED_INACTIVE);
 	}
@@ -1433,9 +1434,10 @@ static void __free_slab(struct kmem_cach
 		-pages);
 
 	__ClearPageSlabPfmemalloc(page);
-	__ClearPageSlab(page);
 
 	page_mapcount_reset(page);
+	page->mapping = NULL;
+	__ClearPageSlab(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
 	__free_pages(page, order);
@@ -1467,7 +1469,7 @@ static void free_slab(struct kmem_cache
 			int offset = (PAGE_SIZE << order) - s->reserved;
 
 			VM_BUG_ON(s->reserved != sizeof(*head));
-			head = page_address(page) + offset;
+			head = page->address + offset;
 		} else {
 			/*
 			 * RCU free overloads the RCU head over the LRU
@@ -3135,7 +3137,7 @@ static void list_slab_objects(struct kme
 							const char *text)
 {
 #ifdef CONFIG_SLUB_DEBUG
-	void *addr = page_address(page);
+	void *addr = page->address;
 	void *p;
 	unsigned long *map = kzalloc(BITS_TO_LONGS(page->objects) *
 				     sizeof(long), GFP_ATOMIC);
@@ -3775,7 +3777,7 @@ static int validate_slab(struct kmem_cac
 						unsigned long *map)
 {
 	void *p;
-	void *addr = page_address(page);
+	void *addr = page->address;
 
 	if (!check_slab(s, page) ||
 			!on_freelist(s, page, NULL))
@@ -3986,7 +3988,7 @@ static void process_slab(struct loc_trac
 		struct page *page, enum track_item alloc,
 		unsigned long *map)
 {
-	void *addr = page_address(page);
+	void *addr = page->address;
 	void *p;
 
 	bitmap_zero(map, page->objects);


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
  2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
  2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 16:54   ` Pekka Enberg
  2014-12-15  8:03   ` Joonsoo Kim
  2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: slub_free_compare_address_range --]
[-- Type: text/plain, Size: 1160 bytes --]

Avoid using the page struct address on free by just doing an
address comparison. That is easily doable now that the page address
is available in the page struct and we already have the page struct
address of the object to be freed calculated.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:25:45.770405462 -0600
+++ linux/mm/slub.c	2014-12-09 12:25:45.766405582 -0600
@@ -2625,6 +2625,13 @@ slab_empty:
 	discard_slab(s, page);
 }
 
+static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
+{
+	long d = p - page->address;
+
+	return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
+}
+
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
  * can perform fastpath freeing without additional function calls.
@@ -2658,7 +2665,7 @@ redo:
 	tid = c->tid;
 	preempt_enable();
 
-	if (likely(page == c->page)) {
+	if (likely(same_slab_page(s, page, c->freelist))) {
 		set_freepointer(s, object, c->freelist);
 
 		if (unlikely(!this_cpu_cmpxchg_double(


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (2 preceding siblings ...)
  2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 16:56   ` Pekka Enberg
  2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: more_c_page --]
[-- Type: text/plain, Size: 1186 bytes --]

We can use virt_to_page there and only invoke the costly function if
actually a node is specified and we have to check the NUMA locality.

Increases the cost of allocating on a specific NUMA node but then that
was never cheap since we may have to dump our caches and retrieve memory
from the correct node.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:27:49.414686959 -0600
+++ linux/mm/slub.c	2014-12-09 12:27:49.414686959 -0600
@@ -2097,6 +2097,15 @@ static inline int node_match(struct page
 	return 1;
 }
 
+static inline int node_match_ptr(void *p, int node)
+{
+#ifdef CONFIG_NUMA
+	if (!p || (node != NUMA_NO_NODE && page_to_nid(virt_to_page(p)) != node))
+		return 0;
+#endif
+	return 1;
+}
+
 #ifdef CONFIG_SLUB_DEBUG
 static int count_free(struct page *page)
 {
@@ -2410,7 +2419,7 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node))) {
+	if (unlikely(!object || !node_match_ptr(object, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 		stat(s, ALLOC_SLOWPATH);
 	} else {


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (3 preceding siblings ...)
  2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 16:59   ` Pekka Enberg
  2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: slub_use_end_token_instead_of_null --]
[-- Type: text/plain, Size: 7676 bytes --]

Ending a list with NULL means that the termination of a list is the same
for all slab pages. The pointers of freelists otherwise always are
pointing to the address space of the page. Make termination of a
list possible by setting the lowest bit in the freelist address
and use the start address of a page if no other address is available
for list termination.

This will allow us to determine the page struct address from a
freelist pointer in the future.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:29:47.219144051 -0600
+++ linux/mm/slub.c	2014-12-09 12:29:47.219144051 -0600
@@ -132,6 +132,16 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }
 
+static bool is_end_token(const void *freelist)
+{
+	return ((unsigned long)freelist) & 1;
+}
+
+static void *end_token(const void *address)
+{
+	return (void *)((unsigned long)address | 1);
+}
+
 /*
  * Issues still to be resolved:
  *
@@ -234,7 +244,7 @@ static inline int check_valid_pointer(st
 
 	base = page->address;
 	if (object < base || object >= base + page->objects * s->size ||
-		(object - base) % s->size) {
+		((object - base) % s->size && !is_end_token(object))) {
 		return 0;
 	}
 
@@ -451,7 +461,7 @@ static void get_map(struct kmem_cache *s
 	void *p;
 	void *addr = page->address;
 
-	for (p = page->freelist; p; p = get_freepointer(s, p))
+	for (p = page->freelist; !is_end_token(p); p = get_freepointer(s, p))
 		set_bit(slab_index(p, s, addr), map);
 }
 
@@ -829,7 +839,7 @@ static int check_object(struct kmem_cach
 		 * of the free objects in this slab. May cause
 		 * another error because the object count is now wrong.
 		 */
-		set_freepointer(s, p, NULL);
+		set_freepointer(s, p, end_token(page->address));
 		return 0;
 	}
 	return 1;
@@ -874,7 +884,7 @@ static int on_freelist(struct kmem_cache
 	unsigned long max_objects;
 
 	fp = page->freelist;
-	while (fp && nr <= page->objects) {
+	while (!is_end_token(fp) && nr <= page->objects) {
 		if (fp == search)
 			return 1;
 		if (!check_valid_pointer(s, page, fp)) {
@@ -1033,7 +1043,7 @@ bad:
 		 */
 		slab_fix(s, "Marking all objects used");
 		page->inuse = page->objects;
-		page->freelist = NULL;
+		page->freelist = end_token(page->address);
 	}
 	return 0;
 }
@@ -1402,7 +1412,7 @@ static struct page *new_slab(struct kmem
 		if (likely(idx < page->objects))
 			set_freepointer(s, p, p + s->size);
 		else
-			set_freepointer(s, p, NULL);
+			set_freepointer(s, p, end_token(start));
 	}
 
 	page->freelist = start;
@@ -1546,12 +1556,11 @@ static inline void *acquire_slab(struct
 	freelist = page->freelist;
 	counters = page->counters;
 	new.counters = counters;
+	new.freelist = freelist;
 	*objects = new.objects - new.inuse;
 	if (mode) {
 		new.inuse = page->objects;
-		new.freelist = NULL;
-	} else {
-		new.freelist = freelist;
+		new.freelist = end_token(freelist);
 	}
 
 	VM_BUG_ON(new.frozen);
@@ -1787,7 +1796,7 @@ static void deactivate_slab(struct kmem_
 	struct page new;
 	struct page old;
 
-	if (page->freelist) {
+	if (!is_end_token(page->freelist)) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
 		tail = DEACTIVATE_TO_TAIL;
 	}
@@ -1800,7 +1809,8 @@ static void deactivate_slab(struct kmem_
 	 * There is no need to take the list->lock because the page
 	 * is still frozen.
 	 */
-	while (freelist && (nextfree = get_freepointer(s, freelist))) {
+	if (freelist)
+	    while (!is_end_token(freelist) && (nextfree = get_freepointer(s, freelist))) {
 		void *prior;
 		unsigned long counters;
 
@@ -1818,7 +1828,8 @@ static void deactivate_slab(struct kmem_
 			"drain percpu freelist"));
 
 		freelist = nextfree;
-	}
+	} else
+		freelist = end_token(page->address);
 
 	/*
 	 * Stage two: Ensure that the page is unfrozen while the
@@ -1842,7 +1853,7 @@ redo:
 
 	/* Determine target state of the slab */
 	new.counters = old.counters;
-	if (freelist) {
+	if (!is_end_token(freelist)) {
 		new.inuse--;
 		set_freepointer(s, freelist, old.freelist);
 		new.freelist = freelist;
@@ -1853,7 +1864,7 @@ redo:
 
 	if (!new.inuse && n->nr_partial >= s->min_partial)
 		m = M_FREE;
-	else if (new.freelist) {
+	else if (!is_end_token(new.freelist)) {
 		m = M_PARTIAL;
 		if (!lock) {
 			lock = 1;
@@ -2180,7 +2191,7 @@ static inline void *new_slab_objects(str
 
 	freelist = get_partial(s, flags, node, c);
 
-	if (freelist)
+	if (freelist && !is_end_token(freelist))
 		return freelist;
 
 	page = new_slab(s, flags, node);
@@ -2194,7 +2205,7 @@ static inline void *new_slab_objects(str
 		 * muck around with it freely without cmpxchg
 		 */
 		freelist = page->freelist;
-		page->freelist = NULL;
+		page->freelist = end_token(freelist);
 
 		stat(s, ALLOC_SLAB);
 		c->page = page;
@@ -2217,11 +2228,11 @@ static inline bool pfmemalloc_match(stru
  * Check the page->freelist of a page and either transfer the freelist to the
  * per cpu freelist or deactivate the page.
  *
- * The page is still frozen if the return value is not NULL.
+ * The page is still frozen if the return value is not end_token.
  *
- * If this function returns NULL then the page has been unfrozen.
+ * If this function returns end_token then the page has been unfrozen.
  *
- * This function must be called with interrupt disabled.
+ * This function must be called with interrupts disabled.
  */
 static inline void *get_freelist(struct kmem_cache *s, struct page *page)
 {
@@ -2237,11 +2248,11 @@ static inline void *get_freelist(struct
 		VM_BUG_ON(!new.frozen);
 
 		new.inuse = page->objects;
-		new.frozen = freelist != NULL;
+		new.frozen = !is_end_token(freelist);
 
 	} while (!__cmpxchg_double_slab(s, page,
 		freelist, counters,
-		NULL, new.counters,
+		end_token(freelist), new.counters,
 		"get_freelist"));
 
 	return freelist;
@@ -2307,12 +2318,17 @@ redo:
 
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	freelist = c->freelist;
-	if (freelist)
+	if (freelist && !is_end_token(freelist))
 		goto load_freelist;
 
+	/*
+	 * There is no freelist or its exhausted. See if the page has any
+	 * objects freed to it.
+	 */
 	freelist = get_freelist(s, page);
 
-	if (!freelist) {
+	if (is_end_token(freelist)) {
+		/* page has been deactivated by get_freelist */
 		c->page = NULL;
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
@@ -2343,7 +2359,7 @@ new_slab:
 		page = c->page = c->partial;
 		c->partial = page->next;
 		stat(s, CPU_PARTIAL_ALLOC);
-		c->freelist = NULL;
+		c->freelist = end_token(page->address);
 		goto redo;
 	}
 
@@ -2419,7 +2435,7 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match_ptr(object, node))) {
+	if (unlikely(!object || is_end_token(object) ||!node_match_ptr(object, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 		stat(s, ALLOC_SLOWPATH);
 	} else {
@@ -2549,9 +2565,9 @@ static void __slab_free(struct kmem_cach
 		new.counters = counters;
 		was_frozen = new.frozen;
 		new.inuse--;
-		if ((!new.inuse || !prior) && !was_frozen) {
+		if ((!new.inuse || is_end_token(prior)) && !was_frozen) {
 
-			if (kmem_cache_has_cpu_partial(s) && !prior) {
+			if (kmem_cache_has_cpu_partial(s) && is_end_token(prior)) {
 
 				/*
 				 * Slab was on no list before and will be
@@ -2608,7 +2624,7 @@ static void __slab_free(struct kmem_cach
 	 * Objects left in the slab. If it was not on the partial list before
 	 * then add it.
 	 */
-	if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
+	if (!kmem_cache_has_cpu_partial(s) && unlikely(is_end_token(prior))) {
 		if (kmem_cache_debug(s))
 			remove_full(s, n, page);
 		add_partial(n, page, DEACTIVATE_TO_TAIL);


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (4 preceding siblings ...)
  2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-10 17:29   ` Pekka Enberg
  2014-12-10 16:30 ` [PATCH 7/7] slub: Remove preemption disable/enable from fastpath Christoph Lameter
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: slub_drop_kmem_cache_cpu_page_Field --]
[-- Type: text/plain, Size: 5024 bytes --]

Dropping the page field is possible since the page struct address
of an object or a freelist pointer can now always be calcualted from
the address. No freelist pointer will be NULL anymore so use
NULL to signify the condition that the current cpu has no
percpu slab attached to it.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h	2014-12-09 12:41:01.150901379 -0600
+++ linux/include/linux/slub_def.h	2014-12-09 12:41:01.150901379 -0600
@@ -40,7 +40,6 @@ enum stat_item {
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
 	unsigned long tid;	/* Globally unique transaction id */
-	struct page *page;	/* The slab from which we are allocating */
 	struct page *partial;	/* Partially allocated frozen slabs */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:41:01.150901379 -0600
+++ linux/mm/slub.c	2014-12-09 12:41:01.150901379 -0600
@@ -1613,7 +1613,6 @@ static void *get_partial_node(struct kme
 
 		available += objects;
 		if (!object) {
-			c->page = page;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
@@ -2051,10 +2050,9 @@ static void put_cpu_partial(struct kmem_
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	stat(s, CPUSLAB_FLUSH);
-	deactivate_slab(s, c->page, c->freelist);
+	deactivate_slab(s, virt_to_head_page(c->freelist), c->freelist);
 
 	c->tid = next_tid(c->tid);
-	c->page = NULL;
 	c->freelist = NULL;
 }
 
@@ -2068,7 +2066,7 @@ static inline void __flush_cpu_slab(stru
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c)) {
-		if (c->page)
+		if (c->freelist)
 			flush_slab(s, c);
 
 		unfreeze_partials(s, c);
@@ -2087,7 +2085,7 @@ static bool has_cpu_slab(int cpu, void *
 	struct kmem_cache *s = info;
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	return c->page || c->partial;
+	return c->freelist || c->partial;
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -2197,7 +2195,7 @@ static inline void *new_slab_objects(str
 	page = new_slab(s, flags, node);
 	if (page) {
 		c = raw_cpu_ptr(s->cpu_slab);
-		if (c->page)
+		if (c->freelist)
 			flush_slab(s, c);
 
 		/*
@@ -2208,7 +2206,6 @@ static inline void *new_slab_objects(str
 		page->freelist = end_token(freelist);
 
 		stat(s, ALLOC_SLAB);
-		c->page = page;
 		*pc = c;
 	} else
 		freelist = NULL;
@@ -2291,9 +2288,10 @@ static void *__slab_alloc(struct kmem_ca
 	c = this_cpu_ptr(s->cpu_slab);
 #endif
 
-	page = c->page;
-	if (!page)
+	if (!c->freelist || is_end_token(c->freelist))
 		goto new_slab;
+
+	page = virt_to_head_page(c->freelist);
 redo:
 
 	if (unlikely(!node_match(page, node))) {
@@ -2329,7 +2327,7 @@ redo:
 
 	if (is_end_token(freelist)) {
 		/* page has been deactivated by get_freelist */
-		c->page = NULL;
+		c->freelist = NULL;
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -2342,7 +2340,7 @@ load_freelist:
 	 * page is pointing to the page from which the objects are obtained.
 	 * That page must be frozen for per cpu allocations to work.
 	 */
-	VM_BUG_ON(!c->page->frozen);
+	VM_BUG_ON(!virt_to_head_page(freelist)->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
@@ -2350,13 +2348,12 @@ load_freelist:
 
 deactivate:
 	deactivate_slab(s, page, c->freelist);
-	c->page = NULL;
 	c->freelist = NULL;
 
 new_slab:
 
 	if (c->partial) {
-		page = c->page = c->partial;
+		page = c->partial;
 		c->partial = page->next;
 		stat(s, CPU_PARTIAL_ALLOC);
 		c->freelist = end_token(page->address);
@@ -2371,7 +2368,7 @@ new_slab:
 		return NULL;
 	}
 
-	page = c->page;
+	page = virt_to_head_page(freelist);
 	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
 		goto load_freelist;
 
@@ -2381,7 +2378,6 @@ new_slab:
 		goto new_slab;	/* Slab failed checks. Next slab needed */
 
 	deactivate_slab(s, page, get_freepointer(s, freelist));
-	c->page = NULL;
 	c->freelist = NULL;
 	local_irq_restore(flags);
 	return freelist;
@@ -2402,7 +2398,6 @@ static __always_inline void *slab_alloc_
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-	struct page *page;
 	unsigned long tid;
 
 	if (slab_pre_alloc_hook(s, gfpflags))
@@ -2434,7 +2429,6 @@ redo:
 	preempt_enable();
 
 	object = c->freelist;
-	page = c->page;
 	if (unlikely(!object || is_end_token(object) ||!node_match_ptr(object, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 		stat(s, ALLOC_SLOWPATH);
@@ -4216,10 +4210,10 @@ static ssize_t show_slab_objects(struct
 			int node;
 			struct page *page;
 
-			page = ACCESS_ONCE(c->page);
-			if (!page)
+			if (!c->freelist)
 				continue;
 
+			page = virt_to_head_page(c->freelist);
 			node = page_to_nid(page);
 			if (flags & SO_TOTAL)
 				x = page->objects;


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 7/7] slub: Remove preemption disable/enable from fastpath
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (5 preceding siblings ...)
  2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
@ 2014-12-10 16:30 ` Christoph Lameter
  2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:30 UTC (permalink / raw)
  To: akpm
  Cc: rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: slub_fastpath_remove_preempt --]
[-- Type: text/plain, Size: 3901 bytes --]

We can now use a this_cpu_cmpxchg_double to update two 64
bit values that are the entire description of the per cpu
freelist. There is no need anymore to disable preempt.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-09 12:31:45.867575731 -0600
+++ linux/mm/slub.c	2014-12-09 12:31:45.867575731 -0600
@@ -2272,21 +2272,15 @@ static inline void *get_freelist(struct
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void *freelist;
 	struct page *page;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-#ifdef CONFIG_PREEMPT
-	/*
-	 * We may have been preempted and rescheduled on a different
-	 * cpu before disabling interrupts. Need to reload cpu area
-	 * pointer.
-	 */
 	c = this_cpu_ptr(s->cpu_slab);
-#endif
 
 	if (!c->freelist || is_end_token(c->freelist))
 		goto new_slab;
@@ -2397,7 +2391,6 @@ static __always_inline void *slab_alloc_
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
 	if (slab_pre_alloc_hook(s, gfpflags))
@@ -2406,31 +2399,15 @@ static __always_inline void *slab_alloc_
 	s = memcg_kmem_get_cache(s, gfpflags);
 redo:
 	/*
-	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
-	 * enabled. We may switch back and forth between cpus while
-	 * reading from one cpu area. That does not matter as long
-	 * as we end up on the original cpu again when doing the cmpxchg.
-	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
-	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
-
-	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
-
-	object = c->freelist;
-	if (unlikely(!object || is_end_token(object) ||!node_match_ptr(object, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+	tid = this_cpu_read(s->cpu_slab->tid);
+	object = this_cpu_read(s->cpu_slab->freelist);
+	if (unlikely(!object || is_end_token(object) || !node_match_ptr(object, node))) {
+		object = __slab_alloc(s, gfpflags, node, addr);
 		stat(s, ALLOC_SLOWPATH);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
@@ -2666,30 +2643,21 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
+	void *freelist;
 	unsigned long tid;
 
 	slab_free_hook(s, x);
 
 redo:
-	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
-	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
-
-	tid = c->tid;
-	preempt_enable();
+	tid = this_cpu_read(s->cpu_slab->tid);
+	freelist = this_cpu_read(s->cpu_slab->freelist);
 
-	if (likely(same_slab_page(s, page, c->freelist))) {
-		set_freepointer(s, object, c->freelist);
+	if (likely(same_slab_page(s, page, freelist))) {
+		set_freepointer(s, object, freelist);
 
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
-				c->freelist, tid,
+				freelist, tid,
 				object, next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_free", s, tid);


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/7] slub: Remove __slab_alloc code duplication
  2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
@ 2014-12-10 16:39   ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 16:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> Somehow the two branches in __slab_alloc do the same.
> Unify them.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB
  2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
@ 2014-12-10 16:45   ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 16:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> SLAB uses the mapping field of the page struct to store a pointer to the
> begining of the objects in the page frame. Use the same field to store
> the address of the objects in SLUB as well. This allows us to avoid a
> number of invocations of page_address(). Those are mostly only used for
> debugging though so this should have no performance benefit.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
@ 2014-12-10 16:54   ` Pekka Enberg
  2014-12-10 17:08     ` Christoph Lameter
  2014-12-15  8:03   ` Joonsoo Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 16:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c        2014-12-09 12:25:45.770405462 -0600
> +++ linux/mm/slub.c     2014-12-09 12:25:45.766405582 -0600
> @@ -2625,6 +2625,13 @@ slab_empty:
>         discard_slab(s, page);
>  }
>
> +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)

Why are you passing a pointer to struct kmem_cache here? You don't
seem to use it.

> +{
> +       long d = p - page->address;
> +
> +       return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}

Can you elaborate on what this is doing? I don't really understand it.

- Pekka

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath
  2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
@ 2014-12-10 16:56   ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 16:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> We can use virt_to_page there and only invoke the costly function if
> actually a node is specified and we have to check the NUMA locality.
>
> Increases the cost of allocating on a specific NUMA node but then that
> was never cheap since we may have to dump our caches and retrieve memory
> from the correct node.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c        2014-12-09 12:27:49.414686959 -0600
> +++ linux/mm/slub.c     2014-12-09 12:27:49.414686959 -0600
> @@ -2097,6 +2097,15 @@ static inline int node_match(struct page
>         return 1;
>  }
>
> +static inline int node_match_ptr(void *p, int node)
> +{
> +#ifdef CONFIG_NUMA
> +       if (!p || (node != NUMA_NO_NODE && page_to_nid(virt_to_page(p)) != node))

You already test that object != NULL before calling node_match_ptr().

> +               return 0;
> +#endif
> +       return 1;
> +}
> +
>  #ifdef CONFIG_SLUB_DEBUG
>  static int count_free(struct page *page)
>  {
> @@ -2410,7 +2419,7 @@ redo:
>
>         object = c->freelist;
>         page = c->page;
> -       if (unlikely(!object || !node_match(page, node))) {
> +       if (unlikely(!object || !node_match_ptr(object, node))) {
>                 object = __slab_alloc(s, gfpflags, node, addr, c);
>                 stat(s, ALLOC_SLOWPATH);
>         } else {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists
  2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
@ 2014-12-10 16:59   ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 16:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> Ending a list with NULL means that the termination of a list is the same
> for all slab pages. The pointers of freelists otherwise always are
> pointing to the address space of the page. Make termination of a
> list possible by setting the lowest bit in the freelist address
> and use the start address of a page if no other address is available
> for list termination.
>
> This will allow us to determine the page struct address from a
> freelist pointer in the future.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 16:54   ` Pekka Enberg
@ 2014-12-10 17:08     ` Christoph Lameter
  2014-12-10 17:32       ` Pekka Enberg
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 17:08 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, 10 Dec 2014, Pekka Enberg wrote:

> > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
>
> Why are you passing a pointer to struct kmem_cache here? You don't
> seem to use it.

True.
> > +{
> > +       long d = p - page->address;
> > +
> > +       return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > +}
>
> Can you elaborate on what this is doing? I don't really understand it.

Checks if the pointer points to the slab page. Also it tres to avoid
having to call compound_order needlessly. Not sure if that optimization is
worth it.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu
  2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
@ 2014-12-10 17:29   ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 17:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <cl@linux.com> wrote:
> Dropping the page field is possible since the page struct address
> of an object or a freelist pointer can now always be calcualted from
> the address. No freelist pointer will be NULL anymore so use
> NULL to signify the condition that the current cpu has no
> percpu slab attached to it.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 17:08     ` Christoph Lameter
@ 2014-12-10 17:32       ` Pekka Enberg
  2014-12-10 17:37         ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Pekka Enberg @ 2014-12-10 17:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 7:08 PM, Christoph Lameter <cl@linux.com> wrote:
>> > +{
>> > +       long d = p - page->address;
>> > +
>> > +       return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
>> > +}
>>
>> Can you elaborate on what this is doing? I don't really understand it.
>
> Checks if the pointer points to the slab page. Also it tres to avoid
> having to call compound_order needlessly. Not sure if that optimization is
> worth it.

Aah, it's the (1 << MAX_ORDER) optimization that confused me. Perhaps
add a comment there to make it more obvious?

I'm fine with the optimization:

Reviewed-by: Pekka Enberg <penberg@kernel.org>

- Pekka

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 17:32       ` Pekka Enberg
@ 2014-12-10 17:37         ` Christoph Lameter
  2014-12-11 13:19           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-10 17:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Steven Rostedt, LKML, Thomas Gleixner, linux-mm,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, 10 Dec 2014, Pekka Enberg wrote:

> I'm fine with the optimization:
>
> Reviewed-by: Pekka Enberg <penberg@kernel.org>

There were some other issues so its now:


Subject: slub: Do not use c->page on free

Avoid using the page struct address on free by just doing an
address comparison. That is easily doable now that the page address
is available in the page struct and we already have the page struct
address of the object to be freed calculated.

Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-12-10 11:35:32.538563734 -0600
+++ linux/mm/slub.c	2014-12-10 11:36:39.032447807 -0600
@@ -2625,6 +2625,17 @@ slab_empty:
 	discard_slab(s, page);
 }

+static bool is_pointer_to_page(struct page *page, void *p)
+{
+	long d = p - page->address;
+
+	/*
+	 * Do a comparison for a MAX_ORDER page first before using
+	 * compound_order() to determine the actual page size.
+	 */
+	return d >= 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
+}
+
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
  * can perform fastpath freeing without additional function calls.
@@ -2658,7 +2669,7 @@ redo:
 	tid = c->tid;
 	preempt_enable();

-	if (likely(page == c->page)) {
+	if (likely(is_pointer_to_page(page, c->freelist))) {
 		set_freepointer(s, object, c->freelist);

 		if (unlikely(!this_cpu_cmpxchg_double(

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 17:37         ` Christoph Lameter
@ 2014-12-11 13:19           ` Jesper Dangaard Brouer
  2014-12-11 15:01             ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 13:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	linux-mm, iamjoonsoo, brouer


On Wed, 10 Dec 2014 11:37:56 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:

[...]
> 
> There were some other issues so its now:
> 
> 
> Subject: slub: Do not use c->page on free
> 
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
> 
> Reviewed-by: Pekka Enberg <penberg@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-12-10 11:35:32.538563734 -0600
> +++ linux/mm/slub.c	2014-12-10 11:36:39.032447807 -0600
> @@ -2625,6 +2625,17 @@ slab_empty:
>  	discard_slab(s, page);
>  }
> 
> +static bool is_pointer_to_page(struct page *page, void *p)
> +{
> +	long d = p - page->address;
> +
> +	/*
> +	 * Do a comparison for a MAX_ORDER page first before using
> +	 * compound_order() to determine the actual page size.
> +	 */
> +	return d >= 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}

My current compiler (gcc 4.9.1), choose not to inline is_pointer_to_page().

 (perf record of [1])
 Samples: 8K of event 'cycles', Event count (approx.): 5737618489
 +   46.13%  modprobe  [kernel.kallsyms]  [k] kmem_cache_free
 +   33.02%  modprobe  [kernel.kallsyms]  [k] kmem_cache_alloc
 +   16.14%  modprobe  [kernel.kallsyms]  [k] is_pointer_to_page

If I explicitly add "inline", then it gets inlined, and performance is good again.

Test[1] cost of kmem_cache_alloc+free:
 * baseline: 47 cycles(tsc) 19.032 ns  (net-next without patchset)
 * patchset: 50 cycles(tsc) 20.028 ns
 * inline  : 45 cycles(tsc) 18.135 ns  (inlined is_pointer_to_page())


>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>   * can perform fastpath freeing without additional function calls.
> @@ -2658,7 +2669,7 @@ redo:
>  	tid = c->tid;
>  	preempt_enable();
> 
> -	if (likely(page == c->page)) {
> +	if (likely(is_pointer_to_page(page, c->freelist))) {
>  		set_freepointer(s, object, c->freelist);
> 
>  		if (unlikely(!this_cpu_cmpxchg_double(


[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (6 preceding siblings ...)
  2014-12-10 16:30 ` [PATCH 7/7] slub: Remove preemption disable/enable from fastpath Christoph Lameter
@ 2014-12-11 13:35 ` Jesper Dangaard Brouer
  2014-12-11 15:03   ` Christoph Lameter
  2014-12-11 17:37 ` Jesper Dangaard Brouer
  2014-12-15  7:59 ` Joonsoo Kim
  9 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 13:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo.kim, brouer

On Wed, 10 Dec 2014 10:30:17 -0600
Christoph Lameter <cl@linux.com> wrote:

[...]
> 
> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
> 20%-50% of fastpath latency:
> 
> Before:
> 
> Single thread testing
[...]
> 2. Kmalloc: alloc/free test
[...]
> 10000 times kmalloc(256)/kfree -> 116 cycles
[...]
> 
> 
> After:
> 
> Single thread testing
[...]
> 2. Kmalloc: alloc/free test
[...]
> 10000 times kmalloc(256)/kfree -> 60 cycles
[...]

It looks like an impressive saving 116 -> 60 cycles.  I just don't see
the same kind of improvements with my similar tests[1][2].

My test[1] is just a fast-path loop over kmem_cache_alloc+free on
256bytes objects. (Results after explicitly inlining new func
is_pointer_to_page())

 baseline: 47 cycles(tsc) 19.032 ns
 patchset: 45 cycles(tsc) 18.135 ns

I do see the improvement, but it is not as high as I would have expected.

(CPU E5-2695)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-11 13:19           ` Jesper Dangaard Brouer
@ 2014-12-11 15:01             ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-11 15:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pekka Enberg, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	linux-mm, iamjoonsoo

On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:

> If I explicitly add "inline", then it gets inlined, and performance is good again.

Ok adding inline for the next release.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
@ 2014-12-11 15:03   ` Christoph Lameter
  2014-12-11 16:50     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-11 15:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo.kim

On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:

> It looks like an impressive saving 116 -> 60 cycles.  I just don't see
> the same kind of improvements with my similar tests[1][2].

This is particularly for a CONFIG_PREEMPT kernel. There will be no effect
on !CONFIG_PREEMPT I hope.

> I do see the improvement, but it is not as high as I would have expected.

Do you have CONFIG_PREEMPT set?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-11 15:03   ` Christoph Lameter
@ 2014-12-11 16:50     ` Jesper Dangaard Brouer
  2014-12-11 17:18       ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 16:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo.kim, brouer

On Thu, 11 Dec 2014 09:03:24 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:
> 
> > It looks like an impressive saving 116 -> 60 cycles.  I just don't see
> > the same kind of improvements with my similar tests[1][2].
> 
> This is particularly for a CONFIG_PREEMPT kernel. There will be no effect
> on !CONFIG_PREEMPT I hope.
> 
> > I do see the improvement, but it is not as high as I would have expected.
> 
> Do you have CONFIG_PREEMPT set?

Yes.

$ grep CONFIG_PREEMPT .config
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y

Full config here:
 http://people.netfilter.org/hawk/kconfig/config01-slub-fastpath01

I was expecting to see at least (specifically) 4.291 ns improvement, as
this is the measured[1] cost of preempt_{disable,enable] on my system.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-11 16:50     ` Jesper Dangaard Brouer
@ 2014-12-11 17:18       ` Christoph Lameter
  2014-12-11 18:11         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-11 17:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo.kim

On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:

> I was expecting to see at least (specifically) 4.291 ns improvement, as
> this is the measured[1] cost of preempt_{disable,enable] on my system.

Right. Those calls are taken out of the fastpaths by this patchset for
the CONFIG_PREEMPT case. So the numbers that you got do not make much
sense to me.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (7 preceding siblings ...)
  2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
@ 2014-12-11 17:37 ` Jesper Dangaard Brouer
  2014-12-12 10:39   ` Jesper Dangaard Brouer
  2014-12-15  7:59 ` Joonsoo Kim
  9 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 17:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, brouer


Warning, I'm getting crashes with this patchset, during my network load testing.
I don't have a nice crash dump to show, yet, but it is in the slub code.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-11 17:18       ` Christoph Lameter
@ 2014-12-11 18:11         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 18:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo.kim, brouer

On Thu, 11 Dec 2014 11:18:31 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:
> 
> > I was expecting to see at least (specifically) 4.291 ns improvement, as
> > this is the measured[1] cost of preempt_{disable,enable] on my system.
> 
> Right. Those calls are taken out of the fastpaths by this patchset for
> the CONFIG_PREEMPT case. So the numbers that you got do not make much
> sense to me.

True, that is also that I'm saying.  I'll try to figure out that is
going on, tomorrow.

You are welcome to run my test harness:
 http://netoptimizer.blogspot.dk/2014/11/announce-github-repo-prototype-kernel.html
 https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst

Just load module: time_bench_kmem_cache1
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-11 17:37 ` Jesper Dangaard Brouer
@ 2014-12-12 10:39   ` Jesper Dangaard Brouer
  2014-12-12 18:31     ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-12 10:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, brouer, Alexander Duyck

On Thu, 11 Dec 2014 18:37:58 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Warning, I'm getting crashes with this patchset, during my network load testing.
> I don't have a nice crash dump to show, yet, but it is in the slub code.

Crash/OOM during IP-forwarding network overload test[1] with pktgen,
single flow thus activating a single CPU on target (device under test).

Testing done with net-next at commit 52c9b12d380, with patchset applied.
Baseline testing have been done without patchset.

[1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen02_burst.sh

[  135.258503] console [netcon0] enabled
[  164.970377] ixgbe 0000:04:00.0 eth4: detected SFP+: 5
[  165.078455] ixgbe 0000:04:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: None
[  165.266662] ixgbe 0000:04:00.1 eth5: detected SFP+: 6
[  165.396958] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: None
[...]
[  290.298350] ksoftirqd/11: page allocation failure: order:0, mode:0x20
[  290.298632] CPU: 11 PID: 64 Comm: ksoftirqd/11 Not tainted 3.18.0-rc7-net-next+ #852
[  290.299109] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  290.299377]  0000000000000000 ffff88046c4eba28 ffffffff8164f6f2 ffff88047fd6d1a0
[  290.300169]  0000000000000020 ffff88046c4ebab8 ffffffff8111d241 0000000000000000
[  290.300833]  0000003000000000 ffff88047ffd9b38 ffff880003d86400 0000000000000040
[  290.301496] Call Trace:
[  290.301763]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  290.302035]  [<ffffffff8111d241>] warn_alloc_failed+0xd1/0x130
[  290.302307]  [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[  290.302572]  [<ffffffff81536b70>] __alloc_page_frag+0x130/0x150
[  290.302840]  [<ffffffff8153b63e>] __alloc_rx_skb+0x5e/0x110
[  290.303112]  [<ffffffff8153b74d>] __napi_alloc_skb+0x1d/0x40
[  290.303383]  [<ffffffffa00f15b1>] ixgbe_clean_rx_irq+0xf1/0x8e0 [ixgbe]
[  290.303655]  [<ffffffffa00f2a7d>] ixgbe_poll+0x41d/0x7c0 [ixgbe]
[  290.303920]  [<ffffffff8154817c>] net_rx_action+0x14c/0x270
[  290.304185]  [<ffffffff8107ad7a>] __do_softirq+0x10a/0x220
[  290.304455]  [<ffffffff8107aeb0>] run_ksoftirqd+0x20/0x50
[  290.304724]  [<ffffffff810962e9>] smpboot_thread_fn+0x159/0x270
[  290.304991]  [<ffffffff81096190>] ? SyS_setgroups+0x180/0x180
[  290.305260]  [<ffffffff81092846>] kthread+0xd6/0xf0
[  290.305525]  [<ffffffff81092770>] ? kthread_create_on_node+0x170/0x170
[  290.305568] rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  290.305570] rsyslogd cpuset=/ mems_allowed=0
[  290.306534]  [<ffffffff81656a2c>] ret_from_fork+0x7c/0xb0
[  290.306800]  [<ffffffff81092770>] ? kthread_create_on_node+0x170/0x170
[  290.307068] CPU: 1 PID: 2264 Comm: rsyslogd Not tainted 3.18.0-rc7-net-next+ #852
[  290.307553] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  290.307823]  0000000000000000 ffff88045248f8f8 ffffffff8164f6f2 0000000012a112a1
[  290.308480]  0000000000000000 ffff88045248f978 ffffffff8164c061 ffff88045248f958
[  290.309137]  ffffffff810bd1e9 ffff88045248fa18 ffffffff8112a42b ffff88045248f948
[  290.309805] Call Trace:
[  290.310064]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  290.310326]  [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[  290.310593]  [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[  290.310863]  [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[  290.316403]  [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[  290.316672]  [<ffffffff8107ee72>] ? has_capability_noaudit+0x12/0x20
[  290.316943]  [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[  290.317212]  [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[  290.317480]  [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[  290.317749]  [<ffffffff81117dc7>] __page_cache_alloc+0xa7/0xd0
[  290.318010]  [<ffffffff8111a387>] filemap_fault+0x1c7/0x400
[  290.318278]  [<ffffffff8113da06>] __do_fault+0x36/0xd0
[  290.318544]  [<ffffffff8113fc8f>] do_read_fault.isra.78+0x1bf/0x2c0
[  290.318815]  [<ffffffff810ae1c0>] ? autoremove_wake_function+0x40/0x40
[  290.319083]  [<ffffffff8114128e>] handle_mm_fault+0x67e/0xc20
[  290.319346]  [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[  290.319610]  [<ffffffff810ae180>] ? abort_exclusive_wait+0xa0/0xa0
[  290.319877]  [<ffffffff810431dc>] do_page_fault+0xc/0x10
[  290.320142]  [<ffffffff81658062>] page_fault+0x22/0x30
[  290.320441] Mem-Info:
[  290.320703] Node 0 DMA per-cpu:
[  290.321011] CPU    0: hi:    0, btch:   1 usd:   0
[  290.321272] CPU    1: hi:    0, btch:   1 usd:   0
[  290.321532] CPU    2: hi:    0, btch:   1 usd:   0
[  290.321792] CPU    3: hi:    0, btch:   1 usd:   0
[  290.322055] CPU    4: hi:    0, btch:   1 usd:   0
[  290.322319] CPU    5: hi:    0, btch:   1 usd:   0
[  290.322581] CPU    6: hi:    0, btch:   1 usd:   0
[  290.322845] CPU    7: hi:    0, btch:   1 usd:   0
[  290.323108] CPU    8: hi:    0, btch:   1 usd:   0
[  290.323367] CPU    9: hi:    0, btch:   1 usd:   0
[  290.323625] CPU   10: hi:    0, btch:   1 usd:   0
[  290.323885] CPU   11: hi:    0, btch:   1 usd:   0
[  290.324143] Node 0 DMA32 per-cpu:
[  290.324445] CPU    0: hi:  186, btch:  31 usd:   0
[  290.324704] CPU    1: hi:  186, btch:  31 usd:   0
[  290.324962] CPU    2: hi:  186, btch:  31 usd:   0
[  290.325227] CPU    3: hi:  186, btch:  31 usd:   0
[  290.325488] CPU    4: hi:  186, btch:  31 usd:   0
[  290.325753] CPU    5: hi:  186, btch:  31 usd:   0
[  290.326016] CPU    6: hi:  186, btch:  31 usd:   0
[  290.326279] CPU    7: hi:  186, btch:  31 usd:   0
[  290.326546] CPU    8: hi:  186, btch:  31 usd:   0
[  290.326811] CPU    9: hi:  186, btch:  31 usd:   0
[  290.327075] CPU   10: hi:  186, btch:  31 usd:   0
[  290.327344] CPU   11: hi:  186, btch:  31 usd:   0
[  290.327609] Node 0 Normal per-cpu:
[  290.327916] CPU    0: hi:  186, btch:  31 usd:  25
[  290.328179] CPU    1: hi:  186, btch:  31 usd:   0
[  290.328444] CPU    2: hi:  186, btch:  31 usd:   0
[  290.328708] CPU    3: hi:  186, btch:  31 usd:   0
[  290.328969] CPU    4: hi:  186, btch:  31 usd:   0
[  290.329230] CPU    5: hi:  186, btch:  31 usd:   0
[  290.329491] CPU    6: hi:  186, btch:  31 usd:   0
[  290.329753] CPU    7: hi:  186, btch:  31 usd:   0
[  290.330014] CPU    8: hi:  186, btch:  31 usd:   0
[  290.330275] CPU    9: hi:  186, btch:  31 usd:   0
[  290.330536] CPU   10: hi:  186, btch:  31 usd:   0
[  290.330801] CPU   11: hi:  186, btch:  31 usd:   0
[  290.331066] active_anon:109 inactive_anon:0 isolated_anon:0
[  290.331066]  active_file:132 inactive_file:104 isolated_file:0
[  290.331066]  unevictable:2141 dirty:0 writeback:0 unstable:0
[  290.331066]  free:26484 slab_reclaimable:3264 slab_unreclaimable:3985491
[  290.331066]  mapped:1957 shmem:17 pagetables:618 bounce:0
[  290.331066]  free_cma:0
[  290.332411] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  290.333825] lowmem_reserve[]: 0 1917 15995 15995
[  290.334317] Node 0 DMA32 free:64740kB min:8092kB low:10112kB high:12136kB active_anon:296kB inactive_anon:0kB active_file:136kB inactive_file:32kB unevictable:1940kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:1956kB dirty:0kB writeback:0kB mapped:1516kB shmem:0kB slab_reclaimable:1436kB slab_unreclaimable:1864332kB kernel_stack:144kB pagetables:460kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:25708 all_unreclaimable? yes
[  290.335947] lowmem_reserve[]: 0 0 14077 14077
[  290.336439] Node 0 Normal free:24532kB min:59424kB low:74280kB high:89136kB active_anon:140kB inactive_anon:0kB active_file:392kB inactive_file:384kB unevictable:6624kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:6624kB dirty:0kB writeback:0kB mapped:6312kB shmem:68kB slab_reclaimable:11620kB slab_unreclaimable:14078392kB kernel_stack:2864kB pagetables:2012kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  290.338061] lowmem_reserve[]: 0 0 0 0
[  290.338546] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15856kB
[  290.339996] Node 0 DMA32: 473*4kB (UEM) 221*8kB (UEM) 116*16kB (UEM) 86*32kB (UEM) 55*64kB (UEM) 24*128kB (UEM) 12*256kB (UM) 3*512kB (EM) 1*1024kB (E) 2*2048kB (UR) 10*4096kB (MR) = 65548kB
[  290.341804] Node 0 Normal: 994*4kB (UEM) 577*8kB (EM) 203*16kB (EM) 113*32kB (UEM) 47*64kB (UEM) 13*128kB (UM) 4*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 25248kB
[  290.343466] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  290.343947] 2081 total pagecache pages
[  290.344210] 13 pages in swap cache
[  290.344473] Swap cache stats: add 4436, delete 4423, find 5/8
[  290.344739] Free swap  = 8198904kB
[  290.345000] Total swap = 8216572kB
[  290.345262] 4184707 pages RAM
[  290.345523] 0 pages HighMem/MovableOnly
[  290.345788] 85688 pages reserved
[  290.346049] 0 pages hwpoisoned
[  290.346307] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  290.346788] [  680]     0   680     2678      264       9      107         -1000 udevd
[  290.347267] [ 1833]     0  1833    10161        0      24       70             0 monitor
[  290.347750] [ 1834]     0  1834    10196      517      27      131             0 ovsdb-server
[  290.348230] [ 1844]     0  1844    10299       50      24       67             0 monitor
[  290.348711] [ 1845]     0  1845    10338     2114      41        0             0 ovs-vswitchd
[  290.349194] [ 2261]     0  2261    62333      386      22      139             0 rsyslogd
[  290.349676] [ 2293]    81  2293     5366      344      13       69             0 dbus-daemon
[  290.350157] [ 2315]    68  2315     9070      403      29      313             0 hald
[  290.350632] [ 2316]     0  2316     5097      339      23       45             0 hald-runner
[  290.351111] [ 2345]     0  2345     5627        0      25       42             0 hald-addon-inpu
[  290.351594] [ 2354]    68  2354     4498      339      20       37             0 hald-addon-acpi
[  290.352069] [ 2363]     0  2363     2677      256       9      106         -1000 udevd
[  290.352540] [ 2471]     0  2471    30430      129      18      558             0 pmqos-static.py
[  290.353011] [ 2486]     0  2486    16672      368      33      179         -1000 sshd
[  290.353481] [ 2497]     0  2497    44314      550      61     1064             0 tuned
[  290.353956] [ 2511]     0  2511    29328      363      16      152             0 crond
[  290.354430] [ 2528]     0  2528     5400        0      14       46             0 atd
[  290.354906] [ 2541]     0  2541    26020      228      12       28             0 rhsmcertd
[  290.355386] [ 2562]     0  2562     1031      308       9       18             0 mingetty
[  290.355858] [ 2564]     0  2564     1031      308       9       18             0 mingetty
[  290.356336] [ 2566]     0  2566     1031      308       9       17             0 mingetty
[  290.356813] [ 2568]     0  2568     1031      308       9       18             0 mingetty
[  290.357291] [ 2570]     0  2570     1031      308       9       18             0 mingetty
[  290.357766] [ 2571]     0  2571     2677      256       9      106         -1000 udevd
[  290.358245] [ 2573]     0  2573     1031      308       9       18             0 mingetty
[  290.358719] [ 2576]     0  2576    25109      985      52      212             0 sshd
[  290.359196] [ 2598]   500  2598    25109      695      50      235             0 sshd
[  290.359673] [ 2611]   500  2611    27820      348      19      806             0 bash
[  290.360147] Out of memory: Kill process 1845 (ovs-vswitchd) score 0 or sacrifice child
[  290.360624] Killed process 1845 (ovs-vswitchd) total-vm:41352kB, anon-rss:732kB, file-rss:7724kB
[  290.450766] ksoftirqd/11: page allocation failure: order:0, mode:0x204020
[  290.451031] ksoftirqd/11: page allocation failure: order:0, mode:0x204020
[  290.451033] CPU: 11 PID: 64 Comm: ksoftirqd/11 Not tainted 3.18.0-rc7-net-next+ #852
[  290.451033] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  290.451034]  0000000000000000 ffff88046c4eb2e8 ffffffff8164f6f2 0000000014801480
[  290.451035]  0000000000204020 ffff88046c4eb378 ffffffff8111d241 0000000000000000
[  290.451036]  0000003000000000 ffff88047ffd9b38 0000000000000001 0000000000000040
[  290.451037] Call Trace:
[  290.451040]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  290.451042]  [<ffffffff8111d241>] warn_alloc_failed+0xd1/0x130
[  290.451045]  [<ffffffff81137c89>] ? compaction_suitable+0x19/0x20
[  290.451046]  [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[  290.451049]  [<ffffffff81348aea>] ? vsnprintf+0x3ba/0x590
[  290.451052]  [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[  290.451054]  [<ffffffff8116563d>] new_slab+0x2ad/0x310
[  290.451056]  [<ffffffff811660e7>] __slab_alloc.isra.63+0x207/0x4d0
[  290.451057]  [<ffffffff8116645b>] kmem_cache_alloc_node+0xab/0x110
[  290.451059]  [<ffffffff81536e47>] __alloc_skb+0x47/0x1d0
[  290.451063]  [<ffffffff8138f5a1>] ? vgacon_set_cursor_size.isra.7+0xa1/0x120
[  290.451066]  [<ffffffff815636c4>] netpoll_send_udp+0x84/0x3f0
[  290.451068]  [<ffffffffa028b8bf>] write_msg+0xcf/0x140 [netconsole]
[  290.451070]  [<ffffffff810b3edb>] call_console_drivers.constprop.24+0x9b/0xa0
[  290.451071]  [<ffffffff810b452d>] console_unlock+0x36d/0x450
[  290.451072]  [<ffffffff810b4960>] vprintk_emit+0x350/0x570
[  290.451073]  [<ffffffff8164be24>] printk+0x5c/0x5e
[  290.451075]  [<ffffffff8111d23c>] warn_alloc_failed+0xcc/0x130
[  290.451077]  [<ffffffff8154987c>] ? dev_hard_start_xmit+0x16c/0x320
[  290.451079]  [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[  290.451081]  [<ffffffff81567c22>] ? sch_direct_xmit+0x112/0x220
[  290.451083]  [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[  290.451084]  [<ffffffff8116563d>] new_slab+0x2ad/0x310
[  290.451085]  [<ffffffff811660e7>] __slab_alloc.isra.63+0x207/0x4d0
[  292.302602] hald-addon-acpi invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  292.303094] hald-addon-acpi cpuset=/ mems_allowed=0
[  292.303456] CPU: 4 PID: 2354 Comm: hald-addon-acpi Not tainted 3.18.0-rc7-net-next+ #852
[  292.303939] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  292.304209]  0000000000000000 ffff88044f2af8f8 ffffffff8164f6f2 00000000000038ce
[  292.304884]  0000000000000000 ffff88044f2af978 ffffffff8164c061 ffff88044f2af958
[  292.305560]  ffffffff810bd1e9 ffff88044f2afa18 ffffffff8112a42b ffff88044f2af948
[  292.306231] Call Trace:
[  292.306497]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  292.306768]  [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[  292.307038]  [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[  292.307305]  [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[  292.307577]  [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[  292.307846]  [<ffffffff8107ee72>] ? has_capability_noaudit+0x12/0x20
[  292.308117]  [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[  292.308386]  [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[  292.308659]  [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[  292.308930]  [<ffffffff81117dc7>] __page_cache_alloc+0xa7/0xd0
[  292.309200]  [<ffffffff8111a387>] filemap_fault+0x1c7/0x400
[  292.309470]  [<ffffffff8113da06>] __do_fault+0x36/0xd0
[  292.309740]  [<ffffffff8113fc8f>] do_read_fault.isra.78+0x1bf/0x2c0
[  292.310010]  [<ffffffff8114128e>] handle_mm_fault+0x67e/0xc20
[  292.310280]  [<ffffffff810970d5>] ? finish_task_switch+0x45/0xf0
[  292.310551]  [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[  292.310821]  [<ffffffff816559d2>] ? do_nanosleep+0x92/0xe0
[  292.311087]  [<ffffffff810c4d88>] ? hrtimer_nanosleep+0xb8/0x1a0
[  292.311353]  [<ffffffff810c3e90>] ? hrtimer_get_res+0x50/0x50
[  292.311618]  [<ffffffff810431dc>] do_page_fault+0xc/0x10
[  292.311884]  [<ffffffff81658062>] page_fault+0x22/0x30
[  292.312145] Mem-Info:
[  292.312403] Node 0 DMA per-cpu:
[  292.312714] CPU    0: hi:    0, btch:   1 usd:   0
[  292.312977] CPU    1: hi:    0, btch:   1 usd:   0
[  292.313241] CPU    2: hi:    0, btch:   1 usd:   0
[  292.313505] CPU    3: hi:    0, btch:   1 usd:   0
[  292.313772] CPU    4: hi:    0, btch:   1 usd:   0
[  292.314038] CPU    5: hi:    0, btch:   1 usd:   0
[  292.314304] CPU    6: hi:    0, btch:   1 usd:   0
[  292.314571] CPU    7: hi:    0, btch:   1 usd:   0
[  292.314840] CPU    8: hi:    0, btch:   1 usd:   0
[  292.315107] CPU    9: hi:    0, btch:   1 usd:   0
[  292.315372] CPU   10: hi:    0, btch:   1 usd:   0
[  292.315640] CPU   11: hi:    0, btch:   1 usd:   0
[  292.315905] Node 0 DMA32 per-cpu:
[  292.316219] CPU    0: hi:  186, btch:  31 usd:   0
[  292.316487] CPU    1: hi:  186, btch:  31 usd:   0
[  292.316754] CPU    2: hi:  186, btch:  31 usd:   0
[  292.317019] CPU    3: hi:  186, btch:  31 usd:   0
[  292.317284] CPU    4: hi:  186, btch:  31 usd:   0
[  292.317553] CPU    5: hi:  186, btch:  31 usd:   0
[  292.317819] CPU    6: hi:  186, btch:  31 usd:   0
[  292.318086] CPU    7: hi:  186, btch:  31 usd:   0
[  292.318352] CPU    8: hi:  186, btch:  31 usd:   0
[  292.318623] CPU    9: hi:  186, btch:  31 usd:   0
[  292.318892] CPU   10: hi:  186, btch:  31 usd:   0
[  292.319161] CPU   11: hi:  186, btch:  31 usd:   0
[  292.319427] Node 0 Normal per-cpu:
[  292.319742] CPU    0: hi:  186, btch:  31 usd:   2
[  292.320009] CPU    1: hi:  186, btch:  31 usd:   0
[  292.320275] CPU    2: hi:  186, btch:  31 usd:   0
[  292.320542] CPU    3: hi:  186, btch:  31 usd:   0
[  292.320811] CPU    4: hi:  186, btch:  31 usd:   0
[  292.321079] CPU    5: hi:  186, btch:  31 usd:   0
[  292.321346] CPU    6: hi:  186, btch:  31 usd:   0
[  292.321614] CPU    7: hi:  186, btch:  31 usd:   0
[  292.321880] CPU    8: hi:  186, btch:  31 usd:   0
[  292.322146] CPU    9: hi:  186, btch:  31 usd:   0
[  292.322412] CPU   10: hi:  186, btch:  31 usd:   0
[  292.322681] CPU   11: hi:  186, btch:  31 usd:   0
[  292.322947] active_anon:0 inactive_anon:2 isolated_anon:0
[  292.322947]  active_file:81 inactive_file:42 isolated_file:0
[  292.322947]  unevictable:0 dirty:0 writeback:0 unstable:0
[  292.322947]  free:24558 slab_reclaimable:3128 slab_unreclaimable:3989981
[  292.322947]  mapped:39 shmem:0 pagetables:577 bounce:0
[  292.322947]  free_cma:0
[  292.324305] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  292.325716] lowmem_reserve[]: 0 1917 15995 15995
[  292.326216] Node 0 DMA32 free:59736kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:820 all_unreclaimable? yes
[  292.327636] lowmem_reserve[]:Normal free:22900kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3288 all_unreclaimable? yes
[  292.360844] monitor invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  292.361326] monitor cpuset=/ mems_allowed=0
[  292.361687] CPU: 1 PID: 1844 Comm: monitor Not tainted 3.18.0-rc7-net-next+ #852
[  292.362162] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  292.362429]  0000000000000000 ffff880456aef8f8 ffffffff8164f6f2 0000000000003cf9
[  292.363095]  0000000000000000 ffff880456aef978 ffffffff8164c061 ffff880456aef958
[  292.363764]  ffffffff810bd1e9 ffff880456aefa18 ffffffff8112a42b ffff880456aef988
[  292.364433] Call Trace:
[  292.364695]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  292.381944] active_anon:0 inactive_anon:2 isolated_anon:0
[  292.381944]  active_file:81 inactive_file:42 isolated_file:0
[  292.381944]  unevictable:0 dirty:0 writeback:0 unstable:0
[  292.381944]  free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[  292.381944]  mapped:39 shmem:0 pagetables:577 bounce:0
[  292.381944]  free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[  292.419725] ovsdb-server invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[  292.420210] ovsdb-server cpuset=/ mems_allowed=0
[  292.420570] CPU: 2 PID: 1834 Comm: ovsdb-server Not tainted 3.18.0-rc7-net-next+ #852
[  292.421053] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  292.421320]  0000000000000000[  292.440836] active_anon:0 inactive_anon:2 isolated_anon:0
[  292.440836]  active_file:81 inactive_file:42 isolated_file:0
[  292.440836]  unevictable:0 dirty:0 writeback:0 unstable:0
[  292.440836]  free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[  292.440836]  mapped:39 shmem:0 pagetables:577 bounce:0
[  292.440836]  free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[  292.502168] active_anon:0 inactive_anon:2 isolated_anon:0
[  292.502168]  active_file:81 inactive_file:42 isolated_file:0
[  292.502168]  unevictable:0 dirty:0 writeback:0 unstable:0
[  292.502168]  free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[  292.502168]  mapped:39 shmem:0 pagetables:577 bounce:0
[  292.502168]  free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[  292.539902] hald invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[  292.540379] hald cpuset=/ mems_allowed=0
[  292.540736] CPU: 9 PID: 2315 Comm: hald Not tainted 3.18.0-rc7-net-next+ #852
[  292.541004] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[  292.541266]  0000000000000000 ffff88044f8eb4a8 ffffffff8164f6f2 0000000000004977
[  292.541934]  0000000000000000 ffff88044f8eb528 ffffffff8164c061 ffff88044f8eb508
[  292.542600]  ffffffff810bd1e9 ffff88044f8eb5c8 ffffffff8112a42b ffff88044f8eb4f8
[  292.543265] Call Trace:
[  292.543530]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  292.543797]  [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[  292.544064]  [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[  292.544331]  [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[  292.544597]  [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[  292.544867]  [<ffffffff8111aff5>] ? oom_unkillable_task.isra.5+0xc5/0xf0
[  292.545134]  [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[  292.545399]  [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[  292.545664]  [<ffffffff8115fa1f>] alloc_pages_vma+0x9f/0x1b0
[  292.545934]  [<ffffffff8115283b>] read_swap_cache_async+0x13b/0x1e0
[  292.546202]  [<ffffffff81152a06>] swapin_readahead+0x126/0x190
[  292.546467]  [<ffffffff81118ada>] ? pagecache_get_page+0x2a/0x1e0
[  292.564580] active_anon:0 inactive_anon:2 isolated_anon:0
[  292.564580]  active_file:81 inactive_file:42 isolated_file:0
[  292.564580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  292.564580]  free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[  292.564580]  mapped:39 shmem:0 pagetables:480 bounce:0
[  292.564580]  free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:1616kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes

[  293.207640] Call Trace:
[  293.207903]  [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[  293.208166]  [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[  293.208431]  [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[  293.208696]  [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[  293.208963]  [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[  293.209232]  [<ffffffff8111aff5>] ? oom_unkillable_task.isra.5+0xc5/0xf0
[  293.209500]  [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[  293.209761]  [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[  293.210029]  [<ffffffff8115fa1f>] alloc_pages_vma+0x9f/0x1b0
[  293.210297]  [<ffffffff8115283b>] read_swap_cache_async+0x13b/0x1e0
[  293.210562]  [<ffffffff81152a06>] swapin_readahead+0x126/0x190
[  293.210828]  [<ffffffff81118ada>] ? pagecache_get_page+0x2a/0x1e0
[  293.211092]  [<ffffffff811415d8>] handle_mm_fault+0x9c8/0xc20
[  293.211357]  [<ffffffff810a364f>] ? dequeue_entity+0x10f/0x600
[  293.211626]  [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[  293.211892]  [<ffffffff810970d5>] ? finish_task_switch+0x45/0xf0
[  293.212156]  [<ffffffff81651070>] ? __schedule+0x290/0x7f0
[  293.212416]  [<ffffffff810431dc>] do_page_fault+0xc/0x10
[  293.212676]  [<ffffffff81658062>] page_fault+0x22/0x30
[  293.212937]  [<ffffffff8118b189>] ? do_sys_poll+0x179/0x5b0
[  293.213196]  [<ffffffff8118b13d>] ? do_sys_poll+0x12d/0x5b0
[  293.213459]  [<ffffffff815ead03>] ? unix_stream_sendmsg+0x413/0x450
[  293.213724]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.213992]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.214259]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.214528]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.214792]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.215056]  [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[  293.215321]  [<ffffffff8117ac45>] ? SYSC_newfstat+0x25/0x30
[  293.215586]  [<ffffffff8118b697>] SyS_poll+0x77/0x100
[  293.215851]  [<ffffffff81656ad2>] system_call_fastpath+0x12/0x17
[  293.216115] Mem-Info:
[  293.216374] Node 0 DMA per-cpu:
[  293.216685] CPU    0: hi:    0, btch:   1 usd:   0
[  293.216948] CPU    1: hi:    0, btch:   1 usd:   0
[  293.217209] CPU    2: hi:    0, btch:   1 usd:   0
[  293.217473] CPU    3: hi:    0, btch:   1 usd:   0
[  293.217734] CPU    4: hi:    0, btch:   1 usd:   0
[  293.217999] CPU    5: hi:    0, btch:   1 usd:   0
[  293.218261] CPU    6: hi:    0, btch:   1 usd:   0
[  293.218525] CPU    7: hi:    0, btch:   1 usd:   0
[  293.218787] CPU    8: hi:    0, btch:   1 usd:   0
[  293.219051] CPU    9: hi:    0, btch:   1 usd:   0
[  293.219314] CPU   10: hi:    0, btch:   1 usd:   0
[  293.219580] CPU   11: hi:    0, btch:   1 usd:   0
[  293.219843] Node 0 DMA32 per-cpu:
[  293.220149] CPU    0: hi:  186, btch:  31 usd:   0
[  293.220411] CPU    1: hi:  186, btch:  31 usd:   0
[  293.220677] CPU    2: hi:  186, btch:  31 usd:   0
[  293.220940] CPU    3: hi:  186, btch:  31 usd:   0
[  293.221203] CPU    4: hi:  186, btch:  31 usd:   0
[  293.221467] CPU    5: hi:  186, btch:  31 usd:   0
[  293.221730] CPU    6: hi:  186, btch:  31 usd:   0
[  293.221992] CPU    7: hi:  186, btch:  31 usd:   0
[  293.222253] CPU    8: hi:  186, btch:  31 usd:   0
[  293.222519] CPU    9: hi:  186, btch:  31 usd:   0
[  293.222782] CPU   10: hi:  186, btch:  31 usd:   0
[  293.223043] CPU   11: hi:  186, btch:  31 usd:   0
[  293.223306] Node 0 Normal per-cpu:
[  293.223615] CPU    0: hi:  186, btch:  31 usd:   0
[  293.223878] CPU    1: hi:  186, btch:  31 usd:   0
[  293.224142] CPU    2: hi:  186, btch:  31 usd:   0
[  293.224404] CPU    3: hi:  186, btch:  31 usd:   0
[  293.224672] CPU    4: hi:  186, btch:  31 usd:   0
[  293.224934] CPU    5: hi:  186, btch:  31 usd:   0
[  293.225198] CPU    6: hi:  186, btch:  31 usd:   0
[  293.225462] CPU    7: hi:  186, btch:  31 usd:   0
[  293.225726] CPU    8: hi:  186, btch:  31 usd:   0
[  293.225989] CPU    9: hi:  186, btch:  31 usd:   0
[  293.226249] CPU   10: hi:  186, btch:  31 usd:   0
[  293.226507] CPU   11: hi:  186, btch:  31 usd:   0
[  293.226769] active_anon:0 inactive_anon:0 isolated_anon:0
[  293.226769]  active_file:148 inactive_file:26 isolated_file:0
[  293.226769]  unevictable:0 dirty:0 writeback:0 unstable:0
[  293.226769]  free:25182 slab_reclaimable:3095 slab_unreclaimable:3990011
[  293.226769]  mapped:17 shmem:0 pagetables:427 bounce:0
[  293.226769]  free_cma:0
[  293.228100] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  293.229516] lowmem_reserve[]: 0 1917 15995 15995
[  293.230008] Node 0 DMA32 free:60388kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:68kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872156kB kernel_stack:144kB pagetables:232kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5548 all_unreclaimable? yes
[  293.236698] lowmem_reserve[]: 0 0 14077 14077
[  293.237185] Node 0 Normal free:24484kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:0kB active_file:508kB inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11024kB slab_unreclaimable:14087856kB kernel_stack:2832kB pagetables:1476kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3904 all_unreclaimable? yes
[  293.238816] lowmem_reserve[]: 0 0 0 0
[  293.239309] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15856kB
[  293.240758] Node 0 DMA32: 408*4kB (UEM) 43*8kB (UEM) 23*16kB (EM) 20*32kB (EM) 17*64kB (EM) 10*128kB (UEM) 9*256kB (UM) 5*512kB (EM) 3*1024kB (EM) 3*2048kB (MR) 10*4096kB (MR) = 60392kB
[  293.242558] Node 0 Normal: 912*4kB (UEM) 585*8kB (UEM) 193*16kB (UEM) 100*32kB (UEM) 46*64kB (UEM) 16*128kB (M) 3*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 24472kB
[  293.244228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  293.244707] 172 total pagecache pages
[  293.244964] 0 pages in swap cache
[  293.245226] Swap cache stats: add 4615, delete 4615, find 2631/2669
[  293.245492] Free swap  = 8209840kB
[  293.245751] Total swap = 8216572kB
[  293.246012] 4184707 pages RAM
[  293.246272] 0 pages HighMem/MovableOnly
[  293.246535] 85688 pages reserved
[  293.246797] 0 pages hwpoisoned
[  293.247059] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  293.247544] [  680]     0   680     2678        0       9      107         -1000 udevd
[  293.248022] [ 1833]     0  1833    10161        0      24       70             0 monitor
[  293.248502] [ 1834]     0  1834    10196        0      27      131             0 ovsdb-server
[  293.248984] [ 1844]     0  1844    10299        0      24       82             0 monitor
[  293.249467] [ 2261]     0  2261    62333        0      22      143             0 rsyslogd
[  293.249949] [ 2293]    81  2293     5366        1      13       69             0 dbus-daemon
[  293.250431] [ 2345]     0  2345     5627        0      25       42             0 hald-addon-inpu
[  293.250912] [ 2354]    68  2354     4498        1      20       37             0 hald-addon-acpi
[  293.251396] [ 2363]     0  2363     2677        0       9      106         -1000 udevd
[  293.251870] [ 2486]     0  2486    16672        0      33      179         -1000 sshd
[  293.252346] [ 2511]     0  2511    29328        1      16      152             0 crond
[  293.252825] [ 2528]     0  2528     5400        0      14       46             0 atd
[  293.253304] [ 2541]     0  2541    26020        1      12       28             0 rhsmcertd
[  293.253787] [ 2562]     0  2562     1031        1       9       18             0 mingetty
[  293.254267] [ 2564]     0  2564     1031        1       9       18             0 mingetty
[  293.254747] [ 2566]     0  2566     1031        1       9       17             0 mingetty
[  293.255218] [ 2568]     0  2568     1031        1       9       18             0 mingetty
[  293.255688] [ 2570]     0  2570     1031        1       9       18             0 mingetty
[  293.256157] [ 2571]     0  2571     2677        0       9      106         -1000 udevd
[  293.256634] [ 2573]     0  2573     1031        1       9       18             0 mingetty
[  293.257108] [ 2576]     0  2576    25109        1      52      234             0 sshd
[  293.257585] [ 2598]   500  2598    25109        0      50      247             0 sshd
[  293.258059] Out of memory: Kill process 2598 (sshd) score 0 or sacrifice child
[  293.258536] Killed process 2598 (sshd) total-vm:100436kB, anon-rss:0kB, file-rss:0kB
[... etc ...]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-12 10:39   ` Jesper Dangaard Brouer
@ 2014-12-12 18:31     ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-12 18:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: akpm, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Alexander Duyck

On Fri, 12 Dec 2014, Jesper Dangaard Brouer wrote:

> Crash/OOM during IP-forwarding network overload test[1] with pktgen,
> single flow thus activating a single CPU on target (device under test).

Hmmm... Bisected it and the patch that removes the page pointer from
kmem_cache_cpu causes in a memory leak. Pretty obvious with hackbench.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
                   ` (8 preceding siblings ...)
  2014-12-11 17:37 ` Jesper Dangaard Brouer
@ 2014-12-15  7:59 ` Joonsoo Kim
  2014-12-17  7:13   ` Joonsoo Kim
  9 siblings, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-15  7:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 10:30:17AM -0600, Christoph Lameter wrote:
> We had to insert a preempt enable/disable in the fastpath a while ago. This
> was mainly due to a lot of state that is kept to be allocating from the per
> cpu freelist. In particular the page field is not covered by
> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
> change for fast path allocation and freeing.
> 
> This patch removes the need for the page field to describe the state of the
> per cpu list. The freelist pointer can be used to determine the page struct
> address if necessary.
> 
> However, currently this does not work for the termination value of a list
> which is NULL and the same for all slab pages. If we use a valid pointer
> into the page as well as set the last bit then all freelist pointers can
> always be used to determine the address of the page struct and we will not
> need the page field anymore in the per cpu are for a slab. Testing for the
> end of the list is a test if the first bit is set.
> 
> So the first patch changes the termination pointer for freelists to do just
> that. The second removes the page field and then third can then remove the
> preempt enable/disable.
> 
> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
> allocator effectiveness will increase further. Also RT uses full preemption which means
> that currently pretty expensive code has to be inserted into the fastpath. This approach
> allows the removal of that code and a corresponding performance increase.
> 
> For V1 a number of changes were made to avoid the overhead of virt_to_page
> and page_address from the RFC.
> 
> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
> 20%-50% of fastpath latency:
> 
> Before:
> 
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
> 10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
> 10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
> 10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
> 10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
> 10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
> 10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
> 10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
> 10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
> 10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
> 10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
> 10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles
> 
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 114 cycles
> 10000 times kmalloc(16)/kfree -> 115 cycles
> 10000 times kmalloc(32)/kfree -> 117 cycles
> 10000 times kmalloc(64)/kfree -> 115 cycles
> 10000 times kmalloc(128)/kfree -> 111 cycles
> 10000 times kmalloc(256)/kfree -> 116 cycles
> 10000 times kmalloc(512)/kfree -> 110 cycles
> 10000 times kmalloc(1024)/kfree -> 114 cycles
> 10000 times kmalloc(2048)/kfree -> 110 cycles
> 10000 times kmalloc(4096)/kfree -> 107 cycles
> 10000 times kmalloc(8192)/kfree -> 108 cycles
> 10000 times kmalloc(16384)/kfree -> 706 cycles
> 
> 
> After:
> 
> 
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
> 10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
> 10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
> 10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
> 10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
> 10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
> 10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
> 10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
> 10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
> 10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
> 10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
> 10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles
> 
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 100 cycles
> 10000 times kmalloc(16)/kfree -> 108 cycles
> 10000 times kmalloc(32)/kfree -> 101 cycles
> 10000 times kmalloc(64)/kfree -> 109 cycles
> 10000 times kmalloc(128)/kfree -> 125 cycles
> 10000 times kmalloc(256)/kfree -> 60 cycles
> 10000 times kmalloc(512)/kfree -> 60 cycles
> 10000 times kmalloc(1024)/kfree -> 67 cycles
> 10000 times kmalloc(2048)/kfree -> 60 cycles
> 10000 times kmalloc(4096)/kfree -> 65 cycles
> 10000 times kmalloc(8192)/kfree -> 60 cycles

Hello, Christoph.

I don't review in detail, but, at a glance, overall patchset looks good.
But, above result looks odd. Improvement is beyond what we can expect.
Do you have any idea why allocating object more than 256 bytes is so
fast?

Thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
  2014-12-10 16:54   ` Pekka Enberg
@ 2014-12-15  8:03   ` Joonsoo Kim
  2014-12-15 14:16     ` Christoph Lameter
  1 sibling, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-15  8:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	Jesper Dangaard Brouer

On Wed, Dec 10, 2014 at 10:30:20AM -0600, Christoph Lameter wrote:
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-12-09 12:25:45.770405462 -0600
> +++ linux/mm/slub.c	2014-12-09 12:25:45.766405582 -0600
> @@ -2625,6 +2625,13 @@ slab_empty:
>  	discard_slab(s, page);
>  }
>  
> +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> +{
> +	long d = p - page->address;
> +
> +	return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}
> +

Somtimes, compound_order() induces one more cacheline access, because
compound_order() access second struct page in order to get order. Is there
any way to remove this?

Thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-15  8:03   ` Joonsoo Kim
@ 2014-12-15 14:16     ` Christoph Lameter
  2014-12-16  2:42       ` Joonsoo Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-15 14:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	Jesper Dangaard Brouer

On Mon, 15 Dec 2014, Joonsoo Kim wrote:

> > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> > +{
> > +	long d = p - page->address;
> > +
> > +	return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > +}
> > +
>
> Somtimes, compound_order() induces one more cacheline access, because
> compound_order() access second struct page in order to get order. Is there
> any way to remove this?

I already have code there to avoid the access if its within a MAX_ORDER
page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-15 14:16     ` Christoph Lameter
@ 2014-12-16  2:42       ` Joonsoo Kim
  2014-12-16  7:54         ` Andrey Ryabinin
  0 siblings, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-16  2:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, rostedt, linux-kernel, Thomas Gleixner, linux-mm, penberg,
	Jesper Dangaard Brouer

On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> 
> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> > > +{
> > > +	long d = p - page->address;
> > > +
> > > +	return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > > +}
> > > +
> >
> > Somtimes, compound_order() induces one more cacheline access, because
> > compound_order() access second struct page in order to get order. Is there
> > any way to remove this?
> 
> I already have code there to avoid the access if its within a MAX_ORDER
> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?

That is the solution to avoid compound_order() call when slab of
object isn't matched with per cpu slab.

What I'm asking is whether there is a way to avoid compound_order() call when slab
of object is matched with per cpu slab or not.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16  2:42       ` Joonsoo Kim
@ 2014-12-16  7:54         ` Andrey Ryabinin
  2014-12-16  8:25           ` Joonsoo Kim
  2014-12-16 14:05           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 50+ messages in thread
From: Andrey Ryabinin @ 2014-12-16  7:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, akpm, rostedt, LKML, Thomas Gleixner,
	linux-mm, Pekka Enberg, Jesper Dangaard Brouer

2014-12-16 5:42 GMT+03:00 Joonsoo Kim <iamjoonsoo.kim@lge.com>:
> On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
>> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
>>
>> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
>> > > +{
>> > > + long d = p - page->address;
>> > > +
>> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
>> > > +}
>> > > +
>> >
>> > Somtimes, compound_order() induces one more cacheline access, because
>> > compound_order() access second struct page in order to get order. Is there
>> > any way to remove this?
>>
>> I already have code there to avoid the access if its within a MAX_ORDER
>> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
>
> That is the solution to avoid compound_order() call when slab of
> object isn't matched with per cpu slab.
>
> What I'm asking is whether there is a way to avoid compound_order() call when slab
> of object is matched with per cpu slab or not.
>

Can we use page->objects for that?

Like this:

        return d > 0 && d < page->objects * s->size;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16  7:54         ` Andrey Ryabinin
@ 2014-12-16  8:25           ` Joonsoo Kim
  2014-12-16 14:53             ` Christoph Lameter
  2014-12-16 14:05           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-16  8:25 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Christoph Lameter, akpm, rostedt, LKML, Thomas Gleixner,
	linux-mm, Pekka Enberg, Jesper Dangaard Brouer

On Tue, Dec 16, 2014 at 11:54:12AM +0400, Andrey Ryabinin wrote:
> 2014-12-16 5:42 GMT+03:00 Joonsoo Kim <iamjoonsoo.kim@lge.com>:
> > On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> >> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> >>
> >> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> >> > > +{
> >> > > + long d = p - page->address;
> >> > > +
> >> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> >> > > +}
> >> > > +
> >> >
> >> > Somtimes, compound_order() induces one more cacheline access, because
> >> > compound_order() access second struct page in order to get order. Is there
> >> > any way to remove this?
> >>
> >> I already have code there to avoid the access if its within a MAX_ORDER
> >> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
> >
> > That is the solution to avoid compound_order() call when slab of
> > object isn't matched with per cpu slab.
> >
> > What I'm asking is whether there is a way to avoid compound_order() call when slab
> > of object is matched with per cpu slab or not.
> >
> 
> Can we use page->objects for that?
> 
> Like this:
> 
>         return d > 0 && d < page->objects * s->size;
> 

Yes! That's what I'm looking for.
Christoph, how about above change?

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16  7:54         ` Andrey Ryabinin
  2014-12-16  8:25           ` Joonsoo Kim
@ 2014-12-16 14:05           ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-16 14:05 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Joonsoo Kim, Christoph Lameter, akpm, rostedt, LKML,
	Thomas Gleixner, linux-mm, Pekka Enberg, brouer

On Tue, 16 Dec 2014 11:54:12 +0400
Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:

> 2014-12-16 5:42 GMT+03:00 Joonsoo Kim <iamjoonsoo.kim@lge.com>:
> > On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> >> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> >>
> >> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> >> > > +{
> >> > > + long d = p - page->address;
> >> > > +
> >> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> >> > > +}
> >> > > +
> >> >
> >> > Somtimes, compound_order() induces one more cacheline access, because
> >> > compound_order() access second struct page in order to get order. Is there
> >> > any way to remove this?
> >>
> >> I already have code there to avoid the access if its within a MAX_ORDER
> >> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
> >
> > That is the solution to avoid compound_order() call when slab of
> > object isn't matched with per cpu slab.
> >
> > What I'm asking is whether there is a way to avoid compound_order() call when slab
> > of object is matched with per cpu slab or not.
> >
> 
> Can we use page->objects for that?
> 
> Like this:
> 
>         return d > 0 && d < page->objects * s->size;

I gave this change a quick micro benchmark spin (with Christoph's
tool), the results are below.

Notice, the "2. Kmalloc: alloc/free test" for small obj sizes improves,
which is more "back-to-normal" as before this patchset.

Before (with curr patchset):
============================

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 50 cycles kfree -> 60 cycles
 10000 times kmalloc(16) -> 52 cycles kfree -> 60 cycles
 10000 times kmalloc(32) -> 56 cycles kfree -> 64 cycles
 10000 times kmalloc(64) -> 67 cycles kfree -> 72 cycles
 10000 times kmalloc(128) -> 86 cycles kfree -> 79 cycles
 10000 times kmalloc(256) -> 97 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 88 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 91 cycles kfree -> 115 cycles
 10000 times kmalloc(2048) -> 119 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 159 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 269 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 498 cycles kfree -> 291 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 112 cycles
 10000 times kmalloc(16)/kfree -> 118 cycles
 10000 times kmalloc(32)/kfree -> 117 cycles
 10000 times kmalloc(64)/kfree -> 122 cycles
 10000 times kmalloc(128)/kfree -> 133 cycles
 10000 times kmalloc(256)/kfree -> 79 cycles
 10000 times kmalloc(512)/kfree -> 79 cycles
 10000 times kmalloc(1024)/kfree -> 79 cycles
 10000 times kmalloc(2048)/kfree -> 72 cycles
 10000 times kmalloc(4096)/kfree -> 78 cycles
 10000 times kmalloc(8192)/kfree -> 78 cycles
 10000 times kmalloc(16384)/kfree -> 596 cycles

After (with proposed change):
=============================
 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 53 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 53 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 57 cycles kfree -> 66 cycles
 10000 times kmalloc(64) -> 68 cycles kfree -> 72 cycles
 10000 times kmalloc(128) -> 77 cycles kfree -> 80 cycles
 10000 times kmalloc(256) -> 98 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 87 cycles kfree -> 113 cycles
 10000 times kmalloc(1024) -> 90 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 116 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 160 cycles kfree -> 164 cycles
 10000 times kmalloc(8192) -> 269 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 499 cycles kfree -> 295 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 74 cycles
 10000 times kmalloc(16)/kfree -> 73 cycles
 10000 times kmalloc(32)/kfree -> 73 cycles
 10000 times kmalloc(64)/kfree -> 74 cycles
 10000 times kmalloc(128)/kfree -> 73 cycles
 10000 times kmalloc(256)/kfree -> 72 cycles
 10000 times kmalloc(512)/kfree -> 73 cycles
 10000 times kmalloc(1024)/kfree -> 72 cycles
 10000 times kmalloc(2048)/kfree -> 73 cycles
 10000 times kmalloc(4096)/kfree -> 72 cycles
 10000 times kmalloc(8192)/kfree -> 72 cycles
 10000 times kmalloc(16384)/kfree -> 556 cycles


(kernel 3.18.0-net-next+ SMP PREEMPT on top of f96fe225677)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16  8:25           ` Joonsoo Kim
@ 2014-12-16 14:53             ` Christoph Lameter
  2014-12-16 15:15               ` Jesper Dangaard Brouer
  2014-12-16 15:33               ` Andrey Ryabinin
  0 siblings, 2 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-16 14:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrey Ryabinin, akpm, rostedt, LKML, Thomas Gleixner, linux-mm,
	Pekka Enberg, Jesper Dangaard Brouer

On Tue, 16 Dec 2014, Joonsoo Kim wrote:

> > Like this:
> >
> >         return d > 0 && d < page->objects * s->size;
> >
>
> Yes! That's what I'm looking for.
> Christoph, how about above change?

Ok but now there is a multiplication in the fast path.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16 14:53             ` Christoph Lameter
@ 2014-12-16 15:15               ` Jesper Dangaard Brouer
  2014-12-16 15:34                 ` Andrey Ryabinin
  2014-12-16 15:48                 ` Christoph Lameter
  2014-12-16 15:33               ` Andrey Ryabinin
  1 sibling, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-16 15:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, Andrey Ryabinin, akpm, rostedt, LKML,
	Thomas Gleixner, linux-mm, Pekka Enberg, brouer

On Tue, 16 Dec 2014 08:53:08 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
> 
> > > Like this:
> > >
> > >         return d > 0 && d < page->objects * s->size;
> > >
> >
> > Yes! That's what I'm looking for.
> > Christoph, how about above change?
> 
> Ok but now there is a multiplication in the fast path.

Could we pre-calculate the value (page->objects * s->size) and e.g store it
in struct kmem_cache, thus saving the imul ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16 14:53             ` Christoph Lameter
  2014-12-16 15:15               ` Jesper Dangaard Brouer
@ 2014-12-16 15:33               ` Andrey Ryabinin
  1 sibling, 0 replies; 50+ messages in thread
From: Andrey Ryabinin @ 2014-12-16 15:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, rostedt, LKML, Thomas Gleixner, linux-mm,
	Pekka Enberg, Jesper Dangaard Brouer

2014-12-16 17:53 GMT+03:00 Christoph Lameter <cl@linux.com>:
> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
>
>> > Like this:
>> >
>> >         return d > 0 && d < page->objects * s->size;
>> >
>>
>> Yes! That's what I'm looking for.
>> Christoph, how about above change?
>
> Ok but now there is a multiplication in the fast path.
>

Another idea - store page's order in the lower bits of page->address.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16 15:15               ` Jesper Dangaard Brouer
@ 2014-12-16 15:34                 ` Andrey Ryabinin
  2014-12-16 15:48                 ` Christoph Lameter
  1 sibling, 0 replies; 50+ messages in thread
From: Andrey Ryabinin @ 2014-12-16 15:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Lameter, Joonsoo Kim, akpm, rostedt, LKML,
	Thomas Gleixner, linux-mm, Pekka Enberg

2014-12-16 18:15 GMT+03:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 16 Dec 2014 08:53:08 -0600 (CST)
> Christoph Lameter <cl@linux.com> wrote:
>
>> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
>>
>> > > Like this:
>> > >
>> > >         return d > 0 && d < page->objects * s->size;
>> > >
>> >
>> > Yes! That's what I'm looking for.
>> > Christoph, how about above change?
>>
>> Ok but now there is a multiplication in the fast path.
>
> Could we pre-calculate the value (page->objects * s->size) and e.g store it
> in struct kmem_cache, thus saving the imul ?
>

No, one kmem_cache could have several pages with different orders,
therefore different page->objects.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16 15:15               ` Jesper Dangaard Brouer
  2014-12-16 15:34                 ` Andrey Ryabinin
@ 2014-12-16 15:48                 ` Christoph Lameter
  2014-12-17  7:15                   ` Joonsoo Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-16 15:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Joonsoo Kim, Andrey Ryabinin, akpm, rostedt, LKML,
	Thomas Gleixner, linux-mm, Pekka Enberg

On Tue, 16 Dec 2014, Jesper Dangaard Brouer wrote:

> > Ok but now there is a multiplication in the fast path.
>
> Could we pre-calculate the value (page->objects * s->size) and e.g store it
> in struct kmem_cache, thus saving the imul ?

I think I just used the last available field for the page->address.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-15  7:59 ` Joonsoo Kim
@ 2014-12-17  7:13   ` Joonsoo Kim
  2014-12-17 12:08     ` Jesper Dangaard Brouer
                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-17  7:13 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

2014-12-15 16:59 GMT+09:00 Joonsoo Kim <iamjoonsoo.kim@lge.com>:
> On Wed, Dec 10, 2014 at 10:30:17AM -0600, Christoph Lameter wrote:
>> We had to insert a preempt enable/disable in the fastpath a while ago. This
>> was mainly due to a lot of state that is kept to be allocating from the per
>> cpu freelist. In particular the page field is not covered by
>> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
>> change for fast path allocation and freeing.
>>
>> This patch removes the need for the page field to describe the state of the
>> per cpu list. The freelist pointer can be used to determine the page struct
>> address if necessary.
>>
>> However, currently this does not work for the termination value of a list
>> which is NULL and the same for all slab pages. If we use a valid pointer
>> into the page as well as set the last bit then all freelist pointers can
>> always be used to determine the address of the page struct and we will not
>> need the page field anymore in the per cpu are for a slab. Testing for the
>> end of the list is a test if the first bit is set.
>>
>> So the first patch changes the termination pointer for freelists to do just
>> that. The second removes the page field and then third can then remove the
>> preempt enable/disable.
>>
>> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
>> allocator effectiveness will increase further. Also RT uses full preemption which means
>> that currently pretty expensive code has to be inserted into the fastpath. This approach
>> allows the removal of that code and a corresponding performance increase.
>>
>> For V1 a number of changes were made to avoid the overhead of virt_to_page
>> and page_address from the RFC.
>>
>> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
>> 20%-50% of fastpath latency:
>>
>> Before:
>>
>> Single thread testing
>> 1. Kmalloc: Repeatedly allocate then free test
>> 10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
>> 10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
>> 10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
>> 10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
>> 10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
>> 10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
>> 10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
>> 10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
>> 10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
>> 10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
>> 10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
>> 10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles
>>
>> 2. Kmalloc: alloc/free test
>> 10000 times kmalloc(8)/kfree -> 114 cycles
>> 10000 times kmalloc(16)/kfree -> 115 cycles
>> 10000 times kmalloc(32)/kfree -> 117 cycles
>> 10000 times kmalloc(64)/kfree -> 115 cycles
>> 10000 times kmalloc(128)/kfree -> 111 cycles
>> 10000 times kmalloc(256)/kfree -> 116 cycles
>> 10000 times kmalloc(512)/kfree -> 110 cycles
>> 10000 times kmalloc(1024)/kfree -> 114 cycles
>> 10000 times kmalloc(2048)/kfree -> 110 cycles
>> 10000 times kmalloc(4096)/kfree -> 107 cycles
>> 10000 times kmalloc(8192)/kfree -> 108 cycles
>> 10000 times kmalloc(16384)/kfree -> 706 cycles
>>
>>
>> After:
>>
>>
>> Single thread testing
>> 1. Kmalloc: Repeatedly allocate then free test
>> 10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
>> 10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
>> 10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
>> 10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
>> 10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
>> 10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
>> 10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
>> 10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
>> 10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
>> 10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
>> 10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
>> 10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles
>>
>> 2. Kmalloc: alloc/free test
>> 10000 times kmalloc(8)/kfree -> 100 cycles
>> 10000 times kmalloc(16)/kfree -> 108 cycles
>> 10000 times kmalloc(32)/kfree -> 101 cycles
>> 10000 times kmalloc(64)/kfree -> 109 cycles
>> 10000 times kmalloc(128)/kfree -> 125 cycles
>> 10000 times kmalloc(256)/kfree -> 60 cycles
>> 10000 times kmalloc(512)/kfree -> 60 cycles
>> 10000 times kmalloc(1024)/kfree -> 67 cycles
>> 10000 times kmalloc(2048)/kfree -> 60 cycles
>> 10000 times kmalloc(4096)/kfree -> 65 cycles
>> 10000 times kmalloc(8192)/kfree -> 60 cycles
>
> Hello, Christoph.
>
> I don't review in detail, but, at a glance, overall patchset looks good.
> But, above result looks odd. Improvement is beyond what we can expect.
> Do you have any idea why allocating object more than 256 bytes is so
> fast?

Ping... and I found another way to remove preempt_disable/enable
without complex changes.

What we want to ensure is getting tid and kmem_cache_cpu
on the same cpu. We can achieve that goal with below condition loop.

I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
kmem_cache_alloc+free in CONFIG_PREEMPT.

14.5 ns -> 13.8 ns

See following patch.

Thanks.

----------->8-------------
diff --git a/mm/slub.c b/mm/slub.c
index 95d2142..e537af5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2399,8 +2399,10 @@ redo:
         * on a different processor between the determination of the pointer
         * and the retrieval of the tid.
         */
-       preempt_disable();
-       c = this_cpu_ptr(s->cpu_slab);
+       do {
+               tid = this_cpu_read(s->cpu_slab->tid);
+               c = this_cpu_ptr(s->cpu_slab);
+       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

        /*
         * The transaction ids are globally unique per cpu and per operation on
@@ -2408,8 +2410,6 @@ redo:
         * occurs on the right processor and that there was no operation on the
         * linked list in between.
         */
-       tid = c->tid;
-       preempt_enable();

        object = c->freelist;
        page = c->page;
@@ -2655,11 +2655,10 @@ redo:
         * data is retrieved via this pointer. If we are on the same cpu
         * during the cmpxchg then the free will succedd.
         */
-       preempt_disable();
-       c = this_cpu_ptr(s->cpu_slab);
-
-       tid = c->tid;
-       preempt_enable();
+       do {
+               tid = this_cpu_read(s->cpu_slab->tid);
+               c = this_cpu_ptr(s->cpu_slab);
+       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

        if (likely(page == c->page)) {
                set_freepointer(s, object, c->freelist);

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: Do not use c->page on free
  2014-12-16 15:48                 ` Christoph Lameter
@ 2014-12-17  7:15                   ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-17  7:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesper Dangaard Brouer, Joonsoo Kim, Andrey Ryabinin, akpm,
	Steven Rostedt, LKML, Thomas Gleixner, linux-mm, Pekka Enberg

2014-12-17 0:48 GMT+09:00 Christoph Lameter <cl@linux.com>:
> On Tue, 16 Dec 2014, Jesper Dangaard Brouer wrote:
>
>> > Ok but now there is a multiplication in the fast path.
>>
>> Could we pre-calculate the value (page->objects * s->size) and e.g store it
>> in struct kmem_cache, thus saving the imul ?
>
> I think I just used the last available field for the page->address.

Possibly, we can use _count field.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17  7:13   ` Joonsoo Kim
@ 2014-12-17 12:08     ` Jesper Dangaard Brouer
  2014-12-18 14:34       ` Joonsoo Kim
  2014-12-17 15:36     ` Christoph Lameter
  2014-12-17 16:10     ` Christoph Lameter
  2 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-17 12:08 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Christoph Lameter, akpm, Steven Rostedt, LKML,
	Thomas Gleixner, Linux Memory Management List, Pekka Enberg,
	brouer

On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@gmail.com> wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
> 
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
> 
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
> 
> 14.5 ns -> 13.8 ns

Hi Kim,

I've tested you patch.  Full report below patch.

Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

For network overload tests:

Dropping packets in iptables raw, which is hitting the slub fast-path.
Here I'm seeing an improvement of 3ns.

For IP-forward, which is also invoking the slub slower path, I'm seeing
an improvement of 6ns (I were not expecting to see any improvement
here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
saving some icache).

Full report below patch...
 
> See following patch.
> 
> Thanks.
> 
> ----------->8-------------
> diff --git a/mm/slub.c b/mm/slub.c
> index 95d2142..e537af5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2399,8 +2399,10 @@ redo:
>          * on a different processor between the determination of the pointer
>          * and the retrieval of the tid.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         /*
>          * The transaction ids are globally unique per cpu and per operation on
> @@ -2408,8 +2410,6 @@ redo:
>          * occurs on the right processor and that there was no operation on the
>          * linked list in between.
>          */
> -       tid = c->tid;
> -       preempt_enable();
> 
>         object = c->freelist;
>         page = c->page;
> @@ -2655,11 +2655,10 @@ redo:
>          * data is retrieved via this pointer. If we are on the same cpu
>          * during the cmpxchg then the free will succedd.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> -
> -       tid = c->tid;
> -       preempt_enable();
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         if (likely(page == c->page)) {
>                 set_freepointer(s, object, c->freelist);

SLUB evaluation 03
==================

Testing patch from Joonsoo Kim <iamjoonsoo.kim@lge.com> slub fast-path
preempt_{disable,enable} avoidance.

Kernel
======
Compiler: GCC 4.9.1

Kernel config ::

 $ grep PREEMPT .config
 CONFIG_PREEMPT_RCU=y
 CONFIG_PREEMPT_NOTIFIERS=y
 # CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
 CONFIG_PREEMPT=y
 CONFIG_PREEMPT_COUNT=y
 # CONFIG_DEBUG_PREEMPT is not set

 $ egrep -e "SLUB|SLAB" .config
 # CONFIG_SLUB_DEBUG is not set
 # CONFIG_SLAB is not set
 CONFIG_SLUB=y
 # CONFIG_SLUB_CPU_PARTIAL is not set
 # CONFIG_SLUB_STATS is not set

On top of::

 commit f96fe225677b3efb74346ebd56fafe3997b02afa
 Merge: 5543798 eea3e8f
 Author: Linus Torvalds <torvalds@linux-foundation.org>
 Date:   Fri Dec 12 16:11:12 2014 -0800

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net


Setup
=====

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30
sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5


# sudo tuned-adm active
Current active profile: latency-performance

Drop in raw
-----------
alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Generator
---------
./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64


Patch by Joonsoo Kim to avoid preempt in slub
=============================================

baseline: without patch
-----------------------

baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567

Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
 - (measurement period time:1.859917529 sec time_interval:1859917529)
 - (invoke count:100000000 tsc_interval:4649791431)

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
 - (measurement period time:1.025993290 sec time_interval:1025993290)
 - (invoke count:25600000 tsc_interval:2564981743)

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
    (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
 * Drop in RAW (slab fast-path test)
   - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
     (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)

Christoph's slab_test, baseline kernel (at commit f96fe22567)::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
ffffffff81162cb0 000000000000013b T kmem_cache_free


with patch
----------

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
    (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
 * compare against baseline:
  - 1174222-1165928 = +8294pps
  - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns

 * Drop in RAW (slab fast-path test)
  - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
 * compare against baseline:
  - 3277737-3245325 = +32412 pps
  - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns

SLUB fast-path test: time_bench_kmem_cache1
 * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c

Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
 - (measurement period time:1.752338378 sec time_interval:1752338378)
 - (invoke count:100000000 tsc_interval:4380843588)
  * difference: 17.523 - 18.599 = -1.076ns

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
 - (measurement period time:1.033447112 sec time_interval:1033447112)
 - (invoke count:25600000 tsc_interval:2583616203)
    * difference: 40.369 - 40.077 = +0.292ns


Christoph's slab_test::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
ffffffff81162cb0 0000000000000133 T kmem_cache_free



Kernel size change
------------------

 $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
 add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
 function                                     old     new   delta
 kmem_cache_free                              315     307      -8
 kmem_cache_alloc_node                        268     248     -20
 kmem_cache_alloc                             225     201     -24
 kfree                                        274     250     -24
 __kmalloc_node_track_caller                  356     324     -32
 __kmalloc_node                               340     308     -32
 __kmalloc                                    324     273     -51
 __kmalloc_track_caller                       343     286     -57


Qmempool notes:
---------------

On baseline kernel:

Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
 - (measurement period time:0.398628965 sec time_interval:398628965)
 - (invoke count:30000000 tsc_interval:996571541)

Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
 - (measurement period time:0.575425927 sec time_interval:575425927)
 - (invoke count:30000000 tsc_interval:1438563781)

qmempool_bench: N-pattern with 256 elements

Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
 - (measurement period time:0.638871008 sec time_interval:638871008)
 - (invoke count:25600000 tsc_interval:1597176303)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17  7:13   ` Joonsoo Kim
  2014-12-17 12:08     ` Jesper Dangaard Brouer
@ 2014-12-17 15:36     ` Christoph Lameter
  2014-12-18 14:38       ` Joonsoo Kim
  2014-12-17 16:10     ` Christoph Lameter
  2 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-17 15:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

On Wed, 17 Dec 2014, Joonsoo Kim wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
>
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
>
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
>
> 14.5 ns -> 13.8 ns
>
> See following patch.

Good idea. How does this affect the !CONFIG_PREEMPT case?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17  7:13   ` Joonsoo Kim
  2014-12-17 12:08     ` Jesper Dangaard Brouer
  2014-12-17 15:36     ` Christoph Lameter
@ 2014-12-17 16:10     ` Christoph Lameter
  2014-12-17 19:44       ` Christoph Lameter
  2014-12-18 14:41       ` Joonsoo Kim
  2 siblings, 2 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-17 16:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

On Wed, 17 Dec 2014, Joonsoo Kim wrote:

> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));


Assembly code produced is a bit weird. I think the compiler undoes what
you wanted to do:

 46fb:       49 8b 1e                mov    (%r14),%rbx				rbx = c =s->cpu_slab?
    46fe:       65 4c 8b 6b 08          mov    %gs:0x8(%rbx),%r13		r13 = tid
    4703:       e8 00 00 00 00          callq  4708 <kmem_cache_alloc+0x48>	??
    4708:       89 c0                   mov    %eax,%eax			??
    470a:       48 03 1c c5 00 00 00    add    0x0(,%rax,8),%rbx		??
    4711:       00
    4712:       4c 3b 6b 08             cmp    0x8(%rbx),%r13			tid == c->tid
    4716:       49 89 d8                mov    %rbx,%r8
    4719:       75 e0                   jne    46fb <kmem_cache_alloc+0x3b>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17 16:10     ` Christoph Lameter
@ 2014-12-17 19:44       ` Christoph Lameter
  2014-12-18 14:41       ` Joonsoo Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-12-17 19:44 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

On Wed, 17 Dec 2014, Christoph Lameter wrote:

> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
> > +       do {
> > +               tid = this_cpu_read(s->cpu_slab->tid);
> > +               c = this_cpu_ptr(s->cpu_slab);
> > +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

Here is another one without debugging:

   0xffffffff811d23bb <+59>:	mov    %gs:0x8(%r9),%rdx		tid(rdx) = this_cpu_read()
   0xffffffff811d23c0 <+64>:	mov    %r9,%r8
   0xffffffff811d23c3 <+67>:	add    %gs:0x7ee37d9d(%rip),%r8         c (r8) =
   0xffffffff811d23cb <+75>:	cmp    0x8(%r8),%rdx			c->tid == tid
   0xffffffff811d23cf <+79>:	jne    0xffffffff811d23bb <kmem_cache_alloc+59>

Actually that looks ok.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17 12:08     ` Jesper Dangaard Brouer
@ 2014-12-18 14:34       ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-18 14:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Joonsoo Kim, Christoph Lameter, akpm, Steven Rostedt, LKML,
	Thomas Gleixner, Linux Memory Management List, Pekka Enberg

2014-12-17 21:08 GMT+09:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@gmail.com> wrote:
>
>> Ping... and I found another way to remove preempt_disable/enable
>> without complex changes.
>>
>> What we want to ensure is getting tid and kmem_cache_cpu
>> on the same cpu. We can achieve that goal with below condition loop.
>>
>> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
>> kmem_cache_alloc+free in CONFIG_PREEMPT.
>>
>> 14.5 ns -> 13.8 ns
>
> Hi Kim,
>
> I've tested you patch.  Full report below patch.
>
> Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

Thanks for testing! :)
It will help to convince others.

Thanks.

> For network overload tests:
>
> Dropping packets in iptables raw, which is hitting the slub fast-path.
> Here I'm seeing an improvement of 3ns.
>
> For IP-forward, which is also invoking the slub slower path, I'm seeing
> an improvement of 6ns (I were not expecting to see any improvement
> here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
> saving some icache).
>
> Full report below patch...
>
>> See following patch.
>>
>> Thanks.
>>
>> ----------->8-------------
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 95d2142..e537af5 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2399,8 +2399,10 @@ redo:
>>          * on a different processor between the determination of the pointer
>>          * and the retrieval of the tid.
>>          */
>> -       preempt_disable();
>> -       c = this_cpu_ptr(s->cpu_slab);
>> +       do {
>> +               tid = this_cpu_read(s->cpu_slab->tid);
>> +               c = this_cpu_ptr(s->cpu_slab);
>> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>>
>>         /*
>>          * The transaction ids are globally unique per cpu and per operation on
>> @@ -2408,8 +2410,6 @@ redo:
>>          * occurs on the right processor and that there was no operation on the
>>          * linked list in between.
>>          */
>> -       tid = c->tid;
>> -       preempt_enable();
>>
>>         object = c->freelist;
>>         page = c->page;
>> @@ -2655,11 +2655,10 @@ redo:
>>          * data is retrieved via this pointer. If we are on the same cpu
>>          * during the cmpxchg then the free will succedd.
>>          */
>> -       preempt_disable();
>> -       c = this_cpu_ptr(s->cpu_slab);
>> -
>> -       tid = c->tid;
>> -       preempt_enable();
>> +       do {
>> +               tid = this_cpu_read(s->cpu_slab->tid);
>> +               c = this_cpu_ptr(s->cpu_slab);
>> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>>
>>         if (likely(page == c->page)) {
>>                 set_freepointer(s, object, c->freelist);
>
> SLUB evaluation 03
> ==================
>
> Testing patch from Joonsoo Kim <iamjoonsoo.kim@lge.com> slub fast-path
> preempt_{disable,enable} avoidance.
>
> Kernel
> ======
> Compiler: GCC 4.9.1
>
> Kernel config ::
>
>  $ grep PREEMPT .config
>  CONFIG_PREEMPT_RCU=y
>  CONFIG_PREEMPT_NOTIFIERS=y
>  # CONFIG_PREEMPT_NONE is not set
>  # CONFIG_PREEMPT_VOLUNTARY is not set
>  CONFIG_PREEMPT=y
>  CONFIG_PREEMPT_COUNT=y
>  # CONFIG_DEBUG_PREEMPT is not set
>
>  $ egrep -e "SLUB|SLAB" .config
>  # CONFIG_SLUB_DEBUG is not set
>  # CONFIG_SLAB is not set
>  CONFIG_SLUB=y
>  # CONFIG_SLUB_CPU_PARTIAL is not set
>  # CONFIG_SLUB_STATS is not set
>
> On top of::
>
>  commit f96fe225677b3efb74346ebd56fafe3997b02afa
>  Merge: 5543798 eea3e8f
>  Author: Linus Torvalds <torvalds@linux-foundation.org>
>  Date:   Fri Dec 12 16:11:12 2014 -0800
>
>     Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>
>
> Setup
> =====
>
> netfilter_unload_modules.sh
> netfilter_unload_modules.sh
> sudo rmmod nf_reject_ipv4 nf_reject_ipv6
>
> base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
> base_device_setup.sh eth5
> sudo ethtool --coalesce eth4 rx-usecs 30
> sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
> sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5
>
>
> # sudo tuned-adm active
> Current active profile: latency-performance
>
> Drop in raw
> -----------
> alias iptables='sudo iptables'
> iptables -t raw -N simple || iptables -t raw -F simple
> iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
> iptables -t raw -D PREROUTING -j simple
> iptables -t raw -I PREROUTING -j simple
>
> Generator
> ---------
> ./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64
>
>
> Patch by Joonsoo Kim to avoid preempt in slub
> =============================================
>
> baseline: without patch
> -----------------------
>
> baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567
>
> Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
>  - (measurement period time:1.859917529 sec time_interval:1859917529)
>  - (invoke count:100000000 tsc_interval:4649791431)
>
> alloc N-pattern before free with 256 elements
>
> Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
>  - (measurement period time:1.025993290 sec time_interval:1025993290)
>  - (invoke count:25600000 tsc_interval:2564981743)
>
> single flow/CPU
>  * IP-forward
>   - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
>     (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
>  * Drop in RAW (slab fast-path test)
>    - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
>      (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)
>
> Christoph's slab_test, baseline kernel (at commit f96fe22567)::
>
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
>  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
>  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
>  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
>  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
>  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
>  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
>  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
>  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
>  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 68 cycles
>  10000 times kmalloc(16)/kfree -> 68 cycles
>  10000 times kmalloc(32)/kfree -> 69 cycles
>  10000 times kmalloc(64)/kfree -> 68 cycles
>  10000 times kmalloc(128)/kfree -> 68 cycles
>  10000 times kmalloc(256)/kfree -> 68 cycles
>  10000 times kmalloc(512)/kfree -> 74 cycles
>  10000 times kmalloc(1024)/kfree -> 75 cycles
>  10000 times kmalloc(2048)/kfree -> 74 cycles
>  10000 times kmalloc(4096)/kfree -> 74 cycles
>  10000 times kmalloc(8192)/kfree -> 75 cycles
>  10000 times kmalloc(16384)/kfree -> 510 cycles
>
> $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
> ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
> ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
> ffffffff81162cb0 000000000000013b T kmem_cache_free
>
>
> with patch
> ----------
>
> single flow/CPU
>  * IP-forward
>   - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
>     (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
>  * compare against baseline:
>   - 1174222-1165928 = +8294pps
>   - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns
>
>  * Drop in RAW (slab fast-path test)
>   - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
>     (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
>  * compare against baseline:
>   - 3277737-3245325 = +32412 pps
>   - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns
>
> SLUB fast-path test: time_bench_kmem_cache1
>  * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c
>
> Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
>  - (measurement period time:1.752338378 sec time_interval:1752338378)
>  - (invoke count:100000000 tsc_interval:4380843588)
>   * difference: 17.523 - 18.599 = -1.076ns
>
> alloc N-pattern before free with 256 elements
>
> Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
>  - (measurement period time:1.033447112 sec time_interval:1033447112)
>  - (invoke count:25600000 tsc_interval:2583616203)
>     * difference: 40.369 - 40.077 = +0.292ns
>
>
> Christoph's slab_test::
>
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
>  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
>  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
>  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
>  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
>  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
>  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
>  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
>  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
>  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 65 cycles
>  10000 times kmalloc(16)/kfree -> 66 cycles
>  10000 times kmalloc(32)/kfree -> 65 cycles
>  10000 times kmalloc(64)/kfree -> 66 cycles
>  10000 times kmalloc(128)/kfree -> 66 cycles
>  10000 times kmalloc(256)/kfree -> 71 cycles
>  10000 times kmalloc(512)/kfree -> 72 cycles
>  10000 times kmalloc(1024)/kfree -> 71 cycles
>  10000 times kmalloc(2048)/kfree -> 71 cycles
>  10000 times kmalloc(4096)/kfree -> 71 cycles
>  10000 times kmalloc(8192)/kfree -> 65 cycles
>  10000 times kmalloc(16384)/kfree -> 511 cycles
>
> $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
> ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
> ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
> ffffffff81162cb0 0000000000000133 T kmem_cache_free
>
>
>
> Kernel size change
> ------------------
>
>  $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
>  add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
>  function                                     old     new   delta
>  kmem_cache_free                              315     307      -8
>  kmem_cache_alloc_node                        268     248     -20
>  kmem_cache_alloc                             225     201     -24
>  kfree                                        274     250     -24
>  __kmalloc_node_track_caller                  356     324     -32
>  __kmalloc_node                               340     308     -32
>  __kmalloc                                    324     273     -51
>  __kmalloc_track_caller                       343     286     -57
>
>
> Qmempool notes:
> ---------------
>
> On baseline kernel:
>
> Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
>  - (measurement period time:0.398628965 sec time_interval:398628965)
>  - (invoke count:30000000 tsc_interval:996571541)
>
> Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
>  - (measurement period time:0.575425927 sec time_interval:575425927)
>  - (invoke count:30000000 tsc_interval:1438563781)
>
> qmempool_bench: N-pattern with 256 elements
>
> Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
>  - (measurement period time:0.638871008 sec time_interval:638871008)
>  - (invoke count:25600000 tsc_interval:1597176303)
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17 15:36     ` Christoph Lameter
@ 2014-12-18 14:38       ` Joonsoo Kim
  2014-12-18 14:57         ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-18 14:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

2014-12-18 0:36 GMT+09:00 Christoph Lameter <cl@linux.com>:
> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
>> Ping... and I found another way to remove preempt_disable/enable
>> without complex changes.
>>
>> What we want to ensure is getting tid and kmem_cache_cpu
>> on the same cpu. We can achieve that goal with below condition loop.
>>
>> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
>> kmem_cache_alloc+free in CONFIG_PREEMPT.
>>
>> 14.5 ns -> 13.8 ns
>>
>> See following patch.
>
> Good idea. How does this affect the !CONFIG_PREEMPT case?

One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
Roughly 3~5%.

We can deal with each cases separately although it looks dirty.

#ifdef CONFIG_PREEMPT
XXX
#else
YYY
#endif

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-17 16:10     ` Christoph Lameter
  2014-12-17 19:44       ` Christoph Lameter
@ 2014-12-18 14:41       ` Joonsoo Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-18 14:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

2014-12-18 1:10 GMT+09:00 Christoph Lameter <cl@linux.com>:
> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
>> +       do {
>> +               tid = this_cpu_read(s->cpu_slab->tid);
>> +               c = this_cpu_ptr(s->cpu_slab);
>> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>
>
> Assembly code produced is a bit weird. I think the compiler undoes what
> you wanted to do:

I checked my compiled code and it seems to be no problem.
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

Thanks.

>  46fb:       49 8b 1e                mov    (%r14),%rbx                         rbx = c =s->cpu_slab?
>     46fe:       65 4c 8b 6b 08          mov    %gs:0x8(%rbx),%r13               r13 = tid
>     4703:       e8 00 00 00 00          callq  4708 <kmem_cache_alloc+0x48>     ??
>     4708:       89 c0                   mov    %eax,%eax                        ??
>     470a:       48 03 1c c5 00 00 00    add    0x0(,%rax,8),%rbx                ??
>     4711:       00
>     4712:       4c 3b 6b 08             cmp    0x8(%rbx),%r13                   tid == c->tid
>     4716:       49 89 d8                mov    %rbx,%r8
>     4719:       75 e0                   jne    46fb <kmem_cache_alloc+0x3b>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-18 14:38       ` Joonsoo Kim
@ 2014-12-18 14:57         ` Christoph Lameter
  2014-12-18 15:08           ` Joonsoo Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2014-12-18 14:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer


On Thu, 18 Dec 2014, Joonsoo Kim wrote:
> > Good idea. How does this affect the !CONFIG_PREEMPT case?
>
> One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
> Roughly 3~5%.
>
> We can deal with each cases separately although it looks dirty.

Ok maybe you can come up with a solution that is as clean as possible?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
  2014-12-18 14:57         ` Christoph Lameter
@ 2014-12-18 15:08           ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-12-18 15:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, Steven Rostedt, LKML, Thomas Gleixner,
	Linux Memory Management List, Pekka Enberg,
	Jesper Dangaard Brouer

2014-12-18 23:57 GMT+09:00 Christoph Lameter <cl@linux.com>:
>
> On Thu, 18 Dec 2014, Joonsoo Kim wrote:
>> > Good idea. How does this affect the !CONFIG_PREEMPT case?
>>
>> One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
>> Roughly 3~5%.
>>
>> We can deal with each cases separately although it looks dirty.
>
> Ok maybe you can come up with a solution that is as clean as possible?

Okay. Will do!

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2014-12-18 15:08 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-10 16:30 [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Christoph Lameter
2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
2014-12-10 16:39   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
2014-12-10 16:45   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
2014-12-10 16:54   ` Pekka Enberg
2014-12-10 17:08     ` Christoph Lameter
2014-12-10 17:32       ` Pekka Enberg
2014-12-10 17:37         ` Christoph Lameter
2014-12-11 13:19           ` Jesper Dangaard Brouer
2014-12-11 15:01             ` Christoph Lameter
2014-12-15  8:03   ` Joonsoo Kim
2014-12-15 14:16     ` Christoph Lameter
2014-12-16  2:42       ` Joonsoo Kim
2014-12-16  7:54         ` Andrey Ryabinin
2014-12-16  8:25           ` Joonsoo Kim
2014-12-16 14:53             ` Christoph Lameter
2014-12-16 15:15               ` Jesper Dangaard Brouer
2014-12-16 15:34                 ` Andrey Ryabinin
2014-12-16 15:48                 ` Christoph Lameter
2014-12-17  7:15                   ` Joonsoo Kim
2014-12-16 15:33               ` Andrey Ryabinin
2014-12-16 14:05           ` Jesper Dangaard Brouer
2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
2014-12-10 16:56   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
2014-12-10 16:59   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
2014-12-10 17:29   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 7/7] slub: Remove preemption disable/enable from fastpath Christoph Lameter
2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
2014-12-11 15:03   ` Christoph Lameter
2014-12-11 16:50     ` Jesper Dangaard Brouer
2014-12-11 17:18       ` Christoph Lameter
2014-12-11 18:11         ` Jesper Dangaard Brouer
2014-12-11 17:37 ` Jesper Dangaard Brouer
2014-12-12 10:39   ` Jesper Dangaard Brouer
2014-12-12 18:31     ` Christoph Lameter
2014-12-15  7:59 ` Joonsoo Kim
2014-12-17  7:13   ` Joonsoo Kim
2014-12-17 12:08     ` Jesper Dangaard Brouer
2014-12-18 14:34       ` Joonsoo Kim
2014-12-17 15:36     ` Christoph Lameter
2014-12-18 14:38       ` Joonsoo Kim
2014-12-18 14:57         ` Christoph Lameter
2014-12-18 15:08           ` Joonsoo Kim
2014-12-17 16:10     ` Christoph Lameter
2014-12-17 19:44       ` Christoph Lameter
2014-12-18 14:41       ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).